469 Commits

Author SHA1 Message Date
Pierre Bourdon
8e02589ac8 queue-runner: switch to pseudorandom ordering of builds processing
We don't rely on sequential / monotonic build IDs processing anymore, so
randomizing actually has the advantage of mixing builds for different
systems together, to avoid only one chunk of builds for a single system
getting processed while builders for other systems are starved.
2025-04-07 12:33:35 -04:00
Pierre Bourdon
52a0199a9b queue runner: introduce some parallelism for remote paths lookup
Each output for a given step being ingested is looked up in parallel,
which should basically multiply the speed of builds ingestion by the
average number of outputs per derivation.
2025-04-07 12:33:35 -04:00
Pierre Bourdon
9265fc5002 queue-runner: reduce the time between queue monitor restarts
This will induce more DB queries (though these are fairly cheap), but at
the benefit of processing bumps within 1m instead of within 10m.
2025-04-07 12:33:35 -04:00
Pierre Bourdon
d8ffa6b56a queue-runner: remove id > X from new builds query
Running the query with/without it shows that it makes no difference to
postgres, since there's an index on finished=0 already. This allows a
few simplifications, but also paves the way towards running multiple
parallel monitor threads in the future.
2025-04-07 12:33:35 -04:00
Pierre Bourdon
efcf6815d9 queue-runner: add prom metrics to allow detecting internal bottlenecks
By looking at the ratio of running vs. waiting for the dispatcher and
the queue monitor, we should get better visibility into what hydra is
currently bottlenecked on.

There are other side effects we can try to measure to get to the same
result, but having a simple way doesn't cost us much.
2025-04-07 12:33:35 -04:00
Pierre Bourdon
1e2d3211d9 queue-runner: limit parallelism of CPU intensive operations
My current theory is that running more parallel xz than available CPU
cores is reducing our overall throughput by requiring more scheduling
overhead and more cache thrashing.
2025-04-07 12:33:35 -04:00
John Ericson
d6a5df25bf Fix the build 2025-04-07 11:36:59 -04:00
John Ericson
1cb1e139c4 Fix build (due to C++ API changes) 2025-04-07 11:12:12 -04:00
John Ericson
9a5bd39d4c Revert "Use LegacySSHStore"
There were some hangs caused by this. Need to fix them, ideally
reproducing the issue in a test, before trying this again.

This reverts commit 4a4a0f901c70676ee47f830d2ff6a72789ba1baf.
2025-03-03 10:12:38 -05:00
John Ericson
4a4a0f901c Use LegacySSHStore
In https://github.com/NixOS/nix/pull/10748 it is extended with
everything we need.
2025-02-18 14:07:42 -05:00
John Ericson
80241fc8be Make code change necessary for building with Nix 2.25 2025-02-13 19:10:09 -05:00
John Ericson
9a6928d93b Use new CommonSSHStoreConfig::createSSHMaster
This avoids some duplicated code, leveraging the same `StoreReference`
type that also undergirds the machine file dedup we just did prior.

By using `LegacySSHStoreConfig`, we're also taking a baby step towards
using the store interface rather than messing around with the protocol
internals.
2025-02-13 18:13:38 -05:00
John Ericson
af9b0663f2 Merge branch 'master' into nix-next 2025-02-13 17:54:15 -05:00
Pierre Bourdon
182a48c9fb autotools -> meson
Original commit message:

> There are some known regressions regarding local testing setups - since
> everything was kinda half written with the expectation that build dir =
> source dir (which should not be true anymore). But everything builds and
> the test suite runs fine, after several hours spent debugging random
> crashes in libpqxx with MALLOC_PERTURB_...

I have not experienced regressions with local testing.

(cherry picked from commit 4b886d9c45cd2d7fe9b0a8dbc05c7318d46f615d)
2024-11-24 15:58:26 -05:00
John Ericson
f974891c76
Merge pull request #1420 from NixOS/nix-2.23
`sshPublicHostKey` fix for `master`
2024-10-24 17:03:20 +02:00
John Ericson
8515cb183e Merge branch 'nix-2.22' into nix-2.23 2024-10-21 11:23:41 -04:00
John Ericson
60dd7ec187 Merge branch 'nix-2.21' into nix-2.22 2024-10-21 11:23:30 -04:00
John Ericson
53b04ddf74 Merge branch 'nix-2.20' into nix-2.21 2024-10-21 11:23:20 -04:00
Martin Weinelt
4e2c06ec2c queue-runner: don't decode base64 hostkey in hydra
Nix expects a base64 encoded hostkey in SSHMaster, so make sure we don't
decode this prematurely in hydra.

Reported-By: Puck Meerburg <puck@puck.moe>
2024-10-21 11:22:44 -04:00
John Ericson
750275d6e8 Avoid trailing slash that broke lookup 2024-10-07 11:43:58 -04:00
John Ericson
ceb8b48cce Fix type error with NAR accesssor 2024-09-24 12:14:23 -04:00
John Ericson
012cbd43f5 Add missing include 2024-09-24 11:51:17 -04:00
John Ericson
029116422d Update to Nix 2.23
Flake lock file updates:

• Updated input 'nix':
    'github:NixOS/nix/1c8150ac312b5f9ba1b3f6768ff43b09867e5883' (2024-04-23)
  → 'github:NixOS/nix/5ffd239adc9b7fddca7a2a59a8b87da5af14ec4d' (2024-09-23)
2024-09-24 11:38:01 -04:00
Jörg Thalheim
2dad87ad89 hydra-queue-runner: fix compilation warning
instead of converting to double, we can convert to float right away.
2024-09-20 07:50:24 +02:00
John Ericson
cd925e876f Merge branch 'master' into nix-next 2024-05-29 17:05:04 -04:00
Pierre Bourdon
5728011da1 queue-runner: try larger pipe buffer sizes
(cherry picked from commit 18466e83261d39b997a73bbd9f0f249c3a91fbeb)
2024-05-23 11:42:35 -04:00
John Ericson
09a1e64ed2 Dedup with nix: use nix::Machine::parseConfig
Companion to https://github.com/NixOS/nix/pull/10763
2024-05-23 09:59:46 -04:00
John Ericson
d55bea2a1e Utilize nix::Machine more fully
With https://github.com/NixOS/nix/pull/9839, the `storeUri` field is
much better structured, so we can use it while still opening the SSH
connection ourselves.
2024-05-22 22:02:46 -04:00
John Ericson
71c4e2dc5b Dedup more protocol code
Use https://github.com/NixOS/nix/pull/10749
2024-05-20 18:19:59 -04:00
John Ericson
ef7bf1e67b
Merge pull request #1375 from NixOS/nix-2.21
Nix 2.21
2024-04-12 17:28:37 -04:00
Maximilian Bosch
99afff03b0
hydra-queue-runner: drop broken connections from pool
Closes #1336

When restarting postgresql, the connections are still reused in
`hydra-queue-runner` causing errors like this

    main thread: Lost connection to the database server.
    queue monitor: Lost connection to the database server.

and no more builds being processed.

`hydra-evaluator` doesn't have that issue since it crashes right away.
We could let it retry indefinitely as well (see below), but I don't
want to change too much.

If the DB is still unreachable 10s later, the process will stop with a
non-zero exit code because of a missing DB connection. This however
isn't such a big deal because it will be immediately restarted
afterwards. With the current configuration, Hydra will never give up,
but restart (and retry) infinitely. To me that seems reasonable, i.e. to
retry DB connections on a long-running process. If this doesn't work
out, the monitoring should fire anyways because the queue fills up, but
I'm open to discuss that.

Please note that this isn't reproducible with the DB and the queue
runner on the same machine when using `services.hydra-dev`, because of
the `Requires=` dependency `hydra-queue-runner.service` ->
`hydra-init.service` -> `postgresql.service` that causes the queue
runner to be restarted on `systemctl restart postgresql`.

Internally, Hydra uses Nix's pool data structure: it basically has N
slots (here DB connections) and whenever a new one is requested, an idle
slot is provided or a new one is created (when N slots are active, it'll
be waited until one slot is free). The issue in the code here is however
that whenever an error is encountered, the slot is released, however the
same broken connection will be reused the next time. By using
`Pool::Handle::markBad`, Nix will drop a broken slot. This is now being
done when `pqxx::broken_connection` was caught.
2024-03-15 14:09:31 +01:00
Maximilian Bosch
e499509595
Switch to new Nix bindings, update Nix for that
Implements support for Nix's new Perl bindings[1]. The current state
basically does `openStore()`, but always uses `auto` and doesn't support
stores at other URIs.

Even though the stores are cached inside the Perl implementation, I
decided to instantiate those once in the Nix helper module. That way
store openings aren't cluttered across the entire codebase. Also, there
are two stores used later on - MACHINE_LOCAL_STORE for `auto`,
BINARY_CACHE_STORE for the one from `store_uri` in `hydra.conf` - and
using consistent names should make the intent clearer then.

This doesn't contain any behavioral changes, i.e. the build product
availability issue from #1352 isn't fixed. This patch only contains the
migration to the new API.

[1] https://github.com/NixOS/nix/pull/9863
2024-02-12 18:50:56 +01:00
John Ericson
7b826ec5ad Merge branch 'nix-next' into nix-2.20 2024-01-30 13:26:45 -05:00
John Ericson
fcde5908d8 More CA derivations prep
Again, with care not to change the schema in any way.
2024-01-25 21:32:22 -05:00
John Ericson
7a53b866f6 Merge branch 'master' into nix-next
• Updated input 'nix' (merge):
    'github:NixOS/nix/212ba69e6f995992f8b4e4c0656d19c0156c8714'
    'github:NixOS/nix/2c4bb93ba5a97e7078896ebc36385ce172960e4e' (2024-01-25)
  → 'github:NixOS/nix/8df68a213fc52a57b02a57005b0e06cc8de40ce3' (2024-01-25)
2024-01-25 16:26:07 -05:00
John Ericson
c64eed7d07 Simplify StoreConfig::getDefaultSystemFeatures call
That method is now static.
2024-01-25 15:58:07 -05:00
John Ericson
b1fa6b3aac Use StoreConfig::getDefaultSystemFeatures for default machine config
We have to oddly make a `StoreConfig` subclass to get it, but
https://github.com/NixOS/nix/pull/9848 will fix that.

The purpose of this is to ensure that, absent an explicit config,
`localhost` includes `ca-derivations` and `recursive-nix` if those
experimental features are enabled.

Very much the complement of #1342, the previous PR.
2024-01-24 21:37:13 -05:00
John Ericson
07cb5d1b7c Use nix::ParsedDerivation::getRequiredSystemFeatures()
A slight dedup, and also ensures that floating CA derivations require a
`ca-derivations` experimental feature. This fixes the scheduling issue
that @SuperSandro2000 found.
2024-01-24 21:04:14 -05:00
John Ericson
449eb2d873 Use more nix::Machine fields
The upstream fields were made to match Hydra, so we can get rid of the
extra fields temporary added in
70e5469303b422bdb4b123be222bdea4d7f9611c.
2024-01-24 20:14:31 -05:00
John Ericson
9e7ac58042 Merge branch 'master' into nix-next 2024-01-24 18:36:03 -05:00
John Ericson
d45e14fd43
Merge pull request #1316 from NixOS/ca-derivations-prep
Prepare for CA derivation support with lower impact changes
2024-01-24 18:12:42 -05:00
John Ericson
9a86da0e7b Merge branch 'master' into nix-next 2024-01-23 15:49:14 -05:00
John Ericson
70e5469303 Use Nix's Machine type in a mimimal way
This is *just* using the fields from that type, and only where the types
coincide. (There are two fields with different types, `speedFactor` most
interestingly.) No code is reused, so we can be sure that no behavior is
changed.

Once the types are reconciled on the Nix side, then we can start
carefully actually reusing code.

Progress on #1164
2024-01-23 12:18:57 -05:00
John Ericson
2e6ee28f9b Machine -> ::Machine so we don't conflict with Nix's 2024-01-23 11:03:19 -05:00
John Ericson
7386caaecf Use Nix's SSHMaster 2024-01-23 10:24:02 -05:00
John Ericson
84c46b6b68 Update to newer Nix
Flake lock file updates:

• Updated input 'nix':
    'github:NixOS/nix/74534829f23b668fb9b2f2a14ff6afa4d5e71d4a' (2024-01-22)
  → 'github:NixOS/nix/b6aee9a93f6646bbffd919d362a5c75c37bb9caa' (2024-01-23)
2024-01-23 10:21:48 -05:00
John Ericson
f1d9230f25 Merge remote-tracking branch 'upstream/master' into nix-next 2024-01-23 01:18:13 -05:00
John Ericson
4e8fbaa3d6 Replace Child with SSHMaster::Connection
Nix defines basically an identical struct for the same purpose, so let's
just use that.
2024-01-23 01:11:46 -05:00
John Ericson
4ac31c89df Use nix::serv_proto::BasicConnection in build_remote.cc
- Use the type itself

  This lays the foundation for being able to dedup the protocol code.

- Use `BasicConnection::handshake`, replacing ours.

- Use `BasicConnection::queryValidPaths`

- Use `BasicConnection::putBuildDerivationRequest`
2024-01-22 14:20:39 -05:00
John Ericson
89cfe26533 Merge remote-tracking branch 'upstream/master' into nix-next 2024-01-22 13:01:40 -05:00