hydra

Author	SHA1	Message	Date
Pierre Bourdon	143a07bff0	queue-runner: release machine reservation while copying outputs This allows for better builder usage when the queue runner is busy. To avoid running into uncontrollable imbalances between builder/queue runner, we only release the machine reservation after the local throttler has found a slot to start copying the outputs for that build. As opposed to asserting uniqueness to understand resource utilization, we just switch to using `std::unique_ptr`.	2025-04-07 14:01:50 -04:00
K900	8a6482bb1c	Add metric for builds waiting for download slot (cherry picked from commit f23ec71227911891807706b6b978836e4d80edde)	2025-04-07 13:16:49 -04:00
Pierre Bourdon	8e02589ac8	queue-runner: switch to pseudorandom ordering of builds processing We don't rely on sequential / monotonic build IDs processing anymore, so randomizing actually has the advantage of mixing builds for different systems together, to avoid only one chunk of builds for a single system getting processed while builders for other systems are starved.	2025-04-07 12:33:35 -04:00
Pierre Bourdon	52a0199a9b	queue runner: introduce some parallelism for remote paths lookup Each output for a given step being ingested is looked up in parallel, which should basically multiply the speed of builds ingestion by the average number of outputs per derivation.	2025-04-07 12:33:35 -04:00
Pierre Bourdon	9265fc5002	queue-runner: reduce the time between queue monitor restarts This will induce more DB queries (though these are fairly cheap), but at the benefit of processing bumps within 1m instead of within 10m.	2025-04-07 12:33:35 -04:00
Pierre Bourdon	d8ffa6b56a	queue-runner: remove id > X from new builds query Running the query with/without it shows that it makes no difference to postgres, since there's an index on finished=0 already. This allows a few simplifications, but also paves the way towards running multiple parallel monitor threads in the future.	2025-04-07 12:33:35 -04:00
Pierre Bourdon	efcf6815d9	queue-runner: add prom metrics to allow detecting internal bottlenecks By looking at the ratio of running vs. waiting for the dispatcher and the queue monitor, we should get better visibility into what hydra is currently bottlenecked on. There are other side effects we can try to measure to get to the same result, but having a simple way doesn't cost us much.	2025-04-07 12:33:35 -04:00
Pierre Bourdon	1e2d3211d9	queue-runner: limit parallelism of CPU intensive operations My current theory is that running more parallel xz than available CPU cores is reducing our overall throughput by requiring more scheduling overhead and more cache thrashing.	2025-04-07 12:33:35 -04:00
John Ericson	d6a5df25bf	Fix the build	2025-04-07 11:36:59 -04:00
John Ericson	1cb1e139c4	Fix build (due to C++ API changes)	2025-04-07 11:12:12 -04:00
John Ericson	9a5bd39d4c	Revert "Use `LegacySSHStore`" There were some hangs caused by this. Need to fix them, ideally reproducing the issue in a test, before trying this again. This reverts commit `4a4a0f901c`.	2025-03-03 10:12:38 -05:00
John Ericson	4a4a0f901c	Use `LegacySSHStore` In https://github.com/NixOS/nix/pull/10748 it is extended with everything we need.	2025-02-18 14:07:42 -05:00
John Ericson	80241fc8be	Make code change necessary for building with Nix 2.25	2025-02-13 19:10:09 -05:00
John Ericson	9a6928d93b	Use new `CommonSSHStoreConfig::createSSHMaster` This avoids some duplicated code, leveraging the same `StoreReference` type that also undergirds the machine file dedup we just did prior. By using `LegacySSHStoreConfig`, we're also taking a baby step towards using the store interface rather than messing around with the protocol internals.	2025-02-13 18:13:38 -05:00
John Ericson	af9b0663f2	Merge branch 'master' into nix-next	2025-02-13 17:54:15 -05:00
Pierre Bourdon	182a48c9fb	autotools -> meson Original commit message: > There are some known regressions regarding local testing setups - since > everything was kinda half written with the expectation that build dir = > source dir (which should not be true anymore). But everything builds and > the test suite runs fine, after several hours spent debugging random > crashes in libpqxx with MALLOC_PERTURB_... I have not experienced regressions with local testing. (cherry picked from commit 4b886d9c45cd2d7fe9b0a8dbc05c7318d46f615d)	2024-11-24 15:58:26 -05:00
John Ericson	f974891c76	Merge pull request #1420 from NixOS/nix-2.23 `sshPublicHostKey` fix for `master`	2024-10-24 17:03:20 +02:00
John Ericson	8515cb183e	Merge branch 'nix-2.22' into nix-2.23	2024-10-21 11:23:41 -04:00
John Ericson	60dd7ec187	Merge branch 'nix-2.21' into nix-2.22	2024-10-21 11:23:30 -04:00
John Ericson	53b04ddf74	Merge branch 'nix-2.20' into nix-2.21	2024-10-21 11:23:20 -04:00
Martin Weinelt	4e2c06ec2c	queue-runner: don't decode base64 hostkey in hydra Nix expects a base64 encoded hostkey in SSHMaster, so make sure we don't decode this prematurely in hydra. Reported-By: Puck Meerburg <puck@puck.moe>	2024-10-21 11:22:44 -04:00
John Ericson	750275d6e8	Avoid trailing slash that broke lookup	2024-10-07 11:43:58 -04:00
John Ericson	ceb8b48cce	Fix type error with NAR accesssor	2024-09-24 12:14:23 -04:00
John Ericson	012cbd43f5	Add missing include	2024-09-24 11:51:17 -04:00
John Ericson	029116422d	Update to Nix 2.23 Flake lock file updates: • Updated input 'nix': 'github:NixOS/nix/1c8150ac312b5f9ba1b3f6768ff43b09867e5883' (2024-04-23) → 'github:NixOS/nix/5ffd239adc9b7fddca7a2a59a8b87da5af14ec4d' (2024-09-23)	2024-09-24 11:38:01 -04:00
Jörg Thalheim	2dad87ad89	hydra-queue-runner: fix compilation warning instead of converting to double, we can convert to float right away.	2024-09-20 07:50:24 +02:00
John Ericson	cd925e876f	Merge branch 'master' into nix-next	2024-05-29 17:05:04 -04:00
Pierre Bourdon	5728011da1	queue-runner: try larger pipe buffer sizes (cherry picked from commit 18466e83261d39b997a73bbd9f0f249c3a91fbeb)	2024-05-23 11:42:35 -04:00
John Ericson	09a1e64ed2	Dedup with nix: use `nix::Machine::parseConfig` Companion to https://github.com/NixOS/nix/pull/10763	2024-05-23 09:59:46 -04:00
John Ericson	d55bea2a1e	Utilize `nix::Machine` more fully With https://github.com/NixOS/nix/pull/9839, the `storeUri` field is much better structured, so we can use it while still opening the SSH connection ourselves.	2024-05-22 22:02:46 -04:00
John Ericson	71c4e2dc5b	Dedup more protocol code Use https://github.com/NixOS/nix/pull/10749	2024-05-20 18:19:59 -04:00
John Ericson	ef7bf1e67b	Merge pull request #1375 from NixOS/nix-2.21 Nix 2.21	2024-04-12 17:28:37 -04:00
Maximilian Bosch	99afff03b0	hydra-queue-runner: drop broken connections from pool Closes #1336 When restarting postgresql, the connections are still reused in `hydra-queue-runner` causing errors like this main thread: Lost connection to the database server. queue monitor: Lost connection to the database server. and no more builds being processed. `hydra-evaluator` doesn't have that issue since it crashes right away. We could let it retry indefinitely as well (see below), but I don't want to change too much. If the DB is still unreachable 10s later, the process will stop with a non-zero exit code because of a missing DB connection. This however isn't such a big deal because it will be immediately restarted afterwards. With the current configuration, Hydra will never give up, but restart (and retry) infinitely. To me that seems reasonable, i.e. to retry DB connections on a long-running process. If this doesn't work out, the monitoring should fire anyways because the queue fills up, but I'm open to discuss that. Please note that this isn't reproducible with the DB and the queue runner on the same machine when using `services.hydra-dev`, because of the `Requires=` dependency `hydra-queue-runner.service` -> `hydra-init.service` -> `postgresql.service` that causes the queue runner to be restarted on `systemctl restart postgresql`. Internally, Hydra uses Nix's pool data structure: it basically has N slots (here DB connections) and whenever a new one is requested, an idle slot is provided or a new one is created (when N slots are active, it'll be waited until one slot is free). The issue in the code here is however that whenever an error is encountered, the slot is released, however the same broken connection will be reused the next time. By using `Pool::Handle::markBad`, Nix will drop a broken slot. This is now being done when `pqxx::broken_connection` was caught.	2024-03-15 14:09:31 +01:00
Maximilian Bosch	e499509595	Switch to new Nix bindings, update Nix for that Implements support for Nix's new Perl bindings[1]. The current state basically does `openStore()`, but always uses `auto` and doesn't support stores at other URIs. Even though the stores are cached inside the Perl implementation, I decided to instantiate those once in the Nix helper module. That way store openings aren't cluttered across the entire codebase. Also, there are two stores used later on - MACHINE_LOCAL_STORE for `auto`, BINARY_CACHE_STORE for the one from `store_uri` in `hydra.conf` - and using consistent names should make the intent clearer then. This doesn't contain any behavioral changes, i.e. the build product availability issue from #1352 isn't fixed. This patch only contains the migration to the new API. [1] https://github.com/NixOS/nix/pull/9863	2024-02-12 18:50:56 +01:00
John Ericson	7b826ec5ad	Merge branch 'nix-next' into nix-2.20	2024-01-30 13:26:45 -05:00
John Ericson	fcde5908d8	More CA derivations prep Again, with care not to change the schema in any way.	2024-01-25 21:32:22 -05:00
John Ericson	7a53b866f6	Merge branch 'master' into nix-next • Updated input 'nix' (merge): 'github:NixOS/nix/212ba69e6f995992f8b4e4c0656d19c0156c8714' 'github:NixOS/nix/2c4bb93ba5a97e7078896ebc36385ce172960e4e' (2024-01-25) → 'github:NixOS/nix/8df68a213fc52a57b02a57005b0e06cc8de40ce3' (2024-01-25)	2024-01-25 16:26:07 -05:00
John Ericson	c64eed7d07	Simplify `StoreConfig::getDefaultSystemFeatures` call That method is now static.	2024-01-25 15:58:07 -05:00
John Ericson	b1fa6b3aac	Use `StoreConfig::getDefaultSystemFeatures` for default machine config We have to oddly make a `StoreConfig` subclass to get it, but https://github.com/NixOS/nix/pull/9848 will fix that. The purpose of this is to ensure that, absent an explicit config, `localhost` includes `ca-derivations` and `recursive-nix` if those experimental features are enabled. Very much the complement of #1342, the previous PR.	2024-01-24 21:37:13 -05:00
John Ericson	07cb5d1b7c	Use `nix::ParsedDerivation::getRequiredSystemFeatures()` A slight dedup, and also ensures that floating CA derivations require a `ca-derivations` experimental feature. This fixes the scheduling issue that @SuperSandro2000 found.	2024-01-24 21:04:14 -05:00
John Ericson	449eb2d873	Use more `nix::Machine` fields The upstream fields were made to match Hydra, so we can get rid of the extra fields temporary added in `70e5469303`.	2024-01-24 20:14:31 -05:00
John Ericson	9e7ac58042	Merge branch 'master' into nix-next	2024-01-24 18:36:03 -05:00
John Ericson	d45e14fd43	Merge pull request #1316 from NixOS/ca-derivations-prep Prepare for CA derivation support with lower impact changes	2024-01-24 18:12:42 -05:00
John Ericson	9a86da0e7b	Merge branch 'master' into nix-next	2024-01-23 15:49:14 -05:00
John Ericson	70e5469303	Use Nix's `Machine` type in a mimimal way This is just using the fields from that type, and only where the types coincide. (There are two fields with different types, `speedFactor` most interestingly.) No code is reused, so we can be sure that no behavior is changed. Once the types are reconciled on the Nix side, then we can start carefully actually reusing code. Progress on #1164	2024-01-23 12:18:57 -05:00
John Ericson	2e6ee28f9b	`Machine` -> `::Machine` so we don't conflict with Nix's	2024-01-23 11:03:19 -05:00
John Ericson	7386caaecf	Use Nix's `SSHMaster`	2024-01-23 10:24:02 -05:00
John Ericson	84c46b6b68	Update to newer Nix Flake lock file updates: • Updated input 'nix': 'github:NixOS/nix/74534829f23b668fb9b2f2a14ff6afa4d5e71d4a' (2024-01-22) → 'github:NixOS/nix/b6aee9a93f6646bbffd919d362a5c75c37bb9caa' (2024-01-23)	2024-01-23 10:21:48 -05:00
John Ericson	f1d9230f25	Merge remote-tracking branch 'upstream/master' into nix-next	2024-01-23 01:18:13 -05:00
John Ericson	4e8fbaa3d6	Replace `Child` with `SSHMaster::Connection` Nix defines basically an identical struct for the same purpose, so let's just use that.	2024-01-23 01:11:46 -05:00

1 2 3 4 5 ...

471 Commits