471 Commits

Author SHA1 Message Date
Théophane Hufschmitt
143c31734f Move all the build remote utils to their namespace
Just don't pollute the global one
2022-10-25 10:04:29 +02:00
Jörg Thalheim
94d19e1972 hydra: fix localhost detection when protocol prefix are used 2022-09-29 20:46:13 +02:00
Eelco Dolstra
44e1efff7f Send the right nix-serve client version
We were using protocol version 6 but requesting version 4. The only
reason that this worked was because of a broken version check in
'nix-store --serve'. That was fixed in
c2d7456926,
which had the side-effect of breaking hydra-queue-runner.
2022-09-08 11:51:13 +02:00
Maximilian Bosch
5c01800fbe
flake: Update Nix to 2.9.1
NOTE: I'm well-aware that we have to be careful with this to avoid new
regressions on hydra.nixos.org, so this should only be merged after
extensive testing from more people.

Motivation: I updated Nix in my deployment to 2.9.1 and decided to also
update Hydra in one go (and compile it against the newer Nix). Given
that this also updates the C++ code in `hydra-{queue-runner,eval-jobs}`
this patch might become useful in the future though.
2022-06-16 14:54:57 +02:00
Graham Christensen
e1965250b5
Merge pull request #1173 from DeterminateSystems/queue-runner-exporter
hydra-queue-runner metrics
2022-04-07 12:27:33 -04:00
Cole Helbling
f8dc48f171
hydra-queue-runner: fixup: remove extraneous newline 2022-04-06 17:53:11 -07:00
Graham Christensen
59ac96a99c Track the number of steps created 2022-04-06 20:23:02 -04:00
Graham Christensen
1c12c5882f hydra queue runner: instrument the process of loading new builds with prom 2022-04-06 20:18:29 -04:00
Graham Christensen
5de08d412e queue metrics: refactor the metrics into a struct 2022-04-06 20:00:30 -04:00
Graham Christensen
46f52b4c4e bring back the working version Cole made 2022-04-06 15:49:38 -04:00
Cole Helbling
5bff730f2c WIP: I love it when they delete the assignment operator :) 2022-04-06 11:41:40 -07:00
Cole Helbling
edf3c348f2 hydra-queue-runner: make entire address configurable 2022-04-06 10:59:45 -07:00
Cole Helbling
33bc60b83c hydra-queue-runner: move exporter back to State::run
It's (arguably) better than risking pinning the thread at 100% due to
the busy `while` loop.
2022-04-06 10:49:14 -07:00
Cole Helbling
8c5636fe18
hydra-queue-runner: use port 9198 by default
Co-authored-by: Graham Christensen <graham@grahamc.com>
2022-04-02 17:32:14 -07:00
Eelco Dolstra
bcaad1c934 openConnection(): Don't throw exceptions in forked child
On hydra.nixos.org the queue runner had child processes that were
stuck handling an exception:

  Thread 1 (Thread 0x7f501f7fe640 (LWP 1413473) "bld~v54h5zkhmb3"):
  #0  futex_wait (private=0, expected=2, futex_word=0x7f50c27969b0 <_rtld_local+2480>) at ../sysdeps/nptl/futex-internal.h:146
  #1  __lll_lock_wait (futex=0x7f50c27969b0 <_rtld_local+2480>, private=0) at lowlevellock.c:52
  #2  0x00007f50c21eaee4 in __GI___pthread_mutex_lock (mutex=0x7f50c27969b0 <_rtld_local+2480>) at ../nptl/pthread_mutex_lock.c:115
  #3  0x00007f50c1854bef in __GI___dl_iterate_phdr (callback=0x7f50c190c020 <_Unwind_IteratePhdrCallback>, data=0x7f501f7fb040) at dl-iteratephdr.c:40
  #4  0x00007f50c190d2d1 in _Unwind_Find_FDE () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
  #5  0x00007f50c19099b3 in uw_frame_state_for () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
  #6  0x00007f50c190ab90 in uw_init_context_1 () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
  #7  0x00007f50c190b08e in _Unwind_RaiseException () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
  #8  0x00007f50c1b02ab7 in __cxa_throw () from /nix/store/dd8swlwhpdhn6bv219562vyxhi8278hs-gcc-10.3.0-lib/lib/libstdc++.so.6
  #9  0x00007f50c1d01abe in nix::parseURL (url="root@cb893012.packethost.net") at src/libutil/url.cc:53
  #10 0x0000000000484f55 in extraStoreArgs (machine="root@cb893012.packethost.net") at build-remote.cc:35
  #11 operator() (__closure=0x7f4fe9fe0420) at build-remote.cc:79
  ...

Maybe the fork happened while another thread was holding some global
stack unwinding lock
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744). Anyway, since
the hanging child inherits all file descriptors to SSH clients,
shutting down remote builds (via 'child.to = -1' in
State::buildRemote()) doesn't work and 'child.pid.wait()' hangs
forever.

So let's not do any significant work between fork and exec.
2022-03-30 22:39:48 +02:00
ajs124
089da272c7 fix build against nix 2.7.0
fix build after such commits as df552ff53e68dff8ca360adbdbea214ece1d08ee
and e862833ec662c1bffbe31b9a229147de391e801a
2022-03-29 15:38:24 -04:00
ajs124
c64c5f0a7e hydra-queue-runner: rename build-result.hh to hydra-build-result.hh 2022-03-29 15:34:29 -04:00
Graham Christensen
3b048ed136 Revert "Revert "Use copyClosure instead of computeFSClosure + copyPaths""
This reverts commit 8e3ada2afcc2dd5153d3ae162afbb0633a570285.
2022-03-29 15:28:47 -04:00
Cole Helbling
4789eba92c hydra-queue-runer: split metrics functionality into its own function 2022-03-29 10:55:28 -07:00
Cole Helbling
928b3b8268 hydra-queue-runner: fix priority of flag over config file 2022-03-29 10:42:07 -07:00
Cole Helbling
5ddb9a98ca fixup! hydra-queue-runner: log message before and after exporter is started 2022-03-29 08:47:41 -07:00
Cole Helbling
905a7a7beb hydra-queue-runner: read metrics port from queue_runner_metrics_port config 2022-03-29 08:46:43 -07:00
Cole Helbling
9cdc5aceed hydra-queue-runner: log message before and after exporter is started
This way, if something goes wrong between the two, it's easier to narrow
down where the issue lies.
2022-03-29 08:41:19 -07:00
Théophane Hufschmitt
6e571e26ff Build the resolved derivation and not the original one 2022-03-29 17:05:30 +02:00
Théophane Hufschmitt
92b627ac1b Remove an accidental re-indenting of a comment
Co-authored-by: Eelco Dolstra <edolstra@gmail.com>
2022-03-29 17:04:19 +02:00
Théophane Hufschmitt
b430d41afd Use the BuildOptions more eagerly 2022-03-29 17:04:19 +02:00
Théophane Hufschmitt
fd0ae78eba Factor out the copying from the build store 2022-03-29 17:04:19 +02:00
Théophane Hufschmitt
a778a89f04 Factor out the queryPathInfos part of the build 2022-03-29 17:04:19 +02:00
Théophane Hufschmitt
365776f5d7 Factor out the building part 2022-03-29 17:04:19 +02:00
Théophane Hufschmitt
9f1b911625 Factor more stuff out 2022-03-29 17:04:17 +02:00
Théophane Hufschmitt
2f494b7834 Factor out the creation of the log file 2022-03-29 16:52:59 +02:00
Théophane Hufschmitt
5db8642224 Factor out a struct representing a connection to a machine 2022-03-29 16:52:59 +02:00
Cole Helbling
8e3ada2afc Revert "Use copyClosure instead of computeFSClosure + copyPaths"
This reverts commit f14c583ce5188903f7c9db6f99c8c3fb42c77416.
2022-03-28 09:54:02 -07:00
Eelco Dolstra
962bf36939
Merge pull request #1162 from obsidiansystems/less-ref
Make `copyClosureTo` take a regular C++ ref to the store
2022-03-23 16:25:59 +01:00
Cole Helbling
8503a7917b fixup! hydra-queue-runner: make registry member of State, configurable metrics port 2022-03-22 13:38:13 -07:00
Cole Helbling
c0f826b92d hydra-queue-runner: get the listening port from the exposer itself
Otherwise, when the port is randomly chosen (e.g. by specifying no port,
or a port of 0), it will just show that the port is 0 and not the port
that is actually serving the metrics.
2022-03-14 08:41:45 -07:00
Cole Helbling
52a29d43e6 hydra-queue-runner: make registry member of State, configurable metrics port
Thanks to the updated prometheus-cpp library, specifying a port of 0
will cause it to pick a random (available) port -- ideal for tests.
2022-03-11 11:58:10 -08:00
Cole Helbling
3bf31bd6a6 hydra-queue-runner: add simple "up" exporter
There are probably better ways to achieve this (and will likely need to
be refactored a bit to support further metrics).
2022-03-10 12:36:58 -08:00
John Ericson
445bba337b Make copyClosureTo take a regular C++ ref to the store
This is syntactically lighter wait, and demonstates there are no weird
dynamic lifetimes involved, just regular passing reference to callee
which it only borrows for the duration of the call.
2022-02-20 17:22:43 +00:00
John Ericson
f14c583ce5 Use copyClosure instead of computeFSClosure + copyPaths
It is more terse, and in the future it is possible `copyClosure` will
become more sophisticated.
2022-02-19 11:59:17 -05:00
Graham Christensen
4acaf9c8b0 hydra-queue-runner: don't dispatch until the machines parser has completed one run
Periodically, I have seen tests fail because of out of order queue runner behavior:

    checking the queue for builds > 0...
    loading build 1 (tests:basic:empty_dir)
    aborting unsupported build step '...-empty-dir.drv' (type 'x86_64-linux')
    marking build 1 as failed
    adding new machine ‘localhost’

This patch should prevent the dispatcher from running before any machines are
made available.
2022-02-10 10:54:30 -05:00
Graham Christensen
f6e86efc9f
Merge pull request #1091 from Ma27/ssh-remote-store-location
hydra-queue-runner: support store URIs declaring an alternate store location
2022-01-24 14:10:54 -05:00
Graham Christensen
3a4ea6e563
Merge pull request #1124 from obsidiansystems/simplify--closure-of-path-set
simplify, `computeFSClosure` can take a set now
2022-01-24 14:09:35 -05:00
Graham Christensen
ba96a13407 Record metrics when getting the closure to localhost 2022-01-21 15:38:05 -05:00
Graham Christensen
7e9e82398d build-remote: copy missing paths from the binary cache to localhost
In a Hydra instance I saw:

    possibly transient failure building ‘/nix/store/X.drv’ on ‘localhost’:
      dependency '/nix/store/Y' of '/nix/store/Y.drv' does not exist,
      and substitution is disabled

This is confusing because the Hydra in question does have substitution enabled.

This instance uses:

  keep-outputs = true
  keep-derivations = true

and an S3 binary cache which is not configured as a substituter in the nix.conf.

It appears this instance encountered a situation where store path Y was built
and present in the binary cache, and Y.drv was GC rooted on the instance,
however Y was not on the host.

When Hydra would try to build this path locally, it would look in the binary
cache to see if it was cached:

    (nix)
    439      bool valid = isValidPathUncached(storePath);
    440
    441      if (diskCache && !valid)
    442          // FIXME: handle valid = true case.
    443          diskCache->upsertNarInfo(getUri(), hashPart, 0);
    444
    445      return valid;

Since it was cached, the store path was considered Valid.

The queue monitor would then not put this input in for substitution, because
the path is valid:

    (hydra)
    470          if (!destStore->isValidPath(*i.second.path(*localStore, step->drv->name, i.first))) {
    471              valid = false;
    472              missing.insert_or_assign(i.first, i.second);
    473          }

Hydra appears to correctly handle the case of missing paths that need
to be substituted from the binary cache already, but since most
Hydra instances use `keep-outputs` *and* all paths in the binary cache
originate from that machine, it is not common for a path to be cached
and not GC rooted locally.

I'll run Hydra with this patch for a while and see if we run in to the
problem again.

A big thanks to John Ericson who helped debug this particular issue.
2022-01-21 15:26:45 -05:00
John Ericson
e7a1ae87aa simplify, computeFSClosure can take a set now 2022-01-20 14:53:01 -05:00
Graham Christensen
72c3110002 queue-runner: track jobsets by ID 2022-01-15 14:06:00 -05:00
Maximilian Bosch
a18b487403
hydra-queue-runner: support store URIs declaring an alternate store location
When having a builder like this in `/etc/nix/machines`

    ssh://mfbuild?remote-store=/home/bosch/store

Hydra cannot build there since it tries to pass the entire value to
`ssh(1)` which doesn't work. Also, an alternate store-location is e.g.
used if the user isn't a trusted user on the remote system and thus
cannot use `/nix/store`.

If such a URI is given, Hydra will now add a `--store /home/bosch/store`
to the `ssh`-command to select the appropriate location remotely.
2022-01-12 15:56:05 +01:00
Eelco Dolstra
5edb58b314 Fix build 2021-08-10 13:47:16 +02:00
regnat
abff212d06 Use system-features from the Nix conf in the default machine file
Fix #936
2021-04-28 11:43:04 +02:00