It has a performance cost, and as the comment says we should be doing
the better solution. We want to land this preparatory change on prod
while the rest is still on staging, so we should just skip it for now.
Skipping it will not affect regular fixed-output and input-addressed
derivations, which are the only ones prod would deal with upon getting
this code.
The main CA derivations support branch will revert this commit so it
still works.
This is just C++ changes without any Perl / Frontend / SQL Schema
changes.
The idea is that it should be possible to redeploy Hydra with these
chnages with (a) no schema migration and also (b) no regressions. We
should be able to much more safely deploy these to a staging server and
then production `hydra.nixos.org`.
Extracted from #875
Co-Authored-By: Théophane Hufschmitt <theophane.hufschmitt@tweag.io>
Co-Authored-By: Alexander Sosedkin <monk@unboiled.info>
Co-Authored-By: Andrea Ciceri <andrea.ciceri@autistici.org>
Co-Authored-By: Charlotte 🦝 Delenk Mlotte@chir.rs>
Co-Authored-By: Sandro Jäckel <sandro.jaeckel@gmail.com>
We were using protocol version 6 but requesting version 4. The only
reason that this worked was because of a broken version check in
'nix-store --serve'. That was fixed in
c2d7456926,
which had the side-effect of breaking hydra-queue-runner.
On hydra.nixos.org the queue runner had child processes that were
stuck handling an exception:
Thread 1 (Thread 0x7f501f7fe640 (LWP 1413473) "bld~v54h5zkhmb3"):
#0 futex_wait (private=0, expected=2, futex_word=0x7f50c27969b0 <_rtld_local+2480>) at ../sysdeps/nptl/futex-internal.h:146
#1 __lll_lock_wait (futex=0x7f50c27969b0 <_rtld_local+2480>, private=0) at lowlevellock.c:52
#2 0x00007f50c21eaee4 in __GI___pthread_mutex_lock (mutex=0x7f50c27969b0 <_rtld_local+2480>) at ../nptl/pthread_mutex_lock.c:115
#3 0x00007f50c1854bef in __GI___dl_iterate_phdr (callback=0x7f50c190c020 <_Unwind_IteratePhdrCallback>, data=0x7f501f7fb040) at dl-iteratephdr.c:40
#4 0x00007f50c190d2d1 in _Unwind_Find_FDE () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
#5 0x00007f50c19099b3 in uw_frame_state_for () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
#6 0x00007f50c190ab90 in uw_init_context_1 () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
#7 0x00007f50c190b08e in _Unwind_RaiseException () from /nix/store/65hafbsx91127farbmyyv4r5ifgjdg43-glibc-2.33-117/lib/libgcc_s.so.1
#8 0x00007f50c1b02ab7 in __cxa_throw () from /nix/store/dd8swlwhpdhn6bv219562vyxhi8278hs-gcc-10.3.0-lib/lib/libstdc++.so.6
#9 0x00007f50c1d01abe in nix::parseURL (url="root@cb893012.packethost.net") at src/libutil/url.cc:53
#10 0x0000000000484f55 in extraStoreArgs (machine="root@cb893012.packethost.net") at build-remote.cc:35
#11 operator() (__closure=0x7f4fe9fe0420) at build-remote.cc:79
...
Maybe the fork happened while another thread was holding some global
stack unwinding lock
(https://gcc.gnu.org/bugzilla/show_bug.cgi?id=71744). Anyway, since
the hanging child inherits all file descriptors to SSH clients,
shutting down remote builds (via 'child.to = -1' in
State::buildRemote()) doesn't work and 'child.pid.wait()' hangs
forever.
So let's not do any significant work between fork and exec.
This is syntactically lighter wait, and demonstates there are no weird
dynamic lifetimes involved, just regular passing reference to callee
which it only borrows for the duration of the call.
In a Hydra instance I saw:
possibly transient failure building ‘/nix/store/X.drv’ on ‘localhost’:
dependency '/nix/store/Y' of '/nix/store/Y.drv' does not exist,
and substitution is disabled
This is confusing because the Hydra in question does have substitution enabled.
This instance uses:
keep-outputs = true
keep-derivations = true
and an S3 binary cache which is not configured as a substituter in the nix.conf.
It appears this instance encountered a situation where store path Y was built
and present in the binary cache, and Y.drv was GC rooted on the instance,
however Y was not on the host.
When Hydra would try to build this path locally, it would look in the binary
cache to see if it was cached:
(nix)
439 bool valid = isValidPathUncached(storePath);
440
441 if (diskCache && !valid)
442 // FIXME: handle valid = true case.
443 diskCache->upsertNarInfo(getUri(), hashPart, 0);
444
445 return valid;
Since it was cached, the store path was considered Valid.
The queue monitor would then not put this input in for substitution, because
the path is valid:
(hydra)
470 if (!destStore->isValidPath(*i.second.path(*localStore, step->drv->name, i.first))) {
471 valid = false;
472 missing.insert_or_assign(i.first, i.second);
473 }
Hydra appears to correctly handle the case of missing paths that need
to be substituted from the binary cache already, but since most
Hydra instances use `keep-outputs` *and* all paths in the binary cache
originate from that machine, it is not common for a path to be cached
and not GC rooted locally.
I'll run Hydra with this patch for a while and see if we run in to the
problem again.
A big thanks to John Ericson who helped debug this particular issue.
When having a builder like this in `/etc/nix/machines`
ssh://mfbuild?remote-store=/home/bosch/store
Hydra cannot build there since it tries to pass the entire value to
`ssh(1)` which doesn't work. Also, an alternate store-location is e.g.
used if the user isn't a trusted user on the remote system and thus
cannot use `/nix/store`.
If such a URI is given, Hydra will now add a `--store /home/bosch/store`
to the `ssh`-command to select the appropriate location remotely.
In Nix the protocol was slightly altered[1] to also contain more
information about realisations. This however wasn't read from the pipe
that was used to read from the store.
After the `cmdBuildDerivation` command which caused this issue, Hydra
will issue a `cmdQueryPathInfos` that tries to read from the remote
store as well. However, there's still left over to read from the
previous command and thus Nix fails to properly allocate the expected
string.
[1] See rev a2b69660a9b326b95d48bd222993c5225bbd5b5f
Fixes#898