We've seen many fails on ofborg, at lot of them ultimately appear to come down to
a timeout being hit, resulting in something like this:
Failure executing slapadd -F /<path>/slap.d -b dc=example -l /<path>/load.ldif.
Hopefully this resolves it for most cases.
I've done some endurance testing and this helps a lot.
some other commands also regularly time-out with high load:
- hydra-init
- hydra-create-user
- nix-store --delete
This should address most issues with tests randomly failing.
Used the following script for endurance testing:
```
import os
import subprocess
run_counter = 0
fail_counter = 0
while True:
try:
run_counter += 1
print(f"Starting run {run_counter}")
env = os.environ
env["YATH_JOB_COUNT"] = "20"
result = subprocess.run(["perl", "t/test.pl"], env=env)
if (result.returncode != 0):
fail_counter += 1
print(f"Finish run {run_counter}, total fail count: {fail_counter}")
except KeyboardInterrupt:
print(f"Finished {run_counter} runs with {fail_counter} fails")
break
```
In case someone else wants to do it on their system :).
Note that YATH_JOB_COUNT may need to be changed loosely based on your
cores.
I only have 4 cores (8 threads), so for others higher numbers might
yield better results in hashing out unstable tests.
56 lines
1.9 KiB
Perl
56 lines
1.9 KiB
Perl
use strict;
|
||
use warnings;
|
||
use Setup;
|
||
use Data::Dumper;
|
||
use Test2::V0;
|
||
use Hydra::Helper::Exec;
|
||
|
||
my $ctx = test_context(
|
||
use_external_destination_store => 1
|
||
);
|
||
|
||
require Hydra::Helper::Nix;
|
||
|
||
# This test is regarding https://github.com/NixOS/hydra/pull/1126
|
||
#
|
||
# A hydra instance was regularly failing to build derivations with:
|
||
#
|
||
# possibly transient failure building ‘/nix/store/X.drv’ on ‘localhost’:
|
||
# dependency '/nix/store/Y' of '/nix/store/Y.drv' does not exist,
|
||
# and substitution is disabled
|
||
#
|
||
# However it would only fail when building on localhost, and it would only
|
||
# fail if the build output was already in the binary cache.
|
||
#
|
||
# This test replicates this scenario by having two jobs, underlyingJob and
|
||
# dependentJob. dependentJob depends on underlyingJob. We first build
|
||
# underlyingJob and copy it to an external cache. Then forcefully delete
|
||
# the output of underlyingJob, and build dependentJob. In order to pass
|
||
# it must either rebuild underlyingJob or fetch it from the cache.
|
||
|
||
|
||
subtest "Building, caching, and then garbage collecting the underlying job" => sub {
|
||
my $builds = $ctx->makeAndEvaluateJobset(
|
||
expression => "dependencies/underlyingOnly.nix",
|
||
build => 1
|
||
);
|
||
|
||
my $path = $builds->{"underlyingJob"}->buildoutputs->find({ name => "out" })->path;
|
||
|
||
ok(unlink(Hydra::Helper::Nix::gcRootFor($path)), "Unlinking the GC root for underlying Dependency succeeds");
|
||
|
||
(my $ret, my $stdout, my $stderr) = captureStdoutStderr(15, "nix-store", "--delete", $path);
|
||
is($ret, 0, "Deleting the underlying dependency should succeed");
|
||
};
|
||
|
||
subtest "Building the dependent job should now succeed, even though we're missing a local dependency" => sub {
|
||
my $builds = $ctx->makeAndEvaluateJobset(
|
||
expression => "dependencies/dependentOnly.nix"
|
||
);
|
||
|
||
ok(runBuild($builds->{"dependentJob"}), "building the job should succeed");
|
||
};
|
||
|
||
|
||
done_testing;
|