We should look into how to resolve this, but I tried some things and nothing really worked.
Let's put it skipped for now until someone comes along to improve it.
Only log issues/failures when something's actually up.
It has irked me for a long time that so much output came
out of running the tests, this seems to silence it.
It does hide some warnings, but I think it makes the output
so much more readable that it's worth the tradeoff.
Helps for highly parallel running of jobs, sometimes they'd not give output for a while.
Setting this timeout higher appears to help.
Not completely sure if this is the right place to do it, but it works fine for me.
We've seen many fails on ofborg, at lot of them ultimately appear to come down to
a timeout being hit, resulting in something like this:
Failure executing slapadd -F /<path>/slap.d -b dc=example -l /<path>/load.ldif.
Hopefully this resolves it for most cases.
I've done some endurance testing and this helps a lot.
some other commands also regularly time-out with high load:
- hydra-init
- hydra-create-user
- nix-store --delete
This should address most issues with tests randomly failing.
Used the following script for endurance testing:
```
import os
import subprocess
run_counter = 0
fail_counter = 0
while True:
try:
run_counter += 1
print(f"Starting run {run_counter}")
env = os.environ
env["YATH_JOB_COUNT"] = "20"
result = subprocess.run(["perl", "t/test.pl"], env=env)
if (result.returncode != 0):
fail_counter += 1
print(f"Finish run {run_counter}, total fail count: {fail_counter}")
except KeyboardInterrupt:
print(f"Finished {run_counter} runs with {fail_counter} fails")
break
```
In case someone else wants to do it on their system :).
Note that YATH_JOB_COUNT may need to be changed loosely based on your
cores.
I only have 4 cores (8 threads), so for others higher numbers might
yield better results in hashing out unstable tests.
Nixpkgs only contains a `hydra_unstable`, not `hydra`, package, so
adjust the default accordingly, and then override it to our package in
the separate module which does that.
This fixes:
> Caught exception in Hydra::Controller::Root->realisations "Undefined subroutine &Hydra::Controller::Root::queryRawRealisation called at /nix/store/v842xb35ph8ka1yi1xanjhk4xh1pn5nm-hydra-2024-04-22/libexec/hydra/lib/Hydra/Controller/Root.pm line 371."
When content addressed derivations are built on the hydra server,
one may run into an issue where some builds suddenly don't load anymore.
This seems to be caused by outPaths that are NULL (which is
allowed for ca-derivations). Filter them out to prevent querying the
database for them, which is not supported by the database abstraction
layer that's currently in use.
On my instance this appears to resolve the issue.
I feel like I might be doing this at the wrong abstraction layer, but on
the other hand -- it seems to resolve it and it also doesn't really look
like it will hurt anything.
The test added in a previous commit uncovers this issue, and this commit
resolves it. So I'm happy with this patch for now.
The issue I was seeing on my server:
hydra-server[2549]: [error] Couldn't render template "undef error - DBIx::Class::SQLMaker::ClassicExtensions::puke(): Fatal: NULL-within-IN not implemented: The upcoming SQL::Abstract::Classic 2.0 will emit the logically correct SQL instead of raising this exception. at /nix/store/<hash>-hydra-unstable-2024-03-08_nix_2_20/libexec/hydra/lib/Hydra/Helper/Nix.pm line 190
See also short discussion here: https://github.com/NixOS/nixpkgs/pull/297392#issuecomment-2035366263
Closes#1336
When restarting postgresql, the connections are still reused in
`hydra-queue-runner` causing errors like this
main thread: Lost connection to the database server.
queue monitor: Lost connection to the database server.
and no more builds being processed.
`hydra-evaluator` doesn't have that issue since it crashes right away.
We could let it retry indefinitely as well (see below), but I don't
want to change too much.
If the DB is still unreachable 10s later, the process will stop with a
non-zero exit code because of a missing DB connection. This however
isn't such a big deal because it will be immediately restarted
afterwards. With the current configuration, Hydra will never give up,
but restart (and retry) infinitely. To me that seems reasonable, i.e. to
retry DB connections on a long-running process. If this doesn't work
out, the monitoring should fire anyways because the queue fills up, but
I'm open to discuss that.
Please note that this isn't reproducible with the DB and the queue
runner on the same machine when using `services.hydra-dev`, because of
the `Requires=` dependency `hydra-queue-runner.service` ->
`hydra-init.service` -> `postgresql.service` that causes the queue
runner to be restarted on `systemctl restart postgresql`.
Internally, Hydra uses Nix's pool data structure: it basically has N
slots (here DB connections) and whenever a new one is requested, an idle
slot is provided or a new one is created (when N slots are active, it'll
be waited until one slot is free). The issue in the code here is however
that whenever an error is encountered, the slot is released, however the
same broken connection will be reused the next time. By using
`Pool::Handle::markBad`, Nix will drop a broken slot. This is now being
done when `pqxx::broken_connection` was caught.