Bug#986527: Patches for flaky build and cython unavailability

Tue Aug 3 12:12:21 BST 2021

Thanks a lot for the patches Ahzo. Especially fixing the file handle leak should help a lot.

I guess it's too late for bullseye now, but I can at least upload a fixed package to experimental. I'll also try to fix many of the failing tests by including sage's (large) patch to support pari 2.13 which was finished in June [1]. I have to see if I can backport that to sage 9.2 or if I update to sage 9.4 right away.

Best,
Tobias

[1] https://trac.sagemath.org/ticket/30801

On 7/31/21 8:47 PM, Ahzo wrote:
> Control: tags -1 patch
>
> Hi,
>
> the main problem making the sagemath testsuite flaky is that it randomly aborts due to 'Too many open files'.
> Thus only a small part of the test suite gets actually run, when the build is heavily parallelized.
> This can be seen by reporting not only the number of failed, but also that of run tests, which shows significant fluctuations.
>
> The problem occurs, because every finished, but not yet logged worker, holds an open fd (a pipe used to read the output of the child actually doing the tests).
> Thus when following a long running worker, i.e. logging its messages, while it is still running, so many finished tests can accumulate, that the open files limit (ulimit -n) is reached.
>
> However, there should be no open pipe per finished worker, as the test suite calls 'os.close(self.rmessages)' before waiting for logging the messages.
> So this seems to be caused by something that python does behind the scenes.
> Removing the single line 'finished.append(w)' in src/sage/doctest/forker.py prevents the open fd increase, though at the cost of hardly logging any test output.
>
> This problem can be avoided by simply logging every finished test, but no running one.
>
> With only the 0001-Report-the-number-of-total-tests-run.patch, the result is something like:
> Success: 5 of 71435 tests failed, up to 200 failures are tolerated
>
> Adding the dt-Do-not-follow-a-running-worker.patch, the result becomes:
> Success: 194 of 361139 tests failed, up to 200 failures are tolerated
>
> These 194 failures are pretty close to the threshold of 200, so it is not particularly surprising, that this can fail in some environments.
> Slightly passing this threshold triggered the build failure in this bug and also the one in bug #983931.
>
> Increasing the threshold to 300 should make that rather unlikely, though.
> And considering that there are more than 360 thousand tests, less then 300 failures means more than 99.9 % of the tests succeeded.
>
> The "cython: not found" issue is trivial to fix and important, because otherwise 'sage --cython' does not work and there is no '--cython3' option (unlike e.g. the '--ipython3' option).
>
> After adding the 0002-Tolerate-up-to-300-failing-tests.patch and the u2-Adapt-to-python2-removal.patch the test result is:
> Success: 189 of 361139 tests failed, up to 300 failures are tolerated
>
> It would also be a good idea to include a backport of commit 5cf493ca51 ("Avoid libgmp's new lazy allocation") in the next sagemath upload, as that fixes a severe memory leak (see bug #964848).
>
> As to the crashes, I can't reproduce any crash when testing interfaces/singular.py:
> sage -t --long --random-seed=0 src/sage/interfaces/singular.py
>     [404 tests, 3.87 s]
>
> This crash also does not always happen for the reproducible builds either, e.g. the following log shows it first crashing and then passing this test:
> https://tests.reproducible-builds.org/debian/rbuild/bullseye/amd64/sagemath_9.2-2.rbuild.log.gz
> [...]
> sage -t --long --random-seed=0 src/sage/interfaces/singular.py
>     Killed due to segmentation fault
> [...]
> sage -t --long --random-seed=0 src/sage/interfaces/singular.py
>     [404 tests, 21.06 s]
> [...]
>
> However, a number of other crashes happen during every test run, but only one of them causes a test failure:
> sage -t --long --random-seed=0 src/sage/interfaces/tests.py
> **********************************************************************
> File "src/sage/interfaces/tests.py", line 34, in sage.interfaces.tests
> Failed example:
>     subprocess.call("echo syntax error | ecl", **kwds) in (0, 255)
> Expected:
>     True
> Got:
>     False
> **********************************************************************
>
> Similar crashes sometimes also occur when testing interfaces/lisp.py, but without causing the test to fail.
> This is a problem in ecl, which crashes when both stdout and stderr are full, see bug #710953.
>
> Then there is a crash in nauty-gentourng triggered by src/sage/graphs/digraph_generators.py.
> For details see bug #991750.
>
> There are also two SIGABRT crashes in mwrank triggered by src/sage/interfaces/mwrank.py.
> These seem to be intentional due to invalid input.
>
> Finally, there are some python crashes (5 SIGQUIT, 1 SIGABRT, 1 SIGSEGV) that are all caused intentionally by the test suite.
>
> So none of these crashes are problems in sagemath itself.
>
> Regards,
> Ahzo