Bug#848859: FTBFS randomly (failing tests)

Wed Jan 4 19:57:14 UTC 2017

On Wed, Jan 04, 2017 at 07:58:43PM +0100, Anton Gladky wrote:
> 2017-01-04 13:26 GMT+01:00 Santiago Vila <sanvila at unex.es>:
> > No matter how much glitch-free is the autobuilder you use to build the
> > above package, it will fail to build 1 every 147 times on average,
> > mathematically, because the test is wrongly designed.
> 
> That is not always true. If you look in many tests from numerical
> simulation packages, there is usually a "threshold" for test result
> which should not be exceeded. And the test result varies in
> the limits, which are set by upstream authors. This result
> can be different even on the same machine, running the simulation
> several time. And it is normal.
> [...]

I know what you mean. I've seen it several times in statistical packages.

In my opinion, upstream authors may do as they wish, but Debian aims
at having reproducible builds (in some future). Reproducible builds
means each time you build the package, the same .deb is created.

This is of course not policy yet, but I see it as fundamentally
incompatible with packages which FTBFS from time to time. If we want
the end result to be always the same, then failing from time to time
(which is sometimes the end result) should never happen.

In other words, if we don't take deterministic builds seriously
(as in "every time I try to build the package, the build does not fail")
how can we expect to be reproducible in the future?

It is interesting, however, what you mention about thresholds,
statistical packages, and simulations, so here is the math
I do applied to Debian:

Let's say that we have 25000 source packages in stretch, and
I want to build all of them and not have a single failure.

Since, as you point out, there are quite a bunch of statistical
packages with tests based on random numbers, the mathematical
probability that there is some failure will always be > 0.

Ok, then. Let's suppose that I'm happy enough if the expected number
of packages that fail to build is closer to 0 than to 1.

So, let's make the probability of failure for each package to be half
of 1/25000. That would be 0.002%.

Not realistic enough? Let's assume additionally that 24950 source
packages build ok all the time, and only 50 of them have a probability
of failure > 0.

I still want to build all packages and have 0 or 1 failures,
so in this case the probability should be 1/50/2, i.e. 1%.

I think this is still feasible.

> The "fix" for such cases is the increasing of the threshold or disabling
> the test completely. Because you can do nothing with it due to the
> nature of numerical simulations.

I really wish it would always be as simple as that.

But as far as there are people in this project who consider that a
package which FTBFS on single-CPU machines more than 50% of the time
is ok "because it does not usually fail in buildd.debian.org",
we are doomed.

See the FTBFS-randomly bugs open against rygel, libsecret
or libsoup2.4, for example.

> > Really, we need more people doing QA, and not stop doing it "because
> > we are near the freeze".
> 
> If you are maintaining the package several years, fixing most of its
> bugs, hoping to see it in release and trying to escape major changes
> several months before the freeze.. Sure, it will actively be defended
> from maintainers if some pseudo-reasons for its removal appear just
> before the freeze. This fact has to be considered as well.

Well, you will see that I've downgraded all the bugs of this type to
important (btw: please do not call this "pseudo-reasons").

The problem I see with this threshold thing is that every maintainer
seems to have his own threshold, different from the others.

In case we decide about RC-ness depending on probability of failure:
What threshold do you think we should use for a single package and why?

[ I say that 1% of failure is the maximum we should allow, and I've
  explained why, but I would love to hear your opinion on this ].

Thanks a lot.