[Debian-med-packaging] Bug#918047: bali-phy FTBFS building for armhf on arm64

Thu Jan 3 19:09:55 GMT 2019

On 1/2/19 11:21 AM, Steve McIntyre wrote:
> Package: src:bali-phy
> Version: 3.4+dfsg-1
> Severity: important
>
> Hi!
>
> I've been doing a full rebuild of the Debian archive, building all
> source packages targeting armel and armhf using arm64 hardware. We are
> planning in future to move all of our 32-bit armel/armhf builds to
> using arm64 machines, so this rebuild is to identify packages that
> might have problems with this configuration.
>
> I've tried to build bali-phy for armhf on top of arm64, and it's
> failing a test at the end of the build:
>
> ...
>     dh_auto_test -O--buildsystem=meson
>          cd obj-arm-linux-gnueabihf && LC_ALL=C.UTF-8 MESON_TESTTHREADS=8 ninja test
> [1/4] Generating git-version.h with a custom command.
> [1/2] Running all tests.
>   1/30 bali-phy version                        OK       0.04 s
>   2/30 bali-phy help                           OK       0.04 s
>   3/30 bali-phy 5d test                        OK       3.61 s
>   4/30 bali-phy 5d +A 50                       OK      17.14 s
>   5/30 bali-phy 5d -A 200                      FAIL     7.87 s (exit status 1)
>   6/30 model_P --help                          OK       0.04 s
>   7/30 statreport --help                       OK       0.03 s
>   8/30 stats-select --help                     OK       0.04 s
>   9/30 alignment-gild --help                   OK       0.01 s
> 10/30 alignment-consensus --help              OK       0.01 s
> 11/30 alignment-max --help                    OK       0.01 s
> 12/30 alignment-chop-internal --help          OK       0.03 s
> 13/30 alignment-indices --help                OK       0.01 s
> 14/30 alignment-info --help                   OK       0.01 s
> 15/30 alignment-cat --help                    OK       0.01 s
> 16/30 alignment-translate --help              OK       0.02 s
> 17/30 alignment-find --help                   OK       0.02 s
> 18/30 trees-consensus --help                  OK       0.02 s
> 19/30 tree-mean-lengths --help                OK       0.01 s
> 20/30 mctree-mean-lengths --help              OK       0.01 s
> 21/30 trees-to-SRQ --help                     OK       0.01 s
> 22/30 pickout --help                          OK       0.01 s
> 23/30 cut-range --help                        OK       0.01 s
> 24/30 trees-distances --help                  OK       0.01 s
> 25/30 alignment-thin --help                   OK       0.01 s
> 26/30 alignments-diff --help                  OK       0.01 s
> 27/30 tree-tool --help                        OK       0.01 s
> 28/30 alignment-distances --help              OK       0.01 s
> 29/30 subsample --help                        OK       0.02 s
> 30/30 bali-phy testsuite                      OK      272.10 s
>
> Ok:                   29
> Expected Fail:         0
> Fail:                  1
> Unexpected Pass:       0
> Skipped:               0
> Timeout:               0
>
>
> The output from the failed tests:
>
>   5/30 bali-phy 5d -A 200                      FAIL     7.87 s (exit status 1)
>
> --- command ---
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/../examples/sequences/5S-rRNA/5d.fasta --iter=200 --package-path=/<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/builtins:/<<PKGBUILDDIR>> -Inone
> --- stdout ---
> T:topology ~ uniform on tree topologies
> T:lengths ~ iid[num_branches[T],gamma[0.5,div[2,num_branches[T]]]]
>
> Partition P1:
>      file = /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/../examples/sequences/5S-rRNA/5d.fasta
>      alphabet = DNA
>      subst = tn93 (S1)
>      indel = none
>      scale ~ gamma[0.5,2] (Scale1)
>
> Substitution model S1 priors:
>      tn93:kappaPur ~ log_normal[log[2],0.25]
>      tn93:kappaPyr ~ log_normal[log[2],0.25]
>      tn93:pi ~ dirichlet_on[letters[@a],1]
>
> Beginning pre-burnin: 3 iterations.
>   Start #1   prior = 2.85078   likelihood = -1171.56   |T| = 0.468531   |A| = 110   Scale1*|T| = 0.80032
>
>   Tree (S)ize #1   prior = 2.21294   likelihood = -1097.14   |T| = 0.448637   |A| = 110   Scale1*|T| = 1.30021
>   Tree (S)ize #2   prior = -13.3031   likelihood = -1001.31   |T| = 0.572728   |A| = 110   Scale1*|T| = 16.6362
>   Tree (S)ize #3   prior = -12.5843   likelihood = -999.563   |T| = 0.738133   |A| = 110   Scale1*|T| = 18.3298
>
>   (S)+Branch (L)engths #1   prior = -8.39757   likelihood = -984.367   |T| = 1.63073   |A| = 110   Scale1*|T| = 6.22256
>   (S)+Branch (L)engths #2   prior = -15.5928   likelihood = -943.676   |T| = 3.95652   |A| = 110   Scale1*|T| = 13.9505
>   (S)+Branch (L)engths #3   prior = -6.1693   likelihood = -933.719   |T| = 1.54154   |A| = 110   Scale1*|T| = 6.65305
>
>   (S)+(L)+(P)arameters #1   prior = -6.85447   likelihood = -820.042   |T| = 0.83291   |A| = 110   Scale1*|T| = 7.19549
>   (S)+(L)+(P)arameters #2   prior = -13.9795   likelihood = -804.932   |T| = 2.52913   |A| = 110   Scale1*|T| = 14.9393
>   (S)+(L)+(P)arameters #3   prior = -14.8455   likelihood = -804.004   |T| = 1.78642   |A| = 110   Scale1*|T| = 11.4421
>
>   (S)+(L)+(P)+NNI #1   prior = -14.7193   likelihood = -800.519   |T| = 1.80833   |A| = 110   Scale1*|T| = 11.1644
>   (S)+(L)+(P)+NNI #2   prior = -14.8403   likelihood = -794.947   |T| = 1.29684   |A| = 110   Scale1*|T| = 8.27748
>   (S)+(L)+(P)+NNI #3   prior = -18.6064   likelihood = -794.606   |T| = 1.68838   |A| = 110   Scale1*|T| = 6.26695
>   (S)+(L)+(P)+NNI #4   prior = -17.5519   likelihood = -796.853   |T| = 1.68336   |A| = 110   Scale1*|T| = 5.20335
>
>   SPR #1   prior = -17.4975   likelihood = -795.163   |T| = 1.47967   |A| = 110   Scale1*|T| = 4.62688
>
>   (S)+(L)+(P)+NNI #1   prior = -14.4695   likelihood = -797.104   |T| = 1.03049   |A| = 110   Scale1*|T| = 4.62992
>   (S)+(L)+(P)+NNI #2   prior = -17.9617   likelihood = -791.809   |T| = 2.04283   |A| = 110   Scale1*|T| = 8.75168
>   (S)+(L)+(P)+NNI #3   prior = -17.1506   likelihood = -788.311   |T| = 0.917332   |A| = 110   Scale1*|T| = 8.90935
>
> Finished pre-burnin in 0.26 seconds.
>
>
> BAli-Phy does NOT detect how many iterations is sufficient:
>     You need to monitor convergence and kill it when done.
>     Maximum number of iterations set to 200.
>
> Beginning MCMC computations.
>     - Future screen output sent to '5d-1/C1.out'
>     - Future debugging output sent to '5d-1/C1.err'
>     - Sampled trees logged to '5d-1/C1.trees'
>     - Sampled alignments logged to '5d-1/C1.P<partition>.fastas'
>     - Run info written to '5d-1/C1.run.json'
>     - Sampled numerical parameters logged to '5d-1/C1.log'
>
> You can examine 'C1.log' using BAli-Phy tool statreport (command-line) or the BEAST program Tracer (graphical).
>
> See the manual at http://www.bali-phy.org/README.xhtml for further information.
> --- stderr ---
> Home directory '/sbuild-nonexistent' does not exist!
> Partition #1: 126 columns -> 110 unique patterns.
> Created directory '5d-1/' for output files.
> bali-phy: Error!  move = sampler
>     submove = tree
>   move = tree
>     submove = topology
>   move = topology
>     submove = SPR
>   move = SPR
>     submove = SPR_all
>   [single] move = SPR_all
> bool sample_SPR_search_one(Parameters&, MCMC::MoveStats&, const tree_edge&, const spr_range&, bool)int choose_MH(int, const std::vector<T>&) [with F = log_double_t]:
> No option chosen! (current = 0)
> *log(Pr[0]) = nan
> log(Pr[1]) = nan
> log(Pr[2]) = nan
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN16choose_exceptionI12log_double_tEC2EiRKSt6vectorIS0_SaIS0_EE+0xccb) [0xc09a3c]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_Z9choose_MHI12log_double_tEiiRKSt6vectorIT_SaIS2_EE+0xa0d) [0xc3ecce]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_Z21sample_SPR_search_oneR10ParametersRN4MCMC9MoveStatsERK9tree_edgeRKSt3mapIS4_bSt4lessIS4_ESaISt4pairIS5_bEEEb+0x513) [0xc7ceb4]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_Z21sample_SPR_search_oneR10ParametersRN4MCMC9MoveStatsERK9tree_edgeb+0x3d) [0xc7dc86]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_Z14sample_SPR_allR9owned_ptrI5ModelERN4MCMC9MoveStatsE+0x7b) [0xc7e78c]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN4MCMC10SingleMove7iterateER9owned_ptrI5ModelERNS_9MoveStatsEi+0x39) [0xc01c76]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN4MCMC9MoveGroup7iterateER9owned_ptrI5ModelERNS_9MoveStatsEi+0x43) [0xc01d80]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN4MCMC9MoveGroup7iterateER9owned_ptrI5ModelERNS_9MoveStatsEi+0x43) [0xc01d80]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN4MCMC9MoveGroup7iterateER9owned_ptrI5ModelERNS_9MoveStatsEi+0x43) [0xc01d80]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN4MCMC9MoveGroup7iterateER9owned_ptrI5ModelERNS_9MoveStatsEi+0x43) [0xc01d80]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN4MCMC7Sampler2goER9owned_ptrI5ModelEiiRSo+0x2ab) [0xc03818]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_Z11do_samplingRKN5boost15program_options13variables_mapER9owned_ptrI5ModelElRSoRKSt6vectorISt8functionIFvRKS5_lEESaISE_EE+0xa2f) [0xcbfe68]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(main+0x1b31) [0xb99f5a]
> /lib/arm-linux-gnueabihf/libc.so.6(__libc_start_main+0x97) [0xf6cef4e4]
>
> -------
>
> In fact, checking I can see that the buildd arm-arm-01 (also an arm64
> host configured to build armhf) fails in the same way. See
>
>    https://buildd.debian.org/status/fetch.php?pkg=bali-phy&arch=armhf&ver=3.4%2Bdfsg-1&stamp=1544787942&raw=0
>
> for the build log. Oddly, I'm not seeing any similar problems
> building/testing for armel on top of arm64...
>
> -- System Information:
> Debian Release: 9.6
>    APT prefers stable-updates
>    APT policy: (500, 'stable-updates'), (500, 'stable-debug'), (500, 'stable')
> Architecture: amd64 (x86_64)
> Foreign Architectures: i386
>
> Kernel: Linux 4.9.0-8-amd64 (SMP w/4 CPU cores)
> Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8), LANGUAGE=en_GB.UTF-8 (charmap=UTF-8)
> Shell: /bin/sh linked to /bin/dash
> Init: systemd (via /run/systemd/system)
>
> _______________________________________________
> Debian-med-packaging mailing list
> Debian-med-packaging at alioth-lists.debian.net
> https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/debian-med-packaging

Thanks for the bug report.  Note that I'm the original author of the 
software, as well as maintaining the package.

1. The error message indicates that one of the MCMC moves computes three 
probabilities as NaN's, and then tries to sample from that weighted set, 
which throws an exception that prints a backtrace.

2. The question is what is yielding an NaN.

3.   I'm surprised that all three numbers are NaN.  This suggests that 
the current probability was already NaN before the function was called.  
In that case running with '-V4' to enable extra logging might show where 
the NaN originates.

4. The stack trace indicates that the probabilities that are NaNs are 
coming from sample_SPR_search_one(Parameters&, MCMC::MoveStats&, 
tree_edge const&, std::map<tree_edge, bool, std::less<tree_edge>, 
std::allocator<std::pair<tree_edge const, bool> > > const&, bool)+0x513

This is line 1281 "C = choose_MH(0, PrL)" of src/mcmc/sample-topology-SPR.cc

5. The build log from arm-arm-01 seems to be getting NaN's in the same 
function, although they are -nan instead of +nan.  This is only a 
sample-size of 2, but suggests the problem occurs mostly in that 
particular function.

6. The fact that it takes 7 seconds to crash, and the fact that the 
previous test success suggests that this error occurs only after several 
iterations.  So for most inputs, no NaN is generated.

7. Since armel does not crash, it looks like there might be difference 
in how IEEE math errors are handled between armel and armhf.  So, the 
floating point emulation code is not exactly the same as the hardward 
implementation.  Does that sound possible?

On the other hand, its possible that armel would  crash too if you reran 
it, since the test uses random numbers.  However, since there are no 
errors on x86 that I can find, this lends some weight to armel also 
being fine.

8. This makes me wonder what happens if the -ffast-math flag is removed 
from this line in src/meson.build:

add_project_arguments(['-DNDEBUG','-DNDEBUG_DP','-O3','-funroll-loops','-ffast-math'], 
language : 'cpp')

It could be that armel and armhf differ in how they handle math errors 
when told to ignore NaN and Inf.

9. We might be able to find out where the error is happening by changing 
the line

    feclearexcept(FE_DIVBYZERO|FE_OVERFLOW|FE_INVALID);

in `src/bali-phy.cc`.  If we change this to just 
feclearexcept(FE_INVALID); then I think we'll find the first NaN when it 
gets generated.  But we might need to run this inside gdb to find out 
where that occurs.

I hope this detailed response is helpful... if I could reproduce the 
error that would make it easier to fix.

I don't have any arm hardware though.  How do you typically handle cases 
like this?

-BenRI