[Debian-med-packaging] Bug#918047: bali-phy FTBFS building for armhf on arm64
Benjamin Redelings
benjamin.redelings at gmail.com
Thu Jan 3 19:09:55 GMT 2019
On 1/2/19 11:21 AM, Steve McIntyre wrote:
> Package: src:bali-phy
> Version: 3.4+dfsg-1
> Severity: important
>
> Hi!
>
> I've been doing a full rebuild of the Debian archive, building all
> source packages targeting armel and armhf using arm64 hardware. We are
> planning in future to move all of our 32-bit armel/armhf builds to
> using arm64 machines, so this rebuild is to identify packages that
> might have problems with this configuration.
>
> I've tried to build bali-phy for armhf on top of arm64, and it's
> failing a test at the end of the build:
>
> ...
> dh_auto_test -O--buildsystem=meson
> cd obj-arm-linux-gnueabihf && LC_ALL=C.UTF-8 MESON_TESTTHREADS=8 ninja test
> [1/4] Generating git-version.h with a custom command.
> [1/2] Running all tests.
> 1/30 bali-phy version OK 0.04 s
> 2/30 bali-phy help OK 0.04 s
> 3/30 bali-phy 5d test OK 3.61 s
> 4/30 bali-phy 5d +A 50 OK 17.14 s
> 5/30 bali-phy 5d -A 200 FAIL 7.87 s (exit status 1)
> 6/30 model_P --help OK 0.04 s
> 7/30 statreport --help OK 0.03 s
> 8/30 stats-select --help OK 0.04 s
> 9/30 alignment-gild --help OK 0.01 s
> 10/30 alignment-consensus --help OK 0.01 s
> 11/30 alignment-max --help OK 0.01 s
> 12/30 alignment-chop-internal --help OK 0.03 s
> 13/30 alignment-indices --help OK 0.01 s
> 14/30 alignment-info --help OK 0.01 s
> 15/30 alignment-cat --help OK 0.01 s
> 16/30 alignment-translate --help OK 0.02 s
> 17/30 alignment-find --help OK 0.02 s
> 18/30 trees-consensus --help OK 0.02 s
> 19/30 tree-mean-lengths --help OK 0.01 s
> 20/30 mctree-mean-lengths --help OK 0.01 s
> 21/30 trees-to-SRQ --help OK 0.01 s
> 22/30 pickout --help OK 0.01 s
> 23/30 cut-range --help OK 0.01 s
> 24/30 trees-distances --help OK 0.01 s
> 25/30 alignment-thin --help OK 0.01 s
> 26/30 alignments-diff --help OK 0.01 s
> 27/30 tree-tool --help OK 0.01 s
> 28/30 alignment-distances --help OK 0.01 s
> 29/30 subsample --help OK 0.02 s
> 30/30 bali-phy testsuite OK 272.10 s
>
> Ok: 29
> Expected Fail: 0
> Fail: 1
> Unexpected Pass: 0
> Skipped: 0
> Timeout: 0
>
>
> The output from the failed tests:
>
> 5/30 bali-phy 5d -A 200 FAIL 7.87 s (exit status 1)
>
> --- command ---
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/../examples/sequences/5S-rRNA/5d.fasta --iter=200 --package-path=/<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/builtins:/<<PKGBUILDDIR>> -Inone
> --- stdout ---
> T:topology ~ uniform on tree topologies
> T:lengths ~ iid[num_branches[T],gamma[0.5,div[2,num_branches[T]]]]
>
> Partition P1:
> file = /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/../examples/sequences/5S-rRNA/5d.fasta
> alphabet = DNA
> subst = tn93 (S1)
> indel = none
> scale ~ gamma[0.5,2] (Scale1)
>
> Substitution model S1 priors:
> tn93:kappaPur ~ log_normal[log[2],0.25]
> tn93:kappaPyr ~ log_normal[log[2],0.25]
> tn93:pi ~ dirichlet_on[letters[@a],1]
>
> Beginning pre-burnin: 3 iterations.
> Start #1 prior = 2.85078 likelihood = -1171.56 |T| = 0.468531 |A| = 110 Scale1*|T| = 0.80032
>
> Tree (S)ize #1 prior = 2.21294 likelihood = -1097.14 |T| = 0.448637 |A| = 110 Scale1*|T| = 1.30021
> Tree (S)ize #2 prior = -13.3031 likelihood = -1001.31 |T| = 0.572728 |A| = 110 Scale1*|T| = 16.6362
> Tree (S)ize #3 prior = -12.5843 likelihood = -999.563 |T| = 0.738133 |A| = 110 Scale1*|T| = 18.3298
>
> (S)+Branch (L)engths #1 prior = -8.39757 likelihood = -984.367 |T| = 1.63073 |A| = 110 Scale1*|T| = 6.22256
> (S)+Branch (L)engths #2 prior = -15.5928 likelihood = -943.676 |T| = 3.95652 |A| = 110 Scale1*|T| = 13.9505
> (S)+Branch (L)engths #3 prior = -6.1693 likelihood = -933.719 |T| = 1.54154 |A| = 110 Scale1*|T| = 6.65305
>
> (S)+(L)+(P)arameters #1 prior = -6.85447 likelihood = -820.042 |T| = 0.83291 |A| = 110 Scale1*|T| = 7.19549
> (S)+(L)+(P)arameters #2 prior = -13.9795 likelihood = -804.932 |T| = 2.52913 |A| = 110 Scale1*|T| = 14.9393
> (S)+(L)+(P)arameters #3 prior = -14.8455 likelihood = -804.004 |T| = 1.78642 |A| = 110 Scale1*|T| = 11.4421
>
> (S)+(L)+(P)+NNI #1 prior = -14.7193 likelihood = -800.519 |T| = 1.80833 |A| = 110 Scale1*|T| = 11.1644
> (S)+(L)+(P)+NNI #2 prior = -14.8403 likelihood = -794.947 |T| = 1.29684 |A| = 110 Scale1*|T| = 8.27748
> (S)+(L)+(P)+NNI #3 prior = -18.6064 likelihood = -794.606 |T| = 1.68838 |A| = 110 Scale1*|T| = 6.26695
> (S)+(L)+(P)+NNI #4 prior = -17.5519 likelihood = -796.853 |T| = 1.68336 |A| = 110 Scale1*|T| = 5.20335
>
> SPR #1 prior = -17.4975 likelihood = -795.163 |T| = 1.47967 |A| = 110 Scale1*|T| = 4.62688
>
> (S)+(L)+(P)+NNI #1 prior = -14.4695 likelihood = -797.104 |T| = 1.03049 |A| = 110 Scale1*|T| = 4.62992
> (S)+(L)+(P)+NNI #2 prior = -17.9617 likelihood = -791.809 |T| = 2.04283 |A| = 110 Scale1*|T| = 8.75168
> (S)+(L)+(P)+NNI #3 prior = -17.1506 likelihood = -788.311 |T| = 0.917332 |A| = 110 Scale1*|T| = 8.90935
>
> Finished pre-burnin in 0.26 seconds.
>
>
> BAli-Phy does NOT detect how many iterations is sufficient:
> You need to monitor convergence and kill it when done.
> Maximum number of iterations set to 200.
>
> Beginning MCMC computations.
> - Future screen output sent to '5d-1/C1.out'
> - Future debugging output sent to '5d-1/C1.err'
> - Sampled trees logged to '5d-1/C1.trees'
> - Sampled alignments logged to '5d-1/C1.P<partition>.fastas'
> - Run info written to '5d-1/C1.run.json'
> - Sampled numerical parameters logged to '5d-1/C1.log'
>
> You can examine 'C1.log' using BAli-Phy tool statreport (command-line) or the BEAST program Tracer (graphical).
>
> See the manual at http://www.bali-phy.org/README.xhtml for further information.
> --- stderr ---
> Home directory '/sbuild-nonexistent' does not exist!
> Partition #1: 126 columns -> 110 unique patterns.
> Created directory '5d-1/' for output files.
> bali-phy: Error! move = sampler
> submove = tree
> move = tree
> submove = topology
> move = topology
> submove = SPR
> move = SPR
> submove = SPR_all
> [single] move = SPR_all
> bool sample_SPR_search_one(Parameters&, MCMC::MoveStats&, const tree_edge&, const spr_range&, bool)int choose_MH(int, const std::vector<T>&) [with F = log_double_t]:
> No option chosen! (current = 0)
> *log(Pr[0]) = nan
> log(Pr[1]) = nan
> log(Pr[2]) = nan
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN16choose_exceptionI12log_double_tEC2EiRKSt6vectorIS0_SaIS0_EE+0xccb) [0xc09a3c]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_Z9choose_MHI12log_double_tEiiRKSt6vectorIT_SaIS2_EE+0xa0d) [0xc3ecce]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_Z21sample_SPR_search_oneR10ParametersRN4MCMC9MoveStatsERK9tree_edgeRKSt3mapIS4_bSt4lessIS4_ESaISt4pairIS5_bEEEb+0x513) [0xc7ceb4]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_Z21sample_SPR_search_oneR10ParametersRN4MCMC9MoveStatsERK9tree_edgeb+0x3d) [0xc7dc86]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_Z14sample_SPR_allR9owned_ptrI5ModelERN4MCMC9MoveStatsE+0x7b) [0xc7e78c]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN4MCMC10SingleMove7iterateER9owned_ptrI5ModelERNS_9MoveStatsEi+0x39) [0xc01c76]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN4MCMC9MoveGroup7iterateER9owned_ptrI5ModelERNS_9MoveStatsEi+0x43) [0xc01d80]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN4MCMC9MoveGroup7iterateER9owned_ptrI5ModelERNS_9MoveStatsEi+0x43) [0xc01d80]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN4MCMC9MoveGroup7iterateER9owned_ptrI5ModelERNS_9MoveStatsEi+0x43) [0xc01d80]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN4MCMC9MoveGroup7iterateER9owned_ptrI5ModelERNS_9MoveStatsEi+0x43) [0xc01d80]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_ZN4MCMC7Sampler2goER9owned_ptrI5ModelEiiRSo+0x2ab) [0xc03818]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(_Z11do_samplingRKN5boost15program_options13variables_mapER9owned_ptrI5ModelElRSoRKSt6vectorISt8functionIFvRKS5_lEESaISE_EE+0xa2f) [0xcbfe68]
> /<<PKGBUILDDIR>>/obj-arm-linux-gnueabihf/src/bali-phy(main+0x1b31) [0xb99f5a]
> /lib/arm-linux-gnueabihf/libc.so.6(__libc_start_main+0x97) [0xf6cef4e4]
>
> -------
>
> In fact, checking I can see that the buildd arm-arm-01 (also an arm64
> host configured to build armhf) fails in the same way. See
>
> https://buildd.debian.org/status/fetch.php?pkg=bali-phy&arch=armhf&ver=3.4%2Bdfsg-1&stamp=1544787942&raw=0
>
> for the build log. Oddly, I'm not seeing any similar problems
> building/testing for armel on top of arm64...
>
> -- System Information:
> Debian Release: 9.6
> APT prefers stable-updates
> APT policy: (500, 'stable-updates'), (500, 'stable-debug'), (500, 'stable')
> Architecture: amd64 (x86_64)
> Foreign Architectures: i386
>
> Kernel: Linux 4.9.0-8-amd64 (SMP w/4 CPU cores)
> Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8), LANGUAGE=en_GB.UTF-8 (charmap=UTF-8)
> Shell: /bin/sh linked to /bin/dash
> Init: systemd (via /run/systemd/system)
>
> _______________________________________________
> Debian-med-packaging mailing list
> Debian-med-packaging at alioth-lists.debian.net
> https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/debian-med-packaging
Thanks for the bug report. Note that I'm the original author of the
software, as well as maintaining the package.
1. The error message indicates that one of the MCMC moves computes three
probabilities as NaN's, and then tries to sample from that weighted set,
which throws an exception that prints a backtrace.
2. The question is what is yielding an NaN.
3. I'm surprised that all three numbers are NaN. This suggests that
the current probability was already NaN before the function was called.
In that case running with '-V4' to enable extra logging might show where
the NaN originates.
4. The stack trace indicates that the probabilities that are NaNs are
coming from sample_SPR_search_one(Parameters&, MCMC::MoveStats&,
tree_edge const&, std::map<tree_edge, bool, std::less<tree_edge>,
std::allocator<std::pair<tree_edge const, bool> > > const&, bool)+0x513
This is line 1281 "C = choose_MH(0, PrL)" of src/mcmc/sample-topology-SPR.cc
5. The build log from arm-arm-01 seems to be getting NaN's in the same
function, although they are -nan instead of +nan. This is only a
sample-size of 2, but suggests the problem occurs mostly in that
particular function.
6. The fact that it takes 7 seconds to crash, and the fact that the
previous test success suggests that this error occurs only after several
iterations. So for most inputs, no NaN is generated.
7. Since armel does not crash, it looks like there might be difference
in how IEEE math errors are handled between armel and armhf. So, the
floating point emulation code is not exactly the same as the hardward
implementation. Does that sound possible?
On the other hand, its possible that armel would crash too if you reran
it, since the test uses random numbers. However, since there are no
errors on x86 that I can find, this lends some weight to armel also
being fine.
8. This makes me wonder what happens if the -ffast-math flag is removed
from this line in src/meson.build:
add_project_arguments(['-DNDEBUG','-DNDEBUG_DP','-O3','-funroll-loops','-ffast-math'],
language : 'cpp')
It could be that armel and armhf differ in how they handle math errors
when told to ignore NaN and Inf.
9. We might be able to find out where the error is happening by changing
the line
feclearexcept(FE_DIVBYZERO|FE_OVERFLOW|FE_INVALID);
in `src/bali-phy.cc`. If we change this to just
feclearexcept(FE_INVALID); then I think we'll find the first NaN when it
gets generated. But we might need to run this inside gdb to find out
where that occurs.
I hope this detailed response is helpful... if I could reproduce the
error that would make it easier to fix.
I don't have any arm hardware though. How do you typically handle cases
like this?
-BenRI
More information about the Debian-med-packaging
mailing list