[Debian-med-packaging] Bug#918047: bali-phy FTBFS building for armhf on arm64
Steve McIntyre
steve at einval.com
Mon Jan 7 17:33:21 GMT 2019
Hi Benjamin!
On Thu, Jan 03, 2019 at 11:09:55AM -0800, Benjamin Redelings wrote:
...
>1. The error message indicates that one of the MCMC moves computes three
>probabilities as NaN's, and then tries to sample from that weighted set,
>which throws an exception that prints a backtrace.
>
>2. The question is what is yielding an NaN.
>
>3. I'm surprised that all three numbers are NaN. This suggests that the
>current probability was already NaN before the function was called. In that
>case running with '-V4' to enable extra logging might show where the NaN
>originates.
>
>4. The stack trace indicates that the probabilities that are NaNs are coming
>from sample_SPR_search_one(Parameters&, MCMC::MoveStats&, tree_edge const&,
>std::map<tree_edge, bool, std::less<tree_edge>,
>std::allocator<std::pair<tree_edge const, bool> > > const&, bool)+0x513
>
>This is line 1281 "C = choose_MH(0, PrL)" of src/mcmc/sample-topology-SPR.cc
>
>5. The build log from arm-arm-01 seems to be getting NaN's in the same
>function, although they are -nan instead of +nan. This is only a sample-size
>of 2, but suggests the problem occurs mostly in that particular function.
>
>6. The fact that it takes 7 seconds to crash, and the fact that the previous
>test success suggests that this error occurs only after several iterations.
>So for most inputs, no NaN is generated.
>
>7. Since armel does not crash, it looks like there might be difference in how
>IEEE math errors are handled between armel and armhf. So, the floating point
>emulation code is not exactly the same as the hardward implementation. Does
>that sound possible?
Totally, yes. I *believe* the ARMv5 and ARMv7 configurations differ
here, but I'm hazy on exact details I'll be honest.
>On the other hand, its possible that armel would crash too if you reran it,
>since the test uses random numbers. However, since there are no errors on
>x86 that I can find, this lends some weight to armel also being fine.
OK.
>8. This makes me wonder what happens if the -ffast-math flag is removed from
>this line in src/meson.build:
>
>add_project_arguments(['-DNDEBUG','-DNDEBUG_DP','-O3','-funroll-loops','-ffast-math'],
>language : 'cpp')
>
>It could be that armel and armhf differ in how they handle math errors when
>told to ignore NaN and Inf.
>
>9. We might be able to find out where the error is happening by changing the
>line
>
> feclearexcept(FE_DIVBYZERO|FE_OVERFLOW|FE_INVALID);
>
>in `src/bali-phy.cc`. If we change this to just feclearexcept(FE_INVALID);
>then I think we'll find the first NaN when it gets generated. But we might
>need to run this inside gdb to find out where that occurs.
>
>I hope this detailed response is helpful... if I could reproduce the error
>that would make it easier to fix.
>
>I don't have any arm hardware though. How do you typically handle cases like
>this?
I'm more than happy to give out access to one of my machines to help
you fix this. Contact me off-list and we can set that up if you like.
--
Steve McIntyre, Cambridge, UK. steve at einval.com
Welcome my son, welcome to the machine.
More information about the Debian-med-packaging
mailing list