[Debian-med-packaging] Bug#918047: bali-phy FTBFS building for armhf on arm64

Mon Jan 7 17:33:21 GMT 2019

Hi Benjamin!

On Thu, Jan 03, 2019 at 11:09:55AM -0800, Benjamin Redelings wrote:

...

>1. The error message indicates that one of the MCMC moves computes three
>probabilities as NaN's, and then tries to sample from that weighted set,
>which throws an exception that prints a backtrace.
>
>2. The question is what is yielding an NaN.
>
>3.   I'm surprised that all three numbers are NaN.  This suggests that the
>current probability was already NaN before the function was called.  In that
>case running with '-V4' to enable extra logging might show where the NaN
>originates.
>
>4. The stack trace indicates that the probabilities that are NaNs are coming
>from sample_SPR_search_one(Parameters&, MCMC::MoveStats&, tree_edge const&,
>std::map<tree_edge, bool, std::less<tree_edge>,
>std::allocator<std::pair<tree_edge const, bool> > > const&, bool)+0x513
>
>This is line 1281 "C = choose_MH(0, PrL)" of src/mcmc/sample-topology-SPR.cc
>
>5. The build log from arm-arm-01 seems to be getting NaN's in the same
>function, although they are -nan instead of +nan.  This is only a sample-size
>of 2, but suggests the problem occurs mostly in that particular function.
>
>6. The fact that it takes 7 seconds to crash, and the fact that the previous
>test success suggests that this error occurs only after several iterations. 
>So for most inputs, no NaN is generated.
>
>7. Since armel does not crash, it looks like there might be difference in how
>IEEE math errors are handled between armel and armhf.  So, the floating point
>emulation code is not exactly the same as the hardward implementation.  Does
>that sound possible?

Totally, yes. I *believe* the ARMv5 and ARMv7 configurations differ
here, but I'm hazy on exact details I'll be honest.

>On the other hand, its possible that armel would  crash too if you reran it,
>since the test uses random numbers.  However, since there are no errors on
>x86 that I can find, this lends some weight to armel also being fine.

OK.

>8. This makes me wonder what happens if the -ffast-math flag is removed from
>this line in src/meson.build:
>
>add_project_arguments(['-DNDEBUG','-DNDEBUG_DP','-O3','-funroll-loops','-ffast-math'],
>language : 'cpp')
>
>It could be that armel and armhf differ in how they handle math errors when
>told to ignore NaN and Inf.
>
>9. We might be able to find out where the error is happening by changing the
>line
>
>   feclearexcept(FE_DIVBYZERO|FE_OVERFLOW|FE_INVALID);
>
>in `src/bali-phy.cc`.  If we change this to just feclearexcept(FE_INVALID);
>then I think we'll find the first NaN when it gets generated.  But we might
>need to run this inside gdb to find out where that occurs.
>
>I hope this detailed response is helpful... if I could reproduce the error
>that would make it easier to fix.
>
>I don't have any arm hardware though.  How do you typically handle cases like
>this?

I'm more than happy to give out access to one of my machines to help
you fix this. Contact me off-list and we can set that up if you like.

-- 
Steve McIntyre, Cambridge, UK.                                steve at einval.com
Welcome my son, welcome to the machine.