Bug#918369: FTBFS for armhf on arm64, fails MPI-based tests

Steve McIntyre steve at einval.com
Tue Jan 8 14:49:13 GMT 2019


On Tue, Jan 08, 2019 at 02:12:06PM +0000, Steve McIntyre wrote:
>On Tue, Jan 08, 2019 at 10:42:01AM +0000, Alastair McKinstry wrote:
>>Hi Steve
>>
>>On 05/01/2019 15:40, Steve McIntyre wrote:
>>> Package: src:p4est
>>> Version: 1.1-5
>>> Severity: important
>>> 
>>> I've been doing a full rebuild of the Debian archive, building all
>>> source packages targeting armel and armhf using arm64 hardware. We are
>>> planning in future to move all of our 32-bit armel/armhf builds to
>>> using arm64 machines, so this rebuild is to identify packages that
>>> might have problems with this configuration.
>>> 
>>You submitted this and a number of other MPI-test based bugs. I believe they
>>may be due to  #918157, a regression in pmix that breaks MPI on 32-bit archs.
>>
>>I've uploaded a new openmpi (3.1.3-8) that uses its internal pmix until a fix
>>is found for #918157, so these should now work. Can you re-test ?
>
>Sure, no problem. I've just queued all those builds to rebuild again
>now. I'll let you know how that goes.

Not looking good, I'm afraid. The errors are different, but still
seeing errors. From the dune-common rebuild:

...
 1/91 Test  #1: indexsettest ...........................   Passed    0.01 sec
      Start  2: remoteindicestest
 2/91 Test  #2: remoteindicestest ......................***Failed    0.05 sec
[mjolnir:16218] mca_base_component_repository_open: unable to open mca_pmix_pmix2x: /usr/lib/arm-linux-gnueabihf/openmpi/lib/openmpi3/mca_pmix_pmix2x.
so: undefined symbol: OPAL_MCA_PMIX2X_PMIx_Get_version (ignored)
[mjolnir:16218] [[42547,0],0] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 325
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_pmix_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
[mjolnir:16217] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 532
[mjolnir:16217] [[INVALID],INVALID] ORTE_ERROR_LOG: Unable to start a daemon on the local node in file ess_singleton_module.c at line 166
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_init failed
  --> Returned value Unable to start a daemon on the local node (-127) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Unable to start a daemon on the local node" (-127) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[mjolnir:16217] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

...



-- 
Steve McIntyre, Cambridge, UK.                                steve at einval.com
"Because heaters aren't purple!" -- Catherine Pitt



More information about the debian-science-maintainers mailing list