[Debichem-devel] Bug#1005951: nwchem (ARMCI) fails in multi-node execution with openmpi

Drew Parsons dparsons at debian.org
Fri Feb 18 15:08:19 GMT 2022


Package: nwchem
Followup-For: Bug #1005951

Running more tests for upstream, I find armci-mpi fails its own tests
when configured to run over two nodes with openmpi, though they don't
report the same gmr_create error directly.

Running armci-mpi tests manually,

$ mpirun.openmpi -H host-1:1,host-2:1 -n 2   tests/contrib/non-blocking/simple
[host-1:53732] *** An error occurred in MPI_Win_allocate
[host-1:53732] *** reported by process [2077097985,0]
[host-1:53732] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[host-1:53732] *** MPI_ERR_WIN: invalid window
[host-1:53732] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[host-1:53732] ***    and potentially your MPI job)
[host-1:53727] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[host-1:53727] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages

and

$ ARMCI_USE_WIN_ALLOCATE=0  mpirun.openmpi -H host-1:1,host-2:1 -n 2   tests/contrib/non-blocking/simple
[host-1:53740] *** An error occurred in MPI_Win_create
[host-1:53740] *** reported by process [2079719425,0]
[host-1:53740] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[host-1:53740] *** MPI_ERR_WIN: invalid window
[host-1:53740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[host-1:53740] ***    and potentially your MPI job)
[host-1:53735] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[host-1:53735] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages



At the same time, an mpich build of armci-mpi/ga/nwchem performs
normally as expected over multiple nodes.

Jeff Hammond upstream concludes that Open-MPI is once again unusable
for RMA purposes.

The simplest work-around in the meantime is to recompile
nwchem/armci-mpi/ga using mpich

This can be relatively easily done in existing packages (rather than
providing two separate mpi builds).  Users would then have to be aware
that they need to launch nwchem with mpirun.mpich not mpirun (while it
still defaults to openmpi).



More information about the Debichem-devel mailing list