[Debichem-devel] Bug#1005951: nwchem (ARMCI) fails in multi-node execution with openmpi
Drew Parsons
dparsons at debian.org
Fri Feb 18 15:08:19 GMT 2022
Package: nwchem
Followup-For: Bug #1005951
Running more tests for upstream, I find armci-mpi fails its own tests
when configured to run over two nodes with openmpi, though they don't
report the same gmr_create error directly.
Running armci-mpi tests manually,
$ mpirun.openmpi -H host-1:1,host-2:1 -n 2 tests/contrib/non-blocking/simple
[host-1:53732] *** An error occurred in MPI_Win_allocate
[host-1:53732] *** reported by process [2077097985,0]
[host-1:53732] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[host-1:53732] *** MPI_ERR_WIN: invalid window
[host-1:53732] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[host-1:53732] *** and potentially your MPI job)
[host-1:53727] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[host-1:53727] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
and
$ ARMCI_USE_WIN_ALLOCATE=0 mpirun.openmpi -H host-1:1,host-2:1 -n 2 tests/contrib/non-blocking/simple
[host-1:53740] *** An error occurred in MPI_Win_create
[host-1:53740] *** reported by process [2079719425,0]
[host-1:53740] *** on communicator MPI COMMUNICATOR 3 DUP FROM 0
[host-1:53740] *** MPI_ERR_WIN: invalid window
[host-1:53740] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[host-1:53740] *** and potentially your MPI job)
[host-1:53735] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal
[host-1:53735] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
At the same time, an mpich build of armci-mpi/ga/nwchem performs
normally as expected over multiple nodes.
Jeff Hammond upstream concludes that Open-MPI is once again unusable
for RMA purposes.
The simplest work-around in the meantime is to recompile
nwchem/armci-mpi/ga using mpich
This can be relatively easily done in existing packages (rather than
providing two separate mpi builds). Users would then have to be aware
that they need to launch nwchem with mpirun.mpich not mpirun (while it
still defaults to openmpi).
More information about the Debichem-devel
mailing list