[Debichem-devel] Bug#1005951: nwchem (ARMCI) fails in multi-node execution with openmpi
Drew Parsons
dparsons at debian.org
Thu Feb 17 22:37:43 GMT 2022
Package: nwchem
Version: 7.0.2-1
Severity: important
Control: forwarded -1 https://github.com/pmodels/armci-mpi/issues/33
Control: affects -1 libarmci-mpi-dev openmpi-bin
The Debian testing build of nwchem is currently failing to run across multiple nodes. It runs fine on one node.
The nodes form a cluster managed by openstack. 16 cpu per node
Testing against the sample water script at https://nwchemgit.github.io/Sample.html, one node runs successfully with
mpirun -n 16 nwchem water.nw
I can also run successfully on a different (single) node (here launching from node-1 to execute on node-2)
mpirun -H node-2:16 -n 16 nwchem water.nw
The segfault occurs when I try to run on both nodes. Whether with -n 32 or -N 16,
mpirun -H node-1:16,node-2:16 -n 32 nwchem water.nw
or
mpirun -H node-1:16,node-2:16 -N 32 nwchem water.nw
both fail the same way.
The error message is:
$ mpirun -H node-1:16,node-2:16 -n 32 nwchem water.nw
[31] ARMCI assert fail in gmr_create() [src/gmr.c:109]: "alloc_slices[alloc_me].base != NULL"
[31] Backtrace:
[31] 10 - nwchem(+0x2836605) [0x55fe1ee26605]
[31] 9 - nwchem(+0x282cc1c) [0x55fe1ee1cc1c]
[31] 8 - nwchem(+0x282c358) [0x55fe1ee1c358]
[31] 7 - nwchem(+0x2819f68) [0x55fe1ee09f68]
[31] 6 - nwchem(+0x2819cba) [0x55fe1ee09cba]
[31] 5 - nwchem(+0x2819d76) [0x55fe1ee09d76]
[31] 4 - nwchem(+0x2818fe9) [0x55fe1ee08fe9]
[31] 3 - nwchem(+0x11b79) [0x55fe1c601b79]
[31] 2 - nwchem(+0x12659) [0x55fe1c602659]
[31] 1 - /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xcd) [0x7fb2c8ffa7ed]
[31] 0 - nwchem(+0x1069a) [0x55fe1c60069a]
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 31 in communicator MPI_COMM_WORLD
with errorcode -1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
An MPI communication peer process has unexpectedly disconnected. This
usually indicates a failure in the peer process (e.g., a crash or
otherwise exiting without calling MPI_FINALIZE first).
Although this local MPI process will likely now behave unpredictably
(it may even hang or crash), the root cause of this problem is the
failure of the peer -- that is what you need to investigate. For
example, there may be a core file that you can examine. More
generally: such peer hangups are frequently caused by application bugs
or other external events.
Local host: node-1
Local PID: 1264980
Peer host: node-2
--------------------------------------------------------------------------
I've tried a fresh rebuild of armci-mpi, ga and nwchem, but the segfault is pervasive.
I've tried running ARMCI_USE_WIN_ALLOCATE=0 as suggested on the
armci-mpi README, but it doesn't avoid the segfault.
After rebuilding against mpich (rebuilding armci-mpi and ga), an mpich build
of nwchem runs fine. That suggests the problem lies in how openmpi
works with armci.
I'm inclined to work around the problem by just proceeding with mpich
builds of nwchem. It's only two packages deep (armci-mpi and ga), and
they both belong to nwchem anyway in practice, so wouldn't be too
disruptive.
-- System Information:
Debian Release: bookworm/sid
APT prefers unstable
APT policy: (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386
Kernel: Linux 5.16.0-1-amd64 (SMP w/8 CPU threads; PREEMPT)
Kernel taint flags: TAINT_PROPRIETARY_MODULE, TAINT_OOT_MODULE
Locale: LANG=en_AU.UTF-8, LC_CTYPE=en_AU.UTF-8 (charmap=UTF-8), LANGUAGE=en_AU:en
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
Versions of packages nwchem depends on:
ii libatlas3-base [liblapack.so.3] 3.10.3-12
ii libblas3 [libblas.so.3] 3.10.0-2
ii libblis3-openmp [libblas.so.3] 0.8.1-2
ii libblis3-pthread [libblas.so.3] 0.8.1-2
ii libc6 2.33-6
ii libgcc-s1 11.2.0-16
ii libgfortran5 11.2.0-16
ii liblapack3 [liblapack.so.3] 3.10.0-2
ii libopenblas0-openmp [liblapack.so.3] 0.3.19+ds-3
ii libopenblas0-pthread [liblapack.so.3] 0.3.19+ds-3
ii libopenmpi3 4.1.2-1
ii libpython3.9 3.9.10-1
ii libscalapack-openmpi2.1 2.1.0-4
ii mpi-default-bin 1.14
ii nwchem-data 7.0.2-2
nwchem recommends no packages.
nwchem suggests no packages.
-- no debconf information
More information about the Debichem-devel
mailing list