Bug#1074350: nvidia-kernel-dkms: Trying to modprobe nvidia-peermem to use NCCL/RDMA/Infiniband with GPUs
Jeffrey Mark Siskind
qobi at qobi.org
Sat Jul 6 20:58:29 BST 2024
If you have more hints for other people trying to get that running, too,
you coul dleave them here in this bug.
enclose
Jeff (http: //engineering.purdue.edu/~qobi)
---------------------------------------------------------------------------------
Debian 12.6 provides packages for OFED based on the standard OFED
release.
rdma-core
libibverbs1
librdmacm1
libibmad5
libibumad3
librdmacm1
ibverbs-providers
rdmacm-utils
infiniband-diags
libfabric1
ibverbs-utils
firmware-linux-nonfree
librdmacm-dev
libibverbs-dev
ibutils
Mellanox (now NVidia) provides modified OFED packages (sometimes
called MLNX_OFED or MOFED).
There are at least two reasons why you might want the MOFED variants.
1. If you have an unmanaged switch, you will need a subnet
manager. OFED and MOFED provide opensm, a software subnet
manager. However, the version of opensm in Debian 12.6 does not
support NDR, the 400Gb/s data rate supported by the ConnectX-7
generation of Mellanox Infiniband hardware. It will not allow
using IPoIB, ib_write_bw, and NCCL. For this, you need the
version of opensmd in MOFED.
2. If you have SXM GPUs, such as those produced by the NVidia DGX
A100 and DGX H100, and variants like those produced by SMC (A100s
use SXM4 and H100s use SXM5), you might have a dedicated IB NIC
associated with each GPU, distinct from the IB NIC on the
chassis. To get the highest possible banswidth from this
hardwared with NCCL you need GPU Direct RDMA. This is provided by
a kernel module nvidia-peermem. In Debian 12.6, this is called
nvidia-current-peermem.ko and loaded with
# modprobe nvidia-peermem
This is included in the Debian 12.6 nvidia-kernel-dkms package
included in the nvidia-driver meta package. However, of the
module is build by dkms with the standard Debian 12.6 OFED
packages, it will not have the requisite symbols and this will not
load. (Actually, it may load but immediately exits.) The
requisite symbols are only available with MOFED.
There are currently two versions of MOFED: MLNX_OFED and DOCA-OFED.
The plan is for NVidia to stop supporting MLNX_OFED and swith to
DOCA-OFED in the future. Even though MLNX_OFED is still available,
you will likely want DOCA-OFED for the following reason.
MLNX_OFED is distributed as a tgz file with an install shell script.
that script removes (Debian 12.6) OFED packages. But that will remove
all dependent packages. For me, that removed
libboost-all-dev
libopenmpi-dev
libopencv-dev
python3-opencv
python3-torch
ros-desktop-full-dev
For me, this is a showstopper.
The reason is that MOFED contains three packages: hcoll, sharp, and
ucx. hcoll depends on sharp which, in turn, depends on ucx. ucx
conflicts with the standard Debian 12.6 libucx0. All of the above
ultimately depend on libucx0.
In contrast, DOCA-OFED is distributed as a standard Debian (meta)
package. If you install it with apt, it will detect that the MOFED
rdma-core replaces the Debian 12.6 rdma-core and not remove
everything. Sort of. See below.
There are several problems with the install instructions at
https://developer.nvidia.com/doca-downloads
1. The package signature doesn't work. So you need to instead do
# echo "deb [trusted=yes signed-by=/etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.pub] $DOCA_URL ./" > /etc/apt/sources.list.d/doca.list
2. If you previously installed Debian 12.6 rdma-core, it has a
service called iwpmd.service which runs /usr/sbin/iwpmd.
Installing DOCA-OFED replaces rdma-core which doesn't use iwpmd.
It will remove /usr/sbin/iwpmd but it won't remove
/etc/init.d/iwpmd. The postinst script for DOCA-OFEd rdma-core
detects the presence of /etc/init.d/iwpmd and attempts to start
iwpmd.service. This fails because /usr/sbin/iwpmd is gone. The
install of rdma-core and all dependences then fails. To solve
this you need to do
# systemctl stop iwpmd
# systemctl disable iwpmd
# rm /etc/init.d/iwpmd
3. doca-ofed has a dependency on ibutils2 provided by doca-ofed.
This conflicts with ibutils provided by Debian 12.6. But it
doesn't remove it. To solve this you need to do
# apt remove ibutils
# apt autoremove
4. Now, nominally, you can do
# apt install doca-ofed
But you likely don't want to do this, because this will remove
libucx0. So instead you can do.
# apt install\
ibsim-doc\
mlnx-tools\
mlnx-ofed-kernel-utils\
mlnx-ofed-kernel-dkms\
rdma-core\
libibumad3\
libibmad5\
infiniband-diags\
libibnetdisc5\
libibverbs-dev\
libibverbs1\
ibverbs-providers\
libibumad-dev\
libibmad-dev\
ibverbs-utils\
librdmacm-dev\
librdmacm1\
opensm-doc\
mlnx-ethtool\
srp-dkms\
knem\
mft\
openmpi\
iser-dkms\
ibarr\
ibdump\
ibacm\
mlnx-iproute2\
srptools\
libopensm\
libopensm-devel\
ofed-scripts\
opensm\
ibutils2\
kernel-mft-dkms\
mpitests\
isert-dkms\
rdmacm-utils\
ibsim\
knem-dkms\
perftest\
rshim
This will install all of doca-ofed excpet hcoll, sharp, and ucx,
allowing you to keep libucx0.
5. The right way to handle this would be for Debian to official
package up doca-ofed and make ucx and use the alternatives
mechanism to make ucx an alternative for libucx0, libucx-dev, and
ucx-utles. And also make a meta package for OFED and use the
alternatives mechanism to make ofed and and doca-ofed
alternatives.
6. Debian 12.6 OFED provides the package ibverbs-providers. This
provides components that support Infiniband hardware by various
vendors. The MLNX_OFED and DOCA-OFED variant of
ibverbs-providers only contains the component provided by NVidia.
You will get warnings about missing libraries if you use NCCL.
To eliminate those warning, you can do the following BEFORE you
do step (4) above.
# mkdir -p ~/libibverbs/lib
# rsync -a -v -z /usr/lib/x86_64-linux-gnu/libibverbs/ ~/lib/libibverbs/
# rsync -a -v -z /usr/lib/x86_64-linux-gnu/libefa.so.1.2.44.0 ~/lib/
# rsync -a -v -z /usr/lib/x86_64-linux-gnu/libmana.so.1.0.44.0 ~/lib/
# rsync -a -v -z /usr/lib/x86_64-linux-gnu/libmlx4.so.1.0.44.0 ~/lib/
# rm ~/lib/libibverbs/libmlx5-rdmav34.so
Then when running NCCL do
$ export LD_LIBRARY_PATH=:~/lib/libibverbs/
There has got to be a cleaner way of doing this.
7. If you with to run GPU Direct RDMA, you will need to do
# dkms --force build -m nvidia-current -v 535.183.01
# dkms --force install -m nvidia-current -v 535.183.01
8. You may need to reboot.
DOCA-OFED requires running openibd.service. This should properly
start upon installation and reboot.
9. If you need to run opensm, you need to do
# systemctl start opensmd.service
# systemctl enable opensmd.service
This should run on exactly one host on an IB network.
10. If you want GPU Direct RDMA, you need to do
# modprobe nvidia-peermem
manually upon reboot. I don't know how to automated this. There
probably should be a service that does this.
11. It would be nice if all this were properly packaged up for
Debian, so all one would need to do is either
# apt install ofed
or
# apt install doca-ofed
12. As an aside, if you have SXM GPUs, you need what is called a
fabricmanager. Without one, you will not be able to use your GPUs
for CUDA (PyTorch). This is not available for Debian 12.6. But
one can download one from NVidia.
# wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/nvidia-fabricmanager-535_535.183.01-1_amd64.deb
# apt remove nvidia-fabricmanager-525
# dpkg -i /tmp/nvidia-fabricmanager-535_535.183.01-1_amd64.deb
# systemctl daemon-reload
# systemctl start nvidia-fabricmanager.service
The version of fabricmanager is specific to the version of the
NVidia driver. You need to remove the old one before installing
the new one.
A word to the wise: be careful when doing an apt upgrade. The
version of the NVidia driver might change. This will leave you
in a state where you don't have the correct version of
fabricmanager. So check that the NVidia website as a version for
the driver version you are about to install or else your GPUs
will become unusable.
Again, it would be nice if the official Debian nvidia-driver
package included the appropriate fabricmanager.
More information about the pkg-nvidia-devel
mailing list