Bug#1074350: nvidia-kernel-dkms: Trying to modprobe nvidia-peermem to use NCCL/RDMA/Infiniband with GPUs

Jeffrey Mark Siskind qobi at qobi.org
Sat Jul 6 20:58:29 BST 2024


   If you have more hints for other people trying to get that running, too, 
   you coul dleave them here in this bug.

enclose

    Jeff (http: //engineering.purdue.edu/~qobi)
---------------------------------------------------------------------------------
Debian 12.6 provides packages for OFED based on the standard OFED
release.

    rdma-core
    libibverbs1
    librdmacm1
    libibmad5
    libibumad3
    librdmacm1
    ibverbs-providers
    rdmacm-utils
    infiniband-diags
    libfabric1
    ibverbs-utils
    firmware-linux-nonfree
    librdmacm-dev
    libibverbs-dev
    ibutils

Mellanox (now NVidia) provides modified OFED packages (sometimes
called MLNX_OFED or MOFED).

There are at least two reasons why you might want the MOFED variants.

 1. If you have an unmanaged switch, you will need a subnet
    manager.  OFED and MOFED provide opensm, a software subnet
    manager.  However, the version of opensm in Debian 12.6 does not
    support NDR, the 400Gb/s data rate supported by the ConnectX-7
    generation of Mellanox Infiniband hardware.  It will not allow
    using IPoIB, ib_write_bw, and NCCL.  For this, you need the
    version of opensmd in MOFED.
 2. If you have SXM GPUs, such as those produced by the NVidia DGX
    A100 and DGX H100, and variants like those produced by SMC (A100s
    use SXM4 and H100s use SXM5), you might have a dedicated IB NIC
    associated with each GPU, distinct from the IB NIC on the
    chassis.  To get the highest possible banswidth from this
    hardwared with NCCL you need GPU Direct RDMA.  This is provided by
    a kernel module nvidia-peermem.  In Debian 12.6, this is called
    nvidia-current-peermem.ko and loaded with

    # modprobe nvidia-peermem

    This is included in the Debian 12.6 nvidia-kernel-dkms package
    included in the nvidia-driver meta package.  However, of the
    module is build by dkms with the standard Debian 12.6 OFED
    packages, it will not have the requisite symbols and this will not
    load. (Actually, it may load but immediately exits.)  The
    requisite symbols are only available with MOFED.

There are currently two versions of MOFED: MLNX_OFED and DOCA-OFED.
The plan is for NVidia to stop supporting MLNX_OFED and swith to
DOCA-OFED in the future.  Even though MLNX_OFED is still available,
you will likely want DOCA-OFED for the following reason.

MLNX_OFED is distributed as a tgz file with an install shell script.
that script removes (Debian 12.6) OFED packages.  But that will remove
all dependent packages.  For me, that removed

    libboost-all-dev
    libopenmpi-dev
    libopencv-dev
    python3-opencv
    python3-torch
    ros-desktop-full-dev

For me, this is a showstopper.

The reason is that MOFED contains three packages: hcoll, sharp, and
ucx. hcoll depends on sharp which, in turn, depends on ucx.  ucx
conflicts with the standard Debian 12.6 libucx0.  All of the above
ultimately depend on libucx0.

In contrast, DOCA-OFED is distributed as a standard Debian (meta)
package.  If you install it with apt, it will detect that the MOFED
rdma-core replaces the Debian 12.6 rdma-core and not remove
everything.  Sort of.  See below.

There are several problems with the install instructions at

    https://developer.nvidia.com/doca-downloads

 1. The package signature doesn't work.  So you need to instead do

    # echo "deb [trusted=yes signed-by=/etc/apt/trusted.gpg.d/GPG-KEY-Mellanox.pub] $DOCA_URL ./" > /etc/apt/sources.list.d/doca.list

 2. If you previously installed Debian 12.6 rdma-core, it has a
    service called iwpmd.service which runs /usr/sbin/iwpmd.
    Installing DOCA-OFED replaces rdma-core which doesn't use iwpmd.
    It will remove /usr/sbin/iwpmd but it won't remove
    /etc/init.d/iwpmd.  The postinst script for DOCA-OFEd rdma-core
    detects the presence of /etc/init.d/iwpmd and attempts to start
    iwpmd.service.  This fails because /usr/sbin/iwpmd is gone.  The
    install of rdma-core and all dependences then fails.  To solve
    this you need to do

    # systemctl stop iwpmd
    # systemctl disable iwpmd
    # rm /etc/init.d/iwpmd

 3. doca-ofed has a dependency on ibutils2 provided by doca-ofed.
    This conflicts with ibutils provided by Debian 12.6.  But it
    doesn't remove it.  To solve this you need to do

    # apt remove ibutils
    # apt autoremove

 4. Now, nominally, you can do

    # apt install doca-ofed

    But you likely don't want to do this, because this will remove
    libucx0.  So instead you can do.

    # apt install\
	  ibsim-doc\
	  mlnx-tools\
	  mlnx-ofed-kernel-utils\
	  mlnx-ofed-kernel-dkms\
	  rdma-core\
	  libibumad3\
	  libibmad5\
	  infiniband-diags\
	  libibnetdisc5\
	  libibverbs-dev\
	  libibverbs1\
	  ibverbs-providers\
	  libibumad-dev\
	  libibmad-dev\
	  ibverbs-utils\
	  librdmacm-dev\
	  librdmacm1\
	  opensm-doc\
	  mlnx-ethtool\
	  srp-dkms\
	  knem\
	  mft\
	  openmpi\
	  iser-dkms\
	  ibarr\
	  ibdump\
	  ibacm\
	  mlnx-iproute2\
	  srptools\
	  libopensm\
	  libopensm-devel\
	  ofed-scripts\
	  opensm\
	  ibutils2\
	  kernel-mft-dkms\
	  mpitests\
	  isert-dkms\
	  rdmacm-utils\
	  ibsim\
	  knem-dkms\
	  perftest\
	  rshim

    This will install all of doca-ofed excpet hcoll, sharp, and ucx,
    allowing you to keep libucx0.

  5. The right way to handle this would be for Debian to official
     package up doca-ofed and make ucx and use the alternatives
     mechanism to make ucx an alternative for libucx0, libucx-dev, and
     ucx-utles.  And also make a meta package for OFED and use the
     alternatives mechanism to make ofed and and doca-ofed
     alternatives.

  6. Debian 12.6 OFED provides the package ibverbs-providers.  This
     provides components that support Infiniband hardware by various
     vendors.  The MLNX_OFED and DOCA-OFED variant of
     ibverbs-providers only contains the component provided by NVidia.
     You will get warnings about missing libraries if you use NCCL.
     To eliminate those warning, you can do the following BEFORE you
     do step (4) above.

     # mkdir -p ~/libibverbs/lib
     # rsync -a -v -z /usr/lib/x86_64-linux-gnu/libibverbs/ ~/lib/libibverbs/
     # rsync -a -v -z /usr/lib/x86_64-linux-gnu/libefa.so.1.2.44.0 ~/lib/
     # rsync -a -v -z /usr/lib/x86_64-linux-gnu/libmana.so.1.0.44.0 ~/lib/
     # rsync -a -v -z /usr/lib/x86_64-linux-gnu/libmlx4.so.1.0.44.0 ~/lib/
     # rm ~/lib/libibverbs/libmlx5-rdmav34.so

     Then when running NCCL do

     $ export LD_LIBRARY_PATH=:~/lib/libibverbs/

     There has got to be a cleaner way of doing this.

  7. If you with to run GPU Direct RDMA, you will need to do

     # dkms --force build -m nvidia-current -v 535.183.01
     # dkms --force install -m nvidia-current -v 535.183.01

  8. You may need to reboot.

     DOCA-OFED requires running openibd.service.  This should properly
     start upon installation and reboot.

  9. If you need to run opensm, you need to do

     # systemctl start opensmd.service
     # systemctl enable opensmd.service

     This should run on exactly one host on an IB network.

 10. If you want GPU Direct RDMA, you need to do

     # modprobe nvidia-peermem

     manually upon reboot.  I don't know how to automated this.  There
     probably should be a service that does this.

 11. It would be nice if all this were properly packaged up for
     Debian, so all one would need to do is either

     # apt install ofed

     or

     # apt install doca-ofed

 12. As an aside, if you have SXM GPUs, you need what is called a
     fabricmanager.  Without one, you will not be able to use your GPUs
     for CUDA (PyTorch).  This is not available for Debian 12.6.  But
     one can download one from NVidia.

     # wget https://developer.download.nvidia.com/compute/cuda/repos/debian11/x86_64/nvidia-fabricmanager-535_535.183.01-1_amd64.deb
     # apt remove nvidia-fabricmanager-525
     # dpkg -i /tmp/nvidia-fabricmanager-535_535.183.01-1_amd64.deb
     # systemctl daemon-reload
     # systemctl start nvidia-fabricmanager.service

     The version of fabricmanager is specific to the version of the
     NVidia driver.  You need to remove the old one before installing
     the new one.

     A word to the wise: be careful when doing an apt upgrade.  The
     version of the NVidia driver might change.  This will leave you
     in a state where you don't have the correct version of
     fabricmanager.  So check that the NVidia website as a version for
     the driver version you are about to install or else your GPUs
     will become unusable.

     Again, it would be nice if the official Debian nvidia-driver
     package included the appropriate fabricmanager.



More information about the pkg-nvidia-devel mailing list