Bug#1074350: nvidia-kernel-dkms: Trying to modprobe nvidia-peermem to use NCCL/RDMA/Infiniband with GPUs

Jeffrey Mark Siskind qobi at qobi.org
Wed Jul 3 23:49:29 BST 2024


   > How can we/I compile the nvidia-peermem driver with Mellanox
     ib_peer_mem symbols?

   Probably not a problem of the nvidia-peermem module but of the kernel 
   (or a third-party module) that needs to provide these symbols.

First, I upgraded to Debian 12.6. So I am running nvidia-driver 535.183.01.

I then installed doca-ofed.

   https://developer.nvidia.com/doca-downloads?deployment_platform=Host-Server&deployment_package=DOCA-Host&target_os=Linux&Architecture=x86_64&Profile=doca-ofed&Distribution=Debian&version=12.1&installer_type=deb_online

(I'll spare you the details. It did not install out of the box. If you
wish, I can tell you exactly what I did to install it.)

doca-ofed (aka MLNX_OFED) provides the necessary symbols.

Then I did this:

   dkms --force build -m nvidia-current -v 535.183.01
   dkms --force install -m nvidia-current -v 535.183.01

I did this to attempt to get nvidia-peermem to work. Indeed, now I get
a different error message:

   root at vuku:~# modprobe nvidia-peermem
   modprobe: ERROR: could not insert 'nvidia_current_peermem': Unknown symbol in module, or unknown parameter (see dmesg)
   modprobe: ERROR: ../libkmod/libkmod-module.c:1047 command_do() Error running install command 'modprobe nvidia ; modprobe -i nvidia-current-peermem ' for module nvidia_peermem: retcode 1
   modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument
   root at vuku:~# dmesg|tail
   [161810.093263] Compat-mlnx-ofed backport release: 91fb8cd
   [161810.093272] Backport based on https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git 91fb8cd
   [161810.093274] compat.git: https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git
   [161810.342539] nvidia_peermem: Unknown symbol ib_register_peer_memory_client (err -2)
   [161810.342554] nvidia_peermem: Unknown symbol ib_unregister_peer_memory_client (err -2)
   root at vuku:~# 

I think I am getting close to getting nvidia-peermem to work.

   /usr/src/nvidia-current-535.183.01/nvidia-peermem/nvidia-peermem.Kbuild

has

   OFA_DIR := /usr/src/ofa_kernel
   OFA_CANDIDATES = $(OFA_DIR)/$(OFA_ARCH)/$(KERNELRELEASE) $(OFA_DIR)/$(KERNELRELEASE) $(OFA_DIR)/default /var/lib/dkms/mlnx-ofed-kernel

And three of the four candidate directories exist:

   qobi at vuku>ls /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/
   compat/           compat_base_tree_version  configure@           Module.symvers
   compat_base       compat.config             configure.mk.kernel  ofed_scripts/
   compat_base_tree  compat_version            include/
   qobi at vuku>ls /usr/src/ofa_kernel/6.1.0-22-amd64/
   ls: cannot access '/usr/src/ofa_kernel/6.1.0-22-amd64/': No such file or directory
   qobi at vuku>ls /usr/src/ofa_kernel/default
   /usr/src/ofa_kernel/default@
   qobi at vuku>ls /usr/src/ofa_kernel/default/
   compat/           compat_base_tree_version  configure@           Module.symvers
   compat_base       compat.config             configure.mk.kernel  ofed_scripts/
   compat_base_tree  compat_version            include/
   qobi at vuku>ls /var/lib/dkms/mlnx-ofed-kernel
   24.04.OFED.24.04.0.7.0.1/  kernel-6.1.0-22-amd64-x86_64@
   qobi at vuku>

There appear to be two different Module.symvers, but they apear to be identical:

   qobi at vuku>ls -l /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers
   -rw-r--r-- 1 root root 92655 Jul  3 15:49 /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers
   qobi at vuku>ls -l /usr/src/ofa_kernel/default/Module.symvers
   -rw-r--r-- 1 root root 92655 Jul  3 15:49 /usr/src/ofa_kernel/default/Module.symvers
   qobi at vuku>diff /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers /usr/src/ofa_kernel/default/Module.symvers
   qobi at vuku>

And they appear to have the requiste symbols:

   qobi at vuku>fgrep ib_register_peer_memory_client /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers
   0xaba78e45        ib_register_peer_memory_client  /var/lib/dkms/mlnx-ofed-kernel/24.04.OFED.24.04.0.7.0.1/build/drivers/infiniband/core/ib_core   EXPORT_SYMBOL   
   qobi at vuku>fgrep ib_unregister_peer_memory_client /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers
   0xbde5c050    ib_unregister_peer_memory_client        /var/lib/dkms/mlnx-ofed-kernel/24.04.OFED.24.04.0.7.0.1/build/drivers/infiniband/core/ib_core   EXPORT_SYMBOL   
   qobi at vuku>

And the module appears to contain those symbols:

   qobi at vuku>fgrep ib_register_peer_memory_client /usr/lib/modules/6.1.0-22-amd64/updates/dkms/nvidia-current-peermem.ko
   grep: /usr/lib/modules/6.1.0-22-amd64/updates/dkms/nvidia-current-peermem.ko: binary file matches
   qobi at vuku>strings /usr/lib/modules/6.1.0-22-amd64/updates/dkms/nvidia-current-peermem.ko|fgrep ib_register_peer_memory_client
   ib_register_peer_memory_client
   ib_register_peer_memory_client
   qobi at vuku>

So I don't know why the module doesn't load.

Any ideas?

    Thanks,
    Jeff (http: //engineering.purdue.edu/~qobi)



More information about the pkg-nvidia-devel mailing list