Bug#1074350: nvidia-kernel-dkms: Trying to modprobe nvidia-peermem to use NCCL/RDMA/Infiniband with GPUs
Jeffrey Mark Siskind
qobi at qobi.org
Wed Jul 3 23:49:29 BST 2024
> How can we/I compile the nvidia-peermem driver with Mellanox
ib_peer_mem symbols?
Probably not a problem of the nvidia-peermem module but of the kernel
(or a third-party module) that needs to provide these symbols.
First, I upgraded to Debian 12.6. So I am running nvidia-driver 535.183.01.
I then installed doca-ofed.
https://developer.nvidia.com/doca-downloads?deployment_platform=Host-Server&deployment_package=DOCA-Host&target_os=Linux&Architecture=x86_64&Profile=doca-ofed&Distribution=Debian&version=12.1&installer_type=deb_online
(I'll spare you the details. It did not install out of the box. If you
wish, I can tell you exactly what I did to install it.)
doca-ofed (aka MLNX_OFED) provides the necessary symbols.
Then I did this:
dkms --force build -m nvidia-current -v 535.183.01
dkms --force install -m nvidia-current -v 535.183.01
I did this to attempt to get nvidia-peermem to work. Indeed, now I get
a different error message:
root at vuku:~# modprobe nvidia-peermem
modprobe: ERROR: could not insert 'nvidia_current_peermem': Unknown symbol in module, or unknown parameter (see dmesg)
modprobe: ERROR: ../libkmod/libkmod-module.c:1047 command_do() Error running install command 'modprobe nvidia ; modprobe -i nvidia-current-peermem ' for module nvidia_peermem: retcode 1
modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument
root at vuku:~# dmesg|tail
[161810.093263] Compat-mlnx-ofed backport release: 91fb8cd
[161810.093272] Backport based on https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git 91fb8cd
[161810.093274] compat.git: https://:@git-nbu.nvidia.com/r/a/mlnx_ofed/mlnx-ofa_kernel-4.0.git
[161810.342539] nvidia_peermem: Unknown symbol ib_register_peer_memory_client (err -2)
[161810.342554] nvidia_peermem: Unknown symbol ib_unregister_peer_memory_client (err -2)
root at vuku:~#
I think I am getting close to getting nvidia-peermem to work.
/usr/src/nvidia-current-535.183.01/nvidia-peermem/nvidia-peermem.Kbuild
has
OFA_DIR := /usr/src/ofa_kernel
OFA_CANDIDATES = $(OFA_DIR)/$(OFA_ARCH)/$(KERNELRELEASE) $(OFA_DIR)/$(KERNELRELEASE) $(OFA_DIR)/default /var/lib/dkms/mlnx-ofed-kernel
And three of the four candidate directories exist:
qobi at vuku>ls /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/
compat/ compat_base_tree_version configure@ Module.symvers
compat_base compat.config configure.mk.kernel ofed_scripts/
compat_base_tree compat_version include/
qobi at vuku>ls /usr/src/ofa_kernel/6.1.0-22-amd64/
ls: cannot access '/usr/src/ofa_kernel/6.1.0-22-amd64/': No such file or directory
qobi at vuku>ls /usr/src/ofa_kernel/default
/usr/src/ofa_kernel/default@
qobi at vuku>ls /usr/src/ofa_kernel/default/
compat/ compat_base_tree_version configure@ Module.symvers
compat_base compat.config configure.mk.kernel ofed_scripts/
compat_base_tree compat_version include/
qobi at vuku>ls /var/lib/dkms/mlnx-ofed-kernel
24.04.OFED.24.04.0.7.0.1/ kernel-6.1.0-22-amd64-x86_64@
qobi at vuku>
There appear to be two different Module.symvers, but they apear to be identical:
qobi at vuku>ls -l /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers
-rw-r--r-- 1 root root 92655 Jul 3 15:49 /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers
qobi at vuku>ls -l /usr/src/ofa_kernel/default/Module.symvers
-rw-r--r-- 1 root root 92655 Jul 3 15:49 /usr/src/ofa_kernel/default/Module.symvers
qobi at vuku>diff /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers /usr/src/ofa_kernel/default/Module.symvers
qobi at vuku>
And they appear to have the requiste symbols:
qobi at vuku>fgrep ib_register_peer_memory_client /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers
0xaba78e45 ib_register_peer_memory_client /var/lib/dkms/mlnx-ofed-kernel/24.04.OFED.24.04.0.7.0.1/build/drivers/infiniband/core/ib_core EXPORT_SYMBOL
qobi at vuku>fgrep ib_unregister_peer_memory_client /usr/src/ofa_kernel/x86_64/6.1.0-22-amd64/Module.symvers
0xbde5c050 ib_unregister_peer_memory_client /var/lib/dkms/mlnx-ofed-kernel/24.04.OFED.24.04.0.7.0.1/build/drivers/infiniband/core/ib_core EXPORT_SYMBOL
qobi at vuku>
And the module appears to contain those symbols:
qobi at vuku>fgrep ib_register_peer_memory_client /usr/lib/modules/6.1.0-22-amd64/updates/dkms/nvidia-current-peermem.ko
grep: /usr/lib/modules/6.1.0-22-amd64/updates/dkms/nvidia-current-peermem.ko: binary file matches
qobi at vuku>strings /usr/lib/modules/6.1.0-22-amd64/updates/dkms/nvidia-current-peermem.ko|fgrep ib_register_peer_memory_client
ib_register_peer_memory_client
ib_register_peer_memory_client
qobi at vuku>
So I don't know why the module doesn't load.
Any ideas?
Thanks,
Jeff (http: //engineering.purdue.edu/~qobi)
More information about the pkg-nvidia-devel
mailing list