Bug#1074350: nvidia-kernel-dkms: Trying to modprobe nvidia-peermem to use NCCL/RDMA/Infiniband with GPUs

Jeffrey Mark Siskind qobi at qobi.org
Thu Jun 27 09:54:59 BST 2024


   After modprobe failed, is there any output in the kernel log (dmesg)?

   Do you get more helpfu output whe using 'modprobe -v' to load the module?

   What does happen if you manually try to load it bypassing the config 
   file: 'modprobe -i -v nvidia-current-peermem'

See enclosed.

   Next weekend with the next bookworm point release we will switch to the 
   535 driver series in bookworm. Packages are already available in 
   stable-proposed-updates.
   Please try that new version, and if the problem persists, we will dig 
   deeper.

I will try that and get back to you.

    Jeff (http: //engineering.purdue.edu/~qobi)
---------------------------------------------------------------------------------
root at poto:~# dmesg|tail
[  362.880934] Key type id_resolver registered
[  362.880935] Key type id_legacy registered
[ 1450.194437] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 1450.216460] nvidia-uvm: Loaded the UVM driver, major device number 236.
[ 3026.469023] nvidia_peermem: unknown parameter '0' ignored
[ 3028.248692] nvidia_peermem: unknown parameter '1' ignored
[ 6978.950808] process '@rootfs/usr/bin/ikarus-scheme-script' started with executable stack
[16005.412570] systemd-journald[3524]: Data hash table of /var/log/journal/da0d3b6a9b2a49d78d775c6393b77418/system.journal has a fill level at 75.0 (174764 of 233016 items, 50331648 file size, 287 bytes per hash table item), suggesting rotation.
[16005.412584] systemd-journald[3524]: /var/log/journal/da0d3b6a9b2a49d78d775c6393b77418/system.journal: Journal header limits reached or header out-of-date, rotating.
[54652.534145] perf: interrupt took too long (2686 > 2500), lowering kernel.perf_event_max_sample_rate to 74250

root at poto:~# modprobe nvidia-peermem
modprobe: ERROR: could not insert 'nvidia_current_peermem': Invalid argument
modprobe: ERROR: ../libkmod/libkmod-module.c:1047 command_do() Error running install command 'modprobe nvidia ; modprobe -i nvidia-current-peermem ' for module nvidia_peermem: retcode 1
modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument

root at poto:~# dmesg|tail
[  362.880934] Key type id_resolver registered
[  362.880935] Key type id_legacy registered
[ 1450.194437] nvidia_uvm: module uses symbols nvUvmInterfaceDisableAccessCntr from proprietary module nvidia, inheriting taint.
[ 1450.216460] nvidia-uvm: Loaded the UVM driver, major device number 236.
[ 3026.469023] nvidia_peermem: unknown parameter '0' ignored
[ 3028.248692] nvidia_peermem: unknown parameter '1' ignored
[ 6978.950808] process '@rootfs/usr/bin/ikarus-scheme-script' started with executable stack
[16005.412570] systemd-journald[3524]: Data hash table of /var/log/journal/da0d3b6a9b2a49d78d775c6393b77418/system.journal has a fill level at 75.0 (174764 of 233016 items, 50331648 file size, 287 bytes per hash table item), suggesting rotation.
[16005.412584] systemd-journald[3524]: /var/log/journal/da0d3b6a9b2a49d78d775c6393b77418/system.journal: Journal header limits reached or header out-of-date, rotating.
[54652.534145] perf: interrupt took too long (2686 > 2500), lowering kernel.perf_event_max_sample_rate to 74250

root at poto:~# modprobe -v nvidia-peermem
install modprobe nvidia ; modprobe -i nvidia-current-peermem $CMDLINE_OPTS 
insmod /lib/modules/6.1.0-21-amd64/updates/dkms/nvidia-current-peermem.ko 
modprobe: ERROR: could not insert 'nvidia_current_peermem': Invalid argument
modprobe: ERROR: ../libkmod/libkmod-module.c:1047 command_do() Error running install command 'modprobe nvidia ; modprobe -i nvidia-current-peermem ' for module nvidia_peermem: retcode 1
modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument

root at poto:~# modprobe nvidia-peermem
modprobe: ERROR: could not insert 'nvidia_current_peermem': Invalid argument
modprobe: ERROR: ../libkmod/libkmod-module.c:1047 command_do() Error running install command 'modprobe nvidia ; modprobe -i nvidia-current-peermem ' for module nvidia_peermem: retcode 1
modprobe: ERROR: could not insert 'nvidia_peermem': Invalid argument

root at poto:~# modprobe -i -v nvidia-current-peermem
insmod /lib/modules/6.1.0-21-amd64/updates/dkms/nvidia-current-peermem.ko 
modprobe: ERROR: could not insert 'nvidia_current_peermem': Invalid argument
root at poto:~#



More information about the pkg-nvidia-devel mailing list