Bug#1074032: [#1074032] nvidia-smi locks up machine with new file descriptor limit
Mark Glines
mark at glines.org
Fri Aug 16 17:32:48 BST 2024
Steve, what happens when you run it with strace?
I am running the same version of nvidia-smi as you, and noticing that it allocates a TON of memory now, for no (apparent) reason. That started happening sometime in the past few months. I think it may be related to the symptoms you describe, but I am not completely sure.
Here's the relevant snippet from the output of `strace nvidia-smi`:
2285205 stat("/var/run/nvidia-persistenced/socket", {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
2285205 socket(AF_UNIX, SOCK_STREAM, 0) = 9
2285205 connect(9, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, 37) = 0
2285205 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
2285205 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1073741816}) = 0
2285205 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5709400000
2285205 mmap(NULL, 51539607552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b09400000
2285205 getpeername(9, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, [128 => 38]) = 0
Quick summary: it opens the persistenced socket, gets the max file descriptor limit (1 billion), allocates 4GB of memory, allocates another 51GB of memory (!!!), and then proceeds to use the persistenced socket.
This happens after reporting the card and driver versions, and before listing the processes. Whatever it's doing during this time period, it delays execution for ~25 seconds, too. The `top` command says that the nvidia-smi process has 52GB virt, 49GB resident.
PID VIRT RES S %CPU %MEM TIME+ COMMAND
2285205 52.0g 49.6g R 100.0 39.4 0:22.14 nvidia-smi
If you have less memory than it's asking for, that might be a reason for your machine to go into swap hell and eventually freeze.
I'm seeing this on Debian Trixie. Package versions:
||/ Name Version Architecture
+++-========================-============-============
ii libcuda1:amd64 535.183.01-1 amd64
ii libnvidia-ml1:amd64 535.183.01-1 amd64
ii linux-image-6.8.12-amd64 6.8.12-1 amd64
ii nvidia-kernel-dkms 535.183.01-1 amd64
ii nvidia-persistenced 535.171.04-1 amd64
ii nvidia-smi 535.183.01-1 amd64
Thanks,
Mark
More information about the pkg-nvidia-devel
mailing list