Bug#1074032: [#1074032] nvidia-smi locks up machine with new file descriptor limit

Mark Glines mark at glines.org
Fri Aug 16 17:32:48 BST 2024


Steve, what happens when you run it with strace?

I am running the same version of nvidia-smi as you, and noticing that it allocates a TON of memory now, for no (apparent) reason.  That started happening sometime in the past few months.  I think it may be related to the symptoms you describe, but I am not completely sure.

Here's the relevant snippet from the output of `strace nvidia-smi`:

2285205 stat("/var/run/nvidia-persistenced/socket", {st_mode=S_IFSOCK|0777, st_size=0, ...}) = 0
2285205 socket(AF_UNIX, SOCK_STREAM, 0) = 9
2285205 connect(9, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, 37) = 0
2285205 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
2285205 prlimit64(0, RLIMIT_NOFILE, NULL, {rlim_cur=1024, rlim_max=1073741816}) = 0
2285205 mmap(NULL, 4294967296, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f5709400000
2285205 mmap(NULL, 51539607552, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f4b09400000
2285205 getpeername(9, {sa_family=AF_UNIX, sun_path="/var/run/nvidia-persistenced/socket"}, [128 => 38]) = 0

Quick summary: it opens the persistenced socket, gets the max file descriptor limit (1 billion), allocates 4GB of memory, allocates another 51GB of memory (!!!), and then proceeds to use the persistenced socket.

This happens after reporting the card and driver versions, and before listing the processes.  Whatever it's doing during this time period, it delays execution for ~25 seconds, too.  The `top` command says that the nvidia-smi process has 52GB virt, 49GB resident.

     PID  VIRT    RES S  %CPU  %MEM     TIME+ COMMAND
2285205 52.0g  49.6g R 100.0  39.4   0:22.14 nvidia-smi

If you have less memory than it's asking for, that might be a reason for your machine to go into swap hell and eventually freeze.

I'm seeing this on Debian Trixie.  Package versions:

||/ Name                     Version      Architecture
+++-========================-============-============
ii  libcuda1:amd64           535.183.01-1 amd64
ii  libnvidia-ml1:amd64      535.183.01-1 amd64
ii  linux-image-6.8.12-amd64 6.8.12-1     amd64
ii  nvidia-kernel-dkms       535.183.01-1 amd64
ii  nvidia-persistenced      535.171.04-1 amd64
ii  nvidia-smi               535.183.01-1 amd64

Thanks,
Mark



More information about the pkg-nvidia-devel mailing list