Bug#932851: systemd causes diskless nodes to stop working / require hard reset

Tue Jul 23 23:01:47 BST 2019

Package: systemd
Version: 241-5

Ever since the switch to systemd we have been experiencing significant problems with our diskless nodes, where if the NFS connection is dropped for any reason (NFS server reboot, network router state reset, etc.) there is a high chance the diskless nodes will enter an unrecoverable state and require a hard reset (power off, power on).

While we've been working around this for a while and assumed it was just a Debian quirk, I was able to obtain the following trace from the console of a hung system today:

[820689.313769] nfs: server 192.168.1.1 not responding, still trying
[820693.530338] nfs: server 192.168.1.1 not responding, still trying
[820693.530451] nfs: server 192.168.1.1 not responding, still trying
[820696.994677] nfs: server 192.168.1.1 not responding, still trying
[820697.218891] nfs: server 192.168.1.1 not responding, still trying
[820697.698918] nfs: server 192.168.1.1 not responding, still trying
[820698.106834] nfs: server 192.168.1.1 not responding, still trying
[820721.177609] nfs: server 192.168.1.1 not responding, still trying
[820725.466102] nfs: server 192.168.1.1 not responding, still trying
[820818.681006] watchdog: BUG: soft lockup - CPU#2 stuck for 21s! [systemd-logind:273]
[820932.960202] INFO: task openvpn:5096 blocked for more than 120 seconds.
[820937.889046] nfs: server 192.168.1.1 OK
[820937.889226] nfs: server 192.168.1.1 OK
[820937.889374] nfs: server 192.168.1.1 OK
[820937.889381] nfs: server 192.168.1.1 OK
[820937.889448] nfs: server 192.168.1.1 OK
[820937.889503] nfs: server 192.168.1.1 OK
[820937.889574] nfs: server 192.168.1.1 OK
[820937.889665] nfs: server 192.168.1.1 OK
[820937.889670] nfs: server 192.168.1.1 OK
[820937.889674] nfs: server 192.168.1.1 OK
[820937.903880] systemd-journald[171]: Failed to open system journal: Permission denied
[820938.083071] systemd[1]: systemd-journald.service: Main process exited, code=killed, status=6/ABRT
[820938.111157] systemd[1]: systemd-journald.service: Failed to kill control group /system.slice/systemd-journald.service, ignoring: Permission denied
[820938.124774] systemd[1]: systemd-journald.service: Failed to kill control group /system.slice/systemd-journald.service, ignoring: Permission denied
[820938.131244] systemd[1]: systemd-journald.service: Unit entered failed state.
[820938.131418] systemd[1]: systemd-journald.service: Failed with result 'watchdog'.
[820938.144754] systemd[1]: systemd-udevd.service: Main process exited, code=killed, status=6/ABRT
[820938.170807] systemd[1]: systemd-udevd.service: Failed to kill control group /system.slice/systemd-udevd.service, ignoring: Permission denied
[820938.177666] systemd[1]: systemd-udevd.service: Unit entered failed state.
[820938.177798] systemd[1]: systemd-udevd.service: Failed with result 'watchdog'.
[820938.189036] systemd[1]: systemd-udevd.service: Service has no hold-off time, scheduling restart.

This fairly clearly puts the blame somewhere in systemd, which makes sense as our older non-systemd machines recover perfectly fine from even extended NFS server failures.  At minimum the systemd watchdog should probably be disabled while the NFS server is unavailable.