[Pkg-sysvinit-devel] Bug#719273: Acknowledgement (sysvinit-utils: /bin/pidof fails when there are stuck NFS mount points, preventing shutdown)
Daniel Povey
dpovey at gmail.com
Sun Dec 31 19:39:07 UTC 2017
Guys, I just want to give a bit more context that I've found out since then
which might be relevant.
As you probably know (but to document it here), if you set the variable
PIDOF_NETFS or set the flag -n for 'pidof', it will avoid trying to stat
files on NFS partitions. This will solve some of these types of issues--
but not all of them, because a binary that's local can still be in D state
due to accessing a stuck partition. I'd also like to point out (and I
don't know if this is really the case but it might be) that maybe a process
that's not in a D or Z state might still have a binary on a stuck mount
point where calling "stat" on it would fail. But I don't want to
complicate things too much: a partial fix is better than nothing.
Also, at the current time (and IIRC this wasn't the case when we submitted
the original patch), start-stop-daemon is a binary not a script, and it
doesn't call pidof or killall. Instead it uses its own code, and that code
is subject to the same issue where it hangs on stuck NFS partitions.
Therefore, as it stands, applying this patch to 'pidof' will no longer
resolve the issue; similar changes would have to be made to 'killall'.
We have a cluster of Linux servers running GridEngine, and they each export
volumes via NFS. We have a problem when one of the NFS servers dies. The
stuck mount point on other nodes causes user processes there to enter the D
state (e.g. if they have pending output to the disk that's down). This
makes it impossible to do routine maintenance on the other nodes, so they
pretty soon will need a reboot, and very often in this condition those
reboots will fail because a shutdown task hangs. Their unavailable NFS
volumes cause failures on other nodes in turn, unless someone is available
to physically reboot. (We're planning to eventually use virtualization to
prevent this failure; previously there were complications relating to
NVidia GPUs).
Dan
On Sun, Dec 31, 2017 at 9:31 AM, Ian Jackson <
ijackson at chiark.greenend.org.uk> wrote:
> Daniel Povey writes ("Bug#719273: Acknowledgement (sysvinit-utils:
> /bin/pidof fails when there are stuck NFS mount points, preventing
> shutdown)"):
> > We're going to check it out early next week, whether it still applies.
> >
> > The bug it fixes is a situation where pidof reaches a process that's in
> a bad
> > state such as D or Z, that might have an executable on a stuck mount
> point or
> > something like that, and gets stuck, preventing system maintenance tasks
> or
> > shutdown.
>
> Yes. The principle of the patch LGTM. Hence my tagging the bug
> accordingly.
>
> Thanks for your persistence.
>
> Ian.
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-sysvinit-devel/attachments/20171231/07d138cd/attachment.html>
More information about the Pkg-sysvinit-devel
mailing list