jessie: help debugging NFS shares not mounted at boot, double mounts with mount -a, and @reboot cronjobs

Fri Feb 19 12:11:47 GMT 2016

Hey Felipe,

On Thu, Feb 18, 2016 at 5:39 PM, Felipe Sateler <fsateler at debian.org> wrote:
> That mail has:
> 1711:Feb 10 16:44:40 SERVER systemd[1]: mnt-NFSSERVER.mount changed
> dead -> mounting
> 1817:Feb 10 16:44:43 SERVER systemd[1]: mnt-NFSSERVER.mount changed
> mounting -> mounted
> 1818:Feb 10 16:44:43 SERVER systemd[1]: Job mnt-NFSSERVER.mount/start
> finished, result=done
> 1819:Feb 10 16:44:43 SERVER systemd[1]: Mounted /mnt/NFSSERVER.
> 2106:Feb 10 16:44:43 SERVER systemd[1]: mnt-NFSSERVER.mount changed
> mounted -> dead
> 2107:Feb 10 16:44:43 SERVER systemd[1]: Failed to destroy cgroup
> /system.slice/mnt-NFSSERVER.mount: Device or resource busy
> 2632:Feb 10 16:44:54 SERVER systemd[1]: mnt-NFSSERVER.mount changed
> dead -> mounted
>
> So somehow the mount finishes successfully but the mount is declared
> dead in rapid succession. This would suggest that systemd regards
> remote-fs up at the time of the mount success exit, but the mount then
> dies and revives some seconds later.

thanks for reading this to me! I somehow missed the mounted-dead-mounted
again dance.

> Are there any possibly relevant logs during that time?

I extended a bit the grep on the logs:

1005:Feb 10 16:44:38 SERVER systemd[1]: Installed new job
mnt-NFSSERVER.mount/start as 97
1707:Feb 10 16:44:40 SERVER systemd[1]: Mounting /mnt/NFSSERVER...
1709:Feb 10 16:44:40 SERVER systemd[1]: About to execute: /bin/mount -n
XXX.YYY.16.226:/VOL /mnt/NFSSERVER -t nfs -o
rw,intr,tcp,bg,rdirplus,noatime,_netdev
1710:Feb 10 16:44:40 SERVER systemd[1]: Forked /bin/mount as 583
1711:Feb 10 16:44:40 SERVER systemd[1]: mnt-NFSSERVER.mount changed dead ->
mounting
1713:Feb 10 16:44:40 SERVER systemd[583]: Executing: /bin/mount -n
XXX.YYY.16.226:/VOL /mnt/NFSSERVER -t nfs -o
rw,intr,tcp,bg,rdirplus,noatime,_netdev
1767:Feb 10 16:44:43 SERVER mount[586]: mount to NFS server
'XXX.YYY.16.226' failed: No route to host, retrying
1814:Feb 10 16:44:43 SERVER systemd[1]: Child 583 (mount) died
(code=exited, status=0/SUCCESS)
1815:Feb 10 16:44:43 SERVER systemd[1]: Child 583 belongs to
mnt-NFSSERVER.mount
1816:Feb 10 16:44:43 SERVER systemd[1]: mnt-NFSSERVER.mount mount process
exited, code=exited status=0
1817:Feb 10 16:44:43 SERVER systemd[1]: mnt-NFSSERVER.mount changed
mounting -> mounted
1818:Feb 10 16:44:43 SERVER systemd[1]: Job mnt-NFSSERVER.mount/start
finished, result=done
1819:Feb 10 16:44:43 SERVER systemd[1]: Mounted /mnt/NFSSERVER.
1875:Feb 10 16:44:43 SERVER mount[583]: mount.nfs: backgrounding
"XXX.YYY.16.226:/VOL"
1876:Feb 10 16:44:43 SERVER mount[583]: mount.nfs: mount options:
"rw,noatime,intr,tcp,bg,rdirplus,_netdev"
2106:Feb 10 16:44:43 SERVER systemd[1]: mnt-NFSSERVER.mount changed mounted
-> dead
2107:Feb 10 16:44:43 SERVER systemd[1]: Failed to destroy cgroup
/system.slice/mnt-NFSSERVER.mount: Device or resource busy
2632:Feb 10 16:44:54 SERVER systemd[1]: mnt-NFSSERVER.mount changed dead ->
mounted

so you can see the "No route to host" error, but also note that the very
same error happened for all the NFS shares, and upon inspection, all of
them show the same behavior:

Feb 10 16:44:40 SERVER systemd[1]: mnt-ANOTHERSERVER.mount changed dead ->
mounting
Feb 10 16:44:43 SERVER systemd[1]: mnt-ANOTHERSERVER.mount changed mounting
-> mounted
Feb 10 16:44:43 SERVER systemd[1]: mnt-ANOTHERSERVER.mount changed mounted
-> dead
Feb 10 16:44:44 SERVER systemd[1]: mnt-ANOTHERSERVER.mount changed dead ->
mounted

but they all recover at 16:44:44 while one takes longer.

so is it possible that systemd consider the mount done (line 1819) above,
while actually the mount process is backgrounding the mount? can we disable
the backgrounding of NFS mounts? if so, how?

>> Let me know if you prefer to investigate this latest state (the
>> machine is still in that state and has not been touched since, and it
>> appears to be somehow relevant to the situation at hand) or do you
>> want me to start rebooting the node until we are able to replicate the
>> same situation as above.
>
>
> Well, lets debug the most reproducible one ;) But, the above seems to

makes sense :)

> imply that the problems do not happen always? I presume it happens
> frequently enough to be a problem, but does the system manage to boot
> successfully sometimes?

apologize if i didnt make it clearer before, but this error started
appearing after I change the way i'm collecting these failures; I have an
@reboot cronjob that previously it was:

1. sleep for 2m (as the initial error was about mounts timing out at 1m30s)
2. checks if the number of mounts is the correct one
2.1. if not, grab some systems log
3. reboot

i changed it slighly into:

1. sleep for 10s
2. checks if the number of mounts is the correct one
2.1. if not, grab some systems log
3. sleep for 2m (to have a chance to log into the system and eventually
stop the reboot process)
4. reboot

now that the cronjob "do stuff" much earlier than before, this issue is
triggered, and it might very well be the "(3.) we have some @reboot
cronjobs to start programs stored on some" problem mentioned in the
original post

>>>>> Do you have more
>>>>> interfaces in these machines? Are all of them configured as auto or
>>>>> static?
>>>>
>>>> on this particular machine there is a single eth0 interface configured
as auto
>>>
>>> So this is not the same setup as the previous one you posted? I'm
>>> getting a bit confused...
>>
>> yes this has always been the same setup; my question about multiple
>> NICs is because we do have seen this behavior on machines with
>> multiple interfaces, and so we were wondering if that could make the
>> issue more likely to happen, but it was more of a curiosity.
>>
>> the machine I am providing logs from is always the same with the same
>> exact configuration, unless specified (like disabling services as
>> requested, etc etc).
>
> Well, that does not match with the info you sent in the Feb 9 mail (an
> inet static configuration).

gaah sorry about the confusion :( yeah, that is the exact network config:
"auto eth0 \ iface eth0 inet static" so I highlighted the "auto eth0"
(driven a bit by your comment about auto) and forgot to mention the second
config line.

thanks!

-- 
Sandro "morph" Tosi
My website: http://sandrotosi.me/
Me at Debian: http://wiki.debian.org/SandroTosi
G+: https://plus.google.com/u/0/+SandroTosi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/pkg-systemd-maintainers/attachments/20160219/35cb5685/attachment-0002.html>