jessie: help debugging NFS shares not mounted at boot, double mounts with mount -a, and @reboot cronjobs

Felipe Sateler fsateler at debian.org
Thu Feb 25 19:19:59 GMT 2016


On 24 February 2016 at 09:29, Sandro Tosi <morph at debian.org> wrote:
> Hey Felipe, thanks a lot for your help!
>
>
> On Tue, Feb 23, 2016 at 10:16 PM, Felipe Sateler <fsateler at debian.org>
> wrote:
>>
>> On 23 February 2016 at 12:12, Sandro Tosi <morph at debian.org> wrote:
>> > On Tue, Feb 23, 2016 at 9:19 AM, Sandro Tosi <morph at debian.org> wrote:
>> >> quick update: we had a couple of (real) nfs issues and
>> >> misconfiguration (meeh) that made the script fail even if it shouldnt
>> >> have, so no news yet; the reboot loop just restarted and will
>> >> periodically check it and report back if something comes up.
>> >
>> > so well, we just needed to wait a bit :)
>> >
>> > here it is:
>> >
>> > # journalctl -alb | grep -nE "cron|NFSSERVER"
>> > 1012:Feb 23 06:50:42 SERVER systemd[1]: Installed new job
>> > mnt-NFSSERVER_VOL.mount/start as 99
>> > 1014:Feb 23 06:50:42 SERVER systemd[1]: Installed new job
>> > cron.service/start
>> > as 101
>> > 1714:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER_VOL...
>> > 1716:Feb 23 06:50:43 SERVER systemd[1]: About to execute: /bin/mount -n
>> > XXX.YYY.32.75:/vol/VOL /mnt/NFSSERVER_VOL -t nfs -o
>> > ro,intr,nolock,tcp,rdirplus,noatime,_netdev
>> > 1718:Feb 23 06:50:43 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount changed
>> > dead
>> > -> mounting
>> > 1720:Feb 23 06:50:43 SERVER systemd[574]: Executing: /bin/mount -n
>> > XXX.YYY.32.75:/vol/VOL /mnt/NFSSERVER_VOL -t nfs -o
>> > ro,intr,nolock,tcp,rdirplus,noatime,_netdev
>> > 1905:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount mounting
>> > timed out. Stopping.
>> > 1906:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount changed
>> > mounting -> mounting-sigterm
>> > 1915:Feb 23 06:52:13 SERVER systemd[1]: Child 574 belongs to
>> > mnt-NFSSERVER_VOL.mount
>> > 1916:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount mount
>> > process exited, code=killed status=15
>> > 1917:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount changed
>> > mounting-sigterm -> mounted
>> > 1918:Feb 23 06:52:13 SERVER systemd[1]: Job
>> > mnt-NFSSERVER_VOL.mount/start
>> > finished, result=done
>> > 1919:Feb 23 06:52:13 SERVER systemd[1]: Mounted /mnt/NFSSERVER_VOL.
>> > 2025:Feb 23 06:52:13 SERVER systemd[1]: About to execute: /usr/sbin/cron
>> > -f
>> > $EXTRA_OPTS
>> > 2026:Feb 23 06:52:13 SERVER systemd[1]: Forked /usr/sbin/cron as 786
>> > 2027:Feb 23 06:52:13 SERVER systemd[1]: cron.service changed dead ->
>> > running
>> > 2028:Feb 23 06:52:13 SERVER systemd[1]: Job cron.service/start finished,
>> > result=done
>> > 2029:Feb 23 06:52:13 SERVER systemd[786]: Executing: /usr/sbin/cron -f
>> > 2038:Feb 23 06:52:13 SERVER cron[786]: (CRON) INFO (pidfile fd = 3)
>> > 2128:Feb 23 06:52:13 SERVER cron[786]: (CRON) INFO (Running @reboot
>> > jobs)
>> > 2300:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount changed
>> > mounted -> failed
>> > 2301:Feb 23 06:52:13 SERVER systemd[1]: Failed to destroy cgroup
>> > /system.slice/mnt-NFSSERVER_VOL.mount: Device or resource busy
>> > 2302:Feb 23 06:52:13 SERVER systemd[1]: Unit mnt-NFSSERVER_VOL.mount
>> > entered
>> > failed state.
>> > 2303:Feb 23 06:52:13 SERVER systemd[1]: Sent message type=signal
>> > sender=n/a
>> > destination=n/a
>> > object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER_5fVOL_2emount
>> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
>> > cookie=30
>> > reply_cookie=0 error=n/a
>> > 2304:Feb 23 06:52:13 SERVER systemd[1]: Sent message type=signal
>> > sender=n/a
>> > destination=n/a
>> > object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER_5fVOL_2emount
>> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
>> > cookie=31
>> > reply_cookie=0 error=n/a
>> >
>> > so 1m30s passed and the mount didnt come up, which is (one of) the
>> > original
>> > issue (usually, running mount -t nfs -a will bring it up, even just
>> > right
>> > after the failure at boot, so it seems like it's not retried?). i check
>> > in
>> > the journalctl output around those lines, there was additional msg
>> > relevant
>> > to this.
>>
>> Is the network up at this point?
>
> i think so (and the other 11 NFS mounts were correctly mounted at this point
> anyway): here's some logs of the network and the failing mount:
>
> # journalctl -alb | grep -inE "network|NFSSERVER"
> 653:Feb 23 06:50:42 SERVER kernel: igb: Intel(R) Gigabit Ethernet Network
> Driver - version 5.2.15-k
> 656:Feb 23 06:50:42 SERVER kernel: igb 0000:01:00.0: Intel(R) Gigabit
> Ethernet Network Connection
> 661:Feb 23 06:50:42 SERVER kernel: igb 0000:01:00.1: Intel(R) Gigabit
> Ethernet Network Connection
> 678:Feb 23 06:50:42 SERVER kernel: drop_monitor: Initializing network drop
> monitor service
> 962:Feb 23 06:50:42 SERVER systemd[1]: Installed new job
> networking.service/start as 46
> 963:Feb 23 06:50:42 SERVER systemd[1]: Installed new job
> network.target/start as 47
> 969:Feb 23 06:50:42 SERVER systemd[1]: Installed new job
> network-online.target/start as 53
> 1012:Feb 23 06:50:42 SERVER systemd[1]: Installed new job
> mnt-NFSSERVER_VOL.mount/start as 99
> 1548:Feb 23 06:50:42 SERVER systemd[1]: Starting LSB: Raise network
> interfaces....
> 1550:Feb 23 06:50:42 SERVER systemd[1]: About to execute:
> /etc/init.d/networking start
> 1551:Feb 23 06:50:42 SERVER systemd[1]: Forked /etc/init.d/networking as 407
> 1552:Feb 23 06:50:42 SERVER systemd[1]: networking.service changed dead ->
> start
> 1553:Feb 23 06:50:42 SERVER systemd[407]: Executing: /etc/init.d/networking
> start
> 1591:Feb 23 06:50:43 SERVER networking[407]: Configuring network
> interfaces...Cannot get EEE settings: Operation not supported
> 1648:Feb 23 06:50:43 SERVER networking[407]: done.
> 1649:Feb 23 06:50:43 SERVER systemd[1]: Received SIGCHLD from PID 407
> (networking).
> 1650:Feb 23 06:50:43 SERVER systemd[1]: Child 407 (networking) died
> (code=exited, status=0/SUCCESS)
> 1651:Feb 23 06:50:43 SERVER systemd[1]: Child 407 belongs to
> networking.service
> 1652:Feb 23 06:50:43 SERVER systemd[1]: networking.service: control process
> exited, code=exited status=0
> 1653:Feb 23 06:50:43 SERVER systemd[1]: networking.service got final SIGCHLD
> for state start
> 1654:Feb 23 06:50:43 SERVER systemd[1]: networking.service changed start ->
> running
> 1655:Feb 23 06:50:43 SERVER systemd[1]: Job networking.service/start
> finished, result=done
> 1656:Feb 23 06:50:43 SERVER systemd[1]: Started LSB: Raise network
> interfaces..
> 1657:Feb 23 06:50:43 SERVER systemd[1]: Starting Network.
> 1658:Feb 23 06:50:43 SERVER systemd[1]: network.target changed dead ->
> active
> 1659:Feb 23 06:50:43 SERVER systemd[1]: Job network.target/start finished,
> result=done
> 1660:Feb 23 06:50:43 SERVER systemd[1]: Reached target Network.
> 1661:Feb 23 06:50:43 SERVER systemd[1]: Starting Network is Online.
> 1662:Feb 23 06:50:43 SERVER systemd[1]: network-online.target changed dead
> -> active
> 1663:Feb 23 06:50:43 SERVER systemd[1]: Job network-online.target/start
> finished, result=done
> 1664:Feb 23 06:50:43 SERVER systemd[1]: Reached target Network is Online.
> 1714:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER_VOL...
> 1716:Feb 23 06:50:43 SERVER systemd[1]: About to execute: /bin/mount -n
> XXX.YYY.32.75:/vol/VOL /mnt/NFSSERVER_VOL -t nfs -o
> ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> 1718:Feb 23 06:50:43 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount changed dead
> -> mounting
> 1720:Feb 23 06:50:43 SERVER systemd[574]: Executing: /bin/mount -n
> XXX.YYY.32.75:/vol/VOL /mnt/NFSSERVER_VOL -t nfs -o
> ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> 1877:Feb 23 06:50:58 SERVER systemd[1]: Child 489 belongs to
> networking.service
> 1879:Feb 23 06:50:58 SERVER systemd[1]: Child 501 belongs to
> networking.service
> 1891:Feb 23 06:51:19 SERVER systemd[1]: Child 539 belongs to
> networking.service

This is weird. Why would there be lingering processes of
networking.service? Note that these processes exit after the nfs
mounts are started.

> 1892:Feb 23 06:51:19 SERVER systemd[1]: networking.service: cgroup is empty
> 1893:Feb 23 06:51:19 SERVER systemd[1]: networking.service changed running
> -> exited
> 1905:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount mounting
> timed out. Stopping.
> 1906:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount changed
> mounting -> mounting-sigterm
> 1915:Feb 23 06:52:13 SERVER systemd[1]: Child 574 belongs to
> mnt-NFSSERVER_VOL.mount
> 1916:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount mount
> process exited, code=killed status=15
> 1917:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount changed
> mounting-sigterm -> mounted

This is definitely weird.

> 1918:Feb 23 06:52:13 SERVER systemd[1]: Job mnt-NFSSERVER_VOL.mount/start
> finished, result=done
> 1919:Feb 23 06:52:13 SERVER systemd[1]: Mounted /mnt/NFSSERVER_VOL.
> 2300:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount changed
> mounted -> failed
> 2301:Feb 23 06:52:13 SERVER systemd[1]: Failed to destroy cgroup
> /system.slice/mnt-NFSSERVER_VOL.mount: Device or resource busy
> 2302:Feb 23 06:52:13 SERVER systemd[1]: Unit mnt-NFSSERVER_VOL.mount entered
> failed state.
> 2303:Feb 23 06:52:13 SERVER systemd[1]: Sent message type=signal sender=n/a
> destination=n/a
> object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER_5fVOL_2emount
> interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=30
> reply_cookie=0 error=n/a
> 2304:Feb 23 06:52:13 SERVER systemd[1]: Sent message type=signal sender=n/a
> destination=n/a
> object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER_5fVOL_2emount
> interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=31
> reply_cookie=0 error=n/a
> 2449:Feb 23 06:52:13 SERVER systemd[1]: Starting Host and Network Name
> Lookups.
> 2452:Feb 23 06:52:13 SERVER systemd[1]: Reached target Host and Network Name
> Lookups.
>
> in particular
>
> 1664:Feb 23 06:50:43 SERVER systemd[1]: Reached target Network is Online.
> 1714:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER_VOL...
>
> so the net online target is reached (if we believe systemd :) from man
> systemd.special for network-online.target, "What precisely this requires is
> left to the implementation of the network managing service.", how can we
> know how this is implemented, and so what is meant here as 'online'?) before
> attempting the mount.
>
> I checked at what point the /mnt/NFSSERVER_VOL is executed, and it was the
> very last one NFS mount attempted (dunno if it's useful, i was expecting to
> be the first, so with a possibly network still unstable), here below what I
> called before /mnt/NFSSERVER_VOL is /mnt/NFSSERVER1_VOL (as i included also
> the other NFS mounts):
>
> 1665:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER12_VOL...
> 1669:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER7_VOL...
> 1674:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER10...
> 1679:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER7_VOL...
> 1684:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER8...
> 1689:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER7_VOL...
> 1694:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER5...
> 1699:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER4...
> 1704:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER3...
> 1709:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER2...
> 1714:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER1_VOL...
> 1757:Feb 23 06:50:46 SERVER systemd[1]: mnt-NFSSERVER12_VOL.mount changed
> mounting -> mounting-done
> 1759:Feb 23 06:50:46 SERVER systemd[1]: Mounted /mnt/NFSSERVER12_VOL.
> 1777:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER8.mount changed
> mounting -> mounting-done
> 1779:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER8.
> 1780:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER7_VOL.mount changed
> mounting -> mounting-done
> 1782:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER7_VOL.
> 1783:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER3.mount changed
> mounting -> mounting-done
> 1785:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER3.
> 1786:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER5.mount changed
> mounting -> mounting-done
> 1788:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER5.
> 1789:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER7_VOL.mount changed
> mounting -> mounting-done
> 1791:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER7_VOL.
> 1792:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER7_VOL.mount changed
> mounting -> mounting-done
> 1794:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER7_VOL.
> 1795:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER10.mount changed
> mounting -> mounting-done
> 1797:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER10.
> 1830:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER4.mount changed
> mounting -> mounted
> 1832:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER4.
> 1836:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER2.mount changed
> mounting -> mounted
> 1838:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER2.
> 1906:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed
> mounting -> mounting-sigterm
> 1919:Feb 23 06:52:13 SERVER systemd[1]: Mounted /mnt/NFSSERVER1_VOL.
>
> but it is the first one defined in /etc/fstab:
>
> XXX.YYY.32.75:/vol/vol3    /mnt/NFSSERVER1_VOL      nfs
> ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> XXX.YYY.26.132:/ifs    /mnt/NFSSERVER2    nfs
> rw,intr,tcp,rdirplus,noatime,_netdev
> XXX.YYY.20.176:/ifs    /mnt/NFSSERVER3      nfs
> ro,tcp,rdirplus,noatime,_netdev
> XXX.YYY.16.226:/ifs    /mnt/NFSSERVER4     nfs
> rw,intr,tcp,rdirplus,noatime,_netdev
> XXX.YYY.21.22:/ifs/scratcho    /mnt/NFSSERVER5     nfs
> rw,intr,tcp,rdirplus,noatime,_netdev
> XXX.YYY.27.74:/vol/vol5    /mnt/NFSSERVER7_VOL     nfs
> rw,intr,tcp,rdirplus,noatime,_netdev
> XXX.YYY.32.111:/buildcache     /mnt/NFSSERVER8   nfs
> ro,soft,intr,tcp,rdirplus,noatime,_netdev
> XXX.YYY.27.74:/vol/vol1    /mnt/NFSSERVER7_VOL     nfs
> ro,soft,intr,tcp,rdirplus,noatime,_netdev
> XXX.YYY.20.176:/ifs    /mnt/NFSSERVER10     nfs
> ro,nocto,actimeo=600,tcp,rdirplus,noatime,_netdev
> XXX.YYY.27.74:/vol/vol6    /mnt/NFSSERVER7_VOL     nfs
> ro,soft,intr,tcp,rdirplus,noatime,_netdev
> XXX.YYY.20.14:/vol/vol3 /mnt/NFSSERVER12_VOL nfs
> ro,intr,nolock,tcp,rdirplus,noatime,_netdev
>
> something else that sounds fishy is:
>
> Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER1_VOL...
> Feb 23 06:50:43 SERVER systemd[1]: About to execute: /bin/mount -n
> XXX.YYY.32.75:/vol/vol3 /mnt/NFSSERVER1_VOL -t nfs -o
> ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> Feb 23 06:50:43 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed dead ->
> mounting
> Feb 23 06:50:43 SERVER systemd[574]: Executing: /bin/mount -n
> XXX.YYY.32.75:/vol/vol3 /mnt/NFSSERVER1_VOL -t nfs -o
> ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount mounting timed
> out. Stopping.
> Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed mounting
> -> mounting-sigterm
> Feb 23 06:52:13 SERVER systemd[1]: Child 574 belongs to
> mnt-NFSSERVER1_VOL.mount
> Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount mount process
> exited, code=killed status=15
> Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed
> mounting-sigterm -> mounted
> Feb 23 06:52:13 SERVER systemd[1]: Job mnt-NFSSERVER1_VOL.mount/start
> finished, result=done
> Feb 23 06:52:13 SERVER systemd[1]: Mounted /mnt/NFSSERVER1_VOL.
> Feb 23 06:52:13 SERVER systemd[1]: Starting Remote File Systems.
> Feb 23 06:52:13 SERVER systemd[1]: remote-fs.target changed dead -> active
> Feb 23 06:52:13 SERVER systemd[1]: Job remote-fs.target/start finished,
> result=done
> Feb 23 06:52:13 SERVER systemd[1]: Reached target Remote File Systems.
> Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed mounted
> -> failed
>
> so it is actually marked as done/mounted (?!?): "mnt-NFSSERVER1_VOL.mount
> changed mounting-sigterm -> mounted", which triggers "Job
> remote-fs.target/start finished, result=done" - but that seems wrong, it was
> sigterm'ed, why systemd thinks this is actually completed correctly? at the
> end it doesnt, as it's marked as failed, but that "mounting-sigterm ->
> mounted" transition appears to be a mistake

Yes, that is definitely weird.

>
>
>>
>> >
>> > also note like cron.service is started, even if we configured:
>> >
>> > # grep remote-fs /etc/systemd/system/cron.service
>> > Requires=remote-fs.target
>> > After=remote-fs.target
>>
>> What does `systemctl show remote-fs.target` say?
>
> Id=remote-fs.target
> Names=remote-fs.target
> Requires=mnt-NFSSERVER12_VOL.mount mnt-NFSSERVER7_vol6.mount
> mnt-NFSSERVER32.mount mnt-NFSSERVER9_vol1.mount mnt-NFSSERVER8.mount
> mnt-NFSSERVER11_vol5.mount mnt-NFSSERVER5.mount mnt-NFSSERVER4.mount
> mnt-gonzo.mount mnt-NFSSERVER2.mount mnt-NFSSERVER1_vol3.mount
> Wants=remote-fs-pre.target
> RequiredBy=cron.service
> WantedBy=multi-user.target
> Conflicts=shutdown.target
> Before=console-setup.service kbd.service systemd-journal-flush.service
> systemd-user-sessions.service mcelog.service ntp.service ipmidetectd.service
> kexec-load.service nagios-nrpe-server.service nutcracker.service
> lsyncd.service irqbalance.service ipwatchd.service ganglia-monitor.service
> unbound.service bmc-watchdog.service setup_hugetlbfs.service sysstat.service
> postfix.service cron.service
> After=remote-fs-pre.target local-fs.target mnt-NFSSERVER1_VOL.mount
> mnt-NFSSERVER2.mount mnt-NFSSERVER3.mount mnt-NFSSERVER4.mount
> mnt-NFSSERVER5.mount mnt-NFSSERVER7_vol5.mount mnt-NFSSERVER8.mount
> mnt-NFSSERVER9_vol1.mount mnt-NFSSERVER10.mount mnt-NFSSERVER11_vol6.mount
> mnt-NFSSERVER12_vol3.mount
> Documentation=man:systemd.special(7)
> Description=Remote File Systems
> LoadState=loaded
> ActiveState=active
> SubState=active
> FragmentPath=/lib/systemd/system/remote-fs.target
> DropInPaths=/run/systemd/generator/remote-fs.target.d/50-insserv.conf.conf
> UnitFileState=enabled
> InactiveExitTimestamp=Tue 2016-02-23 06:52:13 EST
> InactiveExitTimestampMonotonic=104052746
> ActiveEnterTimestamp=Tue 2016-02-23 06:52:13 EST
> ActiveEnterTimestampMonotonic=104052746
> ActiveExitTimestampMonotonic=0
> InactiveEnterTimestampMonotonic=0
> CanStart=yes
> CanStop=yes
> CanReload=no
> CanIsolate=no
> StopWhenUnneeded=no
> RefuseManualStart=no
> RefuseManualStop=no
> AllowIsolate=no
> DefaultDependencies=no
> OnFailureJobMode=replace
> IgnoreOnIsolate=no
> IgnoreOnSnapshot=no
> NeedDaemonReload=no
> JobTimeoutUSec=0
> ConditionResult=yes
> ConditionTimestamp=Tue 2016-02-23 06:52:13 EST
> ConditionTimestampMonotonic=104052734
> Transient=no
>
>>
>> >
>> > checking the status of that target:
>> >
>> > # systemctl status remote-fs.target
>> > ● remote-fs.target - Remote File Systems
>> >    Loaded: loaded (/lib/systemd/system/remote-fs.target; enabled)
>> >   Drop-In: /run/systemd/generator/remote-fs.target.d
>> >            └─50-insserv.conf.conf
>> >    Active: active since Tue 2016-02-23 06:52:13 EST; 3h 15min ago
>> >      Docs: man:systemd.special(7)
>> >
>> > Feb 23 06:52:13 SERVER systemd[1]: Starting Remote File Systems.
>> > Feb 23 06:52:13 SERVER systemd[1]: Job remote-fs.target/start finished,
>> > result=done
>> > Feb 23 06:52:13 SERVER systemd[1]: Reached target Remote File Systems.
>> >
>> > so at the same time when mnt-NFSSERVER_VOL.mount is marked as failed,
>> > the
>> > remote-fs.target is marked as loaded successfully (which seems the wrong
>> > status to me), and in fact the only failed unit is:
>> >
>> > # systemctl --failed
>> >   UNIT                    LOAD   ACTIVE SUB    DESCRIPTION
>> > ● mnt-NFSSERVER_VOL.mount loaded failed failed /mnt/NFSSERVER_VOL
>> >
>> > 1 loaded units listed. Pass --all to see loaded but inactive units, too.
>>
>> Looks like remote-fs.target does not Require its mounts, only Wants.
>
> from the 'show' command above it seems it actually Requires the mount, where
> did you find the info about only 'Want'ing the mounts?

I was guessing, but your output suggests the relation is correct. I am
very confused...

>
>>
>> >
>> >
>> > can I provide more logs/info? do you see anything wrong on this
>> > configuration that we might want to change?
>>
>> It seems strange that the mount fails.
>
> yeah I agree it's strange, but there might be situations when the particular
> nfs server is overloaded, and/or the network path is a bit congested, and
> the mount might take a while to come up; we consider a slow mount acceptable
> (we dont reboot machines that often, and servers are already slow beasts to
> start, a few secs more is not an issue), we just feel like if something goes
> a bit out of the usual, the mount is never retried and it's just left it
> there, dying

Maybe the timeout is just too short. Maybe adding
x-systemd.device-timeout=90s helps?

A completely different alternative is to setup the nfs mounts as
automounts instead of real mounts (ie, set x-systemd.automount
option). This would have the aditional benefit of removing the need to
specify dependency against remote-fs.target.

This option will cause systemd to attempt to mount whenever a subpath
is accessed and the mount is not yet active.

>
>> Any logs which might shed some
>> light on why the mount command fails (especially network status)?
>
> sadly i cant find anything relevant: dmesg has nothing, journalctl contains
> nothing helpful/related, ad so do syslog or messages - any other place you
> want me to look?
>
>> Does
>> the mount complete successfully if started via systemd post-boot ?
>> that is, `systemctl start mnt-NFSSERVER_VOL.mount`.
>
> it usually does, up to the point that we setup a "mount -t nfs -a" as an
> @reboot cronjob just to get the mounts up
>
> on this specific machine. I'm running that command now, so 24hrs after when
> that happened, and it completes just fine (but i have no doubts it would
> have succeeded even just after the boot failure):
>
> # systemctl --failed --no-legend
> mnt-NFSSERVER_VOL.mount loaded failed failed /mnt/NFSSERVER_VOL
> # systemctl start mnt-NFSSERVER_VOL.mount
>
> Feb 24 07:24:21 SERVER systemd[1]: Trying to enqueue job
> mnt-NFSSERVER1_VOL.mount/start/replace
> Feb 24 07:24:21 SERVER systemd[1]: Installed new job
> mnt-NFSSERVER1_VOL.mount/start as 565
> Feb 24 07:24:21 SERVER systemd[1]: Enqueued job
> mnt-NFSSERVER1_VOL.mount/start as 565
> Feb 24 07:24:21 SERVER systemd[1]: Mounting /mnt/NFSSERVER1_VOL...
> Feb 24 07:24:21 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount: Directory
> /mnt/majestix_vol3 to mount over is not empty, mounting anyway.
> Feb 24 07:24:21 SERVER systemd[1]: About to execute: /bin/mount -n
> XXX.YYY.32.75:/vol/vol3 /mnt/NFSSERVER1_VOL -t nfs -o
> ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> Feb 24 07:24:21 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed failed
> -> mounting
> Feb 24 07:24:21 SERVER systemd[1]: Sent message type=signal sender=n/a
> destination=n/a
> object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER1_5fvol3_2emount
> interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=3
> reply_cookie=0 error=n/a
> Feb 24 07:24:21 SERVER systemd[1]: Sent message type=signal sender=n/a
> destination=n/a
> object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER1_5fvol3_2emount
> interface=org.freedesktop.DBus.Properties member=PropertiesChanged cookie=4
> reply_cookie=0 error=n/a
> Feb 24 07:24:21 SERVER systemd[1]: Got message type=method_call sender=n/a
> destination=org.freedesktop.systemd1
> object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER1_5fvol3_2emount
> interface=org.freedesktop.DBus.Properties member=Get cookie=3 reply_cookie=0
> error=n/a
> Feb 24 07:24:21 SERVER systemd[27575]: Executing: /bin/mount -n
> XXX.YYY.32.75:/vol/vol3 /mnt/NFSSERVER1_VOL -t nfs -o
> ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> Feb 24 07:24:21 SERVER systemd[1]: Child 27575 belongs to
> mnt-NFSSERVER1_VOL.mount
> Feb 24 07:24:21 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount mount process
> exited, code=exited status=0
> Feb 24 07:24:21 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed mounting
> -> mounted
> Feb 24 07:24:21 SERVER systemd[1]: Job mnt-NFSSERVER1_VOL.mount/start
> finished, result=done
> Feb 24 07:24:21 SERVER systemd[1]: Mounted /mnt/NFSSERVER1_VOL.
>
> # systemctl status | head
> ● SERVER
>     State: running
>      Jobs: 0 queued
>    Failed: 0 units
>

OK, so no systemd-specific environment snafu, and more likely just a
timing issue.

I have to confess I don't have much more ideas on where to look...

-- 

Saludos,
Felipe Sateler




More information about the Pkg-systemd-maintainers mailing list