jessie: help debugging NFS shares not mounted at boot, double mounts with mount -a, and @reboot cronjobs
Sandro Tosi
morph at debian.org
Wed Mar 2 14:46:23 GMT 2016
Thanks Felipe to assist me with this and sorry for the late reply!
On Thu, Feb 25, 2016 at 7:19 PM, Felipe Sateler <fsateler at debian.org> wrote:
> On 24 February 2016 at 09:29, Sandro Tosi <morph at debian.org> wrote:
> > Hey Felipe, thanks a lot for your help!
> >
> >
> > On Tue, Feb 23, 2016 at 10:16 PM, Felipe Sateler <fsateler at debian.org>
> > wrote:
> >>
> >> On 23 February 2016 at 12:12, Sandro Tosi <morph at debian.org> wrote:
> >> > On Tue, Feb 23, 2016 at 9:19 AM, Sandro Tosi <morph at debian.org>
> wrote:
> >> >> quick update: we had a couple of (real) nfs issues and
> >> >> misconfiguration (meeh) that made the script fail even if it shouldnt
> >> >> have, so no news yet; the reboot loop just restarted and will
> >> >> periodically check it and report back if something comes up.
> >> >
> >> > so well, we just needed to wait a bit :)
> >> >
> >> > here it is:
> >> >
> >> > # journalctl -alb | grep -nE "cron|NFSSERVER"
> >> > 1012:Feb 23 06:50:42 SERVER systemd[1]: Installed new job
> >> > mnt-NFSSERVER_VOL.mount/start as 99
> >> > 1014:Feb 23 06:50:42 SERVER systemd[1]: Installed new job
> >> > cron.service/start
> >> > as 101
> >> > 1714:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER_VOL...
> >> > 1716:Feb 23 06:50:43 SERVER systemd[1]: About to execute: /bin/mount
> -n
> >> > XXX.YYY.32.75:/vol/VOL /mnt/NFSSERVER_VOL -t nfs -o
> >> > ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> >> > 1718:Feb 23 06:50:43 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount
> changed
> >> > dead
> >> > -> mounting
> >> > 1720:Feb 23 06:50:43 SERVER systemd[574]: Executing: /bin/mount -n
> >> > XXX.YYY.32.75:/vol/VOL /mnt/NFSSERVER_VOL -t nfs -o
> >> > ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> >> > 1905:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount
> mounting
> >> > timed out. Stopping.
> >> > 1906:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount
> changed
> >> > mounting -> mounting-sigterm
> >> > 1915:Feb 23 06:52:13 SERVER systemd[1]: Child 574 belongs to
> >> > mnt-NFSSERVER_VOL.mount
> >> > 1916:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount mount
> >> > process exited, code=killed status=15
> >> > 1917:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount
> changed
> >> > mounting-sigterm -> mounted
> >> > 1918:Feb 23 06:52:13 SERVER systemd[1]: Job
> >> > mnt-NFSSERVER_VOL.mount/start
> >> > finished, result=done
> >> > 1919:Feb 23 06:52:13 SERVER systemd[1]: Mounted /mnt/NFSSERVER_VOL.
> >> > 2025:Feb 23 06:52:13 SERVER systemd[1]: About to execute:
> /usr/sbin/cron
> >> > -f
> >> > $EXTRA_OPTS
> >> > 2026:Feb 23 06:52:13 SERVER systemd[1]: Forked /usr/sbin/cron as 786
> >> > 2027:Feb 23 06:52:13 SERVER systemd[1]: cron.service changed dead ->
> >> > running
> >> > 2028:Feb 23 06:52:13 SERVER systemd[1]: Job cron.service/start
> finished,
> >> > result=done
> >> > 2029:Feb 23 06:52:13 SERVER systemd[786]: Executing: /usr/sbin/cron -f
> >> > 2038:Feb 23 06:52:13 SERVER cron[786]: (CRON) INFO (pidfile fd = 3)
> >> > 2128:Feb 23 06:52:13 SERVER cron[786]: (CRON) INFO (Running @reboot
> >> > jobs)
> >> > 2300:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount
> changed
> >> > mounted -> failed
> >> > 2301:Feb 23 06:52:13 SERVER systemd[1]: Failed to destroy cgroup
> >> > /system.slice/mnt-NFSSERVER_VOL.mount: Device or resource busy
> >> > 2302:Feb 23 06:52:13 SERVER systemd[1]: Unit mnt-NFSSERVER_VOL.mount
> >> > entered
> >> > failed state.
> >> > 2303:Feb 23 06:52:13 SERVER systemd[1]: Sent message type=signal
> >> > sender=n/a
> >> > destination=n/a
> >> > object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER_5fVOL_2emount
> >> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> >> > cookie=30
> >> > reply_cookie=0 error=n/a
> >> > 2304:Feb 23 06:52:13 SERVER systemd[1]: Sent message type=signal
> >> > sender=n/a
> >> > destination=n/a
> >> > object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER_5fVOL_2emount
> >> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> >> > cookie=31
> >> > reply_cookie=0 error=n/a
> >> >
> >> > so 1m30s passed and the mount didnt come up, which is (one of) the
> >> > original
> >> > issue (usually, running mount -t nfs -a will bring it up, even just
> >> > right
> >> > after the failure at boot, so it seems like it's not retried?). i
> check
> >> > in
> >> > the journalctl output around those lines, there was additional msg
> >> > relevant
> >> > to this.
> >>
> >> Is the network up at this point?
> >
> > i think so (and the other 11 NFS mounts were correctly mounted at this
> point
> > anyway): here's some logs of the network and the failing mount:
> >
> > # journalctl -alb | grep -inE "network|NFSSERVER"
> > 653:Feb 23 06:50:42 SERVER kernel: igb: Intel(R) Gigabit Ethernet Network
> > Driver - version 5.2.15-k
> > 656:Feb 23 06:50:42 SERVER kernel: igb 0000:01:00.0: Intel(R) Gigabit
> > Ethernet Network Connection
> > 661:Feb 23 06:50:42 SERVER kernel: igb 0000:01:00.1: Intel(R) Gigabit
> > Ethernet Network Connection
> > 678:Feb 23 06:50:42 SERVER kernel: drop_monitor: Initializing network
> drop
> > monitor service
> > 962:Feb 23 06:50:42 SERVER systemd[1]: Installed new job
> > networking.service/start as 46
> > 963:Feb 23 06:50:42 SERVER systemd[1]: Installed new job
> > network.target/start as 47
> > 969:Feb 23 06:50:42 SERVER systemd[1]: Installed new job
> > network-online.target/start as 53
> > 1012:Feb 23 06:50:42 SERVER systemd[1]: Installed new job
> > mnt-NFSSERVER_VOL.mount/start as 99
> > 1548:Feb 23 06:50:42 SERVER systemd[1]: Starting LSB: Raise network
> > interfaces....
> > 1550:Feb 23 06:50:42 SERVER systemd[1]: About to execute:
> > /etc/init.d/networking start
> > 1551:Feb 23 06:50:42 SERVER systemd[1]: Forked /etc/init.d/networking as
> 407
> > 1552:Feb 23 06:50:42 SERVER systemd[1]: networking.service changed dead
> ->
> > start
> > 1553:Feb 23 06:50:42 SERVER systemd[407]: Executing:
> /etc/init.d/networking
> > start
> > 1591:Feb 23 06:50:43 SERVER networking[407]: Configuring network
> > interfaces...Cannot get EEE settings: Operation not supported
> > 1648:Feb 23 06:50:43 SERVER networking[407]: done.
> > 1649:Feb 23 06:50:43 SERVER systemd[1]: Received SIGCHLD from PID 407
> > (networking).
> > 1650:Feb 23 06:50:43 SERVER systemd[1]: Child 407 (networking) died
> > (code=exited, status=0/SUCCESS)
> > 1651:Feb 23 06:50:43 SERVER systemd[1]: Child 407 belongs to
> > networking.service
> > 1652:Feb 23 06:50:43 SERVER systemd[1]: networking.service: control
> process
> > exited, code=exited status=0
> > 1653:Feb 23 06:50:43 SERVER systemd[1]: networking.service got final
> SIGCHLD
> > for state start
> > 1654:Feb 23 06:50:43 SERVER systemd[1]: networking.service changed start
> ->
> > running
> > 1655:Feb 23 06:50:43 SERVER systemd[1]: Job networking.service/start
> > finished, result=done
> > 1656:Feb 23 06:50:43 SERVER systemd[1]: Started LSB: Raise network
> > interfaces..
> > 1657:Feb 23 06:50:43 SERVER systemd[1]: Starting Network.
> > 1658:Feb 23 06:50:43 SERVER systemd[1]: network.target changed dead ->
> > active
> > 1659:Feb 23 06:50:43 SERVER systemd[1]: Job network.target/start
> finished,
> > result=done
> > 1660:Feb 23 06:50:43 SERVER systemd[1]: Reached target Network.
> > 1661:Feb 23 06:50:43 SERVER systemd[1]: Starting Network is Online.
> > 1662:Feb 23 06:50:43 SERVER systemd[1]: network-online.target changed
> dead
> > -> active
> > 1663:Feb 23 06:50:43 SERVER systemd[1]: Job network-online.target/start
> > finished, result=done
> > 1664:Feb 23 06:50:43 SERVER systemd[1]: Reached target Network is Online.
> > 1714:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER_VOL...
> > 1716:Feb 23 06:50:43 SERVER systemd[1]: About to execute: /bin/mount -n
> > XXX.YYY.32.75:/vol/VOL /mnt/NFSSERVER_VOL -t nfs -o
> > ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> > 1718:Feb 23 06:50:43 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount changed
> dead
> > -> mounting
> > 1720:Feb 23 06:50:43 SERVER systemd[574]: Executing: /bin/mount -n
> > XXX.YYY.32.75:/vol/VOL /mnt/NFSSERVER_VOL -t nfs -o
> > ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> > 1877:Feb 23 06:50:58 SERVER systemd[1]: Child 489 belongs to
> > networking.service
> > 1879:Feb 23 06:50:58 SERVER systemd[1]: Child 501 belongs to
> > networking.service
> > 1891:Feb 23 06:51:19 SERVER systemd[1]: Child 539 belongs to
> > networking.service
>
> This is weird. Why would there be lingering processes of
> networking.service? Note that these processes exit after the nfs
> mounts are started.
>
> > 1892:Feb 23 06:51:19 SERVER systemd[1]: networking.service: cgroup is
> empty
> > 1893:Feb 23 06:51:19 SERVER systemd[1]: networking.service changed
> running
> > -> exited
> > 1905:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount mounting
> > timed out. Stopping.
> > 1906:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount changed
> > mounting -> mounting-sigterm
> > 1915:Feb 23 06:52:13 SERVER systemd[1]: Child 574 belongs to
> > mnt-NFSSERVER_VOL.mount
> > 1916:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount mount
> > process exited, code=killed status=15
> > 1917:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount changed
> > mounting-sigterm -> mounted
>
> This is definitely weird.
>
> > 1918:Feb 23 06:52:13 SERVER systemd[1]: Job mnt-NFSSERVER_VOL.mount/start
> > finished, result=done
> > 1919:Feb 23 06:52:13 SERVER systemd[1]: Mounted /mnt/NFSSERVER_VOL.
> > 2300:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER_VOL.mount changed
> > mounted -> failed
> > 2301:Feb 23 06:52:13 SERVER systemd[1]: Failed to destroy cgroup
> > /system.slice/mnt-NFSSERVER_VOL.mount: Device or resource busy
> > 2302:Feb 23 06:52:13 SERVER systemd[1]: Unit mnt-NFSSERVER_VOL.mount
> entered
> > failed state.
> > 2303:Feb 23 06:52:13 SERVER systemd[1]: Sent message type=signal
> sender=n/a
> > destination=n/a
> > object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER_5fVOL_2emount
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> cookie=30
> > reply_cookie=0 error=n/a
> > 2304:Feb 23 06:52:13 SERVER systemd[1]: Sent message type=signal
> sender=n/a
> > destination=n/a
> > object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER_5fVOL_2emount
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> cookie=31
> > reply_cookie=0 error=n/a
> > 2449:Feb 23 06:52:13 SERVER systemd[1]: Starting Host and Network Name
> > Lookups.
> > 2452:Feb 23 06:52:13 SERVER systemd[1]: Reached target Host and Network
> Name
> > Lookups.
> >
> > in particular
> >
> > 1664:Feb 23 06:50:43 SERVER systemd[1]: Reached target Network is Online.
> > 1714:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER_VOL...
> >
> > so the net online target is reached (if we believe systemd :) from man
> > systemd.special for network-online.target, "What precisely this requires
> is
> > left to the implementation of the network managing service.", how can we
> > know how this is implemented, and so what is meant here as 'online'?)
> before
> > attempting the mount.
> >
> > I checked at what point the /mnt/NFSSERVER_VOL is executed, and it was
> the
> > very last one NFS mount attempted (dunno if it's useful, i was expecting
> to
> > be the first, so with a possibly network still unstable), here below
> what I
> > called before /mnt/NFSSERVER_VOL is /mnt/NFSSERVER1_VOL (as i included
> also
> > the other NFS mounts):
> >
> > 1665:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER12_VOL...
> > 1669:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER7_VOL...
> > 1674:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER10...
> > 1679:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER7_VOL...
> > 1684:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER8...
> > 1689:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER7_VOL...
> > 1694:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER5...
> > 1699:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER4...
> > 1704:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER3...
> > 1709:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER2...
> > 1714:Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER1_VOL...
> > 1757:Feb 23 06:50:46 SERVER systemd[1]: mnt-NFSSERVER12_VOL.mount changed
> > mounting -> mounting-done
> > 1759:Feb 23 06:50:46 SERVER systemd[1]: Mounted /mnt/NFSSERVER12_VOL.
> > 1777:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER8.mount changed
> > mounting -> mounting-done
> > 1779:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER8.
> > 1780:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER7_VOL.mount changed
> > mounting -> mounting-done
> > 1782:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER7_VOL.
> > 1783:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER3.mount changed
> > mounting -> mounting-done
> > 1785:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER3.
> > 1786:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER5.mount changed
> > mounting -> mounting-done
> > 1788:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER5.
> > 1789:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER7_VOL.mount changed
> > mounting -> mounting-done
> > 1791:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER7_VOL.
> > 1792:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER7_VOL.mount changed
> > mounting -> mounting-done
> > 1794:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER7_VOL.
> > 1795:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER10.mount changed
> > mounting -> mounting-done
> > 1797:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER10.
> > 1830:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER4.mount changed
> > mounting -> mounted
> > 1832:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER4.
> > 1836:Feb 23 06:50:47 SERVER systemd[1]: mnt-NFSSERVER2.mount changed
> > mounting -> mounted
> > 1838:Feb 23 06:50:47 SERVER systemd[1]: Mounted /mnt/NFSSERVER2.
> > 1906:Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed
> > mounting -> mounting-sigterm
> > 1919:Feb 23 06:52:13 SERVER systemd[1]: Mounted /mnt/NFSSERVER1_VOL.
> >
> > but it is the first one defined in /etc/fstab:
> >
> > XXX.YYY.32.75:/vol/vol3 /mnt/NFSSERVER1_VOL nfs
> > ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> > XXX.YYY.26.132:/ifs /mnt/NFSSERVER2 nfs
> > rw,intr,tcp,rdirplus,noatime,_netdev
> > XXX.YYY.20.176:/ifs /mnt/NFSSERVER3 nfs
> > ro,tcp,rdirplus,noatime,_netdev
> > XXX.YYY.16.226:/ifs /mnt/NFSSERVER4 nfs
> > rw,intr,tcp,rdirplus,noatime,_netdev
> > XXX.YYY.21.22:/ifs/scratcho /mnt/NFSSERVER5 nfs
> > rw,intr,tcp,rdirplus,noatime,_netdev
> > XXX.YYY.27.74:/vol/vol5 /mnt/NFSSERVER7_VOL nfs
> > rw,intr,tcp,rdirplus,noatime,_netdev
> > XXX.YYY.32.111:/buildcache /mnt/NFSSERVER8 nfs
> > ro,soft,intr,tcp,rdirplus,noatime,_netdev
> > XXX.YYY.27.74:/vol/vol1 /mnt/NFSSERVER7_VOL nfs
> > ro,soft,intr,tcp,rdirplus,noatime,_netdev
> > XXX.YYY.20.176:/ifs /mnt/NFSSERVER10 nfs
> > ro,nocto,actimeo=600,tcp,rdirplus,noatime,_netdev
> > XXX.YYY.27.74:/vol/vol6 /mnt/NFSSERVER7_VOL nfs
> > ro,soft,intr,tcp,rdirplus,noatime,_netdev
> > XXX.YYY.20.14:/vol/vol3 /mnt/NFSSERVER12_VOL nfs
> > ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> >
> > something else that sounds fishy is:
> >
> > Feb 23 06:50:43 SERVER systemd[1]: Mounting /mnt/NFSSERVER1_VOL...
> > Feb 23 06:50:43 SERVER systemd[1]: About to execute: /bin/mount -n
> > XXX.YYY.32.75:/vol/vol3 /mnt/NFSSERVER1_VOL -t nfs -o
> > ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> > Feb 23 06:50:43 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed dead
> ->
> > mounting
> > Feb 23 06:50:43 SERVER systemd[574]: Executing: /bin/mount -n
> > XXX.YYY.32.75:/vol/vol3 /mnt/NFSSERVER1_VOL -t nfs -o
> > ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> > Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount mounting
> timed
> > out. Stopping.
> > Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed
> mounting
> > -> mounting-sigterm
> > Feb 23 06:52:13 SERVER systemd[1]: Child 574 belongs to
> > mnt-NFSSERVER1_VOL.mount
> > Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount mount process
> > exited, code=killed status=15
> > Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed
> > mounting-sigterm -> mounted
> > Feb 23 06:52:13 SERVER systemd[1]: Job mnt-NFSSERVER1_VOL.mount/start
> > finished, result=done
> > Feb 23 06:52:13 SERVER systemd[1]: Mounted /mnt/NFSSERVER1_VOL.
> > Feb 23 06:52:13 SERVER systemd[1]: Starting Remote File Systems.
> > Feb 23 06:52:13 SERVER systemd[1]: remote-fs.target changed dead ->
> active
> > Feb 23 06:52:13 SERVER systemd[1]: Job remote-fs.target/start finished,
> > result=done
> > Feb 23 06:52:13 SERVER systemd[1]: Reached target Remote File Systems.
> > Feb 23 06:52:13 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed
> mounted
> > -> failed
> >
> > so it is actually marked as done/mounted (?!?): "mnt-NFSSERVER1_VOL.mount
> > changed mounting-sigterm -> mounted", which triggers "Job
> > remote-fs.target/start finished, result=done" - but that seems wrong, it
> was
> > sigterm'ed, why systemd thinks this is actually completed correctly? at
> the
> > end it doesnt, as it's marked as failed, but that "mounting-sigterm ->
> > mounted" transition appears to be a mistake
>
> Yes, that is definitely weird.
>
> >
> >
> >>
> >> >
> >> > also note like cron.service is started, even if we configured:
> >> >
> >> > # grep remote-fs /etc/systemd/system/cron.service
> >> > Requires=remote-fs.target
> >> > After=remote-fs.target
> >>
> >> What does `systemctl show remote-fs.target` say?
> >
> > Id=remote-fs.target
> > Names=remote-fs.target
> > Requires=mnt-NFSSERVER12_VOL.mount mnt-NFSSERVER7_vol6.mount
> > mnt-NFSSERVER32.mount mnt-NFSSERVER9_vol1.mount mnt-NFSSERVER8.mount
> > mnt-NFSSERVER11_vol5.mount mnt-NFSSERVER5.mount mnt-NFSSERVER4.mount
> > mnt-gonzo.mount mnt-NFSSERVER2.mount mnt-NFSSERVER1_vol3.mount
> > Wants=remote-fs-pre.target
> > RequiredBy=cron.service
> > WantedBy=multi-user.target
> > Conflicts=shutdown.target
> > Before=console-setup.service kbd.service systemd-journal-flush.service
> > systemd-user-sessions.service mcelog.service ntp.service
> ipmidetectd.service
> > kexec-load.service nagios-nrpe-server.service nutcracker.service
> > lsyncd.service irqbalance.service ipwatchd.service
> ganglia-monitor.service
> > unbound.service bmc-watchdog.service setup_hugetlbfs.service
> sysstat.service
> > postfix.service cron.service
> > After=remote-fs-pre.target local-fs.target mnt-NFSSERVER1_VOL.mount
> > mnt-NFSSERVER2.mount mnt-NFSSERVER3.mount mnt-NFSSERVER4.mount
> > mnt-NFSSERVER5.mount mnt-NFSSERVER7_vol5.mount mnt-NFSSERVER8.mount
> > mnt-NFSSERVER9_vol1.mount mnt-NFSSERVER10.mount
> mnt-NFSSERVER11_vol6.mount
> > mnt-NFSSERVER12_vol3.mount
> > Documentation=man:systemd.special(7)
> > Description=Remote File Systems
> > LoadState=loaded
> > ActiveState=active
> > SubState=active
> > FragmentPath=/lib/systemd/system/remote-fs.target
> >
> DropInPaths=/run/systemd/generator/remote-fs.target.d/50-insserv.conf.conf
> > UnitFileState=enabled
> > InactiveExitTimestamp=Tue 2016-02-23 06:52:13 EST
> > InactiveExitTimestampMonotonic=104052746
> > ActiveEnterTimestamp=Tue 2016-02-23 06:52:13 EST
> > ActiveEnterTimestampMonotonic=104052746
> > ActiveExitTimestampMonotonic=0
> > InactiveEnterTimestampMonotonic=0
> > CanStart=yes
> > CanStop=yes
> > CanReload=no
> > CanIsolate=no
> > StopWhenUnneeded=no
> > RefuseManualStart=no
> > RefuseManualStop=no
> > AllowIsolate=no
> > DefaultDependencies=no
> > OnFailureJobMode=replace
> > IgnoreOnIsolate=no
> > IgnoreOnSnapshot=no
> > NeedDaemonReload=no
> > JobTimeoutUSec=0
> > ConditionResult=yes
> > ConditionTimestamp=Tue 2016-02-23 06:52:13 EST
> > ConditionTimestampMonotonic=104052734
> > Transient=no
> >
> >>
> >> >
> >> > checking the status of that target:
> >> >
> >> > # systemctl status remote-fs.target
> >> > ● remote-fs.target - Remote File Systems
> >> > Loaded: loaded (/lib/systemd/system/remote-fs.target; enabled)
> >> > Drop-In: /run/systemd/generator/remote-fs.target.d
> >> > └─50-insserv.conf.conf
> >> > Active: active since Tue 2016-02-23 06:52:13 EST; 3h 15min ago
> >> > Docs: man:systemd.special(7)
> >> >
> >> > Feb 23 06:52:13 SERVER systemd[1]: Starting Remote File Systems.
> >> > Feb 23 06:52:13 SERVER systemd[1]: Job remote-fs.target/start
> finished,
> >> > result=done
> >> > Feb 23 06:52:13 SERVER systemd[1]: Reached target Remote File Systems.
> >> >
> >> > so at the same time when mnt-NFSSERVER_VOL.mount is marked as failed,
> >> > the
> >> > remote-fs.target is marked as loaded successfully (which seems the
> wrong
> >> > status to me), and in fact the only failed unit is:
> >> >
> >> > # systemctl --failed
> >> > UNIT LOAD ACTIVE SUB DESCRIPTION
> >> > ● mnt-NFSSERVER_VOL.mount loaded failed failed /mnt/NFSSERVER_VOL
> >> >
> >> > 1 loaded units listed. Pass --all to see loaded but inactive units,
> too.
> >>
> >> Looks like remote-fs.target does not Require its mounts, only Wants.
> >
> > from the 'show' command above it seems it actually Requires the mount,
> where
> > did you find the info about only 'Want'ing the mounts?
>
> I was guessing, but your output suggests the relation is correct. I am
> very confused...
>
> >
> >>
> >> >
> >> >
> >> > can I provide more logs/info? do you see anything wrong on this
> >> > configuration that we might want to change?
> >>
> >> It seems strange that the mount fails.
> >
> > yeah I agree it's strange, but there might be situations when the
> particular
> > nfs server is overloaded, and/or the network path is a bit congested, and
> > the mount might take a while to come up; we consider a slow mount
> acceptable
> > (we dont reboot machines that often, and servers are already slow beasts
> to
> > start, a few secs more is not an issue), we just feel like if something
> goes
> > a bit out of the usual, the mount is never retried and it's just left it
> > there, dying
>
> Maybe the timeout is just too short. Maybe adding
> x-systemd.device-timeout=90s helps?
>
i think that's already 90s by default? at least we see the mount fail after
90s, so maybe we should set to a lower or higher value? I'm unsure if
that's what could trigger a "retry" of the mount, because as soon as I see
the machine online, i can login and issue a mount -t nfs -a and all the
missing mountpoints (still pending for systemd) are promptly mounted, so
it's like if they are "frozen" and a retry would just make them successful.
>
> A completely different alternative is to setup the nfs mounts as
> automounts instead of real mounts (ie, set x-systemd.automount
> option). This would have the aditional benefit of removing the need to
> specify dependency against remote-fs.target.
>
i dont think we want that: we prefer (for various reason) to have all the
mountpoint, services and processes running at the machine boot, and be
already there when we want to use them
>
> This option will cause systemd to attempt to mount whenever a subpath
> is accessed and the mount is not yet active.
>
> >
> >> Any logs which might shed some
> >> light on why the mount command fails (especially network status)?
> >
> > sadly i cant find anything relevant: dmesg has nothing, journalctl
> contains
> > nothing helpful/related, ad so do syslog or messages - any other place
> you
> > want me to look?
> >
> >> Does
> >> the mount complete successfully if started via systemd post-boot ?
> >> that is, `systemctl start mnt-NFSSERVER_VOL.mount`.
> >
> > it usually does, up to the point that we setup a "mount -t nfs -a" as an
> > @reboot cronjob just to get the mounts up
> >
> > on this specific machine. I'm running that command now, so 24hrs after
> when
> > that happened, and it completes just fine (but i have no doubts it would
> > have succeeded even just after the boot failure):
> >
> > # systemctl --failed --no-legend
> > mnt-NFSSERVER_VOL.mount loaded failed failed /mnt/NFSSERVER_VOL
> > # systemctl start mnt-NFSSERVER_VOL.mount
> >
> > Feb 24 07:24:21 SERVER systemd[1]: Trying to enqueue job
> > mnt-NFSSERVER1_VOL.mount/start/replace
> > Feb 24 07:24:21 SERVER systemd[1]: Installed new job
> > mnt-NFSSERVER1_VOL.mount/start as 565
> > Feb 24 07:24:21 SERVER systemd[1]: Enqueued job
> > mnt-NFSSERVER1_VOL.mount/start as 565
> > Feb 24 07:24:21 SERVER systemd[1]: Mounting /mnt/NFSSERVER1_VOL...
> > Feb 24 07:24:21 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount: Directory
> > /mnt/majestix_vol3 to mount over is not empty, mounting anyway.
> > Feb 24 07:24:21 SERVER systemd[1]: About to execute: /bin/mount -n
> > XXX.YYY.32.75:/vol/vol3 /mnt/NFSSERVER1_VOL -t nfs -o
> > ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> > Feb 24 07:24:21 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed
> failed
> > -> mounting
> > Feb 24 07:24:21 SERVER systemd[1]: Sent message type=signal sender=n/a
> > destination=n/a
> > object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER1_5fvol3_2emount
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> cookie=3
> > reply_cookie=0 error=n/a
> > Feb 24 07:24:21 SERVER systemd[1]: Sent message type=signal sender=n/a
> > destination=n/a
> > object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER1_5fvol3_2emount
> > interface=org.freedesktop.DBus.Properties member=PropertiesChanged
> cookie=4
> > reply_cookie=0 error=n/a
> > Feb 24 07:24:21 SERVER systemd[1]: Got message type=method_call
> sender=n/a
> > destination=org.freedesktop.systemd1
> > object=/org/freedesktop/systemd1/unit/mnt_2dNFSSERVER1_5fvol3_2emount
> > interface=org.freedesktop.DBus.Properties member=Get cookie=3
> reply_cookie=0
> > error=n/a
> > Feb 24 07:24:21 SERVER systemd[27575]: Executing: /bin/mount -n
> > XXX.YYY.32.75:/vol/vol3 /mnt/NFSSERVER1_VOL -t nfs -o
> > ro,intr,nolock,tcp,rdirplus,noatime,_netdev
> > Feb 24 07:24:21 SERVER systemd[1]: Child 27575 belongs to
> > mnt-NFSSERVER1_VOL.mount
> > Feb 24 07:24:21 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount mount process
> > exited, code=exited status=0
> > Feb 24 07:24:21 SERVER systemd[1]: mnt-NFSSERVER1_VOL.mount changed
> mounting
> > -> mounted
> > Feb 24 07:24:21 SERVER systemd[1]: Job mnt-NFSSERVER1_VOL.mount/start
> > finished, result=done
> > Feb 24 07:24:21 SERVER systemd[1]: Mounted /mnt/NFSSERVER1_VOL.
> >
> > # systemctl status | head
> > ● SERVER
> > State: running
> > Jobs: 0 queued
> > Failed: 0 units
> >
>
> OK, so no systemd-specific environment snafu, and more likely just a
> timing issue.
>
> I have to confess I don't have much more ideas on where to look...
>
ok, at least we are not alone in not understanding what's going on :)
but as you can imagine, this is a rather nasty issue and -while we are
moving forward with the adoption of jessie- is making a lot of people
uncomfortable and skeptical (i just want to express the feelings we have
not any complains on the quality of systemd or debian :) )
would you think it might be viable to contact systemd upstream about this?
Jessie runs 215 while upstream released 229 so the risk of a "get the
latest version and report back" is high, and it's not something easily
doable i guess (?). would you prefer to start this discussion yourself, as
you might have a well established relationship with systemd upstream? i can
trigger the discussion myself as well, and copying the systemd debiam maint
ml, as you prefer
do we have any other option to explore here? the first thing that comes to
mind is to go back to sysv but i really really dont want to do it (it's
looking at the past not towards the future), is there something else?
thanks for your continuous support!
--
Sandro "morph" Tosi
My website: http://sandrotosi.me/
Me at Debian: http://wiki.debian.org/SandroTosi
G+: https://plus.google.com/u/0/+SandroTosi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.alioth.debian.org/pipermail/pkg-systemd-maintainers/attachments/20160302/d31fb1de/attachment-0001.html>
More information about the Pkg-systemd-maintainers
mailing list