Bug#854197: systemd: please handle the case where socket activation leads to restart loop better

Sat Feb 4 23:05:03 GMT 2017

Package: systemd
Version: 232-15
Severity: wishlist

Dear maintainers,

while helping out on debian-mentors@ with #854192 I noticed that systemd
doesn't appear to handle the case very well when dbus is installed but not
configured properly (this was due to a bug in the usbguard package that
missed a dependency on dbus), trying to start a Type=dbus service (that
does DBus requests) will cause a nasty restart loop that you can only get
out of if you stop dbus.socket - but it's very non-obvious that that is
what you should do.

Steps to reproduce:

 - install a stretch system, minimal (tasksel empty), no DBus
 - Recreate a broken DBus installation:
      apt-get download libdbus-1-3 dbus
      dpkg --install libdbus-1-3_*.deb
      dpkg --unpack dbus_*.deb
 - Create a dummy service:
      cat > /etc/systemd/system/dummy.service
      [Service]
      BusName=org.example.dummy
      ExecStart=/usr/bin/dbus-monitor --system
      (Ctrl+D)
 - Try to start that service
      systemctl daemon-reload
      systemctl start dummy

The dbus-monitor startup will cause dbus.socket to be triggered, which
in turn will cause systemd to try to start dbus.service. Problem here is
that dbus's postinst won't have run yet, so the "messagebus" user won't
exist, so dbus-daemon won't start up propery.

Problem: this creates a restart loop, since systemd tries to restart
the service over and over again because there's data on the DBus socket.
I'm pretty sure you could also reproduce that with other services that
are socket activated, but this definitely reproduces this.

Doing systemctl stop dummy or systemctl stop dbus doesn't help here;
masking dbus.service or dummy.service doesn't either. journalctl doesn't
say anything useful except "Looping too fast" being printed every 1s or
so. systemctl daemon-reexec has no effect (it does reexec though). The
only way to get out of this problem is to stop dbus.socket, which is
not very obvious to a user - even I didn't think of that immediately,
and rebooted my test VM a couple of times while figuring this out. I
suspect users with less knowledge of systemd than I will not fare
better.

What I would like to see is: systemd could maybe print a message when
a service (repeatedly) fails to start as a result of socket activation
(including which socket is responsible), so that users have an idea of
what they could do to make systemd cooperate again. Also once could
think about a mode where a socket is stopped (in failed state)
automatically after the service associated with it has failed to start
more than N times (configurable in the socket's unit file), with N
defaulting to 30 or something similar. This would really help in this
kind of situation.

Not sure about the severity of this bug, because the current behavior
of systemd does indeed work as designed (data on the socket -> try to
start service -> service fails -> service marked inactive -> systemd
looks at socket again -> data on the socket -> rinse and repeat ...),
but the consequences are rather nasty IMHO. I've filed it under
wishlist for now because of the "works as designed" argument, but my
annoyance level with this bug would easily make this 'normal' or
'important'. I'll leave this up to you.

Regards,
Christian