Bug#852461: systemd: Creating scopes with systemd-run unreliable

Michael Gebetsroither michael at mgeb.org
Fri Jan 27 17:56:31 GMT 2017


On 2017-01-24 20:20, Michael Biebl wrote:
Hi Michael,

>> The creation of scopes with systemd-run --scope is unreliable.
>> It seems to depend on system "load" or some other factors affected
>> by system "load".
>> On production systems we've seen an error rate of up to 20%.
> 
>> % for i in `seq 50`; do (systemd-run --scope -- sleep 5 2>&1 ) & done 2>&1 |grep -ni failed
> 
> 
> You mean you see this problem under high load? What kind of load?

We are seeing this problem especially under high load, but for the commands above and the output
this was on a test-system without any load on it (4 cores kvm).

> Could it be that you are running into D-Bus timeouts?

doesn't seem like it, see below.

> Do you get any error messages in the journal?

no

> Can you reproduce the issue with v232 from stretch or tell us how we can
> reproduce the issue?
>
> Even my small PI (running raspbian + systemd 215-17+deb8u6) seems to
> handle your for loop without a hitch even if under load.

thx for trying so hard to reproduce the issue!
And sorry for wasting your time and not just sending some debug logs :/

# strace log of an unsuccessful systemd-run run

The error sent from systemd back to systemd-run seems to be:
"Unit run-24908.scope already exists."

A working run prints the scope with pid of the child process, not the pid of the caller.

> 24908 socket(PF_LOCAL, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 3
> 24908 setsockopt(3, SOL_SOCKET, SO_PASSCRED, [1], 4) = 0
> 24908 setsockopt(3, SOL_SOCKET, 0x22 /* SO_??? */, [0], 4) = 0
> 24908 getsockopt(3, SOL_SOCKET, SO_RCVBUF, [212992], [4]) = 0
> 24908 setsockopt(3, SOL_SOCKET, 0x21 /* SO_??? */, [8388608], 4) = 0
> 24908 getsockopt(3, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
> 24908 setsockopt(3, SOL_SOCKET, 0x20 /* SO_??? */, [8388608], 4) = 0
> 24908 connect(3, {sa_family=AF_LOCAL, sun_path="/run/systemd/private"}, 22) = 0
> 24908 getsockopt(3, SOL_SOCKET, SO_PEERCRED, {pid=1, uid=0, gid=0}, [12]) = 0
> 24908 fstat(3, {st_dev=makedev(0, 7), st_ino=8642755, st_mode=S_IFSOCK|0777, st_nlink=1, st_uid=0, st_gid=0, st_blksize=4096, st_blocks=0, st_size=0, st_atime=0, st_mtime=0, st_ctime=0}) = 0
> 24908 getsockopt(3, SOL_SOCKET, SO_ACCEPTCONN, [0], [4]) = 0
> 24908 getsockname(3, {sa_family=AF_LOCAL, sun_path=@"37a95"}, [8]) = 0
> 24908 geteuid()                         = 0
> 24908 sendmsg(3, {msg_name(0)=NULL, msg_iov(3)=[{"\0AUTH EXTERNAL ", 15}, {"30", 2}, {"\r\nNEGOTIATE_UNIX_FD\r\nBEGIN\r\n", 28}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 45
> 24908 getsockopt(3, SOL_SOCKET, SO_PEERCRED, {pid=1, uid=0, gid=0}, [12]) = 0
> 24908 recvmsg(3, 0x7ffe1836b9d0, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
> 24908 ppoll([{fd=3, events=POLLIN}], 1, {24, 999718000}, NULL, 8) = 1 ([{fd=3, revents=POLLIN}], left {24, 999061963})
> 24908 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"OK 68040979bb0143a29a072e055d9fa717\r\nAGREE_UNIX_FD\r\n", 256}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 52
> 24908 sendmsg(3, {msg_name(0)=NULL, msg_iov(2)=[{"l\1\0\1p\0\0\0\1\0\0\0\266\0\0\0\1\1o\0\31\0\0\0/org/freedesktop/systemd1\0\0\0\0\0\0\0\3\1s\0\22\0\0\0StartTransientUnit\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\6\1s\0\30\0\0\0org.freedesktop.systemd1\0\0\0\0\0\0\0\0\10\1g\0\20ssa(sv)a(sa(sv))\0\0\0", 200}, {"\17\0\0\0run-24908.scope\0\4\0\0\0fail\0\0\0\0@\0\0\0\0\0\0\0\v\0\0\0Description\0\1s\0\0\f\0\0\0/bin/sleep 5\0\0\0\0\4\0\0\0PIDs\0\2au\0\0\0\0\4\0\0\0La\0\0\0\0\0\0\0\0\0\0", 112}], msg_controllen=0, msg_flags=0}, MSG_DONTWAIT|MSG_NOSIGNAL) = 312
> 24908 recvmsg(3, 0x7ffe1836bad0, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
> 24908 ppoll([{fd=3, events=POLLIN}], 1, {24, 999950000}, NULL, 8) = 1 ([{fd=3, revents=POLLIN}], left {24, 999721214})
> 24908 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\4\1\1K\0\0\0\1\0\0\0p\0\0\0\1\1o\0\31\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
> 24908 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\7\0\0\0UnitNew\0\10\1g\0\2so\0\17\0\0\0run-24918.scope\0002\0\0\0/org/freedesktop/systemd1/unit/run_2d24918_2escope\0", 179}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 179
> 24908 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\4\1\1D\0\0\0\2\0\0\0q\0\0\0\1\1o\0\31\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
> 24908 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"/org/freedesktop/systemd1\0\0\0\0\0\0\0\2\1s\0 \0\0\0org.freedesktop.systemd1.Manager\0\0\0\0\0\0\0\0\3\1s\0\6\0\0\0JobNew\0\0\10\1g\0\3uos\0\0\0\0\0\0\0\0\344\322\f\0$\0\0\0/org/freedesktop/systemd1/job/840420\0\0\0\0\17\0\0\0run-24918.scope\0", 180}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 180
> 24908 recvmsg(3, 0x7ffe1836bad0, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = -1 EAGAIN (Resource temporarily unavailable)
> 24908 ppoll([{fd=3, events=POLLIN}], 1, {24, 999467000}, NULL, 8) = 1 ([{fd=3, revents=POLLIN}], left {24, 999362896})
> 24908 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"l\3\1\1)\0\0\0\3\0\0\0?\0\0\0\5\1u\0\1\0\0\0", 24}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 24
> 24908 recvmsg(3, {msg_name(0)=NULL, msg_iov(1)=[{"\4\1s\0#\0\0\0org.freedesktop.systemd1.UnitExists\0\0\0\0\0\10\1g\0\1s\0\0$\0\0\0Unit run-24908.scope already exists.\0", 97}], msg_controllen=32, {cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS{pid=1, uid=0, gid=0}}, msg_flags=MSG_CMSG_CLOEXEC}, MSG_DONTWAIT|MSG_NOSIGNAL|MSG_CMSG_CLOEXEC) = 97
> 24908 writev(2, [{"Failed to create message: Input/output error", 44}, {"\n", 1}], 2) = 45
> 24908 exit_group(1)                     = ?
> 24908 +++ exited with 1 +++

gebi




More information about the Pkg-systemd-maintainers mailing list