[pkg-lxc-devel] Bug#961584: Bug#961584: Bug#961584: lxc-stop fails with exit code 1
Antonio Terceiro
terceiro at debian.org
Thu Sep 17 13:57:47 BST 2020
Hi,
On Wed, Sep 16, 2020 at 11:14:58PM +0200, Pierre-Elliott Bécue wrote:
> Le mercredi 16 septembre 2020 à 22:08:24+0200, Pierre-Elliott Bécue a écrit :
> > Le vendredi 11 septembre 2020 à 11:12:25+0200, Iñaki Malerba a écrit :
> > > Hi Pebs,
> > >
> > > Thanks for checking this.
> > >
> > > On Sat, 5 Sep 2020 23:23:30 +0200 Pierre-Elliott =?utf-8?B?QsOpY3Vl?=
> > > <peb at debian.org> wrote:>
> > > > LXC's devs told me that 4.0.4 should solve it. I'm uploading this
> > > > release now. Please don't hesitate to tell me if it helps.
> > >
> > > Run a pipeline removing the pinning of lxc, and the behaviour seems to
> > > be the same.
> > >
> > > Image building:
> > > https://salsa.debian.org/ina/pipeline/-/jobs/990332
> > > > Setting up lxc (1:4.0.4-1) ..
> > >
> > > Running lxc:
> > > https://salsa.debian.org/ina/pipeline/-/jobs/990352
> > > > <VirtSubproc>: failure: ['sudo', 'timeout', '600', 'lxc-stop',
> > > '--quiet', '--kill', '--name', 'ci-254-b2fcad5f'] failed (exit status 1,
> > > stderr '')
> > >
> > > Please let me know if you want us to test something else.
> > >
> > > Abrazos,
> >
> > Could you get me a full trace like the previous time? I have no
> > technical means of running proper tests currently, sorry. :/
> >
> > Cheers!
>
> I found a way to run tests on my own.
>
> Turns out I tried to add a lxc-attach autopkgtest-stable-amd64 -- ps
> auxf to see the process tree in case I could find something useful,
> and… the container successfully stopped that time. I retried and it kept
> working.
>
> The process tree I see is:
> ─( 23:09:35 )─< /home/becue/tmp >───────────────────────────────────────────────[ 0 ]─
> root at dawaj # docker run --rm --privileged -i autopkgtest
> Starting LXC network bridge: :Starting LXC autoboot containers: :USER
> PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 4 0.0 0.0 7644 2760 ? R 21:09 0:00 ps auxf
> root 1 0.0 0.0 20904 7492 ? Ds 21:09 0:00 /sbin/init
> ok
>
> After some more tests, it seems that lxc-start && lxc-stop isn't working
> properly because the signal is sent before the container is ready to
> handle it.
>
> After this test I decided to add a sleep 2 before the lxc-attach ... --
> ps command:
>
> Starting LXC network bridge: :Starting LXC autoboot containers: :USER
> PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> root 52 0.0 0.0 7644 2768 ? R 21:10 0:00 ps auxf
> root 1 2.5 0.1 21524 9596 ? Ss 21:10 0:00 /sbin/init
> root 17 0.5 0.1 27444 8404 ? Ss 21:10 0:00 /lib/systemd/systemd-journald
> root 27 0.0 0.0 2348 1772 ? Ss 21:10 0:00 /sbin/ifup -a --read-environment
> root 42 0.0 0.0 2392 764 ? S 21:10 0:00 \_ /bin/sh -c /sbin/dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases eth0 .
> root 43 0.0 0.0 8456 1936 ? S 21:10 0:00 \_ /sbin/dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases eth0
> root 44 0.0 0.0 9492 5644 ? S 21:10 0:00 \_ /sbin/dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases eth0
> message+ 50 0.0 0.0 8696 3636 ? Ss 21:10 0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
> root 51 0.5 0.0 19308 6376 ? Ss 21:10 0:00 /lib/systemd/systemd-logind
> ok
>
> Turns out your lxc-stop is really fast, and therefore, not catched
> properly by LXC.
>
> While I appreciate it shouldn't be a corner case that makes things
> explode, do you think there's a way to take this realization into
> account to lower the severity of this bug, having a temporary fix set up
> in place?
>
> I'll still try to see what upstream could offer to handle this in a
> better way.
I did a little bit of debugging on this today. I think the provided
Dockerfile does not really reproduce the real issue. The lxc-stop call
from lxc comes after the tests are run, so plenty of time after the
lxc-start call.
I installed autopkgtest inside the container, did manually the mount
steps listed in the Dockerfile, and tried it directly. Failed as
expected:
root at 665d38b2f9e3:~/pkg# autopkgtest -B . -- lxc autopkgtest-stable-amd64
autopkgtest [12:41:01]: starting date: 2020-09-17
autopkgtest [12:41:01]: version 5.14
autopkgtest [12:41:01]: host 665d38b2f9e3; command line: /usr/bin/autopkgtest -B . -- lxc autopkgtest-stable-amd64
autopkgtest [12:41:07]: testbed dpkg architecture: amd64
autopkgtest [12:41:08]: testbed running kernel: Linux 5.8.0-1-amd64 #1 SMP Debian 5.8.7-1 (2020-09-05)
autopkgtest [12:41:08]: @@@@@@@@@@@@@@@@@@@@ built-tree .
autopkgtest [12:41:08]: test command1: preparing testbed
autopkgtest [12:41:08]: test command1: true
autopkgtest [12:41:08]: test command1: [-----------------------
autopkgtest [12:41:09]: test command1: -----------------------]
autopkgtest [12:41:09]: test command1: - - - - - - - - - - results - - - - - - - - - -
command1 PASS
autopkgtest [12:41:09]: @@@@@@@@@@@@@@@@@@@@ summary
command1 PASS
<VirtSubproc>: failure: ['lxc-stop', '--quiet', '--kill', '--name', 'autopkgtest-lxc-jodrho'] failed (exit status 1, stderr '')
autopkgtest [12:41:09]: ERROR: autopkgtest
: error cleaning up:
autopkgtest [12:41:09]: ERROR: testbed failure: cannot send to testbed: [Errno 32] Broken pipe
Note however the despite lxc-stop call returning non-zero, the container
is stopped. I then tried starting the container again, and stopping it
by hand. Failed as well:
root at 665d38b2f9e3:~/pkg# lxc-start --name autopkgtest-lxc-jodrho
root at 665d38b2f9e3:~/pkg# lxc-attach --name autopkgtest-lxc-jodrho -- sh -c 'sleep 1 && echo HI'
HI
root at 665d38b2f9e3:~/pkg# lxc-stop --kill --name autopkgtest-lxc-jodrho
lxc-stop: autopkgtest-lxc-jodrho: commands.c: lxc_cmd_stop: 775 No such file or directory - Failed to stop container "autopkgtest-lxc-jodrho"
Note however that this time I omitted the --quiet flag that autopkgtest
uses, which gives us at least some extra info. No such file of
directory.
I tried lxc-stop with the --kill flag:
root at 665d38b2f9e3:~/pkg# lxc-start --name autopkgtest-lxc-jodrho
root at 665d38b2f9e3:~/pkg# lxc-attach --name autopkgtest-lxc-jodrho -- sh -c 'sleep 1 && echo HI'
HI
root at 665d38b2f9e3:~/pkg# lxc-stop --name autopkgtest-lxc-jodrho
Works!
So this seems related to the --kill flag.
I then tried to strace the lxc-stop --kill call. By the end of the log,
there is this:
openat(AT_FDCWD, "/lxc/autopkgtest-lxc-jodrho/partial", O_RDWR|O_CLOEXEC) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/autopkgtest-lxc-jodrho/command"}, 38) = 0
getpid() = 3567
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3567, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\232\r\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/autopkgtest-lxc-jodrho/command"}, 38) = 0
getpid() = 3567
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3567, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/autopkgtest-lxc-jodrho/command"}, 38) = 0
getpid() = 3567
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3567, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
close(3) = 0
write(2, "lxc-stop: autopkgtest-lxc-jodrho"..., 34lxc-stop: autopkgtest-lxc-jodrho: ) = 34
write(2, "commands.c: lxc_cmd_stop: 775 ", 30commands.c: lxc_cmd_stop: 775 ) = 30
write(2, "No such file or directory - Fail"..., 77No such file or directory - Failed to stop container "autopkgtest-lxc-jodrho") = 77
write(2, "\n", 1
) = 1
exit_group(1) = ?
+++ exited with 1 +++
First I though that it could be related to /lxc/autopkgtest-lxc-jodrho/partial,
but strace'ing a lxc-stop call without --kill, there is a similar error, so
that's not it. For comparison, the end of the logs in that case look like this:
openat(AT_FDCWD, "/lxc/test/partial", O_RDWR|O_CLOEXEC) = -1 ENOENT (No such file or directory)
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
getpid() = 3926
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\f\17\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
getpid() = 3926
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
getpid() = 3926
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
getpid() = 3926
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\22\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[4]}], msg_controllen=24, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
getpid() = 3926
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\f\17\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
close(3) = 0
openat(AT_FDCWD, "/proc/3852/status", O_RDONLY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
read(3, "Name:\tsystemd\nUmask:\t0000\nState:"..., 1024) = 1024
close(3) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
getpid() = 3926
getuid() = 0
getgid() = 0
sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\n\0\0\0 \0\0\0\320\210s\352\376\177\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
sendto(3, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_NOSIGNAL, NULL, 0) = 32
recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\10\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
pidfd_send_signal(4, SIGRT_5, NULL, 0) = 0
poll([{fd=4, events=POLLIN}], 1, 60000) = 1 ([{fd=4, revents=POLLIN}])
setsockopt(3, SOL_SOCKET, SO_RCVTIMEO_OLD, "<\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
recvfrom(3, "", 264, 0, NULL, NULL) = 0
close(3) = 0
close(4) = 0
exit_group(0) = ?
+++ exited with 0 +++
I just ran out of time. Maybe the above is useful for someone else to
figure the issue out.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/pkg-lxc-devel/attachments/20200917/181e90b6/attachment.sig>
More information about the Pkg-lxc-devel
mailing list