[pkg-lxc-devel] Bug#961584: Bug#961584: Bug#961584: lxc-stop fails with exit code 1

Antonio Terceiro terceiro at debian.org
Thu Sep 17 13:57:47 BST 2020


Hi,

On Wed, Sep 16, 2020 at 11:14:58PM +0200, Pierre-Elliott Bécue wrote:
> Le mercredi 16 septembre 2020 à 22:08:24+0200, Pierre-Elliott Bécue a écrit :
> > Le vendredi 11 septembre 2020 à 11:12:25+0200, Iñaki Malerba a écrit :
> > > Hi Pebs,
> > > 
> > > Thanks for checking this.
> > > 
> > > On Sat, 5 Sep 2020 23:23:30 +0200 Pierre-Elliott =?utf-8?B?QsOpY3Vl?=
> > > <peb at debian.org> wrote:>
> > > > LXC's devs told me that 4.0.4 should solve it. I'm uploading this
> > > > release now. Please don't hesitate to tell me if it helps.
> > > 
> > > Run a pipeline removing the pinning of lxc, and the behaviour seems to
> > > be the same.
> > > 
> > > Image building:
> > > https://salsa.debian.org/ina/pipeline/-/jobs/990332
> > > > Setting up lxc (1:4.0.4-1) ..
> > > 
> > > Running lxc:
> > > https://salsa.debian.org/ina/pipeline/-/jobs/990352
> > > > <VirtSubproc>: failure: ['sudo', 'timeout', '600', 'lxc-stop',
> > > '--quiet', '--kill', '--name', 'ci-254-b2fcad5f'] failed (exit status 1,
> > > stderr '')
> > > 
> > > Please let me know if you want us to test something else.
> > > 
> > > Abrazos,
> > 
> > Could you get me a full trace like the previous time? I have no
> > technical  means of running proper tests currently, sorry. :/
> > 
> > Cheers!
> 
> I found a way to run tests on my own.
> 
> Turns out I tried to add a lxc-attach autopkgtest-stable-amd64 -- ps
> auxf to see the process tree in case I could find something useful,
> and… the container successfully stopped that time. I retried and it kept
> working.
> 
> The process tree I see is:
> ─( 23:09:35 )─< /home/becue/tmp >───────────────────────────────────────────────[ 0 ]─
> root at dawaj # docker run --rm --privileged -i autopkgtest
> Starting LXC network bridge: :Starting LXC autoboot containers: :USER
> PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root           4  0.0  0.0   7644  2760 ?        R    21:09   0:00 ps auxf
> root           1  0.0  0.0  20904  7492 ?        Ds   21:09   0:00 /sbin/init
> ok
> 
> After some more tests, it seems that lxc-start && lxc-stop isn't working
> properly because the signal is sent before the container is ready to
> handle it.
> 
> After this test I decided to add a sleep 2 before the lxc-attach ... --
> ps command:
> 
> Starting LXC network bridge: :Starting LXC autoboot containers: :USER
> PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
> root          52  0.0  0.0   7644  2768 ?        R    21:10   0:00 ps auxf
> root           1  2.5  0.1  21524  9596 ?        Ss   21:10   0:00 /sbin/init
> root          17  0.5  0.1  27444  8404 ?        Ss   21:10   0:00 /lib/systemd/systemd-journald
> root          27  0.0  0.0   2348  1772 ?        Ss   21:10   0:00 /sbin/ifup -a --read-environment
> root          42  0.0  0.0   2392   764 ?        S    21:10   0:00  \_ /bin/sh -c /sbin/dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases eth0 .
> root          43  0.0  0.0   8456  1936 ?        S    21:10   0:00 \_ /sbin/dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases eth0
> root          44  0.0  0.0   9492  5644 ?        S    21:10   0:00 \_ /sbin/dhclient -4 -v -i -pf /run/dhclient.eth0.pid -lf /var/lib/dhcp/dhclient.eth0.leases -I -df /var/lib/dhcp/dhclient6.eth0.leases eth0
> message+      50  0.0  0.0   8696  3636 ?        Ss   21:10   0:00 /usr/bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation --syslog-only
> root          51  0.5  0.0  19308  6376 ?        Ss   21:10   0:00 /lib/systemd/systemd-logind
> ok
> 
> Turns out your lxc-stop is really fast, and therefore, not catched
> properly by LXC.
> 
> While I appreciate it shouldn't be a corner case that makes things
> explode, do you think there's a way to take this realization into
> account to lower the severity of this bug, having a temporary fix set up
> in place?
> 
> I'll still try to see what upstream could offer to handle this in a
> better way.

I did a little bit of debugging on this today. I think the provided
Dockerfile does not really reproduce the real issue. The lxc-stop call
from lxc comes after the tests are run, so plenty of time after the
lxc-start call.

I installed autopkgtest inside the container, did manually the mount
steps listed in the Dockerfile, and tried it directly. Failed as
expected:

    root at 665d38b2f9e3:~/pkg# autopkgtest -B . -- lxc autopkgtest-stable-amd64
    autopkgtest [12:41:01]: starting date: 2020-09-17
    autopkgtest [12:41:01]: version 5.14
    autopkgtest [12:41:01]: host 665d38b2f9e3; command line: /usr/bin/autopkgtest -B . -- lxc autopkgtest-stable-amd64
    autopkgtest [12:41:07]: testbed dpkg architecture: amd64
    autopkgtest [12:41:08]: testbed running kernel: Linux 5.8.0-1-amd64 #1 SMP Debian 5.8.7-1 (2020-09-05)
    autopkgtest [12:41:08]: @@@@@@@@@@@@@@@@@@@@ built-tree .
    autopkgtest [12:41:08]: test command1: preparing testbed
    autopkgtest [12:41:08]: test command1: true
    autopkgtest [12:41:08]: test command1: [-----------------------
    autopkgtest [12:41:09]: test command1: -----------------------]
    autopkgtest [12:41:09]: test command1:  - - - - - - - - - - results - - - - - - - - - -
    command1             PASS
    autopkgtest [12:41:09]: @@@@@@@@@@@@@@@@@@@@ summary
    command1             PASS
    <VirtSubproc>: failure: ['lxc-stop', '--quiet', '--kill', '--name', 'autopkgtest-lxc-jodrho'] failed (exit status 1, stderr '')
    autopkgtest [12:41:09]: ERROR: autopkgtest
    : error cleaning up:

autopkgtest [12:41:09]: ERROR: testbed failure: cannot send to testbed: [Errno 32] Broken pipe

Note however the despite lxc-stop call returning non-zero, the container
is stopped. I then tried starting the container again, and stopping it
by hand. Failed as well:

    root at 665d38b2f9e3:~/pkg# lxc-start --name autopkgtest-lxc-jodrho
    root at 665d38b2f9e3:~/pkg# lxc-attach --name autopkgtest-lxc-jodrho -- sh -c 'sleep 1 && echo HI'
    HI
    root at 665d38b2f9e3:~/pkg# lxc-stop --kill --name autopkgtest-lxc-jodrho
    lxc-stop: autopkgtest-lxc-jodrho: commands.c: lxc_cmd_stop: 775 No such file or directory - Failed to stop container "autopkgtest-lxc-jodrho"

Note however that this time I omitted the --quiet flag that autopkgtest
uses, which gives us at least some extra info. No such file of
directory.

I tried lxc-stop with the --kill flag:

    root at 665d38b2f9e3:~/pkg# lxc-start --name autopkgtest-lxc-jodrho
    root at 665d38b2f9e3:~/pkg# lxc-attach --name autopkgtest-lxc-jodrho -- sh -c 'sleep 1 && echo HI'
    HI
    root at 665d38b2f9e3:~/pkg# lxc-stop --name autopkgtest-lxc-jodrho

Works!

So this seems related to the --kill flag.

I then tried to strace the lxc-stop --kill call. By the end of the log,
there is this:

    openat(AT_FDCWD, "/lxc/autopkgtest-lxc-jodrho/partial", O_RDWR|O_CLOEXEC) = -1 ENOENT (No such file or directory)
    socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
    connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/autopkgtest-lxc-jodrho/command"}, 38) = 0
    getpid()                                = 3567
    getuid()                                = 0
    getgid()                                = 0
    sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3567, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
    recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\232\r\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
    close(3)                                = 0
    socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
    connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/autopkgtest-lxc-jodrho/command"}, 38) = 0
    getpid()                                = 3567
    getuid()                                = 0
    getgid()                                = 0
    sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3567, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
    recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
    close(3)                                = 0
    socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
    connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/autopkgtest-lxc-jodrho/command"}, 38) = 0
    getpid()                                = 3567
    getuid()                                = 0
    getgid()                                = 0
    sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3567, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
    recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
    close(3)                                = 0
    write(2, "lxc-stop: autopkgtest-lxc-jodrho"..., 34lxc-stop: autopkgtest-lxc-jodrho: ) = 34
    write(2, "commands.c: lxc_cmd_stop: 775 ", 30commands.c: lxc_cmd_stop: 775 ) = 30
    write(2, "No such file or directory - Fail"..., 77No such file or directory - Failed to stop container "autopkgtest-lxc-jodrho") = 77
    write(2, "\n", 1
    )                       = 1
    exit_group(1)                           = ?
    +++ exited with 1 +++

First I though that it could be related to /lxc/autopkgtest-lxc-jodrho/partial,
but strace'ing a lxc-stop call without --kill, there is a similar error, so
that's not it. For comparison, the end of the logs in that case look like this:

    openat(AT_FDCWD, "/lxc/test/partial", O_RDWR|O_CLOEXEC) = -1 ENOENT (No such file or directory)
    socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
    connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
    getpid()                                = 3926
    getuid()                                = 0
    getgid()                                = 0
    sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
    recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\f\17\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
    close(3)                                = 0
    socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
    connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
    getpid()                                = 3926
    getuid()                                = 0
    getgid()                                = 0
    sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
    recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
    close(3)                                = 0
    socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
    connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
    getpid()                                = 3926
    getuid()                                = 0
    getgid()                                = 0
    sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\3\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
    recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\2\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
    close(3)                                = 0
    socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
    connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
    getpid()                                = 3926
    getuid()                                = 0
    getgid()                                = 0
    sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\22\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
    recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[4]}], msg_controllen=24, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
    close(3)                                = 0
    socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
    connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
    getpid()                                = 3926
    getuid()                                = 0
    getgid()                                = 0
    sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
    recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\0\0\0\0\0\0\0\f\17\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
    close(3)                                = 0
    openat(AT_FDCWD, "/proc/3852/status", O_RDONLY|O_CLOEXEC) = 3
    fstat(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
    read(3, "Name:\tsystemd\nUmask:\t0000\nState:"..., 1024) = 1024
    close(3)                                = 0
    socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 3
    connect(3, {sa_family=AF_UNIX, sun_path=@"/lxc/test/command"}, 20) = 0
    getpid()                                = 3926
    getuid()                                = 0
    getgid()                                = 0
    sendmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\n\0\0\0 \0\0\0\320\210s\352\376\177\0\0", iov_len=16}], msg_iovlen=1, msg_control=[{cmsg_len=28, cmsg_level=SOL_SOCKET, cmsg_type=SCM_CREDENTIALS, cmsg_data={pid=3926, uid=0, gid=0}}], msg_controllen=32, msg_flags=0}, MSG_NOSIGNAL) = 16
    sendto(3, "\1\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 32, MSG_NOSIGNAL, NULL, 0) = 32
    recvmsg(3, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\10\0\0\0\0\0\0\0\10\0\0\0\0\0\0\0", iov_len=16}], msg_iovlen=1, msg_controllen=0, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 16
    pidfd_send_signal(4, SIGRT_5, NULL, 0)  = 0
    poll([{fd=4, events=POLLIN}], 1, 60000) = 1 ([{fd=4, revents=POLLIN}])
    setsockopt(3, SOL_SOCKET, SO_RCVTIMEO_OLD, "<\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", 16) = 0
    recvfrom(3, "", 264, 0, NULL, NULL)     = 0
    close(3)                                = 0
    close(4)                                = 0
    exit_group(0)                           = ?
    +++ exited with 0 +++

I just ran out of time. Maybe the above is useful for someone else to
figure the issue out.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/pkg-lxc-devel/attachments/20200917/181e90b6/attachment.sig>


More information about the Pkg-lxc-devel mailing list