[Pkg-samba-maint] Bug#801690: Bug#801690: 'smbstatus -b' leads to broken ctdb cluster

Mon Nov 2 06:24:29 UTC 2015

2015-10-13 15:44 GMT+02:00 Adi Kriegisch <adi at kriegisch.at>:
> Package: ctdb
> Version: 2.5.4+debian0-4
>
> Dear maintainers,

Hello Adi,

Sorry for my late reply.

> I recently upgraded a samba cluster from Wheezy (with Kernel, ctdb, samba
> and glusterfs from backports) to Jessie. The cluster itself is way older
> and basically always worked. Since the upgrade to Jessie 'smbstatus -b'
> (almost always) just hangs the whole cluster; I need to interrupt the call
> with ctrl+c (or run with 'timeout 2') to avoid a complete cluster lockup
> leading to the other cluster nodes being banned and the node I run smbstatus
> on to have ctdbd run at 100% load but not being able to recover.

How do you recover then? KILL-ing ctdbd?

> The cluster itself consists of three nodes sharing three cluster ips. The
> only service ctdb manages is Samba. The lock file is located on a mirrored
> glusterfs volume.
>
> running and interrupting the hanging smbstatus leads to the following log
> messages in /var/log/ctdb/log.ctdb:
>   | 2015/10/13 15:09:24.923002 [19378]: Starting traverse on DB
>   |                  smbXsrv_session_global.tdb (id 2592646)
>   | 2015/10/13 15:09:25.505302 [19378]: server/ctdb_traverse.c:644 Traverse
>   |                  cancelled by client disconnect for database:0x6b06a26d
>   | 2015/10/13 15:09:25.505492 [19378]: Could not find idr:2592646
>   | [...]
>   | 2015/10/13 15:09:25.507553 [19378]: Could not find idr:2592646
>
> 'ctdb getdbmap' lists that database, but also lists a second entry for
> smbXsrv_session_global.tdb:
>   | dbid:0x521b7544 name:smbXsrv_version_global.tdb path:/var/lib/ctdb/smbXsrv_version_global.tdb.0
>   | dbid:0x6b06a26d name:smbXsrv_session_global.tdb path:/var/lib/ctdb/smbXsrv_session_global.tdb.0
> (I have no idea if that has always been the case or if that happened after
> the upgrade).
>
> Calling 'smbstatus --locks' and 'smbstatus --shares' works just fine.

Have you tried which of --processes, --notify hangs? Does it hangs
with "-b --fast"?

,

> 'strace'ing ctdbd leads to a massive amount of these messages:
>   | write(58,"\240\4\0\0BDTC\1\0\0\0\215U\336\25\5\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"...,
>   |                          1184) = -1 EAGAIN (Resource temporarily unavailable)

fd 58 is probably the ctdb socket. Can you confirm?

To have more usefull info, can you install gdb, ctdb-dbg and samba-dbg
and send the stacktrace of ctdbd at the write?

> Running 'ctdb_diagnostics' is only possible shortly after  the cluster is
> started (ie. while smbstatus -b works) and yields the following messages:
>   | ERROR[1]: /etc/krb5.conf is missing on node 0
>   | ERROR[2]: File /etc/hosts is different on node 1
>   | ERROR[3]: File /etc/hosts is different on node 2
>   | ERROR[4]: File /etc/samba/smb.conf is different on node 1
>   | ERROR[5]: File /etc/samba/smb.conf is different on node 2
>   | ERROR[6]: File /etc/fstab is different on node 1
>   | ERROR[7]: File /etc/fstab is different on node 2
>   | ERROR[8]: /etc/multipath.conf is missing on node 0
>   | ERROR[9]: /etc/pam.d/system-auth is missing on node 0
>   | ERROR[10]: /etc/default/nfs is missing on node 0
>   | ERROR[11]: /etc/exports is missing on node 0
>   | ERROR[12]: /etc/vsftpd/vsftpd.conf is missing on node 0
>   | ERROR[13]: Optional file /etc/ctdb/static-routes is not present on node 0
> '/etc/hosts' differs in some newlines and comments while 'smb.conf' only
> has some different log levels on the nodes. The rest of the messages does
> not affect ctdb as it only manages samba.

Yes. Nothing relevant here.

> Feel free to ask if you need any more information.

Regards

-- 
Mathieu