[Pkg-openldap-devel] Bug#549642: slapd: suddenly stops working, and starts using 100% CPU

Mon Oct 5 07:33:20 UTC 2009

Package: slapd
Version: 2.4.11-1
Severity: important

Hello,

I've run into a problem with slapd and have tried to file a bug through openldap's bug system: http://www.openldap.org/its/index.cgi?findid=6322 . As you can see, I'm asked to upgrade to a newer version of openldap, which isn't much of a solution to us as we'd like to continue using official debian packages.

I'm hoping there's patches out for this already (I have not found anything from searching this problem, though) that can be applied to the current package. If there's no patches or any ideas on what's causing this, I would be happy to try debugging it, but as I'll explain later in this bug report, running slapd with args/strace loglevel or running it through gdb is something we want to avoid. Of course, this problem doesn't seem to go away by itself, so at some point, something has to be done.

The bugreport, originally written for openldap's issue tracking system:

In the last two days, slapd has randomly stopped working on our server, about
one time every two or three hours. Before this, slapd had been running without
any problems for about two months. slapd.conf has not been touched since then.

What happens is that slapd stops taking queries (postfix, dovecot, NSS etc.
starts getting timeouts), and it starts using exactly 100% of one core until we
restart it. 
After restart, everything works again (until this happens again).

It seems to only happen when slapd is under "stress" (we have an 8-core system
with 8G RAM, and so slapd is really never under much "stress"; the biggest CPU
load I've seen is about 40% on one core). The last ldap search before slapd
stops working is always the same:
Oct  2 10:57:38 oi-mail slapd[24550]: conn=1543 op=1 SRCH
base="ou=groups,dc=web,dc=xxxx,dc=net" scope=1 deref=0
filter="(&(objectClass=posixGroup))"
.. which should return about 1800 entries. Different connections might query
this simultaneously, and it's a query that happens every ~10 seconds on
average.

I've tried to reproduce this by doing ..
# while [ 1 ]; do ldapsearch -x -H ldaps://.. -b
'ou=groups,dc=web,dc=xxxx,dc=net' '(&(objectClass=posixAccount))'; done
.. for two minutes without any "luck" :). slapd just uses a bit more CPU, but
never stops working.

With "loglevel stats stats2", I get nothing out of the logs:
Oct  2 08:42:15 oi-mail slapd[15836]: conn=34798 op=1 ENTRY
dn="cn=xxxx,ou=groups,dc=web,dc=xxxx,dc=net"
... time of "crash" (we're mid-search at this point); nothing gets logged until
we do a restart: ... 
Oct  2 08:42:30 oi-mail slapd[15836]: daemon: shutdown requested and initiated.
Oct  2 08:42:30 oi-mail slapd[15836]: conn=34391 fd=31 closed (slapd shutdown)
... many entires like the one above, and then I seem to be getting the next
entry from the same connection, right before it closes: ... 
Oct  2 08:42:30 oi-mail slapd[15836]: conn=34798 op=1 ENTRY
dn="cn=xxxx,ou=groups,dc=web,dc=xxxx,dc=net"
Oct  2 08:42:30 oi-mail slapd[15836]: conn=34798 fd=81 closed (slapd shutdown)

I've tried logging more than just stats, but the interesting loglevels also
causes a big performance hit (trace and args), and I'm not really sure which of
the other loglevels could be helpful.
This is a live ldap server, on a server that also hosts other services, so I
simply can't take the performance hit.

I also did a strace -p <slapd pid>, which only gives "futex(0x41c499e0,
FUTEX_WAIT," until I restart slapd, but the "restarting routine" could be
useful, so here it is: http://www.pastie.org/639170

I understand that this probably isn't enough data to see what the problem is,
and so I'm hoping to get some help in debugging this properly. As I've said,
args/trace loglevel takes too much of a performance hit, but if this is the only
way, we might reconcider.

Interesting parts of slapd.conf below.
We use slapd.conf instead of cn=config, everything goes over TLS, we're AFAIK
only using LDAPv3 binding, and we only have one database entry in slapd.conf.

moduleload  back_hdb
sizelimit 99999
tool-threads 8
threads 16
concurrency 32
backend     hdb 
database        hdb
cachesize 60000
dbconfig set_cachesize 0 52428800 0
dbconfig set_lk_max_objects 1500
dbconfig set_lk_max_locks 1500
dbconfig set_lk_max_lockers 1500
lastmod         on
checkpoint      512 30

Best regards,
Helge Milde

-- System Information:
Debian Release: 5.0.2
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.26-2-amd64 (SMP w/8 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages slapd depends on:
ii  adduser           3.110                  add and remove users and groups
ii  coreutils         6.10-6                 The GNU core utilities
ii  debconf [debconf- 1.5.24                 Debian configuration management sy
ii  libc6             2.7-18                 GNU C Library: Shared libraries
ii  libdb4.2          4.2.52+dfsg-5          Berkeley v4.2 Database Libraries [
ii  libgnutls26       2.4.2-6+lenny1         the GNU TLS library - runtime libr
ii  libldap-2.4-2     2.4.11-1               OpenLDAP libraries
ii  libltdl3          1.5.26-4               A system independent dlopen wrappe
ii  libperl5.10       5.10.0-19              Shared Perl library
ii  libsasl2-2        2.1.22.dfsg1-23+lenny1 Cyrus SASL - authentication abstra
ii  libslp1           1.2.1-7.5              OpenSLP libraries
ii  libwrap0          7.6.q-16               Wietse Venema's TCP wrappers libra
ii  perl [libmime-bas 5.10.0-19              Larry Wall's Practical Extraction 
ii  psmisc            22.6-1                 Utilities that use the proc filesy
ii  unixodbc          2.2.11-16              ODBC tools libraries

Versions of packages slapd recommends:
pn  libsasl2-modules              <none>     (no description available)

Versions of packages slapd suggests:
ii  ldap-utils                    2.4.11-1   OpenLDAP utilities

-- debconf-show failed