Bug#377857: occasional SIGSEGV in exim4 under heavy load

Tue Jul 11 17:19:38 UTC 2006

Package: exim4-daemon-heavy
Version: 4.50-8sarge2

Under heavy SMTP load, we occasionally observe the exim4
daemon crashing (with the result that no further
connections can be accepted, obviously). We can reproduce
this here with the `postal' SMTP benchmark (package
postal) and the following command-line:

    postal -p 10 -c 10 -m 1 localhost users -

(running on the host running exim4). The file `users'
contains a single line with an email address; in this case
I used a local address aliased to /dev/null in
/etc/aliases. While running these tests I had the whole of
/var/spool/exim4 mounted on a tmpfs (to simulate a
hardware configuration with very fast disk); since the bug
is likely timing-related you may have to do the same to
reproduce it. Here typically it will exhibit within the
first five minutes of postal's run.

Here's a stack trace from the exim4 daemon when it
crashes:

    #0  0x00000000 in ?? ()
    #1  0x40361825 in __pthread_sighandler () from /lib/libpthread.so.0
    #2  <signal handler called>
    #3  0x403d05d9 in __libc_sigaction () from /lib/libc.so.6
    #4  0x4035e828 in sigaction () from /lib/libpthread.so.0
    #5  0x080866c5 in os_non_restarting_signal (sig=17, handler=0x805c930 <main_sigchld_handler>) at os.c:267
    #6  0x0805e9f3 in daemon_go () at daemon.c:1842
    #7  0x0806e06b in main (argc=3, cargv=0xbfffdbc4) at exim.c:3922

-- for some reason the distributed binaries are built
without debugging symbols, so I had to rebuild it. Also,
there's some horrific ugliness going on with os.c in the
distribution (``#include "../src/os.c"''!) which confuses
gdb; to get a working debug build I had to catenate the
generated build-exim4-daemon-heavy/os.c and that in src/,
since otherwise gdb got confused about which os.c was
which. Hence, the line numbers in os.c in the backtrace
don't correspond directly to anything in the exim4 source
package.

Looking at the backtrace, it appears that what's happened
is that a signal (presumably SIGCHLD) has arrived while
os_non_restarting_signal is running. The SIGCHLD handler
itself calls os_non_restarting_signal, and a crash
results. I'm not sure why, though -- there's nothing in
the code for that function that's obviously nonreentrant
(it only uses automatic variables and calls sigaction(2),
which is async-signal-safe).

Note that exim in this case is linked against -lpthread,
presumably because of -lpq. I haven't had an opportunity
to check whether the -light version of the daemon has the
same problem, but it's possible that this alters the
behaviour of sigaction.

The following patch to src/os.c, which blocks the signal
for which a handler is being installed over the call to
sigaction, appears to fix the problem, which is at least
compatible with the above hypothesis, though not a great
fix.

--- os.c.orig   2006-07-11 18:02:09.000000000 +0100
+++ os.c        2006-07-11 18:05:15.000000000 +0100
@@ -261,11 +261,20 @@
 
 #ifdef SA_RESTART
 struct sigaction act;
+sigset_t mask, curmask;
+
+sigemptyset(&mask);
+sigprocmask(SIG_BLOCK, &mask, &curmask);
+sigaddset(&mask, sig);
+sigprocmask(SIG_SETMASK, &mask, NULL);
+
 act.sa_handler = handler;
 sigemptyset(&(act.sa_mask));
 act.sa_flags = 0;
 sigaction(sig, &act, NULL);
 
+sigprocmask(SIG_SETMASK, &curmask, NULL);
+
 #ifdef STAND_ALONE
 printf("Used sigaction() with flags = 0\n");
 #endif

-- 
Chris Lightfoot
mySociety