[Pkg-exim4-users] mail system near-death experience

Ross Boylan ross at biostat.ucsf.edu
Fri Jun 13 21:56:26 UTC 2008


For reasons that are still unclear to me, my mail system stopped working
early this morning.  I think rebooting the system has cured the problem.
I'm reporting it here in case anyone else runs into something similar,
and in case anyone has any ideas what the problem was.  I'm running
exim4-daemon-heavy 4.69-5+b1.  Actually, there seem to have been several
different failures; I'm not sure how they are related.

Mail delivery started malfunctioning at least as early as 1am; this was
shortly after an invoke-rc.d exim4 stop and start.

At 3am this morning, and continuously there after, I started seeing
2008-06-13 03:03:31 Connection from [127.0.0.1] refused: too many
connections
in /var/log/exim4/mainlog.  Sometimes the messages involved remote IPs.
I restarted exim, and that seemed to fix that problem (but not
others--see close to the bottom).  The immediately preceding entries
were
2008-06-13 02:55:28 1K760D-0005xE-Ap <= hershel_sniderqs at att.net
H=localhost (corn.betterworld.us) [127.0.0.1] U=fetchmail P=esmtp S=3131
id=20080613095328.8367D52CFC2 at cmsfwd02.mx.net
2008-06-13 02:55:32 1K760D-0005xE-Ap => ross+junk.spamlo
<ross at localhost> R=imap_user T=cyrus
2008-06-13 02:55:32 1K760D-0005xE-Ap Completed
2008-06-13 02:58:40 Start queue run: pid=22924
2008-06-13 02:58:40 End queue run: pid=22924

This is quite odd, because I wasn't doing anything at 3am, and none of
the software upgraded the previous day had anything to do with exim.

There were a lot of messages like these beforehand:
2008-06-13 01:38:52 1K74m5-0005Qq-Ck spam acl condition: error reading
from spamd socket: Connection timed out
2008-06-13 01:43:37 1K74qk-0005RO-H7 spam acl condition: error reading
from spamd socket: Connection timed out
2008-06-13 01:46:09 1K74tA-0005Rw-3J spam acl condition: error reading
from spamd socket: Connection timed out
2008-06-13 01:52:15 1K74z9-0005Sw-2u spam acl condition: error reading
from spamd socket: Connection timed out
These started at 2008-06-12 15:38:00.  I usually see these errors a few
times around 3am, but the last set was on 2006-06-08.

My backup programs ran into trouble sending mail as early as 1am:
13-Jun 01:07 corn-dir JobId 994: Error: message.c:493 Mail program
terminated in error.
CMD=/usr/lib/bacula/bsmtp -h localhost -f "(Bacula) <root at localhost>" -s
"Bacula: Backup OK of corn-fd Incremental" root at localhost
ERR=Child died from signal 15: Termination
However, fetchmail continued to deliver until
Jun 13 03:03:40 corn fetchmail[6433]: SMTP connect to localhost/26
failed 
Jun 13 03:03:45 corn fetchmail[6433]: SMTP transaction error while
fetching from ross at pop.rawbw.com and delivering to SMTP host
localhost/26 
Jun 13 03:03:49 corn fetchmail[6433]: Query status=10 (SMTP) 
(I dedicate port 26 for deliveries from fetchmail).

Key events:
00:50  daemons, including exim, shutdown, snapshots made, daemon restart
01:05  backups begin
02:18  backups end
03:00 cleanup backup job starts; shuts down all snapshots

Additionally, I think there is some other system process that runs
somewhere 2-3:30am, because I have characteristically seen "connection
timed out" to spamd and locking problems (from cyrus) about that time.
Until very recently, my backups were still running at that time.  The
recent logs show the Cyrus squatter beginning to index mailboxes at 1:17
until after 4am (maybe well after, since that was debian-user).

I noticed even after restarting exim that I could not send mail, whether
via sendmail or SMTP to localhost.  These failures didn't generate much
logging: the exim logs didn't seem to indicate even attempted contact
and .xsession-errors (I was running evolution) didn't seem to have
anything relevant.  I tried
$ mail -v ross at localhost
Subject: test
total bust?
Cc:
^C
exim: SIGINT received - message abandoned
LOG: MAIN
  SIGINT received while reading local message
Can't send mail: sendmail process failed with error code 1

This was on a stock Debian 2.6.24 kernl, using LVM for the filesystem.
bacula does the backups; since it compresses in software, it can use a
fair amount of CPU.  It also hits the disk pretty hard, especially in
the cyrus mail store.  The latter is so slow to traverse that it doesn't
use much CPU while it's doing so.  I recently added a patch to speed up
bacula's disk traversal; it is in an LD_PRELOAD that only bacula gets.

My configuration deliberately stalls during the SMTP exchanges as an
anti-spam measure; however, I seem to be delaying even connections from
localhost as well.

All very weird.  I hope I've seen the last of it.

Ross




More information about the Pkg-exim4-users mailing list