Bug#644832: boinc-client: WUs crash with stack overflow almost every night

Ove Kaaven ovek at arcticnet.no
Sun Oct 9 14:40:55 UTC 2011

Package: boinc-client
Version: 6.13.1+dfsg-2
Severity: normal

I hope someone can help me track this down. Relevant information about this is:

- I use World Community Grid.

- I have a Intel Core i7 2.8GHz (i.e. 4 cores, each with hyperthreading,
so I can run 8 WUs in parallel)

- I've symlinked /var/lib/boinc-client into a directory of a 1.8-Terabyte
ext4 filesystem. My backup system (backuppc), running on the same machine,
also does backups (of my home network) into that same filesystem (in a
different directory).

- The occurrence of this bug *appears* to coincide with significant I/O
load on the system - in particular, all the boinc WUs appear to crash
(simultaneously) roughly 10-15 minutes after the start of backuppc's nightly
pool cleanup (BackupPC_nightly). (A while ago I also often saw the WUs crash
when using aptitude to upgrade stuff, but this no longer seems to trigger
it, perhaps a kernel upgrade or something made a difference there.)

- Obviously, boinc keeps my CPU permanently at its thermal limit, with my
syslog full of MCE messages about automatic throttling. I won't rule out
hardware failures, but if it was, I'd expect to see other things fail as
well (which doesn't seem to happen), or that the crashes would happen
less predictably than they do. If the problem isn't in userspace, it'd
seem more likely to be an ext4 bug or something.

Anyway, recently, after starting to see a pattern, I tried to attach
strace to some of the processes before the nightly backup thing started.
It showed sudden SIGSEGVs without anything extraordinary before them,
so the next night I tried to attach gdb to a process and wait for it
to crash. When it did, I saw that the stack pointer (%esp) was out of
limit for what appeared to be a 16K thread stack. It appeared that the
stack had overflowed. But since the WCG applications don't have debug
symbols, it wasn't clear why.

It'd be interesting to try to increase the stack size, but I'm not sure
how to tell boinc to do that. Besides, since the crash happens in all
running WUs simultaneously, regardless of application (they wouldn't
all use the stack in the exact same way, would they?), perhaps it
wouldn't help much. Perhaps there is something like an infinite
recursion problem common to all WCG applications, though?

Any ideas on how to proceed?

-- Package-specific info:
-- Contents of /etc/default/boinc-client:
# This file is /etc/default/boinc-client, it is a configuration file for the
# /etc/init.d/boinc-client init script.

# Set this to 1 to enable and to 0 to disable the init script.

# Set this to 1 to enable advanced scheduling of the BOINC core client and
# all its sub-processes (reduces the impact of BOINC on the system's
# performance).

# The BOINC core client will be started with the permissions of this user.

# This is the data directory of the BOINC core client.

# This is the location of the BOINC core client, that the init script uses.
# If you do not want to use the client program provided by the boinc-client
# package, you can specify here an alternative client program.

# Here you can specify additional options to pass to the BOINC core client.
# Type 'boinc --help' or 'man boinc' for a full summary of allowed options.

-- System Information:
Debian Release: wheezy/sid
  APT prefers testing
  APT policy: (900, 'testing'), (600, 'stable'), (1, 'unstable')
Architecture: i386 (i686)

Kernel: Linux 3.0.0-1-686-pae (SMP w/8 CPU cores)
Locale: LANG=nb_NO.utf8, LC_CTYPE=nb_NO.utf8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages boinc-client depends on:
ii  adduser                3.113           
ii  ca-certificates        20110502+nmu1   
ii  debconf [debconf-2.0]  1.5.40          
ii  libc6                  2.13-21         
ii  libcurl3               7.21.7-3        
ii  libgcc1                1:4.6.1-4       
ii  libssl1.0.0            1.0.0e-2        
ii  libstdc++6             4.6.1-4         
ii  python                 2.6.7-3         
ii  zlib1g                 1:

boinc-client recommends no packages.

Versions of packages boinc-client suggests:
ii  boinc-app-seti     <none>          
ii  boinc-manager      6.12.33+dfsg-1.1
ii  x11-xserver-utils  7.6+3           

-- Configuration Files:
/etc/boinc-client/global_prefs_override.xml changed:

/etc/boinc-client/gui_rpc_auth.cfg [Errno 13] Ikke tilgang: u'/etc/boinc-client/gui_rpc_auth.cfg'
/etc/boinc-client/remote_hosts.cfg changed:

-- debconf information:
  boinc-client/remove_boinc_dir: false

More information about the pkg-boinc-devel mailing list