Bug#647520: boinc-client: SIGSEGV (segmentation fault) after some hours of computation
John Feuerstein
john at feurix.com
Thu Nov 3 14:34:42 UTC 2011
Package: boinc-client
Version: 6.13.10+dfsg-1
Severity: important
/usr/bin/boinc crashes (SIGSEGV) after some hours of computation. This
happens every day, so it is reproducible. What follows is a journey to
track this down:
My recent upgrades of boinc-related packages include:
Mon, Oct 24 2011 20:09:36 +0200
[UPGRADE] boinc 6.13.1+dfsg-2 -> 6.13.6+dfsg-2
[UPGRADE] boinc-client 6.13.1+dfsg-2 -> 6.13.6+dfsg-2
[UPGRADE] boinc-manager 6.13.1+dfsg-2 -> 6.13.6+dfsg-2
Tue, Nov 1 2011 10:43:55 +0100
[UPGRADE] boinc 6.13.6+dfsg-2 -> 6.13.10+dfsg-1
[UPGRADE] boinc-client 6.13.6+dfsg-2 -> 6.13.10+dfsg-1
[UPGRADE] boinc-manager 6.13.6+dfsg-2 -> 6.13.10+dfsg-1
I've noticed the missing system load (and hereby the broken boinc
client) only some days ago, so I suspect the bug was introduced by
either boinc-client version 6.13.10+dfsg-1 or 6.13.6+dfsg-2.
I'm sure this never happened with version 6.13.1+dfsg-2.
For the record, I'm currently running these projects:
- climateprediction.net
- rosetta at home
- SETI at home
I was never attached using boinc-manager or any other client, and I did
not change the configuration of any boinc-related software in the past
month. I also did not add or remove projects.
-----------------------------------------------------------
On to the bug hunting:
/var/lib/boinc-client/std{out,err}dae.txt do not contain anything
interesting leading to this crash.
So I went on and ran boinc under gdb, in the environment as created by
the init script. The following crash happened after 3 hours of
computation.
# apt-get install boinc-dbg
# sudo -u boinc -H -- sh -c 'cd ~ && gdb -q -s /usr/lib/debug/usr/bin/boinc -e /usr/bin/boinc'
Reading symbols from /usr/lib/debug/usr/bin/boinc...done.
(gdb) run --check_all_logins --redirectio --dir /var/lib/boinc-client
Starting program: /usr/bin/boinc --check_all_logins --redirectio --dir /var/lib/boinc-client
[Thread debugging using libthread_db enabled]
Program received signal SIGSEGV, Segmentation fault.
__mempcpy_sse2 () at ../sysdeps/x86_64/memcpy.S:436
436 ../sysdeps/x86_64/memcpy.S: No such file or directory.
in ../sysdeps/x86_64/memcpy.S
(gdb) bt 10
#0 __mempcpy_sse2 () at ../sysdeps/x86_64/memcpy.S:436
#1 0x00007ffff6276d66 in _IO_default_xsputn (f=0x7ffffffedb80, data=<optimized out>, n=150842) at genops.c:468
#2 0x00007ffff6249f04 in _IO_vfprintf_internal (s=0x7ffffffedb80,
format=0x4a5ec0 "<scheduler_request>\n <authenticator>%s</authenticator>\n <hostid>%d</hostid>\n <rpc_seqno>%d</rpc_seqno>\n <core_client_major_version>%d</core_client_major_version>\n <core_client_minor_ver"..., ap=0x7ffffffedca0) at vfprintf.c:1620
#3 0x00007ffff626c3b9 in __IO_vsprintf (
string=0x7ffffffeddb0 "<scheduler_request>\n <authenticator>4a31871bf59efd4895a7ca5a65402602</authenticator>\n <hostid>1171553</hostid>\n <rpc_seqno>122</rpc_seqno>\n <core_client_major_version>6</core_client_major_"...,
format=0x4a5ec0 "<scheduler_request>\n <authenticator>%s</authenticator>\n <hostid>%d</hostid>\n <rpc_seqno>%d</rpc_seqno>\n <core_client_major_version>%d</core_client_major_version>\n <core_client_minor_ver"..., args=0x7ffffffedca0) at iovsprintf.c:43
#4 0x00007ffff62532f8 in __sprintf (s=0x7fffffffefe0 "L*W53'G?R8ALXQ]?<\"TT$^2.&3REI]0W" <Address 0x7ffffffff000 out of bounds>,
format=0x7ffff7fba050 "L*W53'G?R8ALXQ]?<\"TT$^2.&3REI]0WIP,']_K^-2]C=SQF;M\"XB'\\-('\n:]'0L-?/P^1N,V-5L8R1>YXA( W at +ZJF0'.?-*CF5L0\"*IJP at 7:/H$E>M</EE#67\nI\\WA[#6?4\\WY'Z^?LUAKSY:!/J&9:_6R2TZ,ED'RM at Y\\G-]Q!DWW1!.FW#).F\n: O#_9L'[76"...) at sprintf.c:34
#5 0x000000000045e189 in trickle_up_request_message (
buf=0x7ffffffeddb0 "<scheduler_request>\n <authenticator>4a31871bf59efd4895a7ca5a65402602</authenticator>\n <hostid>1171553</hostid>\n <rpc_seqno>122</rpc_seqno>\n <core_client_major_version>6</core_client_major_"..., t=1320089068,
result_name=0x7fffffffdee0 "\\C)1*3M4 $V1D60'GS'4%\">(>+)M'=\\5#!X/Z,?AWY:2#^65>%\\.7+8\n\\7Y(+D6F:N-U*6 !3[452\\NBC62?W8J+<2<L>K_&*=!+ I9&U '9EC&F6'%SNM)Z\n$IV+P=X@?]10&:\\+(D$%099=N 1,QK)XA)M'4>6+$?$X#IO>\"^<6N3/9^\\U[*Y-(\nUW92Z18_3H2XQC"...,
msg=0x7ffff7fa9010 "<variety>year</variety>\n<wu>hadcm3n_ymgx_1900_40_007524528</wu>\n<result>hadcm3n_ymgx_1900_40_007524528_1</result>\n<ph>1</ph>\n<ts>25920</ts>\n<cp>46188</cp>\n<vr>6.07</vr>\n<ppname>\ntrickle_hadcm3n_ymgx_1"..., p=0x72dad0) at cs_trickle.cpp:195
#6 send_replicated_trickles (p=0x72dad0,
msg=0x7ffff7fa9010 "<variety>year</variety>\n<wu>hadcm3n_ymgx_1900_40_007524528</wu>\n<result>hadcm3n_ymgx_1900_40_007524528_1</result>\n<ph>1</ph>\n<ts>25920</ts>\n<cp>46188</cp>\n<vr>6.07</vr>\n<ppname>\ntrickle_hadcm3n_ymgx_1"...,
result_name=0x7fffffffdee0 "\\C)1*3M4 $V1D60'GS'4%\">(>+)M'=\\5#!X/Z,?AWY:2#^65>%\\.7+8\n\\7Y(+D6F:N-U*6 !3[452\\NBC62?W8J+<2<L>K_&*=!+ I9&U '9EC&F6'%SNM)Z\n$IV+P=X@?]10&:\\+(D$%099=N 1,QK)XA)M'4>6+$?$X#IO>\"^<6N3/9^\\U[*Y-(\nUW92Z18_3H2XQC"..., now=1320089068) at cs_trickle.cpp:202
#7 0x000000000000000d in ?? ()
#8 0x0000000000000063 in ?? ()
#9 0x00000000007240f8 in ?? ()
(More stack frames follow...)
Note: there are thousands of following stack frames without matching symbols...
(gdb) bt full 10
#0 __mempcpy_sse2 () at ../sysdeps/x86_64/memcpy.S:436
No locals.
#1 0x00007ffff6276d66 in _IO_default_xsputn (f=0x7ffffffedb80, data=<optimized out>, n=150842) at genops.c:468
count = 150842
s = 0x7ffff7fcdd4a ""
more = 150842
#2 0x00007ffff6249f04 in _IO_vfprintf_internal (s=0x7ffffffedb80,
format=0x4a5ec0 "<scheduler_request>\n <authenticator>%s</authenticator>\n <hostid>%d</hostid>\n <rpc_seqno>%d</rpc_seqno>\n <core_client_major_version>%d</core_client_major_version>\n <core_client_minor_ver"..., ap=0x7ffffffedca0) at vfprintf.c:1620
len = <optimized out>
string_malloced = 1163014950
step0_jumps = {0, -15179, -14483, -14390, -14293, -14199, -15032, -14766, -13119, -13894, -13811, -13241, -12583, -2939, -3410, -3367, -3217, -3202, -10653, -8358,
-2123, -12480, -2322, -2771, -1385, -1830, -2607, -3126, -3022, -14856}
space = 0
is_short = 0
use_outdigits = 0
step1_jumps = {0, 0, 0, 0, 0, 0, 0, 0, 0, -13894, -13811, -13241, -12583, -2939, -3410, -3367, -3217, -3202, -10653, -8358, -2123, -12480, -2322, -2771, -1385, -1830,
-2607, -3126, -3022, 0}
group = 0
prec = <optimized out>
step2_jumps = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -13811, -13241, -12583, -2939, -3410, -3367, -3217, -3202, -10653, -8358, -2123, -12480, -2322, -2771, -1385, -1830, -2607,
-3126, -3022, 0}
string = 0x2e325e245454223c <Address 0x2e325e245454223c out of bounds>
left = 0
is_long_double = 0
width = <optimized out>
step3a_jumps = {0, 0, 0, 0, 0, 0, 0, 0, 0, 0, -13341, 0, 0, 0, -3410, -3367, -3217, -3202, -10653, 0, 0, 0, 0, -2771, 0, 0, 0, 0, 0, 0}
alt = 0
showsign = 0
is_long = 0
is_char = 0
pad = <optimized out>
step3b_jumps = {0 <repeats 11 times>, -12583, 0, 0, -3410, -3367, -3217, -3202, -10653, -8358, -2123, -12480, -2322, -2771, -1385, -1830, -2607, 0, 0, 0}
step4_jumps = {0 <repeats 14 times>, -3410, -3367, -3217, -3202, -10653, -8358, -2123, -12480, -2322, -2771, -1385, -1830, -2607, 0, 0, 0}
is_negative = <optimized out>
base = 10
the_arg = {pa_wchar = 0 L'\000', pa_int = 0, pa_long_int = 0, pa_long_long_int = 0, pa_u_int = 0, pa_u_long_int = 0, pa_u_long_long_int = 0, pa_double = 0,
pa_long_double = 0, pa_string = 0x0, pa_wstring = 0x0, pa_pointer = 0x0, pa_user = 0x0}
spec = <optimized out>
_buffer = {__routine = 0, __arg = 0x0, __canceltype = 0, __prev = 0x0}
_avail = 0
thousands_sep = 0x0
grouping = 0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>
done = 496
f = <optimized out>
lead_str_end = 0x4a5ee7 "%s</authenticator>\n <hostid>%d</hostid>\n <rpc_seqno>%d</rpc_seqno>\n <core_client_major_version>%d</core_client_major_version>\n <core_client_minor_version>%d</core_client_minor_version>\n "...
work_buffer = '\000' <repeats 990 times>, "1320089068"
workstart = 0x0
workend = 0x7ffffffeda38 ""
ap_save = {{gp_offset = 16, fp_offset = 48, overflow_arg_area = 0x7ffffffedd80, reg_save_area = 0x7ffffffedcc0}}
nspecs_done = <optimized out>
save_errno = 0
readonly_format = 0
jump_table = "\001\000\000\004\000\016\000\006\000\000\a\002\000\003\t\000\005\b\b\b\b\b\b\b\b\b\000\000\000\000\000\000\000\032\000\031\000\023\023\023\000\035\000\000\f\000\000\000\000\000\000\025\000\000\000\000\022\000\r\000\000\000\000\000\000\032\000\024\017\023\023\023\n\017\034\000\v\030\027\021\026\f\000\025\033\020\000\000\022\000\r"
__PRETTY_FUNCTION__ = "_IO_vfprintf_internal"
#3 0x00007ffff626c3b9 in __IO_vsprintf (
string=0x7ffffffeddb0 "<scheduler_request>\n <authenticator>4a31871bf59efd4895a7ca5a65402602</authenticator>\n <hostid>1171553</hostid>\n <rpc_seqno>122</rpc_seqno>\n <core_client_major_version>6</core_client_major_"...,
format=0x4a5ec0 "<scheduler_request>\n <authenticator>%s</authenticator>\n <hostid>%d</hostid>\n <rpc_seqno>%d</rpc_seqno>\n <core_client_major_version>%d</core_client_major_version>\n <core_client_minor_ver"..., args=0x7ffffffedca0) at iovsprintf.c:43
sf = {_sbf = {_f = {_flags = -72515583,
_IO_read_ptr = 0x7ffffffeddb0 "<scheduler_request>\n <authenticator>4a31871bf59efd4895a7ca5a65402602</authenticator>\n <hostid>1171553</hostid>\n <rpc_seqno>122</rpc_seqno>\n <core_client_major_version>6</core_client_major_"...,
_IO_read_end = 0x7ffffffeddb0 "<scheduler_request>\n <authenticator>4a31871bf59efd4895a7ca5a65402602</authenticator>\n <hostid>1171553</hostid>\n <rpc_seqno>122</rpc_seqno>\n <core_client_major_version>6</core_client_major_"...,
_IO_read_base = 0x7ffffffeddb0 "<scheduler_request>\n <authenticator>4a31871bf59efd4895a7ca5a65402602</authenticator>\n <hostid>1171553</hostid>\n <rpc_seqno>122</rpc_seqno>\n <core_client_major_version>6</core_client_major_"...,
_IO_write_base = 0x7ffffffeddb0 "<scheduler_request>\n <authenticator>4a31871bf59efd4895a7ca5a65402602</authenticator>\n <hostid>1171553</hostid>\n <rpc_seqno>122</rpc_seqno>\n <core_client_major_version>6</core_client_major_"...,
_IO_write_ptr = 0x7ffffffedfa0 "<variety>year</variety>\n<wu>hadcm3n_ymgx_1900_40_007524528</wu>\n<result>hadcm3n_ymgx_1900_40_007524528_1</result>\n<ph>1</ph>\n<ts>25920</ts>\n<cp>46188</cp>\n<vr>6.07</vr>\n<ppname>\ntrickle_hadcm3n_ymgx_1"..., _IO_write_end = 0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>,
_IO_buf_base = 0x7ffffffeddb0 "<scheduler_request>\n <authenticator>4a31871bf59efd4895a7ca5a65402602</authenticator>\n <hostid>1171553</hostid>\n <rpc_seqno>122</rpc_seqno>\n <core_client_major_version>6</core_client_major_"..., _IO_buf_end = 0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>, _IO_save_base = 0x0,
_IO_backup_base = 0x0, _IO_save_end = 0x0, _markers = 0x0, _chain = 0x0, _fileno = 0, _flags2 = 0, _old_offset = 0, _cur_column = 0, _vtable_offset = 0 '\000',
_shortbuf = "", _lock = 0x0, _offset = 0, _codecvt = 0x0, _wide_data = 0x0, _freeres_list = 0x0, _freeres_buf = 0x0, _freeres_size = 0, _mode = -1,
_unused2 = '\000' <repeats 19 times>}, vtable = 0x7ffff6583740}, _s = {_allocate_buffer = 0, _free_buffer = 0}}
ret = <optimized out>
---Type <return> to continue, or q <return> to quit---
#4 0x00007ffff62532f8 in __sprintf (s=0x7fffffffefe0 "L*W53'G?R8ALXQ]?<\"TT$^2.&3REI]0W" <Address 0x7ffffffff000 out of bounds>,
format=0x7ffff7fba050 "L*W53'G?R8ALXQ]?<\"TT$^2.&3REI]0WIP,']_K^-2]C=SQF;M\"XB'\\-('\n:]'0L-?/P^1N,V-5L8R1>YXA( W at +ZJF0'.?-*CF5L0\"*IJP at 7:/H$E>M</EE#67\nI\\WA[#6?4\\WY'Z^?LUAKSY:!/J&9:_6R2TZ,ED'RM at Y\\G-]Q!DWW1!.FW#).F\n: O#_9L'[76"...) at sprintf.c:34
arg = {{gp_offset = 48, fp_offset = 48, overflow_arg_area = 0x7ffffffeddb0, reg_save_area = 0x7ffffffedcc0}}
done = 894904908
#5 0x000000000045e189 in trickle_up_request_message (
buf=0x7ffffffeddb0 "<scheduler_request>\n <authenticator>4a31871bf59efd4895a7ca5a65402602</authenticator>\n <hostid>1171553</hostid>\n <rpc_seqno>122</rpc_seqno>\n <core_client_major_version>6</core_client_major_"..., t=1320089068,
result_name=0x7fffffffdee0 "\\C)1*3M4 $V1D60'GS'4%\">(>+)M'=\\5#!X/Z,?AWY:2#^65>%\\.7+8\n\\7Y(+D6F:N-U*6 !3[452\\NBC62?W8J+<2<L>K_&*=!+ I9&U '9EC&F6'%SNM)Z\n$IV+P=X@?]10&:\\+(D$%099=N 1,QK)XA)M'4>6+$?$X#IO>\"^<6N3/9^\\U[*Y-(\nUW92Z18_3H2XQC"...,
msg=0x7ffff7fa9010 "<variety>year</variety>\n<wu>hadcm3n_ymgx_1900_40_007524528</wu>\n<result>hadcm3n_ymgx_1900_40_007524528_1</result>\n<ph>1</ph>\n<ts>25920</ts>\n<cp>46188</cp>\n<vr>6.07</vr>\n<ppname>\ntrickle_hadcm3n_ymgx_1"..., p=0x72dad0) at cs_trickle.cpp:195
No locals.
#6 send_replicated_trickles (p=0x72dad0,
msg=0x7ffff7fa9010 "<variety>year</variety>\n<wu>hadcm3n_ymgx_1900_40_007524528</wu>\n<result>hadcm3n_ymgx_1900_40_007524528_1</result>\n<ph>1</ph>\n<ts>25920</ts>\n<cp>46188</cp>\n<vr>6.07</vr>\n<ppname>\ntrickle_hadcm3n_ymgx_1"...,
result_name=0x7fffffffdee0 "\\C)1*3M4 $V1D60'GS'4%\">(>+)M'=\\5#!X/Z,?AWY:2#^65>%\\.7+8\n\\7Y(+D6F:N-U*6 !3[452\\NBC62?W8J+<2<L>K_&*=!+ I9&U '9EC&F6'%SNM)Z\n$IV+P=X@?]10&:\\+(D$%099=N 1,QK)XA)M'4>6+$?$X#IO>\"^<6N3/9^\\U[*Y-(\nUW92Z18_3H2XQC"..., now=1320089068) at cs_trickle.cpp:202
buf = "<scheduler_request>\n <authenticator>4a31871bf59efd4895a7ca5a65402602</authenticator>\n <hostid>1171553</hostid>\n <rpc_seqno>122</rpc_seqno>\n <core_client_major_version>6</core_client_major_"...
#7 0x000000000000000d in ?? ()
No symbol table info available.
#8 0x0000000000000063 in ?? ()
No symbol table info available.
#9 0x00000000007240f8 in ?? ()
No symbol table info available.
(More stack frames follow...)
Sorry I'm not familiar with any of the boinc code, so I leave the
interpretation of this data to the boinc programmers.
Please let me know if you need more. I've dumped the core and am able to
inspect additional stuff (or send the core along with the binary
containing all symbols to interested parties via private email).
Hope this helps,
John
-- Package-specific info:
-- Contents of /etc/default/boinc-client:
# This file is /etc/default/boinc-client, it is a configuration file for the
# /etc/init.d/boinc-client init script.
# Set this to 1 to enable and to 0 to disable the init script.
ENABLED="1"
# Set this to 1 to enable advanced scheduling of the BOINC core client and
# all its sub-processes (reduces the impact of BOINC on the system's
# performance).
SCHEDULE="1"
# The BOINC core client will be started with the permissions of this user.
BOINC_USER="boinc"
# This is the data directory of the BOINC core client.
BOINC_DIR="/var/lib/boinc-client"
# This is the location of the BOINC core client, that the init script uses.
# If you do not want to use the client program provided by the boinc-client
# package, you can specify here an alternative client program.
#BOINC_CLIENT="/usr/local/bin/boinc"
BOINC_CLIENT="/usr/bin/boinc"
# Here you can specify additional options to pass to the BOINC core client.
# Type 'boinc --help' or 'man boinc' for a full summary of allowed options.
#BOINC_OPTS="--allow_remote_gui_rpc"
BOINC_OPTS=""
-- System Information:
Debian Release: wheezy/sid
APT prefers unstable
APT policy: (500, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Kernel: Linux 3.0.0-1 (SMP w/4 CPU cores; PREEMPT)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages boinc-client depends on:
ii adduser 3.113
ii ca-certificates 20111025
ii debconf [debconf-2.0] 1.5.41
ii libc6 2.13-21
ii libcurl3 7.21.7-3
ii libgcc1 1:4.6.2-3
ii libssl1.0.0 1.0.0e-2
ii libstdc++6 4.6.2-3
ii libx11-6 2:1.4.4-2
ii libxss1 1:1.2.1-2
ii python 2.7.2-9
ii zlib1g 1:1.2.3.4.dfsg-3
Versions of packages boinc-client recommends:
ii ia32-libs 20111001
Versions of packages boinc-client suggests:
ii boinc-app-seti <none>
ii boinc-manager 6.13.10+dfsg-1
ii x11-xserver-utils 7.6+3
-- Configuration Files:
/etc/boinc-client/global_prefs_override.xml changed:
<global_preferences>
<run_on_batteries>0</run_on_batteries>
<run_if_user_active>1</run_if_user_active>
<run_gpu_if_user_active>0</run_gpu_if_user_active>
<idle_time_to_run>3.000000</idle_time_to_run>
<suspend_cpu_usage>25.000000</suspend_cpu_usage>
<start_hour>0.000000</start_hour>
<end_hour>0.000000</end_hour>
<net_start_hour>0.000000</net_start_hour>
<net_end_hour>0.000000</net_end_hour>
<leave_apps_in_memory>0</leave_apps_in_memory>
<confirm_before_connecting>0</confirm_before_connecting>
<hangup_if_dialed>0</hangup_if_dialed>
<dont_verify_images>0</dont_verify_images>
<work_buf_min_days>0.000000</work_buf_min_days>
<work_buf_additional_days>0.250000</work_buf_additional_days>
<max_ncpus_pct>50.000000</max_ncpus_pct>
<cpu_scheduling_period_minutes>60.000000</cpu_scheduling_period_minutes>
<disk_interval>60.000000</disk_interval>
<disk_max_used_gb>100.000000</disk_max_used_gb>
<disk_max_used_pct>50.000000</disk_max_used_pct>
<disk_min_free_gb>0.000000</disk_min_free_gb>
<vm_max_used_pct>75.000000</vm_max_used_pct>
<ram_max_used_busy_pct>50.000000</ram_max_used_busy_pct>
<ram_max_used_idle_pct>90.000000</ram_max_used_idle_pct>
<max_bytes_sec_up>24995.840000</max_bytes_sec_up>
<max_bytes_sec_down>249999.360000</max_bytes_sec_down>
<cpu_usage_limit>100.000000</cpu_usage_limit>
<daily_xfer_limit_mb>0.000000</daily_xfer_limit_mb>
<daily_xfer_period_days>0</daily_xfer_period_days>
</global_preferences>
--
John Feuerstein <john at feurix.com>
More information about the pkg-boinc-devel
mailing list