[Debichem-devel] Bug#687266: aces3: some jobs hang when run sequentially
Michael Banck
mbanck at debian.org
Tue Sep 11 10:22:02 UTC 2012
Package: aces3
Version: 3.0.6-1
Severity: important
When I run the same job as mention in #687264 with only one process, the
job hangs in tran_rhf_ao_sv1.sio:
Gather on company_rank succeeded.
Static pre-defined array # 2 is first used on line 328
Allocated 800 bytes for static arrays.
Allocated 896466560 bytes for blkmgr.
Total memory usage 855 MBytes.
Max. possible usage 900 MBytes
Total blocks used = 65759
At this point, no more output is written and also no temporary files are
written or updated, while the xaces3 process spins at 100% CPU.
This is a representable backtrace:
0x000000000042e968 in one_pass_of_server () at sumz.c:439
439 MPI_Iprobe(MPI_ANY_SOURCE, readytag, newcomm, &flag, &status);
(gdb) bt
#0 0x000000000042e968 in one_pass_of_server () at sumz.c:439
#1 0x000000000042f73d in exec_thread_server_ (bflag=bflag at entry=0x729b20) at sumz.c:1248
#2 0x00000000004df2a4 in wait_on_block (array=23, block=1, blkndx=56362, type=201, request=4, instruction_timer=35, comm_timer=95) at wait_on_block.f:50
#3 0x000000000048b6a5 in compute_block (op=..., array_table=..., narray_table=198, index_table=..., nindex_table=32, block_map_table=..., nblock_map_table=55,
segment_table=..., nsegment_table=43, scalar_table=..., nscalar_table=13, address_table=..., debugit=.FALSE., validate=.FALSE., flopcount=0, comm=3, comm_timer=95,
instruction_timer=35) at compute_block.F:759
#4 0x00000000004d1e53 in optable_loop (optable=..., noptable=245, array_table=..., narray_table=198, array_labels=..., index_table=..., nindex_table=32, segment_table=...,
nsegment_table=43, block_map_table=..., nblock_map_table=55, scalar_table=..., nscalar_table=13, proctab=..., address_table=..., debug=.FALSE., validate=.FALSE., comm=3,
comm_timer=95, _array_labels=_array_labels at entry=10) at optable_loop.f:274
#5 0x00000000004423e5 in master.0.sip_fmain_init (__entry=1, ncompany_workers_min=<error reading variable: Cannot access memory at address 0x0>,
ierr_return=<error reading variable: Cannot access memory at address 0x0>) at sip_fmain.F:582
#6 0x000000000042f8b8 in sumz_work_ (dryrun_flag=0x2, dryrun_flag at entry=0x7fff04aff4e8, fmbuffer=0xffff8002, fmbuffer at entry=0x23d0448c, dbg_flag=0x1,
dbg_flag at entry=0x7fff04aff4e4, totalrecvbuffer=0x36c66) at sumz.c:1294
#7 0x0000000000423bea in worker_work () at worker_work.F:79
#8 0x000000000041a613 in aces3 () at beta.F:914
#9 0x000000000041959d in main (argc=<optimized out>, argv=<optimized out>) at beta.F:1014
#10 0x00007f0a0f6b4ead in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x00000000004195c9 in _start ()
Another:
0x00007f0a122f0f63 in PMPI_Iprobe () from /usr/lib/libmpi.so.0
(gdb) bt
#0 0x00007f0a122f0f63 in PMPI_Iprobe () from /usr/lib/libmpi.so.0
#1 0x000000000042e9d0 in one_pass_of_server () at sumz.c:445
#2 0x000000000042f73d in exec_thread_server_ (bflag=bflag at entry=0x729b20) at sumz.c:1248
#3 0x00000000004df2a4 in wait_on_block (array=23, block=1, blkndx=56362, type=201, request=4, instruction_timer=35, comm_timer=95) at wait_on_block.f:50
#4 0x000000000048b6a5 in compute_block [...] at compute_block.F:759
And another:
0x00007f0a10c4e369 in opal_progress () from /usr/lib/libopen-pal.so.0
(gdb) bt
#0 0x00007f0a10c4e369 in opal_progress () from /usr/lib/libopen-pal.so.0
#1 0x00007f0a122cd9c9 in ?? () from /usr/lib/libmpi.so.0
#2 0x00007f0a122f84e3 in PMPI_Test () from /usr/lib/libmpi.so.0
#3 0x00007f0a1110e122 in pmpi_test__ () from /usr/lib/libmpi_f77.so.0
#4 0x00000000004df2bd in wait_on_block (array=23, block=1, blkndx=56362, type=201, request=4, instruction_timer=35, comm_timer=95) at wait_on_block.f:48
#5 0x000000000048b6a5 in compute_block [...] at compute_block.F:759
I did not encounter any other backtraces after a few more tries.
Michael
More information about the Debichem-devel
mailing list