[Debichem-devel] Bug#687266: aces3: some jobs hang when run sequentially

Tue Sep 11 10:22:02 UTC 2012

Package: aces3
Version: 3.0.6-1
Severity: important

When I run the same job as mention in #687264 with only one process, the
job hangs in tran_rhf_ao_sv1.sio:

 Gather on company_rank succeeded.
 Static pre-defined array #           2  is first used on line 328
 Allocated                   800  bytes for static arrays.
 Allocated             896466560  bytes for blkmgr.
 Total memory usage                   855  MBytes.
 Max. possible usage          900  MBytes
 Total blocks used =        65759

At this point, no more output is written and also no temporary files are
written or updated, while the xaces3 process spins at 100% CPU.

This is a representable backtrace:

0x000000000042e968 in one_pass_of_server () at sumz.c:439
439	           MPI_Iprobe(MPI_ANY_SOURCE, readytag, newcomm, &flag, &status);
(gdb) bt
#0  0x000000000042e968 in one_pass_of_server () at sumz.c:439
#1  0x000000000042f73d in exec_thread_server_ (bflag=bflag at entry=0x729b20) at sumz.c:1248
#2  0x00000000004df2a4 in wait_on_block (array=23, block=1, blkndx=56362, type=201, request=4, instruction_timer=35, comm_timer=95) at wait_on_block.f:50
#3  0x000000000048b6a5 in compute_block (op=..., array_table=..., narray_table=198, index_table=..., nindex_table=32, block_map_table=..., nblock_map_table=55, 
    segment_table=..., nsegment_table=43, scalar_table=..., nscalar_table=13, address_table=..., debugit=.FALSE., validate=.FALSE., flopcount=0, comm=3, comm_timer=95, 
    instruction_timer=35) at compute_block.F:759
#4  0x00000000004d1e53 in optable_loop (optable=..., noptable=245, array_table=..., narray_table=198, array_labels=..., index_table=..., nindex_table=32, segment_table=..., 
    nsegment_table=43, block_map_table=..., nblock_map_table=55, scalar_table=..., nscalar_table=13, proctab=..., address_table=..., debug=.FALSE., validate=.FALSE., comm=3, 
    comm_timer=95, _array_labels=_array_labels at entry=10) at optable_loop.f:274
#5  0x00000000004423e5 in master.0.sip_fmain_init (__entry=1, ncompany_workers_min=<error reading variable: Cannot access memory at address 0x0>, 
    ierr_return=<error reading variable: Cannot access memory at address 0x0>) at sip_fmain.F:582
#6  0x000000000042f8b8 in sumz_work_ (dryrun_flag=0x2, dryrun_flag at entry=0x7fff04aff4e8, fmbuffer=0xffff8002, fmbuffer at entry=0x23d0448c, dbg_flag=0x1, 
    dbg_flag at entry=0x7fff04aff4e4, totalrecvbuffer=0x36c66) at sumz.c:1294
#7  0x0000000000423bea in worker_work () at worker_work.F:79
#8  0x000000000041a613 in aces3 () at beta.F:914
#9  0x000000000041959d in main (argc=<optimized out>, argv=<optimized out>) at beta.F:1014
#10 0x00007f0a0f6b4ead in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#11 0x00000000004195c9 in _start ()

Another:

0x00007f0a122f0f63 in PMPI_Iprobe () from /usr/lib/libmpi.so.0
(gdb) bt
#0  0x00007f0a122f0f63 in PMPI_Iprobe () from /usr/lib/libmpi.so.0
#1  0x000000000042e9d0 in one_pass_of_server () at sumz.c:445
#2  0x000000000042f73d in exec_thread_server_ (bflag=bflag at entry=0x729b20) at sumz.c:1248
#3  0x00000000004df2a4 in wait_on_block (array=23, block=1, blkndx=56362, type=201, request=4, instruction_timer=35, comm_timer=95) at wait_on_block.f:50
#4  0x000000000048b6a5 in compute_block [...] at compute_block.F:759

And another:

0x00007f0a10c4e369 in opal_progress () from /usr/lib/libopen-pal.so.0
(gdb) bt
#0  0x00007f0a10c4e369 in opal_progress () from /usr/lib/libopen-pal.so.0
#1  0x00007f0a122cd9c9 in ?? () from /usr/lib/libmpi.so.0
#2  0x00007f0a122f84e3 in PMPI_Test () from /usr/lib/libmpi.so.0
#3  0x00007f0a1110e122 in pmpi_test__ () from /usr/lib/libmpi_f77.so.0
#4  0x00000000004df2bd in wait_on_block (array=23, block=1, blkndx=56362, type=201, request=4, instruction_timer=35, comm_timer=95) at wait_on_block.f:48
#5  0x000000000048b6a5 in compute_block [...] at compute_block.F:759

I did not encounter any other backtraces after a few more tries.

Michael