[Debian-ha-maintainers] Oops breaks resource failover in RHCS

Ernesto Rodriguez Reina erreina at gmail.com
Wed Feb 17 02:07:22 UTC 2010


Hi, once I wrote you because I had a very very similar problem, and I
though it was completed solved. Unfortunately I saw the OOPS again. We
have repeted some times and always get the same. Here is my scenario:

Node master with nodeid=1;
Node spare with nodeid=2;
Node slave1 with nodeid=3;
Node slave2 with nodeid=4;

We shutdown node master. Services are corrected relocated. We turn on
node Master and again services are corrected relocated. We then
shutdown node master again and then the oops appears but only on node
spare, nodes slave1 and slave2 seems to be ok with services running.
We tested with to different kernels 2.6.32.8 and 2.6.31.5 (with patch
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=063c4c99630c0b06afad080d2a18bda64172c1a2).

We are using RHCS 3.0.4-2 from debian mirror. Any ideas of how to
solve this? We are going to test with RHCS 3.0.6-5

Hoping you can help me. Best regards,
Ernesto

The oops:

with kernel 2.6.32.8:
Feb 16 19:48:22 spare kernel: [ 1080.523027] INFO: task rgmanager:6531
blocked for more than 120 seconds.
Feb 16 19:48:22 spare kernel: [ 1080.523091] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 16 19:48:22 spare kernel: [ 1080.523166] rgmanager     D
0000000000000000     0  6531   2363 0x00000000
Feb 16 19:48:22 spare kernel: [ 1080.523170]  ffffffff826fc080
0000000000000086 0000000000000296 ffffffff8104d7d9
Feb 16 19:48:22 spare kernel: [ 1080.523175]  ffff8801ab0a5038
000000000000e1c8 ffff8801ac083fd8 ffff8801ab915000
Feb 16 19:48:22 spare kernel: [ 1080.523178]  ffff8801ab9154b0
ffff8801ab0a5010 ffffffff82a420d8 ffff8801ab9154b0
Feb 16 19:48:22 spare kernel: [ 1080.523181] Call Trace:
Feb 16 19:48:22 spare kernel: [ 1080.523189]  [<ffffffff8104d7d9>] ?
try_to_wake_up+0x109/0x2d0
Feb 16 19:48:22 spare kernel: [ 1080.523194]  [<ffffffff81234bc4>] ?
cpumask_any_but+0x24/0x40
Feb 16 19:48:22 spare kernel: [ 1080.523199]  [<ffffffff8140d7a5>] ?
__down_read+0x85/0xb5
Feb 16 19:48:22 spare kernel: [ 1080.523208]  [<ffffffffa04b7960>] ?
dlm_user_request+0x60/0x240 [dlm]
Feb 16 19:48:22 spare kernel: [ 1080.523212]  [<ffffffff8110a72c>] ?
__kmalloc+0x11c/0x250
Feb 16 19:48:22 spare kernel: [ 1080.523217]  [<ffffffffa04c2196>] ?
device_write+0x686/0x790 [dlm]
Feb 16 19:48:22 spare kernel: [ 1080.523221]  [<ffffffff81111f7b>] ?
vfs_write+0xcb/0x1a0
Feb 16 19:48:22 spare kernel: [ 1080.523224]  [<ffffffff81112153>] ?
sys_write+0x53/0xa0
Feb 16 19:48:22 spare kernel: [ 1080.523227]  [<ffffffff8100bf82>] ?
system_call_fastpath+0x16/0x1b

with kernel 2.6.31.5 (with patch
http://git.kernel.org/gitweb.cgi?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=063c4c99630c0b06afad080d2a18bda64172c1a2):
Feb 16 20:35:27 spare kernel: [ 1320.436213] INFO: task
rgmanager:13795 blocked for more than 120 seconds.
Feb 16 20:35:27 spare kernel: [ 1320.436277] "echo 0 >
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Feb 16 20:35:27 spare kernel: [ 1320.436352] rgmanager     D
0000000000000000     0 13795   2247 0x00000000
Feb 16 20:35:27 spare kernel: [ 1320.436357]  ffff8801ae219000
0000000000000086 ffff88019293fd88 ffff8801a1cfbe90
Feb 16 20:35:27 spare kernel: [ 1320.436360]  0000000000013f80
000000000000e168 ffff8801928a1000 ffff8801928a14b8
Feb 16 20:35:27 spare kernel: [ 1320.436364]  0000000200000002
00000001000bf260 ffff8801ab843038 ffff8801928a14b8
Feb 16 20:35:27 spare kernel: [ 1320.436367] Call Trace:
Feb 16 20:35:27 spare kernel: [ 1320.436376]  [<ffffffff813ea425>] ?
__down_read+0x85/0xb5
Feb 16 20:35:27 spare kernel: [ 1320.436389]  [<ffffffffa052c970>] ?
dlm_user_request+0x60/0x240 [dlm]
Feb 16 20:35:27 spare kernel: [ 1320.436393]  [<ffffffff81077aef>] ?
wake_futex+0x3f/0x80
Feb 16 20:35:27 spare kernel: [ 1320.436397]  [<ffffffff810d4c40>] ?
shmem_delete_inode+0x0/0x110
Feb 16 20:35:27 spare kernel: [ 1320.436401]  [<ffffffff8100caee>] ?
invalidate_interrupt0+0xe/0x20
Feb 16 20:35:27 spare kernel: [ 1320.436406]  [<ffffffff810fc1cc>] ?
__kmalloc+0x11c/0x250
Feb 16 20:35:27 spare kernel: [ 1320.436414]  [<ffffffffa05370f6>] ?
device_write+0x686/0x790 [dlm]
Feb 16 20:35:27 spare kernel: [ 1320.436418]  [<ffffffff8105c7a3>] ?
do_sigaction+0x1b3/0x1d0
Feb 16 20:35:27 spare kernel: [ 1320.436421]  [<ffffffff8105c691>] ?
do_sigaction+0xa1/0x1d0
Feb 16 20:35:27 spare kernel: [ 1320.436424]  [<ffffffff81102e0b>] ?
vfs_write+0xcb/0x1a0
Feb 16 20:35:27 spare kernel: [ 1320.436427]  [<ffffffff81102fe3>] ?
sys_write+0x53/0xa0
Feb 16 20:35:27 spare kernel: [ 1320.436430]  [<ffffffff8100bf02>] ?
system_call_fastpath+0x16/0x1b



-- 
Ernesto Rodriguez Reina



More information about the Debian-ha-maintainers mailing list