[Pkg-xen-devel] Bug#603727: xen-hypervisor-4.0-amd64: i386 Dom0 crashes after doing some I/O on local storage (software Raid1 on SAS-drives with mpt2sas driver)

Jordan Pittier - Rezel jordan at rezel.net
Wed Jan 19 18:24:51 UTC 2011


I have the exact same issue on several Sunfire v20z (hardware LSI RAID
controler). Under high IO load, the raid controler starts complaining
with disk errors :

[163442.483878] mptbase: ioc0: LogInfo(0x11070000): F/W: DMA Error
[163445.172333] sd 2:0:0:0: [sda] Unhandled error code
[163445.172338] sd 2:0:0:0: [sda] Result: hostbyte=DID_SOFT_ERROR
driverbyte=DRIVER_OK
[163445.172345] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 ba ee 4d 00 04 00 00
[163445.172365] end_request: I/O error, dev sda, sector 12250701
[163445.172529] __ratelimit: 13242 callbacks suppressed
[163445.172534] Buffer I/O error on device dm-0, logical block 1408794
[163445.172694] lost page write due to I/O error on dm-0

When I reboot the server, everything is "back to normal". This issue
only appears with the Xen hypervisor loaded
(multiboot	/xen-4.0-amd64.gz placeholder in grub2). The same kernel
(2.6.32-5-xen-amd64 #1 SMP Fri Dec 10) without Xen hypervisor works
flowlessly.


Here are some logs :
[163440.613034] INFO: task kjournald:299 blocked for more than 120 seconds.
[163440.613208] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[163440.613466] kjournald     D 0000000000000000     0   299      2 0x00000000
[163440.613475]  ffffffff814771f0 0000000000000246 0000000000000000
ffffffff810e7e95
[163440.613485]  00011200028d6400 0000000000000010 000000000000f9e0
ffff8800024a7fd8
[163440.613494]  0000000000015780 0000000000015780 ffff8800024ab170
ffff8800024ab468
[163440.613503] Call Trace:
[163440.613520]  [<ffffffff810e7e95>] ? kmem_cache_alloc+0x8c/0xf0
[163440.613528]  [<ffffffff8102ddc0>] ? pvclock_clocksource_read+0x3a/0x8b
[163440.613537]  [<ffffffff8130b5f1>] ? io_schedule+0x73/0xb7
[163440.613546]  [<ffffffff8118053b>] ? get_request_wait+0xf0/0x188
[163440.613553]  [<ffffffff81065d0a>] ? autoremove_wake_function+0x0/0x2e
[163440.613559]  [<ffffffff811808ca>] ? __make_request+0x2f7/0x428
[163440.613565]  [<ffffffff8117f0a7>] ? generic_make_request+0x299/0x2f9
[163440.613574]  [<ffffffff81176f2d>] ? elv_rb_latter_request+0x0/0x23
[163440.613580]  [<ffffffff8117f1dd>] ? submit_bio+0xd6/0xf2
[163440.613588]  [<ffffffff8110dc59>] ? submit_bh+0x103/0x123
[163440.613614]  [<ffffffffa00b60cf>] ?
journal_commit_transaction+0x406/0xe2b [jbd]
[163440.613623]  [<ffffffff8100e63d>] ? xen_force_evtchn_callback+0x9/0xa
[163440.613629]  [<ffffffff8100ecf2>] ? check_events+0x12/0x20
[163440.613636]  [<ffffffff8130c8b2>] ? _spin_unlock_irqrestore+0xd/0xe
[163440.613642]  [<ffffffff8100ecdf>] ? xen_restore_fl_direct_end+0x0/0x1
[163440.613648]  [<ffffffff8130c8b2>] ? _spin_unlock_irqrestore+0xd/0xe
[163440.613656]  [<ffffffffa00b9423>] ? kjournald+0xdf/0x226 [jbd]
[163440.613662]  [<ffffffff81065d0a>] ? autoremove_wake_function+0x0/0x2e
[163440.613670]  [<ffffffffa00b9344>] ? kjournald+0x0/0x226 [jbd]
[163440.613675]  [<ffffffff81065a3d>] ? kthread+0x79/0x81
[163440.613682]  [<ffffffff81012baa>] ? child_rip+0xa/0x20
[163440.613688]  [<ffffffff81011d61>] ? int_ret_from_sys_call+0x7/0x1b
[163440.613693]  [<ffffffff8101251d>] ? retint_restore_args+0x5/0x6
[163440.613699]  [<ffffffff8102ddc0>] ? pvclock_clocksource_read+0x3a/0x8b
[163440.613704]  [<ffffffff81012ba0>] ? child_rip+0x0/0x20
[163440.613711] INFO: task rs:main Q:Reg:14995 blocked for more than
120 seconds.
[163440.613960] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[163440.614217] rs:main Q:Reg D 0000000000000000     0 14995      1 0x00000000
[163440.614224]  ffffffff814771f0 0000000000000282 0000000000000000
ffffffff8117fc90
[163440.614233]  ffff880002bb88d0 ffffffff8117fe91 000000000000f9e0
ffff8800ef85ffd8
[163440.614241]  0000000000015780 0000000000015780 ffff88007e7b69f0
ffff88007e7b6ce8
[163440.614250] Call Trace:
[163440.614255]  [<ffffffff8117fc90>] ? blk_remove_plug+0xb/0x8e
[163440.614261]  [<ffffffff8117fe91>] ? __generic_unplug_device+0x12/0x2c
[163440.614266]  [<ffffffff8102ddc0>] ? pvclock_clocksource_read+0x3a/0x8b
[163440.614273]  [<ffffffff8110ed8a>] ? sync_buffer+0x0/0x40
[163440.614279]  [<ffffffff8130b5f1>] ? io_schedule+0x73/0xb7
[163440.614284]  [<ffffffff8110edc5>] ? sync_buffer+0x3b/0x40
[163440.614290]  [<ffffffff8130c8b2>] ? _spin_unlock_irqrestore+0xd/0xe
[163440.614295]  [<ffffffff8130bafe>] ? __wait_on_bit+0x41/0x70
[163440.614301]  [<ffffffff8110ed8a>] ? sync_buffer+0x0/0x40
[163440.614306]  [<ffffffff8130bb98>] ? out_of_line_wait_on_bit+0x6b/0x77
[163440.614312]  [<ffffffff81065d38>] ? wake_bit_function+0x0/0x23
[163440.614318]  [<ffffffff8110f1e9>] ? sync_dirty_buffer+0x5b/0x93
[163440.614326]  [<ffffffffa00b4e04>] ? journal_dirty_data+0xd1/0x1b0 [jbd]
[163440.614339]  [<ffffffffa0113f1f>] ? ext3_journal_dirty_data+0xf/0x34 [ext3]
[163440.614347]  [<ffffffffa01123f9>] ? walk_page_buffers+0x65/0x8b [ext3]
[163440.614356]  [<ffffffffa0113f44>] ? journal_dirty_data_fn+0x0/0x13 [ext3]
[163440.614365]  [<ffffffffa0115a66>] ? ext3_ordered_write_end+0x73/0x10f [ext3]
[163440.614373]  [<ffffffff810b5b8d>] ? generic_file_buffered_write+0x18d/0x278
[163440.614381]  [<ffffffff810b6029>] ? __generic_file_aio_write+0x25f/0x293
[163440.614387]  [<ffffffff810b60b6>] ? generic_file_aio_write+0x59/0x9f
[163440.614394]  [<ffffffff810ef716>] ? do_sync_write+0xce/0x113
[163440.614400]  [<ffffffff8100e63d>] ? xen_force_evtchn_callback+0x9/0xa
[163440.614406]  [<ffffffff81065d0a>] ? autoremove_wake_function+0x0/0x2e
[163440.614412]  [<ffffffff8100e63d>] ? xen_force_evtchn_callback+0x9/0xa
[163440.614418]  [<ffffffff8100ecf2>] ? check_events+0x12/0x20
[163440.614426]  [<ffffffff81153bf3>] ? cap_cred_prepare+0x0/0x3
[163440.614431]  [<ffffffff8102ddc0>] ? pvclock_clocksource_read+0x3a/0x8b
[163440.614438]  [<ffffffff810f008e>] ? vfs_write+0xa9/0x102
[163440.614443]  [<ffffffff810f01a3>] ? sys_write+0x45/0x6e
[163440.614449]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
[163440.614462] INFO: task flush-254:0:12175 blocked for more than 120 seconds.
[163440.614631] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[163440.614888] flush-254:0   D 0000000000000000     0 12175      2 0x00000000
[163440.614895]  ffff8800f0b30e20 0000000000000246 0000000000000000
ffff8800028cb9ec
[163440.614904]  ffff88000245ea00 0000000000000336 000000000000f9e0
ffff8800028cbfd8
[163440.614912]  0000000000015780 0000000000015780 ffff88007ec22a60
ffff88007ec22d58
[163440.614921] Call Trace:
[163440.614926]  [<ffffffff8102ddc0>] ? pvclock_clocksource_read+0x3a/0x8b
[163440.614932]  [<ffffffff8110ed8a>] ? sync_buffer+0x0/0x40
[163440.614937]  [<ffffffff8130b5f1>] ? io_schedule+0x73/0xb7
[163440.614943]  [<ffffffff8110edc5>] ? sync_buffer+0x3b/0x40
[163440.614949]  [<ffffffff8130c8b2>] ? _spin_unlock_irqrestore+0xd/0xe
[163440.614955]  [<ffffffff8130ba01>] ? __wait_on_bit_lock+0x3f/0x84
[163440.614960]  [<ffffffff8110ed8a>] ? sync_buffer+0x0/0x40
[163440.614966]  [<ffffffff8130bab1>] ? out_of_line_wait_on_bit_lock+0x6b/0x77
[163440.614972]  [<ffffffff81065d38>] ? wake_bit_function+0x0/0x23
[163440.614978]  [<ffffffff81110157>] ? __block_write_full_page+0x159/0x2ac
[163440.614984]  [<ffffffff8110ef54>] ? end_buffer_async_write+0x0/0x13b
[163440.614990]  [<ffffffff810bb3a2>] ? __writepage+0xa/0x25
[163440.614996]  [<ffffffff810bba29>] ? write_cache_pages+0x20b/0x327
[163440.615001]  [<ffffffff810bb398>] ? __writepage+0x0/0x25
[163440.615008]  [<ffffffff81108b56>] ? writeback_single_inode+0xe7/0x2da
[163440.615014]  [<ffffffff8110985c>] ? writeback_inodes_wb+0x424/0x4ff
[163440.615019]  [<ffffffff81109a63>] ? wb_writeback+0x12c/0x1ab
[163440.615025]  [<ffffffff8100ecdf>] ? xen_restore_fl_direct_end+0x0/0x1
[163440.615031]  [<ffffffff81109bfd>] ? wb_do_writeback+0x73/0x165
[163440.615037]  [<ffffffff81109d20>] ? bdi_writeback_task+0x31/0xaa
[163440.615045]  [<ffffffff810c9cf2>] ? bdi_start_fn+0x0/0xd2
[163440.615050]  [<ffffffff810c9d62>] ? bdi_start_fn+0x70/0xd2
[163440.615056]  [<ffffffff810c9cf2>] ? bdi_start_fn+0x0/0xd2
[163440.615061]  [<ffffffff81065a3d>] ? kthread+0x79/0x81
[163440.615067]  [<ffffffff81012baa>] ? child_rip+0xa/0x20
[163440.615072]  [<ffffffff81011d61>] ? int_ret_from_sys_call+0x7/0x1b
[163440.615077]  [<ffffffff8101251d>] ? retint_restore_args+0x5/0x6
[163440.615083]  [<ffffffff8102ddc0>] ? pvclock_clocksource_read+0x3a/0x8b
[163440.615088]  [<ffffffff81012ba0>] ? child_rip+0x0/0x20
[163440.615096] INFO: task bonnie++:14991 blocked for more than 120 seconds.
[163440.615261] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[163440.615518] bonnie++      D 0000000000000002     0 14991  12179 0x00000004
[163440.615526]  ffff88007f345bd0 0000000000000286 000000000006001b
000000009b952edb
[163440.615534]  ffff880002b53800 ffffffff811982cb 000000000000f9e0
ffff880002a65fd8
[163440.615542]  0000000000015780 0000000000015780 ffff88007ec254c0
ffff88007ec257b8
[163440.615551] Call Trace:
[163440.615557]  [<ffffffff811982cb>] ? __bitmap_weight+0x3a/0x7e
[163440.615563]  [<ffffffff8102ddc0>] ? pvclock_clocksource_read+0x3a/0x8b
[163440.615569]  [<ffffffff810b4ee5>] ? sync_page+0x0/0x46
[163440.615574]  [<ffffffff810b4ee5>] ? sync_page+0x0/0x46
[163440.615579]  [<ffffffff8130b5f1>] ? io_schedule+0x73/0xb7
[163440.615584]  [<ffffffff810b4f26>] ? sync_page+0x41/0x46
[163440.615590]  [<ffffffff8130c8b2>] ? _spin_unlock_irqrestore+0xd/0xe
[163440.615596]  [<ffffffff8130ba01>] ? __wait_on_bit_lock+0x3f/0x84
[163440.615604]  [<ffffffff810b4eb2>] ? __lock_page+0x5d/0x63
[163440.615609]  [<ffffffff81065d38>] ? wake_bit_function+0x0/0x23
[163440.615616]  [<ffffffff810bcdde>] ? pagevec_lookup_tag+0x1a/0x21
[163440.615622]  [<ffffffff810bb9cb>] ? write_cache_pages+0x1ad/0x327
[163440.615627]  [<ffffffff810bb398>] ? __writepage+0x0/0x25
[163440.615633]  [<ffffffff8100e63d>] ? xen_force_evtchn_callback+0x9/0xa
[163440.615639]  [<ffffffff8100ecf2>] ? check_events+0x12/0x20
[163440.615645]  [<ffffffff8130c8b2>] ? _spin_unlock_irqrestore+0xd/0xe
[163440.615650]  [<ffffffff81108b56>] ? writeback_single_inode+0xe7/0x2da
[163440.615656]  [<ffffffff8110985c>] ? writeback_inodes_wb+0x424/0x4ff
[163440.615662]  [<ffffffff810bc18f>] ?
balance_dirty_pages_ratelimited_nr+0x192/0x332
[163440.615669]  [<ffffffff810b5bf5>] ? generic_file_buffered_write+0x1f5/0x278
[163440.615676]  [<ffffffff810b4329>] ? generic_segment_checks+0x50/0x76
[163440.615682]  [<ffffffff810b6029>] ? __generic_file_aio_write+0x25f/0x293
[163440.615688]  [<ffffffff810b60b6>] ? generic_file_aio_write+0x59/0x9f
[163440.615694]  [<ffffffff810ef716>] ? do_sync_write+0xce/0x113
[163440.615699]  [<ffffffff8101079c>] ? __switch_to+0x285/0x297
[163440.615705]  [<ffffffff8101251d>] ? retint_restore_args+0x5/0x6
[163440.615711]  [<ffffffff81065d0a>] ? autoremove_wake_function+0x0/0x2e
[163440.615717]  [<ffffffff8130ce6a>] ? error_exit+0x2a/0x60
[163440.615722]  [<ffffffff8101251d>] ? retint_restore_args+0x5/0x6
[163440.615728]  [<ffffffff8102ddc0>] ? pvclock_clocksource_read+0x3a/0x8b
[163440.615734]  [<ffffffff810f008e>] ? vfs_write+0xa9/0x102
[163440.615739]  [<ffffffff810f01a3>] ? sys_write+0x45/0x6e
[163440.615745]  [<ffffffff81011b42>] ? system_call_fastpath+0x16/0x1b
[163442.483878] mptbase: ioc0: LogInfo(0x11070000): F/W: DMA Error
[163443.829405] mptbase: ioc0: LogInfo(0x11070000): F/W: DMA Error
[163445.172319] mptbase: ioc0: LogInfo(0x11070000): F/W: DMA Error
[163445.172333] sd 2:0:0:0: [sda] Unhandled error code
[163445.172338] sd 2:0:0:0: [sda] Result: hostbyte=DID_SOFT_ERROR
driverbyte=DRIVER_OK
[163445.172345] sd 2:0:0:0: [sda] CDB: Write(10): 2a 00 00 ba ee 4d 00 04 00 00
[163445.172365] end_request: I/O error, dev sda, sector 12250701
[163445.172529] __ratelimit: 13242 callbacks suppressed
[163445.172534] Buffer I/O error on device dm-0, logical block 1408794
[163445.172694] lost page write due to I/O error on dm-0
[163445.172704] Buffer I/O error on device dm-0, logical block 1408795
[163445.172864] lost page write due to I/O error on dm-0
[163445.172869] Buffer I/O error on device dm-0, logical block 1408796
[163445.173030] lost page write due to I/O error on dm-0
[163445.173035] Buffer I/O error on device dm-0, logical block 1408797
[163445.173194] lost page write due to I/O error on dm-0
[163445.173199] Buffer I/O error on device dm-0, logical block 1408798
[163445.173358] lost page write due to I/O error on dm-0
[163445.173363] Buffer I/O error on device dm-0, logical block 1408799
[163445.173522] lost page write due to I/O error on dm-0
[163445.173527] Buffer I/O error on device dm-0, logical block 1408800
[163445.173686] lost page write due to I/O error on dm-0
[163445.173690] Buffer I/O error on device dm-0, logical block 1408801
[163445.173850] lost page write due to I/O error on dm-0
[163445.173854] Buffer I/O error on device dm-0, logical block 1408802
[163445.174013] lost page write due to I/O error on dm-0
[163445.174018] Buffer I/O error on device dm-0, logical block 1408803
[163445.174177] lost page write due to I/O error on dm-0
[163445.437120] mptbase: ioc0: LogInfo(0x00000000): F/W: unknown
[163447.843551] mptbase: ioc0: LogInfo(0x11070000): F/W: DMA Error





More information about the Pkg-xen-devel mailing list