[Nut-upsdev] a nasty kernel oops
Alfred Ganz
alfred-ganz+nut at agci.com
Wed Jan 12 15:08:35 UTC 2011
Ladies and Gentlemen,
I have been trying to sort out a nasty kernel oops for which I would
like to ask for some advice. I don't actually think that this is a nut
problem, although it is triggered by upsdrvctl or usbhid-ups. I rather
suspect the USB library or the associated kernel code.
Here is the configuration information:
* system: Centos-5.5
* kernel: 2.6.18-194.32.1.el5, latest vanilla for Centos-5.5
* nut: nut-2.4.3-1, a build provided by Arnaud last fall
* libusb: libusb-0.1.12-5.1, the latest for this system
* the UPS:
device.mfr: APC
device.model: Back-UPS ES 650
device.serial: QB0514232934
device.type: ups
ups.firmware: 818.w1.D
ups.mfr.date: 2005/08/10
ups.productid: 0002
Sorry Arnaud!
* the driver:
driver.name: usbhid-ups
driver.parameter.pollfreq: 30
driver.parameter.pollinterval: 1
driver.parameter.port: auto
driver.version: 2.4.3
driver.version.data: APC HID 0.95
driver.version.internal: 0.34
The problem:
* under certain circumstances system boot fails with a kernel oops
during the upsdrvctl/usbhid-ups phase of the startup script.
Note that we have *not* reached the upsd part of the startup script,
and the startup script of the printer package (cups) would be
reached much later.
* the boot process works fine if another USB device, the USB printer,
is powered up. Unfortunately, I would like to keep it powered off
most of the time if I could do so.
Note, the printer is not the only other USB device on the system,
but the number of active USB devices is being changed.
* Once the system is completely booted, there is no problem starting
and stopping the ups using the startup script.
* the problem appeared late in the lifetime of Centos-5, the system
has shut down due to power loss and restarted with the printer off
before the arrival of the problem.
* the problem has probably appeared with the transition from Centos-5.4
(kernel-2.6.18-164.15.1.el5) to Centos-5.5 (kernel-2.6.18-194.3.1.el5),
although I noticed it much later
* libusb has been the same since 2008/04/22, I don't believe the problem
appeared that long ago.
* the problem happens with the locally rebuilt (with no changes)
nut-2.2.0-6.1 from Fedora 8, with the nut-2.4.3-1 provided by
Arnaud, and with a locally rebuilt version of nut-2.4.3-1 using
a spec file provided by Arnaud.
Here is what I have tried:
* I have started the system for a crash dump, and below is an extract
from the crash log.
* I have tried to recreate the problem on a virtual machine, but it
doesn't happen there
* I have introduced a delay before the start of the ups startup script
to make sure that the suspected timing problem doesn't happen earlier.
* I have hacked the startup script to invoke usbhid-ups directly with
the proper device parameter and -D, and the problem goes away
* I have used the hacked startup script to invoke usbhid-ups directly
with the proper device parameter but *no* -D, and we have a crash!
So I am now left with the observation that we have a timing problem
somewhere among usbhid-ups, libusb, and the kernel, and I have no
idea how to further narrow things down.
I would be glad to do some more tests, but I would need your help with
this.
Thanks for any advice, AG
----------------------------------------------------------------------------
BUG: unable to handle kernel paging request at virtual address 0000190c
printing eip:
c05954c6
*pde = 74ecb067
Oops: 0000 [#1]
SMP
....
CPU: 0
EIP: 0060:[<c05954c6>] Not tainted VLI
EFLAGS: 00010206 (2.6.18-194.26.1.el5 #1)
EIP is at hid_close+0x2/0x1f
eax: 00000000 ebx: d216bd20 ecx: f7fff080 edx: 00000000
esi: d21ebc00 edi: c06b3470 ebp: 00000000 esp: f768ad34
ds: 007b es: 007b ss: 0068
Process usbhid-ups (pid: 2437, ti=f768a000 task=f7682000 task.ti=f768a000)
Stack: c05976fd d20fe000 c0595668 d21ebc00 c06b3440 c058e1c5 d21ebc8c d21ebc14
c055e859 d21ebc14 d21ebc14 d21ebc00 c055ea91 d21ebc00 c0588351 ffffffc3
00000000 c0590cc0 f75ee340 00000000 00005516 00000000 c00c5512 d2173400
Call Trace:
[<c05976fd>] hiddev_disconnect+0x3f/0x5e
[<c0595668>] hid_disconnect+0x81/0xbf
[<c058e1c5>] usb_unbind_interface+0x34/0x6a
[<c055e859>] __device_release_driver+0x7d/0xbb
[<c055ea91>] device_release_driver+0x1c/0x2b
[<c0588351>] usb_driver_release_interface+0x38/0x60
[<c0590cc0>] proc_ioctl_default+0x10d/0x1d0
[<c0591ffd>] usbdev_ioctl+0x1027/0x10de
[<c048b06b>] __d_lookup+0x98/0xdb
[<c04c7eff>] inode_has_perm+0x54/0x5c
[<c04c78ab>] avc_has_perm+0x3c/0x46
[<c04c7eff>] inode_has_perm+0x54/0x5c
[<c0464c32>] __handle_mm_fault+0x463/0xaac
[<c0486290>] do_ioctl+0x47/0x5d
[<c04867f9>] vfs_ioctl+0x47b/0x4d3
[<c0486899>] sys_ioctl+0x48/0x5f
[<c0404f17>] syscall_call+0x7/0xb
=======================
Code: 00 05 b0 01 86 83 b4 0c 00 00 fb 8d 83 54 0c 00 00 e8 4c 87 e9 ff 8b 83 a8 0c 00 00 e8 7d 74 ff ff 31 c0 5b c3 31 d2 eb be 89 c2 <8b> 80 0c 19 00 00 48 85 c0 89 82 0c 19 00 00 75 0b 8b 82 a8 0c
EIP: [<c05954c6>] hid_close+0x2/0x1f SS:ESP 0068:f768ad34
----------------------------------------------------------------------------
--
----------------------------------------------------------------------
Alfred Ganz alfred-ganz:at:agci.com
AG Consulting, Inc. (203) 624-9667
440 Prospect Street # 11
New Haven, CT 06511
----------------------------------------------------------------------
More information about the Nut-upsdev
mailing list