[Nut-upsdev] a nasty kernel oops
Alfred Ganz
alfred-ganz+nut at agci.com
Tue Jan 25 02:22:38 UTC 2011
Charles,
Subject: Re: [Nut-upsdev] a nasty kernel oops
Date: Sat, 15 Jan 2011 10:27:26 -0500
>* I am now able to get a crash on a virtual machine, so life has
> become a bit easier
kernel oops, usbhid-ups crash, or simply a failure to launch usbhid-ups?
Kernel oops and failure to launch usbhid-ups, but *no* usbhid-ups
crash! However, it looks like I can now quite consistently reproduce
the oops with libhid-detach-device. Whoever does the USB stuff in the
group is off the hock!
I don't want to downplay the significance of the problem on your end,
but it is really up to the kernel to protect itself from race
conditions and crashes caused by userspace applications accessing
devices. To that end, I agree that something should be done outside
usbhid-ups.
Yes, I agree, the more I play around, the more I see this as a kernel
problem. The fact that I can cause the oops with libhid-detach-device
seems to confirm this.
....
Yes, I agree with your comments about starting NUT, if you can't rely
on the NUT + device combination, why use it.
>Any advice on what might work would of course be much appreciated.
One workaround would be to patch the kernel to blacklist the UPS from
the kernel raw HID driver. Of course, this doesn't play well with
prebuilt binary kernels.
Along these lines, it should be possible to blacklist the kernel HID
module which has attached to the UPS. I haven't followed this portion
of the kernel much lately (and all bets are off in RedHat kernels),
but with any luck, it might be separate from the keyboard/mouse HID-to-
input-layer module.
There is a way to do this with kernel modules, but RHEL kernels seem to
have the hid driver builtin. So I would rather not go there, it would
probably also eliminate a chance to recreate the same race condition.
A less intrusive way might be to watch the /dev space for the node
corresponding to the HID interface, and wait a few seconds after that
appears before detaching.
Unfortunately, the names of those entries reflect the device number,
which is dynamically assigned, and they change at least with every
{re-}connection of the device. I think the only fixed way to identify
the device may be vendor:product, although there seems to be an internal
logical naming scheme that is static, which shows up as <x>-<y> in kernel
log entries.
For what it's worth, here is a sequence that I can put before the
upsdrvctl invocation in the startup script that seems to consistently
eliminate an oops:
lsusb -d <vendor-id>:<product-id>
sleep 5
libhid-detach-device <vendor-id>:<product-id>
I have tried to play around with this sequence, but eliminating the
sleep will result in an oops out of libhid-detach-device, and using
libhid-detach-device as the first component may also lead to an oops.
I have no idea why any of this works or doesn't work, but it sure
seems to point at some sort of a race condition.
Since this CentOS-5.5 and a new version with a much newer kernel is
expected relatively soon, I will wait and hope for the problem to go
away with the new system.
Thanks for your help, AG
--
----------------------------------------------------------------------
Alfred Ganz alfred-ganz:at:agci.com
AG Consulting, Inc. (203) 624-9667
440 Prospect Street # 11
New Haven, CT 06511
----------------------------------------------------------------------
More information about the Nut-upsdev
mailing list