[Nut-upsdev] a nasty kernel oops

Tue Jan 25 02:22:38 UTC 2011

Charles,
	Subject: Re: [Nut-upsdev] a nasty kernel oops
	Date: Sat, 15 Jan 2011 10:27:26 -0500

	>* I am now able to get a crash on a virtual machine, so life has
	>  become a bit easier
	kernel oops, usbhid-ups crash, or simply a failure to launch usbhid-ups?
Kernel oops and failure to launch usbhid-ups, but *no* usbhid-ups 
crash! However, it looks like I can now quite consistently reproduce 
the oops with libhid-detach-device. Whoever does the USB stuff in the 
group is off the hock! 

	I don't want to downplay the significance of the problem on your end,  
	but it is really up to the kernel to protect itself from race  
	conditions and crashes caused by userspace applications accessing  
	devices. To that end, I agree that something should be done outside  
	usbhid-ups.
Yes, I agree, the more I play around, the more I see this as a kernel
problem. The fact that I can cause the oops with libhid-detach-device
seems to confirm this.
	....
Yes, I agree with your comments about starting NUT, if you can't rely
on the NUT + device combination, why use it.

	>Any advice on what might work would of course be much appreciated.

	One workaround would be to patch the kernel to blacklist the UPS from  
	the kernel raw HID driver. Of course, this doesn't play well with  
	prebuilt binary kernels.

	Along these lines, it should be possible to blacklist the kernel HID  
	module which has attached to the UPS. I haven't followed this portion  
	of the kernel much lately (and all bets are off in RedHat kernels),  
	but with any luck, it might be separate from the keyboard/mouse HID-to- 
	input-layer module.
There is a way to do this with kernel modules, but RHEL kernels seem to 
have the hid driver builtin. So I would rather not go there, it would
probably also eliminate a chance to recreate the same race condition.

	A less intrusive way might be to watch the /dev space for the node  
	corresponding to the HID interface, and wait a few seconds after that  
	appears before detaching.
Unfortunately, the names of those entries reflect the device number,
which is dynamically assigned, and they change at least with every
{re-}connection of the device. I think the only fixed way to identify
the device may be vendor:product, although there seems to be an internal
logical naming scheme that is static, which shows up as <x>-<y> in kernel
log entries.

For what it's worth, here is a sequence that I can put before the
upsdrvctl invocation in the startup script that seems to consistently
eliminate an oops:
	lsusb -d <vendor-id>:<product-id>
	sleep 5
	libhid-detach-device <vendor-id>:<product-id>
I have tried to play around with this sequence, but eliminating the
sleep will result in an oops out of libhid-detach-device, and using
libhid-detach-device as the first component may also lead to an oops.
I have no idea why any of this works or doesn't work, but it sure
seems to point at some sort of a race condition.

Since this CentOS-5.5 and a new version with a much newer kernel is
expected relatively soon, I will wait and hope for the problem to go
away with the new system.

Thanks for your help, AG

-- 
 ----------------------------------------------------------------------
   Alfred Ganz					alfred-ganz:at:agci.com
   AG Consulting, Inc.				(203) 624-9667
   440 Prospect Street # 11
   New Haven, CT 06511
 ----------------------------------------------------------------------