[Nut-upsuser] usbhid-ups causes hang/crash in ohci driver
clepple at gmail.com
Sat Nov 22 15:17:17 UTC 2014
On Nov 21, 2014, at 7:53 PM, Kristian Rasmussen <kristian_rasmussen at fastmail.co.uk> wrote:
> When this happens, reloading the usbhid-ups driver and even unplugging
> and re-inserting the USB cable does not fix the problem, and lsusb does
> not list the UPS at all. On one system, an x86_64 server running kernel
> 3.17.3, the following can be seen in the log:
> Nov 21 23:24:07 test-svr1 kernel: [111233.920039] ohci-pci 0000:00:02.0:
> frame counter not updating; disabled
> Nov 21 23:24:07 test-svr1 kernel: [111233.920047] ohci-pci 0000:00:02.0:
> HC died; cleaning up
> Unloading and re-inserting the ohci kernel modules (ohci_hcd and
> ohci_pci) does temporarily resolve the issue on this host, but after a
> few hours the problem appears again.
The "frame counter not updating" message seems like it could indicate an IRQ issue. While these were much more common before the advent of PCI, there is still potential for mismatches between the hardware routing of the PCI interrupt lines, and the interrupt tables that get passed in to the kernel from the BIOS/EFI.
Alternatively, you might have load-related problems where another hardware device is not properly sharing the interrupt with the OHCI controller. If I remember correctly, OHCI can only handle two or four physical USB ports per PCI device (or per function? it's been a while) so if you can check /proc/interrupts and find another group of ports on a different OHCI device and IRQ, you might have better luck.
Are these ports also USB2 or USB3? Does the addition of a USB 2.0 hub change the symptoms?
Also, are you using non-default values for pollinterval or pollfreq?
I have no idea if this matters, but are you using libusb-0.1, or libusb-1.0.x with libusb-compat?
> On the other affected system, a 32-bit VIA router running kernel 3.12.6,
> the issue causes a kernel panic and nothing regarding the kernel USB
> drivers is logged. I haven't yet had the opportunity to set up serial
> console logging to conclusively verify that the panic occurs in the ohci
> module, but it does seem likely; if I catch the problem in time, after
> the warning about "stale data" from upsd but before the actual panic
> occurs, I can unload and reload the ohci drivers and prevent the crash.
Same comments apply, but without the exact crash info, I'd say it's a little early to draw conclusions.
> Both systems are running nut-2.7.1 compiled from source. The UPS units
> involved are not identical; one is an older MGE Pulsar while the other
> is a newer Eaton, but both use the same USB identiier (0463:ffff, "MGE
> UPS Systems UPS"). The USB chipsets involved are quite dissimilar
> (Nvidia vs. VIA).
Given that the OHCI drivers are typically the same for the different chipsets, it could well be OHCI-related. I am also not aware of any changes in behavior for the chip(s) 0463:ffff ID between MGE and Eaton, although that ID is used on a wide array of their USB devices.
> Is this likely to be a bug in nut, or has the nut usbhid-ups driver
> perhaps triggered an underlying kernel bug in the USB driver subsystem?
> Anything I can do to narrow down what's causing this?
I think it's the latter. While we have had our share of USB-related bugs in the NUT userspace code, they typically don't cause kernel crashes across-the-board. (I know it doesn't help much in your case, but we have many counterexamples of kernel/hardware/NUT versions that do work long-term-- maybe we should be recording those to find patterns.) The usual NUT USB problem manifests itself as an inscrutable errno value from the kernel, which doesn't bring the rest of the system down with it.
Unfortunately, we don't often hear back when we recommend that users take the issue up with the Linux USB lists, so I don't know if these bugs are getting fixed, or if people are just moving to other hardware (UPS or motherboard).
The bottom line is that we are happy to try and work with you to debug this, but so far there are more questions than answers in this particular problem space.
clepple at gmail
More information about the Nut-upsuser