[Nut-upsuser] battery not installed, but battery still 100% and NUT 2.7.2-4 does not catch this and report a error
clepple at gmail.com
Tue Apr 11 01:36:43 UTC 2017
On Apr 10, 2017, at 10:42 AM, Jon Bendtsen wrote:
> Actually maybe it is within NUT's control. Maybe NUT should only claim that a UPS is ONLINE if ONLINE is the only thing it is?
The problem is that a lot of the UPS status values are more of a de-facto standard, but they have been generally defined in such a way that simpler UPSes only need to report basic status. We don't have an equivalent of "no ALARM", just the absence of "ALARM" in the status line.
Also, I think you are reading more into the OL status than is intended. OL simply means that AC power is available and passing through the UPS (contrast with OB and OFF). Again, because of the de-facto nature of this, we would have to consult the ups.type value to accurately reflect whether the UPS is truly an online (double-conversion) system, or "offline" with the relay feeding power directly from line to load (different than an online UPS in bypass). Unfortunately, ups.type is marked as opaque, and is not available everywhere.
The short answer here is that if a monitoring system wants to represent the overall health of the system, ALARM needs to be taken into account. (Never mind the fact that a basic Back-UPS LS 500 uses a more common HID PDC Usage that maps to RB when its battery test fails...) I think we have established that the monitoring in upsmon was not sufficient, but by extension, that means the Nagios plugin probably needs a change to expose the ALARM bit and message.
Not sure if this got answered already, but is the "No battery installed" alarm accurate, or is it just an old battery? If old, does the battery.runtime value get adjusted downwards after a battery test? Either way, we would need to establish which reading should take priority, and I don't think this is straightforward.
I almost think we need another layer of logic to handle priority logic like this, as well as scale values. It irks me that we add scale values to the driver without knowing the extent of the error (is it only for one firmware revision? for a whole line?) This would offer some hope of being able to silence false alarms (I vaguely remember some "life cycle alarm" in one UPS that contradicted another, more direct, status bit). But this is the sort of thing that should be designed, rather than slapped together, and it shouldn't get in the way of an UPS that behaves predictably. And I think it should be a separate layer so that we can always go directly to the driver to see the raw values that the UPS is returning.
More information about the Nut-upsuser