[Nut-upsdev] Bug#441342: Nut can kill power to UPSs that never went on battery

Robert Woodcock rcw at debian.org
Fri Sep 14 23:21:14 UTC 2007


On Fri, Sep 14, 2007 at 05:41:58PM +0200, Arjen de Korte wrote:
> From this I conclude that you must be using the usbhid-ups driver, right?

Yes.

> You may not be aware of it, but this is where the root of the problem
> lies. Normally, when a UPS is told to shutdown, two things can happen:
> 
> 1) The input power is gone and the UPS powers off and switches back on
> when the power returns.
> 
> 2) Input power is still available and the UPS cycles the power so that the
> systems that receive power from it, can restart.

In this case, strictly speaking, neither of those two conditions were a
cause.

> Apparently, #2 is not happening for you and from looking at the driver, I
> can understand why. The shutdown sequence in this driver is not doing what
> it is supposed to do. This needs fixing, but since I don't have an APC
> UPS, I can't do that for you. We'll have to track down the developer that
> wrote this driver, to correct this.

I have pulled the UPS in question out of production and I should have time
next week to attach a spare PC to it and get nut talking to it. For
starters, I'll want to look into why the Smart-UPS's aren't starting back up
a few seconds after a forced shutdown.

> There is definitly something broken here too. Under no circumstance should
> a driver indicate both 'on battery' and 'low battery' if the power is not
> actually out.

I don't know for certain if this is applicable to the APC Smart-UPS 1500,
but I think there is a hardware limitation with many UPSs in that they are
not able to communicate when an on-battery condition is due to a self-test
or mains failure.

> Interesting question is why the UPS under test is indicating 'low battery'
> as soon as you start a test. This could be justified (but in that case,
> I would expect a 'replace battery' warning as well)

The UPS in question is out of warranty, we replaced its batteries 3 months
ago but it's still a little messed in the head.

> or might be happening because the subdriver is misinterpreting a value
> read from the UPS. Running the driver in debug mode (to see what values
> it reads from the UPS) could help fixing this.

I'll do this as soon as I get a test system set up.

> >> Nut would then immediately shut down the Linux system, and once
> >> that was done, it would proceed to force a shutdown of all the other
> >> UPSs.
> 
> This is by design. If the power is really out, the systems that are
> connected to the NUT server *must* be shutdown, if the NUT server must go
> down. Otherwise, they would not be able to be shutdown if the UPS that
> feeds them is low on battery.

While I see where you are coming from, I maintain that this isn't proper
behavior. If the power never went out to the UPS in question, then it's
safer to assume that power will not go out to that UPS (and risk an unclean
shutdown of those hosts) than to assume that that UPS must be shut down (and
guarantee the unavailability of those hosts). Every power-related cascading
failure I have witnessed has run its course in seconds, not hours. 

It's much easier for a system administrator to explain to their manager
that systems X, Y, and Z were shut down uncleanly because the UPS monitoring
system did not have power at the time that UPSs A and B lost power, than to
try to explain why systems X, Y, and Z shut down because power was lost to
UPS C (which does not feed systems X, Y, and Z). As a systems administrator,
I did not even attempt to say to my boss, "Great news! It's working as
designed!" - I had to say "sorry about that - I know why it happened and it
won't happen again."

If a nut server simply isn't running, then client hosts won't refuse to
mount their disks read-write on startup. I don't see any reason to be so
paranoid in one situation but not another.

> In a single NUT server, multiple UPS system it is impossible to deal
> with situations where some of the UPS'es monitored receive power
> from the mains and the one powering the NUT server is not.

I agree, it is impossible to do perfectly. There are pitfalls to both
approaches. I believe the current approach violates the principle of least
surprise.

> So if the UPS powering the NUT server is critical, all the UPS'es we're
> monitoring should be at the very least on battery as well.

Yes, but if they are not in fact on battery, it isn't reasonable to assume
that they will be on battery eventually. For example, it's very possible to
lose just one phase line of commercial 3-phase power, which will leave 1/3
of a building's circuits running with full power. Or, a breaker feeding just
one of the UPSs may trip.

> Not shutting down all clients connected to the NUT server when that has
> to go down (ie, the power is out and the batteries of the UPS feeding it),
> would be a sure fire way to crash them.

I would rather take the slight risk of data corruption than shut down
systems when there is no evidence that a shutdown is needed. In a perfect
world, all services would automatically come back up with full functionality
after a mass reboot. In the real world, I have managed several server rooms,
each with a few dozen systems, and despite our best efforts, none have had a
startup procedure less than 10 bullet points long.

> >> Or, at the very least, document the hidden assumption that none of your
> >> monitored UPS's runtimes will exceed that of your master server's UPS
> >> runtime in particular.
> 
> This is documented in the 'upsmon' man page for instance:
> 
> > FORCED SHUTDOWNS
> >
> > When upsmon is forced to bring down the local system, it sets the
> > "FSD" (forced shutdown) flag on any UPSes that it is running in master
> > mode.  This is used to synchronize slaves in the event that a master UPS
> > that is otherwise OK needs to be brought down due to some pressing event
> > on the master.

This is the most euphemistic paragraph I have read in months. Consider
replacing the word 'synchronize'.
-- 
Robert Woodcock - rcw at debian.org
"When failure is not an option, success can get expensive."
	-- Peter Stibrany



More information about the Nut-upsdev mailing list