[Nut-upsdev] some fixes, improvements, and new features (EPO and DYING) for NUT

Greg A. Woods woods at planix.com
Thu Mar 8 23:21:11 UTC 2012


Here are a series of my recent changes to NUT.

The first few in the set are primarily little fixes and improvements.

In among those are a few for .gitignore files which of course you can
ignore for SVN, and there's one for a commit to a generated file which
of course should not be tracked in any VCS.

Then there are a couple or three to do with generating the header files
used by nut-scanner.  These probably could have been collapsed into one,
but I left them separate to show more clearly what some of the problems
are with the crazy attempts to use scripts to parse C code instead of
using the compiler.  The final one in that group is a half-assed attempt
to generate one of the headers using a helper function directly from the
compiled data structures it is derived from, and thus totally
eliminating the need for the broken python script in the first place.
Even this though is wrong -- the code needing the data structures from
the driver should be linking directly with shared .o files to access it
instead of re-inventing new data structures and trying to populate them
from the existing data structures.  The same thing should be done to
eliminate the horrid perl script in there too.

I then made some improvements to the SNMP driver to make it actually
work properly with my AP9605 SNMP card, and which should make it work
properly now with any SNMP agent implementing APC's POWERNET MIB.

I also discovered the blazer driver does work pretty well with my GE
Digital Energy GT Series UPS, at least with the 1000-3000 VA models.

I added some more info about APC cables that I'd been keeping track of
independently.

I had independently made a similar change to the apcsmart driver to keep
it from failing when tcgetattr() reported some irrelevant differences in
the port settings.  What's actually in the patch now is my merge of the
change from upstream which is basically just an "improved" log message.

I've also added some suggested coding improvements which I think will
make things easier to maintain down the line, notably using clear syntax
that's easy to modify safely for defining bit flag value macros, as well
as a strong suggestion to NEVER EVER use comment syntax to comment out
code blocks -- always use the pre-processor -- it's much safer!

Finally I introduce the first of my new features:  The "EPO" command.
This is very similar to "FSD", but fundamentally different in that it
goes a bit deeper into the infrastructure and it has a different purpose
and ultimate affect on the systems being managed.  The basic idea is to
provide the moral equivalent, though not in quite such draconian and
dangerous hard-core way, of an Emergency Power Off (big red) switch.
The critical difference with FSD is that EPO is intended to require
manual human intervention to recover from, and that it is also intended
to completely and entirely remove power from everything if at all
possible, even if mains power is still fully and smoothly functioning.

I'm really not sure if "FSD" has a true purpose other than as a test
command to see if everything will restart after mains power returns,
since of course FSD tries to simulate the effect of mains power
returning after a full shutdown has been committed to and is in
progress.

EPO on the other hand is a key requirement of my next feature:  The
ability of a UPS driver to declare that the UPS is dying of some
critical condition and that it must be shut down in such a way that
manual human intervention is required to restart it.  EPO is also
intended to be triggered automatically, whereas FSD (I think) is always
intended to be manually introduced by a human systems manager.

I.e. in an ideal configuration everything should restart and reboot and
return to operational status after "upsmon -c fsd" once mains power
returns or if power was never actually off; whereas with "upsmon -c epo"
then everything should power down and stay off even if mains power
remains on and steady.

For example I would have used "FSD" to shut down in power blackouts
where I knew the power could not return before the batteries ran low,
and thus I would have conserved battery charge for the inevitable short
hiccups that occur after a long blackout, but still been able to enjoy
automatic restart after the blackout in case power returns while I'm
sleeping, etc.

Finally I add some features to the three drivers I was able to test
which make use of this new "DYING" state to power things down safely but
quickly when they detect operating temperatures above a configurable
maximum value.  The one driver that already supported use of the ALARM
state also sets an alarm when the temperature rises above a configurable
warning value.  The idea here is that if the HVAC fails in your computer
room then you can have everything automatically shut down _AND_ stop
pouring BTUs out into the room, and of course hopefully first raise an
alarm so that a human can try to intervene before an emergency power off
is actually necessary to prevent equipment damage.

Indeed the motivation behind these new features is because HVAC fails
far more frequently in my client's server room, and with far more dire
consequences, than the power fails.  Indeed they have only one tiny UPS
that can run only the most critical core equipment, but everything has
come near to suffering serious physical damage when ambient temperatures
have shot up above 45C in an extremely short time after HVAC failure,
which of course is usually on a Saturday night.

These changes are a work in progress to some extent -- I still have not
fully tested the EPO of a running network, but I hope to do that very
soon.  The drivers do report alarms (where implemented) and they report
the "DYING" status when their temperature sensors report above-maximum
values.





More information about the Nut-upsdev mailing list