[Nut-upsdev] some fixes, improvements, and new features (EPO and DYING) for NUT
Arnaud Quette
aquette.dev at gmail.com
Fri Mar 9 13:40:55 UTC 2012
Hi Greg,
first, thanks a lot for your many contribs.
it's always appreciated to get new people looking at NUT with fresh
eyes, and different needs / approach.
next, Charles has well summed up the situation: we're currently trying
to address issues in our process and infra, to get an improved
workflow.
that said, would you be motivated to enter to team in the long run,
and participate in the many improvement that are underway, and to
solve the many issues and bottlenecks that have accumulated over the
years?
2012/3/9 Charles Lepple <clepple at gmail.com>:
> On Mar 8, 2012, at 6:21 PM, Greg A. Woods wrote:
>
>> Here are a series of my recent changes to NUT.
>>
>> The first few in the set are primarily little fixes and improvements.
>>
>> In among those are a few for .gitignore files which of course you can
>> ignore for SVN, and there's one for a commit to a generated file which
>> of course should not be tracked in any VCS.
>
> We are actually in the process of trying to move the NUT source code over to Git, but both conversions by git-svn and Eric S. Raymond's reposurgeon are not quite there yet. (We are leaning towards reposurgeon, which involves a little more tweaking of commits, but produces better results for a one-way SVN-to-Git conversion, including .gitignore files generated from svn:ignore properties.)
>
> That said, while we could easily apply these first few patches, I would like to preserve what is left of my sanity (we are still working through a horrible Git/SVN hybrid merge of the NSS SSL code), and defer applying them until we have a native Git tree. This will also prevent some from falling through the cracks.
seconded
>> Then there are a couple or three to do with generating the header files
>> used by nut-scanner. These probably could have been collapsed into one,
>> but I left them separate to show more clearly what some of the problems
>> are with the crazy attempts to use scripts to parse C code instead of
>> using the compiler. The final one in that group is a half-assed attempt
>> to generate one of the headers using a helper function directly from the
>> compiled data structures it is derived from, and thus totally
>> eliminating the need for the broken python script in the first place.
>> Even this though is wrong -- the code needing the data structures from
>> the driver should be linking directly with shared .o files to access it
>> instead of re-inventing new data structures and trying to populate them
>> from the existing data structures. The same thing should be done to
>> eliminate the horrid perl script in there too.
>
> I have vented about other issues related to nut-scanner in the past, but with the CI and source control stuff, I haven't had time to fix it personally. My vote would be for applying these, but I'll give the Eaton folks a chance to look at it first.
there are indeed lots of glitches around this, but we now have the
overall visibility we lacked in the past.
I've planned to address some of these when revamping the source tree
layout, and/or during the rewrite of some drivers (as for snmp-ups).
that said, having some more manpower would help ;-)
>> I then made some improvements to the SNMP driver to make it actually
>> work properly with my AP9605 SNMP card, and which should make it work
>> properly now with any SNMP agent implementing APC's POWERNET MIB.
>
> SNMP isn't my area, but sounds good.
a side note on SNMP: a rewrite of the engine is scheduled for years
[1], to switch from this 1rst generation engine, to the 2nd one (a la
usbhid-ups), and possibly end up with a 3rd generation that fixes the
known issue.
>> I also discovered the blazer driver does work pretty well with my GE
>> Digital Energy GT Series UPS, at least with the 1000-3000 VA models.
>
> Trivial to apply.
>
>> I added some more info about APC cables that I'd been keeping track of
>> independently.
>
> Very useful, thanks.
>
>> I had independently made a similar change to the apcsmart driver to keep
>> it from failing when tcgetattr() reported some irrelevant differences in
>> the port settings. What's actually in the patch now is my merge of the
>> change from upstream which is basically just an "improved" log message.
>
> Agreed.
>
>> I've also added some suggested coding improvements which I think will
>> make things easier to maintain down the line, notably using clear syntax
>> that's easy to modify safely for defining bit flag value macros, as well
>> as a strong suggestion to NEVER EVER use comment syntax to comment out
>> code blocks -- always use the pre-processor -- it's much safer!
>
> Agreed in principle, although I haven't looked to see if collapsing any of the unused bits will lead to binary incompatibility. Given how distributions tend to lag behind the latest code, we often suggest that people just drop in a replacement driver to test certain changes without disrupting the rest of the install. This could be completely unwarranted fears on my part, though.
>
>> Finally I introduce the first of my new features: The "EPO" command.
>> This is very similar to "FSD", but fundamentally different in that it
>> goes a bit deeper into the infrastructure and it has a different purpose
>> and ultimate affect on the systems being managed. The basic idea is to
>> provide the moral equivalent, though not in quite such draconian and
>> dangerous hard-core way, of an Emergency Power Off (big red) switch.
>> The critical difference with FSD is that EPO is intended to require
>> manual human intervention to recover from, and that it is also intended
>> to completely and entirely remove power from everything if at all
>> possible, even if mains power is still fully and smoothly functioning.
>>
>> I'm really not sure if "FSD" has a true purpose other than as a test
>> command to see if everything will restart after mains power returns,
>> since of course FSD tries to simulate the effect of mains power
>> returning after a full shutdown has been committed to and is in
>> progress.
>
> Recently, we discussed adding the option for drivers to set FSD if an external shutdown signal has been applied (e.g. if NUT is not the master):
>
> http://article.gmane.org/gmane.comp.monitoring.nut.devel/5925
>
>> EPO on the other hand is a key requirement of my next feature: The
>> ability of a UPS driver to declare that the UPS is dying of some
>> critical condition and that it must be shut down in such a way that
>> manual human intervention is required to restart it. EPO is also
>> intended to be triggered automatically, whereas FSD (I think) is always
>> intended to be manually introduced by a human systems manager.
>>
>> I.e. in an ideal configuration everything should restart and reboot and
>> return to operational status after "upsmon -c fsd" once mains power
>> returns or if power was never actually off; whereas with "upsmon -c epo"
>> then everything should power down and stay off even if mains power
>> remains on and steady.
>
> This is an interesting distinction (one that a few drivers make in their different shutdown commands, but that is not currently tied to FSD).
>
> The reason why I advocated usurping the "FSD" status was because it is the only other status besides "OB LB" that is currently guaranteed to trigger a shutdown. I wonder if we could just use FSD with some other status option to indicate whether the driver should request a restart when the power returns.
>
> I've CC'd Bill Elliot to get his thoughts on the use cases that led to suggesting the external shutdown trigger - it might dovetail with this.
>
>> For example I would have used "FSD" to shut down in power blackouts
>> where I knew the power could not return before the batteries ran low,
>> and thus I would have conserved battery charge for the inevitable short
>> hiccups that occur after a long blackout, but still been able to enjoy
>> automatic restart after the blackout in case power returns while I'm
>> sleeping, etc.
>>
>> Finally I add some features to the three drivers I was able to test
>> which make use of this new "DYING" state to power things down safely but
>> quickly when they detect operating temperatures above a configurable
>> maximum value. The one driver that already supported use of the ALARM
>> state also sets an alarm when the temperature rises above a configurable
>> warning value. The idea here is that if the HVAC fails in your computer
>> room then you can have everything automatically shut down _AND_ stop
>> pouring BTUs out into the room, and of course hopefully first raise an
>> alarm so that a human can try to intervene before an emergency power off
>> is actually necessary to prevent equipment damage.
>>
>> Indeed the motivation behind these new features is because HVAC fails
>> far more frequently in my client's server room, and with far more dire
>> consequences, than the power fails. Indeed they have only one tiny UPS
>> that can run only the most critical core equipment, but everything has
>> come near to suffering serious physical damage when ambient temperatures
>> have shot up above 45C in an extremely short time after HVAC failure,
>> which of course is usually on a Saturday night.
>>
>> These changes are a work in progress to some extent -- I still have not
>> fully tested the EPO of a running network, but I hope to do that very
>> soon. The drivers do report alarms (where implemented) and they report
>> the "DYING" status when their temperature sensors report above-maximum
>> values.
>
>
> It's definitely a feature I would like to see merged at some point. Now that you mention this, I think there are several UPS protocols which support a bitmask for alarm conditions which will trigger a shutdown (including overtemp). We will want to make sure that the procedure for setting that event mask is not terribly different depending on whether the shutdown is triggered by the UPS hardware, or by NUT monitoring other UPS status (as I believe you are proposing with the DYING status).
>
> I admit I haven't had time to read all of the patches that implement this, though, so please correct me if I am making any incorrect assumptions.
there is definitely a need to complete the status notion, and related
behavior, as well as providing means to easily configure extended
behaviors.
Ie configuration that allows to consider other things than the simple
provided ones, and trigger action for outlets, temperature, remaining
runtime, ...
NUT has grown a lot over the recent years, and some issues have not
received enough care.
Just let us some more time to digest the numerous things underway.
and possibly consider working from the inside for better results ;)
cheers,
Arnaud
--
[1] https://alioth.debian.org/pm/task.php?func=detailtask&project_task_id=198&group_id=30602&group_project_id=31
--
Linux / Unix Expert R&D - Eaton - http://powerquality.eaton.com
Network UPS Tools (NUT) Project Leader - http://www.networkupstools.org/
Debian Developer - http://www.debian.org
Free Software Developer - http://arnaud.quette.free.fr/
More information about the Nut-upsdev
mailing list