Hi<br><br><div class="gmail_quote">2012/1/11 William Seligman <span dir="ltr"><<a href="mailto:seligman@nevis.columbia.edu">seligman@nevis.columbia.edu</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
The problem is solved, but first things first:<br>
<div><div class="h5"><br>
On 1/11/12 6:43 AM, Arnaud Quette wrote:<br>
> 2012/1/9 William Seligman <<a href="mailto:seligman@nevis.columbia.edu">seligman@nevis.columbia.edu</a>><br>
><br>
>> On 1/9/12 9:53 AM, Arnaud Quette wrote:<br>
>><br>
>>> 2012/1/6 William Seligman <<a href="mailto:seligman@nevis.columbia.edu">seligman@nevis.columbia.edu</a>><br>
>>><br>
>>>> I've googled and RTFM'ed, but still can't solve this one. I hope you<br>
>>>> folks can.<br>
>>>><br>
>>>> This affects my entire computer cluster, but let's start simple: I've<br>
>>>> got a computer running NUT; OS is Scientific Linux 5.5; kernel<br>
>>>> 2.6.18-274.12.1.el5xen. It connects to an APC SMART-UPS via an APC<br>
>>>> SmartCard using the snmp-ups driver. It generally works: upsmon will<br>
>>>> detect if the battery is low (I get an e-mail message); I can control<br>
>>>> the UPS, inspect it variables, set variables, issue commands, and so<br>
>>>> on.<br>
>>><br>
>>> If "On battery" and "Low battery" are both detected, there should be no<br>
>>> issue.<br>
>>><br>
>>>> There's just one thing that does not happen: when the UPS goes critical,<br>
>>>> the computer does not shut down. The upsmon daemon does not display any<br>
>>>> messages, does not write to the syslog, does not send e-mail, etc.; even<br>
>>>> though I've configured it to do so in upsmon.conf.>><br>
>>>> I've tried nut-2.2.2, nut-2.4.3, and nut-2.6.2, and the symptom is the<br>
>>>> same.<br>
>>><br>
>>> Using the latest version, when possible, is always a good idea.<br>
>><br>
>> Installing nut-2.6.2 on a Scientific Linux 5.5 system was a bit difficult,<br>
>> and played havoc with my regular yum updates. After I've finished<br>
>> debugging this problem, I'm going to completely reinstall the OS to make<br>
>> sure I've got a consistent set of RPMs.>><br>
><br>
> you may have prefered to rebuild an SRPM like that:<br>
> <a href="http://zid-luxinst.uibk.ac.at/linux/rpm2html/fedora/14/i386/updates/nut-2.6.2-1.fc14.i686.html" target="_blank">http://zid-luxinst.uibk.ac.at/linux/rpm2html/fedora/14/i386/updates/nut-2.6.2-1.fc14.i686.html</a><br>
<br>
</div></div>That what I did, at first. The rebuild process for that RPM involves "-devel"<br>
libraries that are not part of an RHEL5-style distribution. So I tried to<br>
download and compile the SRPMs for those libraries (neon-devel, portman-devel,<br>
net-snmp-devel, etc.). This led to a chain of installs and the usual RPM hell; I<br>
had not appreciated how different RHEL6+ was from RHEL5.<br>
<br>
Even with all the dependent libraries installed, the nut-2.6.2 SRPM would still<br>
not rebuild; even though the neon and neon-devel libraries were present, the<br>
configure script couldn't find them and so the rebuild failed.<br>
<br>
Finally, I did what I should have done from the start: I just used the<br>
nut-2.6.2.tar.gz file and built it manually. The configure script still couldn't<br>
find the neon libraries, but I didn't need that functionality for my tests, and<br>
this did not block the compilation. The only problem was getting the various<br>
directory options set so files/binaries would be installed in the same<br>
directories as in a Redhat distribution. Even then, I had to move binaries<br>
around post-install.<br>
<br>
And after all that work, it still didn't solve the problem. Read on...<br>
<div><div class="h5"><br>
>>>> I tried issuing a "graceful reboot" command via the APC SmartCard's web<br>
>>>> and telnet interface. It made no difference; the system still did not<br>
>>>> shut down.<br>
>>>><br>
>>>> Now let's extend the problem to my cluster: I have a variety of<br>
>>>> different computers, all running Scientific Linux 5.5, connecting<br>
>>>> through different switches, connecting to different flavors of APC<br>
>>>> SMART-UPSes, via SmartCards, each ranging in age from six months to<br>
>>>> five years. They all exhibit this same symptom, as I painfully<br>
>>>> discovered during a recent power outage: they all sent me e-mail when<br>
>>>> the UPSes went to low battery, but none turned off when the UPS went<br>
>>>> critical. Given the range of hardware involved, this must be a common<br>
>>>> software problem.<br>
>>>><br>
>>>> The systems will shut down properly if I do "upsmon -c fsd", so it<br>
>>>> doesn't appear to be a permissions problem.<br>
>>>><br>
>>>> I don't think this is the upsdrv_shutdown() issue described in the<br>
>>>> snmp-ups man page; I do not care if the UPS shuts down when the<br>
>>>> computer does, nor do I want it to. I just want upsmon to shut down the<br>
>>>> system when the UPS goes critical.<br>
>>>><br>
>>>> Here are my config files; the system is tanya, its UPS is tanya-ups.<br>
>>>> Any advice?<br>
>>>><br>
>>>> ups.conf:<br>
>>>><br>
>>>> [tanya-ups]<br>
>>>> driver = snmp-ups<br>
>>>> port = tanya-ups<br>
>>>> community = private<br>
>>>> mibs = apcc<br>
>>>><br>
>>>> upsd.conf:<br>
>>>><br>
>>>> # LISTEN 0.0.0.0 3493<br>
>>>><br>
>>>> upsd.users:<br>
>>>><br>
>>>> [admin]<br>
>>>> password = nowayjose<br>
>>>> actions = SET<br>
>>>> instcmds = all<br>
>>>> upsmon master<br>
>>>><br>
>>><br>
>>> it's also a good idea to separate monitoring and administrative users.<br>
>>> Ie:<br>
>>> [admin]<br>
>>> password = XXX<br>
>>> actions = SET<br>
>>> instcmds = all<br>
>>><br>
>>> [monuser]<br>
>>> password = XXX<br>
>>> upsmon master<br>
>>><br>
>>>> upsmon.conf:<br>
>>>><br>
>>>> MONITOR tanya-ups@localhost 1 admin nowayjose master<br>
>>>> MINSUPPLIES 1<br>
>>>> SHUTDOWNCMD "/sbin/shutdown -h +0"<br>
>>>> NOTIFYCMD /home/bin/notify.sh # sends me e-mail<br>
>>>> POLLFREQ 5<br>
>>>> POLLFREQALERT 5<br>
>>>> HOSTSYNC 15<br>
>>>> DEADTIME 15<br>
>>>> POWERDOWNFLAG /etc/killpower<br>
>>>> NOTIFYFLAG ONLINE SYSLOG<br>
>>>> NOTIFYFLAG ONBATT SYSLOG+WALL<br>
>>>> NOTIFYFLAG LOWBATT SYSLOG+WALL<br>
>>>> NOTIFYFLAG FSD SYSLOG+WALL+EXEC<br>
>>>> NOTIFYFLAG COMMOK SYSLOG<br>
>>>> NOTIFYFLAG COMMBAD SYSLOG<br>
>>>> NOTIFYFLAG SHUTDOWN SYSLOG+WALL+EXEC<br>
>>>> NOTIFYFLAG REPLBATT SYSLOG+WALL+EXEC<br>
>>>> NOTIFYFLAG NOCOMM SYSLOG<br>
>>>> NOTIFYFLAG NOPARENT SYSLOG+WALL<br>
>>>> RBWARNTIME 43200<br>
>>>> NOCOMMWARNTIME 300<br>
>>>> FINALDELAY 5<br>
>>><br>
>>> Your config seems fine.<br>
>>> An interesting test to do would be to stop upsmon, but keep snmp-ups and<br>
>>> upsd, then discharge your UPS and to ensure that you indeed get an<br>
>>> ups.status == "OB LB", which triggers the call to<br>
>>> upsmon.conf->SHUTDOWNCMD. Note that you need both "OB" and "LB", since<br>
>>> you may have "low battery" and be "online" at the same time!<br>
>><br>
>> This is a good idea, and I ran the test. I disconnected the UPS, and<br>
>> periodically checked the output of:<br>
>><br>
>> upsc tanya-ups@localhost ups.status<br>
>><br>
>> Eventually this command returned "OB LB" as you said. But upsmon did<br>
>> nothing. I waited and eventually the UPS shut power to the system in a hard<br>
>> crash.<br>
><br>
> ooch, mea culpa!<br>
> I was too brief in my answer, and forgot to tell you the obvious: remove<br>
> your computer from the UPS, in order to avoid such crash.<br>
><br>
>> So the UPS is sending the correct signals, and snmp-ups is reporting the<br>
>> correct status. Is there anything else I can check to trace the cause of<br>
>> the problem?<br>
><br>
> indeed, though there is an issue, as you've reported initially.<br>
><br>
> Could you do this test again, but this time:<br>
> - remove your server from the UPS,<br>
> - start upsmon in debug mode. If it's already started, just call "upsmon -c<br>
> stop ; upsmon -DDDDD"<br>
> and send us back the output, at least when it should see the "OB LB"<br>
> condition, to see what's going on.<br>
<br>
</div></div>I solved the problem by looking at the code in upsmon.c. I did two stupid things:<br>
<br>
- I didn't RTFM as much as I thought I had.<br>
<br>
- In my rush to trim down the config files for my first message to nut-upsuser,<br>
I left out the crucial bits that would have enabled anyone else to help me.<br>
<br>
Here's the key: In my upsmon.conf, I actually have two MONITOR lines:<br>
<br>
MONITOR tanya-ups@localhost 1 monuser acdc master<br>
MONITOR network-ups@localhost 1 monuser acdc master<br>
<br>
(Note the change to "monuser", indicating that I followed Arnaud's advice.)<br>
<br>
I'm using snmp-ups to communicate with my UPS. If the UPS that supplies power to<br>
the network switch goes critical, I want tanya to power down as well; after all,<br>
if tanya can't talk to its UPS anymore, it won't know when tanya-ups goes critical.<br>
<br>
So the intent of the two MONITOR lines is: If either tanya-ups OR network-ups<br>
goes critical, shut down the system.<br>
<br>
But I also had this line in upsmon.conf:<br>
<br>
MINSUPPLIES 1<br>
<br>
That means the effect of the two MONITOR lines is: If tanya-ups AND network-ups<br>
go critical, shut down the system.<br>
<br>
Since all my tests involved just cutting the power via tanya-ups, upsmon wasn't<br>
shutting down tanya. It was doing what the configuration file told it to do.<br>
<br>
The solution is change the MINSUPPLIES line:<br>
<br>
MINSUPPLIES 2<br>
<br>
Then upsmon does what I want it to do. I've already confirmed this with direct<br>
tests. (I also discovered that I had to increase the "low-battery duration"<br>
parameter on tanya-ups, but that's another story.)<br>
<br>
In general, at least for my cluster configuration, the argument to MINSUPPLIES<br>
should be equal to the number of MONITOR lines I have in upsmon.conf.<br>
<br>
My confusion was due to my mis-interpretation of the language of the<br>
documentation. The upsmon.conf man page and big-servers.txt all speak about<br>
power supplies directly connected to the system; I skipped over those parts<br>
because I thought of only one UPS supplying power to my system. In my<br>
configuration I have to think of the network switch as part of "the system." I<br>
should have paid more attention.<br>
<br>
Thanks for trying to help me out, Arnaud. It wasn't your fault that I didn't<br>
give you enough information.<br></blockquote></div><br>glad to hear that your issue is fixed.<br>I'll try to check if these wordings can be improved to avoid confusion.<br><br>cheers,<br clear="all">Arnaud<br>-- <br>
Linux / Unix Expert R&D - Eaton - <a href="http://powerquality.eaton.com" target="_blank">http://powerquality.eaton.com</a><br>Network UPS Tools (NUT) Project Leader - <a href="http://www.networkupstools.org/" target="_blank">http://www.networkupstools.org/</a><br>
Debian Developer - <a href="http://www.debian.org" target="_blank">http://www.debian.org</a><br>Free Software Developer - <a href="http://arnaud.quette.free.fr/" target="_blank">http://arnaud.quette.free.fr/</a><br><br>