[Nut-upsuser] upsmon+snmp-ups does not shut down system
William Seligman
seligman at nevis.columbia.edu
Wed Jan 11 16:50:59 UTC 2012
The problem is solved, but first things first:
On 1/11/12 6:43 AM, Arnaud Quette wrote:
> 2012/1/9 William Seligman <seligman at nevis.columbia.edu>
>
>> On 1/9/12 9:53 AM, Arnaud Quette wrote:
>>
>>> 2012/1/6 William Seligman <seligman at nevis.columbia.edu>
>>>
>>>> I've googled and RTFM'ed, but still can't solve this one. I hope you
>>>> folks can.
>>>>
>>>> This affects my entire computer cluster, but let's start simple: I've
>>>> got a computer running NUT; OS is Scientific Linux 5.5; kernel
>>>> 2.6.18-274.12.1.el5xen. It connects to an APC SMART-UPS via an APC
>>>> SmartCard using the snmp-ups driver. It generally works: upsmon will
>>>> detect if the battery is low (I get an e-mail message); I can control
>>>> the UPS, inspect it variables, set variables, issue commands, and so
>>>> on.
>>>
>>> If "On battery" and "Low battery" are both detected, there should be no
>>> issue.
>>>
>>>> There's just one thing that does not happen: when the UPS goes critical,
>>>> the computer does not shut down. The upsmon daemon does not display any
>>>> messages, does not write to the syslog, does not send e-mail, etc.; even
>>>> though I've configured it to do so in upsmon.conf.>>
>>>> I've tried nut-2.2.2, nut-2.4.3, and nut-2.6.2, and the symptom is the
>>>> same.
>>>
>>> Using the latest version, when possible, is always a good idea.
>>
>> Installing nut-2.6.2 on a Scientific Linux 5.5 system was a bit difficult,
>> and played havoc with my regular yum updates. After I've finished
>> debugging this problem, I'm going to completely reinstall the OS to make
>> sure I've got a consistent set of RPMs.>>
>
> you may have prefered to rebuild an SRPM like that:
> http://zid-luxinst.uibk.ac.at/linux/rpm2html/fedora/14/i386/updates/nut-2.6.2-1.fc14.i686.html
That what I did, at first. The rebuild process for that RPM involves "-devel"
libraries that are not part of an RHEL5-style distribution. So I tried to
download and compile the SRPMs for those libraries (neon-devel, portman-devel,
net-snmp-devel, etc.). This led to a chain of installs and the usual RPM hell; I
had not appreciated how different RHEL6+ was from RHEL5.
Even with all the dependent libraries installed, the nut-2.6.2 SRPM would still
not rebuild; even though the neon and neon-devel libraries were present, the
configure script couldn't find them and so the rebuild failed.
Finally, I did what I should have done from the start: I just used the
nut-2.6.2.tar.gz file and built it manually. The configure script still couldn't
find the neon libraries, but I didn't need that functionality for my tests, and
this did not block the compilation. The only problem was getting the various
directory options set so files/binaries would be installed in the same
directories as in a Redhat distribution. Even then, I had to move binaries
around post-install.
And after all that work, it still didn't solve the problem. Read on...
>>>> I tried issuing a "graceful reboot" command via the APC SmartCard's web
>>>> and telnet interface. It made no difference; the system still did not
>>>> shut down.
>>>>
>>>> Now let's extend the problem to my cluster: I have a variety of
>>>> different computers, all running Scientific Linux 5.5, connecting
>>>> through different switches, connecting to different flavors of APC
>>>> SMART-UPSes, via SmartCards, each ranging in age from six months to
>>>> five years. They all exhibit this same symptom, as I painfully
>>>> discovered during a recent power outage: they all sent me e-mail when
>>>> the UPSes went to low battery, but none turned off when the UPS went
>>>> critical. Given the range of hardware involved, this must be a common
>>>> software problem.
>>>>
>>>> The systems will shut down properly if I do "upsmon -c fsd", so it
>>>> doesn't appear to be a permissions problem.
>>>>
>>>> I don't think this is the upsdrv_shutdown() issue described in the
>>>> snmp-ups man page; I do not care if the UPS shuts down when the
>>>> computer does, nor do I want it to. I just want upsmon to shut down the
>>>> system when the UPS goes critical.
>>>>
>>>> Here are my config files; the system is tanya, its UPS is tanya-ups.
>>>> Any advice?
>>>>
>>>> ups.conf:
>>>>
>>>> [tanya-ups]
>>>> driver = snmp-ups
>>>> port = tanya-ups
>>>> community = private
>>>> mibs = apcc
>>>>
>>>> upsd.conf:
>>>>
>>>> # LISTEN 0.0.0.0 3493
>>>>
>>>> upsd.users:
>>>>
>>>> [admin]
>>>> password = nowayjose
>>>> actions = SET
>>>> instcmds = all
>>>> upsmon master
>>>>
>>>
>>> it's also a good idea to separate monitoring and administrative users.
>>> Ie:
>>> [admin]
>>> password = XXX
>>> actions = SET
>>> instcmds = all
>>>
>>> [monuser]
>>> password = XXX
>>> upsmon master
>>>
>>>> upsmon.conf:
>>>>
>>>> MONITOR tanya-ups at localhost 1 admin nowayjose master
>>>> MINSUPPLIES 1
>>>> SHUTDOWNCMD "/sbin/shutdown -h +0"
>>>> NOTIFYCMD /home/bin/notify.sh # sends me e-mail
>>>> POLLFREQ 5
>>>> POLLFREQALERT 5
>>>> HOSTSYNC 15
>>>> DEADTIME 15
>>>> POWERDOWNFLAG /etc/killpower
>>>> NOTIFYFLAG ONLINE SYSLOG
>>>> NOTIFYFLAG ONBATT SYSLOG+WALL
>>>> NOTIFYFLAG LOWBATT SYSLOG+WALL
>>>> NOTIFYFLAG FSD SYSLOG+WALL+EXEC
>>>> NOTIFYFLAG COMMOK SYSLOG
>>>> NOTIFYFLAG COMMBAD SYSLOG
>>>> NOTIFYFLAG SHUTDOWN SYSLOG+WALL+EXEC
>>>> NOTIFYFLAG REPLBATT SYSLOG+WALL+EXEC
>>>> NOTIFYFLAG NOCOMM SYSLOG
>>>> NOTIFYFLAG NOPARENT SYSLOG+WALL
>>>> RBWARNTIME 43200
>>>> NOCOMMWARNTIME 300
>>>> FINALDELAY 5
>>>
>>> Your config seems fine.
>>> An interesting test to do would be to stop upsmon, but keep snmp-ups and
>>> upsd, then discharge your UPS and to ensure that you indeed get an
>>> ups.status == "OB LB", which triggers the call to
>>> upsmon.conf->SHUTDOWNCMD. Note that you need both "OB" and "LB", since
>>> you may have "low battery" and be "online" at the same time!
>>
>> This is a good idea, and I ran the test. I disconnected the UPS, and
>> periodically checked the output of:
>>
>> upsc tanya-ups at localhost ups.status
>>
>> Eventually this command returned "OB LB" as you said. But upsmon did
>> nothing. I waited and eventually the UPS shut power to the system in a hard
>> crash.
>
> ooch, mea culpa!
> I was too brief in my answer, and forgot to tell you the obvious: remove
> your computer from the UPS, in order to avoid such crash.
>
>> So the UPS is sending the correct signals, and snmp-ups is reporting the
>> correct status. Is there anything else I can check to trace the cause of
>> the problem?
>
> indeed, though there is an issue, as you've reported initially.
>
> Could you do this test again, but this time:
> - remove your server from the UPS,
> - start upsmon in debug mode. If it's already started, just call "upsmon -c
> stop ; upsmon -DDDDD"
> and send us back the output, at least when it should see the "OB LB"
> condition, to see what's going on.
I solved the problem by looking at the code in upsmon.c. I did two stupid things:
- I didn't RTFM as much as I thought I had.
- In my rush to trim down the config files for my first message to nut-upsuser,
I left out the crucial bits that would have enabled anyone else to help me.
Here's the key: In my upsmon.conf, I actually have two MONITOR lines:
MONITOR tanya-ups at localhost 1 monuser acdc master
MONITOR network-ups at localhost 1 monuser acdc master
(Note the change to "monuser", indicating that I followed Arnaud's advice.)
I'm using snmp-ups to communicate with my UPS. If the UPS that supplies power to
the network switch goes critical, I want tanya to power down as well; after all,
if tanya can't talk to its UPS anymore, it won't know when tanya-ups goes critical.
So the intent of the two MONITOR lines is: If either tanya-ups OR network-ups
goes critical, shut down the system.
But I also had this line in upsmon.conf:
MINSUPPLIES 1
That means the effect of the two MONITOR lines is: If tanya-ups AND network-ups
go critical, shut down the system.
Since all my tests involved just cutting the power via tanya-ups, upsmon wasn't
shutting down tanya. It was doing what the configuration file told it to do.
The solution is change the MINSUPPLIES line:
MINSUPPLIES 2
Then upsmon does what I want it to do. I've already confirmed this with direct
tests. (I also discovered that I had to increase the "low-battery duration"
parameter on tanya-ups, but that's another story.)
In general, at least for my cluster configuration, the argument to MINSUPPLIES
should be equal to the number of MONITOR lines I have in upsmon.conf.
My confusion was due to my mis-interpretation of the language of the
documentation. The upsmon.conf man page and big-servers.txt all speak about
power supplies directly connected to the system; I skipped over those parts
because I thought of only one UPS supplying power to my system. In my
configuration I have to think of the network switch as part of "the system." I
should have paid more attention.
Thanks for trying to help me out, Arnaud. It wasn't your fault that I didn't
give you enough information.
--
Bill Seligman | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137 |
Irvington NY 10533 USA | http://www.nevis.columbia.edu/~seligman/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4497 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.alioth.debian.org/pipermail/nut-upsuser/attachments/20120111/239abe1d/attachment.bin>
More information about the Nut-upsuser
mailing list