[Nut-upsuser] upsmon+snmp-ups does not shut down system

William Seligman seligman at nevis.columbia.edu
Wed Jan 11 16:50:59 UTC 2012


The problem is solved, but first things first:

On 1/11/12 6:43 AM, Arnaud Quette wrote:
> 2012/1/9 William Seligman <seligman at nevis.columbia.edu>
> 
>> On 1/9/12 9:53 AM, Arnaud Quette wrote:
>>
>>> 2012/1/6 William Seligman <seligman at nevis.columbia.edu>
>>>
>>>> I've googled and RTFM'ed, but still can't solve this one. I hope you
>>>> folks can.
>>>> 
>>>> This affects my entire computer cluster, but let's start simple: I've
>>>> got a computer running NUT; OS is Scientific Linux 5.5; kernel 
>>>> 2.6.18-274.12.1.el5xen. It connects to an APC SMART-UPS via an APC 
>>>> SmartCard using the snmp-ups driver. It generally works: upsmon will
>>>> detect if the battery is low (I get an e-mail message); I can control
>>>> the UPS, inspect it variables, set variables, issue commands, and so
>>>> on.
>>>
>>> If "On battery" and "Low battery" are both detected, there should be no
>>> issue.
>>>
>>>> There's just one thing that does not happen: when the UPS goes critical,
>>>> the computer does not shut down. The upsmon daemon does not display any
>>>> messages, does not write to the syslog, does not send e-mail, etc.; even
>>>> though I've configured it to do so in upsmon.conf.>>
>>>> I've tried nut-2.2.2, nut-2.4.3, and nut-2.6.2, and the symptom is the
>>>> same.
>>>
>>> Using the latest version, when possible, is always a good idea.
>>
>> Installing nut-2.6.2 on a Scientific Linux 5.5 system was a bit difficult, 
>> and played havoc with my regular yum updates. After I've finished
>> debugging this problem, I'm going to completely reinstall the OS to make
>> sure I've got a consistent set of RPMs.>>
> 
> you may have prefered to rebuild an SRPM like that:
> http://zid-luxinst.uibk.ac.at/linux/rpm2html/fedora/14/i386/updates/nut-2.6.2-1.fc14.i686.html

That what I did, at first. The rebuild process for that RPM involves "-devel"
libraries that are not part of an RHEL5-style distribution. So I tried to
download and compile the SRPMs for those libraries (neon-devel, portman-devel,
net-snmp-devel, etc.). This led to a chain of installs and the usual RPM hell; I
had not appreciated how different RHEL6+ was from RHEL5.

Even with all the dependent libraries installed, the nut-2.6.2 SRPM would still
not rebuild; even though the neon and neon-devel libraries were present, the
configure script couldn't find them and so the rebuild failed.

Finally, I did what I should have done from the start: I just used the
nut-2.6.2.tar.gz file and built it manually. The configure script still couldn't
find the neon libraries, but I didn't need that functionality for my tests, and
this did not block the compilation. The only problem was getting the various
directory options set so files/binaries would be installed in the same
directories as in a Redhat distribution. Even then, I had to move binaries
around post-install.

And after all that work, it still didn't solve the problem. Read on...

>>>> I tried issuing a "graceful reboot" command via the APC SmartCard's web
>>>> and telnet interface. It made no difference; the system still did not
>>>> shut down.
>>>> 
>>>> Now let's extend the problem to my cluster: I have a variety of
>>>> different computers, all running Scientific Linux 5.5, connecting
>>>> through different switches, connecting to different flavors of APC
>>>> SMART-UPSes, via SmartCards, each ranging in age from six months to
>>>> five years. They all exhibit this same symptom, as I painfully
>>>> discovered during a recent power outage: they all sent me e-mail when
>>>> the UPSes went to low battery, but none turned off when the UPS went
>>>> critical. Given the range of hardware involved, this must be a common
>>>> software problem.
>>>> 
>>>> The systems will shut down properly if I do "upsmon -c fsd", so it
>>>> doesn't appear to be a permissions problem.
>>>> 
>>>> I don't think this is the upsdrv_shutdown() issue described in the
>>>> snmp-ups man page; I do not care if the UPS shuts down when the
>>>> computer does, nor do I want it to. I just want upsmon to shut down the
>>>> system when the UPS goes critical.
>>>> 
>>>> Here are my config files; the system is tanya, its UPS is tanya-ups.
>>>> Any advice?
>>>>
>>>> ups.conf:
>>>>
>>>> [tanya-ups]
>>>>        driver = snmp-ups
>>>>        port = tanya-ups
>>>>        community = private
>>>>        mibs = apcc
>>>>
>>>> upsd.conf:
>>>>
>>>> # LISTEN 0.0.0.0 3493
>>>>
>>>> upsd.users:
>>>>
>>>> [admin]
>>>>        password = nowayjose
>>>>        actions = SET
>>>>        instcmds = all
>>>>        upsmon master
>>>>
>>>
>>> it's also a good idea to separate monitoring and administrative users.
>>> Ie:
>>> [admin]
>>>        password = XXX
>>>        actions = SET
>>>        instcmds = all
>>>
>>> [monuser]
>>>        password = XXX
>>>        upsmon master
>>>
>>>> upsmon.conf:
>>>>
>>>> MONITOR tanya-ups at localhost 1 admin nowayjose master
>>>> MINSUPPLIES 1
>>>> SHUTDOWNCMD "/sbin/shutdown -h +0"
>>>> NOTIFYCMD /home/bin/notify.sh # sends me e-mail
>>>> POLLFREQ 5
>>>> POLLFREQALERT 5
>>>> HOSTSYNC 15
>>>> DEADTIME 15
>>>> POWERDOWNFLAG /etc/killpower
>>>> NOTIFYFLAG ONLINE       SYSLOG
>>>> NOTIFYFLAG ONBATT       SYSLOG+WALL
>>>> NOTIFYFLAG LOWBATT      SYSLOG+WALL
>>>> NOTIFYFLAG FSD          SYSLOG+WALL+EXEC
>>>> NOTIFYFLAG COMMOK       SYSLOG
>>>> NOTIFYFLAG COMMBAD      SYSLOG
>>>> NOTIFYFLAG SHUTDOWN     SYSLOG+WALL+EXEC
>>>> NOTIFYFLAG REPLBATT     SYSLOG+WALL+EXEC
>>>> NOTIFYFLAG NOCOMM       SYSLOG
>>>> NOTIFYFLAG NOPARENT     SYSLOG+WALL
>>>> RBWARNTIME 43200
>>>> NOCOMMWARNTIME 300
>>>> FINALDELAY 5
>>>
>>> Your config seems fine.
>>> An interesting test to do would be to stop upsmon, but keep snmp-ups and 
>>> upsd, then discharge your UPS and to ensure that you indeed get an 
>>> ups.status == "OB LB", which triggers the call to
>>> upsmon.conf->SHUTDOWNCMD. Note that you need both "OB" and "LB", since
>>> you may have "low battery" and be "online" at the same time!
>>
>> This is a good idea, and I ran the test. I disconnected the UPS, and
>> periodically checked the output of:
>>
>> upsc tanya-ups at localhost ups.status
>>
>> Eventually this command returned "OB LB" as you said. But upsmon did
>> nothing. I waited and eventually the UPS shut power to the system in a hard
>> crash.
> 
> ooch, mea culpa!
> I was too brief in my answer, and forgot to tell you the obvious: remove
> your computer from the UPS, in order to avoid such crash.
> 
>> So the UPS is sending the correct signals, and snmp-ups is reporting the
>> correct status. Is there anything else I can check to trace the cause of
>> the problem?
> 
> indeed, though there is an issue, as you've reported initially.
> 
> Could you do this test again, but this time:
> - remove your server from the UPS,
> - start upsmon in debug mode. If it's already started, just call "upsmon -c
> stop ; upsmon -DDDDD"
> and send us back the output, at least when it should see the "OB LB"
> condition, to see what's going on.

I solved the problem by looking at the code in upsmon.c. I did two stupid things:

- I didn't RTFM as much as I thought I had.

- In my rush to trim down the config files for my first message to nut-upsuser,
I left out the crucial bits that would have enabled anyone else to help me.

Here's the key: In my upsmon.conf, I actually have two MONITOR lines:

MONITOR tanya-ups at localhost 1 monuser acdc master
MONITOR network-ups at localhost 1 monuser acdc master

(Note the change to "monuser", indicating that I followed Arnaud's advice.)

I'm using snmp-ups to communicate with my UPS. If the UPS that supplies power to
the network switch goes critical, I want tanya to power down as well; after all,
if tanya can't talk to its UPS anymore, it won't know when tanya-ups goes critical.

So the intent of the two MONITOR lines is: If either tanya-ups OR network-ups
goes critical, shut down the system.

But I also had this line in upsmon.conf:

MINSUPPLIES 1

That means the effect of the two MONITOR lines is: If tanya-ups AND network-ups
go critical, shut down the system.

Since all my tests involved just cutting the power via tanya-ups, upsmon wasn't
shutting down tanya. It was doing what the configuration file told it to do.

The solution is change the MINSUPPLIES line:

MINSUPPLIES 2

Then upsmon does what I want it to do. I've already confirmed this with direct
tests. (I also discovered that I had to increase the "low-battery duration"
parameter on tanya-ups, but that's another story.)

In general, at least for my cluster configuration, the argument to MINSUPPLIES
should be equal to the number of MONITOR lines I have in upsmon.conf.

My confusion was due to my mis-interpretation of the language of the
documentation. The upsmon.conf man page and big-servers.txt all speak about
power supplies directly connected to the system; I skipped over those parts
because I thought of only one UPS supplying power to my system. In my
configuration I have to think of the network switch as part of "the system." I
should have paid more attention.

Thanks for trying to help me out, Arnaud. It wasn't your fault that I didn't
give you enough information.

-- 
Bill Seligman             | Phone: (914) 591-2823
Nevis Labs, Columbia Univ | mailto://seligman@nevis.columbia.edu
PO Box 137                |
Irvington NY 10533 USA    | http://www.nevis.columbia.edu/~seligman/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: smime.p7s
Type: application/pkcs7-signature
Size: 4497 bytes
Desc: S/MIME Cryptographic Signature
URL: <http://lists.alioth.debian.org/pipermail/nut-upsuser/attachments/20120111/239abe1d/attachment.bin>


More information about the Nut-upsuser mailing list