[Nut-upsuser] Heartbeat validation of NUT integrity

Roger Price roger at rogerprice.org
Sun Apr 2 21:23:40 UTC 2017


This note describes a heartbeat technique for validating the integrity of 
a NUT installation.

Introduction
------------

A NUT configuration may run for months with little or no output to a 
system administrator to assure that the combined processes are running 
correctly.  The technique described in this note verifies that the ups 
driver, upsd, upsmon, upssched and upssched-cmd components are operational 
and that the flow of data between them is effective.  The system 
administrator is warned if the overall combined process breaks.

Overview of the technique
-------------------------

An 11 minute upssched timer runs permanently, and when it completes, 
upssched-cmd sends a warning message to the sysadmin.  During normal 
operation the timer is prevented from completing by a timed process with a 
shorter 10 minute period running in a dummy UPS known as "heartbeat". The 
dummy UPS "heartbeat" cycles through an OL and an OB every 10 minutes, and 
the status changes are communicated to upsd and then to upsmon and 
upssched.  Thus every 10 minutes upssched stops and restarts the 11 minute 
timer.  During normal operation the 11 minute timer will never complete, 
but if the driver -> upsd -> upsmon -> upssched chain is broken, it will 
complete and the sysadmin advised.

The technique requires a working NUT installation and an understanding of 
upssched timers and the upssched-cmd script.

Changes to configuration files
------------------------------

1. In ups.conf, add

[heartbeat]
         driver = dummy-ups
         port = heartbeat.dev
         desc = "Heart beat validation of NUT"

2. Create heartbeat.dev in the same directory as ups.conf with the 
contents

ups.status: OL
TIMER 300
ups.status: OB
TIMER 300

Remember that the are no comments in NUT .dev files.

3. In upsmon.conf, add

MONITOR heartbeat at localhost 1 upsmaster s3cr3t master

and make sure that you have specified

NOTIFYCMD /usr/sbin/upssched
NOTIFYFLAG ONBATT   SYSLOG+WALL+EXEC
NOTIFYFLAG ONLINE   SYSLOG+WALL+EXEC

Your upssched executable may be elsewhere. You may want to remove the 
WALL.

4. In upssched.conf, add

# Heart beat validation that NUT is operational.
# Restart timer which completes only if the dummy-ups heart beat has stopped.
# See timer values in heartbeat.dev 
AT ONBATT heartbeat at localhost CANCEL-TIMER heartbeat-failure-timer
AT ONBATT heartbeat at localhost START-TIMER  heartbeat-failure-timer 660

and make sure that there are no entries such as

AT ONLINE * ...
AT ONBATT * ...

Replace the "*" with the full address of the ups unit, e.g. 
myups at localhost

Make sure that you have specified

CMDSCRIPT /usr/sbin/upssched-cmd

Your upssched-cmd may be elsewhere.

5. In upssched-cmd, test for completion of the heartbeat-failure-timer and 
when it completes send a warning to the sysadmin, e-mail, SMS, pigeon, ...

Testing the heartbeat setup
---------------------------

1. Test that you can send a warning to the sysadmin with the command

    upssched-cmd heartbeat-failure-timer

2. When you start NUT, check that "heartbeat" is running. Command ps aux | 
grep ups should show something like

upsd     14785  0.0  0.0  13228   652 ?        Ss   22:48   0:00 /usr/lib/ups/driver/usbhid-ups -a myups
upsd     14787  0.0  0.0  19624   704 ?        Ss   22:48   0:00 /usr/lib/ups/driver/dummy-ups -a heartbeat
upsd     14791  0.0  0.0  17560   744 ?        Ss   22:48   0:00 /usr/sbin/upsd -u upsd
root     14794  0.0  0.0  19432   664 ?        Ss   22:48   0:00 /usr/sbin/upsmon
upsd     14795  0.0  0.0  19856  1616 ?        S    22:48   0:00 /usr/sbin/upsmon
upsd     14845  0.0  0.0   6408   448 ?        S    22:53   0:00 /usr/sbin/upssched UPS heartbeat at localhost: On battery

3. Shorten the heartbeat-failure-timer in upssched.conf to 540 seconds, 
and you should receive a warning every 10 minutes.

4. If you leave the WALL in the NOTIFYFLAG ONBATT and NOTIFYFLAG ONLINE 
declarations in upsmon.conf you will see the action of the dummy-ups 
displayed in an xterm or equivalent console.

I have tested this setup with NUT 2.7.4 on openSUSE 13.2 and 42.2.
Comments and suggestions welcome.

Roger




More information about the Nut-upsuser mailing list