[Nut-upsuser] UPS Socomec is unavailable appears in the logs

Mon Feb 8 13:46:17 UTC 2016

Hello,

we have two UPSes - one Galaxy and one from Socomec, both monitored by the NUT server.

The NUT server runs the version 2.7.2.

We have roughly 3500 clients running upsmon, reading the status of both
UPSes. They run mainly the version 2.6.4 and some 2.7.2.

We observed the following on both client versions, regardless whether
we run upsmon in the daemon or debug mode:

Running
upsc socomec at ip_socomec
returns a list of all required values.

OTOH, we find on all clients the folloging logs:
 upsmon[1608]: UPS socomec at ip_socomec is unavailable

This appears regularly each little bit more than 300s and corresponds to
the NOCOMMWARNTIME in the upsmon.conf.

The log entry appears only for the Socomec UPS, not for the Galaxy.

The process list shows that /sbin/upsmon is forking occasionally, generating a child in the Z state

root      3294  0.0  0.0  10448   704 ?        Ss   14:02   0:00   /sbin/upsmon
nut       3295  0.0  0.0  10448   580 ?        S    14:02   0:00     /sbin/upsmon
nut       3497  0.0  0.0      0     0 ?        Z    14:02   0:00       [upsmon] <defunct>

As far as I can see, strace provides not much. Here is a snipped:

4275  14:07:50.688200 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdc214c79d0) = 4580
4275  14:07:50.688650 select(6, [5], NULL, NULL, {0, 0}) = 0 (Timeout)
4275  14:07:50.688873 wait4(-1, NULL, WNOHANG, NULL) = 4521
4275  14:07:50.688968 rt_sigprocmask(SIG_BLOCK, [CHLD], [], 8) = 0
4275  14:07:50.689007 rt_sigaction(SIGCHLD, NULL, {SIG_DFL, [], 0}, 8) = 0
4275  14:07:50.689046 rt_sigprocmask(SIG_SETMASK, [],  <unfinished ...>
4580  14:07:50.689065 exit_group(0)     = ?
4275  14:07:50.689089 <... rt_sigprocmask resumed> NULL, 8) = 0
4275  14:07:50.689109 nanosleep({15, 0}, {14, 999864784}) = ? ERESTART_RESTARTBLOCK (To be restarted)
4275  14:07:50.689334 --- SIGCHLD (Child exited) @ 0 (0) ---
4275  14:07:50.689355 restart_syscall(<... resuming interrupted call ...>) = 0
4275  14:08:05.689412 alarm(10)         = 0
4275  14:08:05.689530 select(5, NULL, [4], NULL, {0, 0}) = 1 (out [4], left {0, 0})
4275  14:08:05.689595 write(4, "GET VAR galaxy ups.status\n", 26) = 26
4275  14:08:05.689721 select(5, [4], NULL, NULL, {5, 0}) = 1 (in [4], left {4, 996929})
4275  14:08:05.692927 read(4, "VAR galaxy ups.status \"OL\"\n", 64) = 27
4275  14:08:05.693050 rt_sigaction(SIGALRM, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, 8) = 0
4275  14:08:05.693108 alarm(0)          = 10
4275  14:08:05.693139 rt_sigaction(SIGALRM, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, 8) = 0
4275  14:08:05.693187 alarm(0)          = 0
4275  14:08:05.693228 alarm(10)         = 0
4275  14:08:05.693266 select(7, NULL, [6], NULL, {0, 0}) = 1 (out [6], left {0, 0})
4275  14:08:05.693306 write(6, "GET VAR socomec ups.status\n", 27) = 27
4275  14:08:05.693373 select(7, [6], NULL, NULL, {5, 0}) = 1 (in [6], left {4, 997653})
4275  14:08:05.695899 read(6, "VAR socomec ups.status \"\"\n", 64) = 26
4275  14:08:05.696006 rt_sigaction(SIGALRM, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, 8) = 0
4275  14:08:05.696062 alarm(0)          = 10
4275  14:08:05.696140 rt_sigaction(SIGALRM, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, {SIG_IGN, [ALRM], SA_RESTORER|SA_RESTART, 0x7fdc20b471e0}, 8) = 0
4275  14:08:05.696187 alarm(0)          = 0
4275  14:08:05.696251 write(2, "  50.070507\t", 12) = 12
4275  14:08:05.696305 write(2, "UPS socomec at ip_socomec is unav"..., 40) = 40
4275  14:08:05.696346 clone(child_stack=0, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7fdc214c79d0) = 4631
4275  14:08:05.696826 select(6, [5], NULL, NULL, {0, 0}) = 0 (Timeout)
4275  14:08:05.697101 wait4(-1, NULL, WNOHANG, NULL) = 4580

where process 4580 is the zombi process.

Here is the ups.conf on the NUT server:

[galaxy]
        driver = netxml-ups
        port = http://ip_galaxy
        pollinterval = 10
        desc = "GALAXY 6000"

[socomec]
        driver = snmp-ups
        port = ip_socomec
        community = public
        snmp_version = v1
        mibs = netvision
        pollinterval = 10
        desc = "Socomec"

The upsd.users is
[monuser]
        password = XXX

On the client side the ups.conf is
[galaxy]
        driver = galaxy
        port = ip_NUTserver
        desc = "GALAXY 6000"

[socomec]
        driver = socomec
        port = ip_NUTserver
        desc = "SOCOMEC"

The upsmon.conf reads:

MONITOR galaxy at 172.28.22.40 1 monuser XXX master
MONITOR socomec at 172.28.22.40 1 monuser XXX master

MINSUPPLIES 1
SHUTDOWNCMD "/sbin/shutdown -h now"
NOTIFYCMD "/sbin/upssched"
POLLFREQ 15
POLLFREQALERT 15
HOSTSYNC 15
DEADTIME 45
POWERDOWNFLAG /etc/killpower
NOTIFYMSG ONBATT "%s runs on battery"
NOTIFYMSG FSD "%s forced shutdown"
NOTIFYFLAG COMMOK  SYSLOG
NOTIFYFLAG COMMBAD SYSLOG
NOTIFYFLAG NOCOMM SYSLOG
NOTIFYFLAG ONLINE SYSLOG+EXEC+WALL
NOTIFYFLAG ONBATT SYSLOG+EXEC+WALL
NOTIFYFLAG FSD SYSLOG+EXEC+WALL
RBWARNTIME 43200
NOCOMMWARNTIME 300

If some other logs or configs are required, please let us know. Is there something,
we've missed in the config, which might have caused this problem?

Thank you and cheers,
Henning