[Nut-upsuser] Upssched 100% CPU after updating Debian 12

Jim Klimov jimklimov+nut at gmail.com
Tue Jun 13 21:40:03 BST 2023


Thanks for the feedback.

Just the bit about "when provided by Debian" concerns me - we have a
slow-ish cadence of releases, and then distros take another year to ship
them :\

You might be reasonably served by a custom build (in-place or even
package). YMMV...

Jim

On Tue, Jun 13, 2023, 18:38 Kari Lempiäinen <kari.lempiainen at summerday.net>
wrote:

> Hi,
>
>
>
> Great work Jim! I’m glad you could reproduce the problem and found a
> potential culprit.
>
>
>
> Just for my own interest I restored upsshed from my backups (version
> 2.7.4-13) and it seems to running ok, so no big runtime changes regarding
> that with Debian 12. It is not hogging CPU. From the daemon log the
> heartbeat seems to be working ok. Only difference between the old logs (pre
> Debian 12 update) is the there is this line (Network UPS Tools upsmon
> 2.8.0) every five minutes.
>
>
>
> Jun 13 19:17:08 fricka upssched[1896873]: Timer daemon started
>
> Jun 13 19:17:08 fricka upssched[1896873]: New timer:
> heartbeat-failure-timer (660 seconds)
>
> Jun 13 19:17:08 fricka nut-monitor[1896870]: Network UPS Tools upsmon 2.8.0
>
> Jun 13 19:22:08 fricka nut-monitor[1899911]: Network UPS Tools upsmon 2.8.0
>
> Jun 13 19:27:08 fricka upssched[1896873]: Cancelling timer:
> heartbeat-failure-timer
>
> Jun 13 19:27:08 fricka upssched[1896873]: New timer:
> heartbeat-failure-timer (660 seconds)
>
> Jun 13 19:27:08 fricka nut-monitor[1903379]: Network UPS Tools upsmon 2.8.0
>
> Jun 13 19:32:13 fricka nut-monitor[1906677]: Network UPS Tools upsmon 2.8.0
>
>
>
> That line wasn’t there previously. I will leave the old version in place
> until nut gets fixed and it’s been provided by Debian.
>
>
>
> This actually reminded me of a similar problem I had myself a loooong time
> ago. I was writing a large software distribution program for a Windows
> platform. The connection between server and client were done by TCP
> sockets. In my server code there was place where I read reply from the
> client and the read (from socket) function didn’t return error code, but
> the data length was 0. In the documentation, at least at the time, it
> wasn’t specified directly that it was an error situation. My server code
> looped in a thread hogging CPU. When I modified the code to treat 0 bytes
> read as an error, everything worked fine.
>
>
>
> Best regards,
>
> Kari
>
>
>
> *From: *Jim Klimov <jimklimov+nut at gmail.com>
> *Date: *Tuesday, 13. June 2023 at 18.36
> *To: *Kari Lempiäinen <kari.lempiainen at summerday.net>
> *Cc: *Arnaud Quette via Nut-upsuser <nut-upsuser at alioth-lists.debian.net>,
> nut-upsdev <nut-upsdev at alioth-lists.debian.net>, Dimitris Economou <
> dimitris.s.economou at gmail.com>
> *Subject: *Re: [Nut-upsuser] Upssched 100% CPU after updating Debian 12
>
> So... determining that FD is to be reaped proved hard. Internet lore
> suggests fcntl() and poll() on the FD, but it just seems valid to them. The
> errno is also usually not raised (once I saw a "111: Connection refused"
> though).
>
> So the best dumb idea so far is to bail out if we spent the whole loop
> (128 attempts) and only got zero-sized read replies and no errors.
>
>
>
> https://github.com/networkupstools/nut/pull/1965
>
>
>
> Jim
>
>
>
> On Tue, Jun 13, 2023 at 4:06 PM Jim Klimov <jimklimov+nut at gmail.com>
> wrote:
>
> After launching the command several times, with debug (posted by new code
> in a new branch for the investigation) confirming that the same daemon
> handles operations from the new client instances, its strace now has
> numerous FDs to report after select() - so I guess it is a problem of
> detecting an exit of the counterpart.
>
>
>
>    0.000000     [D2] parse_at: is 'heartbeat at localhost' in AT command the
> 'heartbeat at localhost' we were launched to process?
>    0.000034     [D1] parse_at: 'heartbeat at localhost' in AT command
> matched the 'heartbeat at localhost' UPSNAME we were launched to process
>    0.000041     [D1] parse_at: processing CANCEL-TIMER
>    0.000054     [D2] parse_at: is 'heartbeat at localhost' in AT command the
> 'heartbeat at localhost' we were launched to process?
>    0.000057     [D1] parse_at: 'heartbeat at localhost' in AT command
> matched the 'heartbeat at localhost' UPSNAME we were launched to process
>    0.000060     [D1] parse_at: processing START-TIMER
>    0.000151     [D1] Keeping stderr open due to debug verbosity 8
>    0.000195     Timer daemon started
>    0.000204     [D2] Timer daemon waiting for connections on pipefd 10
>    0.250325     [D3] new connection on fd 7
>    0.250377     New timer: heartbeat-failure-timer (660 seconds)
>    0.250423     [D1] Exiting upssched (CLI process)
>
>
>
>    0.000000     [D2] parse_at: is 'heartbeat at localhost' in AT command the
> 'heartbeat at localhost' we were launched to process?
>    0.000039     [D1] parse_at: 'heartbeat at localhost' in AT command
> matched the 'heartbeat at localhost' UPSNAME we were launched to process
>    0.000047     [D1] parse_at: processing CANCEL-TIMER
>   14.745773     [D3] new connection on fd 8
>   14.745829     Cancelling timer: heartbeat-failure-timer
>    0.000153     [D2] parse_at: is 'heartbeat at localhost' in AT command the
> 'heartbeat at localhost' we were launched to process?
>    0.000159     [D1] parse_at: 'heartbeat at localhost' in AT command
> matched the 'heartbeat at localhost' UPSNAME we were launched to process
>    0.000163     [D1] parse_at: processing START-TIMER
>   14.745890     [D3] new connection on fd 9
>   14.745969     New timer: heartbeat-failure-timer (660 seconds)
>    0.000293     [D1] Exiting upssched (CLI process)
>
>
>
>    0.000001     [D2] parse_at: is 'heartbeat at localhost' in AT command the
> 'heartbeat at localhost' we were launched to process?
>    0.000042     [D1] parse_at: 'heartbeat at localhost' in AT command
> matched the 'heartbeat at localhost' UPSNAME we were launched to process
>    0.000050     [D1] parse_at: processing CANCEL-TIMER
>   47.674032     [D3] new connection on fd 11
>   47.674160     Cancelling timer: heartbeat-failure-timer
>    0.000266     [D2] parse_at: is 'heartbeat at localhost' in AT command the
> 'heartbeat at localhost' we were launched to process?
>    0.000277     [D1] parse_at: 'heartbeat at localhost' in AT command
> matched the 'heartbeat at localhost' UPSNAME we were launched to process
>    0.000284     [D1] parse_at: processing START-TIMER
>   47.674235     [D3] new connection on fd 12
>   47.674388     New timer: heartbeat-failure-timer (660 seconds)
>    0.000487     [D1] Exiting upssched (CLI process)
>
>
>
> Note that `select` only lists remnants of "new connection" but not the
> FD#10 (pipefd):
>
>
>
>     select(13, [7 8 9 10 11 12], NULL, NULL, {tv_sec=1, tv_usec=0}) = 5
> (in [7 8 9 11 12], left {tv_sec=0, tv_usec=999997})
>
>
>
> Jim
>
>
>
>
>
>
>
> On Tue, Jun 13, 2023 at 3:41 PM Jim Klimov <jimklimov+nut at gmail.com>
> wrote:
>
> So, got some good news: I hear(*) I managed to reproduce the problem with
> current NUT master and an adapted copy of your posted configs and script :D
>
> Experimental debugging now sounds possible.
>
>
>
> (*) PC under the desk wails with all its cooling fans as soon as I started
> the client which spawned a daemon and itself had exited:
>
>
>
> $ UPSNAME=heartbeat at localhost NOTIFYTYPE=ONBATT NUT_STATEPATH=/dev/shm
> NUT_CONFPATH=`pwd`/tests/NIT/tmp/etc ./clients/upssched -DDDDDDDD
>    0.000000     [D2] parse_at: is 'heartbeat at localhost' in AT command the
> 'heartbeat at localhost' we were launched to process?
>    0.000053     [D1] parse_at: 'heartbeat at localhost' in AT command
> matched the 'heartbeat at localhost' UPSNAME we were launched to process
>    0.000066     [D1] parse_at: processing CANCEL-TIMER
>    0.000089     [D2] parse_at: is 'heartbeat at localhost' in AT command the
> 'heartbeat at localhost' we were launched to process?
>    0.000096     [D1] parse_at: 'heartbeat at localhost' in AT command
> matched the 'heartbeat at localhost' UPSNAME we were launched to process
>    0.000104     [D1] parse_at: processing START-TIMER
>
>
>
> $
>
>
>
> Jim
>
>
>
>
>
> On Tue, Jun 13, 2023 at 1:58 PM Kari Lempiäinen <
> kari.lempiainen at summerday.net> wrote:
>
> Hi,
>
>
>
> Thanks Jim! Here is more system information from the commands you
> mentioned.
>
>
>
> Kari
>
>
>
> root at fricka:~# lsof -p 1716171
>
> lsof: WARNING: can't stat() fuse.gvfsd-fuse file system /run/user/1002/gvfs
>
>       Output information may be incomplete.
>
> COMMAND      PID USER   FD   TYPE             DEVICE SIZE/OFF     NODE NAME
>
> upssched 1716171 root  cwd    DIR                8,2     4096        2 /
>
> upssched 1716171 root  rtd    DIR                8,2     4096        2 /
>
> upssched 1716171 root  txt    REG                8,2    39240 21762148
> /usr/sbin/upssched
>
> upssched 1716171 root  mem    REG                8,2  1922136 21764230
> /usr/lib/x86_64-linux-gnu/libc.so.6
>
> upssched 1716171 root  mem    REG                8,2   210968 21764476
> /usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2
>
> upssched 1716171 root    0u   CHR                1,3      0t0        4
> /dev/null
>
> upssched 1716171 root    1u   CHR                1,3      0t0        4
> /dev/null
>
> upssched 1716171 root    2u   CHR                1,3      0t0        4
> /dev/null
>
> upssched 1716171 root    3u  unix 0x00000000eeb52825      0t0 16427010
> type=DGRAM (CONNECTED)
>
> upssched 1716171 root    4r   REG                8,2      545 39847858
> /etc/nut/upssched.conf
>
> upssched 1716171 root    5u  unix 0x00000000a035b466      0t0 16426351
> type=STREAM (UNCONNECTED)
>
> upssched 1716171 root    6u  unix 0x0000000050820b2b      0t0 16426352
> type=STREAM (UNCONNECTED)
>
> upssched 1716171 root    7u  unix 0x00000000447aac0e      0t0 16424724
> /var/run/nut/upssched.pipe type=STREAM (CONNECTED)
>
> upssched 1716171 root   10u  unix 0x00000000d7bd8088      0t0 16424722
> /var/run/nut/upssched.pipe type=STREAM (LISTEN)
>
>
>
>
>
> root at fricka:~# ps -ef|grep ups
>
> root        1425       1  0 Jun11 ?        00:00:00 sshd: /usr/sbin/sshd
> -D [listener] 0 of 10-100 startups
>
> root     1156722       1  0 00:00 ?        00:00:00 /usr/sbin/cupsd -l
>
> root     1156723       1  0 00:00 ?        00:00:00 /usr/sbin/cups-browsed
>
> lp       1156748 1156722  0 00:00 ?        00:00:00
> /usr/lib/cups/notifier/dbus dbus://
>
> nut      1713117       1  0 14:30 ?        00:00:00 /lib/nut/dummy-ups -a
> heartbeat
>
> nut      1713154       1  0 14:30 ?        00:00:00 /lib/nut/usbhid-ups -a
> eaton5s
>
> nut      1713155       1  0 14:30 ?        00:00:00 /lib/nut/upsd -F
>
> root     1713156       1  0 14:30 ?        00:00:00 /lib/nut/upsmon -F
>
> root     1713157 1713156  0 14:30 ?        00:00:00 /lib/nut/upsmon -F
>
> root     1716171       1 99 14:35 ?        00:05:10 /sbin/upssched UPS
> heartbeat at localhost: On battery.
>
> root     1719510 1550505  0 14:40 pts/2    00:00:00 grep ups
>
>
>
>
>
>
>
>
>
> root at fricka:~# fuser /var/run/nut/upssched.pipe
>
> /run/nut/upssched.pipe: 1716171
>
>
>
>
>
> root at fricka:/proc/1716171/fd# ls -la
>
> total 0
>
> dr-x------ 2 root root  0 Jun 13 14:36 .
>
> dr-xr-xr-x 9 root root  0 Jun 13 14:35 ..
>
> lrwx------ 1 root root 64 Jun 13 14:36 0 -> /dev/null
>
> lrwx------ 1 root root 64 Jun 13 14:36 1 -> /dev/null
>
> lrwx------ 1 root root 64 Jun 13 14:36 10 -> 'socket:[16424722]'
>
> lrwx------ 1 root root 64 Jun 13 14:36 2 -> /dev/null
>
> lrwx------ 1 root root 64 Jun 13 14:36 3 -> 'socket:[16427010]'
>
> lr-x------ 1 root root 64 Jun 13 14:36 4 -> /etc/nut/upssched.conf
>
> lrwx------ 1 root root 64 Jun 13 14:36 5 -> 'socket:[16426351]'
>
> lrwx------ 1 root root 64 Jun 13 14:36 6 -> 'socket:[16426352]'
>
> lrwx------ 1 root root 64 Jun 13 14:36 7 -> 'socket:[16424724]'
>
>
>
>
>
>
> ------------------------------
>
> *From:* Jim Klimov <jimklimov+nut at gmail.com>
> *Sent:* Tuesday, June 13, 2023 14:17
> *To:* Kari Lempiäinen <kari.lempiainen at summerday.net>
> *Cc:* Arnaud Quette via Nut-upsuser <nut-upsuser at alioth-lists.debian.net>;
> nut-upsdev <nut-upsdev at alioth-lists.debian.net>; Dimitris Economou <
> dimitris.s.economou at gmail.com>
> *Subject:* Re: [Nut-upsuser] Upssched 100% CPU after updating Debian 12
>
>
>
> FWIW, cross-posted the issue to NUT GitHub tracker:
> https://github.com/networkupstools/nut/issues/1964
>
>
>
> Jim
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/nut-upsuser/attachments/20230613/4ff32544/attachment-0001.htm>


More information about the Nut-upsuser mailing list