[Nut-upsdev] [Nut-upsuser] Upssched 100% CPU after updating Debian 12

Jim Klimov jimklimov+nut at gmail.com
Tue Jun 13 09:41:01 BST 2023


Thanks for the interesting puzzle to crunch!

Looking at these few bread-crumbs, I wager an educated guess that this
loops in `sendcmd()` (where CLI child processes talk to a daemonized copy
which tracks the timers for events), around here:
https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L583-L626
for the daemon and here for the child:
https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L762-L787

In both cases we run an infinite loop with `select()` to wait up to 1
second and `read()` from the pipe. The common expectation is that if there
is nothing to read, the program blocks in `select()` waiting for a reply
from the kernel, and only returns after the specified timeout - to the
point that some `sleep()` with sub-second precision implementations do just
that on systems that lack `usleep()`.

With `select()` in your capture posting an array of FDs `[7 10]` I'd say it
is the parent loop, and one of those is the `pipefd` and another is the FD
for socket connection to talk to a particular child on the other side of
the pipe.

Here I must guess that either the `pselect6()` implementation behaves
differently which would be surprising after decades worth of expected
behavior, or that it for some reason decides that the FD #7 (whoever that
is) raised the hand but was then silent (0 bytes read).

The next clue is 128 reads, which leads to
https://github.com/networkupstools/nut/blob/v2.8.0/clients/upssched.c#L485
`sock_read()` method which indeed does up to `US_MAX_READ==128` attempts to
read one byte from a connection (so from the CLI/child counterpart) at a
time - so that is who FD#7 gotta be.

So it is just at this point that I get stuck explaining what happens and
why :) Possibly there was a child or CLI counterpart process which died and
somehow the connection did not get closed (which would return a `read()`
error and evict it from the conn list), e.g. frozen/zombified, or something
else connected to the pipe (antivirus heh?) and loiters there, or something
else I could not think of quickly...

The next clue could be to look at who is connected to the pipe file, maybe
with `fuser`, or look in `ps -ef` if there are other `upssched` processes
present, or look in `/proc/PIDNUM/fd/` of the daemonized and looping
`upssched`...

If anyone decides to track through the code, the potential processing loop
goes
    checkconf -> conf_arg -> parse_at -> sendcmd -> daemon if needed ->
start_daemon -> conn_add(pipefd) or sock_read(conn)
with short-lived child processes (command-line launched from `upsmon`)
parsing the configuration to pick the part relevant to them per envvars,
and talk to the daemon (fork one if needed) to start/cancel the timers.

Jim

On Tue, Jun 13, 2023 at 8:33 AM Kari Lempiäinen <
kari.lempiainen at summerday.net> wrote:

> Hi,
>
> I ran the strace command while upssched was 100% CPU hungry. This is what
> I got:
>
> 1686633611.702798 read(7, "", 1)        = 0 <0.000004>
> 1686633611.702816 read(7, "", 1)        = 0 <0.000004>
> 1686633611.702834 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0},
> NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=999998800}) <0.000006>
> 1686633611.702862 read(7, "", 1)        = 0 <0.000004>
> 1686633611.702880 read(7, "", 1)        = 0 <0.000004>
> .
> .
> .
> 1686633611.705206 read(7, "", 1)        = 0 <0.000004>
> 1686633611.705224 read(7, "", 1)        = 0 <0.000004>
> 1686633611.705242 pselect6(11, [7 10], NULL, NULL, {tv_sec=1, tv_nsec=0},
> NULL) = 1 (in [7], left {tv_sec=0, tv_nsec=999999204}) <0.000005>
> 1686633611.705268 read(7, "", 1)        = 0 <0.000004>
> 1686633611.705288 read(7, "", 1)        = 0 <0.000302>
>
> there are 128 "read(7" lines between every pselect6 lines (I did count the
> read lines for couple of blocks, 128 in each of them). Nothing else in the
> strace log, just read and pselect lines for the short time I let strace to
> run.
>
> Any help from this?
>
> Best regards,
> Kari
> ------------------------------
>
> *<more history snapped>*
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/nut-upsdev/attachments/20230613/1c198f3c/attachment.htm>


More information about the Nut-upsdev mailing list