[Nut-upsdev] git stable, cppunit?
Greg Troxel
gdt at lexort.com
Mon Dec 2 14:06:31 GMT 2024
Jim Klimov <jimklimov+nut at gmail.com> writes:
> you're likely on to something!
>
> While NUT CI farm runs, give or take, 10^3 builds and tests across the
> matrix of platforms, toolkits and dependencies fir each iteration, and most
> of these pass green or catch true coding errors, I did occasionally see
> failed C++ tests (and also NIT where it waits for both OL and OB to be seen
> on a dummy-ups over certain time).
>
> Mostly this correlated with slow-down of build agents (esp. VMs on
> congested hosts), and maybe kernel or its context-switching-under-stress
> tuning (openbsd and macos seen more often than others), but I did not
> succeed pinpointing the problem for the C++ case.
This is running on a real computer, not a VM, and it's a 9th gen i7 with
32GB of RAM which does things speedily in general.
> In that OL-OB test of NIT, had to sort of write it off - if the VM is too
> busy that a 1-second timer flip is not happening/detected over 10 seconds,
> it is a SUT problem more than NUT problem. A real system on battery and
> frantically shutting down (causing stress/slowness) might have power lost
> during that time though.
Yes, if a signal doesn't show up in 10s, that's an issue. But I don't
think that's what we are seeing here.
> IPC tests are similarly flawed by nature, communicating two processes
> that both have to get a slice of CPU in a given time frame for the test (or
> real-life reaction), but if you can get something to fail reliably in
> reasonable conditions (relevant under normal load) - that's really
> encouraging for the prospect of fixing it.
I would say that if a test fails, we need to be able to say that the SUT
is broken. But I don't see that here.
So:
Is it expected that a failed test will dump core? This is surprising
to me. I'd expect that failures would just be counted and printed
out.
Are you sure the tests reliably use the as-built libs and do not
reach into any previous nut installation on the system?
(I just de-installed 2.8.2 and get the same issue.)
running the unit test by hand (now that the other libs are gone, it
feels safe)
$ ./cppunittest
D: Getting test suite...
D: Preparing test runner...
D: Setting test runner outputter...
D: Launching the test run...
.................................F.terminate called after throwing an instance of 'std::runtime_error'
what(): Poll on communication pipe read end 5 failed: 4
Abort trap (core dumped)
4 is EINTR, and that means select was interrupted, perhaps a signal.
Which does not seem necessarily buggy. There is a comment in nutipc.hpp
that indicates that recovery probably should be happen but it's not
implemented.
ktrace shows
3507 3507 cppunittest 1733147496.035727910 CALL _lwp_create(0x7f7fffeaf860,0,0x7c38b65a80a0)
3507 3507 cppunittest 1733147496.035739244 RET _lwp_create 0
3507 3507 cppunittest 1733147496.035761910 CALL __sigaction_sigtramp(SIGUSR1,0x7f7fffeafc10,0,0x7c38b497f620,2)
3507 3507 cppunittest 1733147496.035773243 RET __sigaction_sigtramp 0
3507 3507 cppunittest 1733147496.035784577 CALL __sigaction_sigtramp(SIGUSR2,0x7f7fffeafc10,0,0x7c38b497f620,2)
3507 3507 cppunittest 1733147496.035795910 RET __sigaction_sigtramp 0
3507 24482 cppunittest 1733147496.035800910 CALL _lwp_ctl(1,0x7c38b65a8148)
3507 3507 cppunittest 1733147496.035812243 CALL getpid
3507 24482 cppunittest 1733147496.035823577 RET _lwp_ctl 0
3507 3507 cppunittest 1733147496.035834910 RET getpid 3507/0xdb3, 4602/0x11fa
3507 24482 cppunittest 1733147496.035846243 CALL __select50(0x100,0x7c38b41ffde0,0,0,0)
3507 3507 cppunittest 1733147496.035857535 CALL kill(0xdb3, SIGUSR1)
3507 3507 cppunittest 1733147496.035891535 RET kill 0
3507 3507 cppunittest 1733147496.035914201 CALL kill(0xdb3, SIGUSR2)
3507 3507 cppunittest 1733147496.035948202 RET kill 0
3507 3507 cppunittest 1733147496.035959535 CALL kill(0xdb3, SIGUSR2)
3507 3507 cppunittest 1733147496.035993493 RET kill 0
3507 3507 cppunittest 1733147496.036016201 CALL kill(0xdb3, SIGUSR1)
3507 3507 cppunittest 1733147496.036027493 RET kill 0
3507 24482 cppunittest 1733147496.036004868 RET __select50 -1 errno 4 Interrupted system call
3507 3507 cppunittest 1733147496.036038826 CALL kill(0xdb3, SIGUSR1)
3507 3507 cppunittest 1733147496.036061493 RET kill 0
3507 3507 cppunittest 1733147496.036086159 CALL __nanosleep50(0x7f7fffeafdf0,0x7f7fffeafe00)
3507 24482 cppunittest 1733147496.036179118 PSIG SIGUSR1 caught handler=0x48555c mask=(): code=SI_USER sent by pid=3507, uid=10853)
3507 24482 cppunittest 1733147496.036190451 PSIG SIGUSR2 caught handler=0x48555c mask=(30): code=SI_USER sent by pid=3507, uid=10853)
so not really sure what's going on, but looks like select got
interrupted so the handler could run, and that seems ok. Maybe other
systems auto restart it?
More information about the Nut-upsdev
mailing list