[Nut-upsdev] git stable, cppunit?

Greg Troxel gdt at lexort.com
Mon Dec 2 14:06:31 GMT 2024


Jim Klimov <jimklimov+nut at gmail.com> writes:

>   you're likely on to something!
>
>   While NUT CI farm runs, give or take, 10^3 builds and tests across the
> matrix of platforms, toolkits and dependencies fir each iteration, and most
> of these pass green or catch true coding errors, I did occasionally see
> failed C++ tests (and also NIT where it waits for both OL and OB to be seen
> on a dummy-ups over certain time).
>
>   Mostly this correlated with slow-down of build agents (esp. VMs on
> congested hosts), and maybe kernel or its context-switching-under-stress
> tuning (openbsd and macos seen more often than others), but I did not
> succeed pinpointing the problem for the C++ case.

This is running on a real computer, not a VM, and it's a 9th gen i7 with
32GB of RAM which does things speedily in general.

>   In that OL-OB test of NIT, had to sort of write it off - if the VM is too
> busy that a 1-second timer flip is not happening/detected over 10 seconds,
> it is a SUT problem more than NUT problem. A real system on battery and
> frantically shutting down (causing stress/slowness) might have power lost
> during that time though.

Yes, if a signal doesn't show up in 10s, that's an issue.  But I don't
think that's what we are seeing here.

>   IPC tests are similarly flawed by nature, communicating two processes
> that both have to get a slice of CPU in a given time frame for the test (or
> real-life reaction), but if you can get something to fail reliably in
> reasonable conditions (relevant under normal load) - that's really
> encouraging for the prospect of fixing it.

I would say that if a test fails, we need to be able to say that the SUT
is broken.   But I don't see that here.



So:

  Is it expected that a failed test will dump core?  This is surprising
  to me.  I'd expect that failures would just be counted and printed
  out.

  Are you sure the tests reliably use the as-built libs  and do not
  reach into any previous nut installation on the system?
  (I just de-installed 2.8.2 and get the same issue.)


running the unit test by hand (now that the other libs are gone, it
feels safe)

  $ ./cppunittest 
  D: Getting test suite...
  D: Preparing test runner...
  D: Setting test runner outputter...
  D: Launching the test run...
  .................................F.terminate called after throwing an instance of 'std::runtime_error'
    what():  Poll on communication pipe read end 5 failed: 4
  Abort trap (core dumped)


4 is EINTR, and that means select was interrupted, perhaps a signal.
Which does not seem necessarily buggy.  There is a comment in nutipc.hpp
that indicates that recovery probably should be happen but it's not
implemented.

ktrace shows

  3507   3507 cppunittest 1733147496.035727910 CALL  _lwp_create(0x7f7fffeaf860,0,0x7c38b65a80a0)
  3507   3507 cppunittest 1733147496.035739244 RET   _lwp_create 0
  3507   3507 cppunittest 1733147496.035761910 CALL  __sigaction_sigtramp(SIGUSR1,0x7f7fffeafc10,0,0x7c38b497f620,2)
  3507   3507 cppunittest 1733147496.035773243 RET   __sigaction_sigtramp 0
  3507   3507 cppunittest 1733147496.035784577 CALL  __sigaction_sigtramp(SIGUSR2,0x7f7fffeafc10,0,0x7c38b497f620,2)
  3507   3507 cppunittest 1733147496.035795910 RET   __sigaction_sigtramp 0
  3507  24482 cppunittest 1733147496.035800910 CALL  _lwp_ctl(1,0x7c38b65a8148)
  3507   3507 cppunittest 1733147496.035812243 CALL  getpid
  3507  24482 cppunittest 1733147496.035823577 RET   _lwp_ctl 0
  3507   3507 cppunittest 1733147496.035834910 RET   getpid 3507/0xdb3, 4602/0x11fa
  3507  24482 cppunittest 1733147496.035846243 CALL  __select50(0x100,0x7c38b41ffde0,0,0,0)
  3507   3507 cppunittest 1733147496.035857535 CALL  kill(0xdb3, SIGUSR1)
  3507   3507 cppunittest 1733147496.035891535 RET   kill 0
  3507   3507 cppunittest 1733147496.035914201 CALL  kill(0xdb3, SIGUSR2)
  3507   3507 cppunittest 1733147496.035948202 RET   kill 0
  3507   3507 cppunittest 1733147496.035959535 CALL  kill(0xdb3, SIGUSR2)
  3507   3507 cppunittest 1733147496.035993493 RET   kill 0
  3507   3507 cppunittest 1733147496.036016201 CALL  kill(0xdb3, SIGUSR1)
  3507   3507 cppunittest 1733147496.036027493 RET   kill 0
  3507  24482 cppunittest 1733147496.036004868 RET   __select50 -1 errno 4 Interrupted system call
  3507   3507 cppunittest 1733147496.036038826 CALL  kill(0xdb3, SIGUSR1)
  3507   3507 cppunittest 1733147496.036061493 RET   kill 0
  3507   3507 cppunittest 1733147496.036086159 CALL  __nanosleep50(0x7f7fffeafdf0,0x7f7fffeafe00)
  3507  24482 cppunittest 1733147496.036179118 PSIG  SIGUSR1 caught handler=0x48555c mask=(): code=SI_USER sent by pid=3507, uid=10853)
  3507  24482 cppunittest 1733147496.036190451 PSIG  SIGUSR2 caught handler=0x48555c mask=(30): code=SI_USER sent by pid=3507, uid=10853)


so not really sure what's going on, but looks like select got
interrupted so the handler could run, and that seems ok.  Maybe other
systems auto restart it?




More information about the Nut-upsdev mailing list