[Nut-upsdev] Bug: upssched and 3+ UPSes

Jørgensen, Richard Richard.Jorgensen at beumergroup.com
Tue Mar 26 14:43:44 GMT 2019


Hi,

I've stumbled over a race condition in upssched when using 3 or more UPSes in combination with START-TIMER/CANCEL-TIMER.
I have a solution that works but is ugly, so I'm going to describe the bug here in the hope somebody can come up with a better solution. 

The problem is in upssched.c:629 check_parent() (see https://github.com/networkupstools/nut/blob/master/clients/upssched.c ). 
When all three UPSes goes on battery at the same time (think major power outage) upssched is started three times. This first upssched to get the lockfn file should start the timer-daemon while the other two upssched instances should wait for it to start up and then try to connect again. It works fine with two UPSes, but with three UPSes the following happens:

upssched#1: UPS1 sees the timer daemon is not running, takes the lockfn file and starts to launch a timer daemon.
upssched#2: UPS2 sees the timer daemon is not running, and the lockfn is in use, so it deletes the lockfn and goes to sleep for 0.25 seconds.
upssched#3: UPS3 UPS1 sees the timer daemon is not running, takes the lockfn file(released by upssched#2, not upssched#1) and starts to launch a timer daemon.
Now TIMER-START event will go to two different upssched timer daemons. 
When power comes back on a little while later all CANCEL-TIMER events are sent to the timer daemon started by upssched#3 and the timer events from upssched#1 are never canceled.

My workaround is to make sure other upssched instances only deletes lockfn as a last resort, not the first time it sees it. It isn't a good solution because it will still have the same race condition if a lockfn was left by a crashed upssched:


diff -up nut-2.7.2/clients/upssched.c.racecondition nut-2.7.2/clients/upssched.c
--- nut-2.7.2/clients/upssched.c.racecondition  2014-02-19 20:31:10.000000000 +0100
+++ nut-2.7.2/clients/upssched.c        2019-03-26 15:30:19.972184228 +0100
@@ -648,8 +648,10 @@ static int check_parent(const char *cmd,
 
                /* we didn't get the lock - must be two upsscheds running */
 
-               /* blow this away in case we crashed before */
-               unlink(lockfn);
+               if (tries >= MAX_TRIES-1) {
+                 /* blow this away in case we crashed before */
+                 unlink(lockfn);
+               }
 
                /* give the other one a chance to start it, then try again */
                usleep(250000);






----
Regards, Richard



More information about the Nut-upsdev mailing list