[Nut-upsuser] Shutdown the servers first, keep the network running

Greg Troxel gdt at lexort.com
Sat Nov 23 13:13:13 GMT 2024


Dan Langille via Nut-upsuser <nut-upsuser at alioth-lists.debian.net>
writes:

> I have an idea for my shutdown process at home. My goal: maximize the network run-time. At present, the UPS has a run-time of about 57 minutes.
>
> This is my idea:
>
> * shutdown the servers after 15 minutes of downtime (for me, that's when battery.runtime hits 40)
> * leave the network gear (switches, firewall, wifi) running so I can continue with Internet access
>
> Optionally:
> * when we get down to 10 minutes, let everything else shutdown
>
> The goal: I can keep working from my home office - there's a separate UPS up there.

I live in a town with a high ratio of

  trees that could fall on a line
  /
  electric meters

and thus we have a fairly large number of outages, even though our power
company does a great job.

I will second Kelly's point that once power has been out 5 minutes it is
unlikely to be back soon.  I have been keeping track, and basically

  - There are a lot of 3-5s outages.  I believe these are faults that
    clear themselves (squirrel stops conducting :-( or branch falls the
    rest of the way) or have cleared by the time the recloser recloses.
    Sometimes it is 5s out, 2s on, 5s out, 2s on, 5s out, back on.
    Sometimes that and then just out.  Often just out cleanly, and
    sometimes pretty messy.

  - There was one outage recently of just about 3 minutes.  I don't
    understand what happened, but it was apparently substation-wide.  I
    am guessing some protection tripped and because they were there it
    could be brought up again fast.   I have no memory of this ever
    happening

  - There was a scheduled outage at 0300, when it was well below
    freezing, to remove branches from a 115 kV line, that lasted only
    13m.  A huge round of applause for the guy in the bucket truck who
    did not damage the transmission line!  Also, I had a 19m outage,
    which I suspect was also planned (if not announced) as part of
    restoring others after damage.

  - After that, I think the fastest was 28m, and 40-70m typical, for
    things that were "minor".  There have been changes to distribution
    protection and these are rarer; I think reclosers are effective at
    ensuring that close-to-fault protection devices open, enabling the
    rest to stay on.

  - Plus some longer ones (multiple hours to small numbers of days),
    resulting from more serious damage, from trees on wires to broken
    poles.

So aside from the single 3m outage, I would have told you that once
power has been out for 30s, it's going to be 30m at least, maybe more.
But given that, the 5m guidance sounds good.


The other thing is that I more or less believe that running the UPS all
the way out is probably rougher on batteries than shutting down when it
claims 10m.  But I also believe that UPS service is really tough on
batteries and they seem to be reliably in need of replacment at 4 years.
And, almost every battery I have pulled from a UPS (which I do when it
becomes troubled) has been messed up, usually a shorted cell or a very
weak cell.  Whereas batteries proactively pulled after 5y from a FiOS
ONT, are often ok.  So I am not at all sure that trying to be nice to
the batteries is a good strategy.


So I would recommend:

  - shut down servers after 5 minutes of outage
  - shut down firewall and killpower when runtime <10m (or maybe 5m)
  - have some way to start servers, such as switch controllable by
    firewall
  - once you have a way to bring servers back hands off, consider server
    shutdown at 30s of outage, and restore after 15m of no outage

  - going over all your non-server stuff and thinking if you can reduce
    usage

  - log outages and also transfer to battery events.  log remaining
    runtime vs time so you can see what the mapping is from reported
    runtime to actual runtime

Your outage patterns may be different, so I may be off about the precise
timings.  I suspect though, that there is a gulf between "protection
device restores power in seconds" and "truck roll".   28m for drive to
fault, visually inspect, decide it's cleared, replace fuse, is amazing
and only happens if the people are in the office next to the truck, and
even then it needs more luck.



More information about the Nut-upsuser mailing list