[Nut-upsdev] Asking hard questions about the NUT architecture

Tue May 29 21:48:06 UTC 2007

On 5/29/07, Eric S. Raymond <esr at thyrsus.com> wrote:
> If it's not reliable enough for you, the right thing to do is fix the
> journaling, not pile on several layers of userspace kluges.  But
> you're worrying about a phantom, anyway.  It's already reliable enough
> for groups like the old OSDL's Linux High Availability project to not
> have considered it a serious issue.  And those guys were aiming at
> *telecoms-grade* uptime, which is better than you probably need.

Journaling isn't magical. No amount of "fixing" is going to change that.

And telecom-grade uptime is just that... _uptime_. Uptime for
services, not for individual machines supporting that service.

If I have three boxes in an HA setup, and someone trips over the power
cord for one of them, and it doesn't come back up again because the
filesystem got corrupted, a data file got currupted, or there's just
some lock file lying around, I still have 100% uptime, but I also have
a dead machine.

And carriers usually have UPSes backing their HA setups too (and
diesel generators, meaning they don't lose power, period).

> Yes, yes, I *do* understand about atomic database-transaction groups.
> Note that these are normally handled by what is, in effect,
> userspace-level journaling (which is what "two-phase commit" means).
> If the filesystem gets its part of the job right, so will any
> competently-written database.  If your database isn't that competently
> written, you've got bigger problems than a UPS will solve.

Well, I guess an UPS would fix at least the problems related to
unexpected corruption due to power outages, which is much better than
not having an UPS.

> Because normally the only response needed is to sync the file
> descriptors, which is what happens anyway.  The few apps that get
> SIGPWR wrong are probably screwing the pooch on SIGTERM too, which is
> how they see a UPS-controlled shutdown.

Every application running as a daemon has a way to shut itself down.
If it does that by handling SIGTERM or some other mechanism is
completely irrelevant. What matters is that doing
"/etc/init.d/someservice stop" does the trick, and a UPS triggered
shutdown is going to do just that.

> You don't.  Each machine subject to power loss raises SIGPWR itself.
> See my reply to Charles Lepple on how this works.

Well, your reply to Charles reads just like a fairy tale.

Even if you could have something that triggered an interrupt once the
power went below some threshold, that thing would still have to be
fast and accurate. Monitor chips like those supported by lm_sensors
are neither, and I belive SMC cards aren't either.

But it doesn't end here... the work that the system would have to do
to sync all there is to be synced would take far longer that the
available "critical" window. Even if we ignore the fact that there
would be no way to know if we are just writing garbage from RAM (like
someone else mentioned). Try stopping an idle mysql instance one of
these days to see what I mean... And then try stopping VMware (for
something that takes a *really* long time)...

> "Not cool" isn't the issue.  "Makes zero configuration impossible"
> is the issue.  I'm willing to toss an awful lot of obsolete hardware
> on the scrapheap to get zero configuration.  And so should you be,
> if you have any sense at all of the value of your own time.  Stack
> the replacement cost of that UPS up against the dollars-per-hour cost
> of troubleshooting NUT config files and *think*.

There's a lot of stuff out there that "make zero configuration
impossible". That's just the way it is.

And I know the value of my time, yes. I also know the value of my
money, and the (environmental) cost of disposing of hardware in
perfectly working order.

When I first used NUT, it took me about 1/2 hour to configure it. And
the UPS had a *serial* connection. Please explain why supporting
serial models gets in the way of having zero configuration for the USB
ones.

> I understand your conservative instincts; I'm an old Unix hand from
> minicomputer days myself.  But there comes a point at which holding
> onto all that legacy-driven complexity is a mal-investment.  When USB
> UPSes are $30 at Computer Center, we have reached it.

But the fact is, NUT's complexity isn't legacy driven. Only in your
narrow vision of a world where nothing else exists but "Aunt Tillie"
users...

> Sure they do.  But inflicting all the complexity they require on
> every single-desktop user is incompetent system design.  That's
> what's bothering me here.

What bothers me more is not being able to do the kind of configuration
needed for servers because things have to be super-easy for the
desktop users. And don't forget that most servers have UPSes (or
should have) and most desktops don't.

Simplicity for the desktop user can be achieved with the current
design, and that is obvious from Arnaud's words. There is no need to
throw the baby out with the bath water.

-- 
Carlos Rodrigues