[Debian-ha-maintainers] Bug#576511: ditto

Tue Oct 23 07:13:05 UTC 2012

Control: severity -1 grave

On Tue, Oct 23, 2012 at 08:53:39AM +0200, Josip Rodin wrote:
> 
> These defaults are really bad.
> 
> I had this happen to a machine running DRBD yesterday:
> 
> block drbd0: PingAck did not arrive in time.
> block drbd0: peer( Secondary -> Unknown ) conn( Connected -> NetworkFailure ) pdsk( UpToDate -> DUnknown )
> block drbd0: asender terminated
> block drbd0: Terminating drbd0_asender
> block drbd0: helper command: /sbin/drbdadm pri-on-incon-degr minor-0
> SysRq : Resetting
> 
> None of this was written in syslog, which it really should have been,
> because DRBD on this machine does *not* operate on the root/var filesystems.
> 
> This is a state that can be detected, and if so, the scripts should instead
> forcefully kill just DRBD, or whatever more limited component.
> 
> They really should not default to forcefully killing the entire machine
> when they have no tangible proof that DRBD would be vital to it.

And that goes double for the local-io-error handler executing a shutdown.

So not only do you get random data loss because of SysRq abuse, an I/O error
on a DRBD device makes the entire machine enter a period of downtime longer
than a reboot (which may well be long itself).

That's just not anything that any reasonable person can infer from the
description of the drbd8-utils package, the inline comments in the default
configuration files, the drbd.conf(5) and drbdsetup(8) manual pages,
the DRBD User's Guide... in fact the latter says:
http://www.drbd.org/users-guide/s-configure-io-error-behavior.html

6.13. Configuring I/O error handling strategies

<strategy> may be one of the following options:

    detach
		This is the default and recommended option. [...]

    call-local-io-error
		[...] It is entirely left to the administrator's discretion
		to implement I/O error handling using the command (or
		script) invoked by local-io-error.

	Note

	Early DRBD versions (prior to 8.0) included another option, panic,
	which would forcibly remove the node from the cluster by way of a
	kernel panic, whenever a local I/O error occurred. While that option
	is no longer available, the same behavior may be mimicked via the
	local-io-error/+ call-local-io-error+ interface. 

	You should do so only if you fully understand the implications of
	such behavior.

The package does not in any way seem to verify that the admin fully
understands the implications of such behavior, so it's really just
setting itself up for disaster.

-- 
     2. That which causes joy or happiness.