[Debian-ha-maintainers] drbd8-utils: dual primary under pacemaker leads to split brain on resource stop

Tue Jan 8 10:01:40 UTC 2013

Subject: drbd8-utils: dual primary under pacemaker leads to split brain on resource stop
Package: drbd8-utils
Version: 2:8.3.11-3~bpo60+1
Severity: important

Every time a pacemaker managed dual primary drbd device is stopped, e.g.
through crm resource stop $DEVICE, it results in a split brain drbd wise.

I am seeing this

	block drbd0: meta connection shut down by peer.
	block drbd0: Sending state for detaching disk failed

on the machine's console.

Stopping both sides of the drbd device by hand does not result in a split
brain.

Adding

	sleep 1

to the linbit drbd resource agent on one or both nodes fixes this.

I can reproduce the bug and the fix on two 2 node clusters, both 64bit.

Resource:

resource debian7 {
  disk {
    fencing resource-only;
  }
  net {
    after-sb-0pri discard-zero-changes;
    after-sb-1pri discard-secondary;
    after-sb-2pri disconnect;

    sndbuf-size 0;
    max-buffers 8000;
    max-epoch-size 8000;
    allow-two-primaries;
  }
  syncer {
    rate 45M;
  }
  on debian8 {
    device    /dev/drbd0;
    disk      /dev/sys/debian7;
    address   192.168.1.8:7791;
    meta-disk internal;
  }
  on debian9 {
    device    /dev/drbd0;
    disk      /dev/sys/debian7;
    address   192.168.1.9:7791;
    meta-disk internal;
  }
}

-- System Information:
Debian Release: 6.0.6
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 3.2.0-0.bpo.4-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages drbd8-utils depends on:
ii  debconf [debconf-2.0]         1.5.36.1   Debian configuration management sy
ii  libc6                         2.11.3-4   Embedded GNU C Library: Shared lib
ii  pacemaker                     1.1.7-1~bpo60+1 HA cluster resource manager
ii  corosync                      1.4.2-1~bpo60+1 Standards-based cluster framework (daemon and

drbd8-utils recommends no packages.

Versions of packages drbd8-utils suggests:
pn  heartbeat                     <none>     (no description available)

-- Configuration Files:
/etc/drbd.d/global_common.conf changed:
global {
	usage-count no;
	# minor-count dialog-refresh disable-ip-verification
}
common {
	protocol C;
	handlers {
		pri-on-incon-degr "/usr/lib/drbd/notify-pri-on-incon-degr.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
		pri-lost-after-sb "/usr/lib/drbd/notify-pri-lost-after-sb.sh; /usr/lib/drbd/notify-emergency-reboot.sh; echo b > /proc/sysrq-trigger ; reboot -f";
		local-io-error "/usr/lib/drbd/notify-io-error.sh; /usr/lib/drbd/notify-emergency-shutdown.sh; echo o > /proc/sysrq-trigger ; halt -f";
		fence-peer "/usr/lib/drbd/crm-fence-peer.sh";
		# split-brain "/usr/lib/drbd/notify-split-brain.sh root";
		# out-of-sync "/usr/lib/drbd/notify-out-of-sync.sh root";
		# before-resync-target "/usr/lib/drbd/snapshot-resync-target-lvm.sh -p 15 -- -c 16k";
		# after-resync-target /usr/lib/drbd/unsnapshot-resync-target-lvm.sh;
	}
	startup {
		# wfc-timeout degr-wfc-timeout outdated-wfc-timeout wait-after-sb
	}
	disk {
		fencing resource-only;
		# on-io-error fencing use-bmbv no-disk-barrier no-disk-flushes
		# no-disk-drain no-md-flushes max-bio-bvecs
	}
	net {
		after-sb-0pri discard-zero-changes;
		after-sb-1pri discard-secondary;
		after-sb-2pri disconnect;
		# sndbuf-size rcvbuf-size timeout connect-int ping-int ping-timeout max-buffers
		# max-epoch-size ko-count allow-two-primaries cram-hmac-alg shared-secret
		# after-sb-0pri after-sb-1pri after-sb-2pri data-integrity-alg no-tcp-cork
	}
	syncer {
		# rate after al-extents use-rle cpu-mask verify-alg csums-alg
	}
}

-- no debconf information

-- 
You learn to write as if to someone else because NEXT YEAR YOU WILL BE
"SOMEONE ELSE."