[Debian-ha-maintainers] Bug#629662: cluster-agents: VirtualDomain OCF script erroneously fails on stop (triggering unjustified fencing)

Cedric Dufour cedric.dufour at idiap.ch
Wed Jun 8 13:16:12 UTC 2011


Package: cluster-agents
Version: 1:1.0.3-3.1
Severity: important
Tags: patch


Hello,

The VirtualDomain OCF script fails erroneously on stop
(triggering unjustified fencing of the node where it fails).

The reason for this is that the VirtualDomain_Status function
bails out too quickly (inconditionously) when receiving an empty status
back from 'virsh domstate' and when in the context of a 'Stop' action.
It should loop/wait a further few seconds and keep checking the domain
state a few more times before bailing out and triggering a cascade of
errors which eventually prompt the HA layer to fence the node.
(see attached Syslog output for the chain of events)

A patch to fix this behavior is attached.

Conceptually the same fix has actually been applied upstream.
(see https://github.com/ClusterLabs/resource-agents)

I hope this can help.

Regards,

Cedric Dufour

-- System Information:
Debian Release: 6.0.1
  APT prefers stable
  APT policy: (990, 'stable')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.32-5-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash

Versions of packages cluster-agents depends on:
ii  cluster-glue            1.0.6-1          The reusable cluster components fo
ii  libc6                   2.11.2-10        Embedded GNU C Library: Shared lib
ii  libcluster-glue         1.0.6-1          The reusable cluster components fo
ii  libglib2.0-0            2.24.2-1         The GLib library of C routines
ii  libnet1                 1.1.4-2          library for the construction and h
ii  python                  2.6.6-3+squeeze6 interactive high-level object-orie

cluster-agents recommends no packages.

cluster-agents suggests no packages.

-- no debconf information
-------------- next part --------------
Jun  8 11:36:55 n501 VirtualDomain[28606]: INFO: Issuing graceful shutdown request for domain VRNSIDU_intranet.
Jun  8 11:36:55 n501 lrmd: [3232]: info: RA output: (VRNSIDU_intranet-VirtualDomain:stop:stdout) Domain VRNSIDU_intranet is being shutdown
Jun  8 11:36:55 n501 sSMTP[28598]: SSL connection using DHE_RSA_AES_128_CBC_SHA1
Jun  8 11:36:55 n501 crmd: [3235]: info: match_graph_event: Action STONITH-mfsys:0_start_0 (8) confirmed on n502.havc (rc=0)
Jun  8 11:36:55 n501 crmd: [3235]: info: te_rsc_command: Initiating action 9: monitor STONITH-mfsys:0_monitor_60000 on n502.havc
Jun  8 11:36:55 n501 crmd: [3235]: info: te_pseudo_action: Pseudo action 13 fired and confirmed
Jun  8 11:36:56 n501 crmd: [3235]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=0, tag=transient_attributes, id=n502.havc, magic=NA, cib=0.33.20) : Transient attribute: update
Jun  8 11:36:56 n501 crmd: [3235]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
Jun  8 11:36:56 n501 crmd: [3235]: info: update_abort_priority: Abort action done superceeded by restart
Jun  8 11:36:56 n501 sSMTP[28598]: Sent mail for root at n501.havc.idiap.ch (221 2.0.0 Bye) uid=0 username=root outbytes=617
Jun  8 11:36:56 n501 crmd: [3235]: info: process_lrm_event: LRM operation GW.LUSTRE_intranet-MailTo_start_0 (call=246, rc=0, cib-update=1626, confirmed=true) ok
Jun  8 11:36:56 n501 crmd: [3235]: info: match_graph_event: Action GW.LUSTRE_intranet-MailTo_start_0 (22) confirmed on n501.havc (rc=0)
Jun  8 11:36:56 n501 crmd: [3235]: info: match_graph_event: Action STONITH-mfsys:0_monitor_60000 (9) confirmed on n502.havc (rc=0)
Jun  8 11:37:16 n501 kernel: [1044136.919946] br0: port 3(vnet1) entering disabled state
Jun  8 11:37:16 n501 kernel: [1044136.950311] device vnet1 left promiscuous mode
Jun  8 11:37:16 n501 kernel: [1044136.950315] br0: port 3(vnet1) entering disabled state
Jun  8 11:37:16 n501 libvirtd: 11:37:16.887: error : qemuMonitorCommandWithHandler:255 : cannot send monitor command 'info balloon': Connection reset by peer
Jun  8 11:37:16 n501 libvirtd: 11:37:16.887: error : qemuMonitorTextGetBalloonInfo:555 : operation failed: could not query memory balloon allocation
Jun  8 11:37:16 n501 lrmd: [3232]: info: RA output: (VRNSIDU_intranet-VirtualDomain:stop:stderr) error: operation failed: could not query memory balloon allocation
[BUG] Jun  8 11:37:16 n501 VirtualDomain[28606]: ERROR: Virtual domain VRNSIDU_intranet has no state during stop operation, bailing out.
[BUG] Jun  8 11:37:16 n501 VirtualDomain[28606]: INFO: Issuing forced shutdown (destroy) request for domain VRNSIDU_intranet.
[BUG] Jun  8 11:37:16 n501 libvirtd: 11:37:16.899: error : qemudDomainDestroy:4767 : Requested operation is not valid: domain is not running
[BUG] Jun  8 11:37:16 n501 lrmd: [3232]: info: RA output: (VRNSIDU_intranet-VirtualDomain:stop:stderr) error: Failed to destroy domain VRNSIDU_intranet#012error: Requested operation is not valid: domain is not running
[BUG] Jun  8 11:37:16 n501 crmd: [3235]: info: process_lrm_event: LRM operation VRNSIDU_intranet-VirtualDomain_stop_0 (call=248, rc=1, cib-update=1627, confirmed=true) unknown error
[BUG] Jun  8 11:37:16 n501 crmd: [3235]: WARN: status_from_rc: Action 27 (VRNSIDU_intranet-VirtualDomain_stop_0) on n501.havc failed (target: 0 vs. rc: 1): Error
[BUG] Jun  8 11:37:16 n501 crmd: [3235]: WARN: update_failcount: Updating failcount for VRNSIDU_intranet-VirtualDomain on n501.havc after failed stop: rc=1 (update=INFINITY, time=1307525836)
Jun  8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph: match_graph_event:272 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=VRNSIDU_intranet-VirtualDomain_stop_0, magic=0:1;27:1068:0:a09be5db-e85d-429a-a8f7-2e876b
Jun  8 11:37:16 n501 crmd: [3235]: info: match_graph_event: Action VRNSIDU_intranet-VirtualDomain_stop_0 (27) confirmed on n501.havc (rc=4)
Jun  8 11:37:16 n501 attrd: [3233]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-VRNSIDU_intranet-VirtualDomain (INFINITY)
Jun  8 11:37:16 n501 crmd: [3235]: info: run_graph: ====================================================
Jun  8 11:37:16 n501 crmd: [3235]: notice: run_graph: Transition 1068 (Complete=18, Pending=0, Fired=0, Skipped=9, Incomplete=0, Source=/var/lib/pengine/pe-error-5.bz2): Stopped
Jun  8 11:37:16 n501 crmd: [3235]: info: te_graph_trigger: Transition 1068 is now complete
Jun  8 11:37:16 n501 crmd: [3235]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun  8 11:37:16 n501 crmd: [3235]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Jun  8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1628: Requesting the current CIB: S_POLICY_ENGINE
Jun  8 11:37:16 n501 attrd: [3233]: info: attrd_perform_update: Sent update 513: fail-count-VRNSIDU_intranet-VirtualDomain=INFINITY
Jun  8 11:37:16 n501 attrd: [3233]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-VRNSIDU_intranet-VirtualDomain (1307525836)
Jun  8 11:37:16 n501 attrd: [3233]: info: attrd_perform_update: Sent update 515: last-failure-VRNSIDU_intranet-VirtualDomain=1307525836
Jun  8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke_callback: Invoking the PE: query=1628, ref=pe_calc-dc-1307525836-1642, seq=40, quorate=1
Jun  8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=n501.havc, magic=NA, cib=0.33.24) : Transient attribute: update
Jun  8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=n501.havc, magic=NA, cib=0.33.25) : Transient attribute: update
Jun  8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1629: Requesting the current CIB: S_POLICY_ENGINE
Jun  8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1630: Requesting the current CIB: S_POLICY_ENGINE
Jun  8 11:37:16 n501 pengine: [3234]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jun  8 11:37:16 n501 pengine: [3234]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun  8 11:37:16 n501 pengine: [3234]: info: determine_online_status: Node n501.havc is online
[BUG] Jun  8 11:37:16 n501 pengine: [3234]: WARN: unpack_rsc_op: Processing failed op VRNSIDU_intranet-VirtualDomain_stop_0 on n501.havc: unknown error (1)
[BUG] Jun  8 11:37:16 n501 pengine: [3234]: WARN: pe_fence_node: Node n501.havc will be fenced to recover from resource failure(s)

-------------- next part --------------
--- VirtualDomain.orig	2011-06-08 13:53:21.000000000 +0200
+++ VirtualDomain	2011-06-08 14:21:27.000000000 +0200
@@ -147,9 +147,12 @@
 }
 
 VirtualDomain_Status() {
+    local loop
     rc=$OCF_ERR_GENERIC
     status="no state"
+    loop=0
     while [ "$status" = "no state" ]; do
+        loop=$(( $loop+1 ))
         status="`virsh $VIRSH_OPTIONS domstate $DOMAIN_NAME`"
         case "$status" in
 	    "shut off")
@@ -173,10 +176,11 @@
 		# whenever virsh can't reliably obtain the domain
 		# state.
 		status="no state"
-		if [ "$__OCF_ACTION" = "stop" ]; then
+		if [ "$__OCF_ACTION" = "stop" -a $loop -gt 5 ]; then
 		    # During the stop operation, we want to bail out
 		    # quickly, so as to be able to force-stop (destroy)
-		    # the domain if necessary.
+		    # the domain if necessary (but not after making sure
+                    # we don't eventually get a valid state).
 		    ocf_log error "Virtual domain $DOMAIN_NAME has no state during stop operation, bailing out."
 		    return $OCF_ERR_GENERIC;
 		else


More information about the Debian-ha-maintainers mailing list