[Debian-ha-maintainers] Bug#629662: cluster-agents: VirtualDomain OCF script erroneously fails on stop (triggering unjustified fencing)
Cedric Dufour
cedric.dufour at idiap.ch
Wed Jun 8 13:16:12 UTC 2011
Package: cluster-agents
Version: 1:1.0.3-3.1
Severity: important
Tags: patch
Hello,
The VirtualDomain OCF script fails erroneously on stop
(triggering unjustified fencing of the node where it fails).
The reason for this is that the VirtualDomain_Status function
bails out too quickly (inconditionously) when receiving an empty status
back from 'virsh domstate' and when in the context of a 'Stop' action.
It should loop/wait a further few seconds and keep checking the domain
state a few more times before bailing out and triggering a cascade of
errors which eventually prompt the HA layer to fence the node.
(see attached Syslog output for the chain of events)
A patch to fix this behavior is attached.
Conceptually the same fix has actually been applied upstream.
(see https://github.com/ClusterLabs/resource-agents)
I hope this can help.
Regards,
Cedric Dufour
-- System Information:
Debian Release: 6.0.1
APT prefers stable
APT policy: (990, 'stable')
Architecture: amd64 (x86_64)
Kernel: Linux 2.6.32-5-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Versions of packages cluster-agents depends on:
ii cluster-glue 1.0.6-1 The reusable cluster components fo
ii libc6 2.11.2-10 Embedded GNU C Library: Shared lib
ii libcluster-glue 1.0.6-1 The reusable cluster components fo
ii libglib2.0-0 2.24.2-1 The GLib library of C routines
ii libnet1 1.1.4-2 library for the construction and h
ii python 2.6.6-3+squeeze6 interactive high-level object-orie
cluster-agents recommends no packages.
cluster-agents suggests no packages.
-- no debconf information
-------------- next part --------------
Jun 8 11:36:55 n501 VirtualDomain[28606]: INFO: Issuing graceful shutdown request for domain VRNSIDU_intranet.
Jun 8 11:36:55 n501 lrmd: [3232]: info: RA output: (VRNSIDU_intranet-VirtualDomain:stop:stdout) Domain VRNSIDU_intranet is being shutdown
Jun 8 11:36:55 n501 sSMTP[28598]: SSL connection using DHE_RSA_AES_128_CBC_SHA1
Jun 8 11:36:55 n501 crmd: [3235]: info: match_graph_event: Action STONITH-mfsys:0_start_0 (8) confirmed on n502.havc (rc=0)
Jun 8 11:36:55 n501 crmd: [3235]: info: te_rsc_command: Initiating action 9: monitor STONITH-mfsys:0_monitor_60000 on n502.havc
Jun 8 11:36:55 n501 crmd: [3235]: info: te_pseudo_action: Pseudo action 13 fired and confirmed
Jun 8 11:36:56 n501 crmd: [3235]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=0, tag=transient_attributes, id=n502.havc, magic=NA, cib=0.33.20) : Transient attribute: update
Jun 8 11:36:56 n501 crmd: [3235]: info: update_abort_priority: Abort priority upgraded from 0 to 1000000
Jun 8 11:36:56 n501 crmd: [3235]: info: update_abort_priority: Abort action done superceeded by restart
Jun 8 11:36:56 n501 sSMTP[28598]: Sent mail for root at n501.havc.idiap.ch (221 2.0.0 Bye) uid=0 username=root outbytes=617
Jun 8 11:36:56 n501 crmd: [3235]: info: process_lrm_event: LRM operation GW.LUSTRE_intranet-MailTo_start_0 (call=246, rc=0, cib-update=1626, confirmed=true) ok
Jun 8 11:36:56 n501 crmd: [3235]: info: match_graph_event: Action GW.LUSTRE_intranet-MailTo_start_0 (22) confirmed on n501.havc (rc=0)
Jun 8 11:36:56 n501 crmd: [3235]: info: match_graph_event: Action STONITH-mfsys:0_monitor_60000 (9) confirmed on n502.havc (rc=0)
Jun 8 11:37:16 n501 kernel: [1044136.919946] br0: port 3(vnet1) entering disabled state
Jun 8 11:37:16 n501 kernel: [1044136.950311] device vnet1 left promiscuous mode
Jun 8 11:37:16 n501 kernel: [1044136.950315] br0: port 3(vnet1) entering disabled state
Jun 8 11:37:16 n501 libvirtd: 11:37:16.887: error : qemuMonitorCommandWithHandler:255 : cannot send monitor command 'info balloon': Connection reset by peer
Jun 8 11:37:16 n501 libvirtd: 11:37:16.887: error : qemuMonitorTextGetBalloonInfo:555 : operation failed: could not query memory balloon allocation
Jun 8 11:37:16 n501 lrmd: [3232]: info: RA output: (VRNSIDU_intranet-VirtualDomain:stop:stderr) error: operation failed: could not query memory balloon allocation
[BUG] Jun 8 11:37:16 n501 VirtualDomain[28606]: ERROR: Virtual domain VRNSIDU_intranet has no state during stop operation, bailing out.
[BUG] Jun 8 11:37:16 n501 VirtualDomain[28606]: INFO: Issuing forced shutdown (destroy) request for domain VRNSIDU_intranet.
[BUG] Jun 8 11:37:16 n501 libvirtd: 11:37:16.899: error : qemudDomainDestroy:4767 : Requested operation is not valid: domain is not running
[BUG] Jun 8 11:37:16 n501 lrmd: [3232]: info: RA output: (VRNSIDU_intranet-VirtualDomain:stop:stderr) error: Failed to destroy domain VRNSIDU_intranet#012error: Requested operation is not valid: domain is not running
[BUG] Jun 8 11:37:16 n501 crmd: [3235]: info: process_lrm_event: LRM operation VRNSIDU_intranet-VirtualDomain_stop_0 (call=248, rc=1, cib-update=1627, confirmed=true) unknown error
[BUG] Jun 8 11:37:16 n501 crmd: [3235]: WARN: status_from_rc: Action 27 (VRNSIDU_intranet-VirtualDomain_stop_0) on n501.havc failed (target: 0 vs. rc: 1): Error
[BUG] Jun 8 11:37:16 n501 crmd: [3235]: WARN: update_failcount: Updating failcount for VRNSIDU_intranet-VirtualDomain on n501.havc after failed stop: rc=1 (update=INFINITY, time=1307525836)
Jun 8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph: match_graph_event:272 - Triggered transition abort (complete=0, tag=lrm_rsc_op, id=VRNSIDU_intranet-VirtualDomain_stop_0, magic=0:1;27:1068:0:a09be5db-e85d-429a-a8f7-2e876b
Jun 8 11:37:16 n501 crmd: [3235]: info: match_graph_event: Action VRNSIDU_intranet-VirtualDomain_stop_0 (27) confirmed on n501.havc (rc=4)
Jun 8 11:37:16 n501 attrd: [3233]: info: attrd_trigger_update: Sending flush op to all hosts for: fail-count-VRNSIDU_intranet-VirtualDomain (INFINITY)
Jun 8 11:37:16 n501 crmd: [3235]: info: run_graph: ====================================================
Jun 8 11:37:16 n501 crmd: [3235]: notice: run_graph: Transition 1068 (Complete=18, Pending=0, Fired=0, Skipped=9, Incomplete=0, Source=/var/lib/pengine/pe-error-5.bz2): Stopped
Jun 8 11:37:16 n501 crmd: [3235]: info: te_graph_trigger: Transition 1068 is now complete
Jun 8 11:37:16 n501 crmd: [3235]: info: do_state_transition: State transition S_TRANSITION_ENGINE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=notify_crmd ]
Jun 8 11:37:16 n501 crmd: [3235]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
Jun 8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1628: Requesting the current CIB: S_POLICY_ENGINE
Jun 8 11:37:16 n501 attrd: [3233]: info: attrd_perform_update: Sent update 513: fail-count-VRNSIDU_intranet-VirtualDomain=INFINITY
Jun 8 11:37:16 n501 attrd: [3233]: info: attrd_trigger_update: Sending flush op to all hosts for: last-failure-VRNSIDU_intranet-VirtualDomain (1307525836)
Jun 8 11:37:16 n501 attrd: [3233]: info: attrd_perform_update: Sent update 515: last-failure-VRNSIDU_intranet-VirtualDomain=1307525836
Jun 8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke_callback: Invoking the PE: query=1628, ref=pe_calc-dc-1307525836-1642, seq=40, quorate=1
Jun 8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=n501.havc, magic=NA, cib=0.33.24) : Transient attribute: update
Jun 8 11:37:16 n501 crmd: [3235]: info: abort_transition_graph: te_update_diff:146 - Triggered transition abort (complete=1, tag=transient_attributes, id=n501.havc, magic=NA, cib=0.33.25) : Transient attribute: update
Jun 8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1629: Requesting the current CIB: S_POLICY_ENGINE
Jun 8 11:37:16 n501 crmd: [3235]: info: do_pe_invoke: Query 1630: Requesting the current CIB: S_POLICY_ENGINE
Jun 8 11:37:16 n501 pengine: [3234]: notice: unpack_config: On loss of CCM Quorum: Ignore
Jun 8 11:37:16 n501 pengine: [3234]: info: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 8 11:37:16 n501 pengine: [3234]: info: determine_online_status: Node n501.havc is online
[BUG] Jun 8 11:37:16 n501 pengine: [3234]: WARN: unpack_rsc_op: Processing failed op VRNSIDU_intranet-VirtualDomain_stop_0 on n501.havc: unknown error (1)
[BUG] Jun 8 11:37:16 n501 pengine: [3234]: WARN: pe_fence_node: Node n501.havc will be fenced to recover from resource failure(s)
-------------- next part --------------
--- VirtualDomain.orig 2011-06-08 13:53:21.000000000 +0200
+++ VirtualDomain 2011-06-08 14:21:27.000000000 +0200
@@ -147,9 +147,12 @@
}
VirtualDomain_Status() {
+ local loop
rc=$OCF_ERR_GENERIC
status="no state"
+ loop=0
while [ "$status" = "no state" ]; do
+ loop=$(( $loop+1 ))
status="`virsh $VIRSH_OPTIONS domstate $DOMAIN_NAME`"
case "$status" in
"shut off")
@@ -173,10 +176,11 @@
# whenever virsh can't reliably obtain the domain
# state.
status="no state"
- if [ "$__OCF_ACTION" = "stop" ]; then
+ if [ "$__OCF_ACTION" = "stop" -a $loop -gt 5 ]; then
# During the stop operation, we want to bail out
# quickly, so as to be able to force-stop (destroy)
- # the domain if necessary.
+ # the domain if necessary (but not after making sure
+ # we don't eventually get a valid state).
ocf_log error "Virtual domain $DOMAIN_NAME has no state during stop operation, bailing out."
return $OCF_ERR_GENERIC;
else
More information about the Debian-ha-maintainers
mailing list