[Debian-ha-maintainers] Bug#705546: pacemaker fails to take action on clones and master/slave resources on-fail

Tue Apr 16 14:38:54 UTC 2013

Package: pacemaker
Version: 1.1.7-1
Severity: normal

using pacemaker from wheezy i found on-fail settings are not honored on clones
and master/slave resources, problem as been already reported to upstream and they
have released a fix, i'm asking for the inclusion of the fix attached to debian.

the attached patch is upstream patch with minor (costmetic) differences in order
to get apply it cleanly to debian sources. 

thanks!

before patch:

# crm resource show msPostgresql
resource msPostgresql is running on: infra02
resource msPostgresql is running on: infra01 Master

# crm configure show msPostgresql
ms msPostgresql pgsql \
meta master-max="1" master-node-max="1" clone-max="2" clone-node-max="1" notify="true" is-managed="true"

# crm configure show pgsql
primitive pgsql ocf:local:pgsql \
params pgctl="/usr/lib/postgresql/9.1/bin/pg_ctl" psql="/usr/bin/psql" pgdata="/var/lib/postgresql/9.1/main" start_opt="-p 5432" rep_mode="sync" node_list="infra01 infra02" restore_command="cp /var/lib/postgresql/9.1/archive/%f %p" primary_conninfo_opt="keepalives_idle=60 keepalives_interval=5 keepalives_count=5" master_ip="192.168.111.12" stop_escalate="0" config="/etc/postgresql/9.1/main/postgresql.conf" tmpdir="/var/lib/postgresql/tmp" pgctldata="/usr/lib/postgresql/9.1/bin/pg_controldata" repuser="repl" \
op start interval="0" timeout="120" on-fail="restart" \
op monitor interval="7" timeout="120" on-fail="stop" \
op monitor interval="2" role="Master" timeout="60" on-fail="restart" \
op promote interval="0" timeout="120" on-fail="restart" \
op demote interval="0" timeout="120" on-fail="stop" \
op stop interval="0" timeout="120" on-fail="block" \
op notify interval="0" timeout="90"

# kill `cat /var/run/postgresql/9.1-main.pid `

pgsql log
Apr 15 16:12:17 infra02 postgres[39723]: [2-1] 2013-04-15 16:12:17 ART LOG:  received smart shutdown request
Apr 15 16:12:17 infra02 postgres[39769]: [1-1] 2013-04-15 16:12:17 ART LOG:  shutting down
Apr 15 16:12:17 infra02 postgres[39769]: [2-1] 2013-04-15 16:12:17 ART LOG:  database system is shut down

cluster log
Apr 15 16:12:17 infra02 pgsql[41389]: INFO: PostgreSQL is down
Apr 15 16:12:17 infra02 crmd: [1441]: info: process_lrm_event: LRM operation pgsql:0_monitor_7000 (call=84, rc=7, cib-update=89, confirmed=false) not running
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_ais_dispatch: Update relayed from infra01
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-pgsql:0 (13)
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_perform_update: Sent update 270: fail-count-pgsql:0=13
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_ais_dispatch: Update relayed from infra01
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-pgsql:0 (1366053137)
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_perform_update: Sent update 272: last-failure-pgsql:0=1366053137
Apr 15 16:12:17 infra02 lrmd: [1438]: info: rsc:pgsql:0 notify[85] (pid 41435)
Apr 15 16:12:17 infra02 lrmd: [1438]: info: operation notify[85] on pgsql:0 for client 1441: pid 41435 exited with return code 0
Apr 15 16:12:17 infra02 crmd: [1441]: info: process_lrm_event: LRM operation pgsql:0_notify_0 (call=85, rc=0, cib-update=0, confirmed=true) ok
Apr 15 16:12:17 infra02 lrmd: [1438]: info: cancel_op: operation monitor[84] on pgsql:0 for client 1441, its parameters: pgctl=[/usr/lib/postgresql/9.1/bin/pg_ctl] CRM_meta_clone=[0] config=[/etc/postgresql/9.1/main/postgresql.conf] CRM_meta_clone_max=[2] CRM_meta_globally_unique=[false] CRM_meta_notify_master_uname=[infra01 ] CRM_meta_notify_promote_uname=[ ] tmpdir=[/var/lib/postgresql/tmp] CRM_meta_notify_active_uname=[ ] start_opt=[-p 5432] CRM_meta_notify_stop_resource=[ ] CRM_meta_name=[monitor] CRM_meta_interval=[7000] CRM_meta_clone_node_max=[1] crm_fe cancelled
Apr 15 16:12:17 infra02 lrmd: [1438]: info: rsc:pgsql:0 stop[86] (pid 41471)
Apr 15 16:12:17 infra02 crmd: [1441]: info: process_lrm_event: LRM operation pgsql:0_monitor_7000 (call=84, status=1, cib-update=0, confirmed=true) Cancelled
Apr 15 16:12:17 infra02 pgsql[41471]: INFO: PostgreSQL is already stopped.
Apr 15 16:12:17 infra02 pgsql[41471]: INFO: Changing pgsql-status on infra02 : HS:alone->STOP.
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_trigger_update: Sending flush op to all hosts for: pgsql-status (STOP)
Apr 15 16:12:17 infra02 attrd: [1439]: notice: attrd_perform_update: Sent update 274: pgsql-status=STOP
Apr 15 16:12:17 infra02 lrmd: [1438]: info: operation stop[86] on pgsql:0 for client 1441: pid 41471 exited with return code 0
Apr 15 16:12:17 infra02 crmd: [1441]: info: process_lrm_event: LRM operation pgsql:0_stop_0 (call=86, rc=0, cib-update=90, confirmed=true) ok
Apr 15 16:12:18 infra02 lrmd: [1438]: info: rsc:pgsql:0 start[87] (pid 41525)
Apr 15 16:12:18 infra02 pgsql[41525]: INFO: Set all nodes into async mode.
Apr 15 16:12:18 infra02 pgsql[41525]: INFO: My Timeline ID and Checkpoint : 7:00000000160000D0
Apr 15 16:12:18 infra02 pgsql[41525]: INFO: infra01 master baseline : 7:0000000017000070
Apr 15 16:12:18 infra02 pgsql[41525]: INFO: server starting
Apr 15 16:12:18 infra02 pgsql[41525]: INFO: PostgreSQL start command sent.
Apr 15 16:12:18 infra02 lrmd: [1438]: info: RA output: (pgsql:0:start:stderr) psql: could not connect to server: No such file or directory#012#011Is the server running locally and accepting#012#011connections on Unix domain socket "/var/run/postgresql/.s.PGSQL.5432"?
Apr 15 16:12:18 infra02 pgsql[41525]: WARNING: PostgreSQL template1 isn't running
Apr 15 16:12:18 infra02 pgsql[41525]: WARNING: Connection error (connection to the server went bad and the session was not interactive) occurred while executing the psql command.
Apr 15 16:12:19 infra02 pgsql[41525]: INFO: PostgreSQL is started.
Apr 15 16:12:19 infra02 pgsql[41525]: INFO: Changing pgsql-status on infra02 : STOP->HS:alone.
Apr 15 16:12:19 infra02 attrd: [1439]: notice: attrd_trigger_update: Sending flush op to all hosts for: pgsql-status (HS:alone)
Apr 15 16:12:19 infra02 attrd: [1439]: notice: attrd_perform_update: Sent update 276: pgsql-status=HS:alone
Apr 15 16:12:19 infra02 lrmd: [1438]: info: operation start[87] on pgsql:0 for client 1441: pid 41525 exited with return code 0
Apr 15 16:12:19 infra02 crmd: [1441]: info: process_lrm_event: LRM operation pgsql:0_start_0 (call=87, rc=0, cib-update=91, confirmed=true) ok
Apr 15 16:12:19 infra02 lrmd: [1438]: info: rsc:pgsql:0 notify[88] (pid 41771)
Apr 15 16:12:19 infra02 lrmd: [1438]: info: operation notify[88] on pgsql:0 for client 1441: pid 41771 exited with return code 0
Apr 15 16:12:19 infra02 crmd: [1441]: info: process_lrm_event: LRM operation pgsql:0_notify_0 (call=88, rc=0, cib-update=0, confirmed=true) ok
Apr 15 16:12:19 infra02 crmd: [1441]: info: process_lrm_event: LRM operation pgsql:0_monitor_7000 (call=89, rc=0, cib-update=92, confirmed=false) ok

after patch:

# kill `cat /var/run/postgresql/9.1-main.pid `

cluster log
Apr 16 11:21:05 infra02 pgsql[100164]: INFO: PostgreSQL is down
Apr 16 11:21:05 infra02 crmd: [97198]: info: process_lrm_event: LRM operation pgsql:0_monitor_7000 (call=15, rc=7, cib-update=24, confirmed=false) not running
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_ais_dispatch: Update relayed from infra01
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_trigger_update: Sending flush op to all hosts for: fail-count-pgsql:0 (1)
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_perform_update: Sent update 47: fail-count-pgsql:0=1
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_ais_dispatch: Update relayed from infra01
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_trigger_update: Sending flush op to all hosts for: last-failure-pgsql:0 (1366122065)
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_perform_update: Sent update 50: last-failure-pgsql:0=1366122065
Apr 16 11:21:05 infra02 lrmd: [97195]: info: rsc:pgsql:0 notify[24] (pid 100206)
Apr 16 11:21:05 infra02 lrmd: [97195]: info: operation notify[24] on pgsql:0 for client 97198: pid 100206 exited with return code 0
Apr 16 11:21:05 infra02 crmd: [97198]: info: process_lrm_event: LRM operation pgsql:0_notify_0 (call=24, rc=0, cib-update=0, confirmed=true) ok
Apr 16 11:21:05 infra02 lrmd: [97195]: info: cancel_op: operation monitor[15] on pgsql:0 for client 97198, its parameters: pgctl=[/usr/lib/postgresql/9.1/bin/pg_ctl] CRM_meta_clone=[0] config=[/etc/postgresql/9.1/main/postgresql.conf] CRM_meta_clone_max=[2] CRM_meta_globally_unique=[false] CRM_meta_notify_master_uname=[ ] CRM_meta_notify_promote_uname=[ ] tmpdir=[/var/lib/postgresql/tmp] CRM_meta_notify_active_uname=[ ] start_opt=[-p 5432] CRM_meta_notify_stop_resource=[ ] CRM_meta_name=[monitor] CRM_meta_interval=[7000] CRM_meta_clone_node_max=[1] crm_feature_ cancelled
Apr 16 11:21:05 infra02 lrmd: [97195]: info: rsc:pgsql:0 stop[25] (pid 100241)
Apr 16 11:21:05 infra02 crmd: [97198]: info: process_lrm_event: LRM operation pgsql:0_monitor_7000 (call=15, status=1, cib-update=0, confirmed=true) Cancelled
Apr 16 11:21:05 infra02 pgsql[100241]: INFO: PostgreSQL is already stopped.
Apr 16 11:21:05 infra02 pgsql[100241]: INFO: Changing pgsql-status on infra02 : HS:alone->STOP.
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_trigger_update: Sending flush op to all hosts for: pgsql-status (STOP)
Apr 16 11:21:05 infra02 lrmd: [97195]: info: operation stop[25] on pgsql:0 for client 97198: pid 100241 exited with return code 0
Apr 16 11:21:05 infra02 attrd: [97196]: notice: attrd_perform_update: Sent update 52: pgsql-status=STOP
Apr 16 11:21:05 infra02 crmd: [97198]: info: process_lrm_event: LRM operation pgsql:0_stop_0 (call=25, rc=0, cib-update=25, confirmed=true) ok

-- System Information:
Debian Release: 7.0
  APT prefers testing
  APT policy: (900, 'testing'), (500, 'testing-updates'), (300, 'unstable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 3.2.0-4-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
-------------- next part --------------
A non-text attachment was scrubbed...
Name: cloned-on-fail.patch
Type: text/x-diff
Size: 1337 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/debian-ha-maintainers/attachments/20130416/ab92ca48/attachment.patch>