[Debian-ha-maintainers] Bug#970084: corosync: Corosync becomes unresponsive and disconnects from the rest of the cluster when primary link is lost
Eugen Wick
operations at sipgate.de
Fri Sep 11 10:22:19 BST 2020
Package: corosync
Version: 3.0.1-2+deb10u1
Severity: important
Dear Maintainer,
* What led up to the situation?
** 2 Node Cluster Corosync 3.0.1 on Debian Buster.
** 2 Knet Links - ring0 on eth0 (front facing if) ring1 on eth1
(back-to-back link).
** Services running on cluster-node01.
** Cluster is running just fine, both nodes are online and see each other.
** crm_mon shows 2 online nodes and running resources without errors.
* What exactly did you do (or not do) that was effective (or ineffective)?
For failover testing we disconnected the eth0 interface on the active node
(cluster-node01).
* What was the outcome of this action?
** Situation on the active node (cluster-node01)
Corosync on the node becomes unresponsive. It does not respond to commands
like corosync-cfgtool and corosync-quorumtool.
in crm_mon however the cluster status just looks fine. It claims both nodes
are online and services are healthy.
corosync logs however indicates that the cluster is disconnected.
####### corosync.log ####
Sep 11 10:06:45 [1946] cluster-node01 corosync warning [MAIN ] Totem is
unable to form a cluster because of an operating system or network fault
(reason: totem is continuously in gather state). The most common cause of
this message is that the local firewall is configured improperly.
#########################
** Situation on the passive node (cluster-node02)
Corosync does respond to commands like corosync-cfgtool and shows that
cluster-node01 is offline on all links.
#########################
####### corosync.log #######
Sep 11 10:06:09 [1941] cluster-node02 corosync info [KNET ] link: host:
1 link: 0 is down
Sep 11 10:06:09 [1941] cluster-node02 corosync info [KNET ] host: host:
1 has 1 active links
Sep 11 10:06:10 [1941] cluster-node02 corosync notice [TOTEM ] Token has
not been received in 2250 ms
Sep 11 10:06:11 [1941] cluster-node02 corosync notice [TOTEM ] A processor
failed, forming new configuration.
Sep 11 10:06:15 [1941] cluster-node02 corosync notice [TOTEM ] A new
membership (2:16) was formed. Members left: 1
Sep 11 10:06:15 [1941] cluster-node02 corosync notice [TOTEM ] Failed to
receive the leave message. failed: 1
Sep 11 10:06:15 [1941] cluster-node02 corosync warning [CPG ] downlist
left_list: 1 received
Sep 11 10:06:15 [1941] cluster-node02 corosync notice [QUORUM] Members[1]:
2
Sep 11 10:06:15 [1941] cluster-node02 corosync notice [MAIN ] Completed
service synchronization, ready to provide service.
Sep 11 10:06:16 [1941] cluster-node02 corosync info [KNET ] link: host:
1 link: 1 is down
Sep 11 10:06:16 [1941] cluster-node02 corosync info [KNET ] host: host:
1 has 0 active links
Sep 11 10:06:16 [1941] cluster-node02 corosync warning [KNET ] host: host:
1 has no active links
#########################
#########################
## corosync-cfgtool -s ##
root at cluster-node02:~# corosync-cfgtool -s
Printing link status.
Local node ID 2
LINK ID 0
addr = ###.###.###.###
status:
node 0: link enabled:1 link connected:0
node 1: link enabled:1 link connected:1
LINK ID 1
addr = ###.###.###.###
status:
node 0: link enabled:1 link connected:1
node 1: link enabled:0 link connected:1
#########################
#########################
##### crm_mon -rfA1 #######
root at cluster-node02:~# crm_mon -rfA1
Stack: corosync
Current DC: cluster-node02 (version 2.0.1-9e909a5bdd) - partition with
quorum
Last updated: Fri Sep 11 10:45:53 2020
Last change: Fri Sep 11 10:42:26 2020 by root via cibadmin on cluster-node02
2 nodes configured
7 resources configured
Online: [ cluster-node02 ]
OFFLINE: [ cluster-node01 ]
#########################
Pacemaker does therefore try to perform a failover.
* What outcome did you expect instead?
With our configuration the cluster should not take any action and both
nodes should see each other on link1.
* Tests with Corosync 3.0.3 from debian testing.
We installed packages from debian testing and fulfilled dependencies from
debian backports.
#########################
apt install libnozzle1=1.16-2~bpo10+1 libknet1=1.16-2~bpo10+1 libnl-3-200
libnl-route-3-200 libknet-dev=1.16-2~bpo10+1 ./corosync_3.0.3-2_amd64.deb
./libcorosync-common4_3.0.3-2_amd64.deb
#########################
The described problem does not occur with the 3.0.3 version from debian
testing.
-- System Information:
Debian Release: 10.5
APT prefers stable
APT policy: (550, 'stable')
Architecture: amd64 (x86_64)
Kernel: Linux 4.19.0-10-amd64 (SMP w/2 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8),
LANGUAGE=en_US.UTF-8 (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)
Versions of packages corosync depends on:
ii adduser 3.118
ii init-system-helpers 1.56+nmu1
ii libc6 2.28-10
ii libcfg7 3.0.1-2+deb10u1
ii libcmap4 3.0.1-2+deb10u1
ii libcorosync-common4 3.0.1-2+deb10u1
ii libcpg4 3.0.1-2+deb10u1
ii libknet1 1.8-2
ii libqb0 1.0.5-1
ii libquorum5 3.0.1-2+deb10u1
ii libstatgrab10 0.91-1+b2
ii libsystemd0 241-7~deb10u4
ii libvotequorum8 3.0.1-2+deb10u1
ii lsb-base 10.2019051400
ii xsltproc 1.1.32-2.2~deb10u1
corosync recommends no packages.
corosync suggests no packages.
-- Configuration Files:
/etc/corosync/corosync.conf changed:
totem {
version: 2
cluster_name: debian
token: 3000
token_retransmits_before_loss_const: 10
crypto_model: nss
crypto_cipher: aes256
crypto_hash: sha256
link_mode: active
keyfile: /etc/corosync/authkey
}
nodelist {
node {
nodeid: 1
name: cluster-node01
ring0_addr: ###.###.###.142
ring1_addr: 192.168.14.1
}
node {
nodeid: 2
name: cluster-node02
ring0_addr: ###.###.###.143
ring1_addr: 192.168.14.2
}
}
logging {
fileline: off
to_stderr: no
to_syslog: no
to_logfile: yes
logfile: /var/log/corosync/corosync.log
debug: off
logger_subsys {
subsys: QUORUM
debug: off
}
}
quorum {
provider: corosync_votequorum
expected_votes: 2
two_node: 1
wait_for_all: 1
auto_tie_breaker: 0
}
-- no debconf information
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/debian-ha-maintainers/attachments/20200911/ca7398b0/attachment.html>
More information about the Debian-ha-maintainers
mailing list