[Debian-ha-maintainers] Bug#962454: Link failures after upgrade to +deb10u1

Mon Jun 8 11:29:35 BST 2020

Source: corosync
Version: 3.0.1-2+deb10u1
Severity: important

Hi,

Some weeks ago I upgraded corosync (3.0.1-2 -> 3.0.1-2+deb10u1) and
started to notice these messages in my nodes (two node cluster):
Jun  2 01:10:13 patty corosync[2346]:   [KNET  ] link: host: 2 link: 0 is down
Jun  2 01:10:13 patty corosync[2346]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
Jun  2 01:10:14 patty corosync[2346]:   [KNET  ] rx: host: 2 link: 0 is up
Jun  2 01:10:14 patty corosync[2346]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun  3 03:11:07 patty corosync[2346]:   [KNET  ] link: host: 2 link: 1 is down
Jun  3 03:11:07 patty corosync[2346]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)
Jun  3 03:11:08 patty corosync[2346]:   [KNET  ] rx: host: 2 link: 1 is up
Jun  3 03:11:08 patty corosync[2346]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)

Notice the failure happens on with both links.  One of the links is a
cross-over cable. The other uses a bond with two interfaces.

These errors are more common on one of the nodes that on the other.

Some times they match (both nodes log the link failure), but most of the
time only one node complains:

Jun  4 01:16:23 selma corosync[52890]:   [KNET  ] link: host: 1 link: 0 is down
Jun  4 01:16:23 selma corosync[52890]:   [KNET  ] host: host: 1 (passive) best link: 1 (pri: 1)
Jun  4 01:16:24 selma corosync[52890]:   [KNET  ] rx: host: 1 link: 0 is up
Jun  4 01:16:24 selma corosync[52890]:   [KNET  ] host: host: 1 (passive) best link: 0 (pri: 1)
Jun  4 01:16:55 patty corosync[2346]:   [KNET  ] link: host: 2 link: 0 is down
Jun  4 01:16:55 patty corosync[2346]:   [KNET  ] host: host: 2 (passive) best link: 1 (pri: 1)
Jun  4 01:16:56 patty corosync[2346]:   [KNET  ] rx: host: 2 link: 0 is up
Jun  4 01:16:56 patty corosync[2346]:   [KNET  ] host: host: 2 (passive) best link: 0 (pri: 1)

Here's my config:
totem {
        version: 2
        cluster_name: web
        crypto_cipher: none
        crypto_hash: none
        interface {
                linknumber: 0
        }
        interface {
                linknumber: 1
        }
}
logging {
        fileline: off
        to_stderr: yes
        to_logfile: yes
        logfile: /var/log/corosync/corosync.log
        to_syslog: yes
        debug: off
        logger_subsys {
                subsys: QUORUM
                debug: off
        }
}
quorum {
        provider: corosync_votequorum
        expected_votes: 2
        two_node: 1
}
nodelist {
        node {
                name: patty
                nodeid: 1
                ring0_addr: 192.168.144.1
                ring1_addr: 10.10.1.5
        }
        node {
                name: selma
                nodeid: 2
                ring0_addr: 192.168.144.2
                ring1_addr: 10.10.1.6
        }
}

Any help is appreciated. Thanks,

Alberto

-- System Information:
Debian Release: bullseye/sid
  APT prefers unstable
  APT policy: (500, 'unstable'), (500, 'testing'), (500, 'stable'), (1, 'experimental')
Architecture: amd64 (x86_64)
Foreign Architectures: i386

Kernel: Linux 5.6.0-1-amd64 (SMP w/4 CPU cores)
Kernel taint flags: TAINT_FIRMWARE_WORKAROUND
Locale: LANG=C.UTF-8, LC_CTYPE=C.UTF-8 (charmap=UTF-8), LANGUAGE= (charmap=UTF-8)
Shell: /bin/sh linked to /bin/dash
Init: systemd (via /run/systemd/system)