[Debian-ha-maintainers] corosync-blackbox qb_rb_chunk_read failed

Thu Jul 23 11:20:14 BST 2015

Hello,

Thank You for the helpful infos, that saves time and nerves for me and 
cleared my problem.  I then avoid to use  ppa.mmogp.com and uninstall 
the packages and  sources. And I don't spend time to resolve this 
blackbox-error.  The most critical part of my servers are drbd and that 
is HA by nature. I now will use my own scripts for failover which are 
simple and worked for years  on other squeeze-servers before I turned to 
pacemaker. It is a regression for me, but I will track the development 
of corosync/pacemaker in debian and hope it will be in good condition 
available in a while.

Best Regards,
Christina

On 23.07.2015 07:37, Richard B Winters wrote:
> On 07/22/2015 08:47 AM, braun wrote:
>> I want to use corosync/pacemaker productive.  Is this a serious error ?
>>
>>
>> corosync-blackbox ends with error:
>>
>> [debug] shm size:8392704; real_size:8392704; rb->word_size:2098176
>> [debug] read total of: 8392724
>> ERROR: qb_rb_chunk_read failed: Connection timed out
>> [trace] ENTERING qb_rb_close()
>> [debug] Free'ing ringbuffer: /dev/shm/qb-create_from_file-header
>>
>> root at willi:~/errorblackbox# df -kl /dev/shm
>> Dateisystem    1K-Blöcke Benutzt Verfügbar Verw% Eingehängt auf
>> tmpfs           16462300   14448  16447852    1% /dev/shm
>>
>>
>> I followed the wiki on
>> https://wiki.debian.org/Debian-HA/ClustersFromScratch and use the
>> folling packets:
>>
>> corosync:
>>    Installiert:           2.3.4-1
>>    Installationskandidat: 2.3.4-1
>>    Versionstabelle:
>>   *** 2.3.4-1 0
>>          500 http://ppa.mmogp.com/apt/debian/ jessie/main amd64 Packages
>>          100 /var/lib/dpkg/status
>>       1.4.6-1.1 0
>>          500 http://debian.uni-duisburg-essen.de/debian/ jessie/main
>> amd64 Packag
>> es
>>          500 http://ftp.de.debian.org/debian/ testing/main amd64 Packages
>> pacemaker:
>>    Installiert:           1.1.12-1
>>    Installationskandidat: 1.1.12-1
>>    Versionstabelle:
>>   *** 1.1.12-1 0
>>          500 http://ppa.mmogp.com/apt/debian/ jessie/main amd64 Packages
>>          100 /var/lib/dpkg/status
>> fence-agents:
>>    Installiert:           4.0.18-1
>>    Installationskandidat: 4.0.18-1
>>    Versionstabelle:
>>   *** 4.0.18-1 0
>>          500 http://ftp.de.debian.org/debian/ testing/main amd64 Packages
>>          100 /var/lib/dpkg/status
>>       4.0.17-2 0
>>          500 http://ppa.mmogp.com/apt/debian/ jessie/main amd64 Packages
>>       3.1.5-2 0
>>          500 http://debian.uni-duisburg-essen.de/debian/ jessie/main
>> amd64 Packag
>> es
>> libqb0:
>>    Installiert:           0.17.1-4
>>    Installationskandidat: 0.17.1-4
>>    Versionstabelle:
>>   *** 0.17.1-4 0
>>          500 http://ftp.de.debian.org/debian/ testing/main amd64 Packages
>>          100 /var/lib/dpkg/status
>>       0.17.1-1 0
>>          500 http://ppa.mmogp.com/apt/debian/ jessie/main amd64 Packages
>>       0.11.1-2 0
>>          500 http://debian.uni-duisburg-essen.de/debian/ jessie/main
>> amd64 Packag
>> es
>> libqb-dev:
>>    Installiert:           0.17.1-4
>>    Installationskandidat: 0.17.1-4
>>    Versionstabelle:
>>   *** 0.17.1-4 0
>>          500 http://ftp.de.debian.org/debian/ testing/main amd64 Packages
>>          100 /var/lib/dpkg/status
>>       0.17.1-1 0
>>          500 http://ppa.mmogp.com/apt/debian/ jessie/main amd64 Packages
>>       0.11.1-2 0
>>          500 http://debian.uni-duisburg-essen.de/debian/ jessie/main
>> amd64 Packag
>> es
>>
>> corosync.conf (modiefied from wheezy,where I don't  try corosync_blackbox):
>>
>> totem {
>>      version: 2
>>
>>          ip_version: ipv4
>>
>>      # How long before declaring a token lost (ms)
>>      token: 4000
>>
>>      # How many token retransmits before forming a new configuration
>>      token_retransmits_before_loss_const: 10
>>
>>      # How long to wait for join messages in the membership protocol (ms)
>>      join: 1100
>>
>>      # How long to wait for consensus to be achieved before starting a
>> new round of membership configuration (ms)
>>      consensus: 3600
>>
>>          # how long to wait before checking for a partition when  no
>> multicast
>>      merge: 1000
>>
>>      # Turn off the virtual synchrony filter
>>      vsftype: none
>>
>>      # Number of messages that may be sent by one processor on receipt of
>> the token
>>      max_messages: 100
>>
>>      # Limit generated nodeids to 31-bits (positive signed integers)
>>      clear_node_high_bit: yes
>>
>>      crypto_cipher: aes256
>>      crypto_hash: sha1
>>
>> ###    # Disable encryption
>> ###     secauth: off
>>
>>      # How many threads to use for encryption/decryption
>>      ### wir haben 4 cpu-cores C.B.
>>       threads: 4
>>
>>      # Optionally assign a fixed node id (integer)
>>      ### habe ich von der ip genommen, brauchen wir eigentlich nicht in
>> ip4 C.B.
>> #    nodeid: 22
>>
>>      # This specifies the mode of redundant ring, which may be none,
>> active, or passive.
>>       rrp_mode: none
>>
>> ### bei den interfaces ucast genommen, da nur 2  hosts und einfacher als
>> mcast
>> ### geht aber irgendwie nicht
>>       interface {
>>          # The following values need to be set based on your environment
>>          # fuer diskmirror
>>          ringnumber: 0
>>          bindnetaddr: 192.168.63.0
>>          broadcast: yes
>>          ##mcastaddr: 230.168.63.1
>>          #mcastport: 5405
>>          mcastport: 5405
>>          }
>>          transport: udpu
>> }
>>
>> nodelist {
>>                node {
>>          ring0_addr: 192.168.63.22
>>          nodeid: 22
>>                  }
>>                node {
>>          ring0_addr: 192.168.63.24
>>          nodeid: 24
>>                  }
>> }
>>
>>
>> quorum {
>>          provider: corosync_votequorum
>>          two_node: 1
>>          expected_votes: 2
>> }
>>
>>
>>
>> logging {
>>          fileline: off
>>          to_stderr: yes
>>          to_logfile: yes
>>          to_syslog: yes
>>      logfile: /var/log/corosync.log
>>      syslog_facility: daemon
>>          debug: off
>>          timestamp: on
>>          logger_subsys {
>>                  subsys: QUORUM
>>                  debug: off
>>          }
>> }
>>
>> qb    {
>>      ipc_type: shm
>>      }
>>
>> corosync.log:
>>
>> Jul 22 13:59:32 [1431] willi corosync info    [MAIN  ] Corosync built-in
>> features: dbus testagents rdma watchdog aug
>> eas systemd upstart xmlconf qdevices snmp pie relro bindnow
>> Jul 22 13:59:32 [1431] willi corosync notice  [TOTEM ] Initializing
>> transport (UDP/IP Unicast).
>> Jul 22 13:59:32 [1431] willi corosync notice  [TOTEM ] Initializing
>> transmit/receive security (NSS) crypto: aes256 h
>> ash: sha1
>> Jul 22 13:59:32 [1431] willi corosync notice  [TOTEM ] The network
>> interface [192.168.63.24] is now up.
>> Jul 22 13:59:32 [1431] willi corosync notice  [SERV  ] Service engine
>> loaded: corosync configuration map access [0]
>> Jul 22 13:59:32 [1431] willi corosync info    [QB    ] server name: cmap
>> Jul 22 13:59:32 [1431] willi corosync notice  [SERV  ] Service engine
>> loaded: corosync configuration service [1]
>> Jul 22 13:59:32 [1431] willi corosync info    [QB    ] server name: cfg
>> Jul 22 13:59:32 [1431] willi corosync notice  [SERV  ] Service engine
>> loaded: corosync cluster closed process group
>> service v1.01 [2]
>> Jul 22 13:59:32 [1431] willi corosync info    [QB    ] server name: cpg
>> Jul 22 13:59:32 [1431] willi corosync notice  [SERV  ] Service engine
>> loaded: corosync profile loading service [4]
>> Jul 22 13:59:32 [1431] willi corosync info    [WD    ] Watchdog is now
>> been tickled by corosync.
>> Jul 22 13:59:32 [1431] willi corosync info    [WD    ] no resources
>> configured.
>> Jul 22 13:59:32 [1431] willi corosync notice  [SERV  ] Service engine
>> loaded: corosync watchdog service [7]
>> Jul 22 13:59:32 [1431] willi corosync notice  [QUORUM] Using quorum
>> provider corosync_votequorum
>> Jul 22 13:59:32 [1431] willi corosync notice  [VOTEQ ] Waiting for all
>> cluster members. Current votes: 1 expected_vo
>> tes: 2
>> Jul 22 13:59:32 [1431] willi corosync notice  [SERV  ] Service engine
>> loaded: corosync vote quorum service v1.0 [5]
>> Jul 22 13:59:32 [1431] willi corosync info    [QB    ] server name:
>> votequorum
>> Jul 22 13:59:32 [1431] willi corosync notice  [SERV  ] Service engine
>> loaded: corosync cluster quorum service v0.1 [
>> 3]
>> Jul 22 13:59:32 [1431] willi corosync info    [QB    ] server name: quorum
>> Jul 22 13:59:32 [1431] willi corosync notice  [TOTEM ] adding new UDPU
>> member {192.168.63.22}
>> Jul 22 13:59:32 [1431] willi corosync notice  [TOTEM ] adding new UDPU
>> member {192.168.63.24}
>> Jul 22 13:59:32 [1431] willi corosync notice  [TOTEM ] A new membership
>> (192.168.63.24:30668) was formed. Members jo
>> ined: 24
>> Jul 22 13:59:32 [1431] willi corosync notice  [VOTEQ ] Waiting for all
>> cluster members. Current votes: 1 expected_vo
>> tes: 2
>> Jul 22 13:59:32 [1431] willi corosync notice  [VOTEQ ] Waiting for all
>> cluster members. Current votes: 1 expected_vo
>> tes: 2
>> Jul 22 13:59:32 [1431] willi corosync notice  [VOTEQ ] Waiting for all
>> cluster members. Current votes: 1 expected_vo
>> tes: 2
>> Jul 22 13:59:32 [1431] willi corosync notice  [QUORUM] Members[1]: 24
>> Jul 22 13:59:32 [1431] willi corosync notice  [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> Jul 22 13:59:40 [1431] willi corosync notice  [TOTEM ] A new membership
>> (192.168.63.22:30672) was formed. Members jo
>> ined: 22
>> Jul 22 13:59:40 [1431] willi corosync notice  [QUORUM] This node is
>> within the primary component and will provide se
>> rvice.
>> Jul 22 13:59:40 [1431] willi corosync notice  [QUORUM] Members[2]: 22 24
>> Jul 22 13:59:40 [1431] willi corosync notice  [MAIN  ] Completed service
>> synchronization, ready to provide service.
>> root at willi:~/errorblackbox#
>>
> Hello,
>
> First please allow me to give you proper warning regarding
> ppa.mmogp.com: It is _not_ a good candidate for production use; it is a
> freely available ppa, and is _unofficial_. Please use this repository at
> your own risk.
>
> In Debian Jessie+, the Pacemaker/Corosync stack is not yet ready for
> production, and for that we apologize - in some time it will be
> officially available for Debian, but we are still testing and preparing
> all of the packages and documentation.
>
> I did some searching regarding your error, and I came across this Redhat
> issue that seems to be related, perhaps it can be of some help to you:
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1114852
>
> The next time you get this error, if you could give us not only the full
> blackbox output, but also the corosync, pacemaker, libqb, and syslog
> logs - that would help us to better identify the issue, and perhaps help
> to offer a solution.
>
> If the error is related/similar to that mentioned in the link above - a
> good point to note is that it seems to be attempting to read
> uninitialized memory:
>
>
>> [trace] ENTERING qb_rb_close()
>> [debug] Free'ing ringbuffer: /dev/shm/qb-create_from_file-header
> It should be resolved, it is a serious error.
>
>
> You could also try using a configuration generated by PCS or CRMSH and
> go from that basic variant and built upon it. There are major
> differences in the APIs and facilities of the stack between Wheezy and
> Jessie.
>
> You can also seek help on OFTC in the #debian-ha channel, as well as
> freenode.net's #clusterlabs channel.
>
>
> I hope that helps to provide you with some direction.
>
>
> Best,
>
>
>
>
>
>
> _______________________________________________
> Debian-ha-maintainers mailing list
> Debian-ha-maintainers at lists.alioth.debian.org
> http://lists.alioth.debian.org/cgi-bin/mailman/listinfo/debian-ha-maintainers

-- 
**********************************************
* email: braun at dc.uni-due.de
* Christina Braun
* WiWI/ICB/Informatik
* University DUE
* Schuetzenbahn 70  tel.: + 49    201 183-3929
* D-45127 Essen     fax.: + 49    201 183-2419
**********************************************