[pkg-bacula-devel] Bug#1012301: bacula: Corruption of File media during concurrent backups
Julien Chiaramello
jchiaramello at quantificare.com
Fri Jun 3 09:10:12 BST 2022
Source: bacula
Severity: important
Tags: upstream
Hello,
Under a specific type of configuration, a Bacula job may sometimes corrupt a
previously written volume, losing all data on it. The following circumstances
have been identified :
- Multiple concurrent jobs are started at the same time, all using the same
Schedule and Pool
- That Pool must have a Volume Use Duration which is higher than the frequency
of the Jobs in the Schedule (For example, hourly backups with a VUD of 2 hours)
- The Pool uses a Device which uses File media
Once these conditions, are filled, a job may randomly corrupt a volume,
typically when that volume is marked as "Used". This has the following
consequences :
- The Job status is "OK -- with warnings"
- The Job includes the following error from the "mount.c" file : "Hey!!!!!
WroteVol non-zero !!!!!"
- One of previously-written volumes is marked in Error
- That volume size on the filesystem drops below 1 kB (Effectively erased)
- Attempting to restore files from a volume in error fails (Ending up as a
mismatch)
=== Steps to Reproduce ===
Configure a Bacula cluster with the following conditions :
- A Device must use "Media Type = File"
- A Pool must have a certain Volume Use Duration (for example, 2 hours)
- A Schedule must perform regular jobs with a higher frequency than the Volume
Use Duration of the Pool (for example, every hour)
- Multiple Jobs must be using this Schedule and Pool
- The jobs must run concurrently
Under these conditions, a job will eventually corrupt a previously written
volume
=== Additional Information ===
This bug happens on various releases of Bacula from the official Debian packages
(5.2, 7.4 and 9.4 are affected)
This bug happens on multiple separated Bacula clusters (Nothing is shared
between them)
In case it matters, the FDs use PKI Signatures and Encryption
This bug does not happen if the Volume Use Duration is set lower than the
frequency of backups, ensuring a given Volume is never re-used between "batches"
of backups (This is our current workaround)
This bug did not happen before we implemented Concurrent Jobs
The bug has been declared upstream : https://bugs.bacula.org/view.php?id=2664
-- System Information:
Debian Release: 10.11
APT prefers oldstable-updates
APT policy: (500, 'oldstable-updates'), (500, 'oldstable')
Architecture: amd64 (x86_64)
Kernel: Linux 4.19.0-18-amd64 (SMP w/32 CPU cores)
Locale: LANG=en_GB.UTF-8, LC_CTYPE=en_GB.UTF-8 (charmap=UTF-8), LANGUAGE=en_GB:en (charmap=UTF-8)
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled
More information about the pkg-bacula-devel
mailing list