Bug#897266: systemd: journalctl assertion failure

Tue Jun 5 18:37:37 BST 2018

On Sun, May 20, 2018 at 09:58:10PM +0200, Michael Biebl <biebl at debian.org> wrote:
> That seems very strange. The only case where I personally ran into
> journal file corruption is when I had to power cycle the machine.
> But you said that journald ran uninterrupted for 40 days.
> Would it be possible that this is a hardware or file system issue?

I have an update to this, and can reproduce this: systemd is likely off the
hook for the corruption itself. Clearly it shouldn't crash, but I can
reproduce the corruption now, and it's a almost certainly a linux 4.14 bug.

As for background: linux 4.4 was the last kernel which worked on our
servers.  At some point in 4.6, we started getting frequent OOM kills a
few hours after booting, despite many gigabytes of memory "available"
(e.g. used as cache) (you might remember me complaining about missing 4.4
compatibility for this reason - we couldn't switch to 4.9). The first
kernel that kind of worked for us again was 4.14, but only with this
hourly cronjob:

   echo 3 >/proc/sys/vm/drop_caches

Without it, mysql still gets killed once per week or so. This doesn't work
with debians 4.9 LTS kernel, which is why we use the 4.14 LTS kernel from
the ubuntu mainline ppa.

And the above command causes corruption of the systemd journal. I have
reproduced this multiple times now, by deleting the journal and restarting
the journald, following by waiting for a day, and then doing this:

   # journalctl --verify
   [everything fine at this point]
   # echo 3 >/proc/sys/vm/drop_caches
   # journalctl --verify
   [journcal now reporting corruption problems]

We are in the lucky position to have "expected" md5 checksums for practically
all files on the servers this happens on (and debian packages usually have
md5sum files as well) and luckily, neither the fs itself now any other file
seems corrupted, including some write-heavy mysql databases and over 53TB of
data we verified.

Only one other program also suffers from corruption: rtorrent, which doesn't
run on many servers :) which is why I found out about it only by accident.
There, the same patterns happens: downloading a torrent is fine, downloading
a torrent while dropping the caches frequently causes file corruption.

I also have cmp -l output from a corrupted file vs. a correct file, and it
seems the corruption manifests itself as (non-aligned to anything obvious,
such as 512 or 4k borders) streaks of zero bytes instead of the real data
that should be there.

I will pursue this with the linux upstream. It's possible that systemd
(like rtorrent is known to) does something to increase the chance of
corruption, as it luckily only seems to affect those tow programs, but it's
unlikely to be a bug in systemd itself (other than it probably shouldn't
crash), as drop_caches is supposed to be safe.

Greetings,

-- 
                The choice of a       Deliantra, the free code+content MORPG
      -----==-     _GNU_              http://www.deliantra.net
      ----==-- _       generation
      ---==---(_)__  __ ____  __      Marc Lehmann
      --==---/ / _ \/ // /\ \/ /      schmorp at schmorp.de
      -=====/_/_//_/\_,_/ /_/\_\