[Pkg-xen-devel] Bug#927071: Bug#927071: xen: More balloon-leak observation

Hans van Kranenburg hans at knorrie.org
Fri Sep 18 21:55:40 BST 2020


Hi again,

On 5/1/19 12:55 AM, Elliott Mitchell wrote:
> On Mon, Apr 22, 2019 at 04:02:28PM +0200, Hans van Kranenburg wrote:
>> On 4/22/19 1:10 AM, Elliott Mitchell wrote:
>>> There is plenty of free memory for creating additional VMs (perhaps too
>>> much, and that confused Xen?), so this is really puzzling that memory is
>>> being ballooned away from Dom0.  At this point I plan after the next
>>> restart to double the allocation for Dom0 and see whether Dom0 is able
>>> to last more than a week.
>>
>> Weird. Can you log memory stats over time, so that you can see when it
>> happens, and correlate it to other events?
> 
> At this point there is only one real pattern I've noticed:  Always
> `smartd` was the process which triggered the kernel OOM-killer.
> 
> Originally I was attributing this to `smartd` doing some large memory
> allocation during its night-time tasks (which I would attribute to
> perhaps `smartd` not being that well written).  Yet now, I never saw
> anything else trigger the OOM-killer and I'm now willing to speculate
> some I/O operation `smartd` was doing triggers a bug in Xen.

At first I replied with "I haven't heard about this symptom before your
report.", but later I realized that I am totally seeing the same kind of
behaviour.

During some debian-xen day in Feb 2020, I even had a bit of a closer
look at this together with Ian, and we ended up thinking that there's
actually some kind of obscure miscalculation bug happening. If you look
closely at the numbers in xl info and xl list, then you'll see that the
numbers just do not add up.

The dom0 gets some kind of fake-down-ballooning which is an accounting
error.

I can't provide more proof right now, because I have to reproduce the
thing in a simplified environment to be able to provide a kind of
walk-through scenario with all the output of the numbers.

And yes, I have seen oom killers do stuff in customer production
environments because of this. O_O

A team member in my team has been busy doing storage migrations where we
attach new block devices to domUs and then sync all their data to the
new filesystem (moving from ext4 to btrfs and also to new iSCSI storage)
and later reboot after a final sync and then swap block devices, etc.
>From the graphs we've been looking at, combined with when migration
stuff is happening, I have gotten a suspicion that it looks like the
fake dom0 down-ballooning is related to grant mappings, since it seems
like the dom0 memory is not decreasing when attaching the new disk, but
it is when starting activity using it.

To be continued....

Hans



More information about the Pkg-xen-devel mailing list