[Pkg-utopia-maintainers] Bug#879898: dbus: memory leak in dbus-daemon consumes over 5GB memory

Fri Oct 27 23:22:22 UTC 2017

On Fri, 27 Oct 2017 at 13:06:32 -0400, Nicholas D Steeves wrote:
> Would using memleax help?  I'm 90% done packaging it, in case it looks
> useful.  Packaged because vagrant can't attach to an existing process.

>From the other information you've given I can tell you with reasonable
certainty that (at least a large part of) the "leaked" memory is not
leaked, but is really still in use - the dbus-daemon is genuinely
trying to process millions of messages to kded5. I can't say why,
though.

> Is it possible for dbus-daemon to signal to the calling application
> that a queue is full?

In principle it might have been if it was defined 10 years ago, but
there is no protocol defined for communicating that fact, and it isn't
clear what an application would do about it anyway. Popping up messages
to annoy the user is generally frowned upon if there is nothing that
most users can do about it...

The dbus 1.11.x development branch (from which a 1.12.0 stable version
will be released soon) does log this information to syslog or the
systemd Journal, but 1.10.x didn't.

The system bus (which has finite limits) responds to excessive messages in
an incoming queue by slowing down reading from that sender, and responds
to excessive messages in an outgoing queue by dropping messages on the
floor. It might be a good idea for the session bus to have a rather smaller
limit on its incoming queues than the 1 GB limit that it has now, perhaps
the 127 MiB that the system bus uses (which tbh could probably also be
reduced) - then floods of messages would get slowed down, but eventually
be processed. Individual messages are allowed to exceed that limit by
up to 1 message, so maximally-sized messages could still be processed
anyway.

> :1.7.stats:         variant             uint32 1000000016

That is indeed slightly more than the arbitrary 1 GB limit that the
session bus has. The limit is conceptually only there to stop byte
counts overflowing a signed 32-bit integer, because session bus users
are trusted to not carry out deliberate denial-of-service attacks on each
other; but I think there would probably be value in turning down the
incoming message limit a bit to prevent accidental denial-of-service.
That would throttle reading from very spammy peers, but wouldn't help
when dealing with peers that read messages slower than they should.

Turning down the *outgoing* message limit is dangerous (it results in
messages being discarded unrecoverably) so we have to be very cautious
about reducing it.

> > dbus-send --session --dest=org.freedesktop.DBus --type=method_call \
> >     --print-reply /org/freedesktop/DBus \
> >     org.freedesktop.DBus.Debug.Stats.GetStats
> 
> I tried this just now and dbus-send seems to have errored with exit
> status 130:
>     Must use org.mydomain.Interface.Method notation, no dot in
>     "/org/freedesktop/DBus"

You have an extra argument somewhere, possibly a space after one of the
backslashes.

> Plasma's NM frontend wouldn't reconnect to the
> network, and I before I realised it I had killed plasmashell ; sleep
> 3; kstart plasmashell.  The dbus-daemon and kded5 daemon seem to be
> unaffected, as expected, but maybe this causes problems with dbus?

Terminating random processes shouldn't be a problem in general (other
than to those processes and anything that was relying on them), but
if some component has a bug and has responded to the unexpected
disappearance by going into a busy-loop, it's possible that the
busy-loop could result in it sending a flood of messages.

Or the inability to reconnect to the network might itself have been a
symptom of a flood of messages, rather than part of the cause.

However, I will say this: if you have a serious problem (like a key
desktop component getting stuck), and you respond to it by killing
individual components rather than by terminating the whole session
(e.g. logging out and back in), then you are certainly in uncharted
territory. This is the sort of thing that in principle ought to work,
but it is never going to be systematically tested, so it is not at all
surprising if it has or exposes bugs.

> IIRC, a couple of days ago kded5 consumed 100%
> of one core for quite some time.  Maybe that's when it was making all
> of these dbus connections?

Lots of messages rather than lots of connections, I think, but most
likely yes.

> I'm honestly not sure if kded5 is truly at
> fault, of if it became overwhelmed when an application spammed [one
> of] its dbus interfaces.

It could be either one.

> Can we keep these two bugs unmerged but associate them somehow?

As far as I know there is no machine-readable way to do this. Just
mention each one in a mail to the other.

Looking at the dbus-daemon stats:

>          string "ListMemPoolUsedBytes"
>          variant             uint32 947456112
>          string "ListMemPoolCachedBytes"
>          variant             uint32 3086736
>          string "ListMemPoolAllocatedBytes"
>          variant             uint32 1610612352

The dbus-daemon keeps some memory pools of pre-allocated linked-list links
and messages ready for reuse, in your case 1.6G of them. If your
dbus-daemon is really processing millions of messages then this seems
plausible to be real use rather than a leak: lots of places in dbus-daemon
have to pre-allocate linked-list links in a very pessimistic way so that
they can be used in error-recovery without worrying about whether memory
can be allocated for them, because dbus-daemon has a policy of assuming
that malloc() can legitimately return NULL.

I have often wondered whether these cache is more harmful than helpful
in the general case, but the worst-case answer is clearly: yes it's
more harmful than helpful (because allocated memory doesn't get freed).
I'm not sure whether that peak memory consumption would ever get returned
to the OS after the heap had grown in any case, though. Last time I
experimented with turning off the message cache the answer seemed to be
that it wasn't.

>          string "ActiveConnections"
>          variant             uint32 76
>          string "IncompleteConnections"
>          variant             uint32 0
>          string "BusNames"
>          variant             uint32 171
>          string "PeakBusNames"
>          variant             uint32 191
>          string "PeakBusNamesPerConnection"
>          variant             uint32 7

Those look normal.

>          string "MatchRules"
>          variant             uint32 1168
>          string "PeakMatchRules"
>          variant             uint32 1273
>          string "PeakMatchRulesPerConnection"
>          variant             uint32 132

132 match rules per connection seem a bit on the high side, but I think
QtDBus adds really specific match rules, so this is probably fine.

Looking at connection :1.7:

org.kde.StatusNotifierWatcher, org.kde.plasmanetworkmanagement,
org.kde.keyboard, org.kde.kded5, org.kde.kcookiejar5, org.kde.apperd
are really all the same connection (you can tell by how they all share
UniqueName = :1.7) and they are the worst problem here. Either kded5
has not been reading its dbus messages from its socket for a long time,
or something has sent it far too many messages, or both.

While I recognise the engineering tradeoffs that lead to bundling
several daemons into one process, this might be more robust (or at
least more debuggable) if it was a group of smaller services.

>          string "UniqueName"
>          variant             string ":1.7"
>          string "OutgoingMessages"
>          variant             uint32 13157884
>          string "OutgoingBytes"
>          variant             uint32 1000000016
>          string "PeakOutgoingBytes"
>          variant             uint32 1000000016

That's a lot. The dbus-daemon *currently* has 13 million messages
totalling 1GB queued to be sent to :1.7, and that's the most there
have ever been at a time. I wonder whether kded has stopped reading
messages (causing them to be stuck in the dbus-daemon until it
disconnects) or whether there is just a massively large backlog that
the dbus-daemon is still working through?

>          string "UniqueName"
>          variant             string ":1.33"

Is there a reason you captured stats for :1.33 (korgac)? It seems to
be a really light D-Bus user - hundreds of bytes, ever, so certainly
non-problematic.

>          string "UniqueName"
>          variant             string ":1.334"
>          string "PeakOutgoingBytes"
>          variant             uint32 7842592

:1.334 is org.kde.kwrite-17177. A peak of 7M waiting to be sent to kwrite
is higher than I would usually expect, but certainly not terrible.

>          string "UniqueName"
>          variant             string ":1.115"
>          string "PeakOutgoingBytes"
>          variant             uint32 7763452

This is org.freedesktop.Notifications aka org.kde.plasmashell aka :1.115.
Again, a peak of 7M waiting to be sent to it is a little high but
certainly not excessive.

    smcv