[Pkg-utopia-maintainers] Bug#879898: dbus: memory leak in dbus-daemon consumes over 5GB memory

Sat Oct 28 15:59:18 UTC 2017

On Sat, 28 Oct 2017 at 09:32:36 -0400, Nicholas D Steeves wrote:
> In that case does it
> correspond to this upstream bug?  https://bugs.freedesktop.org/show_bug.cgi?id=33606#c11

Probably, yes. You'll notice we don't know how to solve that bug without
causing data loss...

dbus-broker is claimed to do better by making use of Linux-specific APIs
and having different concepts of how to assign blame for messages,
fast writing, slow reading etc., but I haven't had time to review it,
and it's possible that it achieves this by (accidentally or deliberately)
not providing invariants/guarantees that dbus-daemon does.

> I'm not sure that it does, because I don't understand how 1.6GB
> becomes 6GB without a leak (see below)

I'm not sure either. Something like memleax might tell you. However,
that 1.6GB is only one of several types of memory allocation in
dbus-daemon (1GB of messages is a separate allocation), so this isn't
a definite leak.

> > The dbus 1.11.x development branch (from which a 1.12.0 stable version
> > will be released soon) does log this information to syslog or the
> > systemd Journal, but 1.10.x didn't.
> 
> Shouldn't this have existed since 1.4.10/1.5.2?
> https://bugs.freedesktop.org/show_bug.cgi?id=35358

Hmm, yes, the addition in 1.11.x is logging when we hit all the other
arbitrary limits.

> org.kde.powerdevil.backlighthelper: QDBusArgument: read from a write-only object

I don't expect this should cause any particular issues.

> org.kde.kuiserver[20319]: QDBusConnection: session D-Bus connection created before QCoreApplication. Application may misbehave.

I don't know what form the misbehaviour would take, but I doubt the answer
is millions of messages.

> Could dbus-daemon --session send a signal to misbehaving peers...eg:
> "you're misbehaving...slow down send or speed up reads" and then
> finally, if necessary "you seem to be stuck, do an internal reinit but
> maintain connect"?  Do you think dbus-broker (
> https://dvdhrm.github.io/rethinking-the-dbus-message-bus/ ) would have
> avoided this bug?

The signal for "slow down sending" is to throttle reading, so that the
messages that are being sent get delayed until the current batch have been
passed through to a recipient. If a sender is sending messages that are
not necessary, then it should not send them ever, regardless of whether
there is memory pressure. If a sender is sending messages that *are*
necessary, then they're necessary, and it can't just not send them...

There is no signal for "speed up reading". It isn't clear to me that
one would be particularly useful: if an app is not reading fast enough,
it's a sign of either pathological performance issues (either a flood of
messages from a sender, or very slow message processing in a recipient),
or a design flaw (blocking the main loop). If there is something useful
for it to do in response to a signal that said "hurry up", it would be
better for it to do that all the time and not pay attention to the signal.
(It also isn't clear to me that sending an extra message is a good way
to deal with a peer that isn't processing its messages fast enough :-)

There is no generic concept of a re-initialization. Higher-level protocols
that are layered over D-Bus generally assume that every message is
delivered and do not cope with messages being discarded, so if there was
any concept of restarting the connection, it would have to be opt-in and
only used by applications whose higher-level D-Bus protocols had been
(re)designed to accommodate it. The only thing we can do with a general
connection is to disconnect it, which is documented to be interpreted as
"end of session, shut down now".

Designing new protocols that require action by library implementations
and/or opt-in by applications is not something that can happen rapidly -
sorry, I only have enough time available for D-Bus work to keep it going
in approximately the same way it does now.

> How does 1.6G of messages become 6GB of memory usage without a memory
> leak?  Do you mean 1.6G of active memory and ~4.4G of cache?

This is 1.6G of linked-list links (overhead of putting things in
lists), not messages. You also have 1G of actual messages queued to
go to kded alone. I could believe there being 3.4G of other stuff
(hash tables? copies of messages? memory fragmentation between/around
messages? etc.).

> > org.kde.StatusNotifierWatcher, org.kde.plasmanetworkmanagement,
> > org.kde.keyboard, org.kde.kded5, org.kde.kcookiejar5, org.kde.apperd
> > are really all the same connection (you can tell by how they all share
> > UniqueName = :1.7) and they are the worst problem here. Either kded5
> > has not been reading its dbus messages from its socket for a long time,
> > or something has sent it far too many messages, or both.
> > 
> > While I recognise the engineering tradeoffs that lead to bundling
> > several daemons into one process, this might be more robust (or at
> > least more debuggable) if it was a group of smaller services.
> 
> I agree.
> 
> Do you think I should provide full backtraces for all of these as a
> bug against "plasma-desktop" or individual bugs for individual
> components?

The way they're designed at the moment, you cannot separate these
components. The most you could do is to provide a backtrace for the
single process that hosts all these services and is identified on D-Bus
as :1.7. I don't know how useful that backtrace would be: it describes
what is happening right now, whereas the information you need here is
probably what happened some time ago when the root cause occurred.

> > The dbus-daemon *currently* has 13 million messages
> > totalling 1GB queued to be sent to :1.7, and that's the most there
> > have ever been at a time. I wonder whether kded has stopped reading
> > messages
> 
> CPU usage is low and dbus-daemon doesn't seem to be very active.

Then my guess would be that a logic error in kded has caused it to
stop reading from the socket that connects it to the dbus-daemon?

(You can see whether messages are still being sent by running
dbus-monitor. If the steady state is more than a few messages a second
then that's probably bad.)

    smcv