[Pkg-utopia-maintainers] Bug#879898: dbus: memory leak in dbus-daemon consumes over 5GB memory

Sat Oct 28 13:32:36 UTC 2017

On Sat, Oct 28, 2017 at 12:22:22AM +0100, Simon McVittie wrote:
> On Fri, 27 Oct 2017 at 13:06:32 -0400, Nicholas D Steeves wrote:
> > Would using memleax help?  I'm 90% done packaging it, in case it looks
> > useful.  Packaged because vagrant can't attach to an existing process.
> 
> From the other information you've given I can tell you with reasonable
> certainty that (at least a large part of) the "leaked" memory is not
> leaked, but is really still in use - the dbus-daemon is genuinely
> trying to process millions of messages to kded5. I can't say why,
> though.

Oh!  My apologies for misunderstanding.  In that case does it
correspond to this upstream bug?  https://bugs.freedesktop.org/show_bug.cgi?id=33606#c11

I'm not sure that it does, because I don't understand how 1.6GB
becomes 6GB without a leak (see below)

> > Is it possible for dbus-daemon to signal to the calling application
> > that a queue is full?
> 
> In principle it might have been if it was defined 10 years ago, but
> there is no protocol defined for communicating that fact, and it isn't
> clear what an application would do about it anyway. Popping up messages
> to annoy the user is generally frowned upon if there is nothing that
> most users can do about it...

Better than a swap storm and eventual mysterious crash, imho.  Also if
a user could more easily identify which component is precipitating the
issue than he/she can file a more useful bug report against the
correct package.  As it stands, it's extra triaging work for the
Debian dbus maintainer[s].

> > Plasma's NM frontend wouldn't reconnect to the
> > network, and I before I realised it I had killed plasmashell ; sleep
> > 3; kstart plasmashell.  The dbus-daemon and kded5 daemon seem to be
> > unaffected, as expected, but maybe this causes problems with dbus?
> 
> Terminating random processes shouldn't be a problem in general (other
> than to those processes and anything that was relying on them), but
> if some component has a bug and has responded to the unexpected
> disappearance by going into a busy-loop, it's possible that the
> busy-loop could result in it sending a flood of messages.
> 
> Or the inability to reconnect to the network might itself have been a
> symptom of a flood of messages, rather than part of the cause.
> 
> However, I will say this: if you have a serious problem (like a key
> desktop component getting stuck), and you respond to it by killing
> individual components rather than by terminating the whole session
> (e.g. logging out and back in), then you are certainly in uncharted
> territory. This is the sort of thing that in principle ought to work,
> but it is never going to be systematically tested, so it is not at all
> surprising if it has or exposes bugs.

I've had to resort to killing individual components, because
KDE5/Plasma Desktop session save/restore is broken.  If I logout or
reboot I lose my session (more specifically a seemingly random
collection of applications sometimes opening old files from old
sessions and sometimes opening no file appear on the first virtual
desktop).  Ksmserver is used for session management and (most? the
successful ones?) applications are restarted using kwrapper5...which
"tries to make the program look like it was actually really started
directly and not via kdeinit"
(https://github.com/KDE/kinit/blob/master/docs/wrapper.txt)

> The dbus 1.11.x development branch (from which a 1.12.0 stable version
> will be released soon) does log this information to syslog or the
> systemd Journal, but 1.10.x didn't.

Shouldn't this have existed since 1.4.10/1.5.2?
https://bugs.freedesktop.org/show_bug.cgi?id=35358

That said, grep -i dbus for files in /var/log (as root)

org.kde.powerdevil.backlighthelper: QDBusArgument: read from a write-only object
# repeats 5 to 15 times each time it occurs
org.kde.kuiserver[20319]: QDBusConnection: session D-Bus connection created before QCoreApplication. Application may misbehave.
# repeats twice
# then more org.kde.powerdevil.backlighthelper repeats
org.kde.kcontrol.kcmsddm: QDBusArgument: read from a write-only object
# repeats six times

That seems very minor compared to millions of messages, but could it
be part of the problem?  I've seen "QDBusConnection: session D-Bus
connection created before QCoreApplication. Application may
misbehave." many times over the years, but never memory usage like this.

> The system bus (which has finite limits) responds to excessive messages in
> an incoming queue by slowing down reading from that sender, and responds
> to excessive messages in an outgoing queue by dropping messages on the
> floor. It might be a good idea for the session bus to have a rather smaller
> limit on its incoming queues than the 1 GB limit that it has now, perhaps
> the 127 MiB that the system bus uses (which tbh could probably also be
> reduced) - then floods of messages would get slowed down, but eventually
> be processed. Individual messages are allowed to exceed that limit by
> up to 1 message, so maximally-sized messages could still be processed
> anyway.

I'm happy to hear there is already rate limiting :-)

> > :1.7.stats:         variant             uint32 1000000016
> 
> That is indeed slightly more than the arbitrary 1 GB limit that the
> session bus has. The limit is conceptually only there to stop byte
> counts overflowing a signed 32-bit integer, because session bus users
> are trusted to not carry out deliberate denial-of-service attacks on each
> other; but I think there would probably be value in turning down the
> incoming message limit a bit to prevent accidental denial-of-service.
> That would throttle reading from very spammy peers, but wouldn't help
> when dealing with peers that read messages slower than they should.

Could dbus-daemon --session send a signal to misbehaving peers...eg:
"you're misbehaving...slow down send or speed up reads" and then
finally, if necessary "you seem to be stuck, do an internal reinit but
maintain connect"?  Do you think dbus-broker (
https://dvdhrm.github.io/rethinking-the-dbus-message-bus/ ) would have
avoided this bug?

> Turning down the *outgoing* message limit is dangerous (it results in
> messages being discarded unrecoverably) so we have to be very cautious
> about reducing it.
> 
> > > dbus-send --session --dest=org.freedesktop.DBus --type=method_call \
> > >     --print-reply /org/freedesktop/DBus \
> > >     org.freedesktop.DBus.Debug.Stats.GetStats
> > 
> > I tried this just now and dbus-send seems to have errored with exit
> > status 130:
> >     Must use org.mydomain.Interface.Method notation, no dot in
> >     "/org/freedesktop/DBus"
> 
> You have an extra argument somewhere, possibly a space after one of the
> backslashes.

Ah, I was missing '--dest=org.freedesktop.DBus'.  I've attached the
output to this email.

> > IIRC, a couple of days ago kded5 consumed 100%
> > of one core for quite some time.  Maybe that's when it was making all
> > of these dbus connections?
> 
> Lots of messages rather than lots of connections, I think, but most
> likely yes.
> 
> > I'm honestly not sure if kded5 is truly at
> > fault, of if it became overwhelmed when an application spammed [one
> > of] its dbus interfaces.
> 
> It could be either one.
> 
> > Can we keep these two bugs unmerged but associate them somehow?
> 
> As far as I know there is no machine-readable way to do this. Just
> mention each one in a mail to the other.

Ok, will do.

> Looking at the dbus-daemon stats:
> 
> >          string "ListMemPoolUsedBytes"
> >          variant             uint32 947456112
> >          string "ListMemPoolCachedBytes"
> >          variant             uint32 3086736
> >          string "ListMemPoolAllocatedBytes"
> >          variant             uint32 1610612352
> 
> The dbus-daemon keeps some memory pools of pre-allocated linked-list links
> and messages ready for reuse, in your case 1.6G of them. If your
> dbus-daemon is really processing millions of messages then this seems
> plausible to be real use rather than a leak: lots of places in dbus-daemon
> have to pre-allocate linked-list links in a very pessimistic way so that
> they can be used in error-recovery without worrying about whether memory
> can be allocated for them, because dbus-daemon has a policy of assuming
> that malloc() can legitimately return NULL.
>
> I have often wondered whether these cache is more harmful than helpful
> in the general case, but the worst-case answer is clearly: yes it's
> more harmful than helpful (because allocated memory doesn't get freed).
> I'm not sure whether that peak memory consumption would ever get returned
> to the OS after the heap had grown in any case, though. Last time I
> experimented with turning off the message cache the answer seemed to be
> that it wasn't.

How does 1.6G of messages become 6GB of memory usage without a memory
leak?  Do you mean 1.6G of active memory and ~4.4G of cache?

> org.kde.StatusNotifierWatcher, org.kde.plasmanetworkmanagement,
> org.kde.keyboard, org.kde.kded5, org.kde.kcookiejar5, org.kde.apperd
> are really all the same connection (you can tell by how they all share
> UniqueName = :1.7) and they are the worst problem here. Either kded5
> has not been reading its dbus messages from its socket for a long time,
> or something has sent it far too many messages, or both.
> 
> While I recognise the engineering tradeoffs that lead to bundling
> several daemons into one process, this might be more robust (or at
> least more debuggable) if it was a group of smaller services.

I agree.

Do you think I should provide full backtraces for all of these as a
bug against "plasma-desktop" or individual bugs for individual
components?

> >          string "UniqueName"
> >          variant             string ":1.7"
> >          string "OutgoingMessages"
> >          variant             uint32 13157884
> >          string "OutgoingBytes"
> >          variant             uint32 1000000016
> >          string "PeakOutgoingBytes"
> >          variant             uint32 1000000016
> 
> That's a lot. The dbus-daemon *currently* has 13 million messages
> totalling 1GB queued to be sent to :1.7, and that's the most there
> have ever been at a time. I wonder whether kded has stopped reading
> messages (causing them to be stuck in the dbus-daemon until it
> disconnects) or whether there is just a massively large backlog that
> the dbus-daemon is still working through?

CPU usage is low and dbus-daemon doesn't seem to be very active.

> >          string "UniqueName"
> >          variant             string ":1.33"
> 
> Is there a reason you captured stats for :1.33 (korgac)? It seems to
> be a really light D-Bus user - hundreds of bytes, ever, so certainly
> non-problematic.

IIRC it was part of the problematic grouping, but I might have
included it out of unconscious bias against.

Thank you,
Nicholas
-------------- next part --------------
method return time=1509194756.775280 sender=org.freedesktop.DBus -> destination=:1.744 serial=3 reply_serial=2
   array [
      dict entry(
         string "Serial"
         variant             uint32 3
      )
      dict entry(
         string "ListMemPoolUsedBytes"
         variant             uint32 947454936
      )
      dict entry(
         string "ListMemPoolCachedBytes"
         variant             uint32 3087912
      )
      dict entry(
         string "ListMemPoolAllocatedBytes"
         variant             uint32 1610612352
      )
      dict entry(
         string "ActiveConnections"
         variant             uint32 76
      )
      dict entry(
         string "IncompleteConnections"
         variant             uint32 0
      )
      dict entry(
         string "MatchRules"
         variant             uint32 1143
      )
      dict entry(
         string "PeakMatchRules"
         variant             uint32 1273
      )
      dict entry(
         string "PeakMatchRulesPerConnection"
         variant             uint32 132
      )
      dict entry(
         string "BusNames"
         variant             uint32 173
      )
      dict entry(
         string "PeakBusNames"
         variant             uint32 191
      )
      dict entry(
         string "PeakBusNamesPerConnection"
         variant             uint32 7
      )
   ]
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <http://lists.alioth.debian.org/pipermail/pkg-utopia-maintainers/attachments/20171028/38845e54/attachment.sig>