[Babel-users] Higher CPU usage since 1.9.x leads to instability on slow devices

Sun May 31 20:50:22 BST 2020

Hi,

so I have been looking into this as well and produced a flamegraph of 
babeld-1.9.2 running for about 10 minutes. This instance was connected 
to one peer which is part of the same network Fabian described. The peer 
was restarted once to force some updates.

I uploaded the interactive flamegraph here [1] (you may need to download 
it and open in a browser to be able to zoom in)

To my eye, conflict_solution() [2] really stands out. And just glancing 
at the code it roughly looks like every route is compared to every 
route, which to me sounds like quadratic complexity. My guess is that 
it's even worse, close to cubic complexity, since this function is 
called on route installs in [3].

git-blame shows that the code is more than 5 years old. So if this is 
indeed our problem, it's not new, we just grew into it.

Cheers,

Johannes

[1] 
https://gist.githubusercontent.com/lemmi/901724d6bc658e0a5a21d8d9fede18d1/raw/f7e1006f4a915a06ab1d88b4295fee7c7e07a28c/babel.svg

[2] 
https://github.com/jech/babeld/blob/d42c1dbdfd6dc836c7b0c8e32657d0be5f0b2b84/disambiguation.c#L170

[3] 
https://github.com/jech/babeld/blob/d42c1dbdfd6dc836c7b0c8e32657d0be5f0b2b84/disambiguation.c#L323

On 31.05.20 13:13, Fabian Bläse wrote:
> Hi,
>
> sorry for the late reply.
>
> On 15.05.20 01:18, Juliusz Chroboczek wrote:
>>> Ever since we have upgraded to babeld 1.9.x, CPU usage is a lot higher
>>> than with 1.8.x. Especially slower devices like embedded MIPS routers
>>> are having trouble to keep up, which leads to route instablity due to
>>> late helos.
>> That sounds pretty bad.
>>
>> In principle, 1.9 should use less CPU than 1.8, since some algorithms have
>> been reworked to be in n·logn instead of n².  On the other hand, 1.9 uses
>> more memory, since there is now a per-neighbour unicast buffer instead of
>> a single buffer for all neighbours; this shouldn't matter much in practice,
>> except if you have hundreds of neighbours.
> I've tried analysing this in more detail using the debug output.
> It looks like the actual route algorithms are not the problem, but the communication with netlink.
>
> If the network gets unstable somewhere upstream, a lot of unreachable routes are sent through the downstream network (for all the routes from the upstream network, that were lost now).
> However, the netlink interface seems to be relatively slow, especially on the device we are having a lot of trouble with (TP Link WDR 3600, Atheros AR9344, 74Kc MIPS 560 MHz).
> Installing all these unreachable routes takes so long, that the relatively small socket buffer for babel messages overflows, because it is not read while route updates are sent to netlink. That leads to loss babel messages.
>
> That initiates a state babeld is unlikely to recover from, because changing the state of all the routes in the kernel (reachable, unreachable) always takes so long, that new babel messages are lost.
>
> The issue probably can only be fixed if route updates are not sent to netlink synchronously.
>
> I'm not really shure, why this only occurs with babeld >=1.9.0.
> I looks like I got a little confused with version numbers, so I might have tested with versions that still had the IPv4 xroute issue [1].
>
>> Perhaps you could provide us with a CPU profile?
> I don't really know, what you mean.
>
> Regards,
> Fabian
>
> [1] https://github.com/jech/babeld/issues/46
>
>
> _______________________________________________
> Babel-users mailing list
> Babel-users at alioth-lists.debian.net
> https://alioth-lists.debian.net/cgi-bin/mailman/listinfo/babel-users
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://alioth-lists.debian.net/pipermail/babel-users/attachments/20200531/d1151837/attachment.html>