[Babel-users] babeld slashes kernel route manipulation performance by 17000%

Wed Apr 13 21:40:36 BST 2022

Hi Toke,

So this is definetly a kernel bug. I've managed to reproduce it with only
iproute2 commands. The problem seems to be dumping the whole FIB while lots
of individual route modifications are taking place.

First we have to generate some ip-route(1) -batch commands to use. You can
use a bgp route dump I've uploaded or create some synthetic prefixes if you
like:

    $ get_prefixes () { curl https://dxld.net/bgp.prefixes; }
    $ get_prefixes | awk '{ print "route add table 1337 unreachable " $1 }' > add-routes
    $ get_prefixes | awk '{ print "route del table 1337 unreachable " $1; print "route add table 1337 unreachable " $1 }' > change-routes

To reproduce this I first insert a bunch of routes from that route dump
using ip -batch:

    # ip -batch ./add-routes

Then to simulate what bird is doing I use a version of this dump where
every route is removed and re-added in a loop:

    # while sleep 0.1; do ip -batch ./change-routes; done

While this is going on monitor route insertion performance using

    # while sleep 0.1; do { timeout 1 ip -6 monitor; } | wc -l; done

On my system this shows ~10k routes/s. If we now dump the table while
change-routes is running the performance drops to ~500 routes/s on my
system:

    # while sleep 0.1; do ip -6 route show table 1337 >/dev/null; done

FYI: peeking at `perf top` shows fib6_walk_continue and mutex_spin_on_owner
as the main offenders and almost all of the CPU time during this test is
spent in the kernel.

--Daniel

PS: To clean up use `ip -6 route flush table 1337`.

On Fri, Apr 08, 2022 at 02:38:43PM +0200, dxld at darkboxed.org wrote:
> On Fri, Apr 08, 2022 at 01:57:01PM +0200, Toke Høiland-Jørgensen wrote:
> > Daniel Gröber <dxld at darkboxed.org> writes:
> > > I'll probably try that tomorrow then.
> > 
> > Alright, let met know how it goes; I can go poking at the kernel, but
> > having a reproducer makes that a lot easier :)
> 
> So i tried ip -batch but it seems it's, um, batching the sendmsg calls too
> much :)
> 
> Bird does a separate sendto call for each route but iproute2 batches them
> into only 1k ish calls for 100k routes so I can't reproduce the problem
> with that unfortunately.
> 
> I did do some stracing against babeld with `strace -e raw=all | ts -i '%.s`
> just to see what the timing of recvmsg calls is and how they vary. It seems
> to me the problem only happens when babeld is exclusively calling recvmsg
> (I assume during kernel_dump()), when it's in a steady state and starts
> calling select() between the recvmsg() calls performance is fine.
> 
> From skimming the code it seems babeld occationally schedules a full dump
> though so that might be why the reproducibility is so sporadic.
> 
> During babled startup seems to be the best chance for repro. For some
> reason bird pretty reliably also starts churning pretty soon after I
> restart babled not sure why but it makes testing easier so I'll debug that
> later :)
> 
> I also tried tweaking the iov_len size for recvmsg() in babled to match
> that of bird which is quite large without much change. Lowering the size
> just gave me message truncated errors not sure what's up with that.
> 
> If you want to play along, `while sleep 0.1; do { timeout 1 ip -6 monitor
> route; } | wc -l; done` is what I'm using to monitor the route insertion
> performance now. The {} is load bearing (for some reason) and it does error
> with "No buffer space available" when lots of churn is going on but it
> works anyway.
> 
> --Daniel