[Babel-users] babeld slashes kernel route manipulation performance by 17000%

Toke Høiland-Jørgensen toke at toke.dk
Thu Apr 14 23:48:05 BST 2022


Daniel Gröber <dxld at darkboxed.org> writes:

> On Thu, Apr 14, 2022 at 08:58:53PM +0200, Toke Høiland-Jørgensen wrote:
>> Yeah, I do. He's also one of the maintainers of the routing code, so
>> definitely the right person to Cc on this (explicitly Cc'ing maintainers
>> makes sure they see your email as not everyone follows netdev
>> rigorously).
>
> Kk :)

Poked a bit more into the kernel fib code; the more I'm looking at it,
the more I'm convinced that the contention between add and dump is a
fundamental feature of the way the routing table is implemented, so I'm
not so sure it's simply a "bug" that can be "fixed" :(

So I suspect the answer from the kernel side may end up being a somewhat
unsatisfying "well, don't do that, then"... So we may have to resort to
fixing userspace; more below:

>> Ah, okay, that's interesting. Playing around with your examples, on my
>> laptop the performance goes from ~90k/s to ~1k/s when doing just a
>> single 'ip -6 route show table 1337'. The dump itself takes between 5-10
>> seconds, so with the 30-sec interval in babeld I guess the periodic dump
>> can coincide with the update at random.
>> 
>> Side note: why is bird replacing all the routes in the first place? :)
>
> That is a good question, bird should really be able to see that the route
> is already installed and just don't bother. I see this del/add behaviour
> even when bgp is otherwise nice and converged though so I assumed bird is
> just like this.

Hmm, that's odd. What's your "background radiation" (i.e. route updates
per second when Bird is running normally and babeld isn't started? I
just checked my own router (which also imports a full v6 table), and
that churns less than one route per second. So if you're seeing a lot of
churn, maybe it's something in your config that could be fixed?

Alternatively, an option could be to improve Bird's performance when
replacing routes; for one thing, there's this comment in bird's
netlink.c:

  /*
   * We use NL_OP_REPLACE for IPv4, it has an issue with not checking for
   * matching rtm_protocol, but that is OK when dedicated priority is used.
   *
   * We do not use NL_OP_REPLACE for IPv6, as it has broken semantics for ECMP
   * and with some kernel versions ECMP replace crashes kernel. Would need more
   * testing and checks for kernel versions.
   *
   * For IPv6, we use NL_OP_DELETE and then NL_OP_ADD. We also do not trust the
   * old route value, so we do not try to optimize IPv6 ECMP reconfigurations.
   */

So the fact that it's doing del/add for each route is obviously not
optimal when there's a lot of churn, and I doubt that comment is
accurate anymore. There's also the possibility of adding support for
kernel nexthop objects to Bird:

https://lwn.net/Articles/763950/

This should significantly improve route insert/update performance, and
has some nice properties as well when routes are updated (for instance,
in some cases updating the nexthop object instead of each individual
route is possible).

I've been meaning to look into adding nexthop support to Bird anyway, so
this could be a nice occasion to bump that up my list. Don't take that
as a promise, though... :P

> As I said before it always triggers when I (re)start babeld but I can't see
> anything obvious in the log even with debug on as to why. Particularily I
> don't see any bgp state events so the sessions should be fine but for some
> reason it decides to churn everything anyway.

Well, the trigger when starting babeld would be the initial route dump,
I suppose: If you have lots of route churn happening in the background,
the drop in insert performance caused by the dump would be your trigger,
no?

One thing I noticed when playing around with your reproducer example,
which may be something we could apply to the babeld case: If I run 'ip
-6 route show table 1337' I get the slowdown, but if I just run a
regular 'ip -6 route show', I do not. This seems to be because iproute2
is adding the table to the route dump request, which will make the
kernel dump only the requested table. And since the lock that's being
contended is per table, that should nicely get rid of the contention. A
patch to do this is included below (only compile-tested, so no idea if
it'll actually work :)).

>> > I'm currently working on babel ECMP support in bird though maybe I'll
>> > have a stab at RTT after that.
>> 
>> On the subject of ECMP and Babel, you may want to read this thread:
>> https://mailarchive.ietf.org/arch/msg/babel/i4tqsRIL3DS9e22GJ0QuoMef-P0/ 
>> 
>> I.e., it's not just a matter of writing the code we'll also need to
>> define the semantics in the spec. Just so you know what you're getting
>> yourself into ;)
>
> Interesting thread, thanks.
>
> I think for my use-case the loop avoidance point is moot though since I'm
> mainly interested in using this on endpoints, not routers. So perhaps
> calling this ECMP is not the right nomenclature?

Not sure; what are you trying to do, exactly?

-Toke


diff --git a/kernel_netlink.c b/kernel_netlink.c
index efe1243c3b07..36aae29124a5 100644
--- a/kernel_netlink.c
+++ b/kernel_netlink.c
@@ -236,7 +236,7 @@ static int nl_setup = 0;
 static int
 netlink_socket(struct netlink *nl, uint32_t groups)
 {
-    int rc;
+    int rc, strict;
     int rcvsize = 512 * 1024;
 
     nl->sock = socket(PF_NETLINK, SOCK_RAW, NETLINK_ROUTE);
@@ -271,6 +271,9 @@ netlink_socket(struct netlink *nl, uint32_t groups)
             perror("setsockopt(SO_RCVBUF)");
         }
     }
+    rc = setsockopt(nl->sock, SOL_NETLINK, NETLINK_GET_STRICT_CHK, &strict, sizeof(strict));
+    if(rc < 0)
+        goto fail;
 
     rc = bind(nl->sock, (struct sockaddr *)&nl->sockaddr, nl->socklen);
     if(rc < 0)
@@ -1263,9 +1266,9 @@ filter_kernel_routes(struct nlmsghdr *nh, struct kernel_route *route)
 int
 kernel_dump(int operation, struct kernel_filter *filter)
 {
-    int i, rc;
+    int i, j, rc;
     int families[2] = { AF_INET6, AF_INET };
-    struct rtgenmsg g;
+    struct rtmsg rtm;
 
     if(!nl_setup) {
         fprintf(stderr,"kernel_dump: netlink not initialized.\n");
@@ -1284,22 +1287,26 @@ kernel_dump(int operation, struct kernel_filter *filter)
     }
 
     for(i = 0; i < 2; i++) {
-        memset(&g, 0, sizeof(g));
-        g.rtgen_family = families[i];
+        memset(&rtm, 0, sizeof(rtm));
+        rtm.rtm_family = families[i];
         if(operation & CHANGE_ROUTE) {
-            rc = netlink_send_dump(RTM_GETROUTE, &g, sizeof(g));
-            if(rc < 0)
-                return -1;
+            for (j = 0; j < import_table_count; j++) {
+                rtm.rtm_table = import_tables[j];
 
-            rc = netlink_read(&nl_command, NULL, 1, filter);
-            if(rc < 0)
-                return -1;
+                rc = netlink_send_dump(RTM_GETROUTE, &rtm, sizeof(rtm));
+                if(rc < 0)
+                    return -1;
+
+                rc = netlink_read(&nl_command, NULL, 1, filter);
+                if(rc < 0)
+                    return -1;
+            }
         }
 
-        memset(&g, 0, sizeof(g));
-        g.rtgen_family = families[i];
+        memset(&rtm, 0, sizeof(rtm));
+        rtm.rtm_family = families[i];
         if(operation & CHANGE_RULE) {
-            rc = netlink_send_dump(RTM_GETRULE, &g, sizeof(g));
+            rc = netlink_send_dump(RTM_GETRULE, &rtm, sizeof(rtm));
             if(rc < 0)
                 return -1;
 
@@ -1310,9 +1317,9 @@ kernel_dump(int operation, struct kernel_filter *filter)
     }
 
     if(operation & CHANGE_ADDR) {
-        memset(&g, 0, sizeof(g));
-        g.rtgen_family = AF_UNSPEC;
-        rc = netlink_send_dump(RTM_GETADDR, &g, sizeof(g));
+        memset(&rtm, 0, sizeof(rtm));
+        rtm.rtm_family = AF_UNSPEC;
+        rc = netlink_send_dump(RTM_GETADDR, &rtm, sizeof(rtm));
         if(rc < 0)
             return -1;
 



More information about the Babel-users mailing list