[Babel-users] [BUG] Route "deadlocks" under load due to non-atomic kernel route updates

Fri Jun 10 10:22:54 UTC 2016

Hello Babel world.

First of all, let me please introduce myself. My name is Kirill. I'm one of
the guys behind lab.nexedi.com team. As Babel community probably already knows,
Nexedi is doing overlay networks. The site structure is backend -- frontends;
frontends are located around the world. Backend -- frontend links are IPv6
(Re6st[1] network routed with babeld). Frontends talk to users via plain IPv4.

Most of backend traffic is thus IPv6 / Re6st, and in Re6st routes are changed
regularly because network topology is adjusted regularly once in every while.

We frequently hit the situation, where kernel routes installed by babeld are
all ok, but some /128 kernel route-cache entry, which should be unicast and reachable,
appears there to be unreachable and can persist in such state for an hour or
more. When this happens, communication with corresponding host becomes
impossible.  Frequently "corresponding host" is one of frontends (for backend),
or backend (for a frontend), so we get frequent site outages.

With Julien we traced the problem down to be

    babeld doing kernel route updates in non-atomic way.

With this email I just want to make sure people around Babel know there is a
real problem related to non-atomicity in kernel route updates, and it can be
not only short-microsecond-time packets lost period, but permanent failure.
We are not proposing a patch (yet ?).

Please find details below:

Thanks,
Kirill

---- 8< ----

When Babeld wants to modify installed route, eventually kernel_route() will be
called with operation=ROUTE_MODIFY. In current codebase this results in two
subsequent calls to kernel - first to remove the route and second to add a new one:

https://github.com/jech/babeld/blob/babeld-1.7.1-64-g75de8a4/kernel_netlink.c#L965

Let's see how this looks like in ip monitor on a kernel from latest Debian stable (3.16 series):
(I've patched[2] iproute's monitor to show all route info, including showing cached routes)

 (1) [22:30:00.447894] [RTCAH]- 2001:67c:1254:e:10::1 via fe80::a8b6:97ff:fe99:8ca8 dev re6stnet8  src 2001:67c:1254:e:63::1  metric 0 \    cache  users 2 used 1048 age 0sec
 (2) [22:30:00.447895] [ROUTE]- 2001:67c:1254:e:10::/80 via fe80::a8b6:97ff:fe99:8ca8 dev re6stnet8  proto babel  src 2001:67c:1254:e:63::1  metric 1024
 (3) [22:30:00.447897] [ROUTE]+ 2001:67c:1254:e:10::/80 via fe80::f485:c1ff:feb4:f32a dev re6stnet-tcp  proto babel  src 2001:67c:1254:e:63::1  metric 1024
 (4) [22:30:00.449873] [RTCAH]+ 2001:67c:1254:e:10::1 via fe80::f485:c1ff:feb4:f32a dev re6stnet-tcp  src 2001:67c:1254:e:63::1  metric 0 \    cache  users 1

Here a route for 2001:67c:1254:e:10::/80 is changed. There was a traffic
(probably established active TCP connection) to 2001:67c:1254:e:10::1, so
initially there is a cached entry for 2001:67c:1254:e:10::1, which is first
pruned from FIB (1) by fib6_del(2001:67c:1254:e:10::/80) (2). Then new route is
installed (3) and on next need to send packet to 2001:67c:1254:e:10::1
corresponding route is looked up and /128 entry is installed into cache (4).

This is a good scenario.

Now let me show you how it can go wrong:

 (1) [22:32:32.532745] [RTCAH]- 2001:67c:1254:e:10::1 via fe80::f485:c1ff:feb4:f32a dev re6stnet-tcp  src 2001:67c:1254:e:63::1  metric 0 \    cache  users 3 used 2372
 (2) [22:32:32.532755] [ROUTE]- 2001:67c:1254:e:10::/80 via fe80::f485:c1ff:feb4:f32a dev re6stnet-tcp  proto babel  src 2001:67c:1254:e:63::1  metric 1024
 (3) [22:32:32.532757] [ROUTE]+ 2001:67c:1254:e:10::/80 via fe80::a8b6:97ff:fe99:8ca8 dev re6stnet8  proto babel  src 2001:67c:1254:e:63::1  metric 1024
 (4) [22:32:32.532758] [RTCAH]+ unreachable 2001:67c:1254:e:10::1 dev lo  metric 0 \    cache  error -101 users 1

 ... a lot but neither 2001:67c:1254:e:10::/80 nor 'unreachable 2001:67c:1254:e:10::1' are changed

 (5) [22:35:27.441827] [RTCAH]- unreachable 2001:67c:1254:e:10::1 dev lo  metric 0 \    cache  error -101 used 281 age 2sec
 (6) [22:35:27.441834] [ROUTE]- 2001:67c:1254:e:10::/80 via fe80::a8b6:97ff:fe99:8ca8 dev re6stnet8  proto babel  src 2001:67c:1254:e:63::1  metric 1024
 (7) [22:35:27.441835] [ROUTE]+ 2001:67c:1254:e:10::/80 via fe80::f485:c1ff:feb4:f32a dev re6stnet-tcp  proto babel  src 2001:67c:1254:e:63::1  metric 1024
 (8) [22:35:28.613731] [RTCAH]+ 2001:67c:1254:e:10::1 via fe80::f485:c1ff:feb4:f32a dev re6stnet-tcp  src 2001:67c:1254:e:63::1  metric 0 \    cache  users 1

Here a route for 2001:67c:1254:e:10::/80 is changed again. (1) and (2) are the
same as in good scenario - cache and real route entries are removed. (3) is
also the same - new real route is installed. But on next need to send a packet
to 2001:67c:1254:e:10::1, somehow `unreachable 2001:67c:1254:e:10::1` cache
entry is born, despite the fact that just installed for 2001:67c:1254:e:10::/80
route is reachable (3), and just removed route to 2001:67c:1254:e:10::/80 was
reachable too (2).

As comments and (5) shows the unreachable cached entry for
2001:67c:1254:e:10::1 lived there for some time - 3 minutes in this example.
And `used 281 age 2sec` in (5) shows that cache entry was used all the time.
The unreachable cache entry gets removed only at time where babeld wants to
change route to 2001:67c:1254:e:10::/80 again (5,6,7,8)

So here we have outage for 3 minutes - pretty "short" period, as in practice it
can be 1 hour and more - all until the time babeld wants to reinstall the route
next time.

So what happens here? To answer, let's first see how in-kernel code for route
lookup/add/del looks like (with simplified logic, without source-specific
routing aspects, etc):

  // route lookup:
  // https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/ipv6/route.c?h=v3.16.35-0-ge3c5b88#n916
  ip_pol_route(table, daddr) {
    relookup:
      // lookup FIB
      read_lock(table)
      rt = fib6_lookup(table, daddr)
      if (rt==notfound || rt->flags & RTF_CACHE)	// cached route
          goto out
      read_unlock(table)

      // for non-cached case: clone rt to a /128 route not _yet_ inserted into table
      nrt = rt6_alloc_clone(rt, daddr)
      rt = nrt

      // put cloned route into FIB
      write_lock(table)
      err = fib6_add(table, rt)
      write_unlock(table)
      if (!err)
         goto out2      // if put ok, i.e. no entry for rt->daddr already exists there - we are done

      goto relookup

    out:
      read_unlock(table)
    out2:
      return rt
  }

( notice the table is locked twice and a clone route is created "in-flight" in
  between with table lock not held. That's why there is a check after putting
  clone into fib - was it ok or not. )

  // route add (tails to __ip6_route_ins):
  // https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/ipv6/route.c?h=v3.16.35-0-ge3c5b88#n1464
  // https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/ipv6/route.c?h=v3.16.35-0-ge3c5b88#n854
  ip6_route_add(table, rt) {
      write_lock(table)
      fib6_add(table, rt)     // also prunes FIB from cloned routes which would match rt, if rt itself is not a cached entry
      write_unlock(table)
  }

  // route_del (tails to __ip6_del_rt):
  // https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/ipv6/route.c?h=v3.16.35-0-ge3c5b88#n1708
  // https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/ipv6/route.c?h=v3.16.35-0-ge3c5b88#n1679
  __ip6_route_del(table, rt) {
      write_lock(table)
      fib6_del(table, rt)     // also prunes FIB from clones of rt, if rt itself is not a cached entry
      write_unlock(table)
  }

now imagine the following scenario:

we have the following kernel routes installed:

    unreachable 2001:67c:1254::/48 dev lo  metric 1024  error -113 pref medium
    unicast 2001:67c:1254:e:10::/80 via ... dev ...
    # probably also something in cache

    # ( re6stnet installs unreachable network-wide prefix (2001:67c:1254::/48 above) e.g.
    #   so that packets for in-re6st network addresses, that do not have a route, do not go
    #   to e.g. default route which might be connected to different network. )

        T1                                                                  T2

                                                                    // babeld starts route update
                                                                    // for 2001:67c:1254:e:10::/80
                                                                    -> ip6_route_del(table, 2001:67c:1254:e:10::/80)
                                                                    write_lock(table)
                                                                    fib6_del(table, 2001:67c:1254:e:10::/80)
                                                                    write_unlock(table)

                                now there is no 2001:67c:1254:e:10::/80 in route entries
                                nor there is any /128 clone that would match it in route cache
                                in particular there now is no 2001:67c:1254:e:10::1 in cache

  // for some reason - e.g. service
  // tries to continue TCP session
  // route lookup is started
  -> ip_pol_route(table, 2001:67c:1254:e:10::1)
  read_lock(table)
  rt = fib6_lookup(table, 2001:67c:1254:e:10::1)
  // route for 2001:67c:1254:e:10::/80 is not installed
  // but `unreachable 2001:67c:1254::/48` is installed
  // -> rt = unreachable 2001:67c:1254::/48 (non RTF_CACHE) (*)
  read_unlock(table)

  // nrt = unreachable 2001:67c:1254:e:10::1 cache
  // cloned from unreachable 2001:67c:1254::/48
  // nrt not _yet_ in FIB
  nrt = rt6_alloc_clone(rt, daddr)
  rt = nrt

                                                                    // babeld continues route update
                                                                    // for 2001:67c:1254:e:10::/80
                                                                    -> ip6_route_add(table, 2001:67c:1254:e:10::/80)
                                                                    write_lock(table)

                                NOTE fib6_add also prunes FIB from cloned routes matching 2001:67c:1254:e:10::/80
                                BUT there is no clones matching 2001:67c:1254:e:10::/80 in FIB YET --
                                -- a clone will be put into FIB by T1 only after T2 completes fib6_add and unlocks table

                                                                    fib6_add(table, 2001:67c:1254:e:10::/80)
                                                                    write_unlock(table)

  // put cloned route into FIB
  write_lock(table)
  // rt->flags has RTF_CACHE - no clones are pruned by fib6_add
  fib6_add(table, rt)
  write_unlock(table)

now we have kernel routes:

    unreachable 2001:67c:1254::/48 dev lo  metric 1024  error -113 pref medium
    unicast 2001:67c:1254:e:10::/80 via ... dev ...         # via and dev changed as babeld told kernel
    unreachable 2001:67c:1254:e:10::1 cache                 # <-- !!!

so for communications with 2001:67c:1254:e:10::1 unreachable route will be
returned (and thus communication will fail) even though there is valid reachable
kernel route installed for that subnet (unicast 2001:67c:1254:e:10::/80 via ...)

~~~~

If babeld would install routes atomically (probably with NLM_F_REPLACE), there
won't be a period of time when table does not contain reachable entry for
2001:67c:1254:e:10::/80, and thus lookup would never go to `unreachable
2001:67c:1254::/48` route. Thus there won't be a possibility to create
above-shown incorrect unreachable sticky route-cache entries.

NOTE Starting from 4.2 Linux kernel, in addition to other IPv6 route changes,
     route lookup procedure was reworked, and now there is almost no /128 cache
     entries created and route table is not locked twice on lookup:

     https://git.kernel.org/linus/45e4fd26
     https://code.facebook.com/posts/1123882380960538/linux-ipv6-improvement-routing-cache-on-demand/

     However, in contrast to IPv4, for IPv6 route cache is still there and
     might show itself in various cases. It feels more confident to just do the
     route updates in atomic way on babeld side.

     Also a lot of people are not on latest kernels, like e.g. Debian stable
     case shows - even though latest 3.16.y is 3.16.35 which was released 1 month
     ago, it does not, and probably won't contain Facebook IPv6 patches, as
     they are intrusive and do not fix regressions.

Thanks again,
Kirill

(*) I've patched[3] the kernel to add traces for born/removed cached routes to verify
    whether it is actually the case and to try to better understand the problem with more info.

    It is relatively easy for me to try kernel patches on my notebook, but
    unfortunately trying kernel patches or upgrading kernel itself on production
    system is relatively hard.

[1] http://re6st.net/
[2] https://lab.nexedi.com/kirr/iproute2/commits/x/watchrtcache ,
    https://lab.nexedi.com/kirr/iproute2/compare/net-next...69d29071
[3] https://lab.nexedi.com/kirr/linux/commit/a023f92b