[Babel-users] [BUG] Route "deadlocks" under load due to non-atomic kernel route updates
Kirill Smelkov
kirr at nexedi.com
Fri Jun 10 10:34:55 UTC 2016
( resending after subscribing to mailing list )
Hello Babel world.
First of all, let me please introduce myself. My name is Kirill. I'm one of
the guys behind lab.nexedi.com team. As Babel community probably already knows,
Nexedi is doing overlay networks. The site structure is backend -- frontends;
frontends are located around the world. Backend -- frontend links are IPv6
(Re6st[1] network routed with babeld). Frontends talk to users via plain IPv4.
Most of backend traffic is thus IPv6 / Re6st, and in Re6st routes are changed
regularly because network topology is adjusted regularly once in every while.
We frequently hit the situation, where kernel routes installed by babeld are
all ok, but some /128 kernel route-cache entry, which should be unicast and reachable,
appears there to be unreachable and can persist in such state for an hour or
more. When this happens, communication with corresponding host becomes
impossible. Frequently "corresponding host" is one of frontends (for backend),
or backend (for a frontend), so we get frequent site outages.
With Julien we traced the problem down to be
babeld doing kernel route updates in non-atomic way.
With this email I just want to make sure people around Babel know there is a
real problem related to non-atomicity in kernel route updates, and it can be
not only short-microsecond-time packets lost period, but permanent failure.
We are not proposing a patch (yet ?).
Please find details below:
Thanks,
Kirill
---- 8< ----
When Babeld wants to modify installed route, eventually kernel_route() will be
called with operation=ROUTE_MODIFY. In current codebase this results in two
subsequent calls to kernel - first to remove the route and second to add a new one:
https://github.com/jech/babeld/blob/babeld-1.7.1-64-g75de8a4/kernel_netlink.c#L965
Let's see how this looks like in ip monitor on a kernel from latest Debian stable (3.16 series):
(I've patched[2] iproute's monitor to show all route info, including showing cached routes)
(1) [22:30:00.447894] [RTCAH]- 2001:67c:1254:e:10::1 via fe80::a8b6:97ff:fe99:8ca8 dev re6stnet8 src 2001:67c:1254:e:63::1 metric 0 \ cache users 2 used 1048 age 0sec
(2) [22:30:00.447895] [ROUTE]- 2001:67c:1254:e:10::/80 via fe80::a8b6:97ff:fe99:8ca8 dev re6stnet8 proto babel src 2001:67c:1254:e:63::1 metric 1024
(3) [22:30:00.447897] [ROUTE]+ 2001:67c:1254:e:10::/80 via fe80::f485:c1ff:feb4:f32a dev re6stnet-tcp proto babel src 2001:67c:1254:e:63::1 metric 1024
(4) [22:30:00.449873] [RTCAH]+ 2001:67c:1254:e:10::1 via fe80::f485:c1ff:feb4:f32a dev re6stnet-tcp src 2001:67c:1254:e:63::1 metric 0 \ cache users 1
Here a route for 2001:67c:1254:e:10::/80 is changed. There was a traffic
(probably established active TCP connection) to 2001:67c:1254:e:10::1, so
initially there is a cached entry for 2001:67c:1254:e:10::1, which is first
pruned from FIB (1) by fib6_del(2001:67c:1254:e:10::/80) (2). Then new route is
installed (3) and on next need to send packet to 2001:67c:1254:e:10::1
corresponding route is looked up and /128 entry is installed into cache (4).
This is a good scenario.
Now let me show you how it can go wrong:
(1) [22:32:32.532745] [RTCAH]- 2001:67c:1254:e:10::1 via fe80::f485:c1ff:feb4:f32a dev re6stnet-tcp src 2001:67c:1254:e:63::1 metric 0 \ cache users 3 used 2372
(2) [22:32:32.532755] [ROUTE]- 2001:67c:1254:e:10::/80 via fe80::f485:c1ff:feb4:f32a dev re6stnet-tcp proto babel src 2001:67c:1254:e:63::1 metric 1024
(3) [22:32:32.532757] [ROUTE]+ 2001:67c:1254:e:10::/80 via fe80::a8b6:97ff:fe99:8ca8 dev re6stnet8 proto babel src 2001:67c:1254:e:63::1 metric 1024
(4) [22:32:32.532758] [RTCAH]+ unreachable 2001:67c:1254:e:10::1 dev lo metric 0 \ cache error -101 users 1
... a lot but neither 2001:67c:1254:e:10::/80 nor 'unreachable 2001:67c:1254:e:10::1' are changed
(5) [22:35:27.441827] [RTCAH]- unreachable 2001:67c:1254:e:10::1 dev lo metric 0 \ cache error -101 used 281 age 2sec
(6) [22:35:27.441834] [ROUTE]- 2001:67c:1254:e:10::/80 via fe80::a8b6:97ff:fe99:8ca8 dev re6stnet8 proto babel src 2001:67c:1254:e:63::1 metric 1024
(7) [22:35:27.441835] [ROUTE]+ 2001:67c:1254:e:10::/80 via fe80::f485:c1ff:feb4:f32a dev re6stnet-tcp proto babel src 2001:67c:1254:e:63::1 metric 1024
(8) [22:35:28.613731] [RTCAH]+ 2001:67c:1254:e:10::1 via fe80::f485:c1ff:feb4:f32a dev re6stnet-tcp src 2001:67c:1254:e:63::1 metric 0 \ cache users 1
Here a route for 2001:67c:1254:e:10::/80 is changed again. (1) and (2) are the
same as in good scenario - cache and real route entries are removed. (3) is
also the same - new real route is installed. But on next need to send a packet
to 2001:67c:1254:e:10::1, somehow `unreachable 2001:67c:1254:e:10::1` cache
entry is born, despite the fact that just installed for 2001:67c:1254:e:10::/80
route is reachable (3), and just removed route to 2001:67c:1254:e:10::/80 was
reachable too (2).
As comments and (5) shows the unreachable cached entry for
2001:67c:1254:e:10::1 lived there for some time - 3 minutes in this example.
And `used 281 age 2sec` in (5) shows that cache entry was used all the time.
The unreachable cache entry gets removed only at time where babeld wants to
change route to 2001:67c:1254:e:10::/80 again (5,6,7,8)
So here we have outage for 3 minutes - pretty "short" period, as in practice it
can be 1 hour and more - all until the time babeld wants to reinstall the route
next time.
So what happens here? To answer, let's first see how in-kernel code for route
lookup/add/del looks like (with simplified logic, without source-specific
routing aspects, etc):
// route lookup:
// https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/ipv6/route.c?h=v3.16.35-0-ge3c5b88#n916
ip_pol_route(table, daddr) {
relookup:
// lookup FIB
read_lock(table)
rt = fib6_lookup(table, daddr)
if (rt==notfound || rt->flags & RTF_CACHE) // cached route
goto out
read_unlock(table)
// for non-cached case: clone rt to a /128 route not _yet_ inserted into table
nrt = rt6_alloc_clone(rt, daddr)
rt = nrt
// put cloned route into FIB
write_lock(table)
err = fib6_add(table, rt)
write_unlock(table)
if (!err)
goto out2 // if put ok, i.e. no entry for rt->daddr already exists there - we are done
goto relookup
out:
read_unlock(table)
out2:
return rt
}
( notice the table is locked twice and a clone route is created "in-flight" in
between with table lock not held. That's why there is a check after putting
clone into fib - was it ok or not. )
// route add (tails to __ip6_route_ins):
// https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/ipv6/route.c?h=v3.16.35-0-ge3c5b88#n1464
// https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/ipv6/route.c?h=v3.16.35-0-ge3c5b88#n854
ip6_route_add(table, rt) {
write_lock(table)
fib6_add(table, rt) // also prunes FIB from cloned routes which would match rt, if rt itself is not a cached entry
write_unlock(table)
}
// route_del (tails to __ip6_del_rt):
// https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/ipv6/route.c?h=v3.16.35-0-ge3c5b88#n1708
// https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/net/ipv6/route.c?h=v3.16.35-0-ge3c5b88#n1679
__ip6_route_del(table, rt) {
write_lock(table)
fib6_del(table, rt) // also prunes FIB from clones of rt, if rt itself is not a cached entry
write_unlock(table)
}
now imagine the following scenario:
we have the following kernel routes installed:
unreachable 2001:67c:1254::/48 dev lo metric 1024 error -113 pref medium
unicast 2001:67c:1254:e:10::/80 via ... dev ...
# probably also something in cache
# ( re6stnet installs unreachable network-wide prefix (2001:67c:1254::/48 above) e.g.
# so that packets for in-re6st network addresses, that do not have a route, do not go
# to e.g. default route which might be connected to different network. )
T1 T2
// babeld starts route update
// for 2001:67c:1254:e:10::/80
-> ip6_route_del(table, 2001:67c:1254:e:10::/80)
write_lock(table)
fib6_del(table, 2001:67c:1254:e:10::/80)
write_unlock(table)
now there is no 2001:67c:1254:e:10::/80 in route entries
nor there is any /128 clone that would match it in route cache
in particular there now is no 2001:67c:1254:e:10::1 in cache
// for some reason - e.g. service
// tries to continue TCP session
// route lookup is started
-> ip_pol_route(table, 2001:67c:1254:e:10::1)
read_lock(table)
rt = fib6_lookup(table, 2001:67c:1254:e:10::1)
// route for 2001:67c:1254:e:10::/80 is not installed
// but `unreachable 2001:67c:1254::/48` is installed
// -> rt = unreachable 2001:67c:1254::/48 (non RTF_CACHE) (*)
read_unlock(table)
// nrt = unreachable 2001:67c:1254:e:10::1 cache
// cloned from unreachable 2001:67c:1254::/48
// nrt not _yet_ in FIB
nrt = rt6_alloc_clone(rt, daddr)
rt = nrt
// babeld continues route update
// for 2001:67c:1254:e:10::/80
-> ip6_route_add(table, 2001:67c:1254:e:10::/80)
write_lock(table)
NOTE fib6_add also prunes FIB from cloned routes matching 2001:67c:1254:e:10::/80
BUT there is no clones matching 2001:67c:1254:e:10::/80 in FIB YET --
-- a clone will be put into FIB by T1 only after T2 completes fib6_add and unlocks table
fib6_add(table, 2001:67c:1254:e:10::/80)
write_unlock(table)
// put cloned route into FIB
write_lock(table)
// rt->flags has RTF_CACHE - no clones are pruned by fib6_add
fib6_add(table, rt)
write_unlock(table)
now we have kernel routes:
unreachable 2001:67c:1254::/48 dev lo metric 1024 error -113 pref medium
unicast 2001:67c:1254:e:10::/80 via ... dev ... # via and dev changed as babeld told kernel
unreachable 2001:67c:1254:e:10::1 cache # <-- !!!
so for communications with 2001:67c:1254:e:10::1 unreachable route will be
returned (and thus communication will fail) even though there is valid reachable
kernel route installed for that subnet (unicast 2001:67c:1254:e:10::/80 via ...)
~~~~
If babeld would install routes atomically (probably with NLM_F_REPLACE), there
won't be a period of time when table does not contain reachable entry for
2001:67c:1254:e:10::/80, and thus lookup would never go to `unreachable
2001:67c:1254::/48` route. Thus there won't be a possibility to create
above-shown incorrect unreachable sticky route-cache entries.
NOTE Starting from 4.2 Linux kernel, in addition to other IPv6 route changes,
route lookup procedure was reworked, and now there is almost no /128 cache
entries created and route table is not locked twice on lookup:
https://git.kernel.org/linus/45e4fd26
https://code.facebook.com/posts/1123882380960538/linux-ipv6-improvement-routing-cache-on-demand/
However, in contrast to IPv4, for IPv6 route cache is still there and
might show itself in various cases. It feels more confident to just do the
route updates in atomic way on babeld side.
Also a lot of people are not on latest kernels, like e.g. Debian stable
case shows - even though latest 3.16.y is 3.16.35 which was released 1 month
ago, it does not, and probably won't contain Facebook IPv6 patches, as
they are intrusive and do not fix regressions.
Thanks again,
Kirill
(*) I've patched[3] the kernel to add traces for born/removed cached routes to verify
whether it is actually the case and to try to better understand the problem with more info.
It is relatively easy for me to try kernel patches on my notebook, but
unfortunately trying kernel patches or upgrading kernel itself on production
system is relatively hard.
[1] http://re6st.net/
[2] https://lab.nexedi.com/kirr/iproute2/commits/x/watchrtcache ,
https://lab.nexedi.com/kirr/iproute2/compare/net-next...69d29071
[3] https://lab.nexedi.com/kirr/linux/commit/a023f92b
More information about the Babel-users
mailing list