[Babel-users] [BUG] Route "deadlocks" under load due to non-atomic kernel route updates

Kirill Smelkov kirr at nexedi.com
Thu Jun 16 11:17:02 UTC 2016


On Wed, Jun 15, 2016 at 12:56:34PM +0200, Juliusz Chroboczek wrote:
> >> If I read you correctly, this looks like a kernel bug: incorrect
> >> invalidation of the route cache.
> 
> [...]
> 
> > What we have here is of another kind - it is inherent race condition
> > inside kernel
> 
> Perhaps I'm confused, but it still looks like a kernel bug to me.

Yes, it is a kernel bug. But in a sense it is so old and so widespread
that it has to be cared about in userspace - as with atomic route
updates we do not hit it.

Also: atomic route updates are needed not only for avoiding this bug.
Another reason is: if we have routedel & routeadd pair, even after
routeadd the state of cache is correct, in the time between del & add, 
if a packet destined to that route gets to the node, it hits
'unreachable' route case.

For usual packets it is only "packet lost" and TCP probably retransmits.
But for SYN packets, e.g. when a connection is going to be established,
ICMP error is returned which results in "host unreachable" error on
originator side.

> Perhaps it would make sense to speak to netdev about that?

Yes, makes sense. Though as this particular case is not present on 4.2+
kernels, people on netdev will probably has less interest to look into.

I will see what can be done.

> > Quagga, at least, switched to atomic updates some time ago, I think.
> > 
> > http://patchwork.quagga.net/patch/1234/
> 
> I see.  I'm busy right now, but I'll be grateful for a patch.

I see about this. Thanks for feedback.


On Wed, Jun 15, 2016 at 07:35:05PM -0700, Dave Taht wrote:
> >     https://lab.nexedi.com/kirr/iproute2/blob/bd480e66/t/rtcache-torture
> >     (also attached to this email)
> >
> > which reproduces the problem in several minutes just on one computer and
> > retested it locally: I can reliably reproduce the issue on pristine
> > Debian 3.16.7-ckt25-2 (on both Atom and Core2 notebooks) and on pristine
> > 3.16.35 on Atom (compiled by me, since Debian kernel team has not yet
> > uploaded 3.16.35 to Jessie).
> 
> I have been running this script on four different machines for hours
> now without reproducing your bug on the 4.4 or later kernels. It does
> trigger on a 3.14 kernel. (it helps to do a killall fping6 before
> exiting!)
> 
> It does not seem to be happening on 4.4 or later. At one level, I'm
> relieved - one last babel bug to worry about in openwrt (now 4.4
> based), although one of the platforms I work on is still stuck at
> 3.18, as is the 3.14 c2 (for now).
> 
> At another level I still really, really, really wanted atomic updates
> in babel, and was clearing the decks to make a run at the right
> netlink stuff when I'd decided to confirm your bug existed or not in
> my kernels. :(. Weirdly demotivating.
> 
> 
> d at dancer:~/bin$ ssh root at pi3 uname -a
> Linux pi3 4.4.12-v7+ #892 SMP Thu Jun 2 15:41:19 BST 2016 armv7l GNU/Linux
> d at dancer:~/bin$ ssh root at pi2 uname -a
> Linux pi2 4.4.12-v7+ #892 SMP Thu Jun 2 15:41:19 BST 2016 armv7l GNU/Linux
> d at dancer:~/bin$ uname -a
> Linux dancer 4.5.0-rc7-fqfi #1 SMP PREEMPT Mon Mar 7 16:04:17 PST 2016
> x86_64 x86_64 x86_64 GNU/Linux
> 
> ...
> 
> The odroid C2 has the bug.
> 
> d at dancer:~/bin$ ssh root at c2 uname -a
> Linux c2 3.14.29-56 #1 SMP PREEMPT Wed Apr 20 12:15:54 BRT 2016
> aarch64 aarch64 aarch64 GNU/Linux
> 
> BUG: Got unexpected unreachable route for 2226:3333:4444:5555::1: #
> I'd changed the number
> unreachable 2226:3333:4444:5555::1 from :: dev lo  src fd99::2  metric
> 0 \    cache  error -101
> 
> route table for root 2226:3333:4444::/48
> ---- 8< ----
> unicast 2226:3333:4444:5555::/64 dev dum0  proto boot  scope global  metric 1024
> unreachable 2226:3333:4444::/48 dev lo  proto boot  scope global
> metric 1024  error -101
> ---- 8< ----
> 
> route for 2226:3333:4444:5555::1 (once again)
> unreachable 2226:3333:4444:5555::1 from :: dev lo  src fd99::2  metric
> 0 \    cache  error -101 users 1 used 3

Dave, thanks for confirming and for feedback about this.

Yes, 4.2+ kernels should not have this _particular_ bug, because
https://git.kernel.org/linus/45e4fd26 reworks ip6_pol_route() for above
tested case to not lock the route table twice and not to create /128
cache entries on lookup when there is a gateway.

BUT

Route cache for IPv6 is still there in new kernels, and sometimes cache
entries are created. E.g. this happens on PMTU exception, but also for
lookups without gateway when associated flow has FLOWI_FLAG_KNOWN_NH set
(I don't yet know what it is yet, but still):

https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/ipv6/route.c?id=v4.7-rc3-55-gd325ea8#n1089

etc.

So _related_ problems should be there. They are probably just maybe less
easily reproducible and less often happening. I have not looked into
further details though...

And also: as shown above it is better to have atomic route updates even
without cache issues to get SYN not occasionally rejected in the time of
route update.

So Dave, please keep up your motivation for fixing this if you were
going to eventually do so.

Thanks,
Kirill

P.S.

> (it helps to do a killall fping6 before exiting!)

There is

    trap 'kill $(jobs -p)' EXIT

it does not work?


> > It is always the same: the issue reproduces reliably in several minutes.
> > And it looks like e.g.
> >
> >      ----- 8< ----
> >      root at mini:/home/kirr/src/tools/net/iproute2/t# time ./rtcache-torture
> >      PING 2222:3333:4444:5555::1(2222:3333:4444:5555::1) 56 data bytes
> >      E.E.E.....E......E..E............E...E..
> >      <more output from ping>
> >
> >      BUG: Linux mini 3.16.35-mini64 #14 SMP PREEMPT Sun Jun 12 19:41:09 MSK 2016 x86_64 GNU/Linux
> >      BUG: Got unexpected unreachable route for 2222:3333:4444:5555::1:
> >      unreachable 2222:3333:4444:5555::1 from :: dev lo  src 2001:67c:1254:20::1  metric 0 \    cache  error -101
> >
> >      route table for root 2222:3333:4444::/48
> >      ---- 8< ----
> >      unicast 2222:3333:4444:5555::/64 dev dum0  proto boot  scope global  metric 1024
> >      unreachable 2222:3333:4444::/48 dev lo  proto boot  scope global  metric 1024  error -101
> >      ---- 8< ----
> >
> >      route for 2222:3333:4444:5555::1 (once again)
> >      unreachable 2222:3333:4444:5555::1 from :: dev lo  src 2001:67c:1254:20::1  metric 0 \    cache  error -101 users 1 used 4
> >
> >      real    0m49.938s
> >      user    0m4.488s
> >      sys     0m5.872s
> >      ---- 8< ----
> >
> > The issue should not show itself with kernels >= 4.2, because there the
> > lookup procedure does not take table lock twice, and /128 cache entries
> > are not routinely created (they are created only upon PMTU exception).
> >
> > I'm running Debian testing on my development machine. Currently it has
> > 4.5.5-1 (2016-05-29). I can confirm that /128 route cache entries are
> > not created there just because a route was looked up.
> >
> > Kirill
> >
> >
> > ---- 8< ---- (rtcache-torture)
> > #!/bin/sh -e
> > # torture for IPv6 RT cache, trying to hit the race between lookup,cache-add & route add
> > # http://lists.alioth.debian.org/pipermail/babel-users/2016-June/002547.html
> >
> >
> > tprefix=2222:3333:4444      # "whole-network" prefix for tests  /48
> > tsubnet=$tprefix:5555       # subnetwork for which "to" route will be changed   /64
> > taddr=$tsubnet::1           # test address on $tsubnet
> >
> > # setup for tests:
> >
> > # dum0 dummy device
> > ip link del dev dum0 2>/dev/null || :
> > ip link add dum0 type dummy
> > ip link set up dev dum0
> >
> > # clean route table for tprefix with only unreachable whole-network route
> > ip -6 route flush root $tprefix::/48
> > ip -6 route add unreachable $tprefix::/48
> > ip -6 route flush cache
> >
> > ip -6 route add $tsubnet::/64 dev dum0
> >
> >
> > # put a lot of requests to rt/rtcache getting route to $taddr
> > trap 'kill $(jobs -p)' EXIT
> > rtgetter() {
> >     # NOTE we cannot do this with `ip route get ...` in a loop, as `ip route
> >     # get` first takes RTNL lock, and thus will be completely serialized with
> >     # e.g. route add and del.
> >     #
> >     # Ping, like other usually connect/tx activity works without RTNL held.
> >     exec ping6 -n -f $taddr
> > }
> > rtgetter &
> >
> > # do route del/route in busyloop;
> > # after route add: check route get $addr is not unreachable
> > while true; do
> >     ip -6 route del $tsubnet::/64 dev dum0
> >     ip -6 route add $tsubnet::/64 dev dum0
> >     r=`ip -6 -d -o route get $taddr`
> >     if echo "$r" | grep -q unreachable ; then
> >         echo
> >         echo
> >         echo BUG: `uname -a`
> >         echo BUG: Got unexpected unreachable route for $taddr:
> >         echo "$r"
> >         echo
> >         echo "route table for root $tprefix::/48"
> >         echo "---- 8< ----"
> >         ip -6 -d -o route show root $tprefix::/48
> >         echo "---- 8< ----"
> >         echo
> >         echo "route for $taddr (once again)"
> >         ip -6 -d -o -s -s -s route get $taddr
> >         exit 1
> >     fi
> > done
> 
> 
> 
> -- 
> Dave Täht
> Let's go make home routers and wifi faster! With better software!
> http://blog.cerowrt.org



More information about the Babel-users mailing list