[Babel-users] atomic updates, finally in babeld

Dave Taht dave.taht at gmail.com
Tue Oct 9 13:35:26 BST 2018


so after getting 2 days per year for 6 years, to work on that, I finally got 4.

https://github.com/dtaht/babeld/commits/atomic

I (temporarily) ripped out the netlink code and replaced it with
system("ip route whatever") to finally get the state machine right without
having to fiddle with netlink directly.

* can't change the kernel metric
* ip route replace rules are weird in linux
reach -> unreach: requires a del/add
unreach -> reach: can use a replace
reach -> reach elsewhere: replace (have not tested interface flipping yet)
unreach -> unreach: replace (shouldn't happen)

* the semantics of add_route were weird, I changed it to use the
position of the new parameters, rather than gate, so as to unconfuse
myself.

There was at least one other bug.

So, this code was originally done the prior to avoid conflicting with
"stuck multicast routes" and I don't have (and have never had) a test
case that showed that. ?

my topology today was:

comcast gw ------------------------ couch gw
                        |         |            |
        spaceheater    ceres   dancer

couchgw filters all this stuff out (mips) but still chokes somewhere
spaceheater and ceres are 2 12 core boxes running babeld and rtod
dancer is the box with the new "atomic" code
comcast gw is an arm a15 box: (1.8.3 stock right now)

* I did in a prior attempt 2 years back discover that attempts to
insert/retrieve lots of routes could return EAGAIN, ENOSPC, and a few
other things maybe not checked for in the current code. Haven't gone
back there. But while doing myself in I did get a couple

dancer: ip route add unreachable 172.22.0.172/32 from 0.0.0.0/0 table
254 metric 0 proto 42
netlink_read: recvmsg(): No buffer space available


* Another "interesting" I notice with the new code, is I inject 1024
identical routes on ceres and spaceheater via "rtod -r 1024 -H test"
at roughly the same time, elsewhere on the net, I end up with
something similar to ECMP.

root at dancer:~/git/babeld-atomic# ip -6 route | grep
fe80::225:90ff:fec1:6252 | wc -l
453
root at dancer:~/git/babeld-atomic# ip -6 route | grep
fe80::225:90ff:fec2:2aa3 | wc -l
575

Which is cool... But (?) in the process it also adds the routes, then
later on replaces some of them pointing elsewhere, when I figured it
would keep the first one it got as the basic metrics are equal.

Another odditity is that it will batch up dels, before unreachables.
This sort of behavior strikes me as having existed before, but was
impossible to see, and
perhaps the cause of some issues. Or I missed a state in the state machine, but
there's no way (at this low level) that can happen, I think.

...
ip route del fcd8:8fca:2dc0:3ff::/64 from ::/0 table 254 metric 0 dev
eno1 via fe80::225:90ff:fec1:6252 proto 42
ip route del fcd8:8fca:2dc0::/48 from ::/0 table 254 metric 0 dev eno1
via fe80::225:90ff:fec1:6252 proto 42

ip route add unreachable fcd8:8fca:2dc0:11::/64 from ::/0 table 254
metric 0 proto 42
ip route add unreachable fcd8:8fca:2dc0:12::/64 from ::/0 table 254
metric 0 proto 42

* injecting and removing this many routes causes a burp in arm and
mips based daemons...

and they go unreachable. (note that I'm doing the injection on a pair
of 12 core machines that do end up spinning a core in the xroute add
code, but what I'm trying to exercise is the route code on all the
other boxes)

ANYWAY. the most productive 4 days I've had on babel in years. I was
not running bird during this exercise, will try that. I do wish I could get more
folk blowing things up with rtod. I need to add ipv4 injection tests to it next.

-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740



More information about the Babel-users mailing list