[Babel-users] 64k routes, bird, babel rfc, rtod, etc

Dave Taht dave.taht at gmail.com
Thu Nov 8 17:46:55 GMT 2018


I did a ton of testing on my boat network (4 machines) and the lab
(I've kind of lost track - today it was 7 machines, tomorrow 12)....

* quick summary:

the nlogn branch has a bug in that redistribute local deny does not
work. Haven't been able to find it.

Also needs a correct patch to message.c (attached)

I think bird is misparsing something.

... but ...

staging 32k routes each from two boxes with the nlogn + my uthash
stuff for resend.c eventually (if you stage 2k injections from rtod
over time) gets to carrying 64k routes.

I rather like the uthash thing so far. much better than rbtrees.

* This is an i3 nuc running ubuntu 16....

d at dancer:~/git/rtod$ ip -6 route | grep unreach | wc -l
1
d at dancer:~/git/rtod$ ip -6 route | wc -l
65685

NetworkManager (ubuntu 16) goes nuts and stays nuts until these routes
are finally retracted. (all the cpu fans in the room run high, which
is helpful as winter approaches). The daemon will get behind when
churn happens (so it will do things like late hello and lose the
default route to the main gw)

* an edgerouterx struggles.

odhcpd and dnsmasq (also listening on the kernel netlink socket) eat
25% of cpu each while babel drops packets

root at edgerouterx:~# ip -6 route | wc -l
16561

* My Arm dual core a15 did considerably better

but with odchpd spinning away it stopped serving dhcp renews...

* Systemd (ubuntu 18) based boxes don't have any other daemon go nuts.

(however those are 12 core machines, I'll have to go try a weaker box).

Also ubuntu 18 (systemd?) uses a default metric of 100 for its dhcp
default route. Historically I've always had to modify
/etc/dhcp/dhcpd.conf to not request a default route.... yea....

* Using an:

in ip fc00::/8 ge 8 deny
in src-ip fc00::/8

on my core (mips) routers ignore these rtod routes and these motor
happily along.

* When you get to having this many routes you hit other bugs.

root at ceres:~/git/rtod# ip -6 route | wc -l
73528
root at ceres:~/git/rtod# ip -6 route flush proto 50
Failed to send flush request: No such process
Flush terminated
root at ceres:~/git/rtod# ip -6 route flush proto 50
Failed to send flush request: No such process
root at ceres:~/git/rtod# ip -6 route | wc -l
39069
root at ceres:~/git/rtod# ip -6 route flush proto 50

* As for bird

which is running on a weak atom box... peaked at about 32k routes. Limit?

if I fire it up during all this carnage, weird things happen. Whether
that's bird doing something weird, messing with metrics, or the cost
of a route dump, or what, dunno...
I had about 1000 seconds left before the lab network is usuable again,
which I'm using to write this email.  :)

** gotta fix this:

Received prefix with no router id.
Couldn't parse packet (8, 12) from fe80::230:18ff:fec9:de9c on eno1.

** the carnage of retracting this many routes was "interesting"

** Probably most importantly... I get "permanent" wierdness
(metrics? misparsing something?), like for example, on my 172.22.0.0
network... I end up with a route announced and stuck to the bird box
here:

172.22.0.2 via 172.22.0.85 dev eno1 proto babel onlink

and I should probably filter out any announcements of in 172.22.0.0/24
ge 24 universally, but!

.2 has redistribute local deny and is not running the nlogn branch
with that bug in it, and this does not happen with babeld exclusively
on the network. So I think this is a genuine bird bug.

* 20k routes over mcast over 2.4ghz wifi 1mbit

is *really* disabling of the link. With unicast... barely noticed it.

* Anyway, in conclusion...

0) both unicast and mcast in the babel rfc branch work. I have tons of
packet captures, the only weird thing I saw with unicast was over
openwrt wifi where I periodically and too often (IMHO) get an icmpv6
unreachable message when I should have seen a RA solicit happen
somewhere around it.

1) I'd like it if babeld/bird made absolutely sure hellos went out on
time, no matter how much compute was being used. (it's not just
network manager/odhcpd/dnsmasq but anything else that eats cpu like
(for example) my nas server doing a backup over scp...) It might also
be nice to try to *get* hellos faster. Is it possible to have both a
mcast and a unicast socket on the babel port open at the same time?

2) Only a crazy person should try to make babeld carry more than 8k
routes in its current incarnation on cheapo mips hardware. :) I can
certainly see "staging" and "pacing" and stretching hellos and route
announcement intervals, under load, as a stabler way to get to 64k+
routes, now that it appears we are no longer cpu bound within babel
itself, on slightly higher end hardware, to get that far.

For the record, the bgp route table is about 38k? routes nowadays


* Next steps for me...

go test the src-pref stuff
gotta go climb a tree
build some virtual topologies again
it would be good to have a coherent test suite rather than me
flailing, and simple testing trying to get a usable, observable result
(with sane numbers of routes).

hmac? :puppy dog eyes:

-- 

Dave Täht
CTO, TekLibre, LLC
http://www.teklibre.com
Tel: 1-831-205-9740
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0001-Re-re-re-fix-message.c-ifup-test.patch
Type: text/x-patch
Size: 824 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/babel-users/attachments/20181108/f202b743/attachment.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0003-Increase-maxmaxroutes-to-an-unreasonable-value.patch
Type: text/x-patch
Size: 725 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/babel-users/attachments/20181108/f202b743/attachment-0001.bin>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0002-Log-late-hellos.patch
Type: text/x-patch
Size: 1042 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/babel-users/attachments/20181108/f202b743/attachment-0002.bin>


More information about the Babel-users mailing list