[Babel-users] Anybody else seeing disruption when restarting babeld?y

Sat Feb 25 14:52:43 GMT 2023

I have a fantasy concerning all routing daemons that I hope is the
case in bird, but isn't, in babeld.

It's this pattern:

while(true) {
do_somework();
}

where if do_somework() exceeds a deadline (about 4s in the babel
case), bad things start to happen, and cascade.

Modern linux and windows at least, have the concept of an interval
timer (fdtimers), which can easily let you see when you are exceeding
deadlines,
and find a way to shed_somework(). Babel has within the protocol the
ability to start announcing routes on a larger interval which would be
a way to shed some work. Always ensuring that at least a default route
made it out, or scrambling the order of the announcements somewhat so
other limits are not hit, might also help.

Some realworld examples of applications doing this right are in the
top utility, and in how netflix probes for more bandwidth (if a given
10sec segment doesn't load in under 4 seconds, they scale back the
resolution).

Recently we hit this problem hard whilst trying to scale libreqos down
to sub 10ms sampling intervals.

The second thing that might help some, is good ole-fashioned random
exponential backoff
(https://www3.cs.stonybrook.edu/~bender/newpub/2016-BenderFiGi-SODA-energy-backoff.pdf),
not just in packet access, but in kernel access, where I see on some
loads, netlink returning ENOBUFS.

I am very happy to see crates for these concepts appearing in rust,
and do wish more folk fiddling with routing daemons fiddled with my
RTOD tool. It would be a more robust, more smoothly degrading, world.