[Babel-users] battling with babel and route changes

Dave Taht dave.taht at gmail.com
Wed Jun 22 14:35:53 UTC 2011


Some background on the thread that Juliusz is responding to...

As part of the bufferbloat effort [3] I have been working on analyzing
the behavior of a string of wireless-n routers and clients under load as part of
the 'cerowrt' project [4] - reducing latencies under load being the core
problem we're trying to beat. I have several pieces of fallout from
these analyses, the first, specific issue that Juliusz is
responding to is this, from the thread on the bloat-devel list. [5]

The "bufferbloat" problem has an entry on wikipedia and there have
also been several good presentations made by Jim Gettys on it, the
most recent, to google, which is captured on video [6]. The shortest
description of the problem I've been able to come up with is: is that
immensely deep, *unmanaged* buffering at all levels of the network
stack is destroying network latency in particular, and wireless in
general.

the latest generation of wireless (n) devices are shipping with
enormous buffers in the drivers themselves, and making heroic efforts
to get *all* the packets through with excessive retries, thus leading
to delays often measured in seconds, which can destroy tcp's
congestion control algorithms.

Worse, default levels of buffering in Linux stack, above the driver,
defaults to a txqueuelen of 1000 packets... and there is no QoS
prioritization by default. (not that it helps, with enormous buffers
lying in the device drivers and devices)

Anyway, moving on to this thread...
------snip-----

While I am totally in love with babel as a routing protocol, at the
traffic volumes I'm regularly generating now, it's trivial for
a route change to stop a tcp stream in its tracks. [1] I'm simply not
familiar enough with netlink to fix it, or even know if it can be
fixed...

0) babel keeps all the routing information in it's head. It does not
use the kernel metrics. in particular:

1) babel installs ipv4 routes with a metric of 0, ipv6 routes with a
metric of 1024

2) doing a route change via netlink seems to require a different
   metric (or at least it did, several years ago, when the code was
   written). As juliusz points out earlier on this thread, kernel
   metrics are a misnomer, they are more of a priority than a metric.

The relevant bit of code in babel is in kernel_netlink.c, where it
calls itself recursively to handle a route change, instead of making a
route change.

    if(operation == ROUTE_MODIFY) {
        if(newmetric == metric && memcmp(newgate, gate, 16) == 0 &&
           newifindex == ifindex)
            return 0;
        /* It is better to add the new route before removing the old
           one, to avoid losing packets.  However, this only appears
           to work if the metrics are different. */
        if(newmetric != metric) {
            rc = kernel_route(ROUTE_ADD, dest, plen,
                              newgate, newifindex, newmetric,
                              NULL, 0, 0);
            if(rc < 0 && errno != EEXIST)
                return rc;
            rc = kernel_route(ROUTE_FLUSH, dest, plen,
                              gate, ifindex, metric,
                              NULL, 0, 0);
            if(rc < 0 && (errno == ENOENT || errno == ESRCH))
                rc = 1;
        } else {
            rc = kernel_route(ROUTE_FLUSH, dest, plen,
                              gate, ifindex, metric,
                             NULL, 0, 0);
           rc = kernel_route(ROUTE_ADD, dest, plen,
                             newgate, newifindex, newmetric,
                             NULL, 0, 0);
           if(rc < 0 && errno == EEXIST)
               rc = 1;
       }

The problem lies in the else, above.

In order to disrupt a flow less, babel could

A) install a new route with a higher metric (priority)
B) remove the old route
C) install the new route again with the right metric (priority)
D) Remove the new route with the higher metric

instead, right now, when metrics are equal (which they usually are) it
removes the route entirely, leaving a window where packets drop
(particularly under load), then adds it again going to the right place.

It SHOULD be easier than this, and perhaps babeld is working around an
old bug in the kernel here, but I don't think so. The above is kind of
crufty, and I hope it's just possible to combine C and D into one
operation.

This is not a bug in the protocol but the implementation.



1: lest you think this is not relevant to bufferbloat... I am
generating saturating loads across two test networks [2], while trying
to get real work done - saving the mice packets - you know, typing,
listening to streaming radio - and fixing bloat wherever it appears -
and I keep stumbling across problems like this one (assuming it's
real) that aren't related to bloat but affect the tests.

2: Test network #1 - 8 wndr3700s and a pc at georgia tech
   Test network #2 - Linux laptop, 2 wndr3700, 3 nanostation M5s,
   1 android, 1 ipad, 1 iphone, 1 nokia 770, 1 windows XP box,
    1 vista box, and soon - 3 OLPCS.

3: http://www.bufferbloat.net

4: http://www.bufferbloat.net/projects/uberwrt

5: https://lists.bufferbloat.net/pipermail/bloat-devel/
   https://lists.bufferbloat.net/pipermail/bloat/

(Please consider these last three footnotes a blatant plug. There is a
very active #bufferbloat irc channel on freenode as well)

5: Google Video presentation by JG:
http://www.youtube.com/watch?v=qbIozKVz73g&hd=1


-- 
Dave Täht
SKYPE: davetaht
US Tel: 1-239-829-5608
http://the-edge.blogspot.com



More information about the Babel-users mailing list