[Babel-users] ECMP on endpoints [was: babeld slashes...]

Sat Apr 16 12:18:58 BST 2022

Hi Toke,

On Fri, Apr 15, 2022 at 10:03:11PM +0200, Toke Høiland-Jørgensen wrote:
> I basically do this for my laptop, sans the MAC authentication (but I
> really ought to get that rolled out as well). Works pretty seamlessly:
> When I plug my laptop into the dock traffic shifts to the wired
> interface, and when it's anywhere else it goes over wireguard. I don't
> bother with Babel on the WiFi network, the wg tunnels go to the same
> router in the building anyway, so there's no noticeable difference...

In my case the tunnels terminate in hosted VMs doing the BGP announcements
so it's better not to route via those. I also like the slightly better
performance due to the larger MTU and who knows maybe I'll deploy jumbo
frames in my physical network some time.

But the main reason I want to go via the physical network is so I can tell
when it's broken but my device's tunnels still work. Dogfooding 'ya know :)

> > All I have to do is run one wg tunnel per edge router to my clients
> > (which I already do) and then have babel install a default
> > route/nexthop for each tunnel (the bit I'm working on). Together with
> > RTT metrics and CECMP this could even kick out edge routers where the
> > underlay network path is performing too poorly fully automatically :)
> 
> How do you define "too poorly"? I guess that's the crux of the issue:
> you could just install all feasible routes as ECMP paths, but that would
> potentially give you wildly varying performance for each flow, which I
> would imagine would be a pretty louse user experience. So what would you
> do instead?

Well I already have varying performance but right now it's sticky so
per-flow varying is strictly an improvment :)

One of the the problem I'm trying to solve (using the RTT metric bit) is
the underlay network path having high latency. One of my hosters had pesky
problems with their path getting congested during prime time and the
carrier AS responsible for the link (DTAG) refusing to do anything about
it. However cost wise this is the cheapest host per unit of traffic so it's
still useful to have around most of the time.

In that case this router would just not get selected as nexthop, or kicked
out of the "close enough cost" nexthop set when it's having
congestion/latency issues.

The other problem is shaky connections from the edge routers to the rest of
the v6 internet but that's a separate and much more difficult problem that
babel can't really help with much.

I'm looking into how to solve in my case tough. Probably some kind of flow
montitoring system that triggers automatic ping/traceroute via all
upstreams, reroutes locally based on that info and does AS prepending too
for the ingress side.

Haven't quite figured that one out yet :)

--Daniel