[Babel-users] MTU based routing for tunnel based babel networks?

Mon Jul 17 01:26:35 BST 2023

On Mon, Jul 17, 2023 at 12:47:30AM +0200, Juliusz Chroboczek wrote:
> >> IP does not support variable MTU links.
> > 
> > Excuse me, but that's plain false. IP was designed in an environment where
> > (non-ethernet) networks with various MTU standards were commonplace
> 
> Sorry, I wasn't clear.  IP requires every link to have a well-defined
> MTU: all the nodes connected to a link must agree on the link's MTU.

I don't think that can be true either. PMTU can vary and paths can be
asymmetric so two nodes could very well see different MTUs across the
internet. There's just not many ASen that run with less than 1500 MTU :)

Do you have a referece for this "MTU well-definedness" criteria, I don't
think I ever heard of this.

> > There is a way: My routing protocol just has to stop picking links that are
> > obviously going to cause a problem.
> 
> Could you please describe the problem in detail?  Because I'm probably
> missing something.

Let me try to give some more context:

My mesh network deploys two wg tunnels per node. One wg-over-v6 and one
wg-over-v4 tunnel to support dualstack, v4-only and v6-only underlay
networks.

Nodes run babel over all wg interfaces and will receive a default route
covering the wg-over-v6 tunnel endpoint addresses. Some nodes are served by
IPv6 routers that are themselves part of the wg mesh network and only have
v6 connectivity via wg-over-v4.

This can cause wg-over-v6 tunnels on such nodes to want to cross a
wg-over-v4 tunnel.

All wg interfaces have MTU 1420 configured which is the worst case for
wg-over-v6 or v4 (with MTU 1500). In the wg-over-wg-over-v4 case this
results in packets that are too big for the v4 underlay network
(1420+80+60=1560).

Wireguard drops packets when they exceed the underlay network's MTU. When
this happens no PTB ICMP errors are generated by wireguard inside the
tunnel, packets are simply dropped and TCP applications running on the
overlay IPv6 network break badly as no ICMP errors reach the sender.

This can be avoided by simply ignoring the wg-over-v6 tunnel which only
exists for deployment consistency as a wg-over-v4 tunnel with (actual) 1440
MTU is available too which can reach the entire network.

Worth mentioning: The reason I have to run two wg tunnels per node to begin
with is that wireguard's strategy for dual-stack support is that it doesn't
have one. It supports only one endpoint address per tunnel (well wg-peer
really) and if you pick wrong because, say, IPv6 addresses are available
but dont work, the tunnel simply blackholes everything. Yey, joy is me.

> If Wireguard implements RFC 4459 Section 3.2, then pushing a too large
> packet over the tunnel, then Wireguard should synthesise an ICMP "packet
> too large", which will cause the sender to retry with a smaller packet.
> Is that not the case?

Yeah, having wg forward PTB errors from the underlay to inside the tunnel
was something I considered for fixing this but I belive that would be
called "insecure" by the wg project since the ICMP erros aren't signed like
normal wireguard packets. So what happens when an attacker sends spoofed
PTB with MTU=0 etc. ;)

Furthermore on IPv4 which unfortunately is the underlay in my network more
often than not ICMP blackholes are very common so breakage would could
ensue again.

This really is just putting lipstick on a pig. It would "work" I suppose
but I don't want my network to use these paths because the double
encapsulation is just plain inefficient!

Prune thy inefficient paths I say :]

> I'm not opposed to your probing idea, but I'd really prefer to fully
> understand the problem first.

Sure thing, I'm not opposed to working the problem. I've just been dealing
with this problem (and ducktape "solutions" surrounding it) for a while now
and I just want to get this squared away so I can go back to my (mostly)
IPv6-only bliss :D

I think RFC4459 simply didn't consider L3 routing protocol based
solutions. Probably since the usual network vendor suspects would never
implementing something uncouth like this but we need not be constrained by
the inefficiencies of the commercial world in the free software community,
now do we :)

Speaking of which I'm working on a babeld patch to see if my idea
works. Just have to dig through the kernel code first to figure out which
one of the amazingly (badly) named IP_PMTUDISC_* options I want to use to
force it to neither do fragmentation nor attempt PMTU for the babel socket.

Thanks,
--Daniel