[Babel-users] MTU based routing for tunnel based babel networks?

Mon Jul 17 21:01:11 BST 2023

On Mon, Jul 17, 2023 at 11:41:01AM +0200, Juliusz Chroboczek wrote:
> >> Sorry, I wasn't clear.  IP requires every link to have a well-defined
> >> MTU: all the nodes connected to a link must agree on the link's MTU.
> 
> > I don't think that can be true either. PMTU can vary and paths can be
> > asymmetric so two nodes could very well see different MTUs across the
> > internet. There's just not many ASen that run with less than 1500 MTU :)
> 
> I'm not speaking about PMTU.  I'm speaking about link MTU.

Yeah I got that confused. That what happens when you write technical emails
at 2am ;)

> > Do you have a referece for this "MTU well-definedness" criteria, I don't
> > think I ever heard of this.
> 
> RFC 2460: "link MTU - the maximum transmission unit, i.e., maximum packet
>            size in octets, that can be conveyed over a link."

I read this as "link MTU" being the maximum packet size that you could ever
hope to be able send but the link technology could very well not allow the
maximum at times. Unfortunately they didn't use the usual RFC2119
requirement level terminology here so who knows :)

> RFC 4861: "All nodes on a link must use the same MTU (or Maximum Receive
>            Unit) in order for multicast to work properly."

I mean that only applies when you want to run NDP over the link so that's
hardly relevant for L3 tunnel interfaces or internet backbone links in
general.

I'm still not sold on your argument, but it hardly matters. Tunnels on top
of the internet exist so we kind of just have to deal with it.

> > Wireguard drops packets when they exceed the underlay network's MTU.
> > this happens no PTB ICMP errors are generated by wireguard inside the
> > tunnel,
> 
> If true, that's very surprising, and looks to me like a bug in Wireguard.
> 
> But yeah, I'll add an option to probe for MTU on each Hello.

I've been looking at how to implement this probing. The IPV6_MTU_DISCOVER
sockopt used to configure the kernel behaviour unfortunately conflates
multiple behaviours (oh joy), this list is for IPv6 on v4 DF also comes in
but thankfully babel only uses a v6 socket:

- whether EMSGSIZE is returned to send() when a UDP packet is too big (or
  the packet is simply dropped)
- whether the interface MTU or PMTU result controls the above error
  condition when enabled
- whether UDP send() calls with too large a size are automatically
  fragmented locally or return the error
- whether ICMP PTB messages are interpreted at all (a DNS-over-UDP security
  feature apparently)

That got me wondering: is babeld currently relying on the kernel to
fragment large UPDATE packets? From my reading of the code it doesn't look
like it. If my reading is right `(struct buffered).size` determines the
maximum UDP payload size and this is initialized from the interface MTU.

This means we can probably just set IPV6_MTU_DISCOVER to the undocumented
IP_PMTUDISC_INTERFACE[1] to maximally disable PMTU behaviour. This option
1) prevents local fragmentation of any sort (interface MTU or PMTU), 2)
disables updating the PMTU cache from ICMP-PTB messages for this socket
since we don't need that anyway and 3) causes too big send() calls to fail
with EMSGSIZE (if my reading of the kernel code is right).

[1]: Introduced around 2013, see kernel commits 482fc6094a 93b36cf342
1b34657635 0b95227a7b for the full story.

In principle we could also use the older IP_PMTUDISC_DONT since we don't
technically have to turn off ICMP-PTB interpretation but I fell like it's
neater if we disable that too.

--Daniel