[Babel-users] MTU based routing for tunnel based babel networks?

Daniel Gröber dxld at darkboxed.org
Wed Jul 19 15:41:35 BST 2023


Hi,

after some more testing and tcpdumping I have to revise my theory of what's
going on, turns out Juliusz was right all along (as usual :)

On Mon, Jul 17, 2023 at 02:26:40AM +0200, Daniel Gröber wrote:
> Let me try to give some more context:
> 
> My mesh network deploys two wg tunnels per node. One wg-over-v6 and one
> wg-over-v4 tunnel to support dualstack, v4-only and v6-only underlay
> networks.
> 
> Nodes run babel over all wg interfaces and will receive a default route
> covering the wg-over-v6 tunnel endpoint addresses. Some nodes are served by
> IPv6 routers that are themselves part of the wg mesh network and only have
> v6 connectivity via wg-over-v4.
> 
> This can cause wg-over-v6 tunnels on such nodes to want to cross a
> wg-over-v4 tunnel.
> 
> All wg interfaces have MTU 1420 configured which is the worst case for
> wg-over-v6 or v4 (with MTU 1500). In the wg-over-wg-over-v4 case this
> results in packets that are too big for the v4 underlay network
> (1420+80+60=1560).
> 
> Wireguard drops packets when they exceed the underlay network's MTU.

Not true, wireguard will fragment it's UDP packets based on PMTU results if
available in the route cache (ip -6 route show cache). It does this by
setting skb->ignore_df=1.

> When this happens no PTB ICMP errors are generated by wireguard inside
> the tunnel

This is still true, wireguard does not forward ICMP PTB errors from the
endpoint to inside the tunnel, but it doesn't need to since fragmentation
happens on it's UDP packets.

Now one would expect wg tunnel stacking to just work despite fragmentation
of the encapsulate packets being inefficient and still undesirable. However
it turns out two of my tunnel stacking mitigation attempts taken together
were conspiring against me!

My first approach to fixing the tunnel stacking was to force wireguard
output packets to be sent over the (ethernet) upstream interface only,
using policy routing. This turns out to be ineffective see below.

On top of this I applied the following nftables rule to prevent wg output
from ever going over interfaces with MTU less than 1500. This was
originally concieved to accomodate workstations rather than "core" routers
but was rolled out on the routers too. 

    meta mark 0x1000  meta protocol ip6  rt mtu < 1440 \
        counter reject with icmpx type admin-prohibited \
        comment "wg endpoint loopback prevention"

Note the `rt mtu` match is misnamed and is actually in terms of TCP MSS so
1440+60=1500 (depends on the underlying IP protocol though). Fwmark 0x1000
is what the wg tunnels tag their encapsulated packets with.

The fatal problem here is that the first mitigation will cause the upstream
router to just hairpin the wg packets back at us since we're (usually) also
announcing the endpoint's prefix via BGP. This will cause the fwmark to get
stripped obviously so the otherwise effective nftables loopback prevention
rule was being bypassed. doh!

After removing the policy routing bit stacked tunnels seem to get pruned as
they should now.

Thanks,
--Daniel




More information about the Babel-users mailing list