[Babel-users] Babel MAC auth fails due to packet reordering

Wed May 4 15:28:07 BST 2022

Juliusz Chroboczek <jch at irif.fr> writes:

>> > 1. Are you running with the "unicast" option set in your config file?
>> 
>> Aaah turns out I was, because I had it set in `default` for my wireguard
>> links.
>> 
>> Adding `interface enpxx unicast false` magically fixes this. According to
>> the docs this skips sending a duplicate hello to neighbours which would
>> explain why it works.
>
> Not quite.  Babel has three kinds of TLVs:
>
>   - Discovery Hellos, which are always sent over multicast;
>   - requests, which are always sent over unicast;
>   - the bulk of the protocol, which may be sent either over multicast or
>     unicast.
>
> The unicast option controls whether the bulk of the protocol is sent over
> multicast (unicast off) or sent to each peer over unicast (unicast on).
> In your case, Babel was sending
>
>   Hello multicast
>   IHU unicast
>
> The Hello and the IHU were getting reordered, so the Hello was getting
> dropped due to an incorrect packet counter value.  With "unicast false",
> Babel is sending a single aggregated Hello+IHU over multicast, so no
> reordering can happen.
>
> We still need to understand why you're getting systematic packet
> reordering.  If it's something that cannot be avoided, then we will need
> to update the HMAC implementation (and spec!) to maintain a replay window,
> in the style of
>
>   https://datatracker.ietf.org/doc/html/rfc4303#section-3.4.3
>
>> Where are you getting the 200ms number from exactly?
>
> Here:
>
>   10:24:31.056310 IP6 fe80::1.6696 > fe80::c23c:59ff:fe4a:ce46.6696
>   Router Id 02:0d:b9:ff:fe:4e:90:54
>   Update/prefix...
>   PC value 57567 index len 8
>
>   10:24:31.257774 IP6 fe80::1.6696 > ff02::1:6.6696
>   Hello seqno 36696 interval 4.00s
>   PC value 57566 index len 87
>
> It's suspiciously close to 200ms (201.5ms exactly).  Toke, you're the
> world specialist of the Linux WiFi stack -- do you see a hardwired 200ms
> delay somewhere?

Hmm, well, as Dave mentions, the default beacon interval is 100 ms; and
the default DTIM interval is 2, which would mean that a station that is
sleeping would only wake up every 200ms. So that delay *could*
correspond to such a delay; however it's more of an upper bound, so I
think it's mostly a coincidence that this is the interval in this case.
Indeed, in the pcap file there's another reordering (packets 17 and 18)
with only 30ms in-between.

This could very well be the culprit, though. This buffering should only
happen if there are stations in power-save mode, so one way to check it
would be to disable power-save on the WiFi client (and make sure no
other clients are connected with power-save enabled) and seeing if the
problem goes away...

-Toke