[Babel-users] Babel in Dave's network [was: babels bug with uninitialized data...]

Fri Jun 27 21:43:03 UTC 2014

>>> I turn it off (it has a noisy fan) and for sane values of "boom", the
>>> whole network switches over to going through the wan or adhoc ports.

>> You appear to be running with a very high hello interval,

> It usually seems much faster than that... but I'll go measure.

> I am using the default hello intervals.

Sorry, I got confused.  I've looked at the dump again, and you are running
with the default interval (4s) -- boom should be around 10s in your case.

> Should I tighten that in this case?

The Babel protocol is able to deal with hello intervals as low as 10ms.
However, we've never tested the actual implementation with sub-second
hellos, so you'll likely run into some bugs (and I'm looking forward to
your reports).

Expect reconvergence to happen after 2 hellos in the absence of packet
loss.  (That means 2.5 hello intervals on average, since we start counting
at a random point in the cycle.)

> multicast throughout the network is set to 9mbits/sec.

[...]

> And: if I'm saturating the network, or using an artificial bottleneck
> or wifi sometimes it does very briefly (far less than 4 sec) lose a
> route through that during a rrul test.

Babel doesn't like to lose three updates in a row.  When it does, it sends
a (unicast) request for one extra update.  If that update is lost too, it
drops the route.  Oscar Wilde famously quipped that losing one packet is
a tragedy, while losing both is carelessness -- but unfortunately WiFi
multicasts are extremely unreliable, especially if you raise the multicast
throughput from the official value.  Losing all four updates at 9 Mbit/s
is not at all unlikely.

Babel is able to recover reasonably fast by switching to an alternate
route -- the delay you're seeing is probably the time needed to make an
end-to-end exchange to ensure that the alternate route is loop-free when
it is unfeasible.

> I have long been tempted to change the current delete/add logic for
> add/delete for route updates now that I have some trust in modern
> kernels.

That will only help in the case where the alternate route was already
feasible, and this case is already fast enough (on the order of
milliseconds).  It's the unfeasible case that's bothering you.

Let's consider the actual issue, rather than trying to work around the
symptoms.  I can see three ways to avoid losing four updates in a row when
running over WiFi:

  - don't raise the multicast rate;
  - implement reliable updates (supported by the protocol, but not by the
    implementation);
  - implement unicast updates (supported by the protocol, but not by the
    implementation).

Note that raising the multicast rate is needed in order to get the link
quality estimator to work.  As you well know, I'm planning a better link
quality estimator for some future version, so the first solution above
might become practical at some point.

-- Juliusz