[Babel-users] Anybody else seeing disruption when restarting babeld?y

Daniel Gröber dxld at darkboxed.org
Fri Feb 24 20:48:40 GMT 2023


On Fri, Feb 24, 2023 at 07:41:03PM +0100, Juliusz Chroboczek wrote:
> > I think I figured out whats going on: babeld immediately flushes the kernel
> > routes it installed when shutting down, without waiting for neighbours to
> > switch to a different path.
> 
> Right.  How long is the disruption?

It's not so much how long it is as it is that it's there at all. I don't
want my network to drop packets on the floor unecessarily.

> > I figure this has to be configurable option since full propagation of the
> > retractions depends on the network diameter and there's no way in the
> > protocol we can get acknowledgments from the entire network (AFAIK?), not
> > just our immediate neighbours.
> 
> We only signal the neighbours: this is distance-vector, so the neighbours
> will start searching for a different route without any need for end-to-end
> signalling. 

You're right ofc. as soon as we signal a neighbour they will divert traffic
somewhere else, but iff. they have a feasible route at hand. I'm also
concerned about blackholing in the "no feasible route" case.

> Of course, if there are no feasible routes to a given destination, then
> the neighbours will perform an end-to-end search for a loop-free route,
> but that's the neghbours' problem, not ours.

I can't say I agree with the "their problem" mentality. The way I see it
during graceful shutdown we're still responsible for in-flight traffic
anyway. We're also in a reasonable position to avoid dropping any traffic
still about to be routed based on the assumption we're still alive and
routing until our retraction propagates, why shouldn't we take advantage of
that?

In my mind it doesn't matter if babeld takes 500ms or 15sec to shutdown if
that buys me a rock solid network. So my thinking is I'd like to know when
everything has converged, since that isn't really a thing in DV as you note
an ad-hoc delay is the next best thing I could think of.

The note about the ACKs was simply supposed to be reasoning for why an
ad-hoc delay rather than having neighbours ACK the retractions.

> >     https://github.com/jech/babeld/pull/102
> 
> Looks good to me.  Just two comments:
> 
>   - should the granularity be lower?  A second for local signalling is
>     a lot, I'd expect 300ms to be enough in most cases;

I have no problem changing it to millisecond granularity if that suits you?

>   - why a goto rather than a loop?

Oh you know how it goes: you make a decision then another, change that one
and then unbeknownst to you the first one doesn't really make sense
anymore ;) Will fix.

--Daniel



More information about the Babel-users mailing list