[Babel-users] Convergence problem

Juliusz Chroboczek Juliusz.Chroboczek at pps.jussieu.fr
Thu Jul 3 15:13:09 UTC 2008


> I am testing Babel on ppp links, over USB/serial. And i have
> problem, if I unpluged cable for few seconds, so ppp link is still
> up, then I have problems, output logs from both sides are in files
> debug.*.  But if I wait a little more, so ppp link goes down, and
> then I bring it up(ppp) I dont have problems.

Okay, I think I found the issue.  Your configuration is weird, and
there's a case I didn't envision in Babel.

The issue is that your implementation of ppp switches link-local
addresses on every up/down event.  IPv6 link-local addresses are
supposed to remain stable, so this is why Babel gets confused by this
case.  I'll implement a workaround, but you really should try to
ensure that your link-local addresses remain stable if possible.

Who's adding the link-local addresses here?  Is it a bug in your
ppp.up scripts, or one in the kernel?

FWIW, Babel will recover from this situation, but it will take a long
time -- roughly 16 times the hello interval plus 30 seconds, which in
your case amounts to over a minute.

Here's what's going on on the SuSE side:

> Neighbour 2001:1470:fffe:60:20d:b9ff:fe14:c524 at fe80::e0f7:19b6:d257:4ac1 dev ppp0 reach ff00 rxcost 96 txcost 96.

The alix board, with router id c524 (look at the last 4 digits),
appears at the link-local address 4ac1.

> Neighbour 2001:1470:fffe:60:20d:b9ff:fe14:c524 at fe80::e0f7:19b6:d257:4ac1 dev ppp0 reach 3fc0 rxcost 65535 txcost 96.

The alix board is still there, but it's now been detected as
unreachable (rxcost infinity).

> Neighbour 2001:1470:fffe:60:20d:b9ff:fe14:c524 at fe80::24ef:89ae:1360:698 dev ppp0 reach c000 rxcost 96 txcost 65535.
> Neighbour 2001:1470:fffe:60:20d:b9ff:fe14:c524 at fe80::e0f7:19b6:d257:4ac1 dev ppp0 reach 1fe0 rxcost 65535 txcost 96.

The alix board appears again, but with a different link-local address
698.  Babel fails to notice it's the same board.

On the Alix side:

> Received ihu 96 for 2001:1470:fffe:60:20d:b9ff:fe14:c524 from 2001:1470:fffe:1:219:b9ff:fe6d:4ea1 (fe80::1065:9660:282c:d985) 900.

Fine, the SuSE machine is saying that it can hear the board again...

> Received ihu 65535 for 2001:1470:fffe:60:20d:b9ff:fe14:c524 from 2001:1470:fffe:1:219:b9ff:fe6d:4ea1 (fe80::1065:9660:282c:d985) 900.

...and then immediately complains that it cannot hear it.

In fact, the SuSE machine is sending two distinct IHU messages -- one
with metric 96 to the second link-local address, and one with metric
infinity to the former one.  Unfortunately, as it stands right now,
the protocol cannot distinguish between the two.

                                        Juliusz



More information about the Babel-users mailing list