[Babel-users] Convergence problem
Juliusz Chroboczek
Juliusz.Chroboczek at pps.jussieu.fr
Thu Jul 3 15:13:09 UTC 2008
> I am testing Babel on ppp links, over USB/serial. And i have
> problem, if I unpluged cable for few seconds, so ppp link is still
> up, then I have problems, output logs from both sides are in files
> debug.*. But if I wait a little more, so ppp link goes down, and
> then I bring it up(ppp) I dont have problems.
Okay, I think I found the issue. Your configuration is weird, and
there's a case I didn't envision in Babel.
The issue is that your implementation of ppp switches link-local
addresses on every up/down event. IPv6 link-local addresses are
supposed to remain stable, so this is why Babel gets confused by this
case. I'll implement a workaround, but you really should try to
ensure that your link-local addresses remain stable if possible.
Who's adding the link-local addresses here? Is it a bug in your
ppp.up scripts, or one in the kernel?
FWIW, Babel will recover from this situation, but it will take a long
time -- roughly 16 times the hello interval plus 30 seconds, which in
your case amounts to over a minute.
Here's what's going on on the SuSE side:
> Neighbour 2001:1470:fffe:60:20d:b9ff:fe14:c524 at fe80::e0f7:19b6:d257:4ac1 dev ppp0 reach ff00 rxcost 96 txcost 96.
The alix board, with router id c524 (look at the last 4 digits),
appears at the link-local address 4ac1.
> Neighbour 2001:1470:fffe:60:20d:b9ff:fe14:c524 at fe80::e0f7:19b6:d257:4ac1 dev ppp0 reach 3fc0 rxcost 65535 txcost 96.
The alix board is still there, but it's now been detected as
unreachable (rxcost infinity).
> Neighbour 2001:1470:fffe:60:20d:b9ff:fe14:c524 at fe80::24ef:89ae:1360:698 dev ppp0 reach c000 rxcost 96 txcost 65535.
> Neighbour 2001:1470:fffe:60:20d:b9ff:fe14:c524 at fe80::e0f7:19b6:d257:4ac1 dev ppp0 reach 1fe0 rxcost 65535 txcost 96.
The alix board appears again, but with a different link-local address
698. Babel fails to notice it's the same board.
On the Alix side:
> Received ihu 96 for 2001:1470:fffe:60:20d:b9ff:fe14:c524 from 2001:1470:fffe:1:219:b9ff:fe6d:4ea1 (fe80::1065:9660:282c:d985) 900.
Fine, the SuSE machine is saying that it can hear the board again...
> Received ihu 65535 for 2001:1470:fffe:60:20d:b9ff:fe14:c524 from 2001:1470:fffe:1:219:b9ff:fe6d:4ea1 (fe80::1065:9660:282c:d985) 900.
...and then immediately complains that it cannot hear it.
In fact, the SuSE machine is sending two distinct IHU messages -- one
with metric 96 to the second link-local address, and one with metric
infinity to the former one. Unfortunately, as it stands right now,
the protocol cannot distinguish between the two.
Juliusz
More information about the Babel-users
mailing list