[Babel-users] Restarting MeshPoint – seeking advice on routing for crisis/disaster scenarios
Valent@MeshPoint
valent at meshpointone.com
Fri Dec 19 17:17:49 GMT 2025
Hi everyone,
I'm working on a fair, reproducible benchmark methodology for comparing
mesh routing protocols (Babel, BATMAN-adv, Yggdrasil, and others).
Before
running the full benchmark, I'd like to get feedback from the Babel
community on the methodology.
BACKGROUND
----------
We're using meshnet-lab (https://github.com/mwarning/meshnet-lab) for
testing, which creates virtual mesh networks using Linux network
namespaces
on a single host. This approach has limitations that we've documented,
and
I'd appreciate input on whether our methodology properly accounts for
them.
TEST ENVIRONMENT
----------------
Hardware: ThinkPad T14 laptop (12 cores, 16GB RAM)
Software: meshnet-lab with network namespaces
Protocols: babeld 1.13.x, batctl/batman-adv, yggdrasil 0.5.x
INFRASTRUCTURE LIMITATIONS DISCOVERED
-------------------------------------
During development, we found significant limitations when testing larger
networks:
1. Supernode/Hub Bottleneck
When testing real Freifunk topologies (e.g., Bielefeld with 246 nodes),
we discovered that star topologies cause test infrastructure failures,
not protocol failures.
The issue: If a topology has a supernode (hub) connected to 200+ other
nodes, the meshnet-lab bridge for that hub receives ~60 hello
packets/second
from all neighbors. This causes:
- UDP packet loss at the bridge level
- Apparent "connectivity failures" that are actually infrastructure
artifacts
- False negatives that make protocols look broken when they're not
Our solution: Cap maximum node degree at 20 and avoid pure star
topologies.
2. Scale Limitations
We've validated that 100 nodes is a safe limit where:
- CPU stays under 80%
- Memory is not a bottleneck
- Results are reproducible (variance < 10%)
For networks larger than ~250 nodes, single-host simulation becomes
unreliable regardless of available RAM. The bottleneck is CPU context
switching between namespaces and multicast flooding overhead.
3. 1000+ Node Networks
We cannot reliably test 1000+ node networks with this methodology.
Any attempt would produce infrastructure artifacts, not protocol
measurements. For such scales, distributed testing across multiple
physical hosts would be needed.
PROPOSED TEST SUITE
-------------------
We've documented a methodology with:
6 Topologies:
T1: Grid 10x10 (100 nodes, max degree 4)
T2: Random mesh (100 nodes, max degree ~10)
T3: Clustered/federated (100 nodes, 4 clusters)
T4: Linear chain (50 nodes, diameter 49)
T5: Small-world Watts-Strogatz (100 nodes)
T6: Sampled real Freifunk (80 nodes, degree capped)
5 Validation Tests (before benchmarks):
V1: 3-node sanity check
V2: Scaling ladder (find breaking point)
V3: Consistency check (reproducibility)
V4: Resource monitoring
V5: Bridge port audit
8 Benchmark Scenarios:
S1: Steady-state convergence
S2: Node failure recovery
S3: Lossy link handling (tc netem)
S4: Mobility/roaming simulation
S5: Network partition and merge
S6: High churn (10% nodes cycling)
S7: Traffic under load (iperf3)
S8: Administrative complexity (subjective)
QUESTIONS FOR THE COMMUNITY
---------------------------
1. Missing tests?
Are there scenarios important for Babel that we should add?
2. Unrealistic tests?
Should we skip any tests that don't make sense for real-world
evaluation?
3. Babel-specific considerations?
Any configuration parameters or behaviors we should specifically
measure?
4. Large-scale alternatives?
Does anyone have experience with distributed mesh testing across
multiple hosts? How do you handle the coordination and measurement?
5. Known limitations?
Are there known Babel behaviors at scale that we should document
upfront?
INITIAL RESULTS
---------------
Our initial tests with babeld show:
Grid 100 nodes: 100% connectivity, ~14s convergence
Chain 50 nodes: 100% connectivity, ~5s convergence
Small-world 100 nodes: 100% connectivity, ~12s convergence
These results validate that the test infrastructure works correctly
for Babel at this scale.
FULL METHODOLOGY DOCUMENT
-------------------------
The complete methodology document attached.
I'd appreciate any feedback, suggestions, or concerns before we proceed
with the full benchmark.
Thanks,
Valent.
------ Original Message ------
>From "Juliusz Chroboczek" <jch at irif.fr>
To "Linus Lüssing" <linus.luessing at c0d3.blue>
Cc "Valent Turkovic" <valent at meshpointone.com>;
babel-users at alioth-lists.debian.net
Date 19.12.2025. 12:45:16
Subject Re: [Babel-users] Restarting MeshPoint – seeking advice on
routing for crisis/disaster scenarios
>> There's also l3roamd, predating sroamd:
>>
>> https://github.com/freifunk-gluon/l3roamd
>
>That's right, I should have mentioned it. I'll be sure to give proper
>credit if I ever come back to sroamd.
>
>For the record, sroamd is based on a combination of the ideas in l3roamd
>and in the PMIPv6 protocol, plus a fair dose of IS-IS.
>
>-- Juliusz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 84_-_Multi-Protocol_Mesh_Benchmark_Methodology_(Public).md
Type: application/octet-stream
Size: 20061 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/babel-users/attachments/20251219/8182cfca/attachment-0001.obj>
More information about the Babel-users
mailing list