[Babel-users] Restarting MeshPoint – seeking advice on routing for crisis/disaster scenarios

Fri Dec 19 17:17:49 GMT 2025

Hi everyone,
I'm working on a fair, reproducible benchmark methodology for comparing
mesh routing protocols (Babel, BATMAN-adv, Yggdrasil, and others). 
Before
running the full benchmark, I'd like to get feedback from the Babel
community on the methodology.
BACKGROUND
----------
We're using meshnet-lab (https://github.com/mwarning/meshnet-lab) for
testing, which creates virtual mesh networks using Linux network 
namespaces
on a single host. This approach has limitations that we've documented, 
and
I'd appreciate input on whether our methodology properly accounts for 
them.
TEST ENVIRONMENT
----------------
   Hardware: ThinkPad T14 laptop (12 cores, 16GB RAM)
   Software: meshnet-lab with network namespaces
   Protocols: babeld 1.13.x, batctl/batman-adv, yggdrasil 0.5.x
INFRASTRUCTURE LIMITATIONS DISCOVERED
-------------------------------------
During development, we found significant limitations when testing larger
networks:
1. Supernode/Hub Bottleneck
When testing real Freifunk topologies (e.g., Bielefeld with 246 nodes),
we discovered that star topologies cause test infrastructure failures,
not protocol failures.
The issue: If a topology has a supernode (hub) connected to 200+ other
nodes, the meshnet-lab bridge for that hub receives ~60 hello 
packets/second
from all neighbors. This causes:
   - UDP packet loss at the bridge level
   - Apparent "connectivity failures" that are actually infrastructure 
artifacts
   - False negatives that make protocols look broken when they're not
Our solution: Cap maximum node degree at 20 and avoid pure star 
topologies.
2. Scale Limitations
We've validated that 100 nodes is a safe limit where:
   - CPU stays under 80%
   - Memory is not a bottleneck
   - Results are reproducible (variance < 10%)
For networks larger than ~250 nodes, single-host simulation becomes
unreliable regardless of available RAM. The bottleneck is CPU context
switching between namespaces and multicast flooding overhead.
3. 1000+ Node Networks
We cannot reliably test 1000+ node networks with this methodology.
Any attempt would produce infrastructure artifacts, not protocol
measurements. For such scales, distributed testing across multiple
physical hosts would be needed.
PROPOSED TEST SUITE
-------------------
We've documented a methodology with:
6 Topologies:
   T1: Grid 10x10 (100 nodes, max degree 4)
   T2: Random mesh (100 nodes, max degree ~10)
   T3: Clustered/federated (100 nodes, 4 clusters)
   T4: Linear chain (50 nodes, diameter 49)
   T5: Small-world Watts-Strogatz (100 nodes)
   T6: Sampled real Freifunk (80 nodes, degree capped)
5 Validation Tests (before benchmarks):
   V1: 3-node sanity check
   V2: Scaling ladder (find breaking point)
   V3: Consistency check (reproducibility)
   V4: Resource monitoring
   V5: Bridge port audit
8 Benchmark Scenarios:
   S1: Steady-state convergence
   S2: Node failure recovery
   S3: Lossy link handling (tc netem)
   S4: Mobility/roaming simulation
   S5: Network partition and merge
   S6: High churn (10% nodes cycling)
   S7: Traffic under load (iperf3)
   S8: Administrative complexity (subjective)
QUESTIONS FOR THE COMMUNITY
---------------------------
1. Missing tests?
    Are there scenarios important for Babel that we should add?
2. Unrealistic tests?
    Should we skip any tests that don't make sense for real-world 
evaluation?
3. Babel-specific considerations?
    Any configuration parameters or behaviors we should specifically 
measure?
4. Large-scale alternatives?
    Does anyone have experience with distributed mesh testing across
    multiple hosts? How do you handle the coordination and measurement?
5. Known limitations?
    Are there known Babel behaviors at scale that we should document 
upfront?
INITIAL RESULTS
---------------
Our initial tests with babeld show:
   Grid 100 nodes:       100% connectivity, ~14s convergence
   Chain 50 nodes:       100% connectivity, ~5s convergence
   Small-world 100 nodes: 100% connectivity, ~12s convergence
These results validate that the test infrastructure works correctly
for Babel at this scale.
FULL METHODOLOGY DOCUMENT
-------------------------
The complete methodology document attached.
I'd appreciate any feedback, suggestions, or concerns before we proceed
with the full benchmark.
Thanks,
Valent.


------ Original Message ------
>From "Juliusz Chroboczek" <jch at irif.fr>
To "Linus Lüssing" <linus.luessing at c0d3.blue>
Cc "Valent Turkovic" <valent at meshpointone.com>; 
babel-users at alioth-lists.debian.net
Date 19.12.2025. 12:45:16
Subject Re: [Babel-users] Restarting MeshPoint – seeking advice on 
routing for crisis/disaster scenarios

>>  There's also l3roamd, predating sroamd:
>>
>>  https://github.com/freifunk-gluon/l3roamd
>
>That's right, I should have mentioned it.  I'll be sure to give proper
>credit if I ever come back to sroamd.
>
>For the record, sroamd is based on a combination of the ideas in l3roamd
>and in the PMIPv6 protocol, plus a fair dose of IS-IS.
>
>-- Juliusz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 84_-_Multi-Protocol_Mesh_Benchmark_Methodology_(Public).md
Type: application/octet-stream
Size: 20061 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/babel-users/attachments/20251219/8182cfca/attachment-0001.obj>