[Babel-users] Restarting MeshPoint – seeking advice on routing for crisis/disaster scenarios
Valent@MeshPoint
valent at meshpointone.com
Mon Dec 29 00:26:30 GMT 2025
Hi Juliusz and the list,
Happy holidays! Following up on sroamd testing for mesh mobility.
While testing Babel + sroamd in a network namespace setup, we found two
bugs and an unexpected performance result. Below are the details.
BUGS FOUND
----------
Bug 1: FD_SETSIZE crash
When running sroamd in network namespaces with 25+ nodes, it crashes
with:
*** bit out of range 0 - FD_SETSIZE on fd_set ***: terminated
Aborted (core dumped)
The root cause is in sroamd.c lines 342-374. The main event loop uses
fd_set with pselect(). The fd_set data structure is limited to
FD_SETSIZE (typically 1024) file descriptors. When running in network
namespaces, the kernel assigns file descriptor numbers sequentially
across all namespaces, so sroamd's socket FDs can easily exceed 1024.
The fix is to replace pselect() with poll(). The poll() syscall uses a
dynamically-allocated array of struct pollfd and has no limit on file
descriptor numbers.
Bug 2: server_socket not guarded
In sroamd.c line 401, the server_socket is added to the poll set without
checking if it is valid:
pollfds[nfds].fd = server_socket;
However, server_socket is initialized to -1 in flood.c line 21:
int server_socket = -1;
Compare this with the neighbor socket handling which correctly checks:
if(neighs[i].fd >= 0) {
pollfds[nfds].fd = neighs[i].fd;
...
}
The fix is to add the same guard for server_socket:
if(server_socket >= 0) {
idx_server = nfds;
pollfds[nfds].fd = server_socket;
pollfds[nfds].events = POLLIN;
pollfds[nfds].revents = 0;
nfds++;
}
TEST ENVIRONMENT
----------------
Hardware: Lenovo ThinkPad T14
Operating system: Fedora 42
Kernel version: 6.12.6-200.fc41.x86_64
Software versions used:
babeld 1.13.1 (from Fedora repository)
sroamd latest git commit (with our patches applied)
All tests run in Linux network namespaces using veth pairs for virtual
links.
WHY THIS TEST
-------------
We chose a link-failure failover test because it represents a common
mesh network scenario: a link goes down and the routing protocol must
find an alternative path. This is relevant for:
- Wireless mesh networks where links are unreliable
- Crisis/disaster scenarios where nodes may fail
- Mobile mesh where topology changes frequently
We wanted to measure how quickly Babel (with and without sroamd)
recovers connectivity when a primary path fails.
TEST TOPOLOGY
-------------
We used a diamond topology with 4 nodes. This provides exactly two paths
between source and destination, making failover behavior deterministic
and measurable.
n1
/ \
n0 ---+ +--- n3
\ /
n2
Node n0 is the source (client).
Node n3 is the destination (server).
Nodes n1 and n2 are intermediate routers.
IP addressing:
n0-n1 link: 10.0.1.0/24 (n0 has .1, n1 has .2)
n0-n2 link: 10.0.2.0/24 (n0 has .1, n2 has .2)
n1-n3 link: 10.0.3.0/24 (n1 has .1, n3 has .2)
n2-n3 link: 10.0.4.0/24 (n2 has .1, n3 has .2)
Each node also has a loopback address for stable identification:
n0: 192.168.100.1/32
n1: 192.168.100.2/32
n2: 192.168.100.3/32
n3: 192.168.100.4/32
The test fails the n0-n1 link and measures how long until n0 can reach
n3 via the n0-n2-n3 path.
EXACT COMMANDS USED
-------------------
Step 1: Create network namespaces
ip netns add n0
ip netns add n1
ip netns add n2
ip netns add n3
Step 2: Create veth pairs for each link
ip link add n0e1 type veth peer name n1e0
ip link add n0e2 type veth peer name n2e0
ip link add n1e3 type veth peer name n3e1
ip link add n2e3 type veth peer name n3e2
Step 3: Move interfaces to namespaces
ip link set n0e1 netns n0
ip link set n0e2 netns n0
ip link set n1e0 netns n1
ip link set n1e3 netns n1
ip link set n2e0 netns n2
ip link set n2e3 netns n2
ip link set n3e1 netns n3
ip link set n3e2 netns n3
Step 4: Assign IP addresses (example for n0)
ip netns exec n0 ip addr add 10.0.1.1/24 dev n0e1
ip netns exec n0 ip addr add 10.0.2.1/24 dev n0e2
ip netns exec n0 ip addr add 192.168.100.1/32 dev lo
ip netns exec n0 ip link set n0e1 up
ip netns exec n0 ip link set n0e2 up
ip netns exec n0 ip link set lo up
(similar commands for n1, n2, n3 with their respective addresses)
Step 5: Start babeld on each node
ip netns exec n0 babeld -D -I /tmp/babel_n0.pid -S /tmp/babel_n0.state
\
-C 'redistribute local if n0' n0e1 n0e2
ip netns exec n1 babeld -D -I /tmp/babel_n1.pid -S /tmp/babel_n1.state
\
-C 'redistribute local if n1' n1e0 n1e3
(similar for n2, n3)
Step 6: Start sroamd on each node (when testing Babel+sroamd)
ip netns exec n0 sroamd -d 3 n0e1 n0e2
ip netns exec n1 sroamd -d 3 n1e0 n1e3
ip netns exec n2 sroamd -d 3 n2e0 n2e3
ip netns exec n3 sroamd -d 3 n3e1 n3e2
Step 7: Wait for convergence (30 seconds)
sleep 30
Step 8: Verify baseline connectivity
ip netns exec n0 ping -c 3 192.168.100.4
Step 9: Fail the primary link and measure recovery
START_TIME=$(date +%s.%N)
ip netns exec n0 ip link set n0e1 down
while true; do
if ip netns exec n0 ping -c 1 -W 1 192.168.100.4 > /dev/null 2>&1;
then
END_TIME=$(date +%s.%N)
RECOVERY=$(echo "$END_TIME - $START_TIME" | bc)
echo "Recovery time: ${RECOVERY}s"
break
fi
sleep 0.1
done
Step 10: Cleanup
ip netns delete n0
ip netns delete n1
ip netns delete n2
ip netns delete n3
TEST RESULTS
------------
Babel alone (5 runs):
Run 1: 4.82 seconds
Run 2: 5.31 seconds
Run 3: 5.18 seconds
Run 4: 5.42 seconds
Run 5: 5.17 seconds
Average: 5.18 seconds
All runs successful (5/5)
Babel with sroamd (5 runs):
Run 1: 10.24 seconds
Run 2: 11.58 seconds
Run 3: 12.03 seconds
Run 4: 10.89 seconds
Run 5: 10.61 seconds
Average: 11.07 seconds
All runs successful (5/5)
OBSERVATION
-----------
Adding sroamd increases recovery time by approximately 6 seconds (from
5.18s to 11.07s). This was unexpected - we assumed sroamd would improve
or at least not affect failover time.
QUESTION FOR THE LIST
---------------------
Is this expected behavior?
Our test simulates backbone link failure (a mesh link goes down and the
routing protocol must reconverge). Perhaps sroamd is designed for a
different use case - specifically WiFi client associations where a
station joins or leaves an access point, rather than mesh backbone link
changes?
If so, what would be the recommended way to test sroamd's intended
functionality? We would like to properly evaluate it for our mesh
mobility use case.
PATCHED FILE
------------
The full patched sroamd.c is available at:
https://gist.github.com/valentt/bfd77aa170e189edf9b22e3933a69def
The patch replaces select/pselect with poll throughout the main event
loop, and adds the server_socket guard. We are happy to submit properly
formatted git patches if that would be useful.
Best regards,
Valent Turkovic
MeshPoint
https://www.meshpointone.com/
------ Original Message ------
>From "Juliusz Chroboczek" <jch at irif.fr>
To "Valent Turkovic" <valent at meshpointone.com>
Cc babel-users at alioth-lists.debian.net
Date 18.12.2025. 1:04:43
Subject Re: [Babel-users] Restarting MeshPoint – seeking advice on
routing for crisis/disaster scenarios
>Hello, Valent, good to hear from you again.
>
>> Between 2015 and 2018 I ran the MeshPoint project – a simple, rugged
>> Wi-Fi hotspot designed to work in the toughest conditions.
>
>I remember :-)
>
>> Unfortunately, financial issues forced me to pause the project after 2018
>
>In addition to the issues you mention, the big change since the early
>2000s is the wide availability of cheap cellular connectivity. Hence, the
>demand for mesh networks has changed quite a bit.
>
>> I know that in active conflict zones Wi-Fi can be jammed
>
>The nice thing about having a layer 3 routing protocol is that you can
>combine technologies: Babel is designed to handle a network that has both
>wired and wireless links, and that uses multiple wireless technologies at
>the same time (WiFi at various frequencies, UWB, infrared laser, etc.).
>In such a network, Babel should be able to find a path consisting of
>whichever links are not jammed at a given time.
>
>Of course, this assumes that the opponent is not able to jam all links
>simultaneously.
>
>> - BATMAN-adv-style seamless mobility
>
>I started working on sroamd[1], which implements seamless mobility at
>layer 3, but then Covid happened, and I got interested in
>videoconferencing. I guess we could revive it if there's interest.
>
>[1]: https://github.com/jech/sroamd
>
>> - Better large-scale behaviour for hundreds-to-thousands of nodes in
>> sparse or battery-constrained setups
>
>Could you please clarify?
>
>-- Juliusz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 162_-_sroamd_Bug_Report_for_Babel_Mailing_List.md
Type: application/octet-stream
Size: 9586 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/babel-users/attachments/20251229/bf1ae1ed/attachment.obj>
More information about the Babel-users
mailing list