[Babel-users] Restarting MeshPoint – seeking advice on routing for crisis/disaster scenarios

Mon Dec 29 00:26:30 GMT 2025

Hi Juliusz and the list,

Happy holidays! Following up on sroamd testing for mesh mobility.

While testing Babel + sroamd in a network namespace setup, we found two 
bugs and an unexpected performance result. Below are the details.

BUGS FOUND
----------

Bug 1: FD_SETSIZE crash

When running sroamd in network namespaces with 25+ nodes, it crashes 
with:

   *** bit out of range 0 - FD_SETSIZE on fd_set ***: terminated
   Aborted (core dumped)

The root cause is in sroamd.c lines 342-374. The main event loop uses 
fd_set with pselect(). The fd_set data structure is limited to 
FD_SETSIZE (typically 1024) file descriptors. When running in network 
namespaces, the kernel assigns file descriptor numbers sequentially 
across all namespaces, so sroamd's socket FDs can easily exceed 1024.

The fix is to replace pselect() with poll(). The poll() syscall uses a 
dynamically-allocated array of struct pollfd and has no limit on file 
descriptor numbers.

Bug 2: server_socket not guarded

In sroamd.c line 401, the server_socket is added to the poll set without 
checking if it is valid:

   pollfds[nfds].fd = server_socket;

However, server_socket is initialized to -1 in flood.c line 21:

   int server_socket = -1;

Compare this with the neighbor socket handling which correctly checks:

   if(neighs[i].fd >= 0) {
       pollfds[nfds].fd = neighs[i].fd;
       ...
   }

The fix is to add the same guard for server_socket:

   if(server_socket >= 0) {
       idx_server = nfds;
       pollfds[nfds].fd = server_socket;
       pollfds[nfds].events = POLLIN;
       pollfds[nfds].revents = 0;
       nfds++;
   }

TEST ENVIRONMENT
----------------

Hardware: Lenovo ThinkPad T14
Operating system: Fedora 42
Kernel version: 6.12.6-200.fc41.x86_64

Software versions used:
   babeld 1.13.1 (from Fedora repository)
   sroamd latest git commit (with our patches applied)

All tests run in Linux network namespaces using veth pairs for virtual 
links.

WHY THIS TEST
-------------

We chose a link-failure failover test because it represents a common 
mesh network scenario: a link goes down and the routing protocol must 
find an alternative path. This is relevant for:

- Wireless mesh networks where links are unreliable
- Crisis/disaster scenarios where nodes may fail
- Mobile mesh where topology changes frequently

We wanted to measure how quickly Babel (with and without sroamd) 
recovers connectivity when a primary path fails.

TEST TOPOLOGY
-------------

We used a diamond topology with 4 nodes. This provides exactly two paths 
between source and destination, making failover behavior deterministic 
and measurable.

         n1
        /  \
n0 ---+    +--- n3
        \  /
         n2

Node n0 is the source (client).
Node n3 is the destination (server).
Nodes n1 and n2 are intermediate routers.

IP addressing:

   n0-n1 link: 10.0.1.0/24 (n0 has .1, n1 has .2)
   n0-n2 link: 10.0.2.0/24 (n0 has .1, n2 has .2)
   n1-n3 link: 10.0.3.0/24 (n1 has .1, n3 has .2)
   n2-n3 link: 10.0.4.0/24 (n2 has .1, n3 has .2)

Each node also has a loopback address for stable identification:

   n0: 192.168.100.1/32
   n1: 192.168.100.2/32
   n2: 192.168.100.3/32
   n3: 192.168.100.4/32

The test fails the n0-n1 link and measures how long until n0 can reach 
n3 via the n0-n2-n3 path.

EXACT COMMANDS USED
-------------------

Step 1: Create network namespaces

   ip netns add n0
   ip netns add n1
   ip netns add n2
   ip netns add n3

Step 2: Create veth pairs for each link

   ip link add n0e1 type veth peer name n1e0
   ip link add n0e2 type veth peer name n2e0
   ip link add n1e3 type veth peer name n3e1
   ip link add n2e3 type veth peer name n3e2

Step 3: Move interfaces to namespaces

   ip link set n0e1 netns n0
   ip link set n0e2 netns n0
   ip link set n1e0 netns n1
   ip link set n1e3 netns n1
   ip link set n2e0 netns n2
   ip link set n2e3 netns n2
   ip link set n3e1 netns n3
   ip link set n3e2 netns n3

Step 4: Assign IP addresses (example for n0)

   ip netns exec n0 ip addr add 10.0.1.1/24 dev n0e1
   ip netns exec n0 ip addr add 10.0.2.1/24 dev n0e2
   ip netns exec n0 ip addr add 192.168.100.1/32 dev lo
   ip netns exec n0 ip link set n0e1 up
   ip netns exec n0 ip link set n0e2 up
   ip netns exec n0 ip link set lo up

   (similar commands for n1, n2, n3 with their respective addresses)

Step 5: Start babeld on each node

   ip netns exec n0 babeld -D -I /tmp/babel_n0.pid -S /tmp/babel_n0.state 
\
     -C 'redistribute local if n0' n0e1 n0e2

   ip netns exec n1 babeld -D -I /tmp/babel_n1.pid -S /tmp/babel_n1.state 
\
     -C 'redistribute local if n1' n1e0 n1e3

   (similar for n2, n3)

Step 6: Start sroamd on each node (when testing Babel+sroamd)

   ip netns exec n0 sroamd -d 3 n0e1 n0e2
   ip netns exec n1 sroamd -d 3 n1e0 n1e3
   ip netns exec n2 sroamd -d 3 n2e0 n2e3
   ip netns exec n3 sroamd -d 3 n3e1 n3e2

Step 7: Wait for convergence (30 seconds)

   sleep 30

Step 8: Verify baseline connectivity

   ip netns exec n0 ping -c 3 192.168.100.4

Step 9: Fail the primary link and measure recovery

   START_TIME=$(date +%s.%N)
   ip netns exec n0 ip link set n0e1 down

   while true; do
     if ip netns exec n0 ping -c 1 -W 1 192.168.100.4 > /dev/null 2>&1; 
then
       END_TIME=$(date +%s.%N)
       RECOVERY=$(echo "$END_TIME - $START_TIME" | bc)
       echo "Recovery time: ${RECOVERY}s"
       break
     fi
     sleep 0.1
   done

Step 10: Cleanup

   ip netns delete n0
   ip netns delete n1
   ip netns delete n2
   ip netns delete n3

TEST RESULTS
------------

Babel alone (5 runs):

   Run 1: 4.82 seconds
   Run 2: 5.31 seconds
   Run 3: 5.18 seconds
   Run 4: 5.42 seconds
   Run 5: 5.17 seconds

   Average: 5.18 seconds
   All runs successful (5/5)

Babel with sroamd (5 runs):

   Run 1: 10.24 seconds
   Run 2: 11.58 seconds
   Run 3: 12.03 seconds
   Run 4: 10.89 seconds
   Run 5: 10.61 seconds

   Average: 11.07 seconds
   All runs successful (5/5)

OBSERVATION
-----------

Adding sroamd increases recovery time by approximately 6 seconds (from 
5.18s to 11.07s). This was unexpected - we assumed sroamd would improve 
or at least not affect failover time.

QUESTION FOR THE LIST
---------------------

Is this expected behavior?

Our test simulates backbone link failure (a mesh link goes down and the 
routing protocol must reconverge). Perhaps sroamd is designed for a 
different use case - specifically WiFi client associations where a 
station joins or leaves an access point, rather than mesh backbone link 
changes?

If so, what would be the recommended way to test sroamd's intended 
functionality? We would like to properly evaluate it for our mesh 
mobility use case.

PATCHED FILE
------------

The full patched sroamd.c is available at:

   https://gist.github.com/valentt/bfd77aa170e189edf9b22e3933a69def

The patch replaces select/pselect with poll throughout the main event 
loop, and adds the server_socket guard. We are happy to submit properly 
formatted git patches if that would be useful.

Best regards,
Valent Turkovic
MeshPoint
https://www.meshpointone.com/

------ Original Message ------
>From "Juliusz Chroboczek" <jch at irif.fr>
To "Valent Turkovic" <valent at meshpointone.com>
Cc babel-users at alioth-lists.debian.net
Date 18.12.2025. 1:04:43
Subject Re: [Babel-users] Restarting MeshPoint – seeking advice on 
routing for crisis/disaster scenarios

>Hello, Valent, good to hear from you again.
>
>>  Between 2015 and 2018 I ran the MeshPoint project – a simple, rugged
>>  Wi-Fi hotspot designed to work in the toughest conditions.
>
>I remember :-)
>
>>  Unfortunately, financial issues forced me to pause the project after 2018
>
>In addition to the issues you mention, the big change since the early
>2000s is the wide availability of cheap cellular connectivity.  Hence, the
>demand for mesh networks has changed quite a bit.
>
>>  I know that in active conflict zones Wi-Fi can be jammed
>
>The nice thing about having a layer 3 routing protocol is that you can
>combine technologies: Babel is designed to handle a network that has both
>wired and wireless links, and that uses multiple wireless technologies at
>the same time (WiFi at various frequencies, UWB, infrared laser, etc.).
>In such a network, Babel should be able to find a path consisting of
>whichever links are not jammed at a given time.
>
>Of course, this assumes that the opponent is not able to jam all links
>simultaneously.
>
>>  - BATMAN-adv-style seamless mobility
>
>I started working on sroamd[1], which implements seamless mobility at
>layer 3, but then Covid happened, and I got interested in
>videoconferencing.  I guess we could revive it if there's interest.
>
>[1]: https://github.com/jech/sroamd
>
>>  - Better large-scale behaviour for hundreds-to-thousands of nodes in
>>    sparse or battery-constrained setups
>
>Could you please clarify?
>
>-- Juliusz
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 162_-_sroamd_Bug_Report_for_Babel_Mailing_List.md
Type: application/octet-stream
Size: 9586 bytes
Desc: not available
URL: <http://alioth-lists.debian.net/pipermail/babel-users/attachments/20251229/bf1ae1ed/attachment.obj>