Bug#934584: IPMasquerade=yes uses iptables (not nftables)
Trent W. Buck
trentbuck at gmail.com
Mon Aug 12 09:52:27 BST 2019
Package: systemd
Version: 241-5
Severity: normal
File: /lib/systemd/network/80-container-ve.network
Debian 10 defaults to nftables:
https://www.debian.org/releases/stable/amd64/release-notes/ch-whats-new.en.html#nftables
...but systemd doesn't for IPMasquerade=, see below.
AFAICT the default behaviour of "machinectl start my-new-container" is
to create a veth interface between the host and the container, with a
private /28 IPv4 address range shared between them, masquerading
(i.e. source NAT), and no IPv6 RA(?!). This is governed by
/lib/systemd/network/80-container-ve.network
AFAICT this is set up using *legacy* iptables, not using nftables.
Here is a system with sshguard installed (uses native nftables), and two systemd containers running:
bash5# iptables-save
# Table `sshguard' is incompatible, use 'nft' tool.
# Warning: iptables-legacy tables present, use iptables-legacy-save to see them
bash5# iptables-legacy-save
# Generated by iptables-save v1.8.3 on Mon Aug 12 18:01:37 2019
*nat
:PREROUTING ACCEPT [7781:686652]
:INPUT ACCEPT [7749:685024]
:OUTPUT ACCEPT [2108:206384]
:POSTROUTING ACCEPT [2108:206384]
-A POSTROUTING -s 10.0.0.16/28 -j MASQUERADE
-A POSTROUTING -s 10.0.0.0/28 -j MASQUERADE
COMMIT
# Completed on Mon Aug 12 18:01:37 2019
bash5# nft list ruleset
table ip sshguard {
set attackers {
type ipv4_addr
flags interval
}
chain blacklist {
type filter hook input priority filter - 10; policy accept;
ip saddr @attackers drop
}
}
table ip6 sshguard {
set attackers {
type ipv6_addr
flags interval
}
chain blacklist {
type filter hook input priority filter - 10; policy accept;
ip6 saddr @attackers drop
}
}
I *think* the nftables people said mixing legacy (xtables) and
nftables on the same system is Unsupportedā¢ and Bad Thingsā¢ will happen,
but I can't find a citation right now.
I can do "busybox ping example.com" within the container, which
implies systemd's MASQUERADE rules *are* working.
This test nftables ruleset also appears to be working
(while systemd's legacy ruleset are present):
bash5# nft 'table a { chain b { type nat hook postrouting priority srcnat; }; chain c { type nat hook prerouting priority dstnat; tcp dport 80 counter log; }; }'
bash5# nft list ruleset
[...]
table ip a {
chain b {
type nat hook postrouting priority srcnat; policy accept;
}
chain c {
type nat hook prerouting priority dstnat; policy accept;
tcp dport 80 counter packets 1 bytes 60 log
}
}
...so, I may be freaking out about nothing.
At a minimum, the legacy rules created by systemd are "invisible" to
an admin looking directly at "nft list ruleset", which is the only
place they will look if they expect the system to be nftables native.
That violates the principle of least surprise.
Is it possible to make systemd use nftables instead of iptables, when the system is so configured?
I think this would have Just Happened automatically if systemd was actually running iptables(8) or iptables-restore(8), but
I think it is instead talking direct to the kernel in src/shared/firewall-util.c:fw_add_masquerade() ?
PS: I don't know how to do it directly in C, but from nft(8), your MASQUERADE rules would look like this (comments optional):
#!/usr/sbin/nft --file
table ip systemd-container-blah-blah {
chain postrouting {
type nat hook postrouting priority srcnat
policy accept
ip saddr 10.0.0.0/28 masquerade comment "for systemd-nspawn at my-new-container"
ip saddr 10.0.0.16/28 masquerade comment "for systemd-nspawn at my-other-container"
}
chain prerouting {
type nat hook prerouting priority dstnat
policy accept
continue comment "Apparently the postrouting chain won't work unless unless this prerouting chain also exists"
}
}
The "modern" way to do it would be to put all the IP ranges into a
named set, so that you just add/remove from the set, not from the
rules. This is analogous to "iptables -m set --help" and ipset(8),
which you could already have used for efficiency when you have
hundreds of containers:
#!/usr/sbin/nft --file
table ip systemd-container-blah-blah {
set systemd-container-masquerade-ranges { type ipv4_addr; flags interval; }
chain postrouting {
type nat hook postrouting priority srcnat
policy accept
ip saddr @systemd-container-masquerade-ranges masquerade comment "for systemd-nspawn@"
}
chain prerouting {
type nat hook prerouting priority dstnat
policy accept
continue comment "Apparently the postrouting chain won't work unless unless this prerouting chain also exists"
}
}
Then when a container comes up, do
nft 'add element ip systemd-container-blah-blah systemd-container-masquerade-ranges { 10.0.0.0/24 }'
In fact, this is exactly what sshguard is doing for its filter blacklist.
More information about the Pkg-systemd-maintainers
mailing list