🥝 😫 🚦 Fast routing and NAT on Linux 👨‍👧‍👦 💆🏼 🦉

As IPv4 addresses are exhausted, many telecom operators are faced with the need to organize their clients' access to the network using address translation. In this article I will tell you how to get Carrier Grade NAT level performance on commodity servers.

A bit of history

The topic of running out of IPv4 address space is no longer new. At some point, waiting lists appeared in RIPE, then there were exchanges on which they traded blocks of addresses and concluded transactions for their rent. Gradually, telecom operators began to provide Internet access services through the translation of addresses and ports. Someone did not manage to get enough addresses to give a “white” address to each subscriber, while someone started saving money by refusing to buy addresses in the secondary market. Network equipment manufacturers supported this idea, as this functionality usually requires additional expansion modules or licenses. For example, with Juniper in the MX router lineup (except for the latest MX104 and MX204), NAPT can be performed on a separate MS-MIC service card, Cisco ASR1k requires a GN license,on Cisco ASR9k, a separate A9K-ISM-100 module and an A9K-CGN-LIC license to it. In general, pleasure costs a lot of money.

Iptables

The task of performing NAT does not require specialized computing resources; general-purpose processors that are installed, for example, in any home router, can solve it. On a carrier scale, this problem can be solved using commodity servers running FreeBSD (ipfw / pf) or GNU / Linux (iptables). We will not consider FreeBSD, because I refused to use this OS for a long time, so let's focus on GNU / Linux.

Turning address translation on is not at all difficult. First you need to write the rule in iptables in the nat table:

iptables -t nat -A POSTROUTING -s 100.64.0.0/10 -j SNAT --to <pool_start_addr>-<pool_end_addr> --persistent

The operating system will load the nf_conntrack module, which will monitor all active connections and perform the necessary conversions. There are several subtleties. Firstly, since we are talking about NAT at the scale of the carrier, it is necessary to tighten the timeouts, because with the default values, the size of the translation table will quickly grow to catastrophic values. Below is an example of the settings that I used on my servers:

net.ipv4.ip_forward = 1
net.ipv4.ip_local_port_range = 8192 65535

net.netfilter.nf_conntrack_generic_timeout = 300
net.netfilter.nf_conntrack_tcp_timeout_syn_sent = 60
net.netfilter.nf_conntrack_tcp_timeout_syn_recv = 60
net.netfilter.nf_conntrack_tcp_timeout_established = 600
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 45
net.netfilter.nf_conntrack_tcp_timeout_last_ack = 30
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close = 10
net.netfilter.nf_conntrack_tcp_timeout_max_retrans = 300
net.netfilter.nf_conntrack_tcp_timeout_unacknowledged = 300
net.netfilter.nf_conntrack_udp_timeout = 30
net.netfilter.nf_conntrack_udp_timeout_stream = 60
net.netfilter.nf_conntrack_icmpv6_timeout = 30
net.netfilter.nf_conntrack_icmp_timeout = 30
net.netfilter.nf_conntrack_events_retry_timeout = 15
net.netfilter.nf_conntrack_checksum=0

And secondly, since the default size of the translation table is not designed to work in the conditions of a telecom operator, it must be increased:

net.netfilter.nf_conntrack_max = 3145728

It is also necessary to increase the number of buckets for a hash table that stores all translations (this is an option of the nf_conntrack module):

options nf_conntrack hashsize=1572864

After these simple manipulations, a completely working construction is obtained, which can translate a large number of client addresses into an external pool. However, the performance of this solution is poor. In my first attempts to use GNU / Linux for NAT (approximately 2013), I was able to get about 7Gbit / s performance at 0.8Mpps per server (Xeon E5-1650v2). Since that time, many different optimizations have been made in the GNU / Linux kernel network stack, the performance of one server on the same hardware has grown almost to 18-19 Gbit / s at 1.8-1.9 Mpps (these were the limit values), but the need for traffic volume, processed by a single server, grew much faster. As a result, load balancing schemes for different servers were developed, but all this increased the complexity of the setup,servicing and maintaining the quality of the services provided.

Nftables

Nowadays, the use of DPDK and XDP is a fashionable direction in software “packet transfer”. A lot of articles have been written on this subject, many different presentations have been made, commercial products appear (for example, SKAT from VasExperts). But in the conditions of limited resources of programmers from telecom operators, it’s quite problematic to cut some kind of “share” on the basis of these frameworks. To operate such a solution in the future will be much more difficult, in particular, it will be necessary to develop diagnostic tools. For example, a regular tcpdump with DPDK just doesn’t work, and it won’t “see” packets sent back to the wires using XDP. Amid all the talk about new technologies for outputting packet forwarding to user-space, reports and articles went unnoticedPablo Neira Ayuso, iptables maintainer, on developing flow offloading in nftables. Let's look at this mechanism in more detail.

The main idea is that if the router passed packets of one session on both sides of the stream (the TCP session went into the ESTABLISHED state), then there is no need to pass subsequent packets of this session through all firewall rules, because all these checks will all the same end by transferring the packet further to routing. Yes, and actually the choice of the route does not need to be performed - we already know which interface and which host should forward the packets within this session. It remains only to save this information and use it for routing at an early stage of packet processing. When performing NAT, it is necessary to additionally save information on changes in addresses and ports converted by the nf_conntrack module. Yes, of course, in this case various polysers and other information-statistical rules in iptables stop working,but as part of the task of a separate standing NAT or, for example, a border, this is not so important, because the services are distributed across devices.

Configuration

To use this function we need:

Use a fresh kernel. Despite the fact that the functionality itself appeared in the 4.16 kernel, for quite a while it was very "raw" and regularly called kernel panic. Everything stabilized around December 2019, when the LTS kernels 4.19.90 and 5.4.5 were released.
Rewrite iptables rules in nftables format using a fairly recent version of nftables. Works fine in version 0.9.0

If everything is clear in principle with the first paragraph, the main thing is not to forget to include the module in the configuration during assembly (CONFIG_NFT_FLOW_OFFLOAD = m), then the second paragraph requires explanation. The nftables rules are described quite differently than in iptables. The documentation reveals almost all the points, there are also special rule converters from iptables to nftables. Therefore, I will give only an example of setting up NAT and flow offload. A small legend for an example: <i_if>, <o_if> are the network interfaces through which traffic passes, in reality there can be more than two. <pool_addr_start>, <pool_addr_end> - the start and end address of the range of "white" addresses.

NAT configuration is very simple:

#! /usr/sbin/nft -f

table nat {
        chain postrouting {
                type nat hook postrouting priority 100;
                oif <o_if> snat to <pool_addr_start>-<pool_addr_end> persistent
        }
}

Flow offload is a bit more complicated, but understandable:

#! /usr/sbin/nft -f

table inet filter {
        flowtable fastnat {
                hook ingress priority 0
                devices = { <i_if>, <o_if> }
        }

        chain forward {
                type filter hook forward priority 0; policy accept;
                ip protocol { tcp , udp } flow offload @fastnat;
        }
}

That, in fact, is the whole setup. Now all TCP / UDP traffic will go to the fastnat table and will be processed much faster.

results

To make it clear how “much faster” this is, I will attach a screenshot of the load on two real servers with the same hardware (Xeon E5-1650v2), equally configured, using the same Linux kernel, but running NAT in iptables (NAT4) and in nftables (NAT5).

There is no packet graph per second in the screenshot, but in the load profile of these servers the average packet size is around 800 bytes, so the values go up to 1.5Mpps. As you can see, the performance margin of the server with nftables is huge. Currently, this server processes up to 30Gbit / s at 3Mpps and is clearly able to run into the physical limitation of the 40Gbps network, while having free CPU resources.

I hope this material will be useful to network engineers trying to improve the performance of their servers.

Fast routing and NAT on Linux

A bit of history

Iptables

Nftables

Configuration

results

More articles: