How to receive a million packets per second
Last week during a casual conversation I overheard a colleague saying: “The Linux network stack is slow! You can’t expect it to do more than 50 thousand packets per second per core!”
That got me thinking. While I agree that 50kpps per core is probably the limit for any practical application, what is the Linux networking stack capable of? Let’s rephrase that to make it more fun:
On Linux, how hard is it to write a program that receives 1 million UDP packets per second?
Hopefully, answering this question will be a good lesson about the design of a modern networking stack.
First, let us assume:
- Measuring packets per second (pps) is much more interesting than measuring bytes per second (Bps). You can achieve high Bps by better pipelining and sending longer packets. Improving pps is much harder.
- Since we’re interested in pps, our experiments will use short UDP messages. To be precise: 32 bytes of UDP payload. That means 74 bytes on the Ethernet layer.
- For the experiments we will use two physical servers: “receiver” and “sender”.
- They both have two six core 2GHz Xeon processors. With hyperthreading (HT) enabled that counts to 24 processors on each box. The boxes have a multi-queue 10G network card by Solarflare, with 11 receive queues configured. More on that later.
- The source code of the test programs is available here:
udpsender
,udpreceiver
.
Prerequisites
Let’s use port 4321 for our UDP packets. Before we start we must ensure the traffic won’t be interfered with by the iptables
:
receiver$ iptables -I INPUT 1 -p udp --dport 4321 -j ACCEPT receiver$ iptables -t raw -I PREROUTING 1 -p udp --dport 4321 -j NOTRACK
A couple of explicitly defined IP addresses will later become handy:
receiver$ for i in `seq 1 20`; do ip addr add 192.168.254.$i/24 dev eth2; done sender$ ip addr add 192.168.254.30/24 dev eth3
1. The naive approach
To start let’s do the simplest experiment. How many packets will be delivered for a naive send and receive?
The sender pseudo code:
fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) fd.bind(("0.0.0.0", 65400)) # select source port to reduce nondeterminism fd.connect(("192.168.254.1", 4321)) while True: fd.sendmmsg(["x00" * 32] * 1024)
While we could have used the usual send
syscall, it wouldn’t be efficient. Context switches to the kernel have a cost and it is be better to avoid it. Fortunately a handy syscall was recently added to Linux: sendmmsg
. It allows us to send many packets in one go. Let’s do 1,024 packets at once.
The receiver pseudo code:
fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) fd.bind(("0.0.0.0", 4321)) while True: packets = [None] * 1024 fd.recvmmsg(packets, MSG_WAITFORONE)
Similarly, recvmmsg
is a more efficient version of the common recv
syscall.
Let’s try it out:
sender$ ./udpsender 192.168.254.1:4321 receiver$ ./udpreceiver1 0.0.0.0:4321 0.352M pps 10.730MiB / 90.010Mb 0.284M pps 8.655MiB / 72.603Mb 0.262M pps 7.991MiB / 67.033Mb 0.199M pps 6.081MiB / 51.013Mb 0.195M pps 5.956MiB / 49.966Mb 0.199M pps 6.060MiB / 50.836Mb 0.200M pps 6.097MiB / 51.147Mb 0.197M pps 6.021MiB / 50.509Mb
With the naive approach we can do between 197k and 350k pps. Not too bad. Unfortunately there is quite a bit of variability. It is caused by the kernel shuffling our programs between cores. Pinning the processes to CPUs will help:
sender$ taskset -c 1 ./udpsender 192.168.254.1:4321 receiver$ taskset -c 1 ./udpreceiver1 0.0.0.0:4321 0.362M pps 11.058MiB / 92.760Mb 0.374M pps 11.411MiB / 95.723Mb 0.369M pps 11.252MiB / 94.389Mb 0.370M pps 11.289MiB / 94.696Mb 0.365M pps 11.152MiB / 93.552Mb 0.360M pps 10.971MiB / 92.033Mb
Now, the kernel scheduler keeps the processes on the defined CPUs. This improves processor cache locality and makes the numbers more consistent, just what we wanted.