Articles taggués ‘receive million packets’

How to receive a million packets per second

19/04/2016 Comments off

receive million packetsLast week during a casual conversation I overheard a colleague saying: « The Linux network stack is slow! You can’t expect it to do more than 50 thousand packets per second per core! »

That got me thinking. While I agree that 50kpps per core is probably the limit for any practical application, what is the Linux networking stack capable of? Let’s rephrase that to make it more fun:

On Linux, how hard is it to write a program that receives 1 million UDP packets per second?

Hopefully, answering this question will be a good lesson about the design of a modern networking stack.

First, let us assume:

  • Measuring packets per second (pps) is much more interesting than measuring bytes per second (Bps). You can achieve high Bps by better pipelining and sending longer packets. Improving pps is much harder.
  • Since we’re interested in pps, our experiments will use short UDP messages. To be precise: 32 bytes of UDP payload. That means 74 bytes on the Ethernet layer.
  • For the experiments we will use two physical servers: « receiver » and « sender ».
  • They both have two six core 2GHz Xeon processors. With hyperthreading (HT) enabled that counts to 24 processors on each box. The boxes have a multi-queue 10G network card by Solarflare, with 11 receive queues configured. More on that later.
  • The source code of the test programs is available here: udpsender, udpreceiver.


Let’s use port 4321 for our UDP packets. Before we start we must ensure the traffic won’t be interfered with by the iptables:

receiver$ iptables -I INPUT 1 -p udp --dport 4321 -j ACCEPT  
receiver$ iptables -t raw -I PREROUTING 1 -p udp --dport 4321 -j NOTRACK  

A couple of explicitly defined IP addresses will later become handy:

receiver$ for i in `seq 1 20`; do   
              ip addr add 192.168.254.$i/24 dev eth2; 
sender$ ip addr add dev eth3  

1. The naive approach

To start let’s do the simplest experiment. How many packets will be delivered for a naive send and receive?

The sender pseudo code:

fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)  
fd.bind(("", 65400)) # select source port to reduce nondeterminism  
fd.connect(("", 4321))  
while True:  
    fd.sendmmsg(["x00" * 32] * 1024)

While we could have used the usual send syscall, it wouldn’t be efficient. Context switches to the kernel have a cost and it is be better to avoid it. Fortunately a handy syscall was recently added to Linux: sendmmsg. It allows us to send many packets in one go. Let’s do 1,024 packets at once.

The receiver pseudo code:

fd = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)  
fd.bind(("", 4321))  
while True:  
    packets = [None] * 1024
    fd.recvmmsg(packets, MSG_WAITFORONE)

Similarly, recvmmsg is a more efficient version of the common recv syscall.

Let’s try it out:

sender$ ./udpsender  
receiver$ ./udpreceiver1  
  0.352M pps  10.730MiB /  90.010Mb
  0.284M pps   8.655MiB /  72.603Mb
  0.262M pps   7.991MiB /  67.033Mb
  0.199M pps   6.081MiB /  51.013Mb
  0.195M pps   5.956MiB /  49.966Mb
  0.199M pps   6.060MiB /  50.836Mb
  0.200M pps   6.097MiB /  51.147Mb
  0.197M pps   6.021MiB /  50.509Mb

With the naive approach we can do between 197k and 350k pps. Not too bad. Unfortunately there is quite a bit of variability. It is caused by the kernel shuffling our programs between cores. Pinning the processes to CPUs will help:

sender$ taskset -c 1 ./udpsender  
receiver$ taskset -c 1 ./udpreceiver1  
  0.362M pps  11.058MiB /  92.760Mb
  0.374M pps  11.411MiB /  95.723Mb
  0.369M pps  11.252MiB /  94.389Mb
  0.370M pps  11.289MiB /  94.696Mb
  0.365M pps  11.152MiB /  93.552Mb
  0.360M pps  10.971MiB /  92.033Mb

Now, the kernel scheduler keeps the processes on the defined CPUs. This improves processor cache locality and makes the numbers more consistent, just what we wanted.

Lire la suite…