What kind of load on the servers create network mechanisms?

When analyzing the operation of the network subsystem of servers, attention is usually paid to such indicators as latency, throughput of the system, and the number of packets that can be processed per second (PPS, Packets Per Second). These indicators are used in order to understand under what maximum load the computer under study can work. And although these metrics are important and often able to say a lot about the system, they do not provide information on what impact processing of network packets has on programs running on the server. This material is aimed at studying the load created by network mechanisms on servers. In particular, we will talk about how much processor time the solution to network problems can “steal” from various processes running on Linux systems.





Network packet processing on Linux


Linux processes a significant number of packets in the context of any process executed by the processor at the time of processing the corresponding IRQ. The System Accounting engine will assign the processor cycles used for this to any process that is currently executing. This will be done even if this process has nothing to do with network packet processing. For example, a team topmay indicate that a process appears to be using more than 99% of the processor’s resources, but in fact 60% of the processor time will be spent on processing the packets. And this means that the process itself, solving its own problems, uses only 40% of the CPU resources.

Inbound Handlernet_rx_actionusually performed very, very fast. For example, in less than 25 μs. (This data was obtained from measurements using eBPF. If you are interested in the details, look here net_rx_action .) The processor can process up to 64 packets per NAPI instance (NIC or RPS) before postponing the task to another SoftIRQ cycle. One after another, without a break, up to 10 SoftIRQ cycles can follow, which takes about 2 ms (you can find out more about this by reading about __do_softirq). If the SoftIRQ vector, after the maximum number of cycles has passed, or the time has passed, still has unsolved problems, then the solution of these problems is delayed for execution in the threadksoftirqdspecific CPU. When this happens, the system turns out to be a little more transparent in the sense of obtaining information about the processor load created by network operations (although such an analysis is performed on the assumption that it is SoftIRQ that is studied, which are related to packet processing, and not to something else) .

One way to obtain the above indicators is to use perf:

sudo perf record -a \
        -e irq:irq_handler_entry,irq:irq_handler_exit
        -e irq:softirq_entry --filter="vec == 3" \
        -e irq:softirq_exit --filter="vec == 3"  \
        -e napi:napi_poll \
        -- sleep 1

sudo perf script

Here is the result:

swapper     0 [005] 176146.491879: irq:irq_handler_entry: irq=152 name=mlx5_comp2@pci:0000:d8:00.0
swapper     0 [005] 176146.491880:  irq:irq_handler_exit: irq=152 ret=handled
swapper     0 [005] 176146.491880:     irq:softirq_entry: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491942:        napi:napi_poll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 64 budget 64
swapper     0 [005] 176146.491943:      irq:softirq_exit: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491943:     irq:softirq_entry: vec=3 [action=NET_RX]
swapper     0 [005] 176146.491971:        napi:napi_poll: napi poll on napi struct 0xffff9d3d53863e88 for device eth0 work 27 budget 64
swapper     0 [005] 176146.491971:      irq:softirq_exit: vec=3 [action=NET_RX]
swapper     0 [005] 176146.492200: irq:irq_handler_entry: irq=152 name=mlx5_comp2@pci:0000:d8:00.0

In this case, the processor is idle (hence the appearance of entries swapperfor the process), IRQ is called for the Rx queue on CPU 5, SoftIRQ processing is called twice, 64 packets are processed first, then 27. The next IRQ is called after 229 μs and starts the cycle again.

This data was obtained on an idle system. But on the processor, any task can be performed. In this case, the above sequence of events occurs, interrupting this task and performing IRQ / SoftIRQ tasks. At the same time, System Accounting ascribes to the interrupted process the load created by the processor. As a result, network packet processing tasks are usually hidden from conventional processor load monitoring tools. They are executed in the context of some randomly selected process, in the context of the “victim process”. This leads us to some questions. How to estimate the time for which the process is interrupted for the processing of packets? How to compare 2 different network solutions in order to understand which of them has a lesser effect on various tasks solved on a computer?

When using RSS, RPS, RFS mechanisms, packet processing is usually distributed between the processor cores. Therefore, the above packet processing sequence is related to each specific CPU. As the packet arrival rate increases (I think we can talk about speeds of 100,000 packets per second and higher), each CPU has to process thousands or tens of thousands of packets per second. Processing so many packets will inevitably affect other tasks performed on the server.

Consider one way to evaluate this effect.

Disabling Distributed Packet Processing


To begin, let's stop the distributed processing of packets by disabling RPS and setting up flow control rules aimed at organizing the processing of all packets related to a specific MAC address on the only CPU known to us. My system has 2 NICs aggregated in an 802.3ad configuration. Network tasks are assigned to a single virtual machine running on a computer.

RPS on network adapters is disabled as follows:

for d in eth0 eth1; do
    find /sys/class/net/${d}/queues -name rps_cpus |
    while read f; do
            echo 0 | sudo tee ${f}
    done
done

Next, we set up the flow control rules to ensure that packets get into the test virtual machine using a single CPU:

DMAC=12:34:de:ad:ca:fe
sudo ethtool -N eth0 flow-type ether dst ${DMAC} action 2
sudo ethtool -N eth1 flow-type ether dst ${DMAC} action 2

Disabling RPS and using flow control rules allows us to ensure that all packets destined for our virtual machine are processed on the same CPU. In order to make sure that packets are sent to the queue to which they should be sent, you can use a command like ethq . Then you can find out which CPU this queue belongs to using /proc/interrupts. In my case, turn 2 is processed by means of CPU 5.

Openssl speed command


I could use utilities perfor to analyze SoftIRQ runtimes responsible for processing incoming traffic bpf, but this approach is quite complicated. In addition, the observation process itself definitely affects the results. A much simpler and more understandable solution is to identify the load created by the network operations on the system using some task, one that creates a known load on the system. For example, this is a command openssl speedused to test OpenSSL performance. This will allow you to find out how much processor resources the program gets in reality, and compare it with the amount of resources that it is supposed to receive (this will help to find out how much resources are spent on network tasks).

The team is openssl speedalmost 100% a user space team. If you tie it to a certain CPU, then it, during the execution of tests, uses all its available resources. The team works by setting the timer to the specified interval (here, for example, to make calculations easier, it takes 10 seconds), running the test, and then, when the timer is triggered, using times()it to find out how much processor time the program actually got. From the point of view syscallit looks like this:

alarm(10)                               = 0
times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726601344
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, ...) = 0
rt_sigreturn({mask=[]}) = 2782545353
times({tms_utime=1000, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726602344

That is, it turns out that alarm()very few system calls were made between calling and checking the results. If the program was not interrupted, or interrupted very rarely, the time tms_utimewill coincide with the test time (in this case, 10 seconds).

Since this is a test performed exclusively in user space, any system time that appears in times()will mean some additional load on the system. It turns out that although opensslthis is a process that runs on the CPU, the CPU itself may be busy with something else. For example, processing network packets:

alarm(10)                               = 0
times({tms_utime=0, tms_stime=0, tms_cutime=0, tms_cstime=0}) = 1726617896
--- SIGALRM {si_signo=SIGALRM, si_code=SI_KERNEL} ---
rt_sigaction(SIGALRM, ...) = 0
rt_sigreturn({mask=[]}) = 4079301579
times({tms_utime=178, tms_stime=571, tms_cutime=0, tms_cstime=0}) = 1726618896

Here you can see that it was opensslpossible to work on the processor for 7.49 seconds (178 + 571 in units of measurement corresponding to 0.01 s.). But at the same time 5.71 s. this interval is represented by the system time. Since he is opensslnot busy with any business in the kernel space, this means that 5.71 s. - This is the result of some additional load on the system. That is, this is the time that the process was "stolen" in order to meet the needs of the system.

Using the openssl speed command to detect system load caused by network mechanisms


Now that we’ve figured out how the team works openssl speed, we’ll look at the results that it produces on a practically inactive server:

$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc >/dev/null
Doing aes-256 cbc for 10s on 16 size blocks: 66675623 aes-256 cbc's in 9.99s
Doing aes-256 cbc for 10s on 64 size blocks: 18096647 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 256 size blocks: 4607752 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 1024 size blocks: 1162429 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 8192 size blocks: 145251 aes-256 cbc's in 10.00s
Doing aes-256 cbc for 10s on 16384 size blocks: 72831 aes-256 cbc's in 10.00s

As you can see, we are informed that the program spends from 9.99 to 10 seconds to process blocks of different sizes. This confirms that system mechanisms do not take processor time from the program. Now, using netperf, we will load the server by processing packets coming from two sources. Run the test again:

$ taskset -c 5 openssl speed -seconds 10 aes-256-cbc >/dev/null
Doing aes-256 cbc for 10s on 16 size blocks: 12061658 aes-256 cbc's in 1.96s
Doing aes-256 cbc for 10s on 64 size blocks: 3457491 aes-256 cbc's in 2.10s
Doing aes-256 cbc for 10s on 256 size blocks: 893939 aes-256 cbc's in 2.01s
Doing aes-256 cbc for 10s on 1024 size blocks: 201756 aes-256 cbc's in 1.86s
Doing aes-256 cbc for 10s on 8192 size blocks: 25117 aes-256 cbc's in 1.78s
Doing aes-256 cbc for 10s on 16384 size blocks: 13859 aes-256 cbc's in 1.89s

The results are very different from those obtained on an idle server. It is expected that each of the tests will be executed within 10 seconds, but times()reports that the real execution time is from 1.78 to 2.1 seconds. This means that the remaining time, varying from 7.9 to 8.22 seconds, was spent on processing the packets, either in the context of the process opensslor in ksoftirqd.

Let’s take a look at what the team will give out topwhen analyzing the launch just completed openssl speed.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P 
 8180 libvirt+  20   0 33.269g 1.649g 1.565g S 279.9  0.9  18:57.81 qemu-system-x86     75
 8374 root      20   0       0      0      0 R  99.4  0.0   2:57.97 vhost-8180          89
 1684 dahern    20   0   17112   4400   3892 R  73.6  0.0   0:09.91 openssl              5    
   38 root      20   0       0      0      0 R  26.2  0.0   0:31.86 ksoftirqd/5          5

Here you might think that it openssluses approximately 73% of the resources of CPU 5, and the ksoftirqdremaining resources are obtained. But in reality, in the context openssl, processing of such a large number of packages is performed that the program itself takes only 18-21% of the processor time to solve its problems.

If you reduce the network load to 1 stream, you opensslget the feeling that 99% of system resources are being consumed.

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND              P
 8180 libvirt+  20   0 33.269g 1.722g 1.637g S 325.1  0.9 166:38.12 qemu-system-x86     29
44218 dahern    20   0   17112   4488   3996 R  99.2  0.0   0:28.55 openssl              5
 8374 root      20   0       0      0      0 R  64.7  0.0  60:40.50 vhost-8180          55
   38 root      20   0       0      0      0 S   1.0  0.0   4:51.98 ksoftirqd/5          5

But in reality it turns out that the program running in the user space gets, out of the expected 10 seconds, only about 4 seconds:

Doing aes-256 cbc for 10s on 16 size blocks: 26596388 aes-256 cbc's in 4.01s
Doing aes-256 cbc for 10s on 64 size blocks: 7137481 aes-256 cbc's in 4.14s
Doing aes-256 cbc for 10s on 256 size blocks: 1844565 aes-256 cbc's in 4.31s
Doing aes-256 cbc for 10s on 1024 size blocks: 472687 aes-256 cbc's in 4.28s
Doing aes-256 cbc for 10s on 8192 size blocks: 59001 aes-256 cbc's in 4.46s
Doing aes-256 cbc for 10s on 16384 size blocks: 28569 aes-256 cbc's in 4.16s

Conventional process monitoring tools indicate that the program uses almost all the processor resources, but in reality it turns out that 55-80% of the CPU resources are spent on processing network packets. The throughput of the system at the same time looks great (more than 22 Gb / s per 25 Gb / s line), but this has a tremendous impact on the processes running in this system.

Summary


Here we examined an example of how packet processing mechanisms “steal” processor clocks from a simple and not very important benchmark. But on a real server, processes that are affected similarly can be anything. These can be virtual processors, emulator threads, vhost threads of virtual machines. These can be different system processes, the impact on which can have a different impact on the performance of these processes and the entire system.

Do you consider, analyzing your servers, the impact on their actual performance of the load associated with network operations?


All Articles