Linux Network Scaling: Receiving Packets

Once upon a time, everything was so simple. The network card was slow and had only one queue. When packets arrives, the network card copies packets through DMA and sends an interrupt, and the Linux kernel harvests those packets and completes interrupt processing. As the network cards became faster, the interrupt based model may cause IRQ storm due to the massive incoming packets. This will consume the most of CPU power and freeze the system. To solve this problem, NAPI (interrupt + polling) was proposed. When the kernel receives an interrupt from the network card, it starts to poll the device and harvest packets in the queue as fast as possible. NAPI works nicely with the 1 Gbps network card which is common nowadays. However, it comes to the 10 Gbps, 20 Gbps, or even 40 Gbps network cards, NAPI may not be sufficient. Those cards would demand much faster CPU if we still use one CPU and one queue to receive packets. Fortunately, multi-core CPUs are popular now, so why not process packets in parallel?

RSS: Receive Side Scaling

Receive Side Scaling (RSS) is the mechanism to process packets with multiple RX/TX queues. When the network card with RSS receives packets, it will apply a filter to packets and distribute the packets to RX queues. The filter is usually a hash function and can be configured from "ethtool -X". If you want to spread flows evenly among the first 3 queues:

# ethtool -X eth0 equal 3

Or, if you find a magic hash key that is particularly useful:

# ethtool -X eth0 hkey <magic hash key>

Fig. 1

For the low latency networking, besides the filter, the CPU affinity is also important. The optimal setting is to allocate one CPU to dedicate to one queue. First, find out the IRQ number by check /proc/interrupt, and then set the CPU bitmask to /proc/irq/<IRQ_NUMBER>/smp_affinity to allocate the dedicated CPU. To avoid the setting being overwritten, the daemon, irqbalance, has to be disabled. Please note that according to the kernel document, hyperthreading has shown no benefit for interrupt handling, so it's better to match the number of queues with the number of the physical CPU cores.

RPS: Receive Packet Steering

While RSS provides the hardware queues, a software-queue mechanism called Receive Packet Steering (RPS) is implemented in Linux kernel.

When the driver receives a packet, it wraps the packet in a socket buffer (sk_buff) which contains a u32 hash value for the packet. The hash is so called Layer 4 hash (l4 hash) which is based on the source IP, the source port, the destination IP, and the destination port, and it is calculated by either the network card or __skb_set_sw_hash(). Since every packet of the same TCP/UDP connection (flow) shares the same hash, it's reasonable to process them with the same CPU.

The basic idea of RPS is to send the packets of the same flow to the specific CPUs according to the per-queue rps_map. Here is the struct of rps_map:




struct rps_map {

    unsigned int len;

    struct rcu_head rcu;

    u16 cpus[0];

};

The map changes dynamically according to the CPU bitmask to /sys/class/net/<dev>/queues/rx-<n>/rps_cpus. For example, if we want to make the queue use the first 3 CPUs in a 8 CPUs system, we first construct the bitmask, 0 0 0 0 0 1 1 1, to 0x7, and

# echo 7 > /sys/class/net/eth0/queues/rx-0/rps_cpus

This will guarantee the packets received from queue 0 in eth0 go to CPU 1~3.

After the driver wrap a packet in sk_buff, it will reach either netif_rx_internal() or netif_receive_skb_internal(), and then get_rps_cpu() will be invoked to map the hash to an entry in rps_map, i.e. the CPU id. After getting the CPU id, enqueue_to_backlog() puts the sk_buff to the specific CPU queue for the further processing. The queues for each CPU are allocated in the per-cpu variable, softnet_data.

Fig. 2

The benefit of using RPS is to share the load of packet processing among the CPUs. However, it may be unnecessary if RSS is available since the network card already sort the packets for each queue/CPU. However, RPS could still be useful if there are more CPUs than the queues. In that case, each queue can be associated with more than one CPU and distribute the packets among them.

RFS: Receive Flow Steering

Although RPS distributes packets based on flows, it doesn't take the userspace applications into consideration. The application may run on CPU A while the kernel puts the packets in the queue of CPU B. Since CPU A can only use its own cache, the cached packets in CPU B become useless. Receive Flow Steering (RFS) extends RPS further for the applications.

Instead of the per-queue hash-to-CPU map, RFS maintains a global flow-to-CPU table, rps_sock_flow_table:


struct rps_sock_flow_table {

    u32 mask;

    u32 ents[0];

};

The mask is used to map the hash value into the index of the table. Since the table size will be rounded up to the power of 2, the mask is set to table_size - 1, and it's easy to find the index a sk_buff with hash & scok_table->mask.

The entry is partitioned into flow id and CPU id by rps_cpu_mask. The low bits are for CPU id while the high bits are for the flow id. When the application operates on the socket (inet_recvmsg(), inet_sendmsg(), inet_sendpage(), tcp_splice_read()), sock_rps_record_flow() will be called to update the sock flow table.

When a packet comes, get_rps_cpu() will be called to decide which CPU queue to use.
Here is how get_rps_cpu() decides the CPU for the packet:


ident = sock_flow_table->ents[hash & sock_flow_table->mask];

if ((ident ^ hash) & ~rps_cpu_mask)

     goto try_rps;

next_cpu = ident & rps_cpu_mask;

get_rps_cpu() finds the index of the entry with the flow table mask and checks if the high bits of the hash matches the entry. If yes, it retrieves the CPU id from the entry and assign that CPU for the packet. If the hash doesn't match any entry, it falls back to use the RPS map.

The size of the sock flow table can be adjusted through /proc/sys/net/core/rps_sock_flow_entries. For example, if we want to set the table size to 32768:

# echo 32768 > /proc/sys/net/core/rps_sock_flow_entries

Although the sock flow table improves the application locality, it also raise a problem. When the scheduler migrates the application to a new CPU, the remaining packets in the old CPU queue become outstanding, and the application may get the out of order packets. To solve the problem, RFS uses the per-queue rps_dev_flow_table to track the outstanding packets.

Here is the struct of rps_dev_flow_table:


struct rps_dev_flow {
    u16 cpu;
    u16 filter; /* For aRFS */
    unsigned int last_qtail;
};

struct rps_dev_flow_table {
    unsigned int mask;
    struct rcu_head rcu;
    struct rps_dev_flow flows[0];
};

Similar to sock flow table, rps_dev_flow_table also use table_size - 1 as the mask while the table size also has to be rounded up to the power of 2. When a flow packet is enqueued, last_qtail is updated to the tail of the queue of the CPU. If the application is migrated to a new CPU, the sock flow table will reflect the change and get_rps_cpu() will set the new CPU for the flow. Before setting the new CPU, get_rps_cpu() checks whether the current head of the queue already passes last_qtail. If so, it means there is no more outstanding packets in the queue, and it's safe to change CPU. Otherwise, get_rps_cpu() will still use the old CPU recorded in rps_dev_flow->cpu.

The size of the per-queue flow table (rps_dev_flow_table) can be configured through the sysfs interface: /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt.

It's recommended to set rps_flow_cnt to (rps_sock_flow_entries / N) while N is the number of RX queues (assume the flows are distributed evenly among the queues).

Fig. 3

Fig. 3 illustrates how RFS works. The kernel gets the blue packet which belongs to the blue flow. The per-queue flow table directs the blue packet to CPU 2 (old CPU) while the socket updates the sock flow table to use CPU 1, the new CPU. get_rps_cpu() checks both table and finds the migration of CPU, so it updates the per-queue flow table (assume there is no outstanding packets in CPU 2) and assigns CPU 1 for the blue packet.

Accelerated Receive Flow Steering

Accelerated Receive Flow Steering (aRFS) extends RFS further to the hardware filter for the RX queues. To enable aRFS, it requires a network card with the programmable ntuple filter and the driver support. To enable ntuple filter,

# ethtool -K eth0 ntuple on

For the driver to support aRFS, it has to implement ndo_rx_flow_steer to help set_rps_cpu() to configure the hardware filter. When get_rps_cpu() decides to assign a new CPU for a flow, it calls set_rps_cpu(). set_rps_cpu() first checks if the network card supports ntuple filter. If yes, it will query a rx_cpu_rmap to find a proper RX queue for the flow. rx_cpu_rmap is a special map maintained by the driver. The map is used to look up which RX queue is suitable for the CPU. It could be the queue that is directly associated with the given CPU or a queue whose processing CPU is closet in cache locality. After getting the RX queue index, set_rps_cpu() invokes ndo_rx_flow_steer() to notify the driver to create a new filter for the given flow. ndo_rx_flow_steer() will return the filter id, and the filter id will be stored in the per-queue flow table.

Besides implementing ndo_rx_flow_steer(), the driver has to call rps_may_expire_flow() periodically to check whether the filters are still valid and remove the expired filters.

Fig. 4

Conclusion

RSS, RPS, RFS, and aRFS, those mechanisms were introduced before Linux 3.0, so the most of distributions already included and enabled them. It's good to understanding them in the deep level so that we find out the best configuration for our systems.

By the way, this article only covers the receiving part in the kernel document, and Transmit Packet Steering (XPS), the only multi-queue TX mechanism, was omitted on purpose. I would like to mention it in another article that is dedicated to the packet transmission (probably including qdisc in tc).

References

Scaling in the Linux Networking Stack

Monitoring and Tuning the Linux Networking Stack: Receiving Data

openSUSE/Raspberry Pi 3/Nextcloud 心得

前同事成立 ownCloud 的時候就很好奇是要怎麼用，不過一直到主要開發者另外成立 Nextcloud 都還沒用過，只知道是可以自己架的 file server，也有手機 APP 支援。直到最近家人有備份 iPad 檔案的需求，才想到可以拿被我閒置的 Raspberry Pi 3 來試看看。其實在這之前就一直在想用什麼方式備份比較好。一開始是買有 USB 的 WiFi router 刷上 openWRT 以後再透過 sftp 拷貝到外接的 USB 硬碟，然後順便架了 DLNA 看照片跟影片。不過 WiFi router 用的晶片運算能力都不太強，用 DLNA 看照片時，有時會因為運算縮圖需要的記憶體太多而整個當掉，算不上是一個好方案。另外只有一顆外接硬碟也不是很保險，所以變成電腦裡面也備份，到後來也吃了不少空間。Google Driver 或 iCloud 之類的除了容量限制，還有綁定服務商的問題，所以還是希望自己家裡架一台。中間有想過買台 NAS，不過一直懶得弄。這次總算有動力來做備份用的 server。首先是外接硬碟。一開始想用兩顆外接硬碟做 soft raid，但 Raspberry Pi 3 本身供電有限，接兩顆硬碟搞不好推不動，保險起見還是買外接盒。挑了兩三天最後買了 CyberSLIM SSDPRO RAID。優點：體積小(因為是塞兩顆 2.5” 硬碟)、鋁殼無風扇(很安靜)、有硬體 RAID1、支援 USB 3.1。缺點：軔體鳥鳥的(一開始切到 RAID1 還是會認到兩顆硬碟)、LED 燈太亮(晚上關燈在閃的話很刺眼)。因為沒有主動式散熱，所以我搭配了兩顆 HGST 1TB 5400 轉的硬碟，降低散熱需求。另外 Raspberry Pi 3 只有 USB 2.0，所以也不用太在意硬碟效能，加上大部份使用會透過 WiFi，整體效能瓶頸不會是在硬碟這邊。硬碟設定完，把 openSUSE Leap 42.2 刷進 Raspberry Pi 3 就可以開始裝 Nextcloud 了。安裝難度比想像中低很多，照著 openSUSE wiki 上的步驟做，很容易就可以裝好了。不過呢！因為 server:php:applications 這個 repo 的 build target 沒有 AArch64，所以只能選擇 2.2 的 I...

閱讀完整內容

Gary's Note

搜尋此網誌