# Linux 高速網路封包設定


## 網路封包送出流程

最早在 Usersapce，Process 會組好封包，透過 socket descriptor 傳入封包，這時會透過 system call 把封包放到核心裡面的 socket send queue。

再來會進到 qdisc queue，核心會作一些封包處理（像是 netfilter、分段）。

在傳到 device layer，驅動會放到 NIC TX ring（第三個 queue），NIC 會透過 DMA 拿 NIC 硬體 TX Ring 送出封包。

當 NIC ring 送出後，會發出中斷說，我已經送完，告蘇 driver 可以塞更多封包。

三個 queue：

1. socket queue：: 每個 socket 一個 queue

   ```sh
   adl@Twinkle:~$ cat /proc/net/sockstat

    sockets: used 315
    TCP: inuse 39 orphan 0 tw 0 alloc 40 mem 4
    UDP: inuse 3 mem 3
    UDPLITE: inuse 0
    RAW: inuse 0
    FRAG: inuse 0 memory 0
    ```

2. qdisc queue: 現代網路介面卡 (NIC) 具有多個硬體傳送 (TX) 佇列。 Linux 核心使用「mq」（多佇列）框架，其中每個硬體佇列都附加一個單獨的 Qdisc，佇列數量會根據 CPU 核心數或硬體設計而增加。

    ```sh
    adl@Twinkle:~$ tc -s qdisc show dev enp6s0f1
    qdisc mq 0: root
    Sent 32930743567 bytes 609828465 pkt (dropped 3, overlimits 0 requeues 7616)
    backlog 0b 0p requeues 7616
    qdisc fq_codel 0: parent :14 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 59033718 bytes 1093217 pkt (dropped 0, overlimits 0 requeues 86)
    backlog 0b 0p requeues 86
    maxpacket 54 drop_overlimit 0 new_flow_count 560 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :13 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 57459726 bytes 1064069 pkt (dropped 0, overlimits 0 requeues 64)
    backlog 0b 0p requeues 64
    maxpacket 54 drop_overlimit 0 new_flow_count 268 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :12 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 57390610 bytes 1062787 pkt (dropped 0, overlimits 0 requeues 53)
    backlog 0b 0p requeues 53
    maxpacket 54 drop_overlimit 0 new_flow_count 413 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :11 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 3034216446 bytes 56189195 pkt (dropped 0, overlimits 0 requeues 545)
    backlog 0b 0p requeues 545
    maxpacket 54 drop_overlimit 0 new_flow_count 5620 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :10 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 344980080 bytes 6388520 pkt (dropped 0, overlimits 0 requeues 169)
    backlog 0b 0p requeues 169
    maxpacket 54 drop_overlimit 0 new_flow_count 1143 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :f limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 200657544 bytes 3715875 pkt (dropped 0, overlimits 0 requeues 200)
    backlog 0b 0p requeues 200
    maxpacket 54 drop_overlimit 0 new_flow_count 1884 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :e limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 149811324 bytes 2774284 pkt (dropped 0, overlimits 0 requeues 156)
    backlog 0b 0p requeues 156
    maxpacket 54 drop_overlimit 0 new_flow_count 1191 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :d limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 195745492 bytes 3624905 pkt (dropped 0, overlimits 0 requeues 267)
    backlog 0b 0p requeues 267
    maxpacket 54 drop_overlimit 0 new_flow_count 2332 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :c limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 177561618 bytes 3288179 pkt (dropped 0, overlimits 0 requeues 711)
    backlog 0b 0p requeues 711
    maxpacket 54 drop_overlimit 0 new_flow_count 4602 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :b limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 11404401960 bytes 211192606 pkt (dropped 0, overlimits 0 requeues 1516)
    backlog 0b 0p requeues 1516
    maxpacket 54 drop_overlimit 0 new_flow_count 8062 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :a limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 164616798 bytes 3048457 pkt (dropped 3, overlimits 0 requeues 483)
    backlog 0b 0p requeues 483
    maxpacket 54 drop_overlimit 0 new_flow_count 5024 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :9 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 15427025954 bytes 285685599 pkt (dropped 0, overlimits 0 requeues 1739)
    backlog 0b 0p requeues 1739
    maxpacket 54 drop_overlimit 0 new_flow_count 18733 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :8 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 146930166 bytes 2720929 pkt (dropped 0, overlimits 0 requeues 86)
    backlog 0b 0p requeues 86
    maxpacket 54 drop_overlimit 0 new_flow_count 1189 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :7 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 104468416 bytes 1934596 pkt (dropped 0, overlimits 0 requeues 213)
    backlog 0b 0p requeues 213
    maxpacket 54 drop_overlimit 0 new_flow_count 5862 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :6 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 650631474 bytes 12048731 pkt (dropped 0, overlimits 0 requeues 205)
    backlog 0b 0p requeues 205
    maxpacket 54 drop_overlimit 0 new_flow_count 1579 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :5 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 174091626 bytes 3223919 pkt (dropped 0, overlimits 0 requeues 316)
    backlog 0b 0p requeues 316
    maxpacket 54 drop_overlimit 0 new_flow_count 2513 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 131839056 bytes 2441464 pkt (dropped 0, overlimits 0 requeues 103)
    backlog 0b 0p requeues 103
    maxpacket 54 drop_overlimit 0 new_flow_count 954 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 188725344 bytes 3494914 pkt (dropped 0, overlimits 0 requeues 270)
    backlog 0b 0p requeues 270
    maxpacket 54 drop_overlimit 0 new_flow_count 2479 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 134800200 bytes 2496300 pkt (dropped 0, overlimits 0 requeues 147)
    backlog 0b 0p requeues 147
    maxpacket 54 drop_overlimit 0 new_flow_count 1392 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 126356015 bytes 2339919 pkt (dropped 0, overlimits 0 requeues 287)
    backlog 0b 0p requeues 287
    maxpacket 54 drop_overlimit 0 new_flow_
    ```

3. NIC RX queue

## 發送端設定

以前有 XPS，避免所有 CPU 都打同一個 TX queue，但現在在 tqdic 有 fq，kernel flow-based TX scheduling。

但假如發送大量同樣 5 tuple，就算有 fq or xps 也會只用到網卡的一個 TX queue。

XPS 設定如下假如你有 20 cores：

```sh
for i in /sys/class/net/enp6s0f1/queues/tx-*; do
  echo fffff > $i/xps_cpus
done

# 更好的方式是 1 queue ↔ 1 CPU
echo 00001 > tx-0/xps_cpus
echo 00002 > tx-1/xps_cpus
echo 00004 > tx-2/xps_cpus
echo 00008 > tx-3/xps_cpus
```

多 queue NIC + 多 core + 高 PPS 才適合開 xps，讓不同 CPU 的 send flow 對應不同 TX queue。

## 接收端設定

RSS：是網卡有沒有這個功能，NIC 自己把 packet 分到不同 RX queue，用 hash（5-tuple），每個 queue 一個 MSI-X
RPS（Receive Packet Steering）：NIC 已經把 packet 丟進 RX queue，但 CPU 可以再「重新分配」
RFS（Receive Flow Steering）：讓同一條 TCP flow 一直在同 CPU

RPS + RFS 會一起用，主要就是用在 receive path（RX），而且它的設計目標就是：在「NIC 已經把封包打進某個 RX queue + 某個 CPU」之後，再把後續處理搬去別的 CPU。

[但假如你有 RSS 就不用用 RPS + RFS 了，因為他們是為了單個 queue。](https://blog.csdn.net/the_dog_tail_grass/article/details/122177977)

* RPS 設定是: `echo 4096 > /proc/sys/net/core/rps_sock_flow_entries`
* `irqbalance`: 自動調整 NIC interrupt 分配到 CPU