目錄

Linux 高速網路封包設定

網路封包送出流程

最早在 Usersapce,Process 會組好封包,透過 socket descriptor 傳入封包,這時會透過 system call 把封包放到核心裡面的 socket send queue。

再來會進到 qdisc queue,核心會作一些封包處理(像是 netfilter、分段)。

在傳到 device layer,驅動會放到 NIC TX ring(第三個 queue),NIC 會透過 DMA 拿 NIC 硬體 TX Ring 送出封包。

當 NIC ring 送出後,會發出中斷說,我已經送完,告蘇 driver 可以塞更多封包。

三個 queue:

  1. socket queue:: 每個 socket 一個 queue

    adl@Twinkle:~$ cat /proc/net/sockstat
    
     sockets: used 315
     TCP: inuse 39 orphan 0 tw 0 alloc 40 mem 4
     UDP: inuse 3 mem 3
     UDPLITE: inuse 0
     RAW: inuse 0
     FRAG: inuse 0 memory 0
  2. qdisc queue: 現代網路介面卡 (NIC) 具有多個硬體傳送 (TX) 佇列。 Linux 核心使用「mq」(多佇列)框架,其中每個硬體佇列都附加一個單獨的 Qdisc,佇列數量會根據 CPU 核心數或硬體設計而增加。

    adl@Twinkle:~$ tc -s qdisc show dev enp6s0f1
    qdisc mq 0: root
    Sent 32930743567 bytes 609828465 pkt (dropped 3, overlimits 0 requeues 7616)
    backlog 0b 0p requeues 7616
    qdisc fq_codel 0: parent :14 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 59033718 bytes 1093217 pkt (dropped 0, overlimits 0 requeues 86)
    backlog 0b 0p requeues 86
    maxpacket 54 drop_overlimit 0 new_flow_count 560 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :13 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 57459726 bytes 1064069 pkt (dropped 0, overlimits 0 requeues 64)
    backlog 0b 0p requeues 64
    maxpacket 54 drop_overlimit 0 new_flow_count 268 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :12 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 57390610 bytes 1062787 pkt (dropped 0, overlimits 0 requeues 53)
    backlog 0b 0p requeues 53
    maxpacket 54 drop_overlimit 0 new_flow_count 413 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :11 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 3034216446 bytes 56189195 pkt (dropped 0, overlimits 0 requeues 545)
    backlog 0b 0p requeues 545
    maxpacket 54 drop_overlimit 0 new_flow_count 5620 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :10 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 344980080 bytes 6388520 pkt (dropped 0, overlimits 0 requeues 169)
    backlog 0b 0p requeues 169
    maxpacket 54 drop_overlimit 0 new_flow_count 1143 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :f limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 200657544 bytes 3715875 pkt (dropped 0, overlimits 0 requeues 200)
    backlog 0b 0p requeues 200
    maxpacket 54 drop_overlimit 0 new_flow_count 1884 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :e limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 149811324 bytes 2774284 pkt (dropped 0, overlimits 0 requeues 156)
    backlog 0b 0p requeues 156
    maxpacket 54 drop_overlimit 0 new_flow_count 1191 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :d limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 195745492 bytes 3624905 pkt (dropped 0, overlimits 0 requeues 267)
    backlog 0b 0p requeues 267
    maxpacket 54 drop_overlimit 0 new_flow_count 2332 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :c limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 177561618 bytes 3288179 pkt (dropped 0, overlimits 0 requeues 711)
    backlog 0b 0p requeues 711
    maxpacket 54 drop_overlimit 0 new_flow_count 4602 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :b limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 11404401960 bytes 211192606 pkt (dropped 0, overlimits 0 requeues 1516)
    backlog 0b 0p requeues 1516
    maxpacket 54 drop_overlimit 0 new_flow_count 8062 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :a limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 164616798 bytes 3048457 pkt (dropped 3, overlimits 0 requeues 483)
    backlog 0b 0p requeues 483
    maxpacket 54 drop_overlimit 0 new_flow_count 5024 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :9 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 15427025954 bytes 285685599 pkt (dropped 0, overlimits 0 requeues 1739)
    backlog 0b 0p requeues 1739
    maxpacket 54 drop_overlimit 0 new_flow_count 18733 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :8 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 146930166 bytes 2720929 pkt (dropped 0, overlimits 0 requeues 86)
    backlog 0b 0p requeues 86
    maxpacket 54 drop_overlimit 0 new_flow_count 1189 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :7 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 104468416 bytes 1934596 pkt (dropped 0, overlimits 0 requeues 213)
    backlog 0b 0p requeues 213
    maxpacket 54 drop_overlimit 0 new_flow_count 5862 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :6 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 650631474 bytes 12048731 pkt (dropped 0, overlimits 0 requeues 205)
    backlog 0b 0p requeues 205
    maxpacket 54 drop_overlimit 0 new_flow_count 1579 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :5 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 174091626 bytes 3223919 pkt (dropped 0, overlimits 0 requeues 316)
    backlog 0b 0p requeues 316
    maxpacket 54 drop_overlimit 0 new_flow_count 2513 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :4 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 131839056 bytes 2441464 pkt (dropped 0, overlimits 0 requeues 103)
    backlog 0b 0p requeues 103
    maxpacket 54 drop_overlimit 0 new_flow_count 954 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :3 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 188725344 bytes 3494914 pkt (dropped 0, overlimits 0 requeues 270)
    backlog 0b 0p requeues 270
    maxpacket 54 drop_overlimit 0 new_flow_count 2479 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :2 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 134800200 bytes 2496300 pkt (dropped 0, overlimits 0 requeues 147)
    backlog 0b 0p requeues 147
    maxpacket 54 drop_overlimit 0 new_flow_count 1392 ecn_mark 0
    new_flows_len 0 old_flows_len 0
    qdisc fq_codel 0: parent :1 limit 10240p flows 1024 quantum 1514 target 5.0ms interval 100.0ms memory_limit 32Mb ecn
    Sent 126356015 bytes 2339919 pkt (dropped 0, overlimits 0 requeues 287)
    backlog 0b 0p requeues 287
    maxpacket 54 drop_overlimit 0 new_flow_
  3. NIC RX queue

發送端設定

以前有 XPS,避免所有 CPU 都打同一個 TX queue,但現在在 tqdic 有 fq,kernel flow-based TX scheduling。

但假如發送大量同樣 5 tuple,就算有 fq or xps 也會只用到網卡的一個 TX queue。

XPS 設定如下假如你有 20 cores:

for i in /sys/class/net/enp6s0f1/queues/tx-*; do
  echo fffff > $i/xps_cpus
done

# 更好的方式是 1 queue ↔ 1 CPU
echo 00001 > tx-0/xps_cpus
echo 00002 > tx-1/xps_cpus
echo 00004 > tx-2/xps_cpus
echo 00008 > tx-3/xps_cpus

多 queue NIC + 多 core + 高 PPS 才適合開 xps,讓不同 CPU 的 send flow 對應不同 TX queue。

接收端設定

RSS:是網卡有沒有這個功能,NIC 自己把 packet 分到不同 RX queue,用 hash(5-tuple),每個 queue 一個 MSI-X RPS(Receive Packet Steering):NIC 已經把 packet 丟進 RX queue,但 CPU 可以再「重新分配」 RFS(Receive Flow Steering):讓同一條 TCP flow 一直在同 CPU

RPS + RFS 會一起用,主要就是用在 receive path(RX),而且它的設計目標就是:在「NIC 已經把封包打進某個 RX queue + 某個 CPU」之後,再把後續處理搬去別的 CPU。

但假如你有 RSS 就不用用 RPS + RFS 了,因為他們是為了單個 queue。

  • RPS 設定是: echo 4096 > /proc/sys/net/core/rps_sock_flow_entries
  • irqbalance: 自動調整 NIC interrupt 分配到 CPU