Message ID | 20230704082916.2135501-4-wei.fang@nxp.com (mailing list archive) |
---|---|
State | Superseded |
Delegated to: | Netdev Maintainers |
Headers | show |
Series | net: fec: fix some issues of ndo_xdp_xmit() | expand |
Context | Check | Description |
---|---|---|
netdev/series_format | success | Posting correctly formatted |
netdev/tree_selection | success | Clearly marked for net |
netdev/fixes_present | success | Fixes tag present in non-next series |
netdev/header_inline | success | No static functions without inline keyword in header files |
netdev/build_32bit | success | Errors and warnings before: 8 this patch: 8 |
netdev/cc_maintainers | success | CCed 10 of 10 maintainers |
netdev/build_clang | fail | Errors and warnings before: 18 this patch: 18 |
netdev/verify_signedoff | success | Signed-off-by tag matches author and committer |
netdev/deprecated_api | success | None detected |
netdev/check_selftest | success | No net selftest shell script |
netdev/verify_fixes | success | Fixes tag looks correct |
netdev/build_allmodconfig_warn | success | Errors and warnings before: 8 this patch: 8 |
netdev/checkpatch | success | total: 0 errors, 0 warnings, 0 checks, 27 lines checked |
netdev/kdoc | success | Errors and warnings before: 0 this patch: 0 |
netdev/source_inline | success | Was 0 now: 0 |
> After trying various methods, I think that increase the size of tx > BD ring is simple and effective. Maybe the best resolution is that > allocate NAPI for each queue to improve the efficiency of the NAPI > callback, but this change is a bit big and I didn't try this method. > Perheps this method will be implemented in a future patch. How does this affect platforms like Vybrid with its fast Ethernet? Does the burst latency go up? > In addtion, this patch also updates the tx_stop_threshold and the > tx_wake_threshold of the tx ring. In previous logic, the value of > tx_stop_threshold is 217, however, the value of tx_wake_threshold > is 147, it does not make sense that tx_wake_threshold is less than > tx_stop_threshold. What do these actually mean? I could imagine that as the ring fills you don't want to stop until it is 217/512 full. There is then some hysteresis, such that it has to drop below 147/512 before more can be added? > Besides, both XDP path and 'slow path' share the > tx BD rings. So if tx_stop_threshold is 217, in the case of heavy > XDP traffic, the slow path is easily to be stopped, this will have > a serious impact on the slow path. Please post your iperf results for various platforms, so we can see the effects of this. We generally don't accept tuning patches without benchmarks which prove the improvements, and also show there is no regression. And given the wide variety of SoCs using the FEC, i expect testing on a number of SoCs, but Fast and 1G. Andrew
> -----Original Message----- > From: Andrew Lunn <andrew@lunn.ch> > Sent: 2023年7月5日 8:25 > To: Wei Fang <wei.fang@nxp.com> > Cc: davem@davemloft.net; edumazet@google.com; kuba@kernel.org; > pabeni@redhat.com; ast@kernel.org; daniel@iogearbox.net; > hawk@kernel.org; john.fastabend@gmail.com; Shenwei Wang > <shenwei.wang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>; > netdev@vger.kernel.org; dl-linux-imx <linux-imx@nxp.com>; > linux-kernel@vger.kernel.org; bpf@vger.kernel.org > Subject: Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update > thresholds of tx ring > > > After trying various methods, I think that increase the size of tx BD > > ring is simple and effective. Maybe the best resolution is that > > allocate NAPI for each queue to improve the efficiency of the NAPI > > callback, but this change is a bit big and I didn't try this method. > > Perheps this method will be implemented in a future patch. > > How does this affect platforms like Vybrid with its fast Ethernet? Sorry, I don't have the Vybrid platform, but I think I don't think it has much impact, at most it just takes up some more memory. > Does the burst latency go up? No, for fec, when a packet is attached to the BDs, the software will immediately trigger the hardware to send the packet. In addition, I think it may improve the latency, because the size of the tx ring becomes larger, and more packets can be attached to the BD ring for burst traffic. > > > In addtion, this patch also updates the tx_stop_threshold and the > > tx_wake_threshold of the tx ring. In previous logic, the value of > > tx_stop_threshold is 217, however, the value of tx_wake_threshold is > > 147, it does not make sense that tx_wake_threshold is less than > > tx_stop_threshold. > > What do these actually mean? I could imagine that as the ring fills you don't > want to stop until it is 217/512 full. There is then some hysteresis, such that it > has to drop below 147/512 before more can be added? > You must have misunderstood, let me explain more clearly, the queue will be stopped when the available BDs are less than tx_stop_threshold (217 BDs). And the queue will be waked when the available BDs are greater than tx_wake_threshold (147 BDs). So in most cases, the available BDs are greater than tx_wake_threshold when the queue is stopped, the only effect is to delay packet sending. In my opinion, tx_wake_threshold should be greater than tx_stop_threshold, we should stop queue when the available BDs are not enough for a skb to be attached. And wake the queue when the available BDs are sufficient for a skb. > > Besides, both XDP path and 'slow path' share the tx BD rings. So if > > tx_stop_threshold is 217, in the case of heavy XDP traffic, the slow > > path is easily to be stopped, this will have a serious impact on the > > slow path. > > Please post your iperf results for various platforms, so we can see the effects of > this. We generally don't accept tuning patches without benchmarks which > prove the improvements, and also show there is no regression. And given the > wide variety of SoCs using the FEC, i expect testing on a number of SoCs, but > Fast and 1G. > Below are the results on i.MX6UL/8MM/8MP/8ULP/93 platforms, i.MX6UL and 8ULP only support Fast ethernet. Others support 1G. 1.1 i.MX6UL with tx ring size 512 root@imx6ul7d:~# iperf3 -c 192.168.3.6 Connecting to host 192.168.3.6, port 5201 [ 5] local 192.168.3.9 port 47148 connected to 192.168.3.6 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 10.1 MBytes 84.9 Mbits/sec 0 103 KBytes [ 5] 1.00-2.00 sec 9.88 MBytes 82.6 Mbits/sec 0 103 KBytes [ 5] 2.00-3.00 sec 9.82 MBytes 82.7 Mbits/sec 0 103 KBytes [ 5] 3.00-4.00 sec 9.82 MBytes 82.4 Mbits/sec 0 103 KBytes [ 5] 4.00-5.00 sec 9.88 MBytes 82.9 Mbits/sec 0 103 KBytes [ 5] 5.00-6.00 sec 9.94 MBytes 83.4 Mbits/sec 0 103 KBytes [ 5] 6.00-7.00 sec 10.1 MBytes 84.3 Mbits/sec 0 103 KBytes [ 5] 7.00-8.00 sec 9.82 MBytes 82.4 Mbits/sec 0 103 KBytes [ 5] 8.00-9.00 sec 9.82 MBytes 82.4 Mbits/sec 0 103 KBytes [ 5] 9.00-10.00 sec 9.88 MBytes 82.9 Mbits/sec 0 103 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 99.0 MBytes 83.1 Mbits/sec 0 sender [ 5] 0.00-10.01 sec 98.8 MBytes 82.8 Mbits/sec receiver 1.2 i.MX6UL with tx ring size 1024 root@imx6ul7d:~# iperf3 -c 192.168.3.6 Connecting to host 192.168.3.6, port 5201 [ 5] local 192.168.3.9 port 55670 connected to 192.168.3.6 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 10.2 MBytes 85.4 Mbits/sec 0 236 KBytes [ 5] 1.00-2.00 sec 10.1 MBytes 84.6 Mbits/sec 0 236 KBytes [ 5] 2.00-3.00 sec 10.2 MBytes 85.5 Mbits/sec 0 249 KBytes [ 5] 3.00-4.00 sec 10.1 MBytes 85.1 Mbits/sec 0 249 KBytes [ 5] 4.00-5.00 sec 10.1 MBytes 84.7 Mbits/sec 0 249 KBytes [ 5] 5.00-6.00 sec 10.0 MBytes 84.1 Mbits/sec 0 249 KBytes [ 5] 6.00-7.00 sec 10.1 MBytes 85.1 Mbits/sec 0 249 KBytes [ 5] 7.00-8.00 sec 10.1 MBytes 84.9 Mbits/sec 0 249 KBytes [ 5] 8.00-9.00 sec 10.3 MBytes 85.9 Mbits/sec 0 249 KBytes [ 5] 9.00-10.00 sec 10.2 MBytes 85.6 Mbits/sec 0 249 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 101 MBytes 85.1 Mbits/sec 0 sender [ 5] 0.00-10.01 sec 101 MBytes 84.5 Mbits/sec receiver 2.1 i.MX8ULP with tx ring size 512 root@imx8ulpevk:~# iperf3 -c 192.168.3.6 Connecting to host 192.168.3.6, port 5201 [ 5] local 192.168.3.9 port 54342 connected to 192.168.3.6 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 10.8 MBytes 90.3 Mbits/sec 0 99.0 KBytes [ 5] 1.00-2.00 sec 9.94 MBytes 83.4 Mbits/sec 0 99.0 KBytes [ 5] 2.00-3.00 sec 10.2 MBytes 85.5 Mbits/sec 0 99.0 KBytes [ 5] 3.00-4.00 sec 9.94 MBytes 83.4 Mbits/sec 0 99.0 KBytes [ 5] 4.00-5.00 sec 10.2 MBytes 85.5 Mbits/sec 0 99.0 KBytes [ 5] 5.00-6.00 sec 9.94 MBytes 83.4 Mbits/sec 0 99.0 KBytes [ 5] 6.00-7.00 sec 9.69 MBytes 81.3 Mbits/sec 0 99.0 KBytes [ 5] 7.00-8.00 sec 9.94 MBytes 83.4 Mbits/sec 0 99.0 KBytes [ 5] 8.00-9.00 sec 9.69 MBytes 81.3 Mbits/sec 0 99.0 KBytes [ 5] 9.00-10.00 sec 10.2 MBytes 85.5 Mbits/sec 0 99.0 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 100 MBytes 84.3 Mbits/sec 0 sender [ 5] 0.00-9.90 sec 100 MBytes 84.7 Mbits/sec receiver 2.1 i.MX8ULP with tx ring size 1024 root@imx8ulpevk:~# iperf3 -c 192.168.3.6 Connecting to host 192.168.3.6, port 5201 [ 5] local 192.168.3.9 port 44770 connected to 192.168.3.6 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 10.7 MBytes 90.1 Mbits/sec 0 93.3 KBytes [ 5] 1.00-2.00 sec 9.94 MBytes 83.4 Mbits/sec 0 93.3 KBytes [ 5] 2.00-3.00 sec 10.2 MBytes 85.5 Mbits/sec 0 93.3 KBytes [ 5] 3.00-4.00 sec 10.1 MBytes 85.0 Mbits/sec 0 93.3 KBytes [ 5] 4.00-5.00 sec 9.94 MBytes 83.4 Mbits/sec 0 100 KBytes [ 5] 5.00-6.00 sec 10.2 MBytes 85.5 Mbits/sec 0 100 KBytes [ 5] 6.00-7.00 sec 9.69 MBytes 81.3 Mbits/sec 0 100 KBytes [ 5] 7.00-8.00 sec 9.94 MBytes 83.4 Mbits/sec 0 100 KBytes [ 5] 8.00-9.00 sec 10.2 MBytes 85.5 Mbits/sec 0 100 KBytes [ 5] 9.00-10.00 sec 9.69 MBytes 81.3 Mbits/sec 0 100 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 101 MBytes 84.4 Mbits/sec 0 sender [ 5] 0.00-9.92 sec 100 MBytes 84.8 Mbits/sec receiver 3.1 i.MX8MM with tx ring size 512 root@imx8mmevk:~# iperf3 -c 192.168.3.6 Connecting to host 192.168.3.6, port 5201 [ 5] local 192.168.3.9 port 55734 connected to 192.168.3.6 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 111 MBytes 934 Mbits/sec 0 577 KBytes [ 5] 1.00-2.00 sec 112 MBytes 937 Mbits/sec 0 577 KBytes [ 5] 2.00-3.00 sec 112 MBytes 942 Mbits/sec 0 609 KBytes [ 5] 3.00-4.00 sec 113 MBytes 945 Mbits/sec 0 638 KBytes [ 5] 4.00-5.00 sec 112 MBytes 941 Mbits/sec 0 638 KBytes [ 5] 5.00-6.00 sec 112 MBytes 942 Mbits/sec 0 638 KBytes [ 5] 6.00-7.00 sec 112 MBytes 942 Mbits/sec 0 638 KBytes [ 5] 7.00-8.00 sec 112 MBytes 943 Mbits/sec 0 638 KBytes [ 5] 8.00-9.00 sec 112 MBytes 943 Mbits/sec 0 638 KBytes [ 5] 9.00-10.00 sec 112 MBytes 942 Mbits/sec 0 638 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec 0 sender [ 5] 0.00-10.00 sec 1.09 GBytes 938 Mbits/sec receiver 3.2 i.MX8MM with tx ring size 1024 root@imx8mmevk:~# iperf3 -c 192.168.3.6 Connecting to host 192.168.3.6, port 5201 [ 5] local 192.168.3.9 port 53350 connected to 192.168.3.6 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 114 MBytes 952 Mbits/sec 0 585 KBytes [ 5] 1.00-2.00 sec 112 MBytes 942 Mbits/sec 0 585 KBytes [ 5] 2.00-3.00 sec 113 MBytes 947 Mbits/sec 0 585 KBytes [ 5] 3.00-4.00 sec 112 MBytes 940 Mbits/sec 0 648 KBytes [ 5] 4.00-5.00 sec 112 MBytes 944 Mbits/sec 0 648 KBytes [ 5] 5.00-6.00 sec 112 MBytes 944 Mbits/sec 0 648 KBytes [ 5] 6.00-7.00 sec 111 MBytes 933 Mbits/sec 0 648 KBytes [ 5] 7.00-8.00 sec 112 MBytes 944 Mbits/sec 0 648 KBytes [ 5] 8.00-9.00 sec 112 MBytes 944 Mbits/sec 0 648 KBytes [ 5] 9.00-10.00 sec 112 MBytes 944 Mbits/sec 0 648 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender [ 5] 0.00-10.00 sec 1.09 GBytes 940 Mbits/sec receiver 4.1 i.MX8MP with tx ring size 512 root@imx8mpevk:~# iperf3 -c 192.168.3.6 Connecting to host 192.168.3.6, port 5201 [ 5] local 192.168.3.9 port 51892 connected to 192.168.3.6 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 114 MBytes 959 Mbits/sec 0 594 KBytes [ 5] 1.00-2.00 sec 112 MBytes 940 Mbits/sec 0 626 KBytes [ 5] 2.00-3.00 sec 113 MBytes 946 Mbits/sec 0 626 KBytes [ 5] 3.00-4.00 sec 112 MBytes 937 Mbits/sec 0 626 KBytes [ 5] 4.00-5.00 sec 112 MBytes 940 Mbits/sec 0 626 KBytes [ 5] 5.00-6.00 sec 112 MBytes 940 Mbits/sec 0 626 KBytes [ 5] 6.00-7.00 sec 113 MBytes 946 Mbits/sec 0 626 KBytes [ 5] 7.00-8.00 sec 112 MBytes 939 Mbits/sec 0 626 KBytes [ 5] 8.00-9.00 sec 111 MBytes 935 Mbits/sec 0 626 KBytes [ 5] 9.00-10.00 sec 112 MBytes 943 Mbits/sec 0 626 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender [ 5] 0.00-10.00 sec 1.09 GBytes 940 Mbits/sec receiver 4.2 i.MX8MP with tx ring size 1024 root@imx8mpevk:~# iperf3 -c 192.168.3.6 Connecting to host 192.168.3.6, port 5201 [ 5] local 192.168.3.9 port 37922 connected to 192.168.3.6 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 113 MBytes 951 Mbits/sec 0 608 KBytes [ 5] 1.00-2.00 sec 112 MBytes 937 Mbits/sec 0 608 KBytes [ 5] 2.00-3.00 sec 113 MBytes 947 Mbits/sec 0 608 KBytes [ 5] 3.00-4.00 sec 111 MBytes 934 Mbits/sec 0 608 KBytes [ 5] 4.00-5.00 sec 112 MBytes 942 Mbits/sec 0 608 KBytes [ 5] 5.00-6.00 sec 112 MBytes 939 Mbits/sec 0 608 KBytes [ 5] 6.00-7.00 sec 113 MBytes 949 Mbits/sec 0 608 KBytes [ 5] 7.00-8.00 sec 112 MBytes 942 Mbits/sec 0 608 KBytes [ 5] 8.00-9.00 sec 112 MBytes 936 Mbits/sec 0 608 KBytes [ 5] 9.00-10.00 sec 112 MBytes 942 Mbits/sec 0 608 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.10 GBytes 942 Mbits/sec 0 sender [ 5] 0.00-10.00 sec 1.09 GBytes 939 Mbits/sec receiver 5.1 i.MX93 with tx ring size 512 root@imx93evk:~# iperf3 -c 192.168.3.6 Connecting to host 192.168.3.6, port 5201 [ 5] local 192.168.3.9 port 44216 connected to 192.168.3.6 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 115 MBytes 965 Mbits/sec 0 656 KBytes [ 5] 1.00-2.00 sec 111 MBytes 934 Mbits/sec 0 656 KBytes [ 5] 2.00-3.00 sec 112 MBytes 944 Mbits/sec 0 656 KBytes [ 5] 3.00-4.00 sec 112 MBytes 944 Mbits/sec 0 656 KBytes [ 5] 4.00-5.00 sec 112 MBytes 944 Mbits/sec 0 656 KBytes [ 5] 5.00-6.00 sec 111 MBytes 933 Mbits/sec 0 656 KBytes [ 5] 6.00-7.00 sec 112 MBytes 944 Mbits/sec 0 656 KBytes [ 5] 7.00-8.00 sec 112 MBytes 944 Mbits/sec 0 656 KBytes [ 5] 8.00-9.00 sec 112 MBytes 944 Mbits/sec 0 656 KBytes [ 5] 9.00-10.00 sec 112 MBytes 944 Mbits/sec 0 656 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.10 GBytes 944 Mbits/sec 0 sender [ 5] 0.00-10.00 sec 1.10 GBytes 941 Mbits/sec receiver 5.2 i.MX93 with tx ring size 1024 root@imx93evk:~# iperf3 -c 192.168.3.6 Connecting to host 192.168.3.6, port 5201 [ 5] local 192.168.3.9 port 51058 connected to 192.168.3.6 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 114 MBytes 959 Mbits/sec 0 588 KBytes [ 5] 1.00-2.00 sec 112 MBytes 935 Mbits/sec 0 649 KBytes [ 5] 2.00-3.00 sec 112 MBytes 944 Mbits/sec 0 649 KBytes [ 5] 3.00-4.00 sec 112 MBytes 944 Mbits/sec 0 649 KBytes [ 5] 4.00-5.00 sec 112 MBytes 944 Mbits/sec 0 649 KBytes [ 5] 5.00-6.00 sec 112 MBytes 944 Mbits/sec 0 649 KBytes [ 5] 6.00-7.00 sec 111 MBytes 933 Mbits/sec 0 649 KBytes [ 5] 7.00-8.00 sec 112 MBytes 944 Mbits/sec 0 649 KBytes [ 5] 8.00-9.00 sec 112 MBytes 944 Mbits/sec 0 649 KBytes [ 5] 9.00-10.00 sec 112 MBytes 944 Mbits/sec 0 649 KBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 1.10 GBytes 943 Mbits/sec 0 sender [ 5] 0.00-10.00 sec 1.10 GBytes 940 Mbits/sec receiver
On Wed, 5 Jul 2023 06:20:26 +0000 Wei Fang wrote: > > > In addtion, this patch also updates the tx_stop_threshold and the > > > tx_wake_threshold of the tx ring. In previous logic, the value of > > > tx_stop_threshold is 217, however, the value of tx_wake_threshold is > > > 147, it does not make sense that tx_wake_threshold is less than > > > tx_stop_threshold. > > > > What do these actually mean? I could imagine that as the ring fills you don't > > want to stop until it is 217/512 full. There is then some hysteresis, such that it > > has to drop below 147/512 before more can be added? > > > You must have misunderstood, let me explain more clearly, the queue will be > stopped when the available BDs are less than tx_stop_threshold (217 BDs). And > the queue will be waked when the available BDs are greater than tx_wake_threshold > (147 BDs). So in most cases, the available BDs are greater than tx_wake_threshold > when the queue is stopped, the only effect is to delay packet sending. > In my opinion, tx_wake_threshold should be greater than tx_stop_threshold, we > should stop queue when the available BDs are not enough for a skb to be attached. > And wake the queue when the available BDs are sufficient for a skb. But you shouldn't restart the queue for a single packet either. Restarting for a single packet wastes CPU cycles as there will be much more stop / start operations. Two large packets seem like the absolute minimum reasonable wake threshold. Setting tx_stop_threshold to MAX_SKB_FRAGS doesn't seem right either, as you won't be able to accept a full TSO frame. Please split the change, the netdev_err_once() should be one patch and then the change to wake thresh a separate one.
> -----Original Message----- > From: Jakub Kicinski <kuba@kernel.org> > Sent: 2023年7月6日 2:11 > To: Wei Fang <wei.fang@nxp.com> > Cc: Andrew Lunn <andrew@lunn.ch>; davem@davemloft.net; > edumazet@google.com; pabeni@redhat.com; ast@kernel.org; > daniel@iogearbox.net; hawk@kernel.org; john.fastabend@gmail.com; > Shenwei Wang <shenwei.wang@nxp.com>; Clark Wang > <xiaoning.wang@nxp.com>; netdev@vger.kernel.org; dl-linux-imx > <linux-imx@nxp.com>; linux-kernel@vger.kernel.org; bpf@vger.kernel.org > Subject: Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update > thresholds of tx ring > > On Wed, 5 Jul 2023 06:20:26 +0000 Wei Fang wrote: > > > > In addtion, this patch also updates the tx_stop_threshold and the > > > > tx_wake_threshold of the tx ring. In previous logic, the value of > > > > tx_stop_threshold is 217, however, the value of tx_wake_threshold > > > > is 147, it does not make sense that tx_wake_threshold is less than > > > > tx_stop_threshold. > > > > > > What do these actually mean? I could imagine that as the ring fills > > > you don't want to stop until it is 217/512 full. There is then some > > > hysteresis, such that it has to drop below 147/512 before more can be > added? > > > > > You must have misunderstood, let me explain more clearly, the queue > > will be stopped when the available BDs are less than tx_stop_threshold > > (217 BDs). And the queue will be waked when the available BDs are > > greater than tx_wake_threshold > > (147 BDs). So in most cases, the available BDs are greater than > > tx_wake_threshold when the queue is stopped, the only effect is to delay > packet sending. > > In my opinion, tx_wake_threshold should be greater than > > tx_stop_threshold, we should stop queue when the available BDs are not > enough for a skb to be attached. > > And wake the queue when the available BDs are sufficient for a skb. > > But you shouldn't restart the queue for a single packet either. > Restarting for a single packet wastes CPU cycles as there will be much more > stop / start operations. Two large packets seem like the absolute minimum > reasonable wake threshold. > > Setting tx_stop_threshold to MAX_SKB_FRAGS doesn't seem right either, as > you won't be able to accept a full TSO frame. > Maybe I should keep the tx_stop_threshold unchanged, so that the queue is to be stopped if the available BDs is not enough for a full TSO frame to be attached. And then just change tx_wake_threshold to tx_stop_threshold + 1, which I think it's more reasonable. > Please split the change, the netdev_err_once() should be one patch and then > the change to wake thresh a separate one. Okay, thanks!
On Thu, 6 Jul 2023 01:44:49 +0000 Wei Fang wrote: > > But you shouldn't restart the queue for a single packet either. > > Restarting for a single packet wastes CPU cycles as there will be much more > > stop / start operations. Two large packets seem like the absolute minimum > > reasonable wake threshold. > > > > Setting tx_stop_threshold to MAX_SKB_FRAGS doesn't seem right either, as > > you won't be able to accept a full TSO frame. > > > Maybe I should keep the tx_stop_threshold unchanged, so that the queue is > to be stopped if the available BDs is not enough for a full TSO frame to be attached. > And then just change tx_wake_threshold to tx_stop_threshold + 1, which I think it's > more reasonable. How about at least tx_stop_threshold + 2 * MAX_SKB_FRAGS ? If a queue of hundreds of entries is overflowing, we should be able to apply a hysteresis of a few tens of entries. Do you see a difference in drops? The packets from the stack should preferably stay in the qdiscs instead of the driver queue, where AQM and scheduling can be applied.
> -----Original Message----- > From: Jakub Kicinski <kuba@kernel.org> > Sent: 2023年7月6日 11:12 > To: Wei Fang <wei.fang@nxp.com> > Cc: Andrew Lunn <andrew@lunn.ch>; davem@davemloft.net; > edumazet@google.com; pabeni@redhat.com; ast@kernel.org; > daniel@iogearbox.net; hawk@kernel.org; john.fastabend@gmail.com; > Shenwei Wang <shenwei.wang@nxp.com>; Clark Wang > <xiaoning.wang@nxp.com>; netdev@vger.kernel.org; dl-linux-imx > <linux-imx@nxp.com>; linux-kernel@vger.kernel.org; bpf@vger.kernel.org > Subject: Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update > thresholds of tx ring > > On Thu, 6 Jul 2023 01:44:49 +0000 Wei Fang wrote: > > > But you shouldn't restart the queue for a single packet either. > > > Restarting for a single packet wastes CPU cycles as there will be > > > much more stop / start operations. Two large packets seem like the > > > absolute minimum reasonable wake threshold. > > > > > > Setting tx_stop_threshold to MAX_SKB_FRAGS doesn't seem right > > > either, as you won't be able to accept a full TSO frame. > > > > > Maybe I should keep the tx_stop_threshold unchanged, so that the queue > > is to be stopped if the available BDs is not enough for a full TSO frame to be > attached. > > And then just change tx_wake_threshold to tx_stop_threshold + 1, which > > I think it's more reasonable. > > How about at least tx_stop_threshold + 2 * MAX_SKB_FRAGS ? It's okay. The iperf performance is well as before. > If a queue of hundreds of entries is overflowing, we should be able to apply a > hysteresis of a few tens of entries. Do you see a difference in drops? The I didn't see there was any packet loss. > packets from the stack should preferably stay in the qdiscs instead of the driver > queue, where AQM and scheduling can be applied.
> > How does this affect platforms like Vybrid with its fast Ethernet? > Sorry, I don't have the Vybrid platform, but I think I don't think it has much > impact, at most it just takes up some more memory. It has been 6 months since the page pool patches were posted and i asked about benchmark results for other platforms like Vybrid. Is it so hard to get hold or reference platforms? Fugang Duan used to have a test farm of all sorts of boards and reported to me the regressions i introduced with MDIO changes and PM changes. As somebody who seems to be an NXP FEC Maintainer i would expect you to have access to a range of hardware. Especially since XDP and eBPF is a bit of a niche for embedded processes which NXP produce. You want to be sure your changes don't regress the main use cases which i guess are plain networking. > > Does the burst latency go up? > No, for fec, when a packet is attached to the BDs, the software will immediately > trigger the hardware to send the packet. In addition, I think it may improve the > latency, because the size of the tx ring becomes larger, and more packets can be > attached to the BD ring for burst traffic. And a bigger burst means more latency. Read about Buffer bloat. While you have iperf running saturating the link, try a ping as well. How does ping latency change with more TX buffers? Ideally you want enough transmit buffers to keep the link full, but not more. If the driver is using BQL, the network stack will help with this. > Below are the results on i.MX6UL/8MM/8MP/8ULP/93 platforms, i.MX6UL and > 8ULP only support Fast ethernet. Others support 1G. Thanks for the benchmark numbers. Please get into the habit of including them. We like to see justification for any sort of performance tweaks. Andrew
> -----Original Message----- > From: Andrew Lunn <andrew@lunn.ch> > Sent: 2023年7月9日 3:04 > To: Wei Fang <wei.fang@nxp.com> > Cc: davem@davemloft.net; edumazet@google.com; kuba@kernel.org; > pabeni@redhat.com; ast@kernel.org; daniel@iogearbox.net; > hawk@kernel.org; john.fastabend@gmail.com; Shenwei Wang > <shenwei.wang@nxp.com>; Clark Wang <xiaoning.wang@nxp.com>; > netdev@vger.kernel.org; dl-linux-imx <linux-imx@nxp.com>; > linux-kernel@vger.kernel.org; bpf@vger.kernel.org > Subject: Re: [PATCH net 3/3] net: fec: increase the size of tx ring and update > thresholds of tx ring > > > > How does this affect platforms like Vybrid with its fast Ethernet? > > Sorry, I don't have the Vybrid platform, but I think I don't think it > > has much impact, at most it just takes up some more memory. > > It has been 6 months since the page pool patches were posted and i asked > about benchmark results for other platforms like Vybrid. Is it so hard to get > hold or reference platforms? Fugang Duan used to have a test farm of all sorts > of boards and reported to me the regressions i introduced with MDIO changes > and PM changes. As somebody who seems to be an NXP FEC Maintainer i > would expect you to have access to a range of hardware. Especially since XDP > and eBPF is a bit of a niche for embedded processes which NXP produce. You > want to be sure your changes don't regress the main use cases which i guess > are plain networking. > Sorry, the Vybrid platform is not currently maintained by us, and Vybrid is also not included in our usual Yocto SDK RC version. Even I find a Vybrid board, I think it probably cann't run with the latest kernel image, because the latest kernel image does not match with the old Yocto SDK, and new Yocto SDK does not support Vybrid platform. I also asked my colleague in test team who is in charge of ethernet testing, she hadn't even heard of Vybrid platform. > > > Does the burst latency go up? > > No, for fec, when a packet is attached to the BDs, the software will > > immediately trigger the hardware to send the packet. In addition, I > > think it may improve the latency, because the size of the tx ring > > becomes larger, and more packets can be attached to the BD ring for burst > traffic. > > And a bigger burst means more latency. Read about Buffer bloat. > Perhaps, but not necessarily. For example, in some cases that some burst packets maybe stay in Qdisc instead of hardware queue if ring size is small. > While you have iperf running saturating the link, try a ping as well. How does > ping latency change with more TX buffers? > Per your suggestions, I tested on i.MX8ULP, i.MX8MM and i.MX93 platforms. The results do not appear to be very different. i.MX93 with ring size 1024: fangwei@fangwei-OptiPlex-7050:~$ ping 10.193.102.114 -c 16 PING 10.193.102.114 (10.193.102.114) 56(84) bytes of data. 64 bytes from 10.193.102.114: icmp_seq=1 ttl=63 time=3.78 ms 64 bytes from 10.193.102.114: icmp_seq=2 ttl=63 time=2.16 ms 64 bytes from 10.193.102.114: icmp_seq=3 ttl=63 time=3.31 ms 64 bytes from 10.193.102.114: icmp_seq=4 ttl=63 time=2.11 ms 64 bytes from 10.193.102.114: icmp_seq=5 ttl=63 time=3.43 ms 64 bytes from 10.193.102.114: icmp_seq=6 ttl=63 time=3.20 ms 64 bytes from 10.193.102.114: icmp_seq=7 ttl=63 time=3.20 ms 64 bytes from 10.193.102.114: icmp_seq=8 ttl=63 time=3.75 ms 64 bytes from 10.193.102.114: icmp_seq=9 ttl=63 time=3.21 ms 64 bytes from 10.193.102.114: icmp_seq=10 ttl=63 time=3.76 ms 64 bytes from 10.193.102.114: icmp_seq=11 ttl=63 time=2.16 ms 64 bytes from 10.193.102.114: icmp_seq=12 ttl=63 time=2.67 ms 64 bytes from 10.193.102.114: icmp_seq=13 ttl=63 time=3.59 ms 64 bytes from 10.193.102.114: icmp_seq=14 ttl=63 time=2.55 ms 64 bytes from 10.193.102.114: icmp_seq=15 ttl=63 time=2.54 ms 64 bytes from 10.193.102.114: icmp_seq=16 ttl=63 time=3.88 ms --- 10.193.102.114 ping statistics --- 16 packets transmitted, 16 received, 0% packet loss, time 15027ms rtt min/avg/max/mdev = 2.112/3.082/3.875/0.606 ms i.MX93 with ring size 512: fangwei@fangwei-OptiPlex-7050:~$ ping 10.193.102.114 -c 16 PING 10.193.102.114 (10.193.102.114) 56(84) bytes of data. 64 bytes from 10.193.102.114: icmp_seq=1 ttl=63 time=2.74 ms 64 bytes from 10.193.102.114: icmp_seq=2 ttl=63 time=3.32 ms 64 bytes from 10.193.102.114: icmp_seq=3 ttl=63 time=2.72 ms 64 bytes from 10.193.102.114: icmp_seq=4 ttl=63 time=3.36 ms 64 bytes from 10.193.102.114: icmp_seq=5 ttl=63 time=3.41 ms 64 bytes from 10.193.102.114: icmp_seq=6 ttl=63 time=2.67 ms 64 bytes from 10.193.102.114: icmp_seq=7 ttl=63 time=2.77 ms 64 bytes from 10.193.102.114: icmp_seq=8 ttl=63 time=3.38 ms 64 bytes from 10.193.102.114: icmp_seq=9 ttl=63 time=2.54 ms 64 bytes from 10.193.102.114: icmp_seq=10 ttl=63 time=3.36 ms 64 bytes from 10.193.102.114: icmp_seq=11 ttl=63 time=3.44 ms 64 bytes from 10.193.102.114: icmp_seq=12 ttl=63 time=2.80 ms 64 bytes from 10.193.102.114: icmp_seq=13 ttl=63 time=2.86 ms 64 bytes from 10.193.102.114: icmp_seq=14 ttl=63 time=3.90 ms 64 bytes from 10.193.102.114: icmp_seq=15 ttl=63 time=2.50 ms 64 bytes from 10.193.102.114: icmp_seq=16 ttl=63 time=2.89 ms --- 10.193.102.114 ping statistics --- 16 packets transmitted, 16 received, 0% packet loss, time 15028ms rtt min/avg/max/mdev = 2.496/3.040/3.898/0.394 ms i.MX8MM with ring size 1024: fangwei@fangwei-OptiPlex-7050:~$ ping 10.193.102.126 -c 16 PING 10.193.102.126 (10.193.102.126) 56(84) bytes of data. 64 bytes from 10.193.102.126: icmp_seq=1 ttl=127 time=1.34 ms 64 bytes from 10.193.102.126: icmp_seq=2 ttl=127 time=2.07 ms 64 bytes from 10.193.102.126: icmp_seq=3 ttl=127 time=2.40 ms 64 bytes from 10.193.102.126: icmp_seq=4 ttl=127 time=1.48 ms 64 bytes from 10.193.102.126: icmp_seq=5 ttl=127 time=1.69 ms 64 bytes from 10.193.102.126: icmp_seq=6 ttl=127 time=1.54 ms 64 bytes from 10.193.102.126: icmp_seq=7 ttl=127 time=2.30 ms 64 bytes from 10.193.102.126: icmp_seq=8 ttl=127 time=1.94 ms 64 bytes from 10.193.102.126: icmp_seq=9 ttl=127 time=4.25 ms 64 bytes from 10.193.102.126: icmp_seq=10 ttl=127 time=1.75 ms 64 bytes from 10.193.102.126: icmp_seq=11 ttl=127 time=1.25 ms 64 bytes from 10.193.102.126: icmp_seq=12 ttl=127 time=2.04 ms 64 bytes from 10.193.102.126: icmp_seq=13 ttl=127 time=2.31 ms 64 bytes from 10.193.102.126: icmp_seq=14 ttl=127 time=2.18 ms 64 bytes from 10.193.102.126: icmp_seq=15 ttl=127 time=2.25 ms 64 bytes from 10.193.102.126: icmp_seq=16 ttl=127 time=1.37 ms --- 10.193.102.126 ping statistics --- 16 packets transmitted, 16 received, 0% packet loss, time 15026ms rtt min/avg/max/mdev = 1.248/2.011/4.250/0.686 ms i.MX8MM with ring size 512: fangwei@fangwei-OptiPlex-7050:~$ ping 10.193.102.126 -c 16 PING 10.193.102.126 (10.193.102.126) 56(84) bytes of data. 64 bytes from 10.193.102.126: icmp_seq=1 ttl=63 time=4.82 ms 64 bytes from 10.193.102.126: icmp_seq=2 ttl=63 time=4.67 ms 64 bytes from 10.193.102.126: icmp_seq=3 ttl=63 time=3.74 ms 64 bytes from 10.193.102.126: icmp_seq=4 ttl=63 time=3.87 ms 64 bytes from 10.193.102.126: icmp_seq=5 ttl=63 time=3.30 ms 64 bytes from 10.193.102.126: icmp_seq=6 ttl=63 time=3.79 ms 64 bytes from 10.193.102.126: icmp_seq=7 ttl=127 time=2.12 ms 64 bytes from 10.193.102.126: icmp_seq=8 ttl=127 time=1.99 ms 64 bytes from 10.193.102.126: icmp_seq=9 ttl=127 time=2.15 ms 64 bytes from 10.193.102.126: icmp_seq=10 ttl=127 time=1.82 ms 64 bytes from 10.193.102.126: icmp_seq=11 ttl=127 time=1.92 ms 64 bytes from 10.193.102.126: icmp_seq=12 ttl=127 time=1.23 ms 64 bytes from 10.193.102.126: icmp_seq=13 ttl=127 time=2.00 ms 64 bytes from 10.193.102.126: icmp_seq=14 ttl=127 time=1.66 ms 64 bytes from 10.193.102.126: icmp_seq=15 ttl=127 time=1.75 ms 64 bytes from 10.193.102.126: icmp_seq=16 ttl=127 time=2.24 ms --- 10.193.102.126 ping statistics --- 16 packets transmitted, 16 received, 0% packet loss, time 15026ms rtt min/avg/max/mdev = 1.226/2.691/4.820/1.111 ms i.MX8ULP with ring size 1024: fangwei@fangwei-OptiPlex-7050:~$ ping 10.193.102.216 -c 16 PING 10.193.102.216 (10.193.102.216) 56(84) bytes of data. 64 bytes from 10.193.102.216: icmp_seq=1 ttl=63 time=3.40 ms 64 bytes from 10.193.102.216: icmp_seq=2 ttl=63 time=5.46 ms 64 bytes from 10.193.102.216: icmp_seq=3 ttl=63 time=5.55 ms 64 bytes from 10.193.102.216: icmp_seq=4 ttl=63 time=5.97 ms 64 bytes from 10.193.102.216: icmp_seq=5 ttl=63 time=6.26 ms 64 bytes from 10.193.102.216: icmp_seq=6 ttl=63 time=0.963 ms 64 bytes from 10.193.102.216: icmp_seq=7 ttl=63 time=4.10 ms 64 bytes from 10.193.102.216: icmp_seq=8 ttl=63 time=4.55 ms 64 bytes from 10.193.102.216: icmp_seq=9 ttl=63 time=1.24 ms 64 bytes from 10.193.102.216: icmp_seq=10 ttl=63 time=6.96 ms 64 bytes from 10.193.102.216: icmp_seq=11 ttl=63 time=3.27 ms 64 bytes from 10.193.102.216: icmp_seq=12 ttl=63 time=6.57 ms 64 bytes from 10.193.102.216: icmp_seq=13 ttl=63 time=2.99 ms 64 bytes from 10.193.102.216: icmp_seq=14 ttl=63 time=1.70 ms 64 bytes from 10.193.102.216: icmp_seq=15 ttl=63 time=1.79 ms 64 bytes from 10.193.102.216: icmp_seq=16 ttl=63 time=1.42 ms --- 10.193.102.216 ping statistics --- 16 packets transmitted, 16 received, 0% packet loss, time 15026ms rtt min/avg/max/mdev = 0.963/3.886/6.955/2.009 ms i.MX8ULP with ring size 512: fangwei@fangwei-OptiPlex-7050:~$ ping 10.193.102.216 -c 16 PING 10.193.102.216 (10.193.102.216) 56(84) bytes of data. 64 bytes from 10.193.102.216: icmp_seq=1 ttl=63 time=5.70 ms 64 bytes from 10.193.102.216: icmp_seq=2 ttl=63 time=5.89 ms 64 bytes from 10.193.102.216: icmp_seq=3 ttl=63 time=3.37 ms 64 bytes from 10.193.102.216: icmp_seq=4 ttl=63 time=5.07 ms 64 bytes from 10.193.102.216: icmp_seq=5 ttl=63 time=1.47 ms 64 bytes from 10.193.102.216: icmp_seq=6 ttl=63 time=3.45 ms 64 bytes from 10.193.102.216: icmp_seq=7 ttl=63 time=1.35 ms 64 bytes from 10.193.102.216: icmp_seq=8 ttl=63 time=6.62 ms 64 bytes from 10.193.102.216: icmp_seq=9 ttl=63 time=1.41 ms 64 bytes from 10.193.102.216: icmp_seq=10 ttl=63 time=6.43 ms 64 bytes from 10.193.102.216: icmp_seq=11 ttl=63 time=1.41 ms 64 bytes from 10.193.102.216: icmp_seq=12 ttl=63 time=6.75 ms 64 bytes from 10.193.102.216: icmp_seq=13 ttl=63 time=4.76 ms 64 bytes from 10.193.102.216: icmp_seq=14 ttl=63 time=3.85 ms 64 bytes from 10.193.102.216: icmp_seq=15 ttl=63 time=3.50 ms 64 bytes from 10.193.102.216: icmp_seq=16 ttl=63 time=1.37 ms --- 10.193.102.216 ping statistics --- 16 packets transmitted, 16 received, 0% packet loss, time 15027ms rtt min/avg/max/mdev = 1.349/3.900/6.749/1.985 ms > Ideally you want enough transmit buffers to keep the link full, but not more. If > the driver is using BQL, the network stack will help with this. > > > Below are the results on i.MX6UL/8MM/8MP/8ULP/93 platforms, i.MX6UL > > and 8ULP only support Fast ethernet. Others support 1G. > > Thanks for the benchmark numbers. Please get into the habit of including > them. We like to see justification for any sort of performance tweaks. > > Andrew
> Sorry, the Vybrid platform is not currently maintained by us, and Vybrid is also not > included in our usual Yocto SDK RC version. Your Yocto SDK version does not matter. We are talking about mainline here... You are maintaining the mainline driver, and submitting patches to extend the mainline driver. > Even I find a Vybrid board, I think it > probably cann't run with the latest kernel image, because the latest kernel image > does not match with the old Yocto SDK, and new Yocto SDK does not support Vybrid > platform. I also asked my colleague in test team who is in charge of ethernet testing, > she hadn't even heard of Vybrid platform. Well 6.5-rc1 does run on Vybrid, i have a number of boards with it. You will also find that many inflight entertainment systems have a box under each row of seats in a aeroplane with a Vybrid controlling a Marvell switch. Take a look at arch/arm/boot/dts/vf610-zii-* You will also find a number of DTS files for imx5i, imx6, imx7 used for ZII products which are back of the seat displays, and make heavy use of networking with the FEC. So i have in interest in the FEC for all these platforms. > > And a bigger burst means more latency. Read about Buffer bloat. > > > Perhaps, but not necessarily. For example, in some cases that some burst packets > maybe stay in Qdisc instead of hardware queue if ring size is small. Which is what you want, so high priority packets can jump to the head of the queue. > > While you have iperf running saturating the link, try a ping as well. How does > > ping latency change with more TX buffers? > > > Per your suggestions, I tested on i.MX8ULP, i.MX8MM and i.MX93 platforms. The > results do not appear to be very different. Thanks for the benchmark numbers. They look O.K. Andrew
diff --git a/drivers/net/ethernet/freescale/fec.h b/drivers/net/ethernet/freescale/fec.h index 8c0226d061fe..63a053dea819 100644 --- a/drivers/net/ethernet/freescale/fec.h +++ b/drivers/net/ethernet/freescale/fec.h @@ -355,7 +355,7 @@ struct bufdesc_ex { #define RX_RING_SIZE (FEC_ENET_RX_FRPPG * FEC_ENET_RX_PAGES) #define FEC_ENET_TX_FRSIZE 2048 #define FEC_ENET_TX_FRPPG (PAGE_SIZE / FEC_ENET_TX_FRSIZE) -#define TX_RING_SIZE 512 /* Must be power of two */ +#define TX_RING_SIZE 1024 /* Must be power of two */ #define TX_RING_MOD_MASK 511 /* for this to work */ #define BD_ENET_RX_INT 0x00800000 diff --git a/drivers/net/ethernet/freescale/fec_main.c b/drivers/net/ethernet/freescale/fec_main.c index 940d3afe1d24..6c255c0d6f41 100644 --- a/drivers/net/ethernet/freescale/fec_main.c +++ b/drivers/net/ethernet/freescale/fec_main.c @@ -3348,9 +3348,8 @@ static int fec_enet_alloc_queue(struct net_device *ndev) txq->bd.ring_size = TX_RING_SIZE; fep->total_tx_ring_size += fep->tx_queue[i]->bd.ring_size; - txq->tx_stop_threshold = FEC_MAX_SKB_DESCS; - txq->tx_wake_threshold = - (txq->bd.ring_size - txq->tx_stop_threshold) / 2; + txq->tx_stop_threshold = MAX_SKB_FRAGS; + txq->tx_wake_threshold = MAX_SKB_FRAGS + 1; txq->tso_hdrs = dma_alloc_coherent(&fep->pdev->dev, txq->bd.ring_size * TSO_HEADER_SIZE, @@ -3837,7 +3836,7 @@ static int fec_enet_txq_xmit_frame(struct fec_enet_private *fep, entries_free = fec_enet_get_free_txdesc_num(txq); if (entries_free < MAX_SKB_FRAGS + 1) { - netdev_err(fep->netdev, "NOT enough BD for SG!\n"); + netdev_err_once(fep->netdev, "NOT enough BD for SG!\n"); return -EBUSY; }