Message ID | 20220413153015.453864-1-maciej.fijalkowski@intel.com (mailing list archive) |
---|---|
Headers | show |
Series | xsk: stop NAPI Rx processing on full XSK RQ | expand |
Hello: This series was applied to bpf/bpf-next.git (master) by Daniel Borkmann <daniel@iogearbox.net>: On Wed, 13 Apr 2022 17:30:01 +0200 you wrote: > v2: > - add likely for internal redirect return codes in ice and ixgbe > (Jesper) > - do not drop the buffer that head pointed to at full XSK RQ (Maxim) > - terminate Rx loop only when need_wakeup feature is enabled (Maxim) > - reword from 'stop softirq processing' to 'stop NAPI Rx processing' > - s/ENXIO/EINVAL in mlx5 and stmmac's ndo_xsk_wakeup to keep it > consitent with Intel's drivers (Maxim) > - include Jesper's Acks > > [...] Here is the summary with links: - [v2,bpf-next,01/14] xsk: improve xdp_do_redirect() error codes https://git.kernel.org/bpf/bpf-next/c/c6c1f11b691e - [v2,bpf-next,02/14] xsk: diversify return codes in xsk_rcv_check() https://git.kernel.org/bpf/bpf-next/c/2be4a677ccb2 - [v2,bpf-next,03/14] ice: xsk: decorate ICE_XDP_REDIR with likely() https://git.kernel.org/bpf/bpf-next/c/0bd5ab511e30 - [v2,bpf-next,04/14] ixgbe: xsk: decorate IXGBE_XDP_REDIR with likely() https://git.kernel.org/bpf/bpf-next/c/d090c885860f - [v2,bpf-next,05/14] ice: xsk: terminate Rx side of NAPI when XSK Rx queue gets full https://git.kernel.org/bpf/bpf-next/c/50ae06648073 - [v2,bpf-next,06/14] i40e: xsk: terminate Rx side of NAPI when XSK Rx queue gets full https://git.kernel.org/bpf/bpf-next/c/b8aef650e549 - [v2,bpf-next,07/14] ixgbe: xsk: terminate Rx side of NAPI when XSK Rx queue gets full https://git.kernel.org/bpf/bpf-next/c/c7dd09fd4628 - [v2,bpf-next,08/14] ice: xsk: diversify return values from xsk_wakeup call paths https://git.kernel.org/bpf/bpf-next/c/ed8a6bc60f9e - [v2,bpf-next,09/14] i40e: xsk: diversify return values from xsk_wakeup call paths https://git.kernel.org/bpf/bpf-next/c/ed7ae2d62217 - [v2,bpf-next,10/14] ixgbe: xsk: diversify return values from xsk_wakeup call paths https://git.kernel.org/bpf/bpf-next/c/0f8bf018899e - [v2,bpf-next,11/14] mlx5: xsk: diversify return values from xsk_wakeup call paths https://git.kernel.org/bpf/bpf-next/c/7b7f2f273d87 - [v2,bpf-next,12/14] stmmac: xsk: diversify return values from xsk_wakeup call paths https://git.kernel.org/bpf/bpf-next/c/a817ead4154d - [v2,bpf-next,13/14] ice: xsk: avoid refilling single Rx descriptors https://git.kernel.org/bpf/bpf-next/c/4efad196163f - [v2,bpf-next,14/14] xsk: drop ternary operator from xskq_cons_has_entries https://git.kernel.org/bpf/bpf-next/c/0fb53aabc5fc You are awesome, thank you!
Hi Maciej, On Wed, 2022-04-13 at 17:30 +0200, Maciej Fijalkowski wrote: > v2: > - add likely for internal redirect return codes in ice and ixgbe > (Jesper) > - do not drop the buffer that head pointed to at full XSK RQ (Maxim) I found an issue with this approach. If you don't drop that packet, you'll need to run the XDP program for the same packet again. If the XDP program is anything more complex than "redirect-everything-to-XSK", it will get confused, for example, if it tracks any state or counts anything. We can't check whether there is enough space in the XSK RX ring before running the XDP program, as we don't know in advance which XSK socket will be selected. We can't store bpf_redirect_info across NAPI calls to avoid running the XDP program second time, as bpf_redirect_info is protected by RCU and the assumption that the whole XDP_REDIRECT operation happens within one NAPI cycle. I see the following options: 1. Drop the packet when an overflow happens. The problem is that it will happen systematically, but it's still better than the old behavior (drop mulitple packets when an overflow happens and hog the CPU). 2. Make this feature opt-in. If the application opts in, it guarantees to take measures to handle duplicate packets in XDP properly. 3. Require the XDP program to indicate it supports being called multiple times for the same packet and pass a flag to it if it's a repeated run. Drop the packet in the driver if the XDP program doesn't indicate this support. The indication can be done similar to how it's implemented for XDP multi buffer. Thoughts? Other options? Thanks, Max > - terminate Rx loop only when need_wakeup feature is enabled (Maxim) > - reword from 'stop softirq processing' to 'stop NAPI Rx processing' > - s/ENXIO/EINVAL in mlx5 and stmmac's ndo_xsk_wakeup to keep it > consitent with Intel's drivers (Maxim) > - include Jesper's Acks > > Hi! > > This is a revival of Bjorn's idea [0] to break NAPI loop when XSK Rx > queue gets full which in turn makes it impossible to keep on > successfully producing descriptors to XSK Rx ring. By breaking out of > the driver side immediately we will give the user space opportunity for > consuming descriptors from XSK Rx ring and therefore provide room in the > ring so that HW Rx -> XSK Rx redirection can be done. > > Maxim asked and Jesper agreed on simplifying Bjorn's original API used > for detecting the event of interest, so let's just simply check for > -ENOBUFS within Intel's ZC drivers after an attempt to redirect a buffer > to XSK Rx. No real need for redirect API extension. > > One might ask why it is still relevant even after having proper busy > poll support in place - here is the justification. > > For xdpsock that was: > - run for l2fwd scenario, > - app/driver processing took place on the same core in busy poll > with 2048 budget, > - HW ring sizes Tx 256, Rx 2048, > > this work improved throughput by 78% and reduced Rx queue full statistic > bump by 99%. > > For testing ice, make sure that you have [1] present on your side. > > This set, besides the work described above, carries also improvements > around return codes in various XSK paths and lastly a minor optimization > for xskq_cons_has_entries(), a helper that might be used when XSK Rx > batching would make it to the kernel. > > Link to v1 and discussion around it is at [2]. > > Thanks! > MF > > [0]: https://lore.kernel.org/bpf/20200904135332.60259-1-bjorn.topel@gmail.com/ > [1]: https://lore.kernel.org/netdev/20220317175727.340251-1-maciej.fijalkowski@intel.com/ > [2]: https://lore.kernel.org/bpf/5864171b-1e08-1b51-026e-5f09b8945076@nvidia.com/T/ > > Björn Töpel (1): > xsk: improve xdp_do_redirect() error codes > > Maciej Fijalkowski (13): > xsk: diversify return codes in xsk_rcv_check() > ice: xsk: decorate ICE_XDP_REDIR with likely() > ixgbe: xsk: decorate IXGBE_XDP_REDIR with likely() > ice: xsk: terminate Rx side of NAPI when XSK Rx queue gets full > i40e: xsk: terminate Rx side of NAPI when XSK Rx queue gets full > ixgbe: xsk: terminate Rx side of NAPI when XSK Rx queue gets full > ice: xsk: diversify return values from xsk_wakeup call paths > i40e: xsk: diversify return values from xsk_wakeup call paths > ixgbe: xsk: diversify return values from xsk_wakeup call paths > mlx5: xsk: diversify return values from xsk_wakeup call paths > stmmac: xsk: diversify return values from xsk_wakeup call paths > ice: xsk: avoid refilling single Rx descriptors > xsk: drop ternary operator from xskq_cons_has_entries > > .../ethernet/intel/i40e/i40e_txrx_common.h | 1 + > drivers/net/ethernet/intel/i40e/i40e_xsk.c | 38 ++++++++----- > drivers/net/ethernet/intel/ice/ice_txrx.h | 1 + > drivers/net/ethernet/intel/ice/ice_xsk.c | 53 ++++++++++++------- > .../ethernet/intel/ixgbe/ixgbe_txrx_common.h | 1 + > drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 52 ++++++++++-------- > .../ethernet/mellanox/mlx5/core/en/xsk/tx.c | 2 +- > .../net/ethernet/stmicro/stmmac/stmmac_main.c | 4 +- > net/xdp/xsk.c | 4 +- > net/xdp/xsk_queue.h | 4 +- > 10 files changed, 99 insertions(+), 61 deletions(-) >
Anyone from Intel? Maciej, Björn, Magnus? Does anyone else have anything to say? IMO, calling XDP for the same packet multiple times is a bug, we should agree on some sane solution. On Thu, 2022-08-18 at 14:26 +0300, Maxim Mikityanskiy wrote: > Hi Maciej, > > On Wed, 2022-04-13 at 17:30 +0200, Maciej Fijalkowski wrote: > > v2: > > - add likely for internal redirect return codes in ice and ixgbe > > (Jesper) > > - do not drop the buffer that head pointed to at full XSK RQ > > (Maxim) > > I found an issue with this approach. If you don't drop that packet, > you'll need to run the XDP program for the same packet again. If the > XDP program is anything more complex than "redirect-everything-to- > XSK", > it will get confused, for example, if it tracks any state or counts > anything. > > We can't check whether there is enough space in the XSK RX ring > before > running the XDP program, as we don't know in advance which XSK socket > will be selected. > > We can't store bpf_redirect_info across NAPI calls to avoid running > the > XDP program second time, as bpf_redirect_info is protected by RCU and > the assumption that the whole XDP_REDIRECT operation happens within > one > NAPI cycle. > > I see the following options: > > 1. Drop the packet when an overflow happens. The problem is that it > will happen systematically, but it's still better than the old > behavior > (drop mulitple packets when an overflow happens and hog the CPU). > > 2. Make this feature opt-in. If the application opts in, it > guarantees > to take measures to handle duplicate packets in XDP properly. > > 3. Require the XDP program to indicate it supports being called > multiple times for the same packet and pass a flag to it if it's a > repeated run. Drop the packet in the driver if the XDP program > doesn't > indicate this support. The indication can be done similar to how it's > implemented for XDP multi buffer. > > Thoughts? Other options? > > Thanks, > Max > > > - terminate Rx loop only when need_wakeup feature is enabled > > (Maxim) > > - reword from 'stop softirq processing' to 'stop NAPI Rx > > processing' > > - s/ENXIO/EINVAL in mlx5 and stmmac's ndo_xsk_wakeup to keep it > > consitent with Intel's drivers (Maxim) > > - include Jesper's Acks > > > > Hi! > > > > This is a revival of Bjorn's idea [0] to break NAPI loop when XSK > > Rx > > queue gets full which in turn makes it impossible to keep on > > successfully producing descriptors to XSK Rx ring. By breaking out > > of > > the driver side immediately we will give the user space opportunity > > for > > consuming descriptors from XSK Rx ring and therefore provide room > > in the > > ring so that HW Rx -> XSK Rx redirection can be done. > > > > Maxim asked and Jesper agreed on simplifying Bjorn's original API > > used > > for detecting the event of interest, so let's just simply check for > > -ENOBUFS within Intel's ZC drivers after an attempt to redirect a > > buffer > > to XSK Rx. No real need for redirect API extension. > > > > One might ask why it is still relevant even after having proper > > busy > > poll support in place - here is the justification. > > > > For xdpsock that was: > > - run for l2fwd scenario, > > - app/driver processing took place on the same core in busy poll > > with 2048 budget, > > - HW ring sizes Tx 256, Rx 2048, > > > > this work improved throughput by 78% and reduced Rx queue full > > statistic > > bump by 99%. > > > > For testing ice, make sure that you have [1] present on your side. > > > > This set, besides the work described above, carries also > > improvements > > around return codes in various XSK paths and lastly a minor > > optimization > > for xskq_cons_has_entries(), a helper that might be used when XSK > > Rx > > batching would make it to the kernel. > > > > Link to v1 and discussion around it is at [2]. > > > > Thanks! > > MF > > > > [0]: > > https://lore.kernel.org/bpf/20200904135332.60259-1-bjorn.topel@gmail.com/ > > [1]: > > https://lore.kernel.org/netdev/20220317175727.340251-1-maciej.fijalkowski@intel.com/ > > [2]: > > https://lore.kernel.org/bpf/5864171b-1e08-1b51-026e-5f09b8945076@nvidia.com/T/ > > > > Björn Töpel (1): > > xsk: improve xdp_do_redirect() error codes > > > > Maciej Fijalkowski (13): > > xsk: diversify return codes in xsk_rcv_check() > > ice: xsk: decorate ICE_XDP_REDIR with likely() > > ixgbe: xsk: decorate IXGBE_XDP_REDIR with likely() > > ice: xsk: terminate Rx side of NAPI when XSK Rx queue gets full > > i40e: xsk: terminate Rx side of NAPI when XSK Rx queue gets full > > ixgbe: xsk: terminate Rx side of NAPI when XSK Rx queue gets full > > ice: xsk: diversify return values from xsk_wakeup call paths > > i40e: xsk: diversify return values from xsk_wakeup call paths > > ixgbe: xsk: diversify return values from xsk_wakeup call paths > > mlx5: xsk: diversify return values from xsk_wakeup call paths > > stmmac: xsk: diversify return values from xsk_wakeup call paths > > ice: xsk: avoid refilling single Rx descriptors > > xsk: drop ternary operator from xskq_cons_has_entries > > > > .../ethernet/intel/i40e/i40e_txrx_common.h | 1 + > > drivers/net/ethernet/intel/i40e/i40e_xsk.c | 38 ++++++++----- > > drivers/net/ethernet/intel/ice/ice_txrx.h | 1 + > > drivers/net/ethernet/intel/ice/ice_xsk.c | 53 ++++++++++++--- > > ---- > > .../ethernet/intel/ixgbe/ixgbe_txrx_common.h | 1 + > > drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 52 ++++++++++----- > > --- > > .../ethernet/mellanox/mlx5/core/en/xsk/tx.c | 2 +- > > .../net/ethernet/stmicro/stmmac/stmmac_main.c | 4 +- > > net/xdp/xsk.c | 4 +- > > net/xdp/xsk_queue.h | 4 +- > > 10 files changed, 99 insertions(+), 61 deletions(-) > > >
On Tue, Aug 23, 2022 at 09:49:43AM +0000, Maxim Mikityanskiy wrote: > Anyone from Intel? Maciej, Björn, Magnus? Hey Maxim, how about keeping it simple and going with option 1? This behavior was even proposed in the v1 submission of the patch set we're talking about. > > Does anyone else have anything to say? IMO, calling XDP for the same > packet multiple times is a bug, we should agree on some sane solution. > > On Thu, 2022-08-18 at 14:26 +0300, Maxim Mikityanskiy wrote: > > Hi Maciej, > > > > On Wed, 2022-04-13 at 17:30 +0200, Maciej Fijalkowski wrote: > > > v2: > > > - add likely for internal redirect return codes in ice and ixgbe > > > (Jesper) > > > - do not drop the buffer that head pointed to at full XSK RQ > > > (Maxim) > > > > I found an issue with this approach. If you don't drop that packet, > > you'll need to run the XDP program for the same packet again. If the > > XDP program is anything more complex than "redirect-everything-to- > > XSK", > > it will get confused, for example, if it tracks any state or counts > > anything. > > > > We can't check whether there is enough space in the XSK RX ring > > before > > running the XDP program, as we don't know in advance which XSK socket > > will be selected. > > > > We can't store bpf_redirect_info across NAPI calls to avoid running > > the > > XDP program second time, as bpf_redirect_info is protected by RCU and > > the assumption that the whole XDP_REDIRECT operation happens within > > one > > NAPI cycle. > > > > I see the following options: > > > > 1. Drop the packet when an overflow happens. The problem is that it > > will happen systematically, but it's still better than the old > > behavior > > (drop mulitple packets when an overflow happens and hog the CPU). > > > > 2. Make this feature opt-in. If the application opts in, it > > guarantees > > to take measures to handle duplicate packets in XDP properly. > > > > 3. Require the XDP program to indicate it supports being called > > multiple times for the same packet and pass a flag to it if it's a > > repeated run. Drop the packet in the driver if the XDP program > > doesn't > > indicate this support. The indication can be done similar to how it's > > implemented for XDP multi buffer. > > > > Thoughts? Other options? > > > > Thanks, > > Max > > > > > - terminate Rx loop only when need_wakeup feature is enabled > > > (Maxim) > > > - reword from 'stop softirq processing' to 'stop NAPI Rx > > > processing' > > > - s/ENXIO/EINVAL in mlx5 and stmmac's ndo_xsk_wakeup to keep it > > > consitent with Intel's drivers (Maxim) > > > - include Jesper's Acks > > > > > > Hi! > > > > > > This is a revival of Bjorn's idea [0] to break NAPI loop when XSK > > > Rx > > > queue gets full which in turn makes it impossible to keep on > > > successfully producing descriptors to XSK Rx ring. By breaking out > > > of > > > the driver side immediately we will give the user space opportunity > > > for > > > consuming descriptors from XSK Rx ring and therefore provide room > > > in the > > > ring so that HW Rx -> XSK Rx redirection can be done. > > > > > > Maxim asked and Jesper agreed on simplifying Bjorn's original API > > > used > > > for detecting the event of interest, so let's just simply check for > > > -ENOBUFS within Intel's ZC drivers after an attempt to redirect a > > > buffer > > > to XSK Rx. No real need for redirect API extension. > > > > > > One might ask why it is still relevant even after having proper > > > busy > > > poll support in place - here is the justification. > > > > > > For xdpsock that was: > > > - run for l2fwd scenario, > > > - app/driver processing took place on the same core in busy poll > > > with 2048 budget, > > > - HW ring sizes Tx 256, Rx 2048, > > > > > > this work improved throughput by 78% and reduced Rx queue full > > > statistic > > > bump by 99%. > > > > > > For testing ice, make sure that you have [1] present on your side. > > > > > > This set, besides the work described above, carries also > > > improvements > > > around return codes in various XSK paths and lastly a minor > > > optimization > > > for xskq_cons_has_entries(), a helper that might be used when XSK > > > Rx > > > batching would make it to the kernel. > > > > > > Link to v1 and discussion around it is at [2]. > > > > > > Thanks! > > > MF > > > > > > [0]: > > > https://lore.kernel.org/bpf/20200904135332.60259-1-bjorn.topel@gmail.com/ > > > [1]: > > > https://lore.kernel.org/netdev/20220317175727.340251-1-maciej.fijalkowski@intel.com/ > > > [2]: > > > https://lore.kernel.org/bpf/5864171b-1e08-1b51-026e-5f09b8945076@nvidia.com/T/ > > > > > > Björn Töpel (1): > > > xsk: improve xdp_do_redirect() error codes > > > > > > Maciej Fijalkowski (13): > > > xsk: diversify return codes in xsk_rcv_check() > > > ice: xsk: decorate ICE_XDP_REDIR with likely() > > > ixgbe: xsk: decorate IXGBE_XDP_REDIR with likely() > > > ice: xsk: terminate Rx side of NAPI when XSK Rx queue gets full > > > i40e: xsk: terminate Rx side of NAPI when XSK Rx queue gets full > > > ixgbe: xsk: terminate Rx side of NAPI when XSK Rx queue gets full > > > ice: xsk: diversify return values from xsk_wakeup call paths > > > i40e: xsk: diversify return values from xsk_wakeup call paths > > > ixgbe: xsk: diversify return values from xsk_wakeup call paths > > > mlx5: xsk: diversify return values from xsk_wakeup call paths > > > stmmac: xsk: diversify return values from xsk_wakeup call paths > > > ice: xsk: avoid refilling single Rx descriptors > > > xsk: drop ternary operator from xskq_cons_has_entries > > > > > > .../ethernet/intel/i40e/i40e_txrx_common.h | 1 + > > > drivers/net/ethernet/intel/i40e/i40e_xsk.c | 38 ++++++++----- > > > drivers/net/ethernet/intel/ice/ice_txrx.h | 1 + > > > drivers/net/ethernet/intel/ice/ice_xsk.c | 53 ++++++++++++--- > > > ---- > > > .../ethernet/intel/ixgbe/ixgbe_txrx_common.h | 1 + > > > drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 52 ++++++++++----- > > > --- > > > .../ethernet/mellanox/mlx5/core/en/xsk/tx.c | 2 +- > > > .../net/ethernet/stmicro/stmmac/stmmac_main.c | 4 +- > > > net/xdp/xsk.c | 4 +- > > > net/xdp/xsk_queue.h | 4 +- > > > 10 files changed, 99 insertions(+), 61 deletions(-) > > > > > >
On Tue, 2022-08-23 at 13:24 +0200, Maciej Fijalkowski wrote: > On Tue, Aug 23, 2022 at 09:49:43AM +0000, Maxim Mikityanskiy wrote: > > Anyone from Intel? Maciej, Björn, Magnus? > > Hey Maxim, > > how about keeping it simple and going with option 1? This behavior was > even proposed in the v1 submission of the patch set we're talking about. Yeah, I know it was the behavior in v1. It was me who suggested not dropping that packet, and I didn't realize back then that it had this undesired side effect - sorry for that! Option 1 sounds good to me as the first remedy, we can start with that. However, it's not perfect: when NAPI and the application are pinned to the same core, if the fill ring is bigger than the RX ring (which makes sense in case of multiple sockets on the same UMEM), the driver will constantly get into this condition, drop one packet, yield to userspace, the application will of course clean up the RX ring, but then the process will repeat. That means, we'll always have a small percentage of packets dropped, which may trigger the congestion control algorithms on the other side, slowing down the TX to unacceptable speeds (because packet drops won't disappear after slowing down just a little). Given the above, we may need a more complex solution for the long term. What do you think? Also, if the application uses poll(), this whole logic (either v1 or v2) seems not needed, because poll() returns to the application when something becomes available in the RX ring, but I guess the reason for adding it was that fantastic 78% performance improvement mentioned in the cover letter? > > > > > Does anyone else have anything to say? IMO, calling XDP for the same > > packet multiple times is a bug, we should agree on some sane solution. > > > > On Thu, 2022-08-18 at 14:26 +0300, Maxim Mikityanskiy wrote: > > > Hi Maciej, > > > > > > On Wed, 2022-04-13 at 17:30 +0200, Maciej Fijalkowski wrote: > > > > v2: > > > > - add likely for internal redirect return codes in ice and ixgbe > > > > (Jesper) > > > > - do not drop the buffer that head pointed to at full XSK RQ > > > > (Maxim) > > > > > > I found an issue with this approach. If you don't drop that packet, > > > you'll need to run the XDP program for the same packet again. If the > > > XDP program is anything more complex than "redirect-everything-to- > > > XSK", > > > it will get confused, for example, if it tracks any state or counts > > > anything. > > > > > > We can't check whether there is enough space in the XSK RX ring > > > before > > > running the XDP program, as we don't know in advance which XSK socket > > > will be selected. > > > > > > We can't store bpf_redirect_info across NAPI calls to avoid running > > > the > > > XDP program second time, as bpf_redirect_info is protected by RCU and > > > the assumption that the whole XDP_REDIRECT operation happens within > > > one > > > NAPI cycle. > > > > > > I see the following options: > > > > > > 1. Drop the packet when an overflow happens. The problem is that it > > > will happen systematically, but it's still better than the old > > > behavior > > > (drop mulitple packets when an overflow happens and hog the CPU). > > > > > > 2. Make this feature opt-in. If the application opts in, it > > > guarantees > > > to take measures to handle duplicate packets in XDP properly. > > > > > > 3. Require the XDP program to indicate it supports being called > > > multiple times for the same packet and pass a flag to it if it's a > > > repeated run. Drop the packet in the driver if the XDP program > > > doesn't > > > indicate this support. The indication can be done similar to how it's > > > implemented for XDP multi buffer. > > > > > > Thoughts? Other options? > > > > > > Thanks, > > > Max > > > > > > > - terminate Rx loop only when need_wakeup feature is enabled > > > > (Maxim) > > > > - reword from 'stop softirq processing' to 'stop NAPI Rx > > > > processing' > > > > - s/ENXIO/EINVAL in mlx5 and stmmac's ndo_xsk_wakeup to keep it > > > > consitent with Intel's drivers (Maxim) > > > > - include Jesper's Acks > > > > > > > > Hi! > > > > > > > > This is a revival of Bjorn's idea [0] to break NAPI loop when XSK > > > > Rx > > > > queue gets full which in turn makes it impossible to keep on > > > > successfully producing descriptors to XSK Rx ring. By breaking out > > > > of > > > > the driver side immediately we will give the user space opportunity > > > > for > > > > consuming descriptors from XSK Rx ring and therefore provide room > > > > in the > > > > ring so that HW Rx -> XSK Rx redirection can be done. > > > > > > > > Maxim asked and Jesper agreed on simplifying Bjorn's original API > > > > used > > > > for detecting the event of interest, so let's just simply check for > > > > -ENOBUFS within Intel's ZC drivers after an attempt to redirect a > > > > buffer > > > > to XSK Rx. No real need for redirect API extension. > > > > > > > > One might ask why it is still relevant even after having proper > > > > busy > > > > poll support in place - here is the justification. > > > > > > > > For xdpsock that was: > > > > - run for l2fwd scenario, > > > > - app/driver processing took place on the same core in busy poll > > > > with 2048 budget, > > > > - HW ring sizes Tx 256, Rx 2048, > > > > > > > > this work improved throughput by 78% and reduced Rx queue full > > > > statistic > > > > bump by 99%. > > > > > > > > For testing ice, make sure that you have [1] present on your side. > > > > > > > > This set, besides the work described above, carries also > > > > improvements > > > > around return codes in various XSK paths and lastly a minor > > > > optimization > > > > for xskq_cons_has_entries(), a helper that might be used when XSK > > > > Rx > > > > batching would make it to the kernel. > > > > > > > > Link to v1 and discussion around it is at [2]. > > > > > > > > Thanks! > > > > MF > > > > > > > > [0]: > > > > https://lore.kernel.org/bpf/20200904135332.60259-1-bjorn.topel@gmail.com/ > > > > [1]: > > > > https://lore.kernel.org/netdev/20220317175727.340251-1-maciej.fijalkowski@intel.com/ > > > > [2]: > > > > https://lore.kernel.org/bpf/5864171b-1e08-1b51-026e-5f09b8945076@nvidia.com/T/ > > > > > > > > Björn Töpel (1): > > > > xsk: improve xdp_do_redirect() error codes > > > > > > > > Maciej Fijalkowski (13): > > > > xsk: diversify return codes in xsk_rcv_check() > > > > ice: xsk: decorate ICE_XDP_REDIR with likely() > > > > ixgbe: xsk: decorate IXGBE_XDP_REDIR with likely() > > > > ice: xsk: terminate Rx side of NAPI when XSK Rx queue gets full > > > > i40e: xsk: terminate Rx side of NAPI when XSK Rx queue gets full > > > > ixgbe: xsk: terminate Rx side of NAPI when XSK Rx queue gets full > > > > ice: xsk: diversify return values from xsk_wakeup call paths > > > > i40e: xsk: diversify return values from xsk_wakeup call paths > > > > ixgbe: xsk: diversify return values from xsk_wakeup call paths > > > > mlx5: xsk: diversify return values from xsk_wakeup call paths > > > > stmmac: xsk: diversify return values from xsk_wakeup call paths > > > > ice: xsk: avoid refilling single Rx descriptors > > > > xsk: drop ternary operator from xskq_cons_has_entries > > > > > > > > .../ethernet/intel/i40e/i40e_txrx_common.h | 1 + > > > > drivers/net/ethernet/intel/i40e/i40e_xsk.c | 38 ++++++++----- > > > > drivers/net/ethernet/intel/ice/ice_txrx.h | 1 + > > > > drivers/net/ethernet/intel/ice/ice_xsk.c | 53 ++++++++++++--- > > > > ---- > > > > .../ethernet/intel/ixgbe/ixgbe_txrx_common.h | 1 + > > > > drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 52 ++++++++++----- > > > > --- > > > > .../ethernet/mellanox/mlx5/core/en/xsk/tx.c | 2 +- > > > > .../net/ethernet/stmicro/stmmac/stmmac_main.c | 4 +- > > > > net/xdp/xsk.c | 4 +- > > > > net/xdp/xsk_queue.h | 4 +- > > > > 10 files changed, 99 insertions(+), 61 deletions(-) > > > > > > > > >
Maxim Mikityanskiy wrote: > On Tue, 2022-08-23 at 13:24 +0200, Maciej Fijalkowski wrote: > > On Tue, Aug 23, 2022 at 09:49:43AM +0000, Maxim Mikityanskiy wrote: > > > Anyone from Intel? Maciej, Björn, Magnus? > > > > Hey Maxim, > > > > how about keeping it simple and going with option 1? This behavior was > > even proposed in the v1 submission of the patch set we're talking about. > > Yeah, I know it was the behavior in v1. It was me who suggested not > dropping that packet, and I didn't realize back then that it had this > undesired side effect - sorry for that! Just want to reiterate what was said originally, you'll definately confuse our XDP programs if they ever saw the same pkt twice. It would confuse metrics and any "tap" and so on. > > Option 1 sounds good to me as the first remedy, we can start with that. > > However, it's not perfect: when NAPI and the application are pinned to > the same core, if the fill ring is bigger than the RX ring (which makes > sense in case of multiple sockets on the same UMEM), the driver will > constantly get into this condition, drop one packet, yield to > userspace, the application will of course clean up the RX ring, but > then the process will repeat. Maybe dumb question haven't followed the entire thread or at least don't recall it. Could you yield when you hit a high water mark at some point before pkt drop? > > That means, we'll always have a small percentage of packets dropped, > which may trigger the congestion control algorithms on the other side, > slowing down the TX to unacceptable speeds (because packet drops won't > disappear after slowing down just a little). > > Given the above, we may need a more complex solution for the long term. > What do you think? > > Also, if the application uses poll(), this whole logic (either v1 or > v2) seems not needed, because poll() returns to the application when > something becomes available in the RX ring, but I guess the reason for > adding it was that fantastic 78% performance improvement mentioned in > the cover letter? >
On Tue, 2022-08-23 at 22:18 -0700, John Fastabend wrote: > Maxim Mikityanskiy wrote: > > On Tue, 2022-08-23 at 13:24 +0200, Maciej Fijalkowski wrote: > > > On Tue, Aug 23, 2022 at 09:49:43AM +0000, Maxim Mikityanskiy wrote: > > > > Anyone from Intel? Maciej, Björn, Magnus? > > > > > > Hey Maxim, > > > > > > how about keeping it simple and going with option 1? This behavior was > > > even proposed in the v1 submission of the patch set we're talking about. > > > > Yeah, I know it was the behavior in v1. It was me who suggested not > > dropping that packet, and I didn't realize back then that it had this > > undesired side effect - sorry for that! > > Just want to reiterate what was said originally, you'll definately confuse > our XDP programs if they ever saw the same pkt twice. It would confuse > metrics and any "tap" and so on. > > > > > Option 1 sounds good to me as the first remedy, we can start with that. > > > > However, it's not perfect: when NAPI and the application are pinned to > > the same core, if the fill ring is bigger than the RX ring (which makes > > sense in case of multiple sockets on the same UMEM), the driver will > > constantly get into this condition, drop one packet, yield to > > userspace, the application will of course clean up the RX ring, but > > then the process will repeat. > > Maybe dumb question haven't followed the entire thread or at least > don't recall it. Could you yield when you hit a high water mark at > some point before pkt drop? That would be an ideal solution, but it doesn't sound feasible to me. There may be multiple XSK sockets on the same RX queue, and each socket has its own RX ring, which can be full or have some space. We don't know in advance which RX ring will be chosen by the XDP program (if any at all; the XDP program can still drop, pass or retransmit the packet), so we can't check the watermark on a specific ring before running XDP. It may be possible to iterate over all sockets and stop the processing if any socket's RX ring is full, but that would be a heavy thing to do per packet. It should be possible to optimize it to run once per NAPI cycle, checking that each RX ring has enough space to fit the whole NAPI budget, but I'm still not sure of performance implications, and it sounds overly protective. > > > > > That means, we'll always have a small percentage of packets dropped, > > which may trigger the congestion control algorithms on the other side, > > slowing down the TX to unacceptable speeds (because packet drops won't > > disappear after slowing down just a little). > > > > Given the above, we may need a more complex solution for the long term. > > What do you think? > > > > Also, if the application uses poll(), this whole logic (either v1 or > > v2) seems not needed, because poll() returns to the application when > > something becomes available in the RX ring, but I guess the reason for > > adding it was that fantastic 78% performance improvement mentioned in > > the cover letter?