[net-next,v3,0/4] Add support to do threaded napi busy poll

Message ID	20250205001052.2590140-1-skhawaja@google.com (mailing list archive)
Headers	show Received: from mail-pj1-f74.google.com (mail-pj1-f74.google.com [209.85.216.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 31DB7163 for <netdev@vger.kernel.org>; Wed, 5 Feb 2025 00:10:54 +0000 (UTC) Date: Wed, 5 Feb 2025 00:10:48 +0000 Precedence: bulk Mime-Version: 1.0 Message-ID: <20250205001052.2590140-1-skhawaja@google.com> Subject: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll From: Samiullah Khawaja <skhawaja@google.com> To: Jakub Kicinski <kuba@kernel.org>, "David S . Miller " <davem@davemloft.net>, Eric Dumazet <edumazet@google.com>, Paolo Abeni <pabeni@redhat.com>, almasrymina@google.com Cc: netdev@vger.kernel.org, skhawaja@google.com Content-Type: text/plain; charset="UTF-8"
Series	Add support to do threaded napi busy poll \| expand [net-next,v3,0/4] Add support to do threaded napi busy poll [net-next,v3,1/4] Add support to set napi threaded for individual napi [net-next,v3,2/4] net: Create separate gro_flush helper function [net-next,v3,3/4] Extend napi threaded polling to allow kthread based busy polling [net-next,v3,4/4] selftests: Add napi threaded busy poll test in `busy_poller`

Message ID

20250205001052.2590140-1-skhawaja@google.com (mailing list archive)

Headers

Date: Wed,  5 Feb 2025 00:10:48 +0000
Precedence: bulk
Mime-Version: 1.0
Message-ID: <20250205001052.2590140-1-skhawaja@google.com>
Subject: [PATCH net-next v3 0/4] Add support to do threaded napi busy poll
From: Samiullah Khawaja <skhawaja@google.com>
To: Jakub Kicinski <kuba@kernel.org>,
 "David S . Miller " <davem@davemloft.net>,
	Eric Dumazet <edumazet@google.com>, Paolo Abeni <pabeni@redhat.com>,
 almasrymina@google.com
Cc: netdev@vger.kernel.org, skhawaja@google.com
Content-Type: text/plain; charset="UTF-8"

Series

Add support to do threaded napi busy poll | expand

Message

Samiullah Khawaja Feb. 5, 2025, 12:10 a.m. UTC

Extend the already existing support of threaded napi poll to do continuous
busy polling.

This is used for doing continuous polling of napi to fetch descriptors
from backing RX/TX queues for low latency applications. Allow enabling
of threaded busypoll using netlink so this can be enabled on a set of
dedicated napis for low latency applications.

It allows enabling NAPI busy poll for any userspace application
indepdendent of userspace API being used for packet and event processing
(epoll, io_uring, raw socket APIs). Once enabled user can fetch the PID
of the kthread doing NAPI polling and set affinity, priority and
scheduler for it depending on the low-latency requirements.

Currently threaded napi is only enabled at device level using sysfs. Add
support to enable/disable threaded mode for a napi individually. This
can be done using the netlink interface. Extend `napi-set` op in netlink
spec that allows setting the `threaded` attribute of a napi.

Extend the threaded attribute in napi struct to add an option to enable
continuous busy polling. Extend the netlink and sysfs interface to allow
enabled/disabling threaded busypolling at device or individual napi
level.

We use this for our AF_XDP based hard low-latency usecase using onload
stack (https://github.com/Xilinx-CNS/onload) that runs in userspace. Our
usecase is a fixed frequency RPC style traffic with fixed
request/response size. We simulated this using neper by only starting
next transaction when last one has completed. The experiment results are
listed below,

Setup:

- Running on Google C3 VMs with idpf driver with following configurations.
- IRQ affinity and coalascing is common for both experiments.
- There is only 1 RX/TX queue configured.
- First experiment enables busy poll using sysctl for both epoll and
  socket APIs.
- Second experiment enables NAPI threaded busy poll for the full device
  using sysctl.

Non threaded NAPI busy poll enabled using sysctl.
```
echo 400 | sudo tee /proc/sys/net/core/busy_poll
echo 400 | sudo tee /proc/sys/net/core/busy_read
echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
```

Results using following command,
```
sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
		--profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
		-p 50,90,99,999 -H <IP> -l 10

...
...

num_transactions=2835
latency_min=0.000018976
latency_max=0.049642100
latency_mean=0.003243618
latency_stddev=0.010636847
latency_p50=0.000025270
latency_p90=0.005406710
latency_p99=0.049807350
latency_p99.9=0.049807350
```

Results with napi threaded busy poll using following command,
```
sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
                --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
                -p 50,90,99,999 -H <IP> -l 10

...
...

num_transactions=460163
latency_min=0.000015707
latency_max=0.200182942
latency_mean=0.000019453
latency_stddev=0.000720727
latency_p50=0.000016950
latency_p90=0.000017270
latency_p99=0.000018710
latency_p99.9=0.000020150
```

Here with NAPI threaded busy poll in a separate core, we are able to
consistently poll the NAPI to keep latency to absolute minimum. And also
we are able to do this without any major changes to the onload stack and
threading model.

v3:
 - Fixed calls to dev_set_threaded in drivers

v2:
 - Add documentation in napi.rst.
 - Provide experiment data and usecase details.
 - Update busy_poller selftest to include napi threaded poll testcase.
 - Define threaded mode enum in netlink interface.
 - Included NAPI threaded state in napi config to save/restore.

Samiullah Khawaja (4):
  Add support to set napi threaded for individual napi
  net: Create separate gro_flush helper function
  Extend napi threaded polling to allow kthread based busy polling
  selftests: Add napi threaded busy poll test in `busy_poller`

 Documentation/ABI/testing/sysfs-class-net     |   3 +-
 Documentation/netlink/specs/netdev.yaml       |  14 ++
 Documentation/networking/napi.rst             |  80 ++++++++++-
 .../net/ethernet/atheros/atl1c/atl1c_main.c   |   2 +-
 drivers/net/ethernet/mellanox/mlxsw/pci.c     |   2 +-
 drivers/net/ethernet/renesas/ravb_main.c      |   2 +-
 drivers/net/wireless/ath/ath10k/snoc.c        |   2 +-
 include/linux/netdevice.h                     |  24 +++-
 include/uapi/linux/netdev.h                   |   7 +
 net/core/dev.c                                | 127 ++++++++++++++----
 net/core/net-sysfs.c                          |   2 +-
 net/core/netdev-genl-gen.c                    |   5 +-
 net/core/netdev-genl.c                        |   9 ++
 tools/include/uapi/linux/netdev.h             |   7 +
 tools/testing/selftests/net/busy_poll_test.sh |  25 +++-
 tools/testing/selftests/net/busy_poller.c     |  14 +-
 16 files changed, 285 insertions(+), 40 deletions(-)

Comments

Samiullah Khawaja Feb. 5, 2025, 12:14 a.m. UTC | #1

On Tue, Feb 4, 2025 at 4:10 PM Samiullah Khawaja <skhawaja@google.com> wrote:
>
> Extend the already existing support of threaded napi poll to do continuous
> busy polling.
>
> This is used for doing continuous polling of napi to fetch descriptors
> from backing RX/TX queues for low latency applications. Allow enabling
> of threaded busypoll using netlink so this can be enabled on a set of
> dedicated napis for low latency applications.
>
> It allows enabling NAPI busy poll for any userspace application
> indepdendent of userspace API being used for packet and event processing
> (epoll, io_uring, raw socket APIs). Once enabled user can fetch the PID
> of the kthread doing NAPI polling and set affinity, priority and
> scheduler for it depending on the low-latency requirements.
>
> Currently threaded napi is only enabled at device level using sysfs. Add
> support to enable/disable threaded mode for a napi individually. This
> can be done using the netlink interface. Extend `napi-set` op in netlink
> spec that allows setting the `threaded` attribute of a napi.
>
> Extend the threaded attribute in napi struct to add an option to enable
> continuous busy polling. Extend the netlink and sysfs interface to allow
> enabled/disabling threaded busypolling at device or individual napi
> level.
>
> We use this for our AF_XDP based hard low-latency usecase using onload
> stack (https://github.com/Xilinx-CNS/onload) that runs in userspace. Our
> usecase is a fixed frequency RPC style traffic with fixed
> request/response size. We simulated this using neper by only starting
> next transaction when last one has completed. The experiment results are
> listed below,
>
> Setup:
>
> - Running on Google C3 VMs with idpf driver with following configurations.
> - IRQ affinity and coalascing is common for both experiments.
> - There is only 1 RX/TX queue configured.
> - First experiment enables busy poll using sysctl for both epoll and
>   socket APIs.
> - Second experiment enables NAPI threaded busy poll for the full device
>   using sysctl.
>
> Non threaded NAPI busy poll enabled using sysctl.
> ```
> echo 400 | sudo tee /proc/sys/net/core/busy_poll
> echo 400 | sudo tee /proc/sys/net/core/busy_read
> echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> ```
>
> Results using following command,
> ```
> sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
>                 --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
>                 -p 50,90,99,999 -H <IP> -l 10
>
> ...
> ...
>
> num_transactions=2835
> latency_min=0.000018976
> latency_max=0.049642100
> latency_mean=0.003243618
> latency_stddev=0.010636847
> latency_p50=0.000025270
> latency_p90=0.005406710
> latency_p99=0.049807350
> latency_p99.9=0.049807350
> ```
>
> Results with napi threaded busy poll using following command,
> ```
> sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
>                 --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
>                 -p 50,90,99,999 -H <IP> -l 10
>
> ...
> ...
>
> num_transactions=460163
> latency_min=0.000015707
> latency_max=0.200182942
> latency_mean=0.000019453
> latency_stddev=0.000720727
> latency_p50=0.000016950
> latency_p90=0.000017270
> latency_p99=0.000018710
> latency_p99.9=0.000020150
> ```
>
> Here with NAPI threaded busy poll in a separate core, we are able to
> consistently poll the NAPI to keep latency to absolute minimum. And also
> we are able to do this without any major changes to the onload stack and
> threading model.
>
> v3:
>  - Fixed calls to dev_set_threaded in drivers
>
> v2:
>  - Add documentation in napi.rst.
>  - Provide experiment data and usecase details.
>  - Update busy_poller selftest to include napi threaded poll testcase.
>  - Define threaded mode enum in netlink interface.
>  - Included NAPI threaded state in napi config to save/restore.
>
> Samiullah Khawaja (4):
>   Add support to set napi threaded for individual napi
>   net: Create separate gro_flush helper function
>   Extend napi threaded polling to allow kthread based busy polling
>   selftests: Add napi threaded busy poll test in `busy_poller`
>
>  Documentation/ABI/testing/sysfs-class-net     |   3 +-
>  Documentation/netlink/specs/netdev.yaml       |  14 ++
>  Documentation/networking/napi.rst             |  80 ++++++++++-
>  .../net/ethernet/atheros/atl1c/atl1c_main.c   |   2 +-
>  drivers/net/ethernet/mellanox/mlxsw/pci.c     |   2 +-
>  drivers/net/ethernet/renesas/ravb_main.c      |   2 +-
>  drivers/net/wireless/ath/ath10k/snoc.c        |   2 +-
>  include/linux/netdevice.h                     |  24 +++-
>  include/uapi/linux/netdev.h                   |   7 +
>  net/core/dev.c                                | 127 ++++++++++++++----
>  net/core/net-sysfs.c                          |   2 +-
>  net/core/netdev-genl-gen.c                    |   5 +-
>  net/core/netdev-genl.c                        |   9 ++
>  tools/include/uapi/linux/netdev.h             |   7 +
>  tools/testing/selftests/net/busy_poll_test.sh |  25 +++-
>  tools/testing/selftests/net/busy_poller.c     |  14 +-
>  16 files changed, 285 insertions(+), 40 deletions(-)
>
> --
> 2.48.1.362.g079036d154-goog
>
Adding Joe and Martin as they requested to be CC'd in the next
revision. It seems I missed them when sending this out :(.

Martin Karsten Feb. 5, 2025, 1:32 a.m. UTC | #2

On 2025-02-04 19:10, Samiullah Khawaja wrote:
> Extend the already existing support of threaded napi poll to do continuous
> busy polling.

[snip]

> Setup:
> 
> - Running on Google C3 VMs with idpf driver with following configurations.
> - IRQ affinity and coalascing is common for both experiments.
> - There is only 1 RX/TX queue configured.
> - First experiment enables busy poll using sysctl for both epoll and
>    socket APIs.
> - Second experiment enables NAPI threaded busy poll for the full device
>    using sysctl.
> 
> Non threaded NAPI busy poll enabled using sysctl.
> ```
> echo 400 | sudo tee /proc/sys/net/core/busy_poll
> echo 400 | sudo tee /proc/sys/net/core/busy_read
> echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> ```
> 
> Results using following command,
> ```
> sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
> 		--profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
> 		-p 50,90,99,999 -H <IP> -l 10
> 
> ...
> ...
> 
> num_transactions=2835
> latency_min=0.000018976
> latency_max=0.049642100
> latency_mean=0.003243618
> latency_stddev=0.010636847
> latency_p50=0.000025270
> latency_p90=0.005406710
> latency_p99=0.049807350
> latency_p99.9=0.049807350
> ```
> 
> Results with napi threaded busy poll using following command,
> ```
> sudo EF_NO_FAIL=0 EF_POLL_USEC=100000 taskset -c 3-10 onload -v \
>                  --profile=latency ./neper/tcp_rr -Q 200 -R 400 -T 1 -F 50 \
>                  -p 50,90,99,999 -H <IP> -l 10
> 
> ...
> ...
> 
> num_transactions=460163
> latency_min=0.000015707
> latency_max=0.200182942
> latency_mean=0.000019453
> latency_stddev=0.000720727
> latency_p50=0.000016950
> latency_p90=0.000017270
> latency_p99=0.000018710
> latency_p99.9=0.000020150
> ```
> 
> Here with NAPI threaded busy poll in a separate core, we are able to
> consistently poll the NAPI to keep latency to absolute minimum. And also
> we are able to do this without any major changes to the onload stack and
> threading model.

As far as I'm concerned, this is still not sufficient information to 
fully assess the experiment. The experiment shows an 162-fold decrease 
in latency and a corresponding increase in throughput for this 
closed-loop workload (which, btw, is different from your open-loop fixed 
rate use case). This would be an extraordinary improvement and that 
alone warrants some scrutiny. 162X means either the base case has a lot 
of idle time or wastes an enormous amount of cpu cycles. How can that be 
explained? It would be good to get some instruction/cycle counts to 
drill down further.

The server process invocation and the actual irq routing is not 
provided. Just stating its common for both experiments is not 
sufficient. Without further information, I still cannot rule out that:

- In the base case, application and napi processing execute on the same 
core and trample on each other. I don't know how onload implements 
epoll_wait, but I worry that it cannot align application processing 
(busy-looping?) and napi processing (also busy-looping?).

- In the threaded busy-loop case, napi processing ends up on one core, 
while the application executes on another one. This uses two cores 
instead of one.

Based on the above, I think at least the following additional scenarios 
need to be investigated:

a) Run the server application in proper fullbusy mode, i.e., cleanly 
alternating between application processing and napi processing. As a 
second step, spread the incoming traffic across two cores to compare 
apples to apples.

b) Run application and napi processing on separate cores, but simply by 
way of thread pinning and interrupt routing. How close does that get to 
the current results? Then selectively add threaded napi and then busy 
polling.

c) Run the whole thing without onload for comparison. The benefits 
should show without onload as well and it's easier to reproduce. Also, I 
suspect onload hurts in the base case and that explains the atrociously 
high latency and low throughput of it.

Or, even better, simply provide a complete specification / script for 
the experiment that makes it easy to reproduce.

Note that I don't dismiss the approach out of hand. I just think it's 
important to properly understand the purported performance improvements. 
At the same time, I don't think it's good for the planet to burn cores 
with busy-looping without good reason.

Thanks,
Martin

Joe Damato Feb. 5, 2025, 3:18 a.m. UTC | #3

On Wed, Feb 05, 2025 at 12:10:48AM +0000, Samiullah Khawaja wrote:
> Extend the already existing support of threaded napi poll to do continuous
> busy polling.

[...]

Overall, +1 to everything Martin said in his response. I think I'd
like to try to reproduce this myself to better understand the stated
numbers below.

IMHO: the cover letter needs more details.

> 
> Setup:
> 
> - Running on Google C3 VMs with idpf driver with following configurations.
> - IRQ affinity and coalascing is common for both experiments.

As Martin suggested, a lot more detail here would be helpful.

> - There is only 1 RX/TX queue configured.
> - First experiment enables busy poll using sysctl for both epoll and
>   socket APIs.
> - Second experiment enables NAPI threaded busy poll for the full device
>   using sysctl.
> 
> Non threaded NAPI busy poll enabled using sysctl.
> ```
> echo 400 | sudo tee /proc/sys/net/core/busy_poll
> echo 400 | sudo tee /proc/sys/net/core/busy_read

I'm not sure why busy_read is enabled here?

Maybe more details on how exactly the internals of onload+neper work
would explain it, but I presume it's an epoll_wait loop with
non-blocking reads so busy_read wouldn't do anything?

> echo 2 | sudo tee /sys/class/net/eth0/napi_defer_hard_irqs
> echo 15000  | sudo tee /sys/class/net/eth0/gro_flush_timeout
> ```

The deferral amounts above are relatively small, which makes me
wonder if you are seeing IRQ and softIRQ interference in the base
case?

I ask because it seems like in the test case (if I read the patch
correctly) the processing of packets happens when BH is disabled.

Did I get that right?

If so, then:
  - In the base case, IRQs can be generated and softirq can interfere
    with packet processing.

  - In the test case, packet processing happens but BH is disabled,
    reducing interference.

If I got that right, it sounds like IRQ suspension would show good
results in this case, too, and it's probably worth comparing IRQ
suspension in the onload+neper setup.

It seems like it shouldn't be too difficult to get onload+neper
using it and the data would be very enlightening.