mbox series

[v1,00/15] io_uring zero copy rx

Message ID 20241007221603.1703699-1-dw@davidwei.uk (mailing list archive)
Headers show
Series io_uring zero copy rx | expand

Message

David Wei Oct. 7, 2024, 10:15 p.m. UTC
This patchset adds support for zero copy rx into userspace pages using
io_uring, eliminating a kernel to user copy.

We configure a page pool that a driver uses to fill a hw rx queue to
hand out user pages instead of kernel pages. Any data that ends up
hitting this hw rx queue will thus be dma'd into userspace memory
directly, without needing to be bounced through kernel memory. 'Reading'
data out of a socket instead becomes a _notification_ mechanism, where
the kernel tells userspace where the data is. The overall approach is
similar to the devmem TCP proposal.

This relies on hw header/data split, flow steering and RSS to ensure
packet headers remain in kernel memory and only desired flows hit a hw
rx queue configured for zero copy. Configuring this is outside of the
scope of this patchset.

We share netdev core infra with devmem TCP. The main difference is that
io_uring is used for the uAPI and the lifetime of all objects are bound
to an io_uring instance. Data is 'read' using a new io_uring request
type. When done, data is returned via a new shared refill queue. A zero
copy page pool refills a hw rx queue from this refill queue directly. Of
course, the lifetime of these data buffers are managed by io_uring
rather than the networking stack, with different refcounting rules.

This patchset is the first step adding basic zero copy support. We will
extend this iteratively with new features e.g. dynamically allocated
zero copy areas, THP support, dmabuf support, improved copy fallback,
general optimisations and more.

In terms of netdev support, we're first targeting Broadcom bnxt. Patches
aren't included since Taehee Yoo has already sent a more comprehensive
patchset adding support in [1]. Google gve should already support this,
and Mellanox mlx5 support is WIP pending driver changes.

===========
Performance
===========

Test setup:
* AMD EPYC 9454
* Broadcom BCM957508 200G
* Kernel v6.11 base [2]
* liburing fork [3]
* kperf fork [4]
* 4K MTU
* Single TCP flow

With application thread + net rx softirq pinned to _different_ cores:

epoll
82.2 Gbps

io_uring
116.2 Gbps (+41%)

Pinned to _same_ core:

epoll
62.6 Gbps

io_uring
80.9 Gbps (+29%)

==============
Patch overview
==============

Networking folks would be mostly interested in patches 1-8, 11 and 14.
Patches 1-2 clean up net_iov and devmem, then patches 3-8 make changes
to netdev to suit our needs.

Patch 11 implements struct memory_provider_ops, and Patch 14 passes it
all to netdev via the queue API.

io_uring folks would be mostly interested in patches 9-15:

* Initial registration that sets up a hw rx queue.
* Shared ringbuf for userspace to return buffers.
* New request type for doing zero copy rx reads.

=====
Links
=====

Broadcom bnxt support:
[1]: https://lore.kernel.org/netdev/20241003160620.1521626-8-ap420073@gmail.com/

Linux kernel branch:
[2]: https://github.com/isilence/linux.git zcrx/v5

liburing for testing:
[3]: https://github.com/spikeh/liburing/tree/zcrx/next

kperf for testing:
[4]: https://github.com/spikeh/kperf/tree/zcrx/next

Changes in v1:
--------------
* Rebase on top of merged net_iov + netmem infra.
* Decouple net_iov from devmem TCP.
* Use netdev queue API to allocate an rx queue.
* Minor uAPI enhancements for future extensibility.
* QoS improvements with request throttling.

Changes in RFC v4:
------------------
* Rebased on top of Mina Almasry's TCP devmem patchset and latest
  net-next, now sharing common infra e.g.:
    * netmem_t and net_iovs
    * Page pool memory provider
* The registered buffer (rbuf) completion queue where completions from
  io_recvzc requests are posted is removed. Now these post into the main
  completion queue, using big (32-byte) CQEs. The first 16 bytes is an
  ordinary CQE, while the latter 16 bytes contain the io_uring_rbuf_cqe
  as before. This vastly simplifies the uAPI and removes a level of
  indirection in userspace when looking for payloads.
  * The rbuf refill queue is still needed for userspace to return
    buffers to kernel.
* Simplified code and uAPI on the io_uring side, particularly
  io_recvzc() and io_zc_rx_recv(). Many unnecessary lines were removed
  e.g. extra msg flags, readlen, etc.

Changes in RFC v3:
------------------
* Rebased on top of Jakub Kicinski's memory provider API RFC. The ZC
  pool added is now a backend for memory provider.
* We're also reusing ppiov infrastructure. The refcounting rules stay
  the same but it's shifted into ppiov->refcount. That lets us to
  flexibly manage buffer lifetimes without adding any extra code to the
  common networking paths. It'd also make it easier to support dmabufs
  and device memory in the future.
  * io_uring also knows about pages, and so ppiovs might unnecessarily
    break tools inspecting data, that can easily be solved later.

Many patches are not for upstream as they depend on work in progress,
namely from Mina:

* struct netmem_t
* Driver ndo commands for Rx queue configs
* struct page_pool_iov and shared pp infra

Changes in RFC v2:
------------------
* Added copy fallback support if userspace memory allocated for ZC Rx
  runs out, or if header splitting or flow steering fails.
* Added veth support for ZC Rx, for testing and demonstration. We will
  need to figure out what driver would be best for such testing
  functionality in the future. Perhaps netdevsim?
* Added socket registration API to io_uring to associate specific
  sockets with ifqs/Rx queues for ZC.
* Added multi-socket support, such that multiple connections can be
  steered into the same hardware Rx queue.
* Added Netbench server/client support.

David Wei (5):
  net: page pool: add helper creating area from pages
  io_uring/zcrx: add interface queue and refill queue
  io_uring/zcrx: add io_zcrx_area
  io_uring/zcrx: add io_recvzc request
  io_uring/zcrx: set pp memory provider for an rx queue

Jakub Kicinski (1):
  net: page_pool: create hooks for custom page providers

Pavel Begunkov (9):
  net: devmem: pull struct definitions out of ifdef
  net: prefix devmem specific helpers
  net: generalise net_iov chunk owners
  net: prepare for non devmem TCP memory providers
  net: page_pool: add ->scrub mem provider callback
  net: add helper executing custom callback from napi
  io_uring/zcrx: implement zerocopy receive pp memory provider
  io_uring/zcrx: add copy fallback
  io_uring/zcrx: throttle receive requests

 include/linux/io_uring/net.h   |   5 +
 include/linux/io_uring_types.h |   3 +
 include/net/busy_poll.h        |   6 +
 include/net/netmem.h           |  21 +-
 include/net/page_pool/types.h  |  27 ++
 include/uapi/linux/io_uring.h  |  54 +++
 io_uring/Makefile              |   1 +
 io_uring/io_uring.c            |   7 +
 io_uring/io_uring.h            |  10 +
 io_uring/memmap.c              |   8 +
 io_uring/net.c                 |  81 ++++
 io_uring/opdef.c               |  16 +
 io_uring/register.c            |   7 +
 io_uring/rsrc.c                |   2 +-
 io_uring/rsrc.h                |   1 +
 io_uring/zcrx.c                | 847 +++++++++++++++++++++++++++++++++
 io_uring/zcrx.h                |  74 +++
 net/core/dev.c                 |  53 +++
 net/core/devmem.c              |  44 +-
 net/core/devmem.h              |  71 ++-
 net/core/page_pool.c           |  81 +++-
 net/core/page_pool_user.c      |  15 +-
 net/ipv4/tcp.c                 |   8 +-
 23 files changed, 1364 insertions(+), 78 deletions(-)
 create mode 100644 io_uring/zcrx.c
 create mode 100644 io_uring/zcrx.h

Comments

David Wei Oct. 7, 2024, 10:20 p.m. UTC | #1
On 2024-10-07 15:15, David Wei wrote:
> This patchset adds support for zero copy rx into userspace pages using
> io_uring, eliminating a kernel to user copy.

Sorry, I didn't know that versioning do not get reset when going from
RFC -> non-RFC. This patchset should read v5. I'll fix this in the next
version.
Joe Damato Oct. 8, 2024, 11:10 p.m. UTC | #2
On Mon, Oct 07, 2024 at 03:15:48PM -0700, David Wei wrote:
> This patchset adds support for zero copy rx into userspace pages using
> io_uring, eliminating a kernel to user copy.
> 
> We configure a page pool that a driver uses to fill a hw rx queue to
> hand out user pages instead of kernel pages. Any data that ends up
> hitting this hw rx queue will thus be dma'd into userspace memory
> directly, without needing to be bounced through kernel memory. 'Reading'
> data out of a socket instead becomes a _notification_ mechanism, where
> the kernel tells userspace where the data is. The overall approach is
> similar to the devmem TCP proposal.
> 
> This relies on hw header/data split, flow steering and RSS to ensure
> packet headers remain in kernel memory and only desired flows hit a hw
> rx queue configured for zero copy. Configuring this is outside of the
> scope of this patchset.

This looks super cool and very useful, thanks for doing this work.

Is there any possibility of some notes or sample pseudo code on how
userland can use this being added to Documentation/networking/ ?
Pavel Begunkov Oct. 9, 2024, 3:07 p.m. UTC | #3
On 10/9/24 00:10, Joe Damato wrote:
> On Mon, Oct 07, 2024 at 03:15:48PM -0700, David Wei wrote:
>> This patchset adds support for zero copy rx into userspace pages using
>> io_uring, eliminating a kernel to user copy.
>>
>> We configure a page pool that a driver uses to fill a hw rx queue to
>> hand out user pages instead of kernel pages. Any data that ends up
>> hitting this hw rx queue will thus be dma'd into userspace memory
>> directly, without needing to be bounced through kernel memory. 'Reading'
>> data out of a socket instead becomes a _notification_ mechanism, where
>> the kernel tells userspace where the data is. The overall approach is
>> similar to the devmem TCP proposal.
>>
>> This relies on hw header/data split, flow steering and RSS to ensure
>> packet headers remain in kernel memory and only desired flows hit a hw
>> rx queue configured for zero copy. Configuring this is outside of the
>> scope of this patchset.
> 
> This looks super cool and very useful, thanks for doing this work.
> 
> Is there any possibility of some notes or sample pseudo code on how
> userland can use this being added to Documentation/networking/ ?

io_uring man pages would need to be updated with it, there are tests
in liburing and would be a good idea to add back a simple exapmle
to liburing/example/*. I think it should cover it
Jens Axboe Oct. 9, 2024, 3:27 p.m. UTC | #4
On 10/7/24 4:15 PM, David Wei wrote:
> ===========
> Performance
> ===========
> 
> Test setup:
> * AMD EPYC 9454
> * Broadcom BCM957508 200G
> * Kernel v6.11 base [2]
> * liburing fork [3]
> * kperf fork [4]
> * 4K MTU
> * Single TCP flow
> 
> With application thread + net rx softirq pinned to _different_ cores:
> 
> epoll
> 82.2 Gbps
> 
> io_uring
> 116.2 Gbps (+41%)
> 
> Pinned to _same_ core:
> 
> epoll
> 62.6 Gbps
> 
> io_uring
> 80.9 Gbps (+29%)

I'll review the io_uring bits in detail, but I did take a quick look and
overall it looks really nice.

I decided to give this a spin, as I noticed that Broadcom now has a
230.x firmware release out that supports this. Hence no dependencies on
that anymore, outside of some pain getting the fw updated. Here are my
test setup details:

Receiver:
AMD EPYC 9754 (recei
Broadcom P2100G
-git + this series + the bnxt series referenced

Sender:
Intel(R) Xeon(R) Platinum 8458P
Broadcom P2100G
-git

Test:
kperf with David's patches to support io_uring zc. Eg single flow TCP,
just testing bandwidth. A single cpu/thread being used on both the
receiver and sender side.

non-zc
60.9 Gbps

io_uring + zc
97.1 Gbps

or +59% faster. There's quite a bit of IRQ side work, I'm guessing I
might need to tune it a bit. But it Works For Me, and the results look
really nice.

I did run into an issue with the bnxt driver defaulting to shared tx/rx
queues, and it not working for me in that configuration. Once I disabled
that, it worked fine. This may or may not be an issue with the flow rule
to direct the traffic, the driver queue start, or something else. Don't
know for sure, will need to check with the driver folks. Once sorted, I
didn't see any issues with the code in the patchset.
David Ahern Oct. 9, 2024, 3:38 p.m. UTC | #5
On 10/9/24 9:27 AM, Jens Axboe wrote:
> On 10/7/24 4:15 PM, David Wei wrote:
>> ===========
>> Performance
>> ===========
>>
>> Test setup:
>> * AMD EPYC 9454
>> * Broadcom BCM957508 200G
>> * Kernel v6.11 base [2]
>> * liburing fork [3]
>> * kperf fork [4]
>> * 4K MTU
>> * Single TCP flow
>>
>> With application thread + net rx softirq pinned to _different_ cores:
>>
>> epoll
>> 82.2 Gbps
>>
>> io_uring
>> 116.2 Gbps (+41%)
>>
>> Pinned to _same_ core:
>>
>> epoll
>> 62.6 Gbps
>>
>> io_uring
>> 80.9 Gbps (+29%)
> 
> I'll review the io_uring bits in detail, but I did take a quick look and
> overall it looks really nice.
> 
> I decided to give this a spin, as I noticed that Broadcom now has a
> 230.x firmware release out that supports this. Hence no dependencies on
> that anymore, outside of some pain getting the fw updated. Here are my
> test setup details:
> 
> Receiver:
> AMD EPYC 9754 (recei
> Broadcom P2100G
> -git + this series + the bnxt series referenced
> 
> Sender:
> Intel(R) Xeon(R) Platinum 8458P
> Broadcom P2100G
> -git
> 
> Test:
> kperf with David's patches to support io_uring zc. Eg single flow TCP,
> just testing bandwidth. A single cpu/thread being used on both the
> receiver and sender side.
> 
> non-zc
> 60.9 Gbps
> 
> io_uring + zc
> 97.1 Gbps

so line rate? Did you look at whether there is cpu to spare? meaning it
will report higher speeds with a 200G setup?
Jens Axboe Oct. 9, 2024, 3:43 p.m. UTC | #6
On 10/9/24 9:38 AM, David Ahern wrote:
> On 10/9/24 9:27 AM, Jens Axboe wrote:
>> On 10/7/24 4:15 PM, David Wei wrote:
>>> ===========
>>> Performance
>>> ===========
>>>
>>> Test setup:
>>> * AMD EPYC 9454
>>> * Broadcom BCM957508 200G
>>> * Kernel v6.11 base [2]
>>> * liburing fork [3]
>>> * kperf fork [4]
>>> * 4K MTU
>>> * Single TCP flow
>>>
>>> With application thread + net rx softirq pinned to _different_ cores:
>>>
>>> epoll
>>> 82.2 Gbps
>>>
>>> io_uring
>>> 116.2 Gbps (+41%)
>>>
>>> Pinned to _same_ core:
>>>
>>> epoll
>>> 62.6 Gbps
>>>
>>> io_uring
>>> 80.9 Gbps (+29%)
>>
>> I'll review the io_uring bits in detail, but I did take a quick look and
>> overall it looks really nice.
>>
>> I decided to give this a spin, as I noticed that Broadcom now has a
>> 230.x firmware release out that supports this. Hence no dependencies on
>> that anymore, outside of some pain getting the fw updated. Here are my
>> test setup details:
>>
>> Receiver:
>> AMD EPYC 9754 (recei
>> Broadcom P2100G
>> -git + this series + the bnxt series referenced
>>
>> Sender:
>> Intel(R) Xeon(R) Platinum 8458P
>> Broadcom P2100G
>> -git
>>
>> Test:
>> kperf with David's patches to support io_uring zc. Eg single flow TCP,
>> just testing bandwidth. A single cpu/thread being used on both the
>> receiver and sender side.
>>
>> non-zc
>> 60.9 Gbps
>>
>> io_uring + zc
>> 97.1 Gbps
> 
> so line rate? Did you look at whether there is cpu to spare? meaning it
> will report higher speeds with a 200G setup?

Yep basically line rate, I get 97-98Gbps. I originally used a slower box
as the sender, but then you're capped on the non-zc sender being too
slow. The intel box does better, but it's still basically maxing out the
sender at this point. So yeah, with a faster (or more efficient sender),
I have no doubts this will go much higher per thread, if the link bw was
there. When I looked at CPU usage for the receiver, the thread itself is
using ~30% CPU. And then there's some softirq/irq time outside of that,
but that should ammortize with higher bps rates too I'd expect.

My nic does have 2 100G ports, so might warrant a bit more testing...
Pavel Begunkov Oct. 9, 2024, 3:49 p.m. UTC | #7
On 10/9/24 16:43, Jens Axboe wrote:
> On 10/9/24 9:38 AM, David Ahern wrote:
>> On 10/9/24 9:27 AM, Jens Axboe wrote:
>>> On 10/7/24 4:15 PM, David Wei wrote:
>>>> ===========
>>>> Performance
>>>> ===========
>>>>
>>>> Test setup:
>>>> * AMD EPYC 9454
>>>> * Broadcom BCM957508 200G
>>>> * Kernel v6.11 base [2]
>>>> * liburing fork [3]
>>>> * kperf fork [4]
>>>> * 4K MTU
>>>> * Single TCP flow
>>>>
>>>> With application thread + net rx softirq pinned to _different_ cores:
>>>>
>>>> epoll
>>>> 82.2 Gbps
>>>>
>>>> io_uring
>>>> 116.2 Gbps (+41%)
>>>>
>>>> Pinned to _same_ core:
>>>>
>>>> epoll
>>>> 62.6 Gbps
>>>>
>>>> io_uring
>>>> 80.9 Gbps (+29%)
>>>
>>> I'll review the io_uring bits in detail, but I did take a quick look and
>>> overall it looks really nice.
>>>
>>> I decided to give this a spin, as I noticed that Broadcom now has a
>>> 230.x firmware release out that supports this. Hence no dependencies on
>>> that anymore, outside of some pain getting the fw updated. Here are my
>>> test setup details:
>>>
>>> Receiver:
>>> AMD EPYC 9754 (recei
>>> Broadcom P2100G
>>> -git + this series + the bnxt series referenced
>>>
>>> Sender:
>>> Intel(R) Xeon(R) Platinum 8458P
>>> Broadcom P2100G
>>> -git
>>>
>>> Test:
>>> kperf with David's patches to support io_uring zc. Eg single flow TCP,
>>> just testing bandwidth. A single cpu/thread being used on both the
>>> receiver and sender side.
>>>
>>> non-zc
>>> 60.9 Gbps
>>>
>>> io_uring + zc
>>> 97.1 Gbps
>>
>> so line rate? Did you look at whether there is cpu to spare? meaning it
>> will report higher speeds with a 200G setup?
> 
> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
> as the sender, but then you're capped on the non-zc sender being too
> slow. The intel box does better, but it's still basically maxing out the
> sender at this point. So yeah, with a faster (or more efficient sender),
> I have no doubts this will go much higher per thread, if the link bw was
> there. When I looked at CPU usage for the receiver, the thread itself is
> using ~30% CPU. And then there's some softirq/irq time outside of that,
> but that should ammortize with higher bps rates too I'd expect.
> 
> My nic does have 2 100G ports, so might warrant a bit more testing...
If you haven't done it already, I'd also pin softirq processing to
the same CPU as the app so we measure the full stack. kperf has an
option IIRC.
Jens Axboe Oct. 9, 2024, 3:50 p.m. UTC | #8
On 10/9/24 9:49 AM, Pavel Begunkov wrote:
> On 10/9/24 16:43, Jens Axboe wrote:
>> On 10/9/24 9:38 AM, David Ahern wrote:
>>> On 10/9/24 9:27 AM, Jens Axboe wrote:
>>>> On 10/7/24 4:15 PM, David Wei wrote:
>>>>> ===========
>>>>> Performance
>>>>> ===========
>>>>>
>>>>> Test setup:
>>>>> * AMD EPYC 9454
>>>>> * Broadcom BCM957508 200G
>>>>> * Kernel v6.11 base [2]
>>>>> * liburing fork [3]
>>>>> * kperf fork [4]
>>>>> * 4K MTU
>>>>> * Single TCP flow
>>>>>
>>>>> With application thread + net rx softirq pinned to _different_ cores:
>>>>>
>>>>> epoll
>>>>> 82.2 Gbps
>>>>>
>>>>> io_uring
>>>>> 116.2 Gbps (+41%)
>>>>>
>>>>> Pinned to _same_ core:
>>>>>
>>>>> epoll
>>>>> 62.6 Gbps
>>>>>
>>>>> io_uring
>>>>> 80.9 Gbps (+29%)
>>>>
>>>> I'll review the io_uring bits in detail, but I did take a quick look and
>>>> overall it looks really nice.
>>>>
>>>> I decided to give this a spin, as I noticed that Broadcom now has a
>>>> 230.x firmware release out that supports this. Hence no dependencies on
>>>> that anymore, outside of some pain getting the fw updated. Here are my
>>>> test setup details:
>>>>
>>>> Receiver:
>>>> AMD EPYC 9754 (recei
>>>> Broadcom P2100G
>>>> -git + this series + the bnxt series referenced
>>>>
>>>> Sender:
>>>> Intel(R) Xeon(R) Platinum 8458P
>>>> Broadcom P2100G
>>>> -git
>>>>
>>>> Test:
>>>> kperf with David's patches to support io_uring zc. Eg single flow TCP,
>>>> just testing bandwidth. A single cpu/thread being used on both the
>>>> receiver and sender side.
>>>>
>>>> non-zc
>>>> 60.9 Gbps
>>>>
>>>> io_uring + zc
>>>> 97.1 Gbps
>>>
>>> so line rate? Did you look at whether there is cpu to spare? meaning it
>>> will report higher speeds with a 200G setup?
>>
>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>> as the sender, but then you're capped on the non-zc sender being too
>> slow. The intel box does better, but it's still basically maxing out the
>> sender at this point. So yeah, with a faster (or more efficient sender),
>> I have no doubts this will go much higher per thread, if the link bw was
>> there. When I looked at CPU usage for the receiver, the thread itself is
>> using ~30% CPU. And then there's some softirq/irq time outside of that,
>> but that should ammortize with higher bps rates too I'd expect.
>>
>> My nic does have 2 100G ports, so might warrant a bit more testing...
> If you haven't done it already, I'd also pin softirq processing to
> the same CPU as the app so we measure the full stack. kperf has an
> option IIRC.

I thought that was the default if you didn't give it a cpu-off option?
I'll check...
Joe Damato Oct. 9, 2024, 4:10 p.m. UTC | #9
On Wed, Oct 09, 2024 at 04:07:01PM +0100, Pavel Begunkov wrote:
> On 10/9/24 00:10, Joe Damato wrote:
> > On Mon, Oct 07, 2024 at 03:15:48PM -0700, David Wei wrote:
> > > This patchset adds support for zero copy rx into userspace pages using
> > > io_uring, eliminating a kernel to user copy.
> > > 
> > > We configure a page pool that a driver uses to fill a hw rx queue to
> > > hand out user pages instead of kernel pages. Any data that ends up
> > > hitting this hw rx queue will thus be dma'd into userspace memory
> > > directly, without needing to be bounced through kernel memory. 'Reading'
> > > data out of a socket instead becomes a _notification_ mechanism, where
> > > the kernel tells userspace where the data is. The overall approach is
> > > similar to the devmem TCP proposal.
> > > 
> > > This relies on hw header/data split, flow steering and RSS to ensure
> > > packet headers remain in kernel memory and only desired flows hit a hw
> > > rx queue configured for zero copy. Configuring this is outside of the
> > > scope of this patchset.
> > 
> > This looks super cool and very useful, thanks for doing this work.
> > 
> > Is there any possibility of some notes or sample pseudo code on how
> > userland can use this being added to Documentation/networking/ ?
> 
> io_uring man pages would need to be updated with it, there are tests
> in liburing and would be a good idea to add back a simple exapmle
> to liburing/example/*. I think it should cover it

Ah, that sounds amazing to me!

I thought that suggesting that might be too much work ;) which is
why I had suggested Documentation/, but man page updates would be
excellent!
Jens Axboe Oct. 9, 2024, 4:12 p.m. UTC | #10
On 10/9/24 9:07 AM, Pavel Begunkov wrote:
> On 10/9/24 00:10, Joe Damato wrote:
>> On Mon, Oct 07, 2024 at 03:15:48PM -0700, David Wei wrote:
>>> This patchset adds support for zero copy rx into userspace pages using
>>> io_uring, eliminating a kernel to user copy.
>>>
>>> We configure a page pool that a driver uses to fill a hw rx queue to
>>> hand out user pages instead of kernel pages. Any data that ends up
>>> hitting this hw rx queue will thus be dma'd into userspace memory
>>> directly, without needing to be bounced through kernel memory. 'Reading'
>>> data out of a socket instead becomes a _notification_ mechanism, where
>>> the kernel tells userspace where the data is. The overall approach is
>>> similar to the devmem TCP proposal.
>>>
>>> This relies on hw header/data split, flow steering and RSS to ensure
>>> packet headers remain in kernel memory and only desired flows hit a hw
>>> rx queue configured for zero copy. Configuring this is outside of the
>>> scope of this patchset.
>>
>> This looks super cool and very useful, thanks for doing this work.
>>
>> Is there any possibility of some notes or sample pseudo code on how
>> userland can use this being added to Documentation/networking/ ?
> 
> io_uring man pages would need to be updated with it, there are tests
> in liburing and would be a good idea to add back a simple exapmle
> to liburing/example/*. I think it should cover it

man pages for sure, but +1 to the example too. Just a basic thing would
get the point across, I think.
David Ahern Oct. 9, 2024, 4:35 p.m. UTC | #11
On 10/9/24 9:43 AM, Jens Axboe wrote:
> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
> as the sender, but then you're capped on the non-zc sender being too
> slow. The intel box does better, but it's still basically maxing out the
> sender at this point. So yeah, with a faster (or more efficient sender),

I am surprised by this comment. You should not see a Tx limited test
(including CPU bound sender). Tx with ZC has been the easy option for a
while now.

> I have no doubts this will go much higher per thread, if the link bw was
> there. When I looked at CPU usage for the receiver, the thread itself is
> using ~30% CPU. And then there's some softirq/irq time outside of that,
> but that should ammortize with higher bps rates too I'd expect.
> 
> My nic does have 2 100G ports, so might warrant a bit more testing...
> 

It would be good to see what the next bottleneck is for io_uring with ZC
Rx path. My expectation is that a 200G link is a means to show you (ie.,
you will not hit 200G so cpu monitoring, perf-top, etc will show the
limiter).
Jens Axboe Oct. 9, 2024, 4:50 p.m. UTC | #12
On 10/9/24 10:35 AM, David Ahern wrote:
> On 10/9/24 9:43 AM, Jens Axboe wrote:
>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>> as the sender, but then you're capped on the non-zc sender being too
>> slow. The intel box does better, but it's still basically maxing out the
>> sender at this point. So yeah, with a faster (or more efficient sender),
> 
> I am surprised by this comment. You should not see a Tx limited test
> (including CPU bound sender). Tx with ZC has been the easy option for a
> while now.

I just set this up to test yesterday and just used default! I'm sure
there is a zc option, just not the default and hence it wasn't used.
I'll give it a spin, will be useful for 200G testing.

>> I have no doubts this will go much higher per thread, if the link bw was
>> there. When I looked at CPU usage for the receiver, the thread itself is
>> using ~30% CPU. And then there's some softirq/irq time outside of that,
>> but that should ammortize with higher bps rates too I'd expect.
>>
>> My nic does have 2 100G ports, so might warrant a bit more testing...
>>
> 
> It would be good to see what the next bottleneck is for io_uring with ZC
> Rx path. My expectation is that a 200G link is a means to show you (ie.,
> you will not hit 200G so cpu monitoring, perf-top, etc will show the
> limiter).

I'm pretty familiar with profiling ;-)

I'll see if I can get the 200G test setup and then I'll report back what
I get.
Jens Axboe Oct. 9, 2024, 4:53 p.m. UTC | #13
On 10/9/24 10:50 AM, Jens Axboe wrote:
> On 10/9/24 10:35 AM, David Ahern wrote:
>> On 10/9/24 9:43 AM, Jens Axboe wrote:
>>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>>> as the sender, but then you're capped on the non-zc sender being too
>>> slow. The intel box does better, but it's still basically maxing out the
>>> sender at this point. So yeah, with a faster (or more efficient sender),
>>
>> I am surprised by this comment. You should not see a Tx limited test
>> (including CPU bound sender). Tx with ZC has been the easy option for a
>> while now.
> 
> I just set this up to test yesterday and just used default! I'm sure
> there is a zc option, just not the default and hence it wasn't used.
> I'll give it a spin, will be useful for 200G testing.

I think we're talking past each other. Yes send with zerocopy is
available for a while now, both with io_uring and just sendmsg(), but
I'm using kperf for testing and it does not look like it supports it.
Might have to add it... We'll see how far I can get without it.
Mina Almasry Oct. 9, 2024, 4:55 p.m. UTC | #14
On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
>
> This patchset adds support for zero copy rx into userspace pages using
> io_uring, eliminating a kernel to user copy.
>
> We configure a page pool that a driver uses to fill a hw rx queue to
> hand out user pages instead of kernel pages. Any data that ends up
> hitting this hw rx queue will thus be dma'd into userspace memory
> directly, without needing to be bounced through kernel memory. 'Reading'
> data out of a socket instead becomes a _notification_ mechanism, where
> the kernel tells userspace where the data is. The overall approach is
> similar to the devmem TCP proposal.
>
> This relies on hw header/data split, flow steering and RSS to ensure
> packet headers remain in kernel memory and only desired flows hit a hw
> rx queue configured for zero copy. Configuring this is outside of the
> scope of this patchset.
>
> We share netdev core infra with devmem TCP. The main difference is that
> io_uring is used for the uAPI and the lifetime of all objects are bound
> to an io_uring instance.

I've been thinking about this a bit, and I hope this feedback isn't
too late, but I think your work may be useful for users not using
io_uring. I.e. zero copy to host memory that is not dependent on page
aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.

If we refactor things around a bit we should be able to have the
memory tied to the RX queue similar to what AF_XDP does, and then we
should be able to zero copy to the memory via regular sockets and via
io_uring. This will be useful for us and other applications that would
like to ZC similar to what you're doing here but not necessarily
through io_uring.

> Data is 'read' using a new io_uring request
> type. When done, data is returned via a new shared refill queue. A zero
> copy page pool refills a hw rx queue from this refill queue directly. Of
> course, the lifetime of these data buffers are managed by io_uring
> rather than the networking stack, with different refcounting rules.
>
> This patchset is the first step adding basic zero copy support. We will
> extend this iteratively with new features e.g. dynamically allocated
> zero copy areas, THP support, dmabuf support, improved copy fallback,
> general optimisations and more.
>
> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
> aren't included since Taehee Yoo has already sent a more comprehensive
> patchset adding support in [1]. Google gve should already support this,

This is an aside, but GVE supports this via the out-of-tree patches
I've been carrying on github. Uptsream we're working on adding the
prerequisite page_pool support.

> and Mellanox mlx5 support is WIP pending driver changes.
>
> ===========
> Performance
> ===========
>
> Test setup:
> * AMD EPYC 9454
> * Broadcom BCM957508 200G
> * Kernel v6.11 base [2]
> * liburing fork [3]
> * kperf fork [4]
> * 4K MTU
> * Single TCP flow
>
> With application thread + net rx softirq pinned to _different_ cores:
>
> epoll
> 82.2 Gbps
>
> io_uring
> 116.2 Gbps (+41%)
>
> Pinned to _same_ core:
>
> epoll
> 62.6 Gbps
>
> io_uring
> 80.9 Gbps (+29%)
>

Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy
[1]  and io_uring zerocopy respectively?

If not, I would like to see a comparison between TCP RX zerocopy and
this new io-uring zerocopy. For Google for example we use the TCP RX
zerocopy, I would like to see perf numbers possibly motivating us to
move to this new thing.

[1] https://lwn.net/Articles/752046/
Jens Axboe Oct. 9, 2024, 4:57 p.m. UTC | #15
On 10/9/24 10:55 AM, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16?PM David Wei <dw@davidwei.uk> wrote:
>>
>> This patchset adds support for zero copy rx into userspace pages using
>> io_uring, eliminating a kernel to user copy.
>>
>> We configure a page pool that a driver uses to fill a hw rx queue to
>> hand out user pages instead of kernel pages. Any data that ends up
>> hitting this hw rx queue will thus be dma'd into userspace memory
>> directly, without needing to be bounced through kernel memory. 'Reading'
>> data out of a socket instead becomes a _notification_ mechanism, where
>> the kernel tells userspace where the data is. The overall approach is
>> similar to the devmem TCP proposal.
>>
>> This relies on hw header/data split, flow steering and RSS to ensure
>> packet headers remain in kernel memory and only desired flows hit a hw
>> rx queue configured for zero copy. Configuring this is outside of the
>> scope of this patchset.
>>
>> We share netdev core infra with devmem TCP. The main difference is that
>> io_uring is used for the uAPI and the lifetime of all objects are bound
>> to an io_uring instance.
> 
> I've been thinking about this a bit, and I hope this feedback isn't
> too late, but I think your work may be useful for users not using
> io_uring. I.e. zero copy to host memory that is not dependent on page
> aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.

Not David, but come on, let's please get this moving forward. It's been
stuck behind dependencies for seemingly forever, which are finally
resolved. I don't think this is a reasonable ask at all for this
patchset. If you want to work on that after the fact, then that's
certainly an option. But gating this on now new requirements for
something that isn't even a goal of this patchset, that's getting pretty
silly imho.
Jens Axboe Oct. 9, 2024, 5:12 p.m. UTC | #16
On 10/9/24 10:53 AM, Jens Axboe wrote:
> On 10/9/24 10:50 AM, Jens Axboe wrote:
>> On 10/9/24 10:35 AM, David Ahern wrote:
>>> On 10/9/24 9:43 AM, Jens Axboe wrote:
>>>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>>>> as the sender, but then you're capped on the non-zc sender being too
>>>> slow. The intel box does better, but it's still basically maxing out the
>>>> sender at this point. So yeah, with a faster (or more efficient sender),
>>>
>>> I am surprised by this comment. You should not see a Tx limited test
>>> (including CPU bound sender). Tx with ZC has been the easy option for a
>>> while now.
>>
>> I just set this up to test yesterday and just used default! I'm sure
>> there is a zc option, just not the default and hence it wasn't used.
>> I'll give it a spin, will be useful for 200G testing.
> 
> I think we're talking past each other. Yes send with zerocopy is
> available for a while now, both with io_uring and just sendmsg(), but
> I'm using kperf for testing and it does not look like it supports it.
> Might have to add it... We'll see how far I can get without it.

Stanislav pointed me at:

https://github.com/facebookexperimental/kperf/pull/2

which adds zc send. I ran a quick test, and it does reduce cpu
utilization on the sender from 100% to 95%. I'll keep poking...
David Ahern Oct. 9, 2024, 5:19 p.m. UTC | #17
On 10/9/24 10:55 AM, Mina Almasry wrote:
> 
> I've been thinking about this a bit, and I hope this feedback isn't
> too late, but I think your work may be useful for users not using
> io_uring. I.e. zero copy to host memory that is not dependent on page
> aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.

I disagree with this request; AF_XDP by definition is bypassing the
kernel stack.
Pedro Tammela Oct. 9, 2024, 6:21 p.m. UTC | #18
On 09/10/2024 13:55, Mina Almasry wrote:
> [...]
> 
> If not, I would like to see a comparison between TCP RX zerocopy and
> this new io-uring zerocopy. For Google for example we use the TCP RX
> zerocopy, I would like to see perf numbers possibly motivating us to
> move to this new thing.
> 
> [1] https://lwn.net/Articles/752046/
> 

Hi!

 From my own testing, the TCP RX Zerocopy is quite heavy on the page 
unmapping side. Since the io_uring implementation is expected to be 
lighter (see patch 11), I would expect a simple comparison to show 
better numbers for io_uring.

To be fair to the existing implementation, it would then be needed to be 
paired with some 'real' computation, but that varies a lot. As we 
presented in netdevconf this year, HW-GRO eventually was the best option 
for us (no app changes, etc...) but still a case by case decision.
Mina Almasry Oct. 9, 2024, 7:32 p.m. UTC | #19
On Wed, Oct 9, 2024 at 9:57 AM Jens Axboe <axboe@kernel.dk> wrote:
>
> On 10/9/24 10:55 AM, Mina Almasry wrote:
> > On Mon, Oct 7, 2024 at 3:16?PM David Wei <dw@davidwei.uk> wrote:
> >>
> >> This patchset adds support for zero copy rx into userspace pages using
> >> io_uring, eliminating a kernel to user copy.
> >>
> >> We configure a page pool that a driver uses to fill a hw rx queue to
> >> hand out user pages instead of kernel pages. Any data that ends up
> >> hitting this hw rx queue will thus be dma'd into userspace memory
> >> directly, without needing to be bounced through kernel memory. 'Reading'
> >> data out of a socket instead becomes a _notification_ mechanism, where
> >> the kernel tells userspace where the data is. The overall approach is
> >> similar to the devmem TCP proposal.
> >>
> >> This relies on hw header/data split, flow steering and RSS to ensure
> >> packet headers remain in kernel memory and only desired flows hit a hw
> >> rx queue configured for zero copy. Configuring this is outside of the
> >> scope of this patchset.
> >>
> >> We share netdev core infra with devmem TCP. The main difference is that
> >> io_uring is used for the uAPI and the lifetime of all objects are bound
> >> to an io_uring instance.
> >
> > I've been thinking about this a bit, and I hope this feedback isn't
> > too late, but I think your work may be useful for users not using
> > io_uring. I.e. zero copy to host memory that is not dependent on page
> > aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
>
> Not David, but come on, let's please get this moving forward. It's been
> stuck behind dependencies for seemingly forever, which are finally
> resolved.

Part of the reason this has been stuck behind dependencies for so long
is because the dependency took the time to implement things very
generically (memory providers, net_iovs) and provided you with the
primitives that enable your work. And dealt with nacks in this area
you now don't have to deal with.

> I don't think this is a reasonable ask at all for this
> patchset. If you want to work on that after the fact, then that's
> certainly an option.

I think this work is extensible to sockets and the implementation need
not be heavily tied to io_uring; yes at least leaving things open for
a socket extension to be done easier in the future would be good, IMO.
I'll look at the series more closely to see if I actually have any
concrete feedback along these lines. I hope you're open to some of it
:-)

--
Thanks,
Mina
Pavel Begunkov Oct. 9, 2024, 7:43 p.m. UTC | #20
On 10/9/24 20:32, Mina Almasry wrote:
> On Wed, Oct 9, 2024 at 9:57 AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 10/9/24 10:55 AM, Mina Almasry wrote:
>>> On Mon, Oct 7, 2024 at 3:16?PM David Wei <dw@davidwei.uk> wrote:
>>>>
>>>> This patchset adds support for zero copy rx into userspace pages using
>>>> io_uring, eliminating a kernel to user copy.
>>>>
>>>> We configure a page pool that a driver uses to fill a hw rx queue to
>>>> hand out user pages instead of kernel pages. Any data that ends up
>>>> hitting this hw rx queue will thus be dma'd into userspace memory
>>>> directly, without needing to be bounced through kernel memory. 'Reading'
>>>> data out of a socket instead becomes a _notification_ mechanism, where
>>>> the kernel tells userspace where the data is. The overall approach is
>>>> similar to the devmem TCP proposal.
>>>>
>>>> This relies on hw header/data split, flow steering and RSS to ensure
>>>> packet headers remain in kernel memory and only desired flows hit a hw
>>>> rx queue configured for zero copy. Configuring this is outside of the
>>>> scope of this patchset.
>>>>
>>>> We share netdev core infra with devmem TCP. The main difference is that
>>>> io_uring is used for the uAPI and the lifetime of all objects are bound
>>>> to an io_uring instance.
>>>
>>> I've been thinking about this a bit, and I hope this feedback isn't
>>> too late, but I think your work may be useful for users not using
>>> io_uring. I.e. zero copy to host memory that is not dependent on page
>>> aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
>>
>> Not David, but come on, let's please get this moving forward. It's been
>> stuck behind dependencies for seemingly forever, which are finally
>> resolved.
> 
> Part of the reason this has been stuck behind dependencies for so long
> is because the dependency took the time to implement things very
> generically (memory providers, net_iovs) and provided you with the
> primitives that enable your work. And dealt with nacks in this area
> you now don't have to deal with.

And that's well appreciated, but I completely share Jens' sentiment.
Is there anything like uapi concerns that prevents it to be
implemented after / separately? I'd say that for io_uring users
it's nice to have the API done the io_uring way regardless of the
socket API option, so at the very least it would fork on the completion
format and that thing would need to have a different ring/etc.

>> I don't think this is a reasonable ask at all for this
>> patchset. If you want to work on that after the fact, then that's
>> certainly an option.
> 
> I think this work is extensible to sockets and the implementation need
> not be heavily tied to io_uring; yes at least leaving things open for
> a socket extension to be done easier in the future would be good, IMO

And as far as I can tell there is already a socket API allowing
all that called devmem TCP :) Might need slight improvement on
the registration side unless dmabuf wrapped user pages are good
enough.

> I'll look at the series more closely to see if I actually have any
> concrete feedback along these lines. I hope you're open to some of it
> :-)
Jens Axboe Oct. 9, 2024, 7:47 p.m. UTC | #21
On 10/9/24 1:32 PM, Mina Almasry wrote:
> On Wed, Oct 9, 2024 at 9:57?AM Jens Axboe <axboe@kernel.dk> wrote:
>>
>> On 10/9/24 10:55 AM, Mina Almasry wrote:
>>> On Mon, Oct 7, 2024 at 3:16?PM David Wei <dw@davidwei.uk> wrote:
>>>>
>>>> This patchset adds support for zero copy rx into userspace pages using
>>>> io_uring, eliminating a kernel to user copy.
>>>>
>>>> We configure a page pool that a driver uses to fill a hw rx queue to
>>>> hand out user pages instead of kernel pages. Any data that ends up
>>>> hitting this hw rx queue will thus be dma'd into userspace memory
>>>> directly, without needing to be bounced through kernel memory. 'Reading'
>>>> data out of a socket instead becomes a _notification_ mechanism, where
>>>> the kernel tells userspace where the data is. The overall approach is
>>>> similar to the devmem TCP proposal.
>>>>
>>>> This relies on hw header/data split, flow steering and RSS to ensure
>>>> packet headers remain in kernel memory and only desired flows hit a hw
>>>> rx queue configured for zero copy. Configuring this is outside of the
>>>> scope of this patchset.
>>>>
>>>> We share netdev core infra with devmem TCP. The main difference is that
>>>> io_uring is used for the uAPI and the lifetime of all objects are bound
>>>> to an io_uring instance.
>>>
>>> I've been thinking about this a bit, and I hope this feedback isn't
>>> too late, but I think your work may be useful for users not using
>>> io_uring. I.e. zero copy to host memory that is not dependent on page
>>> aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
>>
>> Not David, but come on, let's please get this moving forward. It's been
>> stuck behind dependencies for seemingly forever, which are finally
>> resolved.
> 
> Part of the reason this has been stuck behind dependencies for so long
> is because the dependency took the time to implement things very
> generically (memory providers, net_iovs) and provided you with the
> primitives that enable your work. And dealt with nacks in this area
> you now don't have to deal with.

For sure, not trying to put blame on anyone here, just saying it's been
a long winding road.

>> I don't think this is a reasonable ask at all for this
>> patchset. If you want to work on that after the fact, then that's
>> certainly an option.
> 
> I think this work is extensible to sockets and the implementation need
> not be heavily tied to io_uring; yes at least leaving things open for
> a socket extension to be done easier in the future would be good, IMO.
> I'll look at the series more closely to see if I actually have any
> concrete feedback along these lines. I hope you're open to some of it
> :-)

I'm really not, if someone wants to tackle that, then they are welcome
to do so after the fact. I don't want to create Yet Another dependency
that would need resolving with another patch set behind it, particularly
when no such dependency exists in the first place.

There's zero reason why anyone interested in pursuing this path can't
just do it on top.
Pavel Begunkov Oct. 10, 2024, 1:19 p.m. UTC | #22
On 10/9/24 19:21, Pedro Tammela wrote:
> On 09/10/2024 13:55, Mina Almasry wrote:
>> [...]
>>
>> If not, I would like to see a comparison between TCP RX zerocopy and
>> this new io-uring zerocopy. For Google for example we use the TCP RX
>> zerocopy, I would like to see perf numbers possibly motivating us to
>> move to this new thing.
>>
>> [1] https://lwn.net/Articles/752046/
>>
> 
> Hi!
> 
>  From my own testing, the TCP RX Zerocopy is quite heavy on the page unmapping side. Since the io_uring implementation is expected to be lighter (see patch 11), I would expect a simple comparison to show better numbers for io_uring.

Let's see if kperf supports it or can be easily added, but since page
flipping requires heavy mmap amortisation, looks there are even
different sets of users the interfaces cover, in this sense comparing
to copy IMHO could be more interesting.

> To be fair to the existing implementation, it would then be needed to be paired with some 'real' computation, but that varies a lot. As we presented in netdevconf this year, HW-GRO eventually was the best option for us (no app changes, etc...) but still a case by case decision.
Jens Axboe Oct. 10, 2024, 2:21 p.m. UTC | #23
On 10/9/24 11:12 AM, Jens Axboe wrote:
> On 10/9/24 10:53 AM, Jens Axboe wrote:
>> On 10/9/24 10:50 AM, Jens Axboe wrote:
>>> On 10/9/24 10:35 AM, David Ahern wrote:
>>>> On 10/9/24 9:43 AM, Jens Axboe wrote:
>>>>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>>>>> as the sender, but then you're capped on the non-zc sender being too
>>>>> slow. The intel box does better, but it's still basically maxing out the
>>>>> sender at this point. So yeah, with a faster (or more efficient sender),
>>>>
>>>> I am surprised by this comment. You should not see a Tx limited test
>>>> (including CPU bound sender). Tx with ZC has been the easy option for a
>>>> while now.
>>>
>>> I just set this up to test yesterday and just used default! I'm sure
>>> there is a zc option, just not the default and hence it wasn't used.
>>> I'll give it a spin, will be useful for 200G testing.
>>
>> I think we're talking past each other. Yes send with zerocopy is
>> available for a while now, both with io_uring and just sendmsg(), but
>> I'm using kperf for testing and it does not look like it supports it.
>> Might have to add it... We'll see how far I can get without it.
> 
> Stanislav pointed me at:
> 
> https://github.com/facebookexperimental/kperf/pull/2
> 
> which adds zc send. I ran a quick test, and it does reduce cpu
> utilization on the sender from 100% to 95%. I'll keep poking...

Update on this - did more testing and the 100 -> 95 was a bit of a
fluke, it's still maxed. So I added io_uring send and sendzc support to
kperf, and I still saw the sendzc being maxed out sending at 100G rates
with 100% cpu usage.

Poked a bit, and the reason is that it's all memcpy() off
skb_orphan_frags_rx() -> skb_copy_ubufs(). At this point I asked Pavel
as that made no sense to me, and turns out the kernel thinks there's a
tap on the device. Maybe there is, haven't looked at that yet, but I
just killed the orphaning and tested again.

This looks better, now I can get 100G line rate from a single thread
using io_uring sendzc using only 30% of the single cpu/thread (including
irq time). That is good news, as it unlocks being able to test > 100G as
the sender is no longer the bottleneck.

Tap side still a mystery, but it unblocked testing. I'll figure that
part out separately.
David Ahern Oct. 10, 2024, 3:03 p.m. UTC | #24
On 10/10/24 8:21 AM, Jens Axboe wrote:
>> which adds zc send. I ran a quick test, and it does reduce cpu
>> utilization on the sender from 100% to 95%. I'll keep poking...
> 
> Update on this - did more testing and the 100 -> 95 was a bit of a
> fluke, it's still maxed. So I added io_uring send and sendzc support to
> kperf, and I still saw the sendzc being maxed out sending at 100G rates
> with 100% cpu usage.
> 
> Poked a bit, and the reason is that it's all memcpy() off
> skb_orphan_frags_rx() -> skb_copy_ubufs(). At this point I asked Pavel
> as that made no sense to me, and turns out the kernel thinks there's a
> tap on the device. Maybe there is, haven't looked at that yet, but I
> just killed the orphaning and tested again.
> 
> This looks better, now I can get 100G line rate from a single thread
> using io_uring sendzc using only 30% of the single cpu/thread (including
> irq time). That is good news, as it unlocks being able to test > 100G as
> the sender is no longer the bottleneck.
> 
> Tap side still a mystery, but it unblocked testing. I'll figure that
> part out separately.
> 

Thanks for the update. 30% cpu is more inline with my testing.

For the "tap" you need to make sure no packet socket applications are
running -- e.g., lldpd is a typical open I have a seen in tests. Check
/proc/net/packet
Jens Axboe Oct. 10, 2024, 3:15 p.m. UTC | #25
On 10/10/24 9:03 AM, David Ahern wrote:
> On 10/10/24 8:21 AM, Jens Axboe wrote:
>>> which adds zc send. I ran a quick test, and it does reduce cpu
>>> utilization on the sender from 100% to 95%. I'll keep poking...
>>
>> Update on this - did more testing and the 100 -> 95 was a bit of a
>> fluke, it's still maxed. So I added io_uring send and sendzc support to
>> kperf, and I still saw the sendzc being maxed out sending at 100G rates
>> with 100% cpu usage.
>>
>> Poked a bit, and the reason is that it's all memcpy() off
>> skb_orphan_frags_rx() -> skb_copy_ubufs(). At this point I asked Pavel
>> as that made no sense to me, and turns out the kernel thinks there's a
>> tap on the device. Maybe there is, haven't looked at that yet, but I
>> just killed the orphaning and tested again.
>>
>> This looks better, now I can get 100G line rate from a single thread
>> using io_uring sendzc using only 30% of the single cpu/thread (including
>> irq time). That is good news, as it unlocks being able to test > 100G as
>> the sender is no longer the bottleneck.
>>
>> Tap side still a mystery, but it unblocked testing. I'll figure that
>> part out separately.
>>
> 
> Thanks for the update. 30% cpu is more inline with my testing.
> 
> For the "tap" you need to make sure no packet socket applications are
> running -- e.g., lldpd is a typical open I have a seen in tests. Check
> /proc/net/packet

Here's what I see:

sk               RefCnt Type Proto  Iface R Rmem   User   Inode
0000000078c66cbc 3      3    0003   2     1 0      0      112645
00000000558db352 3      3    0003   2     1 0      0      109578
00000000486837f4 3      3    0003   4     1 0      0      109580
00000000f7c6edd6 3      3    0003   4     1 0      0      22563 
000000006ec0363c 3      3    0003   2     1 0      0      22565 
0000000095e63bff 3      3    0003   5     1 0      0      22567 

was just now poking at what could be causing this. This is a server box,
nothing really is running on it... The nic in question is ifindex 2.
Jens Axboe Oct. 10, 2024, 6:11 p.m. UTC | #26
On 10/10/24 8:21 AM, Jens Axboe wrote:
> On 10/9/24 11:12 AM, Jens Axboe wrote:
>> On 10/9/24 10:53 AM, Jens Axboe wrote:
>>> On 10/9/24 10:50 AM, Jens Axboe wrote:
>>>> On 10/9/24 10:35 AM, David Ahern wrote:
>>>>> On 10/9/24 9:43 AM, Jens Axboe wrote:
>>>>>> Yep basically line rate, I get 97-98Gbps. I originally used a slower box
>>>>>> as the sender, but then you're capped on the non-zc sender being too
>>>>>> slow. The intel box does better, but it's still basically maxing out the
>>>>>> sender at this point. So yeah, with a faster (or more efficient sender),
>>>>>
>>>>> I am surprised by this comment. You should not see a Tx limited test
>>>>> (including CPU bound sender). Tx with ZC has been the easy option for a
>>>>> while now.
>>>>
>>>> I just set this up to test yesterday and just used default! I'm sure
>>>> there is a zc option, just not the default and hence it wasn't used.
>>>> I'll give it a spin, will be useful for 200G testing.
>>>
>>> I think we're talking past each other. Yes send with zerocopy is
>>> available for a while now, both with io_uring and just sendmsg(), but
>>> I'm using kperf for testing and it does not look like it supports it.
>>> Might have to add it... We'll see how far I can get without it.
>>
>> Stanislav pointed me at:
>>
>> https://github.com/facebookexperimental/kperf/pull/2
>>
>> which adds zc send. I ran a quick test, and it does reduce cpu
>> utilization on the sender from 100% to 95%. I'll keep poking...
> 
> Update on this - did more testing and the 100 -> 95 was a bit of a
> fluke, it's still maxed. So I added io_uring send and sendzc support to
> kperf, and I still saw the sendzc being maxed out sending at 100G rates
> with 100% cpu usage.
> 
> Poked a bit, and the reason is that it's all memcpy() off
> skb_orphan_frags_rx() -> skb_copy_ubufs(). At this point I asked Pavel
> as that made no sense to me, and turns out the kernel thinks there's a
> tap on the device. Maybe there is, haven't looked at that yet, but I
> just killed the orphaning and tested again.
> 
> This looks better, now I can get 100G line rate from a single thread
> using io_uring sendzc using only 30% of the single cpu/thread (including
> irq time). That is good news, as it unlocks being able to test > 100G as
> the sender is no longer the bottleneck.
> 
> Tap side still a mystery, but it unblocked testing. I'll figure that
> part out separately.

Further update - the above mystery was dhclient, thanks a lot to David
for being able to figure that out very quickly.

But the more interesting update - I got both links up on the receiving
side, providing 200G of bandwidth. I re-ran the test, with proper zero
copy running on the sending side, and io_uring zcrx on the receiver. The
receiver is two threads, BUT targeting the same queue on the two nics.
Both receiver threads bound to the same core (453 in this case). In
other words, single cpu thread is running all of both rx threads, napi
included.

Basic thread usage from top here:

10816 root      20   0  396640 393224      0 R  49.0   0.0   0:01.77 server
10818 root      20   0  396640 389128      0 R  49.0   0.0   0:01.76 server      

and I get 98.4Gbps and 98.6Gbps on the receiver side, which is basically
the combined link bw again. So 200G not enough to saturate a single cpu
thread.
David Wei Oct. 11, 2024, 12:29 a.m. UTC | #27
On 2024-10-09 09:55, Mina Almasry wrote:
> On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
>>
>> This patchset adds support for zero copy rx into userspace pages using
>> io_uring, eliminating a kernel to user copy.
>>
>> We configure a page pool that a driver uses to fill a hw rx queue to
>> hand out user pages instead of kernel pages. Any data that ends up
>> hitting this hw rx queue will thus be dma'd into userspace memory
>> directly, without needing to be bounced through kernel memory. 'Reading'
>> data out of a socket instead becomes a _notification_ mechanism, where
>> the kernel tells userspace where the data is. The overall approach is
>> similar to the devmem TCP proposal.
>>
>> This relies on hw header/data split, flow steering and RSS to ensure
>> packet headers remain in kernel memory and only desired flows hit a hw
>> rx queue configured for zero copy. Configuring this is outside of the
>> scope of this patchset.
>>
>> We share netdev core infra with devmem TCP. The main difference is that
>> io_uring is used for the uAPI and the lifetime of all objects are bound
>> to an io_uring instance.
> 
> I've been thinking about this a bit, and I hope this feedback isn't
> too late, but I think your work may be useful for users not using
> io_uring. I.e. zero copy to host memory that is not dependent on page
> aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
> 
> If we refactor things around a bit we should be able to have the
> memory tied to the RX queue similar to what AF_XDP does, and then we
> should be able to zero copy to the memory via regular sockets and via
> io_uring. This will be useful for us and other applications that would
> like to ZC similar to what you're doing here but not necessarily
> through io_uring.

Using io_uring and trying to move away from a socket based interface is
an explicit longer term goal. I see your proposal of adding a
traditional socket based API as orthogonal to what we're trying to do.
If someone is motivated enough to see this exist then they can build it
themselves.

> 
>> Data is 'read' using a new io_uring request
>> type. When done, data is returned via a new shared refill queue. A zero
>> copy page pool refills a hw rx queue from this refill queue directly. Of
>> course, the lifetime of these data buffers are managed by io_uring
>> rather than the networking stack, with different refcounting rules.
>>
>> This patchset is the first step adding basic zero copy support. We will
>> extend this iteratively with new features e.g. dynamically allocated
>> zero copy areas, THP support, dmabuf support, improved copy fallback,
>> general optimisations and more.
>>
>> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
>> aren't included since Taehee Yoo has already sent a more comprehensive
>> patchset adding support in [1]. Google gve should already support this,
> 
> This is an aside, but GVE supports this via the out-of-tree patches
> I've been carrying on github. Uptsream we're working on adding the
> prerequisite page_pool support.
> 
>> and Mellanox mlx5 support is WIP pending driver changes.
>>
>> ===========
>> Performance
>> ===========
>>
>> Test setup:
>> * AMD EPYC 9454
>> * Broadcom BCM957508 200G
>> * Kernel v6.11 base [2]
>> * liburing fork [3]
>> * kperf fork [4]
>> * 4K MTU
>> * Single TCP flow
>>
>> With application thread + net rx softirq pinned to _different_ cores:
>>
>> epoll
>> 82.2 Gbps
>>
>> io_uring
>> 116.2 Gbps (+41%)
>>
>> Pinned to _same_ core:
>>
>> epoll
>> 62.6 Gbps
>>
>> io_uring
>> 80.9 Gbps (+29%)
>>
> 
> Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy
> [1]  and io_uring zerocopy respectively?
> 
> If not, I would like to see a comparison between TCP RX zerocopy and
> this new io-uring zerocopy. For Google for example we use the TCP RX
> zerocopy, I would like to see perf numbers possibly motivating us to
> move to this new thing.

No, it is comparing epoll without zero copy vs io_uring zero copy. Yes,
that's a fair request. I will add epoll with TCP_ZEROCOPY_RECEIVE to
kperf and compare.

> 
> [1] https://lwn.net/Articles/752046/
> 
>
David Wei Oct. 11, 2024, 12:35 a.m. UTC | #28
On 2024-10-09 11:21, Pedro Tammela wrote:
> On 09/10/2024 13:55, Mina Almasry wrote:
>> [...]
>>
>> If not, I would like to see a comparison between TCP RX zerocopy and
>> this new io-uring zerocopy. For Google for example we use the TCP RX
>> zerocopy, I would like to see perf numbers possibly motivating us to
>> move to this new thing.
>>
>> [1] https://lwn.net/Articles/752046/
>>
> 
> Hi!
> 
> From my own testing, the TCP RX Zerocopy is quite heavy on the page unmapping side. Since the io_uring implementation is expected to be lighter (see patch 11), I would expect a simple comparison to show better numbers for io_uring.

Hi Pedro, I will add TCP_ZEROCOPY_RECEIVE to kperf and compare in the
next patchset.

> 
> To be fair to the existing implementation, it would then be needed to be paired with some 'real' computation, but that varies a lot. As we presented in netdevconf this year, HW-GRO eventually was the best option for us (no app changes, etc...) but still a case by case decision.

Why is there a need to add some computation to the benchmarks? A
benchmark is meant to be just that - a simple comparison that just looks
at the overheads of the stack. Real workloads are complex, I don't see
this feature as a universal win in all cases, but very workload and
userspace architecture dependent.

As for HW-GRO, whynotboth.jpg?
David Wei Oct. 11, 2024, 6:15 a.m. UTC | #29
On 2024-10-09 09:12, Jens Axboe wrote:
> On 10/9/24 9:07 AM, Pavel Begunkov wrote:
>> On 10/9/24 00:10, Joe Damato wrote:
>>> On Mon, Oct 07, 2024 at 03:15:48PM -0700, David Wei wrote:
>>>> This patchset adds support for zero copy rx into userspace pages using
>>>> io_uring, eliminating a kernel to user copy.
>>>>
>>>> We configure a page pool that a driver uses to fill a hw rx queue to
>>>> hand out user pages instead of kernel pages. Any data that ends up
>>>> hitting this hw rx queue will thus be dma'd into userspace memory
>>>> directly, without needing to be bounced through kernel memory. 'Reading'
>>>> data out of a socket instead becomes a _notification_ mechanism, where
>>>> the kernel tells userspace where the data is. The overall approach is
>>>> similar to the devmem TCP proposal.
>>>>
>>>> This relies on hw header/data split, flow steering and RSS to ensure
>>>> packet headers remain in kernel memory and only desired flows hit a hw
>>>> rx queue configured for zero copy. Configuring this is outside of the
>>>> scope of this patchset.
>>>
>>> This looks super cool and very useful, thanks for doing this work.
>>>
>>> Is there any possibility of some notes or sample pseudo code on how
>>> userland can use this being added to Documentation/networking/ ?
>>
>> io_uring man pages would need to be updated with it, there are tests
>> in liburing and would be a good idea to add back a simple exapmle
>> to liburing/example/*. I think it should cover it
> 
> man pages for sure, but +1 to the example too. Just a basic thing would
> get the point across, I think.
> 

Yeah, there's the liburing side with helpers and all that which will get
manpages. We'll also put back a simple example demonstrating the uAPI.
The liburing changes will be sent as a separate patchset to the io-uring
list.
Pedro Tammela Oct. 11, 2024, 2:28 p.m. UTC | #30
On 10/10/2024 21:35, David Wei wrote:
> On 2024-10-09 11:21, Pedro Tammela wrote:
>> On 09/10/2024 13:55, Mina Almasry wrote:
>>> [...]
>>>
>>> If not, I would like to see a comparison between TCP RX zerocopy and
>>> this new io-uring zerocopy. For Google for example we use the TCP RX
>>> zerocopy, I would like to see perf numbers possibly motivating us to
>>> move to this new thing.
>>>
>>> [1] https://lwn.net/Articles/752046/
>>>
>>
>> Hi!
>>
>>  From my own testing, the TCP RX Zerocopy is quite heavy on the page unmapping side. Since the io_uring implementation is expected to be lighter (see patch 11), I would expect a simple comparison to show better numbers for io_uring.
> 
> Hi Pedro, I will add TCP_ZEROCOPY_RECEIVE to kperf and compare in the
> next patchset.
> 
>>
>> To be fair to the existing implementation, it would then be needed to be paired with some 'real' computation, but that varies a lot. As we presented in netdevconf this year, HW-GRO eventually was the best option for us (no app changes, etc...) but still a case by case decision.
> 
> Why is there a need to add some computation to the benchmarks? A
> benchmark is meant to be just that - a simple comparison that just looks
> at the overheads of the stack.

For the use case we saw, streaming lots of data with zc, the RX pages 
would linger for a reasonable time
in processing and the unmap cost amortized in the hotpath.
Which was not considered in our simple benchmark.

So for Mina's case, I guess the only way to know for sure if it's worth 
is to implement the io_uring approach and compare.

> Real workloads are complex, I don't see> this feature as a universal win in all cases, but very workload and
> userspace architecture dependent.

100% agree here, that's our experience so far as well.
Just wanted to share this sentiment in my previous email.

I personally believe the io_uring approach will encompass more use cases 
than the existing implementation.

> 
> As for HW-GRO, whynotboth.jpg?

For us the cost of changing the apps/services to accomodate rx zc was 
prohibitive for now,
which lead us to stick with HW-GRO.

IIRC, you mentioned in netdevconf Meta uses a library for RPC, but we 
don't have this luxury :/
Mina Almasry Oct. 11, 2024, 7:43 p.m. UTC | #31
On Thu, Oct 10, 2024 at 5:29 PM David Wei <dw@davidwei.uk> wrote:
>
> On 2024-10-09 09:55, Mina Almasry wrote:
> > On Mon, Oct 7, 2024 at 3:16 PM David Wei <dw@davidwei.uk> wrote:
> >>
> >> This patchset adds support for zero copy rx into userspace pages using
> >> io_uring, eliminating a kernel to user copy.
> >>
> >> We configure a page pool that a driver uses to fill a hw rx queue to
> >> hand out user pages instead of kernel pages. Any data that ends up
> >> hitting this hw rx queue will thus be dma'd into userspace memory
> >> directly, without needing to be bounced through kernel memory. 'Reading'
> >> data out of a socket instead becomes a _notification_ mechanism, where
> >> the kernel tells userspace where the data is. The overall approach is
> >> similar to the devmem TCP proposal.
> >>
> >> This relies on hw header/data split, flow steering and RSS to ensure
> >> packet headers remain in kernel memory and only desired flows hit a hw
> >> rx queue configured for zero copy. Configuring this is outside of the
> >> scope of this patchset.
> >>
> >> We share netdev core infra with devmem TCP. The main difference is that
> >> io_uring is used for the uAPI and the lifetime of all objects are bound
> >> to an io_uring instance.
> >
> > I've been thinking about this a bit, and I hope this feedback isn't
> > too late, but I think your work may be useful for users not using
> > io_uring. I.e. zero copy to host memory that is not dependent on page
> > aligned MSS sizing. I.e. AF_XDP zerocopy but using the TCP stack.
> >
> > If we refactor things around a bit we should be able to have the
> > memory tied to the RX queue similar to what AF_XDP does, and then we
> > should be able to zero copy to the memory via regular sockets and via
> > io_uring. This will be useful for us and other applications that would
> > like to ZC similar to what you're doing here but not necessarily
> > through io_uring.
>
> Using io_uring and trying to move away from a socket based interface is
> an explicit longer term goal. I see your proposal of adding a
> traditional socket based API as orthogonal to what we're trying to do.
> If someone is motivated enough to see this exist then they can build it
> themselves.
>

Yes, that was what I was suggesting. I (or whoever interested) would
build it ourselves. Just calling out that your bits to bind umem to an
rx-queue and/or the memory provider could be reused if it is re-usable
(or can be made re-usable). From a quick look it seems fine, nothing
requested here from this series. Sorry I made it seem I was asking you
to implement a sockets extension :-)

> >
> >> Data is 'read' using a new io_uring request
> >> type. When done, data is returned via a new shared refill queue. A zero
> >> copy page pool refills a hw rx queue from this refill queue directly. Of
> >> course, the lifetime of these data buffers are managed by io_uring
> >> rather than the networking stack, with different refcounting rules.
> >>
> >> This patchset is the first step adding basic zero copy support. We will
> >> extend this iteratively with new features e.g. dynamically allocated
> >> zero copy areas, THP support, dmabuf support, improved copy fallback,
> >> general optimisations and more.
> >>
> >> In terms of netdev support, we're first targeting Broadcom bnxt. Patches
> >> aren't included since Taehee Yoo has already sent a more comprehensive
> >> patchset adding support in [1]. Google gve should already support this,
> >
> > This is an aside, but GVE supports this via the out-of-tree patches
> > I've been carrying on github. Uptsream we're working on adding the
> > prerequisite page_pool support.
> >
> >> and Mellanox mlx5 support is WIP pending driver changes.
> >>
> >> ===========
> >> Performance
> >> ===========
> >>
> >> Test setup:
> >> * AMD EPYC 9454
> >> * Broadcom BCM957508 200G
> >> * Kernel v6.11 base [2]
> >> * liburing fork [3]
> >> * kperf fork [4]
> >> * 4K MTU
> >> * Single TCP flow
> >>
> >> With application thread + net rx softirq pinned to _different_ cores:
> >>
> >> epoll
> >> 82.2 Gbps
> >>
> >> io_uring
> >> 116.2 Gbps (+41%)
> >>
> >> Pinned to _same_ core:
> >>
> >> epoll
> >> 62.6 Gbps
> >>
> >> io_uring
> >> 80.9 Gbps (+29%)
> >>
> >
> > Is the 'epoll' results here and the 'io_uring' using TCP RX zerocopy
> > [1]  and io_uring zerocopy respectively?
> >
> > If not, I would like to see a comparison between TCP RX zerocopy and
> > this new io-uring zerocopy. For Google for example we use the TCP RX
> > zerocopy, I would like to see perf numbers possibly motivating us to
> > move to this new thing.
>
> No, it is comparing epoll without zero copy vs io_uring zero copy. Yes,
> that's a fair request. I will add epoll with TCP_ZEROCOPY_RECEIVE to
> kperf and compare.
>

Awesome to hear. For us, we do use TCP_ZEROCOPY_RECEIVE (with
sockets), so I'm unsure how much benefit we'll see if we use this.
Comparing against TCP_ZEROCOPY_RECEIVE will be more of an
apple-to-apple comparison and also motivate folks using the old thing
to switch to the new-thing.
David Laight Oct. 14, 2024, 8:42 a.m. UTC | #32
...
> > Tap side still a mystery, but it unblocked testing. I'll figure that
> > part out separately.
> 
> Further update - the above mystery was dhclient, thanks a lot to David
> for being able to figure that out very quickly.

I've seen that before - on the rx side.
Is there any way to defer the copy until the packet passes a filter?
Or better teach dhcp to use a normal UDP socket??

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)