diff mbox series

[v1,13/15] io_uring/zcrx: add copy fallback

Message ID 20241007221603.1703699-14-dw@davidwei.uk (mailing list archive)
State New
Headers show
Series io_uring zero copy rx | expand

Commit Message

David Wei Oct. 7, 2024, 10:16 p.m. UTC
From: Pavel Begunkov <asml.silence@gmail.com>

There are scenarios in which the zerocopy path might get a normal
in-kernel buffer, it could be a mis-steered packet or simply the linear
part of an skb. Another use case is to allow the driver to allocate
kernel pages when it's out of zc buffers, which makes it more resilient
to spikes in load and allow the user to choose the balance between the
amount of memory provided and performance.

At the moment we fail such requests. Instead, grab a buffer from the
page pool, copy data there, and return back to user in the usual way.
Because the refill ring is private to the napi our page pool is running
from, it's done by stopping the napi via napi_execute() helper. It grabs
only one buffer, which is inefficient, and improving it is left for
follow up patches.

Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
---
 io_uring/zcrx.c | 125 +++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 118 insertions(+), 7 deletions(-)

Comments

Stanislav Fomichev Oct. 8, 2024, 3:58 p.m. UTC | #1
On 10/07, David Wei wrote:
> From: Pavel Begunkov <asml.silence@gmail.com>
> 
> There are scenarios in which the zerocopy path might get a normal
> in-kernel buffer, it could be a mis-steered packet or simply the linear
> part of an skb. Another use case is to allow the driver to allocate
> kernel pages when it's out of zc buffers, which makes it more resilient
> to spikes in load and allow the user to choose the balance between the
> amount of memory provided and performance.

Tangential: should there be some clear way for the users to discover that
(some counter of some entry on cq about copy fallback)?

Or the expectation is that somebody will run bpftrace to diagnose
(supposedly) poor ZC performance when it falls back to copy?
Pavel Begunkov Oct. 8, 2024, 4:39 p.m. UTC | #2
On 10/8/24 16:58, Stanislav Fomichev wrote:
> On 10/07, David Wei wrote:
>> From: Pavel Begunkov <asml.silence@gmail.com>
>>
>> There are scenarios in which the zerocopy path might get a normal
>> in-kernel buffer, it could be a mis-steered packet or simply the linear
>> part of an skb. Another use case is to allow the driver to allocate
>> kernel pages when it's out of zc buffers, which makes it more resilient
>> to spikes in load and allow the user to choose the balance between the
>> amount of memory provided and performance.
> 
> Tangential: should there be some clear way for the users to discover that
> (some counter of some entry on cq about copy fallback)?
> 
> Or the expectation is that somebody will run bpftrace to diagnose
> (supposedly) poor ZC performance when it falls back to copy?

We had some notification for testing before, but that's left out
of the series to follow up patches to keep it simple. We can post
a special CQE to notify the user about that from time to time, which
should be just fine as it's a slow path.
David Wei Oct. 8, 2024, 4:40 p.m. UTC | #3
On 2024-10-08 08:58, Stanislav Fomichev wrote:
> On 10/07, David Wei wrote:
>> From: Pavel Begunkov <asml.silence@gmail.com>
>>
>> There are scenarios in which the zerocopy path might get a normal
>> in-kernel buffer, it could be a mis-steered packet or simply the linear
>> part of an skb. Another use case is to allow the driver to allocate
>> kernel pages when it's out of zc buffers, which makes it more resilient
>> to spikes in load and allow the user to choose the balance between the
>> amount of memory provided and performance.
> 
> Tangential: should there be some clear way for the users to discover that
> (some counter of some entry on cq about copy fallback)?
> 
> Or the expectation is that somebody will run bpftrace to diagnose
> (supposedly) poor ZC performance when it falls back to copy?

Yeah there definitely needs to be a way to notify the user that copy
fallback happened. Right now I'm relying on bpftrace hooking into
io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is
too much. I can think of two other options:

1. Send a final cqe at the end of a number of frag cqes with a count of
   the number of copies.
2. Register a secondary area just for handling copies.

Other suggestions are also very welcome.
Stanislav Fomichev Oct. 9, 2024, 4:30 p.m. UTC | #4
On 10/08, David Wei wrote:
> On 2024-10-08 08:58, Stanislav Fomichev wrote:
> > On 10/07, David Wei wrote:
> >> From: Pavel Begunkov <asml.silence@gmail.com>
> >>
> >> There are scenarios in which the zerocopy path might get a normal
> >> in-kernel buffer, it could be a mis-steered packet or simply the linear
> >> part of an skb. Another use case is to allow the driver to allocate
> >> kernel pages when it's out of zc buffers, which makes it more resilient
> >> to spikes in load and allow the user to choose the balance between the
> >> amount of memory provided and performance.
> > 
> > Tangential: should there be some clear way for the users to discover that
> > (some counter of some entry on cq about copy fallback)?
> > 
> > Or the expectation is that somebody will run bpftrace to diagnose
> > (supposedly) poor ZC performance when it falls back to copy?
> 
> Yeah there definitely needs to be a way to notify the user that copy
> fallback happened. Right now I'm relying on bpftrace hooking into
> io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is
> too much. I can think of two other options:
> 
> 1. Send a final cqe at the end of a number of frag cqes with a count of
>    the number of copies.
> 2. Register a secondary area just for handling copies.
> 
> Other suggestions are also very welcome.

SG, thanks. Up to you and Pavel on the mechanism and whether to follow
up separately. Maybe even move this fallback (this patch) into that separate
series as well? Will be easier to review/accept the rest.
Jens Axboe Oct. 9, 2024, 6:38 p.m. UTC | #5
On 10/7/24 4:16 PM, David Wei wrote:
> From: Pavel Begunkov <asml.silence@gmail.com>
> 
> There are scenarios in which the zerocopy path might get a normal
> in-kernel buffer, it could be a mis-steered packet or simply the linear
> part of an skb. Another use case is to allow the driver to allocate
> kernel pages when it's out of zc buffers, which makes it more resilient
> to spikes in load and allow the user to choose the balance between the
> amount of memory provided and performance.
> 
> At the moment we fail such requests. Instead, grab a buffer from the
> page pool, copy data there, and return back to user in the usual way.
> Because the refill ring is private to the napi our page pool is running
> from, it's done by stopping the napi via napi_execute() helper. It grabs
> only one buffer, which is inefficient, and improving it is left for
> follow up patches.

This also looks fine to me. Agree with the sentiment that it'd be nice
to propagate back if copies are happening, and to what extent. But also
agree that this can wait.
Pavel Begunkov Oct. 9, 2024, 11:05 p.m. UTC | #6
On 10/9/24 17:30, Stanislav Fomichev wrote:
> On 10/08, David Wei wrote:
>> On 2024-10-08 08:58, Stanislav Fomichev wrote:
>>> On 10/07, David Wei wrote:
>>>> From: Pavel Begunkov <asml.silence@gmail.com>
>>>>
>>>> There are scenarios in which the zerocopy path might get a normal
>>>> in-kernel buffer, it could be a mis-steered packet or simply the linear
>>>> part of an skb. Another use case is to allow the driver to allocate
>>>> kernel pages when it's out of zc buffers, which makes it more resilient
>>>> to spikes in load and allow the user to choose the balance between the
>>>> amount of memory provided and performance.
>>>
>>> Tangential: should there be some clear way for the users to discover that
>>> (some counter of some entry on cq about copy fallback)?
>>>
>>> Or the expectation is that somebody will run bpftrace to diagnose
>>> (supposedly) poor ZC performance when it falls back to copy?
>>
>> Yeah there definitely needs to be a way to notify the user that copy
>> fallback happened. Right now I'm relying on bpftrace hooking into
>> io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is
>> too much. I can think of two other options:
>>
>> 1. Send a final cqe at the end of a number of frag cqes with a count of
>>     the number of copies.
>> 2. Register a secondary area just for handling copies.
>>
>> Other suggestions are also very welcome.
> 
> SG, thanks. Up to you and Pavel on the mechanism and whether to follow
> up separately. Maybe even move this fallback (this patch) into that separate
> series as well? Will be easier to review/accept the rest.

I think it's fine to leave it? It shouldn't be particularly
interesting to the net folks to review, and without it any skb
with the linear part would break it, but perhaps it's not such
a concern for bnxt.
David Wei Oct. 11, 2024, 6:22 a.m. UTC | #7
On 2024-10-09 16:05, Pavel Begunkov wrote:
> On 10/9/24 17:30, Stanislav Fomichev wrote:
>> On 10/08, David Wei wrote:
>>> On 2024-10-08 08:58, Stanislav Fomichev wrote:
>>>> On 10/07, David Wei wrote:
>>>>> From: Pavel Begunkov <asml.silence@gmail.com>
>>>>>
>>>>> There are scenarios in which the zerocopy path might get a normal
>>>>> in-kernel buffer, it could be a mis-steered packet or simply the linear
>>>>> part of an skb. Another use case is to allow the driver to allocate
>>>>> kernel pages when it's out of zc buffers, which makes it more resilient
>>>>> to spikes in load and allow the user to choose the balance between the
>>>>> amount of memory provided and performance.
>>>>
>>>> Tangential: should there be some clear way for the users to discover that
>>>> (some counter of some entry on cq about copy fallback)?
>>>>
>>>> Or the expectation is that somebody will run bpftrace to diagnose
>>>> (supposedly) poor ZC performance when it falls back to copy?
>>>
>>> Yeah there definitely needs to be a way to notify the user that copy
>>> fallback happened. Right now I'm relying on bpftrace hooking into
>>> io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is
>>> too much. I can think of two other options:
>>>
>>> 1. Send a final cqe at the end of a number of frag cqes with a count of
>>>     the number of copies.
>>> 2. Register a secondary area just for handling copies.
>>>
>>> Other suggestions are also very welcome.
>>
>> SG, thanks. Up to you and Pavel on the mechanism and whether to follow
>> up separately. Maybe even move this fallback (this patch) into that separate
>> series as well? Will be easier to review/accept the rest.
> 
> I think it's fine to leave it? It shouldn't be particularly
> interesting to the net folks to review, and without it any skb
> with the linear part would break it, but perhaps it's not such
> a concern for bnxt.
> 

My preference is to leave it. Actually from real workloads, fully
linearized skbs are not uncommon due to the minimum size for HDS to kick
in for bnxt. Taking this out would imo make this patchset functionally
broken. Since we're all in agreement here, let's defer the improvements
as a follow up.
Stanislav Fomichev Oct. 11, 2024, 2:43 p.m. UTC | #8
On 10/10, David Wei wrote:
> On 2024-10-09 16:05, Pavel Begunkov wrote:
> > On 10/9/24 17:30, Stanislav Fomichev wrote:
> >> On 10/08, David Wei wrote:
> >>> On 2024-10-08 08:58, Stanislav Fomichev wrote:
> >>>> On 10/07, David Wei wrote:
> >>>>> From: Pavel Begunkov <asml.silence@gmail.com>
> >>>>>
> >>>>> There are scenarios in which the zerocopy path might get a normal
> >>>>> in-kernel buffer, it could be a mis-steered packet or simply the linear
> >>>>> part of an skb. Another use case is to allow the driver to allocate
> >>>>> kernel pages when it's out of zc buffers, which makes it more resilient
> >>>>> to spikes in load and allow the user to choose the balance between the
> >>>>> amount of memory provided and performance.
> >>>>
> >>>> Tangential: should there be some clear way for the users to discover that
> >>>> (some counter of some entry on cq about copy fallback)?
> >>>>
> >>>> Or the expectation is that somebody will run bpftrace to diagnose
> >>>> (supposedly) poor ZC performance when it falls back to copy?
> >>>
> >>> Yeah there definitely needs to be a way to notify the user that copy
> >>> fallback happened. Right now I'm relying on bpftrace hooking into
> >>> io_zcrx_copy_chunk(). Doing it per cqe (which is emitted per frag) is
> >>> too much. I can think of two other options:
> >>>
> >>> 1. Send a final cqe at the end of a number of frag cqes with a count of
> >>>     the number of copies.
> >>> 2. Register a secondary area just for handling copies.
> >>>
> >>> Other suggestions are also very welcome.
> >>
> >> SG, thanks. Up to you and Pavel on the mechanism and whether to follow
> >> up separately. Maybe even move this fallback (this patch) into that separate
> >> series as well? Will be easier to review/accept the rest.
> > 
> > I think it's fine to leave it? It shouldn't be particularly
> > interesting to the net folks to review, and without it any skb
> > with the linear part would break it, but perhaps it's not such
> > a concern for bnxt.
> > 
> 
> My preference is to leave it. Actually from real workloads, fully
> linearized skbs are not uncommon due to the minimum size for HDS to kick
> in for bnxt. Taking this out would imo make this patchset functionally
> broken. Since we're all in agreement here, let's defer the improvements
> as a follow up.

Sounds good!
diff mbox series

Patch

diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c
index 8166d8a2656e..d21e7017deb3 100644
--- a/io_uring/zcrx.c
+++ b/io_uring/zcrx.c
@@ -5,6 +5,8 @@ 
 #include <linux/nospec.h>
 #include <linux/netdevice.h>
 #include <linux/io_uring.h>
+#include <linux/skbuff_ref.h>
+#include <net/busy_poll.h>
 #include <net/page_pool/helpers.h>
 #include <trace/events/page_pool.h>
 #include <net/tcp.h>
@@ -28,6 +30,11 @@  struct io_zcrx_args {
 	struct socket		*sock;
 };
 
+struct io_zc_refill_data {
+	struct io_zcrx_ifq *ifq;
+	struct net_iov *niov;
+};
+
 static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov)
 {
 	struct net_iov_area *owner = net_iov_owner(niov);
@@ -35,6 +42,13 @@  static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *nio
 	return container_of(owner, struct io_zcrx_area, nia);
 }
 
+static inline struct page *io_zcrx_iov_page(const struct net_iov *niov)
+{
+	struct io_zcrx_area *area = io_zcrx_iov_to_area(niov);
+
+	return area->pages[net_iov_idx(niov)];
+}
+
 static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq,
 				 struct io_uring_zcrx_ifq_reg *reg)
 {
@@ -475,6 +489,34 @@  const struct memory_provider_ops io_uring_pp_zc_ops = {
 	.scrub			= io_pp_zc_scrub,
 };
 
+static void io_napi_refill(void *data)
+{
+	struct io_zc_refill_data *rd = data;
+	struct io_zcrx_ifq *ifq = rd->ifq;
+	netmem_ref netmem;
+
+	if (WARN_ON_ONCE(!ifq->pp))
+		return;
+
+	netmem = page_pool_alloc_netmem(ifq->pp, GFP_ATOMIC | __GFP_NOWARN);
+	if (!netmem)
+		return;
+	if (WARN_ON_ONCE(!netmem_is_net_iov(netmem)))
+		return;
+
+	rd->niov = netmem_to_net_iov(netmem);
+}
+
+static struct net_iov *io_zc_get_buf_task_safe(struct io_zcrx_ifq *ifq)
+{
+	struct io_zc_refill_data rd = {
+		.ifq = ifq,
+	};
+
+	napi_execute(ifq->napi_id, io_napi_refill, &rd);
+	return rd.niov;
+}
+
 static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov,
 			      struct io_zcrx_ifq *ifq, int off, int len)
 {
@@ -498,6 +540,45 @@  static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov,
 	return true;
 }
 
+static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
+				  void *data, unsigned int offset, size_t len)
+{
+	size_t copy_size, copied = 0;
+	int ret = 0, off = 0;
+	struct page *page;
+	u8 *vaddr;
+
+	do {
+		struct net_iov *niov;
+
+		niov = io_zc_get_buf_task_safe(ifq);
+		if (!niov) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		page = io_zcrx_iov_page(niov);
+		vaddr = kmap_local_page(page);
+		copy_size = min_t(size_t, PAGE_SIZE, len);
+		memcpy(vaddr, data + offset, copy_size);
+		kunmap_local(vaddr);
+
+		if (!io_zcrx_queue_cqe(req, niov, ifq, off, copy_size)) {
+			napi_pp_put_page(net_iov_to_netmem(niov));
+			return -ENOSPC;
+		}
+
+		io_zcrx_get_buf_uref(niov);
+		napi_pp_put_page(net_iov_to_netmem(niov));
+
+		offset += copy_size;
+		len -= copy_size;
+		copied += copy_size;
+	} while (offset < len);
+
+	return copied ? copied : ret;
+}
+
 static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 			     const skb_frag_t *frag, int off, int len)
 {
@@ -505,8 +586,24 @@  static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq,
 
 	off += skb_frag_off(frag);
 
-	if (unlikely(!skb_frag_is_net_iov(frag)))
-		return -EOPNOTSUPP;
+	if (unlikely(!skb_frag_is_net_iov(frag))) {
+		struct page *page = skb_frag_page(frag);
+		u32 p_off, p_len, t, copied = 0;
+		u8 *vaddr;
+		int ret = 0;
+
+		skb_frag_foreach_page(frag, off, len,
+				      page, p_off, p_len, t) {
+			vaddr = kmap_local_page(page);
+			ret = io_zcrx_copy_chunk(req, ifq, vaddr, p_off, p_len);
+			kunmap_local(vaddr);
+
+			if (ret < 0)
+				return copied ? copied : ret;
+			copied += ret;
+		}
+		return copied;
+	}
 
 	niov = netmem_to_net_iov(frag->netmem);
 	if (niov->pp->mp_ops != &io_uring_pp_zc_ops ||
@@ -527,15 +624,29 @@  io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb,
 	struct io_zcrx_ifq *ifq = args->ifq;
 	struct io_kiocb *req = args->req;
 	struct sk_buff *frag_iter;
-	unsigned start, start_off;
+	unsigned start, start_off = offset;
 	int i, copy, end, off;
 	int ret = 0;
 
-	start = skb_headlen(skb);
-	start_off = offset;
+	if (unlikely(offset < skb_headlen(skb))) {
+		ssize_t copied;
+		size_t to_copy;
 
-	if (offset < start)
-		return -EOPNOTSUPP;
+		to_copy = min_t(size_t, skb_headlen(skb) - offset, len);
+		copied = io_zcrx_copy_chunk(req, ifq, skb->data, offset, to_copy);
+		if (copied < 0) {
+			ret = copied;
+			goto out;
+		}
+		offset += copied;
+		len -= copied;
+		if (!len)
+			goto out;
+		if (offset != skb_headlen(skb))
+			goto out;
+	}
+
+	start = skb_headlen(skb);
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		const skb_frag_t *frag;