Message ID | 3e2ef5f6d39c4631f5bae86b503a5397d6707563.1712923998.git.asml.silence@gmail.com (mailing list archive) |
---|---|
State | Superseded |
Headers | show |
Series | implement io_uring notification (ubuf_info) stacking | expand |
Context | Check | Description |
---|---|---|
netdev/tree_selection | success | Guessing tree name failed - patch did not apply |
Pavel Begunkov wrote: > The network stack allows only one ubuf_info per skb, and unlike > MSG_ZEROCOPY, each io_uring zerocopy send will carry a separate > ubuf_info. That means that send requests can't reuse a previosly > allocated skb and need to get one more or more of new ones. That's fine > for large sends, but otherwise it would spam the stack with lots of skbs > carrying just a little data each. Can you give a little context why each send request has to be a separate ubuf_info? This patch series aims to make that model more efficient. Would it be possible to just change the model instead? I assume you tried that and it proved unworkable, but is it easy to explain what the fundamental blocker is? MSG_ZEROCOPY uses uarg->len to identify multiple consecutive send operations that can be notified at once.
On 4/14/24 18:10, Willem de Bruijn wrote: > Pavel Begunkov wrote: >> The network stack allows only one ubuf_info per skb, and unlike >> MSG_ZEROCOPY, each io_uring zerocopy send will carry a separate >> ubuf_info. That means that send requests can't reuse a previosly >> allocated skb and need to get one more or more of new ones. That's fine >> for large sends, but otherwise it would spam the stack with lots of skbs >> carrying just a little data each. > > Can you give a little context why each send request has to be a > separate ubuf_info? > > This patch series aims to make that model more efficient. Would it be > possible to just change the model instead? I assume you tried that and > it proved unworkable, but is it easy to explain what the fundamental > blocker is? The uapi is so that you get a buffer completion (analogous to what you get with recv(MSG_ERRQUEUE)) for each send request. With that, for skb to serve multiple send requests it'd need to store a list of completions in some way. One could try to track sockets, have one "active" ubuf_info per socket which all sends would use, and then eventually flush the active ubuf so it can post completions and create a new one. but io_uring wouldn't know when it needs to "flush", whenever in the net stack it happens naturally when it pushes skbs from the queue. Not to mention that socket tracking has its own complications. As for uapi, in early versions of io_uring's SEND_ZC, ubuf_info and requests weren't entangled, roughly speaking, the user could choose that this request should use this ubuf_info (I can elaborate if interesting). It wasn't too complex, but all feedback was pointing that it's much easier to use hot it is now, and honestly it does buy with simplicity. I'm not sure what a different model would give. We wouldn't win in efficiency comparing to this patch, I can go into details how there are no extra atomics/locks/kmalloc/etc., the only bit is waking up waiting tasks, but that still would need to happen. I can even optimise / ammortise ubuf refcounting if that would matter. > MSG_ZEROCOPY uses uarg->len to identify multiple consecutive send > operations that can be notified at once.
Pavel Begunkov wrote: > On 4/14/24 18:10, Willem de Bruijn wrote: > > Pavel Begunkov wrote: > >> The network stack allows only one ubuf_info per skb, and unlike > >> MSG_ZEROCOPY, each io_uring zerocopy send will carry a separate > >> ubuf_info. That means that send requests can't reuse a previosly > >> allocated skb and need to get one more or more of new ones. That's fine > >> for large sends, but otherwise it would spam the stack with lots of skbs > >> carrying just a little data each. > > > > Can you give a little context why each send request has to be a > > separate ubuf_info? > > > > This patch series aims to make that model more efficient. Would it be > > possible to just change the model instead? I assume you tried that and > > it proved unworkable, but is it easy to explain what the fundamental > > blocker is? > > The uapi is so that you get a buffer completion (analogous to what you > get with recv(MSG_ERRQUEUE)) for each send request. With that, for skb > to serve multiple send requests it'd need to store a list of completions > in some way. I probably don't know the io_uring implementation well enough yet, so take this with a huge grain of salt. MSG_ZEROCOPY can generate completions for multiple send calls from a single uarg, by virtue of completions being incrementing IDs. Is there a fundamental reason why io_uring needs a 1:1 mapping between request slots in the API and uarg in the datapath? Or differently, is there no trivial way to associate a range of completions with a single uarg? > One could try to track sockets, have one "active" ubuf_info > per socket which all sends would use, and then eventually flush the > active ubuf so it can post completions and create a new one. This is basically what MSG_ZEROCOPY does for TCP. It signals POLLERR as soon as one completion arrives. Then when a process gets around to calling MSG_ERRQUEUE, it returns the range of completions that have arrived in the meantime. A process can thus decide to postpone completion handling to increase batching. > but io_uring > wouldn't know when it needs to "flush", whenever in the net stack it > happens naturally when it pushes skbs from the queue. Not to mention > that socket tracking has its own complications. > > As for uapi, in early versions of io_uring's SEND_ZC, ubuf_info and > requests weren't entangled, roughly speaking, the user could choose > that this request should use this ubuf_info (I can elaborate if > interesting). It wasn't too complex, but all feedback was pointing > that it's much easier to use hot it is now, and honestly it does > buy with simplicity. I see. I suppose that answers the 1:1 mapping the ABI question I asked above. I should reread that patch. > I'm not sure what a different model would give. We wouldn't win > in efficiency comparing to this patch, I can go into details > how there are no extra atomics/locks/kmalloc/etc., the only bit > is waking up waiting tasks, but that still would need to happen. > I can even optimise / ammortise ubuf refcounting if that would > matter. Slight aside: we know that MSG_ZEROCOPY is quite inefficient for small sends. Very rough rule of thumb is you need around 16KB or larger sends for it to outperform regular copy. Part of that is the memory pinning. The other part is the notification handling. MSG_ERRQUEUE is expensive. I hope that io_uring cannot just match, but improve on MSG_ZEROCOPY, especially for smaller packets. > > > MSG_ZEROCOPY uses uarg->len to identify multiple consecutive send > > operations that can be notified at once. > > -- > Pavel Begunkov
On 4/15/24 16:15, Willem de Bruijn wrote: > Pavel Begunkov wrote: >> On 4/14/24 18:10, Willem de Bruijn wrote: >>> Pavel Begunkov wrote: >>>> The network stack allows only one ubuf_info per skb, and unlike >>>> MSG_ZEROCOPY, each io_uring zerocopy send will carry a separate >>>> ubuf_info. That means that send requests can't reuse a previosly >>>> allocated skb and need to get one more or more of new ones. That's fine >>>> for large sends, but otherwise it would spam the stack with lots of skbs >>>> carrying just a little data each. >>> >>> Can you give a little context why each send request has to be a >>> separate ubuf_info? >>> >>> This patch series aims to make that model more efficient. Would it be >>> possible to just change the model instead? I assume you tried that and >>> it proved unworkable, but is it easy to explain what the fundamental >>> blocker is? >> >> The uapi is so that you get a buffer completion (analogous to what you >> get with recv(MSG_ERRQUEUE)) for each send request. With that, for skb >> to serve multiple send requests it'd need to store a list of completions >> in some way. > > I probably don't know the io_uring implementation well enough yet, so > take this with a huge grain of salt. > > MSG_ZEROCOPY can generate completions for multiple send calls from a > single uarg, by virtue of completions being incrementing IDs. > > Is there a fundamental reason why io_uring needs a 1:1 mapping between > request slots in the API and uarg in the datapath? That's an ABI difference. Where MSG_ZEROCOPY returns a range of bytes for the user to look up which buffers now can be reused, io_uring posts one completion per send request, and by request I mean an io_uring way of doing sendmsg(2). Hence the 1:1 mapping of uargs (which post that completion) to send zc requests. IOW, and if MSG_ZEROCOPY's uarg tracks byte range it covers, then io_uring needs to know all requests associated with it, which is currently just one request because of the 1:1 mapping. > Or differently, is > there no trivial way to associate a range of completions with a single > uarg? Quite non trivial without changing ABI, I'd say. And a ABI change wouldn't be small and without pitfalls. >> One could try to track sockets, have one "active" ubuf_info >> per socket which all sends would use, and then eventually flush the >> active ubuf so it can post completions and create a new one. > > This is basically what MSG_ZEROCOPY does for TCP. It signals POLLERR > as soon as one completion arrives. Then when a process gets around to > calling MSG_ERRQUEUE, it returns the range of completions that have > arrived in the meantime. A process can thus decide to postpone > completion handling to increase batching. Yes, there is that on the completion side, but in the submission you need to also decide when to let the current uarg go and allocate a new one. Not an issue if uarg is owned by the TCP stack, you don't have to additionally reference it, you know when you empty the queue and all that. Not that great if io_uring needs to talk to socket to understand when uarg is better to be dropped. >> but io_uring >> wouldn't know when it needs to "flush", whenever in the net stack it >> happens naturally when it pushes skbs from the queue. Not to mention >> that socket tracking has its own complications. >> >> As for uapi, in early versions of io_uring's SEND_ZC, ubuf_info and >> requests weren't entangled, roughly speaking, the user could choose >> that this request should use this ubuf_info (I can elaborate if >> interesting). It wasn't too complex, but all feedback was pointing >> that it's much easier to use hot it is now, and honestly it does >> buy with simplicity. > > I see. I suppose that answers the 1:1 mapping the ABI question I > asked above. I should reread that patch. > >> I'm not sure what a different model would give. We wouldn't win >> in efficiency comparing to this patch, I can go into details >> how there are no extra atomics/locks/kmalloc/etc., the only bit >> is waking up waiting tasks, but that still would need to happen. >> I can even optimise / ammortise ubuf refcounting if that would >> matter. > > Slight aside: we know that MSG_ZEROCOPY is quite inefficient for > small sends. Very rough rule of thumb is you need around 16KB or > larger sends for it to outperform regular copy. Part of that is the > memory pinning. The other part is the notification handling. > MSG_ERRQUEUE is expensive. I hope that io_uring cannot just match, but > improve on MSG_ZEROCOPY, especially for smaller packets. I has some numbers left from this patchset benchmarking. Not too well suited to answer your question, but still gives an idea. Just a benchmark, single buffer, 100g broadcom NIC IIRC. All is io_uring based, -z<bool> switches copy vs zerocopy. Zero copy uses registered buffers, so no page pinning and page table traversal at runtime. 10s per run is not ideal, but was matching longer runs. # 1200 bytes ./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s1200 -z0 packets=15004160 (MB=17170), rps=1470996 (MB/s=1683) ./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s1200 -z1 packets=10440224 (MB=11947), rps=1023551 (MB/s=1171) # 4000 bytes ./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s4000 -z0 packets=11742688 (MB=44794), rps=1151243 (MB/s=4391) ./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s4000 -z1 packets=14144048 (MB=53955), rps=1386671 (MB/s=5289) # 8000 bytes ./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s8000 -z0 packets=6868976 (MB=52406), rps=673429 (MB/s=5137) ./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s8000 -z1 packets=10800784 (MB=82403), rps=1058900 (MB/s=8078)
> > > > Slight aside: we know that MSG_ZEROCOPY is quite inefficient for > > small sends. Very rough rule of thumb is you need around 16KB or > > larger sends for it to outperform regular copy. Part of that is the > > memory pinning. The other part is the notification handling. > > MSG_ERRQUEUE is expensive. I hope that io_uring cannot just match, but > > improve on MSG_ZEROCOPY, especially for smaller packets. > > I has some numbers left from this patchset benchmarking. Not too > well suited to answer your question, but still gives an idea. > Just a benchmark, single buffer, 100g broadcom NIC IIRC. All is > io_uring based, -z<bool> switches copy vs zerocopy. Zero copy > uses registered buffers, so no page pinning and page table > traversal at runtime. 10s per run is not ideal, but was matching > longer runs. > > # 1200 bytes > ./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s1200 -z0 > packets=15004160 (MB=17170), rps=1470996 (MB/s=1683) > ./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s1200 -z1 > packets=10440224 (MB=11947), rps=1023551 (MB/s=1171) > > # 4000 bytes > ./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s4000 -z0 > packets=11742688 (MB=44794), rps=1151243 (MB/s=4391) > ./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s4000 -z1 > packets=14144048 (MB=53955), rps=1386671 (MB/s=5289) > > # 8000 bytes > ./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s8000 -z0 > packets=6868976 (MB=52406), rps=673429 (MB/s=5137) > ./send-zerocopy -4 tcp -D <ip> -t 10 -n 1 -l0 -b1 -d -s8000 -z1 > packets=10800784 (MB=82403), rps=1058900 (MB/s=8078) Parity around 4K. That is very encouraging :)
diff --git a/io_uring/notif.c b/io_uring/notif.c index 26680176335f..d58cdc01e691 100644 --- a/io_uring/notif.c +++ b/io_uring/notif.c @@ -9,18 +9,28 @@ #include "notif.h" #include "rsrc.h" +static const struct ubuf_info_ops io_ubuf_ops; + static void io_notif_tw_complete(struct io_kiocb *notif, struct io_tw_state *ts) { struct io_notif_data *nd = io_notif_to_data(notif); - if (unlikely(nd->zc_report) && (nd->zc_copied || !nd->zc_used)) - notif->cqe.res |= IORING_NOTIF_USAGE_ZC_COPIED; + do { + notif = cmd_to_io_kiocb(nd); - if (nd->account_pages && notif->ctx->user) { - __io_unaccount_mem(notif->ctx->user, nd->account_pages); - nd->account_pages = 0; - } - io_req_task_complete(notif, ts); + lockdep_assert(refcount_read(&nd->uarg.refcnt) == 0); + + if (unlikely(nd->zc_report) && (nd->zc_copied || !nd->zc_used)) + notif->cqe.res |= IORING_NOTIF_USAGE_ZC_COPIED; + + if (nd->account_pages && notif->ctx->user) { + __io_unaccount_mem(notif->ctx->user, nd->account_pages); + nd->account_pages = 0; + } + + nd = nd->next; + io_req_task_complete(notif, ts); + } while (nd); } void io_tx_ubuf_complete(struct sk_buff *skb, struct ubuf_info *uarg, @@ -39,12 +49,56 @@ void io_tx_ubuf_complete(struct sk_buff *skb, struct ubuf_info *uarg, if (!refcount_dec_and_test(&uarg->refcnt)) return; + if (nd->head != nd) { + io_tx_ubuf_complete(skb, &nd->head->uarg, success); + return; + } notif->io_task_work.func = io_notif_tw_complete; __io_req_task_work_add(notif, IOU_F_TWQ_LAZY_WAKE); } +static int io_link_skb(struct sk_buff *skb, struct ubuf_info *uarg) +{ + struct io_notif_data *nd, *prev_nd; + struct io_kiocb *prev_notif, *notif; + struct ubuf_info *prev_uarg = skb_zcopy(skb); + + nd = container_of(uarg, struct io_notif_data, uarg); + notif = cmd_to_io_kiocb(nd); + + if (!prev_uarg) { + net_zcopy_get(&nd->uarg); + skb_zcopy_init(skb, &nd->uarg); + return 0; + } + /* handle it separately as we can't link a notif to itself */ + if (unlikely(prev_uarg == &nd->uarg)) + return 0; + /* we can't join two links together, just request a fresh skb */ + if (unlikely(nd->head != nd || nd->next)) + return -EEXIST; + /* don't mix zc providers */ + if (unlikely(prev_uarg->ops != &io_ubuf_ops)) + return -EEXIST; + + prev_nd = container_of(prev_uarg, struct io_notif_data, uarg); + prev_notif = cmd_to_io_kiocb(nd); + + /* make sure all noifications can be finished in the same task_work */ + if (unlikely(notif->ctx != prev_notif->ctx || + notif->task != prev_notif->task)) + return -EEXIST; + + nd->head = prev_nd->head; + nd->next = prev_nd->next; + prev_nd->next = nd; + net_zcopy_get(&nd->head->uarg); + return 0; +} + static const struct ubuf_info_ops io_ubuf_ops = { .complete = io_tx_ubuf_complete, + .link_skb = io_link_skb, }; struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx) @@ -65,6 +119,9 @@ struct io_kiocb *io_alloc_notif(struct io_ring_ctx *ctx) nd = io_notif_to_data(notif); nd->zc_report = false; nd->account_pages = 0; + nd->next = NULL; + nd->head = nd; + nd->uarg.flags = IO_NOTIF_UBUF_FLAGS; nd->uarg.ops = &io_ubuf_ops; refcount_set(&nd->uarg.refcnt, 1); diff --git a/io_uring/notif.h b/io_uring/notif.h index 394e1d33daa6..6d2e8b674b43 100644 --- a/io_uring/notif.h +++ b/io_uring/notif.h @@ -14,6 +14,10 @@ struct io_notif_data { struct file *file; struct ubuf_info uarg; unsigned long account_pages; + + struct io_notif_data *next; + struct io_notif_data *head; + bool zc_report; bool zc_used; bool zc_copied;
The network stack allows only one ubuf_info per skb, and unlike MSG_ZEROCOPY, each io_uring zerocopy send will carry a separate ubuf_info. That means that send requests can't reuse a previosly allocated skb and need to get one more or more of new ones. That's fine for large sends, but otherwise it would spam the stack with lots of skbs carrying just a little data each. To help with that implement linking notification (i.e. an io_uring wrapper around ubuf_info) into a list. Each is refcounted by skbs and the stack as usual. additionally all non head entries keep a reference to the head, which they put down when their refcount hits 0. When the head have no more users, it'll efficiently put all notifications in a batch. As mentioned previously about ->io_link_skb, the callback implementation always allows to bind to an skb without a ubuf_info. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> --- io_uring/notif.c | 71 +++++++++++++++++++++++++++++++++++++++++++----- io_uring/notif.h | 4 +++ 2 files changed, 68 insertions(+), 7 deletions(-)