From patchwork Sat Aug 26 01:19:44 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13366462 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id D09C5C83F13 for ; Sat, 26 Aug 2023 01:22:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230367AbjHZBWB (ORCPT ); Fri, 25 Aug 2023 21:22:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34850 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231617AbjHZBVc (ORCPT ); Fri, 25 Aug 2023 21:21:32 -0400 Received: from mail-pf1-x42d.google.com (mail-pf1-x42d.google.com [IPv6:2607:f8b0:4864:20::42d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 8989F1FF7 for ; Fri, 25 Aug 2023 18:21:29 -0700 (PDT) Received: by mail-pf1-x42d.google.com with SMTP id d2e1a72fcca58-68a3cae6d94so1345516b3a.0 for ; Fri, 25 Aug 2023 18:21:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20221208.gappssmtp.com; s=20221208; t=1693012889; x=1693617689; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=jTvowOQwcZhVRqOJ6qg2E7pufXaeMTOsQWNawAhB/rk=; b=XfuAJuT79h5NyJw39ganBXsa5agxrBGs0wyAV+RS/JeNOLYr8CMs25iknUyVJZVMrg 7+vKbE9v0Cl35occPgX8AWH1vLz045pOwOKlyX+iGsI9RlAfTmORFKQyYrw4xpGujRGQ RUDwqYSgeJ41XsFbtoLV8MCwAcnSVGg5OKRxsJqM1iP6C/qtNyWPDtW6t4TejyaN2Y0r YV4Fl8/mkMlMC+PVPuk8FpUlrP90LvoEVIaWMADbMwL/e14DfrIKoTcfcuRU+qaKq8YJ hl5Tm6lKWPomhlQmh48r73/6D9DQ0bVQ/JrVbNsUE/CdPDZ4GtyMtZGfislv+oUFxWki 0jrw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693012889; x=1693617689; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=jTvowOQwcZhVRqOJ6qg2E7pufXaeMTOsQWNawAhB/rk=; b=Uufe6mEBwzOXd3e1rB5o7J5DWBs3EOEY+MfDrg8DcOwj+rPJvWBbf9t+6Ulwbu/9Kh 3khJYG7nZKonrrUpB4/fKJxnpUY+GnDBrgiXpTwdNnShPDkWOpLiGWs5zXNIUJ+tr9G1 WtgeBLFClgKBBANn0wd7JNYA1/JkB6wsVpC2lCMTdOdoxge4CEtk0fzdR9o7QTRBidGw Z9rAKfpMGg5029zbmGuwHsoPUql6R033k1nyhIKc2gphprYlgvmqxgSq704gLF7RXYip 1w6kL8WvENY5jp39bi6ZtJgwN6os5DcNKQMzKOkIrPRjsne97+ZjIQXpNOJJSgqeQMWQ tRXg== X-Gm-Message-State: AOJu0YxeY6rSvkDPZP3GLyUZmNSyeYGvtEG5uAtkN77lHpjgLJkE2cf8 t6Avv/FwqJ3hdmewc6MsDYVALg== X-Google-Smtp-Source: AGHT+IFTB2A5HZwCqMLlVHugoKvndKU4apXJZTLN434a7lAmfH+cifph6KopWX41QlM8pe0p0gJx6g== X-Received: by 2002:a05:6a21:3393:b0:131:a21:9f96 with SMTP id yy19-20020a056a21339300b001310a219f96mr27272685pzb.6.1693012889022; Fri, 25 Aug 2023 18:21:29 -0700 (PDT) Received: from localhost (fwdproxy-prn-119.fbsv.net. [2a03:2880:ff:77::face:b00c]) by smtp.gmail.com with ESMTPSA id w5-20020a170902a70500b001ae0152d280sm2387110plq.193.2023.08.25.18.21.28 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Aug 2023 18:21:28 -0700 (PDT) From: David Wei To: Jens Axboe , Pavel Begunkov Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org, Mina Almasry , Jakub Kicinski Subject: [PATCH 01/11] io_uring: add interface queue Date: Fri, 25 Aug 2023 18:19:44 -0700 Message-Id: <20230826011954.1801099-2-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20230826011954.1801099-1-dw@davidwei.uk> References: <20230826011954.1801099-1-dw@davidwei.uk> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org From: David Wei This patch introduces a new object in io_uring called an interface queue (ifq) which contains: * A pool region allocated by userspace and registered w/ io_uring where RX data is written to. * A net device and one specific RX queue in it that will be configured for ZC RX. * A pair of shared ringbuffers w/ userspace, dubbed registered buf (rbuf) rings. Each entry contains a pool region id and an offset + len within that region. The kernel writes entries into the completion ring to tell userspace where RX data is relative to the start of a region. Userspace writes entries into the refill ring to tell the kernel when it is done with the data. For now, each io_uring instance has a single ifq, and each ifq has a single pool region associated with one RX queue. Add a new opcode to io_uring_register that sets up an ifq. Size and offsets of shared ringbuffers are returned to userspace for it to mmap. The implementation will be added in a later patch. Signed-off-by: David Wei --- include/linux/io_uring_types.h | 6 ++++ include/uapi/linux/io_uring.h | 50 ++++++++++++++++++++++++++ io_uring/Makefile | 3 +- io_uring/io_uring.c | 7 ++++ io_uring/kbuf.c | 30 ++++++++++++++++ io_uring/kbuf.h | 5 +++ io_uring/zc_rx.c | 65 ++++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 21 +++++++++++ 8 files changed, 186 insertions(+), 1 deletion(-) create mode 100644 io_uring/zc_rx.c create mode 100644 io_uring/zc_rx.h diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 1b2a20a42413..8f6068da185c 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -151,6 +151,10 @@ struct io_rings { struct io_uring_cqe cqes[] ____cacheline_aligned_in_smp; }; +struct io_rbuf_ring { + struct io_uring rq, cq; +}; + struct io_restriction { DECLARE_BITMAP(register_op, IORING_REGISTER_LAST); DECLARE_BITMAP(sqe_op, IORING_OP_LAST); @@ -330,6 +334,8 @@ struct io_ring_ctx { struct io_rsrc_data *file_data; struct io_rsrc_data *buf_data; + struct io_zc_rx_ifq *ifq; + /* protected by ->uring_lock */ struct list_head rsrc_ref_list; struct io_alloc_cache rsrc_node_cache; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 0716cb17e436..8f2a1061629b 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -523,6 +523,9 @@ enum { /* register a range of fixed file slots for automatic slot allocation */ IORING_REGISTER_FILE_ALLOC_RANGE = 25, + /* register a network interface queue for zerocopy */ + IORING_REGISTER_ZC_RX_IFQ = 26, + /* this goes last */ IORING_REGISTER_LAST, @@ -703,6 +706,53 @@ struct io_uring_recvmsg_out { __u32 flags; }; +struct io_uring_rbuf_rqe { + __u32 off; + __u32 len; + __u16 region; + __u8 __pad[6]; +}; + +struct io_uring_rbuf_cqe { + __u32 off; + __u32 len; + __u16 region; + __u8 flags; + __u8 __pad[3]; +}; + +struct io_rbuf_rqring_offsets { + __u32 head; + __u32 tail; + __u32 rqes; + __u8 __pad[4]; +}; + +struct io_rbuf_cqring_offsets { + __u32 head; + __u32 tail; + __u32 cqes; + __u8 __pad[4]; +}; + +/* + * Argument for IORING_REGISTER_ZC_RX_IFQ + */ +struct io_uring_zc_rx_ifq_reg { + __u32 if_idx; + /* hw rx descriptor ring id */ + __u32 if_rxq_id; + __u32 region_id; + __u32 rq_entries; + __u32 cq_entries; + __u32 flags; + __u16 cpu; + + __u32 mmap_sz; + struct io_rbuf_rqring_offsets rq_off; + struct io_rbuf_cqring_offsets cq_off; +}; + #ifdef __cplusplus } #endif diff --git a/io_uring/Makefile b/io_uring/Makefile index 8cc8e5387a75..7818b015a1f2 100644 --- a/io_uring/Makefile +++ b/io_uring/Makefile @@ -7,5 +7,6 @@ obj-$(CONFIG_IO_URING) += io_uring.o xattr.o nop.o fs.o splice.o \ openclose.o uring_cmd.o epoll.o \ statx.o net.o msg_ring.o timeout.o \ sqpoll.o fdinfo.o tctx.o poll.o \ - cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o + cancel.o kbuf.o rsrc.o rw.o opdef.o \ + notif.o zc_rx.o obj-$(CONFIG_IO_WQ) += io-wq.o diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 3bca7a79efda..7705d18dceff 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -92,6 +92,7 @@ #include "cancel.h" #include "net.h" #include "notif.h" +#include "zc_rx.h" #include "timeout.h" #include "poll.h" @@ -4418,6 +4419,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_register_file_alloc_range(ctx, arg); break; + case IORING_REGISTER_ZC_RX_IFQ: + ret = -EINVAL; + if (!arg || nr_args != 1) + break; + ret = io_register_zc_rx_ifq(ctx, arg); + break; default: ret = -EINVAL; break; diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 2f0181521c98..d7499e7b34bd 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -634,3 +634,33 @@ void *io_pbuf_get_address(struct io_ring_ctx *ctx, unsigned long bgid) return bl->buf_ring; } + +int io_allocate_rbuf_ring(struct io_zc_rx_ifq *ifq, + struct io_uring_zc_rx_ifq_reg *reg) +{ + gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO | __GFP_NOWARN | __GFP_COMP; + size_t off, size, rq_size, cq_size; + void *ptr; + + off = sizeof(struct io_rbuf_ring); + rq_size = reg->rq_entries * sizeof(struct io_uring_rbuf_rqe); + cq_size = reg->cq_entries * sizeof(struct io_uring_rbuf_cqe); + size = off + rq_size + cq_size; + ptr = (void *) __get_free_pages(gfp, get_order(size)); + if (!ptr) + return -ENOMEM; + ifq->ring = (struct io_rbuf_ring *)ptr; + ifq->rqes = (struct io_uring_rbuf_rqe *)((char *)ptr + off); + ifq->cqes = (struct io_uring_rbuf_cqe *)((char *)ifq->rqes + rq_size); + + return 0; +} + +void io_free_rbuf_ring(struct io_zc_rx_ifq *ifq) +{ + struct page *page; + + page = virt_to_head_page(ifq->ring); + if (put_page_testzero(page)) + free_compound_page(page); +} diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h index d14345ef61fc..6c8afda93646 100644 --- a/io_uring/kbuf.h +++ b/io_uring/kbuf.h @@ -4,6 +4,8 @@ #include +#include "zc_rx.h" + struct io_buffer_list { /* * If ->buf_nr_pages is set, then buf_pages/buf_ring are used. If not, @@ -57,6 +59,9 @@ void io_kbuf_recycle_legacy(struct io_kiocb *req, unsigned issue_flags); void *io_pbuf_get_address(struct io_ring_ctx *ctx, unsigned long bgid); +int io_allocate_rbuf_ring(struct io_zc_rx_ifq *ifq, struct io_uring_zc_rx_ifq_reg *reg); +void io_free_rbuf_ring(struct io_zc_rx_ifq *ifq); + static inline void io_kbuf_recycle_ring(struct io_kiocb *req) { /* diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c new file mode 100644 index 000000000000..63bc6cd7d205 --- /dev/null +++ b/io_uring/zc_rx.c @@ -0,0 +1,65 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include + +#include + +#include "io_uring.h" +#include "kbuf.h" +#include "zc_rx.h" + +static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) +{ + struct io_zc_rx_ifq *ifq; + + ifq = kzalloc(sizeof(*ifq), GFP_KERNEL); + if (!ifq) + return NULL; + + ifq->ctx = ctx; + + return ifq; +} + +static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) +{ + io_free_rbuf_ring(ifq); + kfree(ifq); +} + +int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_ifq_reg __user *arg) +{ + struct io_uring_zc_rx_ifq_reg reg; + struct io_zc_rx_ifq *ifq; + int ret; + + if (copy_from_user(®, arg, sizeof(reg))) + return -EFAULT; + if (ctx->ifq) + return -EBUSY; + + ifq = io_zc_rx_ifq_alloc(ctx); + if (!ifq) + return -ENOMEM; + + /* TODO: initialise network interface */ + + ret = io_allocate_rbuf_ring(ifq, ®); + if (ret) + goto err; + + /* TODO: map zc region and initialise zc pool */ + + ifq->rq_entries = reg.rq_entries; + ifq->cq_entries = reg.cq_entries; + ifq->if_rxq_id = reg.if_rxq_id; + ctx->ifq = ifq; + + return 0; +err: + io_zc_rx_ifq_free(ifq); + return ret; +} diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h new file mode 100644 index 000000000000..4363734f3d98 --- /dev/null +++ b/io_uring/zc_rx.h @@ -0,0 +1,21 @@ +// SPDX-License-Identifier: GPL-2.0 +#ifndef IOU_ZC_RX_H +#define IOU_ZC_RX_H + +struct io_zc_rx_ifq { + struct io_ring_ctx *ctx; + struct net_device *dev; + struct io_rbuf_ring *ring; + struct io_uring_rbuf_rqe *rqes; + struct io_uring_rbuf_cqe *cqes; + u32 rq_entries, cq_entries; + void *pool; + + /* hw rx descriptor ring id */ + u32 if_rxq_id; +}; + +int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zc_rx_ifq_reg __user *arg); + +#endif From patchwork Sat Aug 26 01:19:45 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13366460 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 00CA2C83F15 for ; Sat, 26 Aug 2023 01:22:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231621AbjHZBWD (ORCPT ); Fri, 25 Aug 2023 21:22:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34856 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231618AbjHZBVd (ORCPT ); Fri, 25 Aug 2023 21:21:33 -0400 Received: from mail-pl1-x62d.google.com (mail-pl1-x62d.google.com [IPv6:2607:f8b0:4864:20::62d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9B384212B for ; Fri, 25 Aug 2023 18:21:30 -0700 (PDT) Received: by mail-pl1-x62d.google.com with SMTP id d9443c01a7336-1c09673b006so10618825ad.1 for ; Fri, 25 Aug 2023 18:21:30 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20221208.gappssmtp.com; s=20221208; t=1693012890; x=1693617690; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=fQmFAOYPyQYmMwkH3n9RRbmoobUo7gWIqTt8UijUgLA=; b=J11ogROBx0CkEAmY2aTSLQLiSs+xEIt8EgRKXW1j5os3+EAb2o5O7wWEDow0f0UgNx aj+HljEkG8JdblcQd7mJNLVm74azGqKhFN+utnP9roj3pkFbAk9u8+0PJbsXS10YfKXK i9XEesjAelqR+IcKSeJAyDaH1IR0X1zryh9TFl2WtLTMXQNBcVSkM4qbJ7MEYQiKUraV DSeCCfbWRhJdu4bjMWv5vU6/NebRtxBYq3zSqSTIxxnX49i+MEhWn5M8kXXk9DybdrmI j0IT3mV7Ipk1YpJ3cduIhEZHxVPpCWZIn0a/zsSFnN6Yqar/VLRU+EJq1MhkxeKGxQmU Q4JA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693012890; x=1693617690; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=fQmFAOYPyQYmMwkH3n9RRbmoobUo7gWIqTt8UijUgLA=; b=AGoRmM3AgwQ22zlaYbHxI5WTWvvJ0SXxOXn493YwSsTeaOnlW+KYx2Frkpv/2luXSp MWGcfHtl4xd8ORvtuwiWOrBD+ww/2F8GmCfDoh3rwyeUgfZoQ4+lNqd3QBshi0if8fO8 rSXRb5gi/GB2Oo2E3nkL79s51HQi2e0zY6VryQy0a3PA5eQ0ssa5ffzBl6Xq/1hT6pul bRt7goJY39v3fhmE8uqXCy2PHq7WIVcdqndfv6gA85d4PSSOyjn62Omfv4W3hpAvYFjG 95DLe/hifuYameng5Agv2PmdSosCKQMkfFh6Kt6ES9noc1aq+EVMkg6hui74e/RB2aOH 5umQ== X-Gm-Message-State: AOJu0Yzm6MuG4CyMnwU1B+4peCYiIouQASQRbw8ntmPYKGGgp/tt1DgY uDerkqSu4OcltRP0ddPjLI9KH58eXb8y5DiYTWlfdg== X-Google-Smtp-Source: AGHT+IGIo+3adK4mhuENDaoYdhwpNFabeFpYFMgK5SCDkwzDovceVadwG1Af3T5u1BmEPm/ggncWiA== X-Received: by 2002:a17:902:6acc:b0:1c0:98fe:3677 with SMTP id i12-20020a1709026acc00b001c098fe3677mr10580663plt.56.1693012890112; Fri, 25 Aug 2023 18:21:30 -0700 (PDT) Received: from localhost (fwdproxy-prn-016.fbsv.net. [2a03:2880:ff:10::face:b00c]) by smtp.gmail.com with ESMTPSA id b18-20020a170902d31200b001b53c8659fesm2417216plc.30.2023.08.25.18.21.29 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Aug 2023 18:21:29 -0700 (PDT) From: David Wei To: Jens Axboe , Pavel Begunkov Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org, Mina Almasry , Jakub Kicinski Subject: [PATCH 02/11] io_uring: add mmap support for shared ifq ringbuffers Date: Fri, 25 Aug 2023 18:19:45 -0700 Message-Id: <20230826011954.1801099-3-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20230826011954.1801099-1-dw@davidwei.uk> References: <20230826011954.1801099-1-dw@davidwei.uk> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org From: David Wei This patch adds mmap support for ifq rbuf rings. There are two rings and a struct io_rbuf_ring that contains the head and tail ptrs into each ring. Just like the io_uring SQ/CQ rings, userspace issues a single mmap call using the io_uring fd w/ magic offset IORING_OFF_RBUF_RING. An opaque ptr is returned to userspace, which is then expected to use the offsets returned in the registration struct to get access to the head/tail and rings. Signed-off-by: David Wei --- include/uapi/linux/io_uring.h | 2 ++ io_uring/io_uring.c | 5 +++++ io_uring/zc_rx.c | 17 +++++++++++++++++ 3 files changed, 24 insertions(+) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 8f2a1061629b..28154abfe6f4 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -393,6 +393,8 @@ enum { #define IORING_OFF_PBUF_RING 0x80000000ULL #define IORING_OFF_PBUF_SHIFT 16 #define IORING_OFF_MMAP_MASK 0xf8000000ULL +#define IORING_OFF_RBUF_RING 0x20000000ULL +#define IORING_OFF_RBUF_SHIFT 16 /* * Filled with the offset for mmap(2) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 7705d18dceff..0b6c5508b1ca 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3368,6 +3368,11 @@ static void *io_uring_validate_mmap_request(struct file *file, return ERR_PTR(-EINVAL); break; } + case IORING_OFF_RBUF_RING: + if (!ctx->ifq) + return ERR_PTR(-EINVAL); + ptr = ctx->ifq->ring; + break; default: return ERR_PTR(-EINVAL); } diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 63bc6cd7d205..6c57c9b06e05 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -34,6 +34,7 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, { struct io_uring_zc_rx_ifq_reg reg; struct io_zc_rx_ifq *ifq; + size_t ring_sz, rqes_sz, cqes_sz; int ret; if (copy_from_user(®, arg, sizeof(reg))) @@ -58,6 +59,22 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, ifq->if_rxq_id = reg.if_rxq_id; ctx->ifq = ifq; + ring_sz = sizeof(struct io_rbuf_ring); + rqes_sz = sizeof(struct io_uring_rbuf_rqe) * ifq->rq_entries; + cqes_sz = sizeof(struct io_uring_rbuf_cqe) * ifq->cq_entries; + reg.mmap_sz = ring_sz + rqes_sz + cqes_sz; + reg.rq_off.rqes = ring_sz; + reg.cq_off.cqes = ring_sz + rqes_sz; + reg.rq_off.head = offsetof(struct io_rbuf_ring, rq.head); + reg.rq_off.tail = offsetof(struct io_rbuf_ring, rq.tail); + reg.cq_off.head = offsetof(struct io_rbuf_ring, cq.head); + reg.cq_off.tail = offsetof(struct io_rbuf_ring, cq.tail); + + if (copy_to_user(arg, ®, sizeof(reg))) { + ret = -EFAULT; + goto err; + } + return 0; err: io_zc_rx_ifq_free(ifq); From patchwork Sat Aug 26 01:19:46 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13366459 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E3E21C83F12 for ; Sat, 26 Aug 2023 01:22:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231618AbjHZBWE (ORCPT ); Fri, 25 Aug 2023 21:22:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34868 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231620AbjHZBVd (ORCPT ); Fri, 25 Aug 2023 21:21:33 -0400 Received: from mail-pg1-x52f.google.com (mail-pg1-x52f.google.com [IPv6:2607:f8b0:4864:20::52f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 902032680 for ; Fri, 25 Aug 2023 18:21:31 -0700 (PDT) Received: by mail-pg1-x52f.google.com with SMTP id 41be03b00d2f7-56c2e882416so844381a12.3 for ; Fri, 25 Aug 2023 18:21:31 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20221208.gappssmtp.com; s=20221208; t=1693012891; x=1693617691; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=VNgO5ZIv14KVZmuHWpj5cSQRK/umULxc7iSIeMaS/eA=; b=xNwKRdqsMh+qCjPp4OBeyPIsa34xOMM0cUzK6Yg3OyHro97dhOGmYbOl6u1MnHSawg emgcdbv9xTbN4SU4UljeHyRZhsc4ACF485ui6vFOpna6Bjpl3VhZjDCVImSqZphpuCp9 kOc58NE3nU45sTbG2clqpPwtCTMrCclDzRqOaZJmYUpbnoL2nChtDxnftBIngLyhFPn/ WXmtLo32T4q5AZeNwYPrcoHGqWTmXIe7BoMQcy2FJMznQgooT4o4Nj825wGKtcacxT/K 9AQ0tqCwIghD6xF/bQ0ugmO8RelpH7B9h62pdHYzx9Mr/u7pYZ1WwML10UKRltJpWV9o 3Evw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693012891; x=1693617691; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=VNgO5ZIv14KVZmuHWpj5cSQRK/umULxc7iSIeMaS/eA=; b=bpgieCsGNw8niQufKF/QXr9RUnapjnW67DBriFcqff1UvwuxCnRGyBSJPzKfltUW94 AvBYEY2KoNB3t43KsHwQhVRYkiIZr9oua2/Mkqa3VVepNMZ7nIixzK3A/43iN/IS5uk6 8dXut5WUrFVwaVW+uu5ZHMP3EvRci3xiJzTKIunXpbHb2ga5EbwJN01dERLr1fDQ6Ump QO8qqsTfK0oQT7J1PmSr3u75CX/haOMDxqDeDZcEGP/XsTFr6SfXV1OeGC8kKX9EecP7 BfhYzt/zA98Qzo/75sDuOXVJKMHLTmxFTE1EDEjRSWqBUKvIhbirvx9eSEOvSuWkObau KB2g== X-Gm-Message-State: AOJu0YwG1TjJGnBxCstZbRjkW3f76hUz5lkrpa8SubPwXkBETxVD7K6e XVo6wPSsciFNpsqu2rJEa14/BQ== X-Google-Smtp-Source: AGHT+IFzWc4+ijhlirnZLcO2bqVcRhIBIgdFNaal1xrptHcSU36WHJC7JgXlaeCAvzW0niN/CeFJvQ== X-Received: by 2002:a05:6a21:3390:b0:14b:f78e:d05c with SMTP id yy16-20020a056a21339000b0014bf78ed05cmr6858254pzb.15.1693012891044; Fri, 25 Aug 2023 18:21:31 -0700 (PDT) Received: from localhost (fwdproxy-prn-120.fbsv.net. [2a03:2880:ff:78::face:b00c]) by smtp.gmail.com with ESMTPSA id u10-20020a17090341ca00b001b9de4fb749sm2420621ple.20.2023.08.25.18.21.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Aug 2023 18:21:30 -0700 (PDT) From: David Wei To: Jens Axboe , Pavel Begunkov Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org, Mina Almasry , Jakub Kicinski Subject: [PATCH 03/11] netdev: add XDP_SETUP_ZC_RX command Date: Fri, 25 Aug 2023 18:19:46 -0700 Message-Id: <20230826011954.1801099-4-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20230826011954.1801099-1-dw@davidwei.uk> References: <20230826011954.1801099-1-dw@davidwei.uk> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org From: David Wei This patch adds a new XDP_SETUP_ZC_RX command that will be used in a later patch to enable or disable ZC RX for a specific RX queue. Signed-off-by: David Wei Co-developed-by: Jonathan Lemon --- include/linux/netdevice.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 08fbd4622ccf..a20a5c847916 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1000,6 +1000,7 @@ enum bpf_netdev_command { BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE, XDP_SETUP_XSK_POOL, + XDP_SETUP_ZC_RX, }; struct bpf_prog_offload_ops; @@ -1038,6 +1039,11 @@ struct netdev_bpf { struct xsk_buff_pool *pool; u16 queue_id; } xsk; + /* XDP_SETUP_ZC_RX */ + struct { + struct io_zc_rx_ifq *ifq; + u16 queue_id; + } zc_rx; }; }; From patchwork Sat Aug 26 01:19:47 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13366458 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 436C2C83F0C for ; Sat, 26 Aug 2023 01:22:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231617AbjHZBWB (ORCPT ); Fri, 25 Aug 2023 21:22:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34882 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231621AbjHZBVe (ORCPT ); Fri, 25 Aug 2023 21:21:34 -0400 Received: from mail-pl1-x631.google.com (mail-pl1-x631.google.com [IPv6:2607:f8b0:4864:20::631]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9078C1FF7 for ; Fri, 25 Aug 2023 18:21:32 -0700 (PDT) Received: by mail-pl1-x631.google.com with SMTP id d9443c01a7336-1bdbbede5d4so12562505ad.2 for ; Fri, 25 Aug 2023 18:21:32 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20221208.gappssmtp.com; s=20221208; t=1693012892; x=1693617692; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=FqXYs9JKDR7EAnDxjSwatlgPrNqTmHker8ZbKfpU3gE=; b=UZk+0Z3kfokcBJDT7Us5aYoJyw3bmhBrpP3D1m68k8z0cEpWjObh+WFmr51vihzjPP bqufzVhsbrXenQCvIWgY2qDhnhue+gMNiTI/o1sVl9bke6zeVKmAneedI70HmO4FxLz1 iZsjCXllqi20cqKoKJ/KtvEpJdGk3zCBJSxEkoUzpzX9b+FSKwkfsKu7Mvm+H4n7jp+Q MnG173XXOoICRnBiyPlx/xDVoiIPt/cCCOmGU7xdS5PE9XDcAFHLiBA8IIugyaEJHxer KWfhU+Q8WFz8VAcg1yLWGcLEW4wp3+h//9AGP9dqhuuUDpTe+XO1C3nSoS+GNKtqTNy8 PnMA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693012892; x=1693617692; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=FqXYs9JKDR7EAnDxjSwatlgPrNqTmHker8ZbKfpU3gE=; b=OAnhX/28DKgCQmVHbtNxayN6Agb5Y3RbvG0fgZpozdWv6Jzh/+Ht5f4H82L2m0ZLRG JiY+bzt2sD5MZxsttXybVFupIuKEVcINr3noRWEaiBZ2OqKJEQ1mlVbZPRyw1AQmv9ys 0H8kgqpAxYli94eZoCnXhTpKA5Pg3qIgTCz+cX6ZO2AajrdC9/uxDeBpqsZBRwIk1HXT wzAxtcGnGF2pgUeg5gD6SGz+r4K05N/4N5FsnMRmk1htxifT24vmrrdLt4W9l8A1fjjS /WrmOkeGSeZf39t8OGpu/K7G2Rl2kiCTkALfj9bjfQHQXiu8+ls4Kt3zEkYz9Rm9q/Lz vzUw== X-Gm-Message-State: AOJu0YyK6I8swKI0mrv/AHyup38iMEEKG3sOsZ+vOo70NVXvlHZ6ogYd QsjzsL6PkVu3decNChXZF2xHlw== X-Google-Smtp-Source: AGHT+IGCyPX2jM9xBria+78I08Xewm97ynjbJS1JDqK3/K/tRjHnCUZiK/uOI937WtjPOIVHyHjc5w== X-Received: by 2002:a17:903:248:b0:1bc:8748:8bc0 with SMTP id j8-20020a170903024800b001bc87488bc0mr22750701plh.33.1693012892026; Fri, 25 Aug 2023 18:21:32 -0700 (PDT) Received: from localhost (fwdproxy-prn-116.fbsv.net. [2a03:2880:ff:74::face:b00c]) by smtp.gmail.com with ESMTPSA id 6-20020a170902c10600b001bbb598b8bbsm2425852pli.41.2023.08.25.18.21.31 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Aug 2023 18:21:31 -0700 (PDT) From: David Wei To: Jens Axboe , Pavel Begunkov Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org, Mina Almasry , Jakub Kicinski Subject: [PATCH 04/11] io_uring: setup ZC for an RX queue when registering an ifq Date: Fri, 25 Aug 2023 18:19:47 -0700 Message-Id: <20230826011954.1801099-5-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20230826011954.1801099-1-dw@davidwei.uk> References: <20230826011954.1801099-1-dw@davidwei.uk> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org From: David Wei This patch sets up ZC for an RX queue in a net device when an ifq is registered with io_uring. The RX queue is specified in the registration struct. The XDP command added in the previous patch is used to enable or disable ZC RX. For now since there is only one ifq, its destruction is implicit during io_uring cleanup. Signed-off-by: David Wei Co-developed-by: Jonathan Lemon --- io_uring/io_uring.c | 1 + io_uring/zc_rx.c | 62 +++++++++++++++++++++++++++++++++++++++++++-- io_uring/zc_rx.h | 1 + 3 files changed, 62 insertions(+), 2 deletions(-) diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index 0b6c5508b1ca..f2ec0a454307 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -3098,6 +3098,7 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) percpu_ref_kill(&ctx->refs); xa_for_each(&ctx->personalities, index, creds) io_unregister_personality(ctx, index); + io_unregister_zc_rx_ifq(ctx); if (ctx->rings) io_poll_remove_all(ctx, NULL, true); mutex_unlock(&ctx->uring_lock); diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 6c57c9b06e05..8cc66731af5b 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -3,6 +3,7 @@ #include #include #include +#include #include @@ -10,6 +11,35 @@ #include "kbuf.h" #include "zc_rx.h" +typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); + +static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq, + u16 queue_id) +{ + struct netdev_bpf cmd; + bpf_op_t ndo_bpf; + + ndo_bpf = dev->netdev_ops->ndo_bpf; + if (!ndo_bpf) + return -EINVAL; + + cmd.command = XDP_SETUP_ZC_RX; + cmd.zc_rx.ifq = ifq; + cmd.zc_rx.queue_id = queue_id; + + return ndo_bpf(dev, &cmd); +} + +static int io_open_zc_rxq(struct io_zc_rx_ifq *ifq) +{ + return __io_queue_mgmt(ifq->dev, ifq, ifq->if_rxq_id); +} + +static int io_close_zc_rxq(struct io_zc_rx_ifq *ifq) +{ + return __io_queue_mgmt(ifq->dev, NULL, ifq->if_rxq_id); +} + static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) { struct io_zc_rx_ifq *ifq; @@ -19,12 +49,17 @@ static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) return NULL; ifq->ctx = ctx; + ifq->if_rxq_id = -1; return ifq; } static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) { + if (ifq->if_rxq_id != -1) + io_close_zc_rxq(ifq); + if (ifq->dev) + dev_put(ifq->dev); io_free_rbuf_ring(ifq); kfree(ifq); } @@ -41,17 +76,22 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, return -EFAULT; if (ctx->ifq) return -EBUSY; + if (reg.if_rxq_id == -1) + return -EINVAL; ifq = io_zc_rx_ifq_alloc(ctx); if (!ifq) return -ENOMEM; - /* TODO: initialise network interface */ - ret = io_allocate_rbuf_ring(ifq, ®); if (ret) goto err; + ret = -ENODEV; + ifq->dev = dev_get_by_index(&init_net, reg.if_idx); + if (!ifq->dev) + goto err; + /* TODO: map zc region and initialise zc pool */ ifq->rq_entries = reg.rq_entries; @@ -59,6 +99,10 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, ifq->if_rxq_id = reg.if_rxq_id; ctx->ifq = ifq; + ret = io_open_zc_rxq(ifq); + if (ret) + goto err; + ring_sz = sizeof(struct io_rbuf_ring); rqes_sz = sizeof(struct io_uring_rbuf_rqe) * ifq->rq_entries; cqes_sz = sizeof(struct io_uring_rbuf_cqe) * ifq->cq_entries; @@ -80,3 +124,17 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, io_zc_rx_ifq_free(ifq); return ret; } + +int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx) +{ + struct io_zc_rx_ifq *ifq; + + ifq = ctx->ifq; + if (!ifq) + return -EINVAL; + + ctx->ifq = NULL; + io_zc_rx_ifq_free(ifq); + + return 0; +} diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index 4363734f3d98..340ececa9f9c 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -17,5 +17,6 @@ struct io_zc_rx_ifq { int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, struct io_uring_zc_rx_ifq_reg __user *arg); +int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx); #endif From patchwork Sat Aug 26 01:19:48 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13366463 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3D5A1C83F17 for ; Sat, 26 Aug 2023 01:22:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231623AbjHZBWH (ORCPT ); Fri, 25 Aug 2023 21:22:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34890 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231625AbjHZBVk (ORCPT ); Fri, 25 Aug 2023 21:21:40 -0400 Received: from mail-pl1-x636.google.com (mail-pl1-x636.google.com [IPv6:2607:f8b0:4864:20::636]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6B62E2683 for ; Fri, 25 Aug 2023 18:21:33 -0700 (PDT) Received: by mail-pl1-x636.google.com with SMTP id d9443c01a7336-1bdca7cc28dso12650035ad.1 for ; Fri, 25 Aug 2023 18:21:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20221208.gappssmtp.com; s=20221208; t=1693012893; x=1693617693; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=2ynVwsxmMrwAG9Nkn/12Q+RcMuGojnsZ7syWTa5mUqI=; b=sScyon73ZfPCd5COdaruZOMvd2yb5cFxfqTJEizlywXPasyRc/ri4S1DQPkJuu0D8T GD962rQa5CRpcWX9jw9WHhaG6cqUyYUB03zL7b0+tFne0fRR5x9EdIwB2VSN4YK19H2J fz6oH+aikAZgxH5f98/zi+7OHCdMMxUrEttcLgm9qm/EHwTh+azC0TFpvKH7/eXcIS8Q FYVrhRk6mw0RVycbkbeDtV+hmxMy//DPPfBfWKldWaD/Fd/5OwQQXjlx3GQBjAsoUXbV E8s1qAK8WJsR1dKjVMh/cYzPqtjFgcNDLfu11t+KllXtjmXce/0xiHehQZiGVAJfpa9F 1sfQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693012893; x=1693617693; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=2ynVwsxmMrwAG9Nkn/12Q+RcMuGojnsZ7syWTa5mUqI=; b=Xvf46iJYerYo3q8dW4c/Rr2s7yYDl0iGeb65YTrBw8/PnxFXF9QDkf4cZcZJbwKRYM 67zpy36x1A0/IDfl7DxxmJRjaL8uyoTWJ3YilDLwUP3r8+vCR5j0ilBQIwimIHc4GY1T vxq92+M7b5E8QJAUwd/MrZB8Vk0GtVVtKH4GblFXlObG2TonJuDNWm2fNMQFIcRqEXC+ ggoJyv23pixd2ODFx9KdeEVKiWBEPMmJ2GUP+UoWzRcs8WmDq8q5hWjphvqXg5o0Gp9t M1fGRIc6AEjs1r30//CoIab+xKnEYdubNEMuNd0eQHrvH67J9NkI0O6D71EIhvJnKqKv PX4Q== X-Gm-Message-State: AOJu0YycohUjXKIj1SX7rArLEkbT24FW2blkYUoO8psX/s/uRCpIGt7J k4JTgQqFzJzMRX6g072PLp2IjRzbuNVBtVfLC5gPEg== X-Google-Smtp-Source: AGHT+IHrObKFTZupRASNWAVoKD8zWxjg7OzV1ODlDgotGdGYZbjP47UkUmFk/MbtBg+KLZptfmBD5g== X-Received: by 2002:a17:902:e5cd:b0:1b0:f8:9b2d with SMTP id u13-20020a170902e5cd00b001b000f89b2dmr23494952plf.29.1693012892911; Fri, 25 Aug 2023 18:21:32 -0700 (PDT) Received: from localhost (fwdproxy-prn-004.fbsv.net. [2a03:2880:ff:4::face:b00c]) by smtp.gmail.com with ESMTPSA id e4-20020a170902744400b001b03b7f8adfsm2410764plt.246.2023.08.25.18.21.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Aug 2023 18:21:32 -0700 (PDT) From: David Wei To: Jens Axboe , Pavel Begunkov Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org, Mina Almasry , Jakub Kicinski Subject: [PATCH 05/11] io_uring: add ZC buf and pool Date: Fri, 25 Aug 2023 18:19:48 -0700 Message-Id: <20230826011954.1801099-6-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20230826011954.1801099-1-dw@davidwei.uk> References: <20230826011954.1801099-1-dw@davidwei.uk> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org From: David Wei This patch adds two objects: * Zero copy buffer representation, holding a page, its mapped dma_addr, and a refcount for lifetime management. * Zero copy pool, spiritually similar to page pool, that holds ZC bufs and hands them out to net devices. The ZC pool is tiered with currently two tiers: a fast lockless cache that should only be accessed from the NAPI context of a single RX queue, and a freelist. When a ZC pool region is first mapped, it is added to the freelist. During normal operation, bufs are moved from the freelist into the cache in POOL_CACHE_SIZE blocks before being given out. Pool regions are registered w/ io_uring using the registered buffer API, with a 1:1 mapping between region and nr_iovec in io_uring_register_buffers. This does the heavy lifting of pinning and chunking into bvecs into a struct io_mapped_ubuf for us. For now as there is only one pool region per ifq, there is no separate API for adding/removing regions yet and it is mapped implicitly during ifq registration. Signed-off-by: David Wei Co-developed-by: Jonathan Lemon --- include/linux/io_uring.h | 6 ++ io_uring/zc_rx.c | 173 ++++++++++++++++++++++++++++++++++++++- io_uring/zc_rx.h | 1 + 3 files changed, 179 insertions(+), 1 deletion(-) diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index 7fe31b2cd02f..cf1993befa6a 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -41,6 +41,12 @@ static inline const void *io_uring_sqe_cmd(const struct io_uring_sqe *sqe) return sqe->cmd; } +struct io_zc_rx_buf { + dma_addr_t dma; + struct page *page; + atomic_t refcount; +}; + #if defined(CONFIG_IO_URING) int io_uring_cmd_import_fixed(u64 ubuf, unsigned long len, int rw, struct iov_iter *iter, void *ioucmd); diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 8cc66731af5b..317127d0d4e7 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -4,13 +4,43 @@ #include #include #include +#include #include #include "io_uring.h" #include "kbuf.h" +#include "rsrc.h" #include "zc_rx.h" +#define POOL_CACHE_SIZE 128 + +struct io_zc_rx_pool { + struct io_zc_rx_ifq *ifq; + struct io_zc_rx_buf *bufs; + u16 pool_id; + u32 nr_pages; + + /* fast cache */ + u32 cache_count; + u32 cache[POOL_CACHE_SIZE]; + + /* freelist */ + spinlock_t freelist_lock; + u32 free_count; + u32 freelist[]; +}; + +static struct device *netdev2dev(struct net_device *dev) +{ + return dev->dev.parent; +} + +static u64 mk_page_info(u16 pool_id, u32 pgid) +{ + return (u64)0xface << 48 | (u64)pool_id << 32 | (u64)pgid; +} + typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq, @@ -40,6 +70,143 @@ static int io_close_zc_rxq(struct io_zc_rx_ifq *ifq) return __io_queue_mgmt(ifq->dev, NULL, ifq->if_rxq_id); } +static int io_zc_rx_map_buf(struct device *dev, struct page *page, u16 pool_id, + u32 pgid, struct io_zc_rx_buf *buf) +{ + dma_addr_t addr; + + SetPagePrivate(page); + set_page_private(page, mk_page_info(pool_id, pgid)); + + addr = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, + DMA_BIDIRECTIONAL, + DMA_ATTR_SKIP_CPU_SYNC); + if (dma_mapping_error(dev, addr)) { + set_page_private(page, 0); + ClearPagePrivate(page); + return -ENOMEM; + } + + buf->dma = addr; + buf->page = page; + atomic_set(&buf->refcount, 0); + get_page(page); + + return 0; +} + +static void io_zc_rx_unmap_buf(struct device *dev, struct io_zc_rx_buf *buf) +{ + struct page *page; + + page = buf->page; + set_page_private(page, 0); + ClearPagePrivate(page); + dma_unmap_page_attrs(dev, buf->dma, PAGE_SIZE, + DMA_BIDIRECTIONAL, + DMA_ATTR_SKIP_CPU_SYNC); + put_page(page); +} + +static int io_zc_rx_map_pool(struct io_zc_rx_pool *pool, + struct io_mapped_ubuf *imu, + struct device *dev) +{ + struct io_zc_rx_buf *buf; + struct page *page; + int i, ret; + + for (i = 0; i < imu->nr_bvecs; i++) { + page = imu->bvec[i].bv_page; + if (PagePrivate(page)) { + ret = -EEXIST; + goto err; + } + + buf = &pool->bufs[i]; + ret = io_zc_rx_map_buf(dev, page, pool->pool_id, i, buf); + if (ret) + goto err; + + pool->freelist[i] = i; + } + + return 0; +err: + while (i--) { + buf = &pool->bufs[i]; + io_zc_rx_unmap_buf(dev, buf); + } + + return ret; +} + +int io_zc_rx_create_pool(struct io_ring_ctx *ctx, + struct io_zc_rx_ifq *ifq, + u16 id) +{ + struct device *dev = netdev2dev(ifq->dev); + struct io_mapped_ubuf *imu; + struct io_zc_rx_pool *pool; + int nr_pages; + int ret; + + if (ifq->pool) + return -EFAULT; + + if (unlikely(id >= ctx->nr_user_bufs)) + return -EFAULT; + id = array_index_nospec(id, ctx->nr_user_bufs); + imu = ctx->user_bufs[id]; + if (imu->ubuf & ~PAGE_MASK || imu->ubuf_end & ~PAGE_MASK) + return -EFAULT; + + ret = -ENOMEM; + nr_pages = imu->nr_bvecs; + pool = kvmalloc(struct_size(pool, freelist, nr_pages), GFP_KERNEL); + if (!pool) + goto err; + + pool->bufs = kvmalloc_array(nr_pages, sizeof(*pool->bufs), GFP_KERNEL); + if (!pool->bufs) + goto err_buf; + + ret = io_zc_rx_map_pool(pool, imu, dev); + if (ret) + goto err_map; + + pool->ifq = ifq; + pool->pool_id = id; + pool->nr_pages = nr_pages; + pool->cache_count = 0; + spin_lock_init(&pool->freelist_lock); + pool->free_count = nr_pages; + ifq->pool = pool; + + return 0; + +err_map: + kvfree(pool->bufs); +err_buf: + kvfree(pool); +err: + return ret; +} + +static void io_zc_rx_destroy_pool(struct io_zc_rx_pool *pool) +{ + struct device *dev = netdev2dev(pool->ifq->dev); + struct io_zc_rx_buf *buf; + + for (int i = 0; i < pool->nr_pages; i++) { + buf = &pool->bufs[i]; + + io_zc_rx_unmap_buf(dev, buf); + } + kvfree(pool->bufs); + kvfree(pool); +} + static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) { struct io_zc_rx_ifq *ifq; @@ -58,6 +225,8 @@ static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) { if (ifq->if_rxq_id != -1) io_close_zc_rxq(ifq); + if (ifq->pool) + io_zc_rx_destroy_pool(ifq->pool); if (ifq->dev) dev_put(ifq->dev); io_free_rbuf_ring(ifq); @@ -92,7 +261,9 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, if (!ifq->dev) goto err; - /* TODO: map zc region and initialise zc pool */ + ret = io_zc_rx_create_pool(ctx, ifq, reg.region_id); + if (ret) + goto err; ifq->rq_entries = reg.rq_entries; ifq->cq_entries = reg.cq_entries; diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index 340ececa9f9c..3cd0e730115d 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -18,5 +18,6 @@ struct io_zc_rx_ifq { int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, struct io_uring_zc_rx_ifq_reg __user *arg); int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx); +int io_zc_rx_pool_create(struct io_zc_rx_ifq *ifq, u16 id); #endif From patchwork Sat Aug 26 01:19:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13366464 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D0E7C83F16 for ; Sat, 26 Aug 2023 01:22:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231627AbjHZBWG (ORCPT ); Fri, 25 Aug 2023 21:22:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34896 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231623AbjHZBVk (ORCPT ); Fri, 25 Aug 2023 21:21:40 -0400 Received: from mail-ot1-x32c.google.com (mail-ot1-x32c.google.com [IPv6:2607:f8b0:4864:20::32c]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A194A1FF7 for ; Fri, 25 Aug 2023 18:21:34 -0700 (PDT) Received: by mail-ot1-x32c.google.com with SMTP id 46e09a7af769-6b9a2416b1cso1134110a34.2 for ; Fri, 25 Aug 2023 18:21:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20221208.gappssmtp.com; s=20221208; t=1693012894; x=1693617694; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=CkPCZybv2QToUQGiLBLfHjAvQGjKHNqxFqhs5mZLvj0=; b=g60xH4uuHwp/zHzUcBKlgnbNUm5LMwNlmczo93z0rvRylNFFHxf7Tn5F7D9BBrW/kX JMf1h+oPTMILsSXvXIFOCT5CJwQNUujhqdYS6dGG161XHpwey9W3mxEYFclvfPLvhy+r ZzkXgVDKrKhnELB9KeaG0sXzgH7HM657r6tLoIu65UCPZAuFik9Jr/wXQVUrKgiL+uqG lbS4bNYKsdxen3429LRM9mzT645sfw2mDqOF8MaYYfh8pC8NSu7R6QGlAqtb/RuGyqr9 +YkpoqoB8b2D4IX9I5KDHII6phFJt2VWMp+sUbmLT+CrTUIh/iW7TmFvZPdntq1HtHX/ Kb3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693012894; x=1693617694; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=CkPCZybv2QToUQGiLBLfHjAvQGjKHNqxFqhs5mZLvj0=; b=jseziqyluFthTW+isVaSXnd6ZXzXPjDFv4jHZL+dD7lIBQXBTLFjq9C/ucSejCL6yw oQ/1rFK2Rk1dekX0EsXQWurUzvkjr7NYSN80A7erDmV4M/JDrN6Hu6boU1Zipya5gYP1 +/Qvo0DdZQtme90kEknAgBW5nbHJ5wuowjdQ2c5GOYHho27cE4mU0OuPAxxTSZpnlvT+ Pdycp3iSrW74d28q721FKEtVdBziAvIYfE+P7l55XHk1aBhSrtDFu4GW7g8jogNQPZ9o C4L+0LhJh0B67ajxxXia/9iFJNPuG5B0APSiHUCmlfVW2EOjyS+b2lJxcZfcr1xURrqf OYaQ== X-Gm-Message-State: AOJu0YzbaVSsgAQQcCxLFARrWNRwkknADRHjJSvbT2HdPdcsCBvwhS1Q nY/+L+bff2YmLJTkyH0fHnuaWrvEEg9L0vwWAALleA== X-Google-Smtp-Source: AGHT+IFFeLOrWoee3AhQEEGp2q6xYvsabd0OkckOV3EsN9DbDmlj5tHN86cRjsJzxIZyeDM9w3ioNA== X-Received: by 2002:a9d:6314:0:b0:6bd:be5:daa2 with SMTP id q20-20020a9d6314000000b006bd0be5daa2mr6428327otk.33.1693012893949; Fri, 25 Aug 2023 18:21:33 -0700 (PDT) Received: from localhost (fwdproxy-prn-000.fbsv.net. [2a03:2880:ff::face:b00c]) by smtp.gmail.com with ESMTPSA id i4-20020a63b304000000b0054fe7736ac1sm2113783pgf.76.2023.08.25.18.21.33 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Aug 2023 18:21:33 -0700 (PDT) From: David Wei To: Jens Axboe , Pavel Begunkov Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org, Mina Almasry , Jakub Kicinski Subject: [PATCH 06/11] io_uring: add ZC pool API Date: Fri, 25 Aug 2023 18:19:49 -0700 Message-Id: <20230826011954.1801099-7-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20230826011954.1801099-1-dw@davidwei.uk> References: <20230826011954.1801099-1-dw@davidwei.uk> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org From: David Wei This patch adds an API to get/put bufs from a ZC pool added in the previous patch. Recall that there is an rbuf refill ring in an ifq that is shared w/ userspace, which puts bufs it is done with back into it. A new tier is added to the ZC pool that drains entries from the refill ring to put into the cache. So when the cache is empty, it is refilled from the refill ring first, then the freelist. ZC bufs are refcounted, with both a kref and a uref. Userspace is given an off + len into the entire ZC pool region, not individual pages from ZC bufs. A net device may pack multiple packets into the same page it gets from a ZC buf, so it is possible for the same ZC buf to be handed out to userspace multiple times. This means it is possible to drain the entire refill ring, and have no usable free bufs. Suggestions for dealing w/ this are very welcome! Only up to POOL_REFILL_COUNT entries are refilled from the refill ring. Given the above, we may want to limit the amount of work being done since refilling happens inside the NAPI softirq context. Signed-off-by: David Wei Co-developed-by: Jonathan Lemon --- include/linux/io_uring.h | 18 ++++++++ io_uring/zc_rx.c | 98 ++++++++++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 13 ++++++ 3 files changed, 129 insertions(+) diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index cf1993befa6a..61eae25a8f1d 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -60,6 +60,17 @@ void __io_uring_free(struct task_struct *tsk); void io_uring_unreg_ringfd(void); const char *io_uring_get_opcode(u8 opcode); +struct io_zc_rx_ifq; +struct io_zc_rx_buf *io_zc_rx_get_buf(struct io_zc_rx_ifq *ifq); +void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf); +static inline dma_addr_t io_zc_rx_buf_dma(struct io_zc_rx_buf *buf) +{ + return buf->dma; +} +static inline struct page *io_zc_rx_buf_page(struct io_zc_rx_buf *buf) +{ + return buf->page; +} static inline void io_uring_files_cancel(void) { if (current->io_uring) { @@ -108,6 +119,13 @@ static inline const char *io_uring_get_opcode(u8 opcode) { return ""; } +static inline struct io_zc_rx_buf *io_zc_rx_get_buf(struct io_zc_rx_ifq *ifq) +{ + return NULL; +} +void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf) +{ +} #endif #endif diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 317127d0d4e7..14bc063f1c6c 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -14,6 +14,9 @@ #include "zc_rx.h" #define POOL_CACHE_SIZE 128 +#define POOL_REFILL_COUNT 64 +#define IO_ZC_RX_UREF 0x10000 +#define IO_ZC_RX_KREF_MASK (IO_ZC_RX_UREF - 1) struct io_zc_rx_pool { struct io_zc_rx_ifq *ifq; @@ -267,6 +270,8 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, ifq->rq_entries = reg.rq_entries; ifq->cq_entries = reg.cq_entries; + ifq->cached_rq_head = 0; + ifq->cached_cq_tail = 0; ifq->if_rxq_id = reg.if_rxq_id; ctx->ifq = ifq; @@ -309,3 +314,96 @@ int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx) return 0; } + +static bool io_zc_rx_put_buf_uref(struct io_zc_rx_buf *buf) +{ + if (atomic_read(&buf->refcount) < IO_ZC_RX_UREF) + return false; + + return atomic_sub_and_test(IO_ZC_RX_UREF, &buf->refcount); +} + +static void io_zc_rx_refill_cache(struct io_zc_rx_ifq *ifq, int count) +{ + unsigned int entries = io_zc_rx_rqring_entries(ifq); + unsigned int mask = ifq->rq_entries - 1; + struct io_zc_rx_pool *pool = ifq->pool; + struct io_uring_rbuf_rqe *rqe; + struct io_zc_rx_buf *buf; + int i, filled; + + if (!entries) + return; + + for (i = 0, filled = 0; i < entries && filled < count; i++) { + unsigned int rq_idx = ifq->cached_rq_head++ & mask; + u32 pgid; + + rqe = &ifq->rqes[rq_idx]; + pgid = rqe->off / PAGE_SIZE; + buf = &pool->bufs[pgid]; + if (!io_zc_rx_put_buf_uref(buf)) + continue; + pool->cache[filled++] = pgid; + } + + smp_store_release(&ifq->ring->rq.head, ifq->cached_rq_head); + pool->cache_count += filled; +} + +struct io_zc_rx_buf *io_zc_rx_get_buf(struct io_zc_rx_ifq *ifq) +{ + struct io_zc_rx_pool *pool; + struct io_zc_rx_buf *buf; + int count; + u16 pgid; + + pool = ifq->pool; + if (pool->cache_count) + goto out; + + io_zc_rx_refill_cache(ifq, POOL_REFILL_COUNT); + if (pool->cache_count) + goto out; + + spin_lock(&pool->freelist_lock); + + count = min_t(u32, pool->free_count, POOL_CACHE_SIZE); + pool->free_count -= count; + pool->cache_count += count; + memcpy(pool->cache, &pool->freelist[pool->free_count], + count * sizeof(u32)); + + spin_unlock(&pool->freelist_lock); + + if (pool->cache_count) + goto out; + + return NULL; +out: + pgid = pool->cache[--pool->cache_count]; + buf = &pool->bufs[pgid]; + atomic_set(&buf->refcount, 1); + + return buf; +} +EXPORT_SYMBOL(io_zc_rx_get_buf); + +static void io_zc_rx_recycle_buf(struct io_zc_rx_pool *pool, + struct io_zc_rx_buf *buf) +{ + spin_lock(&pool->freelist_lock); + pool->freelist[pool->free_count++] = buf - pool->bufs; + spin_unlock(&pool->freelist_lock); +} + +void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf) +{ + struct io_zc_rx_pool *pool = ifq->pool; + + if (!atomic_dec_and_test(&buf->refcount)) + return; + + io_zc_rx_recycle_buf(pool, buf); +} +EXPORT_SYMBOL(io_zc_rx_put_buf); diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index 3cd0e730115d..b063a3c81ccb 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -2,6 +2,8 @@ #ifndef IOU_ZC_RX_H #define IOU_ZC_RX_H +#include + struct io_zc_rx_ifq { struct io_ring_ctx *ctx; struct net_device *dev; @@ -9,12 +11,23 @@ struct io_zc_rx_ifq { struct io_uring_rbuf_rqe *rqes; struct io_uring_rbuf_cqe *cqes; u32 rq_entries, cq_entries; + u32 cached_rq_head; + u32 cached_cq_tail; void *pool; /* hw rx descriptor ring id */ u32 if_rxq_id; }; +static inline u32 io_zc_rx_rqring_entries(struct io_zc_rx_ifq *ifq) +{ + struct io_rbuf_ring *ring = ifq->ring; + u32 entries; + + entries = smp_load_acquire(&ring->rq.tail) - ifq->cached_rq_head; + return min(entries, ifq->rq_entries); +} + int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, struct io_uring_zc_rx_ifq_reg __user *arg); int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx); From patchwork Sat Aug 26 01:19:50 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13366466 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 03676C83F1A for ; Sat, 26 Aug 2023 01:22:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231624AbjHZBWH (ORCPT ); Fri, 25 Aug 2023 21:22:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34908 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231622AbjHZBVk (ORCPT ); Fri, 25 Aug 2023 21:21:40 -0400 Received: from mail-pg1-x530.google.com (mail-pg1-x530.google.com [IPv6:2607:f8b0:4864:20::530]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 79E852680 for ; Fri, 25 Aug 2023 18:21:35 -0700 (PDT) Received: by mail-pg1-x530.google.com with SMTP id 41be03b00d2f7-53482b44007so810252a12.2 for ; Fri, 25 Aug 2023 18:21:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20221208.gappssmtp.com; s=20221208; t=1693012895; x=1693617695; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=AlVAPC67Ckz18v0seyoQ+Y1mhb5JS+9AeH2Kfw2VkZk=; b=nGW64Om9jJ67AXiVQNre278IB66wMDS46pX6N1g2dm/mi2u8gx7VmhYvjBUiV1K5r6 IBZ0/Fk1h4RbCvebWah92SNYShA20fR+8XkoXXJfb4dP2N8lXOqxqbyhkmaA3THBPvP1 BLZm4HLkL0ck9wRZ9L1IvJ2NnNRsBNZDd7sn3e4e421Wi0Hg6OPdXyl7spPENIAyABIg DTr7tqbLjjm9z9pA1/HMXGBVs/tdcZbfj+ZiL4Wvqid+DO8Bdfi4Yeg1/JfE00IhC6W2 JbFhtMmwwV8A2CABxIroQrRcRoJA39WdhyfKYcuJGmxuTDzmK/PXZKF69hoEaboom50g 9wkA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693012895; x=1693617695; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=AlVAPC67Ckz18v0seyoQ+Y1mhb5JS+9AeH2Kfw2VkZk=; b=CpOVoeGspAscvcngjmEGWSYgbIzubkpZolacsoMYm8crQFDFvKFtDOJ3CTQ0fVbm1m ++N/Id/78dIcRfjhzhGSe0wHWpEhD8ht03Lq2BHiTZz2uZExvnjaNtmPDFwCs5mGrwTo DDaUn7alY7BfKxqoqWUzhBe9IBnjMdz8gqddP7T5X0uiNor42xFmwHnCjKQzobBc96tK TjrygFGYJied+hc0Y5rvlej9Pd7TEGAwDG0Ati4z07rUwgQWeKR7um0a1is2tudcpVWx GqYT0Fzu05U6dHNp9BZqM4s8+k34MRWOMupDpJXFWVFQ+nijzVQAnNR4ueWqYGElDz4A fpvA== X-Gm-Message-State: AOJu0YwTR0fO0MTw1xIesaEkgkAcPxszg7Eq+ntURnrDJ0pfYDE/L5IC NVfqCpDaMbSfpwkrJOx0PHAbdTH6LwGHLUPnrUkx8Q== X-Google-Smtp-Source: AGHT+IF+A3AHOxKASsfzkvmfhUkhDeFLttevBVsPTVTM4K1LnkeW4qOJxIZ1ujeO+Cca4jiofNkGzw== X-Received: by 2002:a17:90a:df07:b0:268:fb85:3b2 with SMTP id gp7-20020a17090adf0700b00268fb8503b2mr16125338pjb.7.1693012894990; Fri, 25 Aug 2023 18:21:34 -0700 (PDT) Received: from localhost (fwdproxy-prn-010.fbsv.net. [2a03:2880:ff:a::face:b00c]) by smtp.gmail.com with ESMTPSA id a6-20020a170902ecc600b001a5fccab02dsm2410936plh.177.2023.08.25.18.21.34 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Aug 2023 18:21:34 -0700 (PDT) From: David Wei To: Jens Axboe , Pavel Begunkov Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org, Mina Almasry , Jakub Kicinski Subject: [PATCH 07/11] skbuff: add SKBFL_FIXED_FRAG and skb_fixed() Date: Fri, 25 Aug 2023 18:19:50 -0700 Message-Id: <20230826011954.1801099-8-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20230826011954.1801099-1-dw@davidwei.uk> References: <20230826011954.1801099-1-dw@davidwei.uk> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org From: David Wei When an skb that is marked as zero copy goes up the network stack during RX, skb_orphan_frags_rx is called which then calls skb_copy_ubufs and defeat the purpose of ZC. This is because currently zero copy is only for TX and this behaviour is designed to prevent TX zero copy data being redirected up the network stack rather than new zero copy RX data coming from the driver. This patch adds a new flag SKBFL_FIXED_FRAG and checks for this in skb_orphan_frags, not calling skb_copy_ubufs if it is set. Signed-off-by: David Wei Co-developed-by: Jonathan Lemon --- include/linux/skbuff.h | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h index 8cff3d817131..d7ef998df4a5 100644 --- a/include/linux/skbuff.h +++ b/include/linux/skbuff.h @@ -518,6 +518,9 @@ enum { * use frags only up until ubuf_info is released */ SKBFL_MANAGED_FRAG_REFS = BIT(4), + + /* don't move or copy the fragment */ + SKBFL_FIXED_FRAG = BIT(5), }; #define SKBFL_ZEROCOPY_FRAG (SKBFL_ZEROCOPY_ENABLE | SKBFL_SHARED_FRAG) @@ -1674,6 +1677,11 @@ static inline bool skb_zcopy_managed(const struct sk_buff *skb) return skb_shinfo(skb)->flags & SKBFL_MANAGED_FRAG_REFS; } +static inline bool skb_fixed(const struct sk_buff *skb) +{ + return skb_shinfo(skb)->flags & SKBFL_FIXED_FRAG; +} + static inline bool skb_pure_zcopy_same(const struct sk_buff *skb1, const struct sk_buff *skb2) { @@ -3135,7 +3143,7 @@ static inline int skb_orphan_frags(struct sk_buff *skb, gfp_t gfp_mask) /* Frags must be orphaned, even if refcounted, if skb might loop to rx path */ static inline int skb_orphan_frags_rx(struct sk_buff *skb, gfp_t gfp_mask) { - if (likely(!skb_zcopy(skb))) + if (likely(!skb_zcopy(skb) || skb_fixed(skb))) return 0; return skb_copy_ubufs(skb, gfp_mask); } From patchwork Sat Aug 26 01:19:51 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13366465 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E5640C83F18 for ; Sat, 26 Aug 2023 01:22:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231625AbjHZBWH (ORCPT ); Fri, 25 Aug 2023 21:22:07 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34916 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231624AbjHZBVk (ORCPT ); Fri, 25 Aug 2023 21:21:40 -0400 Received: from mail-pf1-x434.google.com (mail-pf1-x434.google.com [IPv6:2607:f8b0:4864:20::434]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 554DE2689 for ; Fri, 25 Aug 2023 18:21:36 -0700 (PDT) Received: by mail-pf1-x434.google.com with SMTP id d2e1a72fcca58-68bed286169so1351556b3a.1 for ; Fri, 25 Aug 2023 18:21:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20221208.gappssmtp.com; s=20221208; t=1693012896; x=1693617696; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=tylFMBBlja4pnu4G0FtEa8rKWMH9hDK59r2ccjasHbc=; b=afkvQM3bl4EqltZqv1ipUMeSZgPncQVdl7gvVzkPGKTKcF+97lryLour3GboeyIlKZ RvXBLr5eWTNiOjfpGIO89m2DP2BANuGZpq7AKuWU5oi8z2++oBZVjkBB0QyqPeMOoFJA 0YTaiRhHJzTH4/TqduI9UBzmQv6rzOF4G6JWNJYvcH4XxypuZDyeWOvzOxmMBFpE6Xqh GSD+xu1DEZ0Agt+QxlTLmi+C55k+I/+My7EsKIUUXuplQTWvukBbIb+fmvjOqQQ2IQiK KZihVUJDAMn14axGEQ0jJMdIzbHHIFVpjTpXF+HlMrzBNAz4ufcWzFXjX2syvWFeU4Pt h9wg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693012896; x=1693617696; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=tylFMBBlja4pnu4G0FtEa8rKWMH9hDK59r2ccjasHbc=; b=gQ6r6GGjeWV4Lojy4OoH9AwF+SDJ2fPYk2a3wkO7CYUEVsTGSSNUlkBXayVHhLsDCs AZhiXGNIszTcOFsZlSqwcMTprCnZo9h7pBHzIhmTCiJbLALQDspHEUQUl/857D1cPAxK FIHAqCurBfGbaOEDeXtOIA9mAVgPUnNtqnRXcj4KMtt4QFoUPYtJmZVfffwVM/wgVsof AFme79uoA9Nc9Whipkb62mvrTM0+m9Rn3pNfza6yOEBrx0BTCn95NqGmTcHMD9C4TC9h y20e6UF9uM+/8MqLOBGC32riuzF6Gnvgk5kl+uCxdgPaiPBnRauoMKFIWNyh5w0SQljK 9BIw== X-Gm-Message-State: AOJu0YztnUMySGtxM5EqCS7Nvu0Ad9NKtlWBJ60csL+1tk9Es/rk0Sgc MA0vHFUpCCzZL/MlsQLBia8zlg== X-Google-Smtp-Source: AGHT+IEjiFq2VLUS7JjHkDnkb3/Kh/FZb/jcg+0xtPb+IbUjM75iF8xhj/xI0H7/6H49TV+4q5Si5A== X-Received: by 2002:a05:6a21:7785:b0:14c:d494:77c5 with SMTP id bd5-20020a056a21778500b0014cd49477c5mr975470pzc.13.1693012895860; Fri, 25 Aug 2023 18:21:35 -0700 (PDT) Received: from localhost (fwdproxy-prn-016.fbsv.net. [2a03:2880:ff:10::face:b00c]) by smtp.gmail.com with ESMTPSA id jw21-20020a170903279500b001b06c106844sm2402215plb.151.2023.08.25.18.21.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Aug 2023 18:21:35 -0700 (PDT) From: David Wei To: Jens Axboe , Pavel Begunkov Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org, Mina Almasry , Jakub Kicinski Subject: [PATCH 08/11] io_uring: allocate a uarg for freeing zero copy skbs Date: Fri, 25 Aug 2023 18:19:51 -0700 Message-Id: <20230826011954.1801099-9-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20230826011954.1801099-1-dw@davidwei.uk> References: <20230826011954.1801099-1-dw@davidwei.uk> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org From: David Wei As ZC skbs are marked as zero copy, they will bypass the default skb frag destructor. This patch adds a static uarg that is attached to ZC bufs and a callback that returns them to the freelist of a ZC pool. Signed-off-by: David Wei Co-developed-by: Jonathan Lemon --- include/linux/io_uring.h | 7 +++++++ include/linux/netdevice.h | 1 + io_uring/zc_rx.c | 44 +++++++++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 2 ++ 4 files changed, 54 insertions(+) diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index 61eae25a8f1d..e2a4f92df814 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -62,6 +62,8 @@ const char *io_uring_get_opcode(u8 opcode); struct io_zc_rx_ifq; struct io_zc_rx_buf *io_zc_rx_get_buf(struct io_zc_rx_ifq *ifq); +struct io_zc_rx_buf *io_zc_rx_buf_from_page(struct io_zc_rx_ifq *ifq, + struct page *page); void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf); static inline dma_addr_t io_zc_rx_buf_dma(struct io_zc_rx_buf *buf) { @@ -123,6 +125,11 @@ static inline struct io_zc_rx_buf *io_zc_rx_get_buf(struct io_zc_rx_ifq *ifq) { return NULL; } +static inline struct io_zc_rx_buf *io_zc_rx_buf_from_page(struct io_zc_rx_ifq *ifq, + struct page *page) +{ + return NULL; +} void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf) { } diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index a20a5c847916..bf133cbee721 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -1043,6 +1043,7 @@ struct netdev_bpf { struct { struct io_zc_rx_ifq *ifq; u16 queue_id; + struct ubuf_info *uarg; } zc_rx; }; }; diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 14bc063f1c6c..b8dd699e2777 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -44,6 +44,11 @@ static u64 mk_page_info(u16 pool_id, u32 pgid) return (u64)0xface << 48 | (u64)pool_id << 32 | (u64)pgid; } +static bool is_zc_rx_page(struct page *page) +{ + return PagePrivate(page) && ((page_private(page) >> 48) == 0xface); +} + typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq, @@ -59,6 +64,7 @@ static int __io_queue_mgmt(struct net_device *dev, struct io_zc_rx_ifq *ifq, cmd.command = XDP_SETUP_ZC_RX; cmd.zc_rx.ifq = ifq; cmd.zc_rx.queue_id = queue_id; + cmd.zc_rx.uarg = &ifq->uarg; return ndo_bpf(dev, &cmd); } @@ -73,6 +79,26 @@ static int io_close_zc_rxq(struct io_zc_rx_ifq *ifq) return __io_queue_mgmt(ifq->dev, NULL, ifq->if_rxq_id); } +static void io_zc_rx_skb_free(struct sk_buff *skb, struct ubuf_info *uarg, + bool success) +{ + struct skb_shared_info *shinfo = skb_shinfo(skb); + struct io_zc_rx_ifq *ifq; + struct io_zc_rx_buf *buf; + struct page *page; + int i; + + ifq = container_of(uarg, struct io_zc_rx_ifq, uarg); + for (i = 0; i < shinfo->nr_frags; i++) { + page = skb_frag_page(&shinfo->frags[i]); + buf = io_zc_rx_buf_from_page(ifq, page); + if (likely(buf)) + io_zc_rx_put_buf(ifq, buf); + else + __skb_frag_unref(&shinfo->frags[i], skb->pp_recycle); + } +} + static int io_zc_rx_map_buf(struct device *dev, struct page *page, u16 pool_id, u32 pgid, struct io_zc_rx_buf *buf) { @@ -268,6 +294,8 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, if (ret) goto err; + ifq->uarg.callback = io_zc_rx_skb_free; + ifq->uarg.flags = SKBFL_ALL_ZEROCOPY | SKBFL_FIXED_FRAG; ifq->rq_entries = reg.rq_entries; ifq->cq_entries = reg.cq_entries; ifq->cached_rq_head = 0; @@ -407,3 +435,19 @@ void io_zc_rx_put_buf(struct io_zc_rx_ifq *ifq, struct io_zc_rx_buf *buf) io_zc_rx_recycle_buf(pool, buf); } EXPORT_SYMBOL(io_zc_rx_put_buf); + +struct io_zc_rx_buf *io_zc_rx_buf_from_page(struct io_zc_rx_ifq *ifq, + struct page *page) +{ + struct io_zc_rx_pool *pool; + int pgid; + + if (!is_zc_rx_page(page)) + return NULL; + + pool = ifq->pool; + pgid = page_private(page) & 0xffffffff; + + return &pool->bufs[pgid]; +} +EXPORT_SYMBOL(io_zc_rx_buf_from_page); diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index b063a3c81ccb..ee7e36295f91 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -3,6 +3,7 @@ #define IOU_ZC_RX_H #include +#include struct io_zc_rx_ifq { struct io_ring_ctx *ctx; @@ -10,6 +11,7 @@ struct io_zc_rx_ifq { struct io_rbuf_ring *ring; struct io_uring_rbuf_rqe *rqes; struct io_uring_rbuf_cqe *cqes; + struct ubuf_info uarg; u32 rq_entries, cq_entries; u32 cached_rq_head; u32 cached_cq_tail; From patchwork Sat Aug 26 01:19:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13366461 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2BC02C83F14 for ; Sat, 26 Aug 2023 01:22:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231620AbjHZBWF (ORCPT ); Fri, 25 Aug 2023 21:22:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34924 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231627AbjHZBVl (ORCPT ); Fri, 25 Aug 2023 21:21:41 -0400 Received: from mail-pl1-x62b.google.com (mail-pl1-x62b.google.com [IPv6:2607:f8b0:4864:20::62b]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2B4802684 for ; Fri, 25 Aug 2023 18:21:37 -0700 (PDT) Received: by mail-pl1-x62b.google.com with SMTP id d9443c01a7336-1bf57366ccdso17657495ad.1 for ; Fri, 25 Aug 2023 18:21:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20221208.gappssmtp.com; s=20221208; t=1693012896; x=1693617696; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=f+GIQ4i5ByNuKwYrRpt0EIU7cHtWJOixaax1XVLnBrM=; b=nXv8bQTvoYGWtC9p28ZjXDtWRJf+tYf/7bWYFv3nSsEewRsswG/TdLNViCy1WlQwD3 ZE35++RBVA5uyfkW59Nynz6i0Exhr3SWHmze1o57+5y1BITvZb31UXQTvSfnN8wWFJMf oAZIL+CkM6AeuUNWIBjgS1eAzl2t/RLLJlOTD2TA19SPs3FwtRg9YyvpD6ww4FpvBt7D qLFyzq0/qTPC8VkSnPed+rIAng6pGSKcT9cs6acTzj0lkeWMVaWSaXcqewVe5+ekuDcq K00tg942WVH6rUb6eYJOiFFV138WUjzwDoq1qcJLDXyU0qZ/WMg6h389T72epTcA8QEL WhUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693012896; x=1693617696; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=f+GIQ4i5ByNuKwYrRpt0EIU7cHtWJOixaax1XVLnBrM=; b=hjIjCSXbL++OrpQaZH8Z3Ug8iKlPhCvmoHLPhFckPoJ/Cjlm2I0gh22598KAeKCkps efCe34g19fa9Cil23pPm3PdvH821N6BZBz4b063YHUE1kSp57RE7ZexRsNYsa7GAKc4m ZrGbA6M43HOr/PvI7juw17P5bKVyinmYDYhqjGDPTzHs9UBeaiZImh+oApQeMl6rG6nC COGKFTqcsK+JkbB4UM103rFOgniZLJKZ3ZJy+qnEcoO0x2ZZmlbJw3byxvg6HAZOkJQ6 15P0VBOkKpw74OKrBY452RZgWAbh5sxc9EZG3FZRpB4KYM9h9OxLdRtoRY6h6piEh4j1 6+BQ== X-Gm-Message-State: AOJu0YyE1vt4O5DDUSEF9gKgTvWbPWJuvV+0qghdNbIHuvY1c9qdIuNT NsDxYmSBxktZAo61URX/wbrrnuS8/mmPgApc81DCAQ== X-Google-Smtp-Source: AGHT+IHt+Sg8b3jOaEY8egeSjKUg1cs7Q6AX6j8Ek5+75e/yD/eutQFeYEqR6YHa46DAgcpUZmxx8Q== X-Received: by 2002:a17:903:2304:b0:1b8:8682:62fb with SMTP id d4-20020a170903230400b001b8868262fbmr29753180plh.4.1693012896717; Fri, 25 Aug 2023 18:21:36 -0700 (PDT) Received: from localhost (fwdproxy-prn-002.fbsv.net. [2a03:2880:ff:2::face:b00c]) by smtp.gmail.com with ESMTPSA id 6-20020a170902c10600b001bbb598b8bbsm2425921pli.41.2023.08.25.18.21.36 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Aug 2023 18:21:36 -0700 (PDT) From: David Wei To: Jens Axboe , Pavel Begunkov Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org, Mina Almasry , Jakub Kicinski Subject: [PATCH 09/11] io_uring: delay ZC pool destruction Date: Fri, 25 Aug 2023 18:19:52 -0700 Message-Id: <20230826011954.1801099-10-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20230826011954.1801099-1-dw@davidwei.uk> References: <20230826011954.1801099-1-dw@davidwei.uk> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org From: David Wei At a point in time, a ZC buf may be in: * RX queue * Socket * One of the ifq ringbufs * Userspace The ZC pool region and the pool itself cannot be destroyed until all bufs have been returned. This patch changes the ZC pool destruction to be delayed work, waiting for up to 10 seconds for bufs to be returned before unconditionally destroying the pool. Signed-off-by: David Wei Co-developed-by: Jonathan Lemon --- io_uring/zc_rx.c | 50 ++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 44 insertions(+), 6 deletions(-) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index b8dd699e2777..70e39f851e47 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -28,6 +28,10 @@ struct io_zc_rx_pool { u32 cache_count; u32 cache[POOL_CACHE_SIZE]; + /* delayed destruction */ + unsigned long delay_end; + struct delayed_work destroy_work; + /* freelist */ spinlock_t freelist_lock; u32 free_count; @@ -222,20 +226,56 @@ int io_zc_rx_create_pool(struct io_ring_ctx *ctx, return ret; } -static void io_zc_rx_destroy_pool(struct io_zc_rx_pool *pool) +static void io_zc_rx_destroy_ifq(struct io_zc_rx_ifq *ifq) +{ + if (ifq->dev) + dev_put(ifq->dev); + io_free_rbuf_ring(ifq); + kfree(ifq); +} + +static void io_zc_rx_destroy_pool_work(struct work_struct *work) { + struct io_zc_rx_pool *pool = container_of( + to_delayed_work(work), struct io_zc_rx_pool, destroy_work); struct device *dev = netdev2dev(pool->ifq->dev); struct io_zc_rx_buf *buf; + int i, refc, count; - for (int i = 0; i < pool->nr_pages; i++) { + for (i = 0; i < pool->nr_pages; i++) { buf = &pool->bufs[i]; + refc = atomic_read(&buf->refcount) & IO_ZC_RX_KREF_MASK; + if (refc) { + if (time_before(jiffies, pool->delay_end)) { + schedule_delayed_work(&pool->destroy_work, HZ); + return; + } + count++; + } + } + + if (count) + pr_debug("freeing pool with %d/%d outstanding pages\n", + count, pool->nr_pages); + + for (i = 0; i < pool->nr_pages; i++) { + buf = &pool->bufs[i]; io_zc_rx_unmap_buf(dev, buf); } + + io_zc_rx_destroy_ifq(pool->ifq); kvfree(pool->bufs); kvfree(pool); } +static void io_zc_rx_destroy_pool(struct io_zc_rx_pool *pool) +{ + pool->delay_end = jiffies + HZ * 10; + INIT_DELAYED_WORK(&pool->destroy_work, io_zc_rx_destroy_pool_work); + schedule_delayed_work(&pool->destroy_work, 0); +} + static struct io_zc_rx_ifq *io_zc_rx_ifq_alloc(struct io_ring_ctx *ctx) { struct io_zc_rx_ifq *ifq; @@ -256,10 +296,8 @@ static void io_zc_rx_ifq_free(struct io_zc_rx_ifq *ifq) io_close_zc_rxq(ifq); if (ifq->pool) io_zc_rx_destroy_pool(ifq->pool); - if (ifq->dev) - dev_put(ifq->dev); - io_free_rbuf_ring(ifq); - kfree(ifq); + else + io_zc_rx_destroy_ifq(ifq); } int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, From patchwork Sat Aug 26 01:19:53 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13366468 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32C2FC83F19 for ; Sat, 26 Aug 2023 01:22:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231636AbjHZBWJ (ORCPT ); Fri, 25 Aug 2023 21:22:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34940 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231628AbjHZBVl (ORCPT ); Fri, 25 Aug 2023 21:21:41 -0400 Received: from mail-pf1-x430.google.com (mail-pf1-x430.google.com [IPv6:2607:f8b0:4864:20::430]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 2D038212B for ; Fri, 25 Aug 2023 18:21:38 -0700 (PDT) Received: by mail-pf1-x430.google.com with SMTP id d2e1a72fcca58-68a3082c771so1072612b3a.0 for ; Fri, 25 Aug 2023 18:21:38 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20221208.gappssmtp.com; s=20221208; t=1693012897; x=1693617697; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ug+VHH1fshXaj8Eqs7Sa6txOl4TNISUDe76JMvqnZcw=; b=qavmCd6cWRybNFEN5NxN1FRoff2w7/6P5mtOTUcqhlOT3kMP14sNoPrhquv2WGOpVH HYl51Dm8l+w4D9bCyuE60MhkiDnZy0WC/8CJo8D+mDmg4Q5p327EpoMaWEvsjIiCWTiN Xfm1tKTD6D5ettAyYuC00KGVokTDYQat5ynOSW7LzexpzXCFi5a9EyjRrertSO7mtiqS 4DNmKBHlxb0j/74fKMGxyoG//GZBByL1OgrfSHJl3GY8pUl/jjSKHWmj4cWimKmPThXY OLWkjK7VyaNZqBRsnRShj1/a84Aw212SOUwH9xLcMaMEABPv+LyR4vF/Z5nPlBCPlWlS IQDQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693012897; x=1693617697; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ug+VHH1fshXaj8Eqs7Sa6txOl4TNISUDe76JMvqnZcw=; b=Ke5G24icgHNNBWrAr/dT45lcMkMOBskOhu6Iz3FSX39VED5nURDEJSHHmMvIGydxql 55FXDkw8P7pSb6b5v7xm2m9fMKEqhRHUOi/hajAqQdLxB+ve3s235KOGKk8XQo8LP2Gj 6iA+FA2wcgGc5rPLUZNq9JOFd9rr9a85TDZR1aYchNWEx52BqkBljFbK4B0St+1CRfQU cdx4GFsnydi3SI8iCuBW6GXufDrBPda2znnH/RUxdWIi55Ske52ipqSPCkkh9tNafgBA UyRCRfNW7h2UxrHedkvCr4puWf7NaLmdEXomcyTXEGJQiRo4DNFJb0kks7mbqbc3V/x1 lTiA== X-Gm-Message-State: AOJu0YxM6Er3nr0cZflhVFvUCnTq8JAjvSWy1Pg5jrW7xfJy46+alOpL tPZKLvhv/cn6c7zYXcM0E74e9Q== X-Google-Smtp-Source: AGHT+IELAnNcy8C9BhsWs5nsYI2ET2z89/+WMDH+x3SCMJLLpriQcSpsi8WHS2LoNkqH14d3cPA1XA== X-Received: by 2002:a05:6a00:2808:b0:68a:6e26:a918 with SMTP id bl8-20020a056a00280800b0068a6e26a918mr16000890pfb.8.1693012897619; Fri, 25 Aug 2023 18:21:37 -0700 (PDT) Received: from localhost (fwdproxy-prn-007.fbsv.net. [2a03:2880:ff:7::face:b00c]) by smtp.gmail.com with ESMTPSA id r30-20020a638f5e000000b005641bbe783bsm2280510pgn.11.2023.08.25.18.21.37 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Aug 2023 18:21:37 -0700 (PDT) From: David Wei To: Jens Axboe , Pavel Begunkov Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org, Mina Almasry , Jakub Kicinski Subject: [PATCH 10/11] netdev/bnxt: add data pool and use it in BNXT driver Date: Fri, 25 Aug 2023 18:19:53 -0700 Message-Id: <20230826011954.1801099-11-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20230826011954.1801099-1-dw@davidwei.uk> References: <20230826011954.1801099-1-dw@davidwei.uk> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org From: David Wei This patch adds a thin wrapper called data pool that wraps the existing page pool and the newly added ZC pool. There is one struct netdev_rx_queue per logical RX queue. This patch adds ZC ifq, page pool, and uarg (set during skb construction) to a netdev_rx_queue. The data pool wrapper uses the ZC pool if an ifq is present, otherwise using the page pool. The BNXT driver is modified to use data pool in order to support ZC RX. A setup function bnxt_zc_rx is added that is called on XDP_SETUP_ZC_RX XDP command which sets fields in netdev_rx_queue. Calls to get/put bufs from the page pool are related w/ the data pool. Signed-off-by: David Wei --- drivers/net/ethernet/broadcom/bnxt/bnxt.c | 59 ++++++++---- drivers/net/ethernet/broadcom/bnxt/bnxt.h | 4 + drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c | 3 + include/linux/netdevice.h | 4 + include/net/data_pool.h | 96 +++++++++++++++++++ 5 files changed, 149 insertions(+), 17 deletions(-) create mode 100644 include/net/data_pool.h diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.c b/drivers/net/ethernet/broadcom/bnxt/bnxt.c index 48f584f78561..5c1dabaf07f9 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.c @@ -57,6 +57,7 @@ #include #include #include +#include #include "bnxt_hsi.h" #include "bnxt.h" @@ -724,7 +725,7 @@ static void bnxt_tx_int(struct bnxt *bp, struct bnxt_napi *bnapi, int nr_pkts) __netif_txq_completed_wake(txq, nr_pkts, tx_bytes, bnxt_tx_avail(bp, txr), bp->tx_wake_thresh, - READ_ONCE(txr->dev_state) != BNXT_DEV_STATE_CLOSING); + READ_ONCE(txr->dev_state) == BNXT_DEV_STATE_CLOSING); } static struct page *__bnxt_alloc_rx_64k_page(struct bnxt *bp, dma_addr_t *mapping, @@ -738,13 +739,7 @@ static struct page *__bnxt_alloc_rx_64k_page(struct bnxt *bp, dma_addr_t *mappin if (!page) return NULL; - *mapping = dma_map_page_attrs(&bp->pdev->dev, page, offset, - BNXT_RX_PAGE_SIZE, DMA_FROM_DEVICE, - DMA_ATTR_WEAK_ORDERING); - if (dma_mapping_error(&bp->pdev->dev, *mapping)) { - page_pool_recycle_direct(rxr->page_pool, page); - return NULL; - } + *mapping = page_pool_get_dma_addr(page); if (page_offset) *page_offset = offset; @@ -757,19 +752,14 @@ static struct page *__bnxt_alloc_rx_page(struct bnxt *bp, dma_addr_t *mapping, struct bnxt_rx_ring_info *rxr, gfp_t gfp) { - struct device *dev = &bp->pdev->dev; struct page *page; - page = page_pool_dev_alloc_pages(rxr->page_pool); + page = data_pool_alloc_page(rxr->rx_queue); if (!page) return NULL; - *mapping = dma_map_page_attrs(dev, page, 0, PAGE_SIZE, bp->rx_dir, - DMA_ATTR_WEAK_ORDERING); - if (dma_mapping_error(dev, *mapping)) { - page_pool_recycle_direct(rxr->page_pool, page); - return NULL; - } + *mapping = data_pool_get_dma_addr(rxr->rx_queue, page); + return page; } @@ -1787,6 +1777,8 @@ static void bnxt_deliver_skb(struct bnxt *bp, struct bnxt_napi *bnapi, return; } skb_record_rx_queue(skb, bnapi->index); + if (bnapi->rx_ring->rx_queue->zc_ifq) + skb_zcopy_init(skb, bnapi->rx_ring->rx_queue->zc_uarg); skb_mark_for_recycle(skb); napi_gro_receive(&bnapi->napi, skb); } @@ -3016,7 +3008,7 @@ static void bnxt_free_one_rx_ring_skbs(struct bnxt *bp, int ring_nr) rx_agg_buf->page = NULL; __clear_bit(i, rxr->rx_agg_bmap); - page_pool_recycle_direct(rxr->page_pool, page); + data_pool_put_page(rxr->rx_queue, page); } skip_rx_agg_free: @@ -3225,6 +3217,8 @@ static void bnxt_free_rx_rings(struct bnxt *bp) page_pool_destroy(rxr->page_pool); rxr->page_pool = NULL; + rxr->rx_queue->page_pool = NULL; + rxr->rx_queue->zc_ifq = NULL; kfree(rxr->rx_agg_bmap); rxr->rx_agg_bmap = NULL; @@ -3251,12 +3245,16 @@ static int bnxt_alloc_rx_page_pool(struct bnxt *bp, pp.dma_dir = DMA_BIDIRECTIONAL; if (PAGE_SIZE > BNXT_RX_PAGE_SIZE) pp.flags |= PP_FLAG_PAGE_FRAG; + pp.flags |= PP_FLAG_DMA_MAP | PP_FLAG_DMA_SYNC_DEV; + pp.max_len = PAGE_SIZE; rxr->page_pool = page_pool_create(&pp); + data_pool_set_page_pool(bp->dev, rxr->bnapi->index, rxr->page_pool); if (IS_ERR(rxr->page_pool)) { int err = PTR_ERR(rxr->page_pool); rxr->page_pool = NULL; + data_pool_set_page_pool(bp->dev, rxr->bnapi->index, NULL); return err; } return 0; @@ -4620,6 +4618,8 @@ static int bnxt_alloc_mem(struct bnxt *bp, bool irq_re_init) rxr->rx_agg_ring_struct.ring_mem.flags = BNXT_RMEM_RING_PTE_FLAG; } + + rxr->rx_queue = data_pool_get_rx_queue(bp->dev, bp->bnapi[i]->index); rxr->bnapi = bp->bnapi[i]; bp->bnapi[i]->rx_ring = &bp->rx_ring[i]; } @@ -13904,6 +13904,31 @@ void bnxt_print_device_info(struct bnxt *bp) pcie_print_link_status(bp->pdev); } +int bnxt_zc_rx(struct bnxt *bp, struct netdev_bpf *xdp) +{ + if (xdp->zc_rx.queue_id >= bp->rx_nr_rings) + return -EINVAL; + + bnxt_rtnl_lock_sp(bp); + if (netif_running(bp->dev)) { + struct netdev_rx_queue *rxq; + int rc; + + bnxt_ulp_stop(bp); + bnxt_close_nic(bp, true, false); + + rxq = data_pool_get_rx_queue(bp->dev, xdp->zc_rx.queue_id); + rxq->queue_id = xdp->zc_rx.queue_id; + rxq->zc_ifq = xdp->zc_rx.ifq; + rxq->zc_uarg = xdp->zc_rx.uarg; + + rc = bnxt_open_nic(bp, true, false); + bnxt_ulp_start(bp, rc); + } + bnxt_rtnl_unlock_sp(bp); + return 0; +} + static int bnxt_init_one(struct pci_dev *pdev, const struct pci_device_id *ent) { struct net_device *dev; diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt.h b/drivers/net/ethernet/broadcom/bnxt/bnxt.h index 9fd85ebd8ae8..554c0abc0d44 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt.h +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt.h @@ -949,6 +949,7 @@ struct bnxt_rx_ring_info { struct bnxt_ring_struct rx_agg_ring_struct; struct xdp_rxq_info xdp_rxq; struct page_pool *page_pool; + struct netdev_rx_queue *rx_queue; }; struct bnxt_rx_sw_stats { @@ -2454,4 +2455,7 @@ int bnxt_get_port_parent_id(struct net_device *dev, void bnxt_dim_work(struct work_struct *work); int bnxt_hwrm_set_ring_coal(struct bnxt *bp, struct bnxt_napi *bnapi); void bnxt_print_device_info(struct bnxt *bp); + +int bnxt_zc_rx(struct bnxt *bp, struct netdev_bpf *xdp); + #endif diff --git a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c index 4efa5fe6972b..1387f0e1fff5 100644 --- a/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c +++ b/drivers/net/ethernet/broadcom/bnxt/bnxt_xdp.c @@ -455,6 +455,9 @@ int bnxt_xdp(struct net_device *dev, struct netdev_bpf *xdp) case XDP_SETUP_PROG: rc = bnxt_xdp_set(bp, xdp->prog); break; + case XDP_SETUP_ZC_RX: + return bnxt_zc_rx(bp, xdp); + break; default: rc = -EINVAL; break; diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index bf133cbee721..994237e92cbc 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -789,6 +789,10 @@ struct netdev_rx_queue { struct kobject kobj; struct net_device *dev; netdevice_tracker dev_tracker; + unsigned int queue_id; + struct page_pool *page_pool; + struct io_zc_rx_ifq *zc_ifq; + struct ubuf_info *zc_uarg; #ifdef CONFIG_XDP_SOCKETS struct xsk_buff_pool *pool; diff --git a/include/net/data_pool.h b/include/net/data_pool.h new file mode 100644 index 000000000000..84c96aa1c542 --- /dev/null +++ b/include/net/data_pool.h @@ -0,0 +1,96 @@ +#ifndef _DATA_POOL_H +#define _DATA_POOL_H + +#include +#include +#include +#include + +static inline struct netdev_rx_queue * +data_pool_get_rx_queue(struct net_device *dev, unsigned int q_idx) +{ + if (q_idx >= dev->num_rx_queues) + return NULL; + return __netif_get_rx_queue(dev, q_idx); +} + +static inline int data_pool_set_page_pool(struct net_device *dev, + unsigned int q_idx, + struct page_pool *pool) +{ + struct netdev_rx_queue *rxq; + + rxq = data_pool_get_rx_queue(dev, q_idx); + if (!rxq) + return -EINVAL; + + rxq->page_pool = pool; + return 0; +} + +static inline int data_pool_set_zc_ifq(struct net_device *dev, + unsigned int q_idx, + struct io_zc_rx_ifq *ifq) +{ + struct netdev_rx_queue *rxq; + + rxq = data_pool_get_rx_queue(dev, q_idx); + if (!rxq) + return -EINVAL; + + rxq->zc_ifq = ifq; + return 0; +} + +static inline struct page *data_pool_alloc_page(struct netdev_rx_queue *rxq) +{ + if (rxq->zc_ifq) { + struct io_zc_rx_buf *buf; + buf = io_zc_rx_get_buf(rxq->zc_ifq); + if (!buf) + return NULL; + return buf->page; + } else { + return page_pool_dev_alloc_pages(rxq->page_pool); + } +} + +static inline void data_pool_fragment_page(struct netdev_rx_queue *rxq, + struct page *page, + unsigned long bias) +{ + if (rxq->zc_ifq) { + struct io_zc_rx_buf *buf; + buf = io_zc_rx_buf_from_page(rxq->zc_ifq, page); + atomic_set(&buf->refcount, bias); + } else { + page_pool_fragment_page(page, bias); + } +} + +static inline void data_pool_put_page(struct netdev_rx_queue *rxq, + struct page *page) +{ + if (rxq->zc_ifq) { + struct io_zc_rx_buf *buf; + buf = io_zc_rx_buf_from_page(rxq->zc_ifq, page); + io_zc_rx_put_buf(rxq->zc_ifq, buf); + } else { + WARN_ON_ONCE(page->pp_magic != PP_SIGNATURE); + page_pool_recycle_direct(rxq->page_pool, page); + } +} + +static inline dma_addr_t data_pool_get_dma_addr(struct netdev_rx_queue *rxq, + struct page *page) +{ + if (rxq->zc_ifq) { + struct io_zc_rx_buf *buf; + buf = io_zc_rx_buf_from_page(rxq->zc_ifq, page); + return io_zc_rx_buf_dma(buf); + } else { + return page_pool_get_dma_addr(page); + } +} + +#endif From patchwork Sat Aug 26 01:19:54 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13366467 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2359BC83F1B for ; Sat, 26 Aug 2023 01:22:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231622AbjHZBWI (ORCPT ); Fri, 25 Aug 2023 21:22:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36788 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231629AbjHZBVm (ORCPT ); Fri, 25 Aug 2023 21:21:42 -0400 Received: from mail-pl1-x62e.google.com (mail-pl1-x62e.google.com [IPv6:2607:f8b0:4864:20::62e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4689726BF for ; Fri, 25 Aug 2023 18:21:39 -0700 (PDT) Received: by mail-pl1-x62e.google.com with SMTP id d9443c01a7336-1bf6ea270b2so12069235ad.0 for ; Fri, 25 Aug 2023 18:21:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20221208.gappssmtp.com; s=20221208; t=1693012898; x=1693617698; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ZyOGiEo7fxbB4EEi1N2igx6I6zFU58A0GhJQnNqswuk=; b=QwcYQjP2r+jWYCZ5jqdkfOAgZl8abmTIJe5cqvPid/5MYUQDWpiGWuogb7YUMODxAo ZIUcTWgZZe5c5A1Ne6IXm6ijeocIP/KF+Iv/8qSqgMSDU2EDJ/B8XHDHOJ0Ik5reaemR 8I7NZlouQRPH6HuURzhjNnunkrC23P+Z+ut7NjBuOCKRwBBsAVE1Dvxy3VaLjGxsf/xg JC/7P4ljDpRzDWyPK7Gdop1Zx0o/bHG071s9n4f+FFpdWRrDQTHoo/5OZTCeJ27Bg/5r WREddbf6ubinP4lm0KjxTzgTWJrA89hFhermIpjKeqdUkcb3DPem1xdo57gwi73Wp8n+ hkFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1693012898; x=1693617698; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ZyOGiEo7fxbB4EEi1N2igx6I6zFU58A0GhJQnNqswuk=; b=Aox0c+k00F4LuhqaJRB4VS3+l/xRBJScsWgbaKlnDNV++fhtBiNzKMwz4tWP6igHvz q8j9fbhgZkyHj6YEo3kO7OWxvKttI6+LrYo4Ke/N21cVh8sSZlJS9SvmTxEhMtP2CsNs EbJFaGxmFb98WQWH5YYnBWevMD+3Lpe9E2jtpTBxoQz0OgZ9tUEOcQqeu2wnUjmbRaGo +o7sKdgXpRha/pnvpAxT7ghpOZPEC/OxGa49Q+QzjiUK/BztTMPhROF14ynYyH1+9gVy Z25kzsJCYtvTZFiHtF9Ahd0Jkwf00MXVn7yl5Q8w8/yFSEin+fiWn9Fk70hW4vhfYAgL mIHA== X-Gm-Message-State: AOJu0YxZ90EXiD7WE3I0pNvt1NwXMJq6WYcj2peUEW0zOPAog5555snK tTBAi+xxGyEjxHB6f508yKlN2oLQ+BoPA/W8ldbX6Q== X-Google-Smtp-Source: AGHT+IELtqXl1v1C3izmSmvSWY3urvhLRn9kIXBssDnf1iuzZQgjzN/JUC2ooMdFiuWoYJfhH2754w== X-Received: by 2002:a17:902:ec8d:b0:1c0:d17a:bfef with SMTP id x13-20020a170902ec8d00b001c0d17abfefmr5368457plg.30.1693012898569; Fri, 25 Aug 2023 18:21:38 -0700 (PDT) Received: from localhost (fwdproxy-prn-017.fbsv.net. [2a03:2880:ff:11::face:b00c]) by smtp.gmail.com with ESMTPSA id o15-20020a170902d4cf00b001b9f7bc3e77sm2394390plg.189.2023.08.25.18.21.38 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 25 Aug 2023 18:21:38 -0700 (PDT) From: David Wei To: Jens Axboe , Pavel Begunkov Cc: io-uring@vger.kernel.org, netdev@vger.kernel.org, Mina Almasry , Jakub Kicinski Subject: [PATCH 11/11] io_uring: add io_recvzc request Date: Fri, 25 Aug 2023 18:19:54 -0700 Message-Id: <20230826011954.1801099-12-dw@davidwei.uk> X-Mailer: git-send-email 2.39.3 In-Reply-To: <20230826011954.1801099-1-dw@davidwei.uk> References: <20230826011954.1801099-1-dw@davidwei.uk> MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org From: David Wei This patch adds an io_uring opcode OP_RECV_ZC for doing ZC reads from a socket that is set up for ZC RX. The request reads skbs from a socket where its page frags are tagged w/ a magic cookie in their page private field. For each frag, entries are written into the ifq rbuf completion ring, and the total number of bytes read is returned to user as an io_uring completion event. Multishot requests work. There is no need to specify provided buffers as data is returned in the ifq rbuf completion rings. Userspace is expected to look into the ifq rbuf completion ring when it receives an io_uring completion event. The addr3 field is used to encode params in the following format: addr3 = (readlen << 32) | ifq_id; readlen is the max amount of data to read from the socket. ifq_id is the interface queue id, and currently only 0 is supported. Signed-off-by: David Wei Co-developed-by: Jonathan Lemon --- include/uapi/linux/io_uring.h | 1 + io_uring/net.c | 83 +++++++++++- io_uring/opdef.c | 16 +++ io_uring/zc_rx.c | 232 ++++++++++++++++++++++++++++++++++ io_uring/zc_rx.h | 4 + 5 files changed, 335 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 28154abfe6f4..c43e5cc7de0a 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -223,6 +223,7 @@ enum io_uring_op { IORING_OP_URING_CMD, IORING_OP_SEND_ZC, IORING_OP_SENDMSG_ZC, + IORING_OP_RECV_ZC, /* this goes last, obviously */ IORING_OP_LAST, diff --git a/io_uring/net.c b/io_uring/net.c index 89e839013837..9a0a008418ec 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -69,6 +69,12 @@ struct io_sr_msg { struct io_kiocb *notif; }; +struct io_recvzc { + struct io_sr_msg sr; + u32 datalen; + u16 ifq_id; +}; + static inline bool io_check_multishot(struct io_kiocb *req, unsigned int issue_flags) { @@ -574,7 +580,8 @@ int io_recvmsg_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) if (sr->msg_flags & MSG_ERRQUEUE) req->flags |= REQ_F_CLEAR_POLLIN; if (sr->flags & IORING_RECV_MULTISHOT) { - if (!(req->flags & REQ_F_BUFFER_SELECT)) + if (!(req->flags & REQ_F_BUFFER_SELECT) + && req->opcode != IORING_OP_RECV_ZC) return -EINVAL; if (sr->msg_flags & MSG_WAITALL) return -EINVAL; @@ -931,6 +938,80 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags) return ret; } +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); + u64 recvzc_cmd; + + recvzc_cmd = READ_ONCE(sqe->addr3); + zc->datalen = recvzc_cmd >> 32; + zc->ifq_id = recvzc_cmd & 0xffff; + if (zc->ifq_id != 0) + return -EINVAL; + + return io_recvmsg_prep(req, sqe); +} + +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); + struct socket *sock; + unsigned flags; + int ret, min_ret = 0; + bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + + if (!(req->flags & REQ_F_POLLED) && + (zc->sr.flags & IORING_RECVSEND_POLL_FIRST)) + return -EAGAIN; + + sock = sock_from_file(req->file); + if (unlikely(!sock)) + return -ENOTSOCK; + +retry_multishot: + flags = zc->sr.msg_flags; + if (force_nonblock) + flags |= MSG_DONTWAIT; + if (flags & MSG_WAITALL) + min_ret = zc->sr.len; + + ret = io_zc_rx_recv(sock, zc->datalen, flags); + if (ret < min_ret) { + if (ret == -EAGAIN && force_nonblock) { + if (issue_flags & IO_URING_F_MULTISHOT) { + io_kbuf_recycle(req, issue_flags); + return IOU_ISSUE_SKIP_COMPLETE; + } + + return -EAGAIN; + } + if (ret > 0 && io_net_retry(sock, flags)) { + zc->sr.len -= ret; + zc->sr.buf += ret; + zc->sr.done_io += ret; + req->flags |= REQ_F_PARTIAL_IO; + return -EAGAIN; + } + if (ret == -ERESTARTSYS) + ret = -EINTR; + req_set_fail(req); + } else if ((flags & MSG_WAITALL) && (flags & (MSG_TRUNC | MSG_CTRUNC))) { + req_set_fail(req); + } + + if (ret > 0) + ret += zc->sr.done_io; + else if (zc->sr.done_io) + ret = zc->sr.done_io; + else + io_kbuf_recycle(req, issue_flags); + + if (!io_recv_finish(req, &ret, 0, ret <= 0, issue_flags)) + goto retry_multishot; + + return ret; +} + void io_send_zc_cleanup(struct io_kiocb *req) { struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg); diff --git a/io_uring/opdef.c b/io_uring/opdef.c index 3b9c6489b8b6..4dee7f83222f 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -33,6 +33,7 @@ #include "poll.h" #include "cancel.h" #include "rw.h" +#include "zc_rx.h" static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags) { @@ -426,6 +427,18 @@ const struct io_issue_def io_issue_defs[] = { .issue = io_sendmsg_zc, #else .prep = io_eopnotsupp_prep, +#endif + }, + [IORING_OP_RECV_ZC] = { + .needs_file = 1, + .unbound_nonreg_file = 1, + .pollin = 1, + .ioprio = 1, +#if defined(CONFIG_NET) + .prep = io_recvzc_prep, + .issue = io_recvzc, +#else + .prep = io_eopnotsupp_prep, #endif }, }; @@ -648,6 +661,9 @@ const struct io_cold_def io_cold_defs[] = { .fail = io_sendrecv_fail, #endif }, + [IORING_OP_RECV_ZC] = { + .name = "RECV_ZC", + }, }; const char *io_uring_get_opcode(u8 opcode) diff --git a/io_uring/zc_rx.c b/io_uring/zc_rx.c index 70e39f851e47..a861be50fd61 100644 --- a/io_uring/zc_rx.c +++ b/io_uring/zc_rx.c @@ -5,6 +5,7 @@ #include #include #include +#include #include @@ -38,6 +39,13 @@ struct io_zc_rx_pool { u32 freelist[]; }; +static inline u32 io_zc_rx_cqring_entries(struct io_zc_rx_ifq *ifq) +{ + struct io_rbuf_ring *ring = ifq->ring; + + return ifq->cached_cq_tail - READ_ONCE(ring->cq.head); +} + static struct device *netdev2dev(struct net_device *dev) { return dev->dev.parent; @@ -381,6 +389,14 @@ int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx) return 0; } +static void io_zc_rx_get_buf_uref(struct io_zc_rx_pool *pool, u16 pgid) +{ + if (WARN_ON(pgid >= pool->nr_pages)) + return; + + atomic_add(IO_ZC_RX_UREF, &pool->bufs[pgid].refcount); +} + static bool io_zc_rx_put_buf_uref(struct io_zc_rx_buf *buf) { if (atomic_read(&buf->refcount) < IO_ZC_RX_UREF) @@ -489,3 +505,219 @@ struct io_zc_rx_buf *io_zc_rx_buf_from_page(struct io_zc_rx_ifq *ifq, return &pool->bufs[pgid]; } EXPORT_SYMBOL(io_zc_rx_buf_from_page); + +static struct io_zc_rx_ifq *io_zc_rx_ifq_skb(struct sk_buff *skb) +{ + struct ubuf_info *uarg = skb_zcopy(skb); + + if (uarg && uarg->callback == io_zc_rx_skb_free) + return container_of(uarg, struct io_zc_rx_ifq, uarg); + return NULL; +} + +static int zc_rx_recv_frag(struct io_zc_rx_ifq *ifq, const skb_frag_t *frag, + int off, int len) +{ + struct io_uring_rbuf_cqe *cqe; + unsigned int pgid, cq_idx, queued, free, entries; + struct page *page; + unsigned int mask; + + page = skb_frag_page(frag); + off += skb_frag_off(frag); + + if (likely(ifq && is_zc_rx_page(page))) { + mask = ifq->cq_entries - 1; + pgid = page_private(page) & 0xffffffff; + io_zc_rx_get_buf_uref(ifq->pool, pgid); + cq_idx = ifq->cached_cq_tail & mask; + smp_rmb(); + queued = min(io_zc_rx_cqring_entries(ifq), ifq->cq_entries); + free = ifq->cq_entries - queued; + entries = min(free, ifq->cq_entries - cq_idx); + if (!entries) + return -ENOBUFS; + cqe = &ifq->cqes[cq_idx]; + ifq->cached_cq_tail++; + cqe->region = 0; + cqe->off = pgid * PAGE_SIZE + off; + cqe->len = len; + cqe->flags = 0; + } else { + /* TODO: copy frags that aren't backed by zc pages */ + WARN_ON_ONCE(1); + return -ENOMEM; + } + + return len; +} + +static int +zc_rx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, + unsigned int offset, size_t len) +{ + struct io_zc_rx_ifq *ifq; + struct sk_buff *frag_iter; + unsigned start, start_off; + int i, copy, end, off; + int ret = 0; + + ifq = io_zc_rx_ifq_skb(skb); + start = skb_headlen(skb); + start_off = offset; + + // TODO: copy payload in skb linear data */ + WARN_ON_ONCE(offset < start); + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + const skb_frag_t *frag; + + WARN_ON(start > offset + len); + + frag = &skb_shinfo(skb)->frags[i]; + end = start + skb_frag_size(frag); + + if (offset < end) { + copy = end - offset; + if (copy > len) + copy = len; + + off = offset - start; + ret = zc_rx_recv_frag(ifq, frag, off, copy); + if (ret < 0) + goto out; + + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + start = end; + } + + skb_walk_frags(skb, frag_iter) { + WARN_ON(start > offset + len); + + end = start + frag_iter->len; + if (offset < end) { + copy = end - offset; + if (copy > len) + copy = len; + + off = offset - start; + ret = zc_rx_recv_skb(desc, frag_iter, off, copy); + if (ret < 0) + goto out; + + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + start = end; + } + +out: + smp_store_release(&ifq->ring->cq.tail, ifq->cached_cq_tail); + if (offset == start_off) + return ret; + return offset - start_off; +} + +static int io_zc_rx_tcp_read(struct sock *sk) +{ + read_descriptor_t rd_desc = { + .count = 1, + }; + + return tcp_read_sock(sk, &rd_desc, zc_rx_recv_skb); +} + +static int io_zc_rx_tcp_recvmsg(struct sock *sk, unsigned int recv_limit, + int flags, int *addr_len) +{ + size_t used; + long timeo; + int ret; + + ret = used = 0; + + lock_sock(sk); + + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); + while (recv_limit) { + ret = io_zc_rx_tcp_read(sk); + if (ret < 0) + break; + if (!ret) { + if (used) + break; + if (sock_flag(sk, SOCK_DONE)) + break; + if (sk->sk_err) { + ret = sock_error(sk); + break; + } + if (sk->sk_shutdown & RCV_SHUTDOWN) + break; + if (sk->sk_state == TCP_CLOSE) { + ret = -ENOTCONN; + break; + } + if (!timeo) { + ret = -EAGAIN; + break; + } + if (!skb_queue_empty(&sk->sk_receive_queue)) + break; + sk_wait_data(sk, &timeo, NULL); + if (signal_pending(current)) { + ret = sock_intr_errno(timeo); + break; + } + continue; + } + recv_limit -= ret; + used += ret; + + if (!timeo) + break; + release_sock(sk); + lock_sock(sk); + + if (sk->sk_err || sk->sk_state == TCP_CLOSE || + (sk->sk_shutdown & RCV_SHUTDOWN) || + signal_pending(current)) + break; + } + + release_sock(sk); + + /* TODO: handle timestamping */ + + if (used) + return used; + + return ret; +} + +int io_zc_rx_recv(struct socket *sock, unsigned int limit, unsigned int flags) +{ + struct sock *sk = sock->sk; + const struct proto *prot; + int addr_len = 0; + int ret; + + if (flags & MSG_ERRQUEUE) + return -EOPNOTSUPP; + + prot = READ_ONCE(sk->sk_prot); + if (prot->recvmsg != tcp_recvmsg) + return -EPROTONOSUPPORT; + + sock_rps_record_flow(sk); + + ret = io_zc_rx_tcp_recvmsg(sk, limit, flags, &addr_len); + + return ret; +} diff --git a/io_uring/zc_rx.h b/io_uring/zc_rx.h index ee7e36295f91..63ff81a7900a 100644 --- a/io_uring/zc_rx.h +++ b/io_uring/zc_rx.h @@ -35,4 +35,8 @@ int io_register_zc_rx_ifq(struct io_ring_ctx *ctx, int io_unregister_zc_rx_ifq(struct io_ring_ctx *ctx); int io_zc_rx_pool_create(struct io_zc_rx_ifq *ifq, u16 id); +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags); +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); +int io_zc_rx_recv(struct socket *sock, unsigned int limit, unsigned int flags); + #endif