From patchwork Wed Feb 12 18:57:51 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13972310 Received: from mail-pl1-f178.google.com (mail-pl1-f178.google.com [209.85.214.178]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 5214E261567 for ; Wed, 12 Feb 2025 18:59:06 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.178 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386748; cv=none; b=oVc42CEOlD/U/Cq6zsXSrTzhxsxQItZhsgSCMvKGZjVxiAwOlhQoeLrizjl1e4REdApuc3G0xcHsb406b47KhCRuvET331vZM0TSk19A2JdCS+QYf216aov2dQKIRu7gHQtT68LvhM8nkhqOgunPRZnueEXefSeubw9ZJeuR6aI= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386748; c=relaxed/simple; bh=zj4GIYtz6NKFAa0klaWO1xlf9wkU4S05aJWPNtEFybc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=af8IBIceqriDOkyAI6TEcigvuSQ2cOnj6J3f+mClgAq229kTBCy5AQYVxvBAVP6CbJNi01lqk56ZaS5kaa6vLXt91Pt9sWAYl0Aj4h8cTmQ1ml3H3Hk9kXb2jpYNzDFj8FepS+pvwGXGKYmUMsCZ7fV9RDaIDu0Krna3ZyEMchw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=dWiccyI4; arc=none smtp.client-ip=209.85.214.178 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="dWiccyI4" Received: by mail-pl1-f178.google.com with SMTP id d9443c01a7336-220c665ef4cso12950965ad.3 for ; Wed, 12 Feb 2025 10:59:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1739386745; x=1739991545; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=6DdRdY/fbRokL7PssPrbn76GM3fdK2ghRVLK0jFvN2Q=; b=dWiccyI4P+EnYGqCUDUdjm/CvpANhgU7Wjemgy8dbRz4Ot0GuGjwkiCSdIXamdsgjS /eB+CsAO8dCChHekd7VTkRuI1ERtEJ4HNdvcfSur/fhNFAxwREY/ZBDuoCE4o5MlOQoy huPEc6dJ7nYuPN6okCjUjP/hf0y6se/I3QUkNYakyCiERSQuJsmW24dgjvjJsh5aHBzF 2fv8gFJDYA827We8ekOBeRnIIghZZNkblDEa1cAwbfe9g/yfIFVi38l3Je7aTqLUtHII kyIz+GR40FBoUgRATl2NlQ/f16LOEK/M+aTcVgqxCMwyfmxce+Sd8dl4ogAlf4z92tAB Nvfg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739386745; x=1739991545; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6DdRdY/fbRokL7PssPrbn76GM3fdK2ghRVLK0jFvN2Q=; b=HwIEW/ACe07Vysi5kd4FQB0gENT3kKOVQtb2ZP/iMHXA3vC296MgCs8k8kywVzLjhb YLes0TvDHN48jfUblSmUZxwu2LQpmCOZKoR6gl3zzq+OsWZQOTvyeC+ccPqJvVu/3S6W Zn8odDpvo0jguG2rWN5fwrF0J1K9nsrqW+ktD1YviZvq9av+kWSUbG/QAtMjhOmrJf1u KyNn0l2fhhucWTuqkbQEh5/WxPyvOzwJmTAKJUkSAw2gTp0oxyW/2XCiOmjWi0p0YsHd Fnq/mST+1BNyv2FzHmSI6GqOFE6QVD6TzaRI8zc60IOCc0YwN+wzD/qvOnRlksSIx1h0 fVbw== X-Gm-Message-State: AOJu0Yy7UI/D/ifSagQTpe7e5QIKHX4woyh7T/HY67Dq2AN8MHN9A0x8 qEHCBBs5ZYvKc6Pr1bkro71koeaZrhi6FYaQBVheG/zvNObaLAX+sB22w5BGycaf4AD/qQ81JIQ F X-Gm-Gg: ASbGncusqHQW3xRWrhW58ODSYGG7bqCAJGMdLS6X5CT1A244emYcevzG/qiu1s/4/Q5 P74c8S/byzTD+py1i2sTH66DyyZufU98ditPFpxgMRYx8/anorZZardODr2M3YvrljKm63BVFUs vvJZ4NqM+24Tzs4QaxyI74bn8SaCypJN4WnNkWaCdDJYONRcxREGytP8B0/rKbaconCZOAme6f3 zey1Wj4xdL8bGSeLImDYUWa6Wu+lD2VRPxNLgYx252uKGOkM2Z6VSP1TZuqW0pzlUEOvAFtORw= X-Google-Smtp-Source: AGHT+IGPkEJU5Aj6C72amVPbkxHA805WfVy5wffd+lblN9R4Vaj77jpIFOWkMt27zmm0miIstlh9OQ== X-Received: by 2002:a17:902:d50a:b0:216:60a3:b3fd with SMTP id d9443c01a7336-220bdedcc51mr63357435ad.3.1739386745497; Wed, 12 Feb 2025 10:59:05 -0800 (PST) Received: from localhost ([2a03:2880:ff:7::]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-220c8982d79sm8429185ad.194.2025.02.12.10.59.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Feb 2025 10:59:05 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Stanislav Fomichev , Joe Damato , Pedro Tammela Subject: [PATCH net-next v13 01/11] io_uring/zcrx: add interface queue and refill queue Date: Wed, 12 Feb 2025 10:57:51 -0800 Message-ID: <20250212185859.3509616-2-dw@davidwei.uk> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250212185859.3509616-1-dw@davidwei.uk> References: <20250212185859.3509616-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Add a new object called an interface queue (ifq) that represents a net rx queue that has been configured for zero copy. Each ifq is registered using a new registration opcode IORING_REGISTER_ZCRX_IFQ. The refill queue is allocated by the kernel and mapped by userspace using a new offset IORING_OFF_RQ_RING, in a similar fashion to the main SQ/CQ. It is used by userspace to return buffers that it is done with, which will then be re-used by the netdev again. The main CQ ring is used to notify userspace of received data by using the upper 16 bytes of a big CQE as a new struct io_uring_zcrx_cqe. Each entry contains the offset + len to the data. For now, each io_uring instance only has a single ifq. Reviewed-by: Jens Axboe Signed-off-by: David Wei --- Kconfig | 2 + include/linux/io_uring_types.h | 6 ++ include/uapi/linux/io_uring.h | 43 +++++++++- io_uring/KConfig | 10 +++ io_uring/Makefile | 1 + io_uring/io_uring.c | 7 ++ io_uring/memmap.h | 1 + io_uring/register.c | 7 ++ io_uring/zcrx.c | 149 +++++++++++++++++++++++++++++++++ io_uring/zcrx.h | 35 ++++++++ 10 files changed, 260 insertions(+), 1 deletion(-) create mode 100644 io_uring/KConfig create mode 100644 io_uring/zcrx.c create mode 100644 io_uring/zcrx.h diff --git a/Kconfig b/Kconfig index 745bc773f567..529ea7694ba9 100644 --- a/Kconfig +++ b/Kconfig @@ -30,3 +30,5 @@ source "lib/Kconfig" source "lib/Kconfig.debug" source "Documentation/Kconfig" + +source "io_uring/KConfig" diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index e2fef264ff8b..7bf478727a31 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -40,6 +40,8 @@ enum io_uring_cmd_flags { IO_URING_F_TASK_DEAD = (1 << 13), }; +struct io_zcrx_ifq; + struct io_wq_work_node { struct io_wq_work_node *next; }; @@ -382,6 +384,8 @@ struct io_ring_ctx { struct wait_queue_head poll_wq; struct io_restriction restrictions; + struct io_zcrx_ifq *ifq; + u32 pers_next; struct xarray personalities; @@ -434,6 +438,8 @@ struct io_ring_ctx { struct io_mapped_region ring_region; /* used for optimised request parameter and wait argument passing */ struct io_mapped_region param_region; + /* just one zcrx per ring for now, will move to io_zcrx_ifq eventually */ + struct io_mapped_region zcrx_region; }; struct io_tw_state { diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index e11c82638527..6a1632d0fba1 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -639,7 +639,8 @@ enum io_uring_register_op { /* send MSG_RING without having a ring */ IORING_REGISTER_SEND_MSG_RING = 31, - /* 32 reserved for zc rx */ + /* register a netdev hw rx queue for zerocopy */ + IORING_REGISTER_ZCRX_IFQ = 32, /* resize CQ ring */ IORING_REGISTER_RESIZE_RINGS = 33, @@ -956,6 +957,46 @@ enum io_uring_socket_op { SOCKET_URING_OP_SETSOCKOPT, }; +/* Zero copy receive refill queue entry */ +struct io_uring_zcrx_rqe { + __u64 off; + __u32 len; + __u32 __pad; +}; + +struct io_uring_zcrx_cqe { + __u64 off; + __u64 __pad; +}; + +/* The bit from which area id is encoded into offsets */ +#define IORING_ZCRX_AREA_SHIFT 48 +#define IORING_ZCRX_AREA_MASK (~(((__u64)1 << IORING_ZCRX_AREA_SHIFT) - 1)) + +struct io_uring_zcrx_offsets { + __u32 head; + __u32 tail; + __u32 rqes; + __u32 __resv2; + __u64 __resv[2]; +}; + +/* + * Argument for IORING_REGISTER_ZCRX_IFQ + */ +struct io_uring_zcrx_ifq_reg { + __u32 if_idx; + __u32 if_rxq; + __u32 rq_entries; + __u32 flags; + + __u64 area_ptr; /* pointer to struct io_uring_zcrx_area_reg */ + __u64 region_ptr; /* struct io_uring_region_desc * */ + + struct io_uring_zcrx_offsets offsets; + __u64 __resv[4]; +}; + #ifdef __cplusplus } #endif diff --git a/io_uring/KConfig b/io_uring/KConfig new file mode 100644 index 000000000000..9e2a4beba1ef --- /dev/null +++ b/io_uring/KConfig @@ -0,0 +1,10 @@ +# SPDX-License-Identifier: GPL-2.0-only +# +# io_uring configuration +# + +config IO_URING_ZCRX + def_bool y + depends on PAGE_POOL + depends on INET + depends on NET_RX_BUSY_POLL diff --git a/io_uring/Makefile b/io_uring/Makefile index d695b60dba4f..98e48339d84d 100644 --- a/io_uring/Makefile +++ b/io_uring/Makefile @@ -14,6 +14,7 @@ obj-$(CONFIG_IO_URING) += io_uring.o opdef.o kbuf.o rsrc.o notif.o \ epoll.o statx.o timeout.o fdinfo.o \ cancel.o waitid.o register.o \ truncate.o memmap.o alloc_cache.o +obj-$(CONFIG_IO_URING_ZCRX) += zcrx.o obj-$(CONFIG_IO_WQ) += io-wq.o obj-$(CONFIG_FUTEX) += futex.o obj-$(CONFIG_NET_RX_BUSY_POLL) += napi.o diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index ec98a0ec6f34..7c34b5459e2d 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -97,6 +97,7 @@ #include "uring_cmd.h" #include "msg_ring.h" #include "memmap.h" +#include "zcrx.h" #include "timeout.h" #include "poll.h" @@ -2700,6 +2701,7 @@ static __cold void io_ring_ctx_free(struct io_ring_ctx *ctx) mutex_lock(&ctx->uring_lock); io_sqe_buffers_unregister(ctx); io_sqe_files_unregister(ctx); + io_unregister_zcrx_ifqs(ctx); io_cqring_overflow_kill(ctx); io_eventfd_unregister(ctx); io_free_alloc_caches(ctx); @@ -2859,6 +2861,11 @@ static __cold void io_ring_exit_work(struct work_struct *work) io_cqring_overflow_kill(ctx); mutex_unlock(&ctx->uring_lock); } + if (ctx->ifq) { + mutex_lock(&ctx->uring_lock); + io_shutdown_zcrx_ifqs(ctx); + mutex_unlock(&ctx->uring_lock); + } if (ctx->flags & IORING_SETUP_DEFER_TASKRUN) io_move_task_work_from_local(ctx); diff --git a/io_uring/memmap.h b/io_uring/memmap.h index c898dcba2b4e..dad0aa5b1b45 100644 --- a/io_uring/memmap.h +++ b/io_uring/memmap.h @@ -2,6 +2,7 @@ #define IO_URING_MEMMAP_H #define IORING_MAP_OFF_PARAM_REGION 0x20000000ULL +#define IORING_MAP_OFF_ZCRX_REGION 0x30000000ULL struct page **io_pin_pages(unsigned long ubuf, unsigned long len, int *npages); diff --git a/io_uring/register.c b/io_uring/register.c index 9a4d2fbce4ae..cc23a4c205cd 100644 --- a/io_uring/register.c +++ b/io_uring/register.c @@ -30,6 +30,7 @@ #include "eventfd.h" #include "msg_ring.h" #include "memmap.h" +#include "zcrx.h" #define IORING_MAX_RESTRICTIONS (IORING_RESTRICTION_LAST + \ IORING_REGISTER_LAST + IORING_OP_LAST) @@ -813,6 +814,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_register_clone_buffers(ctx, arg); break; + case IORING_REGISTER_ZCRX_IFQ: + ret = -EINVAL; + if (!arg || nr_args != 1) + break; + ret = io_register_zcrx_ifq(ctx, arg); + break; case IORING_REGISTER_RESIZE_RINGS: ret = -EINVAL; if (!arg || nr_args != 1) diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c new file mode 100644 index 000000000000..f3ace7e8264d --- /dev/null +++ b/io_uring/zcrx.c @@ -0,0 +1,149 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include + +#include + +#include "io_uring.h" +#include "kbuf.h" +#include "memmap.h" +#include "zcrx.h" + +#define IO_RQ_MAX_ENTRIES 32768 + +static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq, + struct io_uring_zcrx_ifq_reg *reg, + struct io_uring_region_desc *rd) +{ + size_t off, size; + void *ptr; + int ret; + + off = sizeof(struct io_uring); + size = off + sizeof(struct io_uring_zcrx_rqe) * reg->rq_entries; + if (size > rd->size) + return -EINVAL; + + ret = io_create_region_mmap_safe(ifq->ctx, &ifq->ctx->zcrx_region, rd, + IORING_MAP_OFF_ZCRX_REGION); + if (ret < 0) + return ret; + + ptr = io_region_get_ptr(&ifq->ctx->zcrx_region); + ifq->rq_ring = (struct io_uring *)ptr; + ifq->rqes = (struct io_uring_zcrx_rqe *)(ptr + off); + return 0; +} + +static void io_free_rbuf_ring(struct io_zcrx_ifq *ifq) +{ + io_free_region(ifq->ctx, &ifq->ctx->zcrx_region); + ifq->rq_ring = NULL; + ifq->rqes = NULL; +} + +static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx) +{ + struct io_zcrx_ifq *ifq; + + ifq = kzalloc(sizeof(*ifq), GFP_KERNEL); + if (!ifq) + return NULL; + + ifq->if_rxq = -1; + ifq->ctx = ctx; + return ifq; +} + +static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq) +{ + io_free_rbuf_ring(ifq); + kfree(ifq); +} + +int io_register_zcrx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zcrx_ifq_reg __user *arg) +{ + struct io_uring_zcrx_ifq_reg reg; + struct io_uring_region_desc rd; + struct io_zcrx_ifq *ifq; + int ret; + + /* + * 1. Interface queue allocation. + * 2. It can observe data destined for sockets of other tasks. + */ + if (!capable(CAP_NET_ADMIN)) + return -EPERM; + + /* mandatory io_uring features for zc rx */ + if (!(ctx->flags & IORING_SETUP_DEFER_TASKRUN && + ctx->flags & IORING_SETUP_CQE32)) + return -EINVAL; + if (ctx->ifq) + return -EBUSY; + if (copy_from_user(®, arg, sizeof(reg))) + return -EFAULT; + if (copy_from_user(&rd, u64_to_user_ptr(reg.region_ptr), sizeof(rd))) + return -EFAULT; + if (memchr_inv(®.__resv, 0, sizeof(reg.__resv))) + return -EINVAL; + if (reg.if_rxq == -1 || !reg.rq_entries || reg.flags) + return -EINVAL; + if (reg.rq_entries > IO_RQ_MAX_ENTRIES) { + if (!(ctx->flags & IORING_SETUP_CLAMP)) + return -EINVAL; + reg.rq_entries = IO_RQ_MAX_ENTRIES; + } + reg.rq_entries = roundup_pow_of_two(reg.rq_entries); + + if (!reg.area_ptr) + return -EFAULT; + + ifq = io_zcrx_ifq_alloc(ctx); + if (!ifq) + return -ENOMEM; + + ret = io_allocate_rbuf_ring(ifq, ®, &rd); + if (ret) + goto err; + + ifq->rq_entries = reg.rq_entries; + ifq->if_rxq = reg.if_rxq; + + reg.offsets.rqes = sizeof(struct io_uring); + reg.offsets.head = offsetof(struct io_uring, head); + reg.offsets.tail = offsetof(struct io_uring, tail); + + if (copy_to_user(arg, ®, sizeof(reg)) || + copy_to_user(u64_to_user_ptr(reg.region_ptr), &rd, sizeof(rd))) { + ret = -EFAULT; + goto err; + } + + ctx->ifq = ifq; + return 0; +err: + io_zcrx_ifq_free(ifq); + return ret; +} + +void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx) +{ + struct io_zcrx_ifq *ifq = ctx->ifq; + + lockdep_assert_held(&ctx->uring_lock); + + if (!ifq) + return; + + ctx->ifq = NULL; + io_zcrx_ifq_free(ifq); +} + +void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx) +{ + lockdep_assert_held(&ctx->uring_lock); +} diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h new file mode 100644 index 000000000000..58e4ab6c6083 --- /dev/null +++ b/io_uring/zcrx.h @@ -0,0 +1,35 @@ +// SPDX-License-Identifier: GPL-2.0 +#ifndef IOU_ZC_RX_H +#define IOU_ZC_RX_H + +#include + +struct io_zcrx_ifq { + struct io_ring_ctx *ctx; + struct io_uring *rq_ring; + struct io_uring_zcrx_rqe *rqes; + u32 rq_entries; + + u32 if_rxq; +}; + +#if defined(CONFIG_IO_URING_ZCRX) +int io_register_zcrx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zcrx_ifq_reg __user *arg); +void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx); +void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx); +#else +static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx, + struct io_uring_zcrx_ifq_reg __user *arg) +{ + return -EOPNOTSUPP; +} +static inline void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx) +{ +} +static inline void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx) +{ +} +#endif + +#endif From patchwork Wed Feb 12 18:57:52 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13972312 Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 74308262163 for ; Wed, 12 Feb 2025 18:59:07 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.48 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386749; cv=none; b=eF4J4ikxra1EPeggZvN7UIS5S6yUZWo276idaGo++beJJrb8VgnjzF9ws3jv7vWCIqYfDEXANyDrTPjDsUJ2Hk9apW7jAUdcALQGy46buUhsOMcfnEOOGG+NRMOeaeIXLLHq3FEm3xbAUDJpR2UvjeFZPQuTozSSW4ruhRG3UlU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386749; c=relaxed/simple; bh=xV+HbjoPZhjUAGGHGU8YbI0mz3uuDMqf8qpq252dPvg=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=cHrk5EcSv//VyhqYMlafQqqxB9mdD2iy0mRCT+ucljEw7jwBUA8pUPGQbWFDNcBWS42B30AU8oltep1vDd5WIhX7tYPY2V9bn67xw0RWv/4BQLGB7pEccXcLmFZHgcx4jrz8yAnxai8cQAKYhvSEVifpnhVdKu2JlNlgmq2Q5m0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=g/+fMHEt; arc=none smtp.client-ip=209.85.216.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="g/+fMHEt" Received: by mail-pj1-f48.google.com with SMTP id 98e67ed59e1d1-2fa48404207so247556a91.1 for ; Wed, 12 Feb 2025 10:59:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1739386746; x=1739991546; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=ayGmHinChkh4euhmmLNAbFxNHBewzdvRepp7HKsmfMg=; b=g/+fMHEtEraVQDyYBYbMVV8SEv9an8ULlAZy7RPnL6UQbq6Kzo8ud/8+WemvAlhnvI mw6LF6oHv3kx4QOUPEr0lxKaeT1e2h2aNF2cEBKpzyMiixzue0VhE/N/zy1L9VmBjMMG LhGVc/548994OOGguOqis7AH8XQFhTpGlKfSiiqNjfLtZ9xGlIb/YnJ7CFTxoIHLsfJO 0/kj01/iHez9jpK/+U2lMcUlJMhtRbRZlPtL049bZSF3RvdGvfZGN0tR/ltaqfpoADyd 5bXZOtyyn24xPdhl6nrp/C84t1xpAyWOYP0606w00kfV+8QbkMwVEbQde4JKbf9tzUIv Lkdw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739386746; x=1739991546; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=ayGmHinChkh4euhmmLNAbFxNHBewzdvRepp7HKsmfMg=; b=uAjZJclpFdxwqmAJu9ZZnjTSlp6T2agmuOrQSyiKfGn5Pemc/+V8VsFlh5nkIq2Z3A 64Bb34uNcPJrZT/Rq6xUf30bNgXrZfJcRqW8IZR9rhgxSy1BZ2bQphS9hG+HgxSS7NUe 2CHgk0xIl7gSj1uOD0AlUglJSGYB2Z9RlYWnfTfONYC0bR07ixrtTBwiDf46NhvUGC9d kUeRebLI7Gpj5l6UrN/DYVaSAf4nTzx/+Vak80jZ7x+vJkBz17T6Gmqy6vwpbehw3pK7 hDy031nQOBNQlQbFVwhHfcLR21WpgddiDcDuEXAqyF9l46lZvPQg3mU1grT1CVILjrjS NXwA== X-Gm-Message-State: AOJu0Yw/2Sywd+UZPighIUhUxp6Umi9NI4ekKHS6GV8+u6j0xpZdzqPT ShDGfOHSU4ecvaoGimwEFTG8uls1NCBMEtfp0+ZwP4EbqCznhLV+ZdvujXa2OqPtU0YKQuSLVE+ L X-Gm-Gg: ASbGncsLQxLJUTIYjKM5j1UUan8d1QNH8byO7+ph6/CcevpIt0IxAh+/yfHdvFIqZ6E YnLY0JKPQ2PHuaxrvw1OELzMBTzbKRGK3j0rwI5IDF4wD4bxg7mdano7rvo/FA2U0zmFGMirqQK iaHq+z8nyWRZTLvYEtPf5tXz7kNl9QCAZjDNfVPckXDKb4Y8YjisFSD/mH/NzA51iTBW9AR8hIZ NaaAL2JSezWu0JtrTxNlMKq0teenYMZrkTavXmPQcpKKNAS86nBc41G2TkQ83LGBgf2s3t91bDo X-Google-Smtp-Source: AGHT+IHNA5/1rvh9tMg5T+fsnqnqy+jZQlT0WndDShpeYXDyQBc2FHUbNSnxUGacqmTopWiQVUlfhg== X-Received: by 2002:a05:6a00:2e84:b0:730:937f:e835 with SMTP id d2e1a72fcca58-7323c1c7e5dmr253964b3a.17.1739386746623; Wed, 12 Feb 2025 10:59:06 -0800 (PST) Received: from localhost ([2a03:2880:ff:71::]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7308caf5b91sm6006891b3a.116.2025.02.12.10.59.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Feb 2025 10:59:06 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Stanislav Fomichev , Joe Damato , Pedro Tammela Subject: [PATCH net-next v13 02/11] io_uring/zcrx: add io_zcrx_area Date: Wed, 12 Feb 2025 10:57:52 -0800 Message-ID: <20250212185859.3509616-3-dw@davidwei.uk> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250212185859.3509616-1-dw@davidwei.uk> References: <20250212185859.3509616-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Add io_zcrx_area that represents a region of userspace memory that is used for zero copy. During ifq registration, userspace passes in the uaddr and len of userspace memory, which is then pinned by the kernel. Each net_iov is mapped to one of these pages. The freelist is a spinlock protected list that keeps track of all the net_iovs/pages that aren't used. For now, there is only one area per ifq and area registration happens implicitly as part of ifq registration. There is no API for adding/removing areas yet. The struct for area registration is there for future extensibility once we support multiple areas and TCP devmem. Reviewed-by: Jens Axboe Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- include/uapi/linux/io_uring.h | 9 ++++ io_uring/rsrc.c | 2 +- io_uring/rsrc.h | 1 + io_uring/zcrx.c | 89 ++++++++++++++++++++++++++++++++++- io_uring/zcrx.h | 16 +++++++ 5 files changed, 114 insertions(+), 3 deletions(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 6a1632d0fba1..44844707d327 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -981,6 +981,15 @@ struct io_uring_zcrx_offsets { __u64 __resv[2]; }; +struct io_uring_zcrx_area_reg { + __u64 addr; + __u64 len; + __u64 rq_area_token; + __u32 flags; + __u32 __resv1; + __u64 __resv2[2]; +}; + /* * Argument for IORING_REGISTER_ZCRX_IFQ */ diff --git a/io_uring/rsrc.c b/io_uring/rsrc.c index af39b69eb4fd..20b884c84e55 100644 --- a/io_uring/rsrc.c +++ b/io_uring/rsrc.c @@ -77,7 +77,7 @@ static int io_account_mem(struct io_ring_ctx *ctx, unsigned long nr_pages) return 0; } -static int io_buffer_validate(struct iovec *iov) +int io_buffer_validate(struct iovec *iov) { unsigned long tmp, acct_len = iov->iov_len + (PAGE_SIZE - 1); diff --git a/io_uring/rsrc.h b/io_uring/rsrc.h index 190f7ee45de9..0f8c20246a74 100644 --- a/io_uring/rsrc.h +++ b/io_uring/rsrc.h @@ -68,6 +68,7 @@ int io_register_rsrc_update(struct io_ring_ctx *ctx, void __user *arg, unsigned size, unsigned type); int io_register_rsrc(struct io_ring_ctx *ctx, void __user *arg, unsigned int size, unsigned int type); +int io_buffer_validate(struct iovec *iov); bool io_check_coalesce_buffer(struct page **page_array, int nr_pages, struct io_imu_folio_data *data); diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index f3ace7e8264d..04883a3ae80c 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -10,6 +10,7 @@ #include "kbuf.h" #include "memmap.h" #include "zcrx.h" +#include "rsrc.h" #define IO_RQ_MAX_ENTRIES 32768 @@ -44,6 +45,79 @@ static void io_free_rbuf_ring(struct io_zcrx_ifq *ifq) ifq->rqes = NULL; } +static void io_zcrx_free_area(struct io_zcrx_area *area) +{ + kvfree(area->freelist); + kvfree(area->nia.niovs); + if (area->pages) { + unpin_user_pages(area->pages, area->nia.num_niovs); + kvfree(area->pages); + } + kfree(area); +} + +static int io_zcrx_create_area(struct io_zcrx_ifq *ifq, + struct io_zcrx_area **res, + struct io_uring_zcrx_area_reg *area_reg) +{ + struct io_zcrx_area *area; + int i, ret, nr_pages; + struct iovec iov; + + if (area_reg->flags || area_reg->rq_area_token) + return -EINVAL; + if (area_reg->__resv1 || area_reg->__resv2[0] || area_reg->__resv2[1]) + return -EINVAL; + if (area_reg->addr & ~PAGE_MASK || area_reg->len & ~PAGE_MASK) + return -EINVAL; + + iov.iov_base = u64_to_user_ptr(area_reg->addr); + iov.iov_len = area_reg->len; + ret = io_buffer_validate(&iov); + if (ret) + return ret; + + ret = -ENOMEM; + area = kzalloc(sizeof(*area), GFP_KERNEL); + if (!area) + goto err; + + area->pages = io_pin_pages((unsigned long)area_reg->addr, area_reg->len, + &nr_pages); + if (IS_ERR(area->pages)) { + ret = PTR_ERR(area->pages); + area->pages = NULL; + goto err; + } + area->nia.num_niovs = nr_pages; + + area->nia.niovs = kvmalloc_array(nr_pages, sizeof(area->nia.niovs[0]), + GFP_KERNEL | __GFP_ZERO); + if (!area->nia.niovs) + goto err; + + area->freelist = kvmalloc_array(nr_pages, sizeof(area->freelist[0]), + GFP_KERNEL | __GFP_ZERO); + if (!area->freelist) + goto err; + + for (i = 0; i < nr_pages; i++) + area->freelist[i] = i; + + area->free_count = nr_pages; + area->ifq = ifq; + /* we're only supporting one area per ifq for now */ + area->area_id = 0; + area_reg->rq_area_token = (u64)area->area_id << IORING_ZCRX_AREA_SHIFT; + spin_lock_init(&area->freelist_lock); + *res = area; + return 0; +err: + if (area) + io_zcrx_free_area(area); + return ret; +} + static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx) { struct io_zcrx_ifq *ifq; @@ -59,6 +133,9 @@ static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx) static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq) { + if (ifq->area) + io_zcrx_free_area(ifq->area); + io_free_rbuf_ring(ifq); kfree(ifq); } @@ -66,6 +143,7 @@ static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq) int io_register_zcrx_ifq(struct io_ring_ctx *ctx, struct io_uring_zcrx_ifq_reg __user *arg) { + struct io_uring_zcrx_area_reg area; struct io_uring_zcrx_ifq_reg reg; struct io_uring_region_desc rd; struct io_zcrx_ifq *ifq; @@ -99,7 +177,7 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx, } reg.rq_entries = roundup_pow_of_two(reg.rq_entries); - if (!reg.area_ptr) + if (copy_from_user(&area, u64_to_user_ptr(reg.area_ptr), sizeof(area))) return -EFAULT; ifq = io_zcrx_ifq_alloc(ctx); @@ -110,6 +188,10 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx, if (ret) goto err; + ret = io_zcrx_create_area(ifq, &ifq->area, &area); + if (ret) + goto err; + ifq->rq_entries = reg.rq_entries; ifq->if_rxq = reg.if_rxq; @@ -122,7 +204,10 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx, ret = -EFAULT; goto err; } - + if (copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) { + ret = -EFAULT; + goto err; + } ctx->ifq = ifq; return 0; err: diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h index 58e4ab6c6083..53fd94b65b38 100644 --- a/io_uring/zcrx.h +++ b/io_uring/zcrx.h @@ -3,9 +3,25 @@ #define IOU_ZC_RX_H #include +#include + +struct io_zcrx_area { + struct net_iov_area nia; + struct io_zcrx_ifq *ifq; + + u16 area_id; + struct page **pages; + + /* freelist */ + spinlock_t freelist_lock ____cacheline_aligned_in_smp; + u32 free_count; + u32 *freelist; +}; struct io_zcrx_ifq { struct io_ring_ctx *ctx; + struct io_zcrx_area *area; + struct io_uring *rq_ring; struct io_uring_zcrx_rqe *rqes; u32 rq_entries; From patchwork Wed Feb 12 18:57:53 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13972313 Received: from mail-pj1-f41.google.com (mail-pj1-f41.google.com [209.85.216.41]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 269E52627ED for ; Wed, 12 Feb 2025 18:59:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.41 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386750; cv=none; b=FoYplqKoAnI8Wwk4+coyj/ecYKdnjMxixmMIsDrefWfxSHgeLVz6o0HjuVI90Do4c0qrjmzvnTjfAhLyTOLuffs4qugBjMv8rJyItIUlE+pKgw9HfuZ0xl6DtgwfYb8RMytKQwh4Jwn84dnZclUDbON+cMuCFqkFW2xLdDnJqn8= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386750; c=relaxed/simple; bh=moxT/9tlqEGGqkvg89jEl7qnqkLxWddMi6wsJk9KyOo=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=CQNg85SRCs6D35cjuaV7MD7ZyyUW12fhEvIms+F4zasTxsfr3sQCBu9hYU2Adz2g1R0rW7ffDDpa/tkjuUWDa5/jdyT29dGVLhbXiRkTtQtzgcuTcYw+Yl+KSq0S5l6PaqmP5vg0Mn8tH901KW3MrkFIuGkGVtCCAQw2H16citM= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=XGh8rszJ; arc=none smtp.client-ip=209.85.216.41 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="XGh8rszJ" Received: by mail-pj1-f41.google.com with SMTP id 98e67ed59e1d1-2f9bac7699aso181505a91.1 for ; Wed, 12 Feb 2025 10:59:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1739386748; x=1739991548; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=6McXPFUQHYSYEvRYSRp9lVm8DJpoafPAg8IZnzA+YUI=; b=XGh8rszJqMN8Qfa8Cj4UmLAL9igowG3ru8BgdeFxaYKrzu6d67YM4Wv72mPaEr4+1/ Umy5NkG+w7C/Wvjj7cT4ska+smg8GXEwrqweynXpHPPzpUsLzk4ebEiKV2kuO70Mc5cb cUwxHYLtOw59wgpnzKqwRDweE9f4xFJlCuokYxhGYtG1It7vqTGF/Ba6fTTe/ptos8wM +3EXh7+0Lxsxwu2l0ZP2ks/pJPP0V2r/pbQQ5swvgHM2rCY/GCVxYajgJMI+yblPaya/ /bwyVvZ3OlZlwALw7lHBeQNYro2iavONBuxBAth5s1TyoFu1LE2o1KQ4XT6ZG8OCSoQH Yb6A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739386748; x=1739991548; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=6McXPFUQHYSYEvRYSRp9lVm8DJpoafPAg8IZnzA+YUI=; b=s7emzIIu/tH71+/efsZ+PbbK4lHadz12a5Vzcd+iiXmivYKV4Wzeb3IZ7ZW3PQqowd JQcG0UO3M7G83iU7jnWv4zzRTRHkcaDP1qJFHuVz5kXgImf+KVIpk97Dy3zy9W6Fu1rY wIW0EJPINLfhMoEOR45kDFwY7hRc2HymVsFbLDySryvC58C4Ik0SIwrF6gNaY7fJF1i6 gaMVXGNyiAFeLfERP8/I6MM8F3Nnk9f/K8F2g5Bje4MOhA7n1Ur8vASxDSQfO5QlHAMN cVERqGFoF6NzMVqh/0ZOc0B6/9AQ4GoZ0fJn0Om608hzIZTxx2zAHOqdbcRJPWlmxsWr ntAQ== X-Gm-Message-State: AOJu0YxTg6YXwaF0Hs8BQSFpS402I1HH2ee5eITB2lz3hvnVQaqCnCtx CAab3ZTpW/LO/1vHdmst+TH8lNjskgNPFlDfmtcTAfPu1ZHs/bm+3uGnDxhbxM8o12oTPP9p1Ao 9 X-Gm-Gg: ASbGncvO8TbmUCdTHiEddhc+Eb3Kaq/DS0eaMM0CIR49hZv0RmOvpXNpIKILLB/rIwE TQIyetIyQWjwnMuORb6LqfYMKbHw6/vecUQMCoulcz+QoxH3iCnL5EP3z0ltuI6PVAPtOeObIHB ASEAk4OhLDHoVKuugidX+eIFbmoLUqrHqu0M8g80c3VAacY1IajYb0kYdpDcCO2wFm6jYP8c2kR 6G+9ClO57VOJDs8AkStT+uzAo8W992QVzVWBv2ISmJTEl3EDCkK0AjXXuxzFEL8jFuydN3T8V8= X-Google-Smtp-Source: AGHT+IHq0dQfxC0rbEh1qFTbv56rIAztS5CVccFToQQQQg1SXxJ2m03b+QzPICPIqkNdBhQOBvdCiw== X-Received: by 2002:a17:90b:4b06:b0:2fa:b84:b320 with SMTP id 98e67ed59e1d1-2fbf9106cc2mr5703089a91.24.1739386748549; Wed, 12 Feb 2025 10:59:08 -0800 (PST) Received: from localhost ([2a03:2880:ff:9::]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2fbf989876bsm1835561a91.9.2025.02.12.10.59.07 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Feb 2025 10:59:07 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Stanislav Fomichev , Joe Damato , Pedro Tammela Subject: [PATCH net-next v13 03/11] io_uring/zcrx: grab a net device Date: Wed, 12 Feb 2025 10:57:53 -0800 Message-ID: <20250212185859.3509616-4-dw@davidwei.uk> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250212185859.3509616-1-dw@davidwei.uk> References: <20250212185859.3509616-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov Zerocopy receive needs a net device to bind to its rx queue and dma map buffers. As a preparation to following patches, resolve a net device from the if_idx parameter with no functional changes otherwise. Reviewed-by: Jens Axboe Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- io_uring/zcrx.c | 28 ++++++++++++++++++++++++++++ io_uring/zcrx.h | 5 +++++ 2 files changed, 33 insertions(+) diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index 04883a3ae80c..435cd634f91c 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -3,6 +3,8 @@ #include #include #include +#include +#include #include @@ -128,13 +130,28 @@ static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx) ifq->if_rxq = -1; ifq->ctx = ctx; + spin_lock_init(&ifq->lock); return ifq; } +static void io_zcrx_drop_netdev(struct io_zcrx_ifq *ifq) +{ + spin_lock(&ifq->lock); + if (ifq->netdev) { + netdev_put(ifq->netdev, &ifq->netdev_tracker); + ifq->netdev = NULL; + } + spin_unlock(&ifq->lock); +} + static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq) { + io_zcrx_drop_netdev(ifq); + if (ifq->area) io_zcrx_free_area(ifq->area); + if (ifq->dev) + put_device(ifq->dev); io_free_rbuf_ring(ifq); kfree(ifq); @@ -195,6 +212,17 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx, ifq->rq_entries = reg.rq_entries; ifq->if_rxq = reg.if_rxq; + ret = -ENODEV; + ifq->netdev = netdev_get_by_index(current->nsproxy->net_ns, reg.if_idx, + &ifq->netdev_tracker, GFP_KERNEL); + if (!ifq->netdev) + goto err; + + ifq->dev = ifq->netdev->dev.parent; + if (!ifq->dev) + return -EOPNOTSUPP; + get_device(ifq->dev); + reg.offsets.rqes = sizeof(struct io_uring); reg.offsets.head = offsetof(struct io_uring, head); reg.offsets.tail = offsetof(struct io_uring, tail); diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h index 53fd94b65b38..595bca0001d2 100644 --- a/io_uring/zcrx.h +++ b/io_uring/zcrx.h @@ -4,6 +4,7 @@ #include #include +#include struct io_zcrx_area { struct net_iov_area nia; @@ -27,6 +28,10 @@ struct io_zcrx_ifq { u32 rq_entries; u32 if_rxq; + struct device *dev; + struct net_device *netdev; + netdevice_tracker netdev_tracker; + spinlock_t lock; }; #if defined(CONFIG_IO_URING_ZCRX) From patchwork Wed Feb 12 18:57:54 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13972314 Received: from mail-pj1-f50.google.com (mail-pj1-f50.google.com [209.85.216.50]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6A900262151 for ; Wed, 12 Feb 2025 18:59:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.50 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386752; cv=none; b=GnsywFZLGk3HNcaoj44+m1JIAsWUkL1wO6ekovINWPh7ApdvzuqqFkaJ7TllxlVsDM9SairX5FischAbUfhac92SQUOpRBs4gMGpVCsCqdsxjXnHC0GTPHTY7ZdfQK2TERflUq3lgISYGAfB3NqMabNyMODryUrb7OB8yBeoxe0= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386752; c=relaxed/simple; bh=fLRP6dZrGlmiy0B76tqSqwMlMV62JCTxtfhIpC5hg9Q=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=naR5mxpwhtsthS9tGhi/D4QJ+KD5ypmI0uxA4XDqRxdyN8RGjkSirwt9nHh3/8lNG6RUCViM8Z6EtbQbKP1MhSol39sJye9wnOEybDLLzkr4ft5aXU7T1gPdKZAsd7UTtr/XbDFejfzgytCtIJZtdsEQDEDd6teqX7Ojj2my2d0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=QnUc1Dy5; arc=none smtp.client-ip=209.85.216.50 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="QnUc1Dy5" Received: by mail-pj1-f50.google.com with SMTP id 98e67ed59e1d1-2fbfc9ff0b9so175822a91.2 for ; Wed, 12 Feb 2025 10:59:10 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1739386750; x=1739991550; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=JaIwCK+6+kgWhobdjuclXHnE4dcJBeoOWXR6oTfIMvg=; b=QnUc1Dy5TsD9tZlpq/0JisVjzXkDQrEVDcYsM6fDL0b7DtnGvN8ZlFWFxmsvV5Pgv8 VyXGWBITZgWCMKP9j3IPdz/OAL/ieK0Kz0ntqPRmoCRCc/12zYSmM9NZr5bAXPdTbaA3 kphHeAsG6leQ2Ntv9mStYFGwDh7dk8RcymwQT/FZ/dTdJNwbISKpgD/FrJRGx7q0UBWo LakI2BE4hSQ/Ji1oXndQwxlcjtGhrLK4PEo0ZYa3kK9YO8XJZQXiqz8lFATABu55ttsH qGamPpdkkRUpigg/j9INe3G8CgPzJoNWRXOqpEiaWbQ2GrcFSWUvcCN261RyJxro6W86 p2Cg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739386750; x=1739991550; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=JaIwCK+6+kgWhobdjuclXHnE4dcJBeoOWXR6oTfIMvg=; b=nX/lbO2qlOp/sYdeTnJHuaeMqQOhOnr9STdGPBpMoCPI+Jd3cFTvqpYi4biHoEp8VD U8mhgtRJTg7pt9Ez3nMdRTdxab/RIbCY6p7o6AT56IYgkiG4GevDD9n4Qq6ztxQDXx74 9XZXrbHG7UfT9BeMYoyh0+jp4+hyIcnkIYpLWLOYfqIqLZL97iYBPbM7MECAcMkVgXKa yeH1qTVDIOojOYYA7UQkb+sDuvVfZwOU58YYQRE7RettWICVDD6lzEXGEpxhu0Uc5wNW zxBEU4jfscoiG90FWEaQv0gE2jL6HD89O1G4DYjRnWzulosfRFmtKuVH06t8zXqE/y/h UXMQ== X-Gm-Message-State: AOJu0YwrG39Sy671AGuAFH77gmHs2vapIegi+Hx5rQounjpx/JnKTywO YsTIUrkGHIsQ/w1DPTosOqXgkCoQfnoekMzFPhLlfTGX/5Rvfj8aLgRKP+qZl7sGlSvn554brj4 r X-Gm-Gg: ASbGncvHeRzcnxY7Nvdvf7D22gM7FN4Zo47uBlKLzAOvBlUwYQkFuJ55GibJl+tH8Ui NBO0PIe5lv9h+/bCJyhTrfj6s6rzTG4ifqeimVX9d/dagc4EjS+k+d2z0f350kp/ttjcyPqlt+h pBYrAdp9GyeloK4C3i9/f8wOwVIWFOYKO6K63dnGeSTRH7Z34wuaaCzB8712pQAttF5hFnz/5Ph sazTXDX3FrXU8jlS6uT1isQ+i1X+lbkbXKNDDPmwNEBF1YDYfjpVF6/vCXujatllZdkKYMiohUp X-Google-Smtp-Source: AGHT+IErISK3NMPAKMzvJ+qu6qhQ9KgWg5tWvZaWHolohWucghkIzM6KB7KtZxJ0VE2C5wz5WHueCw== X-Received: by 2002:a05:6a00:1a93:b0:730:8526:5db4 with SMTP id d2e1a72fcca58-7323c1444dbmr295738b3a.13.1739386749645; Wed, 12 Feb 2025 10:59:09 -0800 (PST) Received: from localhost ([2a03:2880:ff:14::]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-730918f6edfsm5640918b3a.27.2025.02.12.10.59.09 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Feb 2025 10:59:09 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Stanislav Fomichev , Joe Damato , Pedro Tammela Subject: [PATCH net-next v13 04/11] io_uring/zcrx: implement zerocopy receive pp memory provider Date: Wed, 12 Feb 2025 10:57:54 -0800 Message-ID: <20250212185859.3509616-5-dw@davidwei.uk> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250212185859.3509616-1-dw@davidwei.uk> References: <20250212185859.3509616-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov Implement a page pool memory provider for io_uring to receieve in a zero copy fashion. For that, the provider allocates user pages wrapped around into struct net_iovs, that are stored in a previously registered struct net_iov_area. Unlike the traditional receive, that frees pages and returns them back to the page pool right after data was copied to the user, e.g. inside recv(2), we extend the lifetime until the user space confirms that it's done processing the data. That's done by taking a net_iov reference. When the user is done with the buffer, it must return it back to the kernel by posting an entry into the refill ring, which is usually polled off the io_uring memory provider callback in the page pool's netmem allocation path. There is also a separate set of per net_iov "user" references accounting whether a buffer is currently given to the user (including possible fragmentation). Reviewed-by: Jens Axboe Signed-off-by: Pavel Begunkov Signed-off-by: David Wei Reviewed-by: Mina Almasry --- io_uring/zcrx.c | 272 ++++++++++++++++++++++++++++++++++++++++++++++++ io_uring/zcrx.h | 3 + 2 files changed, 275 insertions(+) diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index 435cd634f91c..9d5c0479a285 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -2,10 +2,16 @@ #include #include #include +#include #include #include #include +#include +#include +#include +#include + #include #include "io_uring.h" @@ -16,6 +22,33 @@ #define IO_RQ_MAX_ENTRIES 32768 +__maybe_unused +static const struct memory_provider_ops io_uring_pp_zc_ops; + +static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov) +{ + struct net_iov_area *owner = net_iov_owner(niov); + + return container_of(owner, struct io_zcrx_area, nia); +} + +static inline atomic_t *io_get_user_counter(struct net_iov *niov) +{ + struct io_zcrx_area *area = io_zcrx_iov_to_area(niov); + + return &area->user_refs[net_iov_idx(niov)]; +} + +static bool io_zcrx_put_niov_uref(struct net_iov *niov) +{ + atomic_t *uref = io_get_user_counter(niov); + + if (unlikely(!atomic_read(uref))) + return false; + atomic_dec(uref); + return true; +} + static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq, struct io_uring_zcrx_ifq_reg *reg, struct io_uring_region_desc *rd) @@ -51,6 +84,7 @@ static void io_zcrx_free_area(struct io_zcrx_area *area) { kvfree(area->freelist); kvfree(area->nia.niovs); + kvfree(area->user_refs); if (area->pages) { unpin_user_pages(area->pages, area->nia.num_niovs); kvfree(area->pages); @@ -106,6 +140,19 @@ static int io_zcrx_create_area(struct io_zcrx_ifq *ifq, for (i = 0; i < nr_pages; i++) area->freelist[i] = i; + area->user_refs = kvmalloc_array(nr_pages, sizeof(area->user_refs[0]), + GFP_KERNEL | __GFP_ZERO); + if (!area->user_refs) + goto err; + + for (i = 0; i < nr_pages; i++) { + struct net_iov *niov = &area->nia.niovs[i]; + + niov->owner = &area->nia; + area->freelist[i] = i; + atomic_set(&area->user_refs[i], 0); + } + area->free_count = nr_pages; area->ifq = ifq; /* we're only supporting one area per ifq for now */ @@ -131,6 +178,7 @@ static struct io_zcrx_ifq *io_zcrx_ifq_alloc(struct io_ring_ctx *ctx) ifq->if_rxq = -1; ifq->ctx = ctx; spin_lock_init(&ifq->lock); + spin_lock_init(&ifq->rq_lock); return ifq; } @@ -251,12 +299,236 @@ void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx) if (!ifq) return; + if (WARN_ON_ONCE(ifq->area && + ifq->area->free_count != ifq->area->nia.num_niovs)) ctx->ifq = NULL; io_zcrx_ifq_free(ifq); } +static struct net_iov *__io_zcrx_get_free_niov(struct io_zcrx_area *area) +{ + unsigned niov_idx; + + lockdep_assert_held(&area->freelist_lock); + + niov_idx = area->freelist[--area->free_count]; + return &area->nia.niovs[niov_idx]; +} + +static void io_zcrx_return_niov_freelist(struct net_iov *niov) +{ + struct io_zcrx_area *area = io_zcrx_iov_to_area(niov); + + spin_lock_bh(&area->freelist_lock); + area->freelist[area->free_count++] = net_iov_idx(niov); + spin_unlock_bh(&area->freelist_lock); +} + +static void io_zcrx_return_niov(struct net_iov *niov) +{ + netmem_ref netmem = net_iov_to_netmem(niov); + + page_pool_put_unrefed_netmem(niov->pp, netmem, -1, false); +} + +static void io_zcrx_scrub(struct io_zcrx_ifq *ifq) +{ + struct io_zcrx_area *area = ifq->area; + int i; + + if (!area) + return; + + /* Reclaim back all buffers given to the user space. */ + for (i = 0; i < area->nia.num_niovs; i++) { + struct net_iov *niov = &area->nia.niovs[i]; + int nr; + + if (!atomic_read(io_get_user_counter(niov))) + continue; + nr = atomic_xchg(io_get_user_counter(niov), 0); + if (nr && !page_pool_unref_netmem(net_iov_to_netmem(niov), nr)) + io_zcrx_return_niov(niov); + } +} + void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx) { lockdep_assert_held(&ctx->uring_lock); + + if (ctx->ifq) + io_zcrx_scrub(ctx->ifq); +} + +static inline u32 io_zcrx_rqring_entries(struct io_zcrx_ifq *ifq) +{ + u32 entries; + + entries = smp_load_acquire(&ifq->rq_ring->tail) - ifq->cached_rq_head; + return min(entries, ifq->rq_entries); } + +static struct io_uring_zcrx_rqe *io_zcrx_get_rqe(struct io_zcrx_ifq *ifq, + unsigned mask) +{ + unsigned int idx = ifq->cached_rq_head++ & mask; + + return &ifq->rqes[idx]; +} + +static void io_zcrx_ring_refill(struct page_pool *pp, + struct io_zcrx_ifq *ifq) +{ + unsigned int mask = ifq->rq_entries - 1; + unsigned int entries; + netmem_ref netmem; + + spin_lock_bh(&ifq->rq_lock); + + entries = io_zcrx_rqring_entries(ifq); + entries = min_t(unsigned, entries, PP_ALLOC_CACHE_REFILL - pp->alloc.count); + if (unlikely(!entries)) { + spin_unlock_bh(&ifq->rq_lock); + return; + } + + do { + struct io_uring_zcrx_rqe *rqe = io_zcrx_get_rqe(ifq, mask); + struct io_zcrx_area *area; + struct net_iov *niov; + unsigned niov_idx, area_idx; + + area_idx = rqe->off >> IORING_ZCRX_AREA_SHIFT; + niov_idx = (rqe->off & ~IORING_ZCRX_AREA_MASK) >> PAGE_SHIFT; + + if (unlikely(rqe->__pad || area_idx)) + continue; + area = ifq->area; + + if (unlikely(niov_idx >= area->nia.num_niovs)) + continue; + niov_idx = array_index_nospec(niov_idx, area->nia.num_niovs); + + niov = &area->nia.niovs[niov_idx]; + if (!io_zcrx_put_niov_uref(niov)) + continue; + + netmem = net_iov_to_netmem(niov); + if (page_pool_unref_netmem(netmem, 1) != 0) + continue; + + if (unlikely(niov->pp != pp)) { + io_zcrx_return_niov(niov); + continue; + } + + net_mp_netmem_place_in_cache(pp, netmem); + } while (--entries); + + smp_store_release(&ifq->rq_ring->head, ifq->cached_rq_head); + spin_unlock_bh(&ifq->rq_lock); +} + +static void io_zcrx_refill_slow(struct page_pool *pp, struct io_zcrx_ifq *ifq) +{ + struct io_zcrx_area *area = ifq->area; + + spin_lock_bh(&area->freelist_lock); + while (area->free_count && pp->alloc.count < PP_ALLOC_CACHE_REFILL) { + struct net_iov *niov = __io_zcrx_get_free_niov(area); + netmem_ref netmem = net_iov_to_netmem(niov); + + net_mp_niov_set_page_pool(pp, niov); + net_mp_netmem_place_in_cache(pp, netmem); + } + spin_unlock_bh(&area->freelist_lock); +} + +static netmem_ref io_pp_zc_alloc_netmems(struct page_pool *pp, gfp_t gfp) +{ + struct io_zcrx_ifq *ifq = pp->mp_priv; + + /* pp should already be ensuring that */ + if (unlikely(pp->alloc.count)) + goto out_return; + + io_zcrx_ring_refill(pp, ifq); + if (likely(pp->alloc.count)) + goto out_return; + + io_zcrx_refill_slow(pp, ifq); + if (!pp->alloc.count) + return 0; +out_return: + return pp->alloc.cache[--pp->alloc.count]; +} + +static bool io_pp_zc_release_netmem(struct page_pool *pp, netmem_ref netmem) +{ + struct net_iov *niov; + + if (WARN_ON_ONCE(!netmem_is_net_iov(netmem))) + return false; + + niov = netmem_to_net_iov(netmem); + net_mp_niov_clear_page_pool(niov); + io_zcrx_return_niov_freelist(niov); + return false; +} + +static int io_pp_zc_init(struct page_pool *pp) +{ + struct io_zcrx_ifq *ifq = pp->mp_priv; + + if (WARN_ON_ONCE(!ifq)) + return -EINVAL; + if (pp->dma_map) + return -EOPNOTSUPP; + if (pp->p.order != 0) + return -EOPNOTSUPP; + + percpu_ref_get(&ifq->ctx->refs); + return 0; +} + +static void io_pp_zc_destroy(struct page_pool *pp) +{ + struct io_zcrx_ifq *ifq = pp->mp_priv; + + percpu_ref_put(&ifq->ctx->refs); +} + +static int io_pp_nl_fill(void *mp_priv, struct sk_buff *rsp, + struct netdev_rx_queue *rxq) +{ + struct nlattr *nest; + int type; + + type = rxq ? NETDEV_A_QUEUE_IO_URING : NETDEV_A_PAGE_POOL_IO_URING; + nest = nla_nest_start(rsp, type); + if (!nest) + return -EMSGSIZE; + nla_nest_end(rsp, nest); + + return 0; +} + +static void io_pp_uninstall(void *mp_priv, struct netdev_rx_queue *rxq) +{ + struct pp_memory_provider_params *p = &rxq->mp_params; + struct io_zcrx_ifq *ifq = mp_priv; + + io_zcrx_drop_netdev(ifq); + p->mp_ops = NULL; + p->mp_priv = NULL; +} + +static const struct memory_provider_ops io_uring_pp_zc_ops = { + .alloc_netmems = io_pp_zc_alloc_netmems, + .release_netmem = io_pp_zc_release_netmem, + .init = io_pp_zc_init, + .destroy = io_pp_zc_destroy, + .nl_fill = io_pp_nl_fill, + .uninstall = io_pp_uninstall, +}; diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h index 595bca0001d2..6c808240ac91 100644 --- a/io_uring/zcrx.h +++ b/io_uring/zcrx.h @@ -9,6 +9,7 @@ struct io_zcrx_area { struct net_iov_area nia; struct io_zcrx_ifq *ifq; + atomic_t *user_refs; u16 area_id; struct page **pages; @@ -26,6 +27,8 @@ struct io_zcrx_ifq { struct io_uring *rq_ring; struct io_uring_zcrx_rqe *rqes; u32 rq_entries; + u32 cached_rq_head; + spinlock_t rq_lock; u32 if_rxq; struct device *dev; From patchwork Wed Feb 12 18:57:55 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13972315 Received: from mail-pl1-f173.google.com (mail-pl1-f173.google.com [209.85.214.173]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 579012627ED for ; Wed, 12 Feb 2025 18:59:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.173 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386752; cv=none; b=GmaMAj39kDcLnoAKuLuZWnFZX0CzIvtQbg2RVfuYza2BjZk/gDl0hcy7qVimOmP0JO5zQz9SXkw2vFK5r+sMEb3Y3SHEMAb8ZKPUtPZmTOd+Kle1oou5idBU/AQQ+dAW8mQTq0FMJeK+HD4e+eyrJc/g2fpOJe3kxBZC0S00QGU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386752; c=relaxed/simple; bh=La2d554WIvdRbco+6aiE0S+APfsYY4f7vvV20WZ9OCc=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=P4kUiw5VSGZpvn6tUR4xVQJdZBGKjzDQ1Z9HD9L01KSUF41FjwPQcK47+TQxrTVo7/QRbjL4pmpWsWLV5AfWSJTGE/mtUv9PEXt6Aq3efs457xCQ9gwerGH4wgjaIYDIsHHZoIL6MiuJDUfr4Z2aC1l0qWBUyyUEbB75XVpEZiU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=zAher6st; arc=none smtp.client-ip=209.85.214.173 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="zAher6st" Received: by mail-pl1-f173.google.com with SMTP id d9443c01a7336-21f62cc4088so92684295ad.3 for ; Wed, 12 Feb 2025 10:59:11 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1739386751; x=1739991551; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=mrlDmtSIgjlht2CbqQZMWAO7KO8Ux0SWPg2kOLAtlYk=; b=zAher6sthTnLrWgfqXLRyrmEy5mprZKtruQY3kXseCRP7qd4gGU07SWJm2/W3sZ+r7 U0nLdZgtA7CeT5wMltki0rfRmIO9em5vziwVaMDppbFvUL9opyqIsMCVsNTWJ98ybR1G gb7gS4uMRtEXNNjRkbr70HAz9N/+NrS/VcKJHqrw9g0NdWz8attHyEQ8VQcaV/zBS5U9 q/kQKV/P+6Pxeowz5ymos7bANZcLJYVGjphryEYGfvNJGzOAIr6oEAfHsUoyVrtr4w1i V5BMuimUnoe/jiW4iSW1DJlRRyq8VC3ZBZs3D9mNHCj7n3Q8x8heTWvtKAh6W1MYnJWB Vq7g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739386751; x=1739991551; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mrlDmtSIgjlht2CbqQZMWAO7KO8Ux0SWPg2kOLAtlYk=; b=utEn9dCn/syhNGXJk9+44Io06MO8ygdm7nzJnKFLcC+PZCQfTeMDUnblDPRmEaXanT CFI4KwcnagyvX8Nu4VAkaMqTR/TgtPvmhw30EBvfN/HmbjFGSognMcBKBH6JawQet9X+ UaU4G74MFogK0iSFTSEhmlWa+KLFWEUtIaIvuemyu/cUGm6z53xJfTVIMyO5zojwhKZ2 zBOUVlrjldgfO4zk0jPyCoD5cQiXt4Y55hWVyYdjJnlvLjRMTeQLR89hJkDrOP+beS7y ly2YsKhi1QcUY54dS7jbbmXCiaZBzo87GyMwIfnk4Av0jzmJDxe3PfhHlgHGpG4v78NU 1a/Q== X-Gm-Message-State: AOJu0Yz1tU3rF0u7V3TA295Ctk8FxbM/dvZlVok2rYTiSvEC+CZ5elSb wQqXP67WYNNISoB6keELj2vFWJmH7oLLepZhvOv2dtpZ63TsdBGpDyEIIr6dfyrobFfkSY2xRyK p X-Gm-Gg: ASbGnctDu/n42gnJL9y1xy+2WpM+szUm1kbiCvxYM6Wu3+DPO3RhhFNDnZlPGDRSY2C 9DIYQ/UcneoOfBDRhkW1hTPb/DNK1WOSeNS0ZQd6SR31LAtsa0reRUn4jnhQ4u3/M/tSCdhzfyD ptD6CBduGyks/Kb3vMw01gmmEuCpGoaYC7IC0262x0WTN8jzqV5RaaPFtbMMcuCLjjmuCXAPEeI n5Ccq1xedif13XPjOa0MyOUleWh2PYEEUmWVys42M5vn8EhwCj4T13e9HgTEvEu5qdMg3+5qNo= X-Google-Smtp-Source: AGHT+IEZJusJL1G3pK2f3od/jlvu2y3vpzU5BWhs9T8/VJOaDxw/kjsspjK/5SVrYO0RQw2aL69W6A== X-Received: by 2002:a05:6a20:748e:b0:1e0:c56f:7db4 with SMTP id adf61e73a8af0-1ee6b2e1dacmr945796637.2.1739386750709; Wed, 12 Feb 2025 10:59:10 -0800 (PST) Received: from localhost ([2a03:2880:ff:2::]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-73048bf1421sm11755460b3a.101.2025.02.12.10.59.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Feb 2025 10:59:10 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Stanislav Fomichev , Joe Damato , Pedro Tammela Subject: [PATCH net-next v13 05/11] io_uring/zcrx: dma-map area for the device Date: Wed, 12 Feb 2025 10:57:55 -0800 Message-ID: <20250212185859.3509616-6-dw@davidwei.uk> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250212185859.3509616-1-dw@davidwei.uk> References: <20250212185859.3509616-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov Setup DMA mappings for the area into which we intend to receive data later on. We know the device we want to attach to even before we get a page pool and can pre-map in advance. All net_iov are synchronised for device when allocated, see page_pool_mp_return_in_cache(). Reviewed-by: Jens Axboe Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- io_uring/zcrx.c | 82 ++++++++++++++++++++++++++++++++++++++++++++++++- io_uring/zcrx.h | 1 + 2 files changed, 82 insertions(+), 1 deletion(-) diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index 9d5c0479a285..4f7767980000 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -1,6 +1,7 @@ // SPDX-License-Identifier: GPL-2.0 #include #include +#include #include #include #include @@ -20,6 +21,73 @@ #include "zcrx.h" #include "rsrc.h" +#define IO_DMA_ATTR (DMA_ATTR_SKIP_CPU_SYNC | DMA_ATTR_WEAK_ORDERING) + +static void __io_zcrx_unmap_area(struct io_zcrx_ifq *ifq, + struct io_zcrx_area *area, int nr_mapped) +{ + int i; + + for (i = 0; i < nr_mapped; i++) { + struct net_iov *niov = &area->nia.niovs[i]; + dma_addr_t dma; + + dma = page_pool_get_dma_addr_netmem(net_iov_to_netmem(niov)); + dma_unmap_page_attrs(ifq->dev, dma, PAGE_SIZE, + DMA_FROM_DEVICE, IO_DMA_ATTR); + net_mp_niov_set_dma_addr(niov, 0); + } +} + +static void io_zcrx_unmap_area(struct io_zcrx_ifq *ifq, struct io_zcrx_area *area) +{ + if (area->is_mapped) + __io_zcrx_unmap_area(ifq, area, area->nia.num_niovs); +} + +static int io_zcrx_map_area(struct io_zcrx_ifq *ifq, struct io_zcrx_area *area) +{ + int i; + + for (i = 0; i < area->nia.num_niovs; i++) { + struct net_iov *niov = &area->nia.niovs[i]; + dma_addr_t dma; + + dma = dma_map_page_attrs(ifq->dev, area->pages[i], 0, PAGE_SIZE, + DMA_FROM_DEVICE, IO_DMA_ATTR); + if (dma_mapping_error(ifq->dev, dma)) + break; + if (net_mp_niov_set_dma_addr(niov, dma)) { + dma_unmap_page_attrs(ifq->dev, dma, PAGE_SIZE, + DMA_FROM_DEVICE, IO_DMA_ATTR); + break; + } + } + + if (i != area->nia.num_niovs) { + __io_zcrx_unmap_area(ifq, area, i); + return -EINVAL; + } + + area->is_mapped = true; + return 0; +} + +static void io_zcrx_sync_for_device(const struct page_pool *pool, + struct net_iov *niov) +{ +#if defined(CONFIG_HAS_DMA) && defined(CONFIG_DMA_NEED_SYNC) + dma_addr_t dma_addr; + + if (!dma_dev_need_sync(pool->p.dev)) + return; + + dma_addr = page_pool_get_dma_addr_netmem(net_iov_to_netmem(niov)); + __dma_sync_single_for_device(pool->p.dev, dma_addr + pool->p.offset, + PAGE_SIZE, pool->p.dma_dir); +#endif +} + #define IO_RQ_MAX_ENTRIES 32768 __maybe_unused @@ -82,6 +150,8 @@ static void io_free_rbuf_ring(struct io_zcrx_ifq *ifq) static void io_zcrx_free_area(struct io_zcrx_area *area) { + io_zcrx_unmap_area(area->ifq, area); + kvfree(area->freelist); kvfree(area->nia.niovs); kvfree(area->user_refs); @@ -271,6 +341,10 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx, return -EOPNOTSUPP; get_device(ifq->dev); + ret = io_zcrx_map_area(ifq, ifq->area); + if (ret) + goto err; + reg.offsets.rqes = sizeof(struct io_uring); reg.offsets.head = offsetof(struct io_uring, head); reg.offsets.tail = offsetof(struct io_uring, tail); @@ -423,6 +497,7 @@ static void io_zcrx_ring_refill(struct page_pool *pp, continue; } + io_zcrx_sync_for_device(pp, niov); net_mp_netmem_place_in_cache(pp, netmem); } while (--entries); @@ -440,6 +515,7 @@ static void io_zcrx_refill_slow(struct page_pool *pp, struct io_zcrx_ifq *ifq) netmem_ref netmem = net_iov_to_netmem(niov); net_mp_niov_set_page_pool(pp, niov); + io_zcrx_sync_for_device(pp, niov); net_mp_netmem_place_in_cache(pp, netmem); } spin_unlock_bh(&area->freelist_lock); @@ -483,10 +559,14 @@ static int io_pp_zc_init(struct page_pool *pp) if (WARN_ON_ONCE(!ifq)) return -EINVAL; - if (pp->dma_map) + if (WARN_ON_ONCE(ifq->dev != pp->p.dev)) + return -EINVAL; + if (WARN_ON_ONCE(!pp->dma_map)) return -EOPNOTSUPP; if (pp->p.order != 0) return -EOPNOTSUPP; + if (pp->p.dma_dir != DMA_FROM_DEVICE) + return -EOPNOTSUPP; percpu_ref_get(&ifq->ctx->refs); return 0; diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h index 6c808240ac91..1b6363591f72 100644 --- a/io_uring/zcrx.h +++ b/io_uring/zcrx.h @@ -11,6 +11,7 @@ struct io_zcrx_area { struct io_zcrx_ifq *ifq; atomic_t *user_refs; + bool is_mapped; u16 area_id; struct page **pages; From patchwork Wed Feb 12 18:57:56 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13972316 Received: from mail-pj1-f52.google.com (mail-pj1-f52.google.com [209.85.216.52]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 8C3E8262D01 for ; Wed, 12 Feb 2025 18:59:12 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.52 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386754; cv=none; b=a9KwUN0wPN80Kt5E/8faWv1oYjHZ18c/Qc+DbpXFUBYSQeFALyxNTk08GfX1KvW3zi2cyCCT0Cp0MTqIuTP6G54BzolgvV0YHNhVh3qquPWzhPeD5+ZvkZPSyw3ix8uAOVKCFF7MKQT49BeioTUNZh5pb3YHCsn5CFGoit3W0UA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386754; c=relaxed/simple; bh=hSRnyghuZX9enJl9yqpnC5Yw+BsgbQ9yg/kJ7uSsWVk=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=EGjDfo80i6D1HuAW0Y9alTRufcj1/IpTi2+gssSQ7UYwRcW2qQGrlAqi1wrfwi6Iy6ZMMkw9PtNkhMwCpe1l5Ft/n+Zd9QOuBWTas+Y+FyuO1t2eDtnNqQjqFMy6JZDmNVqi1Evfy2By4I3jCJG6RmBq6YxpAl6+1PFLh49KOUU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=YZ16Gi38; arc=none smtp.client-ip=209.85.216.52 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="YZ16Gi38" Received: by mail-pj1-f52.google.com with SMTP id 98e67ed59e1d1-2f83a8afcbbso78213a91.1 for ; Wed, 12 Feb 2025 10:59:12 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1739386752; x=1739991552; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=lyf3oFr3YGFEOl+cL56NH1phNdqSJhOu0kmBoOVn4pw=; b=YZ16Gi38frvvi4MGE3W/g2zNONaBFi/asHP+7rrNRM5m4GI3jmJTdVXjLT+Yu7f0sA 0tMRrg9M6WDaYTnesah3058rX+EXVkq+DGsM53G6+ZhGb5siapHy4OmnJ4lU1sKBj1yP oHY4VgYIQgxEnFJP9PGiCNWkjksD8MVM45FzRmNDEP1O/mzmiZgaiNsHGmvwmWlXl/K2 TvvFKlUsESEfuMxD9bjYz3Des7C3aVGLRtRpUG4m5jXFenS4aktLskN2R4fN4MdoiBFi OVpl53udK4OF5yCzvVMYBPgsdZPRRajmC0Hv7zpKMBa1HA/I+Yf1Rc9AVaZYhUgRJan3 mG7w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739386752; x=1739991552; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=lyf3oFr3YGFEOl+cL56NH1phNdqSJhOu0kmBoOVn4pw=; b=cOQTGAITMw0n6IBCrb85cuaAQkKqPwsHl81DhujR78psiBhaXnna/z5+1ovp/iSyJB 9riuXqZ3/xiWa37ztEqZlGgBquVMOxmiiSRYjPRWFyOU81Eu/bJPR5+IoWyc4lozoVPo CA+oayalPkdWy/08qiRFYlSWkgPCq20VYrnNYB+dWSywKOHb1z2umaAbQvlWSpRKPmbg XZYFfwxC1X17hZFHunjvB3iLYd5tgJvsUZTQCWZeqsRL7ThMlcddYye6fMoqDlg1WQX8 8KneH0UfuFAJWeY35e9fkVTehzZ/FOz8iIgXIHzda/isSCZbT6cAPaSkOBpxW6HLjmP3 Dshg== X-Gm-Message-State: AOJu0YzQn1mK9GqGjqL6PFLFuG4BfwyTY55EiIEWPuIFWwelESy0k3hk WrWZwJU8GllKoWlhVGhRuYFQoUdjChdKqd2Y30eU/v4YtpqphJcAoFtadz+NG/a0Q5NWbVm5p84 3 X-Gm-Gg: ASbGncsrsNsIntYQrn67cN4fAkOo08moLXBsA3C3ZXU8+z+WkjXyq7CkjSw6+uAESwV xdiiSdpgABui7775+7dYWbJ5IQqIm9AELsmXTEXLZ09MFSU1pjiR2YtL9d/kTxT4drvn6tLAM8w zT0HYOprGFFpfZDlMl07Z30IyZhe6MB0MvO0Z88FPt8YyAnCo0iODIR4qX+h9/LAZDptg+hUDbw Uat9MkBf7W93x69mDMAlq7LRUqwttxHHpyKSx4fDcOLaUmIUm6d4BlM931/NzrwAnopy9bi9FX0 X-Google-Smtp-Source: AGHT+IGVLipZpK1buIDbUQZdsy1GlFYB+gWTme+MGGfkRgqzPbXCr1A4kJeaRChR72a+bWdVy8pZ5g== X-Received: by 2002:a17:90a:d610:b0:2ef:ad48:7175 with SMTP id 98e67ed59e1d1-2fc0fa67226mr107662a91.15.1739386751744; Wed, 12 Feb 2025 10:59:11 -0800 (PST) Received: from localhost ([2a03:2880:ff:48::]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-2fbf98f4fffsm1830359a91.24.2025.02.12.10.59.11 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Feb 2025 10:59:11 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Stanislav Fomichev , Joe Damato , Pedro Tammela Subject: [PATCH net-next v13 06/11] io_uring/zcrx: add io_recvzc request Date: Wed, 12 Feb 2025 10:57:56 -0800 Message-ID: <20250212185859.3509616-7-dw@davidwei.uk> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250212185859.3509616-1-dw@davidwei.uk> References: <20250212185859.3509616-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Add io_uring opcode OP_RECV_ZC for doing zero copy reads out of a socket. Only the connection should be land on the specific rx queue set up for zero copy, and the socket must be handled by the io_uring instance that the rx queue was registered for zero copy with. That's because neither net_iovs / buffers from our queue can be read by outside applications, nor zero copy is possible if traffic for the zero copy connection goes to another queue. This coordination is outside of the scope of this patch series. Also, any traffic directed to the zero copy enabled queue is immediately visible to the application, which is why CAP_NET_ADMIN is required at the registration step. Of course, no data is actually read out of the socket, it has already been copied by the netdev into userspace memory via DMA. OP_RECV_ZC reads skbs out of the socket and checks that its frags are indeed net_iovs that belong to io_uring. A cqe is queued for each one of these frags. Recall that each cqe is a big cqe, with the top half being an io_uring_zcrx_cqe. The cqe res field contains the len or error. The lower IORING_ZCRX_AREA_SHIFT bits of the struct io_uring_zcrx_cqe::off field contain the offset relative to the start of the zero copy area. The upper part of the off field is trivially zero, and will be used to carry the area id. For now, there is no limit as to how much work each OP_RECV_ZC request does. It will attempt to drain a socket of all available data. This request always operates in multishot mode. Reviewed-by: Jens Axboe Signed-off-by: David Wei --- include/uapi/linux/io_uring.h | 2 + io_uring/io_uring.h | 10 ++ io_uring/net.c | 72 +++++++++++++ io_uring/opdef.c | 16 +++ io_uring/zcrx.c | 190 +++++++++++++++++++++++++++++++++- io_uring/zcrx.h | 13 +++ 6 files changed, 302 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 44844707d327..05d6255b0f6a 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -87,6 +87,7 @@ struct io_uring_sqe { union { __s32 splice_fd_in; __u32 file_index; + __u32 zcrx_ifq_idx; __u32 optlen; struct { __u16 addr_len; @@ -278,6 +279,7 @@ enum io_uring_op { IORING_OP_FTRUNCATE, IORING_OP_BIND, IORING_OP_LISTEN, + IORING_OP_RECV_ZC, /* this goes last, obviously */ IORING_OP_LAST, diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h index 85bc8f76ca19..fd0dbe7b0c9a 100644 --- a/io_uring/io_uring.h +++ b/io_uring/io_uring.h @@ -185,6 +185,16 @@ static inline bool io_get_cqe(struct io_ring_ctx *ctx, struct io_uring_cqe **ret return io_get_cqe_overflow(ctx, ret, false); } +static inline bool io_defer_get_uncommited_cqe(struct io_ring_ctx *ctx, + struct io_uring_cqe **cqe_ret) +{ + io_lockdep_assert_cq_locked(ctx); + + ctx->cq_extra++; + ctx->submit_state.cq_flush = true; + return io_get_cqe(ctx, cqe_ret); +} + static __always_inline bool io_fill_cqe_req(struct io_ring_ctx *ctx, struct io_kiocb *req) { diff --git a/io_uring/net.c b/io_uring/net.c index 10344b3a6d89..260eb73a5854 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -16,6 +16,7 @@ #include "net.h" #include "notif.h" #include "rsrc.h" +#include "zcrx.h" #if defined(CONFIG_NET) struct io_shutdown { @@ -89,6 +90,13 @@ struct io_sr_msg { */ #define MULTISHOT_MAX_RETRY 32 +struct io_recvzc { + struct file *file; + unsigned msg_flags; + u16 flags; + struct io_zcrx_ifq *ifq; +}; + int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_shutdown *shutdown = io_kiocb_to_cmd(req, struct io_shutdown); @@ -1227,6 +1235,70 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags) return ret; } +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); + unsigned ifq_idx; + + if (unlikely(sqe->file_index || sqe->addr2 || sqe->addr || + sqe->len || sqe->addr3)) + return -EINVAL; + + ifq_idx = READ_ONCE(sqe->zcrx_ifq_idx); + if (ifq_idx != 0) + return -EINVAL; + zc->ifq = req->ctx->ifq; + if (!zc->ifq) + return -EINVAL; + + zc->flags = READ_ONCE(sqe->ioprio); + zc->msg_flags = READ_ONCE(sqe->msg_flags); + if (zc->msg_flags) + return -EINVAL; + if (zc->flags & ~(IORING_RECVSEND_POLL_FIRST | IORING_RECV_MULTISHOT)) + return -EINVAL; + /* multishot required */ + if (!(zc->flags & IORING_RECV_MULTISHOT)) + return -EINVAL; + /* All data completions are posted as aux CQEs. */ + req->flags |= REQ_F_APOLL_MULTISHOT; + + return 0; +} + +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); + struct socket *sock; + int ret; + + if (!(req->flags & REQ_F_POLLED) && + (zc->flags & IORING_RECVSEND_POLL_FIRST)) + return -EAGAIN; + + sock = sock_from_file(req->file); + if (unlikely(!sock)) + return -ENOTSOCK; + + ret = io_zcrx_recv(req, zc->ifq, sock, zc->msg_flags | MSG_DONTWAIT, + issue_flags); + if (unlikely(ret <= 0) && ret != -EAGAIN) { + if (ret == -ERESTARTSYS) + ret = -EINTR; + + req_set_fail(req); + io_req_set_res(req, ret, 0); + + if (issue_flags & IO_URING_F_MULTISHOT) + return IOU_STOP_MULTISHOT; + return IOU_OK; + } + + if (issue_flags & IO_URING_F_MULTISHOT) + return IOU_ISSUE_SKIP_COMPLETE; + return -EAGAIN; +} + void io_send_zc_cleanup(struct io_kiocb *req) { struct io_sr_msg *zc = io_kiocb_to_cmd(req, struct io_sr_msg); diff --git a/io_uring/opdef.c b/io_uring/opdef.c index e8baef4e5146..89f50ecadeaf 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -37,6 +37,7 @@ #include "waitid.h" #include "futex.h" #include "truncate.h" +#include "zcrx.h" static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags) { @@ -514,6 +515,18 @@ const struct io_issue_def io_issue_defs[] = { .async_size = sizeof(struct io_async_msghdr), #else .prep = io_eopnotsupp_prep, +#endif + }, + [IORING_OP_RECV_ZC] = { + .needs_file = 1, + .unbound_nonreg_file = 1, + .pollin = 1, + .ioprio = 1, +#if defined(CONFIG_NET) + .prep = io_recvzc_prep, + .issue = io_recvzc, +#else + .prep = io_eopnotsupp_prep, #endif }, }; @@ -745,6 +758,9 @@ const struct io_cold_def io_cold_defs[] = { [IORING_OP_LISTEN] = { .name = "LISTEN", }, + [IORING_OP_RECV_ZC] = { + .name = "RECV_ZC", + }, }; const char *io_uring_get_opcode(u8 opcode) diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index 4f7767980000..a487f5149641 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -12,6 +12,8 @@ #include #include #include +#include +#include #include @@ -90,7 +92,12 @@ static void io_zcrx_sync_for_device(const struct page_pool *pool, #define IO_RQ_MAX_ENTRIES 32768 -__maybe_unused +struct io_zcrx_args { + struct io_kiocb *req; + struct io_zcrx_ifq *ifq; + struct socket *sock; +}; + static const struct memory_provider_ops io_uring_pp_zc_ops; static inline struct io_zcrx_area *io_zcrx_iov_to_area(const struct net_iov *niov) @@ -117,6 +124,11 @@ static bool io_zcrx_put_niov_uref(struct net_iov *niov) return true; } +static void io_zcrx_get_niov_uref(struct net_iov *niov) +{ + atomic_inc(io_get_user_counter(niov)); +} + static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq, struct io_uring_zcrx_ifq_reg *reg, struct io_uring_region_desc *rd) @@ -612,3 +624,179 @@ static const struct memory_provider_ops io_uring_pp_zc_ops = { .nl_fill = io_pp_nl_fill, .uninstall = io_pp_uninstall, }; + +static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov, + struct io_zcrx_ifq *ifq, int off, int len) +{ + struct io_uring_zcrx_cqe *rcqe; + struct io_zcrx_area *area; + struct io_uring_cqe *cqe; + u64 offset; + + if (!io_defer_get_uncommited_cqe(req->ctx, &cqe)) + return false; + + cqe->user_data = req->cqe.user_data; + cqe->res = len; + cqe->flags = IORING_CQE_F_MORE; + + area = io_zcrx_iov_to_area(niov); + offset = off + (net_iov_idx(niov) << PAGE_SHIFT); + rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1); + rcqe->off = offset + ((u64)area->area_id << IORING_ZCRX_AREA_SHIFT); + rcqe->__pad = 0; + return true; +} + +static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq, + const skb_frag_t *frag, int off, int len) +{ + struct net_iov *niov; + + if (unlikely(!skb_frag_is_net_iov(frag))) + return -EOPNOTSUPP; + + niov = netmem_to_net_iov(frag->netmem); + if (niov->pp->mp_ops != &io_uring_pp_zc_ops || + niov->pp->mp_priv != ifq) + return -EFAULT; + + if (!io_zcrx_queue_cqe(req, niov, ifq, off + skb_frag_off(frag), len)) + return -ENOSPC; + + /* + * Prevent it from being recycled while user is accessing it. + * It has to be done before grabbing a user reference. + */ + page_pool_ref_netmem(net_iov_to_netmem(niov)); + io_zcrx_get_niov_uref(niov); + return len; +} + +static int +io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, + unsigned int offset, size_t len) +{ + struct io_zcrx_args *args = desc->arg.data; + struct io_zcrx_ifq *ifq = args->ifq; + struct io_kiocb *req = args->req; + struct sk_buff *frag_iter; + unsigned start, start_off; + int i, copy, end, off; + int ret = 0; + + start = skb_headlen(skb); + start_off = offset; + + if (offset < start) + return -EOPNOTSUPP; + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + const skb_frag_t *frag; + + if (WARN_ON(start > offset + len)) + return -EFAULT; + + frag = &skb_shinfo(skb)->frags[i]; + end = start + skb_frag_size(frag); + + if (offset < end) { + copy = end - offset; + if (copy > len) + copy = len; + + off = offset - start; + ret = io_zcrx_recv_frag(req, ifq, frag, off, copy); + if (ret < 0) + goto out; + + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + start = end; + } + + skb_walk_frags(skb, frag_iter) { + if (WARN_ON(start > offset + len)) + return -EFAULT; + + end = start + frag_iter->len; + if (offset < end) { + copy = end - offset; + if (copy > len) + copy = len; + + off = offset - start; + ret = io_zcrx_recv_skb(desc, frag_iter, off, copy); + if (ret < 0) + goto out; + + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + start = end; + } + +out: + if (offset == start_off) + return ret; + return offset - start_off; +} + +static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq, + struct sock *sk, int flags, + unsigned issue_flags) +{ + struct io_zcrx_args args = { + .req = req, + .ifq = ifq, + .sock = sk->sk_socket, + }; + read_descriptor_t rd_desc = { + .count = 1, + .arg.data = &args, + }; + int ret; + + lock_sock(sk); + ret = tcp_read_sock(sk, &rd_desc, io_zcrx_recv_skb); + if (ret <= 0) { + if (ret < 0 || sock_flag(sk, SOCK_DONE)) + goto out; + if (sk->sk_err) + ret = sock_error(sk); + else if (sk->sk_shutdown & RCV_SHUTDOWN) + goto out; + else if (sk->sk_state == TCP_CLOSE) + ret = -ENOTCONN; + else + ret = -EAGAIN; + } else if (sock_flag(sk, SOCK_DONE)) { + /* Make it to retry until it finally gets 0. */ + if (issue_flags & IO_URING_F_MULTISHOT) + ret = IOU_REQUEUE; + else + ret = -EAGAIN; + } +out: + release_sock(sk); + return ret; +} + +int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq, + struct socket *sock, unsigned int flags, + unsigned issue_flags) +{ + struct sock *sk = sock->sk; + const struct proto *prot = READ_ONCE(sk->sk_prot); + + if (prot->recvmsg != tcp_recvmsg) + return -EPROTONOSUPPORT; + + sock_rps_record_flow(sk); + return io_zcrx_tcp_recvmsg(req, ifq, sk, flags, issue_flags); +} diff --git a/io_uring/zcrx.h b/io_uring/zcrx.h index 1b6363591f72..a16bdd921f03 100644 --- a/io_uring/zcrx.h +++ b/io_uring/zcrx.h @@ -3,6 +3,7 @@ #define IOU_ZC_RX_H #include +#include #include #include @@ -43,6 +44,9 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx, struct io_uring_zcrx_ifq_reg __user *arg); void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx); void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx); +int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq, + struct socket *sock, unsigned int flags, + unsigned issue_flags); #else static inline int io_register_zcrx_ifq(struct io_ring_ctx *ctx, struct io_uring_zcrx_ifq_reg __user *arg) @@ -55,6 +59,15 @@ static inline void io_unregister_zcrx_ifqs(struct io_ring_ctx *ctx) static inline void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx) { } +static inline int io_zcrx_recv(struct io_kiocb *req, struct io_zcrx_ifq *ifq, + struct socket *sock, unsigned int flags, + unsigned issue_flags) +{ + return -EOPNOTSUPP; +} #endif +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags); +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); + #endif From patchwork Wed Feb 12 18:57:57 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13972317 Received: from mail-pj1-f54.google.com (mail-pj1-f54.google.com [209.85.216.54]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 7E02F263880 for ; Wed, 12 Feb 2025 18:59:13 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.54 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386754; cv=none; b=V7oyF7rotsA6Wx1k6tNxCM/of8u0oF8Mg7SrQbMKO3/rqjXB6JA265/vVeAZGJowPChuxW4QihENGKvrlBll/PfOazok2CR3hiEmlqcYX18tbem8t7/MhexsEazmPORn2h1SWiTQbHeWVLRyNxRXon34W53GLT7hdcA8swQxWnk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386754; c=relaxed/simple; bh=Dpy282S2VS9hbSG8QbL0FgWrEKfT6otuKfGWIdJQ0Ts=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=Rk7JPDy03M3F+jH2s9Vh2twiZhVZSk1WcUkKiOqff8KqGECzNcK5gUb3amszkNupL0i1jPfnZ59J/BLCyGVgTEOctIMX7v/Zlc2vthrJAv47G17Sw3jOOtebSJbxlBFDIo93XTha910oIY7CEXpqGfaO7sTI7fAp7sStcJCvUs4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=za24uZQa; arc=none smtp.client-ip=209.85.216.54 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="za24uZQa" Received: by mail-pj1-f54.google.com with SMTP id 98e67ed59e1d1-2fa40c0bab2so225929a91.0 for ; Wed, 12 Feb 2025 10:59:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1739386753; x=1739991553; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=sVZ7I9E//ExcaJzKh4O+QD5jakmBreCl243jnc0OV9Q=; b=za24uZQapsdkjw0wYNLEaNSFAUlkhJR2KmHV754dGRiHHxLcAyZMmWNoIyf5ecRLWv mZrRTjztX4Xl3yp3ATfcpcgPAhQnfPCD8qEGMCvNahkctD/lpHfa2xcFOUH21QBWJoQk 7DRYdRrUWdX7Vq2Ld9KSKnciGfpbr12YBanSy3nYXEO6oyAIiJQfkYbtpCnyGTBSGcZF 02H1wKTlWmFpPG9ersZFYidRzjlBzQ6pq9hMhagEoY1MfIu/YLradHa+6nUbrjVcrnyi V3C1axODlRVJCGhVMutyiDaNBeP+cc4WnohZdvYcZxSe4IHoC/LGJlNMpxuKTa3c7sHL AMkw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739386753; x=1739991553; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=sVZ7I9E//ExcaJzKh4O+QD5jakmBreCl243jnc0OV9Q=; b=Cr+rPqVQjNpIvMqaGtYZ477uibSYu2bz5qmOpML45fvRZUedletVpKaaHYzn632ODq zeZy4d2xChKwc58OQtT0IPnD6K8r1NkRzUOjlhS35ErKPOAzG920u35QpW2tPxy+tQBt VrZ+ZQNVggr/Os55bsBfR5GjN648Dyn1hLDpPuYrMM8XHCRb8VD/Ff/BnF7KMYYUIIVk s76wh80uQLdBP6Gy69G7nXIEka9JlrvJ+TSimS7zSOl8nReFizpxFXTq06SCFlF25U9z HkbD1Kv1WHwmiuH3wdVbeF5RGqi3rbtEJMT1Lw+oT+CHQJ66c3Pfg7igvlB/yV9he7ta K90Q== X-Gm-Message-State: AOJu0Yz1R+mlo4U2HjjsBTqLOyGoxvGqidlEqJPnLUngdrRIDp0xAWOT fRmY5l1Nlglw3bfqw/MebQlbhc9rjMShI07vEXUV9fV2C2yqW73Ud0tVY9GThwXWhCoF9zQhPZj e X-Gm-Gg: ASbGncturFEJw79cY20C5KTF+CHJa0SxqOhE10VqQH+p5fSQPMJBhIcgwBdXrRYSf1p 9f1AfxcYOvqw09xZu1j4hF2UMXz6Wlh/0EVqFa1cWZiXw6LESJNNBBybGWZTYT4/sXWzamJq2gH FK8PFbOdBkOT2ARjZdkfARXlVwDtpDlHmYQQ2miWLQy9veYAXmqlpLiZ19jNXISLD+pDClk5nEk w6irEun1WJYPT0OavxGLqJn/tkhKQuxCq531Lc9t+Rsqv+SRjfH5U1KUL/9lEUHE0P4SE0Bu8vs X-Google-Smtp-Source: AGHT+IHAtFBJ3NfxI2ZHlX5KmVZcNDon8H18mxMQ89ldOl5X+k9um4xX9aGb2VBDMuXlKKR6270uhQ== X-Received: by 2002:a05:6a00:8006:b0:72a:9ddf:55ab with SMTP id d2e1a72fcca58-7322c38b08emr6234057b3a.10.1739386752804; Wed, 12 Feb 2025 10:59:12 -0800 (PST) Received: from localhost ([2a03:2880:ff:15::]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7309270637bsm5631443b3a.172.2025.02.12.10.59.12 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Feb 2025 10:59:12 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Stanislav Fomichev , Joe Damato , Pedro Tammela Subject: [PATCH net-next v13 07/11] io_uring/zcrx: set pp memory provider for an rx queue Date: Wed, 12 Feb 2025 10:57:57 -0800 Message-ID: <20250212185859.3509616-8-dw@davidwei.uk> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250212185859.3509616-1-dw@davidwei.uk> References: <20250212185859.3509616-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Set the page pool memory provider for the rx queue configured for zero copy to io_uring. Then the rx queue is reset using netdev_rx_queue_restart() and netdev core + page pool will take care of filling the rx queue from the io_uring zero copy memory provider. For now, there is only one ifq so its destruction happens implicitly during io_uring cleanup. Reviewed-by: Jens Axboe Signed-off-by: David Wei --- io_uring/zcrx.c | 44 ++++++++++++++++++++++++++++++++++++++------ 1 file changed, 38 insertions(+), 6 deletions(-) diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index a487f5149641..af357400aeb8 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -274,8 +274,34 @@ static void io_zcrx_drop_netdev(struct io_zcrx_ifq *ifq) spin_unlock(&ifq->lock); } +static void io_close_queue(struct io_zcrx_ifq *ifq) +{ + struct net_device *netdev; + netdevice_tracker netdev_tracker; + struct pp_memory_provider_params p = { + .mp_ops = &io_uring_pp_zc_ops, + .mp_priv = ifq, + }; + + if (ifq->if_rxq == -1) + return; + + spin_lock(&ifq->lock); + netdev = ifq->netdev; + netdev_tracker = ifq->netdev_tracker; + ifq->netdev = NULL; + spin_unlock(&ifq->lock); + + if (netdev) { + net_mp_close_rxq(netdev, ifq->if_rxq, &p); + netdev_put(netdev, &netdev_tracker); + } + ifq->if_rxq = -1; +} + static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq) { + io_close_queue(ifq); io_zcrx_drop_netdev(ifq); if (ifq->area) @@ -290,6 +316,7 @@ static void io_zcrx_ifq_free(struct io_zcrx_ifq *ifq) int io_register_zcrx_ifq(struct io_ring_ctx *ctx, struct io_uring_zcrx_ifq_reg __user *arg) { + struct pp_memory_provider_params mp_param = {}; struct io_uring_zcrx_area_reg area; struct io_uring_zcrx_ifq_reg reg; struct io_uring_region_desc rd; @@ -340,7 +367,6 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx, goto err; ifq->rq_entries = reg.rq_entries; - ifq->if_rxq = reg.if_rxq; ret = -ENODEV; ifq->netdev = netdev_get_by_index(current->nsproxy->net_ns, reg.if_idx, @@ -357,16 +383,20 @@ int io_register_zcrx_ifq(struct io_ring_ctx *ctx, if (ret) goto err; + mp_param.mp_ops = &io_uring_pp_zc_ops; + mp_param.mp_priv = ifq; + ret = net_mp_open_rxq(ifq->netdev, reg.if_rxq, &mp_param); + if (ret) + goto err; + ifq->if_rxq = reg.if_rxq; + reg.offsets.rqes = sizeof(struct io_uring); reg.offsets.head = offsetof(struct io_uring, head); reg.offsets.tail = offsetof(struct io_uring, tail); if (copy_to_user(arg, ®, sizeof(reg)) || - copy_to_user(u64_to_user_ptr(reg.region_ptr), &rd, sizeof(rd))) { - ret = -EFAULT; - goto err; - } - if (copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) { + copy_to_user(u64_to_user_ptr(reg.region_ptr), &rd, sizeof(rd)) || + copy_to_user(u64_to_user_ptr(reg.area_ptr), &area, sizeof(area))) { ret = -EFAULT; goto err; } @@ -445,6 +475,8 @@ void io_shutdown_zcrx_ifqs(struct io_ring_ctx *ctx) if (ctx->ifq) io_zcrx_scrub(ctx->ifq); + + io_close_queue(ctx->ifq); } static inline u32 io_zcrx_rqring_entries(struct io_zcrx_ifq *ifq) From patchwork Wed Feb 12 18:57:58 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13972318 Received: from mail-pl1-f181.google.com (mail-pl1-f181.google.com [209.85.214.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 75912262817 for ; Wed, 12 Feb 2025 18:59:14 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.181 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386756; cv=none; b=uxFf9Dmh2PrMlhaKdlgZvx8t7WyBT0k05wAc9ROrH9qJYsztDulXQ25LBBSmB62oj+Fx4ysQKtRa5+b1Y3ZD5fm2dpPYT1TtkP+JFGqTwe5z9jX0nthHRNcvWDDTR6NmxdxsV6TNlw4qVWul3b4KXKNmdgXVup/nWVIydzSMDC4= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386756; c=relaxed/simple; bh=ee/nN2D6QNSrV0ltZWEtUTkHe8hZqD01WUDrrxCQo70=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=sHAkDJqjcSnsHzD+x45ojT0xdaMVE/IpfEAr5iWTMhROeLoZ+V/pW4q5v9KAkHVmj4o/Nchi4N8dTpofuHlETxmazJqWq0Bd2YflesaWFzdCprx5cLenhuScDrlT4ZWKKdn5E1NEDDVtBZ52je2SL0STvqZuNPz503S7PD4MZKc= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=ftrWzmuJ; arc=none smtp.client-ip=209.85.214.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="ftrWzmuJ" Received: by mail-pl1-f181.google.com with SMTP id d9443c01a7336-21f818a980cso70797395ad.3 for ; Wed, 12 Feb 2025 10:59:14 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1739386754; x=1739991554; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=jwmwJaLfYGswpQ8xBKMbzXODDugRtY7fPeN6XrAbeLU=; b=ftrWzmuJtqIMkJa6jAxpWUC81DHa51Zk8fQ1Xa3CLpEg8VAgI/H0kSii8DIk+CVyk5 Dc42oAhXA2uGZdgmdlFnhcjwF+mMbjnPiiB7FazAjsI7HruAxhHER1Lzf/6KohmQcH/y +NEOS5no4YHDQm3KDN8NbmqwiobBvAxADjsG6nIQkuaYYRKxTuVUqlmev4GEKXKRlEg0 U+lp+Jq3Bl0PfcrsL67nanN9h6QkaAmxm1Ppoi18RMyXUmMQH3lYwnfr1oMNIMwhecrQ TMbJ6tHKCkmCO10KljZxovzNnLpX8JlazIVKg45kVT9N2t9acc2m9Q1kJEtMtBPDUQfu rs0Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739386754; x=1739991554; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=jwmwJaLfYGswpQ8xBKMbzXODDugRtY7fPeN6XrAbeLU=; b=cePRPvY3WB0rWD0JK3U/ubPjfzk53llBol2Y0Vgl+7t/wwsfBMFKx41qx0TkjdMsv+ joHYrOZuEmoX6SrNumi60b8gJGG3CNgggqJWzinuShAL6JFR0rrVkv0ri2463W6MkSBE Cjc5BRC5g3H5XVOLXQs52bLJbgnKvpmZ0MndfSvfayicNkR+mZ4e+vKjtskPRRYxZjDC xH8ccMsmBr3P8ZoTx2FBEpDgUkW5gKAN63I/5SbG1gdeMEgSXpG37OiAMU4tHFJC7q+2 cjsD6Oh4HDbCbfWYxk5jfFwg9yxNxtKnoS/vyB1tZkY3muxrDUAXjCY4Xlg34uPBYunN oDJQ== X-Gm-Message-State: AOJu0Yx6ROWPjp0KHZZwCWTV3rR2L6qvA0vxRFa8BWCi9wwMeGWRmsZc Qz5RO0hvIvAPmtZu4YLlhrqMwwl8vabnh4gWUhfZbdvjqtMGK0FGMKvrCPh9pdcZIfPiO+VAlIk z X-Gm-Gg: ASbGncv8FXunumoI4kQsU2iYH1j+UeuQDtO9f7tUN1M2YVqLWBrT0lIJAfMlIoNgkDc gyZRbAhYp6crzw1+4OLCuK+iofxlxWYHbCgrEkoc74doilXfDm9h6DsonX6bab3Nz4dPkrX9Z2a TY8nBv3PaT/OiXwxHsrlC7ZVXFOBfwG6WsqYNqcPqvIZKryYt1HZHKQBnzGuBuiaxXKezG6+zyf GdtZH+B1ZkvQFvSXmvlq9JUw7X8uTwleSDdV7ubFfF2Pc/aku6v1kxdhs46eLYAfgWlGGfMHBc= X-Google-Smtp-Source: AGHT+IG7KHjLnWLQTFWVT1k+WnOpKOaw9wmIuzkylonZYRxl3HKxiIApLSWaXxMBbETPpgwijus8UA== X-Received: by 2002:a05:6a20:2583:b0:1ed:a4b3:800 with SMTP id adf61e73a8af0-1ee5c747328mr7052965637.12.1739386753847; Wed, 12 Feb 2025 10:59:13 -0800 (PST) Received: from localhost ([2a03:2880:ff:3::]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-ad54f1691ddsm6075971a12.61.2025.02.12.10.59.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Feb 2025 10:59:13 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Stanislav Fomichev , Joe Damato , Pedro Tammela Subject: [PATCH net-next v13 08/11] io_uring/zcrx: throttle receive requests Date: Wed, 12 Feb 2025 10:57:58 -0800 Message-ID: <20250212185859.3509616-9-dw@davidwei.uk> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250212185859.3509616-1-dw@davidwei.uk> References: <20250212185859.3509616-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov io_zc_rx_tcp_recvmsg() continues until it fails or there is nothing to receive. If the other side sends fast enough, we might get stuck in io_zc_rx_tcp_recvmsg() producing more and more CQEs but not letting the user to handle them leading to unbound latencies. Break out of it based on an arbitrarily chosen limit, the upper layer will either return to userspace or requeue the request. Reviewed-by: Jens Axboe Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- io_uring/net.c | 2 ++ io_uring/zcrx.c | 9 +++++++++ 2 files changed, 11 insertions(+) diff --git a/io_uring/net.c b/io_uring/net.c index 260eb73a5854..000dc70d08d0 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -1285,6 +1285,8 @@ int io_recvzc(struct io_kiocb *req, unsigned int issue_flags) if (unlikely(ret <= 0) && ret != -EAGAIN) { if (ret == -ERESTARTSYS) ret = -EINTR; + if (ret == IOU_REQUEUE) + return IOU_REQUEUE; req_set_fail(req); io_req_set_res(req, ret, 0); diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index af357400aeb8..8f8a71f5d0a4 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -92,10 +92,13 @@ static void io_zcrx_sync_for_device(const struct page_pool *pool, #define IO_RQ_MAX_ENTRIES 32768 +#define IO_SKBS_PER_CALL_LIMIT 20 + struct io_zcrx_args { struct io_kiocb *req; struct io_zcrx_ifq *ifq; struct socket *sock; + unsigned nr_skbs; }; static const struct memory_provider_ops io_uring_pp_zc_ops; @@ -717,6 +720,9 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, int i, copy, end, off; int ret = 0; + if (unlikely(args->nr_skbs++ > IO_SKBS_PER_CALL_LIMIT)) + return -EAGAIN; + start = skb_headlen(skb); start_off = offset; @@ -807,6 +813,9 @@ static int io_zcrx_tcp_recvmsg(struct io_kiocb *req, struct io_zcrx_ifq *ifq, ret = -ENOTCONN; else ret = -EAGAIN; + } else if (unlikely(args.nr_skbs > IO_SKBS_PER_CALL_LIMIT) && + (issue_flags & IO_URING_F_MULTISHOT)) { + ret = IOU_REQUEUE; } else if (sock_flag(sk, SOCK_DONE)) { /* Make it to retry until it finally gets 0. */ if (issue_flags & IO_URING_F_MULTISHOT) From patchwork Wed Feb 12 18:57:59 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13972319 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 82103263C86 for ; Wed, 12 Feb 2025 18:59:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386759; cv=none; b=QqEyUTM2RD5H/GbhWyVui1ZeGZDlegqWdqc/PQJKcHG1yzeo7GwDsAlhw7VE2A58rc8NMohuG7CW690ZS4OVTM85wWeK7F74snWzMRQ6Thz7hLEUePWPKGJ3ElgaAfP222qM5SGEUvBVAx+0F+GtalU0Kup59GfZ0R6FZIHdhnQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386759; c=relaxed/simple; bh=QumA4njWI3uL0s97dnRLtj5mQkb8OjDSLyFeGh8Lsj4=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=MSkgGp/wc4+kTxNhYorNYUI1olYxzyyJTlzOnuyu1Mu3BQ5o5GsCPvrSf6UWn6jtnhmIFxMdz9yHZSMXgqNM0qrvioRofoLezi0wdaF9lfnrmHOA9B3S1K8CBXOE0yX0/WJ+Ixqwe/Rlwiit/TVJ0BFkGa25sm5Q8P37jTe+2ek= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=v6AbuOYE; arc=none smtp.client-ip=209.85.214.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="v6AbuOYE" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-21f55fbb72bso96839685ad.2 for ; Wed, 12 Feb 2025 10:59:15 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1739386755; x=1739991555; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=wlkjdpHbhJX/gR21YXy4ZrMDaO1Q/+f/VrNlDd9Eqgw=; b=v6AbuOYEkvxR7iLcQON1nxrDVhIXu02ulCtakjSJTgDoc2PjnCPW0H/0CdFyPb5oh4 +GhJwJ3v+ewrZhUJExP/gbx8Xj8iKWKPb3cNjAGF+8En1pzBv7FDJxSWygjhXqhznGx1 khzVMki5DulGmTGHMjJx7KdJgphFPm6wZIfEt5r330GQhAyrZm24YodNGhJqzxv5Cc4X 42BFdTiZF275D5w8QB20JuI1ls9cOazUrx3D4kR0EOG90+AHQ44zS6Tj+IDfsQxDSutm 2fd3Nc4HSuHZFahrxOZEIoB2so2ZIMPV5GugR8VmNCc8U0bHDp+3tA5v60qJl8m7OllU aN3w== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739386755; x=1739991555; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=wlkjdpHbhJX/gR21YXy4ZrMDaO1Q/+f/VrNlDd9Eqgw=; b=AfMNWFAXJjKIhtOJKI5USm0ZE5PJAmJJ9cKlg162TFxBo1tPovwqUY1LzZjNiUwpTz 9DPsz0qrGp1ZtBaZgH11hQCNlBkRQ22xQiVrpnNdk1vqyWV6vjbe6lJBWIguWy0e3yZe T+MTmZg/uhe5yVqvYR1N7DrRmFnxSyTtfnSmArZ2CaBZN51RboQ+X5YxyZ08Veoqdtkb H7w/Rz5dCFbT8GvnISIoSe0lgIzyg/7ZE29rtqMyY7eZ9XiHXihNusqdRUuhFuGwv0Xl jwqzrw2I39RRJhC7+j9XUWSxH8aepLGTsHYRB8Krx0ZUJ7wFC6Ql+8udyedp+cN9VOx8 KcrA== X-Gm-Message-State: AOJu0Yx7kh/su6HJwFtkHigSRzWxgK1idxP/JXjapZRD/gnMwoq6C+tO ddwNMe9p551ysPmcnsEiNRMaWiOOzf9CC+D8R9Sw/aU0eR9guYMnZVU/E+14tTKgb1JiiwRh/Ve R X-Gm-Gg: ASbGncsq8IUiWdn4GYv+klyCbVnffcH7CN2Co9QjBlXuatMN1/FJbs6BV+I3NpgM2TS A2WoDc3IRhstjxI0DD9JFH+R2WA/wqu3fvo7lFMOZW79Ohf5NVoEGqoSvQXepZU0h2euan8bnNS ml1UXU8KPl2WNb6Fr2ytbdMmBA+OUqLY1aZPAkKhdATh9h3XmTW++XnTQR3OZrENGLncQkNZ1XV kj1p863KuFXOMIfYMxygUjIQNM8uqIeIA48nysw3CrvVmADyMx8Fzn239CH9S5+W5yprsugSE5x X-Google-Smtp-Source: AGHT+IFzVa8ocMfhh3NL6cqOH0RlyzwtFqrAJLeoWng7wQzg5qvUzNLNu7fbQviMnBQxjbp/zpgTJg== X-Received: by 2002:a17:902:d48b:b0:215:9bc2:42ec with SMTP id d9443c01a7336-220d216ce9fmr4657415ad.47.1739386754815; Wed, 12 Feb 2025 10:59:14 -0800 (PST) Received: from localhost ([2a03:2880:ff:43::]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21f3689b5ddsm115065095ad.216.2025.02.12.10.59.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Feb 2025 10:59:14 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Stanislav Fomichev , Joe Damato , Pedro Tammela Subject: [PATCH net-next v13 09/11] io_uring/zcrx: add copy fallback Date: Wed, 12 Feb 2025 10:57:59 -0800 Message-ID: <20250212185859.3509616-10-dw@davidwei.uk> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250212185859.3509616-1-dw@davidwei.uk> References: <20250212185859.3509616-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Pavel Begunkov There are scenarios in which the zerocopy path can get a kernel buffer instead of a net_iov and needs to copy it to the user, whether it is because of mis-steering or simply getting an skb with the linear part. In this case, grab a net_iov, copy into it and return it to the user as normally. At the moment the user doesn't get any indication whether there was a copy or not, which is left for follow up work. Reviewed-by: Jens Axboe Signed-off-by: Pavel Begunkov Signed-off-by: David Wei --- io_uring/zcrx.c | 120 +++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 114 insertions(+), 6 deletions(-) diff --git a/io_uring/zcrx.c b/io_uring/zcrx.c index 8f8a71f5d0a4..7359e0810104 100644 --- a/io_uring/zcrx.c +++ b/io_uring/zcrx.c @@ -7,6 +7,7 @@ #include #include #include +#include #include #include @@ -132,6 +133,13 @@ static void io_zcrx_get_niov_uref(struct net_iov *niov) atomic_inc(io_get_user_counter(niov)); } +static inline struct page *io_zcrx_iov_page(const struct net_iov *niov) +{ + struct io_zcrx_area *area = io_zcrx_iov_to_area(niov); + + return area->pages[net_iov_idx(niov)]; +} + static int io_allocate_rbuf_ring(struct io_zcrx_ifq *ifq, struct io_uring_zcrx_ifq_reg *reg, struct io_uring_region_desc *rd) @@ -448,6 +456,11 @@ static void io_zcrx_return_niov(struct net_iov *niov) { netmem_ref netmem = net_iov_to_netmem(niov); + if (!niov->pp) { + /* copy fallback allocated niovs */ + io_zcrx_return_niov_freelist(niov); + return; + } page_pool_put_unrefed_netmem(niov->pp, netmem, -1, false); } @@ -683,13 +696,93 @@ static bool io_zcrx_queue_cqe(struct io_kiocb *req, struct net_iov *niov, return true; } +static struct net_iov *io_zcrx_alloc_fallback(struct io_zcrx_area *area) +{ + struct net_iov *niov = NULL; + + spin_lock_bh(&area->freelist_lock); + if (area->free_count) + niov = __io_zcrx_get_free_niov(area); + spin_unlock_bh(&area->freelist_lock); + + if (niov) + page_pool_fragment_netmem(net_iov_to_netmem(niov), 1); + return niov; +} + +static ssize_t io_zcrx_copy_chunk(struct io_kiocb *req, struct io_zcrx_ifq *ifq, + void *src_base, struct page *src_page, + unsigned int src_offset, size_t len) +{ + struct io_zcrx_area *area = ifq->area; + size_t copied = 0; + int ret = 0; + + while (len) { + size_t copy_size = min_t(size_t, PAGE_SIZE, len); + const int dst_off = 0; + struct net_iov *niov; + struct page *dst_page; + void *dst_addr; + + niov = io_zcrx_alloc_fallback(area); + if (!niov) { + ret = -ENOMEM; + break; + } + + dst_page = io_zcrx_iov_page(niov); + dst_addr = kmap_local_page(dst_page); + if (src_page) + src_base = kmap_local_page(src_page); + + memcpy(dst_addr, src_base + src_offset, copy_size); + + if (src_page) + kunmap_local(src_base); + kunmap_local(dst_addr); + + if (!io_zcrx_queue_cqe(req, niov, ifq, dst_off, copy_size)) { + io_zcrx_return_niov(niov); + ret = -ENOSPC; + break; + } + + io_zcrx_get_niov_uref(niov); + src_offset += copy_size; + len -= copy_size; + copied += copy_size; + } + + return copied ? copied : ret; +} + +static int io_zcrx_copy_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq, + const skb_frag_t *frag, int off, int len) +{ + struct page *page = skb_frag_page(frag); + u32 p_off, p_len, t, copied = 0; + int ret = 0; + + off += skb_frag_off(frag); + + skb_frag_foreach_page(frag, off, len, + page, p_off, p_len, t) { + ret = io_zcrx_copy_chunk(req, ifq, NULL, page, p_off, p_len); + if (ret < 0) + return copied ? copied : ret; + copied += ret; + } + return copied; +} + static int io_zcrx_recv_frag(struct io_kiocb *req, struct io_zcrx_ifq *ifq, const skb_frag_t *frag, int off, int len) { struct net_iov *niov; if (unlikely(!skb_frag_is_net_iov(frag))) - return -EOPNOTSUPP; + return io_zcrx_copy_frag(req, ifq, frag, off, len); niov = netmem_to_net_iov(frag->netmem); if (niov->pp->mp_ops != &io_uring_pp_zc_ops || @@ -716,18 +809,33 @@ io_zcrx_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, struct io_zcrx_ifq *ifq = args->ifq; struct io_kiocb *req = args->req; struct sk_buff *frag_iter; - unsigned start, start_off; + unsigned start, start_off = offset; int i, copy, end, off; int ret = 0; if (unlikely(args->nr_skbs++ > IO_SKBS_PER_CALL_LIMIT)) return -EAGAIN; - start = skb_headlen(skb); - start_off = offset; + if (unlikely(offset < skb_headlen(skb))) { + ssize_t copied; + size_t to_copy; - if (offset < start) - return -EOPNOTSUPP; + to_copy = min_t(size_t, skb_headlen(skb) - offset, len); + copied = io_zcrx_copy_chunk(req, ifq, skb->data, NULL, + offset, to_copy); + if (copied < 0) { + ret = copied; + goto out; + } + offset += copied; + len -= copied; + if (!len) + goto out; + if (offset != skb_headlen(skb)) + goto out; + } + + start = skb_headlen(skb); for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { const skb_frag_t *frag; From patchwork Wed Feb 12 18:58:00 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13972321 Received: from mail-pl1-f171.google.com (mail-pl1-f171.google.com [209.85.214.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6B87C262817 for ; Wed, 12 Feb 2025 18:59:16 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386760; cv=none; b=hs8lwOsKb+vwOPG+ZXXARGYCnG709JkcVa9dnAn5Jxxsi+5nx66hMQbMOhxjp9blJaG33MPUlZF31dkCQH8fLzCRKHHymZRWaWsxfSgSEaZiOaiTHRbewNBD4H3CuJKNCYwplZLHL+CKEjq/D35JIozz/PtWxImieYWOZv9bRtU= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386760; c=relaxed/simple; bh=9pTdmByBVpWL3oRo72V1P/Qnfm3p1w1EAQ8TvZwzW4E=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=pesTNugIFGUoxuK9MbdjpIYjaOO4UIEdJImTVm0DyrDjbe7LYNwwfKH7P4j7WkC3MTuG4Rt7NcOjgEgqt9rL9CnhWsoKYLsrWJtKWhC7DAgLAs9ORBxHBaQQE3UpHYU0kiA5H7FB6JEX0RLnFJ0jYA2PX52TCIp6AT+a9b6h2E4= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=EmbZ8TjM; arc=none smtp.client-ip=209.85.214.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="EmbZ8TjM" Received: by mail-pl1-f171.google.com with SMTP id d9443c01a7336-220c8f38febso14439925ad.2 for ; Wed, 12 Feb 2025 10:59:16 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1739386756; x=1739991556; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=NHWoGsRhiFc/ooKo3TvF3R3d8ADOTW+DffRpnDLhJ48=; b=EmbZ8TjMTdDrfcM+n2gNijV42NI28HDG6dgezndFymBVQOffsQgEGvR/VobYIOtV/z 0fPdqzePGwTnA5hav5gIwAcU2gbDgXbuQ8jjoMggRFzDJnKbpTJ1XOeU1WjwidZhSlvu /YT7yHqaEeh3e2hLq4RQirH4D/1xZEXgjO140KDdsKP/xQh7SLm/uFrTjsUH6FibYPec YogppvISJX1K+Ipt7BTtn21yVnc7EIcz16OtzeZc5jlQLsvLhIv2k6gS1YByp+CEnnXX SX/76WjQ8VR3nv0B6EuYlKX9VmEoJkcRX2H+w0X7SBkUsvvH9Om6MPbQaB7Zq6lJKqGT 2swg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739386756; x=1739991556; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=NHWoGsRhiFc/ooKo3TvF3R3d8ADOTW+DffRpnDLhJ48=; b=K2hPrMgxfoW6dSMHZzrILq60zOZkABv6UxG6rMGaBUGKPuTQXIgBAKa4JGXyQ7Kq7K Jrygz1Q4BvgDcDy8X/18tQGed1njqjl1AsayuVZ42gFRyWKOKgmm4lNNxG/A9TPaBXzk lW+sjkaMhRNkxwsBsgxfrKB5/Ijfzivsl2Na8+88ZFnt55eGEEsbYPM6dOZf9NDLqfoq rXN/dJZQ4QuiHHF1fYnXSjopVbvLAJxKoM4GrfP62FLs+YY1t392SiYKLeGlWWdIRvfT E8mVXxzJykWoxCkgFlFrjz8KsnnrL4YnG1LA1HCG0fojz21Zn70zmOSSyWCnuYCQdx2I IT4w== X-Gm-Message-State: AOJu0YwtEfjN2QPPmbjpiXkb38p8mvT7E/w0Q6Zb5ZcR8da3zqHZtVfU gbdH2R79jFb4qj/mFtjbioC4A744P4YoLntiAifnxrsuUPNbITDfj6XJMLuxvkMJA5f5gB8yc7b V X-Gm-Gg: ASbGncvthEd+04Viq1rhjpGYthJfI11sE0A9sLdoQ3frONJ1XU2TBkdNiSollA94nVK bkRX74Ug23uszEfCNW40K2cv3Fd00wHAsCBZ8nvFbt7s5/xUH7xYva0baknXjprJ/uH39I4khIf SO1dXlTWc00Nu0PMKVZ+X6zYKpHdXPTxBPKH7Cf98PL4ILaGyPm5yYCa85QHzhG3Qzq8aK57J0Y UMhE8myFAhxpi5C5CalWBDkNZ5VoFL463B0E7PKKQfWBUU1tfH+Iy59hIaRVRjyfeo5x8toUpwB X-Google-Smtp-Source: AGHT+IHCFlfixa6KjP6YmcpUGIN/NNsejdfcLQhFihNNDU8POVbfi8RxA+ESXd2uIhuEC5K8YngjiQ== X-Received: by 2002:a05:6a21:6da8:b0:1e1:cbbf:be0 with SMTP id adf61e73a8af0-1ee5e5bbea4mr7406758637.22.1739386755760; Wed, 12 Feb 2025 10:59:15 -0800 (PST) Received: from localhost ([2a03:2880:ff:71::]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7307a29ecdesm8003949b3a.173.2025.02.12.10.59.15 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Feb 2025 10:59:15 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Stanislav Fomichev , Joe Damato , Pedro Tammela Subject: [PATCH net-next v13 10/11] net: add documentation for io_uring zcrx Date: Wed, 12 Feb 2025 10:58:00 -0800 Message-ID: <20250212185859.3509616-11-dw@davidwei.uk> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250212185859.3509616-1-dw@davidwei.uk> References: <20250212185859.3509616-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Add documentation for io_uring zero copy Rx that explains requirements and the user API. Signed-off-by: David Wei --- Documentation/networking/index.rst | 1 + Documentation/networking/iou-zcrx.rst | 202 ++++++++++++++++++++++++++ 2 files changed, 203 insertions(+) create mode 100644 Documentation/networking/iou-zcrx.rst diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 058193ed2eeb..c64133d309bf 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -63,6 +63,7 @@ Contents: gtp ila ioam6-sysctl + iou-zcrx ip_dynaddr ipsec ip-sysctl diff --git a/Documentation/networking/iou-zcrx.rst b/Documentation/networking/iou-zcrx.rst new file mode 100644 index 000000000000..0127319b30bb --- /dev/null +++ b/Documentation/networking/iou-zcrx.rst @@ -0,0 +1,202 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +io_uring zero copy Rx +===================== + +Introduction +============ + +io_uring zero copy Rx (ZC Rx) is a feature that removes kernel-to-user copy on +the network receive path, allowing packet data to be received directly into +userspace memory. This feature is different to TCP_ZEROCOPY_RECEIVE in that +there are no strict alignment requirements and no need to mmap()/munmap(). +Compared to kernel bypass solutions such as e.g. DPDK, the packet headers are +processed by the kernel TCP stack as normal. + +NIC HW Requirements +=================== + +Several NIC HW features are required for io_uring ZC Rx to work. For now the +kernel API does not configure the NIC and it must be done by the user. + +Header/data split +----------------- + +Required to split packets at the L4 boundary into a header and a payload. +Headers are received into kernel memory as normal and processed by the TCP +stack as normal. Payloads are received into userspace memory directly. + +Flow steering +------------- + +Specific HW Rx queues are configured for this feature, but modern NICs +typically distribute flows across all HW Rx queues. Flow steering is required +to ensure that only desired flows are directed towards HW queues that are +configured for io_uring ZC Rx. + +RSS +--- + +In addition to flow steering above, RSS is required to steer all other non-zero +copy flows away from queues that are configured for io_uring ZC Rx. + +Usage +===== + +Setup NIC +--------- + +Must be done out of band for now. + +Ensure there are at least two queues:: + + ethtool -L eth0 combined 2 + +Enable header/data split:: + + ethtool -G eth0 tcp-data-split on + +Carve out half of the HW Rx queues for zero copy using RSS:: + + ethtool -X eth0 equal 1 + +Set up flow steering, bearing in mind that queues are 0-indexed:: + + ethtool -N eth0 flow-type tcp6 ... action 1 + +Setup io_uring +-------------- + +This section describes the low level io_uring kernel API. Please refer to +liburing documentation for how to use the higher level API. + +Create an io_uring instance with the following required setup flags:: + + IORING_SETUP_SINGLE_ISSUER + IORING_SETUP_DEFER_TASKRUN + IORING_SETUP_CQE32 + +Create memory area +------------------ + +Allocate userspace memory area for receiving zero copy data:: + + void *area_ptr = mmap(NULL, area_size, + PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, + 0, 0); + +Create refill ring +------------------ + +Allocate memory for a shared ringbuf used for returning consumed buffers:: + + void *ring_ptr = mmap(NULL, ring_size, + PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, + 0, 0); + +This refill ring consists of some space for the header, followed by an array of +``struct io_uring_zcrx_rqe``:: + + size_t rq_entries = 4096; + size_t ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe) + PAGE_SIZE; + /* align to page size */ + ring_size = (ring_size + (PAGE_SIZE - 1)) & ~(PAGE_SIZE - 1); + +Register ZC Rx +-------------- + +Fill in registration structs:: + + struct io_uring_zcrx_area_reg area_reg = { + .addr = (__u64)(unsigned long)area_ptr, + .len = area_size, + .flags = 0, + }; + + struct io_uring_region_desc region_reg = { + .user_addr = (__u64)(unsigned long)ring_ptr, + .size = ring_size, + .flags = IORING_MEM_REGION_TYPE_USER, + }; + + struct io_uring_zcrx_ifq_reg reg = { + .if_idx = if_nametoindex("eth0"), + /* this is the HW queue with desired flow steered into it */ + .if_rxq = 1, + .rq_entries = rq_entries, + .area_ptr = (__u64)(unsigned long)&area_reg, + .region_ptr = (__u64)(unsigned long)®ion_reg, + }; + +Register with kernel:: + + io_uring_register_ifq(ring, ®); + +Map refill ring +--------------- + +The kernel fills in fields for the refill ring in the registration ``struct +io_uring_zcrx_ifq_reg``. Map it into userspace:: + + struct io_uring_zcrx_rq refill_ring; + + refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.head); + refill_ring.khead = (unsigned *)((char *)ring_ptr + reg.offsets.tail); + refill_ring.rqes = + (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes); + refill_ring.rq_tail = 0; + refill_ring.ring_ptr = ring_ptr; + +Receiving data +-------------- + +Prepare a zero copy recv request:: + + struct io_uring_sqe *sqe; + + sqe = io_uring_get_sqe(ring); + io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, fd, NULL, 0, 0); + sqe->ioprio |= IORING_RECV_MULTISHOT; + +Now, submit and wait:: + + io_uring_submit_and_wait(ring, 1); + +Finally, process completions:: + + struct io_uring_cqe *cqe; + unsigned int count = 0; + unsigned int head; + + io_uring_for_each_cqe(ring, head, cqe) { + struct io_uring_zcrx_cqe *rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1); + + unsigned long mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1; + unsigned char *data = area_ptr + (rcqe->off & mask); + /* do something with the data */ + + count++; + } + io_uring_cq_advance(ring, count); + +Recycling buffers +----------------- + +Return buffers back to the kernel to be used again:: + + struct io_uring_zcrx_rqe *rqe; + unsigned mask = refill_ring.ring_entries - 1; + rqe = &refill_ring.rqes[refill_ring.rq_tail & mask]; + + unsigned long area_offset = rcqe->off & ~IORING_ZCRX_AREA_MASK; + rqe->off = area_offset | area_reg.rq_area_token; + rqe->len = cqe->res; + IO_URING_WRITE_ONCE(*refill_ring.ktail, ++refill_ring.rq_tail); + +Testing +======= + +See ``tools/testing/selftests/drivers/net/hw/iou-zcrx.c`` From patchwork Wed Feb 12 18:58:01 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13972320 Received: from mail-pl1-f177.google.com (mail-pl1-f177.google.com [209.85.214.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 82052263C85 for ; Wed, 12 Feb 2025 18:59:17 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.177 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386759; cv=none; b=Elyku7hz/20FFEPUpKWhHS1iYDfYVxQiyFTa17pArvQevPRC8Ut1AhUBwllJYvMLlMtY6Cyoi6CHMlXaSAg/yNiyKWxlZQ4R4eqVVOpx9SOE6zn21t8wk5sKk/LIv8JkNmS5HAXu5kJ4vUTp1lFJ212E9Ajjk4W8pfO9tuuweSw= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739386759; c=relaxed/simple; bh=zwEaT1fzCA6pZ/DniFaCt9D40j2W53weGLCTLMkrfvE=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version; b=u5/C2KM6mR1Y5H0liisHiMOVhbzxjNWfy1rzWdRPu60clJK3QIGFW+WVzupxHKtf8KpcCG9F85PEeZZk+Wqz/HAuZIM5nCrnxFjwsEK4lJfj6Ao2UqhSW3RC+rwDTmJ5mD4PJhdeCblJQbyL619pnMSTT508vRCWEKqKqqqtCvU= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=ceH6qeKs; arc=none smtp.client-ip=209.85.214.177 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="ceH6qeKs" Received: by mail-pl1-f177.google.com with SMTP id d9443c01a7336-21f74c4e586so212275ad.0 for ; Wed, 12 Feb 2025 10:59:17 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1739386757; x=1739991557; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=mHHS+FPDLO3skgLo3fHZNTV70X52xj2EYXoCt0zXgCA=; b=ceH6qeKskOMG4ns6Rs/JRfahbhwifQL46Jvks893yp59faFsEx9PoNPpkkzngEB7vM GC7ZrHJXR8Hylp6ShaRLY14lCIxeOmJbEc3vSb/bLsVx3sMTlta+X0iT4SPrb224UVuu bKCWNySUcsU8RpTb1NBmMed/a6uGw/Hfo6TsrIFzW1EKP6A0FpGjCKo4rSNzJYagVNs+ Ro8UGgl2qTXOifuBBqt44rrm0wnR9ak/7ieFlgYhoXGbUPzEdCnRPcZUXNeyCWpVlKV4 DBA1tLIkBw68eRJ0RIeMEkg3i9TBdNM8jEPBBzINCvTptbhqYGPEtjH9GqYZcZLxSC9N 7P+A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739386757; x=1739991557; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=mHHS+FPDLO3skgLo3fHZNTV70X52xj2EYXoCt0zXgCA=; b=XtuBDAmyAvvMINBBEjf7YE9p1l7CiaNj70EzfoEz1/MkbmLG8qYi37Rv5yK2Ks7sFR YtI8uwHdSNapitTzEKYrPgfo20BU/fjcQ9fExMjSQ2Bk5To700PTUHP45d5fHW9zdSJC i4le5AOaP1V872aObtoIgVyYakhuDPcGqqLn9vTUrn6STDl5zvVw40ourvjweqComcC8 kIJLu1JR2uNX41eRz7aYzLIEL6RA+bU8oZjMU4AmssbGQLlgn5dMYinlTwqBU+cfo7n0 mLiLo/HG7geW8ILNeFRkyOZ13+tPFTWi63qrouqp+U9JeSTwfD4wnosDLh6pmYp2550f xyog== X-Gm-Message-State: AOJu0YwkzjHVz75QxjVd2S4YoFSdika2BCMht0b3foOIkYNedbFesPqn rKxaO9f0teT2GPvZ2iOvNk+whjnPagcd24WvLGPT7EMZ4WUyhul2+inlsNrSdOjSpUk/ez0xSKy X X-Gm-Gg: ASbGncu/DJu4CnUQzJOzvu2orqZSdLwHyYbLxmY/N3KvNZz/GtFEXhFbiG9xcaKdNb/ mlZHMla7U7AwpMRt8pXw29cPhfcD9Szi5/+LiXKI2W9R/4+GRUj4+yMbw0hPbKw+UthRlx8FGRv /gHkk4Fst6yPx/rrkeBXTZLXnuB2sXU+qAx287AUTq/OvC6kHIWrdXw2mKKDqzYJeJ849WvPP++ CDs8qnvIXReKWBNi9NgGoC1SoLOtOmbVfk3CY5WnkYrmJnuuhJR0WKCruA+aDAOsc1ktXBrmRvL X-Google-Smtp-Source: AGHT+IEu2sFpRUu/vz72LicjAXvdeF+AOE/7q4V9lBD5XoIVDQa4eOH26VNDxF33jXM9Dvn7GNm/Cg== X-Received: by 2002:a17:903:230c:b0:216:7ee9:2227 with SMTP id d9443c01a7336-220bdfee0c7mr64830895ad.36.1739386756756; Wed, 12 Feb 2025 10:59:16 -0800 (PST) Received: from localhost ([2a03:2880:ff:43::]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-21f3655164bsm116851585ad.85.2025.02.12.10.59.16 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 12 Feb 2025 10:59:16 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Stanislav Fomichev , Joe Damato , Pedro Tammela Subject: [PATCH net-next v13 11/11] io_uring/zcrx: add selftest Date: Wed, 12 Feb 2025 10:58:01 -0800 Message-ID: <20250212185859.3509616-12-dw@davidwei.uk> X-Mailer: git-send-email 2.43.5 In-Reply-To: <20250212185859.3509616-1-dw@davidwei.uk> References: <20250212185859.3509616-1-dw@davidwei.uk> Precedence: bulk X-Mailing-List: io-uring@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Add a selftest for io_uring zero copy Rx. This test cannot run locally and requires a remote host to be configured in net.config. The remote host must have hardware support for zero copy Rx as listed in the documentation page. The test will restore the NIC config back to before the test and is idempotent. liburing is required to compile the test and be installed on the remote host running the test. Signed-off-by: David Wei --- .../selftests/drivers/net/hw/.gitignore | 2 + .../testing/selftests/drivers/net/hw/Makefile | 5 + .../selftests/drivers/net/hw/iou-zcrx.c | 426 ++++++++++++++++++ .../selftests/drivers/net/hw/iou-zcrx.py | 64 +++ 4 files changed, 497 insertions(+) create mode 100644 tools/testing/selftests/drivers/net/hw/iou-zcrx.c create mode 100755 tools/testing/selftests/drivers/net/hw/iou-zcrx.py diff --git a/tools/testing/selftests/drivers/net/hw/.gitignore b/tools/testing/selftests/drivers/net/hw/.gitignore index e9fe6ede681a..6942bf575497 100644 --- a/tools/testing/selftests/drivers/net/hw/.gitignore +++ b/tools/testing/selftests/drivers/net/hw/.gitignore @@ -1 +1,3 @@ +# SPDX-License-Identifier: GPL-2.0-only +iou-zcrx ncdevmem diff --git a/tools/testing/selftests/drivers/net/hw/Makefile b/tools/testing/selftests/drivers/net/hw/Makefile index 21ba64ce1e34..7efc47c89463 100644 --- a/tools/testing/selftests/drivers/net/hw/Makefile +++ b/tools/testing/selftests/drivers/net/hw/Makefile @@ -1,5 +1,7 @@ # SPDX-License-Identifier: GPL-2.0+ OR MIT +TEST_GEN_FILES = iou-zcrx + TEST_PROGS = \ csum.py \ devlink_port_split.py \ @@ -10,6 +12,7 @@ TEST_PROGS = \ ethtool_rmon.sh \ hw_stats_l3.sh \ hw_stats_l3_gre.sh \ + iou-zcrx.py \ loopback.sh \ nic_link_layer.py \ nic_performance.py \ @@ -38,3 +41,5 @@ include ../../../lib.mk # YNL build YNL_GENS := ethtool netdev include ../../../net/ynl.mk + +$(OUTPUT)/iou-zcrx: LDLIBS += -luring diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.c b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c new file mode 100644 index 000000000000..010c261d2132 --- /dev/null +++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.c @@ -0,0 +1,426 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include + +#define PAGE_SIZE (4096) +#define AREA_SIZE (8192 * PAGE_SIZE) +#define SEND_SIZE (512 * 4096) +#define min(a, b) \ + ({ \ + typeof(a) _a = (a); \ + typeof(b) _b = (b); \ + _a < _b ? _a : _b; \ + }) +#define min_t(t, a, b) \ + ({ \ + t _ta = (a); \ + t _tb = (b); \ + min(_ta, _tb); \ + }) + +#define ALIGN_UP(v, align) (((v) + (align) - 1) & ~((align) - 1)) + +static int cfg_server; +static int cfg_client; +static int cfg_port = 8000; +static int cfg_payload_len; +static const char *cfg_ifname; +static int cfg_queue_id = -1; +static struct sockaddr_in6 cfg_addr; + +static char payload[SEND_SIZE] __attribute__((aligned(PAGE_SIZE))); +static void *area_ptr; +static void *ring_ptr; +static size_t ring_size; +static struct io_uring_zcrx_rq rq_ring; +static unsigned long area_token; +static int connfd; +static bool stop; +static size_t received; + +static unsigned long gettimeofday_ms(void) +{ + struct timeval tv; + + gettimeofday(&tv, NULL); + return (tv.tv_sec * 1000) + (tv.tv_usec / 1000); +} + +static int parse_address(const char *str, int port, struct sockaddr_in6 *sin6) +{ + int ret; + + sin6->sin6_family = AF_INET6; + sin6->sin6_port = htons(port); + + ret = inet_pton(sin6->sin6_family, str, &sin6->sin6_addr); + if (ret != 1) { + /* fallback to plain IPv4 */ + ret = inet_pton(AF_INET, str, &sin6->sin6_addr.s6_addr32[3]); + if (ret != 1) + return -1; + + /* add ::ffff prefix */ + sin6->sin6_addr.s6_addr32[0] = 0; + sin6->sin6_addr.s6_addr32[1] = 0; + sin6->sin6_addr.s6_addr16[4] = 0; + sin6->sin6_addr.s6_addr16[5] = 0xffff; + } + + return 0; +} + +static inline size_t get_refill_ring_size(unsigned int rq_entries) +{ + size_t size; + + ring_size = rq_entries * sizeof(struct io_uring_zcrx_rqe); + /* add space for the header (head/tail/etc.) */ + ring_size += PAGE_SIZE; + return ALIGN_UP(ring_size, 4096); +} + +static void setup_zcrx(struct io_uring *ring) +{ + unsigned int ifindex; + unsigned int rq_entries = 4096; + int ret; + + ifindex = if_nametoindex(cfg_ifname); + if (!ifindex) + error(1, 0, "bad interface name: %s", cfg_ifname); + + area_ptr = mmap(NULL, + AREA_SIZE, + PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, + 0, + 0); + if (area_ptr == MAP_FAILED) + error(1, 0, "mmap(): zero copy area"); + + ring_size = get_refill_ring_size(rq_entries); + ring_ptr = mmap(NULL, + ring_size, + PROT_READ | PROT_WRITE, + MAP_ANONYMOUS | MAP_PRIVATE, + 0, + 0); + + struct io_uring_region_desc region_reg = { + .size = ring_size, + .user_addr = (__u64)(unsigned long)ring_ptr, + .flags = IORING_MEM_REGION_TYPE_USER, + }; + + struct io_uring_zcrx_area_reg area_reg = { + .addr = (__u64)(unsigned long)area_ptr, + .len = AREA_SIZE, + .flags = 0, + }; + + struct io_uring_zcrx_ifq_reg reg = { + .if_idx = ifindex, + .if_rxq = cfg_queue_id, + .rq_entries = rq_entries, + .area_ptr = (__u64)(unsigned long)&area_reg, + .region_ptr = (__u64)(unsigned long)®ion_reg, + }; + + ret = io_uring_register_ifq(ring, ®); + if (ret) + error(1, 0, "io_uring_register_ifq(): %d", ret); + + rq_ring.khead = (unsigned int *)((char *)ring_ptr + reg.offsets.head); + rq_ring.ktail = (unsigned int *)((char *)ring_ptr + reg.offsets.tail); + rq_ring.rqes = (struct io_uring_zcrx_rqe *)((char *)ring_ptr + reg.offsets.rqes); + rq_ring.rq_tail = 0; + rq_ring.ring_entries = reg.rq_entries; + + area_token = area_reg.rq_area_token; +} + +static void add_accept(struct io_uring *ring, int sockfd) +{ + struct io_uring_sqe *sqe; + + sqe = io_uring_get_sqe(ring); + + io_uring_prep_accept(sqe, sockfd, NULL, NULL, 0); + sqe->user_data = 1; +} + +static void add_recvzc(struct io_uring *ring, int sockfd) +{ + struct io_uring_sqe *sqe; + + sqe = io_uring_get_sqe(ring); + + io_uring_prep_rw(IORING_OP_RECV_ZC, sqe, sockfd, NULL, 0, 0); + sqe->ioprio |= IORING_RECV_MULTISHOT; + sqe->user_data = 2; +} + +static void process_accept(struct io_uring *ring, struct io_uring_cqe *cqe) +{ + if (cqe->res < 0) + error(1, 0, "accept()"); + if (connfd) + error(1, 0, "Unexpected second connection"); + + connfd = cqe->res; + add_recvzc(ring, connfd); +} + +static void process_recvzc(struct io_uring *ring, struct io_uring_cqe *cqe) +{ + unsigned rq_mask = rq_ring.ring_entries - 1; + struct io_uring_zcrx_cqe *rcqe; + struct io_uring_zcrx_rqe *rqe; + struct io_uring_sqe *sqe; + uint64_t mask; + char *data; + ssize_t n; + int i; + + if (cqe->res == 0 && cqe->flags == 0) { + stop = true; + return; + } + + if (cqe->res < 0) + error(1, 0, "recvzc(): %d", cqe->res); + + if (!(cqe->flags & IORING_CQE_F_MORE)) + add_recvzc(ring, connfd); + + rcqe = (struct io_uring_zcrx_cqe *)(cqe + 1); + + n = cqe->res; + mask = (1ULL << IORING_ZCRX_AREA_SHIFT) - 1; + data = (char *)area_ptr + (rcqe->off & mask); + + for (i = 0; i < n; i++) { + if (*(data + i) != payload[(received + i)]) + error(1, 0, "payload mismatch"); + } + received += n; + + rqe = &rq_ring.rqes[(rq_ring.rq_tail & rq_mask)]; + rqe->off = (rcqe->off & ~IORING_ZCRX_AREA_MASK) | area_token; + rqe->len = cqe->res; + io_uring_smp_store_release(rq_ring.ktail, ++rq_ring.rq_tail); +} + +static void server_loop(struct io_uring *ring) +{ + struct io_uring_cqe *cqe; + unsigned int count = 0; + unsigned int head; + int i, ret; + + io_uring_submit_and_wait(ring, 1); + + io_uring_for_each_cqe(ring, head, cqe) { + if (cqe->user_data == 1) + process_accept(ring, cqe); + else if (cqe->user_data == 2) + process_recvzc(ring, cqe); + else + error(1, 0, "unknown cqe"); + count++; + } + io_uring_cq_advance(ring, count); +} + +static void run_server(void) +{ + unsigned int flags = 0; + struct io_uring ring; + int fd, enable, ret; + uint64_t tstop; + + fd = socket(AF_INET6, SOCK_STREAM, 0); + if (fd == -1) + error(1, 0, "socket()"); + + enable = 1; + ret = setsockopt(fd, SOL_SOCKET, SO_REUSEADDR, &enable, sizeof(int)); + if (ret < 0) + error(1, 0, "setsockopt(SO_REUSEADDR)"); + + ret = bind(fd, &cfg_addr, sizeof(cfg_addr)); + if (ret < 0) + error(1, 0, "bind()"); + + if (listen(fd, 1024) < 0) + error(1, 0, "listen()"); + + flags |= IORING_SETUP_COOP_TASKRUN; + flags |= IORING_SETUP_SINGLE_ISSUER; + flags |= IORING_SETUP_DEFER_TASKRUN; + flags |= IORING_SETUP_SUBMIT_ALL; + flags |= IORING_SETUP_CQE32; + + io_uring_queue_init(512, &ring, flags); + + setup_zcrx(&ring); + + add_accept(&ring, fd); + + tstop = gettimeofday_ms() + 5000; + while (!stop && gettimeofday_ms() < tstop) + server_loop(&ring); + + if (!stop) + error(1, 0, "test failed\n"); +} + +static void run_client(void) +{ + ssize_t to_send = SEND_SIZE; + ssize_t sent = 0; + ssize_t chunk, res; + int fd; + + fd = socket(AF_INET6, SOCK_STREAM, 0); + if (fd == -1) + error(1, 0, "socket()"); + + if (connect(fd, &cfg_addr, sizeof(cfg_addr))) + error(1, 0, "connect()"); + + while (to_send) { + void *src = &payload[sent]; + + chunk = min_t(ssize_t, cfg_payload_len, to_send); + res = send(fd, src, chunk, 0); + if (res < 0) + error(1, 0, "send(): %d", sent); + sent += res; + to_send -= res; + } + + close(fd); +} + +static void usage(const char *filepath) +{ + error(1, 0, "Usage: %s (-4|-6) (-s|-c) -h -p " + "-l -i -q", filepath); +} + +static void parse_opts(int argc, char **argv) +{ + const int max_payload_len = sizeof(payload) - + sizeof(struct ipv6hdr) - + sizeof(struct tcphdr) - + 40 /* max tcp options */; + struct sockaddr_in6 *addr6 = (void *) &cfg_addr; + char *addr = NULL; + int ret; + int c; + + if (argc <= 1) + usage(argv[0]); + cfg_payload_len = max_payload_len; + + while ((c = getopt(argc, argv, "46sch:p:l:i:q:")) != -1) { + switch (c) { + case 's': + if (cfg_client) + error(1, 0, "Pass one of -s or -c"); + cfg_server = 1; + break; + case 'c': + if (cfg_server) + error(1, 0, "Pass one of -s or -c"); + cfg_client = 1; + break; + case 'h': + addr = optarg; + break; + case 'p': + cfg_port = strtoul(optarg, NULL, 0); + break; + case 'l': + cfg_payload_len = strtoul(optarg, NULL, 0); + break; + case 'i': + cfg_ifname = optarg; + break; + case 'q': + cfg_queue_id = strtoul(optarg, NULL, 0); + break; + } + } + + if (cfg_server && addr) + error(1, 0, "Receiver cannot have -h specified"); + + memset(addr6, 0, sizeof(*addr6)); + addr6->sin6_family = AF_INET6; + addr6->sin6_port = htons(cfg_port); + addr6->sin6_addr = in6addr_any; + if (addr) { + ret = parse_address(addr, cfg_port, addr6); + if (ret) + error(1, 0, "receiver address parse error: %s", addr); + } + + if (cfg_payload_len > max_payload_len) + error(1, 0, "-l: payload exceeds max (%d)", max_payload_len); +} + +int main(int argc, char **argv) +{ + const char *cfg_test = argv[argc - 1]; + int i; + + parse_opts(argc, argv); + + for (i = 0; i < SEND_SIZE; i++) + payload[i] = 'a' + (i % 26); + + if (cfg_server) + run_server(); + else if (cfg_client) + run_client(); + + return 0; +} diff --git a/tools/testing/selftests/drivers/net/hw/iou-zcrx.py b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py new file mode 100755 index 000000000000..ea0a346c3eff --- /dev/null +++ b/tools/testing/selftests/drivers/net/hw/iou-zcrx.py @@ -0,0 +1,64 @@ +#!/usr/bin/env python3 +# SPDX-License-Identifier: GPL-2.0 + +import re +from os import path +from lib.py import ksft_run, ksft_exit +from lib.py import NetDrvEpEnv +from lib.py import bkg, cmd, ethtool, wait_port_listen + + +def _get_rx_ring_entries(cfg): + output = ethtool(f"-g {cfg.ifname}", host=cfg.remote).stdout + values = re.findall(r'RX:\s+(\d+)', output) + return int(values[1]) + + +def _get_combined_channels(cfg): + output = ethtool(f"-l {cfg.ifname}", host=cfg.remote).stdout + values = re.findall(r'Combined:\s+(\d+)', output) + return int(values[1]) + + +def _set_flow_rule(cfg, chan): + output = ethtool(f"-N {cfg.ifname} flow-type tcp6 dst-port 9999 action {chan}", host=cfg.remote).stdout + values = re.search(r'ID (\d+)', output).group(1) + return int(values) + + +def test_zcrx(cfg) -> None: + cfg.require_v6() + + combined_chans = _get_combined_channels(cfg) + if combined_chans < 2: + raise KsftSkipEx('at least 2 combined channels required') + rx_ring = _get_rx_ring_entries(cfg) + + rx_cmd = f"{cfg.bin_remote} -s -p 9999 -i {cfg.ifname} -q {combined_chans - 1}" + tx_cmd = f"{cfg.bin_local} -c -h {cfg.remote_v6} -p 9999 -l 12840" + + try: + ethtool(f"-G {cfg.ifname} rx 64", host=cfg.remote) + ethtool(f"-X {cfg.ifname} equal {combined_chans - 1}", host=cfg.remote) + flow_rule_id = _set_flow_rule(cfg, combined_chans - 1) + + with bkg(rx_cmd, host=cfg.remote, exit_wait=True): + wait_port_listen(9999, proto="tcp", host=cfg.remote) + cmd(tx_cmd) + finally: + ethtool(f"-N {cfg.ifname} delete {flow_rule_id}", host=cfg.remote) + ethtool(f"-X {cfg.ifname} default", host=cfg.remote) + ethtool(f"-G {cfg.ifname} rx {rx_ring}", host=cfg.remote) + + +def main() -> None: + with NetDrvEpEnv(__file__) as cfg: + cfg.bin_local = path.abspath(path.dirname(__file__) + "/../../../drivers/net/hw/iou-zcrx") + cfg.bin_remote = cfg.remote.deploy(cfg.bin_local) + + ksft_run(globs=globals(), case_pfx={"test_"}, args=(cfg, )) + ksft_exit() + + +if __name__ == "__main__": + main()