From patchwork Sat Feb 15 00:09:35 2025 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Patchwork-Submitter: David Wei X-Patchwork-Id: 13975776 Received: from mail-pl1-f170.google.com (mail-pl1-f170.google.com [209.85.214.170]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 249EAEACE for ; Sat, 15 Feb 2025 00:09:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.214.170 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739578202; cv=none; b=UOqR57Z3FZ6UyOr/GIimeduGoNby3C7wL/AwfqziO/Rd1Dj5yxm6uUGYpl0j7tTOAYM2zayshA55ZsoQ2hBlKymajdBSbqeRAyvzJ4oKCK/FVMWvYCwm4xslQbPt0ZTNIFY+eSymhKGGSg+hf9p6+SIoKbH7gylEOiyg2FC2VOs= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1739578202; c=relaxed/simple; bh=L4nGRhBPmFgPHVVSEX9BjtFkYHr7IBZgu9THPPcYsMg=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version:Content-Type; b=HILBo6hAL+4VDKVzXMfsItkgRNcz2OxS/dUozZeFREAYfDQ5Pfj6Pw8m833PkWvFlgsVbxk30VKaP36mp/BK0jqzFeCfjAuTn194a7c9bFmUU/1FAr+L0eb4KKlz1gLjIVsNHvf0LZyw5sOc/dstswWVsmIgMlya5s0bA9YNNOQ= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk; spf=none smtp.mailfrom=davidwei.uk; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b=Ho69KqqX; arc=none smtp.client-ip=209.85.214.170 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; spf=none smtp.mailfrom=davidwei.uk Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=davidwei-uk.20230601.gappssmtp.com header.i=@davidwei-uk.20230601.gappssmtp.com header.b="Ho69KqqX" Received: by mail-pl1-f170.google.com with SMTP id d9443c01a7336-21f6a47d617so46174775ad.2 for ; Fri, 14 Feb 2025 16:09:59 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=davidwei-uk.20230601.gappssmtp.com; s=20230601; t=1739578199; x=1740182999; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=TU2lLPsuD+1Wjb3qbpreSH3syxEnbs/l9acOJdZRAOQ=; b=Ho69KqqXSO+oI21I93w9y9tgIxU0UP9c6m7Cz3pZ0UkBUddK5Q3vjwpM4EOrqJRY8G mdEslm7qbM4LB9EJXs3tTS6BD4dGoYI4w7I57mm+bcoXBZIF7IqL5CIFBUpoi0Ykn6hL V4n5KLsPiTLmIbjM2qPiakfXq8CJ6bKfSb/Bd70Q62trRMhee+YsfeWYBHBWi8D7OA4b fl+t4LSRAYD9mvtY+VPDhe6ZVHR5rupSVElFvrt8ZWTEYxQjYQSaJY09FVgCQ0FdTrrb iT+daBm7w9sYf5xzjBolOVJ0nkRLvxSb2tD3HRQ/BdMRNVtfZFugAohoGhQ61LoeoZCT oBUQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1739578199; x=1740182999; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=TU2lLPsuD+1Wjb3qbpreSH3syxEnbs/l9acOJdZRAOQ=; b=LJzdlYCXnhDj/HXF7ecZv2qVu7jmRgqdHvR2shEsESXaEXGuTluy1JoPrVOCqCklId EgIIh6dyBpYr4DT2uGhv6sASk6IHN9RvSYTyUnBUrsI2qlrRpnieT1lisfMD39x5JI0E hUDfrL3DWmuUMUfirNwUIVDsnfOVKPYLH18e2agm98Dc5179dOswvgX/Ris00/VvJDQK TGTcloykrx8E0SqtwQmUfwWqyu9dgdclYZzBQ1ET08u+YHZrwaPaWbAY39xAmOjGJqla 5cuSNVDVngXnPXsk6pXX4a26bm+Ss7Y3xYS04/ulfZdhpljB3xlHE5b/cX5pOTYyge05 veaQ== X-Forwarded-Encrypted: i=1; AJvYcCVrGx0XQtRlfrtmh4rAqOkncd3wXxbHcxG11t0dN1mkKSKi4qg7s7DfD1HeQCpEpegVa27dOE4=@vger.kernel.org X-Gm-Message-State: AOJu0Yy4mUw57KJbDeX7BmX0Dc2UcgzwHmdrQxDUYhwXdo1IA9gOXRTT ldIf0GcUrx+KR+E26s+txlrL7UiTfaBvTXpGWQDl4JxwHOsb3XjPrJYFtdkHNgM= X-Gm-Gg: ASbGnctAq+qvAL4B8cdrCz7xi4SdvIWHkCyuf2lYZ3rO77sbizEnKcEM/okU4clcuXc Jjj/PTU+QQPMoBsUWcUY5Lbi77Pgb7OImePvpG3msx0U99PSxB53TxAL47rx53HKiODq/80FirI Rg2+Ymhz4vuj03iWr3vflDu0084wtum2KhD/nxkpDXtR4VonQ5n9OuCLOMD8qL8hkMZ079Xbl6z z9KVxcAU2kU2BNqYL1L7VQkL8WFRVAsNYJe1Q7frJClzyo+HZdJ6nPs5PhVrwsr3wNDpZFkTvs= X-Google-Smtp-Source: AGHT+IFtxUdAGsWw/0C9+tu/5Zdu9VucwTHcdgSl4Cw98cDK35mU5kWXBYj/c6cc8QB0vdasn+u3TQ== X-Received: by 2002:a05:6a20:394b:b0:1ee:1af5:86a9 with SMTP id adf61e73a8af0-1ee8cb17bbcmr3032976637.22.1739578199303; Fri, 14 Feb 2025 16:09:59 -0800 (PST) Received: from localhost ([2a03:2880:ff:3::]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-73257bdca3fsm1507123b3a.161.2025.02.14.16.09.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 14 Feb 2025 16:09:58 -0800 (PST) From: David Wei To: io-uring@vger.kernel.org, netdev@vger.kernel.org Cc: Jens Axboe , Pavel Begunkov , Jakub Kicinski , Paolo Abeni , "David S. Miller" , Eric Dumazet , Jesper Dangaard Brouer , David Ahern , Mina Almasry , Stanislav Fomichev , Joe Damato , Pedro Tammela , lizetao Subject: [PATCH v14 00/11] io_uring zero copy rx Date: Fri, 14 Feb 2025 16:09:35 -0800 Message-ID: <20250215000947.789731-1-dw@davidwei.uk> X-Mailer: git-send-email 2.43.5 Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 This patchset contains io_uring patches needed by a new io_uring request implementing zero copy rx into userspace pages, eliminating a kernel to user copy. We configure a page pool that a driver uses to fill a hw rx queue to hand out user pages instead of kernel pages. Any data that ends up hitting this hw rx queue will thus be dma'd into userspace memory directly, without needing to be bounced through kernel memory. 'Reading' data out of a socket instead becomes a _notification_ mechanism, where the kernel tells userspace where the data is. The overall approach is similar to the devmem TCP proposal. This relies on hw header/data split, flow steering and RSS to ensure packet headers remain in kernel memory and only desired flows hit a hw rx queue configured for zero copy. Configuring this is outside of the scope of this patchset. We share netdev core infra with devmem TCP. The main difference is that io_uring is used for the uAPI and the lifetime of all objects are bound to an io_uring instance. Data is 'read' using a new io_uring request type. When done, data is returned via a new shared refill queue. A zero copy page pool refills a hw rx queue from this refill queue directly. Of course, the lifetime of these data buffers are managed by io_uring rather than the networking stack, with different refcounting rules. This patchset is the first step adding basic zero copy support. We will extend this iteratively with new features e.g. dynamically allocated zero copy areas, THP support, dmabuf support, improved copy fallback, general optimisations and more. In terms of netdev support, we're first targeting Broadcom bnxt. Patches aren't included since Taehee Yoo has already sent a more comprehensive patchset adding support in [1]. Google gve should already support this, and Mellanox mlx5 support is WIP pending driver changes. =========== Performance =========== Note: Comparison with epoll + TCP_ZEROCOPY_RECEIVE isn't done yet. Test setup: * AMD EPYC 9454 * Broadcom BCM957508 200G * Kernel v6.11 base [2] * liburing fork [3] * kperf fork [4] * 4K MTU * Single TCP flow With application thread + net rx softirq pinned to _different_ cores: +-------------------------------+ | epoll | io_uring | |-----------|-------------------| | 82.2 Gbps | 116.2 Gbps (+41%) | +-------------------------------+ Pinned to _same_ core: +-------------------------------+ | epoll | io_uring | |-----------|-------------------| | 62.6 Gbps | 80.9 Gbps (+29%) | +-------------------------------+ ===== Links ===== Broadcom bnxt support: [1]: https://lore.kernel.org/netdev/20241003160620.1521626-8-ap420073@gmail.com/ Linux kernel branch including io_uring bits: [2]: https://github.com/isilence/linux.git zcrx/v13 liburing for testing: [3]: https://github.com/isilence/liburing.git zcrx/next kperf for testing: [4]: https://git.kernel.dk/kperf.git Changes in v14: --------------- * Remove net-next in patch subjects * Cast to struct sockaddr in selftest for bind() and connect() Changes in v13: --------------- io_uring * Add missing ~IORING_ZCRX_AREA_MASK * Update documentation * Selftest changes * Remove ipv4 support * Use ethtool() instea dof cmd() net * Fix race between io_uring closing and netdev unregister * Regenerate Netlink YAML Changes in v12: --------------- * Check nla_nest_start() errors * Don't leak a netdev, add missing netdev_put() * Warn on failed queue restart during close Changes in v11: --------------- * Add a shim provider helper for page_pool_set_dma_addr_netmem() * Drop netdev in ->uninstall, pin struct device instead * Add net_mp_open_rxq() and net_mp_close_rxq() * Remove unneeded CFLAGS += -I/usr/include/ in selftest Makefile Changes in v10: --------------- * Fix !CONFIG_PAGE_POOL build * Use acquire/release for RQ in examples * Fix page_pool_ref_netmem for net_iov * Move provider helpers / definitions into a new file * Don’t export page_pool_{set,clear}_pp_info, introduce net_mp_niov_{set,clear}_page_pool() instead * Remove devmem.h from net/core/page_pool_user.c * Add Netdev yaml for io-uring attribute * Add memory provider ops for filling in Netlink info Changes in v9: -------------- * Fail proof against multiple page pools running the same memory provider * Lock the consumer side of the refill queue. * Move scrub into io_uring exit. * Kill napi_execute. * Kill area init api and export finer grained net helpers as partial init now need to happen in ->alloc_netmems() * Separate user refcounting. * Fix copy fallback path math. * Add rodata check to page_pool_init() * Fix incorrect path in documentation Changes in v8: -------------- * add documentation and selftest * use io_uring regions for the refill ring Changes in v7: -------------- net: * Use NAPI_F_PREFER_BUSY_POLL for napi_execute + stylistics changes. Changes in v6: -------------- Please note: Comparison with TCP_ZEROCOPY_RECEIVE isn't done yet. net: * Drop a devmem.h clean up patch. * Migrate to netdev_get_by_index from deprecated API. * Fix !CONFIG_NET_DEVMEM build. * Don’t return into the page pool cache directly, use a new helper * Refactor napi_execute io_uring: * Require IORING_RECV_MULTISHOT flag set. * Add unselectable CONFIG_IO_URING_ZCRX. * Pulled latest io_uring changes. * Unexport io_uring_pp_zc_ops. Changes in v5: -------------- * Rebase on top of merged net_iov + netmem infra. * Decouple net_iov from devmem TCP. * Use netdev queue API to allocate an rx queue. * Minor uAPI enhancements for future extensibility. * QoS improvements with request throttling. Changes in RFC v4: ------------------ * Rebased on top of Mina Almasry's TCP devmem patchset and latest net-next, now sharing common infra e.g.: * netmem_t and net_iovs * Page pool memory provider * The registered buffer (rbuf) completion queue where completions from io_recvzc requests are posted is removed. Now these post into the main completion queue, using big (32-byte) CQEs. The first 16 bytes is an ordinary CQE, while the latter 16 bytes contain the io_uring_rbuf_cqe as before. This vastly simplifies the uAPI and removes a level of indirection in userspace when looking for payloads. * The rbuf refill queue is still needed for userspace to return buffers to kernel. * Simplified code and uAPI on the io_uring side, particularly io_recvzc() and io_zc_rx_recv(). Many unnecessary lines were removed e.g. extra msg flags, readlen, etc. Changes in RFC v3: ------------------ * Rebased on top of Jakub Kicinski's memory provider API RFC. The ZC pool added is now a backend for memory provider. * We're also reusing ppiov infrastructure. The refcounting rules stay the same but it's shifted into ppiov->refcount. That lets us to flexibly manage buffer lifetimes without adding any extra code to the common networking paths. It'd also make it easier to support dmabufs and device memory in the future. * io_uring also knows about pages, and so ppiovs might unnecessarily break tools inspecting data, that can easily be solved later. Many patches are not for upstream as they depend on work in progress, namely from Mina: * struct netmem_t * Driver ndo commands for Rx queue configs * struct page_pool_iov and shared pp infra Changes in RFC v2: ------------------ * Added copy fallback support if userspace memory allocated for ZC Rx runs out, or if header splitting or flow steering fails. * Added veth support for ZC Rx, for testing and demonstration. We will need to figure out what driver would be best for such testing functionality in the future. Perhaps netdevsim? * Added socket registration API to io_uring to associate specific sockets with ifqs/Rx queues for ZC. * Added multi-socket support, such that multiple connections can be steered into the same hardware Rx queue. * Added Netbench server/client support. David Wei (6): io_uring/zcrx: add interface queue and refill queue io_uring/zcrx: add io_zcrx_area io_uring/zcrx: add io_recvzc request io_uring/zcrx: set pp memory provider for an rx queue net: add documentation for io_uring zcrx io_uring/zcrx: add selftest Pavel Begunkov (5): io_uring/zcrx: grab a net device io_uring/zcrx: implement zerocopy receive pp memory provider io_uring/zcrx: dma-map area for the device io_uring/zcrx: throttle receive requests io_uring/zcrx: add copy fallback Documentation/networking/index.rst | 1 + Documentation/networking/iou-zcrx.rst | 202 ++++ Kconfig | 2 + include/linux/io_uring_types.h | 6 + include/uapi/linux/io_uring.h | 54 +- io_uring/KConfig | 10 + io_uring/Makefile | 1 + io_uring/io_uring.c | 7 + io_uring/io_uring.h | 10 + io_uring/memmap.h | 1 + io_uring/net.c | 74 ++ io_uring/opdef.c | 16 + io_uring/register.c | 7 + io_uring/rsrc.c | 2 +- io_uring/rsrc.h | 1 + io_uring/zcrx.c | 954 ++++++++++++++++++ io_uring/zcrx.h | 73 ++ .../selftests/drivers/net/hw/.gitignore | 2 + .../testing/selftests/drivers/net/hw/Makefile | 5 + .../selftests/drivers/net/hw/iou-zcrx.c | 426 ++++++++ .../selftests/drivers/net/hw/iou-zcrx.py | 64 ++ 21 files changed, 1916 insertions(+), 2 deletions(-) create mode 100644 Documentation/networking/iou-zcrx.rst create mode 100644 io_uring/KConfig create mode 100644 io_uring/zcrx.c create mode 100644 io_uring/zcrx.h create mode 100644 tools/testing/selftests/drivers/net/hw/iou-zcrx.c create mode 100755 tools/testing/selftests/drivers/net/hw/iou-zcrx.py Acked-by: Jakub Kicinski