From patchwork Fri Oct 7 21:17:05 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Lemon X-Patchwork-Id: 13001463 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id DB1DBC433FE for ; Fri, 7 Oct 2022 21:17:29 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229573AbiJGVR2 convert rfc822-to-8bit (ORCPT ); Fri, 7 Oct 2022 17:17:28 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59200 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229517AbiJGVR2 (ORCPT ); Fri, 7 Oct 2022 17:17:28 -0400 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 223312A40E for ; Fri, 7 Oct 2022 14:17:16 -0700 (PDT) Received: from pps.filterd (m0148461.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 297KDnos003038 for ; Fri, 7 Oct 2022 14:17:15 -0700 Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k2ty5gd3u-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 07 Oct 2022 14:17:15 -0700 Received: from twshared0933.07.ash9.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::c) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Fri, 7 Oct 2022 14:17:14 -0700 Received: by devvm2494.atn0.facebook.com (Postfix, from userid 172786) id 4B98F21DAFD98; Fri, 7 Oct 2022 14:17:13 -0700 (PDT) From: Jonathan Lemon To: Subject: [RFC v1 1/9] io_uring: add zctap ifq definition Date: Fri, 7 Oct 2022 14:17:05 -0700 Message-ID: <20221007211713.170714-2-jonathan.lemon@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221007211713.170714-1-jonathan.lemon@gmail.com> References: <20221007211713.170714-1-jonathan.lemon@gmail.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: jMoJY2rUacZYXCLue-3KK5xMkTxvbJkG X-Proofpoint-GUID: jMoJY2rUacZYXCLue-3KK5xMkTxvbJkG X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.528,FMLib:17.11.122.1 definitions=2022-10-07_04,2022-10-07_01,2022-06-22_01 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org Add structure definition for io_zctap_ifq for use by lower level networking hooks. Signed-off-by: Jonathan Lemon --- include/linux/io_uring_types.h | 9 +++++++++ 1 file changed, 9 insertions(+) diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 677a25d44d7f..680fbf1f34e7 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -323,6 +323,7 @@ struct io_ring_ctx { struct io_mapped_ubuf *dummy_ubuf; struct io_rsrc_data *file_data; struct io_rsrc_data *buf_data; + struct xarray zctap_ifq_xa; struct delayed_work rsrc_put_work; struct llist_head rsrc_put_llist; @@ -578,4 +579,12 @@ struct io_overflow_cqe { struct io_uring_cqe cqe; }; +struct io_zctap_ifq { + struct net_device *dev; + struct io_ring_ctx *ctx; + u16 queue_id; + u16 id; + u16 fill_bgid; +}; + #endif From patchwork Fri Oct 7 21:17:06 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Lemon X-Patchwork-Id: 13001467 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 95A41C433F5 for ; Fri, 7 Oct 2022 21:17:33 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229728AbiJGVRc convert rfc822-to-8bit (ORCPT ); Fri, 7 Oct 2022 17:17:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59254 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229628AbiJGVRa (ORCPT ); Fri, 7 Oct 2022 17:17:30 -0400 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A2946748D5 for ; Fri, 7 Oct 2022 14:17:21 -0700 (PDT) Received: from pps.filterd (m0044012.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 297I5ihw031068 for ; Fri, 7 Oct 2022 14:17:21 -0700 Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k2mrakhyp-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 07 Oct 2022 14:17:21 -0700 Received: from twshared13457.17.frc2.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:83::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Fri, 7 Oct 2022 14:17:19 -0700 Received: by devvm2494.atn0.facebook.com (Postfix, from userid 172786) id 5246A21DAFD9A; Fri, 7 Oct 2022 14:17:13 -0700 (PDT) From: Jonathan Lemon To: Subject: [RFC v1 2/9] netdevice: add SETUP_ZCTAP to the netdev_bpf structure Date: Fri, 7 Oct 2022 14:17:06 -0700 Message-ID: <20221007211713.170714-3-jonathan.lemon@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221007211713.170714-1-jonathan.lemon@gmail.com> References: <20221007211713.170714-1-jonathan.lemon@gmail.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: QiMoP5_n1Rsl44oE_-C69Qpvg1w0GbCS X-Proofpoint-ORIG-GUID: QiMoP5_n1Rsl44oE_-C69Qpvg1w0GbCS X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.528,FMLib:17.11.122.1 definitions=2022-10-07_04,2022-10-07_01,2022-06-22_01 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org This command requests the networking device setup or teardown a new interface queue, backed by a region of user supplied memory. The queue will be managed by io-uring. Signed-off-by: Jonathan Lemon --- include/linux/netdevice.h | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 9f42fc871c3b..49ecfc276411 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -979,6 +979,7 @@ enum bpf_netdev_command { BPF_OFFLOAD_MAP_ALLOC, BPF_OFFLOAD_MAP_FREE, XDP_SETUP_XSK_POOL, + XDP_SETUP_ZCTAP, }; struct bpf_prog_offload_ops; @@ -1017,6 +1018,11 @@ struct netdev_bpf { struct xsk_buff_pool *pool; u16 queue_id; } xsk; + /* XDP_SETUP_ZCTAP */ + struct { + struct io_zctap_ifq *ifq; + u16 queue_id; + } zct; }; }; From patchwork Fri Oct 7 21:17:07 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Lemon X-Patchwork-Id: 13001464 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F2F0C4332F for ; Fri, 7 Oct 2022 21:17:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229735AbiJGVR3 convert rfc822-to-8bit (ORCPT ); Fri, 7 Oct 2022 17:17:29 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59210 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229628AbiJGVR2 (ORCPT ); Fri, 7 Oct 2022 17:17:28 -0400 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 906502B60B for ; Fri, 7 Oct 2022 14:17:17 -0700 (PDT) Received: from pps.filterd (m0109334.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 297I3rgq026075 for ; Fri, 7 Oct 2022 14:17:17 -0700 Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k2807yh8e-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 07 Oct 2022 14:17:17 -0700 Received: from twshared3028.05.ash9.facebook.com (2620:10d:c085:208::11) by mail.thefacebook.com (2620:10d:c085:11d::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Fri, 7 Oct 2022 14:17:16 -0700 Received: by devvm2494.atn0.facebook.com (Postfix, from userid 172786) id 594C721DAFD9C; Fri, 7 Oct 2022 14:17:13 -0700 (PDT) From: Jonathan Lemon To: Subject: [RFC v1 3/9] io_uring: add register ifq opcode Date: Fri, 7 Oct 2022 14:17:07 -0700 Message-ID: <20221007211713.170714-4-jonathan.lemon@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221007211713.170714-1-jonathan.lemon@gmail.com> References: <20221007211713.170714-1-jonathan.lemon@gmail.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: GV2cLDcAdRU-xvFMTpwmVhI2jRqmzQPh X-Proofpoint-GUID: GV2cLDcAdRU-xvFMTpwmVhI2jRqmzQPh X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.528,FMLib:17.11.122.1 definitions=2022-10-07_04,2022-10-07_01,2022-06-22_01 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org Add initial support for support for hooking in zero-copy interface queues to io_uring. This command requests a user-managed queue from the specified network device. This only includes the register opcode, unregistration is currently done implicitly when the ring is removed. Signed-off-by: Jonathan Lemon --- include/uapi/linux/io_uring.h | 14 ++++ io_uring/Makefile | 3 +- io_uring/io_uring.c | 10 +++ io_uring/zctap.c | 146 ++++++++++++++++++++++++++++++++++ io_uring/zctap.h | 11 +++ 5 files changed, 183 insertions(+), 1 deletion(-) create mode 100644 io_uring/zctap.c create mode 100644 io_uring/zctap.h diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 6b83177fd41d..bc5108d65c0a 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -473,6 +473,9 @@ enum { /* register a range of fixed file slots for automatic slot allocation */ IORING_REGISTER_FILE_ALLOC_RANGE = 25, + /* register a network ifq for zerocopy RX */ + IORING_REGISTER_IFQ = 26, + /* this goes last */ IORING_REGISTER_LAST }; @@ -649,6 +652,17 @@ struct io_uring_recvmsg_out { __u32 flags; }; +/* + * Argument for IORING_REGISTER_IFQ + */ +struct io_uring_ifq_req { + __u32 ifindex; + __u16 queue_id; + __u16 ifq_id; + __u16 fill_bgid; + __u16 __pad[3]; +}; + #ifdef __cplusplus } #endif diff --git a/io_uring/Makefile b/io_uring/Makefile index 8cc8e5387a75..9d87e2e45ef9 100644 --- a/io_uring/Makefile +++ b/io_uring/Makefile @@ -7,5 +7,6 @@ obj-$(CONFIG_IO_URING) += io_uring.o xattr.o nop.o fs.o splice.o \ openclose.o uring_cmd.o epoll.o \ statx.o net.o msg_ring.o timeout.o \ sqpoll.o fdinfo.o tctx.o poll.o \ - cancel.o kbuf.o rsrc.o rw.o opdef.o notif.o + cancel.o kbuf.o rsrc.o rw.o opdef.o \ + notif.o zctap.o obj-$(CONFIG_IO_WQ) += io-wq.o diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c index b9640ad5069f..8dd988b33af0 100644 --- a/io_uring/io_uring.c +++ b/io_uring/io_uring.c @@ -91,6 +91,7 @@ #include "cancel.h" #include "net.h" #include "notif.h" +#include "zctap.h" #include "timeout.h" #include "poll.h" @@ -321,6 +322,7 @@ static __cold struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p) INIT_WQ_LIST(&ctx->locked_free_list); INIT_DELAYED_WORK(&ctx->fallback_work, io_fallback_req_func); INIT_WQ_LIST(&ctx->submit_state.compl_reqs); + xa_init(&ctx->zctap_ifq_xa); return ctx; err: kfree(ctx->dummy_ubuf); @@ -2639,6 +2641,8 @@ static __cold void io_ring_ctx_wait_and_kill(struct io_ring_ctx *ctx) __io_cqring_overflow_flush(ctx, true); xa_for_each(&ctx->personalities, index, creds) io_unregister_personality(ctx, index); + xa_for_each(&ctx->zctap_ifq_xa, index, creds) + io_unregister_zctap_ifq(ctx, index); if (ctx->rings) io_poll_remove_all(ctx, NULL, true); mutex_unlock(&ctx->uring_lock); @@ -3839,6 +3843,12 @@ static int __io_uring_register(struct io_ring_ctx *ctx, unsigned opcode, break; ret = io_register_file_alloc_range(ctx, arg); break; + case IORING_REGISTER_IFQ: + ret = -EINVAL; + if (!arg || nr_args != 1) + break; + ret = io_register_ifq(ctx, arg); + break; default: ret = -EINVAL; break; diff --git a/io_uring/zctap.c b/io_uring/zctap.c new file mode 100644 index 000000000000..41feb76b7a35 --- /dev/null +++ b/io_uring/zctap.c @@ -0,0 +1,146 @@ +// SPDX-License-Identifier: GPL-2.0 +#include +#include +#include +#include +#include +#include +#include + +#include + +#include "io_uring.h" +#include "zctap.h" + +static DEFINE_XARRAY_ALLOC1(io_zctap_ifq_xa); + +typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); + +static int __io_queue_mgmt(struct net_device *dev, struct io_zctap_ifq *ifq, + u16 *queue_id) +{ + struct netdev_bpf cmd; + bpf_op_t ndo_bpf; + int err; + + ndo_bpf = dev->netdev_ops->ndo_bpf; + if (!ndo_bpf) + return -EINVAL; + + cmd.command = XDP_SETUP_ZCTAP; + cmd.zct.ifq = ifq; + cmd.zct.queue_id = *queue_id; + + err = ndo_bpf(dev, &cmd); + if (!err) + *queue_id = cmd.zct.queue_id; + + return err; +} + +static int io_open_zctap_ifq(struct io_zctap_ifq *ifq, u16 *queue_id) +{ + return __io_queue_mgmt(ifq->dev, ifq, queue_id); +} + +static int io_close_zctap_ifq(struct io_zctap_ifq *ifq, u16 queue_id) +{ + return __io_queue_mgmt(ifq->dev, NULL, &queue_id); +} + +static struct io_zctap_ifq *io_zctap_ifq_alloc(void) +{ + struct io_zctap_ifq *ifq; + + ifq = kzalloc(sizeof(*ifq), GFP_KERNEL); + if (!ifq) + return NULL; + + ifq->queue_id = -1; + return ifq; +} + +static void io_zctap_ifq_free(struct io_zctap_ifq *ifq) +{ + if (ifq->queue_id != -1) + io_close_zctap_ifq(ifq, ifq->queue_id); + if (ifq->dev) + dev_put(ifq->dev); + if (ifq->id) + xa_erase(&io_zctap_ifq_xa, ifq->id); + kfree(ifq); +} + +int io_register_ifq(struct io_ring_ctx *ctx, + struct io_uring_ifq_req __user *arg) +{ + struct io_uring_ifq_req req; + struct io_zctap_ifq *ifq; + int id, err; + + if (copy_from_user(&req, arg, sizeof(req))) + return -EFAULT; + + ifq = io_zctap_ifq_alloc(); + if (!ifq) + return -ENOMEM; + ifq->ctx = ctx; + ifq->fill_bgid = req.fill_bgid; + + err = -ENODEV; + ifq->dev = dev_get_by_index(&init_net, req.ifindex); + if (!ifq->dev) + goto out; + + err = io_open_zctap_ifq(ifq, &req.queue_id); + if (err) + goto out; + ifq->queue_id = req.queue_id; + + /* aka idr */ + err = xa_alloc(&io_zctap_ifq_xa, &id, ifq, + XA_LIMIT(1, PAGE_SIZE - 1), GFP_KERNEL); + if (err) + goto out; + ifq->id = id; + req.ifq_id = id; + + err = xa_err(xa_store(&ctx->zctap_ifq_xa, id, ifq, GFP_KERNEL)); + if (err) + goto out; + + if (copy_to_user(arg, &req, sizeof(req))) { + xa_erase(&ctx->zctap_ifq_xa, id); + err = -EFAULT; + goto out; + } + + return 0; + +out: + io_zctap_ifq_free(ifq); + return err; +} + +int io_unregister_zctap_ifq(struct io_ring_ctx *ctx, unsigned long index) +{ + struct io_zctap_ifq *ifq; + + ifq = xa_erase(&ctx->zctap_ifq_xa, index); + if (!ifq) + return -EINVAL; + + io_zctap_ifq_free(ifq); + return 0; +} + +int io_unregister_ifq(struct io_ring_ctx *ctx, + struct io_uring_ifq_req __user *arg) +{ + struct io_uring_ifq_req req; + + if (copy_from_user(&req, arg, sizeof(req))) + return -EFAULT; + + return io_unregister_zctap_ifq(ctx, req.ifq_id); +} diff --git a/io_uring/zctap.h b/io_uring/zctap.h new file mode 100644 index 000000000000..bda15d218fe3 --- /dev/null +++ b/io_uring/zctap.h @@ -0,0 +1,11 @@ +// SPDX-License-Identifier: GPL-2.0 +#ifndef IOU_ZCTAP_H +#define IOU_ZCTAP_H + +int io_register_ifq(struct io_ring_ctx *ctx, + struct io_uring_ifq_req __user *arg); +int io_unregister_ifq(struct io_ring_ctx *ctx, + struct io_uring_ifq_req __user *arg); +int io_unregister_zctap_ifq(struct io_ring_ctx *ctx, unsigned long index); + +#endif From patchwork Fri Oct 7 21:17:08 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Lemon X-Patchwork-Id: 13001471 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BB5E0C4321E for ; Fri, 7 Oct 2022 21:17:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229993AbiJGVRf convert rfc822-to-8bit (ORCPT ); Fri, 7 Oct 2022 17:17:35 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59308 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229837AbiJGVRc (ORCPT ); Fri, 7 Oct 2022 17:17:32 -0400 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 87FD965660 for ; Fri, 7 Oct 2022 14:17:26 -0700 (PDT) Received: from pps.filterd (m0044010.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 297I5iKr027912 for ; Fri, 7 Oct 2022 14:17:26 -0700 Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k2hshchc7-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 07 Oct 2022 14:17:26 -0700 Received: from twshared0933.07.ash9.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::c) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Fri, 7 Oct 2022 14:17:24 -0700 Received: by devvm2494.atn0.facebook.com (Postfix, from userid 172786) id 60B2521DAFD9E; Fri, 7 Oct 2022 14:17:13 -0700 (PDT) From: Jonathan Lemon To: Subject: [RFC v1 4/9] io_uring: add provide_ifq_region opcode Date: Fri, 7 Oct 2022 14:17:08 -0700 Message-ID: <20221007211713.170714-5-jonathan.lemon@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221007211713.170714-1-jonathan.lemon@gmail.com> References: <20221007211713.170714-1-jonathan.lemon@gmail.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: DYxPHV_uMetLLjXnRm5_g-fD4QX0C33S X-Proofpoint-GUID: DYxPHV_uMetLLjXnRm5_g-fD4QX0C33S X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.528,FMLib:17.11.122.1 definitions=2022-10-07_04,2022-10-07_01,2022-06-22_01 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org This opcode takes part or all of a memory region that was previously registered with io_uring, and assigns it as the backing store for the specified ifq. The entire region is registered instead of providing individual bufferrs, as this allows the hardware to select the optimal buffer size for incoming packets. Signed-off-by: Jonathan Lemon --- include/linux/io_uring_types.h | 1 + include/uapi/linux/io_uring.h | 1 + io_uring/opdef.c | 9 ++++ io_uring/zctap.c | 96 ++++++++++++++++++++++++++++++++++ io_uring/zctap.h | 4 ++ 5 files changed, 111 insertions(+) diff --git a/include/linux/io_uring_types.h b/include/linux/io_uring_types.h index 680fbf1f34e7..56257e8afd0a 100644 --- a/include/linux/io_uring_types.h +++ b/include/linux/io_uring_types.h @@ -582,6 +582,7 @@ struct io_overflow_cqe { struct io_zctap_ifq { struct net_device *dev; struct io_ring_ctx *ctx; + void *region; /* XXX relocate? */ u16 queue_id; u16 id; u16 fill_bgid; diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index bc5108d65c0a..3b392f8270dc 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -206,6 +206,7 @@ enum io_uring_op { IORING_OP_SOCKET, IORING_OP_URING_CMD, IORING_OP_SEND_ZC, + IORING_OP_PROVIDE_IFQ_REGION, /* this goes last, obviously */ IORING_OP_LAST, diff --git a/io_uring/opdef.c b/io_uring/opdef.c index c4dddd0fd709..bf28c43117c3 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -33,6 +33,7 @@ #include "poll.h" #include "cancel.h" #include "rw.h" +#include "zctap.h" static int io_no_issue(struct io_kiocb *req, unsigned int issue_flags) { @@ -488,6 +489,14 @@ const struct io_op_def io_op_defs[] = { .prep = io_eopnotsupp_prep, #endif }, + [IORING_OP_PROVIDE_IFQ_REGION] = { + .audit_skip = 1, + .iopoll = 1, + .buffer_select = 1, + .name = "PROVIDE_IFQ_REGION", + .prep = io_provide_ifq_region_prep, + .issue = io_provide_ifq_region, + }, }; const char *io_uring_get_opcode(u8 opcode) diff --git a/io_uring/zctap.c b/io_uring/zctap.c index 41feb76b7a35..728f7c938b7b 100644 --- a/io_uring/zctap.c +++ b/io_uring/zctap.c @@ -6,11 +6,14 @@ #include #include #include +#include #include #include "io_uring.h" #include "zctap.h" +#include "rsrc.h" +#include "kbuf.h" static DEFINE_XARRAY_ALLOC1(io_zctap_ifq_xa); @@ -144,3 +147,96 @@ int io_unregister_ifq(struct io_ring_ctx *ctx, return io_unregister_zctap_ifq(ctx, req.ifq_id); } + +struct io_ifq_region { + struct file *file; + struct io_zctap_ifq *ifq; + __u64 addr; + __u32 len; + __u32 bgid; +}; + +struct ifq_region { + struct io_mapped_ubuf *imu; + u64 start; + u64 end; + int count; + int imu_idx; + int nr_pages; + struct page *page[]; +}; + +int io_provide_ifq_region_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + struct io_ifq_region *r = io_kiocb_to_cmd(req, struct io_ifq_region); + struct io_ring_ctx *ctx = req->ctx; + struct io_mapped_ubuf *imu; + u32 index; + + if (!(req->flags & REQ_F_BUFFER_SELECT)) + return -EINVAL; + + r->addr = READ_ONCE(sqe->addr); + r->len = READ_ONCE(sqe->len); + index = READ_ONCE(sqe->fd); + + if (!r->addr || r->addr & ~PAGE_MASK) + return -EFAULT; + + if (!r->len || r->len & ~PAGE_MASK) + return -EFAULT; + + r->ifq = xa_load(&ctx->zctap_ifq_xa, index); + if (!r->ifq) + return -EFAULT; + + /* XXX for now, only allow one region per ifq. */ + if (r->ifq->region) + return -EFAULT; + + if (unlikely(req->buf_index >= ctx->nr_user_bufs)) + return -EFAULT; + index = array_index_nospec(req->buf_index, ctx->nr_user_bufs); + imu = ctx->user_bufs[index]; + + if (r->addr < imu->ubuf || r->addr + r->len > imu->ubuf_end) + return -EFAULT; + req->imu = imu; + + io_req_set_rsrc_node(req, ctx, 0); + + return 0; +} + +int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_ifq_region *r = io_kiocb_to_cmd(req, struct io_ifq_region); + struct ifq_region *ifr; + int i, idx, nr_pages; + struct page *page; + + nr_pages = r->len >> PAGE_SHIFT; + idx = (r->addr - req->imu->ubuf) >> PAGE_SHIFT; + + ifr = kvmalloc(struct_size(ifr, page, nr_pages), GFP_KERNEL); + if (!ifr) + return -ENOMEM; + + + ifr->nr_pages = nr_pages; + ifr->imu_idx = idx; + ifr->count = nr_pages; + ifr->imu = req->imu; + ifr->start = r->addr; + ifr->end = r->addr + r->len; + + for (i = 0; i < nr_pages; i++, idx++) { + page = req->imu->bvec[idx].bv_page; + ifr->page[i] = page; + } + + WRITE_ONCE(r->ifq->region, ifr); + + return 0; +} diff --git a/io_uring/zctap.h b/io_uring/zctap.h index bda15d218fe3..709c803220f4 100644 --- a/io_uring/zctap.h +++ b/io_uring/zctap.h @@ -8,4 +8,8 @@ int io_unregister_ifq(struct io_ring_ctx *ctx, struct io_uring_ifq_req __user *arg); int io_unregister_zctap_ifq(struct io_ring_ctx *ctx, unsigned long index); +int io_provide_ifq_region_prep(struct io_kiocb *req, + const struct io_uring_sqe *sqe); +int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags); + #endif From patchwork Fri Oct 7 21:17:09 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Lemon X-Patchwork-Id: 13001472 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C93DC433FE for ; Fri, 7 Oct 2022 21:17:36 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229605AbiJGVRe convert rfc822-to-8bit (ORCPT ); Fri, 7 Oct 2022 17:17:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59306 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229835AbiJGVRc (ORCPT ); Fri, 7 Oct 2022 17:17:32 -0400 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 13600925BA for ; Fri, 7 Oct 2022 14:17:27 -0700 (PDT) Received: from pps.filterd (m0044010.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 297I5iKs027912 for ; Fri, 7 Oct 2022 14:17:26 -0700 Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k2hshchc7-2 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 07 Oct 2022 14:17:26 -0700 Received: from twshared0933.07.ash9.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::c) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Fri, 7 Oct 2022 14:17:24 -0700 Received: by devvm2494.atn0.facebook.com (Postfix, from userid 172786) id 6C2D321DAFDA0; Fri, 7 Oct 2022 14:17:13 -0700 (PDT) From: Jonathan Lemon To: Subject: [RFC v1 5/9] io_uring: Add io_uring zctap iov structure and helpers Date: Fri, 7 Oct 2022 14:17:09 -0700 Message-ID: <20221007211713.170714-6-jonathan.lemon@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221007211713.170714-1-jonathan.lemon@gmail.com> References: <20221007211713.170714-1-jonathan.lemon@gmail.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: 60zRx7nguJYLQkbz-gzDzyCNdoQy3qEU X-Proofpoint-GUID: 60zRx7nguJYLQkbz-gzDzyCNdoQy3qEU X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.528,FMLib:17.11.122.1 definitions=2022-10-07_04,2022-10-07_01,2022-06-22_01 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org With networking zero-copy receive, the incoming data is placed directly into user-supplied buffers. Instead of returning the buffer address, return the buffer group id and buffer id, and let the application figure out the base address. Add helpers for storing and retrieving the encoding, which is stored in the page_private field. This will be used in the zerocopy RX routine, when handling pages from skb fragments. Signed-off-by: Jonathan Lemon --- include/uapi/linux/io_uring.h | 10 +++++++++ io_uring/zctap.c | 39 ++++++++++++++++++++++++++++++++++- 2 files changed, 48 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 3b392f8270dc..145d55280919 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -664,6 +664,16 @@ struct io_uring_ifq_req { __u16 __pad[3]; }; +struct io_uring_zctap_iov { + __u32 off; + __u32 len; + __u16 bgid; + __u16 bid; + __u16 ifq_id; + __u16 resv; +}; + + #ifdef __cplusplus } #endif diff --git a/io_uring/zctap.c b/io_uring/zctap.c index 728f7c938b7b..58b4c5417650 100644 --- a/io_uring/zctap.c +++ b/io_uring/zctap.c @@ -19,6 +19,26 @@ static DEFINE_XARRAY_ALLOC1(io_zctap_ifq_xa); typedef int (*bpf_op_t)(struct net_device *dev, struct netdev_bpf *bpf); +static u64 zctap_page_info(u16 region_id, u16 pgid, u16 ifq_id) +{ + return (u64)region_id << 32 | (u64)pgid << 16 | ifq_id; +} + +static u16 zctap_page_region_id(const struct page *page) +{ + return (page_private(page) >> 32) & 0xffff; +} + +static u16 zctap_page_id(const struct page *page) +{ + return (page_private(page) >> 16) & 0xffff; +} + +static u16 zctap_page_ifq_id(const struct page *page) +{ + return page_private(page) & 0xffff; +} + static int __io_queue_mgmt(struct net_device *dev, struct io_zctap_ifq *ifq, u16 *queue_id) { @@ -213,8 +233,9 @@ int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags) { struct io_ifq_region *r = io_kiocb_to_cmd(req, struct io_ifq_region); struct ifq_region *ifr; - int i, idx, nr_pages; + int i, id, idx, nr_pages; struct page *page; + u64 info; nr_pages = r->len >> PAGE_SHIFT; idx = (r->addr - req->imu->ubuf) >> PAGE_SHIFT; @@ -231,12 +252,28 @@ int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags) ifr->start = r->addr; ifr->end = r->addr + r->len; + id = r->ifq->id; for (i = 0; i < nr_pages; i++, idx++) { page = req->imu->bvec[idx].bv_page; + if (PagePrivate(page)) + goto out; + SetPagePrivate(page); + info = zctap_page_info(r->bgid, idx + i, id); + set_page_private(page, info); ifr->page[i] = page; } WRITE_ONCE(r->ifq->region, ifr); return 0; +out: + while (i--) { + page = req->imu->bvec[idx + i].bv_page; + ClearPagePrivate(page); + set_page_private(page, 0); + } + + kvfree(ifr); + + return -EEXIST; } From patchwork Fri Oct 7 21:17:10 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Lemon X-Patchwork-Id: 13001470 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2D614C4167E for ; Fri, 7 Oct 2022 21:17:35 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229821AbiJGVRe convert rfc822-to-8bit (ORCPT ); Fri, 7 Oct 2022 17:17:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59304 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229826AbiJGVRc (ORCPT ); Fri, 7 Oct 2022 17:17:32 -0400 Received: from mx0b-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B228873C1B for ; Fri, 7 Oct 2022 14:17:26 -0700 (PDT) Received: from pps.filterd (m0148460.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 297I5Gan010451 for ; Fri, 7 Oct 2022 14:17:26 -0700 Received: from maileast.thefacebook.com ([163.114.130.16]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k27evyp6r-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 07 Oct 2022 14:17:25 -0700 Received: from twshared5413.23.frc3.facebook.com (2620:10d:c0a8:1b::d) by mail.thefacebook.com (2620:10d:c0a8:82::e) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Fri, 7 Oct 2022 14:17:25 -0700 Received: by devvm2494.atn0.facebook.com (Postfix, from userid 172786) id 73DB921DAFDA2; Fri, 7 Oct 2022 14:17:13 -0700 (PDT) From: Jonathan Lemon To: Subject: [RFC v1 6/9] io_uring: introduce reference tracking for user pages. Date: Fri, 7 Oct 2022 14:17:10 -0700 Message-ID: <20221007211713.170714-7-jonathan.lemon@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221007211713.170714-1-jonathan.lemon@gmail.com> References: <20221007211713.170714-1-jonathan.lemon@gmail.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: 5Z9W4HqVj-qrElROvJnME-I8ZJuly40L X-Proofpoint-ORIG-GUID: 5Z9W4HqVj-qrElROvJnME-I8ZJuly40L X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.528,FMLib:17.11.122.1 definitions=2022-10-07_04,2022-10-07_01,2022-06-22_01 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org This is currently a WIP. If only part of a page is used by a skb fragment, and then provided to the user, the page should not be reused by the kernel until all sub-page fragments are not in use. If only full pages are used (and not sub-page fragments), then this code shouldn't be needed. Signed-off-by: Jonathan Lemon --- io_uring/zctap.c | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) diff --git a/io_uring/zctap.c b/io_uring/zctap.c index 58b4c5417650..9db3421fb9fa 100644 --- a/io_uring/zctap.c +++ b/io_uring/zctap.c @@ -183,9 +183,36 @@ struct ifq_region { int count; int imu_idx; int nr_pages; + u8 *page_uref; struct page *page[]; }; +static void io_add_page_uref(struct ifq_region *ifr, u16 pgid) +{ + if (WARN_ON(!ifr)) + return; + + if (WARN_ON(pgid < ifr->imu_idx)) + return; + + ifr->page_uref[pgid - ifr->imu_idx]++; +} + +static bool io_put_page_last_uref(struct ifq_region *ifr, u64 addr) +{ + int idx; + + if (WARN_ON(addr < ifr->start || addr > ifr->end)) + return false; + + idx = (addr - ifr->start) >> PAGE_SHIFT; + + if (WARN_ON(!ifr->page_uref[idx])) + return false; + + return --ifr->page_uref[idx] == 0; +} + int io_provide_ifq_region_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) { @@ -244,6 +271,11 @@ int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags) if (!ifr) return -ENOMEM; + ifr->page_uref = kvmalloc_array(nr_pages, sizeof(u8), GFP_KERNEL); + if (!ifr->page_uref) { + kvfree(ifr); + return -ENOMEM; + } ifr->nr_pages = nr_pages; ifr->imu_idx = idx; @@ -261,6 +293,7 @@ int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags) info = zctap_page_info(r->bgid, idx + i, id); set_page_private(page, info); ifr->page[i] = page; + ifr->page_uref[i] = 0; } WRITE_ONCE(r->ifq->region, ifr); @@ -273,6 +306,7 @@ int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags) set_page_private(page, 0); } + kvfree(ifr->page_uref); kvfree(ifr); return -EEXIST; From patchwork Fri Oct 7 21:17:11 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Lemon X-Patchwork-Id: 13001469 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 59A86C43217 for ; Fri, 7 Oct 2022 21:17:34 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229628AbiJGVRc convert rfc822-to-8bit (ORCPT ); Fri, 7 Oct 2022 17:17:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59260 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229821AbiJGVRa (ORCPT ); Fri, 7 Oct 2022 17:17:30 -0400 Received: from mx0a-00082601.pphosted.com (mx0b-00082601.pphosted.com [67.231.153.30]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C47365D0E5 for ; Fri, 7 Oct 2022 14:17:23 -0700 (PDT) Received: from pps.filterd (m0089730.ppops.net [127.0.0.1]) by m0089730.ppops.net (8.17.1.5/8.17.1.5) with ESMTP id 297I5EHY006830 for ; Fri, 7 Oct 2022 14:17:23 -0700 Received: from mail.thefacebook.com ([163.114.132.120]) by m0089730.ppops.net (PPS) with ESMTPS id 3k2acdekgv-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 07 Oct 2022 14:17:22 -0700 Received: from twshared19720.14.frc2.facebook.com (2620:10d:c085:108::4) by mail.thefacebook.com (2620:10d:c085:11d::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Fri, 7 Oct 2022 14:17:21 -0700 Received: by devvm2494.atn0.facebook.com (Postfix, from userid 172786) id 7B0A521DAFDA4; Fri, 7 Oct 2022 14:17:13 -0700 (PDT) From: Jonathan Lemon To: Subject: [RFC v1 7/9] page_pool: add page allocation and free hooks. Date: Fri, 7 Oct 2022 14:17:11 -0700 Message-ID: <20221007211713.170714-8-jonathan.lemon@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221007211713.170714-1-jonathan.lemon@gmail.com> References: <20221007211713.170714-1-jonathan.lemon@gmail.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: D9Qslwxk4TApAMcvHPvXEVFE0medB46G X-Proofpoint-GUID: D9Qslwxk4TApAMcvHPvXEVFE0medB46G X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.528,FMLib:17.11.122.1 definitions=2022-10-07_04,2022-10-07_01,2022-06-22_01 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org In order to allow for user-allocated page backing, add hooks to the page pool so pages can be obtained and released from a user-supplied provider instead of the system page allocator. skbs are marked with skb_mark_for_recycle() if they contain pages belonging to a page pool, and page_put() will deliver the pages back to the pool instead of freeing them to the system page allocator. Signed-off-by: Jonathan Lemon --- include/net/page_pool.h | 6 ++++++ net/core/page_pool.c | 41 ++++++++++++++++++++++++++++++++++------- 2 files changed, 40 insertions(+), 7 deletions(-) diff --git a/include/net/page_pool.h b/include/net/page_pool.h index 813c93499f20..85c8423f9a7e 100644 --- a/include/net/page_pool.h +++ b/include/net/page_pool.h @@ -82,6 +82,12 @@ struct page_pool_params { unsigned int offset; /* DMA addr offset */ void (*init_callback)(struct page *page, void *arg); void *init_arg; + struct page *(*alloc_pages)(void *arg, int nid, gfp_t gfp, + unsigned int order); + unsigned long (*alloc_bulk)(void *arg, gfp_t gfp, int nid, + unsigned long nr_pages, + struct page **page_array); + void (*put_page)(void *arg, struct page *page); }; #ifdef CONFIG_PAGE_POOL_STATS diff --git a/net/core/page_pool.c b/net/core/page_pool.c index 9b203d8660e4..21c6ee97bc7f 100644 --- a/net/core/page_pool.c +++ b/net/core/page_pool.c @@ -342,19 +342,47 @@ static void page_pool_clear_pp_info(struct page *page) page->pp = NULL; } +/* hooks to either page provider or system page allocator */ +static void page_pool_mm_put_page(struct page_pool *pool, struct page *page) +{ + if (pool->p.put_page) + return pool->p.put_page(pool->p.init_arg, page); + put_page(page); +} + +static unsigned long page_pool_mm_alloc_bulk(struct page_pool *pool, + gfp_t gfp, + unsigned long nr_pages) +{ + if (pool->p.alloc_bulk) + return pool->p.alloc_bulk(pool->p.init_arg, gfp, + pool->p.nid, nr_pages, + pool->alloc.cache); + return alloc_pages_bulk_array_node(gfp, pool->p.nid, + nr_pages, pool->alloc.cache); +} + +static struct page *page_pool_mm_alloc(struct page_pool *pool, gfp_t gfp) +{ + if (pool->p.alloc_pages) + return pool->p.alloc_pages(pool->p.init_arg, pool->p.nid, + gfp, pool->p.order); + return alloc_pages_node(pool->p.nid, gfp, pool->p.order); +} + static struct page *__page_pool_alloc_page_order(struct page_pool *pool, gfp_t gfp) { struct page *page; gfp |= __GFP_COMP; - page = alloc_pages_node(pool->p.nid, gfp, pool->p.order); + page = page_pool_mm_alloc(pool, gfp); if (unlikely(!page)) return NULL; if ((pool->p.flags & PP_FLAG_DMA_MAP) && unlikely(!page_pool_dma_map(pool, page))) { - put_page(page); + page_pool_mm_put_page(pool, page); return NULL; } @@ -389,8 +417,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, /* Mark empty alloc.cache slots "empty" for alloc_pages_bulk_array */ memset(&pool->alloc.cache, 0, sizeof(void *) * bulk); - nr_pages = alloc_pages_bulk_array_node(gfp, pool->p.nid, bulk, - pool->alloc.cache); + nr_pages = page_pool_mm_alloc_bulk(pool, gfp, bulk); if (unlikely(!nr_pages)) return NULL; @@ -401,7 +428,7 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool, page = pool->alloc.cache[i]; if ((pp_flags & PP_FLAG_DMA_MAP) && unlikely(!page_pool_dma_map(pool, page))) { - put_page(page); + page_pool_mm_put_page(pool, page); continue; } @@ -501,7 +528,7 @@ static void page_pool_return_page(struct page_pool *pool, struct page *page) { page_pool_release_page(pool, page); - put_page(page); + page_pool_mm_put_page(pool, page); /* An optimization would be to call __free_pages(page, pool->p.order) * knowing page is not part of page-cache (thus avoiding a * __page_cache_release() call). @@ -593,7 +620,7 @@ __page_pool_put_page(struct page_pool *pool, struct page *page, recycle_stat_inc(pool, released_refcnt); /* Do not replace this with page_pool_return_page() */ page_pool_release_page(pool, page); - put_page(page); + page_pool_mm_put_page(pool, page); return NULL; } From patchwork Fri Oct 7 21:17:12 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Lemon X-Patchwork-Id: 13001466 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A0748C43219 for ; Fri, 7 Oct 2022 21:17:32 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229517AbiJGVRc convert rfc822-to-8bit (ORCPT ); Fri, 7 Oct 2022 17:17:32 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59232 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229728AbiJGVR3 (ORCPT ); Fri, 7 Oct 2022 17:17:29 -0400 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D4EBB3743C for ; Fri, 7 Oct 2022 14:17:17 -0700 (PDT) Received: from pps.filterd (m0109333.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 297I5hq9008393 for ; Fri, 7 Oct 2022 14:17:17 -0700 Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k26gy8673-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 07 Oct 2022 14:17:17 -0700 Received: from snc-exhub201.TheFacebook.com (2620:10d:c085:21d::7) by snc-exhub204.TheFacebook.com (2620:10d:c085:21d::4) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Fri, 7 Oct 2022 14:17:17 -0700 Received: from twshared3028.05.ash9.facebook.com (2620:10d:c085:108::8) by mail.thefacebook.com (2620:10d:c085:21d::7) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Fri, 7 Oct 2022 14:17:16 -0700 Received: by devvm2494.atn0.facebook.com (Postfix, from userid 172786) id 8221E21DAFDA6; Fri, 7 Oct 2022 14:17:13 -0700 (PDT) From: Jonathan Lemon To: Subject: [RFC v1 8/9] io_uring: provide functions for the page_pool. Date: Fri, 7 Oct 2022 14:17:12 -0700 Message-ID: <20221007211713.170714-9-jonathan.lemon@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221007211713.170714-1-jonathan.lemon@gmail.com> References: <20221007211713.170714-1-jonathan.lemon@gmail.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-GUID: BEUKy5UjV6T0kxidTWmue8cH7J_O407D X-Proofpoint-ORIG-GUID: BEUKy5UjV6T0kxidTWmue8cH7J_O407D X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.528,FMLib:17.11.122.1 definitions=2022-10-07_04,2022-10-07_01,2022-06-22_01 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org These functions are called by the page_pool, in order to refill the pool with user-supplied pages, or returning excess pages back from the pool. If no pages are present in the region cache, then an attempt is made to obtain more pages from the interface fill queue. Signed-off-by: Jonathan Lemon --- include/linux/io_uring.h | 24 ++++++++++++ io_uring/kbuf.c | 13 +++++++ io_uring/kbuf.h | 2 + io_uring/zctap.c | 82 ++++++++++++++++++++++++++++++++++++++++ 4 files changed, 121 insertions(+) diff --git a/include/linux/io_uring.h b/include/linux/io_uring.h index 4a2f6cc5a492..b92e65e0a469 100644 --- a/include/linux/io_uring.h +++ b/include/linux/io_uring.h @@ -37,6 +37,14 @@ void __io_uring_free(struct task_struct *tsk); void io_uring_unreg_ringfd(void); const char *io_uring_get_opcode(u8 opcode); +struct io_zctap_ifq; +struct page *io_zctap_ifq_get_page(struct io_zctap_ifq *ifq, + unsigned int order); +unsigned long io_zctap_ifq_get_bulk(struct io_zctap_ifq *ifq, + unsigned long nr_pages, + struct page **page_array); +bool io_zctap_ifq_put_page(struct io_zctap_ifq *ifq, struct page *page); + static inline void io_uring_files_cancel(void) { if (current->io_uring) { @@ -80,6 +88,22 @@ static inline const char *io_uring_get_opcode(u8 opcode) { return ""; } +static inline struct page *io_zctap_ifq_get_page(struct io_zctap_ifq *ifq, + unsigned int order) +{ + return NULL; +} +sttaic unsigned long io_zctap_ifq_get_bulk(struct io_zctap_ifq *ifq, + unsigned long nr_pages, + struct page **page_array) +{ + return 0; +} +bool io_zctap_ifq_put_page(struct io_zctap_ifq *ifq, struct page *page) +{ + return false; +} + #endif #endif diff --git a/io_uring/kbuf.c b/io_uring/kbuf.c index 25cd724ade18..caae2755e3d5 100644 --- a/io_uring/kbuf.c +++ b/io_uring/kbuf.c @@ -188,6 +188,19 @@ void __user *io_buffer_select(struct io_kiocb *req, size_t *len, return ret; } +/* XXX May called from the driver, in napi context. */ +u64 io_zctap_buffer(struct io_kiocb *req, size_t *len) +{ + struct io_ring_ctx *ctx = req->ctx; + struct io_buffer_list *bl; + void __user *ret = NULL; + + bl = io_buffer_get_list(ctx, req->buf_index); + if (likely(bl)) + ret = io_ring_buffer_select(req, len, bl, IO_URING_F_UNLOCKED); + return (u64)ret; +} + static __cold int io_init_bl_list(struct io_ring_ctx *ctx) { int i; diff --git a/io_uring/kbuf.h b/io_uring/kbuf.h index 746fbf31a703..1379e0e9f870 100644 --- a/io_uring/kbuf.h +++ b/io_uring/kbuf.h @@ -50,6 +50,8 @@ unsigned int __io_put_kbuf(struct io_kiocb *req, unsigned issue_flags); void io_kbuf_recycle_legacy(struct io_kiocb *req, unsigned issue_flags); +u64 io_zctap_buffer(struct io_kiocb *req, size_t *len); + static inline void io_kbuf_recycle_ring(struct io_kiocb *req) { /* diff --git a/io_uring/zctap.c b/io_uring/zctap.c index 9db3421fb9fa..8bebe7c36c82 100644 --- a/io_uring/zctap.c +++ b/io_uring/zctap.c @@ -311,3 +311,85 @@ int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags) return -EEXIST; } + +/* gets a user-supplied buffer from the fill queue */ +static struct page *io_zctap_get_buffer(struct io_zctap_ifq *ifq) +{ + struct io_kiocb req = { + .ctx = ifq->ctx, + .buf_index = ifq->fill_bgid, + }; + struct io_mapped_ubuf *imu; + struct ifq_region *ifr; + size_t len; + u64 addr; + int idx; + + len = 0; + ifr = ifq->region; + imu = ifr->imu; + + addr = io_zctap_buffer(&req, &len); + if (!addr) + goto fail; + + /* XXX poor man's implementation of io_import_fixed */ + + if (addr < ifr->start || addr + len > ifr->end) + goto fail; + + idx = (addr - ifr->start) >> PAGE_SHIFT; + + return imu->bvec[ifr->imu_idx + idx].bv_page; + +fail: + /* warn and just drop buffer */ + WARN_RATELIMIT(1, "buffer addr %llx invalid", addr); + return NULL; +} + +struct page *io_zctap_ifq_get_page(struct io_zctap_ifq *ifq, + unsigned int order) +{ + struct ifq_region *ifr = ifq->region; + + if (WARN_RATELIMIT(order != 1, "order %d", order)) + return NULL; + + if (ifr->count) + return ifr->page[--ifr->count]; + + return io_zctap_get_buffer(ifq); +} + +unsigned long io_zctap_ifq_get_bulk(struct io_zctap_ifq *ifq, + unsigned long nr_pages, + struct page **page_array) +{ + struct ifq_region *ifr = ifq->region; + int count; + + count = min_t(unsigned long, nr_pages, ifr->count); + if (count) { + ifr->count -= count; + memcpy(page_array, &ifr->page[ifr->count], + count * sizeof(struct page *)); + } + + return count; +} + +bool io_zctap_ifq_put_page(struct io_zctap_ifq *ifq, struct page *page) +{ + struct ifq_region *ifr = ifq->region; + + /* if page is not usermapped, then throw an error */ + + /* sanity check - leak pages here if hit */ + if (WARN_RATELIMIT(ifr->count >= ifr->nr_pages, "page overflow")) + return true; + + ifr->page[ifr->count++] = page; + + return true; +} From patchwork Fri Oct 7 21:17:13 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Jonathan Lemon X-Patchwork-Id: 13001465 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id BA89EC43217 for ; Fri, 7 Oct 2022 21:17:31 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229817AbiJGVRa convert rfc822-to-8bit (ORCPT ); Fri, 7 Oct 2022 17:17:30 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59230 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229672AbiJGVR3 (ORCPT ); Fri, 7 Oct 2022 17:17:29 -0400 Received: from mx0a-00082601.pphosted.com (mx0a-00082601.pphosted.com [67.231.145.42]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D41CC32A8A for ; Fri, 7 Oct 2022 14:17:17 -0700 (PDT) Received: from pps.filterd (m0044010.ppops.net [127.0.0.1]) by mx0a-00082601.pphosted.com (8.17.1.5/8.17.1.5) with ESMTP id 297I5hrp027849 for ; Fri, 7 Oct 2022 14:17:17 -0700 Received: from mail.thefacebook.com ([163.114.132.120]) by mx0a-00082601.pphosted.com (PPS) with ESMTPS id 3k2hshchb4-1 (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128 verify=NOT) for ; Fri, 07 Oct 2022 14:17:17 -0700 Received: from twshared3028.05.ash9.facebook.com (2620:10d:c085:108::8) by mail.thefacebook.com (2620:10d:c085:11d::5) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id 15.1.2375.31; Fri, 7 Oct 2022 14:17:16 -0700 Received: by devvm2494.atn0.facebook.com (Postfix, from userid 172786) id 8977221DAFDA8; Fri, 7 Oct 2022 14:17:13 -0700 (PDT) From: Jonathan Lemon To: Subject: [RFC v1 9/9] io_uring: add OP_RECV_ZC command. Date: Fri, 7 Oct 2022 14:17:13 -0700 Message-ID: <20221007211713.170714-10-jonathan.lemon@gmail.com> X-Mailer: git-send-email 2.30.2 In-Reply-To: <20221007211713.170714-1-jonathan.lemon@gmail.com> References: <20221007211713.170714-1-jonathan.lemon@gmail.com> MIME-Version: 1.0 X-FB-Internal: Safe X-Proofpoint-ORIG-GUID: 7g-0Z_DVOEdsEw4ppzH8TtXizIBvRwGj X-Proofpoint-GUID: 7g-0Z_DVOEdsEw4ppzH8TtXizIBvRwGj X-Proofpoint-Virus-Version: vendor=baseguard engine=ICAP:2.0.205,Aquarius:18.0.895,Hydra:6.0.528,FMLib:17.11.122.1 definitions=2022-10-07_04,2022-10-07_01,2022-06-22_01 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org This is still a WIP. The current code (temporarily) uses addr3 as a hack in order to leverage code in io_recvmsg_prep. The recvzc opcode uses a metadata buffer either supplied directly with buf/len, or indirectly from the buffer group. The expectation is that this buffer is then filled with an array of io_uring_zctap_iov structures, which point to the data in user-memory. addr3 = (readlen << 32) | (copy_bgid << 16) | ctx->ifq_id; The amount of returned data is limited by the number of iovs that the metadata area can hold, and also the readlen parameter. As a fallback (and for testing purposes), if the skb data is not present in user memory (perhaps due to system misconfiguration), then a seprate buffer is obtained from the copy_bgid and the data is copied into user-memory. Signed-off-by: Jonathan Lemon --- include/uapi/linux/io_uring.h | 1 + io_uring/net.c | 123 ++++++++++++ io_uring/opdef.c | 14 ++ io_uring/zctap.c | 354 ++++++++++++++++++++++++++++++++++ io_uring/zctap.h | 5 + 5 files changed, 497 insertions(+) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 145d55280919..3c31a966687e 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -207,6 +207,7 @@ enum io_uring_op { IORING_OP_URING_CMD, IORING_OP_SEND_ZC, IORING_OP_PROVIDE_IFQ_REGION, + IORING_OP_RECV_ZC, /* this goes last, obviously */ IORING_OP_LAST, diff --git a/io_uring/net.c b/io_uring/net.c index 60e392f7f2dc..89c57ad83a79 100644 --- a/io_uring/net.c +++ b/io_uring/net.c @@ -16,6 +16,7 @@ #include "net.h" #include "notif.h" #include "rsrc.h" +#include "zctap.h" #if defined(CONFIG_NET) struct io_shutdown { @@ -73,6 +74,14 @@ struct io_sendzc { struct io_kiocb *notif; }; +struct io_recvzc { + struct io_sr_msg sr; + struct io_zctap_ifq *ifq; + u32 datalen; + u16 ifq_id; + u16 copy_bgid; +}; + #define IO_APOLL_MULTI_POLLED (REQ_F_APOLL_MULTISHOT | REQ_F_POLLED) int io_shutdown_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) @@ -879,6 +888,120 @@ int io_recv(struct io_kiocb *req, unsigned int issue_flags) return ret; } +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe) +{ + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); + u64 recvzc_cmd; + u16 ifq_id; + + /* XXX hack so we can temporarily use io_recvmsg_prep */ + recvzc_cmd = READ_ONCE(sqe->addr3); + + ifq_id = recvzc_cmd & 0xffff; + zc->copy_bgid = (recvzc_cmd >> 16) & 0xffff; + zc->datalen = recvzc_cmd >> 32; + + zc->ifq = xa_load(&req->ctx->zctap_ifq_xa, ifq_id); + if (!zc->ifq) + return -EINVAL; + if (zc->ifq->ctx != req->ctx) + return -EINVAL; + + return io_recvmsg_prep(req, sqe); +} + +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags) +{ + struct io_recvzc *zc = io_kiocb_to_cmd(req, struct io_recvzc); + struct msghdr msg; + struct socket *sock; + struct iovec iov; + unsigned int cflags; + unsigned flags; + int ret, min_ret = 0; + bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK; + size_t len = zc->sr.len; + + if (!(req->flags & REQ_F_POLLED) && + (zc->sr.flags & IORING_RECVSEND_POLL_FIRST)) + return -EAGAIN; + + sock = sock_from_file(req->file); + if (unlikely(!sock)) + return -ENOTSOCK; + +retry_multishot: + if (io_do_buffer_select(req)) { + void __user *buf; + + buf = io_buffer_select(req, &len, issue_flags); + if (!buf) + return -ENOBUFS; + zc->sr.buf = buf; + } + + ret = import_single_range(READ, zc->sr.buf, len, &iov, &msg.msg_iter); + if (unlikely(ret)) + goto out_free; + + msg.msg_name = NULL; + msg.msg_namelen = 0; + msg.msg_control = NULL; + msg.msg_get_inq = 1; + msg.msg_flags = 0; + msg.msg_controllen = 0; + msg.msg_iocb = NULL; + msg.msg_ubuf = NULL; + + flags = zc->sr.msg_flags; + if (force_nonblock) + flags |= MSG_DONTWAIT; + if (flags & MSG_WAITALL) + min_ret = iov_iter_count(&msg.msg_iter); + + ret = io_zctap_recv(zc->ifq, sock, &msg, flags, zc->datalen, + zc->copy_bgid); + if (ret < min_ret) { + if (ret == -EAGAIN && force_nonblock) { + if ((req->flags & IO_APOLL_MULTI_POLLED) == IO_APOLL_MULTI_POLLED) { + io_kbuf_recycle(req, issue_flags); + return IOU_ISSUE_SKIP_COMPLETE; + } + + return -EAGAIN; + } + if (ret == -ERESTARTSYS) + ret = -EINTR; + if (ret > 0 && io_net_retry(sock, flags)) { + zc->sr.len -= ret; + zc->sr.buf += ret; + zc->sr.done_io += ret; + req->flags |= REQ_F_PARTIAL_IO; + return -EAGAIN; + } + req_set_fail(req); + } else if ((flags & MSG_WAITALL) && (msg.msg_flags & (MSG_TRUNC | MSG_CTRUNC))) { +out_free: + req_set_fail(req); + } + + if (ret > 0) + ret += zc->sr.done_io; + else if (zc->sr.done_io) + ret = zc->sr.done_io; + else + io_kbuf_recycle(req, issue_flags); + + cflags = io_put_kbuf(req, issue_flags); + if (msg.msg_inq) + cflags |= IORING_CQE_F_SOCK_NONEMPTY; + + if (!io_recv_finish(req, &ret, cflags, ret <= 0)) + goto retry_multishot; + + return ret; +} + void io_sendzc_cleanup(struct io_kiocb *req) { struct io_sendzc *zc = io_kiocb_to_cmd(req, struct io_sendzc); diff --git a/io_uring/opdef.c b/io_uring/opdef.c index bf28c43117c3..f3782e7b707b 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -497,6 +497,20 @@ const struct io_op_def io_op_defs[] = { .prep = io_provide_ifq_region_prep, .issue = io_provide_ifq_region, }, + [IORING_OP_RECV_ZC] = { + .name = "RECV_ZC", + .needs_file = 1, + .unbound_nonreg_file = 1, + .pollin = 1, + .buffer_select = 1, + .ioprio = 1, +#if defined(CONFIG_NET) + .prep = io_recvzc_prep, + .issue = io_recvzc, +#else + .prep = io_eopnotsupp_prep, +#endif + }, }; const char *io_uring_get_opcode(u8 opcode) diff --git a/io_uring/zctap.c b/io_uring/zctap.c index 8bebe7c36c82..1d334ac55c0b 100644 --- a/io_uring/zctap.c +++ b/io_uring/zctap.c @@ -7,6 +7,7 @@ #include #include #include +#include #include @@ -393,3 +394,356 @@ bool io_zctap_ifq_put_page(struct io_zctap_ifq *ifq, struct page *page) return true; } + +static inline bool +zctap_skb_ours(struct sk_buff *skb) +{ + return skb->pp_recycle; +} + +struct zctap_read_desc { + struct iov_iter *iter; + struct ifq_region *ifr; + u32 iov_space; + u32 iov_limit; + u32 recv_limit; + + struct io_kiocb req; + u8 *buf; + size_t offset; + size_t buflen; + + struct io_zctap_ifq *ifq; + u16 ifq_id; + u16 copy_bgid; /* XXX move to register ifq? */ +}; + +static int __zctap_get_user_buffer(struct zctap_read_desc *ztr, int len) +{ + if (!ztr->buflen) { + ztr->req = (struct io_kiocb) { + .ctx = ztr->ifq->ctx, + .buf_index = ztr->copy_bgid, + }; + + ztr->buf = (u8 *)io_zctap_buffer(&ztr->req, &ztr->buflen); + ztr->offset = 0; + } + return len > ztr->buflen ? ztr->buflen : len; +} + +static int zctap_copy_data(struct zctap_read_desc *ztr, int len, u8 *kaddr) +{ + struct io_uring_zctap_iov zov; + u32 space; + int err; + + space = ztr->iov_space + sizeof(zov); + if (space > ztr->iov_limit) + return 0; + + len = __zctap_get_user_buffer(ztr, len); + if (!len) + return -ENOBUFS; + + err = copy_to_user(ztr->buf + ztr->offset, kaddr, len); + if (err) + return -EFAULT; + + zov = (struct io_uring_zctap_iov) { + .off = ztr->offset, + .len = len, + .bgid = ztr->copy_bgid, + .bid = ztr->req.buf_index, + .ifq_id = ztr->ifq_id, + }; + + if (copy_to_iter(&zov, sizeof(zov), ztr->iter) != sizeof(zov)) + return -EFAULT; + + ztr->offset += len; + ztr->buflen -= len; + + ztr->iov_space = space; + + return len; +} + +static int zctap_copy_frag(struct zctap_read_desc *ztr, struct page *page, + int off, int len, int id, + struct io_uring_zctap_iov *zov) +{ + u8 *kaddr; + int err; + + len = __zctap_get_user_buffer(ztr, len); + if (!len) + return -ENOBUFS; + + if (id == 0) { + kaddr = kmap(page) + off; + err = copy_to_user(ztr->buf + ztr->offset, kaddr, len); + kunmap(page); + } else { + kaddr = page_address(page) + off; + err = copy_to_user(ztr->buf + ztr->offset, kaddr, len); + } + + if (err) + return -EFAULT; + + *zov = (struct io_uring_zctap_iov) { + .off = ztr->offset, + .len = len, + .bgid = ztr->copy_bgid, + .bid = ztr->req.buf_index, + .ifq_id = ztr->ifq_id, + }; + + ztr->offset += len; + ztr->buflen -= len; + + return len; +} + +static int zctap_recv_frag(struct zctap_read_desc *ztr, + const skb_frag_t *frag, int off, int len) +{ + struct io_uring_zctap_iov zov; + struct page *page; + int id, pgid; + u32 space; + + space = ztr->iov_space + sizeof(zov); + if (space > ztr->iov_limit) + return 0; + + page = skb_frag_page(frag); + id = zctap_page_ifq_id(page); + off += skb_frag_off(frag); + + if (likely(id == ztr->ifq_id)) { + pgid = zctap_page_id(page); + io_add_page_uref(ztr->ifr, pgid); + zov = (struct io_uring_zctap_iov) { + .off = off, + .len = len, + .bgid = zctap_page_region_id(page), + .bid = pgid, + .ifq_id = id, + }; + } else { + len = zctap_copy_frag(ztr, page, off, len, id, &zov); + if (len <= 0) + return len; + } + + if (copy_to_iter(&zov, sizeof(zov), ztr->iter) != sizeof(zov)) + return -EFAULT; + + ztr->iov_space = space; + + return len; +} + +/* Our version of __skb_datagram_iter -- should work for UDP also. */ +static int +zctap_recv_skb(read_descriptor_t *desc, struct sk_buff *skb, + unsigned int offset, size_t len) +{ + struct zctap_read_desc *ztr = desc->arg.data; + unsigned start, start_off; + struct sk_buff *frag_iter; + int i, copy, end, ret = 0; + + if (ztr->iov_space >= ztr->iov_limit) { + desc->count = 0; + return 0; + } + if (len > ztr->recv_limit) + len = ztr->recv_limit; + + start = skb_headlen(skb); + start_off = offset; + + if (offset < start) { + copy = start - offset; + if (copy > len) + copy = len; + + /* copy out linear data */ + ret = zctap_copy_data(ztr, copy, skb->data + offset); + if (ret < 0) + goto out; + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + + for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) { + const skb_frag_t *frag; + + WARN_ON(start > offset + len); + + frag = &skb_shinfo(skb)->frags[i]; + end = start + skb_frag_size(frag); + + if (offset < end) { + copy = end - offset; + if (copy > len) + copy = len; + + ret = zctap_recv_frag(ztr, frag, offset - start, copy); + if (ret < 0) + goto out; + + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + start = end; + } + + skb_walk_frags(skb, frag_iter) { + WARN_ON(start > offset + len); + + end = start + frag_iter->len; + if (offset < end) { + int off; + + copy = end - offset; + if (copy > len) + copy = len; + + off = offset - start; + ret = zctap_recv_skb(desc, frag_iter, off, copy); + if (ret < 0) + goto out; + + offset += ret; + len -= ret; + if (len == 0 || ret != copy) + goto out; + } + start = end; + } + +out: + if (offset == start_off) + return ret; + return offset - start_off; +} + +static int __io_zctap_tcp_read(struct sock *sk, struct zctap_read_desc *zrd) +{ + read_descriptor_t rd_desc = { + .arg.data = zrd, + .count = 1, + }; + + return tcp_read_sock(sk, &rd_desc, zctap_recv_skb); +} + +static int io_zctap_tcp_recvmsg(struct sock *sk, struct zctap_read_desc *zrd, + int flags, int *addr_len) +{ + size_t used; + long timeo; + int ret; + + ret = used = 0; + + lock_sock(sk); + + timeo = sock_rcvtimeo(sk, flags & MSG_DONTWAIT); + while (zrd->recv_limit) { + ret = __io_zctap_tcp_read(sk, zrd); + if (ret < 0) + break; + if (!ret) { + if (used) + break; + if (sock_flag(sk, SOCK_DONE)) + break; + if (sk->sk_err) { + ret = sock_error(sk); + break; + } + if (sk->sk_shutdown & RCV_SHUTDOWN) + break; + if (sk->sk_state == TCP_CLOSE) { + ret = -ENOTCONN; + break; + } + if (!timeo) { + ret = -EAGAIN; + break; + } + if (!skb_queue_empty(&sk->sk_receive_queue)) + break; + sk_wait_data(sk, &timeo, NULL); + if (signal_pending(current)) { + ret = sock_intr_errno(timeo); + break; + } + continue; + } + zrd->recv_limit -= ret; + used += ret; + + if (!timeo) + break; + release_sock(sk); + lock_sock(sk); + + if (sk->sk_err || sk->sk_state == TCP_CLOSE || + (sk->sk_shutdown & RCV_SHUTDOWN) || + signal_pending(current)) + break; + } + + release_sock(sk); + + /* XXX, handle timestamping */ + + if (used) + return used; + + return ret; +} + +int io_zctap_recv(struct io_zctap_ifq *ifq, struct socket *sock, + struct msghdr *msg, int flags, u32 datalen, u16 copy_bgid) +{ + struct sock *sk = sock->sk; + struct zctap_read_desc zrd = { + .iov_limit = msg_data_left(msg), + .recv_limit = datalen, + .iter = &msg->msg_iter, + .ifq = ifq, + .ifq_id = ifq->id, + .copy_bgid = copy_bgid, + .ifr = ifq->region, + }; + const struct proto *prot; + int addr_len = 0; + int ret; + + if (flags & MSG_ERRQUEUE) + return -EOPNOTSUPP; + + prot = READ_ONCE(sk->sk_prot); + if (prot->recvmsg != tcp_recvmsg) + return -EPROTONOSUPPORT; + + sock_rps_record_flow(sk); + + ret = io_zctap_tcp_recvmsg(sk, &zrd, flags, &addr_len); + if (ret >= 0) { + msg->msg_namelen = addr_len; + ret = zrd.iov_space; + } + return ret; +} diff --git a/io_uring/zctap.h b/io_uring/zctap.h index 709c803220f4..2c3e23a6a07a 100644 --- a/io_uring/zctap.h +++ b/io_uring/zctap.h @@ -12,4 +12,9 @@ int io_provide_ifq_region_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); int io_provide_ifq_region(struct io_kiocb *req, unsigned int issue_flags); +int io_recvzc(struct io_kiocb *req, unsigned int issue_flags); +int io_recvzc_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe); +int io_zctap_recv(struct io_zctap_ifq *ifq, struct socket *sock, + struct msghdr *msg, int flags, u32 datalen, u16 copy_bgid); + #endif