From patchwork Fri Feb 10 15:32:09 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 13135932 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3214EC05027 for ; Fri, 10 Feb 2023 15:33:39 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232707AbjBJPdi (ORCPT ); Fri, 10 Feb 2023 10:33:38 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33714 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232360AbjBJPdh (ORCPT ); Fri, 10 Feb 2023 10:33:37 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 7668275F69 for ; Fri, 10 Feb 2023 07:32:48 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676043167; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=+C2J1t3Z7NlQMd8LrPvwcMC21q6plAekbNcCIut/FN8=; b=V5abkrpSNRVGSg5bD7dZp3s+ZDW/GpDuNK0JRDVI2QrWqAnXixN741KZCC3rVk0xYHtsB/ q9TUe4ucMOJt8wlH4NoDMcXSIxet3DQQ7aQHSB5XGzJAyZ/BSf2+K7vvJg2vhWyKrhZhAJ hlzfii6Du75yjpo+oJ1UdIhDLzxtcgU= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-624-uDckcxZiPmOMe5hOPPbFfw-1; Fri, 10 Feb 2023 10:32:44 -0500 X-MC-Unique: uDckcxZiPmOMe5hOPPbFfw-1 Received: from smtp.corp.redhat.com (int-mx09.intmail.prod.int.rdu2.redhat.com [10.11.54.9]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id B74F7811E9C; Fri, 10 Feb 2023 15:32:43 +0000 (UTC) Received: from localhost (ovpn-8-17.pek2.redhat.com [10.72.8.17]) by smtp.corp.redhat.com (Postfix) with ESMTP id 54F01492C3F; Fri, 10 Feb 2023 15:32:40 +0000 (UTC) From: Ming Lei To: Jens Axboe , io-uring@vger.kernel.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, Alexander Viro Cc: Stefan Hajnoczi , Miklos Szeredi , Bernd Schubert , Nitesh Shetty , Christoph Hellwig , Ziyang Zhang , Ming Lei Subject: [PATCH 1/4] fs/splice: enhance direct pipe & splice for moving pages in kernel Date: Fri, 10 Feb 2023 23:32:09 +0800 Message-Id: <20230210153212.733006-2-ming.lei@redhat.com> In-Reply-To: <20230210153212.733006-1-ming.lei@redhat.com> References: <20230210153212.733006-1-ming.lei@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.9 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org Per-task direct pipe can transfer page between two files or one file and other kernel components, especially by splice_direct_to_actor and __splice_from_pipe(). This way is helpful for fuse/ublk to implement zero copy by transferring pages from device to file or socket. However, when device's ->splice_read() produces pages, the kernel consumer may read from or write to these pages, and from device viewpoint, there could be unexpected read or write on pages. Enhance the limit by the following approach: 1) add kernel splice flags of SPLICE_F_KERN_FOR_[READ|WRITE] which is passed to device's ->read_splice(), then device can check if this READ or WRITE is expected on pages filled to pipe together with information from ppos & len 2) add kernel splice flag of SPLICE_F_KERN_NEED_CONFIRM which is passed to device's ->read_splice() for asking device to confirm if it really supports this kind of usage of feeding pages by ->read_splice(). If device does support, pipe->ack_page_consuming is set. This way can avoid misuse. Signed-off-by: Ming Lei Signed-off-by: Ming Lei --- fs/splice.c | 15 +++++++++++++++ include/linux/pipe_fs_i.h | 10 ++++++++++ include/linux/splice.h | 22 ++++++++++++++++++++++ 3 files changed, 47 insertions(+) diff --git a/fs/splice.c b/fs/splice.c index 87d9b19349de..c4770e1644cc 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -792,6 +792,14 @@ static long do_splice_to(struct file *in, loff_t *ppos, return in->f_op->splice_read(in, ppos, pipe, len, flags); } +static inline bool slice_read_acked(const struct pipe_inode_info *pipe, + int flags) +{ + if (flags & SPLICE_F_KERN_NEED_CONFIRM) + return pipe->ack_page_consuming; + return true; +} + /** * splice_direct_to_actor - splices data directly between two non-pipes * @in: file to splice from @@ -861,10 +869,17 @@ ssize_t splice_direct_to_actor(struct file *in, struct splice_desc *sd, size_t read_len; loff_t pos = sd->pos, prev_pos = pos; + pipe->ack_page_consuming = false; ret = do_splice_to(in, &pos, pipe, len, flags); if (unlikely(ret <= 0)) goto out_release; + if (!slice_read_acked(pipe, flags)) { + bytes = 0; + ret = -EACCES; + goto out_release; + } + read_len = ret; sd->total_len = read_len; diff --git a/include/linux/pipe_fs_i.h b/include/linux/pipe_fs_i.h index 6cb65df3e3ba..09ee1a9380ec 100644 --- a/include/linux/pipe_fs_i.h +++ b/include/linux/pipe_fs_i.h @@ -72,6 +72,7 @@ struct pipe_inode_info { unsigned int r_counter; unsigned int w_counter; bool poll_usage; + bool ack_page_consuming; /* only for direct pipe */ struct page *tmp_page; struct fasync_struct *fasync_readers; struct fasync_struct *fasync_writers; @@ -218,6 +219,15 @@ static inline void pipe_discard_from(struct pipe_inode_info *pipe, pipe_buf_release(pipe, &pipe->bufs[--pipe->head & mask]); } +/* + * Called in ->splice_read() for confirming the READ/WRITE page is allowed + */ +static inline void pipe_ack_page_consume(struct pipe_inode_info *pipe) +{ + if (!WARN_ON_ONCE(current->splice_pipe != pipe)) + pipe->ack_page_consuming = true; +} + /* Differs from PIPE_BUF in that PIPE_SIZE is the length of the actual memory allocation, whereas PIPE_BUF makes atomicity guarantees. */ #define PIPE_SIZE PAGE_SIZE diff --git a/include/linux/splice.h b/include/linux/splice.h index a55179fd60fc..98c471fd918d 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -23,6 +23,28 @@ #define SPLICE_F_ALL (SPLICE_F_MOVE|SPLICE_F_NONBLOCK|SPLICE_F_MORE|SPLICE_F_GIFT) +/* + * Flags used for kernel internal page move from ->splice_read() + * by internal direct pipe, and user pipe can't touch these + * flags. + * + * Pages filled from ->splice_read() are usually moved/copied to + * ->splice_write(). Here address fuse/ublk zero copy by transferring + * page from device to file/socket for either READ or WRITE. So we + * need ->splice_read() to confirm if this READ/WRITE is allowed on + * pages filled in ->splice_read(). + */ +/* The page consumer is for READ from pages moved from direct pipe */ +#define SPLICE_F_KERN_FOR_READ (0x100) +/* The page consumer is for WRITE to pages moved from direct pipe */ +#define SPLICE_F_KERN_FOR_WRITE (0x200) +/* + * ->splice_read() has to confirm if consumer's READ/WRITE on pages + * is allow. If yes, ->splice_read() has to set pipe->ack_page_consuming, + * otherwise pipe->ack_page_consuming has to be cleared. + */ +#define SPLICE_F_KERN_NEED_CONFIRM (0x400) + /* * Passed to the actors */ From patchwork Fri Feb 10 15:32:10 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 13135934 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 72A93C05027 for ; Fri, 10 Feb 2023 15:33:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232782AbjBJPdv (ORCPT ); Fri, 10 Feb 2023 10:33:51 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33736 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232792AbjBJPds (ORCPT ); Fri, 10 Feb 2023 10:33:48 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 78CBE5CBCE for ; Fri, 10 Feb 2023 07:32:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676043176; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=uj7Rv4KeEREhQreSkWBQhCSYBs5N5QoJeDQDwyenPI4=; b=Ie8FYXHnv1tdIBrJyGUOwIWkwyXkJr+izlQsFPcoDXbEv4EVrDTksEXeUVxHh6lhWnWW4L 8tgHhYWfSPxO5KaBUAgBvH0dAu+dVjDHopwBYOog/AHJ4R+hwjhuXzR+TQCcQTK8ClPKSf hXk5NLJTPP2FKIRJjJRSU0vaFDWVyZU= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-463-2NAH9aGyPMmisW9aHU-23Q-1; Fri, 10 Feb 2023 10:32:53 -0500 X-MC-Unique: 2NAH9aGyPMmisW9aHU-23Q-1 Received: from smtp.corp.redhat.com (int-mx03.intmail.prod.int.rdu2.redhat.com [10.11.54.3]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id BE39B85A5A3; Fri, 10 Feb 2023 15:32:52 +0000 (UTC) Received: from localhost (ovpn-8-17.pek2.redhat.com [10.72.8.17]) by smtp.corp.redhat.com (Postfix) with ESMTP id ACBAF1121315; Fri, 10 Feb 2023 15:32:50 +0000 (UTC) From: Ming Lei To: Jens Axboe , io-uring@vger.kernel.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, Alexander Viro Cc: Stefan Hajnoczi , Miklos Szeredi , Bernd Schubert , Nitesh Shetty , Christoph Hellwig , Ziyang Zhang , Ming Lei Subject: [PATCH 2/4] fs/splice: allow to ignore signal in __splice_from_pipe Date: Fri, 10 Feb 2023 23:32:10 +0800 Message-Id: <20230210153212.733006-3-ming.lei@redhat.com> In-Reply-To: <20230210153212.733006-1-ming.lei@redhat.com> References: <20230210153212.733006-1-ming.lei@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.3 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org __splice_from_pipe() is used for splice data from pipe, and the actor could be simply grabbing pages, so if the caller can confirm this actor won't block, it isn't necessary to return -ERESTARTSYS. Signed-off-by: Ming Lei --- fs/splice.c | 4 ++-- include/linux/splice.h | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/splice.c b/fs/splice.c index c4770e1644cc..a8dc46db1045 100644 --- a/fs/splice.c +++ b/fs/splice.c @@ -471,7 +471,7 @@ static int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_des * Check for signal early to make process killable when there are * always buffers available */ - if (signal_pending(current)) + if (signal_pending(current) && !sd->ignore_sig) return -ERESTARTSYS; repeat: @@ -485,7 +485,7 @@ static int splice_from_pipe_next(struct pipe_inode_info *pipe, struct splice_des if (sd->flags & SPLICE_F_NONBLOCK) return -EAGAIN; - if (signal_pending(current)) + if (signal_pending(current) && !sd->ignore_sig) return -ERESTARTSYS; if (sd->need_wakeup) { diff --git a/include/linux/splice.h b/include/linux/splice.h index 98c471fd918d..89e0a0f8b471 100644 --- a/include/linux/splice.h +++ b/include/linux/splice.h @@ -64,6 +64,7 @@ struct splice_desc { loff_t *opos; /* sendfile: output position */ size_t num_spliced; /* number of bytes already spliced */ bool need_wakeup; /* need to wake up writer */ + bool ignore_sig; }; struct partial_page { From patchwork Fri Feb 10 15:32:11 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 13135933 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 05CCDC636D7 for ; Fri, 10 Feb 2023 15:33:52 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232786AbjBJPdu (ORCPT ); Fri, 10 Feb 2023 10:33:50 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33624 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232789AbjBJPdr (ORCPT ); Fri, 10 Feb 2023 10:33:47 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0A66675F63 for ; Fri, 10 Feb 2023 07:33:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676043181; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=VW4k8Jn1GSK+lEHPRwD0ODmsDud93vde0Du4DHS3F8w=; b=d6a8qWrpi3JeltqzKhtIX/tt1TBsaxVna5ahlxQj6d6V+smHUpJUHpenjsiDLtxTR+mLkx J7V2sFpwWn/oYBB5rBg56fs95OHSMfo9+EOzoBA8LRib9GQ5aKb3S7QNP6mqvmWEzdogYw tWxgxzURwrERwb/GfhfNs28IlI2momI= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-526-tBE59yxQMBqLOhGiq_LSKA-1; Fri, 10 Feb 2023 10:32:58 -0500 X-MC-Unique: tBE59yxQMBqLOhGiq_LSKA-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.rdu2.redhat.com [10.11.54.6]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id A315A81172A; Fri, 10 Feb 2023 15:32:57 +0000 (UTC) Received: from localhost (ovpn-8-17.pek2.redhat.com [10.72.8.17]) by smtp.corp.redhat.com (Postfix) with ESMTP id 7AF042166B29; Fri, 10 Feb 2023 15:32:56 +0000 (UTC) From: Ming Lei To: Jens Axboe , io-uring@vger.kernel.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, Alexander Viro Cc: Stefan Hajnoczi , Miklos Szeredi , Bernd Schubert , Nitesh Shetty , Christoph Hellwig , Ziyang Zhang , Ming Lei Subject: [PATCH 3/4] io_uring: add IORING_OP_READ[WRITE]_SPLICE_BUF Date: Fri, 10 Feb 2023 23:32:11 +0800 Message-Id: <20230210153212.733006-4-ming.lei@redhat.com> In-Reply-To: <20230210153212.733006-1-ming.lei@redhat.com> References: <20230210153212.733006-1-ming.lei@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.6 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org IORING_OP_READ_SPLICE_BUF: read to buffer which is built from ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len) for building buffer. IORING_OP_WRITE_SPLICE_BUF: write from buffer which is built from ->read_splice() of specified fd, so user needs to provide (splice_fd, offset, len) for building buffer. The typical use case is for supporting ublk/fuse io_uring zero copy, and READ/WRITE OP retrieves ublk/fuse request buffer via direct pipe from device->read_splice(), then READ/WRITE can be done to/from this buffer directly. Signed-off-by: Ming Lei --- include/uapi/linux/io_uring.h | 2 + io_uring/opdef.c | 37 ++++++++ io_uring/rw.c | 174 +++++++++++++++++++++++++++++++++- io_uring/rw.h | 1 + 4 files changed, 213 insertions(+), 1 deletion(-) diff --git a/include/uapi/linux/io_uring.h b/include/uapi/linux/io_uring.h index 636a4c2c1294..bada0c91a350 100644 --- a/include/uapi/linux/io_uring.h +++ b/include/uapi/linux/io_uring.h @@ -223,6 +223,8 @@ enum io_uring_op { IORING_OP_URING_CMD, IORING_OP_SEND_ZC, IORING_OP_SENDMSG_ZC, + IORING_OP_READ_SPLICE_BUF, + IORING_OP_WRITE_SPLICE_BUF, /* this goes last, obviously */ IORING_OP_LAST, diff --git a/io_uring/opdef.c b/io_uring/opdef.c index 5238ecd7af6a..91e8d8f96134 100644 --- a/io_uring/opdef.c +++ b/io_uring/opdef.c @@ -427,6 +427,31 @@ const struct io_issue_def io_issue_defs[] = { .prep = io_eopnotsupp_prep, #endif }, + [IORING_OP_READ_SPLICE_BUF] = { + .needs_file = 1, + .unbound_nonreg_file = 1, + .pollin = 1, + .plug = 1, + .audit_skip = 1, + .ioprio = 1, + .iopoll = 1, + .iopoll_queue = 1, + .prep = io_prep_rw, + .issue = io_read, + }, + [IORING_OP_WRITE_SPLICE_BUF] = { + .needs_file = 1, + .hash_reg_file = 1, + .unbound_nonreg_file = 1, + .pollout = 1, + .plug = 1, + .audit_skip = 1, + .ioprio = 1, + .iopoll = 1, + .iopoll_queue = 1, + .prep = io_prep_rw, + .issue = io_write, + }, }; @@ -647,6 +672,18 @@ const struct io_cold_def io_cold_defs[] = { .fail = io_sendrecv_fail, #endif }, + [IORING_OP_READ_SPLICE_BUF] = { + .async_size = sizeof(struct io_async_rw), + .name = "READ_TO_SPLICE_BUF", + .cleanup = io_read_write_cleanup, + .fail = io_rw_fail, + }, + [IORING_OP_WRITE_SPLICE_BUF] = { + .async_size = sizeof(struct io_async_rw), + .name = "WRITE_FROM_SPICE_BUF", + .cleanup = io_read_write_cleanup, + .fail = io_rw_fail, + }, }; const char *io_uring_get_opcode(u8 opcode) diff --git a/io_uring/rw.c b/io_uring/rw.c index efe6bfda9ca9..381514fd1bc5 100644 --- a/io_uring/rw.c +++ b/io_uring/rw.c @@ -73,6 +73,175 @@ static int io_iov_buffer_select_prep(struct io_kiocb *req) return 0; } +struct io_rw_splice_buf_data { + unsigned long total; + unsigned int max_bvecs; + struct io_mapped_ubuf **imu; +}; + +/* the max size of whole 'io_mapped_ubuf' allocation is one page */ +static inline unsigned int io_rw_max_splice_buf_bvecs(void) +{ + return (PAGE_SIZE - sizeof(struct io_mapped_ubuf)) / + sizeof(struct bio_vec); +} + +static inline unsigned int io_rw_splice_buf_nr_bvecs(unsigned long len) +{ + return min_t(unsigned int, (len + PAGE_SIZE - 1) >> PAGE_SHIFT, + io_rw_max_splice_buf_bvecs()); +} + +static inline bool io_rw_splice_buf(struct io_kiocb *req) +{ + return req->opcode == IORING_OP_READ_SPLICE_BUF || + req->opcode == IORING_OP_WRITE_SPLICE_BUF; +} + +static void io_rw_cleanup_splice_buf(struct io_kiocb *req) +{ + struct io_mapped_ubuf *imu = req->imu; + int i; + + if (!imu) + return; + + for (i = 0; i < imu->nr_bvecs; i++) + put_page(imu->bvec[i].bv_page); + + req->imu = NULL; + kfree(imu); +} + +static int io_splice_buf_actor(struct pipe_inode_info *pipe, + struct pipe_buffer *buf, + struct splice_desc *sd) +{ + struct io_rw_splice_buf_data *data = sd->u.data; + struct io_mapped_ubuf *imu = *data->imu; + struct bio_vec *bvec; + + if (imu->nr_bvecs >= data->max_bvecs) { + /* + * Double bvec allocation given we don't know + * how many remains + */ + unsigned nr_bvecs = min(data->max_bvecs * 2, + io_rw_max_splice_buf_bvecs()); + struct io_mapped_ubuf *new_imu; + + /* can't grow, given up */ + if (nr_bvecs <= data->max_bvecs) + return 0; + + new_imu = krealloc(imu, struct_size(imu, bvec, nr_bvecs), + GFP_KERNEL); + if (!new_imu) + return -ENOMEM; + imu = new_imu; + data->max_bvecs = nr_bvecs; + *data->imu = imu; + } + + if (!try_get_page(buf->page)) + return -EINVAL; + + bvec = &imu->bvec[imu->nr_bvecs]; + bvec->bv_page = buf->page; + bvec->bv_offset = buf->offset; + bvec->bv_len = buf->len; + imu->nr_bvecs++; + data->total += buf->len; + + return buf->len; +} + +static int io_splice_buf_direct_actor(struct pipe_inode_info *pipe, + struct splice_desc *sd) +{ + return __splice_from_pipe(pipe, sd, io_splice_buf_actor); +} + +static int __io_prep_rw_splice_buf(struct io_kiocb *req, + struct io_rw_splice_buf_data *data, + struct file *splice_f, + size_t len, + loff_t splice_off) +{ + unsigned flags = req->opcode == IORING_OP_READ_SPLICE_BUF ? + SPLICE_F_KERN_FOR_READ : SPLICE_F_KERN_FOR_WRITE; + struct splice_desc sd = { + .total_len = len, + .flags = flags | SPLICE_F_NONBLOCK | SPLICE_F_KERN_NEED_CONFIRM, + .pos = splice_off, + .u.data = data, + .ignore_sig = true, + }; + + return splice_direct_to_actor(splice_f, &sd, + io_splice_buf_direct_actor); +} + +static int io_prep_rw_splice_buf(struct io_kiocb *req, + const struct io_uring_sqe *sqe) +{ + struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); + unsigned nr_pages = io_rw_splice_buf_nr_bvecs(rw->len); + loff_t splice_off = READ_ONCE(sqe->splice_off_in); + struct io_rw_splice_buf_data data; + struct io_mapped_ubuf *imu; + struct fd splice_fd; + int ret; + + splice_fd = fdget(READ_ONCE(sqe->splice_fd_in)); + if (!splice_fd.file) + return -EBADF; + + ret = -EBADF; + if (!(splice_fd.file->f_mode & FMODE_READ)) + goto out_put_fd; + + ret = -ENOMEM; + imu = kmalloc(struct_size(imu, bvec, nr_pages), GFP_KERNEL); + if (!imu) + goto out_put_fd; + + /* splice buffer actually hasn't virtual address */ + imu->nr_bvecs = 0; + + data.max_bvecs = nr_pages; + data.total = 0; + data.imu = &imu; + + rw->addr = 0; + req->flags |= REQ_F_NEED_CLEANUP; + + ret = __io_prep_rw_splice_buf(req, &data, splice_fd.file, rw->len, + splice_off); + imu = *data.imu; + imu->acct_pages = 0; + imu->ubuf = 0; + imu->ubuf_end = data.total; + rw->len = data.total; + req->imu = imu; + if (!data.total) { + io_rw_cleanup_splice_buf(req); + } else { + ret = 0; + } +out_put_fd: + if (splice_fd.file) + fdput(splice_fd); + + return ret; +} + +void io_read_write_cleanup(struct io_kiocb *req) +{ + if (io_rw_splice_buf(req)) + io_rw_cleanup_splice_buf(req); +} + int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe) { struct io_rw *rw = io_kiocb_to_cmd(req, struct io_rw); @@ -117,6 +286,8 @@ int io_prep_rw(struct io_kiocb *req, const struct io_uring_sqe *sqe) ret = io_iov_buffer_select_prep(req); if (ret) return ret; + } else if (io_rw_splice_buf(req)) { + return io_prep_rw_splice_buf(req, sqe); } return 0; @@ -371,7 +542,8 @@ static struct iovec *__io_import_iovec(int ddir, struct io_kiocb *req, size_t sqe_len; ssize_t ret; - if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED) { + if (opcode == IORING_OP_READ_FIXED || opcode == IORING_OP_WRITE_FIXED || + io_rw_splice_buf(req)) { ret = io_import_fixed(ddir, iter, req->imu, rw->addr, rw->len); if (ret) return ERR_PTR(ret); diff --git a/io_uring/rw.h b/io_uring/rw.h index 3b733f4b610a..b37d6f6ecb6a 100644 --- a/io_uring/rw.h +++ b/io_uring/rw.h @@ -21,4 +21,5 @@ int io_readv_prep_async(struct io_kiocb *req); int io_write(struct io_kiocb *req, unsigned int issue_flags); int io_writev_prep_async(struct io_kiocb *req); void io_readv_writev_cleanup(struct io_kiocb *req); +void io_read_write_cleanup(struct io_kiocb *req); void io_rw_fail(struct io_kiocb *req); From patchwork Fri Feb 10 15:32:12 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Ming Lei X-Patchwork-Id: 13135942 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 28A46C636CD for ; Fri, 10 Feb 2023 15:34:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232808AbjBJPeD (ORCPT ); Fri, 10 Feb 2023 10:34:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:33554 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232793AbjBJPd5 (ORCPT ); Fri, 10 Feb 2023 10:33:57 -0500 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 693ED5BA5F for ; Fri, 10 Feb 2023 07:33:06 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1676043185; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=aCWDoQu05GM/FfiJ8Y58UcEN/SbMA9nhBOhrfHvxOXw=; b=f3fhuaK/uGAIFlMibtPBG10Is/lRT1SiEl6w6WxLBn+Nh6m130nqRSTtI6R/GPPBOM2gaA zWc60hrMbdJAjxbeIjtgZSZIJ1UERCDxT9+J1nLjxN8oSbp5x/pIZjbwE42/0bJsidN9M0 e+4PjjPC3fjRZNjxvc647BDJVxeQ/Ok= Received: from mimecast-mx02.redhat.com (mimecast-mx02.redhat.com [66.187.233.88]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-669-kTcJ1TSmNEaEyceEKSLqGQ-1; Fri, 10 Feb 2023 10:33:04 -0500 X-MC-Unique: kTcJ1TSmNEaEyceEKSLqGQ-1 Received: from smtp.corp.redhat.com (int-mx10.intmail.prod.int.rdu2.redhat.com [10.11.54.10]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx02.redhat.com (Postfix) with ESMTPS id E781B858F09; Fri, 10 Feb 2023 15:33:03 +0000 (UTC) Received: from localhost (ovpn-8-17.pek2.redhat.com [10.72.8.17]) by smtp.corp.redhat.com (Postfix) with ESMTP id ADD9A492B00; Fri, 10 Feb 2023 15:33:02 +0000 (UTC) From: Ming Lei To: Jens Axboe , io-uring@vger.kernel.org, linux-block@vger.kernel.org, linux-fsdevel@vger.kernel.org, Alexander Viro Cc: Stefan Hajnoczi , Miklos Szeredi , Bernd Schubert , Nitesh Shetty , Christoph Hellwig , Ziyang Zhang , Ming Lei Subject: [PATCH 4/4] ublk_drv: support splice based read/write zero copy Date: Fri, 10 Feb 2023 23:32:12 +0800 Message-Id: <20230210153212.733006-5-ming.lei@redhat.com> In-Reply-To: <20230210153212.733006-1-ming.lei@redhat.com> References: <20230210153212.733006-1-ming.lei@redhat.com> MIME-Version: 1.0 X-Scanned-By: MIMEDefang 3.1 on 10.11.54.10 Precedence: bulk List-ID: X-Mailing-List: io-uring@vger.kernel.org The initial idea of using splice for zero copy is from Miklos and Stefan. Now io_uring supports IORING_OP_READ[WRITE]_SPLICE_BUF, and ublk can pass request pages via direct/kernel private pipe to backend io_uring read/write handling code. This zero copy implementation improves sequential IO performance obviously. ublksrv code: https://github.com/ming1/ubdsrv/commits/io_uring_splice_buf So far, only loop/null target supports zero copy: ublk add -t loop -f $file -z ublk add -t none -z Signed-off-by: Ming Lei --- drivers/block/ublk_drv.c | 169 ++++++++++++++++++++++++++++++++-- include/uapi/linux/ublk_cmd.h | 31 ++++++- 2 files changed, 193 insertions(+), 7 deletions(-) diff --git a/drivers/block/ublk_drv.c b/drivers/block/ublk_drv.c index e6eceee44366..5ef5f2ccb0d5 100644 --- a/drivers/block/ublk_drv.c +++ b/drivers/block/ublk_drv.c @@ -43,6 +43,8 @@ #include #include #include +#include +#include #include #define UBLK_MINORS (1U << MINORBITS) @@ -154,6 +156,8 @@ struct ublk_device { unsigned long state; int ub_number; + struct srcu_struct srcu; + struct mutex mutex; spinlock_t mm_lock; @@ -537,6 +541,9 @@ static int ublk_map_io(const struct ublk_queue *ubq, const struct request *req, if (req_op(req) != REQ_OP_WRITE && req_op(req) != REQ_OP_FLUSH) return rq_bytes; + if (ubq->flags & UBLK_F_SUPPORT_ZERO_COPY) + return rq_bytes; + if (ublk_rq_has_data(req)) { struct ublk_map_data data = { .ubq = ubq, @@ -558,6 +565,9 @@ static int ublk_unmap_io(const struct ublk_queue *ubq, { const unsigned int rq_bytes = blk_rq_bytes(req); + if (ubq->flags & UBLK_F_SUPPORT_ZERO_COPY) + return rq_bytes; + if (req_op(req) == REQ_OP_READ && ublk_rq_has_data(req)) { struct ublk_map_data data = { .ubq = ubq, @@ -1221,6 +1231,7 @@ static void ublk_stop_dev(struct ublk_device *ub) del_gendisk(ub->ub_disk); ub->dev_info.state = UBLK_S_DEV_DEAD; ub->dev_info.ublksrv_pid = -1; + synchronize_srcu(&ub->srcu); put_disk(ub->ub_disk); ub->ub_disk = NULL; unlock: @@ -1355,13 +1366,155 @@ static int ublk_ch_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) return -EIOCBQUEUED; } +static void ublk_pipe_buf_release(struct pipe_inode_info *pipe, + struct pipe_buffer *buf) +{ +} + +static const struct pipe_buf_operations ublk_pipe_buf_ops = { + .release = ublk_pipe_buf_release, +}; + +static inline bool ublk_check_splice_rw(const struct request *req, + unsigned int flags) +{ + flags &= (SPLICE_F_KERN_FOR_READ | SPLICE_F_KERN_FOR_WRITE); + + if (req_op(req) == REQ_OP_READ && flags == SPLICE_F_KERN_FOR_READ) + return true; + + if (req_op(req) == REQ_OP_WRITE && flags == SPLICE_F_KERN_FOR_WRITE) + return true; + + return false; +} + +static ssize_t ublk_splice_read(struct file *in, loff_t *ppos, + struct pipe_inode_info *pipe, + size_t len, unsigned int flags) +{ + struct ublk_device *ub = in->private_data; + struct req_iterator rq_iter; + struct bio_vec bv; + struct request *req; + struct ublk_queue *ubq; + u16 tag, q_id; + unsigned int done; + int ret, buf_offset, srcu_idx; + + if (!ub) + return -EPERM; + + /* only support direct pipe and we do need confirm */ + if (pipe != current->splice_pipe || !(flags & + SPLICE_F_KERN_NEED_CONFIRM)) + return -EACCES; + + ret = -EINVAL; + + /* protect request queue & disk removed */ + srcu_idx = srcu_read_lock(&ub->srcu); + + if (ub->dev_info.state == UBLK_S_DEV_DEAD) + goto exit; + + tag = ublk_pos_to_tag(*ppos); + q_id = ublk_pos_to_hwq(*ppos); + buf_offset = ublk_pos_to_buf_offset(*ppos); + + if (q_id >= ub->dev_info.nr_hw_queues) + goto exit; + + ubq = ublk_get_queue(ub, q_id); + if (!ubq) + goto exit; + + if (!(ubq->flags & UBLK_F_SUPPORT_ZERO_COPY)) + goto exit; + + /* + * So far just support splice read buffer from ubq daemon context + * because request may be gone in ->splice_read() if the splice + * is called from other context. + * + * TODO: add request protection and relax the following limit. + */ + if (ubq->ubq_daemon != current) + goto exit; + + if (tag >= ubq->q_depth) + goto exit; + + req = blk_mq_tag_to_rq(ub->tag_set.tags[q_id], tag); + if (!req || !blk_mq_request_started(req)) + goto exit; + + pr_devel("%s: qid %d tag %u offset %x, request bytes %u, len %llu flags %x\n", + __func__, tag, q_id, buf_offset, blk_rq_bytes(req), + (unsigned long long)len, flags); + + if (!ublk_check_splice_rw(req, flags)) + goto exit; + + if (!ublk_rq_has_data(req) || !len) + goto exit; + + if (buf_offset + len > blk_rq_bytes(req)) + goto exit; + + done = ret = 0; + rq_for_each_bvec(bv, req, rq_iter) { + struct pipe_buffer buf = { + .ops = &ublk_pipe_buf_ops, + .flags = 0, + .page = bv.bv_page, + .offset = bv.bv_offset, + .len = bv.bv_len, + }; + + if (buf_offset > 0) { + if (buf_offset >= bv.bv_len) { + buf_offset -= bv.bv_len; + continue; + } else { + buf.offset += buf_offset; + buf.len -= buf_offset; + buf_offset = 0; + } + } + + if (done + buf.len > len) + buf.len = len - done; + done += buf.len; + + ret = add_to_pipe(pipe, &buf); + if (unlikely(ret < 0)) { + done -= buf.len; + break; + } + if (done >= len) + break; + } + + if (done) { + *ppos += done; + ret = done; + + pipe_ack_page_consume(pipe); + } +exit: + srcu_read_unlock(&ub->srcu, srcu_idx); + return ret; +} + static const struct file_operations ublk_ch_fops = { .owner = THIS_MODULE, .open = ublk_ch_open, .release = ublk_ch_release, - .llseek = no_llseek, + .llseek = noop_llseek, .uring_cmd = ublk_ch_uring_cmd, .mmap = ublk_ch_mmap, + .splice_read = ublk_splice_read, }; static void ublk_deinit_queue(struct ublk_device *ub, int q_id) @@ -1472,6 +1625,7 @@ static void ublk_cdev_rel(struct device *dev) ublk_deinit_queues(ub); ublk_free_dev_number(ub); mutex_destroy(&ub->mutex); + cleanup_srcu_struct(&ub->srcu); kfree(ub); } @@ -1600,17 +1754,18 @@ static int ublk_ctrl_start_dev(struct ublk_device *ub, struct io_uring_cmd *cmd) set_bit(GD_SUPPRESS_PART_SCAN, &disk->state); get_device(&ub->cdev_dev); + ub->dev_info.state = UBLK_S_DEV_LIVE; ret = add_disk(disk); if (ret) { /* * Has to drop the reference since ->free_disk won't be * called in case of add_disk failure. */ + ub->dev_info.state = UBLK_S_DEV_DEAD; ublk_put_device(ub); goto out_put_disk; } set_bit(UB_STATE_USED, &ub->state); - ub->dev_info.state = UBLK_S_DEV_LIVE; out_put_disk: if (ret) put_disk(disk); @@ -1718,6 +1873,9 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd) ub = kzalloc(sizeof(*ub), GFP_KERNEL); if (!ub) goto out_unlock; + ret = init_srcu_struct(&ub->srcu); + if (ret) + goto out_free_ub; mutex_init(&ub->mutex); spin_lock_init(&ub->mm_lock); INIT_WORK(&ub->quiesce_work, ublk_quiesce_work_fn); @@ -1726,7 +1884,7 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd) ret = ublk_alloc_dev_number(ub, header->dev_id); if (ret < 0) - goto out_free_ub; + goto out_clean_srcu; memcpy(&ub->dev_info, &info, sizeof(info)); @@ -1744,9 +1902,6 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd) if (!IS_BUILTIN(CONFIG_BLK_DEV_UBLK)) ub->dev_info.flags |= UBLK_F_URING_CMD_COMP_IN_TASK; - /* We are not ready to support zero copy */ - ub->dev_info.flags &= ~UBLK_F_SUPPORT_ZERO_COPY; - ub->dev_info.nr_hw_queues = min_t(unsigned int, ub->dev_info.nr_hw_queues, nr_cpu_ids); ublk_align_max_io_size(ub); @@ -1776,6 +1931,8 @@ static int ublk_ctrl_add_dev(struct io_uring_cmd *cmd) ublk_deinit_queues(ub); out_free_dev_number: ublk_free_dev_number(ub); +out_clean_srcu: + cleanup_srcu_struct(&ub->srcu); out_free_ub: mutex_destroy(&ub->mutex); kfree(ub); diff --git a/include/uapi/linux/ublk_cmd.h b/include/uapi/linux/ublk_cmd.h index f6238ccc7800..a2f6748ee4ca 100644 --- a/include/uapi/linux/ublk_cmd.h +++ b/include/uapi/linux/ublk_cmd.h @@ -54,7 +54,36 @@ #define UBLKSRV_IO_BUF_OFFSET 0x80000000 /* tag bit is 12bit, so at most 4096 IOs for each queue */ -#define UBLK_MAX_QUEUE_DEPTH 4096 +#define UBLK_TAG_BITS 12 +#define UBLK_MAX_QUEUE_DEPTH (1U << UBLK_TAG_BITS) + +/* used in ->splice_read for supporting zero-copy */ +#define UBLK_BUFS_SIZE_BITS 42 +#define UBLK_BUFS_SIZE_MASK ((1ULL << UBLK_BUFS_SIZE_BITS) - 1) +#define UBLK_BUF_SIZE_BITS (UBLK_BUFS_SIZE_BITS - UBLK_TAG_BITS) +#define UBLK_BUF_MAX_SIZE (1ULL << UBLK_BUF_SIZE_BITS) + +static inline __u16 ublk_pos_to_hwq(__u64 pos) +{ + return pos >> UBLK_BUFS_SIZE_BITS; +} + +static inline __u32 ublk_pos_to_buf_offset(__u64 pos) +{ + return (pos & UBLK_BUFS_SIZE_MASK) & (UBLK_BUF_MAX_SIZE - 1); +} + +static inline __u16 ublk_pos_to_tag(__u64 pos) +{ + return (pos & UBLK_BUFS_SIZE_MASK) >> UBLK_BUF_SIZE_BITS; +} + +/* offset of single buffer, which has to be < UBLK_BUX_MAX_SIZE */ +static inline __u64 ublk_pos(__u16 q_id, __u16 tag, __u32 offset) +{ + return (((__u64)q_id) << UBLK_BUFS_SIZE_BITS) | + ((((__u64)tag) << UBLK_BUF_SIZE_BITS) + offset); +} /* * zero copy requires 4k block size, and can remap ublk driver's io