From patchwork Wed Sep 11 16:34:41 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Pavel Begunkov X-Patchwork-Id: 13800881 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7C1FDEE57C2 for ; Wed, 11 Sep 2024 16:34:39 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 65C8F940067; Wed, 11 Sep 2024 12:34:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 60CBA940066; Wed, 11 Sep 2024 12:34:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 46232940067; Wed, 11 Sep 2024 12:34:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 1DEBC940066 for ; Wed, 11 Sep 2024 12:34:31 -0400 (EDT) Received: from smtpin02.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id CD121161979 for ; Wed, 11 Sep 2024 16:34:30 +0000 (UTC) X-FDA: 82553005500.02.E76AAE6 Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com [209.85.167.53]) by imf28.hostedemail.com (Postfix) with ESMTP id DE453C001D for ; Wed, 11 Sep 2024 16:34:28 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=bHP7apIy; spf=pass (imf28.hostedemail.com: domain of asml.silence@gmail.com designates 209.85.167.53 as permitted sender) smtp.mailfrom=asml.silence@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1726072416; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=kF1so/Smn2x5rpzm017teUbERvLKVA5uX9EtHevfvOw=; b=G94NxZ3QUuCyyj+lGquHU+z9xHcmvDcrUgfl8TGRWTILlWQxkc0YxAr9RgEA4JH0LiK9u/ epri8RU9moKsGQcHgohecU5ntQZrybFkvVxeEusxzzWfRqXIB4SSuocCmOz78xNNrQotzZ Vy6I4BS090vm2UeqRwQASVJ49Pg+yFc= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=bHP7apIy; spf=pass (imf28.hostedemail.com: domain of asml.silence@gmail.com designates 209.85.167.53 as permitted sender) smtp.mailfrom=asml.silence@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1726072416; a=rsa-sha256; cv=none; b=yHdNMKl+FBHb4pSiaFqOl9roYeaM7cJXuJSYycaz+AKll6fB5nhDAwN1Xrv/Rwt2osfe08 +Ei/4xsbEXjHl7tZp96ejH1K84t5sG0eg2TwZoUgdsxPv4VZCum2U7gh+s6J7qZTo9vruq +1WAkIcv1DNxnzxJAwYVs3RXVP1wGY8= Received: by mail-lf1-f53.google.com with SMTP id 2adb3069b0e04-536562739baso669e87.1 for ; Wed, 11 Sep 2024 09:34:28 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1726072467; x=1726677267; darn=kvack.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=kF1so/Smn2x5rpzm017teUbERvLKVA5uX9EtHevfvOw=; b=bHP7apIywI6I3bZQ/mM+IRJxkD8lNhf4M/cKvHKs4sRCSTAIEQNaKy6Px9Z484gV9Z /wybtozMRNRoNAcjeRZIF804ATHSV2PqbnRwYU8Zxm1GyUy5T9D3IWz5LxQ4jiIRYYIj xo8+yr6OIOkkFj1yE418GRxPVjMbRjko1W3WOntXxjpB69RjCcDRAT8ew7L5Nn64j7pW 2fg9Auvx/UKaXIAyWKOqhgrpG3ckS0gvJ5eh9skU1zvbSOwbypC/nN4F4kNpaNr6Ismd SAT7nV2lRhCCEihiC1ihO18XI/fNdR8Nc4TTto+Rir0PfE34zogqP5H00B/Vdlmyk10j XmZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726072467; x=1726677267; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=kF1so/Smn2x5rpzm017teUbERvLKVA5uX9EtHevfvOw=; b=T4yxn2IkNMmSGLECl10L6U9KG9iFh0NIXiO8a2MAviW07afyFGxVfvx8toke8INwKQ E6xTsp9VdROTtOVfTCcOmeUNb/QOrYcjHdd4I55LmdcMiGKE4dsFc3OJ1src7fOSi6Y8 DxDiC4v1vRer2LAQ2iuuEOlQKppwCUxJNUqquVv3v3mvyXYq1cnyWg8+p9czUIJiaOaz IDkCi6UbDehq2YzjOmHM9OvwjX6nKMXFyrP7I6QHdIR4QM8ppRpxKw1ILeI54OHKMEzM c3s8u6NzXCDcJgbcHqHAT2z/eVwO48uv+Gj9+w3l2mHfaXLH0LaNIzVnRkAbzxiOCx0u mXcw== X-Forwarded-Encrypted: i=1; AJvYcCXbS4Ug9Qcy7rswAzAM9DT5cd37Sa9SOphrVw7ym0604hQmiKL3juWis/vuS/7IY7aelgxMg7184w==@kvack.org X-Gm-Message-State: AOJu0Yzv7M8v+F4f4IVzhnW+lRHegz/JfK9yVqFZordDv/njT9BSIsVZ bRGXzNWqQsYfgUEeZJZkizcobG6de1I/xovEZhqp4y3lAFkgfE8q X-Google-Smtp-Source: AGHT+IHLIiqymTGBUF58wtBP7zqzsomLZiPzbl4k1nvQcQaXuT38DwgIAaWI4pNf9MD33TwjU3QnoA== X-Received: by 2002:a05:6512:68b:b0:536:53fc:e8f5 with SMTP id 2adb3069b0e04-5365880a221mr13417594e87.55.1726072467080; Wed, 11 Sep 2024 09:34:27 -0700 (PDT) Received: from 127.0.0.1localhost ([163.114.131.193]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a8d25c72ed3sm631820866b.135.2024.09.11.09.34.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Sep 2024 09:34:26 -0700 (PDT) From: Pavel Begunkov To: io-uring@vger.kernel.org Cc: Jens Axboe , asml.silence@gmail.com, linux-block@vger.kernel.org, linux-mm@kvack.org, Christoph Hellwig , Conrad Meyer Subject: [PATCH v5 5/8] block: implement async io_uring discard cmd Date: Wed, 11 Sep 2024 17:34:41 +0100 Message-ID: <2b5210443e4fa0257934f73dfafcc18a77cd0e09.1726072086.git.asml.silence@gmail.com> X-Mailer: git-send-email 2.45.2 In-Reply-To: References: MIME-Version: 1.0 X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: DE453C001D X-Stat-Signature: 5d81raa13cjkicp8j1irejk4ax838p1m X-HE-Tag: 1726072468-247761 X-HE-Meta: U2FsdGVkX1/OTADu7M9BHidaJTcP0qCymA6WOoYDpvTCS1Bxi6C/riacRg/YZD+2b/FC/giXfIH7qss0Y7Ooucmixhi1foB6ctrXWb3CBz2WLWGQr/6C0rQc922ZQEdGeLW5DaNvseIZOsVvVwgf9pAeWVWxSM60fMIYy5Lb4Ge9miaL/ddUjJspgkUvd/S9o71ADOwiZPY6bDnuxH3DQM0XAJf2jaecSHoXOfu7Zzw5hGi0kRKo8H5AZyRkWKckpyV7oTQ+BhmV28q6c2tqGvRLeza/mKxkDtCW/Um6zmhxgRYO+be74QgPseoSfOku9mv1hwqwMjcWEQcb2NVv+RLhjKWyABFpBmfSbNt7AGggjDIWox9e8HKchsMKXHc2Dk3okSqERAh5PtOJTP0zEdoz0MUIc85kJPLr99D3F92VV+N9TwIDdQGXGCslXD9uwZJk3VUGxgN+qjQCOhuxIlVODJzB5tUmr94PF3hlIoITJWg5Qhokb7jjit4CWb2Y/fJwnhDMU53PJpMGFFP5vIhWHiwGByq9X5iyjQd3Fth6UB82bHKTn+sQCMY6breJcNCctB/bIsBE82aHfA5pqX5U7HD1/C494VRxrtKNHjyE2+3waQPqY3evNaZnWtjjgSo+fzmo+CLWdnb8C7kk8wySFOCLX/PWvC2V+EMmIjw88K0+sOeD+WsP29N+dAB3W+Fuw/I6pX10Dq2H+chyRrXyxrKDuxc2fTuX4NsKvEtBel1aKpAQiHRpchZoDG2cWyx2iMIpdtSjONhOMWUoibChodseGgsid3YNgZTrDMx1lkpD8pjDQqQaM8xFufnYuQdk1jEhqskUSUyYEI98o9k1ZFC7OrUL6ti+3SWX5EBGccu8AkUGZw1/e9S57T+QohdaJrEPPVHpUTjymTGqe7nttGr/AhiH9pI2iyymY+w5Xf/y7zTt0+PKxyFKGMqcWZs+XHHLCEd2vEbUZrd o32aHE9e db39smcEGMVsFkjsUIruheS1fXQ98ewBpeDcT+z1mIeiBfTvs2nqYjqMo/9WmVTjrAzVzAxGoxoOzRKz6NI9XgpJkOG91Vk7ClHn12Pe/FidN8mPBclzCfln8n2udTkPgHn2IgL3yl2A0Gb4P9aG8iBbSysbZ0PkhdtmwnNGnJxr3Oi75foNNuKXP4YH8DICNSDoO3jx9a9SZ8wHLAKBt3eq4/cJ1dTsKKKAEnR6EkmQJDu+4bzhszb3c6mw3A7wDHrdWPFZpjrHIpYMRnOfKMvcWigSl51EZd+ADKDkc3XLFGegtLmyBP2ujpGyxIWhEtj1pQWUfNNyxcN9wixzSNmSPpOZO3SD65JVcx/YFFEC6fWX11jKIews3grnZVcGDHiTWuyewMjizESly5Wf7VlEdd7AqPYvnixALG/yWTN//Kj/m8erTXizMwABi8Z+xLFjpDRzUUHzf/Pflj+M56BezXunGhh9BZofBqeI5iQZKnz2WzKrkZCfXo1bbh4/IY1O0Dp/o/b+wBxOeVR1uqU3goyB7E7aZ5ttlz9X2cyYo/OByX8muwviLzw== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: io_uring allows implementing custom file specific asynchronous operations via the fops->uring_cmd callback, a.k.a. IORING_OP_URING_CMD requests or just io_uring commands. Use it to add support for async discards. Normally, it first tries to queue up bios in a non-blocking context, and if that fails, we'd retry from a blocking context by returning -EAGAIN to the core io_uring. We always get the result from bios asynchronously by setting a custom bi_end_io callback, at which point we drag the request into the task context to either reissue or complete it and post a completion to the user. Unlike ioctl(BLKDISCARD) with stronger guarantees against races, we only do a best effort attempt to invalidate page cache, and it can race with any writes and reads and leave page cache stale. It's the same kind of races we allow to direct writes. Also, apart from cases where discarding is not allowed at all, e.g. discards are not supported or the file/device is read only, the user should assume that the sector range on disk is not valid anymore, even when an error was returned to the user. Suggested-by: Conrad Meyer Signed-off-by: Pavel Begunkov --- block/blk.h | 1 + block/fops.c | 2 + block/ioctl.c | 112 ++++++++++++++++++++++++++++++++++++ include/uapi/linux/blkdev.h | 14 +++++ 4 files changed, 129 insertions(+) create mode 100644 include/uapi/linux/blkdev.h diff --git a/block/blk.h b/block/blk.h index 32f4e9f630a3..1a1a18d118f7 100644 --- a/block/blk.h +++ b/block/blk.h @@ -605,6 +605,7 @@ blk_mode_t file_to_blk_mode(struct file *file); int truncate_bdev_range(struct block_device *bdev, blk_mode_t mode, loff_t lstart, loff_t lend); long blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags); long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg); extern const struct address_space_operations def_blk_aops; diff --git a/block/fops.c b/block/fops.c index 9825c1713a49..8154b10b5abf 100644 --- a/block/fops.c +++ b/block/fops.c @@ -17,6 +17,7 @@ #include #include #include +#include #include "blk.h" static inline struct inode *bdev_file_inode(struct file *file) @@ -873,6 +874,7 @@ const struct file_operations def_blk_fops = { .splice_read = filemap_splice_read, .splice_write = iter_file_splice_write, .fallocate = blkdev_fallocate, + .uring_cmd = blkdev_uring_cmd, .fop_flags = FOP_BUFFER_RASYNC, }; diff --git a/block/ioctl.c b/block/ioctl.c index 6d663d6ae036..007f6399de66 100644 --- a/block/ioctl.c +++ b/block/ioctl.c @@ -11,6 +11,9 @@ #include #include #include +#include +#include +#include #include "blk.h" static int blkpg_do_ioctl(struct block_device *bdev, @@ -747,3 +750,112 @@ long compat_blkdev_ioctl(struct file *file, unsigned cmd, unsigned long arg) return ret; } #endif + +struct blk_iou_cmd { + int res; + bool nowait; +}; + +static void blk_cmd_complete(struct io_uring_cmd *cmd, unsigned int issue_flags) +{ + struct blk_iou_cmd *bic = io_uring_cmd_to_pdu(cmd, struct blk_iou_cmd); + + if (bic->res == -EAGAIN && bic->nowait) + io_uring_cmd_issue_blocking(cmd); + else + io_uring_cmd_done(cmd, bic->res, 0, issue_flags); +} + +static void bio_cmd_bio_end_io(struct bio *bio) +{ + struct io_uring_cmd *cmd = bio->bi_private; + struct blk_iou_cmd *bic = io_uring_cmd_to_pdu(cmd, struct blk_iou_cmd); + + if (unlikely(bio->bi_status) && !bic->res) + bic->res = blk_status_to_errno(bio->bi_status); + + io_uring_cmd_do_in_task_lazy(cmd, blk_cmd_complete); + bio_put(bio); +} + +static int blkdev_cmd_discard(struct io_uring_cmd *cmd, + struct block_device *bdev, + uint64_t start, uint64_t len, bool nowait) +{ + struct blk_iou_cmd *bic = io_uring_cmd_to_pdu(cmd, struct blk_iou_cmd); + gfp_t gfp = nowait ? GFP_NOWAIT : GFP_KERNEL; + sector_t sector = start >> SECTOR_SHIFT; + sector_t nr_sects = len >> SECTOR_SHIFT; + struct bio *prev = NULL, *bio; + int err; + + if (!bdev_max_discard_sectors(bdev)) + return -EOPNOTSUPP; + if (!(file_to_blk_mode(cmd->file) & BLK_OPEN_WRITE)) + return -EBADF; + if (bdev_read_only(bdev)) + return -EPERM; + err = blk_validate_byte_range(bdev, start, len); + if (err) + return err; + + err = filemap_invalidate_pages(bdev->bd_mapping, start, + start + len - 1, nowait); + if (err) + return err; + + while (true) { + bio = blk_alloc_discard_bio(bdev, §or, &nr_sects, gfp); + if (!bio) + break; + if (nowait) { + /* + * Don't allow multi-bio non-blocking submissions as + * subsequent bios may fail but we won't get a direct + * indication of that. Normally, the caller should + * retry from a blocking context. + */ + if (unlikely(nr_sects)) { + bio_put(bio); + return -EAGAIN; + } + bio->bi_opf |= REQ_NOWAIT; + } + + prev = bio_chain_and_submit(prev, bio); + } + if (unlikely(!prev)) + return -EAGAIN; + if (unlikely(nr_sects)) + bic->res = -EAGAIN; + + prev->bi_private = cmd; + prev->bi_end_io = bio_cmd_bio_end_io; + submit_bio(prev); + return -EIOCBQUEUED; +} + +int blkdev_uring_cmd(struct io_uring_cmd *cmd, unsigned int issue_flags) +{ + struct block_device *bdev = I_BDEV(cmd->file->f_mapping->host); + struct blk_iou_cmd *bic = io_uring_cmd_to_pdu(cmd, struct blk_iou_cmd); + const struct io_uring_sqe *sqe = cmd->sqe; + u32 cmd_op = cmd->cmd_op; + uint64_t start, len; + + if (unlikely(sqe->ioprio || sqe->__pad1 || sqe->len || + sqe->rw_flags || sqe->file_index)) + return -EINVAL; + + bic->res = 0; + bic->nowait = issue_flags & IO_URING_F_NONBLOCK; + + start = READ_ONCE(sqe->addr); + len = READ_ONCE(sqe->addr3); + + switch (cmd_op) { + case BLOCK_URING_CMD_DISCARD: + return blkdev_cmd_discard(cmd, bdev, start, len, bic->nowait); + } + return -EINVAL; +} diff --git a/include/uapi/linux/blkdev.h b/include/uapi/linux/blkdev.h new file mode 100644 index 000000000000..66373cd1a83a --- /dev/null +++ b/include/uapi/linux/blkdev.h @@ -0,0 +1,14 @@ +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ +#ifndef _UAPI_LINUX_BLKDEV_H +#define _UAPI_LINUX_BLKDEV_H + +#include +#include + +/* + * io_uring block file commands, see IORING_OP_URING_CMD. + * It's a different number space from ioctl(), reuse the block's code 0x12. + */ +#define BLOCK_URING_CMD_DISCARD _IO(0x12, 0) + +#endif