From patchwork Wed Mar 22 19:11:48 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 13184520 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A29E7C6FD1F for ; Wed, 22 Mar 2023 19:12:20 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230191AbjCVTMT (ORCPT ); Wed, 22 Mar 2023 15:12:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59496 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230298AbjCVTMS (ORCPT ); Wed, 22 Mar 2023 15:12:18 -0400 Received: from wout1-smtp.messagingengine.com (wout1-smtp.messagingengine.com [64.147.123.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 70FC15C120 for ; Wed, 22 Mar 2023 12:12:00 -0700 (PDT) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.west.internal (Postfix) with ESMTP id EA9E3320029B; Wed, 22 Mar 2023 15:11:57 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute2.internal (MEProxy); Wed, 22 Mar 2023 15:11:58 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm2; t=1679512317; x= 1679598717; bh=vVvd6uG5BBvqxe6aBAm6gei27wUE9QbS5Dg+OiOZwJE=; b=h F09SGgndTnbyZ3sW8A/r+JMIRPutQnhRI4t78oDObCiZ3BjWKobamcT58jQGwpjC FPdxchF1g3uq3jEMetEx9aibH3BxvMRbWsMpJuHFrArmP5gAPnE36XFz7Ao2HdgK qUoqQyCLe9wFUdour5O+/O0SoZ7BmsUgSTGfPz9AJT80BkUJMPy35f7JWRneB9h+ VzrZCUV4ltXHqqPsBTSdObMQT1x6pVoNcVQlM5138IID+wtCDu5CZ9C7CuVxFfge KVcFur8YubkDWOWdwRchkd8OTRWoByHi1OqTAUQn+YHYrvDd2hdHcZLdk1UXP1jJ erZ55/5VCaI8PSh9BGXBw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1679512317; x=1679598717; bh=v Vvd6uG5BBvqxe6aBAm6gei27wUE9QbS5Dg+OiOZwJE=; b=uRz0B0IZCKWlUND9T uE1qCH4ureA0OHJXLlBUPEDaoS3QLkSxluPlv2a30tqr7pYA9n0mYFk9eCvR+22D Ob9BdSiHk/ZvAHzAI5C8HHTFaQ/nrcFzum3rTiVSrhMTMzZ2Se0SFqLVtSa1bKuY R7zNS6vGKfliI/xgOEFEJHnUTjeUcMZ2vu0+hFVRcvHzp2NOBjJpyyBZn6QF+jrQ AmTVgB434LSg/CUu2Mk/Vt32DfSV2bQstrzuzhzWkQuhQYZ178/ZqEb4KlIlrwTP ukeeCO93zIflkwpM6L89Z2L8EgBrW65ibJf0QEWZWGC4MlzmObOM3K1AbKo6RqaK md58A== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdegvddguddvvdcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtke ertdertddtnecuhfhrohhmpeeuohhrihhsuceuuhhrkhhovhcuoegsohhrihhssegsuhhr rdhioheqnecuggftrfgrthhtvghrnhepieeuffeuvdeiueejhfehiefgkeevudejjeejff evvdehtddufeeihfekgeeuheelnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghm pehmrghilhhfrhhomhepsghorhhishessghurhdrihho X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 22 Mar 2023 15:11:56 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v5 1/5] btrfs: add function to create and return an ordered extent Date: Wed, 22 Mar 2023 12:11:48 -0700 Message-Id: <3fac8b7cb05dabbb11205aa9076c889ca2894eb3.1679512207.git.boris@bur.io> X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org Currently, btrfs_add_ordered_extent allocates a new ordered extent, adds it to the rb_tree, but doesn't return a referenced pointer to the caller. There are cases where it is useful for the creator of a new ordered_extent to hang on to such a pointer, so add a new function btrfs_alloc_ordered_extent which is the same as btrfs_add_ordered_extent, except it takes an additional reference count and returns a pointer to the ordered_extent. Implement btrfs_add_ordered_extent as btrfs_alloc_ordered_extent followed by dropping the new reference and handling the IS_ERR case. The type of flags in btrfs_alloc_ordered_extent and btrfs_add_ordered_extent is changed from unsigned int to unsigned long so it's unified with the other ordered extent functions. Reviewed-by: Filipe Manana Reviewed-by: Christoph Hellwig Signed-off-by: Boris Burkov Signed-off-by: David Sterba --- fs/btrfs/ordered-data.c | 46 +++++++++++++++++++++++++++++++++-------- fs/btrfs/ordered-data.h | 7 ++++++- 2 files changed, 43 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 6c24b69e2d0a..1848d0d1a9c4 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -160,14 +160,16 @@ static inline struct rb_node *tree_search(struct btrfs_ordered_inode_tree *tree, * @compress_type: Compression algorithm used for data. * * Most of these parameters correspond to &struct btrfs_file_extent_item. The - * tree is given a single reference on the ordered extent that was inserted. + * tree is given a single reference on the ordered extent that was inserted, and + * the returned pointer is given a second reference. * - * Return: 0 or -ENOMEM. + * Return: the new ordered extent or ERR_PTR(-ENOMEM). */ -int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, - u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, - u64 disk_num_bytes, u64 offset, unsigned flags, - int compress_type) +struct btrfs_ordered_extent *btrfs_alloc_ordered_extent( + struct btrfs_inode *inode, u64 file_offset, + u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, + u64 disk_num_bytes, u64 offset, unsigned long flags, + int compress_type) { struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; @@ -181,7 +183,7 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, /* For nocow write, we can release the qgroup rsv right now */ ret = btrfs_qgroup_free_data(inode, NULL, file_offset, num_bytes); if (ret < 0) - return ret; + return ERR_PTR(ret); ret = 0; } else { /* @@ -190,11 +192,11 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, */ ret = btrfs_qgroup_release_data(inode, file_offset, num_bytes); if (ret < 0) - return ret; + return ERR_PTR(ret); } entry = kmem_cache_zalloc(btrfs_ordered_extent_cache, GFP_NOFS); if (!entry) - return -ENOMEM; + return ERR_PTR(-ENOMEM); entry->file_offset = file_offset; entry->num_bytes = num_bytes; @@ -256,6 +258,32 @@ int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, btrfs_mod_outstanding_extents(inode, 1); spin_unlock(&inode->lock); + /* One ref for the returned entry to match semantics of lookup. */ + refcount_inc(&entry->refs); + + return entry; +} + +/* + * Add a new btrfs_ordered_extent for the range, but drop the reference instead + * of returning it to the caller. + */ +int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, + u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, + u64 disk_num_bytes, u64 offset, unsigned long flags, + int compress_type) +{ + struct btrfs_ordered_extent *ordered; + + ordered = btrfs_alloc_ordered_extent(inode, file_offset, num_bytes, + ram_bytes, disk_bytenr, + disk_num_bytes, offset, flags, + compress_type); + + if (IS_ERR(ordered)) + return PTR_ERR(ordered); + btrfs_put_ordered_extent(ordered); + return 0; } diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index eb40cb39f842..18007f9c00ad 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -178,9 +178,14 @@ void btrfs_mark_ordered_io_finished(struct btrfs_inode *inode, bool btrfs_dec_test_ordered_pending(struct btrfs_inode *inode, struct btrfs_ordered_extent **cached, u64 file_offset, u64 io_size); +struct btrfs_ordered_extent *btrfs_alloc_ordered_extent( + struct btrfs_inode *inode, u64 file_offset, + u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, + u64 disk_num_bytes, u64 offset, unsigned long flags, + int compress_type); int btrfs_add_ordered_extent(struct btrfs_inode *inode, u64 file_offset, u64 num_bytes, u64 ram_bytes, u64 disk_bytenr, - u64 disk_num_bytes, u64 offset, unsigned flags, + u64 disk_num_bytes, u64 offset, unsigned long flags, int compress_type); void btrfs_add_ordered_sum(struct btrfs_ordered_extent *entry, struct btrfs_ordered_sum *sum); From patchwork Wed Mar 22 19:11:49 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 13184521 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 30CD2C76196 for ; Wed, 22 Mar 2023 19:12:22 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230413AbjCVTMU (ORCPT ); Wed, 22 Mar 2023 15:12:20 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59512 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230323AbjCVTMS (ORCPT ); Wed, 22 Mar 2023 15:12:18 -0400 Received: from wout1-smtp.messagingengine.com (wout1-smtp.messagingengine.com [64.147.123.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 722CC5F500 for ; Wed, 22 Mar 2023 12:12:01 -0700 (PDT) Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.west.internal (Postfix) with ESMTP id B16BF320091E; Wed, 22 Mar 2023 15:12:00 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute4.internal (MEProxy); Wed, 22 Mar 2023 15:12:01 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm2; t=1679512320; x= 1679598720; bh=zrGpKiPh2OFEqA9I9wGR/IEfFSnJXMuned77DwsLncQ=; b=h 4SKCyRrnqTiAW3uIDfeIcenkJ/TfaXRgKhOLgjcJzCIwqckcsF3A0JIAly66buQ0 5kwthvVVtXj06J9n0D4rJAbh7MYAOjzCp8jffaouFUdq0ZnARaxbB65HWwRndiAI fHuCSYmrW/Fhc3nufkmkAM2n/aSHDUYuMqn0eQXR8KguIaTBRLHNHq4k3t4EVXCY HxbOl1r1Xf/lRpKcwhCe/f3OxPLdIrnAfVPcHApw1cVKHbCYeKwYFfpuT9fEtW9a HpA3q//IB02RcPJwyJXfYCIIxBBLkWx+z3V8k6EM78kXHamjt47CIyuNpD8WaPoe 2wYJ6RTTAO/JoAcDthv2g== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1679512320; x=1679598720; bh=z rGpKiPh2OFEqA9I9wGR/IEfFSnJXMuned77DwsLncQ=; b=Ms3OJTEFjFxFRG/OW +iEpAzZ8JxBsnfqtcEmGv/Z9qaPXyPyaPrdsJBPPtjH87hukgglL2WW/VCy9hk0Z eiJJFhdZp5wjKMtIUu0hBqzB2noD3FLftbPzpRV4pwl+hjdUFNpcA/9K2VBHY1vB fVC0TgYtiPSmoEQ6AsG/lKcEfnA6XEYhVPq3gxR6Fkqz4AtgjFotc0H1sdwhIxd5 YgIkKSJilBL4Ga4JJbZ5COx0n8VkcexM7EYGX6kvDI86ZJoYaGMeSJ+IZoF3lMMd obkRclZJIsmaA0X7SIfIliQ4QBFVFiUx4m6OYBUwV5B3Dzx8jg7MhhmWwOQSTlao SqPGA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdegvddguddvfecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtke ertdertddtnecuhfhrohhmpeeuohhrihhsuceuuhhrkhhovhcuoegsohhrihhssegsuhhr rdhioheqnecuggftrfgrthhtvghrnhepieeuffeuvdeiueejhfehiefgkeevudejjeejff evvdehtddufeeihfekgeeuheelnecuvehluhhsthgvrhfuihiivgeptdenucfrrghrrghm pehmrghilhhfrhhomhepsghorhhishessghurhdrihho X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 22 Mar 2023 15:11:59 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v5 2/5] btrfs: stash ordered extent in dio_data during iomap dio Date: Wed, 22 Mar 2023 12:11:49 -0700 Message-Id: X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org While it is not feasible for an ordered extent to survive across the calls btrfs_direct_write makes into __iomap_dio_rw, it is still helpful to stash it on the dio_data in between creating it in iomap_begin and finishing it in either end_io or iomap_end. The specific use I have in mind is that we can check if a partcular bio is partial in submit_io without unconditionally looking up the ordered extent. This is a preparatory patch for a later patch which does just that. Signed-off-by: Boris Burkov --- fs/btrfs/inode.c | 37 ++++++++++++++++++++++++------------- 1 file changed, 24 insertions(+), 13 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 76d93b9e94a9..5ab486f448eb 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -81,6 +81,7 @@ struct btrfs_dio_data { struct extent_changeset *data_reserved; bool data_space_reserved; bool nocow_done; + struct btrfs_ordered_extent *ordered; }; struct btrfs_dio_private { @@ -6968,6 +6969,7 @@ struct extent_map *btrfs_get_extent(struct btrfs_inode *inode, } static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, + struct btrfs_dio_data *dio_data, const u64 start, const u64 len, const u64 orig_start, @@ -6978,7 +6980,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, const int type) { struct extent_map *em = NULL; - int ret; + struct btrfs_ordered_extent *ordered; if (type != BTRFS_ORDERED_NOCOW) { em = create_io_em(inode, start, len, orig_start, block_start, @@ -6988,18 +6990,21 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, if (IS_ERR(em)) goto out; } - ret = btrfs_add_ordered_extent(inode, start, len, len, block_start, - block_len, 0, - (1 << type) | - (1 << BTRFS_ORDERED_DIRECT), - BTRFS_COMPRESS_NONE); - if (ret) { + ordered = btrfs_alloc_ordered_extent(inode, start, len, len, + block_start, block_len, 0, + (1 << type) | + (1 << BTRFS_ORDERED_DIRECT), + BTRFS_COMPRESS_NONE); + if (IS_ERR(ordered)) { if (em) { free_extent_map(em); btrfs_drop_extent_map_range(inode, start, start + len - 1, false); } - em = ERR_PTR(ret); + em = ERR_PTR(PTR_ERR(ordered)); + } else { + ASSERT(!dio_data->ordered); + dio_data->ordered = ordered; } out: @@ -7007,6 +7012,7 @@ static struct extent_map *btrfs_create_dio_extent(struct btrfs_inode *inode, } static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode, + struct btrfs_dio_data *dio_data, u64 start, u64 len) { struct btrfs_root *root = inode->root; @@ -7022,7 +7028,8 @@ static struct extent_map *btrfs_new_extent_direct(struct btrfs_inode *inode, if (ret) return ERR_PTR(ret); - em = btrfs_create_dio_extent(inode, start, ins.offset, start, + em = btrfs_create_dio_extent(inode, dio_data, + start, ins.offset, start, ins.objectid, ins.offset, ins.offset, ins.offset, BTRFS_ORDERED_REGULAR); btrfs_dec_block_group_reservations(fs_info, ins.objectid); @@ -7367,7 +7374,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, } space_reserved = true; - em2 = btrfs_create_dio_extent(BTRFS_I(inode), start, len, + em2 = btrfs_create_dio_extent(BTRFS_I(inode), dio_data, start, len, orig_start, block_start, len, orig_block_len, ram_bytes, type); @@ -7409,7 +7416,7 @@ static int btrfs_get_blocks_direct_write(struct extent_map **map, goto out; space_reserved = true; - em = btrfs_new_extent_direct(BTRFS_I(inode), start, len); + em = btrfs_new_extent_direct(BTRFS_I(inode), dio_data, start, len); if (IS_ERR(em)) { ret = PTR_ERR(em); goto out; @@ -7715,6 +7722,10 @@ static int btrfs_dio_iomap_end(struct inode *inode, loff_t pos, loff_t length, pos + length - 1, NULL); ret = -ENOTBLK; } + if (write) { + btrfs_put_ordered_extent(dio_data->ordered); + dio_data->ordered = NULL; + } if (write) extent_changeset_free(dio_data->data_reserved); @@ -7776,7 +7787,7 @@ static const struct iomap_dio_ops btrfs_dio_ops = { ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_t done_before) { - struct btrfs_dio_data data; + struct btrfs_dio_data data = { 0 }; return iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, IOMAP_DIO_PARTIAL, &data, done_before); @@ -7785,7 +7796,7 @@ ssize_t btrfs_dio_read(struct kiocb *iocb, struct iov_iter *iter, size_t done_be struct iomap_dio *btrfs_dio_write(struct kiocb *iocb, struct iov_iter *iter, size_t done_before) { - struct btrfs_dio_data data; + struct btrfs_dio_data data = { 0 }; return __iomap_dio_rw(iocb, iter, &btrfs_dio_iomap_ops, &btrfs_dio_ops, IOMAP_DIO_PARTIAL, &data, done_before); From patchwork Wed Mar 22 19:11:50 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 13184522 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 93FECC6FD1C for ; Wed, 22 Mar 2023 19:12:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230432AbjCVTMV (ORCPT ); Wed, 22 Mar 2023 15:12:21 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59560 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230226AbjCVTMU (ORCPT ); Wed, 22 Mar 2023 15:12:20 -0400 Received: from wout1-smtp.messagingengine.com (wout1-smtp.messagingengine.com [64.147.123.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6B6E65FE84 for ; Wed, 22 Mar 2023 12:12:04 -0700 (PDT) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.west.internal (Postfix) with ESMTP id 577E83200754; Wed, 22 Mar 2023 15:12:03 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute2.internal (MEProxy); Wed, 22 Mar 2023 15:12:03 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm2; t=1679512322; x= 1679598722; bh=cfZJuxmsemb10qZPPalT6K2gUME8tNXOwpuTXtNq3dE=; b=Z OsMSepke5Qq3NoDtVWddDpuO7fIRCgR2u0T6iOJ+qdMdGBf5W6MvMv0tLTFV+6AS lQIT0Zy+Cx/58YIrZPO/kdApTgmQXehXsEMcDEPwy6/Xq65P2ljB6hu/chUoJrKx cgtZMM3dD/XRNrtIAqZqTQw/fW3S8IG9E7ekY03AHJ0+GOGQb8WvOQI8D70Gp6t4 mpT/ln2+e5UPaUnr/06Ik+CkR9Ul/vNgsY2Tj9yCINZLpc5vtoAy2d25/bH3lmmQ QHFzb0ruw2f0RL8D5HkiAn7GeP34vtMvV2ge+UnwPgepSXVxt1wSNjvvh1rcCwH8 LW8OdDsd1YVp9uRtc+NGw== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1679512322; x=1679598722; bh=c fZJuxmsemb10qZPPalT6K2gUME8tNXOwpuTXtNq3dE=; b=r2k3B7dX9HBoVzJKR 6/IfW+2xcdbe6fJ88+YgK7DZC+wOShpjwH0cX5YuXYNLfzi6Wa9Jwe+ZkhDIeMUc OKSSI44pZRAhkD5/suuOmI19GzPvTXZNJx9RwQF2qD6rpJxSNVyGFJMEryfuZjAm 0ohL8EzoZLwoyapR4lRJ1ARJkCDt4QSedX26fKTnxnA4ZmMYddEsjFFfkgE6LpvW wrzAz7d/aZGwsQANe9tCwD2N2kNARZPl/MgA6vSlfxApWuMiTC6MBIu9vRx+YWCO QhzT9HjMERLjiRXoK+OuJLoYMOjxXY6cCCExnQyZ1zbmxMR3kwK5lBjqlxc/L0yt gXWug== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdegvddguddvvdcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtke ertdertddtnecuhfhrohhmpeeuohhrihhsuceuuhhrkhhovhcuoegsohhrihhssegsuhhr rdhioheqnecuggftrfgrthhtvghrnhepieeuffeuvdeiueejhfehiefgkeevudejjeejff evvdehtddufeeihfekgeeuheelnecuvehluhhsthgvrhfuihiivgepudenucfrrghrrghm pehmrghilhhfrhhomhepsghorhhishessghurhdrihho X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 22 Mar 2023 15:12:02 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v5 3/5] btrfs: return ordered_extent splits from bio extraction Date: Wed, 22 Mar 2023 12:11:50 -0700 Message-Id: X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org When extracting a bio from its ordered extent for dio partial writes, we need the "remainder" ordered extent. It would be possible to look it up in that case, but since we can grab the ordered_extent from the new allocation function, we might as well wire it up to be returned to the caller via out parameter and save that lookup. Refactor the clone ordered extent function to return the new ordered extent, then refactor the split and extract functions to pass back the new pre and post split ordered extents via output parameter. Signed-off-by: Boris Burkov --- fs/btrfs/bio.c | 2 +- fs/btrfs/btrfs_inode.h | 5 ++++- fs/btrfs/inode.c | 36 +++++++++++++++++++++++++++--------- fs/btrfs/ordered-data.c | 36 +++++++++++++++++++++++------------- fs/btrfs/ordered-data.h | 6 ++++-- 5 files changed, 59 insertions(+), 26 deletions(-) diff --git a/fs/btrfs/bio.c b/fs/btrfs/bio.c index cf09c6271edb..b849ced40d37 100644 --- a/fs/btrfs/bio.c +++ b/fs/btrfs/bio.c @@ -653,7 +653,7 @@ static bool btrfs_submit_chunk(struct btrfs_bio *bbio, int mirror_num) if (use_append) { bio->bi_opf &= ~REQ_OP_WRITE; bio->bi_opf |= REQ_OP_ZONE_APPEND; - ret = btrfs_extract_ordered_extent(bbio); + ret = btrfs_extract_ordered_extent_bio(bbio, NULL, NULL, NULL); if (ret) goto fail_put_bio; } diff --git a/fs/btrfs/btrfs_inode.h b/fs/btrfs/btrfs_inode.h index 9dc21622806e..e92a09559058 100644 --- a/fs/btrfs/btrfs_inode.h +++ b/fs/btrfs/btrfs_inode.h @@ -407,7 +407,10 @@ static inline void btrfs_inode_split_flags(u64 inode_item_flags, int btrfs_check_sector_csum(struct btrfs_fs_info *fs_info, struct page *page, u32 pgoff, u8 *csum, const u8 * const csum_expected); -blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio); +blk_status_t btrfs_extract_ordered_extent_bio(struct btrfs_bio *bbio, + struct btrfs_ordered_extent *ordered, + struct btrfs_ordered_extent **ret_pre, + struct btrfs_ordered_extent **ret_post); bool btrfs_data_csum_ok(struct btrfs_bio *bbio, struct btrfs_device *dev, u32 bio_offset, struct bio_vec *bv); noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len, diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 5ab486f448eb..e30390051f15 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -2514,10 +2514,14 @@ void btrfs_clear_delalloc_extent(struct btrfs_inode *inode, /* * Split an extent_map at [start, start + len] * - * This function is intended to be used only for extract_ordered_extent(). + * This function is intended to be used only for + * btrfs_extract_ordered_extent_bio(). + * + * It makes assumptions about the extent map that are only valid in the narrow + * situations in which we are extracting a bio from a containing ordered extent, + * that are specific to zoned filesystems or partial dio writes. */ -static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len, - u64 pre, u64 post) +static int split_em(struct btrfs_inode *inode, u64 start, u64 len, u64 pre, u64 post) { struct extent_map_tree *em_tree = &inode->extent_tree; struct extent_map *em; @@ -2626,22 +2630,36 @@ static int split_zoned_em(struct btrfs_inode *inode, u64 start, u64 len, return ret; } -blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio) +/* + * Extract a bio from an ordered extent to enforce an invariant where the bio + * fully matches a single ordered extent. + * + * @bbio: the bio to extract. + * @ordered: the ordered extent the bio is in, will be shrunk to fit. If NULL we + * will look it up. + * @ret_pre: out parameter to return the new oe in front of the bio, if needed. + * @ret_post: out parameter to return the new oe past the bio, if needed. + */ +blk_status_t btrfs_extract_ordered_extent_bio(struct btrfs_bio *bbio, + struct btrfs_ordered_extent *ordered, + struct btrfs_ordered_extent **ret_pre, + struct btrfs_ordered_extent **ret_post) { u64 start = (u64)bbio->bio.bi_iter.bi_sector << SECTOR_SHIFT; u64 len = bbio->bio.bi_iter.bi_size; struct btrfs_inode *inode = bbio->inode; - struct btrfs_ordered_extent *ordered; u64 file_len; u64 end = start + len; u64 ordered_end; u64 pre, post; int ret = 0; - ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset); + if (!ordered) + ordered = btrfs_lookup_ordered_extent(inode, bbio->file_offset); if (WARN_ON_ONCE(!ordered)) return BLK_STS_IOERR; + ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes; /* No need to split */ if (ordered->disk_num_bytes == len) goto out; @@ -2658,7 +2676,6 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio) goto out; } - ordered_end = ordered->disk_bytenr + ordered->disk_num_bytes; /* bio must be in one ordered extent */ if (WARN_ON_ONCE(start < ordered->disk_bytenr || end > ordered_end)) { ret = -EINVAL; @@ -2675,10 +2692,11 @@ blk_status_t btrfs_extract_ordered_extent(struct btrfs_bio *bbio) pre = start - ordered->disk_bytenr; post = ordered_end - end; - ret = btrfs_split_ordered_extent(ordered, pre, post); + ret = btrfs_split_ordered_extent(ordered, pre, post, ret_pre, ret_post); if (ret) goto out; - ret = split_zoned_em(inode, bbio->file_offset, file_len, pre, post); + if (!test_bit(BTRFS_ORDERED_NOCOW, &ordered->flags)) + ret = split_em(inode, bbio->file_offset, file_len, pre, post); out: btrfs_put_ordered_extent(ordered); diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 1848d0d1a9c4..4bebebb9b434 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -1117,8 +1117,8 @@ bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end, } -static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos, - u64 len) +static struct btrfs_ordered_extent *clone_ordered_extent(struct btrfs_ordered_extent *ordered, + u64 pos, u64 len) { struct inode *inode = ordered->inode; struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; @@ -1133,18 +1133,22 @@ static int clone_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pos, percpu_counter_add_batch(&fs_info->ordered_bytes, -len, fs_info->delalloc_batch); WARN_ON_ONCE(flags & (1 << BTRFS_ORDERED_COMPRESSED)); - return btrfs_add_ordered_extent(BTRFS_I(inode), file_offset, len, len, - disk_bytenr, len, 0, flags, - ordered->compress_type); + return btrfs_alloc_ordered_extent(BTRFS_I(inode), file_offset, len, len, + disk_bytenr, len, 0, flags, + ordered->compress_type); } -int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre, - u64 post) +int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, + u64 pre, u64 post, + struct btrfs_ordered_extent **ret_pre, + struct btrfs_ordered_extent **ret_post) + { struct inode *inode = ordered->inode; struct btrfs_ordered_inode_tree *tree = &BTRFS_I(inode)->ordered_tree; struct rb_node *node; struct btrfs_fs_info *fs_info = btrfs_sb(inode->i_sb); + struct btrfs_ordered_extent *oe; int ret = 0; trace_btrfs_ordered_extent_split(BTRFS_I(inode), ordered); @@ -1172,12 +1176,18 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre, spin_unlock_irq(&tree->lock); - if (pre) - ret = clone_ordered_extent(ordered, 0, pre); - if (ret == 0 && post) - ret = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, - post); - + if (pre) { + oe = clone_ordered_extent(ordered, 0, pre); + ret = IS_ERR(oe) ? PTR_ERR(oe) : 0; + if (!ret && ret_pre) + *ret_pre = oe; + } + if (!ret && post) { + oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post); + ret = IS_ERR(oe) ? PTR_ERR(oe) : 0; + if (!ret && ret_post) + *ret_post = oe; + } return ret; } diff --git a/fs/btrfs/ordered-data.h b/fs/btrfs/ordered-data.h index 18007f9c00ad..933f6f0d8c10 100644 --- a/fs/btrfs/ordered-data.h +++ b/fs/btrfs/ordered-data.h @@ -212,8 +212,10 @@ void btrfs_lock_and_flush_ordered_range(struct btrfs_inode *inode, u64 start, struct extent_state **cached_state); bool btrfs_try_lock_ordered_range(struct btrfs_inode *inode, u64 start, u64 end, struct extent_state **cached_state); -int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, u64 pre, - u64 post); +int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, + u64 pre, u64 post, + struct btrfs_ordered_extent **ret_pre, + struct btrfs_ordered_extent **ret_post); int __init ordered_data_init(void); void __cold ordered_data_exit(void); From patchwork Wed Mar 22 19:11:51 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 13184523 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 7EE84C6FD1F for ; Wed, 22 Mar 2023 19:12:25 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230262AbjCVTMY (ORCPT ); Wed, 22 Mar 2023 15:12:24 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59672 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230431AbjCVTMV (ORCPT ); Wed, 22 Mar 2023 15:12:21 -0400 Received: from wout1-smtp.messagingengine.com (wout1-smtp.messagingengine.com [64.147.123.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9657560A9D for ; Wed, 22 Mar 2023 12:12:07 -0700 (PDT) Received: from compute4.internal (compute4.nyi.internal [10.202.2.44]) by mailout.west.internal (Postfix) with ESMTP id 69CFB32006F2; Wed, 22 Mar 2023 15:12:06 -0400 (EDT) Received: from mailfrontend1 ([10.202.2.162]) by compute4.internal (MEProxy); Wed, 22 Mar 2023 15:12:06 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm2; t=1679512325; x= 1679598725; bh=xxveT1NlhD1MHsMmSBTdjJEHd7Q8vXuSckDasrhE94A=; b=t 3Fcu4K6JcpfC9qjRX/dh/d61YDhgGOvxZDhC+oczTD21ZlWeAAAW6mQkgsee6n3m +Fr6ArhXar/Q301v0ue5teAeVxXL56dEk7rxfP258DljpHg7KqDyTDm4mLkxxYgg 4r91PWNlSvQmcalzKHxqKzppnZFu2BLy7w1pbxHZ//iG/wW5Ti9+g/75aPlt/xqO 4VTlSUUfLGJANY6BSz493rvbwGME/JoAkd0mPQxzdngOSQSHpu1EI9eE23Pz1uPg rK1mZ6HxK7wRJeL8tTdbMJ38qRXcpFSp2BS+MhXPAdV9QGDjGOC9TK0AdgJvs1HO 80UmjlFwR5Gw1tlNZt9wA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1679512325; x=1679598725; bh=x xveT1NlhD1MHsMmSBTdjJEHd7Q8vXuSckDasrhE94A=; b=O9EOLc2MEM1tNqUb0 h5BP7Ppq1R36PHDH+ykampOxKVhF94pZd6G/Pr1bBnBCBRK5EBJ/B20YfoAkSGk+ hTLVtvo9ODRA1R2fpD/otMAGjIvKCqWH9j63M1hZ+UYaechgkizeUB6a+GHxGJnm 4tvL6cRmVnGSmJlAfkTA2FDvGydWfdadyPg55L2qtB2LO2SFJGC/9OK0k78sliCM /d4BHK5utHnGkdG+TeAl22RBejLFRD0qZdfT9csDcszvYVQd+pjIFuGXaYY5PYu4 jTXlNrIBBQNbJPssWqYQY2WhIIj/IkhytGdMT3QcoPqq5YdfFYfNvCX7JZM6RIi7 rIFkA== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdegvddguddvfecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtke ertdertddtnecuhfhrohhmpeeuohhrihhsuceuuhhrkhhovhcuoegsohhrihhssegsuhhr rdhioheqnecuggftrfgrthhtvghrnhepieeuffeuvdeiueejhfehiefgkeevudejjeejff evvdehtddufeeihfekgeeuheelnecuvehluhhsthgvrhfuihiivgepudenucfrrghrrghm pehmrghilhhfrhhomhepsghorhhishessghurhdrihho X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 22 Mar 2023 15:12:05 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v5 4/5] btrfs: fix crash with non-zero pre in btrfs_split_ordered_extent Date: Wed, 22 Mar 2023 12:11:51 -0700 Message-Id: X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org if pre != 0 in btrfs_split_ordered_extent, then we do the following: 1. remove ordered (at file_offset) from the rb tree 2. modify file_offset+=pre 3. re-insert ordered 4. clone an ordered extent at offset 0 length pre from ordered. 5. clone an ordered extent for the post range, if necessary. step 4 is not correct, as at this point, the start of ordered is already the end of the desired new pre extent. Further this causes a panic when btrfs_alloc_ordered_extent sees that the node (from the modified and re-inserted ordered) is already present at file_offset + 0 = file_offset. We can fix this by either using a negative offset, or by moving the clone of the pre extent to after we remove the original one, but before we modify and re-insert it. The former feels quite kludgy, as we are "cloning" from outside the range of the ordered extent, so opt for the latter, which does have some locking annoyances. Signed-off-by: Boris Burkov --- fs/btrfs/ordered-data.c | 20 +++++++++++++------- 1 file changed, 13 insertions(+), 7 deletions(-) diff --git a/fs/btrfs/ordered-data.c b/fs/btrfs/ordered-data.c index 4bebebb9b434..d14a3fe1a113 100644 --- a/fs/btrfs/ordered-data.c +++ b/fs/btrfs/ordered-data.c @@ -1161,6 +1161,17 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, if (tree->last == node) tree->last = NULL; + if (pre) { + spin_unlock_irq(&tree->lock); + oe = clone_ordered_extent(ordered, 0, pre); + ret = IS_ERR(oe) ? PTR_ERR(oe) : 0; + if (!ret && ret_pre) + *ret_pre = oe; + if (ret) + goto out; + spin_lock_irq(&tree->lock); + } + ordered->file_offset += pre; ordered->disk_bytenr += pre; ordered->num_bytes -= (pre + post); @@ -1176,18 +1187,13 @@ int btrfs_split_ordered_extent(struct btrfs_ordered_extent *ordered, spin_unlock_irq(&tree->lock); - if (pre) { - oe = clone_ordered_extent(ordered, 0, pre); - ret = IS_ERR(oe) ? PTR_ERR(oe) : 0; - if (!ret && ret_pre) - *ret_pre = oe; - } - if (!ret && post) { + if (post) { oe = clone_ordered_extent(ordered, pre + ordered->disk_num_bytes, post); ret = IS_ERR(oe) ? PTR_ERR(oe) : 0; if (!ret && ret_post) *ret_post = oe; } +out: return ret; } From patchwork Wed Mar 22 19:11:52 2023 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Boris Burkov X-Patchwork-Id: 13184524 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 693ACC6FD1C for ; Wed, 22 Mar 2023 19:12:26 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230326AbjCVTMZ (ORCPT ); Wed, 22 Mar 2023 15:12:25 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:59670 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230298AbjCVTMX (ORCPT ); Wed, 22 Mar 2023 15:12:23 -0400 Received: from wout1-smtp.messagingengine.com (wout1-smtp.messagingengine.com [64.147.123.24]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DE45660D4B for ; Wed, 22 Mar 2023 12:12:09 -0700 (PDT) Received: from compute2.internal (compute2.nyi.internal [10.202.2.46]) by mailout.west.internal (Postfix) with ESMTP id 2D6B93200915; Wed, 22 Mar 2023 15:12:09 -0400 (EDT) Received: from mailfrontend2 ([10.202.2.163]) by compute2.internal (MEProxy); Wed, 22 Mar 2023 15:12:09 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bur.io; h=cc :content-transfer-encoding:content-type:date:date:from:from :in-reply-to:in-reply-to:message-id:mime-version:references :reply-to:sender:subject:subject:to:to; s=fm2; t=1679512328; x= 1679598728; bh=d9SVfXYafcF4aRcE4BwSL1787RSm5xRoEsi9qXqAw/M=; b=q vXxx0z4FYFjlEsfeyUzHKAVlv2T6npr6n8oU2r88249SDDAEl8KFzbAW4R1oLB2o HLZU5HLT2cBy4sJf3enQQo5GGKgD+GAV8xgpPNNNofUDcQGaTMtjkxl7S1TFTi1M Y5e0VmiyTk5jOGEdozIPUFDu4i2XlFftNThaDdoXCX8FyOd3s0QK3F7NlkIJzojg uJ0fjIscIULnXUrYVI8ofVk03eSZWvTt6VK2UpID7adzGBgEVNgbSZlWpTLW/kzb ZKMXGwDL4+Vm+kqWBBhyYlrKlDGTqij7HUSxxy0THrIixeB1bLil4imxwAfVGOdy Ia8fUytatftqF+hZLOLCA== DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d= messagingengine.com; h=cc:content-transfer-encoding:content-type :date:date:feedback-id:feedback-id:from:from:in-reply-to :in-reply-to:message-id:mime-version:references:reply-to:sender :subject:subject:to:to:x-me-proxy:x-me-proxy:x-me-sender :x-me-sender:x-sasl-enc; s=fm2; t=1679512328; x=1679598728; bh=d 9SVfXYafcF4aRcE4BwSL1787RSm5xRoEsi9qXqAw/M=; b=fWe+PSh/bxa5kyXAB EMj8wvYeJ6Q3SdeqcyzolAnLEe/p87+G0NoRidtaOx1t/AiqKgOYf2I53w5NlU+b XQoRKo/tnm42z6S8JGXGO4vKx7N5QccrehmiPNSirsoTe4cyAATl3pmi8jWgCpIW MQIfpVVeKJCwL40j0AaBPKzc3jb7GdROoWy0Zvfwdj4WPIko2b0dh+knbHPBE4yY pG3CkKIIZKfFyTUJ4YrDCbJwfRUloAPCgEydflK5j19w9ReG9NVWtVlFEOlDMxWQ Bm6CoKN96f5zFpibWQ/j9CNj43lJXc38ZA4rM7JqucIBL+Ow60wf3bWHCZgQJyKa /9PHQ== X-ME-Sender: X-ME-Received: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvhedrvdegvddguddvvdcutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecunecujfgurhephffvufffkffojghfggfgsedtke ertdertddtnecuhfhrohhmpeeuohhrihhsuceuuhhrkhhovhcuoegsohhrihhssegsuhhr rdhioheqnecuggftrfgrthhtvghrnhepvddtteffleeggeduudfggfdttdfgheegteffge euuedtleegueehteefteevleeunecuffhomhgrihhnpehrvgguhhgrthdrtghomhdpkhgv rhhnvghlrdhorhhgpdhprghsthgvsghinhdrtghomhenucevlhhushhtvghrufhiiigvpe dtnecurfgrrhgrmhepmhgrihhlfhhrohhmpegsohhrihhssegsuhhrrdhioh X-ME-Proxy: Feedback-ID: i083147f8:Fastmail Received: by mail.messagingengine.com (Postfix) with ESMTPA; Wed, 22 Mar 2023 15:12:08 -0400 (EDT) From: Boris Burkov To: linux-btrfs@vger.kernel.org, kernel-team@fb.com Subject: [PATCH v5 5/5] btrfs: split partial dio bios before submit Date: Wed, 22 Mar 2023 12:11:52 -0700 Message-Id: X-Mailer: git-send-email 2.38.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org If an application is doing direct io to a btrfs file and experiences a page fault reading from the write buffer, iomap will issue a partial bio, and allow the fs to keep going. However, there was a subtle bug in this codepath in the btrfs dio iomap implementation that led to the partial write ending up as a gap in the file's extents and to be read back as zeros. The sequence of events in a partial write, lightly summarized and trimmed down for brevity is as follows: ====WRITING TASK==== btrfs_direct_write __iomap_dio_write iomap_iter btrfs_dio_iomap_begin # create full ordered extent iomap_dio_bio_iter bio_iov_iter_get_pages # page fault; partial read submit_bio # partial bio iomap_iter btrfs_dio_iomap_end btrfs_mark_ordered_io_finished # sets BTRFS_ORDERED_IOERR; # submit to finish_ordered_fn wq fault_in_iov_iter_readable # btrfs_direct_write detects partial write __iomap_dio_write iomap_iter btrfs_dio_iomap_begin # create second partial ordered extent iomap_dio_bio_iter bio_iov_iter_get_pages # read all of remainder submit_bio # partial bio with all of remainder iomap_iter btrfs_dio_iomap_end # nothing exciting to do with ordered io ====DIO ENDIO==== ==FIRST PARTIAL BIO== btrfs_dio_end_io btrfs_mark_ordered_io_finished # bytes_left > 0 # don't submit to finish_ordered_fn wq ==SECOND PARTIAL BIO== btrfs_dio_end_io btrfs_mark_ordered_io_finished # bytes_left == 0 # submit to finish_ordered_fn wq ====BTRFS FINISH ORDERED WQ==== ==FIRST PARTIAL BIO== btrfs_finish_ordered_io # called by dio_iomap_end_io, sees # BTRFS_ORDERED_IOERR, just drops the # ordered_extent ==SECOND PARTIAL BIO== btrfs_finish_ordered_io # called by btrfs_dio_end_io, writes out file # extents, csums, etc... The essence of the problem is that while btrfs_direct_write and iomap properly interact to submit all the correct bios, there is insufficient logic in the btrfs dio functions (btrfs_dio_iomap_begin, btrfs_dio_submit_io, btrfs_dio_end_io, and btrfs_dio_iomap_end) to ensure that every bio is at least a part of a completed ordered_extent. And it is completing an ordered_extent that results in crucial functionality like writing out a file extent for the range. More specifically, btrfs_dio_end_io treats the ordered extent as unfinished but btrfs_dio_iomap_end sets BTRFS_ORDERED_IOERR on it. Thus, the finish io work doesn't result in file extents, csums, etc... In the aftermath, such a file behaves as though it has a hole in it, instead of the purportedly written data. We considered a few options for fixing the bug (apologies for any incorrect summary of a proposal which I didn't implement and fully understand): 1. treat the partial bio as if we had truncated the file, which would result in properly finishing it. 2. split the ordered extent when submitting a partial bio. 3. cache the ordered extent across calls to __iomap_dio_rw in iter->private, so that we could reuse it and correctly apply several bios to it. I had trouble with 1, and it felt the most like a hack, so I tried 2 and 3. Since 3 has the benefit of also not creating an extra file extent, and avoids an ordered extent lookup during bio submission, it felt like the best option. However, that turned out to re-introduce a deadlock which this code discarding the ordered_extent between faults was meant to fix in the first place. (Link to an explanation of the deadlock below) Therefore, go with fix #2, which requires a bit more setup work but fixes the corruption without introducing the deadlock, which is fundamentally caused by the ordered extent existing when we attempt to fault in a range that overlaps with it. Put succinctly, what this patch does is: when we submit a dio bio, check if it is partial against the ordered extent stored in dio_data, and if it is, extract the ordered_extent that matches the bio exactly out of the larger ordered_extent. Keep the remaining ordered_extent around in dio_data for cancellation in iomap_end. Thanks to Josef, Christoph, and Filipe with their help figuring out the bug and the fix. Fixes: 51bd9563b678 ("btrfs: fix deadlock due to page faults during direct IO reads and writes") Link: https://bugzilla.redhat.com/show_bug.cgi?id=2169947 Link: https://lore.kernel.org/linux-btrfs/aa1fb69e-b613-47aa-a99e-a0a2c9ed273f@app.fastmail.com/ Link: https://pastebin.com/3SDaH8C6 Link: https://lore.kernel.org/linux-btrfs/20230315195231.GW10580@twin.jikos.cz/T/#t Signed-off-by: Boris Burkov --- fs/btrfs/inode.c | 21 ++++++++++++++++++++- 1 file changed, 20 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index e30390051f15..08d132071bd3 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7782,6 +7782,7 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, struct btrfs_dio_private *dip = container_of(bbio, struct btrfs_dio_private, bbio); struct btrfs_dio_data *dio_data = iter->private; + int err = 0; btrfs_bio_init(bbio, BTRFS_I(iter->inode), btrfs_dio_end_io, bio->bi_private); bbio->file_offset = file_offset; @@ -7790,7 +7791,25 @@ static void btrfs_dio_submit_io(const struct iomap_iter *iter, struct bio *bio, dip->bytes = bio->bi_iter.bi_size; dio_data->submitted += bio->bi_iter.bi_size; - btrfs_submit_bio(bbio, 0); + /* + * Check if we are doing a partial write. If we are, we need to split + * the ordered extent to match the submitted bio. Hang on to the + * remaining unfinishable ordered_extent in dio_data so that it can be + * cancelled in iomap_end to avoid a deadlock wherein faulting the + * remaining pages is blocked on the outstanding ordered extent. + */ + if (iter->flags & IOMAP_WRITE) { + struct btrfs_ordered_extent *ordered = dio_data->ordered; + + ASSERT(ordered); + if (bio->bi_iter.bi_size < ordered->num_bytes) + err = btrfs_extract_ordered_extent_bio(bbio, ordered, NULL, + &dio_data->ordered); + } + if (err) + btrfs_bio_end_io(bbio, err); + else + btrfs_submit_bio(bbio, 0); } static const struct iomap_ops btrfs_dio_iomap_ops = {