From patchwork Mon Aug 22 10:51:30 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950479 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2F26EC28D13 for ; Mon, 22 Aug 2022 10:51:54 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233660AbiHVKvv (ORCPT ); Mon, 22 Aug 2022 06:51:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43400 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233679AbiHVKvu (ORCPT ); Mon, 22 Aug 2022 06:51:50 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 0F4A72F659 for ; Mon, 22 Aug 2022 03:51:50 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id A76B860FDE for ; Mon, 22 Aug 2022 10:51:49 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 91F6CC433C1 for ; Mon, 22 Aug 2022 10:51:48 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165509; bh=j/ODHUZcA8OtCTa/SH4CvYiotd90a7Eerw3O0tw+FbI=; h=From:To:Subject:Date:In-Reply-To:References:From; b=cz2+dd8bzPxBh5PcKE8q4kPLq0rKSJ+hZFGKSATSxKJdp/yc4qj8AZNaSve/inFzw +W8hbOZY9iZyXYlWBFI5B3QM1Xemx5NR1EW7yXxskXuNazmps0pz1py7xCGrH3oTEB QeIM8BBkL/5luXjENCCvNXcuk0Oxt+4xGk0on7NvMaJzIXOt6E+9NdokBE0ATUZnH0 s5GJ6Pt5MiI0E60himN5TC+Or4sl5p8iQzF2D4RZGMUYM8Io145NjOuTpwtahxMy0Z cz015E6aP5fISXzvGr1RjdotIV+0R5JdWaepkcFkFFCUZ3T+qdLmPFrIa1UiYhwQUF rZaANIwOIKkoA== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 01/15] btrfs: don't drop dir index range items when logging a directory Date: Mon, 22 Aug 2022 11:51:30 +0100 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When logging a directory that was previously logged in the current transaction, we drop all the range items (BTRFS_DIR_LOG_INDEX_KEY key type). This is because we will process all leaves in the subvolume's tree that were changed in the current transaction and then add range items for covering new dir index items and deleted dir index items, which could cover now a larger range than before. We used to fail if we tried to insert a range item key that already exists, so we dropped all range items to avoid failing. However nowadays, since commit 750ee454908e90 ("btrfs: fix assertion failure when logging directory key range item"), we simply update any range item that already exists, increasing its range's last dir index if needed. Since the range covered by a range item can never decrease, due to the fact that dir index values come from a monotonically increasing counter and are never reused, we can stop dropping all range items before we start logging a directory. By not dropping the items we can avoid having occasional tree rebalance operations. This will also be needed for an incoming change where we start logging delayed items directly, without flushing them first. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 6 +----- 1 file changed, 1 insertion(+), 5 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 4d2d19fea112..cffd15e23614 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5716,14 +5716,10 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, * copies of everything. */ if (S_ISDIR(inode->vfs_inode.i_mode)) { - int max_key_type = BTRFS_DIR_LOG_INDEX_KEY; - clear_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags); - if (inode_only == LOG_INODE_EXISTS) - max_key_type = BTRFS_XATTR_ITEM_KEY; if (ctx->logged_before) ret = drop_inode_items(trans, log, path, inode, - max_key_type); + BTRFS_XATTR_ITEM_KEY); } else { if (inode_only == LOG_INODE_EXISTS && ctx->logged_before) { /* From patchwork Mon Aug 22 10:51:31 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950481 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5C432C32789 for ; Mon, 22 Aug 2022 10:51:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233774AbiHVKv5 (ORCPT ); Mon, 22 Aug 2022 06:51:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43426 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233747AbiHVKvx (ORCPT ); Mon, 22 Aug 2022 06:51:53 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 811C13121A for ; Mon, 22 Aug 2022 03:51:52 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id 3DF6CB81015 for ; Mon, 22 Aug 2022 10:51:51 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8077DC433D7 for ; Mon, 22 Aug 2022 10:51:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165510; bh=xMMo1GFYJ7m6Lbnm9+ridGF3gHkC/kuAQUkyT8oNIWs=; h=From:To:Subject:Date:In-Reply-To:References:From; b=Yf8o6M6OjjnVv8apD7QGmjBlc83MsX4Nbp9ymYu8jaLjrkqeOmVvHEODrcqjbgTMl 2A7TAq78pdeTfWPxGBfkevZJsOip++TDlEw3hSgR3rNCqEne61YlSYy2EyiU9MCM39 DXPXw/lILv9kkCOTVaVJMgqwkFon/QUexC5DBJy1hIh8MAmrS7G9cko2z8I/vnmsbT +lEv1km/DX+pd87wE0iZeutgJVL4JjjfI4y8CHYrJP8jqQnwkiB5AYuIIeQQUAZ9hp gAsWwd93MiSp968I8sg9eaGUpZIqDAe18WFQjsJIDo2AhtDBy9Wt55rqdJRJgttPjX U/euWVV1ymIJg== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 02/15] btrfs: remove the root argument from log_new_dir_dentries() Date: Mon, 22 Aug 2022 11:51:31 +0100 Message-Id: <3696e99984ac2a21e57bfc2cfe75b8bf7c282b09.1661165149.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana There's no point in passing a root argument to log_new_dir_dentries() because it always corresponds to the root of the given inode. So remove it and extract the root from the given inode. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index cffd15e23614..56fbd3b9f642 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5969,10 +5969,10 @@ struct btrfs_dir_list { * do_overwrite_item()). */ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, - struct btrfs_root *root, struct btrfs_inode *start_inode, struct btrfs_log_ctx *ctx) { + struct btrfs_root *root = start_inode->root; struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_path *path; LIST_HEAD(dir_list); @@ -6199,7 +6199,7 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, ret = btrfs_log_inode(trans, BTRFS_I(dir_inode), LOG_INODE_ALL, ctx); if (!ret && ctx->log_new_dentries) - ret = log_new_dir_dentries(trans, root, + ret = log_new_dir_dentries(trans, BTRFS_I(dir_inode), ctx); btrfs_add_delayed_iput(dir_inode); if (ret) @@ -6514,7 +6514,7 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans, goto end_trans; if (log_dentries) - ret = log_new_dir_dentries(trans, root, inode, ctx); + ret = log_new_dir_dentries(trans, inode, ctx); else ret = 0; end_trans: From patchwork Mon Aug 22 10:51:32 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950480 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 61CBCC28D13 for ; Mon, 22 Aug 2022 10:51:58 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233773AbiHVKv5 (ORCPT ); Mon, 22 Aug 2022 06:51:57 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43410 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233679AbiHVKvw (ORCPT ); Mon, 22 Aug 2022 06:51:52 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id EFD8C2F671 for ; Mon, 22 Aug 2022 03:51:51 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 8A1B960FDE for ; Mon, 22 Aug 2022 10:51:51 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 6ED73C433D6 for ; Mon, 22 Aug 2022 10:51:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165510; bh=mbx+WQp/WDRHPgNuv/DepEccXVahMV+YyevwAR9Pw6k=; h=From:To:Subject:Date:In-Reply-To:References:From; b=RFiYsQSnheOZRtbFui8FprLExFCyFcjFktFKIvKjgGAesEYkBewa9ly0vAvUcd7V0 hi/kPDdjAbOd7bzyb86VEdrKu+OK7ZU+s+4RNMByz1brVeZDQ//sB455lzZ21mZzqI m2G+zoAjD1CMMiwYhe6Gr3WtrOcdTTpLyI0Y9bL6Zyn+VSab0Kg/PxX9Nf47+SQAY4 8A1n9nOfq4/Mz8RUOu3b05hDqlaY8KrlIdbDh1y0khhD18kXTBNFB0n35H9iB/+UrG 6zJHYdeel/sbFgNpoDcGyQFdmEcrk3IcFr3zL/0b3AWjNZsNglTJEcgCEQEsqFap2/ Q14+OW054KNnQ== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 03/15] btrfs: update stale comment for log_new_dir_dentries() Date: Mon, 22 Aug 2022 11:51:32 +0100 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana The comment refers to the function log_dir_items() in order to check why the inodes of new directory entries need to be logged, but the relevant comments are no longer at log_dir_items(), they were moved to the function process_dir_items_leaf() in commit eb10d85ee77f09 ("btrfs: factor out the copying loop of dir items from log_dir_items()"). So update it with the current function name. Also remove references with i_mutex to "VFS lock", since the inode lock is no longer a mutex since 2016 (it's now a rw semaphore). Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 56fbd3b9f642..9625707bfa8a 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5933,12 +5933,12 @@ struct btrfs_dir_list { }; /* - * Log the inodes of the new dentries of a directory. See log_dir_items() for - * details about the why it is needed. + * Log the inodes of the new dentries of a directory. + * See process_dir_items_leaf() for details about why it is needed. * This is a recursive operation - if an existing dentry corresponds to a * directory, that directory's new entries are logged too (same behaviour as * ext3/4, xfs, f2fs, reiserfs, nilfs2). Note that when logging the inodes - * the dentries point to we do not lock their i_mutex, otherwise lockdep + * the dentries point to we do not acquire their VFS lock, otherwise lockdep * complains about the following circular lock dependency / possible deadlock: * * CPU0 CPU1 @@ -5950,7 +5950,7 @@ struct btrfs_dir_list { * * Where sb_internal is the lock (a counter that works as a lock) acquired by * sb_start_intwrite() in btrfs_start_transaction(). - * Not locking i_mutex of the inodes is still safe because: + * Not acquiring the VFS lock of the inodes is still safe because: * * 1) For regular files we log with a mode of LOG_INODE_EXISTS. It's possible * that while logging the inode new references (names) are added or removed From patchwork Mon Aug 22 10:51:33 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950482 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C785FC3F6B0 for ; Mon, 22 Aug 2022 10:51:59 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233794AbiHVKv6 (ORCPT ); Mon, 22 Aug 2022 06:51:58 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43428 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233748AbiHVKvx (ORCPT ); Mon, 22 Aug 2022 06:51:53 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DA9013123B for ; Mon, 22 Aug 2022 03:51:52 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 70E5660FFF for ; Mon, 22 Aug 2022 10:51:52 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5E7C3C433B5 for ; Mon, 22 Aug 2022 10:51:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165511; bh=6EcyOPKEZ8ByNUXXAzW34NNHJMETgjPq0nFeRVwdIbs=; h=From:To:Subject:Date:In-Reply-To:References:From; b=R/P2/OzA0Z9YF3f8MTBTUJZtGhplxczKCDHxEvMeGG5i/p9Vgzc4vB9oS04PzA8QL mR5iXUZwKnVo4NIJ3+Nt7b6CFXsn/L1ocjwM0Ahpgd2ZVhFloI2Td3MMQqBJf3AGvm iJ5XGWivTz5lsWCWK8Ia9Z6dkR8aYKcqQqpLREsKJrBjWdw7hOkgZCjtjNShQizxFD tDHZr6I/ZPtCwo60OOdMaUV8tRnB4dKjDl2ju9Plrd3ca05CXEaActZv3wnX5xZ2EN D8vQgKTebWi3xqjiU9h+iGURin3qK4/uIwC9CO/rUbpJnxBMKfzwuBk9gCd85Tf0Dd KZ/CwJDSpqvIg== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 04/15] btrfs: free list element sooner at log_new_dir_dentries() Date: Mon, 22 Aug 2022 11:51:33 +0100 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana At log_new_dir_dentries(), there's no need to keep the current list element allocated while processing the leaves with directory items for the current directory, and while logging other inodes. Plus in case we find a subdirectory, we also end up allocating a new list element while the current one is still allocated, temporarily using more memory than necessary. So free the current list element early on, before processing leaves. Also make the removal and release of all list elements in case of an error more simple by eliminating the label and goto, adding an explicit loop to release all list elements in case an error happens. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 52 ++++++++++++++++++++++++++------------------- 1 file changed, 30 insertions(+), 22 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 9625707bfa8a..bd50509e9839 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -6002,25 +6002,28 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, while (!list_empty(&dir_list)) { struct extent_buffer *leaf; struct btrfs_key min_key; + u64 ino; + bool continue_curr_inode = true; int nritems; int i; dir_elem = list_first_entry(&dir_list, struct btrfs_dir_list, list); - if (ret) - goto next_dir_inode; + ino = dir_elem->ino; + list_del(&dir_elem->list); + kfree(dir_elem); - min_key.objectid = dir_elem->ino; + min_key.objectid = ino; min_key.type = BTRFS_DIR_INDEX_KEY; min_key.offset = 0; again: btrfs_release_path(path); ret = btrfs_search_forward(root, &min_key, path, trans->transid); if (ret < 0) { - goto next_dir_inode; + break; } else if (ret > 0) { ret = 0; - goto next_dir_inode; + continue; } leaf = path->nodes[0]; @@ -6029,14 +6032,15 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, struct btrfs_dir_item *di; struct btrfs_key di_key; struct inode *di_inode; - struct btrfs_dir_list *new_dir_elem; int log_mode = LOG_INODE_EXISTS; int type; btrfs_item_key_to_cpu(leaf, &min_key, i); - if (min_key.objectid != dir_elem->ino || - min_key.type != BTRFS_DIR_INDEX_KEY) - goto next_dir_inode; + if (min_key.objectid != ino || + min_key.type != BTRFS_DIR_INDEX_KEY) { + continue_curr_inode = false; + break; + } di = btrfs_item_ptr(leaf, i, struct btrfs_dir_item); type = btrfs_dir_type(leaf, di); @@ -6050,7 +6054,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, di_inode = btrfs_iget(fs_info->sb, di_key.objectid, root); if (IS_ERR(di_inode)) { ret = PTR_ERR(di_inode); - goto next_dir_inode; + goto out; } if (!need_log_inode(trans, BTRFS_I(di_inode))) { @@ -6065,29 +6069,33 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, log_mode, ctx); btrfs_add_delayed_iput(di_inode); if (ret) - goto next_dir_inode; + goto out; if (ctx->log_new_dentries) { - new_dir_elem = kmalloc(sizeof(*new_dir_elem), - GFP_NOFS); - if (!new_dir_elem) { + dir_elem = kmalloc(sizeof(*dir_elem), GFP_NOFS); + if (!dir_elem) { ret = -ENOMEM; - goto next_dir_inode; + goto out; } - new_dir_elem->ino = di_key.objectid; - list_add_tail(&new_dir_elem->list, &dir_list); + dir_elem->ino = di_key.objectid; + list_add_tail(&dir_elem->list, &dir_list); } break; } - if (min_key.offset < (u64)-1) { + + if (continue_curr_inode && min_key.offset < (u64)-1) { min_key.offset++; goto again; } -next_dir_inode: - list_del(&dir_elem->list); - kfree(dir_elem); } - +out: btrfs_free_path(path); + if (ret) { + struct btrfs_dir_list *next; + + list_for_each_entry_safe(dir_elem, next, &dir_list, list) + kfree(dir_elem); + } + return ret; } From patchwork Mon Aug 22 10:51:34 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950483 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A334AC28D13 for ; Mon, 22 Aug 2022 10:52:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233807AbiHVKv7 (ORCPT ); Mon, 22 Aug 2022 06:51:59 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43440 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233753AbiHVKvy (ORCPT ); Mon, 22 Aug 2022 06:51:54 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id C972B3123C for ; Mon, 22 Aug 2022 03:51:53 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 6458961000 for ; Mon, 22 Aug 2022 10:51:53 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 4DC96C433C1 for ; Mon, 22 Aug 2022 10:51:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165512; bh=/1WURF2Y9G9fQ69S2W9mZ3IrBKFe0uDBOg3UOwjD/2A=; h=From:To:Subject:Date:In-Reply-To:References:From; b=a5SIFXq2E9OpqG/cNRf2wN5eScPzGnfvVddcwLkVHfqwzkTv0ytc4qpvrsFVsfswl jfGMK3w+0pw1iEs4ORAuviaX/xDj7HfANNEBlbzWsR/fcmNcXD9TWjHfmgV30bZcuJ Om/Vv20+tdK2CuApJx5+9ZoJPFHsGC24c1YLzxZOtMHeQjGAgUJZ3teSqk2agLKAxs VDePHhNjFPzatuFgJv+Z2nY30R6tp6gtus8ZefLwwTVRjNIMLybwRTf/SlhkHAwYoZ Dwuh2HLlKlsHXl4leciUVgFOXy6AiETZNUyje0nAki1ezY9Dt2gfgk5V3LkMU9tC13 KPGlcjdnxV6Rw== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 05/15] btrfs: avoid memory allocation at log_new_dir_dentries() for common case Date: Mon, 22 Aug 2022 11:51:34 +0100 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana At log_new_dir_dentries() we always start by allocating a list element for the starting inode and then do a while loop with the condition being a list emptiness check. This however is not needed, we can avoid allocating this initial list element and then just check for the list emptiness at the end of the loop's body. So just do that to save one memory allocation from the kmalloc-32 slab. This allows for not doing any memory allocation when we don't have any subdirectory to log, which is a very common case. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 29 ++++++++++++----------------- 1 file changed, 12 insertions(+), 17 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index bd50509e9839..94026098bb68 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5977,6 +5977,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, struct btrfs_path *path; LIST_HEAD(dir_list); struct btrfs_dir_list *dir_elem; + u64 ino = btrfs_ino(start_inode); int ret = 0; /* @@ -5991,28 +5992,13 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, if (!path) return -ENOMEM; - dir_elem = kmalloc(sizeof(*dir_elem), GFP_NOFS); - if (!dir_elem) { - btrfs_free_path(path); - return -ENOMEM; - } - dir_elem->ino = btrfs_ino(start_inode); - list_add_tail(&dir_elem->list, &dir_list); - - while (!list_empty(&dir_list)) { + while (true) { struct extent_buffer *leaf; struct btrfs_key min_key; - u64 ino; bool continue_curr_inode = true; int nritems; int i; - dir_elem = list_first_entry(&dir_list, struct btrfs_dir_list, - list); - ino = dir_elem->ino; - list_del(&dir_elem->list); - kfree(dir_elem); - min_key.objectid = ino; min_key.type = BTRFS_DIR_INDEX_KEY; min_key.offset = 0; @@ -6023,7 +6009,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, break; } else if (ret > 0) { ret = 0; - continue; + goto next; } leaf = path->nodes[0]; @@ -6086,6 +6072,15 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, min_key.offset++; goto again; } + +next: + if (list_empty(&dir_list)) + break; + + dir_elem = list_first_entry(&dir_list, struct btrfs_dir_list, list); + ino = dir_elem->ino; + list_del(&dir_elem->list); + kfree(dir_elem); } out: btrfs_free_path(path); From patchwork Mon Aug 22 10:51:35 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950484 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id AFB85C32772 for ; Mon, 22 Aug 2022 10:52:01 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233819AbiHVKwA (ORCPT ); Mon, 22 Aug 2022 06:52:00 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43448 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233758AbiHVKvz (ORCPT ); Mon, 22 Aug 2022 06:51:55 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id BA9042F659 for ; Mon, 22 Aug 2022 03:51:54 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 535E160FFD for ; Mon, 22 Aug 2022 10:51:54 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 3CAC8C433D7 for ; Mon, 22 Aug 2022 10:51:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165513; bh=9ls91ugzqt+2lsQ6hmYR/x0CTNdK0bDW2ELEIvW/PD8=; h=From:To:Subject:Date:In-Reply-To:References:From; b=K03LMJf32wK3rKU5UWXHTlPLoExLQMskdW0NxOGQzBN/iXjOrGKKGjTqjOOFX0fJm FVEzfdVitCOFfU855jq/ShASDWYZV6aMHTz3T5AeZNMGcOwGKreM7JLyjhR/CD7W6T smwd+ssP/CwIe4nTZ7xMcS8kBrUkr4LMpwAl8f/cbjGBgeZX+agnXaQV0j57wJdavf W9SicCSpUtBHGuU4oKmD01x8Z1Obk6aWy7huQDpJsihAh4KkRchv2HhPc1iAggjBm4 lsyZT+iGFugAUbsu2+s3B+6YApxtfN2ulc+k3lomY+KCihtyVWIo9QBMzUDNyEmSZa rWI1hvns/mX9A== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 06/15] btrfs: remove root argument from btrfs_delayed_item_reserve_metadata() Date: Mon, 22 Aug 2022 11:51:35 +0100 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana The root argument of btrfs_delayed_item_reserve_metadata() is used only to get the fs_info object, but we already have a transaction handle, which we can use to get the fs_info. So remove the root argument. Signed-off-by: Filipe Manana --- fs/btrfs/delayed-inode.c | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index e7f34871a132..a080e08bbb4d 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -520,12 +520,11 @@ static struct btrfs_delayed_item *__btrfs_next_delayed_item( } static int btrfs_delayed_item_reserve_metadata(struct btrfs_trans_handle *trans, - struct btrfs_root *root, struct btrfs_delayed_item *item) { struct btrfs_block_rsv *src_rsv; struct btrfs_block_rsv *dst_rsv; - struct btrfs_fs_info *fs_info = root->fs_info; + struct btrfs_fs_info *fs_info = trans->fs_info; u64 num_bytes; int ret; @@ -1490,8 +1489,7 @@ int btrfs_insert_delayed_dir_index(struct btrfs_trans_handle *trans, } if (reserve_leaf_space) { - ret = btrfs_delayed_item_reserve_metadata(trans, dir->root, - delayed_item); + ret = btrfs_delayed_item_reserve_metadata(trans, delayed_item); /* * Space was reserved for a dir index item insertion when we * started the transaction, so getting a failure here should be @@ -1614,7 +1612,7 @@ int btrfs_delete_delayed_dir_index(struct btrfs_trans_handle *trans, item->key = item_key; item->ins_or_del = BTRFS_DELAYED_DELETION_ITEM; - ret = btrfs_delayed_item_reserve_metadata(trans, dir->root, item); + ret = btrfs_delayed_item_reserve_metadata(trans, item); /* * we have reserved enough space when we start a new transaction, * so reserving metadata failure is impossible. From patchwork Mon Aug 22 10:51:36 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950486 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 98690C32772 for ; Mon, 22 Aug 2022 10:52:03 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233747AbiHVKwC (ORCPT ); Mon, 22 Aug 2022 06:52:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43462 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233778AbiHVKv6 (ORCPT ); Mon, 22 Aug 2022 06:51:58 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [IPv6:2604:1380:4601:e00::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3737A2F671 for ; Mon, 22 Aug 2022 03:51:57 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id D500FB81015 for ; Mon, 22 Aug 2022 10:51:55 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 2C202C433D6 for ; Mon, 22 Aug 2022 10:51:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165514; bh=jaLcy/FHpbfW0y3+KV5OJn6l5mbHgWi88hJPW1MMJso=; h=From:To:Subject:Date:In-Reply-To:References:From; b=XnZJ5HtYsaTqkkG1GtPJbmik/zIE+ikxKfD0m96bsh8RXXLrsj6V5TfM0GbNittu+ H5gJpB/HmT0+GUvTqbczQ3/2K9BSnK+jEIOOZV0/XVsWWZyv13otXrYtMD7yOhshwl B2u6fbP175TW/pVREHIG4+/cjj6PpdCBnLSxydZW4O4c0W8piQBR4VO9S0PO7d4rMB QPNiGuW654Y4+lzlB0+EyMndKpNb2JabIe7afsGeXlF9ngn8SaYziw+xsD0mAc+uOW CxERmyPaBvZUYpbUSyQEDU87RmAu3ylP2u2/6LLCMyxN/WfZT99tJpkHNmjXXp9qTr jykQs3ggXMggw== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 07/15] btrfs: store index number instead of key in struct btrfs_delayed_item Date: Mon, 22 Aug 2022 11:51:36 +0100 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana All delayed items are for dir index keys, so there's really no point of having an embedded struct btrfs_key in struct btrfs_delayed_item, which makes the structure use more space than necessary (and adds a hole of 7 bytes). So replace the key field with an index number (u64), which reduces the size of struct btrfs_delayed_item from 112 bytes down to 96 bytes. Some upcoming work will increase the structure size by 16 bytes, so this change compensates for that future size increase. Signed-off-by: Filipe Manana --- fs/btrfs/delayed-inode.c | 106 ++++++++++++++++++++------------------- fs/btrfs/delayed-inode.h | 3 +- 2 files changed, 56 insertions(+), 53 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index a080e08bbb4d..cd2f3a8c4dfd 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -302,7 +302,8 @@ static inline void btrfs_release_prepared_delayed_node( __btrfs_release_delayed_node(node, 1); } -static struct btrfs_delayed_item *btrfs_alloc_delayed_item(u32 data_len) +static struct btrfs_delayed_item *btrfs_alloc_delayed_item(u32 data_len, + struct btrfs_delayed_node *node) { struct btrfs_delayed_item *item; item = kmalloc(sizeof(*item) + data_len, GFP_NOFS); @@ -310,7 +311,8 @@ static struct btrfs_delayed_item *btrfs_alloc_delayed_item(u32 data_len) item->data_len = data_len; item->ins_or_del = 0; item->bytes_reserved = 0; - item->delayed_node = NULL; + item->delayed_node = node; + RB_CLEAR_NODE(&item->rb_node); refcount_set(&item->refs, 1); } return item; @@ -319,7 +321,7 @@ static struct btrfs_delayed_item *btrfs_alloc_delayed_item(u32 data_len) /* * __btrfs_lookup_delayed_item - look up the delayed item by key * @delayed_node: pointer to the delayed node - * @key: the key to look up + * @index: the dir index value to lookup (offset of a dir index key) * @prev: used to store the prev item if the right item isn't found * @next: used to store the next item if the right item isn't found * @@ -328,7 +330,7 @@ static struct btrfs_delayed_item *btrfs_alloc_delayed_item(u32 data_len) */ static struct btrfs_delayed_item *__btrfs_lookup_delayed_item( struct rb_root *root, - struct btrfs_key *key, + u64 index, struct btrfs_delayed_item **prev, struct btrfs_delayed_item **next) { @@ -342,10 +344,9 @@ static struct btrfs_delayed_item *__btrfs_lookup_delayed_item( delayed_item = rb_entry(node, struct btrfs_delayed_item, rb_node); prev_node = node; - ret = btrfs_comp_cpu_keys(&delayed_item->key, key); - if (ret < 0) + if (delayed_item->index < index) node = node->rb_right; - else if (ret > 0) + else if (delayed_item->index > index) node = node->rb_left; else return delayed_item; @@ -379,9 +380,9 @@ static struct btrfs_delayed_item *__btrfs_lookup_delayed_item( static struct btrfs_delayed_item *__btrfs_lookup_delayed_insertion_item( struct btrfs_delayed_node *delayed_node, - struct btrfs_key *key) + u64 index) { - return __btrfs_lookup_delayed_item(&delayed_node->ins_root.rb_root, key, + return __btrfs_lookup_delayed_item(&delayed_node->ins_root.rb_root, index, NULL, NULL); } @@ -392,7 +393,6 @@ static int __btrfs_add_delayed_item(struct btrfs_delayed_node *delayed_node, struct rb_node *parent_node = NULL; struct rb_root_cached *root; struct btrfs_delayed_item *item; - int cmp; bool leftmost = true; if (ins->ins_or_del == BTRFS_DELAYED_INSERTION_ITEM) @@ -409,11 +409,10 @@ static int __btrfs_add_delayed_item(struct btrfs_delayed_node *delayed_node, item = rb_entry(parent_node, struct btrfs_delayed_item, rb_node); - cmp = btrfs_comp_cpu_keys(&item->key, &ins->key); - if (cmp < 0) { + if (item->index < ins->index) { p = &(*p)->rb_right; leftmost = false; - } else if (cmp > 0) { + } else if (item->index > ins->index) { p = &(*p)->rb_left; } else { return -EEXIST; @@ -422,14 +421,10 @@ static int __btrfs_add_delayed_item(struct btrfs_delayed_node *delayed_node, rb_link_node(node, parent_node, p); rb_insert_color_cached(node, root, leftmost); - ins->delayed_node = delayed_node; - - /* Delayed items are always for dir index items. */ - ASSERT(ins->key.type == BTRFS_DIR_INDEX_KEY); if (ins->ins_or_del == BTRFS_DELAYED_INSERTION_ITEM && - ins->key.offset >= delayed_node->index_cnt) - delayed_node->index_cnt = ins->key.offset + 1; + ins->index >= delayed_node->index_cnt) + delayed_node->index_cnt = ins->index + 1; delayed_node->count++; atomic_inc(&delayed_node->root->fs_info->delayed_root->items); @@ -451,9 +446,10 @@ static void __btrfs_remove_delayed_item(struct btrfs_delayed_item *delayed_item) struct rb_root_cached *root; struct btrfs_delayed_root *delayed_root; - /* Not associated with any delayed_node */ - if (!delayed_item->delayed_node) + /* Not inserted, ignore it. */ + if (RB_EMPTY_NODE(&delayed_item->rb_node)) return; + delayed_root = delayed_item->delayed_node->root->fs_info->delayed_root; BUG_ON(!delayed_root); @@ -466,6 +462,7 @@ static void __btrfs_remove_delayed_item(struct btrfs_delayed_item *delayed_item) root = &delayed_item->delayed_node->del_root; rb_erase_cached(&delayed_item->rb_node, root); + RB_CLEAR_NODE(&delayed_item->rb_node); delayed_item->delayed_node->count--; finish_one_item(delayed_root); @@ -544,7 +541,7 @@ static int btrfs_delayed_item_reserve_metadata(struct btrfs_trans_handle *trans, ret = btrfs_block_rsv_migrate(src_rsv, dst_rsv, num_bytes, true); if (!ret) { trace_btrfs_space_reservation(fs_info, "delayed_item", - item->key.objectid, + item->delayed_node->inode_id, num_bytes, 1); /* * For insertions we track reserved metadata space by accounting @@ -573,8 +570,8 @@ static void btrfs_delayed_item_release_metadata(struct btrfs_root *root, * to release/reserve qgroup space. */ trace_btrfs_space_reservation(fs_info, "delayed_item", - item->key.objectid, item->bytes_reserved, - 0); + item->delayed_node->inode_id, + item->bytes_reserved, 0); btrfs_block_rsv_release(fs_info, rsv, item->bytes_reserved, NULL); } @@ -687,6 +684,7 @@ static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans, struct btrfs_delayed_item *next; const int max_size = BTRFS_LEAF_DATA_SIZE(fs_info); struct btrfs_item_batch batch; + struct btrfs_key first_key; int total_size; char *ins_data = NULL; int ret; @@ -731,8 +729,7 @@ static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans, * We cannot allow gaps in the key space if we're doing log * replay. */ - if (continuous_keys_only && - (next->key.offset != curr->key.offset + 1)) + if (continuous_keys_only && (next->index != curr->index + 1)) break; ASSERT(next->bytes_reserved == 0); @@ -749,7 +746,10 @@ static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans, } if (batch.nr == 1) { - batch.keys = &first_item->key; + first_key.objectid = node->inode_id; + first_key.type = BTRFS_DIR_INDEX_KEY; + first_key.offset = first_item->index; + batch.keys = &first_key; batch.data_sizes = &first_item->data_len; } else { struct btrfs_key *ins_keys; @@ -767,7 +767,9 @@ static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans, batch.keys = ins_keys; batch.data_sizes = ins_sizes; list_for_each_entry(curr, &item_list, tree_list) { - ins_keys[i] = curr->key; + ins_keys[i].objectid = node->inode_id; + ins_keys[i].type = BTRFS_DIR_INDEX_KEY; + ins_keys[i].offset = curr->index; ins_sizes[i] = curr->data_len; i++; } @@ -863,6 +865,7 @@ static int btrfs_batch_delete_items(struct btrfs_trans_handle *trans, struct btrfs_path *path, struct btrfs_delayed_item *item) { + const u64 ino = item->delayed_node->inode_id; struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_delayed_item *curr, *next; struct extent_buffer *leaf = path->nodes[0]; @@ -901,7 +904,9 @@ static int btrfs_batch_delete_items(struct btrfs_trans_handle *trans, slot++; btrfs_item_key_to_cpu(leaf, &key, slot); - if (btrfs_comp_cpu_keys(&next->key, &key) != 0) + if (key.objectid != ino || + key.type != BTRFS_DIR_INDEX_KEY || + key.offset != next->index) break; nitems++; curr = next; @@ -919,9 +924,8 @@ static int btrfs_batch_delete_items(struct btrfs_trans_handle *trans, * Check btrfs_delayed_item_reserve_metadata() to see why we * don't need to release/reserve qgroup space. */ - trace_btrfs_space_reservation(fs_info, "delayed_item", - item->key.objectid, total_reserved_size, - 0); + trace_btrfs_space_reservation(fs_info, "delayed_item", ino, + total_reserved_size, 0); btrfs_block_rsv_release(fs_info, &fs_info->delayed_block_rsv, total_reserved_size, NULL); } @@ -939,8 +943,12 @@ static int btrfs_delete_delayed_items(struct btrfs_trans_handle *trans, struct btrfs_root *root, struct btrfs_delayed_node *node) { + struct btrfs_key key; int ret = 0; + key.objectid = node->inode_id; + key.type = BTRFS_DIR_INDEX_KEY; + while (ret == 0) { struct btrfs_delayed_item *item; @@ -951,7 +959,8 @@ static int btrfs_delete_delayed_items(struct btrfs_trans_handle *trans, break; } - ret = btrfs_search_slot(trans, root, &item->key, path, -1, 1); + key.offset = item->index; + ret = btrfs_search_slot(trans, root, &key, path, -1, 1); if (ret > 0) { /* * There's no matching item in the leaf. This means we @@ -1456,15 +1465,14 @@ int btrfs_insert_delayed_dir_index(struct btrfs_trans_handle *trans, if (IS_ERR(delayed_node)) return PTR_ERR(delayed_node); - delayed_item = btrfs_alloc_delayed_item(sizeof(*dir_item) + name_len); + delayed_item = btrfs_alloc_delayed_item(sizeof(*dir_item) + name_len, + delayed_node); if (!delayed_item) { ret = -ENOMEM; goto release_node; } - delayed_item->key.objectid = btrfs_ino(dir); - delayed_item->key.type = BTRFS_DIR_INDEX_KEY; - delayed_item->key.offset = index; + delayed_item->index = index; delayed_item->ins_or_del = BTRFS_DELAYED_INSERTION_ITEM; dir_item = (struct btrfs_dir_item *)delayed_item->data; @@ -1536,12 +1544,12 @@ int btrfs_insert_delayed_dir_index(struct btrfs_trans_handle *trans, static int btrfs_delete_delayed_insertion_item(struct btrfs_fs_info *fs_info, struct btrfs_delayed_node *node, - struct btrfs_key *key) + u64 index) { struct btrfs_delayed_item *item; mutex_lock(&node->mutex); - item = __btrfs_lookup_delayed_insertion_item(node, key); + item = __btrfs_lookup_delayed_insertion_item(node, index); if (!item) { mutex_unlock(&node->mutex); return 1; @@ -1587,29 +1595,23 @@ int btrfs_delete_delayed_dir_index(struct btrfs_trans_handle *trans, { struct btrfs_delayed_node *node; struct btrfs_delayed_item *item; - struct btrfs_key item_key; int ret; node = btrfs_get_or_create_delayed_node(dir); if (IS_ERR(node)) return PTR_ERR(node); - item_key.objectid = btrfs_ino(dir); - item_key.type = BTRFS_DIR_INDEX_KEY; - item_key.offset = index; - - ret = btrfs_delete_delayed_insertion_item(trans->fs_info, node, - &item_key); + ret = btrfs_delete_delayed_insertion_item(trans->fs_info, node, index); if (!ret) goto end; - item = btrfs_alloc_delayed_item(0); + item = btrfs_alloc_delayed_item(0, node); if (!item) { ret = -ENOMEM; goto end; } - item->key = item_key; + item->index = index; item->ins_or_del = BTRFS_DELAYED_DELETION_ITEM; ret = btrfs_delayed_item_reserve_metadata(trans, item); @@ -1741,9 +1743,9 @@ int btrfs_should_delete_dir_index(struct list_head *del_list, int ret = 0; list_for_each_entry(curr, del_list, readdir_list) { - if (curr->key.offset > index) + if (curr->index > index) break; - if (curr->key.offset == index) { + if (curr->index == index) { ret = 1; break; } @@ -1777,13 +1779,13 @@ int btrfs_readdir_delayed_dir_index(struct dir_context *ctx, list_for_each_entry_safe(curr, next, ins_list, readdir_list) { list_del(&curr->readdir_list); - if (curr->key.offset < ctx->pos) { + if (curr->index < ctx->pos) { if (refcount_dec_and_test(&curr->refs)) kfree(curr); continue; } - ctx->pos = curr->key.offset; + ctx->pos = curr->index; di = (struct btrfs_dir_item *)curr->data; name = (char *)(di + 1); diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h index 9795dc295a18..fd6fe785f748 100644 --- a/fs/btrfs/delayed-inode.h +++ b/fs/btrfs/delayed-inode.h @@ -73,7 +73,8 @@ struct btrfs_delayed_node { struct btrfs_delayed_item { struct rb_node rb_node; - struct btrfs_key key; + /* Offset value of the corresponding dir index key. */ + u64 index; struct list_head tree_list; /* used for batch insert/delete items */ struct list_head readdir_list; /* used for readdir items */ u64 bytes_reserved; From patchwork Mon Aug 22 10:51:37 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950487 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id B1273C32789 for ; Mon, 22 Aug 2022 10:52:04 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233867AbiHVKwD (ORCPT ); Mon, 22 Aug 2022 06:52:03 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43476 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233748AbiHVKv7 (ORCPT ); Mon, 22 Aug 2022 06:51:59 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 561C43121A for ; Mon, 22 Aug 2022 03:51:58 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id C83FFB81032 for ; Mon, 22 Aug 2022 10:51:56 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 1BE17C433B5 for ; Mon, 22 Aug 2022 10:51:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165515; bh=pNgAAd3xi+WHdWX50dK+xFFzmSMeosw8pDhhZlLwE7k=; h=From:To:Subject:Date:In-Reply-To:References:From; b=OR1lDeW58A/sVmFcfKy24hYTy+u9wC2MA2NwzH5ZvhtfDPPvOOEr/k8uNuL397HOv S/mgmT9LLiJo6YmJKs05cYwgjsABqhvb4W9FZ8jF7l3pxw+3Ww+IzCD+Mh3aAIT1gt G9EfACkg92gkqPouLHNG1xYS1L2UwLe+kbCe6Y4aM0LA/0PnduvoBrkCl9/856JSIv XCgmsnbLC8H8y0aEW+UbvSBC5H00K68IamXdWS0KeiIycZ1m2GGNisI3AzJ7CNe1Uj 51csgdSua/G9NFtXLK+0HAJN0Nuwth77JIn9VGdDXbKr190KRyr7jqgLQFr4NGfVwR TCOwBzt6++J+Q== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 08/15] btrfs: remove unused logic when looking up delayed items Date: Mon, 22 Aug 2022 11:51:37 +0100 Message-Id: <870a360c38df7f4331bb583545efae4c9f756b3a.1661165149.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana All callers pass NULL to the 'prev' and 'next' arguments of the function __btrfs_lookup_delayed_item(), so remove these arguments. Also, remove the unnecessary wrapper __btrfs_lookup_delayed_insertion_item(), making btrfs_delete_delayed_insertion_item() directly call __btrfs_lookup_delayed_item(). Signed-off-by: Filipe Manana --- fs/btrfs/delayed-inode.c | 45 +++------------------------------------- 1 file changed, 3 insertions(+), 42 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index cd2f3a8c4dfd..a8947ac00681 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -322,28 +322,20 @@ static struct btrfs_delayed_item *btrfs_alloc_delayed_item(u32 data_len, * __btrfs_lookup_delayed_item - look up the delayed item by key * @delayed_node: pointer to the delayed node * @index: the dir index value to lookup (offset of a dir index key) - * @prev: used to store the prev item if the right item isn't found - * @next: used to store the next item if the right item isn't found * * Note: if we don't find the right item, we will return the prev item and * the next item. */ static struct btrfs_delayed_item *__btrfs_lookup_delayed_item( struct rb_root *root, - u64 index, - struct btrfs_delayed_item **prev, - struct btrfs_delayed_item **next) + u64 index) { - struct rb_node *node, *prev_node = NULL; + struct rb_node *node = root->rb_node; struct btrfs_delayed_item *delayed_item = NULL; - int ret = 0; - - node = root->rb_node; while (node) { delayed_item = rb_entry(node, struct btrfs_delayed_item, rb_node); - prev_node = node; if (delayed_item->index < index) node = node->rb_right; else if (delayed_item->index > index) @@ -352,40 +344,9 @@ static struct btrfs_delayed_item *__btrfs_lookup_delayed_item( return delayed_item; } - if (prev) { - if (!prev_node) - *prev = NULL; - else if (ret < 0) - *prev = delayed_item; - else if ((node = rb_prev(prev_node)) != NULL) { - *prev = rb_entry(node, struct btrfs_delayed_item, - rb_node); - } else - *prev = NULL; - } - - if (next) { - if (!prev_node) - *next = NULL; - else if (ret > 0) - *next = delayed_item; - else if ((node = rb_next(prev_node)) != NULL) { - *next = rb_entry(node, struct btrfs_delayed_item, - rb_node); - } else - *next = NULL; - } return NULL; } -static struct btrfs_delayed_item *__btrfs_lookup_delayed_insertion_item( - struct btrfs_delayed_node *delayed_node, - u64 index) -{ - return __btrfs_lookup_delayed_item(&delayed_node->ins_root.rb_root, index, - NULL, NULL); -} - static int __btrfs_add_delayed_item(struct btrfs_delayed_node *delayed_node, struct btrfs_delayed_item *ins) { @@ -1549,7 +1510,7 @@ static int btrfs_delete_delayed_insertion_item(struct btrfs_fs_info *fs_info, struct btrfs_delayed_item *item; mutex_lock(&node->mutex); - item = __btrfs_lookup_delayed_insertion_item(node, index); + item = __btrfs_lookup_delayed_item(&node->ins_root.rb_root, index); if (!item) { mutex_unlock(&node->mutex); return 1; From patchwork Mon Aug 22 10:51:38 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950485 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A103AC28D13 for ; Mon, 22 Aug 2022 10:52:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233833AbiHVKwB (ORCPT ); Mon, 22 Aug 2022 06:52:01 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43464 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233747AbiHVKv6 (ORCPT ); Mon, 22 Aug 2022 06:51:58 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 95E1E2F659 for ; Mon, 22 Aug 2022 03:51:57 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 2314F60FFD for ; Mon, 22 Aug 2022 10:51:57 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 0CD2DC4347C for ; Mon, 22 Aug 2022 10:51:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165516; bh=XAjRUc5ZNSsZzCrt6EIX3dg5fzXqfiHqjN3lYLoAecU=; h=From:To:Subject:Date:In-Reply-To:References:From; b=aoSSwcFfgcQllVvhEYjfFjpOjU+h/aQ7b0/RsMM+x2ai4RoAtaZuOnaiiD4UpdQlZ jNMwQMUa/L5OpP8kdz1gPR2zSVviNvI/hPKeVyE25Q+j19EaSH648PjBgjLYhsS+d/ FTF3xX/JCjBxu2TqS8Fzwes+9wrJYQWEcxGU4afd8Hkyo7YsSBoKchYVrNqMe4I8Jr HARLchTV2DnIJ/ftIFmSDlLYQpx3adr7b7dBqoKfy3SkPhkUY6mS4sz7CLF0HImJIP Jwiu60OMHbiqtKfS9vjlxqynVS14JAy9pPOjKfmkUz4BdTdUgMGt7Wa6fMpHrqnD9L xG+l8IWE93yUg== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 09/15] btrfs: shrink the size of struct btrfs_delayed_item Date: Mon, 22 Aug 2022 11:51:38 +0100 Message-Id: <705c6c01297eab4523e7ecafafc396b7443e5938.1661165149.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana Currently struct btrfs_delayed_item has a base size of 96 bytes, but its size can be decreased by doing the following 2 tweaks: 1) Change data_len from u32 to u16. Our maximum possible leaf size is 64K, so the data_len can never be larger than that, and in fact it is always much smaller than that. The max length for a dentry's name is ensured at the VFS level (PATH_MAX, 4096 bytes) and in struct btrfs_inode_ref and btrfs_dir_item we use a u16 to store the name's length; 2) Change 'ins_or_del' to a 1 bit enum, which is all we need since it can only have 2 values. After this there's also no longer the need to BUG_ON() before using 'ins_or_del' in several places. Also rename the field from 'ins_or_del' to 'type', which is more clear. These two tweaks decrease the size of struct btrfs_delayed_item from 96 bytes down to 88 bytes. A previous patch already reduced the size of this structure by 16 bytes, but an upcoming change will increase its size by 16 bytes (adding a struct list_head element). Signed-off-by: Filipe Manana --- fs/btrfs/delayed-inode.c | 37 ++++++++++++++++++------------------- fs/btrfs/delayed-inode.h | 12 +++++++----- 2 files changed, 25 insertions(+), 24 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index a8947ac00681..b35eddcac846 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -302,14 +302,16 @@ static inline void btrfs_release_prepared_delayed_node( __btrfs_release_delayed_node(node, 1); } -static struct btrfs_delayed_item *btrfs_alloc_delayed_item(u32 data_len, - struct btrfs_delayed_node *node) +static struct btrfs_delayed_item *btrfs_alloc_delayed_item(u16 data_len, + struct btrfs_delayed_node *node, + enum btrfs_delayed_item_type type) { struct btrfs_delayed_item *item; + item = kmalloc(sizeof(*item) + data_len, GFP_NOFS); if (item) { item->data_len = data_len; - item->ins_or_del = 0; + item->type = type; item->bytes_reserved = 0; item->delayed_node = node; RB_CLEAR_NODE(&item->rb_node); @@ -356,12 +358,11 @@ static int __btrfs_add_delayed_item(struct btrfs_delayed_node *delayed_node, struct btrfs_delayed_item *item; bool leftmost = true; - if (ins->ins_or_del == BTRFS_DELAYED_INSERTION_ITEM) + if (ins->type == BTRFS_DELAYED_INSERTION_ITEM) root = &delayed_node->ins_root; - else if (ins->ins_or_del == BTRFS_DELAYED_DELETION_ITEM) - root = &delayed_node->del_root; else - BUG(); + root = &delayed_node->del_root; + p = &root->rb_root.rb_node; node = &ins->rb_node; @@ -383,7 +384,7 @@ static int __btrfs_add_delayed_item(struct btrfs_delayed_node *delayed_node, rb_link_node(node, parent_node, p); rb_insert_color_cached(node, root, leftmost); - if (ins->ins_or_del == BTRFS_DELAYED_INSERTION_ITEM && + if (ins->type == BTRFS_DELAYED_INSERTION_ITEM && ins->index >= delayed_node->index_cnt) delayed_node->index_cnt = ins->index + 1; @@ -414,10 +415,8 @@ static void __btrfs_remove_delayed_item(struct btrfs_delayed_item *delayed_item) delayed_root = delayed_item->delayed_node->root->fs_info->delayed_root; BUG_ON(!delayed_root); - BUG_ON(delayed_item->ins_or_del != BTRFS_DELAYED_DELETION_ITEM && - delayed_item->ins_or_del != BTRFS_DELAYED_INSERTION_ITEM); - if (delayed_item->ins_or_del == BTRFS_DELAYED_INSERTION_ITEM) + if (delayed_item->type == BTRFS_DELAYED_INSERTION_ITEM) root = &delayed_item->delayed_node->ins_root; else root = &delayed_item->delayed_node->del_root; @@ -509,7 +508,7 @@ static int btrfs_delayed_item_reserve_metadata(struct btrfs_trans_handle *trans, * for the number of leaves that will be used, based on the delayed * node's index_items_size field. */ - if (item->ins_or_del == BTRFS_DELAYED_DELETION_ITEM) + if (item->type == BTRFS_DELAYED_DELETION_ITEM) item->bytes_reserved = num_bytes; } @@ -646,6 +645,7 @@ static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans, const int max_size = BTRFS_LEAF_DATA_SIZE(fs_info); struct btrfs_item_batch batch; struct btrfs_key first_key; + const u32 first_data_size = first_item->data_len; int total_size; char *ins_data = NULL; int ret; @@ -674,9 +674,9 @@ static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans, ASSERT(first_item->bytes_reserved == 0); list_add_tail(&first_item->tree_list, &item_list); - batch.total_data_size = first_item->data_len; + batch.total_data_size = first_data_size; batch.nr = 1; - total_size = first_item->data_len + sizeof(struct btrfs_item); + total_size = first_data_size + sizeof(struct btrfs_item); curr = first_item; while (true) { @@ -711,7 +711,7 @@ static int btrfs_insert_delayed_item(struct btrfs_trans_handle *trans, first_key.type = BTRFS_DIR_INDEX_KEY; first_key.offset = first_item->index; batch.keys = &first_key; - batch.data_sizes = &first_item->data_len; + batch.data_sizes = &first_data_size; } else { struct btrfs_key *ins_keys; u32 *ins_sizes; @@ -1427,14 +1427,14 @@ int btrfs_insert_delayed_dir_index(struct btrfs_trans_handle *trans, return PTR_ERR(delayed_node); delayed_item = btrfs_alloc_delayed_item(sizeof(*dir_item) + name_len, - delayed_node); + delayed_node, + BTRFS_DELAYED_INSERTION_ITEM); if (!delayed_item) { ret = -ENOMEM; goto release_node; } delayed_item->index = index; - delayed_item->ins_or_del = BTRFS_DELAYED_INSERTION_ITEM; dir_item = (struct btrfs_dir_item *)delayed_item->data; dir_item->location = *disk_key; @@ -1566,14 +1566,13 @@ int btrfs_delete_delayed_dir_index(struct btrfs_trans_handle *trans, if (!ret) goto end; - item = btrfs_alloc_delayed_item(0, node); + item = btrfs_alloc_delayed_item(0, node, BTRFS_DELAYED_DELETION_ITEM); if (!item) { ret = -ENOMEM; goto end; } item->index = index; - item->ins_or_del = BTRFS_DELAYED_DELETION_ITEM; ret = btrfs_delayed_item_reserve_metadata(trans, item); /* diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h index fd6fe785f748..729d352ca8a1 100644 --- a/fs/btrfs/delayed-inode.h +++ b/fs/btrfs/delayed-inode.h @@ -16,9 +16,10 @@ #include #include "ctree.h" -/* types of the delayed item */ -#define BTRFS_DELAYED_INSERTION_ITEM 1 -#define BTRFS_DELAYED_DELETION_ITEM 2 +enum btrfs_delayed_item_type { + BTRFS_DELAYED_INSERTION_ITEM, + BTRFS_DELAYED_DELETION_ITEM +}; struct btrfs_delayed_root { spinlock_t lock; @@ -80,8 +81,9 @@ struct btrfs_delayed_item { u64 bytes_reserved; struct btrfs_delayed_node *delayed_node; refcount_t refs; - int ins_or_del; - u32 data_len; + enum btrfs_delayed_item_type type:1; + /* The maximum leaf size is 64K, so u16 is more than enough. */ + u16 data_len; char data[]; }; From patchwork Mon Aug 22 10:51:39 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950488 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 21127C28D13 for ; Mon, 22 Aug 2022 10:52:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233871AbiHVKwE (ORCPT ); Mon, 22 Aug 2022 06:52:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43478 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233798AbiHVKv7 (ORCPT ); Mon, 22 Aug 2022 06:51:59 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 733B63123B for ; Mon, 22 Aug 2022 03:51:58 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 0D6B960FDE for ; Mon, 22 Aug 2022 10:51:58 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id F0009C433C1 for ; Mon, 22 Aug 2022 10:51:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165517; bh=uWFjkSweqpukcNVlBHQgWovsAmfMSnyZa1doTzB9Tno=; h=From:To:Subject:Date:In-Reply-To:References:From; b=FokF+2o2UFsIe3liJH/21rD4we90A3QjqR0To9Y3UK+DUaq+BmkPYewuYRn9qmLcH y1Ek8J7S6fU60J8mAiD3ZOcZPlWR4bXCF7WAjIVZSGed8Ju5x/n+xR0sRaO7+e5rn0 py3PKCVRVO7cApmC6BkqsiciuVejWfWKIT3A2Q1iHLRUfhu2ud5bC30TPQuifuW5K8 kOntQ2/I0CLdeufjRXt9o/kXvxSkyVvpGS22CXjfYKwiRp7QcVyojlpmc2PiAP4/rP yoSb+HSyCKbfVHxN8sLg1P1Uxzvw4ga4VxUYNQntUITWb70TgqEwlQGfCJx4f9Or6n XWANdd3yxzYlQ== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 10/15] btrfs: search for last logged dir index if it's not cached in the inode Date: Mon, 22 Aug 2022 11:51:39 +0100 Message-Id: <2476d5a5d1f34d2646c93f9b765f7e4df520ba9a.1661165149.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana The key offset of the last dir index item that was logged is stored in the inode's last_dir_index_offset field. However that field is not persisted in the inode item or elsewhere, so if the inode gets evicted and reloaded, it gets a value of (u64)-1, so that when we are logging dir index items we check if they were logged before, to avoid attempts to insert duplicated keys and fallback to a transaction commit. Improve on this by searching for the last dir index that was logged when we start logging a directory if the inode's last_dir_index_offset is not set (has a value of (u64)-1) and it was logged before. This avoids checking if each dir index item we find was already logged before, and simplifies the logging of dir index items (process_dir_items_leaf()). This will also be needed for an incoming change where we start logging delayed items directly, without flushing them first. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 119 +++++++++++++++++++++++++++----------------- 1 file changed, 74 insertions(+), 45 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 94026098bb68..1cfbd4503822 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3735,6 +3735,11 @@ static int process_dir_items_leaf(struct btrfs_trans_handle *trans, *last_old_dentry_offset = key.offset; continue; } + + /* If we logged this dir index item before, we can skip it. */ + if (key.offset <= inode->last_dir_index_offset) + continue; + /* * We must make sure that when we log a directory entry, the * corresponding inode, after log replay, has a matching link @@ -3765,51 +3770,6 @@ static int process_dir_items_leaf(struct btrfs_trans_handle *trans, ctx->log_new_dentries = true; } - if (!ctx->logged_before) - goto add_to_batch; - - /* - * If we were logged before and have logged dir items, we can skip - * checking if any item with a key offset larger than the last one - * we logged is in the log tree, saving time and avoiding adding - * contention on the log tree. We can only rely on the value of - * last_dir_index_offset when we know for sure that the inode was - * previously logged in the current transaction. - */ - if (key.offset > inode->last_dir_index_offset) - goto add_to_batch; - /* - * Check if the key was already logged before. If not we can add - * it to a batch for bulk insertion. - */ - ret = btrfs_search_slot(NULL, log, &key, dst_path, 0, 0); - if (ret < 0) { - return ret; - } else if (ret > 0) { - btrfs_release_path(dst_path); - goto add_to_batch; - } - - /* - * Item exists in the log. Overwrite the item in the log if it - * has different content or do nothing if it has exactly the same - * content. And then flush the current batch if any - do it after - * overwriting the current item, or we would deadlock otherwise, - * since we are holding a path for the existing item. - */ - ret = do_overwrite_item(trans, log, dst_path, src, i, &key); - if (ret < 0) - return ret; - - if (batch_size > 0) { - ret = flush_dir_items_batch(trans, log, src, dst_path, - batch_start, batch_size); - if (ret < 0) - return ret; - batch_size = 0; - } - continue; -add_to_batch: if (batch_size == 0) batch_start = i; batch_size++; @@ -3995,6 +3955,71 @@ static noinline int log_dir_items(struct btrfs_trans_handle *trans, return err; } +/* + * If the inode was logged before and it was evicted, then its + * last_dir_index_offset is (u64)-1, so we don't the value of the last index + * key offset. If that's the case, search for it and update the inode. This + * is to avoid lookups in the log tree every time we try to insert a dir index + * key from a leaf changed in the current transaction, and to allow us to always + * do batch insertions of dir index keys. + */ +static int update_last_dir_index_offset(struct btrfs_inode *inode, + struct btrfs_path *path, + const struct btrfs_log_ctx *ctx) +{ + const u64 ino = btrfs_ino(inode); + struct btrfs_key key; + int ret; + + lockdep_assert_held(&inode->log_mutex); + + if (inode->last_dir_index_offset != (u64)-1) + return 0; + + if (!ctx->logged_before) { + inode->last_dir_index_offset = BTRFS_DIR_START_INDEX - 1; + return 0; + } + + key.objectid = ino; + key.type = BTRFS_DIR_INDEX_KEY; + key.offset = (u64)-1; + + ret = btrfs_search_slot(NULL, inode->root->log_root, &key, path, 0, 0); + /* + * An error happened or we actually have an index key with an offset + * value of (u64)-1. Bail out, we're done. + */ + if (ret <= 0) + goto out; + + ret = 0; + inode->last_dir_index_offset = BTRFS_DIR_START_INDEX - 1; + + /* + * No dir index items, bail out and leave last_dir_index_offset with + * the value right before the first valid index value. + */ + if (path->slots[0] == 0) + goto out; + + /* + * btrfs_search_slot() left us at one slot beyond the slot with the last + * index key, or beyond the last key of the directory that is not an + * index key. If we have an index key before, set last_dir_index_offset + * to its offset value, otherwise leave it with a value right before the + * first valid index value, as it means we have an empty directory. + */ + btrfs_item_key_to_cpu(path->nodes[0], &key, path->slots[0] - 1); + if (key.objectid == ino && key.type == BTRFS_DIR_INDEX_KEY) + inode->last_dir_index_offset = key.offset; + +out: + btrfs_release_path(path); + + return ret; +} + /* * logging directories is very similar to logging inodes, We find all the items * from the current transaction and write them to the log. @@ -4017,6 +4042,10 @@ static noinline int log_directory_changes(struct btrfs_trans_handle *trans, u64 max_key; int ret; + ret = update_last_dir_index_offset(inode, path, ctx); + if (ret) + return ret; + min_key = BTRFS_DIR_START_INDEX; max_key = 0; ctx->last_dir_item_offset = inode->last_dir_index_offset; From patchwork Mon Aug 22 10:51:40 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950489 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id E82CEC32796 for ; Mon, 22 Aug 2022 10:52:05 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233883AbiHVKwE (ORCPT ); Mon, 22 Aug 2022 06:52:04 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43494 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233820AbiHVKwA (ORCPT ); Mon, 22 Aug 2022 06:52:00 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 629452F671 for ; Mon, 22 Aug 2022 03:51:59 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id F147C60FFD for ; Mon, 22 Aug 2022 10:51:58 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id DE70BC433B5 for ; Mon, 22 Aug 2022 10:51:57 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165518; bh=LBHv+MnMIFLoOR88t0KvNU1apoe8uRI+e6OrrRtgk9g=; h=From:To:Subject:Date:In-Reply-To:References:From; b=YIBgumOiWMla9i0vmxLlLtZ9um5xgplq4doaJr9yvO3XcwifBPo6u75oGM0YWXBsS tc6oRy+F1TVC1kAPxqKgyldJsmDKw8VdRU3PU0FyEazmdBDBb0EoTFGDkFvmYizqOQ 2gYNp1NDVcuXqEVJE5c2Osq0Vko/OeQwCuZfCCdoHZnQXbDtnmKKEs1Jr8zbVjk6QP auYzEHUCko0wZOYqCpaAZvPdvSrLDMzrfjQS+6MdSiDVB6K9WQRlWMyNVWTFeVE9GB ljpUh5fl4bqflRx5EmMMUJkZ6V1XqMZgH/jSxFxnvMPGPgFAcBjLu6Wx7uM0moykmU gL/qIldVOOGVw== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 11/15] btrfs: move need_log_inode() to above log_conflicting_inodes() Date: Mon, 22 Aug 2022 11:51:40 +0100 Message-Id: <017ffcdd7d3943f2c6d34674f06676c617dd8a6e.1661165149.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana The static function need_log_inode() is defined below btrfs_log_inode() and log_conflicting_inodes(), but in the next patches in the series we will need to call need_log_inode() in a couple new functions that will be used by btrfs_log_inode(). So move its definition to a location above log_conflicting_inodes(). Also make its arguments 'const', since they are not supposed to be modified. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 70 ++++++++++++++++++++++----------------------- 1 file changed, 35 insertions(+), 35 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 1cfbd4503822..15a4a150cb27 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5285,6 +5285,41 @@ static int btrfs_check_ref_name_override(struct extent_buffer *eb, return ret; } +/* + * Check if we need to log an inode. This is used in contexts where while + * logging an inode we need to log another inode (either that it exists or in + * full mode). This is used instead of btrfs_inode_in_log() because the later + * requires the inode to be in the log and have the log transaction committed, + * while here we do not care if the log transaction was already committed - our + * caller will commit the log later - and we want to avoid logging an inode + * multiple times when multiple tasks have joined the same log transaction. + */ +static bool need_log_inode(const struct btrfs_trans_handle *trans, + const struct btrfs_inode *inode) +{ + /* + * If a directory was not modified, no dentries added or removed, we can + * and should avoid logging it. + */ + if (S_ISDIR(inode->vfs_inode.i_mode) && inode->last_trans < trans->transid) + return false; + + /* + * If this inode does not have new/updated/deleted xattrs since the last + * time it was logged and is flagged as logged in the current transaction, + * we can skip logging it. As for new/deleted names, those are updated in + * the log by link/unlink/rename operations. + * In case the inode was logged and then evicted and reloaded, its + * logged_trans will be 0, in which case we have to fully log it since + * logged_trans is a transient field, not persisted. + */ + if (inode->logged_trans == trans->transid && + !test_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags)) + return false; + + return true; +} + struct btrfs_ino_list { u64 ino; u64 parent; @@ -5921,41 +5956,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, return ret; } -/* - * Check if we need to log an inode. This is used in contexts where while - * logging an inode we need to log another inode (either that it exists or in - * full mode). This is used instead of btrfs_inode_in_log() because the later - * requires the inode to be in the log and have the log transaction committed, - * while here we do not care if the log transaction was already committed - our - * caller will commit the log later - and we want to avoid logging an inode - * multiple times when multiple tasks have joined the same log transaction. - */ -static bool need_log_inode(struct btrfs_trans_handle *trans, - struct btrfs_inode *inode) -{ - /* - * If a directory was not modified, no dentries added or removed, we can - * and should avoid logging it. - */ - if (S_ISDIR(inode->vfs_inode.i_mode) && inode->last_trans < trans->transid) - return false; - - /* - * If this inode does not have new/updated/deleted xattrs since the last - * time it was logged and is flagged as logged in the current transaction, - * we can skip logging it. As for new/deleted names, those are updated in - * the log by link/unlink/rename operations. - * In case the inode was logged and then evicted and reloaded, its - * logged_trans will be 0, in which case we have to fully log it since - * logged_trans is a transient field, not persisted. - */ - if (inode->logged_trans == trans->transid && - !test_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags)) - return false; - - return true; -} - struct btrfs_dir_list { u64 ino; struct list_head list; From patchwork Mon Aug 22 10:51:41 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950490 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id A1094C32772 for ; Mon, 22 Aug 2022 10:52:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233902AbiHVKwF (ORCPT ); Mon, 22 Aug 2022 06:52:05 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43508 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233848AbiHVKwB (ORCPT ); Mon, 22 Aug 2022 06:52:01 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5774B3123B for ; Mon, 22 Aug 2022 03:52:00 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id E5F2D60FFF for ; Mon, 22 Aug 2022 10:51:59 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id CD500C433C1 for ; Mon, 22 Aug 2022 10:51:58 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165519; bh=Cv4mD7eTiy+rJLxKCXv+EiJjimhvYm2tREgsqdecDY4=; h=From:To:Subject:Date:In-Reply-To:References:From; b=gXJfPVuZD58b9IH6NCTm1iHpJ6ZcNZxqos7nxXfgnG8oMtl4En1fcTyl2z8Ajq4I5 /ib70z9cXVlbX71El63WyuShFTfXyJ4azNIVgnB4WP2HZoCeTwYWeDdmPilG4vDasa 0gEii+ZNPx9ZMVSwLtqi/K3WdfNwq1gC+znzpX2sSbQM4x9MNRwHEGbFf36VLDnfYW 1W3e1vXX05yAxP/YcW8NkrPUJE8kX2zr+EMDNtb6a2FEry037d3dWEUIM3aZHMAfJJ RuiBx4P7eqwzKbZ5SSwaIfW2h3orLqWzXps9bEtVewmCd7oG+6QBVG2EQU6YR51ft4 Vdti6l7EH8fTg== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 12/15] btrfs: move log_new_dir_dentries() above btrfs_log_inode() Date: Mon, 22 Aug 2022 11:51:41 +0100 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana The static function log_new_dir_dentries() is currently defined below btrfs_log_inode(), but in an upcoming patch a new function is introduced that is called by btrfs_log_inode() and this new function needs to call log_new_dir_dentries(). So move log_new_dir_dentries() to a location between btrfs_log_inode() and need_log_inode() (the later is called by log_new_dir_dentries()). Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 334 ++++++++++++++++++++++---------------------- 1 file changed, 167 insertions(+), 167 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 15a4a150cb27..94ff4a4598dc 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5320,6 +5320,173 @@ static bool need_log_inode(const struct btrfs_trans_handle *trans, return true; } +struct btrfs_dir_list { + u64 ino; + struct list_head list; +}; + +/* + * Log the inodes of the new dentries of a directory. + * See process_dir_items_leaf() for details about why it is needed. + * This is a recursive operation - if an existing dentry corresponds to a + * directory, that directory's new entries are logged too (same behaviour as + * ext3/4, xfs, f2fs, reiserfs, nilfs2). Note that when logging the inodes + * the dentries point to we do not acquire their VFS lock, otherwise lockdep + * complains about the following circular lock dependency / possible deadlock: + * + * CPU0 CPU1 + * ---- ---- + * lock(&type->i_mutex_dir_key#3/2); + * lock(sb_internal#2); + * lock(&type->i_mutex_dir_key#3/2); + * lock(&sb->s_type->i_mutex_key#14); + * + * Where sb_internal is the lock (a counter that works as a lock) acquired by + * sb_start_intwrite() in btrfs_start_transaction(). + * Not acquiring the VFS lock of the inodes is still safe because: + * + * 1) For regular files we log with a mode of LOG_INODE_EXISTS. It's possible + * that while logging the inode new references (names) are added or removed + * from the inode, leaving the logged inode item with a link count that does + * not match the number of logged inode reference items. This is fine because + * at log replay time we compute the real number of links and correct the + * link count in the inode item (see replay_one_buffer() and + * link_to_fixup_dir()); + * + * 2) For directories we log with a mode of LOG_INODE_ALL. It's possible that + * while logging the inode's items new index items (key type + * BTRFS_DIR_INDEX_KEY) are added to fs/subvol tree and the logged inode item + * has a size that doesn't match the sum of the lengths of all the logged + * names - this is ok, not a problem, because at log replay time we set the + * directory's i_size to the correct value (see replay_one_name() and + * do_overwrite_item()). + */ +static int log_new_dir_dentries(struct btrfs_trans_handle *trans, + struct btrfs_inode *start_inode, + struct btrfs_log_ctx *ctx) +{ + struct btrfs_root *root = start_inode->root; + struct btrfs_fs_info *fs_info = root->fs_info; + struct btrfs_path *path; + LIST_HEAD(dir_list); + struct btrfs_dir_list *dir_elem; + u64 ino = btrfs_ino(start_inode); + int ret = 0; + + /* + * If we are logging a new name, as part of a link or rename operation, + * don't bother logging new dentries, as we just want to log the names + * of an inode and that any new parents exist. + */ + if (ctx->logging_new_name) + return 0; + + path = btrfs_alloc_path(); + if (!path) + return -ENOMEM; + + while (true) { + struct extent_buffer *leaf; + struct btrfs_key min_key; + bool continue_curr_inode = true; + int nritems; + int i; + + min_key.objectid = ino; + min_key.type = BTRFS_DIR_INDEX_KEY; + min_key.offset = 0; +again: + btrfs_release_path(path); + ret = btrfs_search_forward(root, &min_key, path, trans->transid); + if (ret < 0) { + break; + } else if (ret > 0) { + ret = 0; + goto next; + } + + leaf = path->nodes[0]; + nritems = btrfs_header_nritems(leaf); + for (i = path->slots[0]; i < nritems; i++) { + struct btrfs_dir_item *di; + struct btrfs_key di_key; + struct inode *di_inode; + int log_mode = LOG_INODE_EXISTS; + int type; + + btrfs_item_key_to_cpu(leaf, &min_key, i); + if (min_key.objectid != ino || + min_key.type != BTRFS_DIR_INDEX_KEY) { + continue_curr_inode = false; + break; + } + + di = btrfs_item_ptr(leaf, i, struct btrfs_dir_item); + type = btrfs_dir_type(leaf, di); + if (btrfs_dir_transid(leaf, di) < trans->transid) + continue; + btrfs_dir_item_key_to_cpu(leaf, di, &di_key); + if (di_key.type == BTRFS_ROOT_ITEM_KEY) + continue; + + btrfs_release_path(path); + di_inode = btrfs_iget(fs_info->sb, di_key.objectid, root); + if (IS_ERR(di_inode)) { + ret = PTR_ERR(di_inode); + goto out; + } + + if (!need_log_inode(trans, BTRFS_I(di_inode))) { + btrfs_add_delayed_iput(di_inode); + break; + } + + ctx->log_new_dentries = false; + if (type == BTRFS_FT_DIR) + log_mode = LOG_INODE_ALL; + ret = btrfs_log_inode(trans, BTRFS_I(di_inode), + log_mode, ctx); + btrfs_add_delayed_iput(di_inode); + if (ret) + goto out; + if (ctx->log_new_dentries) { + dir_elem = kmalloc(sizeof(*dir_elem), GFP_NOFS); + if (!dir_elem) { + ret = -ENOMEM; + goto out; + } + dir_elem->ino = di_key.objectid; + list_add_tail(&dir_elem->list, &dir_list); + } + break; + } + + if (continue_curr_inode && min_key.offset < (u64)-1) { + min_key.offset++; + goto again; + } + +next: + if (list_empty(&dir_list)) + break; + + dir_elem = list_first_entry(&dir_list, struct btrfs_dir_list, list); + ino = dir_elem->ino; + list_del(&dir_elem->list); + kfree(dir_elem); + } +out: + btrfs_free_path(path); + if (ret) { + struct btrfs_dir_list *next; + + list_for_each_entry_safe(dir_elem, next, &dir_list, list) + kfree(dir_elem); + } + + return ret; +} + struct btrfs_ino_list { u64 ino; u64 parent; @@ -5956,173 +6123,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, return ret; } -struct btrfs_dir_list { - u64 ino; - struct list_head list; -}; - -/* - * Log the inodes of the new dentries of a directory. - * See process_dir_items_leaf() for details about why it is needed. - * This is a recursive operation - if an existing dentry corresponds to a - * directory, that directory's new entries are logged too (same behaviour as - * ext3/4, xfs, f2fs, reiserfs, nilfs2). Note that when logging the inodes - * the dentries point to we do not acquire their VFS lock, otherwise lockdep - * complains about the following circular lock dependency / possible deadlock: - * - * CPU0 CPU1 - * ---- ---- - * lock(&type->i_mutex_dir_key#3/2); - * lock(sb_internal#2); - * lock(&type->i_mutex_dir_key#3/2); - * lock(&sb->s_type->i_mutex_key#14); - * - * Where sb_internal is the lock (a counter that works as a lock) acquired by - * sb_start_intwrite() in btrfs_start_transaction(). - * Not acquiring the VFS lock of the inodes is still safe because: - * - * 1) For regular files we log with a mode of LOG_INODE_EXISTS. It's possible - * that while logging the inode new references (names) are added or removed - * from the inode, leaving the logged inode item with a link count that does - * not match the number of logged inode reference items. This is fine because - * at log replay time we compute the real number of links and correct the - * link count in the inode item (see replay_one_buffer() and - * link_to_fixup_dir()); - * - * 2) For directories we log with a mode of LOG_INODE_ALL. It's possible that - * while logging the inode's items new index items (key type - * BTRFS_DIR_INDEX_KEY) are added to fs/subvol tree and the logged inode item - * has a size that doesn't match the sum of the lengths of all the logged - * names - this is ok, not a problem, because at log replay time we set the - * directory's i_size to the correct value (see replay_one_name() and - * do_overwrite_item()). - */ -static int log_new_dir_dentries(struct btrfs_trans_handle *trans, - struct btrfs_inode *start_inode, - struct btrfs_log_ctx *ctx) -{ - struct btrfs_root *root = start_inode->root; - struct btrfs_fs_info *fs_info = root->fs_info; - struct btrfs_path *path; - LIST_HEAD(dir_list); - struct btrfs_dir_list *dir_elem; - u64 ino = btrfs_ino(start_inode); - int ret = 0; - - /* - * If we are logging a new name, as part of a link or rename operation, - * don't bother logging new dentries, as we just want to log the names - * of an inode and that any new parents exist. - */ - if (ctx->logging_new_name) - return 0; - - path = btrfs_alloc_path(); - if (!path) - return -ENOMEM; - - while (true) { - struct extent_buffer *leaf; - struct btrfs_key min_key; - bool continue_curr_inode = true; - int nritems; - int i; - - min_key.objectid = ino; - min_key.type = BTRFS_DIR_INDEX_KEY; - min_key.offset = 0; -again: - btrfs_release_path(path); - ret = btrfs_search_forward(root, &min_key, path, trans->transid); - if (ret < 0) { - break; - } else if (ret > 0) { - ret = 0; - goto next; - } - - leaf = path->nodes[0]; - nritems = btrfs_header_nritems(leaf); - for (i = path->slots[0]; i < nritems; i++) { - struct btrfs_dir_item *di; - struct btrfs_key di_key; - struct inode *di_inode; - int log_mode = LOG_INODE_EXISTS; - int type; - - btrfs_item_key_to_cpu(leaf, &min_key, i); - if (min_key.objectid != ino || - min_key.type != BTRFS_DIR_INDEX_KEY) { - continue_curr_inode = false; - break; - } - - di = btrfs_item_ptr(leaf, i, struct btrfs_dir_item); - type = btrfs_dir_type(leaf, di); - if (btrfs_dir_transid(leaf, di) < trans->transid) - continue; - btrfs_dir_item_key_to_cpu(leaf, di, &di_key); - if (di_key.type == BTRFS_ROOT_ITEM_KEY) - continue; - - btrfs_release_path(path); - di_inode = btrfs_iget(fs_info->sb, di_key.objectid, root); - if (IS_ERR(di_inode)) { - ret = PTR_ERR(di_inode); - goto out; - } - - if (!need_log_inode(trans, BTRFS_I(di_inode))) { - btrfs_add_delayed_iput(di_inode); - break; - } - - ctx->log_new_dentries = false; - if (type == BTRFS_FT_DIR) - log_mode = LOG_INODE_ALL; - ret = btrfs_log_inode(trans, BTRFS_I(di_inode), - log_mode, ctx); - btrfs_add_delayed_iput(di_inode); - if (ret) - goto out; - if (ctx->log_new_dentries) { - dir_elem = kmalloc(sizeof(*dir_elem), GFP_NOFS); - if (!dir_elem) { - ret = -ENOMEM; - goto out; - } - dir_elem->ino = di_key.objectid; - list_add_tail(&dir_elem->list, &dir_list); - } - break; - } - - if (continue_curr_inode && min_key.offset < (u64)-1) { - min_key.offset++; - goto again; - } - -next: - if (list_empty(&dir_list)) - break; - - dir_elem = list_first_entry(&dir_list, struct btrfs_dir_list, list); - ino = dir_elem->ino; - list_del(&dir_elem->list); - kfree(dir_elem); - } -out: - btrfs_free_path(path); - if (ret) { - struct btrfs_dir_list *next; - - list_for_each_entry_safe(dir_elem, next, &dir_list, list) - kfree(dir_elem); - } - - return ret; -} - static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_log_ctx *ctx) From patchwork Mon Aug 22 10:51:42 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950493 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 5F931C32793 for ; Mon, 22 Aug 2022 10:52:11 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233929AbiHVKwJ (ORCPT ); Mon, 22 Aug 2022 06:52:09 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43554 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233905AbiHVKwF (ORCPT ); Mon, 22 Aug 2022 06:52:05 -0400 Received: from ams.source.kernel.org (ams.source.kernel.org [145.40.68.75]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 545CD2F659 for ; Mon, 22 Aug 2022 03:52:03 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by ams.source.kernel.org (Postfix) with ESMTPS id C057DB81032 for ; Mon, 22 Aug 2022 10:52:01 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id BD339C433D6 for ; Mon, 22 Aug 2022 10:51:59 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165520; bh=kN0AUxGZV4caTuK9Myv1uhh0Ec+kY5LoJvnsQuRQiXU=; h=From:To:Subject:Date:In-Reply-To:References:From; b=et27sOzrEr7OJo8/qG7OQD4s7ts3K9Uhll+pTkyNamc6dv4flp2E1+Dk7L9Y4au23 C2YNrZB3tw12tk62Cf6xRnb+EF5zaDuzGR2HDRXiBHiaU45WjSi71OByQRI5nBYQI1 +DWWNdPWgoxjvrQpB3UeJzGrTZHbEWOdxIuayULyV14xfHOW5e63w1d/GO9aR6WaA+ XLgsP8D1n6p33hSTUd8VFBZcolpSFKezWmgF8R4fjMdNeWC9YMyE5/FsLmb0AujT+7 KS3vg/nuNqMOCCYeOh4bXKziIyP4YwPxDqXDgSmZriRBgzJM6emIQiVdz4x1PWOBTv k4SMSdVokBylg== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 13/15] btrfs: log conflicting inodes without holding log mutex of the initial inode Date: Mon, 22 Aug 2022 11:51:42 +0100 Message-Id: <6703074a297ec70d7231b0f778dd2fac6045f03a.1661165149.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When logging an inode, if we detect the inode has a reference that conflicts with some other inode that got renamed, we log that other inode while holding the log mutex of the current inode. We then find out if there are other inodes that conflict with the first conflicting inode, and log them while under the log mutex of the original inode. This is fine because the recursion can only happen once. For the upcoming work where we directly log delayed items without flushing them first to the subvolume tree, this recursion adds a lot of complexity and it's hard to keep lockdep happy about it. So collect a list of conflicting inodes and then log the inodes after unlocking the log mutex of the inode we started with. Also limit the maximum number of conflict inodes we log to 10, to avoid spending too much time logging (and maybe allocating too many list elements too), as typically we don't have more than 1 or 2 conflicting inodes - if we go over the limit, simply fallback to a transaction commit. It is possible to have a very long list of conflicting inodes to be intentionally created by a user if he/she creates a very long succession of renames like this: (...) rename E to F rename D to E rename C to D rename B to C rename A to B touch A (create a new file named A) fsync A If that happened for a sequence of hundreds or thousands of renames, it could massively slow down the logging and cause other secondary effects like for example blocking other fsync operations and transaction commits for a very long time (assuming it wouldn't run into -ENOSPC or -ENOMEM first). However such cases are very uncommon to happen in practice, nevertheless it's better to be prepared for them and avoid chaos. Such long sequence of conflicting inodes could be created before this change. Signed-off-by: Filipe Manana --- fs/btrfs/file.c | 1 + fs/btrfs/tree-log.c | 341 ++++++++++++++++++++++++-------------------- fs/btrfs/tree-log.h | 6 + 3 files changed, 196 insertions(+), 152 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index 14466b99863b..9b693d780431 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2397,6 +2397,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) ret = btrfs_commit_transaction(trans); out: ASSERT(list_empty(&ctx.list)); + ASSERT(list_empty(&ctx.conflict_inodes)); err = file_check_and_advance_wb_err(file); if (!ret) ret = err; diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 94ff4a4598dc..72e7c9e559cc 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -22,6 +22,8 @@ #include "zoned.h" #include "inode-item.h" +#define MAX_CONFLICT_INODES 10 + /* magic values for the inode_only field in btrfs_log_inode: * * LOG_INODE_ALL means to log everything @@ -31,8 +33,6 @@ enum { LOG_INODE_ALL, LOG_INODE_EXISTS, - LOG_OTHER_INODE, - LOG_OTHER_INODE_ALL, }; /* @@ -5493,105 +5493,201 @@ struct btrfs_ino_list { struct list_head list; }; -static int log_conflicting_inodes(struct btrfs_trans_handle *trans, - struct btrfs_root *root, - struct btrfs_path *path, - struct btrfs_log_ctx *ctx, - u64 ino, u64 parent) +static void free_conflicting_inodes(struct btrfs_log_ctx *ctx) +{ + struct btrfs_ino_list *curr; + struct btrfs_ino_list *next; + + list_for_each_entry_safe(curr, next, &ctx->conflict_inodes, list) { + list_del(&curr->list); + kfree(curr); + } +} + +static int add_conflicting_inode(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + u64 ino, u64 parent, + struct btrfs_log_ctx *ctx) { struct btrfs_ino_list *ino_elem; - LIST_HEAD(inode_list); - int ret = 0; + struct inode *inode; + + /* + * It's rare to have a lot of conflicting inodes, in practice it is not + * common to have more than 1 or 2. We don't want to collect too many, + * as we could end up logging too many inodes (even if only in + * LOG_INODE_EXISTS mode) and slow down other fsyncs or transaction + * commits. + */ + if (ctx->num_conflict_inodes >= MAX_CONFLICT_INODES) + return BTRFS_LOG_FORCE_COMMIT; + + inode = btrfs_iget(root->fs_info->sb, ino, root); + /* + * If the other inode that had a conflicting dir entry was deleted in + * the current transaction, we need to log its parent directory. + * We can't simply ignore it and let the current inode's reference cause + * an unlink of the conflicting inode when replaying the log - because + * the conflicting inode may be a deleted subvolume/snapshot or it may + * be a directory that had subvolumes/snapshots inside it (or had one + * or more subdirectories with subvolumes/snapshots, etc). If that's the + * case, then when logging the parent directory we will fallback to a + * transaction commit because the parent directory will have a + * last_unlink_trans that matches the current transaction. + */ + if (IS_ERR(inode)) { + int ret = PTR_ERR(inode); + + if (ret != -ENOENT) + return ret; + + ino_elem = kmalloc(sizeof(*ino_elem), GFP_NOFS); + if (!ino_elem) + return -ENOMEM; + ino_elem->ino = ino; + ino_elem->parent = parent; + list_add_tail(&ino_elem->list, &ctx->conflict_inodes); + ctx->num_conflict_inodes++; + + return 0; + } + + /* + * If the inode was already logged skip it - otherwise we can hit an + * infinite loop. Example: + * + * From the commit root (previous transaction) we have the following + * inodes: + * + * inode 257 a directory + * inode 258 with references "zz" and "zz_link" on inode 257 + * inode 259 with reference "a" on inode 257 + * + * And in the current (uncommitted) transaction we have: + * + * inode 257 a directory, unchanged + * inode 258 with references "a" and "a2" on inode 257 + * inode 259 with reference "zz_link" on inode 257 + * inode 261 with reference "zz" on inode 257 + * + * When logging inode 261 the following infinite loop could + * happen if we don't skip already logged inodes: + * + * - we detect inode 258 as a conflicting inode, with inode 261 + * on reference "zz", and log it; + * + * - we detect inode 259 as a conflicting inode, with inode 258 + * on reference "a", and log it; + * + * - we detect inode 258 as a conflicting inode, with inode 259 + * on reference "zz_link", and log it - again! After this we + * repeat the above steps forever. + * + * Here we can use need_log_inode() because we only need to log the + * inode in LOG_INODE_EXISTS mode and rename operations update the log, + * so that the log ends up with the new name and without the old name. + */ + if (!need_log_inode(trans, BTRFS_I(inode))) { + btrfs_add_delayed_iput(inode); + return 0; + } + + btrfs_add_delayed_iput(inode); ino_elem = kmalloc(sizeof(*ino_elem), GFP_NOFS); if (!ino_elem) return -ENOMEM; ino_elem->ino = ino; ino_elem->parent = parent; - list_add_tail(&ino_elem->list, &inode_list); + list_add_tail(&ino_elem->list, &ctx->conflict_inodes); + ctx->num_conflict_inodes++; - while (!list_empty(&inode_list)) { - struct btrfs_fs_info *fs_info = root->fs_info; - struct btrfs_key key; - struct inode *inode; + return 0; +} - ino_elem = list_first_entry(&inode_list, struct btrfs_ino_list, - list); - ino = ino_elem->ino; - parent = ino_elem->parent; - list_del(&ino_elem->list); - kfree(ino_elem); - if (ret) - continue; +static int log_conflicting_inodes(struct btrfs_trans_handle *trans, + struct btrfs_root *root, + struct btrfs_log_ctx *ctx) +{ + struct btrfs_fs_info *fs_info = root->fs_info; + int ret = 0; - btrfs_release_path(path); + /* + * Conflicting inodes are logged by the first call to btrfs_log_inode(), + * otherwise we could have unbounded recursion of btrfs_log_inode() + * calls. This check guarantees we can have only 1 level of recursion. + */ + if (ctx->logging_conflict_inodes) + return 0; + + ctx->logging_conflict_inodes = true; + + /* + * New conflicting inodes may be found and added to the list while we + * are logging a conflicting inode, so keep iterating while the list is + * not empty. + */ + while (!list_empty(&ctx->conflict_inodes)) { + struct btrfs_ino_list *curr; + struct inode *inode; + u64 ino; + u64 parent; + + curr = list_first_entry(&ctx->conflict_inodes, + struct btrfs_ino_list, list); + ino = curr->ino; + parent = curr->parent; + list_del(&curr->list); + kfree(curr); inode = btrfs_iget(fs_info->sb, ino, root); /* * If the other inode that had a conflicting dir entry was * deleted in the current transaction, we need to log its parent - * directory. + * directory. See the comment at add_conflicting_inode(). */ if (IS_ERR(inode)) { ret = PTR_ERR(inode); - if (ret == -ENOENT) { - inode = btrfs_iget(fs_info->sb, parent, root); - if (IS_ERR(inode)) { - ret = PTR_ERR(inode); - } else { - ret = btrfs_log_inode(trans, - BTRFS_I(inode), - LOG_OTHER_INODE_ALL, - ctx); - btrfs_add_delayed_iput(inode); - } + if (ret != -ENOENT) + break; + + inode = btrfs_iget(fs_info->sb, parent, root); + if (IS_ERR(inode)) { + ret = PTR_ERR(inode); + break; } + + /* + * Always log the directory, we cannot make this + * conditional on need_log_inode() because the directory + * might have been logged in LOG_INODE_EXISTS mode or + * the dir index of the conflicting inode is not in a + * dir index key range logged for the directory. So we + * must make sure the deletion is recorded. + */ + ret = btrfs_log_inode(trans, BTRFS_I(inode), + LOG_INODE_ALL, ctx); + btrfs_add_delayed_iput(inode); + if (ret) + break; continue; } + /* - * If the inode was already logged skip it - otherwise we can - * hit an infinite loop. Example: - * - * From the commit root (previous transaction) we have the - * following inodes: - * - * inode 257 a directory - * inode 258 with references "zz" and "zz_link" on inode 257 - * inode 259 with reference "a" on inode 257 - * - * And in the current (uncommitted) transaction we have: - * - * inode 257 a directory, unchanged - * inode 258 with references "a" and "a2" on inode 257 - * inode 259 with reference "zz_link" on inode 257 - * inode 261 with reference "zz" on inode 257 - * - * When logging inode 261 the following infinite loop could - * happen if we don't skip already logged inodes: - * - * - we detect inode 258 as a conflicting inode, with inode 261 - * on reference "zz", and log it; - * - * - we detect inode 259 as a conflicting inode, with inode 258 - * on reference "a", and log it; + * Here we can use need_log_inode() because we only need to log + * the inode in LOG_INODE_EXISTS mode and rename operations + * update the log, so that the log ends up with the new name and + * without the old name. * - * - we detect inode 258 as a conflicting inode, with inode 259 - * on reference "zz_link", and log it - again! After this we - * repeat the above steps forever. + * We did this check at add_conflicting_inode(), but here we do + * it again because if some other task logged the inode after + * that, we can avoid doing it again. */ - spin_lock(&BTRFS_I(inode)->lock); - /* - * Check the inode's logged_trans only instead of - * btrfs_inode_in_log(). This is because the last_log_commit of - * the inode is not updated when we only log that it exists (see - * btrfs_log_inode()). - */ - if (BTRFS_I(inode)->logged_trans == trans->transid) { - spin_unlock(&BTRFS_I(inode)->lock); + if (!need_log_inode(trans, BTRFS_I(inode))) { btrfs_add_delayed_iput(inode); continue; } - spin_unlock(&BTRFS_I(inode)->lock); + /* * We are safe logging the other inode without acquiring its * lock as long as we log with the LOG_INODE_EXISTS mode. We @@ -5599,67 +5695,16 @@ static int log_conflicting_inodes(struct btrfs_trans_handle *trans, * well because during a rename we pin the log and update the * log with the new name before we unpin it. */ - ret = btrfs_log_inode(trans, BTRFS_I(inode), LOG_OTHER_INODE, ctx); - if (ret) { - btrfs_add_delayed_iput(inode); - continue; - } - - key.objectid = ino; - key.type = BTRFS_INODE_REF_KEY; - key.offset = 0; - ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); - if (ret < 0) { - btrfs_add_delayed_iput(inode); - continue; - } - - while (true) { - struct extent_buffer *leaf = path->nodes[0]; - int slot = path->slots[0]; - u64 other_ino = 0; - u64 other_parent = 0; - - if (slot >= btrfs_header_nritems(leaf)) { - ret = btrfs_next_leaf(root, path); - if (ret < 0) { - break; - } else if (ret > 0) { - ret = 0; - break; - } - continue; - } - - btrfs_item_key_to_cpu(leaf, &key, slot); - if (key.objectid != ino || - (key.type != BTRFS_INODE_REF_KEY && - key.type != BTRFS_INODE_EXTREF_KEY)) { - ret = 0; - break; - } - - ret = btrfs_check_ref_name_override(leaf, slot, &key, - BTRFS_I(inode), &other_ino, - &other_parent); - if (ret < 0) - break; - if (ret > 0) { - ino_elem = kmalloc(sizeof(*ino_elem), GFP_NOFS); - if (!ino_elem) { - ret = -ENOMEM; - break; - } - ino_elem->ino = other_ino; - ino_elem->parent = other_parent; - list_add_tail(&ino_elem->list, &inode_list); - ret = 0; - } - path->slots[0]++; - } + ret = btrfs_log_inode(trans, BTRFS_I(inode), LOG_INODE_EXISTS, ctx); btrfs_add_delayed_iput(inode); + if (ret) + break; } + ctx->logging_conflict_inodes = false; + if (ret) + free_conflicting_inodes(ctx); + return ret; } @@ -5670,7 +5715,6 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans, struct btrfs_path *path, struct btrfs_path *dst_path, const u64 logged_isize, - const bool recursive_logging, const int inode_only, struct btrfs_log_ctx *ctx, bool *need_log_inode_item) @@ -5709,8 +5753,8 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans, break; } else if ((min_key->type == BTRFS_INODE_REF_KEY || min_key->type == BTRFS_INODE_EXTREF_KEY) && - inode->generation == trans->transid && - !recursive_logging) { + (inode->generation == trans->transid || + ctx->logging_conflict_inodes)) { u64 other_ino = 0; u64 other_parent = 0; @@ -5734,11 +5778,12 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans, return ret; ins_nr = 0; - ret = log_conflicting_inodes(trans, root, path, - ctx, other_ino, other_parent); + btrfs_release_path(path); + ret = add_conflicting_inode(trans, root, + other_ino, + other_parent, ctx); if (ret) return ret; - btrfs_release_path(path); goto next_key; } } else if (min_key->type == BTRFS_XATTR_ITEM_KEY) { @@ -5852,9 +5897,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, u64 logged_isize = 0; bool need_log_inode_item = true; bool xattrs_logged = false; - bool recursive_logging = false; bool inode_item_dropped = true; - const bool orig_logged_before = ctx->logged_before; path = btrfs_alloc_path(); if (!path) @@ -5893,16 +5936,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, goto out; } - if (inode_only == LOG_OTHER_INODE || inode_only == LOG_OTHER_INODE_ALL) { - recursive_logging = true; - if (inode_only == LOG_OTHER_INODE) - inode_only = LOG_INODE_EXISTS; - else - inode_only = LOG_INODE_ALL; - mutex_lock_nested(&inode->log_mutex, SINGLE_DEPTH_NESTING); - } else { - mutex_lock(&inode->log_mutex); - } + mutex_lock(&inode->log_mutex); /* * For symlinks, we must always log their content, which is stored in an @@ -6008,7 +6042,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, ret = copy_inode_items_to_log(trans, inode, &min_key, &max_key, path, dst_path, logged_isize, - recursive_logging, inode_only, ctx, + inode_only, ctx, &need_log_inode_item); if (ret) goto out_unlock; @@ -6117,8 +6151,10 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, btrfs_free_path(path); btrfs_free_path(dst_path); - if (recursive_logging) - ctx->logged_before = orig_logged_before; + if (ret) + free_conflicting_inodes(ctx); + else + ret = log_conflicting_inodes(trans, inode->root, ctx); return ret; } @@ -6973,6 +7009,7 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans, * inconsistent state after a rename operation. */ btrfs_log_inode_parent(trans, inode, parent, LOG_INODE_EXISTS, &ctx); + ASSERT(list_empty(&ctx.conflict_inodes)); out: /* * If an error happened mark the log for a full commit because it's not diff --git a/fs/btrfs/tree-log.h b/fs/btrfs/tree-log.h index 57ab5f3b8dc7..4e34fe4b7762 100644 --- a/fs/btrfs/tree-log.h +++ b/fs/btrfs/tree-log.h @@ -28,6 +28,9 @@ struct btrfs_log_ctx { struct list_head list; /* Only used for fast fsyncs. */ struct list_head ordered_extents; + struct list_head conflict_inodes; + int num_conflict_inodes; + bool logging_conflict_inodes; }; static inline void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, @@ -41,6 +44,9 @@ static inline void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, ctx->inode = inode; INIT_LIST_HEAD(&ctx->list); INIT_LIST_HEAD(&ctx->ordered_extents); + INIT_LIST_HEAD(&ctx->conflict_inodes); + ctx->num_conflict_inodes = 0; + ctx->logging_conflict_inodes = false; } static inline void btrfs_release_log_ctx_extents(struct btrfs_log_ctx *ctx) From patchwork Mon Aug 22 10:51:43 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950492 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0A300C28D13 for ; Mon, 22 Aug 2022 10:52:09 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233908AbiHVKwG (ORCPT ); Mon, 22 Aug 2022 06:52:06 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43524 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233861AbiHVKwD (ORCPT ); Mon, 22 Aug 2022 06:52:03 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 63B4D31235 for ; Mon, 22 Aug 2022 03:52:02 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id F19D460FFD for ; Mon, 22 Aug 2022 10:52:01 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id D7619C433C1 for ; Mon, 22 Aug 2022 10:52:00 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165521; bh=rVW89q5X2SO9ZTLTOPhpkD204k/lCf1NNo/8aIU8Z9E=; h=From:To:Subject:Date:In-Reply-To:References:From; b=nNsf31WCCjN9C5LDRw1SkfRkIRRTPSUVxm40h3NiY6IEIOHUACuRucOA8Fc2kssXF FT8mCaFsBp8du5FXZmMnfBc85JYXOcC56RI9DC2M8+ebAIXAQ83779EfuFxvAB8+FI PzoShsjBq6YnHRsO130O0und6aV/9PJ9/X+TeFeW+qkZ/cOpzTnmN08lK9NAZ0n0GW 8p4YbMy+fZkaf9ew7m4E5CnNk5abVvyR2Ap2pzZF+ZFdBCIfP4ln0TWvFLae6NPLzi 7GgVIA3bVyZ1J+DQIkCc9r9/m7XCOUZSb9I/nj036DVv5NGpy7QHNUTh61mNG1ozp4 w1LwfBhbF6QYA== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 14/15] btrfs: skip logging parent dir when conflicting inode is not a dir Date: Mon, 22 Aug 2022 11:51:43 +0100 Message-Id: <9c7aee375e478950b23238a26a2a6a719c4a3db7.1661165149.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When we find a conflicting inode (an inode that had the same name and parent directory as the inode we are logging now) that was deleted in the current transaction, we always end up logging its parent directory. This is to deal with the case where the conflicting inode corresponds to a deleted subvolume/snapshot or a directory that had subvolumes/snapshots (or some subdirectory inside it had subvolumes/snapshots, etc), because we can't deal with dropping subvolumes/snapshots during log replay. So if we log the parent directory, and if we are dealing with these special cases, then we fallback to a transaction commit when logging the parent, because its last_unlink_trans will match the current transaction (which gets set and propagated when a subvolume/snapshot is deleted). This change skips the logging of the parent directory when the conflicting inode is not a directory (or a subvolume/snapshot). This is ok because in this case logging the current inode is enough to trigger an unlink of the conflicting inode during log replay. So for a case like this: $ mkdir /mnt/dir $ echo -n "first foo data" > /mnt/dir/foo $ sync $ rm -f /mnt/dir/foo $ echo -n "second foo data" > /mnt/dir/foo $ xfs_io -c "fsync" /mnt/dir/foo We avoid logging parent directory "dir" when logging the new file "foo". In other cases it avoids falling back to a transaction commit, when the parent directory has a last_unlink_trans value that matches the current transaction, due to moving a file from it to some other directory. This is a case that happens frequently with dbench for example, where a new file that has the name/parent of another file that was deleted in the current transaction, is fsynced. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 72 ++++++++++++++++++++++++++++++++++++++------- 1 file changed, 62 insertions(+), 10 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 72e7c9e559cc..1d4c285bb26c 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5504,8 +5504,46 @@ static void free_conflicting_inodes(struct btrfs_log_ctx *ctx) } } +static int conflicting_inode_is_dir(struct btrfs_root *root, u64 ino, + struct btrfs_path *path) +{ + struct btrfs_key key; + int ret; + + key.objectid = ino; + key.type = BTRFS_INODE_ITEM_KEY; + key.offset = 0; + + path->search_commit_root = 1; + path->skip_locking = 1; + + ret = btrfs_search_slot(NULL, root, &key, path, 0, 0); + if (WARN_ON_ONCE(ret > 0)) { + /* + * We have previously found the inode through the commit root + * so this should not happen. If it does, just error out and + * fallback to a transaction commit. + */ + ret = -ENOENT; + } else if (ret == 0) { + struct btrfs_inode_item *item; + + item = btrfs_item_ptr(path->nodes[0], path->slots[0], + struct btrfs_inode_item); + if (S_ISDIR(btrfs_inode_mode(path->nodes[0], item))) + ret = 1; + } + + btrfs_release_path(path); + path->search_commit_root = 0; + path->skip_locking = 0; + + return ret; +} + static int add_conflicting_inode(struct btrfs_trans_handle *trans, struct btrfs_root *root, + struct btrfs_path *path, u64 ino, u64 parent, struct btrfs_log_ctx *ctx) { @@ -5525,15 +5563,23 @@ static int add_conflicting_inode(struct btrfs_trans_handle *trans, inode = btrfs_iget(root->fs_info->sb, ino, root); /* * If the other inode that had a conflicting dir entry was deleted in - * the current transaction, we need to log its parent directory. - * We can't simply ignore it and let the current inode's reference cause - * an unlink of the conflicting inode when replaying the log - because - * the conflicting inode may be a deleted subvolume/snapshot or it may - * be a directory that had subvolumes/snapshots inside it (or had one - * or more subdirectories with subvolumes/snapshots, etc). If that's the - * case, then when logging the parent directory we will fallback to a - * transaction commit because the parent directory will have a - * last_unlink_trans that matches the current transaction. + * the current transaction then we either: + * + * 1) Log the parent directory (later after adding it to the list) if + * the inode is a directory. This is because it may be a deleted + * subvolume/snapshot or it may be a regular directory that had + * deleted subvolumes/snapshots (or subdirectories that had them), + * and at the moment we can't deal with dropping subvolumes/snapshots + * during log replay. So we just log the parent, which will result in + * a fallback to a transaction commit if we are dealing with those + * cases (last_unlink_trans will match the current transaction); + * + * 2) Do nothing if it's not a directory. During log replay we simply + * unlink the conflicting dentry from the parent directory and then + * add the dentry for our inode. Like this we can avoid logging the + * parent directory (and maybe fallback to a transaction commit in + * case it has a last_unlink_trans == trans->transid, due to moving + * some inode from it to some other directory). */ if (IS_ERR(inode)) { int ret = PTR_ERR(inode); @@ -5541,6 +5587,12 @@ static int add_conflicting_inode(struct btrfs_trans_handle *trans, if (ret != -ENOENT) return ret; + ret = conflicting_inode_is_dir(root, ino, path); + /* Not a directory or we got an error. */ + if (ret <= 0) + return ret; + + /* Conflicting inode is a directory, so we'll log its parent. */ ino_elem = kmalloc(sizeof(*ino_elem), GFP_NOFS); if (!ino_elem) return -ENOMEM; @@ -5779,7 +5831,7 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans, ins_nr = 0; btrfs_release_path(path); - ret = add_conflicting_inode(trans, root, + ret = add_conflicting_inode(trans, root, path, other_ino, other_parent, ctx); if (ret) From patchwork Mon Aug 22 10:51:44 2022 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12950491 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 393B8C32772 for ; Mon, 22 Aug 2022 10:52:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233924AbiHVKwI (ORCPT ); Mon, 22 Aug 2022 06:52:08 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43546 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233820AbiHVKwF (ORCPT ); Mon, 22 Aug 2022 06:52:05 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 5B5673123F for ; Mon, 22 Aug 2022 03:52:03 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id DCF1160FDE for ; Mon, 22 Aug 2022 10:52:02 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id C6868C433D7 for ; Mon, 22 Aug 2022 10:52:01 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1661165522; bh=H0ZdcpupSrmDYFdZf9zEJuGHq+bmtfqdwNnc914bOrw=; h=From:To:Subject:Date:In-Reply-To:References:From; b=LcUWqyYcSVMD1r9qWbUJfqLbgufh9r908bNR+s+bis/i84sXW6V4iYqU+kYkX56WL nxmMAJ4Afaqf/jH/y3NK3slCSI3QMmbAoUFQzoI41wb8KlAH4J1zXnZzzP4/IGZeTt vljBhgn/+0CltIrBi9ihmAVSI3APh7+HP3yXvbUDVoxG70F4uiDDS8onwnc2TF4Yzd V/FarQJZxROLYtHvyG9LMHF8v/J2r5mD9uxSgaE4bQEW8Rna+0upGUchdfdqr6PKii cUVWGepvXIkdoT6ExkUqxy/Yw6eGsDZIH7kjWAbF0eNnW/6NM+WZUI3FBc3uvQHjUM wjQzILHzUP29Q== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH v2 15/15] btrfs: use delayed items when logging a directory Date: Mon, 22 Aug 2022 11:51:44 +0100 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When logging a directory we start by flushing all its delayed items. That results in adding dir index items to the subvolume btree, for new dentries, and removing dir index items from the subvolume btree for any dentries that were deleted. This makes it straightforward to log a directory simply by iterating over all the modified subvolume btree leaves, especially when we used to log both dir index keys and dir item keys (before commit 339d035424849c ("btrfs: only copy dir index keys when logging a directory") and when we used to copy old dir index entries for leaves modified in the current transaction (before commit 732d591a5d6c12 ("btrfs: stop copying old dir items when logging a directory")). From an efficiency point of view this has a couple of drawbacks: 1) Adds extra latency, due to copying delayed items to the subvolume btree and deleting dir index items from the btree. Further if there are other tasks accessing the btree, which is common (syscalls like creat, mkdir, rename, link, unlink, truncate, reflinks, etc, finishing an ordered extent, etc), lock contention can cause further delays, both to the task logging a directory and to the other tasks accessing the btree; 2) More time spent overall flushing delayed items, if after logging the directory further changes are done to the directory in the same transaction. For example, if we add 10 dentries to a directory, fsync it, add more 10 dentries, fsync it again, then add more 10 dentries and fsync it again, then we end up inserting 3 batches of 10 items to the subvolume btree. With the changes from this patch, we flush all the delayed items to the btree only once - a single batch of 30 items, and outside the logging code (transaction commit or when delayed items are flushed asynchronously). This change simply skips the flushing of delayed items everytime we log a directory. Instead we copy the delayed insertion items directly to the log tree and delete delayed deletion items directly from the log tree. Therefore avoiding changing first the subvolume btree and then scanning it for new items to copy from it to the log tree and detecting deletions by observing gaps in consecutive dir index keys in subvolume btree leaves. Running the following tests on a non-debug kernel (Debian's default kernel config), on a box with a nvme device, a 12 cores Intel cpu and 64G of ram, produced the results below. The results compare a branch without this patch and all the other patches it depends on versus the same branch with the patchset applied. The patchset is comprised of the following patches: btrfs: don't drop dir index range items when logging a directory btrfs: remove the root argument from log_new_dir_dentries() btrfs: update stale comment for log_new_dir_dentries() btrfs: free list element sooner at log_new_dir_dentries() btrfs: avoid memory allocation at log_new_dir_dentries() for common case btrfs: remove root argument from btrfs_delayed_item_reserve_metadata() btrfs: store index number instead of key in struct btrfs_delayed_item btrfs: remove unused logic when looking up delayed items btrfs: shrink the size of struct btrfs_delayed_item btrfs: search for last logged dir index if it's not cached in the inode btrfs: move need_log_inode() to above log_conflicting_inodes() btrfs: move log_new_dir_dentries() above btrfs_log_inode() btrfs: log conflicting inodes without holding log mutex of the initial inode btrfs: skip logging parent dir when conflicting inode is not a dir btrfs: use delayed items when logging a directory Custom test script for testing time spent at btrfs_log_inode(): #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 # Total number of files to create in the test directory. NUM_FILES=10000 # Fsync after creating or renaming N files. FSYNC_AFTER=100 umount $DEV &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT TEST_DIR=$MNT/testdir mkdir $TEST_DIR echo "Creating files..." for ((i = 1; i <= $NUM_FILES; i++)); do echo -n > $TEST_DIR/file_$i if (( ($i % $FSYNC_AFTER) == 0 )); then xfs_io -c "fsync" $TEST_DIR fi done sync echo "Renaming files..." for ((i = 1; i <= $NUM_FILES; i++)); do mv $TEST_DIR/file_$i $TEST_DIR/file_$i.renamed if (( ($i % $FSYNC_AFTER) == 0 )); then xfs_io -c "fsync" $TEST_DIR fi done umount $MNT And using the following bpftrace script to capture the total time that is spent at btrfs_log_inode(): #!/usr/bin/bpftrace k:btrfs_log_inode { @start_log_inode[tid] = nsecs; } kr:btrfs_log_inode /@start_log_inode[tid]/ { $dur = (nsecs - @start_log_inode[tid]) / 1000; @btrfs_log_inode_total_time = sum($dur); delete(@start_log_inode[tid]); } END { clear(@start_log_inode); } Result before applying patchset: @btrfs_log_inode_total_time: 622642 Result after applying patchset: @btrfs_log_inode_total_time: 354134 (-43.1% time spent) The following dbench script was also used for testing: #!/bin/bash NUM_JOBS=$(nproc --all) DEV=/dev/nvme0n1 MNT=/mnt/nvme0n1 MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-O no-holes -R free-space-tree" echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT --skip-cleanup -t 120 -S $NUM_JOBS umount $MNT Before patchset: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 3322265 0.034 21.032 Close 2440562 0.002 0.994 Rename 140664 1.150 269.633 Unlink 670796 1.093 269.678 Deltree 96 5.481 15.510 Mkdir 48 0.004 0.052 Qpathinfo 3010924 0.014 8.127 Qfileinfo 528055 0.001 0.518 Qfsinfo 552113 0.003 0.372 Sfileinfo 270575 0.005 0.688 Find 1164176 0.052 13.931 WriteX 1658537 0.019 5.918 ReadX 5207412 0.003 1.034 LockX 10818 0.003 0.079 UnlockX 10818 0.002 0.313 Flush 232811 1.027 269.735 Throughput 869.867 MB/sec (sync dirs) 12 clients 12 procs max_latency=269.741 ms After patchset: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 4152738 0.029 20.863 Close 3050770 0.002 1.119 Rename 175829 0.871 211.741 Unlink 838447 0.845 211.724 Deltree 120 4.798 14.162 Mkdir 60 0.003 0.005 Qpathinfo 3763807 0.011 4.673 Qfileinfo 660111 0.001 0.400 Qfsinfo 690141 0.003 0.429 Sfileinfo 338260 0.005 0.725 Find 1455273 0.046 6.787 WriteX 2073307 0.017 5.690 ReadX 6509193 0.003 1.171 LockX 13522 0.003 0.077 UnlockX 13522 0.002 0.125 Flush 291044 0.811 211.631 Throughput 1089.27 MB/sec (sync dirs) 12 clients 12 procs max_latency=211.750 ms (+25.2% throughput, -21.5% max latency) Signed-off-by: Filipe Manana --- fs/btrfs/delayed-inode.c | 112 ++++++++++ fs/btrfs/delayed-inode.h | 19 ++ fs/btrfs/tree-log.c | 437 ++++++++++++++++++++++++++++++++++++++- fs/btrfs/tree-log.h | 2 + 4 files changed, 562 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c index b35eddcac846..cac5169eaf8d 100644 --- a/fs/btrfs/delayed-inode.c +++ b/fs/btrfs/delayed-inode.c @@ -315,6 +315,8 @@ static struct btrfs_delayed_item *btrfs_alloc_delayed_item(u16 data_len, item->bytes_reserved = 0; item->delayed_node = node; RB_CLEAR_NODE(&item->rb_node); + INIT_LIST_HEAD(&item->log_list); + item->logged = false; refcount_set(&item->refs, 1); } return item; @@ -2045,3 +2047,113 @@ void btrfs_destroy_delayed_inodes(struct btrfs_fs_info *fs_info) } } +void btrfs_log_get_delayed_items(struct btrfs_inode *inode, + struct list_head *ins_list, + struct list_head *del_list) +{ + struct btrfs_delayed_node *node; + struct btrfs_delayed_item *item; + + node = btrfs_get_delayed_node(inode); + if (!node) + return; + + mutex_lock(&node->mutex); + item = __btrfs_first_delayed_insertion_item(node); + while (item) { + /* + * It's possible that the item is already in a log list. This + * can happen in case two tasks are trying to log the same + * directory. For example if we have tasks A and task B: + * + * Task A collected the delayed items into a log list while + * under the inode's log_mutex (at btrfs_log_inode()), but it + * only releases the items after logging the inodes they point + * to (if they are new inodes), which happens after unlocking + * the log mutex; + * + * Task B enters btrfs_log_inode() and acquires the log_mutex + * of the same directory inode, before task B releases the + * delayed items. This can happen for example when logging some + * inode we need to trigger logging of its parent directory, so + * logging two files that have the same parent directory can + * lead to this. + * + * If this happens, just ignore delayed items already in a log + * list. All the tasks logging the directory are under a log + * transaction and whichever finishes first can not sync the log + * before the other completes and leaves the log transaction. + */ + if (!item->logged && list_empty(&item->log_list)) { + refcount_inc(&item->refs); + list_add_tail(&item->log_list, ins_list); + } + item = __btrfs_next_delayed_item(item); + } + + item = __btrfs_first_delayed_deletion_item(node); + while (item) { + /* It may be non-empty, for the same reason mentioned above. */ + if (!item->logged && list_empty(&item->log_list)) { + refcount_inc(&item->refs); + list_add_tail(&item->log_list, del_list); + } + item = __btrfs_next_delayed_item(item); + } + mutex_unlock(&node->mutex); + + /* + * We are called during inode logging, which means the inode is in use + * and can not be evicted before we finish logging the inode. So we never + * have the last reference on the delayed inode. + * Also, we don't use btrfs_release_delayed_node() because that would + * requeue the delayed inode (change its order in the list of prepared + * nodes) and we don't want to do such change because we don't create or + * delete delayed items. + */ + ASSERT(refcount_read(&node->refs) > 1); + refcount_dec(&node->refs); +} + +void btrfs_log_put_delayed_items(struct btrfs_inode *inode, + struct list_head *ins_list, + struct list_head *del_list) +{ + struct btrfs_delayed_node *node; + struct btrfs_delayed_item *item; + struct btrfs_delayed_item *next; + + node = btrfs_get_delayed_node(inode); + if (!node) + return; + + mutex_lock(&node->mutex); + + list_for_each_entry_safe(item, next, ins_list, log_list) { + item->logged = true; + list_del_init(&item->log_list); + if (refcount_dec_and_test(&item->refs)) + kfree(item); + } + + list_for_each_entry_safe(item, next, del_list, log_list) { + item->logged = true; + list_del_init(&item->log_list); + if (refcount_dec_and_test(&item->refs)) + kfree(item); + } + + mutex_unlock(&node->mutex); + + /* + * We are called during inode logging, which means the inode is in use + * and can not be evicted before we finish logging the inode. So we never + * have the last reference on the delayed inode. + * Also, we don't use btrfs_release_delayed_node() because that would + * requeue the delayed inode (change its order in the list of prepared + * nodes) and we don't want to do such change because we don't create or + * delete delayed items. + */ + ASSERT(refcount_read(&node->refs) > 1); + refcount_dec(&node->refs); +} diff --git a/fs/btrfs/delayed-inode.h b/fs/btrfs/delayed-inode.h index 729d352ca8a1..e0699e360f49 100644 --- a/fs/btrfs/delayed-inode.h +++ b/fs/btrfs/delayed-inode.h @@ -78,10 +78,21 @@ struct btrfs_delayed_item { u64 index; struct list_head tree_list; /* used for batch insert/delete items */ struct list_head readdir_list; /* used for readdir items */ + /* + * Used when logging a directory. + * Insertions and deletions to this list are protected by the parent + * delayed node's mutex. + */ + struct list_head log_list; u64 bytes_reserved; struct btrfs_delayed_node *delayed_node; refcount_t refs; enum btrfs_delayed_item_type type:1; + /* + * Track if this delayed item was already logged. + * Protected by the mutex of the parent delayed inode. + */ + bool logged; /* The maximum leaf size is 64K, so u16 is more than enough. */ u16 data_len; char data[]; @@ -147,6 +158,14 @@ int btrfs_should_delete_dir_index(struct list_head *del_list, int btrfs_readdir_delayed_dir_index(struct dir_context *ctx, struct list_head *ins_list); +/* Used during directory logging. */ +void btrfs_log_get_delayed_items(struct btrfs_inode *inode, + struct list_head *ins_list, + struct list_head *del_list); +void btrfs_log_put_delayed_items(struct btrfs_inode *inode, + struct list_head *ins_list, + struct list_head *del_list); + /* for init */ int __init btrfs_delayed_inode_init(void); void __cold btrfs_delayed_inode_exit(void); diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 1d4c285bb26c..098d6cf01475 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5918,6 +5918,371 @@ static int copy_inode_items_to_log(struct btrfs_trans_handle *trans, return ret; } +static int insert_delayed_items_batch(struct btrfs_trans_handle *trans, + struct btrfs_root *log, + struct btrfs_path *path, + const struct btrfs_item_batch *batch, + const struct btrfs_delayed_item *first_item) +{ + const struct btrfs_delayed_item *curr = first_item; + int ret; + + ret = btrfs_insert_empty_items(trans, log, path, batch); + if (ret) + return ret; + + for (int i = 0; i < batch->nr; i++) { + char *data_ptr; + + data_ptr = btrfs_item_ptr(path->nodes[0], path->slots[0], char); + write_extent_buffer(path->nodes[0], &curr->data, + (unsigned long)data_ptr, curr->data_len); + curr = list_next_entry(curr, log_list); + path->slots[0]++; + } + + btrfs_release_path(path); + + return 0; +} + +static int log_delayed_insertion_items(struct btrfs_trans_handle *trans, + struct btrfs_inode *inode, + struct btrfs_path *path, + const struct list_head *delayed_ins_list, + struct btrfs_log_ctx *ctx) +{ + /* 195 (4095 bytes of keys and sizes) fits in a single 4K page. */ + const int max_batch_size = 195; + const int leaf_data_size = BTRFS_LEAF_DATA_SIZE(trans->fs_info); + const u64 ino = btrfs_ino(inode); + struct btrfs_root *log = inode->root->log_root; + struct btrfs_item_batch batch = { + .nr = 0, + .total_data_size = 0, + }; + const struct btrfs_delayed_item *first = NULL; + const struct btrfs_delayed_item *curr; + char *ins_data; + struct btrfs_key *ins_keys; + u32 *ins_sizes; + u64 curr_batch_size = 0; + int batch_idx = 0; + int ret; + + /* We are adding dir index items to the log tree. */ + lockdep_assert_held(&inode->log_mutex); + + /* + * We collect delayed items before copying index keys from the subvolume + * to the log tree. However just after we collected them, they may have + * been flushed (all of them or just some of them), and therefore we + * could have copied them from the subvolume tree to the log tree. + * So find the first delayed item that was not yet logged (they are + * sorted by index number). + */ + list_for_each_entry(curr, delayed_ins_list, log_list) { + if (curr->index > inode->last_dir_index_offset) { + first = curr; + break; + } + } + + /* Empty list or all delayed items were already logged. */ + if (!first) + return 0; + + ins_data = kmalloc(max_batch_size * sizeof(u32) + + max_batch_size * sizeof(struct btrfs_key), GFP_NOFS); + if (!ins_data) + return -ENOMEM; + ins_sizes = (u32 *)ins_data; + batch.data_sizes = ins_sizes; + ins_keys = (struct btrfs_key *)(ins_data + max_batch_size * sizeof(u32)); + batch.keys = ins_keys; + + curr = first; + while (!list_entry_is_head(curr, delayed_ins_list, log_list)) { + const u32 curr_size = curr->data_len + sizeof(struct btrfs_item); + + if (curr_batch_size + curr_size > leaf_data_size || + batch.nr == max_batch_size) { + ret = insert_delayed_items_batch(trans, log, path, + &batch, first); + if (ret) + goto out; + batch_idx = 0; + batch.nr = 0; + batch.total_data_size = 0; + curr_batch_size = 0; + first = curr; + } + + ins_sizes[batch_idx] = curr->data_len; + ins_keys[batch_idx].objectid = ino; + ins_keys[batch_idx].type = BTRFS_DIR_INDEX_KEY; + ins_keys[batch_idx].offset = curr->index; + curr_batch_size += curr_size; + batch.total_data_size += curr->data_len; + batch.nr++; + batch_idx++; + curr = list_next_entry(curr, log_list); + } + + ASSERT(batch.nr >= 1); + ret = insert_delayed_items_batch(trans, log, path, &batch, first); + + curr = list_last_entry(delayed_ins_list, struct btrfs_delayed_item, + log_list); + inode->last_dir_index_offset = curr->index; +out: + kfree(ins_data); + + return ret; +} + +static int log_delayed_deletions_full(struct btrfs_trans_handle *trans, + struct btrfs_inode *inode, + struct btrfs_path *path, + const struct list_head *delayed_del_list, + struct btrfs_log_ctx *ctx) +{ + const u64 ino = btrfs_ino(inode); + const struct btrfs_delayed_item *curr; + + curr = list_first_entry(delayed_del_list, struct btrfs_delayed_item, + log_list); + + while (!list_entry_is_head(curr, delayed_del_list, log_list)) { + u64 first_dir_index = curr->index; + u64 last_dir_index; + const struct btrfs_delayed_item *next; + int ret; + + /* + * Find a range of consecutive dir index items to delete. Like + * this we log a single dir range item spanning several contiguous + * dir items instead of logging one range item per dir index item. + */ + next = list_next_entry(curr, log_list); + while (!list_entry_is_head(next, delayed_del_list, log_list)) { + if (next->index != curr->index + 1) + break; + curr = next; + next = list_next_entry(next, log_list); + } + + last_dir_index = curr->index; + ASSERT(last_dir_index >= first_dir_index); + + ret = insert_dir_log_key(trans, inode->root->log_root, path, + ino, first_dir_index, last_dir_index); + if (ret) + return ret; + curr = list_next_entry(curr, log_list); + } + + return 0; +} + +static int batch_delete_dir_index_items(struct btrfs_trans_handle *trans, + struct btrfs_inode *inode, + struct btrfs_path *path, + struct btrfs_log_ctx *ctx, + const struct list_head *delayed_del_list, + const struct btrfs_delayed_item *first, + const struct btrfs_delayed_item **last_ret) +{ + const struct btrfs_delayed_item *next; + struct extent_buffer *leaf = path->nodes[0]; + const int last_slot = btrfs_header_nritems(leaf) - 1; + int slot = path->slots[0] + 1; + const u64 ino = btrfs_ino(inode); + + next = list_next_entry(first, log_list); + + while (slot < last_slot && + !list_entry_is_head(next, delayed_del_list, log_list)) { + struct btrfs_key key; + + btrfs_item_key_to_cpu(leaf, &key, slot); + if (key.objectid != ino || + key.type != BTRFS_DIR_INDEX_KEY || + key.offset != next->index) + break; + + slot++; + *last_ret = next; + next = list_next_entry(next, log_list); + } + + return btrfs_del_items(trans, inode->root->log_root, path, + path->slots[0], slot - path->slots[0]); +} + +static int log_delayed_deletions_incremental(struct btrfs_trans_handle *trans, + struct btrfs_inode *inode, + struct btrfs_path *path, + const struct list_head *delayed_del_list, + struct btrfs_log_ctx *ctx) +{ + struct btrfs_root *log = inode->root->log_root; + const struct btrfs_delayed_item *curr; + u64 last_range_start; + u64 last_range_end = 0; + struct btrfs_key key; + + key.objectid = btrfs_ino(inode); + key.type = BTRFS_DIR_INDEX_KEY; + curr = list_first_entry(delayed_del_list, struct btrfs_delayed_item, + log_list); + + while (!list_entry_is_head(curr, delayed_del_list, log_list)) { + const struct btrfs_delayed_item *last = curr; + u64 first_dir_index = curr->index; + u64 last_dir_index; + bool deleted_items = false; + int ret; + + key.offset = curr->index; + ret = btrfs_search_slot(trans, log, &key, path, -1, 1); + if (ret < 0) { + return ret; + } else if (ret == 0) { + ret = batch_delete_dir_index_items(trans, inode, path, ctx, + delayed_del_list, curr, + &last); + if (ret) + return ret; + deleted_items = true; + } + + btrfs_release_path(path); + + /* + * If we deleted items from the leaf, it means we have a range + * item logging their range, so no need to add one or update an + * existing one. Otherwise we have to log a dir range item. + */ + if (deleted_items) + goto next_batch; + + last_dir_index = last->index; + ASSERT(last_dir_index >= first_dir_index); + /* + * If this range starts right after where the previous one ends, + * then we want to reuse the previous range item and change its + * end offset to the end of this range. This is just to minimize + * leaf space usage, by avoiding adding a new range item. + */ + if (last_range_end != 0 && first_dir_index == last_range_end + 1) + first_dir_index = last_range_start; + + ret = insert_dir_log_key(trans, log, path, key.objectid, + first_dir_index, last_dir_index); + if (ret) + return ret; + + last_range_start = first_dir_index; + last_range_end = last_dir_index; +next_batch: + curr = list_next_entry(last, log_list); + } + + return 0; +} + +static int log_delayed_deletion_items(struct btrfs_trans_handle *trans, + struct btrfs_inode *inode, + struct btrfs_path *path, + const struct list_head *delayed_del_list, + struct btrfs_log_ctx *ctx) +{ + /* + * We are deleting dir index items from the log tree or adding range + * items to it. + */ + lockdep_assert_held(&inode->log_mutex); + + if (list_empty(delayed_del_list)) + return 0; + + if (ctx->logged_before) + return log_delayed_deletions_incremental(trans, inode, path, + delayed_del_list, ctx); + + return log_delayed_deletions_full(trans, inode, path, delayed_del_list, + ctx); +} + +/* + * Similar logic as for log_new_dir_dentries(), but it iterates over the delayed + * items instead of the subvolume tree. + */ +static int log_new_delayed_dentries(struct btrfs_trans_handle *trans, + struct btrfs_inode *inode, + const struct list_head *delayed_ins_list, + struct btrfs_log_ctx *ctx) +{ + const bool orig_log_new_dentries = ctx->log_new_dentries; + struct btrfs_fs_info *fs_info = trans->fs_info; + struct btrfs_delayed_item *item; + int ret = 0; + + /* + * No need for the log mutex, plus to avoid potential deadlocks or + * lockdep annotations due to nesting of delayed inode mutexes and log + * mutexes. + */ + lockdep_assert_not_held(&inode->log_mutex); + + ASSERT(!ctx->logging_new_delayed_dentries); + ctx->logging_new_delayed_dentries = true; + + list_for_each_entry(item, delayed_ins_list, log_list) { + struct btrfs_dir_item *dir_item; + struct inode *di_inode; + struct btrfs_key key; + int log_mode = LOG_INODE_EXISTS; + + dir_item = (struct btrfs_dir_item *)item->data; + btrfs_disk_key_to_cpu(&key, &dir_item->location); + + if (key.type == BTRFS_ROOT_ITEM_KEY) + continue; + + di_inode = btrfs_iget(fs_info->sb, key.objectid, inode->root); + if (IS_ERR(di_inode)) { + ret = PTR_ERR(di_inode); + break; + } + + if (!need_log_inode(trans, BTRFS_I(di_inode))) { + btrfs_add_delayed_iput(di_inode); + continue; + } + + if (btrfs_stack_dir_type(dir_item) == BTRFS_FT_DIR) + log_mode = LOG_INODE_ALL; + + ctx->log_new_dentries = false; + ret = btrfs_log_inode(trans, BTRFS_I(di_inode), log_mode, ctx); + + if (!ret && ctx->log_new_dentries) + ret = log_new_dir_dentries(trans, BTRFS_I(di_inode), ctx); + + btrfs_add_delayed_iput(di_inode); + + if (ret) + break; + } + + ctx->log_new_dentries = orig_log_new_dentries; + ctx->logging_new_delayed_dentries = false; + + return ret; +} + /* log a single inode in the tree log. * At least one parent directory for this inode must exist in the tree * or be logged already. @@ -5950,6 +6315,9 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, bool need_log_inode_item = true; bool xattrs_logged = false; bool inode_item_dropped = true; + bool full_dir_logging = false; + LIST_HEAD(delayed_ins_list); + LIST_HEAD(delayed_del_list); path = btrfs_alloc_path(); if (!path) @@ -5977,12 +6345,40 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, max_key.type = (u8)-1; max_key.offset = (u64)-1; + if (S_ISDIR(inode->vfs_inode.i_mode) && inode_only == LOG_INODE_ALL) + full_dir_logging = true; + /* - * Only run delayed items if we are a directory. We want to make sure - * all directory indexes hit the fs/subvolume tree so we can find them - * and figure out which index ranges have to be logged. + * If we are logging a directory while we are logging dentries of the + * delayed items of some other inode, then we need to flush the delayed + * items of this directory and not log the delayed items directly. This + * is to prevent more than one level of recursion into btrfs_log_inode() + * by having something like this: + * + * $ mkdir -p a/b/c/d/e/f/g/h/... + * $ xfs_io -c "fsync" a + * + * Where all directories in the path did not exist before and are + * created in the current transaction. + * So in such a case we directly log the delayed items of the main + * directory ("a") without flushing them first, while for each of its + * subdirectories we flush their delayed items before logging them. + * This prevents a potential unbounded recursion like this: + * + * btrfs_log_inode() + * log_new_delayed_dentries() + * btrfs_log_inode() + * log_new_delayed_dentries() + * btrfs_log_inode() + * log_new_delayed_dentries() + * (...) + * + * We have thresholds for the maximum number of delayed items to have in + * memory, and once they are hit, the items are flushed asynchronously. + * However the limit is quite high, so lets prevent deep levels of + * recursion to happen by limitting the maximum depth to be 1. */ - if (S_ISDIR(inode->vfs_inode.i_mode)) { + if (full_dir_logging && ctx->logging_new_delayed_dentries) { ret = btrfs_commit_inode_delayed_items(trans, inode); if (ret) goto out; @@ -6020,9 +6416,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, * to known the file was moved from A to B, so logging just A would * result in losing the file after a log replay. */ - if (S_ISDIR(inode->vfs_inode.i_mode) && - inode_only == LOG_INODE_ALL && - inode->last_unlink_trans >= trans->transid) { + if (full_dir_logging && inode->last_unlink_trans >= trans->transid) { btrfs_set_log_full_commit(trans); ret = BTRFS_LOG_FORCE_COMMIT; goto out_unlock; @@ -6092,6 +6486,16 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, if (ret) goto out_unlock; + /* + * If we are logging a directory in full mode, collect the delayed items + * before iterating the subvolume tree, so that we don't miss any new + * dir index items in case they get flushed while or right after we are + * iterating the subvolume tree. + */ + if (full_dir_logging && !ctx->logging_new_delayed_dentries) + btrfs_log_get_delayed_items(inode, &delayed_ins_list, + &delayed_del_list); + ret = copy_inode_items_to_log(trans, inode, &min_key, &max_key, path, dst_path, logged_isize, inode_only, ctx, @@ -6147,10 +6551,18 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, write_unlock(&em_tree->lock); } - if (inode_only == LOG_INODE_ALL && S_ISDIR(inode->vfs_inode.i_mode)) { + if (full_dir_logging) { ret = log_directory_changes(trans, inode, path, dst_path, ctx); if (ret) goto out_unlock; + ret = log_delayed_insertion_items(trans, inode, path, + &delayed_ins_list, ctx); + if (ret) + goto out_unlock; + ret = log_delayed_deletion_items(trans, inode, path, + &delayed_del_list, ctx); + if (ret) + goto out_unlock; } spin_lock(&inode->lock); @@ -6208,6 +6620,15 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, else ret = log_conflicting_inodes(trans, inode->root, ctx); + if (full_dir_logging && !ctx->logging_new_delayed_dentries) { + if (!ret) + ret = log_new_delayed_dentries(trans, inode, + &delayed_ins_list, ctx); + + btrfs_log_put_delayed_items(inode, &delayed_ins_list, + &delayed_del_list); + } + return ret; } diff --git a/fs/btrfs/tree-log.h b/fs/btrfs/tree-log.h index 4e34fe4b7762..aed1e05e9879 100644 --- a/fs/btrfs/tree-log.h +++ b/fs/btrfs/tree-log.h @@ -20,6 +20,7 @@ struct btrfs_log_ctx { int log_transid; bool log_new_dentries; bool logging_new_name; + bool logging_new_delayed_dentries; /* Indicate if the inode being logged was logged before. */ bool logged_before; /* Tracks the last logged dir item/index key offset. */ @@ -40,6 +41,7 @@ static inline void btrfs_init_log_ctx(struct btrfs_log_ctx *ctx, ctx->log_transid = 0; ctx->log_new_dentries = false; ctx->logging_new_name = false; + ctx->logging_new_delayed_dentries = false; ctx->logged_before = false; ctx->inode = inode; INIT_LIST_HEAD(&ctx->list);