From patchwork Wed Nov 25 12:19:23 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 11931281 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 819F8C63777 for ; Wed, 25 Nov 2020 12:19:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 266E3206F9 for ; Wed, 25 Nov 2020 12:19:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="i/jSmyqu" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729047AbgKYMTe (ORCPT ); Wed, 25 Nov 2020 07:19:34 -0500 Received: from mail.kernel.org ([198.145.29.99]:44940 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725616AbgKYMTe (ORCPT ); Wed, 25 Nov 2020 07:19:34 -0500 Received: from localhost.localdomain (bl8-197-74.dsl.telepac.pt [85.241.197.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 598D5206F7 for ; Wed, 25 Nov 2020 12:19:32 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1606306773; bh=N7CHilKL/jIBO8LegweHnlPvy3fbA6YSmsH1TbI4h3s=; h=From:To:Subject:Date:In-Reply-To:References:In-Reply-To: References:From; b=i/jSmyquSx3RVflZCFE0DvDCrS6OQs9yR6CRjQFvT9aylLX3Iv+eB/akIsfd/sdL8 n09cnWDbsxTpOKetsSAzIklGN88ouwsJGOzMZkbbcWBx7hQqPN8OTYhCCs3OBbJ7W7 sQZwR2lNdnqqpKnokd8EPqR2vZHNLn8iEag+NmFY= From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 1/6] btrfs: fix race causing unnecessary inode logging during link and rename Date: Wed, 25 Nov 2020 12:19:23 +0000 Message-Id: <1c2b2e9144cd96308d66369175993bdf8419b5a0.1606305501.git.fdmanana@suse.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When we are doing a rename or a link operation for an inode that was logged in the previous transaction and that transaction is still committing, we have a time window where we incorrectly consider that the inode was logged previously in the current transaction and therefore decide to log it to update it in the log. The following steps give an example on how this happens during a link operation: 1) Inode X is logged in transaction 1000, so its logged_trans field is set to 1000; 2) Task A starts to commit transaction 1000; 3) The state of transaction 1000 is changed to TRANS_STATE_UNBLOCKED; 4) Task B starts a link operation for inode X, and as a consequence it starts transaction 1001; 5) Task A is still committing transaction 1000, therefore the value stored at fs_info->last_trans_committed is still 999; 6) Task B calls btrfs_log_new_name(), it reads a value of 999 from fs_info->last_trans_committed and because the logged_trans field of inode X has a value of 1000, the function does not return immediately, instead it proceeds to logging the inode, which should not happen because the inode was logged in the previous transaction (1000) and not in the current one (1001). This is not a functional problem, just wasted time and space logging an inode that does not need to be logged, contributing to higher latency for link and rename operations. So fix this by comparing the inodes' logged_trans field with the generation of the current transaction instead of comparing with the value stored in fs_info->last_trans_committed. This case is often hit when running dbench for a long enough duration, as it does lots of rename operations. This patch belongs to a patch set that is comprised of the following patches: btrfs: fix race causing unnecessary inode logging during link and rename btrfs: fix race that results in logging old extents during a fast fsync btrfs: fix race that causes unnecessary logging of ancestor inodes btrfs: fix race that makes inode logging fallback to transaction commit btrfs: fix race leading to unnecessary transaction commit when logging inode btrfs: do not block inode logging for so long during transaction commit Performance results are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 80cf496226c2..b63f5c2b982a 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -6443,7 +6443,6 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct btrfs_inode *old_dir, struct dentry *parent) { - struct btrfs_fs_info *fs_info = trans->fs_info; struct btrfs_log_ctx ctx; /* @@ -6457,8 +6456,8 @@ void btrfs_log_new_name(struct btrfs_trans_handle *trans, * if this inode hasn't been logged and directory we're renaming it * from hasn't been logged, we don't need to log it */ - if (inode->logged_trans <= fs_info->last_trans_committed && - (!old_dir || old_dir->logged_trans <= fs_info->last_trans_committed)) + if (inode->logged_trans < trans->transid && + (!old_dir || old_dir->logged_trans < trans->transid)) return; btrfs_init_log_ctx(&ctx, &inode->vfs_inode); From patchwork Wed Nov 25 12:19:24 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 11931277 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9DAE2C6379D for ; Wed, 25 Nov 2020 12:19:50 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 62EEC206E5 for ; Wed, 25 Nov 2020 12:19:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="tPiew0Oa" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729088AbgKYMTf (ORCPT ); Wed, 25 Nov 2020 07:19:35 -0500 Received: from mail.kernel.org ([198.145.29.99]:44970 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727349AbgKYMTf (ORCPT ); Wed, 25 Nov 2020 07:19:35 -0500 Received: from localhost.localdomain (bl8-197-74.dsl.telepac.pt [85.241.197.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 99A5E20715 for ; Wed, 25 Nov 2020 12:19:33 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1606306774; bh=KUV3qsOKfbZTb3T39rI1D4IJznnHDl3/r05Afj7Vwqs=; h=From:To:Subject:Date:In-Reply-To:References:In-Reply-To: References:From; b=tPiew0OaHaw1xwfR7/mvBGcOo61vae/rDih/3qUfYVVjklU7oN5gpoNiMagJhUyxj 4Y2iYPJnoQUEnqrGBCLmpzgAOPvYl93XZfGZarO39dgXb7jnOkpL5paf/yn67T+Bf4 pu/NYjhtGa2KFzAyS0YQblFrVojaqyCocyS7ii9g= From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 2/6] btrfs: fix race that results in logging old extents during a fast fsync Date: Wed, 25 Nov 2020 12:19:24 +0000 Message-Id: <9f9b13e0edb79adea30d7e92dccfea8daf9cac88.1606305501.git.fdmanana@suse.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When logging the extents of an inode during a fast fsync, we have a time window where we can log extents that are from the previous transaction and already persisted. This only makes us waste time unnecessarily. The following sequence of steps shows how this can happen: 1) We are at transaction 1000; 2) An ordered extent E from inode I completes, that is it has gone through btrfs_finish_ordered_io(), and it set the extent maps' generation to 1000 when we unpin the extent, which is the generation of the current transaction; 3) The commit for transaction 1000 starts by task A; 4) The task committing transaction 1000 sets the transaction state to unblocked, writes the dirty extent buffers and the super blocks, then unlocks tree_log_mutex; 5) Some change is made to inode I, resulting in creation of a new transaction with a generation of 1001; 6) The transaction 1000 commit starts unpinning extents. At this point fs_info->last_trans_committed still has a value of 999; 7) Task B starts an fsync on inode I, and when it gets to btrfs_log_changed_extents() sees the extent map for extent E in the list of modified extents. It sees the extent map has a generation of 1000 and fs_info->last_trans_committed has a value of 999, so it proceeds to logging the respective file extent item and all the checksums covering its range. So we end up wasting time since the extent was already persisted and is reachable through the trees pointed to by the super block committed by transaction 1000. So just fix this by comparing the extent maps generation against the generation of the transaction handle - if it is smaller then the id in the handle, we know the extent was already persisted and we do not need to log it. This patch belongs to a patch set that is comprised of the following patches: btrfs: fix race causing unnecessary inode logging during link and rename btrfs: fix race that results in logging old extents during a fast fsync btrfs: fix race that causes unnecessary logging of ancestor inodes btrfs: fix race that makes inode logging fallback to transaction commit btrfs: fix race leading to unnecessary transaction commit when logging inode btrfs: do not block inode logging for so long during transaction commit Performance results are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index b63f5c2b982a..596d72d239e9 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -4416,14 +4416,12 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans, struct extent_map *em, *n; struct list_head extents; struct extent_map_tree *tree = &inode->extent_tree; - u64 test_gen; int ret = 0; int num = 0; INIT_LIST_HEAD(&extents); write_lock(&tree->lock); - test_gen = root->fs_info->last_trans_committed; list_for_each_entry_safe(em, n, &tree->modified_extents, list) { list_del_init(&em->list); @@ -4439,7 +4437,7 @@ static int btrfs_log_changed_extents(struct btrfs_trans_handle *trans, goto process; } - if (em->generation <= test_gen) + if (em->generation < trans->transid) continue; /* We log prealloc extents beyond eof later. */ From patchwork Wed Nov 25 12:19:25 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 11931279 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 057D8C64E75 for ; Wed, 25 Nov 2020 12:19:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 97E3B206F7 for ; Wed, 25 Nov 2020 12:19:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="KOYdCCM7" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729101AbgKYMTg (ORCPT ); Wed, 25 Nov 2020 07:19:36 -0500 Received: from mail.kernel.org ([198.145.29.99]:44972 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725616AbgKYMTg (ORCPT ); Wed, 25 Nov 2020 07:19:36 -0500 Received: from localhost.localdomain (bl8-197-74.dsl.telepac.pt [85.241.197.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 7D5D7206F7 for ; Wed, 25 Nov 2020 12:19:34 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1606306775; bh=rHOEdBiLWHRDt/0k09QjMJDLFW0fVyqbVxzAX5qlY44=; h=From:To:Subject:Date:In-Reply-To:References:In-Reply-To: References:From; b=KOYdCCM7I7K7Mrhbpiblq4zn/aC/tDuDFyGU/IqLEHv8BVbZAIGC+zrifhFLR+BLM 3GKb6YMT9dGr0ZK4t71jaVYdm9De4Z0gkFD+2u1tvgSaQ0k2hfWtxEHGAGtwSFNws8 ufp3Djamv8xhQB48UV2K1/IPBFUUxok2knQx1OhM= From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 3/6] btrfs: fix race that causes unnecessary logging of ancestor inodes Date: Wed, 25 Nov 2020 12:19:25 +0000 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When logging an inode and we are checking if we need to log ancestors that are new, if the previous transaction is still committing we have a time window where we can unncessarily log ancestor inodes that were created in the previous transaction. The race is described by the following steps: 1) We are at transaction 1000; 2) Directory inode X is created, its generation is set to 1000; 3) The commit for transaction 1000 is started by task A; 4) The task committing transaction 1000 sets the transaction state to unblocked, writes the dirty extent buffers and the super blocks, then unlocks tree_log_mutex; 5) Inode Y, a regular file, is created under directory inode X, this results in starting a new transaction with a generation of 1001; 6) The transaction 1000 commit is unpinning extents. At this point fs_info->last_trans_committed still has a value of 999; 7) Task B calls fsync on inode Y and gets a handle for transaction 1001; 8) Task B ends up at log_all_new_ancestors() and then because inode Y has only one hard link, ends up at log_new_ancestors_fast(). There it reads a value of 999 from fs_info->last_trans_committed, and sees that the parent inode X has a generation of 1000, so we end up logging inode X: if (inode->generation > fs_info->last_trans_committed) { ret = btrfs_log_inode(trans, root, inode, LOG_INODE_EXISTS, ctx); (...) which is not necessary since it was created in the past transaction, with a generation of 1000, and that transaction has already committed its super blocks - it's still unpinning extents so it has not yet updated fs_info->last_trans_committed from 999 to 1000. So this just causes us to spend more time logging and allocating and writing more tree blocks for the log tree. So fix this by comparing an inode's generation with the generation of the transaction our transaction handle refers to - if the inode's generation matches the generation of the current transaction than we know it is a new inode we need to log, otherwise don't log it. This case is often hit when running dbench for a long enough duration. This patch belongs to a patch set that is comprised of the following patches: btrfs: fix race causing unnecessary inode logging during link and rename btrfs: fix race that results in logging old extents during a fast fsync btrfs: fix race that causes unnecessary logging of ancestor inodes btrfs: fix race that makes inode logging fallback to transaction commit btrfs: fix race leading to unnecessary transaction commit when logging inode btrfs: do not block inode logging for so long during transaction commit Performance results are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 596d72d239e9..6313c50f5e33 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5835,7 +5835,6 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans, while (true) { struct btrfs_fs_info *fs_info = root->fs_info; - const u64 last_committed = fs_info->last_trans_committed; struct extent_buffer *leaf = path->nodes[0]; int slot = path->slots[0]; struct btrfs_key search_key; @@ -5854,7 +5853,7 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans, if (IS_ERR(inode)) return PTR_ERR(inode); - if (BTRFS_I(inode)->generation > last_committed) + if (BTRFS_I(inode)->generation >= trans->transid) ret = btrfs_log_inode(trans, root, BTRFS_I(inode), LOG_INODE_EXISTS, ctx); btrfs_add_delayed_iput(inode); @@ -5895,7 +5894,6 @@ static int log_new_ancestors_fast(struct btrfs_trans_handle *trans, struct btrfs_log_ctx *ctx) { struct btrfs_root *root = inode->root; - struct btrfs_fs_info *fs_info = root->fs_info; struct dentry *old_parent = NULL; struct super_block *sb = inode->vfs_inode.i_sb; int ret = 0; @@ -5909,7 +5907,7 @@ static int log_new_ancestors_fast(struct btrfs_trans_handle *trans, if (root != inode->root) break; - if (inode->generation > fs_info->last_trans_committed) { + if (inode->generation >= trans->transid) { ret = btrfs_log_inode(trans, root, inode, LOG_INODE_EXISTS, ctx); if (ret) From patchwork Wed Nov 25 12:19:26 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 11931287 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 78B1BC64E7B for ; Wed, 25 Nov 2020 12:19:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 23BFB206F7 for ; Wed, 25 Nov 2020 12:19:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="VLipUE0o" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729111AbgKYMTh (ORCPT ); Wed, 25 Nov 2020 07:19:37 -0500 Received: from mail.kernel.org ([198.145.29.99]:44974 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727349AbgKYMTh (ORCPT ); Wed, 25 Nov 2020 07:19:37 -0500 Received: from localhost.localdomain (bl8-197-74.dsl.telepac.pt [85.241.197.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 6210C206E5 for ; Wed, 25 Nov 2020 12:19:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1606306775; bh=7tX5kVbdtQWxW/gqvQpsaI7tBZlcMMawPcQJV9JQDj8=; h=From:To:Subject:Date:In-Reply-To:References:In-Reply-To: References:From; b=VLipUE0o2XfDXmhRFT07CiZvpJITGWbvh9uwspgmRypuxMiInKZVIOrEX251wwW5f RD4FIjOGcN3hsUmLxpHAPz0cPIM/+Q3cfInQbW0rVq5vgurW96hielNNCN6Yw9cFUm qcIWIBA9lZMGUk5HJRXzp1ra90d1RosfbRKdkqMM= From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 4/6] btrfs: fix race that makes inode logging fallback to transaction commit Date: Wed, 25 Nov 2020 12:19:26 +0000 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When logging an inode and the previous transaction is still committing, we have a time window where we can end up incorrectly think an inode has its last_unlink_trans field with a value greater than the last transaction committed, which results in the logging to fallback to a full transaction commit, which is usually much more expensive than doing a log commit. The race is described by the following steps: 1) We are at transaction 1000; 2) We modify an inode X (a directory) using transaction 1000 and set its last_unlink_trans field to 1000, because for example we removed one of its subdirectories; 3) We create a new inode Y with a dentry in inode X using transaction 1000, so its generation field is set to 1000; 4) The commit for transaction 1000 is started by task A; 5) The task committing transaction 1000 sets the transaction state to unblocked, writes the dirty extent buffers and the super blocks, then unlocks tree_log_mutex; 6) Some task starts a new transaction with a generation of 1001; 7) We do some modification to inode Y (using transaction 1001); 8) The transaction 1000 commit starts unpinning extents. At this point fs_info->last_trans_committed still has a value of 999; 9) Task B starts an fsync on inode Y, and gets a handle for transaction 1001. When it gets to check_parent_dirs_for_sync() it does the checking of the ancestor dentries because the following check does not evaluate to true: if (S_ISREG(inode->vfs_inode.i_mode) && inode->generation <= last_committed && inode->last_unlink_trans <= last_committed) goto out; The generation value for inode Y is 1000 and last_committed, which has the value read from fs_info->last_trans_committed, has a value of 999, so that check evaluates to false and we proceed to check the ancestor inodes. Once we get to the first ancestor, inode X, we call btrfs_must_commit_transaction() on it, which evaluates to true: static bool btrfs_must_commit_transaction(...) { struct btrfs_fs_info *fs_info = inode->root->fs_info; bool ret = false; mutex_lock(&inode->log_mutex); if (inode->last_unlink_trans > fs_info->last_trans_committed) { /* * Make sure any commits to the log are forced to be full * commits. */ btrfs_set_log_full_commit(trans); ret = true; } (...) because inode's X last_unlink_trans has a value of 1000 and fs_info->last_trans_committed still has a value of 999, it returns true to check_parent_dirs_for_sync(), making it return 1 which is propagated up to btrfs_sync_file(), causing it to fallback to a full transaction commit of transaction 1001. We should have not falled back to commit transaction 1001, since inode X had last_unlink_trans set to 1000 and the super blocks for transaction 1000 were already written. So while not resulting in a functional problem, it leads to a lot more work and higher latencies for a fsync since committing a transaction is usually more expensive than committing a log (if other filesystem changes happened under that transaction). Similar problem happens when logging directories, for the same reason as btrfs_must_commit_transaction() returns true on an inode with its last_unlink_trans having the generation of the previous transaction and that transaction is still committing, unpinning its freed extents. So fix this by comparing last_unlink_trans with the id of the current transaction instead of fs_info->last_trans_committed. This case is often hit when running dbench for a long enough duration, as it does lots of rename and rmdir operations (both update the field last_unkink_trans of an inode) and fsyncs of files and directories. This patch belongs to a patch set that is comprised of the following patches: btrfs: fix race causing unnecessary inode logging during link and rename btrfs: fix race that results in logging old extents during a fast fsync btrfs: fix race that causes unnecessary logging of ancestor inodes btrfs: fix race that makes inode logging fallback to transaction commit btrfs: fix race leading to unnecessary transaction commit when logging inode btrfs: do not block inode logging for so long during transaction commit Performance results are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 20 ++++++++------------ 1 file changed, 8 insertions(+), 12 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 6313c50f5e33..592a50d9697f 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5449,11 +5449,10 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, static bool btrfs_must_commit_transaction(struct btrfs_trans_handle *trans, struct btrfs_inode *inode) { - struct btrfs_fs_info *fs_info = inode->root->fs_info; bool ret = false; mutex_lock(&inode->log_mutex); - if (inode->last_unlink_trans > fs_info->last_trans_committed) { + if (inode->last_unlink_trans >= trans->transid) { /* * Make sure any commits to the log are forced to be full * commits. @@ -5475,8 +5474,7 @@ static bool btrfs_must_commit_transaction(struct btrfs_trans_handle *trans, static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans, struct btrfs_inode *inode, struct dentry *parent, - struct super_block *sb, - u64 last_committed) + struct super_block *sb) { int ret = 0; struct dentry *old_parent = NULL; @@ -5488,8 +5486,8 @@ static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans, * and other fun in this file. */ if (S_ISREG(inode->vfs_inode.i_mode) && - inode->generation <= last_committed && - inode->last_unlink_trans <= last_committed) + inode->generation < trans->transid && + inode->last_unlink_trans < trans->transid) goto out; if (!S_ISDIR(inode->vfs_inode.i_mode)) { @@ -6024,7 +6022,6 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans, struct btrfs_fs_info *fs_info = root->fs_info; struct super_block *sb; int ret = 0; - u64 last_committed = fs_info->last_trans_committed; bool log_dentries = false; sb = inode->vfs_inode.i_sb; @@ -6049,8 +6046,7 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans, goto end_no_trans; } - ret = check_parent_dirs_for_sync(trans, inode, parent, sb, - last_committed); + ret = check_parent_dirs_for_sync(trans, inode, parent, sb); if (ret) goto end_no_trans; @@ -6080,8 +6076,8 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans, * and other fun in this file. */ if (S_ISREG(inode->vfs_inode.i_mode) && - inode->generation <= last_committed && - inode->last_unlink_trans <= last_committed) { + inode->generation < trans->transid && + inode->last_unlink_trans < trans->transid) { ret = 0; goto end_trans; } @@ -6130,7 +6126,7 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans, * but the file inode does not have a matching BTRFS_INODE_REF_KEY item * and has a link count of 2. */ - if (inode->last_unlink_trans > last_committed) { + if (inode->last_unlink_trans >= trans->transid) { ret = btrfs_log_all_parents(trans, inode, ctx); if (ret) goto end_trans; From patchwork Wed Nov 25 12:19:27 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 11931283 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 32CC1C64E7A for ; Wed, 25 Nov 2020 12:19:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id D89A9206E5 for ; Wed, 25 Nov 2020 12:19:50 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="iu1l6Rkv" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729121AbgKYMTh (ORCPT ); Wed, 25 Nov 2020 07:19:37 -0500 Received: from mail.kernel.org ([198.145.29.99]:44972 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725616AbgKYMTh (ORCPT ); Wed, 25 Nov 2020 07:19:37 -0500 Received: from localhost.localdomain (bl8-197-74.dsl.telepac.pt [85.241.197.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 4329C206F9 for ; Wed, 25 Nov 2020 12:19:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1606306776; bh=9x+n3sq6dnlPWFSJOyBEQEIlVkJZTeP3Afd1Ru3ddu0=; h=From:To:Subject:Date:In-Reply-To:References:In-Reply-To: References:From; b=iu1l6RkvCcZkyfrnE2BDYfic8Hi6p37Ch918zcyoKcRnikm8VexK7dNNpLEzisys9 dQOYYZ+APZrewoD8Y4bmhmOqifmh9QPZG7kaIh0gnjpG/Gjbo8fT0hmeb6tMArkBIN licidtMxGIE9KtF/kwVa0DRfVnooElq6VBUygsEI= From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 5/6] btrfs: fix race leading to unnecessary transaction commit when logging inode Date: Wed, 25 Nov 2020 12:19:27 +0000 Message-Id: <854ce34eb81dff111952dd5937bc4afdaa41e25e.1606305501.git.fdmanana@suse.com> X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When logging an inode we may often have to fallback to a full transaction commit, either because a new block group was allocated, there is some case we can not deal with wihout a transaction commit or some error like an -ENOMEM happened. However after we fallback to a transaction commit, we have a time window where we can make the next attempt to log any inode commit the next transaction unnecessarily, adding additional overhead and increasing latency. A sequence of steps that leads to this issue is the following: 1) The current open transaction has a generation of 1000; 2) A new block group is allocated, and as a consequence we must make sure any attempts to commit a log fallback to a transaction commit, so btrfs_set_log_full_commit() is called from btrfs_make_block_group(). This sets fs_info->last_trans_log_full_commit to 1000; 3) Task A is holding a handle on transaction 1000 and tries to log inode X. Once it gets to start_log_trans(), it calls btrfs_need_log_full_commit() which returns true, since fs_info->last_trans_log_full_commit has a value of 1000. So we end up returning -EAGAIN and propagating it up to btrfs_sync_file(), where we commit transaction 1000; 4) The transaction commit task (task A) sets the tansaction state to unblocked (TRANS_STATE_UNBLOCKED); 5) Some other task, task B, starts a new transaction with a generation of 1001; 6) Some stuff is done with transaction 1001, some btree blocks COWed, etc; 7) Transaction 1000 has not fully committed yet, we are still writing all the extent buffers it created; 8) Some new task, task C, starts an fsync of inode Y, gets a handle for transaction 1001, and it gets to btrfs_log_inode_parent() which does the following check: if (fs_info->last_trans_log_full_commit > last_committed) { ret = 1; goto end_no_trans; } At that point last_trans_log_full_commit has a value of 1000 and last_committed (value of fs_info->last_trans_committed) has a value of 999, since transaction 1000 has not yet committed - it is either still writing out dirty extent buffers, its super blocks or unpinning extents. As a consequence we return 1, which gets propagated up to btrfs_sync_file(), which will then call btrfs_commit_transaction() for transaction 1001. As a consequence we have an unnecessary second transaction commit, we previously committed transaction 1000 and now commit transaction 1001 as well, resulting in more overhead and increased latency. So fix this double transaction commit issue simply by removing that check, because all we need to do is wait for the previous transaction to finish its commit, which we already do later when starting the log transaction at start_log_trans(), because there we acquire the tree_log_mutex lock, which is held by a transaction commit and only released after the transaction commits its super blocks. Another issue that check has is that it reads last_trans_log_full_commit without using READ_ONCE(), which is incorrect since that member of struct btrfs_fs_info is always updated with WRITE_ONCE() through the helper btrfs_set_log_full_commit(). This double transaction commit issue can actually be triggered quite often in long runs of dbench, since besides the creation of new block groups that force inode logging to fallback to a transaction commit, there are cases where dbench asks to fsync a directory which had files in it that were previously renamed or subdirectories that were removed, resulting in the inode logging to fallback to a full transaction commit. This patch belongs to a patch set that is comprised of the following patches: btrfs: fix race causing unnecessary inode logging during link and rename btrfs: fix race that results in logging old extents during a fast fsync btrfs: fix race that causes unnecessary logging of ancestor inodes btrfs: fix race that makes inode logging fallback to transaction commit btrfs: fix race leading to unnecessary transaction commit when logging inode btrfs: do not block inode logging for so long during transaction commit Performance results are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 10 ---------- 1 file changed, 10 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 592a50d9697f..bc5b652f4f64 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -6031,16 +6031,6 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans, goto end_no_trans; } - /* - * The prev transaction commit doesn't complete, we need do - * full commit by ourselves. - */ - if (fs_info->last_trans_log_full_commit > - fs_info->last_trans_committed) { - ret = 1; - goto end_no_trans; - } - if (btrfs_root_refs(&root->root_item) == 0) { ret = 1; goto end_no_trans; From patchwork Wed Nov 25 12:19:28 2020 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 11931285 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id AC7C3C64E7C for ; Wed, 25 Nov 2020 12:19:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 583EE206E5 for ; Wed, 25 Nov 2020 12:19:51 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=kernel.org header.i=@kernel.org header.b="qJyb7ZHb" Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729141AbgKYMTk (ORCPT ); Wed, 25 Nov 2020 07:19:40 -0500 Received: from mail.kernel.org ([198.145.29.99]:44976 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725616AbgKYMTi (ORCPT ); Wed, 25 Nov 2020 07:19:38 -0500 Received: from localhost.localdomain (bl8-197-74.dsl.telepac.pt [85.241.197.74]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 24C5E206F7 for ; Wed, 25 Nov 2020 12:19:36 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1606306777; bh=kpxwKEe9KHT5yFVY1GSvXGky3bO4pZ3AMMlMfRq6wko=; h=From:To:Subject:Date:In-Reply-To:References:In-Reply-To: References:From; b=qJyb7ZHb+5HNsTzGnotmSbHiZhdaBIP6ki7YL3gJBwsEruP23zWiKCLmInF/f4G63 AH3w0gNBGUrPdu/488FiodLNg++2ESLUYfqSG18VNSjphdb4MmIaZIq0+QF3cQjmve ieYrshZfBzv14kc8YQWSta5xK56616HqYHrW3zp4= From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 6/6] btrfs: do not block inode logging for so long during transaction commit Date: Wed, 25 Nov 2020 12:19:28 +0000 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana Early on during a transaction commit we acquire the tree_log_mutex and hold it until after we write the super blocks. But before writing the extent buffers dirtied by the transaction and the super blocks we unblock the transaction by setting its state to TRANS_STATE_UNBLOCKED and setting fs_info->running_transaction to NULL. This means that after that and before writing the super blocks, new transactions can start. However if any transaction wants to log an inode, it will block waiting for the transaction commit to write its dirty extent buffers and the super blocks because the tree_log_mutex is only released after those operations are complete, and starting a new log transaction blocks on that mutex (at start_log_trans()). Writing the dirty extent buffers and the super blocks can take a very significant amount of time to complete, but we could allow the tasks wanting to log an inode to proceed with most of their steps: 1) create the log trees 2) log metadata in the trees 3) write their dirty extent buffers They only need to wait for the previous transaction commit to complete (write its super blocks) before they attempt to write their super blocks, otherwise we could end up with a corrupt filesystem after a crash So change start_log_trans() to use the root tree's log_mutex to serialize for the creation of the log root tree instead of using the tree_log_mutex, and make btrfs_sync_log() acquire the tree_log_mutex before writing the super blocks. This allows for inode logging to wait much less time when there is a previous transaction that is still committing, often not having to wait at all, as by the time when we try to sync the log the previous transaction already wrote its super blocks. This patch belongs to a patch set that is comprised of the following patches: btrfs: fix race causing unnecessary inode logging during link and rename btrfs: fix race that results in logging old extents during a fast fsync btrfs: fix race that causes unnecessary logging of ancestor inodes btrfs: fix race that makes inode logging fallback to transaction commit btrfs: fix race leading to unnecessary transaction commit when logging inode btrfs: do not block inode logging for so long during transaction commit The following script that uses dbench was used to measure the impact of the whole patchset: $ cat test-dbench.sh #!/bin/bash DEV=/dev/nvme0n1 MNT=/mnt/btrfs MOUNT_OPTIONS="-o ssd" echo "performance" | \ tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor mkfs.btrfs -f -m single -d single $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 300 64 umount $MNT The test was run on a machine with 12 cores, 64G of ram, using a NVMe device and a non-debug kernel configuration (Debian's default). Before patch set: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 11277211 0.250 85.340 Close 8283172 0.002 6.479 Rename 477515 1.935 86.026 Unlink 2277936 0.770 87.071 Deltree 256 15.732 81.379 Mkdir 128 0.003 0.009 Qpathinfo 10221180 0.056 44.404 Qfileinfo 1789967 0.002 4.066 Qfsinfo 1874399 0.003 9.176 Sfileinfo 918589 0.061 10.247 Find 3951758 0.341 54.040 WriteX 5616547 0.047 85.079 ReadX 17676028 0.005 9.704 LockX 36704 0.003 1.800 UnlockX 36704 0.002 0.687 Flush 790541 14.115 676.236 Throughput 1179.19 MB/sec 64 clients 64 procs max_latency=676.240 ms After patch set: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 12687926 0.171 86.526 Close 9320780 0.002 8.063 Rename 537253 1.444 78.576 Unlink 2561827 0.559 87.228 Deltree 374 11.499 73.549 Mkdir 187 0.003 0.005 Qpathinfo 11500300 0.061 36.801 Qfileinfo 2017118 0.002 7.189 Qfsinfo 2108641 0.003 4.825 Sfileinfo 1033574 0.008 8.065 Find 4446553 0.408 47.835 WriteX 6335667 0.045 84.388 ReadX 19887312 0.003 9.215 LockX 41312 0.003 1.394 UnlockX 41312 0.002 1.425 Flush 889233 13.014 623.259 Throughput 1339.32 MB/sec 64 clients 64 procs max_latency=623.265 ms +12.7% throughput, -8.2% max latency Signed-off-by: Filipe Manana --- fs/btrfs/ctree.h | 2 +- fs/btrfs/tree-log.c | 56 +++++++++++++++++++++++++++++++-------------- 2 files changed, 40 insertions(+), 18 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index c0c6e79c43f9..7185384f475a 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -1026,7 +1026,7 @@ enum { BTRFS_ROOT_DEAD_RELOC_TREE, /* Mark dead root stored on device whose cleanup needs to be resumed */ BTRFS_ROOT_DEAD_TREE, - /* The root has a log tree. Used only for subvolume roots. */ + /* The root has a log tree. Used for subvolume roots and the tree root. */ BTRFS_ROOT_HAS_LOG_TREE, /* Qgroup flushing is in progress */ BTRFS_ROOT_QGROUP_FLUSHING, diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index bc5b652f4f64..e8b84543d565 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -139,8 +139,25 @@ static int start_log_trans(struct btrfs_trans_handle *trans, struct btrfs_log_ctx *ctx) { struct btrfs_fs_info *fs_info = root->fs_info; + struct btrfs_root *tree_root = fs_info->tree_root; int ret = 0; + /* + * First check if the log root tree was already created. If not, create + * it before locking the root's log_mutex, just to keep lockdep happy. + */ + if (!test_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state)) { + mutex_lock(&tree_root->log_mutex); + if (!fs_info->log_root_tree) { + ret = btrfs_init_log_root_tree(trans, fs_info); + if (!ret) + set_bit(BTRFS_ROOT_HAS_LOG_TREE, &tree_root->state); + } + mutex_unlock(&tree_root->log_mutex); + if (ret) + return ret; + } + mutex_lock(&root->log_mutex); if (root->log_root) { @@ -156,13 +173,6 @@ static int start_log_trans(struct btrfs_trans_handle *trans, set_bit(BTRFS_ROOT_MULTI_LOG_TASKS, &root->state); } } else { - mutex_lock(&fs_info->tree_log_mutex); - if (!fs_info->log_root_tree) - ret = btrfs_init_log_root_tree(trans, fs_info); - mutex_unlock(&fs_info->tree_log_mutex); - if (ret) - goto out; - ret = btrfs_add_log_tree(trans, root); if (ret) goto out; @@ -3022,6 +3032,8 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, int log_transid = 0; struct btrfs_log_ctx root_log_ctx; struct blk_plug plug; + u64 log_root_start; + u64 log_root_level; mutex_lock(&root->log_mutex); log_transid = ctx->log_transid; @@ -3199,22 +3211,31 @@ int btrfs_sync_log(struct btrfs_trans_handle *trans, goto out_wake_log_root; } - btrfs_set_super_log_root(fs_info->super_for_commit, - log_root_tree->node->start); - btrfs_set_super_log_root_level(fs_info->super_for_commit, - btrfs_header_level(log_root_tree->node)); - + log_root_start = log_root_tree->node->start; + log_root_level = btrfs_header_level(log_root_tree->node); log_root_tree->log_transid++; mutex_unlock(&log_root_tree->log_mutex); /* - * Nobody else is going to jump in and write the ctree - * super here because the log_commit atomic below is protecting - * us. We must be called with a transaction handle pinning - * the running transaction open, so a full commit can't hop - * in and cause problems either. + * Here we are guaranteed that nobody is going to write the superblock + * for the current transaction before us and that neither we do write + * our superblock before the previous transaction finishes its commit + * and writes its superblock, because: + * + * 1) We are holding a handle on the current transaction, so no body + * can commit it until we release the handle; + * + * 2) Before writing our superblock we acquire the tree_log_mutex, so + * if the previous transaction is still committing, and hasn't yet + * written its superblock, we wait for it to do it, because a + * transaction commit acquires the tree_log_mutex when the commit + * begins and releases it only after writing its superblock. */ + mutex_lock(&fs_info->tree_log_mutex); + btrfs_set_super_log_root(fs_info->super_for_commit, log_root_start); + btrfs_set_super_log_root_level(fs_info->super_for_commit, log_root_level); ret = write_all_supers(fs_info, 1); + mutex_unlock(&fs_info->tree_log_mutex); if (ret) { btrfs_set_log_full_commit(trans); btrfs_abort_transaction(trans, ret); @@ -3299,6 +3320,7 @@ int btrfs_free_log_root_tree(struct btrfs_trans_handle *trans, if (fs_info->log_root_tree) { free_log_tree(trans, fs_info->log_root_tree); fs_info->log_root_tree = NULL; + clear_bit(BTRFS_ROOT_HAS_LOG_TREE, &fs_info->tree_root->state); } return 0; }