From patchwork Wed Jan 27 10:34:54 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12049663 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4D3EFC433E0 for ; Wed, 27 Jan 2021 10:39:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 01CD82070E for ; Wed, 27 Jan 2021 10:39:06 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236004AbhA0KiJ (ORCPT ); Wed, 27 Jan 2021 05:38:09 -0500 Received: from mail.kernel.org ([198.145.29.99]:52526 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235757AbhA0Kfp (ORCPT ); Wed, 27 Jan 2021 05:35:45 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 3E15920731 for ; Wed, 27 Jan 2021 10:35:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1611743704; bh=MzfNKbKofZLjvsMlU8rraxLE9WJAm1eSa48pkb07htM=; h=From:To:Subject:Date:In-Reply-To:References:From; b=mtf+DjzjCQwQqbipMKC6fKyqFfkV01m3f22idE43An1+6Er0cN+dFfq8eakJyR4yb +bul/ZLcGoXl/oYecCzYhHwczLJo9htnwihf1hMy6hg7TiXkJnCAKBoWyx/s60AMMo NnjTsnd0bsZjzLDkI9/LhIZm/i3r/TWw4Vt1Ihxmy9fogPFAxhcvnC/bBydRF/1t0L PdazD0D+0tnrKdlri/57AMcz4bKMlYxNvlo3dMMcn+3n9E0+lAwnPvg09zZn+pIFmp fSLWCoDM5wFGfKYcSzFnlaN4YiRGut0O0AkeV3ipM2qkgWNlO3HAy5o88fTmd2ZYPx ZBOu1ENHL+ZMQ== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 1/7] btrfs: remove unnecessary directory inode item update when deleting dir entry Date: Wed, 27 Jan 2021 10:34:54 +0000 Message-Id: <1d84cf958b2d37fdaac8583dff6ee68ec49251b3.1611742865.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When we remove a directory entry, as part of an unlink operation, if the directory was logged before we must remove the directory index items from the log. We are also updating the inode item of the directory to update its i_size, but that is not necessary because during log replay we do not need it and we correctly adjust the i_size in the inode item of the subvolume as we process directory index items and replay deletes. This is not needed since commit d555438b6e1dad ("Btrfs: drop dir i_size when adding new names on replay"), where we explicitly ignore the i_size of directory inode items on log replay. Before that we used it but it was buggy as mentioned in that commit's change log (i_size got a larger value then it should have). So stop updating the i_size of the directory inode item in the log, as that is a waste of time, adds more log contention to the log tree and often results in COWing more extent buffers for the log tree. This code path is triggered often during dbench workloads for example. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 39 ++++----------------------------------- 1 file changed, 4 insertions(+), 35 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 8ee0700a980f..5d87afc6058a 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3379,7 +3379,6 @@ int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans, struct btrfs_path *path; int ret; int err = 0; - int bytes_del = 0; u64 dir_ino = btrfs_ino(dir); if (!inode_logged(trans, dir)) @@ -3406,7 +3405,6 @@ int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans, } if (di) { ret = btrfs_delete_one_dir_name(trans, log, path, di); - bytes_del += name_len; if (ret) { err = ret; goto fail; @@ -3421,46 +3419,17 @@ int btrfs_del_dir_entries_in_log(struct btrfs_trans_handle *trans, } if (di) { ret = btrfs_delete_one_dir_name(trans, log, path, di); - bytes_del += name_len; if (ret) { err = ret; goto fail; } } - /* update the directory size in the log to reflect the names - * we have removed + /* + * We do not need to update the size field of the directory's inode item + * because on log replay we update the field to reflect all existing + * entries in the directory (see overwrite_item()). */ - if (bytes_del) { - struct btrfs_key key; - - key.objectid = dir_ino; - key.offset = 0; - key.type = BTRFS_INODE_ITEM_KEY; - btrfs_release_path(path); - - ret = btrfs_search_slot(trans, log, &key, path, 0, 1); - if (ret < 0) { - err = ret; - goto fail; - } - if (ret == 0) { - struct btrfs_inode_item *item; - u64 i_size; - - item = btrfs_item_ptr(path->nodes[0], path->slots[0], - struct btrfs_inode_item); - i_size = btrfs_inode_size(path->nodes[0], item); - if (i_size > bytes_del) - i_size -= bytes_del; - else - i_size = 0; - btrfs_set_inode_size(path->nodes[0], item, i_size); - btrfs_mark_buffer_dirty(path->nodes[0]); - } else - ret = 0; - btrfs_release_path(path); - } fail: btrfs_free_path(path); out_unlock: From patchwork Wed Jan 27 10:34:55 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12049667 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9EC01C433DB for ; Wed, 27 Jan 2021 10:39:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 54A7220789 for ; Wed, 27 Jan 2021 10:39:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235998AbhA0KiG (ORCPT ); Wed, 27 Jan 2021 05:38:06 -0500 Received: from mail.kernel.org ([198.145.29.99]:52528 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235819AbhA0Kfr (ORCPT ); Wed, 27 Jan 2021 05:35:47 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 1F2F12076D for ; Wed, 27 Jan 2021 10:35:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1611743705; bh=EBREEiWaubx8WBQ4inHpm9ndeX/pRHzU4Hn4Va7JbSo=; h=From:To:Subject:Date:In-Reply-To:References:From; b=UfnCaQSa1U8ZvVL9JViu9NhKolOEwkSP2iAS/9ME3fIjhTv5Fx9ywdSwB68ttfyON tf7Yhs1FhG5JVRgrUCsHfOSPd93QusLQKZ+6epcjJfbOpyqJGNrM0PU0cG2UtpJsFn Jr/8P2APb/vslXQcNRXdiIesOSqg/QnwcUq0meb90ZdreKOCYCrBbt8p1+4RqPzpRF Ao0ERNKo9xU0wpL64z812hkGMIMESWL7gZveal/gRQsdrhMVS5cSJNlk5l4/5+Cctv eZpo+Hwngc22RFn8bZp6xl4Y9gk+vBbDmRPbaxBZQVPkreqLYQLgdqWEuFFR5R9VA5 nqIZF3/BXzcMQ== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 2/7] btrfs: stop setting nbytes when filling inode item for logging Date: Wed, 27 Jan 2021 10:34:55 +0000 Message-Id: <6410eae15fc6e5e4d2e8a1e308412c63acd421f6.1611742865.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When we fill an inode item for logging we are setting its nbytes field with the value returned by inode_get_bytes() (a VFS API), however we do not need it because it is not used during log replay. In fact, for fast fsyncs, when we call inode_get_bytes() we may even get an outdated value for nbytes because the nbytes field of the inode is only updated when ordered extents complete, and a fast fsync only waits for writeback to complete, it does not wait for ordered extent completion. So just remove the setup of nbytes and add an explicit comment mentioning why we do not set it. This also avoids adding contention on the inode's i_lock (VFS) with concurrent stat() calls, since that spinlock is used by inode_get_bytes() which is also called by our stat callback (btrfs_getattr()). This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 9 ++++++++- 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 5d87afc6058a..be62759f0aac 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -3858,7 +3858,14 @@ static void fill_inode_item(struct btrfs_trans_handle *trans, btrfs_set_token_timespec_nsec(&token, &item->ctime, inode->i_ctime.tv_nsec); - btrfs_set_token_inode_nbytes(&token, item, inode_get_bytes(inode)); + /* + * We do not need to set the nbytes field, in fact during a fast fsync + * its value may not even be correct, since a fast fsync does not wait + * for ordered extent completion, which is where we update nbytes, it + * only waits for writeback to complete. During log replay as we find + * file extent items and replay them, we adjust the nbytes field of the + * inode item in subvolume tree as needed (see overwrite_item()). + */ btrfs_set_token_inode_sequence(&token, item, inode_peek_iversion(inode)); btrfs_set_token_inode_transid(&token, item, trans->transid); From patchwork Wed Jan 27 10:34:56 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12049665 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6CF47C433E6 for ; Wed, 27 Jan 2021 10:39:07 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 2BED02076E for ; Wed, 27 Jan 2021 10:39:07 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236003AbhA0KiI (ORCPT ); Wed, 27 Jan 2021 05:38:08 -0500 Received: from mail.kernel.org ([198.145.29.99]:52530 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235846AbhA0Kfr (ORCPT ); Wed, 27 Jan 2021 05:35:47 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id EF93D2076E for ; Wed, 27 Jan 2021 10:35:05 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1611743706; bh=bufW6tDsoD66RjDmrDgsTqL4yOOx4GgrF16MkUAG6yk=; h=From:To:Subject:Date:In-Reply-To:References:From; b=WlETU9iljMatmoWNge8xRTJ4dnY6vGaYp1V8bgCMA2oUEVjd59cs8TpPaOIIOH1os FsLagK4fMDnxyUs0YjKCiFbjLQi3PncU90fLYW2VgNBF6Lx+8CJEuNsRvUdD9ntxE4 EAG0bb7B2yw1zlj3xMSqPdEatZ/+BzULInjVrnU2uipD7sTCKWbSLsYvFiVIJvsCAV VQUmo8y5TVYbroIs3HfBvstY7UoVmTMI1Mx/lOTIFcSBI+vea0r7WJN8swpeLFEqcM /NfZ827Fg0irXVfd9TSUYrHJk/RcwS0TuO4x3WIHwVIsf2XiRR59MWi66J7d5QSLyv uig1vzSrz4U9w== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 3/7] btrfs: avoid logging new ancestor inodes when logging new inode Date: Wed, 27 Jan 2021 10:34:56 +0000 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When we fsync a new file, created in the current transaction, we check all its ancestor inodes and always log them if they were created in the current transaction - even if we have already logged them before, which is a waste of time. So avoid logging new ancestor inodes if they were already logged before and have no xattrs added/updated/removed since they were last logged. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 35 +++++++++++++++++++++++++++++++++-- 1 file changed, 33 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index be62759f0aac..105cf316ee27 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5272,6 +5272,7 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, if (S_ISDIR(inode->vfs_inode.i_mode)) { int max_key_type = BTRFS_DIR_LOG_INDEX_KEY; + clear_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags); if (inode_only == LOG_INODE_EXISTS) max_key_type = BTRFS_XATTR_ITEM_KEY; ret = drop_objectid_items(trans, log, path, ino, max_key_type); @@ -5520,6 +5521,34 @@ static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans, return ret; } +/* + * Check if we need to log an inode. This is used in contexts where while + * logging an inode we need to log another inode (either that it exists or in + * full mode). This is used instead of btrfs_inode_in_log() because the later + * requires the inode to be in the log and have the log transaction committed, + * while here we do not care if the log transaction was already committed - our + * caller will commit the log later - and we want to avoid logging an inode + * multiple times when multiple tasks have joined the same log transaction. + */ +static bool need_log_inode(struct btrfs_trans_handle *trans, + struct btrfs_inode *inode) +{ + /* + * If this inode does not have new/updated/deleted xattrs since the last + * time it was logged and is flagged as logged in the current transaction, + * we can skip logging it. As for new/deleted names, those are updated in + * the log by link/unlink/rename operations. + * In case the inode was logged and then evicted and reloaded, its + * logged_trans will be 0, in which case we have to fully log it since + * logged_trans is a transient field, not persisted. + */ + if (inode->logged_trans == trans->transid && + !test_bit(BTRFS_INODE_COPY_EVERYTHING, &inode->runtime_flags)) + return false; + + return true; +} + struct btrfs_dir_list { u64 ino; struct list_head list; @@ -5848,7 +5877,8 @@ static int log_new_ancestors(struct btrfs_trans_handle *trans, if (IS_ERR(inode)) return PTR_ERR(inode); - if (BTRFS_I(inode)->generation >= trans->transid) + if (BTRFS_I(inode)->generation >= trans->transid && + need_log_inode(trans, BTRFS_I(inode))) ret = btrfs_log_inode(trans, root, BTRFS_I(inode), LOG_INODE_EXISTS, ctx); btrfs_add_delayed_iput(inode); @@ -5902,7 +5932,8 @@ static int log_new_ancestors_fast(struct btrfs_trans_handle *trans, if (root != inode->root) break; - if (inode->generation >= trans->transid) { + if (inode->generation >= trans->transid && + need_log_inode(trans, inode)) { ret = btrfs_log_inode(trans, root, inode, LOG_INODE_EXISTS, ctx); if (ret) From patchwork Wed Jan 27 10:34:57 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12049661 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8777CC433DB for ; Wed, 27 Jan 2021 10:38:43 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3D4782070E for ; Wed, 27 Jan 2021 10:38:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235996AbhA0KiF (ORCPT ); Wed, 27 Jan 2021 05:38:05 -0500 Received: from mail.kernel.org ([198.145.29.99]:52534 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235880AbhA0Kfr (ORCPT ); Wed, 27 Jan 2021 05:35:47 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id CF78920771 for ; Wed, 27 Jan 2021 10:35:06 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1611743707; bh=L1CIGZFWMrAsAG2iZW8rcHV6hjXKPtXEyWE6N3RdetM=; h=From:To:Subject:Date:In-Reply-To:References:From; b=uudeliF/fdhYe2REqCbRmF0nHPqHJa+1fTTXyqk8aizlgTGyPDhOoffpxlhuhTemv oUpRVlqzOzHOwf4f9nOfACLi7wJqJ/dbg1jbWszctsrKLIqREQJS8FANH/6LRDR+/I RGfSDvLBktqd3v0xBUtELmpPuwEuTxCcnRkv/AvaaK/ASklE+ssLOo1oDtYZNYp7yE 7Co/qR+0ZXActnVYIBJIPiP+Ex1cflNyofTIt49OU1PpPvsZ2G47uxkYmKFJpsDK2u BDiVw3B4LCyvqcD9rYgFyIkHTUotn6+GUZZq4qYqQ/Q1t1krAfR2CdEjq4mJyalMAU glQcQreA8z4BQ== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 4/7] btrfs: skip logging directories already logged when logging all parents Date: Wed, 27 Jan 2021 10:34:57 +0000 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana Some times when we fsync an inode we need to do a full log of all its ancestors (due to unlink, link or rename operations), which can be an expensive operation, specially if the directories are large. However if we find an ancestor directory inode that is already logged in the current transaction, and has no inserted/updated/deleted xattrs since it was last logged, we can skip logging the directory again. We are safe to skip that since we know that for logged directories, any link, unlink or rename operations that implicate the directory will update the log as necessary. So use the helper need_log_dir(), introduced in a previous commit, to detect already logged directories that can be skipped. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 5 +++++ 1 file changed, 5 insertions(+) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 105cf316ee27..c0dce99c2c14 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5826,6 +5826,11 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, goto out; } + if (!need_log_inode(trans, BTRFS_I(dir_inode))) { + btrfs_add_delayed_iput(dir_inode); + continue; + } + if (ctx) ctx->log_new_dentries = false; ret = btrfs_log_inode(trans, root, BTRFS_I(dir_inode), From patchwork Wed Jan 27 10:34:58 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12049671 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 020C6C433DB for ; Wed, 27 Jan 2021 10:41:11 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9ABE92070E for ; Wed, 27 Jan 2021 10:41:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235098AbhA0Kks (ORCPT ); Wed, 27 Jan 2021 05:40:48 -0500 Received: from mail.kernel.org ([198.145.29.99]:53076 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236008AbhA0KiZ (ORCPT ); Wed, 27 Jan 2021 05:38:25 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id AE15920773 for ; Wed, 27 Jan 2021 10:35:07 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1611743708; bh=IgAegIguo12oSqOVBxDaVBbCr4bKsxSGPcOcMxGYwBs=; h=From:To:Subject:Date:In-Reply-To:References:From; b=fQCRVbFSmRnawGoIDkX+JjI2EblwH2scq2JlrkVGUM0jI4ZznJadLPHchsZW4Sikl znfB+hR9gR6xMzGVRrDLc8lFp2la3THSJfB+JdPgfYL1f7u8/uTQtiELztj3bxz2tv ml/e/7ysMJksbRzNl4hQOxm98IUORoH48QB0sEO2C9fXPOfizwePDxzU7y/wexHXiO bpmuX2qvitmBBexr6eX5FAmNQK5nok4O7gxB9P6sh4PbGzhoHKmis8Hht4k3v7KG23 zxYpXeOHMgvx/7LoqjVwKnbci22YXWPN3n0mPwyuBO1Q9dI8hHQ+ZgWZ/56us9fIH7 +3afD8pnVxiUA== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 5/7] btrfs: skip logging inodes already logged when logging new entries Date: Wed, 27 Jan 2021 10:34:58 +0000 Message-Id: X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When logging new directory entries of a directory, we log the inodes of new dentries and the inodes of dentries pointing to directories that may have been created in past transactions. For the case of directories we log in full mode, which can be particularly expensive for large directories. We do use btrfs_inode_in_log() to skip already logged inodes, however for that helper to return true, it requires that the log transaction used to log the inode to be already committed. This means that when we have more than one task using the same log transaction we can end up logging an inode multiple times, which is a waste of time and not necessary since the log will be committed by one of the tasks and the others will wait for the log transaction to be committed before returning to user space. So simply replace the use of btrfs_inode_in_log() with the new helper function need_log_inode(), introduced in a previous commit. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index c0dce99c2c14..6dc376a16cf2 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5676,7 +5676,7 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, goto next_dir_inode; } - if (btrfs_inode_in_log(BTRFS_I(di_inode), trans->transid)) { + if (!need_log_inode(trans, BTRFS_I(di_inode))) { btrfs_add_delayed_iput(di_inode); break; } From patchwork Wed Jan 27 10:34:59 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12049683 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 674F8C433E0 for ; Wed, 27 Jan 2021 10:42:01 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 09EB6205F4 for ; Wed, 27 Jan 2021 10:42:00 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236006AbhA0Kkf (ORCPT ); Wed, 27 Jan 2021 05:40:35 -0500 Received: from mail.kernel.org ([198.145.29.99]:53074 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236007AbhA0KiZ (ORCPT ); Wed, 27 Jan 2021 05:38:25 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 8D8B820780 for ; Wed, 27 Jan 2021 10:35:08 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1611743709; bh=Jtxz1Gt1cVJlfkFGfVSaBVXgzfE2MXW6E/QWWSMVkC8=; h=From:To:Subject:Date:In-Reply-To:References:From; b=JxN1DIWQiEUWA/M/AzCuXGusYOJiQVWvXCappWRzOFrcEBBoIuDonY4Zr90DUF0Ow mU1Q3S26CHq4lBm9Uwwb6ct82BoVLT1FhfWGC+PEzUc7yqDfjLG36rKuLwSOR2WcNE NznxTeuXJw0+W5sCgynmldM5sMW1BASGGZgxyhpbHwgmx0Kud1QfgWxNGrHkoaMh2F f9abxjVsKYhcw4UwC1xvxJZ7m3q9RRb8DkV+TKdlsBc68Mxwzu+po1iZqKJV9ptQ+s L+zCH1xM7GC+Hhmh6O/vnvwAhxZROSMYDZK29gvWQTMYEitRULDq792PdUxGJUkX6h 1aEOpBiOLKF+w== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 6/7] btrfs: remove unnecessary check_parent_dirs_for_sync() Date: Wed, 27 Jan 2021 10:34:59 +0000 Message-Id: <77b21c64a5aed56e5602c59558c1b09254f3b494.1611742865.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana Whenever we fsync an inode, if it is a directory, a regular file that was created in the current transaction or has last_unlink_trans set to the generation of the current transaction, we check if any of its ancestor inodes (and the inode itself if it is a directory) can not be logged and need a fallback to a full transaction commit - if so, we return with a value of 1 in order to fallback to a transaction commit. However we often do not need to fallback to a transaction commit because: 1) The ancestor inode is not an immediate parent, and therefore there is not an explicit request to log it and it is not needed neither to guarantee the consistency of the inode originally asked to be logged (fsynced) nor its immediate parent; 2) The ancestor inode was already logged before, in which case any link, unlink or rename operation updates the log as needed. So for these two cases we can avoid an unnecessary transaction commit. Therefore remove check_parent_dirs_for_sync() and add a check at the top of btrfs_log_inode() to make us fallback immediately to a transaction commit when we are logging a directory inode that can not be logged and needs a full transaction commit. All we need to protect is the case where after renaming a file someone fsyncs only the old directory, which would result is losing the renamed file after a log replay. This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Performance results, after applying all patches, are mentioned in the change log of the last patch. Signed-off-by: Filipe Manana --- fs/btrfs/tree-log.c | 121 ++++++-------------------------------------- 1 file changed, 15 insertions(+), 106 deletions(-) diff --git a/fs/btrfs/tree-log.c b/fs/btrfs/tree-log.c index 6dc376a16cf2..4c7b283ed2b2 100644 --- a/fs/btrfs/tree-log.c +++ b/fs/btrfs/tree-log.c @@ -5265,6 +5265,21 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, mutex_lock(&inode->log_mutex); } + /* + * This is for cases where logging a directory could result in losing a + * a file after replaying the log. For example, if we move a file from a + * directory A to a directory B, then fsync directory A, we have no way + * to known the file was moved from A to B, so logging just A would + * result in losing the file after a log replay. + */ + if (S_ISDIR(inode->vfs_inode.i_mode) && + inode_only == LOG_INODE_ALL && + inode->last_unlink_trans >= trans->transid) { + btrfs_set_log_full_commit(trans); + err = 1; + goto out_unlock; + } + /* * a brute force approach to making sure we get the most uptodate * copies of everything. @@ -5428,99 +5443,6 @@ static int btrfs_log_inode(struct btrfs_trans_handle *trans, return err; } -/* - * Check if we must fallback to a transaction commit when logging an inode. - * This must be called after logging the inode and is used only in the context - * when fsyncing an inode requires the need to log some other inode - in which - * case we can't lock the i_mutex of each other inode we need to log as that - * can lead to deadlocks with concurrent fsync against other inodes (as we can - * log inodes up or down in the hierarchy) or rename operations for example. So - * we take the log_mutex of the inode after we have logged it and then check for - * its last_unlink_trans value - this is safe because any task setting - * last_unlink_trans must take the log_mutex and it must do this before it does - * the actual unlink operation, so if we do this check before a concurrent task - * sets last_unlink_trans it means we've logged a consistent version/state of - * all the inode items, otherwise we are not sure and must do a transaction - * commit (the concurrent task might have only updated last_unlink_trans before - * we logged the inode or it might have also done the unlink). - */ -static bool btrfs_must_commit_transaction(struct btrfs_trans_handle *trans, - struct btrfs_inode *inode) -{ - bool ret = false; - - mutex_lock(&inode->log_mutex); - if (inode->last_unlink_trans >= trans->transid) { - /* - * Make sure any commits to the log are forced to be full - * commits. - */ - btrfs_set_log_full_commit(trans); - ret = true; - } - mutex_unlock(&inode->log_mutex); - - return ret; -} - -/* - * follow the dentry parent pointers up the chain and see if any - * of the directories in it require a full commit before they can - * be logged. Returns zero if nothing special needs to be done or 1 if - * a full commit is required. - */ -static noinline int check_parent_dirs_for_sync(struct btrfs_trans_handle *trans, - struct btrfs_inode *inode, - struct dentry *parent, - struct super_block *sb) -{ - int ret = 0; - struct dentry *old_parent = NULL; - - /* - * for regular files, if its inode is already on disk, we don't - * have to worry about the parents at all. This is because - * we can use the last_unlink_trans field to record renames - * and other fun in this file. - */ - if (S_ISREG(inode->vfs_inode.i_mode) && - inode->generation < trans->transid && - inode->last_unlink_trans < trans->transid) - goto out; - - if (!S_ISDIR(inode->vfs_inode.i_mode)) { - if (!parent || d_really_is_negative(parent) || sb != parent->d_sb) - goto out; - inode = BTRFS_I(d_inode(parent)); - } - - while (1) { - if (btrfs_must_commit_transaction(trans, inode)) { - ret = 1; - break; - } - - if (!parent || d_really_is_negative(parent) || sb != parent->d_sb) - break; - - if (IS_ROOT(parent)) { - inode = BTRFS_I(d_inode(parent)); - if (btrfs_must_commit_transaction(trans, inode)) - ret = 1; - break; - } - - parent = dget_parent(parent); - dput(old_parent); - old_parent = parent; - inode = BTRFS_I(d_inode(parent)); - - } - dput(old_parent); -out: - return ret; -} - /* * Check if we need to log an inode. This is used in contexts where while * logging an inode we need to log another inode (either that it exists or in @@ -5686,9 +5608,6 @@ static int log_new_dir_dentries(struct btrfs_trans_handle *trans, log_mode = LOG_INODE_ALL; ret = btrfs_log_inode(trans, root, BTRFS_I(di_inode), log_mode, ctx); - if (!ret && - btrfs_must_commit_transaction(trans, BTRFS_I(di_inode))) - ret = 1; btrfs_add_delayed_iput(di_inode); if (ret) goto next_dir_inode; @@ -5835,9 +5754,6 @@ static int btrfs_log_all_parents(struct btrfs_trans_handle *trans, ctx->log_new_dentries = false; ret = btrfs_log_inode(trans, root, BTRFS_I(dir_inode), LOG_INODE_ALL, ctx); - if (!ret && - btrfs_must_commit_transaction(trans, BTRFS_I(dir_inode))) - ret = 1; if (!ret && ctx && ctx->log_new_dentries) ret = log_new_dir_dentries(trans, root, BTRFS_I(dir_inode), ctx); @@ -6053,12 +5969,9 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans, { struct btrfs_root *root = inode->root; struct btrfs_fs_info *fs_info = root->fs_info; - struct super_block *sb; int ret = 0; bool log_dentries = false; - sb = inode->vfs_inode.i_sb; - if (btrfs_test_opt(fs_info, NOTREELOG)) { ret = 1; goto end_no_trans; @@ -6069,10 +5982,6 @@ static int btrfs_log_inode_parent(struct btrfs_trans_handle *trans, goto end_no_trans; } - ret = check_parent_dirs_for_sync(trans, inode, parent, sb); - if (ret) - goto end_no_trans; - /* * Skip already logged inodes or inodes corresponding to tmpfiles * (since logging them is pointless, a link count of 0 means they From patchwork Wed Jan 27 10:35:00 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12049673 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id E902BC433DB for ; Wed, 27 Jan 2021 10:41:23 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 96D502070E for ; Wed, 27 Jan 2021 10:41:23 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S236016AbhA0KlH (ORCPT ); Wed, 27 Jan 2021 05:41:07 -0500 Received: from mail.kernel.org ([198.145.29.99]:53080 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236010AbhA0Ki1 (ORCPT ); Wed, 27 Jan 2021 05:38:27 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 6AC2620781 for ; Wed, 27 Jan 2021 10:35:09 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1611743709; bh=ZAFPaVcpqJlPci1Idchoh39eTJxfXRuaY85UtemkQrA=; h=From:To:Subject:Date:In-Reply-To:References:From; b=J5x/+oxHRcQtqaqq5hOehCW0CaisO6JWRsmbE6qjRRJyD4ilv4pNiEBq7dZBzhn/X RYzVLr9fXvtbeNF/fkIjJxhg5AKIHl4v66XtGMkxxo/1c1Qi3OifyvtFbVTlayMbbU NndvNGm0V1Hh8hAv0lPH6rb874Hi5zn+zb1xMF/S+JTyii1AlHBYr1q9ePJ1BPgdTU MLw6wuSGUkgyEOgV810Fm6vGOiKirUunRt1aBfi57x+noxBybW7i95RhVZENESpwP3 tKlN3Rh3MK72Rd8aOLEr+fZSL3VyIsRQ8mjHcCI1HGyppKkWnivuvMh8wkM/fP7Ouj Re4Znzh7HNP0Q== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 7/7] btrfs: make concurrent fsyncs wait less when waiting for a transaction commit Date: Wed, 27 Jan 2021 10:35:00 +0000 Message-Id: <8c391b7c0a7a6a7e827644d424cd4c343f39588e.1611742865.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana Often an fsync needs to fallback to a transaction commit for several reasons (to ensure consistency after a power failure, a new block group was allocated or a temporary error such as ENOMEM or ENOSPC happened). In that case the log is marked as needing a full commit and any concurrent tasks attempting to log inodes or commit the log will also fallback to the transaction commit. When this happens they all wait for the task that first started the transaction commit to finish the transaction commit - however they wait until the full transaction commit happens, which is not needed, as they only need to wait for the superblocks to be persisted and not for unpinning all the extents pinned during the transaction's lifetime, which even for short lived transactions can be a few thousand and take some significant amount of time to complete - for dbench workloads I have observed up to 4~5 milliseconds of time spent unpinning extents in the worst cases, and the number of pinned extents was between 2 to 3 thousand. So allow fsync tasks to skip waiting for the unpinning of extents when they call btrfs_commit_transaction() and they were not the task that started the transaction commit (that one has to do it, the alternative would be to offload the transaction commit to another task so that it could avoid waiting for the extent unpinning or offload the extent unpinning to another task). This patch is part of a patchset comprised of the following patches: btrfs: remove unnecessary directory inode item update when deleting dir entry btrfs: stop setting nbytes when filling inode item for logging btrfs: avoid logging new ancestor inodes when logging new inode btrfs: skip logging directories already logged when logging all parents btrfs: skip logging inodes already logged when logging new entries btrfs: remove unnecessary check_parent_dirs_for_sync() btrfs: make concurrent fsyncs wait less when waiting for a transaction commit After applying the entire patchset, dbench shows improvements in respect to throughput and latency. The script used to measure it is the following: $ cat dbench-test.sh #!/bin/bash DEV=/dev/sdk MNT=/mnt/sdk MOUNT_OPTIONS="-o ssd" MKFS_OPTIONS="-m single -d single" echo "performance" | tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor umount $DEV &> /dev/null mkfs.btrfs -f $MKFS_OPTIONS $DEV mount $MOUNT_OPTIONS $DEV $MNT dbench -D $MNT -t 300 64 umount $MNT The test was run on a physical machine with 12 cores (Intel corei7), 64G of ram, using a NVMe device and a non-debug kernel configuration (Debian's default configuration). Before applying patchset, 32 clients: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 9627107 0.153 61.938 Close 7072076 0.001 3.175 Rename 407633 1.222 44.439 Unlink 1943895 0.658 44.440 Deltree 256 17.339 110.891 Mkdir 128 0.003 0.009 Qpathinfo 8725406 0.064 17.850 Qfileinfo 1529516 0.001 2.188 Qfsinfo 1599884 0.002 1.457 Sfileinfo 784200 0.005 3.562 Find 3373513 0.411 30.312 WriteX 4802132 0.053 29.054 ReadX 15089959 0.002 5.801 LockX 31344 0.002 0.425 UnlockX 31344 0.001 0.173 Flush 674724 5.952 341.830 Throughput 1008.02 MB/sec 32 clients 32 procs max_latency=341.833 ms After applying patchset, 32 clients: After patchset, with 32 clients: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 9931568 0.111 25.597 Close 7295730 0.001 2.171 Rename 420549 0.982 49.714 Unlink 2005366 0.497 39.015 Deltree 256 11.149 89.242 Mkdir 128 0.002 0.014 Qpathinfo 9001863 0.049 20.761 Qfileinfo 1577730 0.001 2.546 Qfsinfo 1650508 0.002 3.531 Sfileinfo 809031 0.005 5.846 Find 3480259 0.309 23.977 WriteX 4952505 0.043 41.283 ReadX 15568127 0.002 5.476 LockX 32338 0.002 0.978 UnlockX 32338 0.001 2.032 Flush 696017 7.485 228.835 Throughput 1049.91 MB/sec 32 clients 32 procs max_latency=228.847 ms --> +4.1% throughput, -39.6% max latency Before applying patchset, 64 clients: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 8956748 0.342 108.312 Close 6579660 0.001 3.823 Rename 379209 2.396 81.897 Unlink 1808625 1.108 131.148 Deltree 256 25.632 172.176 Mkdir 128 0.003 0.018 Qpathinfo 8117615 0.131 55.916 Qfileinfo 1423495 0.001 2.635 Qfsinfo 1488496 0.002 5.412 Sfileinfo 729472 0.007 8.643 Find 3138598 0.855 78.321 WriteX 4470783 0.102 79.442 ReadX 14038139 0.002 7.578 LockX 29158 0.002 0.844 UnlockX 29158 0.001 0.567 Flush 627746 14.168 506.151 Throughput 924.738 MB/sec 64 clients 64 procs max_latency=506.154 ms After applying patchset, 64 clients: Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 9069003 0.303 43.193 Close 6662328 0.001 3.888 Rename 383976 2.194 46.418 Unlink 1831080 1.022 43.873 Deltree 256 24.037 155.763 Mkdir 128 0.002 0.005 Qpathinfo 8219173 0.137 30.233 Qfileinfo 1441203 0.001 3.204 Qfsinfo 1507092 0.002 4.055 Sfileinfo 738775 0.006 5.431 Find 3177874 0.936 38.170 WriteX 4526152 0.084 39.518 ReadX 14213562 0.002 24.760 LockX 29522 0.002 1.221 UnlockX 29522 0.001 0.694 Flush 635652 14.358 422.039 Throughput 990.13 MB/sec 64 clients 64 procs max_latency=422.043 ms --> +6.8% throughput, -18.1% max latency Signed-off-by: Filipe Manana Reviewed-by: Josef Bacik --- fs/btrfs/file.c | 1 + fs/btrfs/transaction.c | 39 +++++++++++++++++++++++++++++++-------- fs/btrfs/transaction.h | 2 ++ 3 files changed, 34 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c index d81ae1f518f2..be5350f5bedf 100644 --- a/fs/btrfs/file.c +++ b/fs/btrfs/file.c @@ -2238,6 +2238,7 @@ int btrfs_sync_file(struct file *file, loff_t start, loff_t end, int datasync) ret = PTR_ERR(trans); goto out_release_extents; } + trans->in_fsync = true; ret = btrfs_log_dentry_safe(trans, dentry, &ctx); btrfs_release_log_ctx_extents(&ctx); diff --git a/fs/btrfs/transaction.c b/fs/btrfs/transaction.c index 3bcb5444536e..ff8efa6f8986 100644 --- a/fs/btrfs/transaction.c +++ b/fs/btrfs/transaction.c @@ -107,6 +107,11 @@ static const unsigned int btrfs_blocked_trans_types[TRANS_STATE_MAX] = { __TRANS_JOIN | __TRANS_JOIN_NOLOCK | __TRANS_JOIN_NOSTART), + [TRANS_STATE_SUPER_COMMITTED] = (__TRANS_START | + __TRANS_ATTACH | + __TRANS_JOIN | + __TRANS_JOIN_NOLOCK | + __TRANS_JOIN_NOSTART), [TRANS_STATE_COMPLETED] = (__TRANS_START | __TRANS_ATTACH | __TRANS_JOIN | @@ -826,10 +831,11 @@ btrfs_attach_transaction_barrier(struct btrfs_root *root) return trans; } -/* wait for a transaction commit to be fully complete */ -static noinline void wait_for_commit(struct btrfs_transaction *commit) +/* Wait for a transaction commit to reach at least the given state. */ +static noinline void wait_for_commit(struct btrfs_transaction *commit, + const enum btrfs_trans_state min_state) { - wait_event(commit->commit_wait, commit->state == TRANS_STATE_COMPLETED); + wait_event(commit->commit_wait, commit->state >= min_state); } int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid) @@ -884,7 +890,7 @@ int btrfs_wait_for_commit(struct btrfs_fs_info *fs_info, u64 transid) goto out; /* nothing committing|committed */ } - wait_for_commit(cur_trans); + wait_for_commit(cur_trans, TRANS_STATE_COMPLETED); btrfs_put_transaction(cur_trans); out: return ret; @@ -2102,11 +2108,15 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) spin_lock(&fs_info->trans_lock); if (cur_trans->state >= TRANS_STATE_COMMIT_START) { + enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED; + spin_unlock(&fs_info->trans_lock); refcount_inc(&cur_trans->use_count); - ret = btrfs_end_transaction(trans); - wait_for_commit(cur_trans); + if (trans->in_fsync) + want_state = TRANS_STATE_SUPER_COMMITTED; + ret = btrfs_end_transaction(trans); + wait_for_commit(cur_trans, want_state); if (TRANS_ABORTED(cur_trans)) ret = cur_trans->aborted; @@ -2120,13 +2130,19 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) wake_up(&fs_info->transaction_blocked_wait); if (cur_trans->list.prev != &fs_info->trans_list) { + enum btrfs_trans_state want_state = TRANS_STATE_COMPLETED; + + if (trans->in_fsync) + want_state = TRANS_STATE_SUPER_COMMITTED; + prev_trans = list_entry(cur_trans->list.prev, struct btrfs_transaction, list); - if (prev_trans->state != TRANS_STATE_COMPLETED) { + if (prev_trans->state < want_state) { refcount_inc(&prev_trans->use_count); spin_unlock(&fs_info->trans_lock); - wait_for_commit(prev_trans); + wait_for_commit(prev_trans, want_state); + ret = READ_ONCE(prev_trans->aborted); btrfs_put_transaction(prev_trans); @@ -2345,6 +2361,13 @@ int btrfs_commit_transaction(struct btrfs_trans_handle *trans) if (ret) goto scrub_continue; + /* + * We needn't acquire the lock here because there is no other task + * which can change it. + */ + cur_trans->state = TRANS_STATE_SUPER_COMMITTED; + wake_up(&cur_trans->commit_wait); + btrfs_finish_extent_commit(trans); if (test_bit(BTRFS_TRANS_HAVE_FREE_BGS, &cur_trans->flags)) diff --git a/fs/btrfs/transaction.h b/fs/btrfs/transaction.h index 31ca81bad822..935bd6958a8a 100644 --- a/fs/btrfs/transaction.h +++ b/fs/btrfs/transaction.h @@ -16,6 +16,7 @@ enum btrfs_trans_state { TRANS_STATE_COMMIT_START, TRANS_STATE_COMMIT_DOING, TRANS_STATE_UNBLOCKED, + TRANS_STATE_SUPER_COMMITTED, TRANS_STATE_COMPLETED, TRANS_STATE_MAX, }; @@ -133,6 +134,7 @@ struct btrfs_trans_handle { bool can_flush_pending_bgs; bool reloc_reserved; bool dirty; + bool in_fsync; struct btrfs_root *root; struct btrfs_fs_info *fs_info; struct list_head new_bgs;