From patchwork Wed Feb 3 11:17:44 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12064005 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.5 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id F3631C433E0 for ; Wed, 3 Feb 2021 11:18:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9B7E064F48 for ; Wed, 3 Feb 2021 11:18:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234100AbhBCLSh (ORCPT ); Wed, 3 Feb 2021 06:18:37 -0500 Received: from mail.kernel.org ([198.145.29.99]:56536 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233958AbhBCLSc (ORCPT ); Wed, 3 Feb 2021 06:18:32 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 550D264F67 for ; Wed, 3 Feb 2021 11:17:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1612351071; bh=0wduYG+Xvmcpt6w6eT+Eum7UefBBL3KUaDv/Ie+jm5Y=; h=From:To:Subject:Date:In-Reply-To:References:From; b=LvMeNZDxr3qTsb8nQBNBSynF+Z1wyP3YVpULuobgytkNwFziUrDHVL3pY9cROtptk aUUPjE9RUAJ1PHRXpLeiJCFeRAibaOGFjZfLiBb3FApIIHB5Ze8S3ih4DI8SX9y8Bn l7Dtoj3/RW2Rf665cirfHNlhprpBS8/CWtHmaOX7L6I9HwGTTEhsUPxf/nYgkMatsr lLaXgSPppCQVOQ0ND3p0vIU//8pGoUBRna6TxP2DF2hf42IanfwWerdnDHfOUHDZx3 eJ7iMZrb8wjiaHC1JgiX6RODlaqTp/3jzpKDQ1i7jLWIuRhmI8nKVJ0wGvKI/jnJXn 5IGDOA3IHmX1g== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 1/4] btrfs: avoid checking for RO block group twice during nocow writeback Date: Wed, 3 Feb 2021 11:17:44 +0000 Message-Id: <9d42ab56ffa6b454998453764dbb1c899d10bc40.1612350698.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana During the nocow writeback path, we currently iterate the rbtree of block groups twice: once for checking if the target block group is RO with the call to btrfs_extent_readonly()), and once again for getting a nocow reference on the block group with a call to btrfs_inc_nocow_writers(). Since btrfs_inc_nocow_writers() already returns false when the target block group is RO, remove the call to btrfs_extent_readonly(). Not only we avoid searching the blocks group rbtree twice, it also helps reduce contention on the lock that protects it (specially since it is a spin lock and not a read-write lock). That may make a noticeable difference on very large filesystems, with thousands of allocated block groups. Signed-off-by: Filipe Manana Reviewed-by: Anand Jain --- fs/btrfs/inode.c | 4 +--- 1 file changed, 1 insertion(+), 3 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 589030cefd90..b10fc42f9e9a 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -1657,9 +1657,6 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, */ btrfs_release_path(path); - /* If extent is RO, we must COW it */ - if (btrfs_extent_readonly(fs_info, disk_bytenr)) - goto out_check; ret = btrfs_cross_ref_exist(root, ino, found_key.offset - extent_offset, disk_bytenr, false); @@ -1706,6 +1703,7 @@ static noinline int run_delalloc_nocow(struct btrfs_inode *inode, WARN_ON_ONCE(freespace_inode); goto out_check; } + /* If the extent's block group is RO, we must COW. */ if (!btrfs_inc_nocow_writers(fs_info, disk_bytenr)) goto out_check; nocow = true; From patchwork Wed Feb 3 11:17:45 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12064009 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.5 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A41F0C433E0 for ; Wed, 3 Feb 2021 11:19:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 5697F64F61 for ; Wed, 3 Feb 2021 11:19:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234128AbhBCLSj (ORCPT ); Wed, 3 Feb 2021 06:18:39 -0500 Received: from mail.kernel.org ([198.145.29.99]:56564 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233972AbhBCLSe (ORCPT ); Wed, 3 Feb 2021 06:18:34 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 3563F64F48 for ; Wed, 3 Feb 2021 11:17:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1612351072; bh=LtjoSQCPGPAkC95J5GEi5Fr0JWkbMmuhZSd9Lp3Z6Vs=; h=From:To:Subject:Date:In-Reply-To:References:From; b=eWgb/LmsWjGcklj2NFpgzGS9TgbSMuvNxxUvKhwYHMQfJiXhwxygV2rYsNgr7kjby kDIjcT5DyaYOQvSObVqKsBIL+ULq5Z85L8HS7hL95U5iJOieg6b39YTmXXJsMfsMzm D6bNo1ma6Nt4/CjpA+CCjuP53ZT2j/4twjij965I4POtpWZJiRjQbhm77EeVrJRUD6 PybHRtPFrzKUcjWlV74mZYtS17VaTFGcHAek1Tx7p18hUFsf4QNMUr8s7qzklAWM70 WknzMU/VqnkNh93itSbKmYIMmFS3+am7h2yvAVopI5sxSai1Q5ALcFkpRNtiphKFoK nuzwQJgEIN7qA== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 2/4] btrfs: fix race between writes to swap files and scrub Date: Wed, 3 Feb 2021 11:17:45 +0000 Message-Id: <0da379a02fdabaf9ca295a34f7de287b5d5465f7.1612350698.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When we active a swap file, at btrfs_swap_activate(), we acquire the exclusive operation lock to prevent the physical location of the swap file extents to be changed by operations such as balance and device replace/resize/remove. We also call there can_nocow_extent() which, among other things, checks if the block group of a swap file extent is currently RO, and if it is we can not use the extent, since a write into it would result in COWing the extent. However we have no protection against a scrub operation running after we activate the swap file, which can result in the swap file extents to be COWed while the scrub is running and operating on the respective block group, because scrub turns a block group into RO before it processes it and then back again to RW mode after processing it. That means an attempt to write into a swap file extent while scrub is processing the respective block group, will result in COWing the extent, changing its physical location on disk. Fix this by making sure that block groups that have extents that are used by active swap files can not be turned into RO mode, therefore making it not possible for a scrub to turn them into RO mode. When a scrub finds a block group that can not be turned to RO due to the existence of extents used by swap files, it proceeds to the next block group and logs a warning message that mentions the block group was skipped due to active swap files - this is the same approach we currently use for balance. This ends up removing the need to call btrfs_extent_readonly() from can_nocow_extent(), as btrfs_swap_activate() now checks if a block group is RO through the new function btrfs_inc_block_group_swap_extents(). The only other caller of can_nocow_extent() is the direct IO write path, btrfs_get_blocks_direct_write(), but that already checks if a block group is RO through the call to btrfs_inc_nocow_writers(). In fact, after this change we end up optimizing the direct IO write path, since we no longer iterate the block groups rbtree twice, once with btrfs_extent_readonly(), through can_nocow_extent(), and once again with btrfs_inc_nocow_writers(). This can save time and reduce contention on the lock that protects the rbtree (specially because it is a spinlock and not a read/write lock) on very large filesystems, with several thousands of allocated block groups. Fixes: ed46ff3d42378 ("Btrfs: support swap files") Signed-off-by: Filipe Manana --- fs/btrfs/block-group.c | 33 ++++++++++++++++++++++++++++++++- fs/btrfs/block-group.h | 9 +++++++++ fs/btrfs/ctree.h | 5 +++++ fs/btrfs/inode.c | 22 ++++++++++++++++++---- fs/btrfs/scrub.c | 9 ++++++++- 5 files changed, 72 insertions(+), 6 deletions(-) diff --git a/fs/btrfs/block-group.c b/fs/btrfs/block-group.c index 5fa6b3d540f4..c0a8ddf92ef8 100644 --- a/fs/btrfs/block-group.c +++ b/fs/btrfs/block-group.c @@ -1150,6 +1150,11 @@ static int inc_block_group_ro(struct btrfs_block_group *cache, int force) spin_lock(&sinfo->lock); spin_lock(&cache->lock); + if (cache->swap_extents) { + ret = -ETXTBSY; + goto out; + } + if (cache->ro) { cache->ro++; ret = 0; @@ -2260,7 +2265,7 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache, } ret = inc_block_group_ro(cache, 0); - if (!do_chunk_alloc) + if (!do_chunk_alloc || ret == -ETXTBSY) goto unlock_out; if (!ret) goto out; @@ -2269,6 +2274,8 @@ int btrfs_inc_block_group_ro(struct btrfs_block_group *cache, if (ret < 0) goto out; ret = inc_block_group_ro(cache, 0); + if (ret == -ETXTBSY) + goto unlock_out; out: if (cache->flags & BTRFS_BLOCK_GROUP_SYSTEM) { alloc_flags = btrfs_get_alloc_profile(fs_info, cache->flags); @@ -3352,6 +3359,7 @@ int btrfs_free_block_groups(struct btrfs_fs_info *info) ASSERT(list_empty(&block_group->io_list)); ASSERT(list_empty(&block_group->bg_list)); ASSERT(refcount_read(&block_group->refs) == 1); + ASSERT(block_group->swap_extents == 0); btrfs_put_block_group(block_group); spin_lock(&info->block_group_cache_lock); @@ -3418,3 +3426,26 @@ void btrfs_unfreeze_block_group(struct btrfs_block_group *block_group) __btrfs_remove_free_space_cache(block_group->free_space_ctl); } } + +bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg) +{ + bool ret = true; + + spin_lock(&bg->lock); + if (bg->ro) + ret = false; + else + bg->swap_extents++; + spin_unlock(&bg->lock); + + return ret; +} + +void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount) +{ + spin_lock(&bg->lock); + ASSERT(!bg->ro); + ASSERT(bg->swap_extents >= amount); + bg->swap_extents -= amount; + spin_unlock(&bg->lock); +} diff --git a/fs/btrfs/block-group.h b/fs/btrfs/block-group.h index 8f74a96074f7..105094bd1821 100644 --- a/fs/btrfs/block-group.h +++ b/fs/btrfs/block-group.h @@ -181,6 +181,12 @@ struct btrfs_block_group { */ int needs_free_space; + /* + * Number of extents in this block group used for swap files. + * All accesses protected by the spinlock 'lock'. + */ + int swap_extents; + /* Record locked full stripes for RAID5/6 block group */ struct btrfs_full_stripe_locks_tree full_stripe_locks_root; }; @@ -296,6 +302,9 @@ static inline int btrfs_block_group_done(struct btrfs_block_group *cache) void btrfs_freeze_block_group(struct btrfs_block_group *cache); void btrfs_unfreeze_block_group(struct btrfs_block_group *cache); +bool btrfs_inc_block_group_swap_extents(struct btrfs_block_group *bg); +void btrfs_dec_block_group_swap_extents(struct btrfs_block_group *bg, int amount); + #ifdef CONFIG_BTRFS_FS_RUN_SANITY_TESTS int btrfs_rmap_block(struct btrfs_fs_info *fs_info, u64 chunk_start, u64 physical, u64 **logical, int *naddrs, int *stripe_len); diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index ed6bb46a2572..5269777a4fb4 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -523,6 +523,11 @@ struct btrfs_swapfile_pin { * points to a struct btrfs_device. */ bool is_block_group; + /* + * Only used when 'is_block_group' is true and it is the number of + * extents used by a swapfile for this block group ('ptr' field). + */ + int bg_extent_count; }; bool btrfs_pinned_by_swapfile(struct btrfs_fs_info *fs_info, void *ptr); diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index b10fc42f9e9a..464c289c402d 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -7204,9 +7204,6 @@ noinline int can_nocow_extent(struct inode *inode, u64 offset, u64 *len, *ram_bytes = btrfs_file_extent_ram_bytes(leaf, fi); } - if (btrfs_extent_readonly(fs_info, disk_bytenr)) - goto out; - num_bytes = min(offset + *len, extent_end) - offset; if (!nocow && found_type == BTRFS_FILE_EXTENT_PREALLOC) { u64 range_end; @@ -9990,6 +9987,7 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr, sp->ptr = ptr; sp->inode = inode; sp->is_block_group = is_block_group; + sp->bg_extent_count = 1; spin_lock(&fs_info->swapfile_pins_lock); p = &fs_info->swapfile_pins.rb_node; @@ -10003,6 +10001,8 @@ static int btrfs_add_swapfile_pin(struct inode *inode, void *ptr, (sp->ptr == entry->ptr && sp->inode > entry->inode)) { p = &(*p)->rb_right; } else { + if (is_block_group) + entry->bg_extent_count++; spin_unlock(&fs_info->swapfile_pins_lock); kfree(sp); return 1; @@ -10028,8 +10028,11 @@ static void btrfs_free_swapfile_pins(struct inode *inode) sp = rb_entry(node, struct btrfs_swapfile_pin, node); if (sp->inode == inode) { rb_erase(&sp->node, &fs_info->swapfile_pins); - if (sp->is_block_group) + if (sp->is_block_group) { + btrfs_dec_block_group_swap_extents(sp->ptr, + sp->bg_extent_count); btrfs_put_block_group(sp->ptr); + } kfree(sp); } node = next; @@ -10244,6 +10247,17 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file, goto out; } + if (!btrfs_inc_block_group_swap_extents(bg)) { + btrfs_warn(fs_info, + "block group for swapfile at %llu is read-only%s", + bg->start, + atomic_read(&fs_info->scrubs_running) ? + " (scrub running)" : ""); + btrfs_put_block_group(bg); + ret = -EINVAL; + goto out; + } + ret = btrfs_add_swapfile_pin(inode, bg, true); if (ret) { btrfs_put_block_group(bg); diff --git a/fs/btrfs/scrub.c b/fs/btrfs/scrub.c index 5f4f88a4d2c8..c09a494be8c6 100644 --- a/fs/btrfs/scrub.c +++ b/fs/btrfs/scrub.c @@ -3630,6 +3630,13 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, * commit_transactions. */ ro_set = 0; + } else if (ret == -ETXTBSY) { + btrfs_warn(fs_info, + "skipping scrub of block group %llu due to active swapfile", + cache->start); + scrub_pause_off(fs_info); + ret = 0; + goto skip_unfreeze; } else { btrfs_warn(fs_info, "failed setting block group ro: %d", ret); @@ -3719,7 +3726,7 @@ int scrub_enumerate_chunks(struct scrub_ctx *sctx, } else { spin_unlock(&cache->lock); } - +skip_unfreeze: btrfs_unfreeze_block_group(cache); btrfs_put_block_group(cache); if (ret) From patchwork Wed Feb 3 11:17:46 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12064003 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.5 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 55FF3C433E6 for ; Wed, 3 Feb 2021 11:18:38 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 0FD8464F48 for ; Wed, 3 Feb 2021 11:18:38 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234083AbhBCLSh (ORCPT ); Wed, 3 Feb 2021 06:18:37 -0500 Received: from mail.kernel.org ([198.145.29.99]:56586 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233983AbhBCLSe (ORCPT ); Wed, 3 Feb 2021 06:18:34 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id 17A5864F74 for ; Wed, 3 Feb 2021 11:17:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1612351073; bh=eqLKGlahv72gTBWCi8BJ/rDOzf8XUgsXJ3XVqRYA03o=; h=From:To:Subject:Date:In-Reply-To:References:From; b=AvDAm7v+KJuWaUX0Cgf8nVLLnsyJvHfUZVZn83njRCtrh51+T7pn2LJM13VxrC8gk Tyd8qJStTRmMgdsYfoWQZ6sgG8k+O3a2a3W411SCk0OmYKOl+phC0JWaGxI4lttsur s/wuOprOPAQAqFYNFaO2RSJOKMqbQDsBelEv9K9kLkN4pU/euGvq+xF2lkZnhiZfos O09UX/E/N6PM6tEVbDxapJ9SKmwdKCJvJhs9XaZuIlsl7pI0CXeHGZ4LHsqT365E3+ w8ebmqJk8HPo22fgBu5pg2kNCZFRbc+ZT15t+pa9EgC9L+mX0/HBsy9Wjd8p0u+1Pd sVJgebOeAI3OQ== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 3/4] btrfs: remove no longer used function btrfs_extent_readonly() Date: Wed, 3 Feb 2021 11:17:46 +0000 Message-Id: <62b700e55d7b9cec8cf4d19e1adfcaa450d9cfed.1612350698.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana After the two previous patches: btrfs: avoid checking for RO block group twice during nocow writeback btrfs: fix race between writes to swap files and scrub it is no longer used, so just remove it. Signed-off-by: Filipe Manana --- fs/btrfs/ctree.h | 1 - fs/btrfs/extent-tree.c | 13 ------------- 2 files changed, 14 deletions(-) diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h index 5269777a4fb4..65d158082b86 100644 --- a/fs/btrfs/ctree.h +++ b/fs/btrfs/ctree.h @@ -2686,7 +2686,6 @@ int btrfs_finish_extent_commit(struct btrfs_trans_handle *trans); int btrfs_inc_extent_ref(struct btrfs_trans_handle *trans, struct btrfs_ref *generic_ref); -int btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr); void btrfs_clear_space_info_full(struct btrfs_fs_info *info); /* diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c index 5476ab84e544..11ef4259cacb 100644 --- a/fs/btrfs/extent-tree.c +++ b/fs/btrfs/extent-tree.c @@ -2455,19 +2455,6 @@ int btrfs_dec_ref(struct btrfs_trans_handle *trans, struct btrfs_root *root, return __btrfs_mod_ref(trans, root, buf, full_backref, 0); } -int btrfs_extent_readonly(struct btrfs_fs_info *fs_info, u64 bytenr) -{ - struct btrfs_block_group *block_group; - int readonly = 0; - - block_group = btrfs_lookup_block_group(fs_info, bytenr); - if (!block_group || block_group->ro) - readonly = 1; - if (block_group) - btrfs_put_block_group(block_group); - return readonly; -} - static u64 get_alloc_profile_by_root(struct btrfs_root *root, int data) { struct btrfs_fs_info *fs_info = root->fs_info; From patchwork Wed Feb 3 11:17:47 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 12064007 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-19.5 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EA480C433E6 for ; Wed, 3 Feb 2021 11:18:40 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id A124464F48 for ; Wed, 3 Feb 2021 11:18:40 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234138AbhBCLSj (ORCPT ); Wed, 3 Feb 2021 06:18:39 -0500 Received: from mail.kernel.org ([198.145.29.99]:56604 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234021AbhBCLSg (ORCPT ); Wed, 3 Feb 2021 06:18:36 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id EF4D664F6A for ; Wed, 3 Feb 2021 11:17:53 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1612351074; bh=yq8wnwlglHlDuix1iQoTRsp8q8A7iSJniFO1ZuPkI9w=; h=From:To:Subject:Date:In-Reply-To:References:From; b=k/QkkHweLypvNkjvt9QUSKrU2Qwok++I1JKvq/+Tu5AeLa0RbxnySlqxh6Dcdx3RD XG0K+W17qA4BAn6QPGPkSv4Zwdvl0ImAnaUZJLwccz9NlfAmzu7T5C8ukz+Tco1zzW s1fKD5+LAzlIX9yZrYv1m2i46Cit1mokL5BYOQwzZfMJhzXPINENwphlHOA8UUB86d 08K9dyEn+LsTROS3w5ReU6Bg7snWX+NtI7RGNaQ8vtVjb8D7PmudIMSNV88abmBUnE I9pmTxDum2dS7b5Z06/R4YOHijhQx3dhr7vPTnvsq7JBpEh9fS6VSxN1LhkLNmrBhQ xFmA9ReLunK9A== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 4/4] btrfs: fix race between swap file activation and snapshot creation Date: Wed, 3 Feb 2021 11:17:47 +0000 Message-Id: <84a80739d04185bc50a4f911d2284d3b9f80cf4a.1612350698.git.fdmanana@suse.com> X-Mailer: git-send-email 2.25.1 In-Reply-To: References: MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-btrfs@vger.kernel.org From: Filipe Manana When creating a snapshot we check if the current number of swap files, in the root, is non-zero, and if it is, we error out and warn that we can not create the snapshot because there are active swap files. However this is racy because when a task started activation of a swap file, another task might have started already snapshot creation and might have seen the counter for the number of swap files as zero. This means that after the swap file is activated we may end up with a snapshot of the same root successfully created, and therefore when the first write to the swap file happens it has to fall back into COW mode, which should never happen for active swap files. Basically what can happen is: 1) Task A starts snapshot creation and enters ioctl.c:create_snapshot(). There it sees that root->nr_swapfiles has a value of 0 so it continues; 2) Task B enters btrfs_swap_activate(). It is not aware that another task started snapshot creation but it did not finish yet. It increments root->nr_swapfiles from 0 to 1; 3) Task B checks that the file meets all requirements to be an active swap file - it has NOCOW set, there are no snapshots for the inode's root at the moment, no file holes, no reflinked extents, etc; 4) Task B returns success and now the file is an active swap file; 5) Task A commits the transaction to create the snapshot and finishes. The swap file's extents are now shared between the original root and the snapshot; 6) A write into an extent of the swap file is attempted - there is a snapshot of the file's root, so we fall back to COW mode and therefore the physical location of the extent changes on disk. So fix this by taking the snapshot lock during swap file activation before locking the extent range, as that is the order in which we lock these during buffered writes. Fixes: ed46ff3d42378 ("Btrfs: support swap files") Signed-off-by: Filipe Manana --- fs/btrfs/inode.c | 21 +++++++++++++++++++-- 1 file changed, 19 insertions(+), 2 deletions(-) diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c index 464c289c402d..ca89f3166dd7 100644 --- a/fs/btrfs/inode.c +++ b/fs/btrfs/inode.c @@ -10093,7 +10093,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file, sector_t *span) { struct inode *inode = file_inode(file); - struct btrfs_fs_info *fs_info = BTRFS_I(inode)->root->fs_info; + struct btrfs_root *root = BTRFS_I(inode)->root; + struct btrfs_fs_info *fs_info = root->fs_info; struct extent_io_tree *io_tree = &BTRFS_I(inode)->io_tree; struct extent_state *cached_state = NULL; struct extent_map *em = NULL; @@ -10144,13 +10145,27 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file, "cannot activate swapfile while exclusive operation is running"); return -EBUSY; } + + /* + * Prevent snapshot creation while we are activating the swap file. + * We do not want to race with snapshot creation. If snapshot creation + * already started before we bumped nr_swapfiles from 0 to 1 and + * completes before the first write into the swap file after it is + * activated, than that write would fallback to COW. + */ + if (!btrfs_drew_try_write_lock(&root->snapshot_lock)) { + btrfs_exclop_finish(fs_info); + btrfs_warn(fs_info, + "cannot activate swapfile because snapshot creation is in progress"); + return -EINVAL; + } /* * Snapshots can create extents which require COW even if NODATACOW is * set. We use this counter to prevent snapshots. We must increment it * before walking the extents because we don't want a concurrent * snapshot to run after we've already checked the extents. */ - atomic_inc(&BTRFS_I(inode)->root->nr_swapfiles); + atomic_inc(&root->nr_swapfiles); isize = ALIGN_DOWN(inode->i_size, fs_info->sectorsize); @@ -10296,6 +10311,8 @@ static int btrfs_swap_activate(struct swap_info_struct *sis, struct file *file, if (ret) btrfs_swap_deactivate(file); + btrfs_drew_write_unlock(&root->snapshot_lock); + btrfs_exclop_finish(fs_info); if (ret)