From patchwork Mon Jul 8 14:42:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 13726696 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 21D451C06 for ; Mon, 8 Jul 2024 14:42:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720449770; cv=none; b=cVfO5ly6crNtm4vt5H3Np1ujz0Y4HKiXIpv424bp5xyUFZvYrvVc62/DLjhPDsBN+dF38irSWeaLvjSiZVsbYgoG158/HxFaoDhnHKlytiz/9O48ZCcqxJ8GsjllsSv14vAFONjrtrH19+/miPuAkbcNynBeBEd5Mx+7V0z4m4M= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720449770; c=relaxed/simple; bh=k3YmQzhAw89QLnveLsVJtAEbrgsd/BD+jzp6gNxPP+0=; h=From:To:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=AzGE5wK8Z7dKYGmlC4pc0U4Fn0qx9wtvc80zx87vaOBczfxpjOefflZ5IIVcRlWRhe+wxXAs66Ury+9ZnaZB4ilxGtxEWwOg8dtcQG5vS5RIndLRQ4pV2nXk8xRuvpa2t7D7FaIwkr5JGUV1x0gTdozqkcw+veQ84kYmyTvkfm0= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=VYmUIkOT; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="VYmUIkOT" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 33E52C4AF0A for ; Mon, 8 Jul 2024 14:42:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1720449769; bh=k3YmQzhAw89QLnveLsVJtAEbrgsd/BD+jzp6gNxPP+0=; h=From:To:Subject:Date:In-Reply-To:References:From; b=VYmUIkOTCFNfdux4Fl0KF+VWOT1AvGYvg3PA2ES+iF8k6WsuleWx6UiZUL/egwHcw u9Dbd2sH+lROvZnW3W2RA/is+g/RNVtJ3lN2i+FEEtlJirwnzrGY/cJS/Q3LAfZxno sjQWpRH8CDZn+TZGCf8NyDQZS/63hA3rEzZ1lVGoet3eRoDYaYrAVmren3roAWZ3zK 1PDzACt8QBK+LHHY/UtLnQuKvU/CU4638R7cdqH6hoJ/lhx6reM3I0Zoxmav82Aovj msps7MdsHJQq8NrdOHyXatNkm8udVaD32vO516T42T/oemxQ3Rm1VyIzG9QEKF38/i 15huWrhvJ28VA== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 1/3] btrfs: use delayed iput during extent map shrinking Date: Mon, 8 Jul 2024 15:42:43 +0100 Message-Id: <7b1a85693b56d215ec4c759fb2a29017f94661e4.1720448664.git.fdmanana@suse.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Filipe Manana When putting an inode during extent map shrinking we're doing a standard iput() but that may take a long time in case the inode is dirty and we are doing the final iput that triggers eviction - the VFS will have to wait for writeback before calling the btrfs evict callback (see fs/inode.c:evict()). This slows down the task running the shrinker which may have been triggered while updating some tree for example, meaning locks are held as well as an open transaction handle. Also if the iput() ends up triggering eviction and the inode has no links anymore, then we trigger item truncation which requires flushing delayed items, space reservation to start a transaction and that may trigger the space reclaim task and wait for it, resulting in deadlocks in case the reclaim task needs for example to commit a transaction and the shrinker is being triggered from a path holding a transaction handle. Syzbot reported such a case with the following stack traces: ====================================================== WARNING: possible circular locking dependency detected 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Not tainted ------------------------------------------------------ kswapd0/111 is trying to acquire lock: ffff88801eae4610 (sb_internal#3){.+.+}-{0:0}, at: btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 but task is already holding lock: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (fs_reclaim){+.+.}-{0:0}: __fs_reclaim_acquire mm/page_alloc.c:3783 [inline] fs_reclaim_acquire+0x102/0x160 mm/page_alloc.c:3797 might_alloc include/linux/sched/mm.h:334 [inline] slab_pre_alloc_hook mm/slub.c:3890 [inline] slab_alloc_node mm/slub.c:3980 [inline] kmem_cache_alloc_lru_noprof+0x58/0x2f0 mm/slub.c:4019 btrfs_alloc_inode+0x118/0xb20 fs/btrfs/inode.c:8411 alloc_inode+0x5d/0x230 fs/inode.c:261 iget5_locked fs/inode.c:1235 [inline] iget5_locked+0x1c9/0x2c0 fs/inode.c:1228 btrfs_iget_locked fs/btrfs/inode.c:5590 [inline] btrfs_iget_path fs/btrfs/inode.c:5607 [inline] btrfs_iget+0xfb/0x230 fs/btrfs/inode.c:5636 create_reloc_inode+0x403/0x820 fs/btrfs/relocation.c:3911 btrfs_relocate_block_group+0x471/0xe60 fs/btrfs/relocation.c:4114 btrfs_relocate_chunk+0x143/0x450 fs/btrfs/volumes.c:3373 __btrfs_balance fs/btrfs/volumes.c:4157 [inline] btrfs_balance+0x211a/0x3f00 fs/btrfs/volumes.c:4534 btrfs_ioctl_balance fs/btrfs/ioctl.c:3675 [inline] btrfs_ioctl+0x12ed/0x8290 fs/btrfs/ioctl.c:4742 __do_compat_sys_ioctl+0x2c3/0x330 fs/ioctl.c:1007 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #2 (btrfs_trans_num_extwriters){++++}-{0:0}: join_transaction+0x164/0xf40 fs/btrfs/transaction.c:315 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #1 (btrfs_trans_num_writers){++++}-{0:0}: join_transaction+0x148/0xf40 fs/btrfs/transaction.c:314 start_transaction+0x427/0x1a70 fs/btrfs/transaction.c:700 btrfs_rebuild_free_space_tree+0xaa/0x480 fs/btrfs/free-space-tree.c:1323 btrfs_start_pre_rw_mount+0x218/0xf60 fs/btrfs/disk-io.c:2999 open_ctree+0x41ab/0x52e0 fs/btrfs/disk-io.c:3554 btrfs_fill_super fs/btrfs/super.c:946 [inline] btrfs_get_tree_super fs/btrfs/super.c:1863 [inline] btrfs_get_tree+0x11e9/0x1b90 fs/btrfs/super.c:2089 vfs_get_tree+0x8f/0x380 fs/super.c:1780 fc_mount+0x16/0xc0 fs/namespace.c:1125 btrfs_get_tree_subvol fs/btrfs/super.c:2052 [inline] btrfs_get_tree+0xa53/0x1b90 fs/btrfs/super.c:2090 vfs_get_tree+0x8f/0x380 fs/super.c:1780 do_new_mount fs/namespace.c:3352 [inline] path_mount+0x6e1/0x1f10 fs/namespace.c:3679 do_mount fs/namespace.c:3692 [inline] __do_sys_mount fs/namespace.c:3898 [inline] __se_sys_mount fs/namespace.c:3875 [inline] __ia32_sys_mount+0x295/0x320 fs/namespace.c:3875 do_syscall_32_irqs_on arch/x86/entry/common.c:165 [inline] __do_fast_syscall_32+0x73/0x120 arch/x86/entry/common.c:386 do_fast_syscall_32+0x32/0x80 arch/x86/entry/common.c:411 entry_SYSENTER_compat_after_hwframe+0x84/0x8e -> #0 (sb_internal#3){.+.+}-{0:0}: check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 other info that might help us debug this: Chain exists of: sb_internal#3 --> btrfs_trans_num_extwriters --> fs_reclaim Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(fs_reclaim); lock(btrfs_trans_num_extwriters); lock(fs_reclaim); rlock(sb_internal#3); *** DEADLOCK *** 2 locks held by kswapd0/111: #0: ffffffff8dd3a9a0 (fs_reclaim){+.+.}-{0:0}, at: balance_pgdat+0xa88/0x1970 mm/vmscan.c:6924 #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_trylock_shared fs/super.c:562 [inline] #1: ffff88801eae40e0 (&type->s_umount_key#62){++++}-{3:3}, at: super_cache_scan+0x96/0x550 fs/super.c:196 stack backtrace: CPU: 0 PID: 111 Comm: kswapd0 Not tainted 6.10.0-rc2-syzkaller-00010-g2ab795141095 #0 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.2-debian-1.16.2-1 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x116/0x1f0 lib/dump_stack.c:114 check_noncircular+0x31a/0x400 kernel/locking/lockdep.c:2187 check_prev_add kernel/locking/lockdep.c:3134 [inline] check_prevs_add kernel/locking/lockdep.c:3253 [inline] validate_chain kernel/locking/lockdep.c:3869 [inline] __lock_acquire+0x2478/0x3b30 kernel/locking/lockdep.c:5137 lock_acquire kernel/locking/lockdep.c:5754 [inline] lock_acquire+0x1b1/0x560 kernel/locking/lockdep.c:5719 percpu_down_read include/linux/percpu-rwsem.h:51 [inline] __sb_start_write include/linux/fs.h:1655 [inline] sb_start_intwrite include/linux/fs.h:1838 [inline] start_transaction+0xbc1/0x1a70 fs/btrfs/transaction.c:694 btrfs_commit_inode_delayed_inode+0x110/0x330 fs/btrfs/delayed-inode.c:1275 btrfs_evict_inode+0x960/0xe80 fs/btrfs/inode.c:5291 evict+0x2ed/0x6c0 fs/inode.c:667 iput_final fs/inode.c:1741 [inline] iput.part.0+0x5a8/0x7f0 fs/inode.c:1767 iput+0x5c/0x80 fs/inode.c:1757 btrfs_scan_root fs/btrfs/extent_map.c:1118 [inline] btrfs_free_extent_maps+0xbd3/0x1320 fs/btrfs/extent_map.c:1189 super_cache_scan+0x409/0x550 fs/super.c:227 do_shrink_slab+0x44f/0x11c0 mm/shrinker.c:435 shrink_slab+0x18a/0x1310 mm/shrinker.c:662 shrink_one+0x493/0x7c0 mm/vmscan.c:4790 shrink_many mm/vmscan.c:4851 [inline] lru_gen_shrink_node+0x89f/0x1750 mm/vmscan.c:4951 shrink_node mm/vmscan.c:5910 [inline] kswapd_shrink_node mm/vmscan.c:6720 [inline] balance_pgdat+0x1105/0x1970 mm/vmscan.c:6911 kswapd+0x5ea/0xbf0 mm/vmscan.c:7180 kthread+0x2c1/0x3a0 kernel/kthread.c:389 ret_from_fork+0x45/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 So fix this by using btrfs_add_delayed_iput() so that the final iput is delegated to the cleaner kthread. Link: https://lore.kernel.org/linux-btrfs/000000000000892280061a344581@google.com/ Reported-by: syzbot+3dad89b3993a4b275e72@syzkaller.appspotmail.com Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps") Reviewed-by: David Sterba Signed-off-by: Filipe Manana --- fs/btrfs/extent_map.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 744e8952abb0..cb74d382a24f 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -1115,7 +1115,7 @@ static long btrfs_scan_root(struct btrfs_root *root, long *scanned, long nr_to_s min_ino = btrfs_ino(inode) + 1; fs_info->extent_map_shrinker_last_ino = btrfs_ino(inode); - iput(&inode->vfs_inode); + btrfs_add_delayed_iput(inode); if (*scanned >= nr_to_scan) break; From patchwork Mon Jul 8 14:42:44 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 13726697 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id BA3B613DBBF for ; Mon, 8 Jul 2024 14:42:50 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720449770; cv=none; b=gs+W12rssAKhCPXyhRnM0ZsgHnd2XCV4BwsV4WOc4VSJUGF2mNjhuvxFdhzmSMXN+dPq9zA20UQVr8Z4qa4nYuKFgH8LtezSTvMIadM8R1GbJY5BHPwq1yj8rHFe3cxRryiLeHYqCG8MU6O0/dw1w5x0GaDT566ptwBngOkMfrk= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720449770; c=relaxed/simple; bh=e5N6kTSYo077DXWnkSjOCy8Kw+bbBDDCGwyZKzDVoW0=; h=From:To:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=mxLHHdw7wW+mtElCTbPpFzF5TQQCg7wszf/8kUPieyVetoWVNXUuAagK8I05p3R9nYOayb7VejCoc7vMUIKywTDqkxEVOoKQlxh3+4CMgl+a7MOEnW5k/fKGnYl0ugeBZzktb7evepxGQMmVzUW3DshYDruEfD36j6TSwY4k6cg= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=TPGAtDfr; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="TPGAtDfr" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 259B9C116B1 for ; Mon, 8 Jul 2024 14:42:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1720449770; bh=e5N6kTSYo077DXWnkSjOCy8Kw+bbBDDCGwyZKzDVoW0=; h=From:To:Subject:Date:In-Reply-To:References:From; b=TPGAtDfrAwlaea6uaVvW6p0QiIt+MOacsdrIdSQrSXE6qrM8DFqtkIFI3copk2Ix5 cn5/wH1R82NRx2r4BthWI2muy6Sq3hWECQV62GPiaijRVyxAuHdbJr2VyR1EkS0D3B DlVEdtp7OPRcEjRr6xNAlg++drvavXOJvdq5XVvGYBvLe00CXRb+yl+U/Yr8x7TIBB 6KXhy4LApxQDgxDGCf95l5rGfjUCts+uNjOo5F4FBahKoukaq3JWpqlaU5aFhvIP/5 4muERxtAm3rt07txAEIbCswzSfZwxFIDeD9VGzSXzNZ+ad9RGSZiTpe+JLWY3PYuxs cBUllyncvk3zg== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 2/3] btrfs: stop extent map shrinker if reschedule is needed Date: Mon, 8 Jul 2024 15:42:44 +0100 Message-Id: <2808b9945a3795d7d8119eb6c1966a3c27767d0e.1720448664.git.fdmanana@suse.com> X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Filipe Manana The extent map shrinker can be called in a variety of contextes where we are under memory pressure, and of them is when a task is trying to allocate memory. For this reason the shrinker is typically called with a value of struct shrink_control::nr_to_scan that is much smaller than what we return in the nr_cached_objects callback of struct super_operations (fs/btrfs/super.c:btrfs_nr_cached_objects()), so that the shrinker does not take a long time and cause high latencies. However we can still take a lot of time in the shrinker even for a limited amount of nr_to_scan: 1) When traversing the red black tree that tracks open inodes in a root, as for example with millions of open inodes we get a deep tree which takes time searching for an inode; 2) Iterating over the extent map tree, which is a red black tree, of an inode when doing the rb_next() calls and when removing an extent map from the tree, since often that requires rebalancing the red black tree; 3) When trying to write lock an inode's extent map tree we may wait for a significant amount of time, because there's either another task about to do IO and searching for an extent map in the tree or inserting an extent map in the tree, and we can have thousands or even millions of extent maps for an inode. Furthermore, there can be concurrent calls to the shrinker so the lock might be busy simply because there is already another task shrinking extent maps for the same inode; 4) We often reschedule if we need to, which further increases latency. So improve on this by stopping the extent map shrinking code whenever we need to reschedule and make it skip an inode if we can't immediately lock its extent map tree. Reported-by: Mikhail Gavrilov Reported-by: Andrea Gelmini Link: https://lore.kernel.org/linux-btrfs/CABXGCsMmmb36ym8hVNGTiU8yfUS_cGvoUmGCcBrGWq9OxTrs+A@mail.gmail.com/ Signed-off-by: Filipe Manana --- fs/btrfs/extent_map.c | 39 +++++++++++++++++++++++++++++++-------- 1 file changed, 31 insertions(+), 8 deletions(-) diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index cb74d382a24f..887a0e5dc145 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -1057,7 +1057,18 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, long *scanned, long nr_t if (!down_read_trylock(&inode->i_mmap_lock)) return 0; - write_lock(&tree->lock); + /* + * We want to be fast because we can be called from any path trying to + * allocate memory, so if the lock is busy we don't want to spend time + * waiting for it - either some task is about to do IO for the inode or + * we may have another task shrinking extent maps, here in this code, so + * skip this inode. + */ + if (!write_trylock(&tree->lock)) { + up_read(&inode->i_mmap_lock); + return 0; + } + node = rb_first_cached(&tree->map); while (node) { struct extent_map *em; @@ -1089,12 +1100,14 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, long *scanned, long nr_t break; /* - * Restart if we had to reschedule, and any extent maps that were - * pinned before may have become unpinned after we released the - * lock and took it again. + * Stop if we need to reschedule or there's contention on the + * lock. This is to avoid slowing other tasks trying to take the + * lock and because the shrinker might be called during a memory + * allocation path and we want to avoid taking a very long time + * and slowing down all sorts of tasks. */ - if (cond_resched_rwlock_write(&tree->lock)) - node = rb_first_cached(&tree->map); + if (need_resched() || rwlock_needbreak(&tree->lock)) + break; } write_unlock(&tree->lock); up_read(&inode->i_mmap_lock); @@ -1120,7 +1133,13 @@ static long btrfs_scan_root(struct btrfs_root *root, long *scanned, long nr_to_s if (*scanned >= nr_to_scan) break; - cond_resched(); + /* + * We may be called from memory allocation paths, so we don't + * want to take too much time and slowdown tasks. + */ + if (need_resched()) + break; + inode = btrfs_find_first_inode(root, min_ino); } @@ -1159,7 +1178,11 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan) trace_btrfs_extent_map_shrinker_scan_enter(fs_info, nr_to_scan, nr); } - while (scanned < nr_to_scan) { + /* + * We may be called from memory allocation paths, so we don't want to + * take too much time and slowdown tasks, so stop if we need reschedule. + */ + while (scanned < nr_to_scan && !need_resched()) { struct btrfs_root *root; unsigned long count; From patchwork Mon Jul 8 14:42:45 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Filipe Manana X-Patchwork-Id: 13726698 Received: from smtp.kernel.org (aws-us-west-2-korg-mail-1.web.codeaurora.org [10.30.226.201]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 062FB13D272 for ; Mon, 8 Jul 2024 14:42:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=10.30.226.201 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720449772; cv=none; b=E6oeI59W9NzWtugXEc+0cO+g2bpwnEhUgw/EkCga4mtFZ51r6NBup4t8+4KfWXVQChfLBCgRE81E/N2iM0C5EjzDAvCVhCQ9am2nD0FSCbECMvMx0uX/6zgWVPl50jEWc/7rm5WURpaAfWPPstwB0JJ/IPvkeUVNW0kfSTZu/8U= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1720449772; c=relaxed/simple; bh=zWISfhbJO+T/OC+nd3ouA4rHRVIJ0sGnJSIRwHOrwH0=; h=From:To:Subject:Date:Message-Id:In-Reply-To:References: MIME-Version; b=JYpNf1QXi7lGDLaQ+B87FAA5HOJNVSReMwli5EVPTr6bY+s4S57/hBVN8ykASVqYUrFlG+6YI2z9J7WCKxWXMWw70yf9WyLY77t0iq7HqV+4443CcZ/mZcFjSSNl3EG8nrCHl/wVsh/DW3HobJYRoPDf3H9znLOm7YCJ12Q1T30= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b=TdLcZH4Y; arc=none smtp.client-ip=10.30.226.201 Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=kernel.org header.i=@kernel.org header.b="TdLcZH4Y" Received: by smtp.kernel.org (Postfix) with ESMTPSA id 15721C4AF0A for ; Mon, 8 Jul 2024 14:42:50 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1720449771; bh=zWISfhbJO+T/OC+nd3ouA4rHRVIJ0sGnJSIRwHOrwH0=; h=From:To:Subject:Date:In-Reply-To:References:From; b=TdLcZH4YVlFilqMbfsw+hE+5vu4+WtNRZ8u6pnWtqvz8OLsYQ/YMCZ/IJ8nJ6EsCn aJzsvjs4SfSmCBHwueMflIqqVsjU2x8gWsClaZ+kyBQ8m+Avp5myy2oq4TwBwxAMHU WAFH030Lz1W83cZZcM4k6dx3GF8lnh3V23Zb8rkYr68EW+V4MVL0sSv6QERzqgdU/3 2nYnDMxyNtlB13K7y8jZEPk01jcYWgUAtTUdqTxwD1+9cyk3vtUFytsHFaKTFQH+1F LsvYpCoVxlsXSjjeJEH+dzLgeW8LmzC4WIsp3YdBI1j+okkAJH9e/0432ulCSMt14f F0OTV19pD8OEA== From: fdmanana@kernel.org To: linux-btrfs@vger.kernel.org Subject: [PATCH 3/3] btrfs: avoid races when tracking progress for extent map shrinking Date: Mon, 8 Jul 2024 15:42:45 +0100 Message-Id: X-Mailer: git-send-email 2.34.1 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-btrfs@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 From: Filipe Manana We store the progress (root and inode numbers) of the extent map shrinker in fs_info without any synchronization but we can have multiple tasks calling into the shrinker during memory allocations when there's enough memory pressure for example. This can result in a task A reading fs_info->extent_map_shrinker_last_ino after another task B updates it, and task A reading fs_info->extent_map_shrinker_last_root before task B updates it, making task A see an odd state that isn't necessarily harmful but may make it skip certain inode ranges or do more work than necessary by going over the same inodes again. These unprotected accesses would also trigger warnings from tools like KCSAN. So add a lock to protect access to these progress fields. Signed-off-by: Filipe Manana --- fs/btrfs/disk-io.c | 2 + fs/btrfs/extent_map.c | 84 +++++++++++++++++++++++++++--------- fs/btrfs/fs.h | 1 + include/trace/events/btrfs.h | 18 ++++---- 4 files changed, 76 insertions(+), 29 deletions(-) diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c index 38cdb8875e8e..cabb558dbdaa 100644 --- a/fs/btrfs/disk-io.c +++ b/fs/btrfs/disk-io.c @@ -2856,6 +2856,8 @@ static int init_mount_fs_info(struct btrfs_fs_info *fs_info, struct super_block if (ret) return ret; + spin_lock_init(&fs_info->extent_map_shrinker_lock); + ret = percpu_counter_init(&fs_info->dirty_metadata_bytes, 0, GFP_KERNEL); if (ret) return ret; diff --git a/fs/btrfs/extent_map.c b/fs/btrfs/extent_map.c index 887a0e5dc145..b4c9a6aa118c 100644 --- a/fs/btrfs/extent_map.c +++ b/fs/btrfs/extent_map.c @@ -1028,7 +1028,14 @@ int split_extent_map(struct btrfs_inode *inode, u64 start, u64 len, u64 pre, return ret; } -static long btrfs_scan_inode(struct btrfs_inode *inode, long *scanned, long nr_to_scan) +struct btrfs_em_shrink_ctx { + long nr_to_scan; + long scanned; + u64 last_ino; + u64 last_root; +}; + +static long btrfs_scan_inode(struct btrfs_inode *inode, struct btrfs_em_shrink_ctx *ctx) { const u64 cur_fs_gen = btrfs_get_fs_generation(inode->root->fs_info); struct extent_map_tree *tree = &inode->extent_tree; @@ -1075,7 +1082,7 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, long *scanned, long nr_t em = rb_entry(node, struct extent_map, rb_node); node = rb_next(node); - (*scanned)++; + ctx->scanned++; if (em->flags & EXTENT_FLAG_PINNED) goto next; @@ -1096,7 +1103,7 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, long *scanned, long nr_t free_extent_map(em); nr_dropped++; next: - if (*scanned >= nr_to_scan) + if (ctx->scanned >= ctx->nr_to_scan) break; /* @@ -1115,22 +1122,21 @@ static long btrfs_scan_inode(struct btrfs_inode *inode, long *scanned, long nr_t return nr_dropped; } -static long btrfs_scan_root(struct btrfs_root *root, long *scanned, long nr_to_scan) +static long btrfs_scan_root(struct btrfs_root *root, struct btrfs_em_shrink_ctx *ctx) { - struct btrfs_fs_info *fs_info = root->fs_info; struct btrfs_inode *inode; long nr_dropped = 0; - u64 min_ino = fs_info->extent_map_shrinker_last_ino + 1; + u64 min_ino = ctx->last_ino + 1; inode = btrfs_find_first_inode(root, min_ino); while (inode) { - nr_dropped += btrfs_scan_inode(inode, scanned, nr_to_scan); + nr_dropped += btrfs_scan_inode(inode, ctx); min_ino = btrfs_ino(inode) + 1; - fs_info->extent_map_shrinker_last_ino = btrfs_ino(inode); + ctx->last_ino = btrfs_ino(inode); btrfs_add_delayed_iput(inode); - if (*scanned >= nr_to_scan) + if (ctx->scanned >= ctx->nr_to_scan) break; /* @@ -1151,14 +1157,14 @@ static long btrfs_scan_root(struct btrfs_root *root, long *scanned, long nr_to_s * inode if there is one or we will find out this was the last * one and move to the next root. */ - fs_info->extent_map_shrinker_last_root = btrfs_root_id(root); + ctx->last_root = btrfs_root_id(root); } else { /* * No more inodes in this root, set extent_map_shrinker_last_ino to 0 so * that when processing the next root we start from its first inode. */ - fs_info->extent_map_shrinker_last_ino = 0; - fs_info->extent_map_shrinker_last_root = btrfs_root_id(root) + 1; + ctx->last_ino = 0; + ctx->last_root = btrfs_root_id(root) + 1; } return nr_dropped; @@ -1166,23 +1172,41 @@ static long btrfs_scan_root(struct btrfs_root *root, long *scanned, long nr_to_s long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan) { - const u64 start_root_id = fs_info->extent_map_shrinker_last_root; - u64 next_root_id = start_root_id; + struct btrfs_em_shrink_ctx ctx; + u64 start_root_id; + u64 next_root_id; bool cycled = false; long nr_dropped = 0; - long scanned = 0; + + ctx.scanned = 0; + ctx.nr_to_scan = nr_to_scan; + + /* + * In case we have multiple tasks running this shrinker, make the next + * one start from the next inode in case it starts before we finish. + */ + spin_lock(&fs_info->extent_map_shrinker_lock); + ctx.last_ino = fs_info->extent_map_shrinker_last_ino; + fs_info->extent_map_shrinker_last_ino++; + ctx.last_root = fs_info->extent_map_shrinker_last_root; + spin_unlock(&fs_info->extent_map_shrinker_lock); + + start_root_id = ctx.last_root; + next_root_id = ctx.last_root; if (trace_btrfs_extent_map_shrinker_scan_enter_enabled()) { s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps); - trace_btrfs_extent_map_shrinker_scan_enter(fs_info, nr_to_scan, nr); + trace_btrfs_extent_map_shrinker_scan_enter(fs_info, nr_to_scan, + nr, ctx.last_root, + ctx.last_ino); } /* * We may be called from memory allocation paths, so we don't want to * take too much time and slowdown tasks, so stop if we need reschedule. */ - while (scanned < nr_to_scan && !need_resched()) { + while (ctx.scanned < ctx.nr_to_scan && !need_resched()) { struct btrfs_root *root; unsigned long count; @@ -1194,8 +1218,8 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan) spin_unlock(&fs_info->fs_roots_radix_lock); if (start_root_id > 0 && !cycled) { next_root_id = 0; - fs_info->extent_map_shrinker_last_root = 0; - fs_info->extent_map_shrinker_last_ino = 0; + ctx.last_root = 0; + ctx.last_ino = 0; cycled = true; continue; } @@ -1209,15 +1233,33 @@ long btrfs_free_extent_maps(struct btrfs_fs_info *fs_info, long nr_to_scan) continue; if (is_fstree(btrfs_root_id(root))) - nr_dropped += btrfs_scan_root(root, &scanned, nr_to_scan); + nr_dropped += btrfs_scan_root(root, &ctx); btrfs_put_root(root); } + /* + * In case of multiple tasks running this extent map shrinking code this + * isn't perfect but it's simple and silences things like KCSAN. It's + * not possible to know which task made more progress because we can + * cycle back to the first root and first inode if it's not the first + * time the shrinker ran, see the above logic. Also a task that started + * later may finish ealier than another task and made less progress. So + * make this simple and update to the progress of the last task that + * finished, with the occasional possiblity of having two consecutive + * runs of the shrinker process the same inodes. + */ + spin_lock(&fs_info->extent_map_shrinker_lock); + fs_info->extent_map_shrinker_last_ino = ctx.last_ino; + fs_info->extent_map_shrinker_last_root = ctx.last_root; + spin_unlock(&fs_info->extent_map_shrinker_lock); + if (trace_btrfs_extent_map_shrinker_scan_exit_enabled()) { s64 nr = percpu_counter_sum_positive(&fs_info->evictable_extent_maps); - trace_btrfs_extent_map_shrinker_scan_exit(fs_info, nr_dropped, nr); + trace_btrfs_extent_map_shrinker_scan_exit(fs_info, nr_dropped, + nr, ctx.last_root, + ctx.last_ino); } return nr_dropped; diff --git a/fs/btrfs/fs.h b/fs/btrfs/fs.h index 89f0650631cd..833dc3fe0a38 100644 --- a/fs/btrfs/fs.h +++ b/fs/btrfs/fs.h @@ -630,6 +630,7 @@ struct btrfs_fs_info { s32 delalloc_batch; struct percpu_counter evictable_extent_maps; + spinlock_t extent_map_shrinker_lock; u64 extent_map_shrinker_last_root; u64 extent_map_shrinker_last_ino; diff --git a/include/trace/events/btrfs.h b/include/trace/events/btrfs.h index fadf406b5260..c978fa2893a5 100644 --- a/include/trace/events/btrfs.h +++ b/include/trace/events/btrfs.h @@ -2556,9 +2556,10 @@ TRACE_EVENT(btrfs_extent_map_shrinker_count, TRACE_EVENT(btrfs_extent_map_shrinker_scan_enter, - TP_PROTO(const struct btrfs_fs_info *fs_info, long nr_to_scan, long nr), + TP_PROTO(const struct btrfs_fs_info *fs_info, long nr_to_scan, long nr, + u64 last_root_id, u64 last_ino), - TP_ARGS(fs_info, nr_to_scan, nr), + TP_ARGS(fs_info, nr_to_scan, nr, last_root_id, last_ino), TP_STRUCT__entry_btrfs( __field( long, nr_to_scan ) @@ -2570,8 +2571,8 @@ TRACE_EVENT(btrfs_extent_map_shrinker_scan_enter, TP_fast_assign_btrfs(fs_info, __entry->nr_to_scan = nr_to_scan; __entry->nr = nr; - __entry->last_root_id = fs_info->extent_map_shrinker_last_root; - __entry->last_ino = fs_info->extent_map_shrinker_last_ino; + __entry->last_root_id = last_root_id; + __entry->last_ino = last_ino; ), TP_printk_btrfs("nr_to_scan=%ld nr=%ld last_root=%llu(%s) last_ino=%llu", @@ -2581,9 +2582,10 @@ TRACE_EVENT(btrfs_extent_map_shrinker_scan_enter, TRACE_EVENT(btrfs_extent_map_shrinker_scan_exit, - TP_PROTO(const struct btrfs_fs_info *fs_info, long nr_dropped, long nr), + TP_PROTO(const struct btrfs_fs_info *fs_info, long nr_dropped, long nr, + u64 last_root_id, u64 last_ino), - TP_ARGS(fs_info, nr_dropped, nr), + TP_ARGS(fs_info, nr_dropped, nr, last_root_id, last_ino), TP_STRUCT__entry_btrfs( __field( long, nr_dropped ) @@ -2595,8 +2597,8 @@ TRACE_EVENT(btrfs_extent_map_shrinker_scan_exit, TP_fast_assign_btrfs(fs_info, __entry->nr_dropped = nr_dropped; __entry->nr = nr; - __entry->last_root_id = fs_info->extent_map_shrinker_last_root; - __entry->last_ino = fs_info->extent_map_shrinker_last_ino; + __entry->last_root_id = last_root_id; + __entry->last_ino = last_ino; ), TP_printk_btrfs("nr_dropped=%ld nr=%ld last_root=%llu(%s) last_ino=%llu",