mbox series

[0/8] btrfs: some speedups for io tree operations and cleanups

Message ID cover.1695333278.git.fdmanana@suse.com (mailing list archive)
Headers show
Series btrfs: some speedups for io tree operations and cleanups | expand

Message

Filipe Manana Sept. 22, 2023, 10:39 a.m. UTC
From: Filipe Manana <fdmanana@suse.com>

These are some changes to make some io tree operations more efficient and
some cleanups. More details on the changelogs.

Filipe Manana (8):
  btrfs: make extent state merges more efficient during insertions
  btrfs: update stale comment at extent_io_tree_release()
  btrfs: make wait_extent_bit() static
  btrfs: remove pointless memory barrier from extent_io_tree_release()
  btrfs: collapse wait_on_state() to its caller wait_extent_bit()
  btrfs: make extent_io_tree_release() more efficient
  btrfs: use extent_io_tree_release() to empty dirty log pages
  btrfs: make sure we cache next state in find_first_extent_bit()

 fs/btrfs/extent-io-tree.c | 201 ++++++++++++++++++++++++--------------
 fs/btrfs/extent-io-tree.h |   2 -
 fs/btrfs/tree-log.c       |   3 +-
 3 files changed, 126 insertions(+), 80 deletions(-)

Comments

David Sterba Sept. 22, 2023, 6:18 p.m. UTC | #1
On Fri, Sep 22, 2023 at 11:39:01AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> These are some changes to make some io tree operations more efficient and
> some cleanups. More details on the changelogs.
> 
> Filipe Manana (8):
>   btrfs: make extent state merges more efficient during insertions
>   btrfs: update stale comment at extent_io_tree_release()
>   btrfs: make wait_extent_bit() static
>   btrfs: remove pointless memory barrier from extent_io_tree_release()
>   btrfs: collapse wait_on_state() to its caller wait_extent_bit()
>   btrfs: make extent_io_tree_release() more efficient
>   btrfs: use extent_io_tree_release() to empty dirty log pages
>   btrfs: make sure we cache next state in find_first_extent_bit()

I see a lot of reports like:

btrfs/004        [16:14:23][  468.732077] run fstests btrfs/004 at 2023-09-22 16:14:24
[  470.921989] BTRFS: device fsid f7d57de2-899a-4b33-b77a-084058ac36e9 devid 1 transid 11 /dev/vda scanned by mount (31993)
[  470.926438] BTRFS info (device vda): using crc32c (crc32c-generic) checksum algorithm
[  470.928013] BTRFS info (device vda): using free space tree
[  470.952723] BTRFS info (device vda): auto enabling async discard
[  472.385556] BTRFS: device fsid 097a012d-8e9b-4bd8-960d-87577821cbbe devid 1 transid 6 /dev/vdb scanned by mount (32061)
[  472.395192] BTRFS info (device vdb): using crc32c (crc32c-generic) checksum algorithm
[  472.398895] BTRFS info (device vdb): using free space tree
[  472.438755] BTRFS info (device vdb): auto enabling async discard
[  472.440900] BTRFS info (device vdb): checking UUID tree
[  472.534254] BUG: sleeping function called from invalid context at fs/btrfs/extent-io-tree.c:132
[  472.539305] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32079, name: fsstress
[  472.542685] preempt_count: 1, expected: 0
[  472.543641] RCU nest depth: 0, expected: 0
[  472.544778] 6 locks held by fsstress/32079:
[  472.546916]  #0: ffff888017ad4648 (sb_internal#2){.+.+}-{0:0}, at: btrfs_sync_file+0x594/0xa20 [btrfs]
[  472.551474]  #1: ffff888014c72790 (btrfs_trans_completed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
[  472.556334]  #2: ffff888014c72760 (btrfs_trans_super_committed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
[  472.561372]  #3: ffff888014c72730 (btrfs_trans_unblocked){++++}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
[  472.565793]  #4: ffff888014c70de0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_commit_transaction+0x8ed/0x1580 [btrfs]
[  472.569931]  #5: ffff88802ec1c248 (&tree->lock#2){+.+.}-{2:2}, at: extent_io_tree_release+0x1c/0x120 [btrfs]
[  472.573099] Preemption disabled at:
[  472.573110] [<0000000000000000>] 0x0
[  472.575200] CPU: 0 PID: 32079 Comm: fsstress Not tainted 6.6.0-rc2-default+ #2197
[  472.577440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
[  472.580902] Call Trace:
[  472.581738]  <TASK>
[  472.582480]  dump_stack_lvl+0x60/0x70
[  472.583621]  __might_resched+0x224/0x360
[  472.584591]  extent_io_tree_release+0xa5/0x120 [btrfs]
[  472.585864]  free_log_tree+0xec/0x250 [btrfs]
[  472.587007]  ? walk_log_tree+0x4a0/0x4a0 [btrfs]
[  472.588116]  ? reacquire_held_locks+0x280/0x280
[  472.589104]  ? btrfs_log_holes+0x430/0x430 [btrfs]
[  472.590296]  ? node_tag_clear+0xf4/0x160
[  472.591292]  btrfs_free_log+0x2c/0x60 [btrfs]
[  472.592468]  commit_fs_roots+0x1e0/0x440 [btrfs]
[  472.593569]  ? __lock_release.isra.0+0x14e/0x510
[  472.594470]  ? percpu_up_read+0xe0/0xe0 [btrfs]
[  472.595496]  ? btrfs_run_delayed_refs+0xf6/0x180 [btrfs]
[  472.596758]  ? btrfs_assert_delayed_root_empty+0x2d/0xd0 [btrfs]
[  472.598077]  ? btrfs_commit_transaction+0x932/0x1580 [btrfs]
[  472.599437]  btrfs_commit_transaction+0x94e/0x1580 [btrfs]
[  472.600845]  ? cleanup_transaction+0x650/0x650 [btrfs]
[  472.602164]  ? preempt_count_sub+0x18/0xc0
[  472.603111]  ? __rcu_read_unlock+0x67/0x90
[  472.604011]  ? preempt_count_add+0x71/0xd0
[  472.604840]  ? preempt_count_sub+0x18/0xc0
[  472.605664]  ? preempt_count_add+0x71/0xd0
[  472.606454]  ? preempt_count_sub+0x18/0xc0
[  472.607251]  ? __up_write+0x125/0x300
[  472.608059]  btrfs_sync_file+0x794/0xa20 [btrfs]
[  472.609105]  ? start_ordered_ops.constprop.0+0xd0/0xd0 [btrfs]
[  472.610392]  ? mark_held_locks+0x1a/0x80
[  472.611179]  __x64_sys_fdatasync+0x70/0xb0
[  472.614054]  do_syscall_64+0x3d/0x90
[  472.614584]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[  472.615522] RIP: 0033:0x7f9bbbe983d4

 115 void extent_io_tree_release(struct extent_io_tree *tree)
 116 {
 117         struct extent_state *state;
 118         struct extent_state *tmp;
 119
 120         spin_lock(&tree->lock);
 121         rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
 122                 /* Clear node to keep free_extent_state() happy. */
 123                 RB_CLEAR_NODE(&state->rb_node);
 124                 ASSERT(!(state->state & EXTENT_LOCKED));
 125                 /*
 126                  * No need for a memory barrier here, as we are holding the tree
 127                  * lock and we only change the waitqueue while holding that lock
 128                  * (see wait_extent_bit()).
 129                  */
 130                 ASSERT(!waitqueue_active(&state->wq));
 131                 free_extent_state(state);
 132                 cond_resched();
 ^^^

should be cond_resched_lock() as it's under spinlock but then I'm not
sure if relocking is still safe in the middle of the tree traversal.

 133         }
 134         tree->state = RB_ROOT;
 135         spin_unlock(&tree->lock);
 136 }
Filipe Manana Sept. 22, 2023, 6:43 p.m. UTC | #2
On Fri, Sep 22, 2023 at 7:24 PM David Sterba <dsterba@suse.cz> wrote:
>
> On Fri, Sep 22, 2023 at 11:39:01AM +0100, fdmanana@kernel.org wrote:
> > From: Filipe Manana <fdmanana@suse.com>
> >
> > These are some changes to make some io tree operations more efficient and
> > some cleanups. More details on the changelogs.
> >
> > Filipe Manana (8):
> >   btrfs: make extent state merges more efficient during insertions
> >   btrfs: update stale comment at extent_io_tree_release()
> >   btrfs: make wait_extent_bit() static
> >   btrfs: remove pointless memory barrier from extent_io_tree_release()
> >   btrfs: collapse wait_on_state() to its caller wait_extent_bit()
> >   btrfs: make extent_io_tree_release() more efficient
> >   btrfs: use extent_io_tree_release() to empty dirty log pages
> >   btrfs: make sure we cache next state in find_first_extent_bit()
>
> I see a lot of reports like:
>
> btrfs/004        [16:14:23][  468.732077] run fstests btrfs/004 at 2023-09-22 16:14:24
> [  470.921989] BTRFS: device fsid f7d57de2-899a-4b33-b77a-084058ac36e9 devid 1 transid 11 /dev/vda scanned by mount (31993)
> [  470.926438] BTRFS info (device vda): using crc32c (crc32c-generic) checksum algorithm
> [  470.928013] BTRFS info (device vda): using free space tree
> [  470.952723] BTRFS info (device vda): auto enabling async discard
> [  472.385556] BTRFS: device fsid 097a012d-8e9b-4bd8-960d-87577821cbbe devid 1 transid 6 /dev/vdb scanned by mount (32061)
> [  472.395192] BTRFS info (device vdb): using crc32c (crc32c-generic) checksum algorithm
> [  472.398895] BTRFS info (device vdb): using free space tree
> [  472.438755] BTRFS info (device vdb): auto enabling async discard
> [  472.440900] BTRFS info (device vdb): checking UUID tree
> [  472.534254] BUG: sleeping function called from invalid context at fs/btrfs/extent-io-tree.c:132
> [  472.539305] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32079, name: fsstress
> [  472.542685] preempt_count: 1, expected: 0
> [  472.543641] RCU nest depth: 0, expected: 0
> [  472.544778] 6 locks held by fsstress/32079:
> [  472.546916]  #0: ffff888017ad4648 (sb_internal#2){.+.+}-{0:0}, at: btrfs_sync_file+0x594/0xa20 [btrfs]
> [  472.551474]  #1: ffff888014c72790 (btrfs_trans_completed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> [  472.556334]  #2: ffff888014c72760 (btrfs_trans_super_committed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> [  472.561372]  #3: ffff888014c72730 (btrfs_trans_unblocked){++++}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> [  472.565793]  #4: ffff888014c70de0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_commit_transaction+0x8ed/0x1580 [btrfs]
> [  472.569931]  #5: ffff88802ec1c248 (&tree->lock#2){+.+.}-{2:2}, at: extent_io_tree_release+0x1c/0x120 [btrfs]
> [  472.573099] Preemption disabled at:
> [  472.573110] [<0000000000000000>] 0x0
> [  472.575200] CPU: 0 PID: 32079 Comm: fsstress Not tainted 6.6.0-rc2-default+ #2197
> [  472.577440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> [  472.580902] Call Trace:
> [  472.581738]  <TASK>
> [  472.582480]  dump_stack_lvl+0x60/0x70
> [  472.583621]  __might_resched+0x224/0x360
> [  472.584591]  extent_io_tree_release+0xa5/0x120 [btrfs]
> [  472.585864]  free_log_tree+0xec/0x250 [btrfs]
> [  472.587007]  ? walk_log_tree+0x4a0/0x4a0 [btrfs]
> [  472.588116]  ? reacquire_held_locks+0x280/0x280
> [  472.589104]  ? btrfs_log_holes+0x430/0x430 [btrfs]
> [  472.590296]  ? node_tag_clear+0xf4/0x160
> [  472.591292]  btrfs_free_log+0x2c/0x60 [btrfs]
> [  472.592468]  commit_fs_roots+0x1e0/0x440 [btrfs]
> [  472.593569]  ? __lock_release.isra.0+0x14e/0x510
> [  472.594470]  ? percpu_up_read+0xe0/0xe0 [btrfs]
> [  472.595496]  ? btrfs_run_delayed_refs+0xf6/0x180 [btrfs]
> [  472.596758]  ? btrfs_assert_delayed_root_empty+0x2d/0xd0 [btrfs]
> [  472.598077]  ? btrfs_commit_transaction+0x932/0x1580 [btrfs]
> [  472.599437]  btrfs_commit_transaction+0x94e/0x1580 [btrfs]
> [  472.600845]  ? cleanup_transaction+0x650/0x650 [btrfs]
> [  472.602164]  ? preempt_count_sub+0x18/0xc0
> [  472.603111]  ? __rcu_read_unlock+0x67/0x90
> [  472.604011]  ? preempt_count_add+0x71/0xd0
> [  472.604840]  ? preempt_count_sub+0x18/0xc0
> [  472.605664]  ? preempt_count_add+0x71/0xd0
> [  472.606454]  ? preempt_count_sub+0x18/0xc0
> [  472.607251]  ? __up_write+0x125/0x300
> [  472.608059]  btrfs_sync_file+0x794/0xa20 [btrfs]
> [  472.609105]  ? start_ordered_ops.constprop.0+0xd0/0xd0 [btrfs]
> [  472.610392]  ? mark_held_locks+0x1a/0x80
> [  472.611179]  __x64_sys_fdatasync+0x70/0xb0
> [  472.614054]  do_syscall_64+0x3d/0x90
> [  472.614584]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [  472.615522] RIP: 0033:0x7f9bbbe983d4
>
>  115 void extent_io_tree_release(struct extent_io_tree *tree)
>  116 {
>  117         struct extent_state *state;
>  118         struct extent_state *tmp;
>  119
>  120         spin_lock(&tree->lock);
>  121         rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
>  122                 /* Clear node to keep free_extent_state() happy. */
>  123                 RB_CLEAR_NODE(&state->rb_node);
>  124                 ASSERT(!(state->state & EXTENT_LOCKED));
>  125                 /*
>  126                  * No need for a memory barrier here, as we are holding the tree
>  127                  * lock and we only change the waitqueue while holding that lock
>  128                  * (see wait_extent_bit()).
>  129                  */
>  130                 ASSERT(!waitqueue_active(&state->wq));
>  131                 free_extent_state(state);
>  132                 cond_resched();
>  ^^^
>
> should be cond_resched_lock() as it's under spinlock but then I'm not
> sure if relocking is still safe in the middle of the tree traversal.

cond_resched_lock() works here under the assumption that at this point no other
tasks access the tree (as documented in an earlier patch) - if the
assumption is broken in
the future, then another task can access a node that was freed before
the cond_resched_lock()
call and result in a use-after-free or other weird issues.

But I think it's best to remove the cond_resched() call. I removed it
and then added it again
but didn't think properly and ran on a vm without several debugging
configs turned on, likely
why I didn't get that splat.

Do you want me to send a v2 with the cond_resched() line deleted?

Thanks.

>
>  133         }
>  134         tree->state = RB_ROOT;
>  135         spin_unlock(&tree->lock);
>  136 }
David Sterba Sept. 22, 2023, 7:08 p.m. UTC | #3
On Fri, Sep 22, 2023 at 07:43:24PM +0100, Filipe Manana wrote:
> On Fri, Sep 22, 2023 at 7:24 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Fri, Sep 22, 2023 at 11:39:01AM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > These are some changes to make some io tree operations more efficient and
> > > some cleanups. More details on the changelogs.
> > >
> > > Filipe Manana (8):
> > >   btrfs: make extent state merges more efficient during insertions
> > >   btrfs: update stale comment at extent_io_tree_release()
> > >   btrfs: make wait_extent_bit() static
> > >   btrfs: remove pointless memory barrier from extent_io_tree_release()
> > >   btrfs: collapse wait_on_state() to its caller wait_extent_bit()
> > >   btrfs: make extent_io_tree_release() more efficient
> > >   btrfs: use extent_io_tree_release() to empty dirty log pages
> > >   btrfs: make sure we cache next state in find_first_extent_bit()
> >
> > I see a lot of reports like:
> >
> > btrfs/004        [16:14:23][  468.732077] run fstests btrfs/004 at 2023-09-22 16:14:24
> > [  470.921989] BTRFS: device fsid f7d57de2-899a-4b33-b77a-084058ac36e9 devid 1 transid 11 /dev/vda scanned by mount (31993)
> > [  470.926438] BTRFS info (device vda): using crc32c (crc32c-generic) checksum algorithm
> > [  470.928013] BTRFS info (device vda): using free space tree
> > [  470.952723] BTRFS info (device vda): auto enabling async discard
> > [  472.385556] BTRFS: device fsid 097a012d-8e9b-4bd8-960d-87577821cbbe devid 1 transid 6 /dev/vdb scanned by mount (32061)
> > [  472.395192] BTRFS info (device vdb): using crc32c (crc32c-generic) checksum algorithm
> > [  472.398895] BTRFS info (device vdb): using free space tree
> > [  472.438755] BTRFS info (device vdb): auto enabling async discard
> > [  472.440900] BTRFS info (device vdb): checking UUID tree
> > [  472.534254] BUG: sleeping function called from invalid context at fs/btrfs/extent-io-tree.c:132
> > [  472.539305] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32079, name: fsstress
> > [  472.542685] preempt_count: 1, expected: 0
> > [  472.543641] RCU nest depth: 0, expected: 0
> > [  472.544778] 6 locks held by fsstress/32079:
> > [  472.546916]  #0: ffff888017ad4648 (sb_internal#2){.+.+}-{0:0}, at: btrfs_sync_file+0x594/0xa20 [btrfs]
> > [  472.551474]  #1: ffff888014c72790 (btrfs_trans_completed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > [  472.556334]  #2: ffff888014c72760 (btrfs_trans_super_committed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > [  472.561372]  #3: ffff888014c72730 (btrfs_trans_unblocked){++++}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > [  472.565793]  #4: ffff888014c70de0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_commit_transaction+0x8ed/0x1580 [btrfs]
> > [  472.569931]  #5: ffff88802ec1c248 (&tree->lock#2){+.+.}-{2:2}, at: extent_io_tree_release+0x1c/0x120 [btrfs]
> > [  472.573099] Preemption disabled at:
> > [  472.573110] [<0000000000000000>] 0x0
> > [  472.575200] CPU: 0 PID: 32079 Comm: fsstress Not tainted 6.6.0-rc2-default+ #2197
> > [  472.577440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> > [  472.580902] Call Trace:
> > [  472.581738]  <TASK>
> > [  472.582480]  dump_stack_lvl+0x60/0x70
> > [  472.583621]  __might_resched+0x224/0x360
> > [  472.584591]  extent_io_tree_release+0xa5/0x120 [btrfs]
> > [  472.585864]  free_log_tree+0xec/0x250 [btrfs]
> > [  472.587007]  ? walk_log_tree+0x4a0/0x4a0 [btrfs]
> > [  472.588116]  ? reacquire_held_locks+0x280/0x280
> > [  472.589104]  ? btrfs_log_holes+0x430/0x430 [btrfs]
> > [  472.590296]  ? node_tag_clear+0xf4/0x160
> > [  472.591292]  btrfs_free_log+0x2c/0x60 [btrfs]
> > [  472.592468]  commit_fs_roots+0x1e0/0x440 [btrfs]
> > [  472.593569]  ? __lock_release.isra.0+0x14e/0x510
> > [  472.594470]  ? percpu_up_read+0xe0/0xe0 [btrfs]
> > [  472.595496]  ? btrfs_run_delayed_refs+0xf6/0x180 [btrfs]
> > [  472.596758]  ? btrfs_assert_delayed_root_empty+0x2d/0xd0 [btrfs]
> > [  472.598077]  ? btrfs_commit_transaction+0x932/0x1580 [btrfs]
> > [  472.599437]  btrfs_commit_transaction+0x94e/0x1580 [btrfs]
> > [  472.600845]  ? cleanup_transaction+0x650/0x650 [btrfs]
> > [  472.602164]  ? preempt_count_sub+0x18/0xc0
> > [  472.603111]  ? __rcu_read_unlock+0x67/0x90
> > [  472.604011]  ? preempt_count_add+0x71/0xd0
> > [  472.604840]  ? preempt_count_sub+0x18/0xc0
> > [  472.605664]  ? preempt_count_add+0x71/0xd0
> > [  472.606454]  ? preempt_count_sub+0x18/0xc0
> > [  472.607251]  ? __up_write+0x125/0x300
> > [  472.608059]  btrfs_sync_file+0x794/0xa20 [btrfs]
> > [  472.609105]  ? start_ordered_ops.constprop.0+0xd0/0xd0 [btrfs]
> > [  472.610392]  ? mark_held_locks+0x1a/0x80
> > [  472.611179]  __x64_sys_fdatasync+0x70/0xb0
> > [  472.614054]  do_syscall_64+0x3d/0x90
> > [  472.614584]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > [  472.615522] RIP: 0033:0x7f9bbbe983d4
> >
> >  115 void extent_io_tree_release(struct extent_io_tree *tree)
> >  116 {
> >  117         struct extent_state *state;
> >  118         struct extent_state *tmp;
> >  119
> >  120         spin_lock(&tree->lock);
> >  121         rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
> >  122                 /* Clear node to keep free_extent_state() happy. */
> >  123                 RB_CLEAR_NODE(&state->rb_node);
> >  124                 ASSERT(!(state->state & EXTENT_LOCKED));
> >  125                 /*
> >  126                  * No need for a memory barrier here, as we are holding the tree
> >  127                  * lock and we only change the waitqueue while holding that lock
> >  128                  * (see wait_extent_bit()).
> >  129                  */
> >  130                 ASSERT(!waitqueue_active(&state->wq));
> >  131                 free_extent_state(state);
> >  132                 cond_resched();
> >  ^^^
> >
> > should be cond_resched_lock() as it's under spinlock but then I'm not
> > sure if relocking is still safe in the middle of the tree traversal.
> 
> cond_resched_lock() works here under the assumption that at this point no other
> tasks access the tree (as documented in an earlier patch) - if the
> assumption is broken in
> the future, then another task can access a node that was freed before
> the cond_resched_lock()
> call and result in a use-after-free or other weird issues.
> 
> But I think it's best to remove the cond_resched() call. I removed it
> and then added it again
> but didn't think properly and ran on a vm without several debugging
> configs turned on, likely
> why I didn't get that splat.
> 
> Do you want me to send a v2 with the cond_resched() line deleted?

No need to resend for that.
Filipe Manana Sept. 22, 2023, 7:13 p.m. UTC | #4
On Fri, Sep 22, 2023 at 07:43:24PM +0100, Filipe Manana wrote:
> On Fri, Sep 22, 2023 at 7:24 PM David Sterba <dsterba@suse.cz> wrote:
> >
> > On Fri, Sep 22, 2023 at 11:39:01AM +0100, fdmanana@kernel.org wrote:
> > > From: Filipe Manana <fdmanana@suse.com>
> > >
> > > These are some changes to make some io tree operations more efficient and
> > > some cleanups. More details on the changelogs.
> > >
> > > Filipe Manana (8):
> > >   btrfs: make extent state merges more efficient during insertions
> > >   btrfs: update stale comment at extent_io_tree_release()
> > >   btrfs: make wait_extent_bit() static
> > >   btrfs: remove pointless memory barrier from extent_io_tree_release()
> > >   btrfs: collapse wait_on_state() to its caller wait_extent_bit()
> > >   btrfs: make extent_io_tree_release() more efficient
> > >   btrfs: use extent_io_tree_release() to empty dirty log pages
> > >   btrfs: make sure we cache next state in find_first_extent_bit()
> >
> > I see a lot of reports like:
> >
> > btrfs/004        [16:14:23][  468.732077] run fstests btrfs/004 at 2023-09-22 16:14:24
> > [  470.921989] BTRFS: device fsid f7d57de2-899a-4b33-b77a-084058ac36e9 devid 1 transid 11 /dev/vda scanned by mount (31993)
> > [  470.926438] BTRFS info (device vda): using crc32c (crc32c-generic) checksum algorithm
> > [  470.928013] BTRFS info (device vda): using free space tree
> > [  470.952723] BTRFS info (device vda): auto enabling async discard
> > [  472.385556] BTRFS: device fsid 097a012d-8e9b-4bd8-960d-87577821cbbe devid 1 transid 6 /dev/vdb scanned by mount (32061)
> > [  472.395192] BTRFS info (device vdb): using crc32c (crc32c-generic) checksum algorithm
> > [  472.398895] BTRFS info (device vdb): using free space tree
> > [  472.438755] BTRFS info (device vdb): auto enabling async discard
> > [  472.440900] BTRFS info (device vdb): checking UUID tree
> > [  472.534254] BUG: sleeping function called from invalid context at fs/btrfs/extent-io-tree.c:132
> > [  472.539305] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32079, name: fsstress
> > [  472.542685] preempt_count: 1, expected: 0
> > [  472.543641] RCU nest depth: 0, expected: 0
> > [  472.544778] 6 locks held by fsstress/32079:
> > [  472.546916]  #0: ffff888017ad4648 (sb_internal#2){.+.+}-{0:0}, at: btrfs_sync_file+0x594/0xa20 [btrfs]
> > [  472.551474]  #1: ffff888014c72790 (btrfs_trans_completed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > [  472.556334]  #2: ffff888014c72760 (btrfs_trans_super_committed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > [  472.561372]  #3: ffff888014c72730 (btrfs_trans_unblocked){++++}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > [  472.565793]  #4: ffff888014c70de0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_commit_transaction+0x8ed/0x1580 [btrfs]
> > [  472.569931]  #5: ffff88802ec1c248 (&tree->lock#2){+.+.}-{2:2}, at: extent_io_tree_release+0x1c/0x120 [btrfs]
> > [  472.573099] Preemption disabled at:
> > [  472.573110] [<0000000000000000>] 0x0
> > [  472.575200] CPU: 0 PID: 32079 Comm: fsstress Not tainted 6.6.0-rc2-default+ #2197
> > [  472.577440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> > [  472.580902] Call Trace:
> > [  472.581738]  <TASK>
> > [  472.582480]  dump_stack_lvl+0x60/0x70
> > [  472.583621]  __might_resched+0x224/0x360
> > [  472.584591]  extent_io_tree_release+0xa5/0x120 [btrfs]
> > [  472.585864]  free_log_tree+0xec/0x250 [btrfs]
> > [  472.587007]  ? walk_log_tree+0x4a0/0x4a0 [btrfs]
> > [  472.588116]  ? reacquire_held_locks+0x280/0x280
> > [  472.589104]  ? btrfs_log_holes+0x430/0x430 [btrfs]
> > [  472.590296]  ? node_tag_clear+0xf4/0x160
> > [  472.591292]  btrfs_free_log+0x2c/0x60 [btrfs]
> > [  472.592468]  commit_fs_roots+0x1e0/0x440 [btrfs]
> > [  472.593569]  ? __lock_release.isra.0+0x14e/0x510
> > [  472.594470]  ? percpu_up_read+0xe0/0xe0 [btrfs]
> > [  472.595496]  ? btrfs_run_delayed_refs+0xf6/0x180 [btrfs]
> > [  472.596758]  ? btrfs_assert_delayed_root_empty+0x2d/0xd0 [btrfs]
> > [  472.598077]  ? btrfs_commit_transaction+0x932/0x1580 [btrfs]
> > [  472.599437]  btrfs_commit_transaction+0x94e/0x1580 [btrfs]
> > [  472.600845]  ? cleanup_transaction+0x650/0x650 [btrfs]
> > [  472.602164]  ? preempt_count_sub+0x18/0xc0
> > [  472.603111]  ? __rcu_read_unlock+0x67/0x90
> > [  472.604011]  ? preempt_count_add+0x71/0xd0
> > [  472.604840]  ? preempt_count_sub+0x18/0xc0
> > [  472.605664]  ? preempt_count_add+0x71/0xd0
> > [  472.606454]  ? preempt_count_sub+0x18/0xc0
> > [  472.607251]  ? __up_write+0x125/0x300
> > [  472.608059]  btrfs_sync_file+0x794/0xa20 [btrfs]
> > [  472.609105]  ? start_ordered_ops.constprop.0+0xd0/0xd0 [btrfs]
> > [  472.610392]  ? mark_held_locks+0x1a/0x80
> > [  472.611179]  __x64_sys_fdatasync+0x70/0xb0
> > [  472.614054]  do_syscall_64+0x3d/0x90
> > [  472.614584]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > [  472.615522] RIP: 0033:0x7f9bbbe983d4
> >
> >  115 void extent_io_tree_release(struct extent_io_tree *tree)
> >  116 {
> >  117         struct extent_state *state;
> >  118         struct extent_state *tmp;
> >  119
> >  120         spin_lock(&tree->lock);
> >  121         rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
> >  122                 /* Clear node to keep free_extent_state() happy. */
> >  123                 RB_CLEAR_NODE(&state->rb_node);
> >  124                 ASSERT(!(state->state & EXTENT_LOCKED));
> >  125                 /*
> >  126                  * No need for a memory barrier here, as we are holding the tree
> >  127                  * lock and we only change the waitqueue while holding that lock
> >  128                  * (see wait_extent_bit()).
> >  129                  */
> >  130                 ASSERT(!waitqueue_active(&state->wq));
> >  131                 free_extent_state(state);
> >  132                 cond_resched();
> >  ^^^
> >
> > should be cond_resched_lock() as it's under spinlock but then I'm not
> > sure if relocking is still safe in the middle of the tree traversal.
> 
> cond_resched_lock() works here under the assumption that at this point no other
> tasks access the tree (as documented in an earlier patch) - if the
> assumption is broken in
> the future, then another task can access a node that was freed before
> the cond_resched_lock()
> call and result in a use-after-free or other weird issues.
> 
> But I think it's best to remove the cond_resched() call. I removed it
> and then added it again
> but didn't think properly and ran on a vm without several debugging
> configs turned on, likely
> why I didn't get that splat.
> 
> Do you want me to send a v2 with the cond_resched() line deleted?

Better, we can keep the cond_resched_lock() safely with this incremental on
top of the patch:

diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
index 6e3645afaa38..32788fb7837e 100644
--- a/fs/btrfs/extent-io-tree.c
+++ b/fs/btrfs/extent-io-tree.c
@@ -114,11 +114,14 @@ void extent_io_tree_init(struct btrfs_fs_info *fs_info,
  */
 void extent_io_tree_release(struct extent_io_tree *tree)
 {
+       struct rb_root root;
        struct extent_state *state;
        struct extent_state *tmp;
 
        spin_lock(&tree->lock);
-       rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
+       root = tree->state;
+       tree->state = RB_ROOT;
+       rbtree_postorder_for_each_entry_safe(state, tmp, &root, rb_node) {
                /* Clear node to keep free_extent_state() happy. */
                RB_CLEAR_NODE(&state->rb_node);
                ASSERT(!(state->state & EXTENT_LOCKED));
@@ -129,9 +132,13 @@ void extent_io_tree_release(struct extent_io_tree *tree)
                 */
                ASSERT(!waitqueue_active(&state->wq));
                free_extent_state(state);
-               cond_resched();
+               cond_resched_lock(&tree->lock);
        }
-       tree->state = RB_ROOT;
+       /*
+        * Should still be empty even after a reschedule, no other task should
+        * be accessing the tree anymore.
+        */
+       ASSERT(RB_EMPTY_ROOT(&tree->state));
        spin_unlock(&tree->lock);
 }
 
Also pasted at: https://pastebin.com/raw/uVMP2e5b

Let me know if you prefer I send a v2 or squash this patch.

Thanks.

> 
> Thanks.
> 
> >
> >  133         }
> >  134         tree->state = RB_ROOT;
> >  135         spin_unlock(&tree->lock);
> >  136 }
David Sterba Sept. 22, 2023, 7:15 p.m. UTC | #5
On Fri, Sep 22, 2023 at 08:13:50PM +0100, Filipe Manana wrote:
> On Fri, Sep 22, 2023 at 07:43:24PM +0100, Filipe Manana wrote:
> > On Fri, Sep 22, 2023 at 7:24 PM David Sterba <dsterba@suse.cz> wrote:
> > >
> > > On Fri, Sep 22, 2023 at 11:39:01AM +0100, fdmanana@kernel.org wrote:
> > > > From: Filipe Manana <fdmanana@suse.com>
> > > >
> > > > These are some changes to make some io tree operations more efficient and
> > > > some cleanups. More details on the changelogs.
> > > >
> > > > Filipe Manana (8):
> > > >   btrfs: make extent state merges more efficient during insertions
> > > >   btrfs: update stale comment at extent_io_tree_release()
> > > >   btrfs: make wait_extent_bit() static
> > > >   btrfs: remove pointless memory barrier from extent_io_tree_release()
> > > >   btrfs: collapse wait_on_state() to its caller wait_extent_bit()
> > > >   btrfs: make extent_io_tree_release() more efficient
> > > >   btrfs: use extent_io_tree_release() to empty dirty log pages
> > > >   btrfs: make sure we cache next state in find_first_extent_bit()
> > >
> > > I see a lot of reports like:
> > >
> > > btrfs/004        [16:14:23][  468.732077] run fstests btrfs/004 at 2023-09-22 16:14:24
> > > [  470.921989] BTRFS: device fsid f7d57de2-899a-4b33-b77a-084058ac36e9 devid 1 transid 11 /dev/vda scanned by mount (31993)
> > > [  470.926438] BTRFS info (device vda): using crc32c (crc32c-generic) checksum algorithm
> > > [  470.928013] BTRFS info (device vda): using free space tree
> > > [  470.952723] BTRFS info (device vda): auto enabling async discard
> > > [  472.385556] BTRFS: device fsid 097a012d-8e9b-4bd8-960d-87577821cbbe devid 1 transid 6 /dev/vdb scanned by mount (32061)
> > > [  472.395192] BTRFS info (device vdb): using crc32c (crc32c-generic) checksum algorithm
> > > [  472.398895] BTRFS info (device vdb): using free space tree
> > > [  472.438755] BTRFS info (device vdb): auto enabling async discard
> > > [  472.440900] BTRFS info (device vdb): checking UUID tree
> > > [  472.534254] BUG: sleeping function called from invalid context at fs/btrfs/extent-io-tree.c:132
> > > [  472.539305] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 32079, name: fsstress
> > > [  472.542685] preempt_count: 1, expected: 0
> > > [  472.543641] RCU nest depth: 0, expected: 0
> > > [  472.544778] 6 locks held by fsstress/32079:
> > > [  472.546916]  #0: ffff888017ad4648 (sb_internal#2){.+.+}-{0:0}, at: btrfs_sync_file+0x594/0xa20 [btrfs]
> > > [  472.551474]  #1: ffff888014c72790 (btrfs_trans_completed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > > [  472.556334]  #2: ffff888014c72760 (btrfs_trans_super_committed){.+.+}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > > [  472.561372]  #3: ffff888014c72730 (btrfs_trans_unblocked){++++}-{0:0}, at: btrfs_commit_transaction+0x840/0x1580 [btrfs]
> > > [  472.565793]  #4: ffff888014c70de0 (&fs_info->reloc_mutex){+.+.}-{3:3}, at: btrfs_commit_transaction+0x8ed/0x1580 [btrfs]
> > > [  472.569931]  #5: ffff88802ec1c248 (&tree->lock#2){+.+.}-{2:2}, at: extent_io_tree_release+0x1c/0x120 [btrfs]
> > > [  472.573099] Preemption disabled at:
> > > [  472.573110] [<0000000000000000>] 0x0
> > > [  472.575200] CPU: 0 PID: 32079 Comm: fsstress Not tainted 6.6.0-rc2-default+ #2197
> > > [  472.577440] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-3-gd478f380-rebuilt.opensuse.org 04/01/2014
> > > [  472.580902] Call Trace:
> > > [  472.581738]  <TASK>
> > > [  472.582480]  dump_stack_lvl+0x60/0x70
> > > [  472.583621]  __might_resched+0x224/0x360
> > > [  472.584591]  extent_io_tree_release+0xa5/0x120 [btrfs]
> > > [  472.585864]  free_log_tree+0xec/0x250 [btrfs]
> > > [  472.587007]  ? walk_log_tree+0x4a0/0x4a0 [btrfs]
> > > [  472.588116]  ? reacquire_held_locks+0x280/0x280
> > > [  472.589104]  ? btrfs_log_holes+0x430/0x430 [btrfs]
> > > [  472.590296]  ? node_tag_clear+0xf4/0x160
> > > [  472.591292]  btrfs_free_log+0x2c/0x60 [btrfs]
> > > [  472.592468]  commit_fs_roots+0x1e0/0x440 [btrfs]
> > > [  472.593569]  ? __lock_release.isra.0+0x14e/0x510
> > > [  472.594470]  ? percpu_up_read+0xe0/0xe0 [btrfs]
> > > [  472.595496]  ? btrfs_run_delayed_refs+0xf6/0x180 [btrfs]
> > > [  472.596758]  ? btrfs_assert_delayed_root_empty+0x2d/0xd0 [btrfs]
> > > [  472.598077]  ? btrfs_commit_transaction+0x932/0x1580 [btrfs]
> > > [  472.599437]  btrfs_commit_transaction+0x94e/0x1580 [btrfs]
> > > [  472.600845]  ? cleanup_transaction+0x650/0x650 [btrfs]
> > > [  472.602164]  ? preempt_count_sub+0x18/0xc0
> > > [  472.603111]  ? __rcu_read_unlock+0x67/0x90
> > > [  472.604011]  ? preempt_count_add+0x71/0xd0
> > > [  472.604840]  ? preempt_count_sub+0x18/0xc0
> > > [  472.605664]  ? preempt_count_add+0x71/0xd0
> > > [  472.606454]  ? preempt_count_sub+0x18/0xc0
> > > [  472.607251]  ? __up_write+0x125/0x300
> > > [  472.608059]  btrfs_sync_file+0x794/0xa20 [btrfs]
> > > [  472.609105]  ? start_ordered_ops.constprop.0+0xd0/0xd0 [btrfs]
> > > [  472.610392]  ? mark_held_locks+0x1a/0x80
> > > [  472.611179]  __x64_sys_fdatasync+0x70/0xb0
> > > [  472.614054]  do_syscall_64+0x3d/0x90
> > > [  472.614584]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> > > [  472.615522] RIP: 0033:0x7f9bbbe983d4
> > >
> > >  115 void extent_io_tree_release(struct extent_io_tree *tree)
> > >  116 {
> > >  117         struct extent_state *state;
> > >  118         struct extent_state *tmp;
> > >  119
> > >  120         spin_lock(&tree->lock);
> > >  121         rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
> > >  122                 /* Clear node to keep free_extent_state() happy. */
> > >  123                 RB_CLEAR_NODE(&state->rb_node);
> > >  124                 ASSERT(!(state->state & EXTENT_LOCKED));
> > >  125                 /*
> > >  126                  * No need for a memory barrier here, as we are holding the tree
> > >  127                  * lock and we only change the waitqueue while holding that lock
> > >  128                  * (see wait_extent_bit()).
> > >  129                  */
> > >  130                 ASSERT(!waitqueue_active(&state->wq));
> > >  131                 free_extent_state(state);
> > >  132                 cond_resched();
> > >  ^^^
> > >
> > > should be cond_resched_lock() as it's under spinlock but then I'm not
> > > sure if relocking is still safe in the middle of the tree traversal.
> > 
> > cond_resched_lock() works here under the assumption that at this point no other
> > tasks access the tree (as documented in an earlier patch) - if the
> > assumption is broken in
> > the future, then another task can access a node that was freed before
> > the cond_resched_lock()
> > call and result in a use-after-free or other weird issues.
> > 
> > But I think it's best to remove the cond_resched() call. I removed it
> > and then added it again
> > but didn't think properly and ran on a vm without several debugging
> > configs turned on, likely
> > why I didn't get that splat.
> > 
> > Do you want me to send a v2 with the cond_resched() line deleted?
> 
> Better, we can keep the cond_resched_lock() safely with this incremental on
> top of the patch:
> 
> diff --git a/fs/btrfs/extent-io-tree.c b/fs/btrfs/extent-io-tree.c
> index 6e3645afaa38..32788fb7837e 100644
> --- a/fs/btrfs/extent-io-tree.c
> +++ b/fs/btrfs/extent-io-tree.c
> @@ -114,11 +114,14 @@ void extent_io_tree_init(struct btrfs_fs_info *fs_info,
>   */
>  void extent_io_tree_release(struct extent_io_tree *tree)
>  {
> +       struct rb_root root;
>         struct extent_state *state;
>         struct extent_state *tmp;
>  
>         spin_lock(&tree->lock);
> -       rbtree_postorder_for_each_entry_safe(state, tmp, &tree->state, rb_node) {
> +       root = tree->state;
> +       tree->state = RB_ROOT;
> +       rbtree_postorder_for_each_entry_safe(state, tmp, &root, rb_node) {
>                 /* Clear node to keep free_extent_state() happy. */
>                 RB_CLEAR_NODE(&state->rb_node);
>                 ASSERT(!(state->state & EXTENT_LOCKED));
> @@ -129,9 +132,13 @@ void extent_io_tree_release(struct extent_io_tree *tree)
>                  */
>                 ASSERT(!waitqueue_active(&state->wq));
>                 free_extent_state(state);
> -               cond_resched();
> +               cond_resched_lock(&tree->lock);
>         }
> -       tree->state = RB_ROOT;
> +       /*
> +        * Should still be empty even after a reschedule, no other task should
> +        * be accessing the tree anymore.
> +        */
> +       ASSERT(RB_EMPTY_ROOT(&tree->state));
>         spin_unlock(&tree->lock);
>  }
>  
> Also pasted at: https://pastebin.com/raw/uVMP2e5b
> 
> Let me know if you prefer I send a v2 or squash this patch.

No problem with the fixups, this is easier for me. Good that we can keep
the cond_resched, thanks.
David Sterba Sept. 29, 2023, 4:47 p.m. UTC | #6
On Fri, Sep 22, 2023 at 11:39:01AM +0100, fdmanana@kernel.org wrote:
> From: Filipe Manana <fdmanana@suse.com>
> 
> These are some changes to make some io tree operations more efficient and
> some cleanups. More details on the changelogs.
> 
> Filipe Manana (8):
>   btrfs: make extent state merges more efficient during insertions
>   btrfs: update stale comment at extent_io_tree_release()
>   btrfs: make wait_extent_bit() static
>   btrfs: remove pointless memory barrier from extent_io_tree_release()
>   btrfs: collapse wait_on_state() to its caller wait_extent_bit()
>   btrfs: make extent_io_tree_release() more efficient
>   btrfs: use extent_io_tree_release() to empty dirty log pages
>   btrfs: make sure we cache next state in find_first_extent_bit()

Added to misc-next, thanks. The commit message for patch 4 was changed.