btrfs: relocation: Fix KASAN reports caused by extended reloc tree lifespan
diff mbox series

Message ID 20200104135602.34601-1-wqu@suse.com
State New
Headers show
Series
  • btrfs: relocation: Fix KASAN reports caused by extended reloc tree lifespan
Related show

Commit Message

Qu Wenruo Jan. 4, 2020, 1:56 p.m. UTC
[BUG]
There are several different KASAN reports for balance + snapshot
workloads.
Involved call paths include:

   should_ignore_root+0x54/0xb0 [btrfs]
   build_backref_tree+0x11af/0x2280 [btrfs]
   relocate_tree_blocks+0x391/0xb80 [btrfs]
   relocate_block_group+0x3e5/0xa00 [btrfs]
   btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
   btrfs_relocate_chunk+0x53/0xf0 [btrfs]
   btrfs_balance+0xc91/0x1840 [btrfs]
   btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
   btrfs_ioctl+0x8af/0x3e60 [btrfs]
   do_vfs_ioctl+0x831/0xb10
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x43/0x50
   do_syscall_64+0x79/0xe0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

   create_reloc_root+0x9f/0x460 [btrfs]
   btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
   create_pending_snapshot+0xa9b/0x15f0 [btrfs]
   create_pending_snapshots+0x111/0x140 [btrfs]
   btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
   btrfs_mksubvol+0x915/0x960 [btrfs]
   btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
   btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
   btrfs_ioctl+0x241b/0x3e60 [btrfs]
   do_vfs_ioctl+0x831/0xb10
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x43/0x50
   do_syscall_64+0x79/0xe0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

   btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
   create_pending_snapshot+0x209/0x15f0 [btrfs]
   create_pending_snapshots+0x111/0x140 [btrfs]
   btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
   btrfs_mksubvol+0x915/0x960 [btrfs]
   btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
   btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
   btrfs_ioctl+0x241b/0x3e60 [btrfs]
   do_vfs_ioctl+0x831/0xb10
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x43/0x50
   do_syscall_64+0x79/0xe0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

[CAUSE]
All these call sites are only relying on root->reloc_root, which can
undergo btrfs_drop_snapshot(), and since we don't have real refcount
based protection to reloc roots, we can reach already dropped reloc
root, triggering KASAN.

[FIX]
To avoid such access to unstable root->reloc_root, we should check
BTRFS_ROOT_DEAD_RELOC_TREE bit first.

This patch introduces a new wrapper, have_reloc_root(), to do the proper
check for most callers who don't distinguish merged reloc tree and no
reloc tree.

The only exception is should_ignore_root(), as merged reloc tree can be
ignored, while no reloc tree shouldn't.

Also, set_bit()/clear_bit()/test_bit() doesn't imply a memory barrier,
and BTRFS_ROOT_DEAD_RELOC_TREE is the only indicator, also add extra
memory barrier for that bit.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
Singed-off-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Difference between this and David's diff:
- Use proper smp_mb__after_atomic() for clear_bit()
- Use test_bit() only check for should_ignore_root()
  That call site is an except, can't go regular have_reloc_root() check
- Add extra comment for have_reloc_root()
---
 fs/btrfs/relocation.c | 32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

Comments

Nikolay Borisov Jan. 5, 2020, 2:49 p.m. UTC | #1
On 4.01.20 г. 15:56 ч., Qu Wenruo wrote:
> [BUG]
> There are several different KASAN reports for balance + snapshot
> workloads.
> Involved call paths include:
> 
>    should_ignore_root+0x54/0xb0 [btrfs]
>    build_backref_tree+0x11af/0x2280 [btrfs]
>    relocate_tree_blocks+0x391/0xb80 [btrfs]
>    relocate_block_group+0x3e5/0xa00 [btrfs]
>    btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
>    btrfs_relocate_chunk+0x53/0xf0 [btrfs]
>    btrfs_balance+0xc91/0x1840 [btrfs]
>    btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
>    btrfs_ioctl+0x8af/0x3e60 [btrfs]
>    do_vfs_ioctl+0x831/0xb10
>    ksys_ioctl+0x67/0x90
>    __x64_sys_ioctl+0x43/0x50
>    do_syscall_64+0x79/0xe0
>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>    create_reloc_root+0x9f/0x460 [btrfs]
>    btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
>    create_pending_snapshot+0xa9b/0x15f0 [btrfs]
>    create_pending_snapshots+0x111/0x140 [btrfs]
>    btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>    btrfs_mksubvol+0x915/0x960 [btrfs]
>    btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>    btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>    btrfs_ioctl+0x241b/0x3e60 [btrfs]
>    do_vfs_ioctl+0x831/0xb10
>    ksys_ioctl+0x67/0x90
>    __x64_sys_ioctl+0x43/0x50
>    do_syscall_64+0x79/0xe0
>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>    btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
>    create_pending_snapshot+0x209/0x15f0 [btrfs]
>    create_pending_snapshots+0x111/0x140 [btrfs]
>    btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>    btrfs_mksubvol+0x915/0x960 [btrfs]
>    btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>    btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>    btrfs_ioctl+0x241b/0x3e60 [btrfs]
>    do_vfs_ioctl+0x831/0xb10
>    ksys_ioctl+0x67/0x90
>    __x64_sys_ioctl+0x43/0x50
>    do_syscall_64+0x79/0xe0
>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> [CAUSE]
> All these call sites are only relying on root->reloc_root, which can
> undergo btrfs_drop_snapshot(), and since we don't have real refcount
> based protection to reloc roots, we can reach already dropped reloc
> root, triggering KASAN.
> 
> [FIX]
> To avoid such access to unstable root->reloc_root, we should check
> BTRFS_ROOT_DEAD_RELOC_TREE bit first.
> 
> This patch introduces a new wrapper, have_reloc_root(), to do the proper
> check for most callers who don't distinguish merged reloc tree and no
> reloc tree.
> 
> The only exception is should_ignore_root(), as merged reloc tree can be
> ignored, while no reloc tree shouldn't.
> 
> Also, set_bit()/clear_bit()/test_bit() doesn't imply a memory barrier,
> and BTRFS_ROOT_DEAD_RELOC_TREE is the only indicator, also add extra
> memory barrier for that bit.
> 
> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
> Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
> Singed-off-by: David Sterba <dsterba@suse.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Difference between this and David's diff:
> - Use proper smp_mb__after_atomic() for clear_bit()
> - Use test_bit() only check for should_ignore_root()
>   That call site is an except, can't go regular have_reloc_root() check
> - Add extra comment for have_reloc_root()
> ---
>  fs/btrfs/relocation.c | 32 ++++++++++++++++++++++++++++----
>  1 file changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index d897a8e5e430..586f045bb6dc 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -517,6 +517,23 @@ static int update_backref_cache(struct btrfs_trans_handle *trans,
>  	return 1;
>  }
>  
> +/*
> + * Check if this subvolume tree has valid reloc(*) tree.
> + *
> + * *: Reloc tree after swap is considered dead, thus not considered as valid.
> + *    This is enough for most callers, as they don't distinguish dead reloc
> + *    root from no reloc root.
> + *    But should_ignore_root() below is a special case.
> + */
> +static bool have_reloc_root(struct btrfs_root *root)
> +{
> +	smp_mb__before_atomic();
> +	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
> +		return false;
> +	if (!root->reloc_root)
> +		return false;
> +	return true;
> +}
>  
>  static int should_ignore_root(struct btrfs_root *root)
>  {
> @@ -525,6 +542,11 @@ static int should_ignore_root(struct btrfs_root *root)
>  	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>  		return 0;
>  
> +	/* This root has been merged with its reloc tree, we can ignore it */
> +	smp_mb__before_atomic();

Haven't analyzed the patch deeply but if you add memory barriers you
*must* add comments explaining the ordering guarantees those barriers
provide.

<snip>
Josef Bacik Jan. 6, 2020, 4:33 p.m. UTC | #2
On 1/4/20 8:56 AM, Qu Wenruo wrote:
> [BUG]
> There are several different KASAN reports for balance + snapshot
> workloads.
> Involved call paths include:
> 
>     should_ignore_root+0x54/0xb0 [btrfs]
>     build_backref_tree+0x11af/0x2280 [btrfs]
>     relocate_tree_blocks+0x391/0xb80 [btrfs]
>     relocate_block_group+0x3e5/0xa00 [btrfs]
>     btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
>     btrfs_relocate_chunk+0x53/0xf0 [btrfs]
>     btrfs_balance+0xc91/0x1840 [btrfs]
>     btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
>     btrfs_ioctl+0x8af/0x3e60 [btrfs]
>     do_vfs_ioctl+0x831/0xb10
>     ksys_ioctl+0x67/0x90
>     __x64_sys_ioctl+0x43/0x50
>     do_syscall_64+0x79/0xe0
>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>     create_reloc_root+0x9f/0x460 [btrfs]
>     btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
>     create_pending_snapshot+0xa9b/0x15f0 [btrfs]
>     create_pending_snapshots+0x111/0x140 [btrfs]
>     btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>     btrfs_mksubvol+0x915/0x960 [btrfs]
>     btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>     btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>     btrfs_ioctl+0x241b/0x3e60 [btrfs]
>     do_vfs_ioctl+0x831/0xb10
>     ksys_ioctl+0x67/0x90
>     __x64_sys_ioctl+0x43/0x50
>     do_syscall_64+0x79/0xe0
>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>     btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
>     create_pending_snapshot+0x209/0x15f0 [btrfs]
>     create_pending_snapshots+0x111/0x140 [btrfs]
>     btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>     btrfs_mksubvol+0x915/0x960 [btrfs]
>     btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>     btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>     btrfs_ioctl+0x241b/0x3e60 [btrfs]
>     do_vfs_ioctl+0x831/0xb10
>     ksys_ioctl+0x67/0x90
>     __x64_sys_ioctl+0x43/0x50
>     do_syscall_64+0x79/0xe0
>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> [CAUSE]
> All these call sites are only relying on root->reloc_root, which can
> undergo btrfs_drop_snapshot(), and since we don't have real refcount
> based protection to reloc roots, we can reach already dropped reloc
> root, triggering KASAN.
> 
> [FIX]
> To avoid such access to unstable root->reloc_root, we should check
> BTRFS_ROOT_DEAD_RELOC_TREE bit first.
> 
> This patch introduces a new wrapper, have_reloc_root(), to do the proper
> check for most callers who don't distinguish merged reloc tree and no
> reloc tree.
> 
> The only exception is should_ignore_root(), as merged reloc tree can be
> ignored, while no reloc tree shouldn't.
> 
> Also, set_bit()/clear_bit()/test_bit() doesn't imply a memory barrier,
> and BTRFS_ROOT_DEAD_RELOC_TREE is the only indicator, also add extra
> memory barrier for that bit.
> 
> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
> Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
> Singed-off-by: David Sterba <dsterba@suse.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Difference between this and David's diff:
> - Use proper smp_mb__after_atomic() for clear_bit()
> - Use test_bit() only check for should_ignore_root()
>    That call site is an except, can't go regular have_reloc_root() check
> - Add extra comment for have_reloc_root()
> ---

This took me a minute to figure out, but from what I can tell you are doing the 
mb's around the BTRFS_ROOT_DEAD_RELOC_TREE flag so that in clean_dirty_subvols() 
where we clear the bit and then set root->reloc_root = NULL we are sure to 
either see the bit or that reloc_root == NULL.

That's fine, but man all these random memory barriers around the bit messing 
make 0 sense and confuse the issue, what we really want is the 
smp_mb__after_atomic() in clean_dirty_subvols() and the smp_mb__before_atomic() 
in have_reloc_root().

But instead since we really want to know the right answer for root->reloc_root, 
and we clear that _before_ we clear the BTRFS_ROOT_DEAD_RELOC_TREE let's just do 
READ_ONCE/WRITE_ONCE everywhere we access the reloc_root.  In fact you could just do

static struct btrfs_root get_reloc_root(struct btrfs_root *root)
{
	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
		return NULL;
	return READ_ONCE(root->reloc_root);
}

then instead of the patter of

if (!have_reloc_root)
	return;
reloc_root = root->reloc_root;

do

reloc_root = get_reloc_root(root);
if (!reloc_root)
	return;

Then you only have the READ_ONCE/WRITE_ONCE in one place.  Thanks,

Josef
David Sterba Jan. 6, 2020, 6:04 p.m. UTC | #3
On Mon, Jan 06, 2020 at 11:33:51AM -0500, Josef Bacik wrote:
> This took me a minute to figure out, but from what I can tell you are doing the 
> mb's around the BTRFS_ROOT_DEAD_RELOC_TREE flag so that in clean_dirty_subvols() 
> where we clear the bit and then set root->reloc_root = NULL we are sure to 
> either see the bit or that reloc_root == NULL.
> 
> That's fine, but man all these random memory barriers around the bit messing 
> make 0 sense and confuse the issue, what we really want is the 
> smp_mb__after_atomic() in clean_dirty_subvols() and the smp_mb__before_atomic() 
> in have_reloc_root().

The barriers around test_bit are required, test_bit could be reordered
as it's not a RMW operation. I suggest reding docs/atomic_t.rst on that
topic.

> But instead since we really want to know the right answer for root->reloc_root, 
> and we clear that _before_ we clear the BTRFS_ROOT_DEAD_RELOC_TREE let's just do 
> READ_ONCE/WRITE_ONCE everywhere we access the reloc_root.  In fact you could just do

But READ/WRITE_ONCE don't guarantee CPU-ordering, only that compiler
will not reload the variable in case it's used more than once.

> static struct btrfs_root get_reloc_root(struct btrfs_root *root)
> {
> 	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
> 		return NULL;
> 	return READ_ONCE(root->reloc_root);

Use of READ_ONCE has no effect here and produces the same buggy code as
we have now.

I sent the code to Qu in the previous discussion as work in progress,
with uncommented barriers, expecting that they will be documented in the
final version. So don't blame him, I should have not let barriers
reasoning left only on him. I'll comment under the patch.
David Sterba Jan. 6, 2020, 6:15 p.m. UTC | #4
On Sat, Jan 04, 2020 at 09:56:02PM +0800, Qu Wenruo wrote:
>  }
>  
> +/*
> + * Check if this subvolume tree has valid reloc(*) tree.
> + *
> + * *: Reloc tree after swap is considered dead, thus not considered as valid.
> + *    This is enough for most callers, as they don't distinguish dead reloc
> + *    root from no reloc root.
> + *    But should_ignore_root() below is a special case.
> + */
> +static bool have_reloc_root(struct btrfs_root *root)
> +{
> +	smp_mb__before_atomic();

That one should be the easiest, to get an up to date value of the bit,
sync before reading it. Similar to smp_rmb.

> +	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
> +		return false;
> +	if (!root->reloc_root)
> +		return false;
> +	return true;
> +}
>  
>  static int should_ignore_root(struct btrfs_root *root)
>  {
> @@ -525,6 +542,11 @@ static int should_ignore_root(struct btrfs_root *root)
>  	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>  		return 0;
>  
> +	/* This root has been merged with its reloc tree, we can ignore it */
> +	smp_mb__before_atomic();

This could be replaced by have_reloc_root but the reloc_root has to be
check twice in that function. Here it was slightly optimized as it
partially opencodes have_reloc_root. For clarity and fewer standalone
barriers using the helper might be better.

> +	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
> +		return 1;
> +
>  	reloc_root = root->reloc_root;
>  	if (!reloc_root)
>  		return 0;
> @@ -1439,6 +1461,7 @@ int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
>  	 * The subvolume has reloc tree but the swap is finished, no need to
>  	 * create/update the dead reloc tree
>  	 */
> +	smp_mb__before_atomic();

Another partial have_reloc_root, could be used here as well with
additional reloc_tree check.

>  	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
>  		return 0;
>  
> @@ -1478,8 +1501,7 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
>  	struct btrfs_root_item *root_item;
>  	int ret;
>  
> -	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state) ||
> -	    !root->reloc_root)
> +	if (!have_reloc_root(root))
>  		goto out;
>  
>  	reloc_root = root->reloc_root;
> @@ -1489,6 +1511,7 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
>  	if (fs_info->reloc_ctl->merge_reloc_tree &&
>  	    btrfs_root_refs(root_item) == 0) {
>  		set_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);

First set the bit, so anybody who properly uses barriers before checking
the bit will see it set

> +		smp_mb__after_atomic();

since the reloc_root pointer is not safe to be accessed since this point

>  		__del_reloc_root(reloc_root);
>  	}
>  
> @@ -2202,6 +2225,7 @@ static int clean_dirty_subvols(struct reloc_control *rc)
>  					ret = ret2;
>  			}
>  			clear_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);

This one looks misplaced and reverse, root->reloc_root is set to NULL a
few lines before and the barrier must be between this and clear_bit.
This was not in my proposed version, why did you change that?



> +			smp_mb__after_atomic();
>  			btrfs_put_fs_root(root);
>  		} else {
>  			/* Orphan reloc tree, just clean it up */
> @@ -4717,7 +4741,7 @@ void btrfs_reloc_pre_snapshot(struct btrfs_pending_snapshot *pending,
>  	struct btrfs_root *root = pending->root;
>  	struct reloc_control *rc = root->fs_info->reloc_ctl;
>  
> -	if (!root->reloc_root || !rc)
> +	if (!rc || !have_reloc_root(root))
>  		return;
>  
>  	if (!rc->merge_reloc_tree)
> @@ -4751,7 +4775,7 @@ int btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
>  	struct reloc_control *rc = root->fs_info->reloc_ctl;
>  	int ret;
>  
> -	if (!root->reloc_root || !rc)
> +	if (!rc || !have_reloc_root(root))
>  		return 0;
>  
>  	rc = root->fs_info->reloc_ctl;
> -- 
> 2.24.1
Josef Bacik Jan. 6, 2020, 7:26 p.m. UTC | #5
On 1/6/20 1:04 PM, David Sterba wrote:
> On Mon, Jan 06, 2020 at 11:33:51AM -0500, Josef Bacik wrote:
>> This took me a minute to figure out, but from what I can tell you are doing the
>> mb's around the BTRFS_ROOT_DEAD_RELOC_TREE flag so that in clean_dirty_subvols()
>> where we clear the bit and then set root->reloc_root = NULL we are sure to
>> either see the bit or that reloc_root == NULL.
>>
>> That's fine, but man all these random memory barriers around the bit messing
>> make 0 sense and confuse the issue, what we really want is the
>> smp_mb__after_atomic() in clean_dirty_subvols() and the smp_mb__before_atomic()
>> in have_reloc_root().
> 
> The barriers around test_bit are required, test_bit could be reordered
> as it's not a RMW operation. I suggest reding docs/atomic_t.rst on that
> topic.
> 
>> But instead since we really want to know the right answer for root->reloc_root,
>> and we clear that _before_ we clear the BTRFS_ROOT_DEAD_RELOC_TREE let's just do
>> READ_ONCE/WRITE_ONCE everywhere we access the reloc_root.  In fact you could just do
> 
> But READ/WRITE_ONCE don't guarantee CPU-ordering, only that compiler
> will not reload the variable in case it's used more than once.
> 
>> static struct btrfs_root get_reloc_root(struct btrfs_root *root)
>> {
>> 	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
>> 		return NULL;
>> 	return READ_ONCE(root->reloc_root);
> 
> Use of READ_ONCE has no effect here and produces the same buggy code as
> we have now.
> 

Hmm didn't follow smp_read_barrier_depends() all the way down, I assumed it at 
least protected from re-ordering.  Looks like it only does something on alpha.

> I sent the code to Qu in the previous discussion as work in progress,
> with uncommented barriers, expecting that they will be documented in the
> final version. So don't blame him, I should have not let barriers
> reasoning left only on him. I'll comment under the patch.
> 

There's still just too many of them, like I said before we're only worried about 
either BTRFS_ROOT_DEAD_RELOC_TREE or !root->reloc_root.  So I guess instead do 
something like

static struct btrfs_root *get_reloc_root(struct btrfs_root *root)
{
	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
		return NULL;
	smp_mb__after_atomic();
	return root->reloc_root;
}

And then in clean_dirty_subvols() do the smp_mb__before_atomic() before the 
clear_bit.  There's no reason for the random mb's around the other test_bit's. 
Thanks,

Josef
Qu Wenruo Jan. 7, 2020, 2:30 a.m. UTC | #6
On 2020/1/7 上午2:15, David Sterba wrote:
> On Sat, Jan 04, 2020 at 09:56:02PM +0800, Qu Wenruo wrote:
>>  }
>>  
>> +/*
>> + * Check if this subvolume tree has valid reloc(*) tree.
>> + *
>> + * *: Reloc tree after swap is considered dead, thus not considered as valid.
>> + *    This is enough for most callers, as they don't distinguish dead reloc
>> + *    root from no reloc root.
>> + *    But should_ignore_root() below is a special case.
>> + */
>> +static bool have_reloc_root(struct btrfs_root *root)
>> +{
>> +	smp_mb__before_atomic();
> 
> That one should be the easiest, to get an up to date value of the bit,
> sync before reading it. Similar to smp_rmb.

Yep, if we just go plain rmb/wmb everything would be much easier to
understand.

But since full rmb/wmb is expensive and in this case we're only address
certain arches which doesn't follower Total Store Order, we still need
to use that variant.


One more question.

Why not use before_atomic() and after_atomic() to surround the
set_bit()/test_bit()?

If before and after acts as rmb/wmb, then we don't really need this
before_atomic call aroudn test_bit()

> 
>> +	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
>> +		return false;
>> +	if (!root->reloc_root)
>> +		return false;
>> +	return true;
>> +}
>>  
>>  static int should_ignore_root(struct btrfs_root *root)
>>  {
>> @@ -525,6 +542,11 @@ static int should_ignore_root(struct btrfs_root *root)
>>  	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>>  		return 0;
>>  
>> +	/* This root has been merged with its reloc tree, we can ignore it */
>> +	smp_mb__before_atomic();
> 
> This could be replaced by have_reloc_root but the reloc_root has to be
> check twice in that function. Here it was slightly optimized as it
> partially opencodes have_reloc_root. For clarity and fewer standalone
> barriers using the helper might be better.

The problem is, have_reloc_root() returns false if either:
- DEAD_RELOC_TREE is set
- no reloc_root

What we really want is, if bit set, return 1, but if no reloc root,
return 0 instead.

So we can't use that helper at all.

> 
>> +	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
>> +		return 1;
>> +
>>  	reloc_root = root->reloc_root;
>>  	if (!reloc_root)
>>  		return 0;
>> @@ -1439,6 +1461,7 @@ int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
>>  	 * The subvolume has reloc tree but the swap is finished, no need to
>>  	 * create/update the dead reloc tree
>>  	 */
>> +	smp_mb__before_atomic();
> 
> Another partial have_reloc_root, could be used here as well with
> additional reloc_tree check.

Same problem as should_ignore_root().
Or we need to do extra reloc_root out of the helper.

> 
>>  	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
>>  		return 0;
>>  
>> @@ -1478,8 +1501,7 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
>>  	struct btrfs_root_item *root_item;
>>  	int ret;
>>  
>> -	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state) ||
>> -	    !root->reloc_root)
>> +	if (!have_reloc_root(root))
>>  		goto out;
>>  
>>  	reloc_root = root->reloc_root;
>> @@ -1489,6 +1511,7 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
>>  	if (fs_info->reloc_ctl->merge_reloc_tree &&
>>  	    btrfs_root_refs(root_item) == 0) {
>>  		set_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
> 
> First set the bit, so anybody who properly uses barriers before checking
> the bit will see it set

Still the same question, why not use before and after version around
set_bit()/clear_bit() so test_bit() doesn't need extra before_atomic call?
> 
>> +		smp_mb__after_atomic();
> 
> since the reloc_root pointer is not safe to be accessed since this point
> 
>>  		__del_reloc_root(reloc_root);
>>  	}
>>  
>> @@ -2202,6 +2225,7 @@ static int clean_dirty_subvols(struct reloc_control *rc)
>>  					ret = ret2;
>>  			}
>>  			clear_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
> 
> This one looks misplaced and reverse, root->reloc_root is set to NULL a
> few lines before and the barrier must be between this and clear_bit.

Got the point.

> This was not in my proposed version, why did you change that?

I thought the clear_bit() must be visible for all later operations, thus
the after_atomic() is needed.
But forgot the reloc_root is set to NULL.

So in that case, I guess we need both barriers.

Thanks,
Qu

> 
> 
> 
>> +			smp_mb__after_atomic();
>>  			btrfs_put_fs_root(root);
>>  		} else {
>>  			/* Orphan reloc tree, just clean it up */
>> @@ -4717,7 +4741,7 @@ void btrfs_reloc_pre_snapshot(struct btrfs_pending_snapshot *pending,
>>  	struct btrfs_root *root = pending->root;
>>  	struct reloc_control *rc = root->fs_info->reloc_ctl;
>>  
>> -	if (!root->reloc_root || !rc)
>> +	if (!rc || !have_reloc_root(root))
>>  		return;
>>  
>>  	if (!rc->merge_reloc_tree)
>> @@ -4751,7 +4775,7 @@ int btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
>>  	struct reloc_control *rc = root->fs_info->reloc_ctl;
>>  	int ret;
>>  
>> -	if (!root->reloc_root || !rc)
>> +	if (!rc || !have_reloc_root(root))
>>  		return 0;
>>  
>>  	rc = root->fs_info->reloc_ctl;
>> -- 
>> 2.24.1
Qu Wenruo Jan. 7, 2020, 2:35 a.m. UTC | #7
On 2020/1/7 上午10:30, Qu Wenruo wrote:
> 
> 
> On 2020/1/7 上午2:15, David Sterba wrote:
>> On Sat, Jan 04, 2020 at 09:56:02PM +0800, Qu Wenruo wrote:
>>>  }
>>>  
>>> +/*
>>> + * Check if this subvolume tree has valid reloc(*) tree.
>>> + *
>>> + * *: Reloc tree after swap is considered dead, thus not considered as valid.
>>> + *    This is enough for most callers, as they don't distinguish dead reloc
>>> + *    root from no reloc root.
>>> + *    But should_ignore_root() below is a special case.
>>> + */
>>> +static bool have_reloc_root(struct btrfs_root *root)
>>> +{
>>> +	smp_mb__before_atomic();
>>
>> That one should be the easiest, to get an up to date value of the bit,
>> sync before reading it. Similar to smp_rmb.
> 
> Yep, if we just go plain rmb/wmb everything would be much easier to
> understand.
> 
> But since full rmb/wmb is expensive and in this case we're only address
> certain arches which doesn't follower Total Store Order, we still need
> to use that variant.
> 
> 
> One more question.
> 
> Why not use before_atomic() and after_atomic() to surround the
> set_bit()/test_bit()?

Typo, it's set_bit()/clear_bit().

> 
> If before and after acts as rmb/wmb, then we don't really need this
> before_atomic call aroudn test_bit()
> 
>>
>>> +	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
>>> +		return false;
>>> +	if (!root->reloc_root)
>>> +		return false;
>>> +	return true;
>>> +}
>>>  
>>>  static int should_ignore_root(struct btrfs_root *root)
>>>  {
>>> @@ -525,6 +542,11 @@ static int should_ignore_root(struct btrfs_root *root)
>>>  	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>>>  		return 0;
>>>  
>>> +	/* This root has been merged with its reloc tree, we can ignore it */
>>> +	smp_mb__before_atomic();
>>
>> This could be replaced by have_reloc_root but the reloc_root has to be
>> check twice in that function. Here it was slightly optimized as it
>> partially opencodes have_reloc_root. For clarity and fewer standalone
>> barriers using the helper might be better.
> 
> The problem is, have_reloc_root() returns false if either:
> - DEAD_RELOC_TREE is set
> - no reloc_root
> 
> What we really want is, if bit set, return 1, but if no reloc root,
> return 0 instead.
> 
> So we can't use that helper at all.
> 
>>
>>> +	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
>>> +		return 1;
>>> +
>>>  	reloc_root = root->reloc_root;
>>>  	if (!reloc_root)
>>>  		return 0;
>>> @@ -1439,6 +1461,7 @@ int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
>>>  	 * The subvolume has reloc tree but the swap is finished, no need to
>>>  	 * create/update the dead reloc tree
>>>  	 */
>>> +	smp_mb__before_atomic();
>>
>> Another partial have_reloc_root, could be used here as well with
>> additional reloc_tree check.
> 
> Same problem as should_ignore_root().
> Or we need to do extra reloc_root out of the helper.
> 
>>
>>>  	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
>>>  		return 0;
>>>  
>>> @@ -1478,8 +1501,7 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
>>>  	struct btrfs_root_item *root_item;
>>>  	int ret;
>>>  
>>> -	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state) ||
>>> -	    !root->reloc_root)
>>> +	if (!have_reloc_root(root))
>>>  		goto out;
>>>  
>>>  	reloc_root = root->reloc_root;
>>> @@ -1489,6 +1511,7 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
>>>  	if (fs_info->reloc_ctl->merge_reloc_tree &&
>>>  	    btrfs_root_refs(root_item) == 0) {
>>>  		set_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
>>
>> First set the bit, so anybody who properly uses barriers before checking
>> the bit will see it set
> 
> Still the same question, why not use before and after version around
> set_bit()/clear_bit() so test_bit() doesn't need extra before_atomic call?
>>
>>> +		smp_mb__after_atomic();
>>
>> since the reloc_root pointer is not safe to be accessed since this point
>>
>>>  		__del_reloc_root(reloc_root);
>>>  	}
>>>  
>>> @@ -2202,6 +2225,7 @@ static int clean_dirty_subvols(struct reloc_control *rc)
>>>  					ret = ret2;
>>>  			}
>>>  			clear_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
>>
>> This one looks misplaced and reverse, root->reloc_root is set to NULL a
>> few lines before and the barrier must be between this and clear_bit.
> 
> Got the point.
> 
>> This was not in my proposed version, why did you change that?
> 
> I thought the clear_bit() must be visible for all later operations, thus
> the after_atomic() is needed.
> But forgot the reloc_root is set to NULL.
> 
> So in that case, I guess we need both barriers.
> 
> Thanks,
> Qu
> 
>>
>>
>>
>>> +			smp_mb__after_atomic();
>>>  			btrfs_put_fs_root(root);
>>>  		} else {
>>>  			/* Orphan reloc tree, just clean it up */
>>> @@ -4717,7 +4741,7 @@ void btrfs_reloc_pre_snapshot(struct btrfs_pending_snapshot *pending,
>>>  	struct btrfs_root *root = pending->root;
>>>  	struct reloc_control *rc = root->fs_info->reloc_ctl;
>>>  
>>> -	if (!root->reloc_root || !rc)
>>> +	if (!rc || !have_reloc_root(root))
>>>  		return;
>>>  
>>>  	if (!rc->merge_reloc_tree)
>>> @@ -4751,7 +4775,7 @@ int btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
>>>  	struct reloc_control *rc = root->fs_info->reloc_ctl;
>>>  	int ret;
>>>  
>>> -	if (!root->reloc_root || !rc)
>>> +	if (!rc || !have_reloc_root(root))
>>>  		return 0;
>>>  
>>>  	rc = root->fs_info->reloc_ctl;
>>> -- 
>>> 2.24.1
>

Patch
diff mbox series

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d897a8e5e430..586f045bb6dc 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -517,6 +517,23 @@  static int update_backref_cache(struct btrfs_trans_handle *trans,
 	return 1;
 }
 
+/*
+ * Check if this subvolume tree has valid reloc(*) tree.
+ *
+ * *: Reloc tree after swap is considered dead, thus not considered as valid.
+ *    This is enough for most callers, as they don't distinguish dead reloc
+ *    root from no reloc root.
+ *    But should_ignore_root() below is a special case.
+ */
+static bool have_reloc_root(struct btrfs_root *root)
+{
+	smp_mb__before_atomic();
+	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
+		return false;
+	if (!root->reloc_root)
+		return false;
+	return true;
+}
 
 static int should_ignore_root(struct btrfs_root *root)
 {
@@ -525,6 +542,11 @@  static int should_ignore_root(struct btrfs_root *root)
 	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
 		return 0;
 
+	/* This root has been merged with its reloc tree, we can ignore it */
+	smp_mb__before_atomic();
+	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
+		return 1;
+
 	reloc_root = root->reloc_root;
 	if (!reloc_root)
 		return 0;
@@ -1439,6 +1461,7 @@  int btrfs_init_reloc_root(struct btrfs_trans_handle *trans,
 	 * The subvolume has reloc tree but the swap is finished, no need to
 	 * create/update the dead reloc tree
 	 */
+	smp_mb__before_atomic();
 	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
 		return 0;
 
@@ -1478,8 +1501,7 @@  int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
 	struct btrfs_root_item *root_item;
 	int ret;
 
-	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state) ||
-	    !root->reloc_root)
+	if (!have_reloc_root(root))
 		goto out;
 
 	reloc_root = root->reloc_root;
@@ -1489,6 +1511,7 @@  int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
 	if (fs_info->reloc_ctl->merge_reloc_tree &&
 	    btrfs_root_refs(root_item) == 0) {
 		set_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
+		smp_mb__after_atomic();
 		__del_reloc_root(reloc_root);
 	}
 
@@ -2202,6 +2225,7 @@  static int clean_dirty_subvols(struct reloc_control *rc)
 					ret = ret2;
 			}
 			clear_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
+			smp_mb__after_atomic();
 			btrfs_put_fs_root(root);
 		} else {
 			/* Orphan reloc tree, just clean it up */
@@ -4717,7 +4741,7 @@  void btrfs_reloc_pre_snapshot(struct btrfs_pending_snapshot *pending,
 	struct btrfs_root *root = pending->root;
 	struct reloc_control *rc = root->fs_info->reloc_ctl;
 
-	if (!root->reloc_root || !rc)
+	if (!rc || !have_reloc_root(root))
 		return;
 
 	if (!rc->merge_reloc_tree)
@@ -4751,7 +4775,7 @@  int btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
 	struct reloc_control *rc = root->fs_info->reloc_ctl;
 	int ret;
 
-	if (!root->reloc_root || !rc)
+	if (!rc || !have_reloc_root(root))
 		return 0;
 
 	rc = root->fs_info->reloc_ctl;