[v2] btrfs: relocation: Fix KASAN reports caused by extended reloc tree lifespan
diff mbox series

Message ID 20200108051200.8909-1-wqu@suse.com
State New
Headers show
Series
  • [v2] btrfs: relocation: Fix KASAN reports caused by extended reloc tree lifespan
Related show

Commit Message

Qu Wenruo Jan. 8, 2020, 5:12 a.m. UTC
[BUG]
There are several different KASAN reports for balance + snapshot
workloads.
Involved call paths include:

   should_ignore_root+0x54/0xb0 [btrfs]
   build_backref_tree+0x11af/0x2280 [btrfs]
   relocate_tree_blocks+0x391/0xb80 [btrfs]
   relocate_block_group+0x3e5/0xa00 [btrfs]
   btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
   btrfs_relocate_chunk+0x53/0xf0 [btrfs]
   btrfs_balance+0xc91/0x1840 [btrfs]
   btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
   btrfs_ioctl+0x8af/0x3e60 [btrfs]
   do_vfs_ioctl+0x831/0xb10
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x43/0x50
   do_syscall_64+0x79/0xe0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

   create_reloc_root+0x9f/0x460 [btrfs]
   btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
   create_pending_snapshot+0xa9b/0x15f0 [btrfs]
   create_pending_snapshots+0x111/0x140 [btrfs]
   btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
   btrfs_mksubvol+0x915/0x960 [btrfs]
   btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
   btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
   btrfs_ioctl+0x241b/0x3e60 [btrfs]
   do_vfs_ioctl+0x831/0xb10
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x43/0x50
   do_syscall_64+0x79/0xe0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

   btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
   create_pending_snapshot+0x209/0x15f0 [btrfs]
   create_pending_snapshots+0x111/0x140 [btrfs]
   btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
   btrfs_mksubvol+0x915/0x960 [btrfs]
   btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
   btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
   btrfs_ioctl+0x241b/0x3e60 [btrfs]
   do_vfs_ioctl+0x831/0xb10
   ksys_ioctl+0x67/0x90
   __x64_sys_ioctl+0x43/0x50
   do_syscall_64+0x79/0xe0
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

[CAUSE]
All these call sites are only relying on root->reloc_root, which can
undergo btrfs_drop_snapshot(), and since we don't have real refcount
based protection to reloc roots, we can reach already dropped reloc
root, triggering KASAN.

[FIX]
To avoid such access to unstable root->reloc_root, we should check
BTRFS_ROOT_DEAD_RELOC_TREE bit first.

This patch introduces a new wrapper, have_reloc_root(), to do the proper
check for most callers who don't distinguish merged reloc tree and no
reloc tree.

The only exception is should_ignore_root(), as merged reloc tree can be
ignored, while no reloc tree shouldn't.

[CRITICAL SECTION ANALYSE]
Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
DEAD_RELOC_TREE bit has extra help from transaction as a higher level
barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:

	NULL: reloc_root is NULL	PTR: reloc_root is not NULL
	0: DEAD_RELOC_ROOT bit not set	DEAD: DEAD_RELOC_ROOT bit set

	(NULL, 0)    Initial state		 __
	  |					 /\ Section A
        btrfs_init_reloc_root()			 \/
	  |				 	 __
	(PTR, 0)     reloc_root initialized      /\
          |					 |
	btrfs_update_reloc_root()		 |  Section B
          |					 |
	(PTR, DEAD)  reloc_root has been merged  \/
          |					 __
	=== btrfs_commit_transaction() ====================
	  |					 /\
	clean_dirty_subvols()			 |
	  |					 |  Section C
	(NULL, DEAD) reloc_root cleanup starts   \/
          |					 __
	btrfs_drop_snapshot()			 /\
	  |					 |  Section D
	(NULL, 0)    Back to initial state	 \/

Very have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller has hold a
transaction handler, so none of such caller can cross transaction
boundary.

In Section A, every caller just found no DEAD bit, and grab reloc_root.

In the cross section A-B, caller may get no DEAD bit, but since
reloc_root is still completely valid thus accessing reloc_root is
completely safe.

No test_bit() caller can cross the boundary of Section B and Section C.

In Section C, every caller found the DEAD bit, so no one will access
reloc_root.

In the cross section C-D, either caller gets the DEAD bit set, avoiding
access reloc_root no matter if it's safe or not.
Or caller get the DEAD bit cleared, then access reloc_root, which is
already NULL, nothing will be wrong.

Here we need extra memory barrier in cross section C-D, to ensure
proper memory order between reloc_root and clear_bit().

In Section D, since no DEAD bit and no reloc_root, it's back to initial
state.

With this lifespan, it should be clear only one memory barrier is
needed, between setting reloc_root to NULL and clearing DEAD_RELOC_ROOT
bit.

Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
Suggested-by: David Sterba <dsterba@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
---
Changelog:
v2:
- Add the [CRITICAL SECTION ANALYSE] part
  This gets me into the rabbit hole of memory ordering, but thanks for
  the help from David (initially mentioning the mb hell) and Nikolay
  (for the proper doc), finally I could explain clearly why only
  one mb is needed.
- Add comment for the only needed memory barrier.
---
 fs/btrfs/relocation.c | 32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

Comments

Nikolay Borisov Jan. 8, 2020, 12:28 p.m. UTC | #1
On 8.01.20 г. 7:12 ч., Qu Wenruo wrote:
> [BUG]
> There are several different KASAN reports for balance + snapshot
> workloads.
> Involved call paths include:
> 
>    should_ignore_root+0x54/0xb0 [btrfs]
>    build_backref_tree+0x11af/0x2280 [btrfs]
>    relocate_tree_blocks+0x391/0xb80 [btrfs]
>    relocate_block_group+0x3e5/0xa00 [btrfs]
>    btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
>    btrfs_relocate_chunk+0x53/0xf0 [btrfs]
>    btrfs_balance+0xc91/0x1840 [btrfs]
>    btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
>    btrfs_ioctl+0x8af/0x3e60 [btrfs]
>    do_vfs_ioctl+0x831/0xb10
>    ksys_ioctl+0x67/0x90
>    __x64_sys_ioctl+0x43/0x50
>    do_syscall_64+0x79/0xe0
>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>    create_reloc_root+0x9f/0x460 [btrfs]
>    btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
>    create_pending_snapshot+0xa9b/0x15f0 [btrfs]
>    create_pending_snapshots+0x111/0x140 [btrfs]
>    btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>    btrfs_mksubvol+0x915/0x960 [btrfs]
>    btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>    btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>    btrfs_ioctl+0x241b/0x3e60 [btrfs]
>    do_vfs_ioctl+0x831/0xb10
>    ksys_ioctl+0x67/0x90
>    __x64_sys_ioctl+0x43/0x50
>    do_syscall_64+0x79/0xe0
>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>    btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
>    create_pending_snapshot+0x209/0x15f0 [btrfs]
>    create_pending_snapshots+0x111/0x140 [btrfs]
>    btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>    btrfs_mksubvol+0x915/0x960 [btrfs]
>    btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>    btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>    btrfs_ioctl+0x241b/0x3e60 [btrfs]
>    do_vfs_ioctl+0x831/0xb10
>    ksys_ioctl+0x67/0x90
>    __x64_sys_ioctl+0x43/0x50
>    do_syscall_64+0x79/0xe0
>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> [CAUSE]
> All these call sites are only relying on root->reloc_root, which can
> undergo btrfs_drop_snapshot(), and since we don't have real refcount

what do you mean by "root->reloc_root can undergo btrfs_drop_snapshot" ?

> based protection to reloc roots, we can reach already dropped reloc
> root, triggering KASAN.
what's the relationship between not having a refcount protection and
reaching reloc roots, perhaps you could expand the explanation?

> 
> [FIX]
> To avoid such access to unstable root->reloc_root, we should check
> BTRFS_ROOT_DEAD_RELOC_TREE bit first.
> 
> This patch introduces a new wrapper, have_reloc_root(), to do the proper
> check for most callers who don't distinguish merged reloc tree and no
> reloc tree.
> 
> The only exception is should_ignore_root(), as merged reloc tree can be
> ignored, while no reloc tree shouldn't.
> 
> [CRITICAL SECTION ANALYSE]
> Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
> DEAD_RELOC_TREE bit has extra help from transaction as a higher level
> barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:
> 
> 	NULL: reloc_root is NULL	PTR: reloc_root is not NULL
> 	0: DEAD_RELOC_ROOT bit not set	DEAD: DEAD_RELOC_ROOT bit set
> 
> 	(NULL, 0)    Initial state		 __
> 	  |					 /\ Section A
>         btrfs_init_reloc_root()			 \/
> 	  |				 	 __
> 	(PTR, 0)     reloc_root initialized      /\
>           |					 |
> 	btrfs_update_reloc_root()		 |  Section B
>           |					 |
> 	(PTR, DEAD)  reloc_root has been merged  \/
>           |					 __
> 	=== btrfs_commit_transaction() ====================
> 	  |					 /\
> 	clean_dirty_subvols()			 |
> 	  |					 |  Section C
> 	(NULL, DEAD) reloc_root cleanup starts   \/
>           |					 __
> 	btrfs_drop_snapshot()			 /\
> 	  |					 |  Section D
> 	(NULL, 0)    Back to initial state	 \/
> 
> Very have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller has hold a

 ^^ Perhaps you meant: Every caller of have_reloc_root or
test_bit(DED_RELOC_ROOT) holds a transaction handle which ensures
modifications in those function are limited to a single transaction?

> transaction handler, so none of such caller can cross transaction
> boundary.
> 
> In Section A, every caller just found no DEAD bit, and grab reloc_root.
> 
> In the cross section A-B, caller may get no DEAD bit, but since
> reloc_root is still completely valid thus accessing reloc_root is
> completely safe.
> 
> No test_bit() caller can cross the boundary of Section B and Section C.
> 
> In Section C, every caller found the DEAD bit, so no one will access
> reloc_root.
> 
> In the cross section C-D, either caller gets the DEAD bit set, avoiding
> access reloc_root no matter if it's safe or not.
> Or caller get the DEAD bit cleared, then access reloc_root, which is
> already NULL, nothing will be wrong.
> 
> Here we need extra memory barrier in cross section C-D, to ensure
> proper memory order between reloc_root and clear_bit().
> 
> In Section D, since no DEAD bit and no reloc_root, it's back to initial
> state.
> 
> With this lifespan, it should be clear only one memory barrier is
> needed, between setting reloc_root to NULL and clearing DEAD_RELOC_ROOT
> bit.
> 
> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
> Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
> Suggested-by: David Sterba <dsterba@suse.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Add the [CRITICAL SECTION ANALYSE] part
>   This gets me into the rabbit hole of memory ordering, but thanks for
>   the help from David (initially mentioning the mb hell) and Nikolay
>   (for the proper doc), finally I could explain clearly why only
>   one mb is needed.
> - Add comment for the only needed memory barrier.
> ---
>  fs/btrfs/relocation.c | 32 ++++++++++++++++++++++++++++----
>  1 file changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index d897a8e5e430..17a2484f76a5 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -517,6 +517,22 @@ static int update_backref_cache(struct btrfs_trans_handle *trans,
>  	return 1;
>  }
>  
> +/*
> + * Check if this subvolume tree has valid reloc(*) tree.
> + *
> + * *: Reloc tree after swap is considered dead, thus not considered as valid.
> + *    This is enough for most callers, as they don't distinguish dead reloc
> + *    root from no reloc root.
> + *    But should_ignore_root() below is a special case.
> + */
> +static bool have_reloc_root(struct btrfs_root *root)
> +{
> +	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
> +		return false;
> +	if (!root->reloc_root)
> +		return false;
> +	return true;
> +}
>  
>  static int should_ignore_root(struct btrfs_root *root)
>  {
> @@ -525,6 +541,10 @@ static int should_ignore_root(struct btrfs_root *root)
>  	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
>  		return 0;
>  
> +	/* This root has been merged with its reloc tree, we can ignore it */
> +	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
> +		return 1;
> +
>  	reloc_root = root->reloc_root;
>  	if (!reloc_root)
>  		return 0;
> @@ -1478,8 +1498,7 @@ int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
>  	struct btrfs_root_item *root_item;
>  	int ret;
>  
> -	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state) ||
> -	    !root->reloc_root)
> +	if (!have_reloc_root(root))
>  		goto out;
>  
>  	reloc_root = root->reloc_root;
> @@ -2201,6 +2220,11 @@ static int clean_dirty_subvols(struct reloc_control *rc)
>  				if (ret2 < 0 && !ret)
>  					ret = ret2;
>  			}
> +			/*
> +			 * Need barrier to ensure clear_bit() only happens after
> +			 * root->reloc_root = NULL.
> +			 */
> +			smp_mb__before_atomic();
>  			clear_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
>  			btrfs_put_fs_root(root);
>  		} else {
> @@ -4717,7 +4741,7 @@ void btrfs_reloc_pre_snapshot(struct btrfs_pending_snapshot *pending,
>  	struct btrfs_root *root = pending->root;
>  	struct reloc_control *rc = root->fs_info->reloc_ctl;
>  
> -	if (!root->reloc_root || !rc)
> +	if (!rc || !have_reloc_root(root))
>  		return;
>  
>  	if (!rc->merge_reloc_tree)
> @@ -4751,7 +4775,7 @@ int btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
>  	struct reloc_control *rc = root->fs_info->reloc_ctl;
>  	int ret;
>  
> -	if (!root->reloc_root || !rc)
> +	if (!rc || !have_reloc_root(root))
>  		return 0;
>  
>  	rc = root->fs_info->reloc_ctl;
>
Qu Wenruo Jan. 8, 2020, 12:36 p.m. UTC | #2
On 2020/1/8 下午8:28, Nikolay Borisov wrote:
> 
> 
> On 8.01.20 г. 7:12 ч., Qu Wenruo wrote:
>> [BUG]
>> There are several different KASAN reports for balance + snapshot
>> workloads.
>> Involved call paths include:
>>
>>    should_ignore_root+0x54/0xb0 [btrfs]
>>    build_backref_tree+0x11af/0x2280 [btrfs]
>>    relocate_tree_blocks+0x391/0xb80 [btrfs]
>>    relocate_block_group+0x3e5/0xa00 [btrfs]
>>    btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
>>    btrfs_relocate_chunk+0x53/0xf0 [btrfs]
>>    btrfs_balance+0xc91/0x1840 [btrfs]
>>    btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
>>    btrfs_ioctl+0x8af/0x3e60 [btrfs]
>>    do_vfs_ioctl+0x831/0xb10
>>    ksys_ioctl+0x67/0x90
>>    __x64_sys_ioctl+0x43/0x50
>>    do_syscall_64+0x79/0xe0
>>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>>    create_reloc_root+0x9f/0x460 [btrfs]
>>    btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
>>    create_pending_snapshot+0xa9b/0x15f0 [btrfs]
>>    create_pending_snapshots+0x111/0x140 [btrfs]
>>    btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>>    btrfs_mksubvol+0x915/0x960 [btrfs]
>>    btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>>    btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>>    btrfs_ioctl+0x241b/0x3e60 [btrfs]
>>    do_vfs_ioctl+0x831/0xb10
>>    ksys_ioctl+0x67/0x90
>>    __x64_sys_ioctl+0x43/0x50
>>    do_syscall_64+0x79/0xe0
>>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>>    btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
>>    create_pending_snapshot+0x209/0x15f0 [btrfs]
>>    create_pending_snapshots+0x111/0x140 [btrfs]
>>    btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>>    btrfs_mksubvol+0x915/0x960 [btrfs]
>>    btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>>    btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>>    btrfs_ioctl+0x241b/0x3e60 [btrfs]
>>    do_vfs_ioctl+0x831/0xb10
>>    ksys_ioctl+0x67/0x90
>>    __x64_sys_ioctl+0x43/0x50
>>    do_syscall_64+0x79/0xe0
>>    entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> [CAUSE]
>> All these call sites are only relying on root->reloc_root, which can
>> undergo btrfs_drop_snapshot(), and since we don't have real refcount
> 
> what do you mean by "root->reloc_root can undergo btrfs_drop_snapshot" ?

I mean some caller got root->reloc_root and use it, while
root->reloc_root soon get dropped by btrfs_drop_snapshot().

> 
>> based protection to reloc roots, we can reach already dropped reloc
>> root, triggering KASAN.
> what's the relationship between not having a refcount protection and
> reaching reloc roots, perhaps you could expand the explanation?

If we had a proper refcount protection, we could wait until we're the
last holder of reloc_root before calling btrfs_drop_snapshot().

And to me, that should be the correct solution, while this patch is just
a quick and maybe dirty fix mostly for backport.

> 
>>
>> [FIX]
>> To avoid such access to unstable root->reloc_root, we should check
>> BTRFS_ROOT_DEAD_RELOC_TREE bit first.
>>
>> This patch introduces a new wrapper, have_reloc_root(), to do the proper
>> check for most callers who don't distinguish merged reloc tree and no
>> reloc tree.
>>
>> The only exception is should_ignore_root(), as merged reloc tree can be
>> ignored, while no reloc tree shouldn't.
>>
>> [CRITICAL SECTION ANALYSE]
>> Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
>> DEAD_RELOC_TREE bit has extra help from transaction as a higher level
>> barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:
>>
>> 	NULL: reloc_root is NULL	PTR: reloc_root is not NULL
>> 	0: DEAD_RELOC_ROOT bit not set	DEAD: DEAD_RELOC_ROOT bit set
>>
>> 	(NULL, 0)    Initial state		 __
>> 	  |					 /\ Section A
>>         btrfs_init_reloc_root()			 \/
>> 	  |				 	 __
>> 	(PTR, 0)     reloc_root initialized      /\
>>           |					 |
>> 	btrfs_update_reloc_root()		 |  Section B
>>           |					 |
>> 	(PTR, DEAD)  reloc_root has been merged  \/
>>           |					 __
>> 	=== btrfs_commit_transaction() ====================
>> 	  |					 /\
>> 	clean_dirty_subvols()			 |
>> 	  |					 |  Section C
>> 	(NULL, DEAD) reloc_root cleanup starts   \/
>>           |					 __
>> 	btrfs_drop_snapshot()			 /\
>> 	  |					 |  Section D
>> 	(NULL, 0)    Back to initial state	 \/
>>
>> Very have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller has hold a
> 
>  ^^ Perhaps you meant: Every caller of have_reloc_root or
> test_bit(DED_RELOC_ROOT) holds a transaction handle which ensures
> modifications in those function are limited to a single transaction?

Yep, I mean *E*very.
It looks I need to replace the switch of my 'e' key...

Thanks,
Qu
Josef Bacik Jan. 8, 2020, 2:55 p.m. UTC | #3
On 1/8/20 12:12 AM, Qu Wenruo wrote:
> [BUG]
> There are several different KASAN reports for balance + snapshot
> workloads.
> Involved call paths include:
> 
>     should_ignore_root+0x54/0xb0 [btrfs]
>     build_backref_tree+0x11af/0x2280 [btrfs]
>     relocate_tree_blocks+0x391/0xb80 [btrfs]
>     relocate_block_group+0x3e5/0xa00 [btrfs]
>     btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
>     btrfs_relocate_chunk+0x53/0xf0 [btrfs]
>     btrfs_balance+0xc91/0x1840 [btrfs]
>     btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
>     btrfs_ioctl+0x8af/0x3e60 [btrfs]
>     do_vfs_ioctl+0x831/0xb10
>     ksys_ioctl+0x67/0x90
>     __x64_sys_ioctl+0x43/0x50
>     do_syscall_64+0x79/0xe0
>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>     create_reloc_root+0x9f/0x460 [btrfs]
>     btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
>     create_pending_snapshot+0xa9b/0x15f0 [btrfs]
>     create_pending_snapshots+0x111/0x140 [btrfs]
>     btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>     btrfs_mksubvol+0x915/0x960 [btrfs]
>     btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>     btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>     btrfs_ioctl+0x241b/0x3e60 [btrfs]
>     do_vfs_ioctl+0x831/0xb10
>     ksys_ioctl+0x67/0x90
>     __x64_sys_ioctl+0x43/0x50
>     do_syscall_64+0x79/0xe0
>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
>     btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
>     create_pending_snapshot+0x209/0x15f0 [btrfs]
>     create_pending_snapshots+0x111/0x140 [btrfs]
>     btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>     btrfs_mksubvol+0x915/0x960 [btrfs]
>     btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>     btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>     btrfs_ioctl+0x241b/0x3e60 [btrfs]
>     do_vfs_ioctl+0x831/0xb10
>     ksys_ioctl+0x67/0x90
>     __x64_sys_ioctl+0x43/0x50
>     do_syscall_64+0x79/0xe0
>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
> 
> [CAUSE]
> All these call sites are only relying on root->reloc_root, which can
> undergo btrfs_drop_snapshot(), and since we don't have real refcount
> based protection to reloc roots, we can reach already dropped reloc
> root, triggering KASAN.
> 
> [FIX]
> To avoid such access to unstable root->reloc_root, we should check
> BTRFS_ROOT_DEAD_RELOC_TREE bit first.
> 
> This patch introduces a new wrapper, have_reloc_root(), to do the proper
> check for most callers who don't distinguish merged reloc tree and no
> reloc tree.
> 
> The only exception is should_ignore_root(), as merged reloc tree can be
> ignored, while no reloc tree shouldn't.
> 
> [CRITICAL SECTION ANALYSE]
> Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
> DEAD_RELOC_TREE bit has extra help from transaction as a higher level
> barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:
> 
> 	NULL: reloc_root is NULL	PTR: reloc_root is not NULL
> 	0: DEAD_RELOC_ROOT bit not set	DEAD: DEAD_RELOC_ROOT bit set
> 
> 	(NULL, 0)    Initial state		 __
> 	  |					 /\ Section A
>          btrfs_init_reloc_root()			 \/
> 	  |				 	 __
> 	(PTR, 0)     reloc_root initialized      /\
>            |					 |
> 	btrfs_update_reloc_root()		 |  Section B
>            |					 |
> 	(PTR, DEAD)  reloc_root has been merged  \/
>            |					 __
> 	=== btrfs_commit_transaction() ====================
> 	  |					 /\
> 	clean_dirty_subvols()			 |
> 	  |					 |  Section C
> 	(NULL, DEAD) reloc_root cleanup starts   \/
>            |					 __
> 	btrfs_drop_snapshot()			 /\
> 	  |					 |  Section D
> 	(NULL, 0)    Back to initial state	 \/
> 
> Very have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller has hold a
> transaction handler, so none of such caller can cross transaction
> boundary.
> 
> In Section A, every caller just found no DEAD bit, and grab reloc_root.
> 
> In the cross section A-B, caller may get no DEAD bit, but since
> reloc_root is still completely valid thus accessing reloc_root is
> completely safe.
> 
> No test_bit() caller can cross the boundary of Section B and Section C.
> 
> In Section C, every caller found the DEAD bit, so no one will access
> reloc_root.
> 
> In the cross section C-D, either caller gets the DEAD bit set, avoiding
> access reloc_root no matter if it's safe or not.
> Or caller get the DEAD bit cleared, then access reloc_root, which is
> already NULL, nothing will be wrong.
> 
> Here we need extra memory barrier in cross section C-D, to ensure
> proper memory order between reloc_root and clear_bit().
> 
> In Section D, since no DEAD bit and no reloc_root, it's back to initial
> state.
> 
> With this lifespan, it should be clear only one memory barrier is
> needed, between setting reloc_root to NULL and clearing DEAD_RELOC_ROOT
> bit.
> 
> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
> Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion after merge_reloc_roots")
> Suggested-by: David Sterba <dsterba@suse.com>
> Signed-off-by: Qu Wenruo <wqu@suse.com>
> ---
> Changelog:
> v2:
> - Add the [CRITICAL SECTION ANALYSE] part
>    This gets me into the rabbit hole of memory ordering, but thanks for
>    the help from David (initially mentioning the mb hell) and Nikolay
>    (for the proper doc), finally I could explain clearly why only
>    one mb is needed.
> - Add comment for the only needed memory barrier.
> ---
>   fs/btrfs/relocation.c | 32 ++++++++++++++++++++++++++++----
>   1 file changed, 28 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> index d897a8e5e430..17a2484f76a5 100644
> --- a/fs/btrfs/relocation.c
> +++ b/fs/btrfs/relocation.c
> @@ -517,6 +517,22 @@ static int update_backref_cache(struct btrfs_trans_handle *trans,
>   	return 1;
>   }
>   
> +/*
> + * Check if this subvolume tree has valid reloc(*) tree.
> + *
> + * *: Reloc tree after swap is considered dead, thus not considered as valid.
> + *    This is enough for most callers, as they don't distinguish dead reloc
> + *    root from no reloc root.
> + *    But should_ignore_root() below is a special case.
> + */
> +static bool have_reloc_root(struct btrfs_root *root)
> +{
> +	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
> +		return false;

You still need a smp_mb__after_atomic() here, test_bit is unordered.  Thanks,

Josef
Nikolay Borisov Jan. 8, 2020, 3:03 p.m. UTC | #4
On 8.01.20 г. 16:55 ч., Josef Bacik wrote:
> On 1/8/20 12:12 AM, Qu Wenruo wrote:
>> [BUG]
>> There are several different KASAN reports for balance + snapshot
>> workloads.
>> Involved call paths include:
>>
>>     should_ignore_root+0x54/0xb0 [btrfs]
>>     build_backref_tree+0x11af/0x2280 [btrfs]
>>     relocate_tree_blocks+0x391/0xb80 [btrfs]
>>     relocate_block_group+0x3e5/0xa00 [btrfs]
>>     btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
>>     btrfs_relocate_chunk+0x53/0xf0 [btrfs]
>>     btrfs_balance+0xc91/0x1840 [btrfs]
>>     btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
>>     btrfs_ioctl+0x8af/0x3e60 [btrfs]
>>     do_vfs_ioctl+0x831/0xb10
>>     ksys_ioctl+0x67/0x90
>>     __x64_sys_ioctl+0x43/0x50
>>     do_syscall_64+0x79/0xe0
>>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>>     create_reloc_root+0x9f/0x460 [btrfs]
>>     btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
>>     create_pending_snapshot+0xa9b/0x15f0 [btrfs]
>>     create_pending_snapshots+0x111/0x140 [btrfs]
>>     btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>>     btrfs_mksubvol+0x915/0x960 [btrfs]
>>     btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>>     btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>>     btrfs_ioctl+0x241b/0x3e60 [btrfs]
>>     do_vfs_ioctl+0x831/0xb10
>>     ksys_ioctl+0x67/0x90
>>     __x64_sys_ioctl+0x43/0x50
>>     do_syscall_64+0x79/0xe0
>>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>>     btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
>>     create_pending_snapshot+0x209/0x15f0 [btrfs]
>>     create_pending_snapshots+0x111/0x140 [btrfs]
>>     btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>>     btrfs_mksubvol+0x915/0x960 [btrfs]
>>     btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>>     btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>>     btrfs_ioctl+0x241b/0x3e60 [btrfs]
>>     do_vfs_ioctl+0x831/0xb10
>>     ksys_ioctl+0x67/0x90
>>     __x64_sys_ioctl+0x43/0x50
>>     do_syscall_64+0x79/0xe0
>>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>
>> [CAUSE]
>> All these call sites are only relying on root->reloc_root, which can
>> undergo btrfs_drop_snapshot(), and since we don't have real refcount
>> based protection to reloc roots, we can reach already dropped reloc
>> root, triggering KASAN.
>>
>> [FIX]
>> To avoid such access to unstable root->reloc_root, we should check
>> BTRFS_ROOT_DEAD_RELOC_TREE bit first.
>>
>> This patch introduces a new wrapper, have_reloc_root(), to do the proper
>> check for most callers who don't distinguish merged reloc tree and no
>> reloc tree.
>>
>> The only exception is should_ignore_root(), as merged reloc tree can be
>> ignored, while no reloc tree shouldn't.
>>
>> [CRITICAL SECTION ANALYSE]
>> Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
>> DEAD_RELOC_TREE bit has extra help from transaction as a higher level
>> barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:
>>
>>     NULL: reloc_root is NULL    PTR: reloc_root is not NULL
>>     0: DEAD_RELOC_ROOT bit not set    DEAD: DEAD_RELOC_ROOT bit set
>>
>>     (NULL, 0)    Initial state         __
>>       |                     /\ Section A
>>          btrfs_init_reloc_root()             \/
>>       |                      __
>>     (PTR, 0)     reloc_root initialized      /\
>>            |                     |
>>     btrfs_update_reloc_root()         |  Section B
>>            |                     |
>>     (PTR, DEAD)  reloc_root has been merged  \/
>>            |                     __
>>     === btrfs_commit_transaction() ====================
>>       |                     /\
>>     clean_dirty_subvols()             |
>>       |                     |  Section C
>>     (NULL, DEAD) reloc_root cleanup starts   \/
>>            |                     __
>>     btrfs_drop_snapshot()             /\
>>       |                     |  Section D
>>     (NULL, 0)    Back to initial state     \/
>>
>> Very have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller has hold a
>> transaction handler, so none of such caller can cross transaction
>> boundary.
>>
>> In Section A, every caller just found no DEAD bit, and grab reloc_root.
>>
>> In the cross section A-B, caller may get no DEAD bit, but since
>> reloc_root is still completely valid thus accessing reloc_root is
>> completely safe.
>>
>> No test_bit() caller can cross the boundary of Section B and Section C.
>>
>> In Section C, every caller found the DEAD bit, so no one will access
>> reloc_root.
>>
>> In the cross section C-D, either caller gets the DEAD bit set, avoiding
>> access reloc_root no matter if it's safe or not.
>> Or caller get the DEAD bit cleared, then access reloc_root, which is
>> already NULL, nothing will be wrong.
>>
>> Here we need extra memory barrier in cross section C-D, to ensure
>> proper memory order between reloc_root and clear_bit().
>>
>> In Section D, since no DEAD bit and no reloc_root, it's back to initial
>> state.
>>
>> With this lifespan, it should be clear only one memory barrier is
>> needed, between setting reloc_root to NULL and clearing DEAD_RELOC_ROOT
>> bit.
>>
>> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
>> Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion
>> after merge_reloc_roots")
>> Suggested-by: David Sterba <dsterba@suse.com>
>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>> ---
>> Changelog:
>> v2:
>> - Add the [CRITICAL SECTION ANALYSE] part
>>    This gets me into the rabbit hole of memory ordering, but thanks for
>>    the help from David (initially mentioning the mb hell) and Nikolay
>>    (for the proper doc), finally I could explain clearly why only
>>    one mb is needed.
>> - Add comment for the only needed memory barrier.
>> ---
>>   fs/btrfs/relocation.c | 32 ++++++++++++++++++++++++++++----
>>   1 file changed, 28 insertions(+), 4 deletions(-)
>>
>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>> index d897a8e5e430..17a2484f76a5 100644
>> --- a/fs/btrfs/relocation.c
>> +++ b/fs/btrfs/relocation.c
>> @@ -517,6 +517,22 @@ static int update_backref_cache(struct
>> btrfs_trans_handle *trans,
>>       return 1;
>>   }
>>   +/*
>> + * Check if this subvolume tree has valid reloc(*) tree.
>> + *
>> + * *: Reloc tree after swap is considered dead, thus not considered
>> as valid.
>> + *    This is enough for most callers, as they don't distinguish dead
>> reloc
>> + *    root from no reloc root.
>> + *    But should_ignore_root() below is a special case.
>> + */
>> +static bool have_reloc_root(struct btrfs_root *root)
>> +{
>> +    if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
>> +        return false;
> 
> You still need a smp_mb__after_atomic() here, test_bit is unordered. 

Nope, that won't do anything, since smp_mb__(After|before)_atomic only
orders RMW operations and test_bit is not an RMW operation. From
atomic_bitops.txt:


Non-RMW ops:



  test_bit()

Furthermore from atomic_t.txt:

The barriers:



  smp_mb__{before,after}_atomic()



only apply to the RMW atomic ops and can be used to augment/upgrade the

ordering inherent to the op.

> Thanks,
> 
> Josef
David Sterba Jan. 8, 2020, 3:08 p.m. UTC | #5
On Wed, Jan 08, 2020 at 05:03:35PM +0200, Nikolay Borisov wrote:
> 
> 
> On 8.01.20 г. 16:55 ч., Josef Bacik wrote:
> > On 1/8/20 12:12 AM, Qu Wenruo wrote:
> >> [BUG]
> >> There are several different KASAN reports for balance + snapshot
> >> workloads.
> >> Involved call paths include:
> >>
> >>     should_ignore_root+0x54/0xb0 [btrfs]
> >>     build_backref_tree+0x11af/0x2280 [btrfs]
> >>     relocate_tree_blocks+0x391/0xb80 [btrfs]
> >>     relocate_block_group+0x3e5/0xa00 [btrfs]
> >>     btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
> >>     btrfs_relocate_chunk+0x53/0xf0 [btrfs]
> >>     btrfs_balance+0xc91/0x1840 [btrfs]
> >>     btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
> >>     btrfs_ioctl+0x8af/0x3e60 [btrfs]
> >>     do_vfs_ioctl+0x831/0xb10
> >>     ksys_ioctl+0x67/0x90
> >>     __x64_sys_ioctl+0x43/0x50
> >>     do_syscall_64+0x79/0xe0
> >>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>
> >>     create_reloc_root+0x9f/0x460 [btrfs]
> >>     btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
> >>     create_pending_snapshot+0xa9b/0x15f0 [btrfs]
> >>     create_pending_snapshots+0x111/0x140 [btrfs]
> >>     btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
> >>     btrfs_mksubvol+0x915/0x960 [btrfs]
> >>     btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
> >>     btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
> >>     btrfs_ioctl+0x241b/0x3e60 [btrfs]
> >>     do_vfs_ioctl+0x831/0xb10
> >>     ksys_ioctl+0x67/0x90
> >>     __x64_sys_ioctl+0x43/0x50
> >>     do_syscall_64+0x79/0xe0
> >>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>
> >>     btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
> >>     create_pending_snapshot+0x209/0x15f0 [btrfs]
> >>     create_pending_snapshots+0x111/0x140 [btrfs]
> >>     btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
> >>     btrfs_mksubvol+0x915/0x960 [btrfs]
> >>     btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
> >>     btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
> >>     btrfs_ioctl+0x241b/0x3e60 [btrfs]
> >>     do_vfs_ioctl+0x831/0xb10
> >>     ksys_ioctl+0x67/0x90
> >>     __x64_sys_ioctl+0x43/0x50
> >>     do_syscall_64+0x79/0xe0
> >>     entry_SYSCALL_64_after_hwframe+0x49/0xbe
> >>
> >> [CAUSE]
> >> All these call sites are only relying on root->reloc_root, which can
> >> undergo btrfs_drop_snapshot(), and since we don't have real refcount
> >> based protection to reloc roots, we can reach already dropped reloc
> >> root, triggering KASAN.
> >>
> >> [FIX]
> >> To avoid such access to unstable root->reloc_root, we should check
> >> BTRFS_ROOT_DEAD_RELOC_TREE bit first.
> >>
> >> This patch introduces a new wrapper, have_reloc_root(), to do the proper
> >> check for most callers who don't distinguish merged reloc tree and no
> >> reloc tree.
> >>
> >> The only exception is should_ignore_root(), as merged reloc tree can be
> >> ignored, while no reloc tree shouldn't.
> >>
> >> [CRITICAL SECTION ANALYSE]
> >> Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
> >> DEAD_RELOC_TREE bit has extra help from transaction as a higher level
> >> barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:
> >>
> >>     NULL: reloc_root is NULL    PTR: reloc_root is not NULL
> >>     0: DEAD_RELOC_ROOT bit not set    DEAD: DEAD_RELOC_ROOT bit set
> >>
> >>     (NULL, 0)    Initial state         __
> >>       |                     /\ Section A
> >>          btrfs_init_reloc_root()             \/
> >>       |                      __
> >>     (PTR, 0)     reloc_root initialized      /\
> >>            |                     |
> >>     btrfs_update_reloc_root()         |  Section B
> >>            |                     |
> >>     (PTR, DEAD)  reloc_root has been merged  \/
> >>            |                     __
> >>     === btrfs_commit_transaction() ====================
> >>       |                     /\
> >>     clean_dirty_subvols()             |
> >>       |                     |  Section C
> >>     (NULL, DEAD) reloc_root cleanup starts   \/
> >>            |                     __
> >>     btrfs_drop_snapshot()             /\
> >>       |                     |  Section D
> >>     (NULL, 0)    Back to initial state     \/
> >>
> >> Very have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller has hold a
> >> transaction handler, so none of such caller can cross transaction
> >> boundary.
> >>
> >> In Section A, every caller just found no DEAD bit, and grab reloc_root.
> >>
> >> In the cross section A-B, caller may get no DEAD bit, but since
> >> reloc_root is still completely valid thus accessing reloc_root is
> >> completely safe.
> >>
> >> No test_bit() caller can cross the boundary of Section B and Section C.
> >>
> >> In Section C, every caller found the DEAD bit, so no one will access
> >> reloc_root.
> >>
> >> In the cross section C-D, either caller gets the DEAD bit set, avoiding
> >> access reloc_root no matter if it's safe or not.
> >> Or caller get the DEAD bit cleared, then access reloc_root, which is
> >> already NULL, nothing will be wrong.
> >>
> >> Here we need extra memory barrier in cross section C-D, to ensure
> >> proper memory order between reloc_root and clear_bit().
> >>
> >> In Section D, since no DEAD bit and no reloc_root, it's back to initial
> >> state.
> >>
> >> With this lifespan, it should be clear only one memory barrier is
> >> needed, between setting reloc_root to NULL and clearing DEAD_RELOC_ROOT
> >> bit.
> >>
> >> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
> >> Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion
> >> after merge_reloc_roots")
> >> Suggested-by: David Sterba <dsterba@suse.com>
> >> Signed-off-by: Qu Wenruo <wqu@suse.com>
> >> ---
> >> Changelog:
> >> v2:
> >> - Add the [CRITICAL SECTION ANALYSE] part
> >>    This gets me into the rabbit hole of memory ordering, but thanks for
> >>    the help from David (initially mentioning the mb hell) and Nikolay
> >>    (for the proper doc), finally I could explain clearly why only
> >>    one mb is needed.
> >> - Add comment for the only needed memory barrier.
> >> ---
> >>   fs/btrfs/relocation.c | 32 ++++++++++++++++++++++++++++----
> >>   1 file changed, 28 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
> >> index d897a8e5e430..17a2484f76a5 100644
> >> --- a/fs/btrfs/relocation.c
> >> +++ b/fs/btrfs/relocation.c
> >> @@ -517,6 +517,22 @@ static int update_backref_cache(struct
> >> btrfs_trans_handle *trans,
> >>       return 1;
> >>   }
> >>   +/*
> >> + * Check if this subvolume tree has valid reloc(*) tree.
> >> + *
> >> + * *: Reloc tree after swap is considered dead, thus not considered
> >> as valid.
> >> + *    This is enough for most callers, as they don't distinguish dead
> >> reloc
> >> + *    root from no reloc root.
> >> + *    But should_ignore_root() below is a special case.
> >> + */
> >> +static bool have_reloc_root(struct btrfs_root *root)
> >> +{
> >> +    if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
> >> +        return false;
> > 
> > You still need a smp_mb__after_atomic() here, test_bit is unordered. 
> 
> Nope, that won't do anything, since smp_mb__(After|before)_atomic only
> orders RMW operations and test_bit is not an RMW operation. From
> atomic_bitops.txt:
> 
> 
> Non-RMW ops:
> 
> 
> 
>   test_bit()
> 
> Furthermore from atomic_t.txt:
> 
> The barriers:
> 
> 
> 
>   smp_mb__{before,after}_atomic()
> 
> 
> 
> only apply to the RMW atomic ops and can be used to augment/upgrade the
> 
> ordering inherent to the op.

The way I read it is more like smp_rmb/smp_wmb, but for bits in this
case, so the smp_mb__before/after_atomic was only a syntactic sugar to
match that it's atomic bitops. I realize this could have caused some
confusion, however I still think that some sort of barrier is needed.
David Sterba Jan. 8, 2020, 3:11 p.m. UTC | #6
On Wed, Jan 08, 2020 at 04:08:41PM +0100, David Sterba wrote:
> > >> +static bool have_reloc_root(struct btrfs_root *root)
> > >> +{
> > >> +    if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
> > >> +        return false;
> > > 
> > > You still need a smp_mb__after_atomic() here, test_bit is unordered. 
> > 
> > Nope, that won't do anything, since smp_mb__(After|before)_atomic only
> > orders RMW operations and test_bit is not an RMW operation. From
> > atomic_bitops.txt:
> > 
> > 
> > Non-RMW ops:
> > 
> > 
> > 
> >   test_bit()
> > 
> > Furthermore from atomic_t.txt:
> > 
> > The barriers:
> > 
> > 
> > 
> >   smp_mb__{before,after}_atomic()
> > 
> > 
> > 
> > only apply to the RMW atomic ops and can be used to augment/upgrade the
> > 
> > ordering inherent to the op.
> 
> The way I read it is more like smp_rmb/smp_wmb, but for bits in this
> case, so the smp_mb__before/after_atomic was only a syntactic sugar to
> match that it's atomic bitops. I realize this could have caused some
> confusion, however I still think that some sort of barrier is needed.

There's an existing pattern used for serializing set/clear of
BTRFS_ROOT_IN_TRANS_SETUP (record_root_in_trans,
btrfs_record_root_in_trans).

Once upon a time there were barriers like smp_mb__before_clear_bit but
they got unified to just smp_mb__before_atomic for all set/clear
operations, so I believe I was not all wrong with using them.
Josef Bacik Jan. 8, 2020, 3:19 p.m. UTC | #7
On 1/8/20 10:03 AM, Nikolay Borisov wrote:
> 
> 
> On 8.01.20 г. 16:55 ч., Josef Bacik wrote:
>> On 1/8/20 12:12 AM, Qu Wenruo wrote:
>>> [BUG]
>>> There are several different KASAN reports for balance + snapshot
>>> workloads.
>>> Involved call paths include:
>>>
>>>      should_ignore_root+0x54/0xb0 [btrfs]
>>>      build_backref_tree+0x11af/0x2280 [btrfs]
>>>      relocate_tree_blocks+0x391/0xb80 [btrfs]
>>>      relocate_block_group+0x3e5/0xa00 [btrfs]
>>>      btrfs_relocate_block_group+0x240/0x4d0 [btrfs]
>>>      btrfs_relocate_chunk+0x53/0xf0 [btrfs]
>>>      btrfs_balance+0xc91/0x1840 [btrfs]
>>>      btrfs_ioctl_balance+0x416/0x4e0 [btrfs]
>>>      btrfs_ioctl+0x8af/0x3e60 [btrfs]
>>>      do_vfs_ioctl+0x831/0xb10
>>>      ksys_ioctl+0x67/0x90
>>>      __x64_sys_ioctl+0x43/0x50
>>>      do_syscall_64+0x79/0xe0
>>>      entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>
>>>      create_reloc_root+0x9f/0x460 [btrfs]
>>>      btrfs_reloc_post_snapshot+0xff/0x6c0 [btrfs]
>>>      create_pending_snapshot+0xa9b/0x15f0 [btrfs]
>>>      create_pending_snapshots+0x111/0x140 [btrfs]
>>>      btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>>>      btrfs_mksubvol+0x915/0x960 [btrfs]
>>>      btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>>>      btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>>>      btrfs_ioctl+0x241b/0x3e60 [btrfs]
>>>      do_vfs_ioctl+0x831/0xb10
>>>      ksys_ioctl+0x67/0x90
>>>      __x64_sys_ioctl+0x43/0x50
>>>      do_syscall_64+0x79/0xe0
>>>      entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>
>>>      btrfs_reloc_pre_snapshot+0x85/0xc0 [btrfs]
>>>      create_pending_snapshot+0x209/0x15f0 [btrfs]
>>>      create_pending_snapshots+0x111/0x140 [btrfs]
>>>      btrfs_commit_transaction+0x7a6/0x1360 [btrfs]
>>>      btrfs_mksubvol+0x915/0x960 [btrfs]
>>>      btrfs_ioctl_snap_create_transid+0x1d5/0x1e0 [btrfs]
>>>      btrfs_ioctl_snap_create_v2+0x1d3/0x270 [btrfs]
>>>      btrfs_ioctl+0x241b/0x3e60 [btrfs]
>>>      do_vfs_ioctl+0x831/0xb10
>>>      ksys_ioctl+0x67/0x90
>>>      __x64_sys_ioctl+0x43/0x50
>>>      do_syscall_64+0x79/0xe0
>>>      entry_SYSCALL_64_after_hwframe+0x49/0xbe
>>>
>>> [CAUSE]
>>> All these call sites are only relying on root->reloc_root, which can
>>> undergo btrfs_drop_snapshot(), and since we don't have real refcount
>>> based protection to reloc roots, we can reach already dropped reloc
>>> root, triggering KASAN.
>>>
>>> [FIX]
>>> To avoid such access to unstable root->reloc_root, we should check
>>> BTRFS_ROOT_DEAD_RELOC_TREE bit first.
>>>
>>> This patch introduces a new wrapper, have_reloc_root(), to do the proper
>>> check for most callers who don't distinguish merged reloc tree and no
>>> reloc tree.
>>>
>>> The only exception is should_ignore_root(), as merged reloc tree can be
>>> ignored, while no reloc tree shouldn't.
>>>
>>> [CRITICAL SECTION ANALYSE]
>>> Although test_bit()/set_bit()/clear_bit() doesn't imply a barrier, the
>>> DEAD_RELOC_TREE bit has extra help from transaction as a higher level
>>> barrier, the lifespan of root::reloc_root and DEAD_RELOC_TREE bit are:
>>>
>>>      NULL: reloc_root is NULL    PTR: reloc_root is not NULL
>>>      0: DEAD_RELOC_ROOT bit not set    DEAD: DEAD_RELOC_ROOT bit set
>>>
>>>      (NULL, 0)    Initial state         __
>>>        |                     /\ Section A
>>>           btrfs_init_reloc_root()             \/
>>>        |                      __
>>>      (PTR, 0)     reloc_root initialized      /\
>>>             |                     |
>>>      btrfs_update_reloc_root()         |  Section B
>>>             |                     |
>>>      (PTR, DEAD)  reloc_root has been merged  \/
>>>             |                     __
>>>      === btrfs_commit_transaction() ====================
>>>        |                     /\
>>>      clean_dirty_subvols()             |
>>>        |                     |  Section C
>>>      (NULL, DEAD) reloc_root cleanup starts   \/
>>>             |                     __
>>>      btrfs_drop_snapshot()             /\
>>>        |                     |  Section D
>>>      (NULL, 0)    Back to initial state     \/
>>>
>>> Very have_reloc_root() or test_bit(DEAD_RELOC_ROOT) caller has hold a
>>> transaction handler, so none of such caller can cross transaction
>>> boundary.
>>>
>>> In Section A, every caller just found no DEAD bit, and grab reloc_root.
>>>
>>> In the cross section A-B, caller may get no DEAD bit, but since
>>> reloc_root is still completely valid thus accessing reloc_root is
>>> completely safe.
>>>
>>> No test_bit() caller can cross the boundary of Section B and Section C.
>>>
>>> In Section C, every caller found the DEAD bit, so no one will access
>>> reloc_root.
>>>
>>> In the cross section C-D, either caller gets the DEAD bit set, avoiding
>>> access reloc_root no matter if it's safe or not.
>>> Or caller get the DEAD bit cleared, then access reloc_root, which is
>>> already NULL, nothing will be wrong.
>>>
>>> Here we need extra memory barrier in cross section C-D, to ensure
>>> proper memory order between reloc_root and clear_bit().
>>>
>>> In Section D, since no DEAD bit and no reloc_root, it's back to initial
>>> state.
>>>
>>> With this lifespan, it should be clear only one memory barrier is
>>> needed, between setting reloc_root to NULL and clearing DEAD_RELOC_ROOT
>>> bit.
>>>
>>> Reported-by: Zygo Blaxell <ce3g8jdj@umail.furryterror.org>
>>> Fixes: d2311e698578 ("btrfs: relocation: Delay reloc tree deletion
>>> after merge_reloc_roots")
>>> Suggested-by: David Sterba <dsterba@suse.com>
>>> Signed-off-by: Qu Wenruo <wqu@suse.com>
>>> ---
>>> Changelog:
>>> v2:
>>> - Add the [CRITICAL SECTION ANALYSE] part
>>>     This gets me into the rabbit hole of memory ordering, but thanks for
>>>     the help from David (initially mentioning the mb hell) and Nikolay
>>>     (for the proper doc), finally I could explain clearly why only
>>>     one mb is needed.
>>> - Add comment for the only needed memory barrier.
>>> ---
>>>    fs/btrfs/relocation.c | 32 ++++++++++++++++++++++++++++----
>>>    1 file changed, 28 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
>>> index d897a8e5e430..17a2484f76a5 100644
>>> --- a/fs/btrfs/relocation.c
>>> +++ b/fs/btrfs/relocation.c
>>> @@ -517,6 +517,22 @@ static int update_backref_cache(struct
>>> btrfs_trans_handle *trans,
>>>        return 1;
>>>    }
>>>    +/*
>>> + * Check if this subvolume tree has valid reloc(*) tree.
>>> + *
>>> + * *: Reloc tree after swap is considered dead, thus not considered
>>> as valid.
>>> + *    This is enough for most callers, as they don't distinguish dead
>>> reloc
>>> + *    root from no reloc root.
>>> + *    But should_ignore_root() below is a special case.
>>> + */
>>> +static bool have_reloc_root(struct btrfs_root *root)
>>> +{
>>> +    if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
>>> +        return false;
>>
>> You still need a smp_mb__after_atomic() here, test_bit is unordered.
> 
> Nope, that won't do anything, since smp_mb__(After|before)_atomic only
> orders RMW operations and test_bit is not an RMW operation. From
> atomic_bitops.txt:
> 
> 
> Non-RMW ops:
> 
> 
> 
>    test_bit()
> 
> Furthermore from atomic_t.txt:
> 
> The barriers:
> 
> 
> 
>    smp_mb__{before,after}_atomic()
> 
> 
> 
> only apply to the RMW atomic ops and can be used to augment/upgrade the
> 
> ordering inherent to the op.

Right but the document says it's unordered.  The problem we're trying to address 
here is making sure _either_ we see BTRFS_ROOT_DEAD_RELOC_TREE or we see 
!root->reloc_root.  Which means we don't want the test_bit being re-ordered WRT 
the clear_bit on the other side.  So the other side does

reloc_root = NULL;
smp_mb__before_atomic();
clear_bit(BTRFS_ROOT_DEAD_RELOC_TREE);

now say on the other side we get re-ordered and we see reloc_root != NULL and we 
also see !test_bit(BTRFS_ROOT_DEAD_RELOC_TREE), and now we're screwed.

The smp_mb__after_atomic() guarantees that the re-ordering doesn't happen, correct?

I realize that this is mostly a moot point on real architectures, but I'm 
looking at things like ARM where test_bit() uses the generic asm helper, which 
could definitely be re-ordered as it's nothing special.  Thanks,

Josef
Qu Wenruo Jan. 9, 2020, 12:11 a.m. UTC | #8
On 2020/1/8 下午11:19, Josef Bacik wrote:
> On 1/8/20 10:03 AM, Nikolay Borisov wrote:
[...]
>>>> + */
>>>> +static bool have_reloc_root(struct btrfs_root *root)
>>>> +{
>>>> +    if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
>>>> +        return false;
>>>
>>> You still need a smp_mb__after_atomic() here, test_bit is unordered.
>>
>> Nope, that won't do anything, since smp_mb__(After|before)_atomic only
>> orders RMW operations and test_bit is not an RMW operation. From
>> atomic_bitops.txt:
>>
>>
>> Non-RMW ops:
>>
>>
>>
>>    test_bit()
>>
>> Furthermore from atomic_t.txt:
>>
>> The barriers:
>>
>>
>>
>>    smp_mb__{before,after}_atomic()
>>
>>
>>
>> only apply to the RMW atomic ops and can be used to augment/upgrade the
>>
>> ordering inherent to the op.
>
> Right but the document says it's unordered.  The problem we're trying to
> address here is making sure _either_ we see BTRFS_ROOT_DEAD_RELOC_TREE
> or we see !root->reloc_root.  Which means we don't want the test_bit
> being re-ordered WRT the clear_bit on the other side.  So the other side
> does
>
> reloc_root = NULL;
> smp_mb__before_atomic();
> clear_bit(BTRFS_ROOT_DEAD_RELOC_TREE);

Yes, that's correct.

>
> now say on the other side we get re-ordered and we see reloc_root !=
> NULL and we also see !test_bit(BTRFS_ROOT_DEAD_RELOC_TREE), and now
> we're screwed.

That's not possible. With above mb, there are only several possible
results on the reader side:
A: Reloc_root == PTR, and DEAD_RELOC_TREE : Before NULL assignment
B: Reloc_root == NULL, and DEAD_RELOC_TREE: after NULL assignment
C: Reloc_root == NULL, no DEAD_RELOC_TREE: after clear_bit().

That's what mb() is doing, killing the unwanted case:
D: Reloc_root == PTR, no DEAD_RELOC_TREE.

On the reader side, even with the mb, the test_bit() can happen whatever
they want, mb makes no sense for *single* memory access.

>
> The smp_mb__after_atomic() guarantees that the re-ordering doesn't
> happen, correct?

That smp_mb() has no effect, as it's not defining any extra order, since
there is no extra memory access to start with.

And definitely has nothing to do with reader side, as reader can really
happen whenever they like.


From what I learnt recently, mb is only needed between at least *two*
memory access where the order is really important (to say, kill certain
re-order possibility).

If there are no two critical accesses at both side, then a mb makes no
sense.

Hopes this would help.

Thanks,
Qu

>
> I realize that this is mostly a moot point on real architectures, but
> I'm looking at things like ARM where test_bit() uses the generic asm
> helper, which could definitely be re-ordered as it's nothing special. 
> Thanks,
>
> Josef
>
Qu Wenruo Jan. 9, 2020, 5:54 a.m. UTC | #9
On 2020/1/8 下午11:11, David Sterba wrote:
> On Wed, Jan 08, 2020 at 04:08:41PM +0100, David Sterba wrote:
>>>>> +static bool have_reloc_root(struct btrfs_root *root)
>>>>> +{
>>>>> +    if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
>>>>> +        return false;
>>>>
>>>> You still need a smp_mb__after_atomic() here, test_bit is unordered.
>>>
>>> Nope, that won't do anything, since smp_mb__(After|before)_atomic only
>>> orders RMW operations and test_bit is not an RMW operation. From
>>> atomic_bitops.txt:
>>>
>>>
>>> Non-RMW ops:
>>>
>>>
>>>
>>>   test_bit()
>>>
>>> Furthermore from atomic_t.txt:
>>>
>>> The barriers:
>>>
>>>
>>>
>>>   smp_mb__{before,after}_atomic()
>>>
>>>
>>>
>>> only apply to the RMW atomic ops and can be used to augment/upgrade the
>>>
>>> ordering inherent to the op.
>>
>> The way I read it is more like smp_rmb/smp_wmb, but for bits in this
>> case, so the smp_mb__before/after_atomic was only a syntactic sugar to
>> match that it's atomic bitops. I realize this could have caused some
>> confusion, however I still think that some sort of barrier is needed.
>
> There's an existing pattern used for serializing set/clear of
> BTRFS_ROOT_IN_TRANS_SETUP (record_root_in_trans,
> btrfs_record_root_in_trans).
>
> Once upon a time there were barriers like smp_mb__before_clear_bit but
> they got unified to just smp_mb__before_atomic for all set/clear
> operations, so I believe I was not all wrong with using them.
>
I used to believe the same fairy tail, that mb() works like a flush(),
but we're not living in a fairy tail.

What mb really does is keep certain ordering from happening.
And ordering means, we have at least *2* different memory accesses.


It's not to ensure every reader get the same result, as there is no way
to do that. Read can happen whenever they want.


So before we talk about mb, we first need to know which 2 memory
accesses we're talking about.
If there is even no 2 memory access, then there is no need for mb().

E.g. for the mb implied by spinlock(), it's not to ensure the spinlock
counter reader, but to ensure the memory ordering between the spinlock
counter itself and the memory it's protecting.

So for memory barrier around test_bit(), as long as the compiler is not
doing reordering, we don't need extra mb.

And if the compiler is really doing re-ordering for the
have_reloc_root(), then the compiler is doing something wrong, as that
would makes the test_bit() meaningless.

Thanks,
Qu
David Sterba Jan. 9, 2020, 2:37 p.m. UTC | #10
On Thu, Jan 09, 2020 at 01:54:34PM +0800, Qu Wenruo wrote:
> >> The way I read it is more like smp_rmb/smp_wmb, but for bits in this
> >> case, so the smp_mb__before/after_atomic was only a syntactic sugar to
> >> match that it's atomic bitops. I realize this could have caused some
> >> confusion, however I still think that some sort of barrier is needed.
> >
> > There's an existing pattern used for serializing set/clear of
> > BTRFS_ROOT_IN_TRANS_SETUP (record_root_in_trans,
> > btrfs_record_root_in_trans).
> >
> > Once upon a time there were barriers like smp_mb__before_clear_bit but
> > they got unified to just smp_mb__before_atomic for all set/clear
> > operations, so I believe I was not all wrong with using them.
> >
> I used to believe the same fairy tail, that mb() works like a flush(),
> but we're not living in a fairy tail.

This sounds strange, by flush I was refering to internal CPU mechanism
that makes all temporary changes visible to other cpus, so you're saying
that this does not work as everybody expects?

> What mb really does is keep certain ordering from happening.
> And ordering means, we have at least *2* different memory accesses.

Sorry but I think you're missing some pieces here. There are 2 memory
accesses: set_bit/clear_bit (writer) and test_bit (reader).

The same could be achieved by a plain variable, that it's a bit brings
more confusion about what barrier should be really used.

The pattern we want to use here is pretty standard. Read barrier before
read and write barrier that separates the data change from the indicator
of the change. If you line up the barriers, there's a clear line between
the data changes and the indicator value.

reloc_root = NULL
smp_wmb                           smp_rmb()
                                  test_bit()
                                  ... here
set_bit(DEAD)
                                  ... or here, it'll be always
				  reloc_root == NULL, but it still could
				  be before or after set_bit

You should also distinguish between mb() and smp_mb() (and the rmb/wmb).
mb is a unconditional barrier, used to synchronize access to hardware io
ports, suitable in drivers.

We use smp_mb() because this serializes memory among multipe CPUs, when
one changes memory but stores it to some temporary structures, while
other CPUs don't see the effects. I'm sure you've read about that in the
memory barrier docs.

> It's not to ensure every reader get the same result, as there is no way
> to do that. Read can happen whenever they want.

That is true about the 'whenever', but what we need to guarantee here is
that when the read happens the following condition will have view of the
changes implied by the pairing barrier.

> So before we talk about mb, we first need to know which 2 memory
> accesses we're talking about.
> If there is even no 2 memory access, then there is no need for mb().
> 
> E.g. for the mb implied by spinlock(), it's not to ensure the spinlock
> counter reader, but to ensure the memory ordering between the spinlock
> counter itself and the memory it's protecting.
> 
> So for memory barrier around test_bit(), as long as the compiler is not
> doing reordering, we don't need extra mb.

This is not about compiler.

> And if the compiler is really doing re-ordering for the
> have_reloc_root(), then the compiler is doing something wrong, as that
> would makes the test_bit() meaningless.
Qu Wenruo Jan. 10, 2020, 12:21 a.m. UTC | #11
On 2020/1/9 下午10:37, David Sterba wrote:
> On Thu, Jan 09, 2020 at 01:54:34PM +0800, Qu Wenruo wrote:
>>>> The way I read it is more like smp_rmb/smp_wmb, but for bits in this
>>>> case, so the smp_mb__before/after_atomic was only a syntactic sugar to
>>>> match that it's atomic bitops. I realize this could have caused some
>>>> confusion, however I still think that some sort of barrier is needed.
>>>
>>> There's an existing pattern used for serializing set/clear of
>>> BTRFS_ROOT_IN_TRANS_SETUP (record_root_in_trans,
>>> btrfs_record_root_in_trans).
>>>
>>> Once upon a time there were barriers like smp_mb__before_clear_bit but
>>> they got unified to just smp_mb__before_atomic for all set/clear
>>> operations, so I believe I was not all wrong with using them.
>>>
>> I used to believe the same fairy tail, that mb() works like a flush(),
>> but we're not living in a fairy tail.
>
> This sounds strange, by flush I was refering to internal CPU mechanism
> that makes all temporary changes visible to other cpus, so you're saying
> that this does not work as everybody expects?

Because no matter whether mb ensures that or not, the result is the same.

What would happen if the reader get schedule just before the mb()
command? Reader can still get older value.
That makes no difference if we had mb() or not.

>
>> What mb really does is keep certain ordering from happening.
>> And ordering means, we have at least *2* different memory accesses.
>
> Sorry but I think you're missing some pieces here. There are 2 memory
> accesses: set_bit/clear_bit (writer) and test_bit (reader).
>
> The same could be achieved by a plain variable, that it's a bit brings
> more confusion about what barrier should be really used.
>
> The pattern we want to use here is pretty standard. Read barrier before
> read and write barrier that separates the data change from the indicator
> of the change. If you line up the barriers, there's a clear line between
> the data changes and the indicator value.

Again, test_bit() can happen whenever they like, and smp_rmb() before
test_bit() makes no sense.

test_bit() can happen before reloc_root = NULL assign. after reloc_root
= NULL assign but before set_bit(), or after set_bit().

That smp_wmb() makes sure set_bit() won't happen before reloc_root, than
that's enough.
BTW, smp_mb__before_atomic() should be full mb, not just wmb or rmb.

>
> reloc_root = NULL
> smp_wmb                           smp_rmb()
>                                   test_bit()
>                                   ... here
> set_bit(DEAD)
>                                   ... or here, it'll be always
> 				  reloc_root == NULL, but it still could
> 				  be before or after set_bit
>
> You should also distinguish between mb() and smp_mb() (and the rmb/wmb).
> mb is a unconditional barrier, used to synchronize access to hardware io
> ports, suitable in drivers.
>
> We use smp_mb() because this serializes memory among multipe CPUs, when
> one changes memory but stores it to some temporary structures, while
> other CPUs don't see the effects. I'm sure you've read about that in the
> memory barrier docs.

Yes, but schedule can put that smp_rmb(); test_bit() line where ever
they want. So smp_rmb(); test_bit() can still get all the 3 difference
timing. It's that smp_mb__before_atomic() killing the 4th case, not the
smp_rmb().

>
>> It's not to ensure every reader get the same result, as there is no way
>> to do that. Read can happen whenever they want.
>
> That is true about the 'whenever', but what we need to guarantee here is
> that when the read happens the following condition will have view of the
> changes implied by the pairing barrier.

Still, if reader still gets the temporary value, it's the same as random
schedule timing.
What we need to ensure is the order, which is solely ensured by that
smb_mb__before_atomic().

Thanks,
Qu

>
>> So before we talk about mb, we first need to know which 2 memory
>> accesses we're talking about.
>> If there is even no 2 memory access, then there is no need for mb().
>>
>> E.g. for the mb implied by spinlock(), it's not to ensure the spinlock
>> counter reader, but to ensure the memory ordering between the spinlock
>> counter itself and the memory it's protecting.
>>
>> So for memory barrier around test_bit(), as long as the compiler is not
>> doing reordering, we don't need extra mb.
>
> This is not about compiler.
>
>> And if the compiler is really doing re-ordering for the
>> have_reloc_root(), then the compiler is doing something wrong, as that
>> would makes the test_bit() meaningless.
Qu Wenruo Jan. 10, 2020, 12:58 a.m. UTC | #12
On 2020/1/10 上午8:21, Qu Wenruo wrote:
>
>
> On 2020/1/9 下午10:37, David Sterba wrote:
>> On Thu, Jan 09, 2020 at 01:54:34PM +0800, Qu Wenruo wrote:
>>>>> The way I read it is more like smp_rmb/smp_wmb, but for bits in this
>>>>> case, so the smp_mb__before/after_atomic was only a syntactic sugar to
>>>>> match that it's atomic bitops. I realize this could have caused some
>>>>> confusion, however I still think that some sort of barrier is needed.
>>>>
>>>> There's an existing pattern used for serializing set/clear of
>>>> BTRFS_ROOT_IN_TRANS_SETUP (record_root_in_trans,
>>>> btrfs_record_root_in_trans).
>>>>
>>>> Once upon a time there were barriers like smp_mb__before_clear_bit but
>>>> they got unified to just smp_mb__before_atomic for all set/clear
>>>> operations, so I believe I was not all wrong with using them.
>>>>
>>> I used to believe the same fairy tail, that mb() works like a flush(),
>>> but we're not living in a fairy tail.
>>
>> This sounds strange, by flush I was refering to internal CPU mechanism
>> that makes all temporary changes visible to other cpus, so you're saying
>> that this does not work as everybody expects?
>
> Because no matter whether mb ensures that or not, the result is the same.
>
> What would happen if the reader get schedule just before the mb()
> command? Reader can still get older value.
> That makes no difference if we had mb() or not.
>
>>
>>> What mb really does is keep certain ordering from happening.
>>> And ordering means, we have at least *2* different memory accesses.
>>
>> Sorry but I think you're missing some pieces here. There are 2 memory
>> accesses: set_bit/clear_bit (writer) and test_bit (reader).
>>
>> The same could be achieved by a plain variable, that it's a bit brings
>> more confusion about what barrier should be really used.
>>
>> The pattern we want to use here is pretty standard. Read barrier before
>> read and write barrier that separates the data change from the indicator
>> of the change. If you line up the barriers, there's a clear line between
>> the data changes and the indicator value.
>
> Again, test_bit() can happen whenever they like, and smp_rmb() before
> test_bit() makes no sense.
>
> test_bit() can happen before reloc_root = NULL assign. after reloc_root
> = NULL assign but before set_bit(), or after set_bit().
>
> That smp_wmb() makes sure set_bit() won't happen before reloc_root, than
> that's enough.
> BTW, smp_mb__before_atomic() should be full mb, not just wmb or rmb.
>
>>
>> reloc_root = NULL
>> smp_wmb                           smp_rmb()
>>                                   test_bit()
>>                                   ... here
>> set_bit(DEAD)
>>                                   ... or here, it'll be always
>> 				  reloc_root == NULL, but it still could
>> 				  be before or after set_bit
>>
>> You should also distinguish between mb() and smp_mb() (and the rmb/wmb).
>> mb is a unconditional barrier, used to synchronize access to hardware io
>> ports, suitable in drivers.
>>
>> We use smp_mb() because this serializes memory among multipe CPUs, when
>> one changes memory but stores it to some temporary structures, while
>> other CPUs don't see the effects. I'm sure you've read about that in the
>> memory barrier docs.

I guess the main difference between us is the effect of "per-cpu
viewable temporary value".

It looks like your point is, without rmb() we can't see consistent
values the writer sees.

But my point is, even we can only see a temporary value, the
__before_atomic() mb at the writer side, ensures only 3 possible
temporary values combination can be seen.
(PTR, DEAD), (NULL, DEAD), (NULL, 0).

The killed (PTR, 0) combination is killed by that writer side mb.
Thus no need for the reader side mb before test_bit().

That's why I insist on the "test_bit() can happen whenever they like"
point, as that has the same effect as schedule.

Thanks,
Qu

>
> Yes, but schedule can put that smp_rmb(); test_bit() line where ever
> they want. So smp_rmb(); test_bit() can still get all the 3 difference
> timing. It's that smp_mb__before_atomic() killing the 4th case, not the
> smp_rmb().
>
>>
>>> It's not to ensure every reader get the same result, as there is no way
>>> to do that. Read can happen whenever they want.
>>
>> That is true about the 'whenever', but what we need to guarantee here is
>> that when the read happens the following condition will have view of the
>> changes implied by the pairing barrier.
>
> Still, if reader still gets the temporary value, it's the same as random
> schedule timing.
> What we need to ensure is the order, which is solely ensured by that
> smb_mb__before_atomic().
>
> Thanks,
> Qu
>
>>
>>> So before we talk about mb, we first need to know which 2 memory
>>> accesses we're talking about.
>>> If there is even no 2 memory access, then there is no need for mb().
>>>
>>> E.g. for the mb implied by spinlock(), it's not to ensure the spinlock
>>> counter reader, but to ensure the memory ordering between the spinlock
>>> counter itself and the memory it's protecting.
>>>
>>> So for memory barrier around test_bit(), as long as the compiler is not
>>> doing reordering, we don't need extra mb.
>>
>> This is not about compiler.
>>
>>> And if the compiler is really doing re-ordering for the
>>> have_reloc_root(), then the compiler is doing something wrong, as that
>>> would makes the test_bit() meaningless.
Qu Wenruo Jan. 13, 2020, 4:41 a.m. UTC | #13
On 2020/1/10 上午8:58, Qu Wenruo wrote:
>
>
> On 2020/1/10 上午8:21, Qu Wenruo wrote:
>>
>>
>> On 2020/1/9 下午10:37, David Sterba wrote:
>>> On Thu, Jan 09, 2020 at 01:54:34PM +0800, Qu Wenruo wrote:
>>>>>> The way I read it is more like smp_rmb/smp_wmb, but for bits in this
>>>>>> case, so the smp_mb__before/after_atomic was only a syntactic sugar to
>>>>>> match that it's atomic bitops. I realize this could have caused some
>>>>>> confusion, however I still think that some sort of barrier is needed.
>>>>>
>>>>> There's an existing pattern used for serializing set/clear of
>>>>> BTRFS_ROOT_IN_TRANS_SETUP (record_root_in_trans,
>>>>> btrfs_record_root_in_trans).
>>>>>
>>>>> Once upon a time there were barriers like smp_mb__before_clear_bit but
>>>>> they got unified to just smp_mb__before_atomic for all set/clear
>>>>> operations, so I believe I was not all wrong with using them.
>>>>>
>>>> I used to believe the same fairy tail, that mb() works like a flush(),
>>>> but we're not living in a fairy tail.
>>>
>>> This sounds strange, by flush I was refering to internal CPU mechanism
>>> that makes all temporary changes visible to other cpus, so you're saying
>>> that this does not work as everybody expects?
>>
>> Because no matter whether mb ensures that or not, the result is the same.
>>
>> What would happen if the reader get schedule just before the mb()
>> command? Reader can still get older value.
>> That makes no difference if we had mb() or not.
>>
>>>
>>>> What mb really does is keep certain ordering from happening.
>>>> And ordering means, we have at least *2* different memory accesses.
>>>
>>> Sorry but I think you're missing some pieces here. There are 2 memory
>>> accesses: set_bit/clear_bit (writer) and test_bit (reader).
>>>
>>> The same could be achieved by a plain variable, that it's a bit brings
>>> more confusion about what barrier should be really used.
>>>
>>> The pattern we want to use here is pretty standard. Read barrier before
>>> read and write barrier that separates the data change from the indicator
>>> of the change. If you line up the barriers, there's a clear line between
>>> the data changes and the indicator value.
>>
>> Again, test_bit() can happen whenever they like, and smp_rmb() before
>> test_bit() makes no sense.
>>
>> test_bit() can happen before reloc_root = NULL assign. after reloc_root
>> = NULL assign but before set_bit(), or after set_bit().
>>
>> That smp_wmb() makes sure set_bit() won't happen before reloc_root, than
>> that's enough.
>> BTW, smp_mb__before_atomic() should be full mb, not just wmb or rmb.
>>
>>>
>>> reloc_root = NULL
>>> smp_wmb                           smp_rmb()
>>>                                   test_bit()
>>>                                   ... here
>>> set_bit(DEAD)
>>>                                   ... or here, it'll be always
>>> 				  reloc_root == NULL, but it still could
>>> 				  be before or after set_bit
>>>
>>> You should also distinguish between mb() and smp_mb() (and the rmb/wmb).
>>> mb is a unconditional barrier, used to synchronize access to hardware io
>>> ports, suitable in drivers.
>>>
>>> We use smp_mb() because this serializes memory among multipe CPUs, when
>>> one changes memory but stores it to some temporary structures, while
>>> other CPUs don't see the effects. I'm sure you've read about that in the
>>> memory barrier docs.
>
> I guess the main difference between us is the effect of "per-cpu
> viewable temporary value".
>
> It looks like your point is, without rmb() we can't see consistent
> values the writer sees.
>
> But my point is, even we can only see a temporary value, the
> __before_atomic() mb at the writer side, ensures only 3 possible
> temporary values combination can be seen.
> (PTR, DEAD), (NULL, DEAD), (NULL, 0).
>
> The killed (PTR, 0) combination is killed by that writer side mb.
> Thus no need for the reader side mb before test_bit().
>
> That's why I insist on the "test_bit() can happen whenever they like"
> point, as that has the same effect as schedule.
>
> Thanks,
> Qu

Can we push the fix to upstream? I hope it to be fixed in late rc of v5.5.

Thanks,
Qu

>
>>
>> Yes, but schedule can put that smp_rmb(); test_bit() line where ever
>> they want. So smp_rmb(); test_bit() can still get all the 3 difference
>> timing. It's that smp_mb__before_atomic() killing the 4th case, not the
>> smp_rmb().
>>
>>>
>>>> It's not to ensure every reader get the same result, as there is no way
>>>> to do that. Read can happen whenever they want.
>>>
>>> That is true about the 'whenever', but what we need to guarantee here is
>>> that when the read happens the following condition will have view of the
>>> changes implied by the pairing barrier.
>>
>> Still, if reader still gets the temporary value, it's the same as random
>> schedule timing.
>> What we need to ensure is the order, which is solely ensured by that
>> smb_mb__before_atomic().
>>
>> Thanks,
>> Qu
>>
>>>
>>>> So before we talk about mb, we first need to know which 2 memory
>>>> accesses we're talking about.
>>>> If there is even no 2 memory access, then there is no need for mb().
>>>>
>>>> E.g. for the mb implied by spinlock(), it's not to ensure the spinlock
>>>> counter reader, but to ensure the memory ordering between the spinlock
>>>> counter itself and the memory it's protecting.
>>>>
>>>> So for memory barrier around test_bit(), as long as the compiler is not
>>>> doing reordering, we don't need extra mb.
>>>
>>> This is not about compiler.
>>>
>>>> And if the compiler is really doing re-ordering for the
>>>> have_reloc_root(), then the compiler is doing something wrong, as that
>>>> would makes the test_bit() meaningless.
David Sterba Jan. 13, 2020, 5:19 p.m. UTC | #14
On Mon, Jan 13, 2020 at 12:41:45PM +0800, Qu Wenruo wrote:
> On 2020/1/10 上午8:58, Qu Wenruo wrote:
> > On 2020/1/10 上午8:21, Qu Wenruo wrote:
> >> On 2020/1/9 下午10:37, David Sterba wrote:
> >>> On Thu, Jan 09, 2020 at 01:54:34PM +0800, Qu Wenruo wrote:
> >>> We use smp_mb() because this serializes memory among multipe CPUs, when
> >>> one changes memory but stores it to some temporary structures, while
> >>> other CPUs don't see the effects. I'm sure you've read about that in the
> >>> memory barrier docs.
> >
> > I guess the main difference between us is the effect of "per-cpu
> > viewable temporary value".
> >
> > It looks like your point is, without rmb() we can't see consistent
> > values the writer sees.
> >
> > But my point is, even we can only see a temporary value, the
> > __before_atomic() mb at the writer side, ensures only 3 possible
> > temporary values combination can be seen.
> > (PTR, DEAD), (NULL, DEAD), (NULL, 0).
> >
> > The killed (PTR, 0) combination is killed by that writer side mb.
> > Thus no need for the reader side mb before test_bit().
> >
> > That's why I insist on the "test_bit() can happen whenever they like"
> > point, as that has the same effect as schedule.
> 
> Can we push the fix to upstream? I hope it to be fixed in late rc of v5.5.

Yes the plan is to push it to 5.5-rc so we can get the stable backports.

About the barriers, we seem to have a conclusion to use smp_rmb/smp_wmb
and not the smp_mb__before/after_atomic. Zygo also tested the patch and
reported it's ok so I don't want to hold it back.

Understanding the memory barriers takes time to digest (which basically
means to develop a cpu simulator in ones head with speculative writes
and execution and then keep sanity when reasoning about them).
Nikolay Borisov Jan. 13, 2020, 7:15 p.m. UTC | #15
On 13.01.20 г. 19:19 ч., David Sterba wrote:
> On Mon, Jan 13, 2020 at 12:41:45PM +0800, Qu Wenruo wrote:
>> On 2020/1/10 上午8:58, Qu Wenruo wrote:
>>> On 2020/1/10 上午8:21, Qu Wenruo wrote:
>>>> On 2020/1/9 下午10:37, David Sterba wrote:
>>>>> On Thu, Jan 09, 2020 at 01:54:34PM +0800, Qu Wenruo wrote:
>>>>> We use smp_mb() because this serializes memory among multipe CPUs, when
>>>>> one changes memory but stores it to some temporary structures, while
>>>>> other CPUs don't see the effects. I'm sure you've read about that in the
>>>>> memory barrier docs.
>>>
>>> I guess the main difference between us is the effect of "per-cpu
>>> viewable temporary value".
>>>
>>> It looks like your point is, without rmb() we can't see consistent
>>> values the writer sees.
>>>
>>> But my point is, even we can only see a temporary value, the
>>> __before_atomic() mb at the writer side, ensures only 3 possible
>>> temporary values combination can be seen.
>>> (PTR, DEAD), (NULL, DEAD), (NULL, 0).
>>>
>>> The killed (PTR, 0) combination is killed by that writer side mb.
>>> Thus no need for the reader side mb before test_bit().
>>>
>>> That's why I insist on the "test_bit() can happen whenever they like"
>>> point, as that has the same effect as schedule.
>>
>> Can we push the fix to upstream? I hope it to be fixed in late rc of v5.5.
> 
> Yes the plan is to push it to 5.5-rc so we can get the stable backports.
> 
> About the barriers, we seem to have a conclusion to use smp_rmb/smp_wmb
> and not the smp_mb__before/after_atomic. Zygo also tested the patch and
> reported it's ok so I don't want to hold it back.
> 
> Understanding the memory barriers takes time to digest (which basically
> means to develop a cpu simulator in ones head with speculative writes
> and execution and then keep sanity when reasoning about them).

Or simply using the memory model tool and just write a "simple" litmus
test to see what's possible and what not in the given situation. (And
no, I don't think it's that trivial to do that either :) )

>

Patch
diff mbox series

diff --git a/fs/btrfs/relocation.c b/fs/btrfs/relocation.c
index d897a8e5e430..17a2484f76a5 100644
--- a/fs/btrfs/relocation.c
+++ b/fs/btrfs/relocation.c
@@ -517,6 +517,22 @@  static int update_backref_cache(struct btrfs_trans_handle *trans,
 	return 1;
 }
 
+/*
+ * Check if this subvolume tree has valid reloc(*) tree.
+ *
+ * *: Reloc tree after swap is considered dead, thus not considered as valid.
+ *    This is enough for most callers, as they don't distinguish dead reloc
+ *    root from no reloc root.
+ *    But should_ignore_root() below is a special case.
+ */
+static bool have_reloc_root(struct btrfs_root *root)
+{
+	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
+		return false;
+	if (!root->reloc_root)
+		return false;
+	return true;
+}
 
 static int should_ignore_root(struct btrfs_root *root)
 {
@@ -525,6 +541,10 @@  static int should_ignore_root(struct btrfs_root *root)
 	if (!test_bit(BTRFS_ROOT_REF_COWS, &root->state))
 		return 0;
 
+	/* This root has been merged with its reloc tree, we can ignore it */
+	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state))
+		return 1;
+
 	reloc_root = root->reloc_root;
 	if (!reloc_root)
 		return 0;
@@ -1478,8 +1498,7 @@  int btrfs_update_reloc_root(struct btrfs_trans_handle *trans,
 	struct btrfs_root_item *root_item;
 	int ret;
 
-	if (test_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state) ||
-	    !root->reloc_root)
+	if (!have_reloc_root(root))
 		goto out;
 
 	reloc_root = root->reloc_root;
@@ -2201,6 +2220,11 @@  static int clean_dirty_subvols(struct reloc_control *rc)
 				if (ret2 < 0 && !ret)
 					ret = ret2;
 			}
+			/*
+			 * Need barrier to ensure clear_bit() only happens after
+			 * root->reloc_root = NULL.
+			 */
+			smp_mb__before_atomic();
 			clear_bit(BTRFS_ROOT_DEAD_RELOC_TREE, &root->state);
 			btrfs_put_fs_root(root);
 		} else {
@@ -4717,7 +4741,7 @@  void btrfs_reloc_pre_snapshot(struct btrfs_pending_snapshot *pending,
 	struct btrfs_root *root = pending->root;
 	struct reloc_control *rc = root->fs_info->reloc_ctl;
 
-	if (!root->reloc_root || !rc)
+	if (!rc || !have_reloc_root(root))
 		return;
 
 	if (!rc->merge_reloc_tree)
@@ -4751,7 +4775,7 @@  int btrfs_reloc_post_snapshot(struct btrfs_trans_handle *trans,
 	struct reloc_control *rc = root->fs_info->reloc_ctl;
 	int ret;
 
-	if (!root->reloc_root || !rc)
+	if (!rc || !have_reloc_root(root))
 		return 0;
 
 	rc = root->fs_info->reloc_ctl;