diff mbox series

[v2] btrfs: Turn delayed_nodes_tree into an XArray

Message ID 20220412123546.30478-1-gniebler@suse.com (mailing list archive)
State New, archived
Headers show
Series [v2] btrfs: Turn delayed_nodes_tree into an XArray | expand

Commit Message

Gabriel Niebler April 12, 2022, 12:35 p.m. UTC
… in the btrfs_root struct. Also adjust all usages of this object to use
the XArray API.

Signed-off-by: Gabriel Niebler <gniebler@suse.com>
---

Notes:
    XArrays offer a somewhat nicer API than radix trees and were implemented
    specifically to replace the latter. They utilize the exact same underlying
    data structure, but their API is notionally easier to use, as it provides
    array semantics to the user of radix trees. The higher level API also
    takes care of locking, adding even more ease of use.
    
    The btrfs code uses radix trees in several places. This patch only
    converts the `delayed_nodes_tree` member of the btrfs_root struct.
    
    It survived a complete fstests run.

 fs/btrfs/ctree.h         |  4 ++--
 fs/btrfs/delayed-inode.c | 48 ++++++++++++++++++----------------------
 fs/btrfs/disk-io.c       |  2 +-
 fs/btrfs/inode.c         |  2 +-
 4 files changed, 26 insertions(+), 30 deletions(-)

Comments

Nikolay Borisov April 13, 2022, 1:34 p.m. UTC | #1
On 12.04.22 г. 15:35 ч., Gabriel Niebler wrote:
> … in the btrfs_root struct. Also adjust all usages of this object to use
> the XArray API.
> 
> Signed-off-by: Gabriel Niebler <gniebler@suse.com>
> ---
> 
> Notes:
>      XArrays offer a somewhat nicer API than radix trees and were implemented
>      specifically to replace the latter. They utilize the exact same underlying
>      data structure, but their API is notionally easier to use, as it provides
>      array semantics to the user of radix trees. The higher level API also
>      takes care of locking, adding even more ease of use.
>      
>      The btrfs code uses radix trees in several places. This patch only
>      converts the `delayed_nodes_tree` member of the btrfs_root struct.
>      
>      It survived a complete fstests run.
> 
>   fs/btrfs/ctree.h         |  4 ++--
>   fs/btrfs/delayed-inode.c | 48 ++++++++++++++++++----------------------
>   fs/btrfs/disk-io.c       |  2 +-
>   fs/btrfs/inode.c         |  2 +-
>   4 files changed, 26 insertions(+), 30 deletions(-)
> 
> diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
> index b7631b88426e..9377dded9679 100644
> --- a/fs/btrfs/ctree.h
> +++ b/fs/btrfs/ctree.h
> @@ -1224,10 +1224,10 @@ struct btrfs_root {
>   	struct rb_root inode_tree;
>   
>   	/*
> -	 * radix tree that keeps track of delayed nodes of every inode,
> +	 * XArray that keeps track of delayed nodes of every inode,
>   	 * protected by inode_lock
>   	 */
> -	struct radix_tree_root delayed_nodes_tree;
> +	struct xarray delayed_nodes;
>   	/*
>   	 * right now this just gets used so that a root has its own devid
>   	 * for stat.  It may be used for more later
> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
> index 748bf6b0d860..89e1c39c2aef 100644
> --- a/fs/btrfs/delayed-inode.c
> +++ b/fs/btrfs/delayed-inode.c
> @@ -78,7 +78,7 @@ static struct btrfs_delayed_node *btrfs_get_delayed_node(
>   	}
>   
>   	spin_lock(&root->inode_lock);
> -	node = radix_tree_lookup(&root->delayed_nodes_tree, ino);
> +	node = xa_load(&root->delayed_nodes, (unsigned long)ino);

Isn't this problematic, because  you'd be truncating the ino in case we 
have a number which is larger than unsigned long. For architectures 
which use long as a 64 bit type this cast is a noop since u64 is defined 
as unsigned long (via typedef unsigned long __u64; typedef __u64 u64). 
But on arches which use long long for their 64bit type then this would 
truncate the ino number, so I think this cast is redundant. In any case 
the code is broken for u64 for arches which use long long and as the 
author of xarray put it just now:

<willy> yup.  It's not a great data structure for storing u64s in

However, looking at the existing radix_tree_insert the index is also an 
unsigned long. TLDR is remove the cast.

>   
>   	if (node) {
>   		if (btrfs_inode->delayed_node) {

<snip>


>   	return node;
>   }

<snip>

> @@ -1870,29 +1863,32 @@ void btrfs_kill_delayed_inode_items(struct btrfs_inode *inode)
>   
>   void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
>   {
> -	u64 inode_id = 0;
> +	unsigned long index;
> +	struct btrfs_delayed_node *delayed_node;
>   	struct btrfs_delayed_node *delayed_nodes[8];
>   	int i, n;
>   
>   	while (1) {
>   		spin_lock(&root->inode_lock);
> -		n = radix_tree_gang_lookup(&root->delayed_nodes_tree,
> -					   (void **)delayed_nodes, inode_id,
> -					   ARRAY_SIZE(delayed_nodes));
> -		if (!n) {
> +		if (xa_empty(&root->delayed_nodes)) {
>   			spin_unlock(&root->inode_lock);
>   			break;
>   		}
>   
> -		inode_id = delayed_nodes[n - 1]->inode_id + 1;
> -		for (i = 0; i < n; i++) {
> +		n = 0;
> +		xa_for_each_start(&root->delayed_nodes, index,
> +				  delayed_node, index) {

Here index is used not only as the index of the current entry but also 
as the "start from this index". But you are not actually initializing it 
so index has garbage on the first iteration. If you want to start from 
the 0th index then you should initialize it when defining it.

>   			/*
>   			 * Don't increase refs in case the node is dead and
>   			 * about to be removed from the tree in the loop below
>   			 */
> -			if (!refcount_inc_not_zero(&delayed_nodes[i]->refs))
> -				delayed_nodes[i] = NULL;
> +			if (!refcount_inc_not_zero(&delayed_node->refs))
> +				delayed_nodes[n] = NULL;

This is broken. What the original code did was query up to 8 delayed 
nodes at a time and and then for every node which got written into the 
delayed_nodes array it incremented the refcount so that later the nodes 
can be destroyed outside of spin_unlock. In your code you don't write 
the node into the local delayed_nodes array, meaning the for() loop 
never frees anything.

> +			n++;
> +			if (n >= ARRAY_SIZE(delayed_nodes))
> +				break;
>   		}
> +		index++;
>   		spin_unlock(&root->inode_lock);
>   
>   		for (i = 0; i < n; i++) {

<snip>
David Sterba April 13, 2022, 2:14 p.m. UTC | #2
On Tue, Apr 12, 2022 at 02:35:46PM +0200, Gabriel Niebler wrote:
> … in the btrfs_root struct. Also adjust all usages of this object to use
> the XArray API.
> 
> Signed-off-by: Gabriel Niebler <gniebler@suse.com>
> ---
> 
> Notes:
>     XArrays offer a somewhat nicer API than radix trees and were implemented
>     specifically to replace the latter. They utilize the exact same underlying
>     data structure, but their API is notionally easier to use, as it provides
>     array semantics to the user of radix trees. The higher level API also
>     takes care of locking, adding even more ease of use.
>     
>     The btrfs code uses radix trees in several places. This patch only
>     converts the `delayed_nodes_tree` member of the btrfs_root struct.

You've put this under the --- marker which means this is not supposed to
be in the changelog (as we do for various patch revision commments) but
then there would be basically no useful changelog left. That the
conversion is done is clear, maybe there are some useful notes or
comments what changed and how, eg. the locking.
Gabriel Niebler April 14, 2022, 9:26 a.m. UTC | #3
Am 13.04.22 um 15:34 schrieb Nikolay Borisov:
> On 12.04.22 г. 15:35 ч., Gabriel Niebler wrote:
>> […]
>> diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
>> index 748bf6b0d860..89e1c39c2aef 100644
>> --- a/fs/btrfs/delayed-inode.c
>> +++ b/fs/btrfs/delayed-inode.c
>> @@ -78,7 +78,7 @@ static struct btrfs_delayed_node 
>> *btrfs_get_delayed_node(
>>       }
>>       spin_lock(&root->inode_lock);
>> -    node = radix_tree_lookup(&root->delayed_nodes_tree, ino);
>> +    node = xa_load(&root->delayed_nodes, (unsigned long)ino);
> 
> Isn't this problematic, because  you'd be truncating the ino in case we 
> have a number which is larger than unsigned long. For architectures 
> which use long as a 64 bit type this cast is a noop since u64 is defined 
> as unsigned long (via typedef unsigned long __u64; typedef __u64 u64). 
> But on arches which use long long for their 64bit type then this would 
> truncate the ino number, so I think this cast is redundant. In any case 
> the code is broken for u64 for arches which use long long and as the 
> author of xarray put it just now:
> 
> <willy> yup.  It's not a great data structure for storing u64s in
> 
> However, looking at the existing radix_tree_insert the index is also an 
> unsigned long. TLDR is remove the cast.

Right. This is something I wasn't too sure of myself and I lacked the 
deeper understanding to make a better informed decision here, so thank 
you for your insight. I will remove the cast.

In particular, the fact that radix trees already have this problem was 
something I missed (even though it's kind of clear that they must, as 
it's the same data structure under the hood, but I didn't look too 
closely at the functions and the fact they take an unsigned long in the 
same places as the XArray equivalents).

This will also play a role in upcoming patches where I often did the 
same cast, but now I know that I don't need it. If it worked before, it 
will work after.

>> @@ -1870,29 +1863,32 @@ void btrfs_kill_delayed_inode_items(struct 
>> btrfs_inode *inode)
>>   void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
>>   {
>> -    u64 inode_id = 0;
>> +    unsigned long index;
>> +    struct btrfs_delayed_node *delayed_node;
>>       struct btrfs_delayed_node *delayed_nodes[8];
>>       int i, n;
>>       while (1) {
>>           spin_lock(&root->inode_lock);
>> -        n = radix_tree_gang_lookup(&root->delayed_nodes_tree,
>> -                       (void **)delayed_nodes, inode_id,
>> -                       ARRAY_SIZE(delayed_nodes));
>> -        if (!n) {
>> +        if (xa_empty(&root->delayed_nodes)) {
>>               spin_unlock(&root->inode_lock);
>>               break;
>>           }
>> -        inode_id = delayed_nodes[n - 1]->inode_id + 1;
>> -        for (i = 0; i < n; i++) {
>> +        n = 0;
>> +        xa_for_each_start(&root->delayed_nodes, index,
>> +                  delayed_node, index) {
> 
> Here index is used not only as the index of the current entry but also 
> as the "start from this index". But you are not actually initializing it 
> so index has garbage on the first iteration. If you want to start from 
> the 0th index then you should initialize it when defining it.

Whoops, what an oversight. Good catch! Thanks for pointing this out! 
Will fix.

>>               /*
>>                * Don't increase refs in case the node is dead and
>>                * about to be removed from the tree in the loop below
>>                */
>> -            if (!refcount_inc_not_zero(&delayed_nodes[i]->refs))
>> -                delayed_nodes[i] = NULL;
>> +            if (!refcount_inc_not_zero(&delayed_node->refs))
>> +                delayed_nodes[n] = NULL;
> 
> This is broken. What the original code did was query up to 8 delayed 
> nodes at a time and and then for every node which got written into the 
> delayed_nodes array it incremented the refcount so that later the nodes 
> can be destroyed outside of spin_unlock. In your code you don't write 
> the node into the local delayed_nodes array, meaning the for() loop 
> never frees anything.

You're right, of course. All that's missing is a
             delayed_nodes[n] = delayed_node;
before the `if`, but without that it doesn't work. I don't know how I 
missed that one. Again, thanks for noticing! Will fix.
Gabriel Niebler April 14, 2022, 9:39 a.m. UTC | #4
Am 13.04.22 um 16:14 schrieb David Sterba:
> On Tue, Apr 12, 2022 at 02:35:46PM +0200, Gabriel Niebler wrote:
>> [...]
>> ---
>>
>> Notes:
>>      XArrays offer a somewhat nicer API than radix trees and were implemented
>>      specifically to replace the latter. They utilize the exact same underlying
>>      data structure, but their API is notionally easier to use, as it provides
>>      array semantics to the user of radix trees. The higher level API also
>>      takes care of locking, adding even more ease of use.
>>      
>>      The btrfs code uses radix trees in several places. This patch only
>>      converts the `delayed_nodes_tree` member of the btrfs_root struct.
> 
> You've put this under the --- marker which means this is not supposed to
> be in the changelog (as we do for various patch revision commments) [...]

I did, because I thought that my general waffling about XArrays is not 
something we need to have in the kernel changelog forerver, however...

> [...] but
> then there would be basically no useful changelog left. [...]

That's a good point. I kind of forgot that a good commit message is 
supposed to say *why* something was done and not just *what* was done, 
as that's obvious from the diff.

> [...] That the
> conversion is done is clear, maybe there are some useful notes or
> comments what changed and how, eg. the locking.

I'll try to come up with something more satisfying, yet not too verbose.
diff mbox series

Patch

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index b7631b88426e..9377dded9679 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -1224,10 +1224,10 @@  struct btrfs_root {
 	struct rb_root inode_tree;
 
 	/*
-	 * radix tree that keeps track of delayed nodes of every inode,
+	 * XArray that keeps track of delayed nodes of every inode,
 	 * protected by inode_lock
 	 */
-	struct radix_tree_root delayed_nodes_tree;
+	struct xarray delayed_nodes;
 	/*
 	 * right now this just gets used so that a root has its own devid
 	 * for stat.  It may be used for more later
diff --git a/fs/btrfs/delayed-inode.c b/fs/btrfs/delayed-inode.c
index 748bf6b0d860..89e1c39c2aef 100644
--- a/fs/btrfs/delayed-inode.c
+++ b/fs/btrfs/delayed-inode.c
@@ -78,7 +78,7 @@  static struct btrfs_delayed_node *btrfs_get_delayed_node(
 	}
 
 	spin_lock(&root->inode_lock);
-	node = radix_tree_lookup(&root->delayed_nodes_tree, ino);
+	node = xa_load(&root->delayed_nodes, (unsigned long)ino);
 
 	if (node) {
 		if (btrfs_inode->delayed_node) {
@@ -90,9 +90,9 @@  static struct btrfs_delayed_node *btrfs_get_delayed_node(
 
 		/*
 		 * It's possible that we're racing into the middle of removing
-		 * this node from the radix tree.  In this case, the refcount
+		 * this node from the XArray.  In this case, the refcount
 		 * was zero and it should never go back to one.  Just return
-		 * NULL like it was never in the radix at all; our release
+		 * NULL like it was never in the XArray at all; our release
 		 * function is in the process of removing it.
 		 *
 		 * Some implementations of refcount_inc refuse to bump the
@@ -100,7 +100,7 @@  static struct btrfs_delayed_node *btrfs_get_delayed_node(
 		 * here, refcount_inc() may decide to just WARN_ONCE() instead
 		 * of actually bumping the refcount.
 		 *
-		 * If this node is properly in the radix, we want to bump the
+		 * If this node is properly in the XArray, we want to bump the
 		 * refcount twice, once for the inode and once for this get
 		 * operation.
 		 */
@@ -141,23 +141,17 @@  static struct btrfs_delayed_node *btrfs_get_or_create_delayed_node(
 	/* cached in the btrfs inode and can be accessed */
 	refcount_set(&node->refs, 2);
 
-	ret = radix_tree_preload(GFP_NOFS);
-	if (ret) {
-		kmem_cache_free(delayed_node_cache, node);
-		return ERR_PTR(ret);
-	}
-
 	spin_lock(&root->inode_lock);
-	ret = radix_tree_insert(&root->delayed_nodes_tree, ino, node);
-	if (ret == -EEXIST) {
+	ret = xa_insert(&root->delayed_nodes, ino, node, GFP_NOFS);
+	if (ret) {
 		spin_unlock(&root->inode_lock);
 		kmem_cache_free(delayed_node_cache, node);
-		radix_tree_preload_end();
-		goto again;
+		if (ret == -EBUSY)
+			goto again;
+		return ERR_PTR(ret);
 	}
 	btrfs_inode->delayed_node = node;
 	spin_unlock(&root->inode_lock);
-	radix_tree_preload_end();
 
 	return node;
 }
@@ -276,8 +270,7 @@  static void __btrfs_release_delayed_node(
 		 * back up.  We can delete it now.
 		 */
 		ASSERT(refcount_read(&delayed_node->refs) == 0);
-		radix_tree_delete(&root->delayed_nodes_tree,
-				  delayed_node->inode_id);
+		xa_erase(&root->delayed_nodes, delayed_node->inode_id);
 		spin_unlock(&root->inode_lock);
 		kmem_cache_free(delayed_node_cache, delayed_node);
 	}
@@ -1870,29 +1863,32 @@  void btrfs_kill_delayed_inode_items(struct btrfs_inode *inode)
 
 void btrfs_kill_all_delayed_nodes(struct btrfs_root *root)
 {
-	u64 inode_id = 0;
+	unsigned long index;
+	struct btrfs_delayed_node *delayed_node;
 	struct btrfs_delayed_node *delayed_nodes[8];
 	int i, n;
 
 	while (1) {
 		spin_lock(&root->inode_lock);
-		n = radix_tree_gang_lookup(&root->delayed_nodes_tree,
-					   (void **)delayed_nodes, inode_id,
-					   ARRAY_SIZE(delayed_nodes));
-		if (!n) {
+		if (xa_empty(&root->delayed_nodes)) {
 			spin_unlock(&root->inode_lock);
 			break;
 		}
 
-		inode_id = delayed_nodes[n - 1]->inode_id + 1;
-		for (i = 0; i < n; i++) {
+		n = 0;
+		xa_for_each_start(&root->delayed_nodes, index,
+				  delayed_node, index) {
 			/*
 			 * Don't increase refs in case the node is dead and
 			 * about to be removed from the tree in the loop below
 			 */
-			if (!refcount_inc_not_zero(&delayed_nodes[i]->refs))
-				delayed_nodes[i] = NULL;
+			if (!refcount_inc_not_zero(&delayed_node->refs))
+				delayed_nodes[n] = NULL;
+			n++;
+			if (n >= ARRAY_SIZE(delayed_nodes))
+				break;
 		}
+		index++;
 		spin_unlock(&root->inode_lock);
 
 		for (i = 0; i < n; i++) {
diff --git a/fs/btrfs/disk-io.c b/fs/btrfs/disk-io.c
index b30309f187cf..2a49c772d175 100644
--- a/fs/btrfs/disk-io.c
+++ b/fs/btrfs/disk-io.c
@@ -1164,7 +1164,7 @@  static void __setup_root(struct btrfs_root *root, struct btrfs_fs_info *fs_info,
 	root->nr_delalloc_inodes = 0;
 	root->nr_ordered_extents = 0;
 	root->inode_tree = RB_ROOT;
-	INIT_RADIX_TREE(&root->delayed_nodes_tree, GFP_ATOMIC);
+	xa_init_flags(&root->delayed_nodes, GFP_ATOMIC);
 
 	btrfs_init_root_block_rsv(root);
 
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 17d5557f98ec..5b5355fdfa62 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -3827,7 +3827,7 @@  static int btrfs_read_locked_inode(struct inode *inode,
 	 * cache.
 	 *
 	 * This is required for both inode re-read from disk and delayed inode
-	 * in delayed_nodes_tree.
+	 * in the delayed_nodes XArray.
 	 */
 	if (BTRFS_I(inode)->last_trans == fs_info->generation)
 		set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,