diff mbox series

[v2,2/6] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()

Message ID 20230830125654.21257-3-zhangpeng.00@bytedance.com (mailing list archive)
State New
Headers show
Series Introduce __mt_dup() to improve the performance of fork() | expand

Commit Message

Peng Zhang Aug. 30, 2023, 12:56 p.m. UTC
Introduce interfaces __mt_dup() and mtree_dup(), which are used to
duplicate a maple tree. Compared with traversing the source tree and
reinserting entry by entry in the new tree, it has better performance.
The difference between __mt_dup() and mtree_dup() is that mtree_dup()
handles locks internally.

Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
---
 include/linux/maple_tree.h |   3 +
 lib/maple_tree.c           | 265 +++++++++++++++++++++++++++++++++++++
 2 files changed, 268 insertions(+)

Comments

Liam R. Howlett Sept. 7, 2023, 8:13 p.m. UTC | #1
* Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
> Introduce interfaces __mt_dup() and mtree_dup(), which are used to
> duplicate a maple tree. Compared with traversing the source tree and
> reinserting entry by entry in the new tree, it has better performance.
> The difference between __mt_dup() and mtree_dup() is that mtree_dup()
> handles locks internally.

__mt_dup() should be called mas_dup() to indicate the advanced interface
which requires users to handle their own locks.

> 
> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
>  include/linux/maple_tree.h |   3 +
>  lib/maple_tree.c           | 265 +++++++++++++++++++++++++++++++++++++
>  2 files changed, 268 insertions(+)
> 
> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> index e41c70ac7744..44fe8a57ecbd 100644
> --- a/include/linux/maple_tree.h
> +++ b/include/linux/maple_tree.h
> @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
>  		void *entry, gfp_t gfp);
>  void *mtree_erase(struct maple_tree *mt, unsigned long index);
>  
> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> +
>  void mtree_destroy(struct maple_tree *mt);
>  void __mt_destroy(struct maple_tree *mt);
>  
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index ef234cf02e3e..8f841682269c 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -6370,6 +6370,271 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
>  }
>  EXPORT_SYMBOL(mtree_erase);
>  
> +/*
> + * mas_dup_free() - Free a half-constructed tree.

Maybe "Free an incomplete duplication of a tree" ?

> + * @mas: Points to the last node of the half-constructed tree.

Your use of "Points to" seems to indicate someone knows you are talking
about a "maple state that has a node pointing to".  Can this be made
more clear?
@mas: The maple state of a incomplete tree.

Then add a note that @mas->node points to the last successfully
allocated node?

Or something along those lines.

> + *
> + * This function frees all nodes starting from @mas->node in the reverse order
> + * of mas_dup_build(). There is no need to hold the source tree lock at this
> + * time.
> + */
> +static void mas_dup_free(struct ma_state *mas)
> +{
> +	struct maple_node *node;
> +	enum maple_type type;
> +	void __rcu **slots;
> +	unsigned char count, i;
> +
> +	/* Maybe the first node allocation failed. */
> +	if (!mas->node)
> +		return;
> +
> +	while (!mte_is_root(mas->node)) {
> +		mas_ascend(mas);
> +
> +		if (mas->offset) {
> +			mas->offset--;
> +			do {
> +				mas_descend(mas);
> +				mas->offset = mas_data_end(mas);
> +			} while (!mte_is_leaf(mas->node));

Can you blindly descend and check !mte_is_leaf()?  What happens when the
tree duplication fails at random internal nodes?  Maybe I missed how
this cannot happen?

> +
> +			mas_ascend(mas);
> +		}
> +
> +		node = mte_to_node(mas->node);
> +		type = mte_node_type(mas->node);
> +		slots = (void **)ma_slots(node, type);
> +		count = mas_data_end(mas) + 1;
> +		for (i = 0; i < count; i++)
> +			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
> +
> +		mt_free_bulk(count, slots);
> +	}


> +
> +	node = mte_to_node(mas->node);
> +	mt_free_one(node);
> +}
> +
> +/*
> + * mas_copy_node() - Copy a maple node and allocate child nodes.

if required. "..and allocate child nodes if required."

> + * @mas: Points to the source node.
> + * @new_mas: Points to the new node.
> + * @parent: The parent node of the new node.
> + * @gfp: The GFP_FLAGS to use for allocations.
> + *
> + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
> + * @new_mas->node and allocate new child nodes for @new_mas->node.
> + * If memory allocation fails, @mas is set to -ENOMEM.
> + */
> +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
> +		struct maple_node *parent, gfp_t gfp)
> +{
> +	struct maple_node *node = mte_to_node(mas->node);
> +	struct maple_node *new_node = mte_to_node(new_mas->node);
> +	enum maple_type type;
> +	unsigned long val;
> +	unsigned char request, count, i;
> +	void __rcu **slots;
> +	void __rcu **new_slots;
> +
> +	/* Copy the node completely. */
> +	memcpy(new_node, node, sizeof(struct maple_node));
> +
> +	/* Update the parent node pointer. */
> +	if (unlikely(ma_is_root(node)))
> +		val = MA_ROOT_PARENT;
> +	else
> +		val = (unsigned long)node->parent & MAPLE_NODE_MASK;

If you treat the root as special and outside the loop, then you can
avoid the check for root for every non-root node.  For root, you just
need to copy and do this special parent thing before the main loop in
mas_dup_build().  This will avoid an extra branch for each VMA over 14,
so that would add up to a lot of instructions.

> +
> +	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
> +
> +	if (mte_is_leaf(mas->node))
> +		return;

You are checking here and in mas_dup_build() for the leaf, splitting the
function into parent assignment and allocate would allow you to check
once. Copy could be moved to the main loop or with the parent setting,
depending on how you handle the root suggestion above.

> +
> +	/* Allocate memory for child nodes. */
> +	type = mte_node_type(mas->node);
> +	new_slots = ma_slots(new_node, type);
> +	request = mas_data_end(mas) + 1;
> +	count = mt_alloc_bulk(gfp, request, new_slots);
> +	if (unlikely(count < request)) {
> +		if (count)
> +			mt_free_bulk(count, new_slots);

The new_slots will still contain the addresses of the freed nodes.
Don't you need to clear it here to avoid a double free?  Is there a
test case for this in your testing?  Again, I may have missed how this
is not possible..

> +		mas_set_err(mas, -ENOMEM);
> +		return;
> +	}
> +
> +	/* Restore node type information in slots. */
> +	slots = ma_slots(node, type);
> +	for (i = 0; i < count; i++)
> +		((unsigned long *)new_slots)[i] |=
> +			((unsigned long)mt_slot_locked(mas->tree, slots, i) &
> +			MAPLE_NODE_MASK);

Can you expand this to multiple lines to make it more clear what is
going on?

> +}
> +
> +/*
> + * mas_dup_build() - Build a new maple tree from a source tree
> + * @mas: The maple state of source tree.
> + * @new_mas: The maple state of new tree.
> + * @gfp: The GFP_FLAGS to use for allocations.
> + *
> + * This function builds a new tree in DFS preorder. If the memory allocation
> + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
> + * last node. mas_dup_free() will free the half-constructed tree.
> + *
> + * Note that the attributes of the two trees must be exactly the same, and the
> + * new tree must be empty, otherwise -EINVAL will be returned.
> + */
> +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
> +		gfp_t gfp)
> +{
> +	struct maple_node *node, *parent;

Could parent be struct maple_pnode?

> +	struct maple_enode *root;
> +	enum maple_type type;
> +
> +	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
> +	    unlikely(!mtree_empty(new_mas->tree))) {
> +		mas_set_err(mas, -EINVAL);
> +		return;
> +	}
> +
> +	mas_start(mas);
> +	if (mas_is_ptr(mas) || mas_is_none(mas)) {
> +		/*
> +		 * The attributes of the two trees must be the same before this.
> +		 * The following assignment makes them the same height.
> +		 */
> +		new_mas->tree->ma_flags = mas->tree->ma_flags;
> +		rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
> +		return;
> +	}
> +
> +	node = mt_alloc_one(gfp);
> +	if (!node) {
> +		new_mas->node = NULL;

We don't have checks around for node == NULL, MAS_NONE would be a safer
choice.  It is unlikely that someone would dup the tree and fail then
call something else, but I avoid setting node to NULL.

> +		mas_set_err(mas, -ENOMEM);
> +		return;
> +	}
> +
> +	type = mte_node_type(mas->node);
> +	root = mt_mk_node(node, type);
> +	new_mas->node = root;
> +	new_mas->min = 0;
> +	new_mas->max = ULONG_MAX;
> +	parent = ma_mnode_ptr(new_mas->tree);
> +
> +	while (1) {
> +		mas_copy_node(mas, new_mas, parent, gfp);
> +
> +		if (unlikely(mas_is_err(mas)))
> +			return;
> +
> +		/* Once we reach a leaf, we need to ascend, or end the loop. */
> +		if (mte_is_leaf(mas->node)) {
> +			if (mas->max == ULONG_MAX) {
> +				new_mas->tree->ma_flags = mas->tree->ma_flags;
> +				rcu_assign_pointer(new_mas->tree->ma_root,
> +						   mte_mk_root(root));
> +				break;

If you move this to the end of the function, you can replace the same
block above with a goto.  That will avoid breaking the line up.

> +			}
> +
> +			do {
> +				/*
> +				 * Must not at the root node, because we've
> +				 * already end the loop when we reach the last
> +				 * leaf.
> +				 */

I'm not sure what the comment above is trying to say.  Do you mean "This
won't reach the root node because the loop will break when the last leaf
is hit"?  I don't think that is accurate.. it will hit the root node but
not the end of the root node, right?  Anyways, the comment isn't clear
so please have a look.

> +				mas_ascend(mas);
> +				mas_ascend(new_mas);
> +			} while (mas->offset == mas_data_end(mas));
> +
> +			mas->offset++;
> +			new_mas->offset++;
> +		}
> +
> +		mas_descend(mas);
> +		parent = mte_to_node(new_mas->node);
> +		mas_descend(new_mas);
> +		mas->offset = 0;
> +		new_mas->offset = 0;
> +	}
> +}
> +
> +/**
> + * __mt_dup(): Duplicate a maple tree
> + * @mt: The source maple tree
> + * @new: The new maple tree
> + * @gfp: The GFP_FLAGS to use for allocations
> + *
> + * This function duplicates a maple tree using a faster method than traversing
> + * the source tree and inserting entries into the new tree one by one.

Can you make this comment more about what your code does instead of the
"one by one" description?

> + * The user needs to ensure that the attributes of the source tree and the new
> + * tree are the same, and the new tree needs to be an empty tree, otherwise
> + * -EINVAL will be returned.
> + * Note that the user needs to manually lock the source tree and the new tree.
> + *
> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> + * the attributes of the two trees are different or the new tree is not an empty
> + * tree.
> + */
> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> +{
> +	int ret = 0;
> +	MA_STATE(mas, mt, 0, 0);
> +	MA_STATE(new_mas, new, 0, 0);
> +
> +	mas_dup_build(&mas, &new_mas, gfp);
> +
> +	if (unlikely(mas_is_err(&mas))) {
> +		ret = xa_err(mas.node);
> +		if (ret == -ENOMEM)
> +			mas_dup_free(&new_mas);
> +	}
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(__mt_dup);
> +
> +/**
> + * mtree_dup(): Duplicate a maple tree
> + * @mt: The source maple tree
> + * @new: The new maple tree
> + * @gfp: The GFP_FLAGS to use for allocations
> + *
> + * This function duplicates a maple tree using a faster method than traversing
> + * the source tree and inserting entries into the new tree one by one.

Again, it's more interesting to state it uses the DFS preorder copy.

It is also worth mentioning the superior allocation behaviour since that
is a desirable trait for many.  In fact, you should add the allocation
behaviour in your cover letter.

> + * The user needs to ensure that the attributes of the source tree and the new
> + * tree are the same, and the new tree needs to be an empty tree, otherwise
> + * -EINVAL will be returned.
> + *
> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> + * the attributes of the two trees are different or the new tree is not an empty
> + * tree.
> + */
> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> +{
> +	int ret = 0;
> +	MA_STATE(mas, mt, 0, 0);
> +	MA_STATE(new_mas, new, 0, 0);
> +
> +	mas_lock(&new_mas);
> +	mas_lock(&mas);
> +
> +	mas_dup_build(&mas, &new_mas, gfp);
> +	mas_unlock(&mas);
> +
> +	if (unlikely(mas_is_err(&mas))) {
> +		ret = xa_err(mas.node);
> +		if (ret == -ENOMEM)
> +			mas_dup_free(&new_mas);
> +	}
> +
> +	mas_unlock(&new_mas);
> +
> +	return ret;
> +}
> +EXPORT_SYMBOL(mtree_dup);
> +
>  /**
>   * __mt_destroy() - Walk and free all nodes of a locked maple tree.
>   * @mt: The maple tree
> -- 
> 2.20.1
>
Peng Zhang Sept. 8, 2023, 9:26 a.m. UTC | #2
在 2023/9/8 04:13, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
>> Introduce interfaces __mt_dup() and mtree_dup(), which are used to
>> duplicate a maple tree. Compared with traversing the source tree and
>> reinserting entry by entry in the new tree, it has better performance.
>> The difference between __mt_dup() and mtree_dup() is that mtree_dup()
>> handles locks internally.
> 
> __mt_dup() should be called mas_dup() to indicate the advanced interface
> which requires users to handle their own locks.
Ok, I'll change __mt_dup() to mas_dup().
> 
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>>   include/linux/maple_tree.h |   3 +
>>   lib/maple_tree.c           | 265 +++++++++++++++++++++++++++++++++++++
>>   2 files changed, 268 insertions(+)
>>
>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>> index e41c70ac7744..44fe8a57ecbd 100644
>> --- a/include/linux/maple_tree.h
>> +++ b/include/linux/maple_tree.h
>> @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
>>   		void *entry, gfp_t gfp);
>>   void *mtree_erase(struct maple_tree *mt, unsigned long index);
>>   
>> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>> +
>>   void mtree_destroy(struct maple_tree *mt);
>>   void __mt_destroy(struct maple_tree *mt);
>>   
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index ef234cf02e3e..8f841682269c 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -6370,6 +6370,271 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
>>   }
>>   EXPORT_SYMBOL(mtree_erase);
>>   
>> +/*
>> + * mas_dup_free() - Free a half-constructed tree.
> 
> Maybe "Free an incomplete duplication of a tree" ?
> 
>> + * @mas: Points to the last node of the half-constructed tree.
> 
> Your use of "Points to" seems to indicate someone knows you are talking
> about a "maple state that has a node pointing to".  Can this be made
> more clear?
> @mas: The maple state of a incomplete tree.
> 
> Then add a note that @mas->node points to the last successfully
> allocated node?
> 
> Or something along those lines.
Ok, I'll revise the comment.
> 
>> + *
>> + * This function frees all nodes starting from @mas->node in the reverse order
>> + * of mas_dup_build(). There is no need to hold the source tree lock at this
>> + * time.
>> + */
>> +static void mas_dup_free(struct ma_state *mas)
>> +{
>> +	struct maple_node *node;
>> +	enum maple_type type;
>> +	void __rcu **slots;
>> +	unsigned char count, i;
>> +
>> +	/* Maybe the first node allocation failed. */
>> +	if (!mas->node)
>> +		return;
>> +
>> +	while (!mte_is_root(mas->node)) {
>> +		mas_ascend(mas);
>> +
>> +		if (mas->offset) {
>> +			mas->offset--;
>> +			do {
>> +				mas_descend(mas);
>> +				mas->offset = mas_data_end(mas);
>> +			} while (!mte_is_leaf(mas->node));
> 
> Can you blindly descend and check !mte_is_leaf()?  What happens when the
> tree duplication fails at random internal nodes?  Maybe I missed how
> this cannot happen?
This cannot happen. Note the mas_ascend(mas) at the beginning of the
outermost loop.

> 
>> +
>> +			mas_ascend(mas);
>> +		}
>> +
>> +		node = mte_to_node(mas->node);
>> +		type = mte_node_type(mas->node);
>> +		slots = (void **)ma_slots(node, type);
>> +		count = mas_data_end(mas) + 1;
>> +		for (i = 0; i < count; i++)
>> +			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
>> +
>> +		mt_free_bulk(count, slots);
>> +	}
> 
> 
>> +
>> +	node = mte_to_node(mas->node);
>> +	mt_free_one(node);
>> +}
>> +
>> +/*
>> + * mas_copy_node() - Copy a maple node and allocate child nodes.
> 
> if required. "..and allocate child nodes if required."
> 
>> + * @mas: Points to the source node.
>> + * @new_mas: Points to the new node.
>> + * @parent: The parent node of the new node.
>> + * @gfp: The GFP_FLAGS to use for allocations.
>> + *
>> + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
>> + * @new_mas->node and allocate new child nodes for @new_mas->node.
>> + * If memory allocation fails, @mas is set to -ENOMEM.
>> + */
>> +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
>> +		struct maple_node *parent, gfp_t gfp)
>> +{
>> +	struct maple_node *node = mte_to_node(mas->node);
>> +	struct maple_node *new_node = mte_to_node(new_mas->node);
>> +	enum maple_type type;
>> +	unsigned long val;
>> +	unsigned char request, count, i;
>> +	void __rcu **slots;
>> +	void __rcu **new_slots;
>> +
>> +	/* Copy the node completely. */
>> +	memcpy(new_node, node, sizeof(struct maple_node));
>> +
>> +	/* Update the parent node pointer. */
>> +	if (unlikely(ma_is_root(node)))
>> +		val = MA_ROOT_PARENT;
>> +	else
>> +		val = (unsigned long)node->parent & MAPLE_NODE_MASK;
> 
> If you treat the root as special and outside the loop, then you can
> avoid the check for root for every non-root node.  For root, you just
> need to copy and do this special parent thing before the main loop in
> mas_dup_build().  This will avoid an extra branch for each VMA over 14,
> so that would add up to a lot of instructions.
I'll handle the root node outside.
However, do you think it makes sense to have the parent of the root node
point to the struct maple_tree? I don't see it used anywhere.

> 
>> +
>> +	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
>> +
>> +	if (mte_is_leaf(mas->node))
>> +		return;
> 
> You are checking here and in mas_dup_build() for the leaf, splitting the
> function into parent assignment and allocate would allow you to check
> once. Copy could be moved to the main loop or with the parent setting,
> depending on how you handle the root suggestion above.
I'll try to reduce some checks.
> 
>> +
>> +	/* Allocate memory for child nodes. */
>> +	type = mte_node_type(mas->node);
>> +	new_slots = ma_slots(new_node, type);
>> +	request = mas_data_end(mas) + 1;
>> +	count = mt_alloc_bulk(gfp, request, new_slots);
>> +	if (unlikely(count < request)) {
>> +		if (count)
>> +			mt_free_bulk(count, new_slots);
> 
> The new_slots will still contain the addresses of the freed nodes.
> Don't you need to clear it here to avoid a double free?  Is there a
> test case for this in your testing?  Again, I may have missed how this
> is not possible..
It's impossible, because in mt_free_bulk(), the first thing to do with
the incoming node is to go up. We free all child nodes at the parent
node.

We guarantee that the node passed to mas_dup_free() is "clean".
mas_dup_free() also follows this so will not free children of this node.

The child nodes of this node cannot be freed in mt_free_bulk() because
the node is not completely constructed and data_end cannot be obtained.
data_end cannot be set on this node because the number of successfully
allocated child nodes can be 0.
> 
>> +		mas_set_err(mas, -ENOMEM);
>> +		return;
>> +	}
>> +
>> +	/* Restore node type information in slots. */
>> +	slots = ma_slots(node, type);
>> +	for (i = 0; i < count; i++)
>> +		((unsigned long *)new_slots)[i] |=
>> +			((unsigned long)mt_slot_locked(mas->tree, slots, i) &
>> +			MAPLE_NODE_MASK);
> 
> Can you expand this to multiple lines to make it more clear what is
> going on?
I will try to do that.

> 
>> +}
>> +
>> +/*
>> + * mas_dup_build() - Build a new maple tree from a source tree
>> + * @mas: The maple state of source tree.
>> + * @new_mas: The maple state of new tree.
>> + * @gfp: The GFP_FLAGS to use for allocations.
>> + *
>> + * This function builds a new tree in DFS preorder. If the memory allocation
>> + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
>> + * last node. mas_dup_free() will free the half-constructed tree.
>> + *
>> + * Note that the attributes of the two trees must be exactly the same, and the
>> + * new tree must be empty, otherwise -EINVAL will be returned.
>> + */
>> +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
>> +		gfp_t gfp)
>> +{
>> +	struct maple_node *node, *parent;
> 
> Could parent be struct maple_pnode?
I'll rename it.

> 
>> +	struct maple_enode *root;
>> +	enum maple_type type;
>> +
>> +	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
>> +	    unlikely(!mtree_empty(new_mas->tree))) {
>> +		mas_set_err(mas, -EINVAL);
>> +		return;
>> +	}
>> +
>> +	mas_start(mas);
>> +	if (mas_is_ptr(mas) || mas_is_none(mas)) {
>> +		/*
>> +		 * The attributes of the two trees must be the same before this.
>> +		 * The following assignment makes them the same height.
>> +		 */
>> +		new_mas->tree->ma_flags = mas->tree->ma_flags;
>> +		rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
>> +		return;
>> +	}
>> +
>> +	node = mt_alloc_one(gfp);
>> +	if (!node) {
>> +		new_mas->node = NULL;
> 
> We don't have checks around for node == NULL, MAS_NONE would be a safer
> choice.  It is unlikely that someone would dup the tree and fail then
> call something else, but I avoid setting node to NULL.
I will set it to MAS_NONE in the next version.

> 
>> +		mas_set_err(mas, -ENOMEM);
>> +		return;
>> +	}
>> +
>> +	type = mte_node_type(mas->node);
>> +	root = mt_mk_node(node, type);
>> +	new_mas->node = root;
>> +	new_mas->min = 0;
>> +	new_mas->max = ULONG_MAX;
>> +	parent = ma_mnode_ptr(new_mas->tree);
>> +
>> +	while (1) {
>> +		mas_copy_node(mas, new_mas, parent, gfp);
>> +
>> +		if (unlikely(mas_is_err(mas)))
>> +			return;
>> +
>> +		/* Once we reach a leaf, we need to ascend, or end the loop. */
>> +		if (mte_is_leaf(mas->node)) {
>> +			if (mas->max == ULONG_MAX) {
>> +				new_mas->tree->ma_flags = mas->tree->ma_flags;
>> +				rcu_assign_pointer(new_mas->tree->ma_root,
>> +						   mte_mk_root(root));
>> +				break;
> 
> If you move this to the end of the function, you can replace the same
> block above with a goto.  That will avoid breaking the line up.
I can do this, but it doesn't seem to make a difference.
> 
>> +			}
>> +
>> +			do {
>> +				/*
>> +				 * Must not at the root node, because we've
>> +				 * already end the loop when we reach the last
>> +				 * leaf.
>> +				 */
> 
> I'm not sure what the comment above is trying to say.  Do you mean "This
> won't reach the root node because the loop will break when the last leaf
> is hit"?  I don't think that is accurate.. it will hit the root node but
> not the end of the root node, right?  Anyways, the comment isn't clear
> so please have a look.
Yes, it will hit the root node but not the end of the root node. I'll
fix this comment. Thanks.

> 
>> +				mas_ascend(mas);
>> +				mas_ascend(new_mas);
>> +			} while (mas->offset == mas_data_end(mas));
>> +
>> +			mas->offset++;
>> +			new_mas->offset++;
>> +		}
>> +
>> +		mas_descend(mas);
>> +		parent = mte_to_node(new_mas->node);
>> +		mas_descend(new_mas);
>> +		mas->offset = 0;
>> +		new_mas->offset = 0;
>> +	}
>> +}
>> +
>> +/**
>> + * __mt_dup(): Duplicate a maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree using a faster method than traversing
>> + * the source tree and inserting entries into the new tree one by one.
> 
> Can you make this comment more about what your code does instead of the
> "one by one" description?
> 
>> + * The user needs to ensure that the attributes of the source tree and the new
>> + * tree are the same, and the new tree needs to be an empty tree, otherwise
>> + * -EINVAL will be returned.
>> + * Note that the user needs to manually lock the source tree and the new tree.
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
>> + * the attributes of the two trees are different or the new tree is not an empty
>> + * tree.
>> + */
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>> +{
>> +	int ret = 0;
>> +	MA_STATE(mas, mt, 0, 0);
>> +	MA_STATE(new_mas, new, 0, 0);
>> +
>> +	mas_dup_build(&mas, &new_mas, gfp);
>> +
>> +	if (unlikely(mas_is_err(&mas))) {
>> +		ret = xa_err(mas.node);
>> +		if (ret == -ENOMEM)
>> +			mas_dup_free(&new_mas);
>> +	}
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(__mt_dup);
>> +
>> +/**
>> + * mtree_dup(): Duplicate a maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree using a faster method than traversing
>> + * the source tree and inserting entries into the new tree one by one.
> 
> Again, it's more interesting to state it uses the DFS preorder copy.
> 
> It is also worth mentioning the superior allocation behaviour since that
> is a desirable trait for many.  In fact, you should add the allocation
> behaviour in your cover letter.
Okay, I will describe more in the next version.

> 
>> + * The user needs to ensure that the attributes of the source tree and the new
>> + * tree are the same, and the new tree needs to be an empty tree, otherwise
>> + * -EINVAL will be returned.
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
>> + * the attributes of the two trees are different or the new tree is not an empty
>> + * tree.
>> + */
>> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>> +{
>> +	int ret = 0;
>> +	MA_STATE(mas, mt, 0, 0);
>> +	MA_STATE(new_mas, new, 0, 0);
>> +
>> +	mas_lock(&new_mas);
>> +	mas_lock(&mas);
>> +
>> +	mas_dup_build(&mas, &new_mas, gfp);
>> +	mas_unlock(&mas);
>> +
>> +	if (unlikely(mas_is_err(&mas))) {
>> +		ret = xa_err(mas.node);
>> +		if (ret == -ENOMEM)
>> +			mas_dup_free(&new_mas);
>> +	}
>> +
>> +	mas_unlock(&new_mas);
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(mtree_dup);
>> +
>>   /**
>>    * __mt_destroy() - Walk and free all nodes of a locked maple tree.
>>    * @mt: The maple tree
>> -- 
>> 2.20.1
>>
Liam R. Howlett Sept. 8, 2023, 4:05 p.m. UTC | #3
* Peng Zhang <zhangpeng.00@bytedance.com> [230908 05:26]:
> 
> 
> 在 2023/9/8 04:13, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
> > > Introduce interfaces __mt_dup() and mtree_dup(), which are used to
> > > duplicate a maple tree. Compared with traversing the source tree and
> > > reinserting entry by entry in the new tree, it has better performance.
> > > The difference between __mt_dup() and mtree_dup() is that mtree_dup()
> > > handles locks internally.
> > 
> > __mt_dup() should be called mas_dup() to indicate the advanced interface
> > which requires users to handle their own locks.
> Ok, I'll change __mt_dup() to mas_dup().
> > 
> > > 
> > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > ---
> > >   include/linux/maple_tree.h |   3 +
> > >   lib/maple_tree.c           | 265 +++++++++++++++++++++++++++++++++++++
> > >   2 files changed, 268 insertions(+)
> > > 
> > > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> > > index e41c70ac7744..44fe8a57ecbd 100644
> > > --- a/include/linux/maple_tree.h
> > > +++ b/include/linux/maple_tree.h
> > > @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
> > >   		void *entry, gfp_t gfp);
> > >   void *mtree_erase(struct maple_tree *mt, unsigned long index);
> > > +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > +
> > >   void mtree_destroy(struct maple_tree *mt);
> > >   void __mt_destroy(struct maple_tree *mt);
> > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > index ef234cf02e3e..8f841682269c 100644
> > > --- a/lib/maple_tree.c
> > > +++ b/lib/maple_tree.c
> > > @@ -6370,6 +6370,271 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
> > >   }
> > >   EXPORT_SYMBOL(mtree_erase);
> > > +/*
> > > + * mas_dup_free() - Free a half-constructed tree.
> > 
> > Maybe "Free an incomplete duplication of a tree" ?
> > 
> > > + * @mas: Points to the last node of the half-constructed tree.
> > 
> > Your use of "Points to" seems to indicate someone knows you are talking
> > about a "maple state that has a node pointing to".  Can this be made
> > more clear?
> > @mas: The maple state of a incomplete tree.
> > 
> > Then add a note that @mas->node points to the last successfully
> > allocated node?
> > 
> > Or something along those lines.
> Ok, I'll revise the comment.
> > 
> > > + *
> > > + * This function frees all nodes starting from @mas->node in the reverse order
> > > + * of mas_dup_build(). There is no need to hold the source tree lock at this
> > > + * time.
> > > + */
> > > +static void mas_dup_free(struct ma_state *mas)
> > > +{
> > > +	struct maple_node *node;
> > > +	enum maple_type type;
> > > +	void __rcu **slots;
> > > +	unsigned char count, i;
> > > +
> > > +	/* Maybe the first node allocation failed. */
> > > +	if (!mas->node)
> > > +		return;
> > > +
> > > +	while (!mte_is_root(mas->node)) {
> > > +		mas_ascend(mas);
> > > +
> > > +		if (mas->offset) {
> > > +			mas->offset--;
> > > +			do {
> > > +				mas_descend(mas);
> > > +				mas->offset = mas_data_end(mas);
> > > +			} while (!mte_is_leaf(mas->node));
> > 
> > Can you blindly descend and check !mte_is_leaf()?  What happens when the
> > tree duplication fails at random internal nodes?  Maybe I missed how
> > this cannot happen?
> This cannot happen. Note the mas_ascend(mas) at the beginning of the
> outermost loop.
> 
> > 
> > > +
> > > +			mas_ascend(mas);
> > > +		}
> > > +
> > > +		node = mte_to_node(mas->node);
> > > +		type = mte_node_type(mas->node);
> > > +		slots = (void **)ma_slots(node, type);
> > > +		count = mas_data_end(mas) + 1;
> > > +		for (i = 0; i < count; i++)
> > > +			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
> > > +
> > > +		mt_free_bulk(count, slots);
> > > +	}
> > 
> > 
> > > +
> > > +	node = mte_to_node(mas->node);
> > > +	mt_free_one(node);
> > > +}
> > > +
> > > +/*
> > > + * mas_copy_node() - Copy a maple node and allocate child nodes.
> > 
> > if required. "..and allocate child nodes if required."
> > 
> > > + * @mas: Points to the source node.
> > > + * @new_mas: Points to the new node.
> > > + * @parent: The parent node of the new node.
> > > + * @gfp: The GFP_FLAGS to use for allocations.
> > > + *
> > > + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
> > > + * @new_mas->node and allocate new child nodes for @new_mas->node.
> > > + * If memory allocation fails, @mas is set to -ENOMEM.
> > > + */
> > > +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
> > > +		struct maple_node *parent, gfp_t gfp)
> > > +{
> > > +	struct maple_node *node = mte_to_node(mas->node);
> > > +	struct maple_node *new_node = mte_to_node(new_mas->node);
> > > +	enum maple_type type;
> > > +	unsigned long val;
> > > +	unsigned char request, count, i;
> > > +	void __rcu **slots;
> > > +	void __rcu **new_slots;
> > > +
> > > +	/* Copy the node completely. */
> > > +	memcpy(new_node, node, sizeof(struct maple_node));
> > > +
> > > +	/* Update the parent node pointer. */
> > > +	if (unlikely(ma_is_root(node)))
> > > +		val = MA_ROOT_PARENT;
> > > +	else
> > > +		val = (unsigned long)node->parent & MAPLE_NODE_MASK;
> > 
> > If you treat the root as special and outside the loop, then you can
> > avoid the check for root for every non-root node.  For root, you just
> > need to copy and do this special parent thing before the main loop in
> > mas_dup_build().  This will avoid an extra branch for each VMA over 14,
> > so that would add up to a lot of instructions.
> I'll handle the root node outside.
> However, do you think it makes sense to have the parent of the root node
> point to the struct maple_tree? I don't see it used anywhere.

I'm not sure.  It needs to not point to itself (indicating it is dead),
and we need to tell it's the root node, but I'm not entirely sure it is
necessary to point to the maple_tree.. although it is useful in dumps
sometimes.

> 
> > 
> > > +
> > > +	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
> > > +
> > > +	if (mte_is_leaf(mas->node))
> > > +		return;
> > 
> > You are checking here and in mas_dup_build() for the leaf, splitting the
> > function into parent assignment and allocate would allow you to check
> > once. Copy could be moved to the main loop or with the parent setting,
> > depending on how you handle the root suggestion above.
> I'll try to reduce some checks.
> > 
> > > +
> > > +	/* Allocate memory for child nodes. */
> > > +	type = mte_node_type(mas->node);
> > > +	new_slots = ma_slots(new_node, type);
> > > +	request = mas_data_end(mas) + 1;
> > > +	count = mt_alloc_bulk(gfp, request, new_slots);
> > > +	if (unlikely(count < request)) {
> > > +		if (count)
> > > +			mt_free_bulk(count, new_slots);
> > 
> > The new_slots will still contain the addresses of the freed nodes.
> > Don't you need to clear it here to avoid a double free?  Is there a
> > test case for this in your testing?  Again, I may have missed how this
> > is not possible..
> It's impossible, because in mt_free_bulk(), the first thing to do with
> the incoming node is to go up. We free all child nodes at the parent
> node.
> 
> We guarantee that the node passed to mas_dup_free() is "clean".

You mean there are no allocations below?

> mas_dup_free() also follows this so will not free children of this node.
> 
> The child nodes of this node cannot be freed in mt_free_bulk() because
> the node is not completely constructed and data_end cannot be obtained.
> data_end cannot be set on this node because the number of successfully
> allocated child nodes can be 0.

It still seems unwise to keep pointers pointing to unallocated memory
here, can we just clear the slots?

It's a bit odd because our choice is to leave it with pointers to nodes
in another tree or potential unallocated memory that isn't anywhere.
Both are a bit unnerving passing into another function that cleans
things up.  Since it's the error path, we won't have a performance
penalty in wiping the slots and it doesn't really matter that the node
isn't valid.  It seems more likely we would catch the error in a more
identifiable place if we set the slots to NULL.

> > 
> > > +		mas_set_err(mas, -ENOMEM);
> > > +		return;
> > > +	}
> > > +
> > > +	/* Restore node type information in slots. */
> > > +	slots = ma_slots(node, type);
> > > +	for (i = 0; i < count; i++)
> > > +		((unsigned long *)new_slots)[i] |=
> > > +			((unsigned long)mt_slot_locked(mas->tree, slots, i) &
> > > +			MAPLE_NODE_MASK);
> > 
> > Can you expand this to multiple lines to make it more clear what is
> > going on?
> I will try to do that.
> 
> > 
> > > +}
> > > +
> > > +/*
> > > + * mas_dup_build() - Build a new maple tree from a source tree
> > > + * @mas: The maple state of source tree.
> > > + * @new_mas: The maple state of new tree.
> > > + * @gfp: The GFP_FLAGS to use for allocations.
> > > + *
> > > + * This function builds a new tree in DFS preorder. If the memory allocation
> > > + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
> > > + * last node. mas_dup_free() will free the half-constructed tree.
> > > + *
> > > + * Note that the attributes of the two trees must be exactly the same, and the
> > > + * new tree must be empty, otherwise -EINVAL will be returned.
> > > + */
> > > +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
> > > +		gfp_t gfp)
> > > +{
> > > +	struct maple_node *node, *parent;
> > 
> > Could parent be struct maple_pnode?
> I'll rename it.
> 
> > 
> > > +	struct maple_enode *root;
> > > +	enum maple_type type;
> > > +
> > > +	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
> > > +	    unlikely(!mtree_empty(new_mas->tree))) {
> > > +		mas_set_err(mas, -EINVAL);
> > > +		return;
> > > +	}
> > > +
> > > +	mas_start(mas);
> > > +	if (mas_is_ptr(mas) || mas_is_none(mas)) {
> > > +		/*
> > > +		 * The attributes of the two trees must be the same before this.
> > > +		 * The following assignment makes them the same height.
> > > +		 */
> > > +		new_mas->tree->ma_flags = mas->tree->ma_flags;
> > > +		rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
> > > +		return;
> > > +	}
> > > +
> > > +	node = mt_alloc_one(gfp);
> > > +	if (!node) {
> > > +		new_mas->node = NULL;
> > 
> > We don't have checks around for node == NULL, MAS_NONE would be a safer
> > choice.  It is unlikely that someone would dup the tree and fail then
> > call something else, but I avoid setting node to NULL.
> I will set it to MAS_NONE in the next version.
> 
> > 
> > > +		mas_set_err(mas, -ENOMEM);
> > > +		return;
> > > +	}
> > > +
> > > +	type = mte_node_type(mas->node);
> > > +	root = mt_mk_node(node, type);
> > > +	new_mas->node = root;
> > > +	new_mas->min = 0;
> > > +	new_mas->max = ULONG_MAX;
> > > +	parent = ma_mnode_ptr(new_mas->tree);
> > > +
> > > +	while (1) {
> > > +		mas_copy_node(mas, new_mas, parent, gfp);
> > > +
> > > +		if (unlikely(mas_is_err(mas)))
> > > +			return;
> > > +
> > > +		/* Once we reach a leaf, we need to ascend, or end the loop. */
> > > +		if (mte_is_leaf(mas->node)) {
> > > +			if (mas->max == ULONG_MAX) {
> > > +				new_mas->tree->ma_flags = mas->tree->ma_flags;
> > > +				rcu_assign_pointer(new_mas->tree->ma_root,
> > > +						   mte_mk_root(root));
> > > +				break;
> > 
> > If you move this to the end of the function, you can replace the same
> > block above with a goto.  That will avoid breaking the line up.
> I can do this, but it doesn't seem to make a difference.

Thanks, Just for clarity of keeping it all on one line and sine there's
going to be a respin of the set anyways..

> > 
> > > +			}
> > > +
> > > +			do {
> > > +				/*
> > > +				 * Must not at the root node, because we've
> > > +				 * already end the loop when we reach the last
> > > +				 * leaf.
> > > +				 */
> > 
> > I'm not sure what the comment above is trying to say.  Do you mean "This
> > won't reach the root node because the loop will break when the last leaf
> > is hit"?  I don't think that is accurate.. it will hit the root node but
> > not the end of the root node, right?  Anyways, the comment isn't clear
> > so please have a look.
> Yes, it will hit the root node but not the end of the root node. I'll
> fix this comment. Thanks.
> 
> > 
> > > +				mas_ascend(mas);
> > > +				mas_ascend(new_mas);
> > > +			} while (mas->offset == mas_data_end(mas));
> > > +
> > > +			mas->offset++;
> > > +			new_mas->offset++;
> > > +		}
> > > +
> > > +		mas_descend(mas);
> > > +		parent = mte_to_node(new_mas->node);
> > > +		mas_descend(new_mas);
> > > +		mas->offset = 0;
> > > +		new_mas->offset = 0;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * __mt_dup(): Duplicate a maple tree
> > > + * @mt: The source maple tree
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + *
> > > + * This function duplicates a maple tree using a faster method than traversing
> > > + * the source tree and inserting entries into the new tree one by one.
> > 
> > Can you make this comment more about what your code does instead of the
> > "one by one" description?
> > 
> > > + * The user needs to ensure that the attributes of the source tree and the new
> > > + * tree are the same, and the new tree needs to be an empty tree, otherwise
> > > + * -EINVAL will be returned.
> > > + * Note that the user needs to manually lock the source tree and the new tree.
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> > > + * the attributes of the two trees are different or the new tree is not an empty
> > > + * tree.
> > > + */
> > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> > > +{
> > > +	int ret = 0;
> > > +	MA_STATE(mas, mt, 0, 0);
> > > +	MA_STATE(new_mas, new, 0, 0);
> > > +
> > > +	mas_dup_build(&mas, &new_mas, gfp);
> > > +
> > > +	if (unlikely(mas_is_err(&mas))) {
> > > +		ret = xa_err(mas.node);
> > > +		if (ret == -ENOMEM)
> > > +			mas_dup_free(&new_mas);
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(__mt_dup);
> > > +
> > > +/**
> > > + * mtree_dup(): Duplicate a maple tree
> > > + * @mt: The source maple tree
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + *
> > > + * This function duplicates a maple tree using a faster method than traversing
> > > + * the source tree and inserting entries into the new tree one by one.
> > 
> > Again, it's more interesting to state it uses the DFS preorder copy.
> > 
> > It is also worth mentioning the superior allocation behaviour since that
> > is a desirable trait for many.  In fact, you should add the allocation
> > behaviour in your cover letter.
> Okay, I will describe more in the next version.
> 
> > 
> > > + * The user needs to ensure that the attributes of the source tree and the new
> > > + * tree are the same, and the new tree needs to be an empty tree, otherwise
> > > + * -EINVAL will be returned.
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> > > + * the attributes of the two trees are different or the new tree is not an empty
> > > + * tree.
> > > + */
> > > +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> > > +{
> > > +	int ret = 0;
> > > +	MA_STATE(mas, mt, 0, 0);
> > > +	MA_STATE(new_mas, new, 0, 0);
> > > +
> > > +	mas_lock(&new_mas);
> > > +	mas_lock(&mas);
> > > +
> > > +	mas_dup_build(&mas, &new_mas, gfp);
> > > +	mas_unlock(&mas);
> > > +
> > > +	if (unlikely(mas_is_err(&mas))) {
> > > +		ret = xa_err(mas.node);
> > > +		if (ret == -ENOMEM)
> > > +			mas_dup_free(&new_mas);
> > > +	}
> > > +
> > > +	mas_unlock(&new_mas);
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(mtree_dup);
> > > +
> > >   /**
> > >    * __mt_destroy() - Walk and free all nodes of a locked maple tree.
> > >    * @mt: The maple tree
> > > -- 
> > > 2.20.1
> > >
Peng Zhang Sept. 11, 2023, 12:59 p.m. UTC | #4
在 2023/9/8 04:13, Liam R. Howlett 写道:
> * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
>> Introduce interfaces __mt_dup() and mtree_dup(), which are used to
>> duplicate a maple tree. Compared with traversing the source tree and
>> reinserting entry by entry in the new tree, it has better performance.
>> The difference between __mt_dup() and mtree_dup() is that mtree_dup()
>> handles locks internally.
> 
> __mt_dup() should be called mas_dup() to indicate the advanced interface
> which requires users to handle their own locks.
Changing to the mas_dup() interface may look like this:
mas_dup(mas_old, mas_new)

This still encounters the problem we discussed before. You expect both
mas_old and mas_new to point to the first element after the function
returns, but for_each_vma(vmi, mpnt) in dup_mmap() does not support
this, and will skip the first element.

Unless we have an iterator similar to "do {} while()", we have to reset 
mas_new. There is still additional overhead in making both mas_old and
mas_new point to the first element, because mas will point to the last
node after dfs order traversal.

In fact, I think mtree_dup() and __mt_dup() are enough. They seem to
match mtree_destroy() and __mt_destroy() very well. Underlines indicate
that users need to handle the lock themselves.
> 
>>
>> Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
>> ---
>>   include/linux/maple_tree.h |   3 +
>>   lib/maple_tree.c           | 265 +++++++++++++++++++++++++++++++++++++
>>   2 files changed, 268 insertions(+)
>>
>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>> index e41c70ac7744..44fe8a57ecbd 100644
>> --- a/include/linux/maple_tree.h
>> +++ b/include/linux/maple_tree.h
>> @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
>>   		void *entry, gfp_t gfp);
>>   void *mtree_erase(struct maple_tree *mt, unsigned long index);
>>   
>> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>> +
>>   void mtree_destroy(struct maple_tree *mt);
>>   void __mt_destroy(struct maple_tree *mt);
>>   
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index ef234cf02e3e..8f841682269c 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -6370,6 +6370,271 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
>>   }
>>   EXPORT_SYMBOL(mtree_erase);
>>   
>> +/*
>> + * mas_dup_free() - Free a half-constructed tree.
> 
> Maybe "Free an incomplete duplication of a tree" ?
> 
>> + * @mas: Points to the last node of the half-constructed tree.
> 
> Your use of "Points to" seems to indicate someone knows you are talking
> about a "maple state that has a node pointing to".  Can this be made
> more clear?
> @mas: The maple state of a incomplete tree.
> 
> Then add a note that @mas->node points to the last successfully
> allocated node?
> 
> Or something along those lines.
> 
>> + *
>> + * This function frees all nodes starting from @mas->node in the reverse order
>> + * of mas_dup_build(). There is no need to hold the source tree lock at this
>> + * time.
>> + */
>> +static void mas_dup_free(struct ma_state *mas)
>> +{
>> +	struct maple_node *node;
>> +	enum maple_type type;
>> +	void __rcu **slots;
>> +	unsigned char count, i;
>> +
>> +	/* Maybe the first node allocation failed. */
>> +	if (!mas->node)
>> +		return;
>> +
>> +	while (!mte_is_root(mas->node)) {
>> +		mas_ascend(mas);
>> +
>> +		if (mas->offset) {
>> +			mas->offset--;
>> +			do {
>> +				mas_descend(mas);
>> +				mas->offset = mas_data_end(mas);
>> +			} while (!mte_is_leaf(mas->node));
> 
> Can you blindly descend and check !mte_is_leaf()?  What happens when the
> tree duplication fails at random internal nodes?  Maybe I missed how
> this cannot happen?
> 
>> +
>> +			mas_ascend(mas);
>> +		}
>> +
>> +		node = mte_to_node(mas->node);
>> +		type = mte_node_type(mas->node);
>> +		slots = (void **)ma_slots(node, type);
>> +		count = mas_data_end(mas) + 1;
>> +		for (i = 0; i < count; i++)
>> +			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
>> +
>> +		mt_free_bulk(count, slots);
>> +	}
> 
> 
>> +
>> +	node = mte_to_node(mas->node);
>> +	mt_free_one(node);
>> +}
>> +
>> +/*
>> + * mas_copy_node() - Copy a maple node and allocate child nodes.
> 
> if required. "..and allocate child nodes if required."
> 
>> + * @mas: Points to the source node.
>> + * @new_mas: Points to the new node.
>> + * @parent: The parent node of the new node.
>> + * @gfp: The GFP_FLAGS to use for allocations.
>> + *
>> + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
>> + * @new_mas->node and allocate new child nodes for @new_mas->node.
>> + * If memory allocation fails, @mas is set to -ENOMEM.
>> + */
>> +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
>> +		struct maple_node *parent, gfp_t gfp)
>> +{
>> +	struct maple_node *node = mte_to_node(mas->node);
>> +	struct maple_node *new_node = mte_to_node(new_mas->node);
>> +	enum maple_type type;
>> +	unsigned long val;
>> +	unsigned char request, count, i;
>> +	void __rcu **slots;
>> +	void __rcu **new_slots;
>> +
>> +	/* Copy the node completely. */
>> +	memcpy(new_node, node, sizeof(struct maple_node));
>> +
>> +	/* Update the parent node pointer. */
>> +	if (unlikely(ma_is_root(node)))
>> +		val = MA_ROOT_PARENT;
>> +	else
>> +		val = (unsigned long)node->parent & MAPLE_NODE_MASK;
> 
> If you treat the root as special and outside the loop, then you can
> avoid the check for root for every non-root node.  For root, you just
> need to copy and do this special parent thing before the main loop in
> mas_dup_build().  This will avoid an extra branch for each VMA over 14,
> so that would add up to a lot of instructions.
> 
>> +
>> +	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
>> +
>> +	if (mte_is_leaf(mas->node))
>> +		return;
> 
> You are checking here and in mas_dup_build() for the leaf, splitting the
> function into parent assignment and allocate would allow you to check
> once. Copy could be moved to the main loop or with the parent setting,
> depending on how you handle the root suggestion above.
> 
>> +
>> +	/* Allocate memory for child nodes. */
>> +	type = mte_node_type(mas->node);
>> +	new_slots = ma_slots(new_node, type);
>> +	request = mas_data_end(mas) + 1;
>> +	count = mt_alloc_bulk(gfp, request, new_slots);
>> +	if (unlikely(count < request)) {
>> +		if (count)
>> +			mt_free_bulk(count, new_slots);
> 
> The new_slots will still contain the addresses of the freed nodes.
> Don't you need to clear it here to avoid a double free?  Is there a
> test case for this in your testing?  Again, I may have missed how this
> is not possible..
> 
>> +		mas_set_err(mas, -ENOMEM);
>> +		return;
>> +	}
>> +
>> +	/* Restore node type information in slots. */
>> +	slots = ma_slots(node, type);
>> +	for (i = 0; i < count; i++)
>> +		((unsigned long *)new_slots)[i] |=
>> +			((unsigned long)mt_slot_locked(mas->tree, slots, i) &
>> +			MAPLE_NODE_MASK);
> 
> Can you expand this to multiple lines to make it more clear what is
> going on?
> 
>> +}
>> +
>> +/*
>> + * mas_dup_build() - Build a new maple tree from a source tree
>> + * @mas: The maple state of source tree.
>> + * @new_mas: The maple state of new tree.
>> + * @gfp: The GFP_FLAGS to use for allocations.
>> + *
>> + * This function builds a new tree in DFS preorder. If the memory allocation
>> + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
>> + * last node. mas_dup_free() will free the half-constructed tree.
>> + *
>> + * Note that the attributes of the two trees must be exactly the same, and the
>> + * new tree must be empty, otherwise -EINVAL will be returned.
>> + */
>> +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
>> +		gfp_t gfp)
>> +{
>> +	struct maple_node *node, *parent;
> 
> Could parent be struct maple_pnode?
> 
>> +	struct maple_enode *root;
>> +	enum maple_type type;
>> +
>> +	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
>> +	    unlikely(!mtree_empty(new_mas->tree))) {
>> +		mas_set_err(mas, -EINVAL);
>> +		return;
>> +	}
>> +
>> +	mas_start(mas);
>> +	if (mas_is_ptr(mas) || mas_is_none(mas)) {
>> +		/*
>> +		 * The attributes of the two trees must be the same before this.
>> +		 * The following assignment makes them the same height.
>> +		 */
>> +		new_mas->tree->ma_flags = mas->tree->ma_flags;
>> +		rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
>> +		return;
>> +	}
>> +
>> +	node = mt_alloc_one(gfp);
>> +	if (!node) {
>> +		new_mas->node = NULL;
> 
> We don't have checks around for node == NULL, MAS_NONE would be a safer
> choice.  It is unlikely that someone would dup the tree and fail then
> call something else, but I avoid setting node to NULL.
> 
>> +		mas_set_err(mas, -ENOMEM);
>> +		return;
>> +	}
>> +
>> +	type = mte_node_type(mas->node);
>> +	root = mt_mk_node(node, type);
>> +	new_mas->node = root;
>> +	new_mas->min = 0;
>> +	new_mas->max = ULONG_MAX;
>> +	parent = ma_mnode_ptr(new_mas->tree);
>> +
>> +	while (1) {
>> +		mas_copy_node(mas, new_mas, parent, gfp);
>> +
>> +		if (unlikely(mas_is_err(mas)))
>> +			return;
>> +
>> +		/* Once we reach a leaf, we need to ascend, or end the loop. */
>> +		if (mte_is_leaf(mas->node)) {
>> +			if (mas->max == ULONG_MAX) {
>> +				new_mas->tree->ma_flags = mas->tree->ma_flags;
>> +				rcu_assign_pointer(new_mas->tree->ma_root,
>> +						   mte_mk_root(root));
>> +				break;
> 
> If you move this to the end of the function, you can replace the same
> block above with a goto.  That will avoid breaking the line up.
> 
>> +			}
>> +
>> +			do {
>> +				/*
>> +				 * Must not at the root node, because we've
>> +				 * already end the loop when we reach the last
>> +				 * leaf.
>> +				 */
> 
> I'm not sure what the comment above is trying to say.  Do you mean "This
> won't reach the root node because the loop will break when the last leaf
> is hit"?  I don't think that is accurate.. it will hit the root node but
> not the end of the root node, right?  Anyways, the comment isn't clear
> so please have a look.
> 
>> +				mas_ascend(mas);
>> +				mas_ascend(new_mas);
>> +			} while (mas->offset == mas_data_end(mas));
>> +
>> +			mas->offset++;
>> +			new_mas->offset++;
>> +		}
>> +
>> +		mas_descend(mas);
>> +		parent = mte_to_node(new_mas->node);
>> +		mas_descend(new_mas);
>> +		mas->offset = 0;
>> +		new_mas->offset = 0;
>> +	}
>> +}
>> +
>> +/**
>> + * __mt_dup(): Duplicate a maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree using a faster method than traversing
>> + * the source tree and inserting entries into the new tree one by one.
> 
> Can you make this comment more about what your code does instead of the
> "one by one" description?
> 
>> + * The user needs to ensure that the attributes of the source tree and the new
>> + * tree are the same, and the new tree needs to be an empty tree, otherwise
>> + * -EINVAL will be returned.
>> + * Note that the user needs to manually lock the source tree and the new tree.
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
>> + * the attributes of the two trees are different or the new tree is not an empty
>> + * tree.
>> + */
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>> +{
>> +	int ret = 0;
>> +	MA_STATE(mas, mt, 0, 0);
>> +	MA_STATE(new_mas, new, 0, 0);
>> +
>> +	mas_dup_build(&mas, &new_mas, gfp);
>> +
>> +	if (unlikely(mas_is_err(&mas))) {
>> +		ret = xa_err(mas.node);
>> +		if (ret == -ENOMEM)
>> +			mas_dup_free(&new_mas);
>> +	}
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(__mt_dup);
>> +
>> +/**
>> + * mtree_dup(): Duplicate a maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree using a faster method than traversing
>> + * the source tree and inserting entries into the new tree one by one.
> 
> Again, it's more interesting to state it uses the DFS preorder copy.
> 
> It is also worth mentioning the superior allocation behaviour since that
> is a desirable trait for many.  In fact, you should add the allocation
> behaviour in your cover letter.
> 
>> + * The user needs to ensure that the attributes of the source tree and the new
>> + * tree are the same, and the new tree needs to be an empty tree, otherwise
>> + * -EINVAL will be returned.
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
>> + * the attributes of the two trees are different or the new tree is not an empty
>> + * tree.
>> + */
>> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>> +{
>> +	int ret = 0;
>> +	MA_STATE(mas, mt, 0, 0);
>> +	MA_STATE(new_mas, new, 0, 0);
>> +
>> +	mas_lock(&new_mas);
>> +	mas_lock(&mas);
>> +
>> +	mas_dup_build(&mas, &new_mas, gfp);
>> +	mas_unlock(&mas);
>> +
>> +	if (unlikely(mas_is_err(&mas))) {
>> +		ret = xa_err(mas.node);
>> +		if (ret == -ENOMEM)
>> +			mas_dup_free(&new_mas);
>> +	}
>> +
>> +	mas_unlock(&new_mas);
>> +
>> +	return ret;
>> +}
>> +EXPORT_SYMBOL(mtree_dup);
>> +
>>   /**
>>    * __mt_destroy() - Walk and free all nodes of a locked maple tree.
>>    * @mt: The maple tree
>> -- 
>> 2.20.1
>>
Liam R. Howlett Sept. 11, 2023, 1:36 p.m. UTC | #5
* Peng Zhang <zhangpeng.00@bytedance.com> [230911 08:59]:
> 
> 
> 在 2023/9/8 04:13, Liam R. Howlett 写道:
> > * Peng Zhang <zhangpeng.00@bytedance.com> [230830 08:57]:
> > > Introduce interfaces __mt_dup() and mtree_dup(), which are used to
> > > duplicate a maple tree. Compared with traversing the source tree and
> > > reinserting entry by entry in the new tree, it has better performance.
> > > The difference between __mt_dup() and mtree_dup() is that mtree_dup()
> > > handles locks internally.
> > 
> > __mt_dup() should be called mas_dup() to indicate the advanced interface
> > which requires users to handle their own locks.
> Changing to the mas_dup() interface may look like this:
> mas_dup(mas_old, mas_new)
> 
> This still encounters the problem we discussed before. You expect both
> mas_old and mas_new to point to the first element after the function
> returns, but for_each_vma(vmi, mpnt) in dup_mmap() does not support
> this, and will skip the first element.
> 
> Unless we have an iterator similar to "do {} while()", we have to reset
> mas_new. There is still additional overhead in making both mas_old and
> mas_new point to the first element, because mas will point to the last
> node after dfs order traversal.

I was only looking for the name change.  Although, I think we could have
written in a way to avoid skipping the first element.

> 
> In fact, I think mtree_dup() and __mt_dup() are enough. They seem to
> match mtree_destroy() and __mt_destroy() very well. Underlines indicate
> that users need to handle the lock themselves.

I think you are correct, __mt_dup() doesn't take a maple state.  Thanks
for pointing that out.  Please leave it the way you have it.

> > 
> > > 
> > > Signed-off-by: Peng Zhang <zhangpeng.00@bytedance.com>
> > > ---
> > >   include/linux/maple_tree.h |   3 +
> > >   lib/maple_tree.c           | 265 +++++++++++++++++++++++++++++++++++++
> > >   2 files changed, 268 insertions(+)
> > > 
> > > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> > > index e41c70ac7744..44fe8a57ecbd 100644
> > > --- a/include/linux/maple_tree.h
> > > +++ b/include/linux/maple_tree.h
> > > @@ -327,6 +327,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
> > >   		void *entry, gfp_t gfp);
> > >   void *mtree_erase(struct maple_tree *mt, unsigned long index);
> > > +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> > > +
> > >   void mtree_destroy(struct maple_tree *mt);
> > >   void __mt_destroy(struct maple_tree *mt);
> > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > index ef234cf02e3e..8f841682269c 100644
> > > --- a/lib/maple_tree.c
> > > +++ b/lib/maple_tree.c
> > > @@ -6370,6 +6370,271 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
> > >   }
> > >   EXPORT_SYMBOL(mtree_erase);
> > > +/*
> > > + * mas_dup_free() - Free a half-constructed tree.
> > 
> > Maybe "Free an incomplete duplication of a tree" ?
> > 
> > > + * @mas: Points to the last node of the half-constructed tree.
> > 
> > Your use of "Points to" seems to indicate someone knows you are talking
> > about a "maple state that has a node pointing to".  Can this be made
> > more clear?
> > @mas: The maple state of a incomplete tree.
> > 
> > Then add a note that @mas->node points to the last successfully
> > allocated node?
> > 
> > Or something along those lines.
> > 
> > > + *
> > > + * This function frees all nodes starting from @mas->node in the reverse order
> > > + * of mas_dup_build(). There is no need to hold the source tree lock at this
> > > + * time.
> > > + */
> > > +static void mas_dup_free(struct ma_state *mas)
> > > +{
> > > +	struct maple_node *node;
> > > +	enum maple_type type;
> > > +	void __rcu **slots;
> > > +	unsigned char count, i;
> > > +
> > > +	/* Maybe the first node allocation failed. */
> > > +	if (!mas->node)
> > > +		return;
> > > +
> > > +	while (!mte_is_root(mas->node)) {
> > > +		mas_ascend(mas);
> > > +
> > > +		if (mas->offset) {
> > > +			mas->offset--;
> > > +			do {
> > > +				mas_descend(mas);
> > > +				mas->offset = mas_data_end(mas);
> > > +			} while (!mte_is_leaf(mas->node));
> > 
> > Can you blindly descend and check !mte_is_leaf()?  What happens when the
> > tree duplication fails at random internal nodes?  Maybe I missed how
> > this cannot happen?
> > 
> > > +
> > > +			mas_ascend(mas);
> > > +		}
> > > +
> > > +		node = mte_to_node(mas->node);
> > > +		type = mte_node_type(mas->node);
> > > +		slots = (void **)ma_slots(node, type);
> > > +		count = mas_data_end(mas) + 1;
> > > +		for (i = 0; i < count; i++)
> > > +			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
> > > +
> > > +		mt_free_bulk(count, slots);
> > > +	}
> > 
> > 
> > > +
> > > +	node = mte_to_node(mas->node);
> > > +	mt_free_one(node);
> > > +}
> > > +
> > > +/*
> > > + * mas_copy_node() - Copy a maple node and allocate child nodes.
> > 
> > if required. "..and allocate child nodes if required."
> > 
> > > + * @mas: Points to the source node.
> > > + * @new_mas: Points to the new node.
> > > + * @parent: The parent node of the new node.
> > > + * @gfp: The GFP_FLAGS to use for allocations.
> > > + *
> > > + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
> > > + * @new_mas->node and allocate new child nodes for @new_mas->node.
> > > + * If memory allocation fails, @mas is set to -ENOMEM.
> > > + */
> > > +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
> > > +		struct maple_node *parent, gfp_t gfp)
> > > +{
> > > +	struct maple_node *node = mte_to_node(mas->node);
> > > +	struct maple_node *new_node = mte_to_node(new_mas->node);
> > > +	enum maple_type type;
> > > +	unsigned long val;
> > > +	unsigned char request, count, i;
> > > +	void __rcu **slots;
> > > +	void __rcu **new_slots;
> > > +
> > > +	/* Copy the node completely. */
> > > +	memcpy(new_node, node, sizeof(struct maple_node));
> > > +
> > > +	/* Update the parent node pointer. */
> > > +	if (unlikely(ma_is_root(node)))
> > > +		val = MA_ROOT_PARENT;
> > > +	else
> > > +		val = (unsigned long)node->parent & MAPLE_NODE_MASK;
> > 
> > If you treat the root as special and outside the loop, then you can
> > avoid the check for root for every non-root node.  For root, you just
> > need to copy and do this special parent thing before the main loop in
> > mas_dup_build().  This will avoid an extra branch for each VMA over 14,
> > so that would add up to a lot of instructions.
> > 
> > > +
> > > +	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
> > > +
> > > +	if (mte_is_leaf(mas->node))
> > > +		return;
> > 
> > You are checking here and in mas_dup_build() for the leaf, splitting the
> > function into parent assignment and allocate would allow you to check
> > once. Copy could be moved to the main loop or with the parent setting,
> > depending on how you handle the root suggestion above.
> > 
> > > +
> > > +	/* Allocate memory for child nodes. */
> > > +	type = mte_node_type(mas->node);
> > > +	new_slots = ma_slots(new_node, type);
> > > +	request = mas_data_end(mas) + 1;
> > > +	count = mt_alloc_bulk(gfp, request, new_slots);
> > > +	if (unlikely(count < request)) {
> > > +		if (count)
> > > +			mt_free_bulk(count, new_slots);
> > 
> > The new_slots will still contain the addresses of the freed nodes.
> > Don't you need to clear it here to avoid a double free?  Is there a
> > test case for this in your testing?  Again, I may have missed how this
> > is not possible..
> > 
> > > +		mas_set_err(mas, -ENOMEM);
> > > +		return;
> > > +	}
> > > +
> > > +	/* Restore node type information in slots. */
> > > +	slots = ma_slots(node, type);
> > > +	for (i = 0; i < count; i++)
> > > +		((unsigned long *)new_slots)[i] |=
> > > +			((unsigned long)mt_slot_locked(mas->tree, slots, i) &
> > > +			MAPLE_NODE_MASK);
> > 
> > Can you expand this to multiple lines to make it more clear what is
> > going on?
> > 
> > > +}
> > > +
> > > +/*
> > > + * mas_dup_build() - Build a new maple tree from a source tree
> > > + * @mas: The maple state of source tree.
> > > + * @new_mas: The maple state of new tree.
> > > + * @gfp: The GFP_FLAGS to use for allocations.
> > > + *
> > > + * This function builds a new tree in DFS preorder. If the memory allocation
> > > + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
> > > + * last node. mas_dup_free() will free the half-constructed tree.
> > > + *
> > > + * Note that the attributes of the two trees must be exactly the same, and the
> > > + * new tree must be empty, otherwise -EINVAL will be returned.
> > > + */
> > > +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
> > > +		gfp_t gfp)
> > > +{
> > > +	struct maple_node *node, *parent;
> > 
> > Could parent be struct maple_pnode?
> > 
> > > +	struct maple_enode *root;
> > > +	enum maple_type type;
> > > +
> > > +	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
> > > +	    unlikely(!mtree_empty(new_mas->tree))) {
> > > +		mas_set_err(mas, -EINVAL);
> > > +		return;
> > > +	}
> > > +
> > > +	mas_start(mas);
> > > +	if (mas_is_ptr(mas) || mas_is_none(mas)) {
> > > +		/*
> > > +		 * The attributes of the two trees must be the same before this.
> > > +		 * The following assignment makes them the same height.
> > > +		 */
> > > +		new_mas->tree->ma_flags = mas->tree->ma_flags;
> > > +		rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
> > > +		return;
> > > +	}
> > > +
> > > +	node = mt_alloc_one(gfp);
> > > +	if (!node) {
> > > +		new_mas->node = NULL;
> > 
> > We don't have checks around for node == NULL, MAS_NONE would be a safer
> > choice.  It is unlikely that someone would dup the tree and fail then
> > call something else, but I avoid setting node to NULL.
> > 
> > > +		mas_set_err(mas, -ENOMEM);
> > > +		return;
> > > +	}
> > > +
> > > +	type = mte_node_type(mas->node);
> > > +	root = mt_mk_node(node, type);
> > > +	new_mas->node = root;
> > > +	new_mas->min = 0;
> > > +	new_mas->max = ULONG_MAX;
> > > +	parent = ma_mnode_ptr(new_mas->tree);
> > > +
> > > +	while (1) {
> > > +		mas_copy_node(mas, new_mas, parent, gfp);
> > > +
> > > +		if (unlikely(mas_is_err(mas)))
> > > +			return;
> > > +
> > > +		/* Once we reach a leaf, we need to ascend, or end the loop. */
> > > +		if (mte_is_leaf(mas->node)) {
> > > +			if (mas->max == ULONG_MAX) {
> > > +				new_mas->tree->ma_flags = mas->tree->ma_flags;
> > > +				rcu_assign_pointer(new_mas->tree->ma_root,
> > > +						   mte_mk_root(root));
> > > +				break;
> > 
> > If you move this to the end of the function, you can replace the same
> > block above with a goto.  That will avoid breaking the line up.
> > 
> > > +			}
> > > +
> > > +			do {
> > > +				/*
> > > +				 * Must not at the root node, because we've
> > > +				 * already end the loop when we reach the last
> > > +				 * leaf.
> > > +				 */
> > 
> > I'm not sure what the comment above is trying to say.  Do you mean "This
> > won't reach the root node because the loop will break when the last leaf
> > is hit"?  I don't think that is accurate.. it will hit the root node but
> > not the end of the root node, right?  Anyways, the comment isn't clear
> > so please have a look.
> > 
> > > +				mas_ascend(mas);
> > > +				mas_ascend(new_mas);
> > > +			} while (mas->offset == mas_data_end(mas));
> > > +
> > > +			mas->offset++;
> > > +			new_mas->offset++;
> > > +		}
> > > +
> > > +		mas_descend(mas);
> > > +		parent = mte_to_node(new_mas->node);
> > > +		mas_descend(new_mas);
> > > +		mas->offset = 0;
> > > +		new_mas->offset = 0;
> > > +	}
> > > +}
> > > +
> > > +/**
> > > + * __mt_dup(): Duplicate a maple tree
> > > + * @mt: The source maple tree
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + *
> > > + * This function duplicates a maple tree using a faster method than traversing
> > > + * the source tree and inserting entries into the new tree one by one.
> > 
> > Can you make this comment more about what your code does instead of the
> > "one by one" description?
> > 
> > > + * The user needs to ensure that the attributes of the source tree and the new
> > > + * tree are the same, and the new tree needs to be an empty tree, otherwise
> > > + * -EINVAL will be returned.
> > > + * Note that the user needs to manually lock the source tree and the new tree.
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> > > + * the attributes of the two trees are different or the new tree is not an empty
> > > + * tree.
> > > + */
> > > +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> > > +{
> > > +	int ret = 0;
> > > +	MA_STATE(mas, mt, 0, 0);
> > > +	MA_STATE(new_mas, new, 0, 0);
> > > +
> > > +	mas_dup_build(&mas, &new_mas, gfp);
> > > +
> > > +	if (unlikely(mas_is_err(&mas))) {
> > > +		ret = xa_err(mas.node);
> > > +		if (ret == -ENOMEM)
> > > +			mas_dup_free(&new_mas);
> > > +	}
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(__mt_dup);
> > > +
> > > +/**
> > > + * mtree_dup(): Duplicate a maple tree
> > > + * @mt: The source maple tree
> > > + * @new: The new maple tree
> > > + * @gfp: The GFP_FLAGS to use for allocations
> > > + *
> > > + * This function duplicates a maple tree using a faster method than traversing
> > > + * the source tree and inserting entries into the new tree one by one.
> > 
> > Again, it's more interesting to state it uses the DFS preorder copy.
> > 
> > It is also worth mentioning the superior allocation behaviour since that
> > is a desirable trait for many.  In fact, you should add the allocation
> > behaviour in your cover letter.
> > 
> > > + * The user needs to ensure that the attributes of the source tree and the new
> > > + * tree are the same, and the new tree needs to be an empty tree, otherwise
> > > + * -EINVAL will be returned.
> > > + *
> > > + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> > > + * the attributes of the two trees are different or the new tree is not an empty
> > > + * tree.
> > > + */
> > > +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> > > +{
> > > +	int ret = 0;
> > > +	MA_STATE(mas, mt, 0, 0);
> > > +	MA_STATE(new_mas, new, 0, 0);
> > > +
> > > +	mas_lock(&new_mas);
> > > +	mas_lock(&mas);
> > > +
> > > +	mas_dup_build(&mas, &new_mas, gfp);
> > > +	mas_unlock(&mas);
> > > +
> > > +	if (unlikely(mas_is_err(&mas))) {
> > > +		ret = xa_err(mas.node);
> > > +		if (ret == -ENOMEM)
> > > +			mas_dup_free(&new_mas);
> > > +	}
> > > +
> > > +	mas_unlock(&new_mas);
> > > +
> > > +	return ret;
> > > +}
> > > +EXPORT_SYMBOL(mtree_dup);
> > > +
> > >   /**
> > >    * __mt_destroy() - Walk and free all nodes of a locked maple tree.
> > >    * @mt: The maple tree
> > > -- 
> > > 2.20.1
> > >
diff mbox series

Patch

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index e41c70ac7744..44fe8a57ecbd 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -327,6 +327,9 @@  int mtree_store(struct maple_tree *mt, unsigned long index,
 		void *entry, gfp_t gfp);
 void *mtree_erase(struct maple_tree *mt, unsigned long index);
 
+int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
+int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
+
 void mtree_destroy(struct maple_tree *mt);
 void __mt_destroy(struct maple_tree *mt);
 
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index ef234cf02e3e..8f841682269c 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -6370,6 +6370,271 @@  void *mtree_erase(struct maple_tree *mt, unsigned long index)
 }
 EXPORT_SYMBOL(mtree_erase);
 
+/*
+ * mas_dup_free() - Free a half-constructed tree.
+ * @mas: Points to the last node of the half-constructed tree.
+ *
+ * This function frees all nodes starting from @mas->node in the reverse order
+ * of mas_dup_build(). There is no need to hold the source tree lock at this
+ * time.
+ */
+static void mas_dup_free(struct ma_state *mas)
+{
+	struct maple_node *node;
+	enum maple_type type;
+	void __rcu **slots;
+	unsigned char count, i;
+
+	/* Maybe the first node allocation failed. */
+	if (!mas->node)
+		return;
+
+	while (!mte_is_root(mas->node)) {
+		mas_ascend(mas);
+
+		if (mas->offset) {
+			mas->offset--;
+			do {
+				mas_descend(mas);
+				mas->offset = mas_data_end(mas);
+			} while (!mte_is_leaf(mas->node));
+
+			mas_ascend(mas);
+		}
+
+		node = mte_to_node(mas->node);
+		type = mte_node_type(mas->node);
+		slots = (void **)ma_slots(node, type);
+		count = mas_data_end(mas) + 1;
+		for (i = 0; i < count; i++)
+			((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
+
+		mt_free_bulk(count, slots);
+	}
+
+	node = mte_to_node(mas->node);
+	mt_free_one(node);
+}
+
+/*
+ * mas_copy_node() - Copy a maple node and allocate child nodes.
+ * @mas: Points to the source node.
+ * @new_mas: Points to the new node.
+ * @parent: The parent node of the new node.
+ * @gfp: The GFP_FLAGS to use for allocations.
+ *
+ * Copy @mas->node to @new_mas->node, set @parent to be the parent of
+ * @new_mas->node and allocate new child nodes for @new_mas->node.
+ * If memory allocation fails, @mas is set to -ENOMEM.
+ */
+static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
+		struct maple_node *parent, gfp_t gfp)
+{
+	struct maple_node *node = mte_to_node(mas->node);
+	struct maple_node *new_node = mte_to_node(new_mas->node);
+	enum maple_type type;
+	unsigned long val;
+	unsigned char request, count, i;
+	void __rcu **slots;
+	void __rcu **new_slots;
+
+	/* Copy the node completely. */
+	memcpy(new_node, node, sizeof(struct maple_node));
+
+	/* Update the parent node pointer. */
+	if (unlikely(ma_is_root(node)))
+		val = MA_ROOT_PARENT;
+	else
+		val = (unsigned long)node->parent & MAPLE_NODE_MASK;
+
+	new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
+
+	if (mte_is_leaf(mas->node))
+		return;
+
+	/* Allocate memory for child nodes. */
+	type = mte_node_type(mas->node);
+	new_slots = ma_slots(new_node, type);
+	request = mas_data_end(mas) + 1;
+	count = mt_alloc_bulk(gfp, request, new_slots);
+	if (unlikely(count < request)) {
+		if (count)
+			mt_free_bulk(count, new_slots);
+		mas_set_err(mas, -ENOMEM);
+		return;
+	}
+
+	/* Restore node type information in slots. */
+	slots = ma_slots(node, type);
+	for (i = 0; i < count; i++)
+		((unsigned long *)new_slots)[i] |=
+			((unsigned long)mt_slot_locked(mas->tree, slots, i) &
+			MAPLE_NODE_MASK);
+}
+
+/*
+ * mas_dup_build() - Build a new maple tree from a source tree
+ * @mas: The maple state of source tree.
+ * @new_mas: The maple state of new tree.
+ * @gfp: The GFP_FLAGS to use for allocations.
+ *
+ * This function builds a new tree in DFS preorder. If the memory allocation
+ * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
+ * last node. mas_dup_free() will free the half-constructed tree.
+ *
+ * Note that the attributes of the two trees must be exactly the same, and the
+ * new tree must be empty, otherwise -EINVAL will be returned.
+ */
+static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
+		gfp_t gfp)
+{
+	struct maple_node *node, *parent;
+	struct maple_enode *root;
+	enum maple_type type;
+
+	if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
+	    unlikely(!mtree_empty(new_mas->tree))) {
+		mas_set_err(mas, -EINVAL);
+		return;
+	}
+
+	mas_start(mas);
+	if (mas_is_ptr(mas) || mas_is_none(mas)) {
+		/*
+		 * The attributes of the two trees must be the same before this.
+		 * The following assignment makes them the same height.
+		 */
+		new_mas->tree->ma_flags = mas->tree->ma_flags;
+		rcu_assign_pointer(new_mas->tree->ma_root, mas->tree->ma_root);
+		return;
+	}
+
+	node = mt_alloc_one(gfp);
+	if (!node) {
+		new_mas->node = NULL;
+		mas_set_err(mas, -ENOMEM);
+		return;
+	}
+
+	type = mte_node_type(mas->node);
+	root = mt_mk_node(node, type);
+	new_mas->node = root;
+	new_mas->min = 0;
+	new_mas->max = ULONG_MAX;
+	parent = ma_mnode_ptr(new_mas->tree);
+
+	while (1) {
+		mas_copy_node(mas, new_mas, parent, gfp);
+
+		if (unlikely(mas_is_err(mas)))
+			return;
+
+		/* Once we reach a leaf, we need to ascend, or end the loop. */
+		if (mte_is_leaf(mas->node)) {
+			if (mas->max == ULONG_MAX) {
+				new_mas->tree->ma_flags = mas->tree->ma_flags;
+				rcu_assign_pointer(new_mas->tree->ma_root,
+						   mte_mk_root(root));
+				break;
+			}
+
+			do {
+				/*
+				 * Must not at the root node, because we've
+				 * already end the loop when we reach the last
+				 * leaf.
+				 */
+				mas_ascend(mas);
+				mas_ascend(new_mas);
+			} while (mas->offset == mas_data_end(mas));
+
+			mas->offset++;
+			new_mas->offset++;
+		}
+
+		mas_descend(mas);
+		parent = mte_to_node(new_mas->node);
+		mas_descend(new_mas);
+		mas->offset = 0;
+		new_mas->offset = 0;
+	}
+}
+
+/**
+ * __mt_dup(): Duplicate a maple tree
+ * @mt: The source maple tree
+ * @new: The new maple tree
+ * @gfp: The GFP_FLAGS to use for allocations
+ *
+ * This function duplicates a maple tree using a faster method than traversing
+ * the source tree and inserting entries into the new tree one by one.
+ * The user needs to ensure that the attributes of the source tree and the new
+ * tree are the same, and the new tree needs to be an empty tree, otherwise
+ * -EINVAL will be returned.
+ * Note that the user needs to manually lock the source tree and the new tree.
+ *
+ * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
+ * the attributes of the two trees are different or the new tree is not an empty
+ * tree.
+ */
+int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
+{
+	int ret = 0;
+	MA_STATE(mas, mt, 0, 0);
+	MA_STATE(new_mas, new, 0, 0);
+
+	mas_dup_build(&mas, &new_mas, gfp);
+
+	if (unlikely(mas_is_err(&mas))) {
+		ret = xa_err(mas.node);
+		if (ret == -ENOMEM)
+			mas_dup_free(&new_mas);
+	}
+
+	return ret;
+}
+EXPORT_SYMBOL(__mt_dup);
+
+/**
+ * mtree_dup(): Duplicate a maple tree
+ * @mt: The source maple tree
+ * @new: The new maple tree
+ * @gfp: The GFP_FLAGS to use for allocations
+ *
+ * This function duplicates a maple tree using a faster method than traversing
+ * the source tree and inserting entries into the new tree one by one.
+ * The user needs to ensure that the attributes of the source tree and the new
+ * tree are the same, and the new tree needs to be an empty tree, otherwise
+ * -EINVAL will be returned.
+ *
+ * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
+ * the attributes of the two trees are different or the new tree is not an empty
+ * tree.
+ */
+int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
+{
+	int ret = 0;
+	MA_STATE(mas, mt, 0, 0);
+	MA_STATE(new_mas, new, 0, 0);
+
+	mas_lock(&new_mas);
+	mas_lock(&mas);
+
+	mas_dup_build(&mas, &new_mas, gfp);
+	mas_unlock(&mas);
+
+	if (unlikely(mas_is_err(&mas))) {
+		ret = xa_err(mas.node);
+		if (ret == -ENOMEM)
+			mas_dup_free(&new_mas);
+	}
+
+	mas_unlock(&new_mas);
+
+	return ret;
+}
+EXPORT_SYMBOL(mtree_dup);
+
 /**
  * __mt_destroy() - Walk and free all nodes of a locked maple tree.
  * @mt: The maple tree