diff mbox series

[v2,hotfix,6.12,1/2] maple_tree: correct tree corruption on spanning store

Message ID 964860c315780cf3a5b8929a34b68d8208333c16.1728223996.git.lorenzo.stoakes@oracle.com (mailing list archive)
State New
Headers show
Series maple_tree: correct tree corruption on spanning store | expand

Commit Message

Lorenzo Stoakes Oct. 6, 2024, 2:31 p.m. UTC
There has been a subtle bug present in the maple tree implementation from
its inception.

This arises from how stores are performed - when a store occurs, it will
overwrite overlapping ranges and adjust the tree as necessary to
accommodate this.

A range may always ultimately span two leaf nodes. In this instance we walk
the two leaf nodes, determine which elements are not overwritten to the
left and to the right of the start and end of the ranges respectively and
then rebalance the tree to contain these entries and the newly inserted
one.

This kind of store is dubbed a 'spanning store' and is implemented by
mas_wr_spanning_store().

In order to reach this stage, mas_store_gfp() invokes mas_wr_preallocate(),
mas_wr_store_type() and mas_wr_walk() in turn to walk the tree and update
the object (mas) to traverse to the location where the write should be
performed, determining its store type.

When a spanning store is required, this function returns false stopping at
the parent node which contains the target range, and mas_wr_store_type()
marks the mas->store_type as wr_spanning_store to denote this fact.

When we go to perform the store in mas_wr_spanning_store(), we first
determine the elements AFTER the END of the range we wish to store (that
is, to the right of the entry to be inserted) - we do this by walking to
the NEXT pivot in the tree (i.e. r_mas.last + 1), starting at the node we
have just determined contains the range over which we intend to write.

We then turn our attention to the entries to the left of the entry we are
inserting, whose state is represented by l_mas, and copy these into a 'big
node', which is a special node which contains enough slots to contain two
leaf node's worth of data.

We then copy the entry we wish to store immediately after this - the copy
and the insertion of the new entry is performed by mas_store_b_node().

After this we copy the elements to the right of the end of the range which
we are inserting, if we have not exceeded the length of the node
(i.e. r_mas.offset <= r_mas.end).

Herein lies the bug - under very specific circumstances, this logic can
break and corrupt the maple tree.

Consider the following tree:

Height
  0                             Root Node
                                 /      \
                 pivot = 0xffff /        \ pivot = ULONG_MAX
                               /          \
  1                       A [-----]       ...
                             /   \
             pivot = 0x4fff /     \ pivot = 0xffff
                           /       \
  2 (LEAVES)          B [-----]  [-----] C
                                      ^--- Last pivot 0xffff.

Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note
that all ranges expressed in maple tree code are inclusive):

1. mas_store_gfp() descends the tree, finds node A at <=0xffff, then
   determines that this is a spanning store across nodes B and C. The mas
   state is set such that the current node from which we traverse further
   is node A.

2. In mas_wr_spanning_store() we try to find elements to the right of pivot
   0xffff by searching for an index of 0x10000:

    - mas_wr_walk_index() invokes mas_wr_walk_descend() and
      mas_wr_node_walk() in turn.

        - mas_wr_node_walk() loops over entries in node A until EITHER it
          finds an entry whose pivot equals or exceeds 0x10000 OR it
          reaches the final entry.

        - Since no entry has a pivot equal to or exceeding 0x10000, pivot
          0xffff is selected, leading to node C.

    - mas_wr_walk_traverse() resets the mas state to traverse node C. We
      loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk()
      in turn once again.

         - Again, we reach the last entry in node C, which has a pivot of
           0xffff.

3. We then copy the elements to the left of 0x4000 in node B to the big
   node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry
   too.

4. We determine whether we have any entries to copy from the right of the
   end of the range via - and with r_mas set up at the entry at pivot
   0xffff, r_mas.offset <= r_mas.end, and then we DUPLICATE the entry at
   pivot 0xffff.

5. BUG! The maple tree is corrupted with a duplicate entry.

This requires a very specific set of circumstances - we must be spanning
the last element in a leaf node, which is the last element in the parent
node.

spanning store across two leaf nodes with a range that ends at that shared
pivot.

A potential solution to this problem would simply be to reset the walk each
time we traverse r_mas, however given the rarity of this situation it seems
that would be rather inefficient.

Instead, this patch detects if the right hand node is populated, i.e. has
anything we need to copy. We can do this easily in mas_wr_walk_index() by
detecting if the pivot is either 0 (shorthand for the end of the range) or
the required index is less than or equal to the last encountered pivot.

This change is made in mas_wr_walk_index() which is only used by the
spanning store so it has minimal impact.

The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree
in mmap_region()") seems to have made the probability of this event much
more likely, which is the point at which reports started to be submitted
concerning this bug.

The motivation for this change arose from Bert Karwatzki's report of
encountering mm instability after the release of kernel v6.12-rc1 which,
after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
options, was identified as maple tree corruption.

After Bert very generously provided his time and ability to reproduce this
event consistently, I was able to finally identify that the issue discussed
in this commit message was occurring for him.

Reported-and-tested-by: Bert Karwatzki <spasswolf@web.de>
Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.de/
Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7iF+XFRu1h7-+Dg@mail.gmail.com/
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
---
 lib/maple_tree.c | 20 ++++++++++++++++----
 1 file changed, 16 insertions(+), 4 deletions(-)

--
2.46.2

Comments

Vlastimil Babka Oct. 7, 2024, 12:14 p.m. UTC | #1
On 10/6/24 4:31 PM, Lorenzo Stoakes wrote:
> There has been a subtle bug present in the maple tree implementation from
> its inception.
> 
> This arises from how stores are performed - when a store occurs, it will
> overwrite overlapping ranges and adjust the tree as necessary to
> accommodate this.
> 
> A range may always ultimately span two leaf nodes. In this instance we walk
> the two leaf nodes, determine which elements are not overwritten to the
> left and to the right of the start and end of the ranges respectively and
> then rebalance the tree to contain these entries and the newly inserted
> one.
> 
> This kind of store is dubbed a 'spanning store' and is implemented by
> mas_wr_spanning_store().
> 
> In order to reach this stage, mas_store_gfp() invokes mas_wr_preallocate(),
> mas_wr_store_type() and mas_wr_walk() in turn to walk the tree and update
> the object (mas) to traverse to the location where the write should be
> performed, determining its store type.
> 
> When a spanning store is required, this function returns false stopping at
> the parent node which contains the target range, and mas_wr_store_type()
> marks the mas->store_type as wr_spanning_store to denote this fact.
> 
> When we go to perform the store in mas_wr_spanning_store(), we first
> determine the elements AFTER the END of the range we wish to store (that
> is, to the right of the entry to be inserted) - we do this by walking to
> the NEXT pivot in the tree (i.e. r_mas.last + 1), starting at the node we
> have just determined contains the range over which we intend to write.
> 
> We then turn our attention to the entries to the left of the entry we are
> inserting, whose state is represented by l_mas, and copy these into a 'big
> node', which is a special node which contains enough slots to contain two
> leaf node's worth of data.
> 
> We then copy the entry we wish to store immediately after this - the copy
> and the insertion of the new entry is performed by mas_store_b_node().
> 
> After this we copy the elements to the right of the end of the range which
> we are inserting, if we have not exceeded the length of the node
> (i.e. r_mas.offset <= r_mas.end).
> 
> Herein lies the bug - under very specific circumstances, this logic can
> break and corrupt the maple tree.
> 
> Consider the following tree:
> 
> Height
>   0                             Root Node
>                                  /      \
>                  pivot = 0xffff /        \ pivot = ULONG_MAX
>                                /          \
>   1                       A [-----]       ...
>                              /   \
>              pivot = 0x4fff /     \ pivot = 0xffff
>                            /       \
>   2 (LEAVES)          B [-----]  [-----] C
>                                       ^--- Last pivot 0xffff.
> 
> Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note
> that all ranges expressed in maple tree code are inclusive):
> 
> 1. mas_store_gfp() descends the tree, finds node A at <=0xffff, then
>    determines that this is a spanning store across nodes B and C. The mas
>    state is set such that the current node from which we traverse further
>    is node A.
> 
> 2. In mas_wr_spanning_store() we try to find elements to the right of pivot
>    0xffff by searching for an index of 0x10000:
> 
>     - mas_wr_walk_index() invokes mas_wr_walk_descend() and
>       mas_wr_node_walk() in turn.
> 
>         - mas_wr_node_walk() loops over entries in node A until EITHER it
>           finds an entry whose pivot equals or exceeds 0x10000 OR it
>           reaches the final entry.
> 
>         - Since no entry has a pivot equal to or exceeding 0x10000, pivot
>           0xffff is selected, leading to node C.
> 
>     - mas_wr_walk_traverse() resets the mas state to traverse node C. We
>       loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk()
>       in turn once again.
> 
>          - Again, we reach the last entry in node C, which has a pivot of
>            0xffff.
> 
> 3. We then copy the elements to the left of 0x4000 in node B to the big
>    node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry
>    too.
> 
> 4. We determine whether we have any entries to copy from the right of the
>    end of the range via - and with r_mas set up at the entry at pivot
>    0xffff, r_mas.offset <= r_mas.end, and then we DUPLICATE the entry at
>    pivot 0xffff.
> 
> 5. BUG! The maple tree is corrupted with a duplicate entry.
> 
> This requires a very specific set of circumstances - we must be spanning
> the last element in a leaf node, which is the last element in the parent
> node.
> 
> spanning store across two leaf nodes with a range that ends at that shared
> pivot.

This looks like missing something to form a complete sentence?

> A potential solution to this problem would simply be to reset the walk each
> time we traverse r_mas, however given the rarity of this situation it seems
> that would be rather inefficient.
> 
> Instead, this patch detects if the right hand node is populated, i.e. has
> anything we need to copy. We can do this easily in mas_wr_walk_index() by
> detecting if the pivot is either 0 (shorthand for the end of the range) or
> the required index is less than or equal to the last encountered pivot.
> 
> This change is made in mas_wr_walk_index() which is only used by the
> spanning store so it has minimal impact.
> 
> The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree
> in mmap_region()") seems to have made the probability of this event much
> more likely, which is the point at which reports started to be submitted
> concerning this bug.
> 
> The motivation for this change arose from Bert Karwatzki's report of
> encountering mm instability after the release of kernel v6.12-rc1 which,
> after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
> options, was identified as maple tree corruption.
> 
> After Bert very generously provided his time and ability to reproduce this
> event consistently, I was able to finally identify that the issue discussed
> in this commit message was occurring for him.
> 
> Reported-and-tested-by: Bert Karwatzki <spasswolf@web.de>
> Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.de/
> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7iF+XFRu1h7-+Dg@mail.gmail.com/
> Fixes: 54a611b60590 ("Maple Tree: add new data structure")
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

> ---
>  lib/maple_tree.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 20990ecba2dd..f72e1a5a4dfa 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -2196,6 +2196,8 @@ static inline void mas_node_or_none(struct ma_state *mas,
> 
>  /*
>   * mas_wr_node_walk() - Find the correct offset for the index in the @mas.
> + *                      If @mas->index cannot be found within the containing
> + *                      node, we traverse to the last entry in the node.
>   * @wr_mas: The maple write state
>   *
>   * Uses mas_slot_locked() and does not need to worry about dead nodes.
> @@ -3532,6 +3534,12 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas)
>  	return true;
>  }
> 
> +/*
> + * Traverse the maple tree until the offset of mas->index is reached.
> + *
> + * Return: Is this node actually populated with entries possessing pivots equal
> + *         to or greater than mas->index?
> + */
>  static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
>  {
>  	struct ma_state *mas = wr_mas->mas;
> @@ -3540,8 +3548,11 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
>  		mas_wr_walk_descend(wr_mas);
>  		wr_mas->content = mas_slot_locked(mas, wr_mas->slots,
>  						  mas->offset);
> -		if (ma_is_leaf(wr_mas->type))
> -			return true;
> +		if (ma_is_leaf(wr_mas->type)) {
> +			unsigned long pivot = wr_mas->pivots[mas->offset];
> +
> +			return pivot == 0 || mas->index <= pivot;
> +		}
>  		mas_wr_walk_traverse(wr_mas);
> 
>  	}
> @@ -3701,6 +3712,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
>  	struct maple_big_node b_node;
>  	struct ma_state *mas;
>  	unsigned char height;
> +	bool r_populated;
> 
>  	/* Left and Right side of spanning store */
>  	MA_STATE(l_mas, NULL, 0, 0);
> @@ -3742,7 +3754,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
>  		r_mas.last++;
> 
>  	r_mas.index = r_mas.last;
> -	mas_wr_walk_index(&r_wr_mas);
> +	r_populated = mas_wr_walk_index(&r_wr_mas);
>  	r_mas.last = r_mas.index = mas->last;
> 
>  	/* Set up left side. */
> @@ -3766,7 +3778,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
>  	/* Copy l_mas and store the value in b_node. */
>  	mas_store_b_node(&l_wr_mas, &b_node, l_mas.end);
>  	/* Copy r_mas into b_node. */
> -	if (r_mas.offset <= r_mas.end)
> +	if (r_populated && r_mas.offset <= r_mas.end)
>  		mas_mab_cp(&r_mas, r_mas.offset, r_mas.end,
>  			   &b_node, b_node.b_end + 1);
>  	else
> --
> 2.46.2
Liam R. Howlett Oct. 7, 2024, 2:47 p.m. UTC | #2
* Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241006 10:31]:
> There has been a subtle bug present in the maple tree implementation from
> its inception.
> 
> This arises from how stores are performed - when a store occurs, it will
> overwrite overlapping ranges and adjust the tree as necessary to
> accommodate this.
> 
> A range may always ultimately span two leaf nodes. In this instance we walk
> the two leaf nodes, determine which elements are not overwritten to the
> left and to the right of the start and end of the ranges respectively and
> then rebalance the tree to contain these entries and the newly inserted
> one.
> 
> This kind of store is dubbed a 'spanning store' and is implemented by
> mas_wr_spanning_store().
> 
> In order to reach this stage, mas_store_gfp() invokes mas_wr_preallocate(),
> mas_wr_store_type() and mas_wr_walk() in turn to walk the tree and update
> the object (mas) to traverse to the location where the write should be
> performed, determining its store type.
> 
> When a spanning store is required, this function returns false stopping at
> the parent node which contains the target range, and mas_wr_store_type()
> marks the mas->store_type as wr_spanning_store to denote this fact.
> 
> When we go to perform the store in mas_wr_spanning_store(), we first
> determine the elements AFTER the END of the range we wish to store (that
> is, to the right of the entry to be inserted) - we do this by walking to
> the NEXT pivot in the tree (i.e. r_mas.last + 1), starting at the node we
> have just determined contains the range over which we intend to write.
> 
> We then turn our attention to the entries to the left of the entry we are
> inserting, whose state is represented by l_mas, and copy these into a 'big
> node', which is a special node which contains enough slots to contain two
> leaf node's worth of data.
> 
> We then copy the entry we wish to store immediately after this - the copy
> and the insertion of the new entry is performed by mas_store_b_node().
> 
> After this we copy the elements to the right of the end of the range which
> we are inserting, if we have not exceeded the length of the node
> (i.e. r_mas.offset <= r_mas.end).
> 
> Herein lies the bug - under very specific circumstances, this logic can
> break and corrupt the maple tree.
> 
> Consider the following tree:
> 
> Height
>   0                             Root Node
>                                  /      \
>                  pivot = 0xffff /        \ pivot = ULONG_MAX
>                                /          \
>   1                       A [-----]       ...
>                              /   \
>              pivot = 0x4fff /     \ pivot = 0xffff
>                            /       \
>   2 (LEAVES)          B [-----]  [-----] C
>                                       ^--- Last pivot 0xffff.
> 
> Now imagine we wish to store an entry in the range [0x4000, 0xffff] (note
> that all ranges expressed in maple tree code are inclusive):
> 
> 1. mas_store_gfp() descends the tree, finds node A at <=0xffff, then
>    determines that this is a spanning store across nodes B and C. The mas
>    state is set such that the current node from which we traverse further
>    is node A.
> 
> 2. In mas_wr_spanning_store() we try to find elements to the right of pivot
>    0xffff by searching for an index of 0x10000:
> 
>     - mas_wr_walk_index() invokes mas_wr_walk_descend() and
>       mas_wr_node_walk() in turn.
> 
>         - mas_wr_node_walk() loops over entries in node A until EITHER it
>           finds an entry whose pivot equals or exceeds 0x10000 OR it
>           reaches the final entry.
> 
>         - Since no entry has a pivot equal to or exceeding 0x10000, pivot
>           0xffff is selected, leading to node C.
> 
>     - mas_wr_walk_traverse() resets the mas state to traverse node C. We
>       loop around and invoke mas_wr_walk_descend() and mas_wr_node_walk()
>       in turn once again.
> 
>          - Again, we reach the last entry in node C, which has a pivot of
>            0xffff.
> 
> 3. We then copy the elements to the left of 0x4000 in node B to the big
>    node via mas_store_b_node(), and insert the new [0x4000, 0xffff] entry
>    too.
> 
> 4. We determine whether we have any entries to copy from the right of the
>    end of the range via - and with r_mas set up at the entry at pivot
>    0xffff, r_mas.offset <= r_mas.end, and then we DUPLICATE the entry at
>    pivot 0xffff.
> 
> 5. BUG! The maple tree is corrupted with a duplicate entry.
> 
> This requires a very specific set of circumstances - we must be spanning
> the last element in a leaf node, which is the last element in the parent
> node.
> 
> spanning store across two leaf nodes with a range that ends at that shared
> pivot.
> 
> A potential solution to this problem would simply be to reset the walk each
> time we traverse r_mas, however given the rarity of this situation it seems
> that would be rather inefficient.
> 
> Instead, this patch detects if the right hand node is populated, i.e. has
> anything we need to copy. We can do this easily in mas_wr_walk_index() by
> detecting if the pivot is either 0 (shorthand for the end of the range) or
> the required index is less than or equal to the last encountered pivot.
> 
> This change is made in mas_wr_walk_index() which is only used by the
> spanning store so it has minimal impact.
> 
> The work performed in commit f8d112a4e657 ("mm/mmap: avoid zeroing vma tree
> in mmap_region()") seems to have made the probability of this event much
> more likely, which is the point at which reports started to be submitted
> concerning this bug.
> 
> The motivation for this change arose from Bert Karwatzki's report of
> encountering mm instability after the release of kernel v6.12-rc1 which,
> after the use of CONFIG_DEBUG_VM_MAPLE_TREE and similar configuration
> options, was identified as maple tree corruption.
> 
> After Bert very generously provided his time and ability to reproduce this
> event consistently, I was able to finally identify that the issue discussed
> in this commit message was occurring for him.
> 
> Reported-and-tested-by: Bert Karwatzki <spasswolf@web.de>
> Closes: https://lore.kernel.org/all/20241001023402.3374-1-spasswolf@web.de/
> Reported-by: Mikhail Gavrilov <mikhail.v.gavrilov@gmail.com>
> Closes: https://lore.kernel.org/all/CABXGCsOPwuoNOqSMmAvWO2Fz4TEmPnjFj-b7iF+XFRu1h7-+Dg@mail.gmail.com/
> Fixes: 54a611b60590 ("Maple Tree: add new data structure")
> Signed-off-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
> ---
>  lib/maple_tree.c | 20 ++++++++++++++++----
>  1 file changed, 16 insertions(+), 4 deletions(-)
> 
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 20990ecba2dd..f72e1a5a4dfa 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -2196,6 +2196,8 @@ static inline void mas_node_or_none(struct ma_state *mas,
> 
>  /*
>   * mas_wr_node_walk() - Find the correct offset for the index in the @mas.
> + *                      If @mas->index cannot be found within the containing
> + *                      node, we traverse to the last entry in the node.
>   * @wr_mas: The maple write state
>   *
>   * Uses mas_slot_locked() and does not need to worry about dead nodes.
> @@ -3532,6 +3534,12 @@ static bool mas_wr_walk(struct ma_wr_state *wr_mas)
>  	return true;
>  }
> 
> +/*
> + * Traverse the maple tree until the offset of mas->index is reached.
> + *
> + * Return: Is this node actually populated with entries possessing pivots equal
> + *         to or greater than mas->index?
> + */
>  static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)

Oh good, I'm returning a bool that was never used.

>  {
>  	struct ma_state *mas = wr_mas->mas;
> @@ -3540,8 +3548,11 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
>  		mas_wr_walk_descend(wr_mas);
>  		wr_mas->content = mas_slot_locked(mas, wr_mas->slots,
>  						  mas->offset);
> -		if (ma_is_leaf(wr_mas->type))
> -			return true;
> +		if (ma_is_leaf(wr_mas->type)) {
> +			unsigned long pivot = wr_mas->pivots[mas->offset];

If mas->offset points to the last slot, then this will be outside the
pivot array.  That is, there is an implied max pivot from the parent
which may not have a pivot entry.

> +
> +			return pivot == 0 || mas->index <= pivot;

What is the pivot == 0 portion of this?  The pivot should always have a
value, unless it's the first pivot in the tree of range 0-0, but then
there will always be more content to copy.

> +		}
>  		mas_wr_walk_traverse(wr_mas);
> 
>  	}
> @@ -3701,6 +3712,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
>  	struct maple_big_node b_node;
>  	struct ma_state *mas;
>  	unsigned char height;
> +	bool r_populated;
> 
>  	/* Left and Right side of spanning store */
>  	MA_STATE(l_mas, NULL, 0, 0);
> @@ -3742,7 +3754,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
>  		r_mas.last++;
> 
>  	r_mas.index = r_mas.last;
> -	mas_wr_walk_index(&r_wr_mas);
> +	r_populated = mas_wr_walk_index(&r_wr_mas);
>  	r_mas.last = r_mas.index = mas->last;
> 
>  	/* Set up left side. */
> @@ -3766,7 +3778,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
>  	/* Copy l_mas and store the value in b_node. */
>  	mas_store_b_node(&l_wr_mas, &b_node, l_mas.end);
>  	/* Copy r_mas into b_node. */
> -	if (r_mas.offset <= r_mas.end)
> +	if (r_populated && r_mas.offset <= r_mas.end)
>  		mas_mab_cp(&r_mas, r_mas.offset, r_mas.end,
>  			   &b_node, b_node.b_end + 1);

We may be able to leverage the information contained in r_mas and
r_wr_mas to determine when contents needs to be copied.

Perhaps r_mas.max > r_mas.last instead?  Where r_mas.max is the node
max and r_mas.last is the end of the range being written.

>  	else
> --
> 2.46.2
Lorenzo Stoakes Oct. 7, 2024, 3:04 p.m. UTC | #3
On Mon, Oct 07, 2024 at 10:47:04AM -0400, Liam R. Howlett wrote:
> * Lorenzo Stoakes <lorenzo.stoakes@oracle.com> [241006 10:31]:
> > There has been a subtle bug present in the maple tree implementation from
> > its inception.
> >

[snip]

> > +/*
> > + * Traverse the maple tree until the offset of mas->index is reached.
> > + *
> > + * Return: Is this node actually populated with entries possessing pivots equal
> > + *         to or greater than mas->index?
> > + */
> >  static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
>
> Oh good, I'm returning a bool that was never used.

Yeah...

>
> >  {
> >  	struct ma_state *mas = wr_mas->mas;
> > @@ -3540,8 +3548,11 @@ static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
> >  		mas_wr_walk_descend(wr_mas);
> >  		wr_mas->content = mas_slot_locked(mas, wr_mas->slots,
> >  						  mas->offset);
> > -		if (ma_is_leaf(wr_mas->type))
> > -			return true;
> > +		if (ma_is_leaf(wr_mas->type)) {
> > +			unsigned long pivot = wr_mas->pivots[mas->offset];
>
> If mas->offset points to the last slot, then this will be outside the
> pivot array.  That is, there is an implied max pivot from the parent
> which may not have a pivot entry.

Yikes, you're right! Would say 'will fix' but you suggest a much better
solution overall that vastly simplifies this patch below...

>
> > +
> > +			return pivot == 0 || mas->index <= pivot;
>
> What is the pivot == 0 portion of this?  The pivot should always have a
> value, unless it's the first pivot in the tree of range 0-0, but then
> there will always be more content to copy.

As discussed in IRC, I suspect one of the tests is using data from an old
version of the maple tree where this had semantic meaning. Without this
tests fail.

However this is irrelevant now because...

>
> > +		}
> >  		mas_wr_walk_traverse(wr_mas);
> >
> >  	}
> > @@ -3701,6 +3712,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
> >  	struct maple_big_node b_node;
> >  	struct ma_state *mas;
> >  	unsigned char height;
> > +	bool r_populated;
> >
> >  	/* Left and Right side of spanning store */
> >  	MA_STATE(l_mas, NULL, 0, 0);
> > @@ -3742,7 +3754,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
> >  		r_mas.last++;
> >
> >  	r_mas.index = r_mas.last;
> > -	mas_wr_walk_index(&r_wr_mas);
> > +	r_populated = mas_wr_walk_index(&r_wr_mas);
> >  	r_mas.last = r_mas.index = mas->last;
> >
> >  	/* Set up left side. */
> > @@ -3766,7 +3778,7 @@ static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
> >  	/* Copy l_mas and store the value in b_node. */
> >  	mas_store_b_node(&l_wr_mas, &b_node, l_mas.end);
> >  	/* Copy r_mas into b_node. */
> > -	if (r_mas.offset <= r_mas.end)
> > +	if (r_populated && r_mas.offset <= r_mas.end)
> >  		mas_mab_cp(&r_mas, r_mas.offset, r_mas.end,
> >  			   &b_node, b_node.b_end + 1);
>
> We may be able to leverage the information contained in r_mas and
> r_wr_mas to determine when contents needs to be copied.
>
> Perhaps r_mas.max > r_mas.last instead?  Where r_mas.max is the node
> max and r_mas.last is the end of the range being written.

...yes I tested this and it works :) so we can drop everything else and
turn this into a one liner.

Will respin now. Thanks!

>
> >  	else
> > --
> > 2.46.2
diff mbox series

Patch

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 20990ecba2dd..f72e1a5a4dfa 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -2196,6 +2196,8 @@  static inline void mas_node_or_none(struct ma_state *mas,

 /*
  * mas_wr_node_walk() - Find the correct offset for the index in the @mas.
+ *                      If @mas->index cannot be found within the containing
+ *                      node, we traverse to the last entry in the node.
  * @wr_mas: The maple write state
  *
  * Uses mas_slot_locked() and does not need to worry about dead nodes.
@@ -3532,6 +3534,12 @@  static bool mas_wr_walk(struct ma_wr_state *wr_mas)
 	return true;
 }

+/*
+ * Traverse the maple tree until the offset of mas->index is reached.
+ *
+ * Return: Is this node actually populated with entries possessing pivots equal
+ *         to or greater than mas->index?
+ */
 static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
 {
 	struct ma_state *mas = wr_mas->mas;
@@ -3540,8 +3548,11 @@  static bool mas_wr_walk_index(struct ma_wr_state *wr_mas)
 		mas_wr_walk_descend(wr_mas);
 		wr_mas->content = mas_slot_locked(mas, wr_mas->slots,
 						  mas->offset);
-		if (ma_is_leaf(wr_mas->type))
-			return true;
+		if (ma_is_leaf(wr_mas->type)) {
+			unsigned long pivot = wr_mas->pivots[mas->offset];
+
+			return pivot == 0 || mas->index <= pivot;
+		}
 		mas_wr_walk_traverse(wr_mas);

 	}
@@ -3701,6 +3712,7 @@  static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
 	struct maple_big_node b_node;
 	struct ma_state *mas;
 	unsigned char height;
+	bool r_populated;

 	/* Left and Right side of spanning store */
 	MA_STATE(l_mas, NULL, 0, 0);
@@ -3742,7 +3754,7 @@  static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
 		r_mas.last++;

 	r_mas.index = r_mas.last;
-	mas_wr_walk_index(&r_wr_mas);
+	r_populated = mas_wr_walk_index(&r_wr_mas);
 	r_mas.last = r_mas.index = mas->last;

 	/* Set up left side. */
@@ -3766,7 +3778,7 @@  static noinline void mas_wr_spanning_store(struct ma_wr_state *wr_mas)
 	/* Copy l_mas and store the value in b_node. */
 	mas_store_b_node(&l_wr_mas, &b_node, l_mas.end);
 	/* Copy r_mas into b_node. */
-	if (r_mas.offset <= r_mas.end)
+	if (r_populated && r_mas.offset <= r_mas.end)
 		mas_mab_cp(&r_mas, r_mas.offset, r_mas.end,
 			   &b_node, b_node.b_end + 1);
 	else