diff mbox series

[1/2] maple_tree: Fix mas_skip_node() end slot detection

Message ID 20230303021540.1056603-1-Liam.Howlett@oracle.com (mailing list archive)
State New
Headers show
Series [1/2] maple_tree: Fix mas_skip_node() end slot detection | expand

Commit Message

Liam R. Howlett March 3, 2023, 2:15 a.m. UTC
mas_skip_node() is used to move the maple state to the node with a
higher limit.  It does this by walking up the tree and increasing the
slot count.  Since slot count may not be able to be increased, it may
need to walk up multiple times to find room to walk right to a higher
limit node.  The limit of slots that was being used was the node limit
and not the last location of data in the node.  This would cause the
maple state to be shifted outside actual data and enter an error state,
thus returning -EBUSY.

The result of the incorrect error state means that mas_awalk() would
return an error instead of finding the allocation space.

The fix is to use mas_data_end() in mas_skip_node() to detect the nodes
data end point and continue walking the tree up until it is safe to move
to a node with a higher limit.

mas_skip_node() may also be passed a maple state in an error state from
mas_anode_descend() when no allocations are available.  Return on such
an error state immediately.

Reported-by: Snild Dolkow <snild@sony.com>
Link: https://lore.kernel.org/linux-mm/cb8dc31a-fef2-1d09-f133-e9f7b9f9e77a@sony.com/
Cc: <Stable@vger.kernel.org>
Fixes: 54a611b60590 ("Maple Tree: add new data structure")
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
---
 lib/maple_tree.c | 25 ++++++++++---------------
 1 file changed, 10 insertions(+), 15 deletions(-)

Comments

Snild Dolkow March 3, 2023, 8:51 a.m. UTC | #1
Thanks, this is a significant improvement. Applying it on top of v6.1.12 
allows my reproducer to pass most of the time (running as init in qemu).

Unfortunately, it's still failing around 10% of the time:
> $ for x in $(seq 100); do qemu-system-x86_64 -nographic -no-reboot -append 'console=ttyS0 panic=-1' -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz; done | tee qemu.log
> [...] > $ egrep -o 'Failed|Success' qemu.log | sort | uniq -c
>      11 Failed
>      89 Success

The failures now happen later, around 25 MiB:
> $ grep Failed qemu.log
> Failed. m=0xffffffffffffffff size=8192 (1<<13) i=1050  errno=12 total_leaks=29081600 (27 MiB)
> Failed. m=0xffffffffffffffff size=8192 (1<<13) i=332  errno=12 total_leaks=23199744 (22 MiB)
> Failed. m=0xffffffffffffffff size=8192 (1<<13) i=838  errno=12 total_leaks=27344896 (26 MiB)
> Failed. m=0xffffffffffffffff size=8192 (1<<13) i=282  errno=12 total_leaks=22790144 (21 MiB)
> Failed. m=0xffffffffffffffff size=8192 (1<<13) i=695  errno=12 total_leaks=26173440 (24 MiB)
> Failed. m=0xffffffffffffffff size=8192 (1<<13) i=1064  errno=12 total_leaks=29196288 (27 MiB)
> Failed. m=0xffffffffffffffff size=8192 (1<<13) i=608  errno=12 total_leaks=25460736 (24 MiB)
> Failed. m=0xffffffffffffffff size=8192 (1<<13) i=443  errno=12 total_leaks=24109056 (22 MiB)
> Failed. m=0xffffffffffffffff size=8192 (1<<13) i=549  errno=12 total_leaks=24977408 (23 MiB)
> Failed. m=0xffffffffffffffff size=8192 (1<<13) i=630  errno=12 total_leaks=25640960 (24 MiB)
> Failed. m=0xffffffffffffffff size=8192 (1<<13) i=820  errno=12 total_leaks=27197440 (25 MiB)


Just to make sure, I went back to e15e06a8 and ran the same loop.
> $ egrep -o 'Failed|Success' qemu.log | sort | uniq -c
>     100 Success

And with the patches applied on top of master (ee3f96b1):
> $ egrep -o 'Failed|Success' qemu.log | sort | uniq -c
>      10 Failed
>      90 Success

//Snild
Peng Zhang March 7, 2023, 1:05 p.m. UTC | #2
Hi, Liam,
> mas_skip_node() is used to move the maple state to the node with a
> higher limit.  It does this by walking up the tree and increasing the
> slot count.  Since slot count may not be able to be increased, it may
> need to walk up multiple times to find room to walk right to a higher
> limit node.  The limit of slots that was being used was the node limit
> and not the last location of data in the node.  This would cause the
> maple state to be shifted outside actual data and enter an error state,
> thus returning -EBUSY.
> 
> The result of the incorrect error state means that mas_awalk() would
> return an error instead of finding the allocation space.
> 
> The fix is to use mas_data_end() in mas_skip_node() to detect the nodes
> data end point and continue walking the tree up until it is safe to move
> to a node with a higher limit.
> 
> mas_skip_node() may also be passed a maple state in an error state from
> mas_anode_descend() when no allocations are available.  Return on such
> an error state immediately.
> 
> Reported-by: Snild Dolkow <snild@sony.com>
> Link: https://lore.kernel.org/linux-mm/cb8dc31a-fef2-1d09-f133-e9f7b9f9e77a@sony.com/
> Cc: <Stable@vger.kernel.org>
> Fixes: 54a611b60590 ("Maple Tree: add new data structure")
> Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Peng Zhang <zhangpeng.00@bytedance.com>
> ---
>   lib/maple_tree.c | 25 ++++++++++---------------
>   1 file changed, 10 insertions(+), 15 deletions(-)
> 
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 2be86368237d..2efe854946d6 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -5188,34 +5188,29 @@ static inline bool mas_rewind_node(struct ma_state *mas)
>    */
>   static inline bool mas_skip_node(struct ma_state *mas)
>   {
> -	unsigned char slot, slot_count;
>   	unsigned long *pivots;
>   	enum maple_type mt;
>   
> -	mt = mte_node_type(mas->node);
> -	slot_count = mt_slots[mt] - 1;
> +	if (mas_is_err(mas))
> +		return false;
> +
>   	do {
>   		if (mte_is_root(mas->node)) {
> -			slot = mas->offset;
> -			if (slot > slot_count) {
> +			if (mas->offset >= mas_data_end(mas)) {
>   				mas_set_err(mas, -EBUSY);
>   				return false;
>   			}
>   		} else {
>   			mas_ascend(mas);
> -			slot = mas->offset;
> -			mt = mte_node_type(mas->node);
> -			slot_count = mt_slots[mt] - 1;
>   		}
> -	} while (slot > slot_count);
> +	} while (mas->offset >= mas_data_end(mas));
>   
> -	mas->offset = ++slot;
> +	mt = mte_node_type(mas->node);
>   	pivots = ma_pivots(mas_mn(mas), mt);
> -	if (slot > 0)
> -		mas->min = pivots[slot - 1] + 1;
> -
> -	if (slot <= slot_count)
> -		mas->max = pivots[slot];
> +	mas->min = pivots[mas->offset] + 1;
> +	mas->offset++;
> +	if (mas->offset < mt_slots[mt])
> +		mas->max = pivots[mas->offset];
There is a bug here, the assignment of mas->min and mas->max is wrong.
The assignment will make them represent the range of a child node, but 
it should represent the range of the current node. After mas_ascend() 
returns, mas-min and mas->max already represent the range of the current 
node, so we should delete these assignments of mas->min and mas->max.
>   
>   	return true;
>   }

Sincerely yours,
Peng.
Snild Dolkow March 7, 2023, 2:30 p.m. UTC | #3
On 2023-03-07 14:05, Peng Zhang wrote:
> Hi, Liam,
>> -    } while (slot > slot_count);
>> +    } while (mas->offset >= mas_data_end(mas));
>> -    mas->offset = ++slot;
>> +    mt = mte_node_type(mas->node);
>>       pivots = ma_pivots(mas_mn(mas), mt);
>> -    if (slot > 0)
>> -        mas->min = pivots[slot - 1] + 1;
>> -
>> -    if (slot <= slot_count)
>> -        mas->max = pivots[slot];
>> +    mas->min = pivots[mas->offset] + 1;
>> +    mas->offset++;
>> +    if (mas->offset < mt_slots[mt])
>> +        mas->max = pivots[mas->offset];
> There is a bug here, the assignment of mas->min and mas->max is wrong.
> The assignment will make them represent the range of a child node, but 
> it should represent the range of the current node. After mas_ascend() 
> returns, mas-min and mas->max already represent the range of the current 
> node, so we should delete these assignments of mas->min and mas->max.


Thanks for your suggestion, Peng. Applying it literally by removing only 
the min/max assignments:

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 6fc1ad42b409..9b6e581cf83f 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5118,10 +5118,7 @@ static inline bool mas_skip_node

         mt = mte_node_type(mas->node);
         pivots = ma_pivots(mas_mn(mas), mt);
-       mas->min = pivots[mas->offset] + 1;
         mas->offset++;
-       if (mas->offset < mt_slots[mt])
-               mas->max = pivots[mas->offset];

         return true;
  }


This allowed my test to pass 100/100 runs. Still in qemu with the test 
as init, so not really stressed in any way except that specific usecase.

//Snild
Liam R. Howlett March 7, 2023, 4:10 p.m. UTC | #4
* Snild Dolkow <snild@sony.com> [230307 09:31]:
> On 2023-03-07 14:05, Peng Zhang wrote:
> > Hi, Liam,
> > > -    } while (slot > slot_count);
> > > +    } while (mas->offset >= mas_data_end(mas));
> > > -    mas->offset = ++slot;
> > > +    mt = mte_node_type(mas->node);
> > >       pivots = ma_pivots(mas_mn(mas), mt);
> > > -    if (slot > 0)
> > > -        mas->min = pivots[slot - 1] + 1;
> > > -
> > > -    if (slot <= slot_count)
> > > -        mas->max = pivots[slot];
> > > +    mas->min = pivots[mas->offset] + 1;
> > > +    mas->offset++;
> > > +    if (mas->offset < mt_slots[mt])
> > > +        mas->max = pivots[mas->offset];
> > There is a bug here, the assignment of mas->min and mas->max is wrong.
> > The assignment will make them represent the range of a child node, but
> > it should represent the range of the current node. After mas_ascend()
> > returns, mas-min and mas->max already represent the range of the current
> > node, so we should delete these assignments of mas->min and mas->max.
> 
> 
> Thanks for your suggestion, Peng. Applying it literally by removing only the
> min/max assignments:
> 
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 6fc1ad42b409..9b6e581cf83f 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -5118,10 +5118,7 @@ static inline bool mas_skip_node
> 
>         mt = mte_node_type(mas->node);
>         pivots = ma_pivots(mas_mn(mas), mt);
> -       mas->min = pivots[mas->offset] + 1;
>         mas->offset++;
> -       if (mas->offset < mt_slots[mt])
> -               mas->max = pivots[mas->offset];
> 
>         return true;
>  }
> 
> 
> This allowed my test to pass 100/100 runs. Still in qemu with the test as
> init, so not really stressed in any way except that specific usecase.
> 


Yes, I have a v2 that removes these lines and some extra unused lines.
I've been working on a recreation to add to the testing before I send v2
out.  I had 100% success running my testing over the weekend, but I
wanted to have a testcase before sending the change.

Thanks,
Liam
Peng Zhang March 7, 2023, 4:16 p.m. UTC | #5
Hi, Snild,

在 2023/3/7 22:30, Snild Dolkow 写道:
> On 2023-03-07 14:05, Peng Zhang wrote:
>> Hi, Liam,
>>> -    } while (slot > slot_count);
>>> +    } while (mas->offset >= mas_data_end(mas));
>>> -    mas->offset = ++slot;
>>> +    mt = mte_node_type(mas->node);
>>>       pivots = ma_pivots(mas_mn(mas), mt);
>>> -    if (slot > 0)
>>> -        mas->min = pivots[slot - 1] + 1;
>>> -
>>> -    if (slot <= slot_count)
>>> -        mas->max = pivots[slot];
>>> +    mas->min = pivots[mas->offset] + 1;
>>> +    mas->offset++;
>>> +    if (mas->offset < mt_slots[mt])
>>> +        mas->max = pivots[mas->offset];
>> There is a bug here, the assignment of mas->min and mas->max is wrong.
>> The assignment will make them represent the range of a child node, 
>> but it should represent the range of the current node. After 
>> mas_ascend() returns, mas-min and mas->max already represent the 
>> range of the current node, so we should delete these assignments of 
>> mas->min and mas->max.
>
>
> Thanks for your suggestion, Peng. Applying it literally by removing 
> only the min/max assignments:
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 6fc1ad42b409..9b6e581cf83f 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -5118,10 +5118,7 @@ static inline bool mas_skip_node
>
>         mt = mte_node_type(mas->node);
>         pivots = ma_pivots(mas_mn(mas), mt);
> -       mas->min = pivots[mas->offset] + 1;
>         mas->offset++;
> -       if (mas->offset < mt_slots[mt])
> -               mas->max = pivots[mas->offset];
>
>         return true;
>  }
>
>
> This allowed my test to pass 100/100 runs. Still in qemu with the test 
> as init, so not really stressed in any way except that specific usecase.
>
> //Snild

Thanks for the test, I'm happy if it happens to fix your problem. So a 
patch was made.
This patch needs to be applied after Liam's patch.

Sincerely yours,
Peng.
Peng Zhang March 7, 2023, 4:21 p.m. UTC | #6
Sorry I forgot to post the link, here it is:
https://lore.kernel.org/lkml/20230307160340.57074-1-zhangpeng.00@bytedance.com/
Liam R. Howlett March 7, 2023, 4:29 p.m. UTC | #7
* Peng Zhang <zhangpeng.00@bytedance.com> [230307 11:21]:
> Sorry I forgot to post the link, here it is:
> https://lore.kernel.org/lkml/20230307160340.57074-1-zhangpeng.00@bytedance.com/


I will roll this into V2 of my patch set.  It is exactly what I have,
but I need to come up with a testcase for this first.
diff mbox series

Patch

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 2be86368237d..2efe854946d6 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5188,34 +5188,29 @@  static inline bool mas_rewind_node(struct ma_state *mas)
  */
 static inline bool mas_skip_node(struct ma_state *mas)
 {
-	unsigned char slot, slot_count;
 	unsigned long *pivots;
 	enum maple_type mt;
 
-	mt = mte_node_type(mas->node);
-	slot_count = mt_slots[mt] - 1;
+	if (mas_is_err(mas))
+		return false;
+
 	do {
 		if (mte_is_root(mas->node)) {
-			slot = mas->offset;
-			if (slot > slot_count) {
+			if (mas->offset >= mas_data_end(mas)) {
 				mas_set_err(mas, -EBUSY);
 				return false;
 			}
 		} else {
 			mas_ascend(mas);
-			slot = mas->offset;
-			mt = mte_node_type(mas->node);
-			slot_count = mt_slots[mt] - 1;
 		}
-	} while (slot > slot_count);
+	} while (mas->offset >= mas_data_end(mas));
 
-	mas->offset = ++slot;
+	mt = mte_node_type(mas->node);
 	pivots = ma_pivots(mas_mn(mas), mt);
-	if (slot > 0)
-		mas->min = pivots[slot - 1] + 1;
-
-	if (slot <= slot_count)
-		mas->max = pivots[slot];
+	mas->min = pivots[mas->offset] + 1;
+	mas->offset++;
+	if (mas->offset < mt_slots[mt])
+		mas->max = pivots[mas->offset];
 
 	return true;
 }