Message ID | 20230303021540.1056603-1-Liam.Howlett@oracle.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [1/2] maple_tree: Fix mas_skip_node() end slot detection | expand |
Thanks, this is a significant improvement. Applying it on top of v6.1.12 allows my reproducer to pass most of the time (running as init in qemu). Unfortunately, it's still failing around 10% of the time: > $ for x in $(seq 100); do qemu-system-x86_64 -nographic -no-reboot -append 'console=ttyS0 panic=-1' -kernel arch/x86/boot/bzImage -initrd initrd/initrd.gz; done | tee qemu.log > [...] > $ egrep -o 'Failed|Success' qemu.log | sort | uniq -c > 11 Failed > 89 Success The failures now happen later, around 25 MiB: > $ grep Failed qemu.log > Failed. m=0xffffffffffffffff size=8192 (1<<13) i=1050 errno=12 total_leaks=29081600 (27 MiB) > Failed. m=0xffffffffffffffff size=8192 (1<<13) i=332 errno=12 total_leaks=23199744 (22 MiB) > Failed. m=0xffffffffffffffff size=8192 (1<<13) i=838 errno=12 total_leaks=27344896 (26 MiB) > Failed. m=0xffffffffffffffff size=8192 (1<<13) i=282 errno=12 total_leaks=22790144 (21 MiB) > Failed. m=0xffffffffffffffff size=8192 (1<<13) i=695 errno=12 total_leaks=26173440 (24 MiB) > Failed. m=0xffffffffffffffff size=8192 (1<<13) i=1064 errno=12 total_leaks=29196288 (27 MiB) > Failed. m=0xffffffffffffffff size=8192 (1<<13) i=608 errno=12 total_leaks=25460736 (24 MiB) > Failed. m=0xffffffffffffffff size=8192 (1<<13) i=443 errno=12 total_leaks=24109056 (22 MiB) > Failed. m=0xffffffffffffffff size=8192 (1<<13) i=549 errno=12 total_leaks=24977408 (23 MiB) > Failed. m=0xffffffffffffffff size=8192 (1<<13) i=630 errno=12 total_leaks=25640960 (24 MiB) > Failed. m=0xffffffffffffffff size=8192 (1<<13) i=820 errno=12 total_leaks=27197440 (25 MiB) Just to make sure, I went back to e15e06a8 and ran the same loop. > $ egrep -o 'Failed|Success' qemu.log | sort | uniq -c > 100 Success And with the patches applied on top of master (ee3f96b1): > $ egrep -o 'Failed|Success' qemu.log | sort | uniq -c > 10 Failed > 90 Success //Snild
Hi, Liam, > mas_skip_node() is used to move the maple state to the node with a > higher limit. It does this by walking up the tree and increasing the > slot count. Since slot count may not be able to be increased, it may > need to walk up multiple times to find room to walk right to a higher > limit node. The limit of slots that was being used was the node limit > and not the last location of data in the node. This would cause the > maple state to be shifted outside actual data and enter an error state, > thus returning -EBUSY. > > The result of the incorrect error state means that mas_awalk() would > return an error instead of finding the allocation space. > > The fix is to use mas_data_end() in mas_skip_node() to detect the nodes > data end point and continue walking the tree up until it is safe to move > to a node with a higher limit. > > mas_skip_node() may also be passed a maple state in an error state from > mas_anode_descend() when no allocations are available. Return on such > an error state immediately. > > Reported-by: Snild Dolkow <snild@sony.com> > Link: https://lore.kernel.org/linux-mm/cb8dc31a-fef2-1d09-f133-e9f7b9f9e77a@sony.com/ > Cc: <Stable@vger.kernel.org> > Fixes: 54a611b60590 ("Maple Tree: add new data structure") > Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> Reviewed-by: Peng Zhang <zhangpeng.00@bytedance.com> > --- > lib/maple_tree.c | 25 ++++++++++--------------- > 1 file changed, 10 insertions(+), 15 deletions(-) > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > index 2be86368237d..2efe854946d6 100644 > --- a/lib/maple_tree.c > +++ b/lib/maple_tree.c > @@ -5188,34 +5188,29 @@ static inline bool mas_rewind_node(struct ma_state *mas) > */ > static inline bool mas_skip_node(struct ma_state *mas) > { > - unsigned char slot, slot_count; > unsigned long *pivots; > enum maple_type mt; > > - mt = mte_node_type(mas->node); > - slot_count = mt_slots[mt] - 1; > + if (mas_is_err(mas)) > + return false; > + > do { > if (mte_is_root(mas->node)) { > - slot = mas->offset; > - if (slot > slot_count) { > + if (mas->offset >= mas_data_end(mas)) { > mas_set_err(mas, -EBUSY); > return false; > } > } else { > mas_ascend(mas); > - slot = mas->offset; > - mt = mte_node_type(mas->node); > - slot_count = mt_slots[mt] - 1; > } > - } while (slot > slot_count); > + } while (mas->offset >= mas_data_end(mas)); > > - mas->offset = ++slot; > + mt = mte_node_type(mas->node); > pivots = ma_pivots(mas_mn(mas), mt); > - if (slot > 0) > - mas->min = pivots[slot - 1] + 1; > - > - if (slot <= slot_count) > - mas->max = pivots[slot]; > + mas->min = pivots[mas->offset] + 1; > + mas->offset++; > + if (mas->offset < mt_slots[mt]) > + mas->max = pivots[mas->offset]; There is a bug here, the assignment of mas->min and mas->max is wrong. The assignment will make them represent the range of a child node, but it should represent the range of the current node. After mas_ascend() returns, mas-min and mas->max already represent the range of the current node, so we should delete these assignments of mas->min and mas->max. > > return true; > } Sincerely yours, Peng.
On 2023-03-07 14:05, Peng Zhang wrote: > Hi, Liam, >> - } while (slot > slot_count); >> + } while (mas->offset >= mas_data_end(mas)); >> - mas->offset = ++slot; >> + mt = mte_node_type(mas->node); >> pivots = ma_pivots(mas_mn(mas), mt); >> - if (slot > 0) >> - mas->min = pivots[slot - 1] + 1; >> - >> - if (slot <= slot_count) >> - mas->max = pivots[slot]; >> + mas->min = pivots[mas->offset] + 1; >> + mas->offset++; >> + if (mas->offset < mt_slots[mt]) >> + mas->max = pivots[mas->offset]; > There is a bug here, the assignment of mas->min and mas->max is wrong. > The assignment will make them represent the range of a child node, but > it should represent the range of the current node. After mas_ascend() > returns, mas-min and mas->max already represent the range of the current > node, so we should delete these assignments of mas->min and mas->max. Thanks for your suggestion, Peng. Applying it literally by removing only the min/max assignments: diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 6fc1ad42b409..9b6e581cf83f 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5118,10 +5118,7 @@ static inline bool mas_skip_node mt = mte_node_type(mas->node); pivots = ma_pivots(mas_mn(mas), mt); - mas->min = pivots[mas->offset] + 1; mas->offset++; - if (mas->offset < mt_slots[mt]) - mas->max = pivots[mas->offset]; return true; } This allowed my test to pass 100/100 runs. Still in qemu with the test as init, so not really stressed in any way except that specific usecase. //Snild
* Snild Dolkow <snild@sony.com> [230307 09:31]: > On 2023-03-07 14:05, Peng Zhang wrote: > > Hi, Liam, > > > - } while (slot > slot_count); > > > + } while (mas->offset >= mas_data_end(mas)); > > > - mas->offset = ++slot; > > > + mt = mte_node_type(mas->node); > > > pivots = ma_pivots(mas_mn(mas), mt); > > > - if (slot > 0) > > > - mas->min = pivots[slot - 1] + 1; > > > - > > > - if (slot <= slot_count) > > > - mas->max = pivots[slot]; > > > + mas->min = pivots[mas->offset] + 1; > > > + mas->offset++; > > > + if (mas->offset < mt_slots[mt]) > > > + mas->max = pivots[mas->offset]; > > There is a bug here, the assignment of mas->min and mas->max is wrong. > > The assignment will make them represent the range of a child node, but > > it should represent the range of the current node. After mas_ascend() > > returns, mas-min and mas->max already represent the range of the current > > node, so we should delete these assignments of mas->min and mas->max. > > > Thanks for your suggestion, Peng. Applying it literally by removing only the > min/max assignments: > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > index 6fc1ad42b409..9b6e581cf83f 100644 > --- a/lib/maple_tree.c > +++ b/lib/maple_tree.c > @@ -5118,10 +5118,7 @@ static inline bool mas_skip_node > > mt = mte_node_type(mas->node); > pivots = ma_pivots(mas_mn(mas), mt); > - mas->min = pivots[mas->offset] + 1; > mas->offset++; > - if (mas->offset < mt_slots[mt]) > - mas->max = pivots[mas->offset]; > > return true; > } > > > This allowed my test to pass 100/100 runs. Still in qemu with the test as > init, so not really stressed in any way except that specific usecase. > Yes, I have a v2 that removes these lines and some extra unused lines. I've been working on a recreation to add to the testing before I send v2 out. I had 100% success running my testing over the weekend, but I wanted to have a testcase before sending the change. Thanks, Liam
Hi, Snild, 在 2023/3/7 22:30, Snild Dolkow 写道: > On 2023-03-07 14:05, Peng Zhang wrote: >> Hi, Liam, >>> - } while (slot > slot_count); >>> + } while (mas->offset >= mas_data_end(mas)); >>> - mas->offset = ++slot; >>> + mt = mte_node_type(mas->node); >>> pivots = ma_pivots(mas_mn(mas), mt); >>> - if (slot > 0) >>> - mas->min = pivots[slot - 1] + 1; >>> - >>> - if (slot <= slot_count) >>> - mas->max = pivots[slot]; >>> + mas->min = pivots[mas->offset] + 1; >>> + mas->offset++; >>> + if (mas->offset < mt_slots[mt]) >>> + mas->max = pivots[mas->offset]; >> There is a bug here, the assignment of mas->min and mas->max is wrong. >> The assignment will make them represent the range of a child node, >> but it should represent the range of the current node. After >> mas_ascend() returns, mas-min and mas->max already represent the >> range of the current node, so we should delete these assignments of >> mas->min and mas->max. > > > Thanks for your suggestion, Peng. Applying it literally by removing > only the min/max assignments: > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c > index 6fc1ad42b409..9b6e581cf83f 100644 > --- a/lib/maple_tree.c > +++ b/lib/maple_tree.c > @@ -5118,10 +5118,7 @@ static inline bool mas_skip_node > > mt = mte_node_type(mas->node); > pivots = ma_pivots(mas_mn(mas), mt); > - mas->min = pivots[mas->offset] + 1; > mas->offset++; > - if (mas->offset < mt_slots[mt]) > - mas->max = pivots[mas->offset]; > > return true; > } > > > This allowed my test to pass 100/100 runs. Still in qemu with the test > as init, so not really stressed in any way except that specific usecase. > > //Snild Thanks for the test, I'm happy if it happens to fix your problem. So a patch was made. This patch needs to be applied after Liam's patch. Sincerely yours, Peng.
Sorry I forgot to post the link, here it is: https://lore.kernel.org/lkml/20230307160340.57074-1-zhangpeng.00@bytedance.com/
* Peng Zhang <zhangpeng.00@bytedance.com> [230307 11:21]: > Sorry I forgot to post the link, here it is: > https://lore.kernel.org/lkml/20230307160340.57074-1-zhangpeng.00@bytedance.com/ I will roll this into V2 of my patch set. It is exactly what I have, but I need to come up with a testcase for this first.
diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 2be86368237d..2efe854946d6 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5188,34 +5188,29 @@ static inline bool mas_rewind_node(struct ma_state *mas) */ static inline bool mas_skip_node(struct ma_state *mas) { - unsigned char slot, slot_count; unsigned long *pivots; enum maple_type mt; - mt = mte_node_type(mas->node); - slot_count = mt_slots[mt] - 1; + if (mas_is_err(mas)) + return false; + do { if (mte_is_root(mas->node)) { - slot = mas->offset; - if (slot > slot_count) { + if (mas->offset >= mas_data_end(mas)) { mas_set_err(mas, -EBUSY); return false; } } else { mas_ascend(mas); - slot = mas->offset; - mt = mte_node_type(mas->node); - slot_count = mt_slots[mt] - 1; } - } while (slot > slot_count); + } while (mas->offset >= mas_data_end(mas)); - mas->offset = ++slot; + mt = mte_node_type(mas->node); pivots = ma_pivots(mas_mn(mas), mt); - if (slot > 0) - mas->min = pivots[slot - 1] + 1; - - if (slot <= slot_count) - mas->max = pivots[slot]; + mas->min = pivots[mas->offset] + 1; + mas->offset++; + if (mas->offset < mt_slots[mt]) + mas->max = pivots[mas->offset]; return true; }
mas_skip_node() is used to move the maple state to the node with a higher limit. It does this by walking up the tree and increasing the slot count. Since slot count may not be able to be increased, it may need to walk up multiple times to find room to walk right to a higher limit node. The limit of slots that was being used was the node limit and not the last location of data in the node. This would cause the maple state to be shifted outside actual data and enter an error state, thus returning -EBUSY. The result of the incorrect error state means that mas_awalk() would return an error instead of finding the allocation space. The fix is to use mas_data_end() in mas_skip_node() to detect the nodes data end point and continue walking the tree up until it is safe to move to a node with a higher limit. mas_skip_node() may also be passed a maple state in an error state from mas_anode_descend() when no allocations are available. Return on such an error state immediately. Reported-by: Snild Dolkow <snild@sony.com> Link: https://lore.kernel.org/linux-mm/cb8dc31a-fef2-1d09-f133-e9f7b9f9e77a@sony.com/ Cc: <Stable@vger.kernel.org> Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com> --- lib/maple_tree.c | 25 ++++++++++--------------- 1 file changed, 10 insertions(+), 15 deletions(-)