diff mbox series

[v3,5/6] maple_tree: add sufficient height

Message ID 20250227204823.758784-6-sidhartha.kumar@oracle.com (mailing list archive)
State New
Headers show
Series Track node vacancy to reduce worst case allocation counts | expand

Commit Message

Sidhartha Kumar Feb. 27, 2025, 8:48 p.m. UTC
In order to support rebalancing and spanning stores using less than the
worst case number of nodes, we need to track more than just the vacant
height. Using only vacant height to reduce the worst case maple node
allocation count can lead to a shortcoming of nodes in the following
scenarios.

For rebalancing writes, when a leaf node becomes insufficient, it may be
combined with a sibling into a single node. This means that the parent node
which has entries for this children will lose one entry. If this parent node
was just meeting the minimum entries, losing one entry will now cause this
parent node to be insufficient. This leads to a cascading operation of
rebalancing at different levels and can lead to more node allocations than
simply using vacant height can return.

For spanning writes, a similar situation occurs. At the location at
which a spanning write is detected, the number of ancestor nodes may
similarly need to rebalanced into a smaller number of nodes and the same
cascading situation could occur.

To use less than the full height of the tree for the number of
allocations, we also need to track the height at which a non-leaf node
cannot become insufficient. This means even if a rebalance occurs to a
child of this node, it currently has enough entries that it can lose one
without any further action. This field is stored in the maple write state
as sufficient height. In mas_prealloc_calc() when figuring out how many
nodes to allocate, we check if the the vacant node is lower in the tree
than a sufficient node (has a larger value). If it is, we cannot use the
vacant height and must use the difference in the height and sufficient
height as the basis for the number of nodes needed.

Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
---
 include/linux/maple_tree.h       |  4 +++-
 lib/maple_tree.c                 | 17 +++++++++++++++--
 tools/testing/radix-tree/maple.c | 28 ++++++++++++++++++++++++++++
 3 files changed, 46 insertions(+), 3 deletions(-)

Comments

Vasily Gorbik March 10, 2025, 6:13 p.m. UTC | #1
On Thu, Feb 27, 2025 at 08:48:22PM +0000, Sidhartha Kumar wrote:
> In order to support rebalancing and spanning stores using less than the
> worst case number of nodes, we need to track more than just the vacant
> height. Using only vacant height to reduce the worst case maple node
> allocation count can lead to a shortcoming of nodes in the following
> scenarios.
...
> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
> ---
>  include/linux/maple_tree.h       |  4 +++-
>  lib/maple_tree.c                 | 17 +++++++++++++++--
>  tools/testing/radix-tree/maple.c | 28 ++++++++++++++++++++++++++++
>  3 files changed, 46 insertions(+), 3 deletions(-)

Hi Sidhartha,

Starting from this commit, the LTP test "linkat02" consistently triggers
a kernel WARNING followed by a crash, at least on s390 (and probably on
other big-endian architectures as well). The maple tree selftest passes
successfully.

[  233.489583] LTP: starting linkat02
linkat02    0  TINFO  :  Using /tmp/ltp-8P2ZJL0mgN/LTP_lin3flG7N as tmpdir (tmpfs filesystem)
linkat02    0  TINFO  :  Found free device 0 '/dev/loop0'
[  234.187957] loop0: detected capacity change from 0 to 614400
linkat02    0  TINFO  :  Formatting /dev/loop0 with ext2 opts='' extra opts=''
mke2fs 1.47.1 (20-May-2024)
[  234.571157] operation not supported error, dev loop0, sector 614272 op 0x9:(WRITE_ZEROES) flags 0x10000800 phys_seg 0 prio class 0
linkat02    0  TINFO  :  Mounting /dev/loop0 to /tmp/ltp-8P2ZJL0mgN/LTP_lin3flG7N/mntpoint fstyp=ext2 flags=0
[  234.690816] EXT4-fs (loop0): mounting ext2 file system using the ext4 subsystem
[  234.696090] EXT4-fs (loop0): mounted filesystem 29120d07-e10b-43b8-bfb0-6156683a2769 r/w without journal. Quota mode: none.
linkat02    0  TINFO  :  Failed reach the hardlinks limit
[  239.616047] ------------[ cut here ]------------
[  239.616231] WARNING: CPU: 0 PID: 669 at lib/maple_tree.c:1156 mas_pop_node+0x220/0x290
[  239.616252] Modules linked in:
[  239.616292] CPU: 0 UID: 0 PID: 669 Comm: linkat02 Not tainted 6.14.0-rc5-next-20250307 #29
[  239.616305] Hardware name: IBM 3931 A01 704 (KVM/Linux)
[  239.616315] Krnl PSW : 0704c00180000000 00007fffe2b6c314 (mas_pop_node+0x224/0x290)
[  239.616334]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[  239.616349] Krnl GPRS: 0000000000000005 001c0feffc355f67 00007f7fe1aafb38 001c000000000000
[  239.616360]            001c000000000000 001c0fef00007f05 0000000000000000 ffffffffffffffff
[  239.616371]            00007f7fe1aaf3e0 00007f7fe1aafb08 001c000000000000 0000000000000000
[  239.616381]            0000000001026838 0000000000000005 00007f7fe1aaf020 00007f7fe1aaefc8
[  239.616399] Krnl Code: 00007fffe2b6c306: e370a0000024        stg     %r7,0(%r10)
[  239.616399]            00007fffe2b6c30c: a7f4ff83            brc     15,00007fffe2b6c212
[  239.616399]           #00007fffe2b6c310: af000000            mc      0,0
[  239.616399]           >00007fffe2b6c314: a7b90000            lghi    %r11,0
[  239.616399]            00007fffe2b6c318: a7f4ff89            brc     15,00007fffe2b6c22a
[  239.616399]            00007fffe2b6c31c: c0e5fefc1f8a        brasl   %r14,00007fffe0af0230
[  239.616399]            00007fffe2b6c322: a7f4ff4b            brc     15,00007fffe2b6c1b8
[  239.616399]            00007fffe2b6c326: c0e5fefc1fa5        brasl   %r14,00007fffe0af0270
[  239.616454] Call Trace:
[  239.616463]  [<00007fffe2b6c314>] mas_pop_node+0x224/0x290
[  239.616475]  [<00007fffe2b85ab6>] mas_spanning_rebalance+0x3006/0x4e90
[  239.616487]  [<00007fffe2b87e7a>] mas_rebalance+0x53a/0x9c0
[  239.616627]  [<00007fffe2b8c10a>] mas_wr_bnode+0x14a/0x1a0
[  239.616639]  [<00007fffe2b9a87c>] mas_erase+0xd9c/0x1120
[  239.616650]  [<00007fffe2b9acbe>] mtree_erase+0xbe/0xf0
[  239.616661]  [<00007fffe0c3b4d2>] simple_offset_remove+0x52/0x90
[  239.616674]  [<00007fffe093dc16>] shmem_unlink+0xb6/0x320
[  239.616686]  [<00007fffe0bc0830>] vfs_unlink+0x270/0x760
[  239.616698]  [<00007fffe0bd473a>] do_unlinkat+0x40a/0x5c0
[  239.616709]  [<00007fffe0bd4a48>] __s390x_sys_unlink+0x58/0x70
[  239.616720]  [<00007fffe0155356>] do_syscall+0x2f6/0x430
[  239.616733]  [<00007fffe2bd3668>] __do_syscall+0xc8/0x1c0
[  239.616746]  [<00007fffe2bf70d4>] system_call+0x74/0x98
[  239.616758] 4 locks held by linkat02/669:
[  239.616769]  #0: 0000780097e89450 (sb_writers#8){.+.+}-{0:0}, at: mnt_want_write+0x4c/0xc0
[  239.616799]  #1: 00007800a7de6cd0 (&type->i_mutex_dir_key#5/1){+.+.}-{3:3}, at: do_unlinkat+0x1f8/0x5c0
[  239.616831]  #2: 00007800a7de7ac0 (&sb->s_type->i_mutex_key#12){+.+.}-{3:3}, at: vfs_unlink+0xc6/0x760
[  239.616860]  #3: 00007800a7de6a58 (&simple_offset_lock_class){+.+.}-{2:2}, at: mtree_erase+0xb4/0xf0
[  239.616886] Last Breaking-Event-Address:
[  239.616895]  [<00007fffe2b6c12a>] mas_pop_node+0x3a/0x290
[  239.616909] irq event stamp: 5205821
[  239.616918] hardirqs last  enabled at (5205831): [<00007fffe03d2be8>] __up_console_sem+0xe8/0x130
[  239.616931] hardirqs last disabled at (5205840): [<00007fffe03d2bc6>] __up_console_sem+0xc6/0x130
[  239.616943] softirqs last  enabled at (5200824): [<00007fffe0246b6c>] handle_softirqs+0x6dc/0xe30
[  239.616955] softirqs last disabled at (5200687): [<00007fffe024508a>] __irq_exit_rcu+0x34a/0x3f0
[  239.616994] ---[ end trace 0000000000000000 ]---
[  239.617009] Unable to handle kernel pointer dereference in virtual kernel address space
[  239.617019] Failing address: 0000000000000000 TEID: 0000000000000483
[  239.617029] Fault in home space mode while using kernel ASCE.
[  239.617049] AS:0000000005dac00b R2:00000001ffffc00b R3:00000001ffff8007 S:00000001ffff7801 P:000000000000013d
[  239.617150] Oops: 0004 ilc:3 [#1] PREEMPT SMP
[  239.617162] Modules linked in:
[  239.617170] CPU: 0 UID: 0 PID: 669 Comm: linkat02 Tainted: G        W           6.14.0-rc5-next-20250307 #29
[  239.617243] Tainted: [W]=WARN
[  239.617248] Hardware name: IBM 3931 A01 704 (KVM/Linux)
[  239.617253] Krnl PSW : 0704c00180000000 00007fffe2b6a988 (mab_mas_cp+0x168/0x640)
[  239.617264]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
[  239.617272] Krnl GPRS: 0000000000000008 0000000000000000 00007fff00000008 00007f7f00000009
[  239.617279]            0000000000000008 001c000000000000 0000000000000008 0000000000000048
[  239.617285]            001c0ffffc638e09 001c000000000009 0000000000000098 001c000000000000
[  239.617292]            0000000001026838 00007f7fe1aaf608 001c000000000013 00007f7fe1aaef68
[  239.617302] Krnl Code: 00007fffe2b6a97c: b90800e5            agr     %r14,%r5
[  239.617302]            00007fffe2b6a980: 9500e000            cli     0(%r14),0
[  239.617302]           #00007fffe2b6a984: a7740262            brc     7,00007fffe2b6ae48
[  239.617302]           >00007fffe2b6a988: e548a0000000        mvghi   0(%r10),0
[  239.617302]            00007fffe2b6a98e: e3a0f0c00004        lg      %r10,192(%r15)
[  239.617302]            00007fffe2b6a994: a7b80000            lhi     %r11,0
[  239.617302]            00007fffe2b6a998: eb2a0003000d        sllg    %r2,%r10,3
[  239.617302]            00007fffe2b6a99e: e320f0f00024        stg     %r2,240(%r15)
[  239.617350] Call Trace:
[  239.617354]  [<00007fffe2b6a988>] mab_mas_cp+0x168/0x640
[  239.617362]  [<00007fffe2b85bcc>] mas_spanning_rebalance+0x311c/0x4e90
[  239.617369]  [<00007fffe2b87e7a>] mas_rebalance+0x53a/0x9c0
[  239.617376]  [<00007fffe2b8c10a>] mas_wr_bnode+0x14a/0x1a0
[  239.617383]  [<00007fffe2b9a87c>] mas_erase+0xd9c/0x1120
[  239.617389]  [<00007fffe2b9acbe>] mtree_erase+0xbe/0xf0
[  239.617396]  [<00007fffe0c3b4d2>] simple_offset_remove+0x52/0x90
[  239.617403]  [<00007fffe093dc16>] shmem_unlink+0xb6/0x320
[  239.617410]  [<00007fffe0bc0830>] vfs_unlink+0x270/0x760
[  239.617416]  [<00007fffe0bd473a>] do_unlinkat+0x40a/0x5c0
[  239.617422]  [<00007fffe0bd4a48>] __s390x_sys_unlink+0x58/0x70
[  239.617429]  [<00007fffe0155356>] do_syscall+0x2f6/0x430
[  239.617436]  [<00007fffe2bd3668>] __do_syscall+0xc8/0x1c0
[  239.617443]  [<00007fffe2bf70d4>] system_call+0x74/0x98
[  239.617450] INFO: lockdep is turned off.
[  239.617454] Last Breaking-Event-Address:
[  239.617458]  [<00007fffe2b6a8f4>] mab_mas_cp+0xd4/0x640
[  239.617468] Kernel panic - not syncing: Fatal exception: panic_on_oops
Sidhartha Kumar March 10, 2025, 9:01 p.m. UTC | #2
On 3/10/25 2:13 PM, Vasily Gorbik wrote:
> On Thu, Feb 27, 2025 at 08:48:22PM +0000, Sidhartha Kumar wrote:
>> In order to support rebalancing and spanning stores using less than the
>> worst case number of nodes, we need to track more than just the vacant
>> height. Using only vacant height to reduce the worst case maple node
>> allocation count can lead to a shortcoming of nodes in the following
>> scenarios.
> ...
>> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
>> ---
>>   include/linux/maple_tree.h       |  4 +++-
>>   lib/maple_tree.c                 | 17 +++++++++++++++--
>>   tools/testing/radix-tree/maple.c | 28 ++++++++++++++++++++++++++++
>>   3 files changed, 46 insertions(+), 3 deletions(-)
> 
> Hi Sidhartha,
> 
> Starting from this commit, the LTP test "linkat02" consistently triggers
> a kernel WARNING followed by a crash, at least on s390 (and probably on
> other big-endian architectures as well). The maple tree selftest passes
> successfully.
> 

Hi,

Thanks for reporting this, it looks like it doesn't reproduce on x86 so 
I'll try to virtualize s390 to reproduce. Andrew, could you revert this 
series from mm-unstable as I work on fixing this?

Thanks,
Sid




> [  233.489583] LTP: starting linkat02
> linkat02    0  TINFO  :  Using /tmp/ltp-8P2ZJL0mgN/LTP_lin3flG7N as tmpdir (tmpfs filesystem)
> linkat02    0  TINFO  :  Found free device 0 '/dev/loop0'
> [  234.187957] loop0: detected capacity change from 0 to 614400
> linkat02    0  TINFO  :  Formatting /dev/loop0 with ext2 opts='' extra opts=''
> mke2fs 1.47.1 (20-May-2024)
> [  234.571157] operation not supported error, dev loop0, sector 614272 op 0x9:(WRITE_ZEROES) flags 0x10000800 phys_seg 0 prio class 0
> linkat02    0  TINFO  :  Mounting /dev/loop0 to /tmp/ltp-8P2ZJL0mgN/LTP_lin3flG7N/mntpoint fstyp=ext2 flags=0
> [  234.690816] EXT4-fs (loop0): mounting ext2 file system using the ext4 subsystem
> [  234.696090] EXT4-fs (loop0): mounted filesystem 29120d07-e10b-43b8-bfb0-6156683a2769 r/w without journal. Quota mode: none.
> linkat02    0  TINFO  :  Failed reach the hardlinks limit
> [  239.616047] ------------[ cut here ]------------
> [  239.616231] WARNING: CPU: 0 PID: 669 at lib/maple_tree.c:1156 mas_pop_node+0x220/0x290
> [  239.616252] Modules linked in:
> [  239.616292] CPU: 0 UID: 0 PID: 669 Comm: linkat02 Not tainted 6.14.0-rc5-next-20250307 #29
> [  239.616305] Hardware name: IBM 3931 A01 704 (KVM/Linux)
> [  239.616315] Krnl PSW : 0704c00180000000 00007fffe2b6c314 (mas_pop_node+0x224/0x290)
> [  239.616334]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [  239.616349] Krnl GPRS: 0000000000000005 001c0feffc355f67 00007f7fe1aafb38 001c000000000000
> [  239.616360]            001c000000000000 001c0fef00007f05 0000000000000000 ffffffffffffffff
> [  239.616371]            00007f7fe1aaf3e0 00007f7fe1aafb08 001c000000000000 0000000000000000
> [  239.616381]            0000000001026838 0000000000000005 00007f7fe1aaf020 00007f7fe1aaefc8
> [  239.616399] Krnl Code: 00007fffe2b6c306: e370a0000024        stg     %r7,0(%r10)
> [  239.616399]            00007fffe2b6c30c: a7f4ff83            brc     15,00007fffe2b6c212
> [  239.616399]           #00007fffe2b6c310: af000000            mc      0,0
> [  239.616399]           >00007fffe2b6c314: a7b90000            lghi    %r11,0
> [  239.616399]            00007fffe2b6c318: a7f4ff89            brc     15,00007fffe2b6c22a
> [  239.616399]            00007fffe2b6c31c: c0e5fefc1f8a        brasl   %r14,00007fffe0af0230
> [  239.616399]            00007fffe2b6c322: a7f4ff4b            brc     15,00007fffe2b6c1b8
> [  239.616399]            00007fffe2b6c326: c0e5fefc1fa5        brasl   %r14,00007fffe0af0270
> [  239.616454] Call Trace:
> [  239.616463]  [<00007fffe2b6c314>] mas_pop_node+0x224/0x290
> [  239.616475]  [<00007fffe2b85ab6>] mas_spanning_rebalance+0x3006/0x4e90
> [  239.616487]  [<00007fffe2b87e7a>] mas_rebalance+0x53a/0x9c0
> [  239.616627]  [<00007fffe2b8c10a>] mas_wr_bnode+0x14a/0x1a0
> [  239.616639]  [<00007fffe2b9a87c>] mas_erase+0xd9c/0x1120
> [  239.616650]  [<00007fffe2b9acbe>] mtree_erase+0xbe/0xf0
> [  239.616661]  [<00007fffe0c3b4d2>] simple_offset_remove+0x52/0x90
> [  239.616674]  [<00007fffe093dc16>] shmem_unlink+0xb6/0x320
> [  239.616686]  [<00007fffe0bc0830>] vfs_unlink+0x270/0x760
> [  239.616698]  [<00007fffe0bd473a>] do_unlinkat+0x40a/0x5c0
> [  239.616709]  [<00007fffe0bd4a48>] __s390x_sys_unlink+0x58/0x70
> [  239.616720]  [<00007fffe0155356>] do_syscall+0x2f6/0x430
> [  239.616733]  [<00007fffe2bd3668>] __do_syscall+0xc8/0x1c0
> [  239.616746]  [<00007fffe2bf70d4>] system_call+0x74/0x98
> [  239.616758] 4 locks held by linkat02/669:
> [  239.616769]  #0: 0000780097e89450 (sb_writers#8){.+.+}-{0:0}, at: mnt_want_write+0x4c/0xc0
> [  239.616799]  #1: 00007800a7de6cd0 (&type->i_mutex_dir_key#5/1){+.+.}-{3:3}, at: do_unlinkat+0x1f8/0x5c0
> [  239.616831]  #2: 00007800a7de7ac0 (&sb->s_type->i_mutex_key#12){+.+.}-{3:3}, at: vfs_unlink+0xc6/0x760
> [  239.616860]  #3: 00007800a7de6a58 (&simple_offset_lock_class){+.+.}-{2:2}, at: mtree_erase+0xb4/0xf0
> [  239.616886] Last Breaking-Event-Address:
> [  239.616895]  [<00007fffe2b6c12a>] mas_pop_node+0x3a/0x290
> [  239.616909] irq event stamp: 5205821
> [  239.616918] hardirqs last  enabled at (5205831): [<00007fffe03d2be8>] __up_console_sem+0xe8/0x130
> [  239.616931] hardirqs last disabled at (5205840): [<00007fffe03d2bc6>] __up_console_sem+0xc6/0x130
> [  239.616943] softirqs last  enabled at (5200824): [<00007fffe0246b6c>] handle_softirqs+0x6dc/0xe30
> [  239.616955] softirqs last disabled at (5200687): [<00007fffe024508a>] __irq_exit_rcu+0x34a/0x3f0
> [  239.616994] ---[ end trace 0000000000000000 ]---
> [  239.617009] Unable to handle kernel pointer dereference in virtual kernel address space
> [  239.617019] Failing address: 0000000000000000 TEID: 0000000000000483
> [  239.617029] Fault in home space mode while using kernel ASCE.
> [  239.617049] AS:0000000005dac00b R2:00000001ffffc00b R3:00000001ffff8007 S:00000001ffff7801 P:000000000000013d
> [  239.617150] Oops: 0004 ilc:3 [#1] PREEMPT SMP
> [  239.617162] Modules linked in:
> [  239.617170] CPU: 0 UID: 0 PID: 669 Comm: linkat02 Tainted: G        W           6.14.0-rc5-next-20250307 #29
> [  239.617243] Tainted: [W]=WARN
> [  239.617248] Hardware name: IBM 3931 A01 704 (KVM/Linux)
> [  239.617253] Krnl PSW : 0704c00180000000 00007fffe2b6a988 (mab_mas_cp+0x168/0x640)
> [  239.617264]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [  239.617272] Krnl GPRS: 0000000000000008 0000000000000000 00007fff00000008 00007f7f00000009
> [  239.617279]            0000000000000008 001c000000000000 0000000000000008 0000000000000048
> [  239.617285]            001c0ffffc638e09 001c000000000009 0000000000000098 001c000000000000
> [  239.617292]            0000000001026838 00007f7fe1aaf608 001c000000000013 00007f7fe1aaef68
> [  239.617302] Krnl Code: 00007fffe2b6a97c: b90800e5            agr     %r14,%r5
> [  239.617302]            00007fffe2b6a980: 9500e000            cli     0(%r14),0
> [  239.617302]           #00007fffe2b6a984: a7740262            brc     7,00007fffe2b6ae48
> [  239.617302]           >00007fffe2b6a988: e548a0000000        mvghi   0(%r10),0
> [  239.617302]            00007fffe2b6a98e: e3a0f0c00004        lg      %r10,192(%r15)
> [  239.617302]            00007fffe2b6a994: a7b80000            lhi     %r11,0
> [  239.617302]            00007fffe2b6a998: eb2a0003000d        sllg    %r2,%r10,3
> [  239.617302]            00007fffe2b6a99e: e320f0f00024        stg     %r2,240(%r15)
> [  239.617350] Call Trace:
> [  239.617354]  [<00007fffe2b6a988>] mab_mas_cp+0x168/0x640
> [  239.617362]  [<00007fffe2b85bcc>] mas_spanning_rebalance+0x311c/0x4e90
> [  239.617369]  [<00007fffe2b87e7a>] mas_rebalance+0x53a/0x9c0
> [  239.617376]  [<00007fffe2b8c10a>] mas_wr_bnode+0x14a/0x1a0
> [  239.617383]  [<00007fffe2b9a87c>] mas_erase+0xd9c/0x1120
> [  239.617389]  [<00007fffe2b9acbe>] mtree_erase+0xbe/0xf0
> [  239.617396]  [<00007fffe0c3b4d2>] simple_offset_remove+0x52/0x90
> [  239.617403]  [<00007fffe093dc16>] shmem_unlink+0xb6/0x320
> [  239.617410]  [<00007fffe0bc0830>] vfs_unlink+0x270/0x760
> [  239.617416]  [<00007fffe0bd473a>] do_unlinkat+0x40a/0x5c0
> [  239.617422]  [<00007fffe0bd4a48>] __s390x_sys_unlink+0x58/0x70
> [  239.617429]  [<00007fffe0155356>] do_syscall+0x2f6/0x430
> [  239.617436]  [<00007fffe2bd3668>] __do_syscall+0xc8/0x1c0
> [  239.617443]  [<00007fffe2bf70d4>] system_call+0x74/0x98
> [  239.617450] INFO: lockdep is turned off.
> [  239.617454] Last Breaking-Event-Address:
> [  239.617458]  [<00007fffe2b6a8f4>] mab_mas_cp+0xd4/0x640
> [  239.617468] Kernel panic - not syncing: Fatal exception: panic_on_oops
>
Andrew Morton March 10, 2025, 10:27 p.m. UTC | #3
On Mon, 10 Mar 2025 17:01:28 -0400 Sidhartha Kumar <sidhartha.kumar@oracle.com> wrote:

> Andrew, could you revert this 
> series from mm-unstable as I work on fixing this?

I have done so, thanks.
Sidhartha Kumar March 11, 2025, 2:40 p.m. UTC | #4
On 3/10/25 2:13 PM, Vasily Gorbik wrote:
> On Thu, Feb 27, 2025 at 08:48:22PM +0000, Sidhartha Kumar wrote:
>> In order to support rebalancing and spanning stores using less than the
>> worst case number of nodes, we need to track more than just the vacant
>> height. Using only vacant height to reduce the worst case maple node
>> allocation count can lead to a shortcoming of nodes in the following
>> scenarios.
> ...
>> Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>
>> ---
>>   include/linux/maple_tree.h       |  4 +++-
>>   lib/maple_tree.c                 | 17 +++++++++++++++--
>>   tools/testing/radix-tree/maple.c | 28 ++++++++++++++++++++++++++++
>>   3 files changed, 46 insertions(+), 3 deletions(-)
> 
> Hi Sidhartha,
> 
> Starting from this commit, the LTP test "linkat02" consistently triggers
> a kernel WARNING followed by a crash, at least on s390 (and probably on
> other big-endian architectures as well). The maple tree selftest passes
> successfully.
> 

Would you be able to rerun this test with the following diff applied and 
CONFIG_DEBUG_MAPLE_TREE=y so I can check the state of the tree to easier 
reproduce the error?

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index ea1f0acac118..74a4b1924a55 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -1153,8 +1153,11 @@ static inline struct maple_node 
*mas_pop_node(struct ma_state *mas)
         unsigned int req = mas_alloc_req(mas);

         /* nothing or a request pending. */
-       if (WARN_ON(!total))
+       if (WARN_ON(!total)) {
+               mas_dump(mas);
+               mt_dump(mas->tree, mt_dump_hex);
                 return NULL;
+       }

         if (total == 1) {
                 /* single allocation in this ma_state */


> [  233.489583] LTP: starting linkat02
> linkat02    0  TINFO  :  Using /tmp/ltp-8P2ZJL0mgN/LTP_lin3flG7N as tmpdir (tmpfs filesystem)
> linkat02    0  TINFO  :  Found free device 0 '/dev/loop0'
> [  234.187957] loop0: detected capacity change from 0 to 614400
> linkat02    0  TINFO  :  Formatting /dev/loop0 with ext2 opts='' extra opts=''
> mke2fs 1.47.1 (20-May-2024)
> [  234.571157] operation not supported error, dev loop0, sector 614272 op 0x9:(WRITE_ZEROES) flags 0x10000800 phys_seg 0 prio class 0
> linkat02    0  TINFO  :  Mounting /dev/loop0 to /tmp/ltp-8P2ZJL0mgN/LTP_lin3flG7N/mntpoint fstyp=ext2 flags=0
> [  234.690816] EXT4-fs (loop0): mounting ext2 file system using the ext4 subsystem
> [  234.696090] EXT4-fs (loop0): mounted filesystem 29120d07-e10b-43b8-bfb0-6156683a2769 r/w without journal. Quota mode: none.
> linkat02    0  TINFO  :  Failed reach the hardlinks limit
> [  239.616047] ------------[ cut here ]------------
> [  239.616231] WARNING: CPU: 0 PID: 669 at lib/maple_tree.c:1156 mas_pop_node+0x220/0x290
> [  239.616252] Modules linked in:
> [  239.616292] CPU: 0 UID: 0 PID: 669 Comm: linkat02 Not tainted 6.14.0-rc5-next-20250307 #29
> [  239.616305] Hardware name: IBM 3931 A01 704 (KVM/Linux)
> [  239.616315] Krnl PSW : 0704c00180000000 00007fffe2b6c314 (mas_pop_node+0x224/0x290)
> [  239.616334]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [  239.616349] Krnl GPRS: 0000000000000005 001c0feffc355f67 00007f7fe1aafb38 001c000000000000
> [  239.616360]            001c000000000000 001c0fef00007f05 0000000000000000 ffffffffffffffff
> [  239.616371]            00007f7fe1aaf3e0 00007f7fe1aafb08 001c000000000000 0000000000000000
> [  239.616381]            0000000001026838 0000000000000005 00007f7fe1aaf020 00007f7fe1aaefc8
> [  239.616399] Krnl Code: 00007fffe2b6c306: e370a0000024        stg     %r7,0(%r10)
> [  239.616399]            00007fffe2b6c30c: a7f4ff83            brc     15,00007fffe2b6c212
> [  239.616399]           #00007fffe2b6c310: af000000            mc      0,0
> [  239.616399]           >00007fffe2b6c314: a7b90000            lghi    %r11,0
> [  239.616399]            00007fffe2b6c318: a7f4ff89            brc     15,00007fffe2b6c22a
> [  239.616399]            00007fffe2b6c31c: c0e5fefc1f8a        brasl   %r14,00007fffe0af0230
> [  239.616399]            00007fffe2b6c322: a7f4ff4b            brc     15,00007fffe2b6c1b8
> [  239.616399]            00007fffe2b6c326: c0e5fefc1fa5        brasl   %r14,00007fffe0af0270
> [  239.616454] Call Trace:
> [  239.616463]  [<00007fffe2b6c314>] mas_pop_node+0x224/0x290
> [  239.616475]  [<00007fffe2b85ab6>] mas_spanning_rebalance+0x3006/0x4e90
> [  239.616487]  [<00007fffe2b87e7a>] mas_rebalance+0x53a/0x9c0
> [  239.616627]  [<00007fffe2b8c10a>] mas_wr_bnode+0x14a/0x1a0
> [  239.616639]  [<00007fffe2b9a87c>] mas_erase+0xd9c/0x1120
> [  239.616650]  [<00007fffe2b9acbe>] mtree_erase+0xbe/0xf0
> [  239.616661]  [<00007fffe0c3b4d2>] simple_offset_remove+0x52/0x90
> [  239.616674]  [<00007fffe093dc16>] shmem_unlink+0xb6/0x320
> [  239.616686]  [<00007fffe0bc0830>] vfs_unlink+0x270/0x760
> [  239.616698]  [<00007fffe0bd473a>] do_unlinkat+0x40a/0x5c0
> [  239.616709]  [<00007fffe0bd4a48>] __s390x_sys_unlink+0x58/0x70
> [  239.616720]  [<00007fffe0155356>] do_syscall+0x2f6/0x430
> [  239.616733]  [<00007fffe2bd3668>] __do_syscall+0xc8/0x1c0
> [  239.616746]  [<00007fffe2bf70d4>] system_call+0x74/0x98
> [  239.616758] 4 locks held by linkat02/669:
> [  239.616769]  #0: 0000780097e89450 (sb_writers#8){.+.+}-{0:0}, at: mnt_want_write+0x4c/0xc0
> [  239.616799]  #1: 00007800a7de6cd0 (&type->i_mutex_dir_key#5/1){+.+.}-{3:3}, at: do_unlinkat+0x1f8/0x5c0
> [  239.616831]  #2: 00007800a7de7ac0 (&sb->s_type->i_mutex_key#12){+.+.}-{3:3}, at: vfs_unlink+0xc6/0x760
> [  239.616860]  #3: 00007800a7de6a58 (&simple_offset_lock_class){+.+.}-{2:2}, at: mtree_erase+0xb4/0xf0
> [  239.616886] Last Breaking-Event-Address:
> [  239.616895]  [<00007fffe2b6c12a>] mas_pop_node+0x3a/0x290
> [  239.616909] irq event stamp: 5205821
> [  239.616918] hardirqs last  enabled at (5205831): [<00007fffe03d2be8>] __up_console_sem+0xe8/0x130
> [  239.616931] hardirqs last disabled at (5205840): [<00007fffe03d2bc6>] __up_console_sem+0xc6/0x130
> [  239.616943] softirqs last  enabled at (5200824): [<00007fffe0246b6c>] handle_softirqs+0x6dc/0xe30
> [  239.616955] softirqs last disabled at (5200687): [<00007fffe024508a>] __irq_exit_rcu+0x34a/0x3f0
> [  239.616994] ---[ end trace 0000000000000000 ]---
> [  239.617009] Unable to handle kernel pointer dereference in virtual kernel address space
> [  239.617019] Failing address: 0000000000000000 TEID: 0000000000000483
> [  239.617029] Fault in home space mode while using kernel ASCE.
> [  239.617049] AS:0000000005dac00b R2:00000001ffffc00b R3:00000001ffff8007 S:00000001ffff7801 P:000000000000013d
> [  239.617150] Oops: 0004 ilc:3 [#1] PREEMPT SMP
> [  239.617162] Modules linked in:
> [  239.617170] CPU: 0 UID: 0 PID: 669 Comm: linkat02 Tainted: G        W           6.14.0-rc5-next-20250307 #29
> [  239.617243] Tainted: [W]=WARN
> [  239.617248] Hardware name: IBM 3931 A01 704 (KVM/Linux)
> [  239.617253] Krnl PSW : 0704c00180000000 00007fffe2b6a988 (mab_mas_cp+0x168/0x640)
> [  239.617264]            R:0 T:1 IO:1 EX:1 Key:0 M:1 W:0 P:0 AS:3 CC:0 PM:0 RI:0 EA:3
> [  239.617272] Krnl GPRS: 0000000000000008 0000000000000000 00007fff00000008 00007f7f00000009
> [  239.617279]            0000000000000008 001c000000000000 0000000000000008 0000000000000048
> [  239.617285]            001c0ffffc638e09 001c000000000009 0000000000000098 001c000000000000
> [  239.617292]            0000000001026838 00007f7fe1aaf608 001c000000000013 00007f7fe1aaef68
> [  239.617302] Krnl Code: 00007fffe2b6a97c: b90800e5            agr     %r14,%r5
> [  239.617302]            00007fffe2b6a980: 9500e000            cli     0(%r14),0
> [  239.617302]           #00007fffe2b6a984: a7740262            brc     7,00007fffe2b6ae48
> [  239.617302]           >00007fffe2b6a988: e548a0000000        mvghi   0(%r10),0
> [  239.617302]            00007fffe2b6a98e: e3a0f0c00004        lg      %r10,192(%r15)
> [  239.617302]            00007fffe2b6a994: a7b80000            lhi     %r11,0
> [  239.617302]            00007fffe2b6a998: eb2a0003000d        sllg    %r2,%r10,3
> [  239.617302]            00007fffe2b6a99e: e320f0f00024        stg     %r2,240(%r15)
> [  239.617350] Call Trace:
> [  239.617354]  [<00007fffe2b6a988>] mab_mas_cp+0x168/0x640
> [  239.617362]  [<00007fffe2b85bcc>] mas_spanning_rebalance+0x311c/0x4e90
> [  239.617369]  [<00007fffe2b87e7a>] mas_rebalance+0x53a/0x9c0
> [  239.617376]  [<00007fffe2b8c10a>] mas_wr_bnode+0x14a/0x1a0
> [  239.617383]  [<00007fffe2b9a87c>] mas_erase+0xd9c/0x1120
> [  239.617389]  [<00007fffe2b9acbe>] mtree_erase+0xbe/0xf0
> [  239.617396]  [<00007fffe0c3b4d2>] simple_offset_remove+0x52/0x90
> [  239.617403]  [<00007fffe093dc16>] shmem_unlink+0xb6/0x320
> [  239.617410]  [<00007fffe0bc0830>] vfs_unlink+0x270/0x760
> [  239.617416]  [<00007fffe0bd473a>] do_unlinkat+0x40a/0x5c0
> [  239.617422]  [<00007fffe0bd4a48>] __s390x_sys_unlink+0x58/0x70
> [  239.617429]  [<00007fffe0155356>] do_syscall+0x2f6/0x430
> [  239.617436]  [<00007fffe2bd3668>] __do_syscall+0xc8/0x1c0
> [  239.617443]  [<00007fffe2bf70d4>] system_call+0x74/0x98
> [  239.617450] INFO: lockdep is turned off.
> [  239.617454] Last Breaking-Event-Address:
> [  239.617458]  [<00007fffe2b6a8f4>] mab_mas_cp+0xd4/0x640
> [  239.617468] Kernel panic - not syncing: Fatal exception: panic_on_oops
>
diff mbox series

Patch

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index 7d777aa2d9ed..37dc9525dff6 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -464,6 +464,7 @@  struct ma_wr_state {
 	void *entry;			/* The entry to write */
 	void *content;			/* The existing entry that is being overwritten */
 	unsigned char vacant_height;	/* Depth of lowest node with free space */
+	unsigned char sufficient_height;/* Depth of lowest node with min sufficiency + 1 nodes */
 };
 
 #define mas_lock(mas)           spin_lock(&((mas)->tree->ma_lock))
@@ -499,7 +500,8 @@  struct ma_wr_state {
 		.mas = ma_state,					\
 		.content = NULL,					\
 		.entry = wr_entry,					\
-		.vacant_height = 0					\
+		.vacant_height = 0,					\
+		.sufficient_height = 0					\
 	}
 
 #define MA_TOPIARY(name, tree)						\
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index c859c7253d69..d3aa5241166b 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -3560,6 +3560,13 @@  static bool mas_wr_walk(struct ma_wr_state *wr_mas)
 		if (mas->end < mt_slots[wr_mas->type] - 1)
 			wr_mas->vacant_height = mas->depth + 1;
 
+		if (ma_is_root(mas_mn(mas))) {
+			/* root needs more than 2 entries to be sufficient + 1 */
+			if (mas->end > 2)
+				wr_mas->sufficient_height = 1;
+		} else if (mas->end > mt_min_slots[wr_mas->type] + 1)
+			wr_mas->sufficient_height = mas->depth + 1;
+
 		mas_wr_walk_traverse(wr_mas);
 	}
 
@@ -4195,13 +4202,19 @@  static inline int mas_prealloc_calc(struct ma_wr_state *wr_mas, void *entry)
 			ret = 0;
 		break;
 	case wr_spanning_store:
-		WARN_ON_ONCE(ret != height * 3 + 1);
+		if (wr_mas->sufficient_height < wr_mas->vacant_height)
+			ret = (height - wr_mas->sufficient_height) * 3 + 1;
+		else
+			ret = delta * 3 + 1;
 		break;
 	case wr_split_store:
 		ret = delta * 2 + 1;
 		break;
 	case wr_rebalance:
-		ret = height * 2 + 1;
+		if (wr_mas->sufficient_height < wr_mas->vacant_height)
+			ret = (height - wr_mas->sufficient_height) * 2 + 1;
+		else
+			ret = delta * 2 + 1;
 		break;
 	case wr_node_store:
 		ret = mt_in_rcu(mas->tree) ? 1 : 0;
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 5950a7c9b27f..dc8e40b94c3f 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -36311,6 +36311,30 @@  static noinline void __init check_mtree_dup(struct maple_tree *mt)
 
 extern void test_kmem_cache_bulk(void);
 
+/*
+ * Test to check the path of a spanning rebalance which results in
+ * a collapse where the rebalancing of the child node leads to
+ * insufficieny in the parent node.
+ */
+static void check_collapsing_rebalance(struct maple_tree *mt)
+{
+	int i = 0;
+	MA_STATE(mas, mt, ULONG_MAX, ULONG_MAX);
+
+	/* create a height 4 tree */
+	while (mt_height(mt) < 4) {
+		mtree_store_range(mt, i, i + 10, xa_mk_value(i), GFP_KERNEL);
+		i += 9;
+	}
+
+	/* delete all entries one at a time, starting from the right */
+	do {
+		mas_erase(&mas);
+	} while (mas_prev(&mas, 0) != NULL);
+
+	mtree_unlock(mt);
+}
+
 /* callback function used for check_nomem_writer_race() */
 static void writer2(void *maple_tree)
 {
@@ -36477,6 +36501,10 @@  void farmer_tests(void)
 	check_spanning_write(&tree);
 	mtree_destroy(&tree);
 
+	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+	check_collapsing_rebalance(&tree);
+	mtree_destroy(&tree);
+
 	mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
 	check_null_expand(&tree);
 	mtree_destroy(&tree);