Message ID | 20220215143728.3810954-1-Liam.Howlett@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | Introducing the Maple Tree | expand |
On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > The maple tree is an RCU-safe range based B-tree designed to use modern > processor cache efficiently. There are a number of places in the kernel > that a non-overlapping range-based tree would be beneficial, especially > one with a simple interface. The first user that is covered in this > patch set is the vm_area_struct, where three data structures are > replaced by the maple tree: the augmented rbtree, the vma cache, and the > linked list of VMAs in the mm_struct. The long term goal is to reduce > or remove the mmap_sem contention. Has a path been chosen which gets us from here to significant reduction in mmap_lock overhead? If so, what's the plan and what must be done? Thanks.
On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote: > On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > > > The maple tree is an RCU-safe range based B-tree designed to use modern > > processor cache efficiently. There are a number of places in the kernel > > that a non-overlapping range-based tree would be beneficial, especially > > one with a simple interface. The first user that is covered in this > > patch set is the vm_area_struct, where three data structures are > > replaced by the maple tree: the augmented rbtree, the vma cache, and the > > linked list of VMAs in the mm_struct. The long term goal is to reduce > > or remove the mmap_sem contention. > > Has a path been chosen which gets us from here to significant reduction > in mmap_lock overhead? > > If so, what's the plan and what must be done? I would say there are still competing ideas for how that is to be done. First, the Maple Tree is independent (to a certain extent) of all the approaches to reducing mmap_lock overhead. It's a better data structure for VMA lookup than the rbtree. Set against that, it has higher overhead for modifications. That means that benchmarks which primarily measure modification overhead see worse performance. We believe this is not representative of real workloads, and indeed we see ~parity on workloads like git and make -jN. The advantage to this patch set is that we get rid of the vmacache entirely, we get rid of the doubly-linked list chaining VMAs together and we get rid of the rbtree usage. And we add a new generic data structure that hopefully I can point people towards instead of using the XArray inappropriately. Both of the approaches to avoiding taking the mmap_lock in the fault path will benefit from using the maple tree for VMAs. Fewer cache misses and RCU safety are things we all want. The maple tree will also benefit if/when either approach is merged; because modifications to the maple tree are more heavyweight than the rbtree, they block readers for longer. The SPF approach is further along. It has been tested in the past with the maple tree, but not in the most recent iteration of either. It doesn't rely on the maple tree, but it does benefit. The RCU-freeing-of-VMAs approach that I favour has not received much attention. I've been busy with other things ;-) It offers much more opportunity for performance improvement, but will take longer to arrive.
On Wed, Feb 16, 2022 at 08:24:34PM +0000, Matthew Wilcox wrote: > On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote: > > On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > > > > > The maple tree is an RCU-safe range based B-tree designed to use modern > > > processor cache efficiently. There are a number of places in the kernel > > > that a non-overlapping range-based tree would be beneficial, especially > > > one with a simple interface. The first user that is covered in this > > > patch set is the vm_area_struct, where three data structures are > > > replaced by the maple tree: the augmented rbtree, the vma cache, and the > > > linked list of VMAs in the mm_struct. The long term goal is to reduce > > > or remove the mmap_sem contention. > > > > Has a path been chosen which gets us from here to significant reduction > > in mmap_lock overhead? > > > > If so, what's the plan and what must be done? > > I would say there are still competing ideas for how that is to be done. > > First, the Maple Tree is independent (to a certain extent) of all the > approaches to reducing mmap_lock overhead. It's a better data structure > for VMA lookup than the rbtree. Set against that, it has higher overhead > for modifications. That means that benchmarks which primarily measure > modification overhead see worse performance. We believe this is not > representative of real workloads, and indeed we see ~parity on workloads > like git and make -jN. > I'm way behind and only got around to looking at SPF properly today. Maple is working its way up my list and I need to gather new data but I did have old data queued from a time when I thought I would get around to maple tree soon. The big result that stood was was brk performance from will-it-scale but for processes specifically wis-malloc 5.17.0-rc3 5.17.0-rc3 vanilla mm-maplevma-v5r1 Hmean brk1-processes-2 5265361.84 ( 0.00%) 3489097.63 * -33.73%* Hmean brk1-processes-5 12270382.36 ( 0.00%) 8136542.13 * -33.69%* Hmean brk1-processes-8 19560585.95 ( 0.00%) 12941094.22 * -33.84%* Hmean brk1-processes-12 27328267.18 ( 0.00%) 18148145.64 * -33.59%* Hmean brk1-processes-21 39599378.04 ( 0.00%) 26188063.96 * -33.87%* Hmean brk1-processes-30 59880571.20 ( 0.00%) 39653658.00 * -33.78%* Hmean brk1-processes-48 72900335.59 ( 0.00%) 49044880.84 * -32.72%* Hmean brk1-processes-79 72388042.42 ( 0.00%) 48416196.93 * -33.12%* Hmean brk1-processes-110 72333206.18 ( 0.00%) 48437222.12 * -33.04%* Hmean brk1-processes-141 72443841.68 ( 0.00%) 48490158.48 * -33.07%* Hmean brk1-processes-172 72187451.54 ( 0.00%) 48480696.77 * -32.84%* Hmean brk1-processes-203 72575945.95 ( 0.00%) 48503896.03 * -33.17%* Hmean brk1-processes-234 67509812.97 ( 0.00%) 48466347.95 * -28.21%* Hmean brk1-processes-265 72379447.48 ( 0.00%) 48449330.06 * -33.06%* Hmean brk1-processes-296 72241861.59 ( 0.00%) 48303854.85 * -33.14%* Hmean brk1-processes-320 72183320.07 ( 0.00%) 48253708.21 * -33.15%* Hmean brk1-threads-2 1352297.67 ( 0.00%) 2963406.42 * 119.14%* Hmean brk1-threads-5 814358.55 ( 0.00%) 1233604.84 * 51.48%* Hmean brk1-threads-8 849900.52 ( 0.00%) 1241170.76 * 46.04%* Hmean brk1-threads-12 804143.24 ( 0.00%) 1232451.60 * 53.26%* Hmean brk1-threads-21 672680.28 ( 0.00%) 1001257.38 * 48.85%* Hmean brk1-threads-30 443221.73 ( 0.00%) 713592.66 * 61.00%* Hmean brk1-threads-48 397283.95 ( 0.00%) 591349.83 * 48.85%* Hmean brk1-threads-79 167479.04 ( 0.00%) 548681.03 * 227.61%* Hmean brk1-threads-110 154108.98 ( 0.00%) 207349.67 * 34.55%* Hmean brk1-threads-141 166712.91 ( 0.00%) 217314.50 * 30.35%* Hmean brk1-threads-172 155023.61 ( 0.00%) 210318.59 * 35.67%* Hmean brk1-threads-203 166943.60 ( 0.00%) 227892.26 * 36.51%* The mmtests configuration this was based on no longer exists but the equivalent should be configs/config-workload-will-it-scale-pf-processes configs/config-workload-will-it-scale-pf-threads It shouldn't be too hard to pull out the brk1 test on its own and see what it's doing in profiles. brk_test from aim9 also suffers so may be brk-specific. Other workloads looked ok, kernel building took a small -1.5% performance hit on some machines but not enough that I would scream. Others looked mostly neutral but it was only a provisional sniff test before I even considered looking at 70 patches :O
On Wed, Feb 23, 2022 at 04:35:09PM +0000, Mel Gorman wrote: > On Wed, Feb 16, 2022 at 08:24:34PM +0000, Matthew Wilcox wrote: > > On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote: > > > On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > > > > > > > The maple tree is an RCU-safe range based B-tree designed to use modern > > > > processor cache efficiently. There are a number of places in the kernel > > > > that a non-overlapping range-based tree would be beneficial, especially > > > > one with a simple interface. The first user that is covered in this > > > > patch set is the vm_area_struct, where three data structures are > > > > replaced by the maple tree: the augmented rbtree, the vma cache, and the > > > > linked list of VMAs in the mm_struct. The long term goal is to reduce > > > > or remove the mmap_sem contention. > > > > > > Has a path been chosen which gets us from here to significant reduction > > > in mmap_lock overhead? > > > > > > If so, what's the plan and what must be done? > > > > I would say there are still competing ideas for how that is to be done. > > > > First, the Maple Tree is independent (to a certain extent) of all the > > approaches to reducing mmap_lock overhead. It's a better data structure > > for VMA lookup than the rbtree. Set against that, it has higher overhead > > for modifications. That means that benchmarks which primarily measure > > modification overhead see worse performance. We believe this is not > > representative of real workloads, and indeed we see ~parity on workloads > > like git and make -jN. > > I'm way behind and only got around to looking at SPF properly today. Maple > is working its way up my list and I need to gather new data but I did > have old data queued from a time when I thought I would get around to > maple tree soon. The big result that stood was was brk performance from > will-it-scale but for processes specifically Yup, we know about the brk1 regression. It's a really unrealistic benchmark, which is why we added brk2. To quote the commit message: Linux has this horrendously complicated anon_vma structure that you don't care about, but the upshot is that after calling fork(), each process that calls brk() gets a _new_ VMA created. That is, after calling brk() the first time, the process address space looks like this: 557777fab000-557777ff0000 rw-p 00000000 00:00 0 [heap] 557777ff0000-557777ff1000 rw-p 00000000 00:00 0 [heap] so what brk1 is actually testing is how long it takes to create & destroy a new VMA. This does not match what most programs do -- most will call exec() which resets the anon_vma structures and starts each program off with its own heap. And if you do have a multi-process program which uses brk(), chances are it doesn't just oscillate betwee zero and one extra pages of heap compared to its parent. A better test starts out by allocating one page on the heap and then throbs between one and two pages instead of throbbing between zero and one page. That means we're actually testing expanding and contracting the heap instead of creating and destroying a new heap. For realism, I wanted to add actually accessing the memory in the new heap, but that doesn't work for the threaded case -- another thread might remove the memory you just allocated while you're allocating it. Threaded programs give each thread its own heap anyway, so this is kind of a pointless syscall to ask about its threaded scalability. Anyway, here's brk2.c. It is not very different from brk1.c, but the performance results are quite different (actually worse by about 10-15%).
On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote: > The maple tree is an RCU-safe range based B-tree designed to use modern > processor cache efficiently. There are a number of places in the kernel > that a non-overlapping range-based tree would be beneficial, especially > one with a simple interface. The first user that is covered in this > patch set is the vm_area_struct, where three data structures are > replaced by the maple tree: the augmented rbtree, the vma cache, and the > linked list of VMAs in the mm_struct. The long term goal is to reduce > or remove the mmap_sem contention. > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > nodes. With the increased branching factor, it is significantly shorter than > the rbtree so it has fewer cache misses. The removal of the linked list > between subsequent entries also reduces the cache misses and the need to pull > in the previous and next VMA during many tree alterations. > > This patch is based on v5.17-rc4 > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214 Just a heads-up. I noticed an use-after-free in today's linux-next below. I am running out of time to fully triage this, but I noticed this is the only series (not sure which revision) heavily touched mm/mmap.c recently there. BUG: KASAN: use-after-free in move_vma.isra.0 Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280 CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242 Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020 Call trace: dump_backtrace show_stack dump_stack_lvl print_address_description.constprop.0 kasan_report __asan_report_load8_noabort move_vma.isra.0 move_vma at /usr/src/linux-next/mm/mremap.c:714 __do_sys_mremap __arm64_sys_mremap invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync Allocated by task 1280: kasan_save_stack __kasan_slab_alloc slab_post_alloc_hook kmem_cache_alloc vm_area_alloc vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455 mmap_region mmap_region at /usr/src/linux-next/mm/mmap.c:2585 do_mmap vm_mmap_pgoff ksys_mmap_pgoff __arm64_sys_mmap invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync Freed by task 1280: kasan_save_stack kasan_set_track kasan_set_free_info __kasan_slab_free kmem_cache_free vm_area_free remove_vma remove_vma at /usr/src/linux-next/mm/mmap.c:187 do_mas_align_munmap.isra.0 remove_mt at /usr/src/linux-next/mm/mmap.c:2176 (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437 do_mas_munmap do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483 do_munmap move_vma.isra.0 __do_sys_mremap __arm64_sys_mremap invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync The buggy address belongs to the object at ffff0009ce752aa8 which belongs to the cache vm_area_struct of size 144 The buggy address is located 32 bytes inside of 144-byte region [ffff0009ce752aa8, ffff0009ce752b38) The buggy address belongs to the page: page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750 head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0 memcg:ffff0009ce456e01 flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff) raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380 raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb ^ ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
* Qian Cai <quic_qiancai@quicinc.com> [220224 22:49]: > On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote: > > The maple tree is an RCU-safe range based B-tree designed to use modern > > processor cache efficiently. There are a number of places in the kernel > > that a non-overlapping range-based tree would be beneficial, especially > > one with a simple interface. The first user that is covered in this > > patch set is the vm_area_struct, where three data structures are > > replaced by the maple tree: the augmented rbtree, the vma cache, and the > > linked list of VMAs in the mm_struct. The long term goal is to reduce > > or remove the mmap_sem contention. > > > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > > nodes. With the increased branching factor, it is significantly shorter than > > the rbtree so it has fewer cache misses. The removal of the linked list > > between subsequent entries also reduces the cache misses and the need to pull > > in the previous and next VMA during many tree alterations. > > > > This patch is based on v5.17-rc4 > > > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214 > > Just a heads-up. I noticed an use-after-free in today's linux-next below. I > am running out of time to fully triage this, but I noticed this is the only > series (not sure which revision) heavily touched mm/mmap.c recently there. > > BUG: KASAN: use-after-free in move_vma.isra.0 > Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280 > > CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242 > Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020 > Call trace: > dump_backtrace > show_stack > dump_stack_lvl > print_address_description.constprop.0 > kasan_report > __asan_report_load8_noabort > move_vma.isra.0 > move_vma at /usr/src/linux-next/mm/mremap.c:714 > __do_sys_mremap > __arm64_sys_mremap > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc > el0_svc > el0t_64_sync_handler > el0t_64_sync > > Allocated by task 1280: > kasan_save_stack > __kasan_slab_alloc > slab_post_alloc_hook > kmem_cache_alloc > vm_area_alloc > vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455 > mmap_region > mmap_region at /usr/src/linux-next/mm/mmap.c:2585 > do_mmap > vm_mmap_pgoff > ksys_mmap_pgoff > __arm64_sys_mmap > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc > el0_svc > el0t_64_sync_handler > el0t_64_sync > > Freed by task 1280: > kasan_save_stack > kasan_set_track > kasan_set_free_info > __kasan_slab_free > kmem_cache_free > vm_area_free > remove_vma > remove_vma at /usr/src/linux-next/mm/mmap.c:187 > do_mas_align_munmap.isra.0 > remove_mt at /usr/src/linux-next/mm/mmap.c:2176 > (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437 > do_mas_munmap > do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483 > do_munmap > move_vma.isra.0 > __do_sys_mremap > __arm64_sys_mremap > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc > el0_svc > el0t_64_sync_handler > el0t_64_sync > > The buggy address belongs to the object at ffff0009ce752aa8 > which belongs to the cache vm_area_struct of size 144 > The buggy address is located 32 bytes inside of > 144-byte region [ffff0009ce752aa8, ffff0009ce752b38) > The buggy address belongs to the page: > page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750 > head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0 > memcg:ffff0009ce456e01 > flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff) > raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380 > raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01 > page dumped because: kasan: bad access detected > > Memory state around the buggy address: > ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb > ^ > ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc > ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc Thank you for the report. I will try to reproduce it with a kvm here. Cheers, Liam
* Liam Howlett <liam.howlett@oracle.com> [220225 14:09]: > * Qian Cai <quic_qiancai@quicinc.com> [220224 22:49]: > > On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote: > > > The maple tree is an RCU-safe range based B-tree designed to use modern > > > processor cache efficiently. There are a number of places in the kernel > > > that a non-overlapping range-based tree would be beneficial, especially > > > one with a simple interface. The first user that is covered in this > > > patch set is the vm_area_struct, where three data structures are > > > replaced by the maple tree: the augmented rbtree, the vma cache, and the > > > linked list of VMAs in the mm_struct. The long term goal is to reduce > > > or remove the mmap_sem contention. > > > > > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > > > nodes. With the increased branching factor, it is significantly shorter than > > > the rbtree so it has fewer cache misses. The removal of the linked list > > > between subsequent entries also reduces the cache misses and the need to pull > > > in the previous and next VMA during many tree alterations. > > > > > > This patch is based on v5.17-rc4 > > > > > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214 > > > > Just a heads-up. I noticed an use-after-free in today's linux-next below. I > > am running out of time to fully triage this, but I noticed this is the only > > series (not sure which revision) heavily touched mm/mmap.c recently there. > > > > BUG: KASAN: use-after-free in move_vma.isra.0 > > Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280 > > > > CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242 > > Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020 > > Call trace: > > dump_backtrace > > show_stack > > dump_stack_lvl > > print_address_description.constprop.0 > > kasan_report > > __asan_report_load8_noabort > > move_vma.isra.0 > > move_vma at /usr/src/linux-next/mm/mremap.c:714 > > __do_sys_mremap > > __arm64_sys_mremap > > invoke_syscall > > el0_svc_common.constprop.0 > > do_el0_svc > > el0_svc > > el0t_64_sync_handler > > el0t_64_sync > > > > Allocated by task 1280: > > kasan_save_stack > > __kasan_slab_alloc > > slab_post_alloc_hook > > kmem_cache_alloc > > vm_area_alloc > > vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455 > > mmap_region > > mmap_region at /usr/src/linux-next/mm/mmap.c:2585 > > do_mmap > > vm_mmap_pgoff > > ksys_mmap_pgoff > > __arm64_sys_mmap > > invoke_syscall > > el0_svc_common.constprop.0 > > do_el0_svc > > el0_svc > > el0t_64_sync_handler > > el0t_64_sync > > > > Freed by task 1280: > > kasan_save_stack > > kasan_set_track > > kasan_set_free_info > > __kasan_slab_free > > kmem_cache_free > > vm_area_free > > remove_vma > > remove_vma at /usr/src/linux-next/mm/mmap.c:187 > > do_mas_align_munmap.isra.0 > > remove_mt at /usr/src/linux-next/mm/mmap.c:2176 > > (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437 > > do_mas_munmap > > do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483 > > do_munmap > > move_vma.isra.0 > > __do_sys_mremap > > __arm64_sys_mremap > > invoke_syscall > > el0_svc_common.constprop.0 > > do_el0_svc > > el0_svc > > el0t_64_sync_handler > > el0t_64_sync > > > > The buggy address belongs to the object at ffff0009ce752aa8 > > which belongs to the cache vm_area_struct of size 144 > > The buggy address is located 32 bytes inside of > > 144-byte region [ffff0009ce752aa8, ffff0009ce752b38) > > The buggy address belongs to the page: > > page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750 > > head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0 > > memcg:ffff0009ce456e01 > > flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff) > > raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380 > > raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01 > > page dumped because: kasan: bad access detected > > > > Memory state around the buggy address: > > ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > > ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > > >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb > > ^ > > ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc > > ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc > > Thank you for the report. I will try to reproduce it with a kvm here. > I just booted an arm64 VM with my build and kasan enabled with no issue. Could you please send me your config file for the build? Thanks, Liam
On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote: > I just booted an arm64 VM with my build and kasan enabled with no issue. > Could you please send me your config file for the build? On linux-next, I just do: $ make arch=arm64 defconfig debug.config [1] Then, I just generate some memory pressume into swapping/OOM Killer to trigger it. [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config
On Fri, Feb 25, 2022 at 03:46:52PM -0500, Qian Cai wrote: > On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote: > > I just booted an arm64 VM with my build and kasan enabled with no issue. > > Could you please send me your config file for the build? > > On linux-next, I just do: > > $ make arch=arm64 defconfig debug.config [1] > > Then, I just generate some memory pressume into swapping/OOM Killer to > trigger it. > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config Is the stacktrace [1] related to the conflict that Mark encountered [2] while merging the maple and folio trees? Booting a next-20220223 kernel on my Raspberry Pi 3 and 4 shows constant NULL pointer dereferences (just ARCH=arm and ARCH=arm64 defconfigs) and reverting the folio and maple tree merges makes everything work properly again. [1]: https://lore.kernel.org/r/YhhRrBpXTFolUAKi@qian/ [2]: https://lore.kernel.org/r/20220224011653.1380557-1-broonie@kernel.org/ Cheers, Nathan
* Nathan Chancellor <nathan@kernel.org> [220225 18:00]: > On Fri, Feb 25, 2022 at 03:46:52PM -0500, Qian Cai wrote: > > On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote: > > > I just booted an arm64 VM with my build and kasan enabled with no issue. > > > Could you please send me your config file for the build? > > > > On linux-next, I just do: > > > > $ make arch=arm64 defconfig debug.config [1] > > > > Then, I just generate some memory pressume into swapping/OOM Killer to > > trigger it. > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config > > Is the stacktrace [1] related to the conflict that Mark encountered [2] > while merging the maple and folio trees? Booting a next-20220223 kernel > on my Raspberry Pi 3 and 4 shows constant NULL pointer dereferences > (just ARCH=arm and ARCH=arm64 defconfigs) and reverting the folio and > maple tree merges makes everything work properly again. > > [1]: https://lore.kernel.org/r/YhhRrBpXTFolUAKi@qian/ > [2]: https://lore.kernel.org/r/20220224011653.1380557-1-broonie@kernel.org/ Maybe? I'm trying to figure out why it's having issues.. I've not been able to reproduce it with just my maple tree branch. Steven Rostedt found a bad commit that has been fixed in either 20220224, I believe [1]. It might be best to try next-20220225 and see if you have better luck if that's an option. [1]: https://lore.kernel.org/linux-fsdevel/f6fb6fd4-dcf2-4326-d25e-9a4a9dad5020@fb.com/T/#t Thanks, Liam
On Sat, 26 Feb 2022, Liam Howlett wrote: > * Nathan Chancellor <nathan@kernel.org> [220225 18:00]: > > On Fri, Feb 25, 2022 at 03:46:52PM -0500, Qian Cai wrote: > > > On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote: > > > > I just booted an arm64 VM with my build and kasan enabled with no issue. > > > > Could you please send me your config file for the build? > > > > > > On linux-next, I just do: > > > > > > $ make arch=arm64 defconfig debug.config [1] > > > > > > Then, I just generate some memory pressume into swapping/OOM Killer to > > > trigger it. > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config > > > > Is the stacktrace [1] related to the conflict that Mark encountered [2] > > while merging the maple and folio trees? Booting a next-20220223 kernel > > on my Raspberry Pi 3 and 4 shows constant NULL pointer dereferences > > (just ARCH=arm and ARCH=arm64 defconfigs) and reverting the folio and > > maple tree merges makes everything work properly again. > > > > [1]: https://lore.kernel.org/r/YhhRrBpXTFolUAKi@qian/ > > [2]: https://lore.kernel.org/r/20220224011653.1380557-1-broonie@kernel.org/ > > Maybe? I'm trying to figure out why it's having issues.. I've not been > able to reproduce it with just my maple tree branch. Steven Rostedt > found a bad commit that has been fixed in either 20220224, I believe > [1]. It might be best to try next-20220225 and see if you have better > luck if that's an option. > > [1]: > https://lore.kernel.org/linux-fsdevel/f6fb6fd4-dcf2-4326-d25e-9a4a9dad5020@fb.com/T/#t Hi Liam, I think I have the beginnings of an answer for Qian's issue, details below; but first off, I'd better make my own opinion clear. I think this series, however good it may be so far, is undertested and not approaching ready for 5.18, and should not have been pushed via git tree into 5.17-rc-late-stage linux-next, subverting Andrew's mmotm. I believe Stephen and Mark should drop it from linux-next for now, while linux-next stabilizes for 5.18, while you work on testing and fixing, and resubmit rebased to 5.18-rc1 or -rc2 when you're ready. Qian's issue, BUG: KASAN: use-after-free in move_vma.isra.0, with stacktrace implicating mremap. I don't have KASAN on, but my config on this laptop does have CONFIG_SLAB=y CONFIG_DEBUG_SLAB=y, and reported "Slab corruption" during bootup of mmotm 2022-02-24-22-38 (plus Steven Rostedt's fix to vfs_statx() which you mention in [1], missed from mmotm - not related to your series of course, but essential for booting here). Previous mmotm 2022-02-23-21-20 (plus vfs_statx() fix) was okay, but did not contain the Maple Tree series. The "vm_area" "Slab corruption" "Single bit error detected" 6b->7b report is enough to indicate that the VM_ACCOUNT bit gets set in a freed vma's vm_flags. And looking through "|= VM_ACCOUNT"s, my suspicion fell on mm/mremap.c's move_vma(). Which I now see is implicated in Qian's report too. mremap move's VM_ACCOUNT accounting is difficult: not losing what's already charged in case the move fails, accounting the extra without overcharging, in the face of vmas being split and merged: difficult. And there's an assumption documented in (now) do_mas_align_munmap() "Note: mremap's move_vma VM_ACCOUNT handling assumes a partially unmapped vm_area_struct will remain in use". My suspicion (not verified) is that the maple tree changes are now violating that; and I doubt that fixing it will be easy (I'm not going to try) - doable, sure, but needs careful thought. (The way move_vma() masks out VM_ACCOUNT in a live vma, then adds it back later: implications for vma merging; all under mmap lock of course, but still awkward.) Though I did partially verify it, by commenting out the VM_ACCOUNT adjustments in move_vma(), and booting several times without any "Slab corruption". And did also kind-of verify it by booting with #define VM_ACCOUNT 0: I was interested to see if that bit corruption could account for all the other bugs I get, but sadly no. Initially I was building without CONFIG_DEBUG_VM_MAPLE_TREE=y and CONFIG_DEBUG_MAPLE_TREE=y, but have now switched them on. Hit bugs without them and with them, but now they're on, validate_mm() often catches something (whether it is correct to complain, I haven't investigated, but I assume so since the debug option is there, and problems are seen without it). I say "often": it's very erratic. Once, a machine booted with mem=1G ran kernel builds successfully swapping for 4.5 hours before faulting in __nr_to_section in virt_to_folio ... while doing a __vma_adjust() - you'll ask me for a full stacktrace, and I'll answer sorry, too many, please try for yourself. Another time, for 1.5 hours before hitting the BUG_ON(is_migration_entry(entry) && !PageLocked(p)) in pfn_swap_entry_to_page() - suggesting anon_vma locking had not been right while doing page migration (I was exercising THPs a lot). But now, can I even get it to complete the boot sequence? (I happened to be sampling /proc/meminfo during those successful runs: looking afterwards at those samples, I see Committed_AS growing steadily; whereas samples saved from pre-maple runs were not: that would correspond to VM_ACCOUNT vm_enough_memory() charges leaking.) You're having difficulty reproducing: I suggest trying with different mem= on the boot command line (some of my loads I run with mem=700M, some mem=1G, some with the workstation mem 8G) - I get the impression that different mem= jumbles up memory allocations differently, so what's harmless in one case is quickly harmful in another. Or try changing between SLAB and SLUB. One other thing: I haven't studied the source much, but didn't like that rwsem_acquire(&mm->mmap_lock.dep_map, 0, 0, _THIS_IP_) hack in exit_mmap(): sneaked into "mm: Start tracking VMAs with maple tree", it reverts Suren's 64591e8605d6 "mm: protect free_pgtables with mmap_lock write lock in exit_mmap", without explanating how it becomes safe with maple tree (I imagine it's not). That would have to be a separate, justified patch if it goes forward. (The nearby conflict resolutions in mmotm and next are not quite right there, some stuff my mm/munlock series deleted has resurfaced: but it's harmless, and not worth worrying about if maple tree is dropped from linux-next.) Hugh
On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote: > The maple tree is an RCU-safe range based B-tree designed to use modern > processor cache efficiently. There are a number of places in the kernel > that a non-overlapping range-based tree would be beneficial, especially > one with a simple interface. The first user that is covered in this > patch set is the vm_area_struct, where three data structures are > replaced by the maple tree: the augmented rbtree, the vma cache, and the > linked list of VMAs in the mm_struct. The long term goal is to reduce > or remove the mmap_sem contention. > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > nodes. With the increased branching factor, it is significantly shorter than > the rbtree so it has fewer cache misses. The removal of the linked list > between subsequent entries also reduces the cache misses and the need to pull > in the previous and next VMA during many tree alterations. > > This patch is based on v5.17-rc4 > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214 Hi Liam, this patch series completely breaks s390. Besides endianess issue in maple trees reported here: https://lore.kernel.org/all/your-ad-here.call-01645924312-ext-0398@work.hours/ and with this endianess issue fixed (fixup from here ^^^) we still get numerous KASAN reports starting with [PATCH v6 10/71] mm: Start tracking VMAs with maple tree mostly looking like this: BUG: KASAN: use-after-free in mas_descend_adopt+0x113a/0x1408 Read of size 8 at addr 00000000b1d4e900 by task cat/610 CPU: 34 PID: 610 Comm: cat Tainted: G W 5.17.0-rc4-91313-g592c3b299aad #1 Hardware name: IBM 3906 M04 704 (LPAR) Call Trace: [<000000000234647a>] dump_stack_lvl+0xfa/0x150 [<0000000002328154>] print_address_description.constprop.0+0x64/0x340 [<00000000008f5ba6>] kasan_report+0x13e/0x1a8 [<00000000017f0c12>] mas_descend_adopt+0x113a/0x1408 [<00000000018076ec>] mas_spanning_rebalance.isra.0+0x5164/0x6a20 [<000000000180a71e>] mas_wr_spanning_store.isra.0+0x476/0xbc8 [<00000000018189bc>] mas_store_gfp+0xd4/0x188 [<0000000000827bae>] vma_mt_szero+0x146/0x368 [<000000000082e990>] __do_munmap+0x340/0xe20 [<000000000082f580>] __vm_munmap+0x110/0x1e8 [<000000000082f76e>] __s390x_sys_munmap+0x6e/0x90 [<000000000010dc6c>] do_syscall+0x22c/0x328 [<000000000234d3d2>] __do_syscall+0x9a/0xf8 [<0000000002374ed2>] system_call+0x82/0xb0 INFO: lockdep is turned off. Allocated by task 610: kasan_save_stack+0x34/0x58 __kasan_slab_alloc+0x84/0xa8 kmem_cache_alloc+0x20c/0x520 mas_alloc_nodes+0x26a/0x4c8 mas_split.isra.0+0x2aa/0x1418 mas_wr_modify+0x3fa/0xd28 mas_store_gfp+0xd4/0x188 vma_store+0x17a/0x3d8 vma_link+0xac/0x798 mmap_region+0xa5a/0x10b8 do_mmap+0x7c2/0xa90 vm_mmap_pgoff+0x186/0x250 ksys_mmap_pgoff+0x334/0x400 __s390x_sys_old_mmap+0xf4/0x130 do_syscall+0x22c/0x328 __do_syscall+0x9a/0xf8 system_call+0x82/0xb0 Freed by task 610: kasan_save_stack+0x34/0x58 kasan_set_track+0x36/0x48 kasan_set_free_info+0x34/0x58 ____kasan_slab_free+0x11c/0x188 __kasan_slab_free+0x24/0x30 kmem_cache_free_bulk.part.0+0xec/0x538 mas_destroy+0x2e4/0x710 mas_store_gfp+0xf4/0x188 vma_mt_szero+0x146/0x368 __vma_adjust+0x155a/0x2188 __split_vma+0x228/0x450 mprotect_fixup+0x4f2/0x618 do_mprotect_pkey.constprop.0+0x328/0x600 __s390x_sys_mprotect+0x8e/0xb8 do_syscall+0x22c/0x328 __do_syscall+0x9a/0xf8 system_call+0x82/0xb0 The buggy address belongs to the object at 00000000b1d4e900 which belongs to the cache maple_node of size 256 The buggy address is located 0 bytes inside of 256-byte region [00000000b1d4e900, 00000000b1d4ea00) The buggy address belongs to the page: page:0000400002c75200 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xb1d48 head:0000400002c75200 order:3 compound_mapcount:0 compound_pincount:0 flags: 0x3ffff00000010200(slab|head|node=0|zone=1|lastcpupid=0x1ffff) raw: 3ffff00000010200 0000400002c91e08 000040000263b608 000000008009ed00 raw: 0000000000000000 0020004000000000 ffffffff00000001 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: 00000000b1d4e800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc 00000000b1d4e880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >00000000b1d4e900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ 00000000b1d4e980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb 00000000b1d4ea00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc with eventual crash. How come none of s390 architecture maintainers nor s390 mailing list were on CC for the changes which affect s390 to that magnitude? None of this has apparently been tested on s390 or any other big endian system. And a shoulder tap to give it try would be helpful. Now we are just starting looking at the problems. And until issues are resolved this patch series has to be dropped from linux-next.
On Sat, 26 Feb 2022, Hugh Dickins wrote: > > I say "often": it's very erratic. Once, a machine booted with mem=1G > ran kernel builds successfully swapping for 4.5 hours before faulting > in __nr_to_section in virt_to_folio ... while doing a __vma_adjust() - > you'll ask me for a full stacktrace, and I'll answer sorry, too many, > please try for yourself. Another time, for 1.5 hours before hitting > the BUG_ON(is_migration_entry(entry) && !PageLocked(p)) in > pfn_swap_entry_to_page() - suggesting anon_vma locking had not been > right while doing page migration (I was exercising THPs a lot). > But now, can I even get it to complete the boot sequence? Sorry, I never mentioned the architecture: x86_64 throughout for me.
On Sat, Feb 26, 2022 at 01:58:09AM +0000, Liam Howlett wrote: > Maybe? I'm trying to figure out why it's having issues.. I've not been > able to reproduce it with just my maple tree branch. Steven Rostedt > found a bad commit that has been fixed in either 20220224, I believe > [1]. It might be best to try next-20220225 and see if you have better > luck if that's an option. Same thing for next-20220225. I just need to boot it and it will crash shortly after. BUG: KASAN: use-after-free in move_vma.isra.0 Read of size 8 at addr ffff00088f74c7e8 by task apt-check/2331 CPU: 2 PID: 2331 Comm: apt-check Tainted: G W 5.17.0-rc5-next-20220225 #245 Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020 Call trace: dump_backtrace show_stack dump_stack_lvl print_address_description.constprop.0 kasan_report __asan_report_load8_noabort move_vma.isra.0 __do_sys_mremap __arm64_sys_mremap invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync Allocated by task 2331: kasan_save_stack __kasan_slab_alloc slab_post_alloc_hook kmem_cache_alloc vm_area_dup __split_vma do_mas_align_munmap.isra.0 do_mas_munmap __vm_munmap __arm64_sys_munmap invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync Freed by task 2331: kasan_save_stack kasan_set_track kasan_set_free_info __kasan_slab_free kmem_cache_free vm_area_free remove_vma do_mas_align_munmap.isra.0 do_mas_munmap do_munmap move_vma.isra.0 __do_sys_mremap __arm64_sys_mremap invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync The buggy address belongs to the object at ffff00088f74c7c8 which belongs to the cache vm_area_struct of size 144 The buggy address is located 32 bytes inside of 144-byte region [ffff00088f74c7c8, ffff00088f74c858) The buggy address belongs to the physical page: page:fffffc00223dd300 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x90f74c head:fffffc00223dd300 order:2 compound_mapcount:0 compound_pincount:0 memcg:ffff000884a75001 flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff) raw: 0bfffc0000010200 fffffc00223dd008 fffffc0021e6bc08 ffff000800f5b380 raw: 0000000000000000 0000000000210021 00000001ffffffff ffff000884a75001 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff00088f74c680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff00088f74c700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff00088f74c780: fc fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb ^ ffff00088f74c800: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc ffff00088f74c880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Disabling lock debugging due to kernel taint Unable to handle kernel paging request at virtual address dfff800000000004 KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027] Mem abort info: ESR = 0x96000004 EC = 0x25: DABT (current EL), IL = 32 bits SET = 0, FnV = 0 EA = 0, S1PTW = 0 FSC = 0x04: level 0 translation fault Data abort info: ISV = 0, ISS = 0x00000004 CM = 0, WnR = 0 [dfff800000000004] address between user and kernel address ranges Internal error: Oops: 96000004 [#1] PREEMPT SMP Modules linked in: ipmi_devintf ipmi_msghandler cppc_cpufreq drm ip_tables x_tables ipv6 btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress dm_mod crct10dif_ce nvme mlx5_core mpt3sas nvme_core raid_class CPU: 2 PID: 2331 Comm: apt-check Tainted: G B W 5.17.0-rc5-next-20220225 #245 Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020 pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : move_vma.isra.0 lr : move_vma.isra.0 sp : ffff800016df7ad0 x29: ffff800016df7ad0 x28: ffff00088c7ba6c8 x27: 0000000000000001 x26: 0000ffffad180000 x25: ffff00088bea1c40 x24: ffff00088bea1dc8 x23: ffff00088c988040 x22: 0000000001800000 x21: 0000ffffab82d000 x20: 0000000000000000 x19: 1ffff00002dbef6c x18: 0000000000000066 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 x14: 0000000000000000 x13: 0000000000001490 x12: 0000000000000001 x11: 0000000000000009 x10: 0000000000000000 x9 : ffff800008445544 x8 : 000000003fffffff x7 : ffff80000928fbc4 x6 : 00000000f1f1f1f1 x5 : dfff800000000000 x4 : ffff700002dbef18 x3 : 0000000041b58ab3 x2 : 0000000000000004 x1 : dfff800000000000 x0 : 0000000000000020 Call trace: move_vma.isra.0 __do_sys_mremap __arm64_sys_mremap invoke_syscall el0_svc_common.constprop.0 do_el0_svc el0_svc el0t_64_sync_handler el0t_64_sync Code: 91008000 d2d00001 f2fbffe1 d343fc02 (38e16841) ---[ end trace 0000000000000000 ]--- Kernel panic - not syncing: Oops: Fatal exception SMP: stopping secondary CPUs Kernel Offset: 0x160000 from 0xffff800008000000 PHYS_OFFSET: 0x80000000 CPU features: 0x00,00000942,40000846 Memory Limit: none ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
* Hugh Dickins <hughd@google.com> [220226 18:20]: > On Sat, 26 Feb 2022, Liam Howlett wrote: > > * Nathan Chancellor <nathan@kernel.org> [220225 18:00]: > > > On Fri, Feb 25, 2022 at 03:46:52PM -0500, Qian Cai wrote: > > > > On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote: > > > > > I just booted an arm64 VM with my build and kasan enabled with no issue. > > > > > Could you please send me your config file for the build? > > > > > > > > On linux-next, I just do: > > > > > > > > $ make arch=arm64 defconfig debug.config [1] > > > > > > > > Then, I just generate some memory pressume into swapping/OOM Killer to > > > > trigger it. > > > > > > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config > > > > > > Is the stacktrace [1] related to the conflict that Mark encountered [2] > > > while merging the maple and folio trees? Booting a next-20220223 kernel > > > on my Raspberry Pi 3 and 4 shows constant NULL pointer dereferences > > > (just ARCH=arm and ARCH=arm64 defconfigs) and reverting the folio and > > > maple tree merges makes everything work properly again. > > > > > > [1]: https://lore.kernel.org/r/YhhRrBpXTFolUAKi@qian/ > > > [2]: https://lore.kernel.org/r/20220224011653.1380557-1-broonie@kernel.org/ > > > > Maybe? I'm trying to figure out why it's having issues.. I've not been > > able to reproduce it with just my maple tree branch. Steven Rostedt > > found a bad commit that has been fixed in either 20220224, I believe > > [1]. It might be best to try next-20220225 and see if you have better > > luck if that's an option. > > > > [1]: > > https://lore.kernel.org/linux-fsdevel/f6fb6fd4-dcf2-4326-d25e-9a4a9dad5020@fb.com/T/#t > > Hi Liam, I think I have the beginnings of an answer for Qian's issue, > details below; but first off, I'd better make my own opinion clear. > > I think this series, however good it may be so far, is undertested and > not approaching ready for 5.18, and should not have been pushed via > git tree into 5.17-rc-late-stage linux-next, subverting Andrew's mmotm. > > I believe Stephen and Mark should drop it from linux-next for now, > while linux-next stabilizes for 5.18, while you work on testing and > fixing, and resubmit rebased to 5.18-rc1 or -rc2 when you're ready. Hugh, Thank you for your testing, review, and thoughts on the maple tree. I agree with your findings and see the error, although still have difficulty reproducing on amd64 and arm64 VMs. Stephen, Please drop the maple tree from linux-next. Thank you, Liam
* Vasily Gorbik <gor@linux.ibm.com> [220226 21:23]: > On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote: > > The maple tree is an RCU-safe range based B-tree designed to use modern > > processor cache efficiently. There are a number of places in the kernel > > that a non-overlapping range-based tree would be beneficial, especially > > one with a simple interface. The first user that is covered in this > > patch set is the vm_area_struct, where three data structures are > > replaced by the maple tree: the augmented rbtree, the vma cache, and the > > linked list of VMAs in the mm_struct. The long term goal is to reduce > > or remove the mmap_sem contention. > > > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > > nodes. With the increased branching factor, it is significantly shorter than > > the rbtree so it has fewer cache misses. The removal of the linked list > > between subsequent entries also reduces the cache misses and the need to pull > > in the previous and next VMA during many tree alterations. > > > > This patch is based on v5.17-rc4 > > > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214 > > Hi Liam, > > this patch series completely breaks s390. Besides endianess issue > in maple trees reported here: > > https://lore.kernel.org/all/your-ad-here.call-01645924312-ext-0398@work.hours/ > > and with this endianess issue fixed (fixup from here ^^^) we still get numerous > KASAN reports starting with > > [PATCH v6 10/71] mm: Start tracking VMAs with maple tree Thanks for looking at this. That commit had two issues which I was debugging on parisc for most of last week. The first is that I was not actually writing the stack expansion to the maple tree in that commit - it was added in a later commit. The second issue which was most likely causing your crashes was the calculation for the gap in unmapped_area_topdown(). With the maple tree (3dea5db7fbb5), I am able to boot parisc and s390 virtual machines. Although, the issue which Hugh discusses in detail still exists [1]. The latest maple tree can be found on github [2]. I am still investigating why the heap changed locations on parisc to ensure it's not of issue. 1: https://lore.kernel.org/all/5f8f4f-ad63-eb-fd73-d48748af8a76@google.com/ 2: https://github.com/oracle/linux-uek/tree/maple/mainline Regards, Liam