Message ID | 20220426150616.3937571-1-Liam.Howlett@oracle.com (mailing list archive) |
---|---|
Headers | show |
Series | Introducing the Maple Tree | expand |
On Tue, 26 Apr 2022 15:06:19 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > Please replace the patches in your mglru-maple branch with this set. It should > be a drop in replacement for my patch range with the fixes into these > patches. Adding the preallocation to work around the fs-reclaim LOCKDEP > issue caused enough changes to the patches to warrant a respin. OK, thanks. I'll give these a bit of testing locally with a view to having a run in -next soon. It's all not looking very reviewed. Any thoughts on who we could attempt to presuade to hep out here? > The last patch on the branch is still needed to fix vmscan after mglru > is applied. ee4b1fc24f30 "mm/vmscan: Use VMA_ITERATOR in > get_next_vma()" OK, someone please send that at me when the next rev of mglru surfaces?
On Tue, 26 Apr 2022 15:06:19 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > The maple tree is an RCU-safe range based B-tree designed to use modern > processor cache efficiently. There are a number of places in the kernel I think it would be helpful to expand on "a number of places". Specifically which places? > that a non-overlapping range-based tree would be beneficial, especially > one with a simple interface. The first user that is covered in this > patch set is the vm_area_struct, where three data structures are > replaced by the maple tree: the augmented rbtree, the vma cache, and the > linked list of VMAs in the mm_struct. The long term goal is to reduce > or remove the mmap_sem contention. "mmap_lock" ;) > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > nodes. With the increased branching factor, it is significantly shorter than > the rbtree so it has fewer cache misses. The removal of the linked list > between subsequent entries also reduces the cache misses and the need to pull > in the previous and next VMA during many tree alterations. Do we have any quantitative testing results? What's the plan on utilizing this to further reduce mmap_lock contention?
On Tue, Apr 26, 2022 at 01:08:57PM -0700, Andrew Morton wrote: > On Tue, 26 Apr 2022 15:06:19 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > > The maple tree is an RCU-safe range based B-tree designed to use modern > > processor cache efficiently. There are a number of places in the kernel > > I think it would be helpful to expand on "a number of places". > Specifically which places? The page cache would be a good place to use it once it has a few more features to bring it up to par with the radix tree. I can go into more detail if you want. In general, anywhere that's using the IDR/Radix Tree/XArray would benefit. The radix tree has some pretty poor space consumption properties, particularly for anyone using the cyclic variants. Many users of the rbtree would have better performance if converted to the maple tree. Ultimately, it's going to be incumbent on people who know their own chunk of the kernel to say "Oh, yes, I can use that data structure", rather than on Liam to go around converting everybody's code for them.
* Andrew Morton <akpm@linux-foundation.org> [220426 16:09]: > On Tue, 26 Apr 2022 15:06:19 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > > > The maple tree is an RCU-safe range based B-tree designed to use modern > > processor cache efficiently. There are a number of places in the kernel > > I think it would be helpful to expand on "a number of places". > Specifically which places? Matthew answered this, but if you look for users of the interval tree you will come across many users that do not have overlapping ranges. The interval tree is being (ab)used because it is easier than the other options even though it is not necessarily the best choice for the data being stored. I don't want to be negative about the other options, they are all really valuable and have their uses. I think where the maple tree excels is the ease of use and the cache efficiency. > > > that a non-overlapping range-based tree would be beneficial, especially > > one with a simple interface. The first user that is covered in this > > patch set is the vm_area_struct, where three data structures are > > replaced by the maple tree: the augmented rbtree, the vma cache, and the > > linked list of VMAs in the mm_struct. The long term goal is to reduce > > or remove the mmap_sem contention. > > "mmap_lock" ;) Ah, right. Thanks. > > > > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > > nodes. With the increased branching factor, it is significantly shorter than > > the rbtree so it has fewer cache misses. The removal of the linked list > > between subsequent entries also reduces the cache misses and the need to pull > > in the previous and next VMA during many tree alterations. > > Do we have any quantitative testing results? I was waiting for the mmtests sweep to complete before sending them but I didn't want to hold up Yu Zhao's testing of the combined tree as it has proved useful already. Please don't include the results in the first commit as it wouldn't make much sense. If you intend to put them in a commit message, please put them in the patch introducing the maple tree. The benchmarks are around the same as they have always been. kernbench rb5.18-rc2 mt5.18-rc2 Amean user-2 862.24 ( 0.00%) 861.45 * 0.09%* Amean syst-2 136.65 ( 0.00%) 141.58 * -3.61%* Amean elsp-2 505.38 ( 0.00%) 507.99 * -0.52%* Amean user-4 890.24 ( 0.00%) 888.34 * 0.21%* Amean syst-4 140.64 ( 0.00%) 145.32 * -3.33%* Amean elsp-4 264.34 ( 0.00%) 265.76 * -0.54%* Amean user-8 952.30 ( 0.00%) 947.57 * 0.50%* Amean syst-8 145.52 ( 0.00%) 147.79 * -1.56%* Amean elsp-8 145.02 ( 0.00%) 145.38 * -0.24%* Amean user-16 920.83 ( 0.00%) 918.95 * 0.20%* Amean syst-16 135.37 ( 0.00%) 138.99 * -2.67%* Amean elsp-16 75.03 ( 0.00%) 75.25 * -0.29%* Amean user-32 970.98 ( 0.00%) 969.01 * 0.20%* Amean syst-32 144.75 ( 0.00%) 148.58 * -2.64%* Amean elsp-32 44.10 ( 0.00%) 44.64 * -1.24%* Amean user-64 1062.19 ( 0.00%) 1060.30 * 0.18%* Amean syst-64 154.24 ( 0.00%) 157.58 * -2.17%* Amean elsp-64 28.88 ( 0.00%) 29.10 * -0.76%* Amean user-128 1609.09 ( 0.00%) 1612.19 * -0.19%* Amean syst-128 210.05 ( 0.00%) 215.09 * -2.40%* Amean elsp-128 25.22 ( 0.00%) 25.45 * -0.94%* Amean user-256 1767.37 ( 0.00%) 1766.86 * 0.03%* Amean syst-256 215.99 ( 0.00%) 221.56 * -2.58%* Amean elsp-256 25.20 ( 0.00%) 25.33 * -0.54%* Amean user-288 1772.73 ( 0.00%) 1772.35 * 0.02%* Amean syst-288 216.08 ( 0.00%) 221.39 * -2.45%* Amean elsp-288 25.16 ( 0.00%) 25.44 * -1.13%* Increase in performance in the following micro-benchmarks in Hmean: - context_switch1-processes +14.74% to 2.22% Mixed results in the following micro-benchmarks in Hmean: - malloc1-threads +34.95% to -30.06% - malloc1-processes +2.73% to -23.65% - page_fault3-threads +19.84% to -11.55% - pthread_mutex1-threads +42.50% to -11.76% Decrease in performance in the following micro-benchmarks in Hmean: - brk1-processes -35.35% to -42.69% brk1-processes decrease is due to the test itself. Since the VMA cannot be expanded, the test is actually inserting a new VMA. > > What's the plan on utilizing this to further reduce mmap_lock contention? The plan is to get to the point where we use the maple tree in RCU mode. Readers will not block for writers. A single write operation will be allowed at a time. A reader re-walks if stale data is encountered. VMAs would be RCU enabled and this mode would be entered once multiple tasks are using the mm_struct. I can go into more details if you want.
On Tue, Apr 26, 2022 at 03:06:19PM +0000, Liam Howlett wrote: > Andrew, > > Please replace the patches in your mglru-maple branch with this set. It should > be a drop in replacement for my patch range with the fixes into these > patches. Adding the preallocation to work around the fs-reclaim LOCKDEP > issue caused enough changes to the patches to warrant a respin. > > The last patch on the branch is still needed to fix vmscan after mglru > is applied. ee4b1fc24f30 "mm/vmscan: Use VMA_ITERATOR in > get_next_vma()" > > > Here is the pretty cover letter you requested last time. > > ------------------------------------ > > The maple tree is an RCU-safe range based B-tree designed to use modern > processor cache efficiently. There are a number of places in the kernel > that a non-overlapping range-based tree would be beneficial, especially > one with a simple interface. The first user that is covered in this > patch set is the vm_area_struct, where three data structures are > replaced by the maple tree: the augmented rbtree, the vma cache, and the > linked list of VMAs in the mm_struct. The long term goal is to reduce > or remove the mmap_sem contention. > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > nodes. With the increased branching factor, it is significantly shorter than > the rbtree so it has fewer cache misses. The removal of the linked list > between subsequent entries also reduces the cache misses and the need to pull > in the previous and next VMA during many tree alterations. > > This patch set is based on v5.18-rc2 > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220426 > > v8 changes: > - Added preallocations before any potential edits to the tree when holding the > i_mmap_lock to avoid fs-reclaim issues on extreme memory pressure. > - Fixed issue in mempolicy mas_for_each() loop. > - Moved static definitions inside ifdef for DEBUG_MAPLE > - Fixed compile warnings reported by build bots > - Moved mas_dfs_preorder() to testing code > - Changed __vma_adjust() to record the highest vma in case 6 instead of > finding it twice. > - Fixed locking issue in exit_mmap() > - Fixed up from/s-o-b ordering Running some syscall fuzzer would trigger a crash. BUG: KASAN: use-after-free in mas_find ma_dead_node at lib/maple_tree.c:532 (inlined by) mas_next_entry at lib/maple_tree.c:4637 (inlined by) mas_find at lib/maple_tree.c:5869 Read of size 8 at addr ffff88811c5e9c00 by task trinity-c0/1351 CPU: 5 PID: 1351 Comm: trinity-c0 Not tainted 5.18.0-rc4-next-20220427 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-5.fc35 04/01/2014 Call Trace: <TASK> dump_stack_lvl print_address_description.constprop.0.cold print_report.cold kasan_report mas_find apply_mlockall_flags __do_sys_munlockall do_syscall_64 entry_SYSCALL_64_after_hwframe RIP: 0033:0x7f3105611a3d Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 738 RSP: 002b:00007ffeefae7c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000098 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f3105611a3d RDX: 00000000ffff0000 RSI: ffffffffffffafff RDI: 0000009357075a39 RBP: 00007f3103f94000 R08: 00000000fffffff7 R09: 000000000000006b R10: fffffffffffffffd R11: 0000000000000246 R12: 0000000000000098 R13: 00007f31054f06c0 R14: 00007f3103f94058 R15: 00007f3103f94000 </TASK> Allocated by task 1351: kasan_save_stack __kasan_slab_alloc kmem_cache_alloc_bulk mas_alloc_nodes mas_preallocate __vma_adjust __split_vma do_mas_align_munmap.constprop.0 __vm_munmap __x64_sys_munmap do_syscall_64 entry_SYSCALL_64_after_hwframe Freed by task 1351: kasan_save_stack kasan_set_track kasan_set_free_info ____kasan_slab_free slab_free_freelist_hook kmem_cache_free_bulk.part.0 mas_destroy mas_store_prealloc __vma_adjust vma_merge mlock_fixup apply_mlockall_flags __do_sys_munlockall do_syscall_64 entry_SYSCALL_64_after_hwframe The buggy address belongs to the object at ffff88811c5e9c00 which belongs to the cache maple_node of size 256 The buggy address is located 0 bytes inside of 256-byte region [ffff88811c5e9c00, ffff88811c5e9d00) The buggy address belongs to the physical page: page:ffffea0004717a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11c5e8 head:ffffea0004717a00 order:2 compound_mapcount:0 compound_pincount:0 flags: 0x200000000010200(slab|head|node=0|zone=2) raw: 0200000000010200 ffffea0004f8da08 ffffea0004676a08 ffff888100050940 raw: 0000000000000000 0000000000150015 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected Memory state around the buggy address: ffff88811c5e9b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff88811c5e9b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc >ffff88811c5e9c00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff88811c5e9c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff88811c5e9d00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ================================================================== Disabling lock debugging due to kernel taint general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 5 PID: 1351 Comm: trinity-c0 Tainted: G B 5.18.0-rc4-next-20220427 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-5.fc35 04/01/2014 RIP: 0010:mas_ascend Code: cb 18 41 89 d3 48 83 cb 04 41 83 f3 01 40 84 ed 40 0f 95 c7 41 20 fb 74 26 83 ee 01 48 63 f6 48 8d 14 f1 48 89 d6 48 c1 ee 03 <42> 80 3c 3e 00 0f 854 RSP: 0018:ffffc9000279fc78 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff888106dc3504 RCX: 0000000000000000 RDX: 0000000000000008 RSI: 0000000000000001 RDI: ffff88811d42e201 RBP: 0000000000000002 R08: 0000000000000000 R09: ffff88810f723300 R10: ffffffffffffffff R11: 0000000000000001 R12: ffffc9000279fe78 R13: 0000000000000000 R14: ffff888106dc3500 R15: dffffc0000000000 FS: 00007f31054f0740(0000) GS:ffff8882d2e80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3104db847c CR3: 0000000127e80002 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> mas_next_node mas_find apply_mlockall_flags __do_sys_munlockall do_syscall_64 entry_SYSCALL_64_after_hwframe RIP: 0033:0x7f3105611a3d Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 738 RSP: 002b:00007ffeefae7c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000098 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f3105611a3d RDX: 00000000ffff0000 RSI: ffffffffffffafff RDI: 0000009357075a39 RBP: 00007f3103f94000 R08: 00000000fffffff7 R09: 000000000000006b R10: fffffffffffffffd R11: 0000000000000246 R12: 0000000000000098 R13: 00007f31054f06c0 R14: 00007f3103f94058 R15: 00007f3103f94000 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:mas_ascend Code: cb 18 41 89 d3 48 83 cb 04 41 83 f3 01 40 84 ed 40 0f 95 c7 41 20 fb 74 26 83 ee 01 48 63 f6 48 8d 14 f1 48 89 d6 48 c1 ee 03 <42> 80 3c 3e 00 0f 854 RSP: 0018:ffffc9000279fc78 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff888106dc3504 RCX: 0000000000000000 RDX: 0000000000000008 RSI: 0000000000000001 RDI: ffff88811d42e201 RBP: 0000000000000002 R08: 0000000000000000 R09: ffff88810f723300 R10: ffffffffffffffff R11: 0000000000000001 R12: ffffc9000279fe78 R13: 0000000000000000 R14: ffff888106dc3500 R15: dffffc0000000000 FS: 00007f31054f0740(0000) GS:ffff8882d2e80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f3104db847c CR3: 0000000127e80002 CR4: 0000000000370ee0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Kernel panic - not syncing: Fatal exception Kernel Offset: 0x15400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
* Qian Cai <quic_qiancai@quicinc.com> [220427 12:10]: > On Tue, Apr 26, 2022 at 03:06:19PM +0000, Liam Howlett wrote: > > Andrew, > > > > Please replace the patches in your mglru-maple branch with this set. It should > > be a drop in replacement for my patch range with the fixes into these > > patches. Adding the preallocation to work around the fs-reclaim LOCKDEP > > issue caused enough changes to the patches to warrant a respin. > > > > The last patch on the branch is still needed to fix vmscan after mglru > > is applied. ee4b1fc24f30 "mm/vmscan: Use VMA_ITERATOR in > > get_next_vma()" > > > > > > Here is the pretty cover letter you requested last time. > > > > ------------------------------------ > > > > The maple tree is an RCU-safe range based B-tree designed to use modern > > processor cache efficiently. There are a number of places in the kernel > > that a non-overlapping range-based tree would be beneficial, especially > > one with a simple interface. The first user that is covered in this > > patch set is the vm_area_struct, where three data structures are > > replaced by the maple tree: the augmented rbtree, the vma cache, and the > > linked list of VMAs in the mm_struct. The long term goal is to reduce > > or remove the mmap_sem contention. > > > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > > nodes. With the increased branching factor, it is significantly shorter than > > the rbtree so it has fewer cache misses. The removal of the linked list > > between subsequent entries also reduces the cache misses and the need to pull > > in the previous and next VMA during many tree alterations. > > > > This patch set is based on v5.18-rc2 > > > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220426 > > > > v8 changes: > > - Added preallocations before any potential edits to the tree when holding the > > i_mmap_lock to avoid fs-reclaim issues on extreme memory pressure. > > - Fixed issue in mempolicy mas_for_each() loop. > > - Moved static definitions inside ifdef for DEBUG_MAPLE > > - Fixed compile warnings reported by build bots > > - Moved mas_dfs_preorder() to testing code > > - Changed __vma_adjust() to record the highest vma in case 6 instead of > > finding it twice. > > - Fixed locking issue in exit_mmap() > > - Fixed up from/s-o-b ordering > > Running some syscall fuzzer would trigger a crash. > > BUG: KASAN: use-after-free in mas_find > ma_dead_node at lib/maple_tree.c:532 > (inlined by) mas_next_entry at lib/maple_tree.c:4637 > (inlined by) mas_find at lib/maple_tree.c:5869 > Read of size 8 at addr ffff88811c5e9c00 by task trinity-c0/1351 > > CPU: 5 PID: 1351 Comm: trinity-c0 Not tainted 5.18.0-rc4-next-20220427 #3 > Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-5.fc35 04/01/2014 > Call Trace: > <TASK> > dump_stack_lvl > print_address_description.constprop.0.cold > print_report.cold > kasan_report > mas_find > apply_mlockall_flags Thanks. This is indeed an issue with 0d43186b36c1 (mm/mlock: use vma iterator and instead of vma linked list) Andrew, Please include this patch as a fix. Thanks, Liam
On Wed, 27 Apr 2022 14:08:39 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > > > > > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > > > nodes. With the increased branching factor, it is significantly shorter than > > > the rbtree so it has fewer cache misses. The removal of the linked list > > > between subsequent entries also reduces the cache misses and the need to pull > > > in the previous and next VMA during many tree alterations. > > > > Do we have any quantitative testing results? > > I was waiting for the mmtests sweep to complete before sending them but > I didn't want to hold up Yu Zhao's testing of the combined tree as it > has proved useful already. Please don't include the results in the first > commit as it wouldn't make much sense. If you intend to put them in a > commit message, please put them in the patch introducing the maple tree. I did that. > The benchmarks are around the same as they have always been. So it's presently a wash. That makes "the plan" (below) really critical, otherwise there seems little point in merging this code at this time? Please send me many very soothing words about how confident we should be that the plan will be implemented and that it shall be good? > > > > What's the plan on utilizing this to further reduce mmap_lock contention? > > The plan is to get to the point where we use the maple tree in RCU mode. > Readers will not block for writers. A single write operation will be > allowed at a time. A reader re-walks if stale data is encountered. VMAs > would be RCU enabled and this mode would be entered once multiple tasks > are using the mm_struct. I can go into more details if you want. Uh, that's very important information. It's really the whole point of the patchset! I added this to the [0/n] changelog.
On Wed, Apr 27, 2022 at 10:33:31AM -0700, Andrew Morton wrote: > On Wed, 27 Apr 2022 14:08:39 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > > The benchmarks are around the same as they have always been. > > So it's presently a wash. > > That makes "the plan" (below) really critical, otherwise there seems > little point in merging this code at this time? > > Please send me many very soothing words about how confident we should > be that the plan will be implemented and that it shall be good? Yes, performance-wise it's a wash. However, Davidlohr was very impressed that it was a wash because we're actually getting rid of three data structures here; the linked list, the rbtree and the vmacache. His opinion was that we should push the maple tree in now, in advance of the future RCU uses. We also have other users waiting in the wings. Dave Howells has something he's working on that uses the maple tree directly. I have a couple of XArray users that are using it inappropriately that I want to convert ... I just didn't want to do that work before all this lands. The current LSFMM schedule has very many words about the Maple tree scheduled for 13:30-15:00 on Monday. Hopefully we'll have a better idea after that how confident we are that RCU VMA walking is going to work.
On Wed, Apr 27, 2022 at 04:51:50PM +0000, Liam Howlett wrote: > Thanks. This is indeed an issue with 0d43186b36c1 (mm/mlock: use vma > iterator and instead of vma linked list) > > Andrew, Please include this patch as a fix. Even with the patch applied, there are still thousands of memory leaks reports from kmemleak after booting. unreferenced object 0xffff400259bd6d00 (size 256): comm "multipathd", pid 2577, jiffies 4294915929 (age 2370.384s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: slab_post_alloc_hook kmem_cache_alloc_bulk mas_alloc_nodes mt_alloc_bulk at lib/maple_tree.c:151 (inlined by) mas_alloc_nodes at lib/maple_tree.c:1244 mas_preallocate __vma_adjust shift_arg_pages setup_arg_pages load_elf_binary search_binary_handler exec_binprm bprm_execve do_execveat_common.isra.0 __arm64_sys_execve invoke_syscall el0_svc_common.constprop.0 do_el0_svc
* Qian Cai <quic_qiancai@quicinc.com> [220427 16:22]: > On Wed, Apr 27, 2022 at 04:51:50PM +0000, Liam Howlett wrote: > > Thanks. This is indeed an issue with 0d43186b36c1 (mm/mlock: use vma > > iterator and instead of vma linked list) > > > > Andrew, Please include this patch as a fix. > > Even with the patch applied, there are still thousands of memory leaks > reports from kmemleak after booting. Thank you for finding this. > > unreferenced object 0xffff400259bd6d00 (size 256): > comm "multipathd", pid 2577, jiffies 4294915929 (age 2370.384s) > hex dump (first 32 bytes): > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ > backtrace: > slab_post_alloc_hook > kmem_cache_alloc_bulk > mas_alloc_nodes > mt_alloc_bulk at lib/maple_tree.c:151 > (inlined by) mas_alloc_nodes at lib/maple_tree.c:1244 > mas_preallocate > __vma_adjust > shift_arg_pages > setup_arg_pages > load_elf_binary > search_binary_handler > exec_binprm > bprm_execve > do_execveat_common.isra.0 > __arm64_sys_execve > invoke_syscall > el0_svc_common.constprop.0 > do_el0_svc __vma_adjust is way too complicated. This patch should fix the leak. Andrew, please add this patch to "mm: start tracking VMAs with maple tree" Thanks, Liam
* Andrew Morton <akpm@linux-foundation.org> [220427 13:33]: > On Wed, 27 Apr 2022 14:08:39 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > > > > > > > > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf > > > > nodes. With the increased branching factor, it is significantly shorter than > > > > the rbtree so it has fewer cache misses. The removal of the linked list > > > > between subsequent entries also reduces the cache misses and the need to pull > > > > in the previous and next VMA during many tree alterations. > > > > > > Do we have any quantitative testing results? > > > > I was waiting for the mmtests sweep to complete before sending them but > > I didn't want to hold up Yu Zhao's testing of the combined tree as it > > has proved useful already. Please don't include the results in the first > > commit as it wouldn't make much sense. If you intend to put them in a > > commit message, please put them in the patch introducing the maple tree. > > I did that. > > > The benchmarks are around the same as they have always been. > > So it's presently a wash. > > That makes "the plan" (below) really critical, otherwise there seems > little point in merging this code at this time? There are a number of reasons to merge at this time. First, as Matthew said, we have more than one user so having the tree upstream would obviously help them out. Second, we can make sure this much (maple tree + vma tracking) works for everyone before we enable VMA RCU and play even more with the locking scenarios. The third reason is to get more community input into "the plan" to ensure the maple tree is beneficial in the most common execution paths and capture corner cases. > > Please send me many very soothing words about how confident we should > be that the plan will be implemented and that it shall be good? Tea, warmth, fresh bread. As we know, VMAs are rarely written and mostly read. This plan removes the majority of the slow down on readers. The only slow down is when a reader conflicts with a writer. We are actually slower on writes - we allocate nodes for the tree vs having it embedded in the VMA itself and we do more work in calculating the gaps, but we are actually fast enough to hide that on the read side. Once the writers stop blocking the readers it shall be good. We need this step to get to using the maple tree in RCU mode to make this happen. > > > > > > > What's the plan on utilizing this to further reduce mmap_lock contention? > > > > The plan is to get to the point where we use the maple tree in RCU mode. > > Readers will not block for writers. A single write operation will be > > allowed at a time. A reader re-walks if stale data is encountered. VMAs > > would be RCU enabled and this mode would be entered once multiple tasks > > are using the mm_struct. I can go into more details if you want. > > Uh, that's very important information. It's really the whole point > of the patchset! I added this to the [0/n] changelog. Yes, but that's not what the patch set does today. I didn't want to include The Plan until it exists. I'd like to think that what we have here is worth while on its own and a good start - but a start to something big. It's hard enough to get people to look at 70 patches, doubly so for an RFC of 70. The interest from others in using a tree with an easy interface seems like a win as well. I also have some side goals such as the removal of __vma_adjust() for code readability. Thanks, Liam
On Wed, 27 Apr 2022, Matthew Wilcox wrote: >On Wed, Apr 27, 2022 at 10:33:31AM -0700, Andrew Morton wrote: >> On Wed, 27 Apr 2022 14:08:39 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: >> > The benchmarks are around the same as they have always been. >> >> So it's presently a wash. >> >> That makes "the plan" (below) really critical, otherwise there seems >> little point in merging this code at this time? >> >> Please send me many very soothing words about how confident we should >> be that the plan will be implemented and that it shall be good? > >Yes, performance-wise it's a wash. However, Davidlohr was very >impressed that it was a wash because we're actually getting rid of three >data structures here; the linked list, the rbtree and the vmacache. >His opinion was that we should push the maple tree in now, in advance >of the future RCU uses. Yes I like the maple tree, and at this stage I don't think we can ask for more from this series wrt the MM - albeit there seems to still be some folks reporting breakage. Fundamentally I see Liam's work to (re)move complexity out of the MM (not to say that the actual maple tree is not complex) by consolidating the three complimentary data structures very much worth it considering performance does not take a hit. This was very much a turn off with the range locking approach, which worst case scenario incurred in prohibitive overhead. Also as Liam and Matthew have mentioned, RCU opens up a lot of nice performance opportunities, and in addition academia[1] has shown outstanding scalability of address spaces with the foundation of replacing the locked rbtree with RCU aware trees. [1] https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf Thanks, Davidlohr
On Sun, 1 May 2022 13:26:34 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote: > On Wed, 27 Apr 2022, Matthew Wilcox wrote: > > >On Wed, Apr 27, 2022 at 10:33:31AM -0700, Andrew Morton wrote: > >> On Wed, 27 Apr 2022 14:08:39 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > >> > The benchmarks are around the same as they have always been. > >> > >> So it's presently a wash. > >> > >> That makes "the plan" (below) really critical, otherwise there seems > >> little point in merging this code at this time? > >> > >> Please send me many very soothing words about how confident we should > >> be that the plan will be implemented and that it shall be good? > > > >Yes, performance-wise it's a wash. However, Davidlohr was very > >impressed that it was a wash because we're actually getting rid of three > >data structures here; the linked list, the rbtree and the vmacache. > >His opinion was that we should push the maple tree in now, in advance > >of the future RCU uses. > > Yes I like the maple tree, and at this stage I don't think we can ask > for more from this series wrt the MM - albeit there seems to still be > some folks reporting breakage. Fundamentally I see Liam's work to (re)move > complexity out of the MM (not to say that the actual maple tree is not > complex) by consolidating the three complimentary data structures very > much worth it considering performance does not take a hit. This was > very much a turn off with the range locking approach, which worst case > scenario incurred in prohibitive overhead. Also as Liam and Matthew > have mentioned, RCU opens up a lot of nice performance opportunities, > and in addition academia[1] has shown outstanding scalability of address > spaces with the foundation of replacing the locked rbtree with RCU > aware trees. Thanks. That sounded like a wordy acked-by to me? :) Liam, I think the above is useful background for the [0/N]. > [1] https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf As is that. The paper seems shockingly relevant. Do we know the authors or is it a cosmic coincidence?
* Andrew Morton <akpm@linux-foundation.org> [220501 19:56]: > On Sun, 1 May 2022 13:26:34 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote: > > > On Wed, 27 Apr 2022, Matthew Wilcox wrote: > > > > >On Wed, Apr 27, 2022 at 10:33:31AM -0700, Andrew Morton wrote: > > >> On Wed, 27 Apr 2022 14:08:39 +0000 Liam Howlett <liam.howlett@oracle.com> wrote: > > >> > The benchmarks are around the same as they have always been. > > >> > > >> So it's presently a wash. > > >> > > >> That makes "the plan" (below) really critical, otherwise there seems > > >> little point in merging this code at this time? > > >> > > >> Please send me many very soothing words about how confident we should > > >> be that the plan will be implemented and that it shall be good? > > > > > >Yes, performance-wise it's a wash. However, Davidlohr was very > > >impressed that it was a wash because we're actually getting rid of three > > >data structures here; the linked list, the rbtree and the vmacache. > > >His opinion was that we should push the maple tree in now, in advance > > >of the future RCU uses. > > > > Yes I like the maple tree, and at this stage I don't think we can ask > > for more from this series wrt the MM - albeit there seems to still be > > some folks reporting breakage. Fundamentally I see Liam's work to (re)move > > complexity out of the MM (not to say that the actual maple tree is not > > complex) by consolidating the three complimentary data structures very > > much worth it considering performance does not take a hit. This was > > very much a turn off with the range locking approach, which worst case > > scenario incurred in prohibitive overhead. Also as Liam and Matthew > > have mentioned, RCU opens up a lot of nice performance opportunities, > > and in addition academia[1] has shown outstanding scalability of address > > spaces with the foundation of replacing the locked rbtree with RCU > > aware trees. > > Thanks. That sounded like a wordy acked-by to me? :) > > Liam, I think the above is useful background for the [0/N]. > > > [1] https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf > > As is that. The paper seems shockingly relevant. Do we know the > authors or is it a cosmic coincidence? > Cosmic coincidence. We designed our tree with the intention of solving the hardest problem first. Upon settling on a b-tree variant and a rough outline, we researched ranged based b-trees and RCU b-trees and did find that article. So it was nice to find reassurances that we were on the right path, but our design choice of using ranges made that paper unusable for us. Thanks, Liam