diff mbox series

[v12,08/69] mm: start tracking VMAs with maple tree

Message ID 20220720021727.17018-9-Liam.Howlett@oracle.com (mailing list archive)
State New
Headers show
Series Introducing the Maple Tree | expand

Commit Message

Liam R. Howlett July 20, 2022, 2:17 a.m. UTC
From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>

Start tracking the VMAs with the new maple tree structure in parallel with
the rb_tree.  Add debug and trace events for maple tree operations and
duplicate the rb_tree that is created on forks into the maple tree.

The maple tree is added to the mm_struct including the mm_init struct,
added support in required mm/mmap functions, added tracking in kernel/fork
for process forking, and used to find the unmapped_area and checked
against what the rbtree finds.

This also moves the mmap_lock() in exit_mmap() since the oom reaper call
does walk the VMAs.  Otherwise lockdep will be unhappy if oom happens.

When splitting a vma fails due to allocations of the maple tree nodes,
the error path in __split_vma() calls new->vm_ops->close(new).  The page
accounting for hugetlb is actually in the close() operation,  so it
accounts for the removal of 1/2 of the VMA which was not adjusted.  This
results in a negative exit value.  To avoid the negative charge, set
vm_start = vm_end and vm_pgoff = 0.

There is also a potential accounting issue in special mappings from
insert_vm_struct() failing to allocate, so reverse the charge there in
the failure scenario.

Link: https://lkml.kernel.org/r/20220504010716.661115-10-Liam.Howlett@oracle.com
Link: https://lkml.kernel.org/r/20220621204632.3370049-9-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: David Howells <dhowells@redhat.com>
Cc: SeongJae Park <sj@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Will Deacon <will@kernel.org>
Cc: Davidlohr Bueso <dave@stgolabs.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 arch/x86/kernel/tboot.c     |   1 +
 drivers/firmware/efi/efi.c  |   1 +
 include/linux/mm.h          |   5 +
 include/linux/mm_types.h    |   3 +
 include/trace/events/mmap.h |  73 ++++++++
 kernel/fork.c               |  20 +-
 mm/init-mm.c                |   2 +
 mm/mmap.c                   | 353 ++++++++++++++++++++++++++++++++----
 mm/nommu.c                  |  13 ++
 9 files changed, 435 insertions(+), 36 deletions(-)

Comments

Nathan Chancellor July 27, 2022, 12:28 a.m. UTC | #1
Hi Liam,

On Wed, Jul 20, 2022 at 02:17:45AM +0000, Liam Howlett wrote:
> From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> 
> Start tracking the VMAs with the new maple tree structure in parallel with
> the rb_tree.  Add debug and trace events for maple tree operations and
> duplicate the rb_tree that is created on forks into the maple tree.
> 
> The maple tree is added to the mm_struct including the mm_init struct,
> added support in required mm/mmap functions, added tracking in kernel/fork
> for process forking, and used to find the unmapped_area and checked
> against what the rbtree finds.
> 
> This also moves the mmap_lock() in exit_mmap() since the oom reaper call
> does walk the VMAs.  Otherwise lockdep will be unhappy if oom happens.
> 
> When splitting a vma fails due to allocations of the maple tree nodes,
> the error path in __split_vma() calls new->vm_ops->close(new).  The page
> accounting for hugetlb is actually in the close() operation,  so it
> accounts for the removal of 1/2 of the VMA which was not adjusted.  This
> results in a negative exit value.  To avoid the negative charge, set
> vm_start = vm_end and vm_pgoff = 0.
> 
> There is also a potential accounting issue in special mappings from
> insert_vm_struct() failing to allocate, so reverse the charge there in
> the failure scenario.
> 
> Link: https://lkml.kernel.org/r/20220504010716.661115-10-Liam.Howlett@oracle.com
> Link: https://lkml.kernel.org/r/20220621204632.3370049-9-Liam.Howlett@oracle.com
> Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: David Howells <dhowells@redhat.com>
> Cc: SeongJae Park <sj@kernel.org>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Will Deacon <will@kernel.org>
> Cc: Davidlohr Bueso <dave@stgolabs.net>
> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>

Apologies if this has been reported already, I tried searching the
mailing lists but I did not really find anything.

I bisected my arm64 test system failing to boot to this change as commit
fdfbd22f37db ("mm: start tracking VMAs with maple tree") in
next-20220726 (bisect log at the end).

[    4.295886] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000
[    4.306595] Mem abort info:
[    4.309381]   ESR = 0x0000000096000044
[    4.313118]   EC = 0x25: DABT (current EL), IL = 32 bits
[    4.318422]   SET = 0, FnV = 0
[    4.321464]   EA = 0, S1PTW = 0
[    4.324592]   FSC = 0x04: level 0 translation fault
[    4.329461] Data abort info:
[    4.332329]   ISV = 0, ISS = 0x00000044
[    4.336152]   CM = 0, WnR = 1
[    4.339110] user pgtable: 4k pages, 48-bit VAs, pgdp=00000020a9712000
[    4.345539] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[    4.352321] Internal error: Oops: 96000044 [#1] SMP
[    4.357188] Modules linked in:
[    4.360232] CPU: 6 PID: 264 Comm: dracut-rootfs-g Not tainted 5.19.0-rc4-00288-gfdfbd22f37db #1
[    4.368918] Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II Jun 21 2022
[    4.376994] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    4.383943] pc : mas_split.isra.0+0x50c/0x784
[    4.388295] lr : mas_split.isra.0+0x204/0x784
[    4.392640] sp : ffff8000094a3510
[    4.395942] x29: ffff8000094a3510 x28: ffff08dd66c0c000 x27: ffff8000094a3610
[    4.403067] x26: ffff8000094a35d0 x25: ffff8000094a3578 x24: ffffd823cb5448b8
[    4.410192] x23: ffff8000094a3650 x22: ffff8000094a3690 x21: ffff8000094a3738
[    4.417316] x20: 0000000000000002 x19: ffff8000094a3af0 x18: 0000000000000002
[    4.424441] x17: 0000000000000000 x16: ffff08dd66c45450 x15: 0000000000000000
[    4.431565] x14: ffff08dd66c459c8 x13: ffff8000094a3748 x12: 0000000000000001
[    4.438689] x11: ffff8000094a3610 x10: 0000000000000003 x9 : ffff08dd66c47300
[    4.445813] x8 : 000000000000001c x7 : 0000000000000003 x6 : 0000000000000006
[    4.452937] x5 : ffff08dd68a44409 x4 : 0000000000000001 x3 : ffff8000094a35d0
[    4.460061] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000094a3738
[    4.467186] Call trace:
[    4.469620]  mas_split.isra.0+0x50c/0x784
[    4.473618]  mas_commit_b_node.isra.0+0x1e0/0x274
[    4.478311]  mas_wr_modify+0x10c/0x28c
[    4.482048]  mas_wr_store_entry.isra.0+0x10c/0x4a0
[    4.486827]  mas_store+0x48/0x110
[    4.490131]  dup_mmap+0x268/0x514
[    4.493436]  dup_mm+0x68/0xfc
[    4.496391]  copy_process+0x864/0x10b4
[    4.500129]  kernel_clone+0x88/0x494
[    4.503692]  __do_sys_clone+0x60/0x80
[    4.507342]  __arm64_sys_clone+0x2c/0x40
[    4.511254]  invoke_syscall+0x78/0x100
[    4.514991]  el0_svc_common.constprop.0+0x4c/0xf4
[    4.519683]  do_el0_svc+0x38/0x4c
[    4.522985]  el0_svc+0x34/0x100
[    4.526115]  el0t_64_sync_handler+0x11c/0x150
[    4.530460]  el0t_64_sync+0x190/0x194
[    4.534112] Code: f9000125 f9400e65 9278dca5 f94000a5 (f9000045)
[    4.540193] ---[ end trace 0000000000000000 ]---

I was also able to reproduce the same crash in a Fedora virtual machine
using QEMU with Fedora's rawhide configuration [1]:

[    5.913992] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000
[    5.914510] Mem abort info:
[    5.914581]   ESR = 0x0000000096000044
[    5.914705]   EC = 0x25: DABT (current EL), IL = 32 bits
[    5.914858]   SET = 0, FnV = 0
[    5.914951]   EA = 0, S1PTW = 0
[    5.915065]   FSC = 0x04: level 0 translation fault
[    5.915215] Data abort info:
[    5.915321]   ISV = 0, ISS = 0x00000044
[    5.915465]   CM = 0, WnR = 1
[    5.915624] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000103051000
[    5.915799] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[    5.916196] Internal error: Oops: 96000044 [#1] SMP
[    5.916504] Modules linked in:
[    5.916771] CPU: 2 PID: 202 Comm: dracut-rootfs-g Not tainted 5.19.0-rc4+ #1
[    5.917003] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
[    5.917339] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    5.917584] pc : mas_split.isra.0+0x50c/0x784
[    5.917794] lr : mas_split.isra.0+0x204/0x784
[    5.917949] sp : ffff8000086334a0
[    5.918038] x29: ffff8000086334a0 x28: ffff5b65c095a258 x27: ffff8000086335a0
[    5.918289] x26: ffff800008633560 x25: ffff800008633508 x24: ffffdb30c80d9778
[    5.918844] x23: ffff8000086335e0 x22: ffff800008633620 x21: ffff8000086336c8
[    5.919277] x20: 0000000000000002 x19: ffff800008633a80 x18: 0000000000000002
[    5.919533] x17: 0000000000000000 x16: ffff5b65c095a4b0 x15: 0000000000000000
[    5.919747] x14: ffff5b65c095a898 x13: ffff8000086336d8 x12: 0000000000000001
[    5.919971] x11: ffff8000086335a0 x10: 0000000000000003 x9 : ffff5b66f42e2a00
[    5.920214] x8 : 000000000000001c x7 : 0000000000000003 x6 : 0000000000000006
[    5.920493] x5 : ffff5b65c3077309 x4 : 0000000000000001 x3 : ffff800008633560
[    5.920739] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000086336c8
[    5.921051] Call trace:
[    5.921152]  mas_split.isra.0+0x50c/0x784
[    5.921303]  mas_commit_b_node.isra.0+0x1e0/0x274
[    5.921459]  mas_wr_modify+0x10c/0x28c
[    5.921565]  mas_wr_store_entry.isra.0+0x10c/0x4a0
[    5.921725]  mas_store+0x48/0x110
[    5.921864]  dup_mmap+0x268/0x514
[    5.921993]  dup_mm+0x68/0xfc
[    5.922074]  copy_process+0x864/0x10b4
[    5.922213]  kernel_clone+0x88/0x494
[    5.922315]  __do_sys_clone+0x60/0x80
[    5.922444]  __arm64_sys_clone+0x2c/0x40
[    5.922576]  invoke_syscall+0x78/0x100
[    5.922686]  el0_svc_common.constprop.0+0x4c/0xf4
[    5.922847]  do_el0_svc+0x38/0x4c
[    5.922947]  el0_svc+0x34/0x100
[    5.923056]  el0t_64_sync_handler+0x11c/0x150
[    5.923179]  el0t_64_sync+0x190/0x194
[    5.923365] Code: f9000125 f9400e65 9278dca5 f94000a5 (f9000045)
[    5.923833] ---[ end trace 0000000000000000 ]---

If there is any additional information I can provide or patches I can
test, please let me know!

Cheers,
Nathan

[1]: https://src.fedoraproject.org/rpms/kernel/raw/rawhide/f/kernel-aarch64-fedora.config

# bad: [058affafc65a74cf54499fb578b66ad0b18f939b] Add linux-next specific files for 20220726
# good: [e0dccc3b76fb35bb257b4118367a883073d7390e] Linux 5.19-rc8
git bisect start '058affafc65a74cf54499fb578b66ad0b18f939b' 'e0dccc3b76fb35bb257b4118367a883073d7390e'
# good: [e9173a7b08211b52862d61e7cdc8899fc5e6a44d] Merge branch 'drm-next' of git://git.freedesktop.org/git/drm/drm.git
git bisect good e9173a7b08211b52862d61e7cdc8899fc5e6a44d
# good: [45dfa9ecc6a971ab9217e41c1e2ea3ee98fd0f70] Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm.git
git bisect good 45dfa9ecc6a971ab9217e41c1e2ea3ee98fd0f70
# good: [1991de2cb33a921c5a422e749eaba9067b9e8a29] Merge branch 'staging-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging.git
git bisect good 1991de2cb33a921c5a422e749eaba9067b9e8a29
# good: [21a47601220fc0b93b7ab254381b2a3ef1f6d3fe] Merge branch 'for-next' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock.git
git bisect good 21a47601220fc0b93b7ab254381b2a3ef1f6d3fe
# bad: [2a210fe818f13dfe3342eb117a4bfeb36aad8215] mm: khugepaged: don't carry huge page to the next loop for !CONFIG_NUMA
git bisect bad 2a210fe818f13dfe3342eb117a4bfeb36aad8215
# good: [43957b5d11037a651d162f65c682ec3c76777fc8] mm/mmap: define DECLARE_VM_GET_PAGE_PROT
git bisect good 43957b5d11037a651d162f65c682ec3c76777fc8
# good: [e3e449def7ea1d17e890408ea01013592e65298b] radix tree test suite: add pr_err define
git bisect good e3e449def7ea1d17e890408ea01013592e65298b
# bad: [c1870dd3ebf1f8a1337f10f2b2ef97e0c1d7e03a] ipc/shm: use VMA iterator instead of linked list
git bisect bad c1870dd3ebf1f8a1337f10f2b2ef97e0c1d7e03a
# bad: [5f9f7cac1a89ff1c7d111dfc7edbcb1b0987000a] mm: use maple tree operations for find_vma_intersection()
git bisect bad 5f9f7cac1a89ff1c7d111dfc7edbcb1b0987000a
# bad: [3b6a687016b08feb5c1b8d9fb78b31dcb314674d] mm/mmap: use maple tree for unmapped_area{_topdown}
git bisect bad 3b6a687016b08feb5c1b8d9fb78b31dcb314674d
# good: [264f03ef6aaca0d56e1c6efed11c93680b8156ac] lib/test_maple_tree: add testing for maple tree
git bisect good 264f03ef6aaca0d56e1c6efed11c93680b8156ac
# bad: [bea49723f45480acf67f46f6fd76bc5cde941e5d] mmap: use the VMA iterator in count_vma_pages_range()
git bisect bad bea49723f45480acf67f46f6fd76bc5cde941e5d
# bad: [423dbb83d4e1b9e894a2309a0035284eb20d9f2b] mm: add VMA iterator
git bisect bad 423dbb83d4e1b9e894a2309a0035284eb20d9f2b
# bad: [fdfbd22f37db37d2db32411d7f48c57bc810366b] mm: start tracking VMAs with maple tree
git bisect bad fdfbd22f37db37d2db32411d7f48c57bc810366b
# first bad commit: [fdfbd22f37db37d2db32411d7f48c57bc810366b] mm: start tracking VMAs with maple tree
Liam R. Howlett July 28, 2022, 12:34 a.m. UTC | #2
* Nathan Chancellor <nathan@kernel.org> [220726 20:28]:
> Hi Liam,
> 
> On Wed, Jul 20, 2022 at 02:17:45AM +0000, Liam Howlett wrote:
> > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > 
> > Start tracking the VMAs with the new maple tree structure in parallel with
> > the rb_tree.  Add debug and trace events for maple tree operations and
> > duplicate the rb_tree that is created on forks into the maple tree.
> > 
> > The maple tree is added to the mm_struct including the mm_init struct,
> > added support in required mm/mmap functions, added tracking in kernel/fork
> > for process forking, and used to find the unmapped_area and checked
> > against what the rbtree finds.
> > 
> > This also moves the mmap_lock() in exit_mmap() since the oom reaper call
> > does walk the VMAs.  Otherwise lockdep will be unhappy if oom happens.
> > 
> > When splitting a vma fails due to allocations of the maple tree nodes,
> > the error path in __split_vma() calls new->vm_ops->close(new).  The page
> > accounting for hugetlb is actually in the close() operation,  so it
> > accounts for the removal of 1/2 of the VMA which was not adjusted.  This
> > results in a negative exit value.  To avoid the negative charge, set
> > vm_start = vm_end and vm_pgoff = 0.
> > 
> > There is also a potential accounting issue in special mappings from
> > insert_vm_struct() failing to allocate, so reverse the charge there in
> > the failure scenario.
> > 
> > Link: https://lkml.kernel.org/r/20220504010716.661115-10-Liam.Howlett@oracle.com
> > Link: https://lkml.kernel.org/r/20220621204632.3370049-9-Liam.Howlett@oracle.com
> > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: David Howells <dhowells@redhat.com>
> > Cc: SeongJae Park <sj@kernel.org>
> > Cc: Vlastimil Babka <vbabka@suse.cz>
> > Cc: Will Deacon <will@kernel.org>
> > Cc: Davidlohr Bueso <dave@stgolabs.net>
> > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> 
> Apologies if this has been reported already, I tried searching the
> mailing lists but I did not really find anything.
> 
> I bisected my arm64 test system failing to boot to this change as commit
> fdfbd22f37db ("mm: start tracking VMAs with maple tree") in
> next-20220726 (bisect log at the end).
> 
> [    4.295886] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000
> [    4.306595] Mem abort info:
> [    4.309381]   ESR = 0x0000000096000044
> [    4.313118]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    4.318422]   SET = 0, FnV = 0
> [    4.321464]   EA = 0, S1PTW = 0
> [    4.324592]   FSC = 0x04: level 0 translation fault
> [    4.329461] Data abort info:
> [    4.332329]   ISV = 0, ISS = 0x00000044
> [    4.336152]   CM = 0, WnR = 1
> [    4.339110] user pgtable: 4k pages, 48-bit VAs, pgdp=00000020a9712000
> [    4.345539] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
> [    4.352321] Internal error: Oops: 96000044 [#1] SMP
> [    4.357188] Modules linked in:
> [    4.360232] CPU: 6 PID: 264 Comm: dracut-rootfs-g Not tainted 5.19.0-rc4-00288-gfdfbd22f37db #1
> [    4.368918] Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II Jun 21 2022
> [    4.376994] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    4.383943] pc : mas_split.isra.0+0x50c/0x784
> [    4.388295] lr : mas_split.isra.0+0x204/0x784
> [    4.392640] sp : ffff8000094a3510
> [    4.395942] x29: ffff8000094a3510 x28: ffff08dd66c0c000 x27: ffff8000094a3610
> [    4.403067] x26: ffff8000094a35d0 x25: ffff8000094a3578 x24: ffffd823cb5448b8
> [    4.410192] x23: ffff8000094a3650 x22: ffff8000094a3690 x21: ffff8000094a3738
> [    4.417316] x20: 0000000000000002 x19: ffff8000094a3af0 x18: 0000000000000002
> [    4.424441] x17: 0000000000000000 x16: ffff08dd66c45450 x15: 0000000000000000
> [    4.431565] x14: ffff08dd66c459c8 x13: ffff8000094a3748 x12: 0000000000000001
> [    4.438689] x11: ffff8000094a3610 x10: 0000000000000003 x9 : ffff08dd66c47300
> [    4.445813] x8 : 000000000000001c x7 : 0000000000000003 x6 : 0000000000000006
> [    4.452937] x5 : ffff08dd68a44409 x4 : 0000000000000001 x3 : ffff8000094a35d0
> [    4.460061] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000094a3738
> [    4.467186] Call trace:
> [    4.469620]  mas_split.isra.0+0x50c/0x784
> [    4.473618]  mas_commit_b_node.isra.0+0x1e0/0x274
> [    4.478311]  mas_wr_modify+0x10c/0x28c
> [    4.482048]  mas_wr_store_entry.isra.0+0x10c/0x4a0
> [    4.486827]  mas_store+0x48/0x110
> [    4.490131]  dup_mmap+0x268/0x514
> [    4.493436]  dup_mm+0x68/0xfc
> [    4.496391]  copy_process+0x864/0x10b4
> [    4.500129]  kernel_clone+0x88/0x494
> [    4.503692]  __do_sys_clone+0x60/0x80
> [    4.507342]  __arm64_sys_clone+0x2c/0x40
> [    4.511254]  invoke_syscall+0x78/0x100
> [    4.514991]  el0_svc_common.constprop.0+0x4c/0xf4
> [    4.519683]  do_el0_svc+0x38/0x4c
> [    4.522985]  el0_svc+0x34/0x100
> [    4.526115]  el0t_64_sync_handler+0x11c/0x150
> [    4.530460]  el0t_64_sync+0x190/0x194
> [    4.534112] Code: f9000125 f9400e65 9278dca5 f94000a5 (f9000045)
> [    4.540193] ---[ end trace 0000000000000000 ]---
> 
> I was also able to reproduce the same crash in a Fedora virtual machine
> using QEMU with Fedora's rawhide configuration [1]:
> 
> [    5.913992] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000
> [    5.914510] Mem abort info:
> [    5.914581]   ESR = 0x0000000096000044
> [    5.914705]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    5.914858]   SET = 0, FnV = 0
> [    5.914951]   EA = 0, S1PTW = 0
> [    5.915065]   FSC = 0x04: level 0 translation fault
> [    5.915215] Data abort info:
> [    5.915321]   ISV = 0, ISS = 0x00000044
> [    5.915465]   CM = 0, WnR = 1
> [    5.915624] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000103051000
> [    5.915799] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
> [    5.916196] Internal error: Oops: 96000044 [#1] SMP
> [    5.916504] Modules linked in:
> [    5.916771] CPU: 2 PID: 202 Comm: dracut-rootfs-g Not tainted 5.19.0-rc4+ #1
> [    5.917003] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> [    5.917339] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    5.917584] pc : mas_split.isra.0+0x50c/0x784
> [    5.917794] lr : mas_split.isra.0+0x204/0x784
> [    5.917949] sp : ffff8000086334a0
> [    5.918038] x29: ffff8000086334a0 x28: ffff5b65c095a258 x27: ffff8000086335a0
> [    5.918289] x26: ffff800008633560 x25: ffff800008633508 x24: ffffdb30c80d9778
> [    5.918844] x23: ffff8000086335e0 x22: ffff800008633620 x21: ffff8000086336c8
> [    5.919277] x20: 0000000000000002 x19: ffff800008633a80 x18: 0000000000000002
> [    5.919533] x17: 0000000000000000 x16: ffff5b65c095a4b0 x15: 0000000000000000
> [    5.919747] x14: ffff5b65c095a898 x13: ffff8000086336d8 x12: 0000000000000001
> [    5.919971] x11: ffff8000086335a0 x10: 0000000000000003 x9 : ffff5b66f42e2a00
> [    5.920214] x8 : 000000000000001c x7 : 0000000000000003 x6 : 0000000000000006
> [    5.920493] x5 : ffff5b65c3077309 x4 : 0000000000000001 x3 : ffff800008633560
> [    5.920739] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000086336c8
> [    5.921051] Call trace:
> [    5.921152]  mas_split.isra.0+0x50c/0x784
> [    5.921303]  mas_commit_b_node.isra.0+0x1e0/0x274
> [    5.921459]  mas_wr_modify+0x10c/0x28c
> [    5.921565]  mas_wr_store_entry.isra.0+0x10c/0x4a0
> [    5.921725]  mas_store+0x48/0x110
> [    5.921864]  dup_mmap+0x268/0x514
> [    5.921993]  dup_mm+0x68/0xfc
> [    5.922074]  copy_process+0x864/0x10b4
> [    5.922213]  kernel_clone+0x88/0x494
> [    5.922315]  __do_sys_clone+0x60/0x80
> [    5.922444]  __arm64_sys_clone+0x2c/0x40
> [    5.922576]  invoke_syscall+0x78/0x100
> [    5.922686]  el0_svc_common.constprop.0+0x4c/0xf4
> [    5.922847]  do_el0_svc+0x38/0x4c
> [    5.922947]  el0_svc+0x34/0x100
> [    5.923056]  el0t_64_sync_handler+0x11c/0x150
> [    5.923179]  el0t_64_sync+0x190/0x194
> [    5.923365] Code: f9000125 f9400e65 9278dca5 f94000a5 (f9000045)
> [    5.923833] ---[ end trace 0000000000000000 ]---
> 
> If there is any additional information I can provide or patches I can
> test, please let me know!
> 

Hello Nathan,

Thanks for testing this and your report.  You are the first and only
report of this failure so I very much appreciate it.

I run a number of tests on arm64 so I will have to try your kernel
config.  Thanks for including the link.

Regards,
Liam
Liam R. Howlett July 29, 2022, 3:41 p.m. UTC | #3
* Liam R. Howlett <Liam.Howlett@Oracle.com> [220727 20:34]:
> * Nathan Chancellor <nathan@kernel.org> [220726 20:28]:
> > Hi Liam,
> > 
> > On Wed, Jul 20, 2022 at 02:17:45AM +0000, Liam Howlett wrote:
> > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > > 
> > > Start tracking the VMAs with the new maple tree structure in parallel with
> > > the rb_tree.  Add debug and trace events for maple tree operations and
> > > duplicate the rb_tree that is created on forks into the maple tree.
> > > 
> > > The maple tree is added to the mm_struct including the mm_init struct,
> > > added support in required mm/mmap functions, added tracking in kernel/fork
> > > for process forking, and used to find the unmapped_area and checked
> > > against what the rbtree finds.
> > > 
> > > This also moves the mmap_lock() in exit_mmap() since the oom reaper call
> > > does walk the VMAs.  Otherwise lockdep will be unhappy if oom happens.
> > > 
> > > When splitting a vma fails due to allocations of the maple tree nodes,
> > > the error path in __split_vma() calls new->vm_ops->close(new).  The page
> > > accounting for hugetlb is actually in the close() operation,  so it
> > > accounts for the removal of 1/2 of the VMA which was not adjusted.  This
> > > results in a negative exit value.  To avoid the negative charge, set
> > > vm_start = vm_end and vm_pgoff = 0.
> > > 
> > > There is also a potential accounting issue in special mappings from
> > > insert_vm_struct() failing to allocate, so reverse the charge there in
> > > the failure scenario.
> > > 
> > > Link: https://lkml.kernel.org/r/20220504010716.661115-10-Liam.Howlett@oracle.com
> > > Link: https://lkml.kernel.org/r/20220621204632.3370049-9-Liam.Howlett@oracle.com
> > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > Cc: David Howells <dhowells@redhat.com>
> > > Cc: SeongJae Park <sj@kernel.org>
> > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > Cc: Will Deacon <will@kernel.org>
> > > Cc: Davidlohr Bueso <dave@stgolabs.net>
> > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > 
> > Apologies if this has been reported already, I tried searching the
> > mailing lists but I did not really find anything.
> > 
> > I bisected my arm64 test system failing to boot to this change as commit
> > fdfbd22f37db ("mm: start tracking VMAs with maple tree") in
> > next-20220726 (bisect log at the end).
> > 
> > [    4.295886] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000
> > [    4.306595] Mem abort info:
> > [    4.309381]   ESR = 0x0000000096000044
> > [    4.313118]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [    4.318422]   SET = 0, FnV = 0
> > [    4.321464]   EA = 0, S1PTW = 0
> > [    4.324592]   FSC = 0x04: level 0 translation fault
> > [    4.329461] Data abort info:
> > [    4.332329]   ISV = 0, ISS = 0x00000044
> > [    4.336152]   CM = 0, WnR = 1
> > [    4.339110] user pgtable: 4k pages, 48-bit VAs, pgdp=00000020a9712000
> > [    4.345539] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
> > [    4.352321] Internal error: Oops: 96000044 [#1] SMP
> > [    4.357188] Modules linked in:
> > [    4.360232] CPU: 6 PID: 264 Comm: dracut-rootfs-g Not tainted 5.19.0-rc4-00288-gfdfbd22f37db #1
> > [    4.368918] Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II Jun 21 2022
> > [    4.376994] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [    4.383943] pc : mas_split.isra.0+0x50c/0x784
> > [    4.388295] lr : mas_split.isra.0+0x204/0x784
> > [    4.392640] sp : ffff8000094a3510
> > [    4.395942] x29: ffff8000094a3510 x28: ffff08dd66c0c000 x27: ffff8000094a3610
> > [    4.403067] x26: ffff8000094a35d0 x25: ffff8000094a3578 x24: ffffd823cb5448b8
> > [    4.410192] x23: ffff8000094a3650 x22: ffff8000094a3690 x21: ffff8000094a3738
> > [    4.417316] x20: 0000000000000002 x19: ffff8000094a3af0 x18: 0000000000000002
> > [    4.424441] x17: 0000000000000000 x16: ffff08dd66c45450 x15: 0000000000000000
> > [    4.431565] x14: ffff08dd66c459c8 x13: ffff8000094a3748 x12: 0000000000000001
> > [    4.438689] x11: ffff8000094a3610 x10: 0000000000000003 x9 : ffff08dd66c47300
> > [    4.445813] x8 : 000000000000001c x7 : 0000000000000003 x6 : 0000000000000006
> > [    4.452937] x5 : ffff08dd68a44409 x4 : 0000000000000001 x3 : ffff8000094a35d0
> > [    4.460061] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000094a3738
> > [    4.467186] Call trace:
> > [    4.469620]  mas_split.isra.0+0x50c/0x784
> > [    4.473618]  mas_commit_b_node.isra.0+0x1e0/0x274
> > [    4.478311]  mas_wr_modify+0x10c/0x28c
> > [    4.482048]  mas_wr_store_entry.isra.0+0x10c/0x4a0
> > [    4.486827]  mas_store+0x48/0x110
> > [    4.490131]  dup_mmap+0x268/0x514
> > [    4.493436]  dup_mm+0x68/0xfc
> > [    4.496391]  copy_process+0x864/0x10b4
> > [    4.500129]  kernel_clone+0x88/0x494
> > [    4.503692]  __do_sys_clone+0x60/0x80
> > [    4.507342]  __arm64_sys_clone+0x2c/0x40
> > [    4.511254]  invoke_syscall+0x78/0x100
> > [    4.514991]  el0_svc_common.constprop.0+0x4c/0xf4
> > [    4.519683]  do_el0_svc+0x38/0x4c
> > [    4.522985]  el0_svc+0x34/0x100
> > [    4.526115]  el0t_64_sync_handler+0x11c/0x150
> > [    4.530460]  el0t_64_sync+0x190/0x194
> > [    4.534112] Code: f9000125 f9400e65 9278dca5 f94000a5 (f9000045)
> > [    4.540193] ---[ end trace 0000000000000000 ]---
> > 
> > I was also able to reproduce the same crash in a Fedora virtual machine
> > using QEMU with Fedora's rawhide configuration [1]:
> > 
> > [    5.913992] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000
> > [    5.914510] Mem abort info:
> > [    5.914581]   ESR = 0x0000000096000044
> > [    5.914705]   EC = 0x25: DABT (current EL), IL = 32 bits
> > [    5.914858]   SET = 0, FnV = 0
> > [    5.914951]   EA = 0, S1PTW = 0
> > [    5.915065]   FSC = 0x04: level 0 translation fault
> > [    5.915215] Data abort info:
> > [    5.915321]   ISV = 0, ISS = 0x00000044
> > [    5.915465]   CM = 0, WnR = 1
> > [    5.915624] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000103051000
> > [    5.915799] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
> > [    5.916196] Internal error: Oops: 96000044 [#1] SMP
> > [    5.916504] Modules linked in:
> > [    5.916771] CPU: 2 PID: 202 Comm: dracut-rootfs-g Not tainted 5.19.0-rc4+ #1
> > [    5.917003] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> > [    5.917339] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > [    5.917584] pc : mas_split.isra.0+0x50c/0x784
> > [    5.917794] lr : mas_split.isra.0+0x204/0x784
> > [    5.917949] sp : ffff8000086334a0
> > [    5.918038] x29: ffff8000086334a0 x28: ffff5b65c095a258 x27: ffff8000086335a0
> > [    5.918289] x26: ffff800008633560 x25: ffff800008633508 x24: ffffdb30c80d9778
> > [    5.918844] x23: ffff8000086335e0 x22: ffff800008633620 x21: ffff8000086336c8
> > [    5.919277] x20: 0000000000000002 x19: ffff800008633a80 x18: 0000000000000002
> > [    5.919533] x17: 0000000000000000 x16: ffff5b65c095a4b0 x15: 0000000000000000
> > [    5.919747] x14: ffff5b65c095a898 x13: ffff8000086336d8 x12: 0000000000000001
> > [    5.919971] x11: ffff8000086335a0 x10: 0000000000000003 x9 : ffff5b66f42e2a00
> > [    5.920214] x8 : 000000000000001c x7 : 0000000000000003 x6 : 0000000000000006
> > [    5.920493] x5 : ffff5b65c3077309 x4 : 0000000000000001 x3 : ffff800008633560
> > [    5.920739] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000086336c8
> > [    5.921051] Call trace:
> > [    5.921152]  mas_split.isra.0+0x50c/0x784
> > [    5.921303]  mas_commit_b_node.isra.0+0x1e0/0x274
> > [    5.921459]  mas_wr_modify+0x10c/0x28c
> > [    5.921565]  mas_wr_store_entry.isra.0+0x10c/0x4a0
> > [    5.921725]  mas_store+0x48/0x110
> > [    5.921864]  dup_mmap+0x268/0x514
> > [    5.921993]  dup_mm+0x68/0xfc
> > [    5.922074]  copy_process+0x864/0x10b4
> > [    5.922213]  kernel_clone+0x88/0x494
> > [    5.922315]  __do_sys_clone+0x60/0x80
> > [    5.922444]  __arm64_sys_clone+0x2c/0x40
> > [    5.922576]  invoke_syscall+0x78/0x100
> > [    5.922686]  el0_svc_common.constprop.0+0x4c/0xf4
> > [    5.922847]  do_el0_svc+0x38/0x4c
> > [    5.922947]  el0_svc+0x34/0x100
> > [    5.923056]  el0t_64_sync_handler+0x11c/0x150
> > [    5.923179]  el0t_64_sync+0x190/0x194
> > [    5.923365] Code: f9000125 f9400e65 9278dca5 f94000a5 (f9000045)
> > [    5.923833] ---[ end trace 0000000000000000 ]---
> > 
> > If there is any additional information I can provide or patches I can
> > test, please let me know!
> > 
> 
> Hello Nathan,
> 
> Thanks for testing this and your report.  You are the first and only
> report of this failure so I very much appreciate it.
> 
> I run a number of tests on arm64 so I will have to try your kernel
> config.  Thanks for including the link.

Nathan,

I am having a hard time reproducing this bug.  I had to modify the
config you pointed me towards with the addition of virtio block device
support.  I tried the next tag you had the issue with along with my most
recent patches and neither produced the crash.  Although I was not able
to reproduce the crash, I suspect it was to do with insufficient number
of allocated nodes at fork time.  I've been running stress-ng with fork
& clone in qemu but so far no luck reproducing it.

Can you decode the line number of mas_split.isra.0+0x50c/0x784 ?

Could you test git tag howlett/maple/20220728 from
http://git.infradead.org/users/jedix/linux-maple.git and see if this
issue still triggers?

Thanks,
Liam
Nathan Chancellor July 29, 2022, 5:02 p.m. UTC | #4
On Fri, Jul 29, 2022 at 03:41:44PM +0000, Liam Howlett wrote:
> * Liam R. Howlett <Liam.Howlett@Oracle.com> [220727 20:34]:
> > * Nathan Chancellor <nathan@kernel.org> [220726 20:28]:
> > > Hi Liam,
> > > 
> > > On Wed, Jul 20, 2022 at 02:17:45AM +0000, Liam Howlett wrote:
> > > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > > > 
> > > > Start tracking the VMAs with the new maple tree structure in parallel with
> > > > the rb_tree.  Add debug and trace events for maple tree operations and
> > > > duplicate the rb_tree that is created on forks into the maple tree.
> > > > 
> > > > The maple tree is added to the mm_struct including the mm_init struct,
> > > > added support in required mm/mmap functions, added tracking in kernel/fork
> > > > for process forking, and used to find the unmapped_area and checked
> > > > against what the rbtree finds.
> > > > 
> > > > This also moves the mmap_lock() in exit_mmap() since the oom reaper call
> > > > does walk the VMAs.  Otherwise lockdep will be unhappy if oom happens.
> > > > 
> > > > When splitting a vma fails due to allocations of the maple tree nodes,
> > > > the error path in __split_vma() calls new->vm_ops->close(new).  The page
> > > > accounting for hugetlb is actually in the close() operation,  so it
> > > > accounts for the removal of 1/2 of the VMA which was not adjusted.  This
> > > > results in a negative exit value.  To avoid the negative charge, set
> > > > vm_start = vm_end and vm_pgoff = 0.
> > > > 
> > > > There is also a potential accounting issue in special mappings from
> > > > insert_vm_struct() failing to allocate, so reverse the charge there in
> > > > the failure scenario.
> > > > 
> > > > Link: https://lkml.kernel.org/r/20220504010716.661115-10-Liam.Howlett@oracle.com
> > > > Link: https://lkml.kernel.org/r/20220621204632.3370049-9-Liam.Howlett@oracle.com
> > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > Cc: David Howells <dhowells@redhat.com>
> > > > Cc: SeongJae Park <sj@kernel.org>
> > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > Cc: Will Deacon <will@kernel.org>
> > > > Cc: Davidlohr Bueso <dave@stgolabs.net>
> > > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > 
> > > Apologies if this has been reported already, I tried searching the
> > > mailing lists but I did not really find anything.
> > > 
> > > I bisected my arm64 test system failing to boot to this change as commit
> > > fdfbd22f37db ("mm: start tracking VMAs with maple tree") in
> > > next-20220726 (bisect log at the end).
> > > 
> > > [    4.295886] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000
> > > [    4.306595] Mem abort info:
> > > [    4.309381]   ESR = 0x0000000096000044
> > > [    4.313118]   EC = 0x25: DABT (current EL), IL = 32 bits
> > > [    4.318422]   SET = 0, FnV = 0
> > > [    4.321464]   EA = 0, S1PTW = 0
> > > [    4.324592]   FSC = 0x04: level 0 translation fault
> > > [    4.329461] Data abort info:
> > > [    4.332329]   ISV = 0, ISS = 0x00000044
> > > [    4.336152]   CM = 0, WnR = 1
> > > [    4.339110] user pgtable: 4k pages, 48-bit VAs, pgdp=00000020a9712000
> > > [    4.345539] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
> > > [    4.352321] Internal error: Oops: 96000044 [#1] SMP
> > > [    4.357188] Modules linked in:
> > > [    4.360232] CPU: 6 PID: 264 Comm: dracut-rootfs-g Not tainted 5.19.0-rc4-00288-gfdfbd22f37db #1
> > > [    4.368918] Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II Jun 21 2022
> > > [    4.376994] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > [    4.383943] pc : mas_split.isra.0+0x50c/0x784
> > > [    4.388295] lr : mas_split.isra.0+0x204/0x784
> > > [    4.392640] sp : ffff8000094a3510
> > > [    4.395942] x29: ffff8000094a3510 x28: ffff08dd66c0c000 x27: ffff8000094a3610
> > > [    4.403067] x26: ffff8000094a35d0 x25: ffff8000094a3578 x24: ffffd823cb5448b8
> > > [    4.410192] x23: ffff8000094a3650 x22: ffff8000094a3690 x21: ffff8000094a3738
> > > [    4.417316] x20: 0000000000000002 x19: ffff8000094a3af0 x18: 0000000000000002
> > > [    4.424441] x17: 0000000000000000 x16: ffff08dd66c45450 x15: 0000000000000000
> > > [    4.431565] x14: ffff08dd66c459c8 x13: ffff8000094a3748 x12: 0000000000000001
> > > [    4.438689] x11: ffff8000094a3610 x10: 0000000000000003 x9 : ffff08dd66c47300
> > > [    4.445813] x8 : 000000000000001c x7 : 0000000000000003 x6 : 0000000000000006
> > > [    4.452937] x5 : ffff08dd68a44409 x4 : 0000000000000001 x3 : ffff8000094a35d0
> > > [    4.460061] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000094a3738
> > > [    4.467186] Call trace:
> > > [    4.469620]  mas_split.isra.0+0x50c/0x784
> > > [    4.473618]  mas_commit_b_node.isra.0+0x1e0/0x274
> > > [    4.478311]  mas_wr_modify+0x10c/0x28c
> > > [    4.482048]  mas_wr_store_entry.isra.0+0x10c/0x4a0
> > > [    4.486827]  mas_store+0x48/0x110
> > > [    4.490131]  dup_mmap+0x268/0x514
> > > [    4.493436]  dup_mm+0x68/0xfc
> > > [    4.496391]  copy_process+0x864/0x10b4
> > > [    4.500129]  kernel_clone+0x88/0x494
> > > [    4.503692]  __do_sys_clone+0x60/0x80
> > > [    4.507342]  __arm64_sys_clone+0x2c/0x40
> > > [    4.511254]  invoke_syscall+0x78/0x100
> > > [    4.514991]  el0_svc_common.constprop.0+0x4c/0xf4
> > > [    4.519683]  do_el0_svc+0x38/0x4c
> > > [    4.522985]  el0_svc+0x34/0x100
> > > [    4.526115]  el0t_64_sync_handler+0x11c/0x150
> > > [    4.530460]  el0t_64_sync+0x190/0x194
> > > [    4.534112] Code: f9000125 f9400e65 9278dca5 f94000a5 (f9000045)
> > > [    4.540193] ---[ end trace 0000000000000000 ]---
> > > 
> > > I was also able to reproduce the same crash in a Fedora virtual machine
> > > using QEMU with Fedora's rawhide configuration [1]:
> > > 
> > > [    5.913992] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000
> > > [    5.914510] Mem abort info:
> > > [    5.914581]   ESR = 0x0000000096000044
> > > [    5.914705]   EC = 0x25: DABT (current EL), IL = 32 bits
> > > [    5.914858]   SET = 0, FnV = 0
> > > [    5.914951]   EA = 0, S1PTW = 0
> > > [    5.915065]   FSC = 0x04: level 0 translation fault
> > > [    5.915215] Data abort info:
> > > [    5.915321]   ISV = 0, ISS = 0x00000044
> > > [    5.915465]   CM = 0, WnR = 1
> > > [    5.915624] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000103051000
> > > [    5.915799] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
> > > [    5.916196] Internal error: Oops: 96000044 [#1] SMP
> > > [    5.916504] Modules linked in:
> > > [    5.916771] CPU: 2 PID: 202 Comm: dracut-rootfs-g Not tainted 5.19.0-rc4+ #1
> > > [    5.917003] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> > > [    5.917339] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > [    5.917584] pc : mas_split.isra.0+0x50c/0x784
> > > [    5.917794] lr : mas_split.isra.0+0x204/0x784
> > > [    5.917949] sp : ffff8000086334a0
> > > [    5.918038] x29: ffff8000086334a0 x28: ffff5b65c095a258 x27: ffff8000086335a0
> > > [    5.918289] x26: ffff800008633560 x25: ffff800008633508 x24: ffffdb30c80d9778
> > > [    5.918844] x23: ffff8000086335e0 x22: ffff800008633620 x21: ffff8000086336c8
> > > [    5.919277] x20: 0000000000000002 x19: ffff800008633a80 x18: 0000000000000002
> > > [    5.919533] x17: 0000000000000000 x16: ffff5b65c095a4b0 x15: 0000000000000000
> > > [    5.919747] x14: ffff5b65c095a898 x13: ffff8000086336d8 x12: 0000000000000001
> > > [    5.919971] x11: ffff8000086335a0 x10: 0000000000000003 x9 : ffff5b66f42e2a00
> > > [    5.920214] x8 : 000000000000001c x7 : 0000000000000003 x6 : 0000000000000006
> > > [    5.920493] x5 : ffff5b65c3077309 x4 : 0000000000000001 x3 : ffff800008633560
> > > [    5.920739] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000086336c8
> > > [    5.921051] Call trace:
> > > [    5.921152]  mas_split.isra.0+0x50c/0x784
> > > [    5.921303]  mas_commit_b_node.isra.0+0x1e0/0x274
> > > [    5.921459]  mas_wr_modify+0x10c/0x28c
> > > [    5.921565]  mas_wr_store_entry.isra.0+0x10c/0x4a0
> > > [    5.921725]  mas_store+0x48/0x110
> > > [    5.921864]  dup_mmap+0x268/0x514
> > > [    5.921993]  dup_mm+0x68/0xfc
> > > [    5.922074]  copy_process+0x864/0x10b4
> > > [    5.922213]  kernel_clone+0x88/0x494
> > > [    5.922315]  __do_sys_clone+0x60/0x80
> > > [    5.922444]  __arm64_sys_clone+0x2c/0x40
> > > [    5.922576]  invoke_syscall+0x78/0x100
> > > [    5.922686]  el0_svc_common.constprop.0+0x4c/0xf4
> > > [    5.922847]  do_el0_svc+0x38/0x4c
> > > [    5.922947]  el0_svc+0x34/0x100
> > > [    5.923056]  el0t_64_sync_handler+0x11c/0x150
> > > [    5.923179]  el0t_64_sync+0x190/0x194
> > > [    5.923365] Code: f9000125 f9400e65 9278dca5 f94000a5 (f9000045)
> > > [    5.923833] ---[ end trace 0000000000000000 ]---
> > > 
> > > If there is any additional information I can provide or patches I can
> > > test, please let me know!
> > > 
> > 
> > Hello Nathan,
> > 
> > Thanks for testing this and your report.  You are the first and only
> > report of this failure so I very much appreciate it.
> > 
> > I run a number of tests on arm64 so I will have to try your kernel
> > config.  Thanks for including the link.
> 
> Nathan,
> 
> I am having a hard time reproducing this bug.  I had to modify the
> config you pointed me towards with the addition of virtio block device
> support.  I tried the next tag you had the issue with along with my most
> recent patches and neither produced the crash.  Although I was not able
> to reproduce the crash, I suspect it was to do with insufficient number
> of allocated nodes at fork time.  I've been running stress-ng with fork
> & clone in qemu but so far no luck reproducing it.

Sorry about that :( That is odd since my VM appears to be using virtio
block devices?

$ lsblk
NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
zram0  252:0    0  7.7G  0 disk [SWAP]
vda    253:0    0   50G  0 disk
├─vda1 253:1    0  600M  0 part /boot/efi
├─vda2 253:2    0    1G  0 part /boot
└─vda3 253:3    0 48.4G  0 part /

> Can you decode the line number of mas_split.isra.0+0x50c/0x784 ?

Sure thing! Here is the entire stacktrace passed through
scripts/decode_stacktrace.sh, this was done at commit fdfbd22f37db ("mm:
start tracking VMAs with maple tree"):

[    7.473069] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[    7.473377] Mem abort info:
[    7.473460]   ESR = 0x0000000096000044
[    7.473595]   EC = 0x25: DABT (current EL), IL = 32 bits
[    7.473765]   SET = 0, FnV = 0
[    7.473867]   EA = 0, S1PTW = 0
[    7.473963]   FSC = 0x04: level 0 translation fault
[    7.474238] Data abort info:
[    7.474394]   ISV = 0, ISS = 0x00000044
[    7.474574]   CM = 0, WnR = 1
[    7.474737] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000234857000
[    7.474895] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
[    7.475717] Internal error: Oops: 96000044 [#1] SMP
[    7.476094] Modules linked in:
[    7.476450] CPU: 0 PID: 206 Comm: dracut-rootfs-g Not tainted 5.19.0-rc4+ #1
[    7.476788] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
[    7.477085] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[    7.477379] pc : mas_split.isra.0 (lib/maple_tree.c:3303 lib/maple_tree.c:3512) 
[    7.477941] lr : mas_split.isra.0 (lib/maple_tree.c:3543) 
[    7.478136] sp : ffff8000089ab420
[    7.478269] x29: ffff8000089ab420 x28: ffff0001f4868c80 x27: ffff8000089ab520
[    7.478620] x26: ffff8000089ab4e0 x25: ffff8000089ab488 x24: ffffd28653819778
[    7.478884] x23: ffff8000089ab560 x22: ffff8000089ab5a0 x21: ffff8000089ab648
[    7.479122] x20: 0000000000000002 x19: ffff8000089aba00 x18: 0000000000000002
[    7.479354] x17: 0000000000000000 x16: ffff0001f485a258 x15: 0000000000000000
[    7.479662] x14: ffff0001f485aed8 x13: ffff8000089ab658 x12: 0000000000000001
[    7.479926] x11: ffff8000089ab520 x10: 0000000000000003 x9 : ffff0001f4bb5d00
[    7.480240] x8 : 000000000000001c x7 : 0000000000000003 x6 : 0000000000000006
[    7.480638] x5 : ffff0001f4a7e789 x4 : 0000000000000001 x3 : ffff8000089ab4e0
[    7.480911] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000089ab648
[    7.481254] Call trace:
[    7.481364] mas_split.isra.0 (lib/maple_tree.c:3303 lib/maple_tree.c:3512) 
[    7.481580] mas_commit_b_node.isra.0 (lib/maple_tree.c:3618) 
[    7.481734] mas_wr_modify (lib/maple_tree.c:4356) 
[    7.481886] mas_wr_store_entry.isra.0 (lib/maple_tree.c:4396) 
[    7.482096] mas_store (lib/maple_tree.c:5651) 
[    7.482265] dup_mmap (kernel/fork.c:707) 
[    7.482410] dup_mm (kernel/fork.c:1539) 
[    7.482554] copy_process (kernel/fork.c:1591 kernel/fork.c:2254) 
[    7.482718] kernel_clone (kernel/fork.c:2669) 
[    7.482878] __do_sys_clone (kernel/fork.c:2804) 
[    7.483072] __arm64_sys_clone (kernel/fork.c:2771) 
[    7.483201] invoke_syscall (./arch/arm64/include/asm/current.h:19 arch/arm64/kernel/syscall.c:57) 
[    7.483361] el0_svc_common.constprop.0 (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/syscall.c:150) 
[    7.483518] do_el0_svc (arch/arm64/kernel/syscall.c:207) 
[    7.483651] el0_svc (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:142 arch/arm64/kernel/entry-common.c:625) 
[    7.483767] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:643) 
[    7.483918] el0t_64_sync (arch/arm64/kernel/entry.S:581) 
[ 7.484294] Code: f9000125 f9400e65 9278dca5 f94000a5 (f9000045)
All code
========
   0:	f9000125 	.word	0xf9000125
   4:	f9400e65 	.word	0xf9400e65
   8:	9278dca5 	.word	0x9278dca5
   c:	f94000a5 	.word	0xf94000a5
  10:*	f9000045 	.word	0xf9000045		<-- trapping instruction

Code starting with the faulting instruction
===========================================
   0:	f9000045 	.word	0xf9000045
[    7.484865] ---[ end trace 0000000000000000 ]---

> Could you test git tag howlett/maple/20220728 from
> http://git.infradead.org/users/jedix/linux-maple.git and see if this
> issue still triggers?

That tag appears to be okay, so this bug was one you already fixed that
manifested in a different way?

Cheers,
Nathan
Liam R. Howlett July 29, 2022, 8:13 p.m. UTC | #5
* Nathan Chancellor <nathan@kernel.org> [220729 13:02]:
> On Fri, Jul 29, 2022 at 03:41:44PM +0000, Liam Howlett wrote:
> > * Liam R. Howlett <Liam.Howlett@Oracle.com> [220727 20:34]:
> > > * Nathan Chancellor <nathan@kernel.org> [220726 20:28]:
> > > > Hi Liam,
> > > > 
> > > > On Wed, Jul 20, 2022 at 02:17:45AM +0000, Liam Howlett wrote:
> > > > > From: "Liam R. Howlett" <Liam.Howlett@Oracle.com>
> > > > > 
> > > > > Start tracking the VMAs with the new maple tree structure in parallel with
> > > > > the rb_tree.  Add debug and trace events for maple tree operations and
> > > > > duplicate the rb_tree that is created on forks into the maple tree.
> > > > > 
> > > > > The maple tree is added to the mm_struct including the mm_init struct,
> > > > > added support in required mm/mmap functions, added tracking in kernel/fork
> > > > > for process forking, and used to find the unmapped_area and checked
> > > > > against what the rbtree finds.
> > > > > 
> > > > > This also moves the mmap_lock() in exit_mmap() since the oom reaper call
> > > > > does walk the VMAs.  Otherwise lockdep will be unhappy if oom happens.
> > > > > 
> > > > > When splitting a vma fails due to allocations of the maple tree nodes,
> > > > > the error path in __split_vma() calls new->vm_ops->close(new).  The page
> > > > > accounting for hugetlb is actually in the close() operation,  so it
> > > > > accounts for the removal of 1/2 of the VMA which was not adjusted.  This
> > > > > results in a negative exit value.  To avoid the negative charge, set
> > > > > vm_start = vm_end and vm_pgoff = 0.
> > > > > 
> > > > > There is also a potential accounting issue in special mappings from
> > > > > insert_vm_struct() failing to allocate, so reverse the charge there in
> > > > > the failure scenario.
> > > > > 
> > > > > Link: https://lkml.kernel.org/r/20220504010716.661115-10-Liam.Howlett@oracle.com
> > > > > Link: https://lkml.kernel.org/r/20220621204632.3370049-9-Liam.Howlett@oracle.com
> > > > > Signed-off-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
> > > > > Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org>
> > > > > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > > > > Cc: David Howells <dhowells@redhat.com>
> > > > > Cc: SeongJae Park <sj@kernel.org>
> > > > > Cc: Vlastimil Babka <vbabka@suse.cz>
> > > > > Cc: Will Deacon <will@kernel.org>
> > > > > Cc: Davidlohr Bueso <dave@stgolabs.net>
> > > > > Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
> > > > 
> > > > Apologies if this has been reported already, I tried searching the
> > > > mailing lists but I did not really find anything.
> > > > 
> > > > I bisected my arm64 test system failing to boot to this change as commit
> > > > fdfbd22f37db ("mm: start tracking VMAs with maple tree") in
> > > > next-20220726 (bisect log at the end).
> > > > 
> > > > [    4.295886] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000
> > > > [    4.306595] Mem abort info:
> > > > [    4.309381]   ESR = 0x0000000096000044
> > > > [    4.313118]   EC = 0x25: DABT (current EL), IL = 32 bits
> > > > [    4.318422]   SET = 0, FnV = 0
> > > > [    4.321464]   EA = 0, S1PTW = 0
> > > > [    4.324592]   FSC = 0x04: level 0 translation fault
> > > > [    4.329461] Data abort info:
> > > > [    4.332329]   ISV = 0, ISS = 0x00000044
> > > > [    4.336152]   CM = 0, WnR = 1
> > > > [    4.339110] user pgtable: 4k pages, 48-bit VAs, pgdp=00000020a9712000
> > > > [    4.345539] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
> > > > [    4.352321] Internal error: Oops: 96000044 [#1] SMP
> > > > [    4.357188] Modules linked in:
> > > > [    4.360232] CPU: 6 PID: 264 Comm: dracut-rootfs-g Not tainted 5.19.0-rc4-00288-gfdfbd22f37db #1
> > > > [    4.368918] Hardware name: SolidRun Ltd. SolidRun CEX7 Platform, BIOS EDK II Jun 21 2022
> > > > [    4.376994] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    4.383943] pc : mas_split.isra.0+0x50c/0x784
> > > > [    4.388295] lr : mas_split.isra.0+0x204/0x784
> > > > [    4.392640] sp : ffff8000094a3510
> > > > [    4.395942] x29: ffff8000094a3510 x28: ffff08dd66c0c000 x27: ffff8000094a3610
> > > > [    4.403067] x26: ffff8000094a35d0 x25: ffff8000094a3578 x24: ffffd823cb5448b8
> > > > [    4.410192] x23: ffff8000094a3650 x22: ffff8000094a3690 x21: ffff8000094a3738
> > > > [    4.417316] x20: 0000000000000002 x19: ffff8000094a3af0 x18: 0000000000000002
> > > > [    4.424441] x17: 0000000000000000 x16: ffff08dd66c45450 x15: 0000000000000000
> > > > [    4.431565] x14: ffff08dd66c459c8 x13: ffff8000094a3748 x12: 0000000000000001
> > > > [    4.438689] x11: ffff8000094a3610 x10: 0000000000000003 x9 : ffff08dd66c47300
> > > > [    4.445813] x8 : 000000000000001c x7 : 0000000000000003 x6 : 0000000000000006
> > > > [    4.452937] x5 : ffff08dd68a44409 x4 : 0000000000000001 x3 : ffff8000094a35d0
> > > > [    4.460061] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000094a3738
> > > > [    4.467186] Call trace:
> > > > [    4.469620]  mas_split.isra.0+0x50c/0x784
> > > > [    4.473618]  mas_commit_b_node.isra.0+0x1e0/0x274
> > > > [    4.478311]  mas_wr_modify+0x10c/0x28c
> > > > [    4.482048]  mas_wr_store_entry.isra.0+0x10c/0x4a0
> > > > [    4.486827]  mas_store+0x48/0x110
> > > > [    4.490131]  dup_mmap+0x268/0x514
> > > > [    4.493436]  dup_mm+0x68/0xfc
> > > > [    4.496391]  copy_process+0x864/0x10b4
> > > > [    4.500129]  kernel_clone+0x88/0x494
> > > > [    4.503692]  __do_sys_clone+0x60/0x80
> > > > [    4.507342]  __arm64_sys_clone+0x2c/0x40
> > > > [    4.511254]  invoke_syscall+0x78/0x100
> > > > [    4.514991]  el0_svc_common.constprop.0+0x4c/0xf4
> > > > [    4.519683]  do_el0_svc+0x38/0x4c
> > > > [    4.522985]  el0_svc+0x34/0x100
> > > > [    4.526115]  el0t_64_sync_handler+0x11c/0x150
> > > > [    4.530460]  el0t_64_sync+0x190/0x194
> > > > [    4.534112] Code: f9000125 f9400e65 9278dca5 f94000a5 (f9000045)
> > > > [    4.540193] ---[ end trace 0000000000000000 ]---
> > > > 
> > > > I was also able to reproduce the same crash in a Fedora virtual machine
> > > > using QEMU with Fedora's rawhide configuration [1]:
> > > > 
> > > > [    5.913992] Unable to handle kernel access to user memory outside uaccess routines at virtual address 0000000000000000
> > > > [    5.914510] Mem abort info:
> > > > [    5.914581]   ESR = 0x0000000096000044
> > > > [    5.914705]   EC = 0x25: DABT (current EL), IL = 32 bits
> > > > [    5.914858]   SET = 0, FnV = 0
> > > > [    5.914951]   EA = 0, S1PTW = 0
> > > > [    5.915065]   FSC = 0x04: level 0 translation fault
> > > > [    5.915215] Data abort info:
> > > > [    5.915321]   ISV = 0, ISS = 0x00000044
> > > > [    5.915465]   CM = 0, WnR = 1
> > > > [    5.915624] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000103051000
> > > > [    5.915799] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
> > > > [    5.916196] Internal error: Oops: 96000044 [#1] SMP
> > > > [    5.916504] Modules linked in:
> > > > [    5.916771] CPU: 2 PID: 202 Comm: dracut-rootfs-g Not tainted 5.19.0-rc4+ #1
> > > > [    5.917003] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> > > > [    5.917339] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> > > > [    5.917584] pc : mas_split.isra.0+0x50c/0x784
> > > > [    5.917794] lr : mas_split.isra.0+0x204/0x784
> > > > [    5.917949] sp : ffff8000086334a0
> > > > [    5.918038] x29: ffff8000086334a0 x28: ffff5b65c095a258 x27: ffff8000086335a0
> > > > [    5.918289] x26: ffff800008633560 x25: ffff800008633508 x24: ffffdb30c80d9778
> > > > [    5.918844] x23: ffff8000086335e0 x22: ffff800008633620 x21: ffff8000086336c8
> > > > [    5.919277] x20: 0000000000000002 x19: ffff800008633a80 x18: 0000000000000002
> > > > [    5.919533] x17: 0000000000000000 x16: ffff5b65c095a4b0 x15: 0000000000000000
> > > > [    5.919747] x14: ffff5b65c095a898 x13: ffff8000086336d8 x12: 0000000000000001
> > > > [    5.919971] x11: ffff8000086335a0 x10: 0000000000000003 x9 : ffff5b66f42e2a00
> > > > [    5.920214] x8 : 000000000000001c x7 : 0000000000000003 x6 : 0000000000000006
> > > > [    5.920493] x5 : ffff5b65c3077309 x4 : 0000000000000001 x3 : ffff800008633560
> > > > [    5.920739] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000086336c8
> > > > [    5.921051] Call trace:
> > > > [    5.921152]  mas_split.isra.0+0x50c/0x784
> > > > [    5.921303]  mas_commit_b_node.isra.0+0x1e0/0x274
> > > > [    5.921459]  mas_wr_modify+0x10c/0x28c
> > > > [    5.921565]  mas_wr_store_entry.isra.0+0x10c/0x4a0
> > > > [    5.921725]  mas_store+0x48/0x110
> > > > [    5.921864]  dup_mmap+0x268/0x514
> > > > [    5.921993]  dup_mm+0x68/0xfc
> > > > [    5.922074]  copy_process+0x864/0x10b4
> > > > [    5.922213]  kernel_clone+0x88/0x494
> > > > [    5.922315]  __do_sys_clone+0x60/0x80
> > > > [    5.922444]  __arm64_sys_clone+0x2c/0x40
> > > > [    5.922576]  invoke_syscall+0x78/0x100
> > > > [    5.922686]  el0_svc_common.constprop.0+0x4c/0xf4
> > > > [    5.922847]  do_el0_svc+0x38/0x4c
> > > > [    5.922947]  el0_svc+0x34/0x100
> > > > [    5.923056]  el0t_64_sync_handler+0x11c/0x150
> > > > [    5.923179]  el0t_64_sync+0x190/0x194
> > > > [    5.923365] Code: f9000125 f9400e65 9278dca5 f94000a5 (f9000045)
> > > > [    5.923833] ---[ end trace 0000000000000000 ]---
> > > > 
> > > > If there is any additional information I can provide or patches I can
> > > > test, please let me know!
> > > > 
> > > 
> > > Hello Nathan,
> > > 
> > > Thanks for testing this and your report.  You are the first and only
> > > report of this failure so I very much appreciate it.
> > > 
> > > I run a number of tests on arm64 so I will have to try your kernel
> > > config.  Thanks for including the link.
> > 
> > Nathan,
> > 
> > I am having a hard time reproducing this bug.  I had to modify the
> > config you pointed me towards with the addition of virtio block device
> > support.  I tried the next tag you had the issue with along with my most
> > recent patches and neither produced the crash.  Although I was not able
> > to reproduce the crash, I suspect it was to do with insufficient number
> > of allocated nodes at fork time.  I've been running stress-ng with fork
> > & clone in qemu but so far no luck reproducing it.
> 
> Sorry about that :( That is odd since my VM appears to be using virtio
> block devices?

I'm booting without an initrd as I am not set up to build one for the
archs I'm testing (s390, arm, arm64, i386, x86_64, sparc64, etc).  This
is quite likely why I wasn't able to hit your bug.. or glibc version
since that affects the VMA layouts and thus the operations/ordering of
splits of the tree.

> 
> $ lsblk
> NAME   MAJ:MIN RM  SIZE RO TYPE MOUNTPOINTS
> zram0  252:0    0  7.7G  0 disk [SWAP]
> vda    253:0    0   50G  0 disk
> ├─vda1 253:1    0  600M  0 part /boot/efi
> ├─vda2 253:2    0    1G  0 part /boot
> └─vda3 253:3    0 48.4G  0 part /
> 
> > Can you decode the line number of mas_split.isra.0+0x50c/0x784 ?
> 
> Sure thing! Here is the entire stacktrace passed through
> scripts/decode_stacktrace.sh, this was done at commit fdfbd22f37db ("mm:
> start tracking VMAs with maple tree"):
> 
> [    7.473069] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
> [    7.473377] Mem abort info:
> [    7.473460]   ESR = 0x0000000096000044
> [    7.473595]   EC = 0x25: DABT (current EL), IL = 32 bits
> [    7.473765]   SET = 0, FnV = 0
> [    7.473867]   EA = 0, S1PTW = 0
> [    7.473963]   FSC = 0x04: level 0 translation fault
> [    7.474238] Data abort info:
> [    7.474394]   ISV = 0, ISS = 0x00000044
> [    7.474574]   CM = 0, WnR = 1
> [    7.474737] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000234857000
> [    7.474895] [0000000000000000] pgd=0000000000000000, p4d=0000000000000000
> [    7.475717] Internal error: Oops: 96000044 [#1] SMP
> [    7.476094] Modules linked in:
> [    7.476450] CPU: 0 PID: 206 Comm: dracut-rootfs-g Not tainted 5.19.0-rc4+ #1
> [    7.476788] Hardware name: QEMU QEMU Virtual Machine, BIOS 0.0.0 02/06/2015
> [    7.477085] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [    7.477379] pc : mas_split.isra.0 (lib/maple_tree.c:3303 lib/maple_tree.c:3512) 
> [    7.477941] lr : mas_split.isra.0 (lib/maple_tree.c:3543) 
> [    7.478136] sp : ffff8000089ab420
> [    7.478269] x29: ffff8000089ab420 x28: ffff0001f4868c80 x27: ffff8000089ab520
> [    7.478620] x26: ffff8000089ab4e0 x25: ffff8000089ab488 x24: ffffd28653819778
> [    7.478884] x23: ffff8000089ab560 x22: ffff8000089ab5a0 x21: ffff8000089ab648
> [    7.479122] x20: 0000000000000002 x19: ffff8000089aba00 x18: 0000000000000002
> [    7.479354] x17: 0000000000000000 x16: ffff0001f485a258 x15: 0000000000000000
> [    7.479662] x14: ffff0001f485aed8 x13: ffff8000089ab658 x12: 0000000000000001
> [    7.479926] x11: ffff8000089ab520 x10: 0000000000000003 x9 : ffff0001f4bb5d00
> [    7.480240] x8 : 000000000000001c x7 : 0000000000000003 x6 : 0000000000000006
> [    7.480638] x5 : ffff0001f4a7e789 x4 : 0000000000000001 x3 : ffff8000089ab4e0
> [    7.480911] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff8000089ab648
> [    7.481254] Call trace:
> [    7.481364] mas_split.isra.0 (lib/maple_tree.c:3303 lib/maple_tree.c:3512) 
> [    7.481580] mas_commit_b_node.isra.0 (lib/maple_tree.c:3618) 
> [    7.481734] mas_wr_modify (lib/maple_tree.c:4356) 
> [    7.481886] mas_wr_store_entry.isra.0 (lib/maple_tree.c:4396) 
> [    7.482096] mas_store (lib/maple_tree.c:5651) 
> [    7.482265] dup_mmap (kernel/fork.c:707) 
> [    7.482410] dup_mm (kernel/fork.c:1539) 
> [    7.482554] copy_process (kernel/fork.c:1591 kernel/fork.c:2254) 
> [    7.482718] kernel_clone (kernel/fork.c:2669) 
> [    7.482878] __do_sys_clone (kernel/fork.c:2804) 
> [    7.483072] __arm64_sys_clone (kernel/fork.c:2771) 
> [    7.483201] invoke_syscall (./arch/arm64/include/asm/current.h:19 arch/arm64/kernel/syscall.c:57) 
> [    7.483361] el0_svc_common.constprop.0 (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/syscall.c:150) 
> [    7.483518] do_el0_svc (arch/arm64/kernel/syscall.c:207) 
> [    7.483651] el0_svc (./arch/arm64/include/asm/daifflags.h:28 arch/arm64/kernel/entry-common.c:133 arch/arm64/kernel/entry-common.c:142 arch/arm64/kernel/entry-common.c:625) 
> [    7.483767] el0t_64_sync_handler (arch/arm64/kernel/entry-common.c:643) 
> [    7.483918] el0t_64_sync (arch/arm64/kernel/entry.S:581) 
> [ 7.484294] Code: f9000125 f9400e65 9278dca5 f94000a5 (f9000045)
> All code
> ========
>    0:	f9000125 	.word	0xf9000125
>    4:	f9400e65 	.word	0xf9400e65
>    8:	9278dca5 	.word	0x9278dca5
>    c:	f94000a5 	.word	0xf94000a5
>   10:*	f9000045 	.word	0xf9000045		<-- trapping instruction
> 
> Code starting with the faulting instruction
> ===========================================
>    0:	f9000045 	.word	0xf9000045
> [    7.484865] ---[ end trace 0000000000000000 ]---
> 
> > Could you test git tag howlett/maple/20220728 from
> > http://git.infradead.org/users/jedix/linux-maple.git and see if this
> > issue still triggers?
> 
> That tag appears to be okay, so this bug was one you already fixed that
> manifested in a different way?

Yes.  It ran out of nodes.  I added a warning if there are zero nodes
when an allocation requests happens, but in your case you had nodes -
just not enough.  The idea here is that the maple tree entered bulk
allocation mode when mas_expected_entries() was called - also entered
when mas_preallocate() is called, so I expect there is enough nodes for
the desired operation.  I just forgot a few things that needed two extra
nodes.

[    7.481364] mas_split.isra.0 (lib/maple_tree.c:3303 lib/maple_tree.c:3512) 

line 3303: mte_to_node(ancestor)->parent = mas_mn(mas)->parent;

ancestor is (close to) null, well it ends up being null when the last 8
bits are masked out.  See include/linux/maple_tree.h ~line 86 for more
node encoding information.

The last commit on the tag is the addition of a more thorough testing of
mas_expected_entries() to ensure that there is enough nodes in the new
calculation of the maximum nodes needed to duplicate a tree.

Thanks,
Liam
diff mbox series

Patch

diff --git a/arch/x86/kernel/tboot.c b/arch/x86/kernel/tboot.c
index 0c1154a1c403..71c54ad3868a 100644
--- a/arch/x86/kernel/tboot.c
+++ b/arch/x86/kernel/tboot.c
@@ -97,6 +97,7 @@  void __init tboot_probe(void)
 static pgd_t *tboot_pg_dir;
 static struct mm_struct tboot_mm = {
 	.mm_rb          = RB_ROOT,
+	.mm_mt          = MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, tboot_mm.mmap_lock),
 	.pgd            = swapper_pg_dir,
 	.mm_users       = ATOMIC_INIT(2),
 	.mm_count       = ATOMIC_INIT(1),
diff --git a/drivers/firmware/efi/efi.c b/drivers/firmware/efi/efi.c
index 860534bcfdac..1eddef189d68 100644
--- a/drivers/firmware/efi/efi.c
+++ b/drivers/firmware/efi/efi.c
@@ -58,6 +58,7 @@  static unsigned long __initdata rt_prop = EFI_INVALID_TABLE_ADDR;
 
 struct mm_struct efi_mm = {
 	.mm_rb			= RB_ROOT,
+	.mm_mt			= MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, efi_mm.mmap_lock),
 	.mm_users		= ATOMIC_INIT(2),
 	.mm_count		= ATOMIC_INIT(1),
 	.write_protect_seq      = SEQCNT_ZERO(efi_mm.write_protect_seq),
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cf3d0d673f6b..adc963765d95 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2581,6 +2581,8 @@  extern bool arch_has_descending_max_zone_pfns(void);
 /* nommu.c */
 extern atomic_long_t mmap_pages_allocated;
 extern int nommu_shrink_inode_mappings(struct inode *, size_t, size_t);
+/* mmap.c */
+void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas);
 
 /* interval_tree.c */
 void vma_interval_tree_insert(struct vm_area_struct *node,
@@ -2644,6 +2646,9 @@  extern struct vm_area_struct *copy_vma(struct vm_area_struct **,
 	bool *need_rmap_locks);
 extern void exit_mmap(struct mm_struct *);
 
+void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas);
+void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas);
+
 static inline int check_data_rlimit(unsigned long rlim,
 				    unsigned long new,
 				    unsigned long start,
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6b961a29bf26..e810aaca6c04 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -9,6 +9,7 @@ 
 #include <linux/list.h>
 #include <linux/spinlock.h>
 #include <linux/rbtree.h>
+#include <linux/maple_tree.h>
 #include <linux/rwsem.h>
 #include <linux/completion.h>
 #include <linux/cpumask.h>
@@ -481,6 +482,7 @@  struct kioctx_table;
 struct mm_struct {
 	struct {
 		struct vm_area_struct *mmap;		/* list of VMAs */
+		struct maple_tree mm_mt;
 		struct rb_root mm_rb;
 		u64 vmacache_seqnum;                   /* per-thread vmacache */
 #ifdef CONFIG_MMU
@@ -676,6 +678,7 @@  struct mm_struct {
 	unsigned long cpu_bitmap[];
 };
 
+#define MM_MT_FLAGS	(MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN)
 extern struct mm_struct init_mm;
 
 /* Pointer magic because the dynamic array size confuses some compilers. */
diff --git a/include/trace/events/mmap.h b/include/trace/events/mmap.h
index 4661f7ba07c0..216de5f03621 100644
--- a/include/trace/events/mmap.h
+++ b/include/trace/events/mmap.h
@@ -42,6 +42,79 @@  TRACE_EVENT(vm_unmapped_area,
 		__entry->low_limit, __entry->high_limit, __entry->align_mask,
 		__entry->align_offset)
 );
+
+TRACE_EVENT(vma_mas_szero,
+	TP_PROTO(struct maple_tree *mt, unsigned long start,
+		 unsigned long end),
+
+	TP_ARGS(mt, start, end),
+
+	TP_STRUCT__entry(
+			__field(struct maple_tree *, mt)
+			__field(unsigned long, start)
+			__field(unsigned long, end)
+	),
+
+	TP_fast_assign(
+			__entry->mt		= mt;
+			__entry->start		= start;
+			__entry->end		= end;
+	),
+
+	TP_printk("mt_mod %p, (NULL), SNULL, %lu, %lu,",
+		  __entry->mt,
+		  (unsigned long) __entry->start,
+		  (unsigned long) __entry->end
+	)
+);
+
+TRACE_EVENT(vma_store,
+	TP_PROTO(struct maple_tree *mt, struct vm_area_struct *vma),
+
+	TP_ARGS(mt, vma),
+
+	TP_STRUCT__entry(
+			__field(struct maple_tree *, mt)
+			__field(struct vm_area_struct *, vma)
+			__field(unsigned long, vm_start)
+			__field(unsigned long, vm_end)
+	),
+
+	TP_fast_assign(
+			__entry->mt		= mt;
+			__entry->vma		= vma;
+			__entry->vm_start	= vma->vm_start;
+			__entry->vm_end		= vma->vm_end - 1;
+	),
+
+	TP_printk("mt_mod %p, (%p), STORE, %lu, %lu,",
+		  __entry->mt, __entry->vma,
+		  (unsigned long) __entry->vm_start,
+		  (unsigned long) __entry->vm_end
+	)
+);
+
+
+TRACE_EVENT(exit_mmap,
+	TP_PROTO(struct mm_struct *mm),
+
+	TP_ARGS(mm),
+
+	TP_STRUCT__entry(
+			__field(struct mm_struct *, mm)
+			__field(struct maple_tree *, mt)
+	),
+
+	TP_fast_assign(
+		       __entry->mm		= mm;
+		       __entry->mt		= &mm->mm_mt;
+	),
+
+	TP_printk("mt_mod %p, DESTROY\n",
+		  __entry->mt
+	)
+);
+
 #endif
 
 /* This part must be outside protection */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9d44f2d46c69..1840da0732f6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -585,6 +585,7 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 	int retval;
 	unsigned long charge;
 	LIST_HEAD(uf);
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
 	uprobe_start_dup_mmap();
 	if (mmap_write_lock_killable(oldmm)) {
@@ -614,6 +615,10 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		goto out;
 	khugepaged_fork(mm, oldmm);
 
+	retval = mas_expected_entries(&mas, oldmm->map_count);
+	if (retval)
+		goto out;
+
 	prev = NULL;
 	for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) {
 		struct file *file;
@@ -629,7 +634,7 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		 */
 		if (fatal_signal_pending(current)) {
 			retval = -EINTR;
-			goto out;
+			goto loop_out;
 		}
 		if (mpnt->vm_flags & VM_ACCOUNT) {
 			unsigned long len = vma_pages(mpnt);
@@ -694,6 +699,11 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 		rb_link = &tmp->vm_rb.rb_right;
 		rb_parent = &tmp->vm_rb;
 
+		/* Link the vma into the MT */
+		mas.index = tmp->vm_start;
+		mas.last = tmp->vm_end - 1;
+		mas_store(&mas, tmp);
+
 		mm->map_count++;
 		if (!(tmp->vm_flags & VM_WIPEONFORK))
 			retval = copy_page_range(tmp, mpnt);
@@ -702,10 +712,12 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 			tmp->vm_ops->open(tmp);
 
 		if (retval)
-			goto out;
+			goto loop_out;
 	}
 	/* a new mm has just been created */
 	retval = arch_dup_mmap(oldmm, mm);
+loop_out:
+	mas_destroy(&mas);
 out:
 	mmap_write_unlock(mm);
 	flush_tlb_mm(oldmm);
@@ -721,7 +733,7 @@  static __latent_entropy int dup_mmap(struct mm_struct *mm,
 fail_nomem:
 	retval = -ENOMEM;
 	vm_unacct_memory(charge);
-	goto out;
+	goto loop_out;
 }
 
 static inline int mm_alloc_pgd(struct mm_struct *mm)
@@ -1111,6 +1123,8 @@  static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
 {
 	mm->mmap = NULL;
 	mm->mm_rb = RB_ROOT;
+	mt_init_flags(&mm->mm_mt, MM_MT_FLAGS);
+	mt_set_external_lock(&mm->mm_mt, &mm->mmap_lock);
 	mm->vmacache_seqnum = 0;
 	atomic_set(&mm->mm_users, 1);
 	atomic_set(&mm->mm_count, 1);
diff --git a/mm/init-mm.c b/mm/init-mm.c
index fbe7844d0912..b912b0f2eced 100644
--- a/mm/init-mm.c
+++ b/mm/init-mm.c
@@ -1,6 +1,7 @@ 
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/mm_types.h>
 #include <linux/rbtree.h>
+#include <linux/maple_tree.h>
 #include <linux/rwsem.h>
 #include <linux/spinlock.h>
 #include <linux/list.h>
@@ -29,6 +30,7 @@ 
  */
 struct mm_struct init_mm = {
 	.mm_rb		= RB_ROOT,
+	.mm_mt		= MTREE_INIT_EXT(mm_mt, MM_MT_FLAGS, init_mm.mmap_lock),
 	.pgd		= swapper_pg_dir,
 	.mm_users	= ATOMIC_INIT(2),
 	.mm_count	= ATOMIC_INIT(1),
diff --git a/mm/mmap.c b/mm/mmap.c
index 61e6135c54ef..0e202b16caf3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -381,7 +381,71 @@  static int browse_rb(struct mm_struct *mm)
 	}
 	return bug ? -1 : i;
 }
+#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
+extern void mt_validate(struct maple_tree *mt);
+extern void mt_dump(const struct maple_tree *mt);
 
+/* Validate the maple tree */
+static void validate_mm_mt(struct mm_struct *mm)
+{
+	struct maple_tree *mt = &mm->mm_mt;
+	struct vm_area_struct *vma_mt, *vma = mm->mmap;
+
+	MA_STATE(mas, mt, 0, 0);
+
+	mt_validate(&mm->mm_mt);
+	mas_for_each(&mas, vma_mt, ULONG_MAX) {
+		if (xa_is_zero(vma_mt))
+			continue;
+
+		if (!vma)
+			break;
+
+		if ((vma != vma_mt) ||
+		    (vma->vm_start != vma_mt->vm_start) ||
+		    (vma->vm_end != vma_mt->vm_end) ||
+		    (vma->vm_start != mas.index) ||
+		    (vma->vm_end - 1 != mas.last)) {
+			pr_emerg("issue in %s\n", current->comm);
+			dump_stack();
+#ifdef CONFIG_DEBUG_VM
+			dump_vma(vma_mt);
+			pr_emerg("and next in rb\n");
+			dump_vma(vma->vm_next);
+#endif
+			pr_emerg("mt piv: %p %lu - %lu\n", vma_mt,
+				 mas.index, mas.last);
+			pr_emerg("mt vma: %p %lu - %lu\n", vma_mt,
+				 vma_mt->vm_start, vma_mt->vm_end);
+			pr_emerg("rb vma: %p %lu - %lu\n", vma,
+				 vma->vm_start, vma->vm_end);
+			pr_emerg("rb->next = %p %lu - %lu\n", vma->vm_next,
+					vma->vm_next->vm_start, vma->vm_next->vm_end);
+
+			mt_dump(mas.tree);
+			if (vma_mt->vm_end != mas.last + 1) {
+				pr_err("vma: %p vma_mt %lu-%lu\tmt %lu-%lu\n",
+						mm, vma_mt->vm_start, vma_mt->vm_end,
+						mas.index, mas.last);
+				mt_dump(mas.tree);
+			}
+			VM_BUG_ON_MM(vma_mt->vm_end != mas.last + 1, mm);
+			if (vma_mt->vm_start != mas.index) {
+				pr_err("vma: %p vma_mt %p %lu - %lu doesn't match\n",
+						mm, vma_mt, vma_mt->vm_start, vma_mt->vm_end);
+				mt_dump(mas.tree);
+			}
+			VM_BUG_ON_MM(vma_mt->vm_start != mas.index, mm);
+		}
+		VM_BUG_ON(vma != vma_mt);
+		vma = vma->vm_next;
+
+	}
+	VM_BUG_ON(vma);
+}
+#else
+#define validate_mm_mt(root) do { } while (0)
+#endif
 static void validate_mm_rb(struct rb_root *root, struct vm_area_struct *ignore)
 {
 	struct rb_node *nd;
@@ -436,6 +500,7 @@  static void validate_mm(struct mm_struct *mm)
 }
 #else
 #define validate_mm_rb(root, ignore) do { } while (0)
+#define validate_mm_mt(root) do { } while (0)
 #define validate_mm(mm) do { } while (0)
 #endif
 
@@ -680,6 +745,56 @@  static void __vma_link_file(struct vm_area_struct *vma)
 	}
 }
 
+/*
+ * vma_mas_store() - Store a VMA in the maple tree.
+ * @vma: The vm_area_struct
+ * @mas: The maple state
+ *
+ * Efficient way to store a VMA in the maple tree when the @mas has already
+ * walked to the correct location.
+ *
+ * Note: the end address is inclusive in the maple tree.
+ */
+void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas)
+{
+	trace_vma_store(mas->tree, vma);
+	mas_set_range(mas, vma->vm_start, vma->vm_end - 1);
+	mas_store_prealloc(mas, vma);
+}
+
+/*
+ * vma_mas_remove() - Remove a VMA from the maple tree.
+ * @vma: The vm_area_struct
+ * @mas: The maple state
+ *
+ * Efficient way to remove a VMA from the maple tree when the @mas has already
+ * been established and points to the correct location.
+ * Note: the end address is inclusive in the maple tree.
+ */
+void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas)
+{
+	trace_vma_mas_szero(mas->tree, vma->vm_start, vma->vm_end - 1);
+	mas->index = vma->vm_start;
+	mas->last = vma->vm_end - 1;
+	mas_store_prealloc(mas, NULL);
+}
+
+/*
+ * vma_mas_szero() - Set a given range to zero.  Used when modifying a
+ * vm_area_struct start or end.
+ *
+ * @mm: The struct_mm
+ * @start: The start address to zero
+ * @end: The end address to zero.
+ */
+static inline void vma_mas_szero(struct ma_state *mas, unsigned long start,
+				unsigned long end)
+{
+	trace_vma_mas_szero(mas->tree, start, end - 1);
+	mas_set_range(mas, start, end - 1);
+	mas_store_prealloc(mas, NULL);
+}
+
 static void
 __vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 	struct vm_area_struct *prev, struct rb_node **rb_link,
@@ -689,17 +804,22 @@  __vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 	__vma_link_rb(mm, vma, rb_link, rb_parent);
 }
 
-static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
+static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 			struct vm_area_struct *prev, struct rb_node **rb_link,
 			struct rb_node *rb_parent)
 {
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 	struct address_space *mapping = NULL;
 
+	if (mas_preallocate(&mas, vma, GFP_KERNEL))
+		return -ENOMEM;
+
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
 		i_mmap_lock_write(mapping);
 	}
 
+	vma_mas_store(vma, &mas);
 	__vma_link(mm, vma, prev, rb_link, rb_parent);
 	__vma_link_file(vma);
 
@@ -708,13 +828,15 @@  static void vma_link(struct mm_struct *mm, struct vm_area_struct *vma,
 
 	mm->map_count++;
 	validate_mm(mm);
+	return 0;
 }
 
 /*
  * Helper for vma_adjust() in the split_vma insert case: insert a vma into the
  * mm's list and rbtree.  It has already been inserted into the interval tree.
  */
-static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
+static void __insert_vm_struct(struct mm_struct *mm, struct ma_state *mas,
+			       struct vm_area_struct *vma)
 {
 	struct vm_area_struct *prev;
 	struct rb_node **rb_link, *rb_parent;
@@ -722,7 +844,10 @@  static void __insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 	if (find_vma_links(mm, vma->vm_start, vma->vm_end,
 			   &prev, &rb_link, &rb_parent))
 		BUG();
-	__vma_link(mm, vma, prev, rb_link, rb_parent);
+
+	vma_mas_store(vma, mas);
+	__vma_link_list(mm, vma, prev);
+	__vma_link_rb(mm, vma, rb_link, rb_parent);
 	mm->map_count++;
 }
 
@@ -749,6 +874,7 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 {
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *next = vma->vm_next, *orig_vma = vma;
+	struct vm_area_struct *next_next;
 	struct address_space *mapping = NULL;
 	struct rb_root_cached *root = NULL;
 	struct anon_vma *anon_vma = NULL;
@@ -756,10 +882,13 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	bool start_changed = false, end_changed = false;
 	long adjust_next = 0;
 	int remove_next = 0;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
+	struct vm_area_struct *exporter = NULL, *importer = NULL;
 
-	if (next && !insert) {
-		struct vm_area_struct *exporter = NULL, *importer = NULL;
+	validate_mm(mm);
+	validate_mm_mt(mm);
 
+	if (next && !insert) {
 		if (end >= next->vm_end) {
 			/*
 			 * vma expands, overlapping all the next, and
@@ -788,10 +917,11 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 				 * remove_next == 1 is case 1 or 7.
 				 */
 				remove_next = 1 + (end > next->vm_end);
+				if (remove_next == 2)
+					next_next = find_vma(mm, next->vm_end);
+
 				VM_WARN_ON(remove_next == 2 &&
 					   end != next->vm_next->vm_end);
-				/* trim end to next, for case 6 first pass */
-				end = next->vm_end;
 			}
 
 			exporter = next;
@@ -839,9 +969,11 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 				return error;
 		}
 	}
-again:
-	vma_adjust_trans_huge(orig_vma, start, end, adjust_next);
 
+	if (mas_preallocate(&mas, vma, GFP_KERNEL))
+		return -ENOMEM;
+
+	vma_adjust_trans_huge(orig_vma, start, end, adjust_next);
 	if (file) {
 		mapping = file->f_mapping;
 		root = &mapping->i_mmap;
@@ -882,17 +1014,28 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	}
 
 	if (start != vma->vm_start) {
+		unsigned long old_start = vma->vm_start;
 		vma->vm_start = start;
+		if (old_start < start)
+			vma_mas_szero(&mas, old_start, start);
 		start_changed = true;
 	}
 	if (end != vma->vm_end) {
+		unsigned long old_end = vma->vm_end;
 		vma->vm_end = end;
+		if (old_end > end)
+			vma_mas_szero(&mas, end, old_end);
 		end_changed = true;
 	}
+
+	if (end_changed || start_changed)
+		vma_mas_store(vma, &mas);
+
 	vma->vm_pgoff = pgoff;
 	if (adjust_next) {
 		next->vm_start += adjust_next;
 		next->vm_pgoff += adjust_next >> PAGE_SHIFT;
+		vma_mas_store(next, &mas);
 	}
 
 	if (file) {
@@ -906,10 +1049,14 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		/*
 		 * vma_merge has merged next into vma, and needs
 		 * us to remove next before dropping the locks.
+		 * Since we have expanded over this vma, the maple tree will
+		 * have overwritten by storing the value
 		 */
-		if (remove_next != 3)
+		if (remove_next != 3) {
 			__vma_unlink(mm, next, next);
-		else
+			if (remove_next == 2)
+				__vma_unlink(mm, next_next, next_next);
+		} else {
 			/*
 			 * vma is not before next if they've been
 			 * swapped.
@@ -920,15 +1067,19 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "vma").
 			 */
 			__vma_unlink(mm, next, vma);
-		if (file)
+		}
+		if (file) {
 			__remove_shared_vm_struct(next, file, mapping);
+			if (remove_next == 2)
+				__remove_shared_vm_struct(next_next, file, mapping);
+		}
 	} else if (insert) {
 		/*
 		 * split_vma has split insert from vma, and needs
 		 * us to insert it before dropping the locks
 		 * (it may either follow vma or precede it).
 		 */
-		__insert_vm_struct(mm, insert);
+		__insert_vm_struct(mm, &mas, insert);
 	} else {
 		if (start_changed)
 			vma_gap_update(vma);
@@ -956,6 +1107,7 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 	}
 
 	if (remove_next) {
+again:
 		if (file) {
 			uprobe_munmap(next, next->vm_start, next->vm_end);
 			fput(file);
@@ -977,7 +1129,7 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 			 * "next->vm_prev->vm_end" changed and the
 			 * "vma->vm_next" gap must be updated.
 			 */
-			next = vma->vm_next;
+			next = next_next;
 		} else {
 			/*
 			 * For the scope of the comment "next" and
@@ -993,7 +1145,6 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		}
 		if (remove_next == 2) {
 			remove_next = 1;
-			end = next->vm_end;
 			goto again;
 		}
 		else if (next)
@@ -1025,6 +1176,7 @@  int __vma_adjust(struct vm_area_struct *vma, unsigned long start,
 		uprobe_mmap(insert);
 
 	validate_mm(mm);
+	validate_mm_mt(mm);
 
 	return 0;
 }
@@ -1178,6 +1330,7 @@  struct vm_area_struct *vma_merge(struct mm_struct *mm,
 	struct vm_area_struct *area, *next;
 	int err;
 
+	validate_mm_mt(mm);
 	/*
 	 * We later require that vma->vm_flags == vm_flags,
 	 * so this tests vma->vm_flags & VM_SPECIAL, too.
@@ -1253,6 +1406,7 @@  struct vm_area_struct *vma_merge(struct mm_struct *mm,
 		khugepaged_enter_vma(area, vm_flags);
 		return area;
 	}
+	validate_mm_mt(mm);
 
 	return NULL;
 }
@@ -1732,6 +1886,7 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 	struct rb_node **rb_link, *rb_parent;
 	unsigned long charged = 0;
 
+	validate_mm_mt(mm);
 	/* Check against address space limit. */
 	if (!may_expand_vm(mm, vm_flags, len >> PAGE_SHIFT)) {
 		unsigned long nr_pages;
@@ -1846,7 +2001,13 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 			goto free_vma;
 	}
 
-	vma_link(mm, vma, prev, rb_link, rb_parent);
+	if (vma_link(mm, vma, prev, rb_link, rb_parent)) {
+		error = -ENOMEM;
+		if (file)
+			goto unmap_and_free_vma;
+		else
+			goto free_vma;
+	}
 
 	/*
 	 * vma_merge() calls khugepaged_enter_vma() either, the below
@@ -1886,6 +2047,7 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 
 	vma_set_page_prot(vma);
 
+	validate_mm_mt(mm);
 	return addr;
 
 unmap_and_free_vma:
@@ -1902,6 +2064,7 @@  unsigned long mmap_region(struct file *file, unsigned long addr,
 unacct_error:
 	if (charged)
 		vm_unacct_memory(charged);
+	validate_mm_mt(mm);
 	return error;
 }
 
@@ -1918,12 +2081,19 @@  static unsigned long unmapped_area(struct vm_unmapped_area_info *info)
 	struct mm_struct *mm = current->mm;
 	struct vm_area_struct *vma;
 	unsigned long length, low_limit, high_limit, gap_start, gap_end;
+	unsigned long gap;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
 	/* Adjust search length to account for worst case alignment overhead */
 	length = info->length + info->align_mask;
 	if (length < info->length)
 		return -ENOMEM;
 
+	mas_empty_area(&mas, info->low_limit, info->high_limit - 1,
+			   length);
+	gap = mas.index;
+	gap += (info->align_offset - gap) & info->align_mask;
+
 	/* Adjust search limits by the desired length */
 	if (info->high_limit < length)
 		return -ENOMEM;
@@ -2005,20 +2175,31 @@  static unsigned long unmapped_area(struct vm_unmapped_area_info *info)
 
 	VM_BUG_ON(gap_start + info->length > info->high_limit);
 	VM_BUG_ON(gap_start + info->length > gap_end);
+
+	VM_BUG_ON(gap != gap_start);
 	return gap_start;
 }
 
 static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
 {
 	struct mm_struct *mm = current->mm;
-	struct vm_area_struct *vma;
+	struct vm_area_struct *vma = NULL;
 	unsigned long length, low_limit, high_limit, gap_start, gap_end;
+	unsigned long gap;
+
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
+	validate_mm_mt(mm);
 
 	/* Adjust search length to account for worst case alignment overhead */
 	length = info->length + info->align_mask;
 	if (length < info->length)
 		return -ENOMEM;
 
+	mas_empty_area_rev(&mas, info->low_limit, info->high_limit - 1,
+		   length);
+	gap = mas.last + 1 - info->length;
+	gap -= (gap - info->align_offset) & info->align_mask;
+
 	/*
 	 * Adjust search limits by the desired length.
 	 * See implementation comment at top of unmapped_area().
@@ -2104,6 +2285,32 @@  static unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
 
 	VM_BUG_ON(gap_end < info->low_limit);
 	VM_BUG_ON(gap_end < gap_start);
+
+	if (gap != gap_end) {
+		pr_err("%s: %p Gap was found: mt %lu gap_end %lu\n", __func__,
+		       mm, gap, gap_end);
+		pr_err("window was %lu - %lu size %lu\n", info->high_limit,
+		       info->low_limit, length);
+		pr_err("mas.min %lu max %lu mas.last %lu\n", mas.min, mas.max,
+		       mas.last);
+		pr_err("mas.index %lu align mask %lu offset %lu\n", mas.index,
+		       info->align_mask, info->align_offset);
+		pr_err("rb_find_vma find on %lu => %p (%p)\n", mas.index,
+		       find_vma(mm, mas.index), vma);
+#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
+		mt_dump(&mm->mm_mt);
+#endif
+		{
+			struct vm_area_struct *dv = mm->mmap;
+
+			while (dv) {
+				pr_err("vma %p %lu-%lu\n", dv, dv->vm_start, dv->vm_end);
+				dv = dv->vm_next;
+			}
+		}
+		VM_BUG_ON(gap != gap_end);
+	}
+
 	return gap_end;
 }
 
@@ -2326,7 +2533,6 @@  struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 		vmacache_update(addr, vma);
 	return vma;
 }
-
 EXPORT_SYMBOL(find_vma);
 
 /*
@@ -2399,7 +2605,9 @@  int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	struct vm_area_struct *next;
 	unsigned long gap_addr;
 	int error = 0;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
+	validate_mm_mt(mm);
 	if (!(vma->vm_flags & VM_GROWSUP))
 		return -EFAULT;
 
@@ -2423,9 +2631,14 @@  int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 		/* Check that both stack segments have the same anon_vma? */
 	}
 
+	if (mas_preallocate(&mas, vma, GFP_KERNEL))
+		return -ENOMEM;
+
 	/* We must make sure the anon_vma is allocated. */
-	if (unlikely(anon_vma_prepare(vma)))
+	if (unlikely(anon_vma_prepare(vma))) {
+		mas_destroy(&mas);
 		return -ENOMEM;
+	}
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -2462,6 +2675,8 @@  int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 				vm_stat_account(mm, vma->vm_flags, grow);
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_end = address;
+				/* Overwrite old entry in mtree. */
+				vma_mas_store(vma, &mas);
 				anon_vma_interval_tree_post_update_vma(vma);
 				if (vma->vm_next)
 					vma_gap_update(vma->vm_next);
@@ -2476,6 +2691,8 @@  int expand_upwards(struct vm_area_struct *vma, unsigned long address)
 	anon_vma_unlock_write(vma->anon_vma);
 	khugepaged_enter_vma(vma, vma->vm_flags);
 	validate_mm(mm);
+	validate_mm_mt(mm);
+	mas_destroy(&mas);
 	return error;
 }
 #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
@@ -2489,7 +2706,9 @@  int expand_downwards(struct vm_area_struct *vma,
 	struct mm_struct *mm = vma->vm_mm;
 	struct vm_area_struct *prev;
 	int error = 0;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
+	validate_mm(mm);
 	address &= PAGE_MASK;
 	if (address < mmap_min_addr)
 		return -EPERM;
@@ -2503,9 +2722,14 @@  int expand_downwards(struct vm_area_struct *vma,
 			return -ENOMEM;
 	}
 
+	if (mas_preallocate(&mas, vma, GFP_KERNEL))
+		return -ENOMEM;
+
 	/* We must make sure the anon_vma is allocated. */
-	if (unlikely(anon_vma_prepare(vma)))
+	if (unlikely(anon_vma_prepare(vma))) {
+		mas_destroy(&mas);
 		return -ENOMEM;
+	}
 
 	/*
 	 * vma->vm_start/vm_end cannot change under us because the caller
@@ -2543,6 +2767,8 @@  int expand_downwards(struct vm_area_struct *vma,
 				anon_vma_interval_tree_pre_update_vma(vma);
 				vma->vm_start = address;
 				vma->vm_pgoff -= grow;
+				/* Overwrite old entry in mtree. */
+				vma_mas_store(vma, &mas);
 				anon_vma_interval_tree_post_update_vma(vma);
 				vma_gap_update(vma);
 				spin_unlock(&mm->page_table_lock);
@@ -2554,6 +2780,7 @@  int expand_downwards(struct vm_area_struct *vma,
 	anon_vma_unlock_write(vma->anon_vma);
 	khugepaged_enter_vma(vma, vma->vm_flags);
 	validate_mm(mm);
+	mas_destroy(&mas);
 	return error;
 }
 
@@ -2676,14 +2903,17 @@  static void unmap_region(struct mm_struct *mm,
  * vma list as we go..
  */
 static bool
-detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
-	struct vm_area_struct *prev, unsigned long end)
+detach_vmas_to_be_unmapped(struct mm_struct *mm, struct ma_state *mas,
+	struct vm_area_struct *vma, struct vm_area_struct *prev,
+	unsigned long end)
 {
 	struct vm_area_struct **insertion_point;
 	struct vm_area_struct *tail_vma = NULL;
 
 	insertion_point = (prev ? &prev->vm_next : &mm->mmap);
 	vma->vm_prev = NULL;
+	mas_set_range(mas, vma->vm_start, end - 1);
+	mas_store_prealloc(mas, NULL);
 	do {
 		vma_rb_erase(vma, &mm->mm_rb);
 		if (vma->vm_flags & VM_LOCKED)
@@ -2724,6 +2954,7 @@  int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct vm_area_struct *new;
 	int err;
+	validate_mm_mt(mm);
 
 	if (vma->vm_ops && vma->vm_ops->may_split) {
 		err = vma->vm_ops->may_split(vma, addr);
@@ -2766,6 +2997,9 @@  int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	if (!err)
 		return 0;
 
+	/* Avoid vm accounting in close() operation */
+	new->vm_start = new->vm_end;
+	new->vm_pgoff = 0;
 	/* Clean everything up if vma_adjust failed. */
 	if (new->vm_ops && new->vm_ops->close)
 		new->vm_ops->close(new);
@@ -2776,6 +3010,7 @@  int __split_vma(struct mm_struct *mm, struct vm_area_struct *vma,
 	mpol_put(vma_policy(new));
  out_free_vma:
 	vm_area_free(new);
+	validate_mm_mt(mm);
 	return err;
 }
 
@@ -2802,6 +3037,8 @@  int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 {
 	unsigned long end;
 	struct vm_area_struct *vma, *prev, *last;
+	int error = -ENOMEM;
+	MA_STATE(mas, &mm->mm_mt, 0, 0);
 
 	if ((offset_in_page(start)) || start > TASK_SIZE || len > TASK_SIZE-start)
 		return -EINVAL;
@@ -2822,6 +3059,9 @@  int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	vma = find_vma_intersection(mm, start, end);
 	if (!vma)
 		return 0;
+
+	if (mas_preallocate(&mas, vma, GFP_KERNEL))
+		return -ENOMEM;
 	prev = vma->vm_prev;
 
 	/*
@@ -2832,7 +3072,6 @@  int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	 * places tmp vma above, and higher split_vma places tmp vma below.
 	 */
 	if (start > vma->vm_start) {
-		int error;
 
 		/*
 		 * Make sure that map_count on return from munmap() will
@@ -2840,20 +3079,20 @@  int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 		 * its limit temporarily, to help free resources as expected.
 		 */
 		if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
-			return -ENOMEM;
+			goto map_count_exceeded;
 
 		error = __split_vma(mm, vma, start, 0);
 		if (error)
-			return error;
+			goto split_failed;
 		prev = vma;
 	}
 
 	/* Does it split the last one? */
 	last = find_vma(mm, end);
 	if (last && end > last->vm_start) {
-		int error = __split_vma(mm, last, end, 1);
+		error = __split_vma(mm, last, end, 1);
 		if (error)
-			return error;
+			goto split_failed;
 	}
 	vma = vma_next(mm, prev);
 
@@ -2867,13 +3106,13 @@  int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 		 * split, despite we could. This is unlikely enough
 		 * failure that it's not worth optimizing it for.
 		 */
-		int error = userfaultfd_unmap_prep(vma, start, end, uf);
+		error = userfaultfd_unmap_prep(vma, start, end, uf);
 		if (error)
-			return error;
+			goto userfaultfd_error;
 	}
 
 	/* Detach vmas from rbtree */
-	if (!detach_vmas_to_be_unmapped(mm, vma, prev, end))
+	if (!detach_vmas_to_be_unmapped(mm, &mas, vma, prev, end))
 		downgrade = false;
 
 	if (downgrade)
@@ -2885,6 +3124,12 @@  int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
 	remove_vma_list(mm, vma);
 
 	return downgrade ? 1 : 0;
+
+map_count_exceeded:
+split_failed:
+userfaultfd_error:
+	mas_destroy(&mas);
+	return error;
 }
 
 int do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
@@ -3024,6 +3269,7 @@  static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
 	pgoff_t pgoff = addr >> PAGE_SHIFT;
 	int error;
 	unsigned long mapped_addr;
+	validate_mm_mt(mm);
 
 	/* Until we need other flags, refuse anything except VM_EXEC. */
 	if ((flags & (~VM_EXEC)) != 0)
@@ -3073,7 +3319,9 @@  static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
 	vma->vm_pgoff = pgoff;
 	vma->vm_flags = flags;
 	vma->vm_page_prot = vm_get_page_prot(flags);
-	vma_link(mm, vma, prev, rb_link, rb_parent);
+	if (vma_link(mm, vma, prev, rb_link, rb_parent))
+		goto no_vma_link;
+
 out:
 	perf_event_mmap(vma);
 	mm->total_vm += len >> PAGE_SHIFT;
@@ -3081,7 +3329,12 @@  static int do_brk_flags(unsigned long addr, unsigned long len, unsigned long fla
 	if (flags & VM_LOCKED)
 		mm->locked_vm += (len >> PAGE_SHIFT);
 	vma->vm_flags |= VM_SOFTDIRTY;
+	validate_mm_mt(mm);
 	return 0;
+
+no_vma_link:
+	vm_area_free(vma);
+	return -ENOMEM;
 }
 
 int vm_brk_flags(unsigned long addr, unsigned long request, unsigned long flags)
@@ -3170,6 +3423,9 @@  void exit_mmap(struct mm_struct *mm)
 		vma = remove_vma(vma);
 		cond_resched();
 	}
+
+	trace_exit_mmap(mm);
+	__mt_destroy(&mm->mm_mt);
 	mm->mmap = NULL;
 	mmap_write_unlock(mm);
 	vm_unacct_memory(nr_accounted);
@@ -3183,12 +3439,30 @@  int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 {
 	struct vm_area_struct *prev;
 	struct rb_node **rb_link, *rb_parent;
+	unsigned long start = vma->vm_start;
+	struct vm_area_struct *overlap = NULL;
+	unsigned long charged = vma_pages(vma);
 
 	if (find_vma_links(mm, vma->vm_start, vma->vm_end,
 			   &prev, &rb_link, &rb_parent))
+
+	if (find_vma_intersection(mm, vma->vm_start, vma->vm_end))
 		return -ENOMEM;
+
+	overlap = mt_find(&mm->mm_mt, &start, vma->vm_end - 1);
+	if (overlap) {
+
+		pr_err("Found vma ending at %lu\n", start - 1);
+		pr_err("vma : %lu => %lu-%lu\n", (unsigned long)overlap,
+				overlap->vm_start, overlap->vm_end - 1);
+#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
+		mt_dump(&mm->mm_mt);
+#endif
+		BUG();
+	}
+
 	if ((vma->vm_flags & VM_ACCOUNT) &&
-	     security_vm_enough_memory_mm(mm, vma_pages(vma)))
+	     security_vm_enough_memory_mm(mm, charged))
 		return -ENOMEM;
 
 	/*
@@ -3208,7 +3482,11 @@  int insert_vm_struct(struct mm_struct *mm, struct vm_area_struct *vma)
 		vma->vm_pgoff = vma->vm_start >> PAGE_SHIFT;
 	}
 
-	vma_link(mm, vma, prev, rb_link, rb_parent);
+	if (vma_link(mm, vma, prev, rb_link, rb_parent)) {
+		vm_unacct_memory(charged);
+		return -ENOMEM;
+	}
+
 	return 0;
 }
 
@@ -3226,7 +3504,9 @@  struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 	struct vm_area_struct *new_vma, *prev;
 	struct rb_node **rb_link, *rb_parent;
 	bool faulted_in_anon_vma = true;
+	unsigned long index = addr;
 
+	validate_mm_mt(mm);
 	/*
 	 * If anonymous vma has not yet been faulted, update new pgoff
 	 * to match new location, to increase its chance of merging.
@@ -3238,6 +3518,8 @@  struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 
 	if (find_vma_links(mm, addr, addr + len, &prev, &rb_link, &rb_parent))
 		return NULL;	/* should never get here */
+	if (mt_find(&mm->mm_mt, &index, addr+len - 1))
+		BUG();
 	new_vma = vma_merge(mm, prev, addr, addr + len, vma->vm_flags,
 			    vma->anon_vma, vma->vm_file, pgoff, vma_policy(vma),
 			    vma->vm_userfaultfd_ctx, anon_vma_name(vma));
@@ -3281,6 +3563,7 @@  struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 		vma_link(mm, new_vma, prev, rb_link, rb_parent);
 		*need_rmap_locks = false;
 	}
+	validate_mm_mt(mm);
 	return new_vma;
 
 out_free_mempol:
@@ -3288,6 +3571,7 @@  struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 out_free_vma:
 	vm_area_free(new_vma);
 out:
+	validate_mm_mt(mm);
 	return NULL;
 }
 
@@ -3424,6 +3708,7 @@  static struct vm_area_struct *__install_special_mapping(
 	int ret;
 	struct vm_area_struct *vma;
 
+	validate_mm_mt(mm);
 	vma = vm_area_alloc(mm);
 	if (unlikely(vma == NULL))
 		return ERR_PTR(-ENOMEM);
@@ -3446,10 +3731,12 @@  static struct vm_area_struct *__install_special_mapping(
 
 	perf_event_mmap(vma);
 
+	validate_mm_mt(mm);
 	return vma;
 
 out:
 	vm_area_free(vma);
+	validate_mm_mt(mm);
 	return ERR_PTR(ret);
 }
 
diff --git a/mm/nommu.c b/mm/nommu.c
index 9d7afc2d959e..5af0b050eba8 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -545,6 +545,19 @@  static void put_nommu_region(struct vm_region *region)
 	__put_nommu_region(region);
 }
 
+void vma_mas_store(struct vm_area_struct *vma, struct ma_state *mas)
+{
+	mas_set_range(mas, vma->vm_start, vma->vm_end - 1);
+	mas_store_prealloc(mas, vma);
+}
+
+void vma_mas_remove(struct vm_area_struct *vma, struct ma_state *mas)
+{
+	mas->index = vma->vm_start;
+	mas->last = vma->vm_end - 1;
+	mas_store_prealloc(mas, NULL);
+}
+
 /*
  * add a VMA into a process's mm_struct in the appropriate place in the list
  * and tree and add to the address space's page tree also if not an anonymous