mbox series

[v6,00/71] Introducing the Maple Tree

Message ID 20220215143728.3810954-1-Liam.Howlett@oracle.com (mailing list archive)
Headers show
Series Introducing the Maple Tree | expand

Message

Liam R. Howlett Feb. 15, 2022, 2:37 p.m. UTC
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct.  The long term goal is to reduce
or remove the mmap_sem contention.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses.  The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.

This patch is based on v5.17-rc4

git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214

v6 changes:
- Added patch for xarray testcode which should be dropped for upstream
  fix. - Thanks Matthew Wilcox
- Fixed issue with maple state index/last setting when the tree is just
  a pointer - Thanks David Howells
- Changed internal RCU handling to not check flags more than once
- Fixed mas_prev() underflow issue - Thanks David Howells
- Fixed returns on mas_prev()/mas_next() when there is only a value at
  0 index - Thanks David Howells
- Fixed mas_find_rev() running past minimum value - Thanks David Howells
- Fixed testing code function rename - Thanks Mark Hemment
- Reworked brk() and vm_brk_flags() as suggested - Thanks Vlastimil
  Babka
- Documentation fixes - Thanks Mike Rapoport
- Separated test code from tree code - Thanks Vlastimil Babka
- Moved maple_tree_init() call into the maple tree patch for other users
  - Thanks David Howells
- Fixed copyright date - Thanks Mike Rapoport
- Whitespace fixes in comments, reduced changes in other locations.
- Added missing kdocs - Thanks Mike Rapoport
- Fixed exit_mmap comment - Thanks Mark Hemment
- Removed RCU tracking as there is an issue with atomic increments of
  mmget() - Thanks Mark Hemment for initial issue report that allowed me
  to discover this.  RCU was not being used, so it is disabled for VMA
  tracking for now.
- Removed unnecessary assignment in mtree_range_walk() - Thanks JaeJoon
  Jung
- Dropped ma_xa_benchmark from testing Makefile - Thanks JaeJoon Jung
- Fixed leaf gap calculation in rare case of underflow
- Fixed mmap_region() bug on merging of prev
- Fixed mmap_region() bug on khugepaged_enter_vma_merge()
- Added test cases for mas_find_rev(), mas_prev(), mas_next, and
  mas_root_expand() to test suite.
- Updated config options and ifdefs to allow other maple tree users to
  debug maple tree without debugging the VM maple tree.

v5: https://lore.kernel.org/linux-mm/20220203172051.i2jnhnkudzssdsxg@revolver/T/
v4: https://lore.kernel.org/linux-mm/20211201142918.921493-30-Liam.Howlett@oracle.com/t/
v3: https://lore.kernel.org/linux-mm/20211005012959.1110504-1-Liam.Howlett@oracle.com/
v2: https://lore.kernel.org/linux-mm/20210817154651.1570984-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20210428153542.2814175-1-Liam.Howlett@Oracle.com/


Performance on a 144 core x86:

It is important to note that the code is still using the mmap_sem, the
performance seems fairly similar on real-world workloads, while there
are variations in micro-benchmarks.

kernbench, increased system time, less user time:
Amean     user-2        885.34 (   0.00%)      886.07 *  -0.08%*
Amean     syst-2        161.95 (   0.00%)      168.19 *  -3.85%*
Amean     elsp-2        530.06 (   0.00%)      532.96 *  -0.55%*
Amean     user-4        908.58 (   0.00%)      908.88 *  -0.03%*
Amean     syst-4        167.56 (   0.00%)      173.72 *  -3.68%*
Amean     elsp-4        277.21 (   0.00%)      277.38 *  -0.06%*
Amean     user-8        961.84 (   0.00%)      962.45 *  -0.06%*
Amean     syst-8        176.40 (   0.00%)      183.43 *  -3.99%*
Amean     elsp-8        150.59 (   0.00%)      151.21 *  -0.41%*
Amean     user-16      1040.15 (   0.00%)     1039.89 *   0.02%*
Amean     syst-16       188.19 (   0.00%)      193.81 *  -2.99%*
Amean     elsp-16        86.85 (   0.00%)       86.32 *   0.61%*
Amean     user-32      1240.46 (   0.00%)     1233.93 *   0.53%*
Amean     syst-32       217.15 (   0.00%)      222.99 *  -2.69%*
Amean     elsp-32        55.16 (   0.00%)       55.09 *   0.12%*
Amean     user-64      1241.17 (   0.00%)     1234.26 *   0.56%*
Amean     syst-64       215.11 (   0.00%)      220.76 *  -2.63%*
Amean     elsp-64        32.88 (   0.00%)       33.72 *  -2.57%*
Amean     user-128     1613.09 (   0.00%)     1609.63 *   0.21%*
Amean     syst-128      267.10 (   0.00%)      276.72 *  -3.60%*
Amean     elsp-128       25.80 (   0.00%)       26.09 *  -1.10%*

Mixed Hmean results:
- freqmine-medium -3.50% to +12.82%
- malloc1-processes: -14.50% to +6.53%
- signal1-processes -1.87% to +14.02%
- page_fault3-threads -6.46% to +26.15%
- pthread_mutex1-threads -16.81% to +28.63%

Decrease in performance in the following micro-benchmarks in Hmean:
- brk1-processes -37.42% to -44.16%
- malloc1-threads -18.27% to -23.08%

Modifications are more expensive so the micro-benchmarks that write but
do not use the data will be negatively affected.

Patch organization:
Patch 1 is to add a missing lock to avoid an assert issue when using a vma iterator.

Patch 2 is an xarray fix due to bitmap header changes which will be
dropped for a pending upstream fix.

Patches 3 to 7 are radix tree test suite additions for maple tree
support.

Patch 8 adds the maple tree.
Patch 9 adds the maple tree test code.

Patches 10-19 are the removal of the rbtree from the mm_struct.  This
now includes the introduction of the vma iterator.

Patch 20 optimizes __vma_adjust() for the maple tree.

Patches 21-27 are the removal of the vmacache from the kernel.

Patches 28-31 are internal mm changes for efficiencies.

Patches 32-69 are the removal of the linked list

Patches 70 and 71 are a small cleanup from the removal of the vma linked list.

Liam R. Howlett (61):
  xarray: Fix bitmap breakage
  radix tree test suite: Add pr_err define
  radix tree test suite: Add kmem_cache_set_non_kernel()
  radix tree test suite: Add allocation counts and size to kmem_cache
  radix tree test suite: Add support for slab bulk APIs
  radix tree test suite: Add lockdep_is_held to header
  Maple Tree: Add new data structure
  lib/test_maple_tree: Add testing for maple tree
  mm: Start tracking VMAs with maple tree
  mm/mmap: Use the maple tree in find_vma() instead of the rbtree.
  mm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree
  mm/mmap: Use maple tree for unmapped_area{_topdown}
  kernel/fork: Use maple tree for dup_mmap() during forking
  mm: Remove rb tree.
  mmap: Change zeroing of maple tree in __vma_adjust()
  xen: Use vma_lookup() in privcmd_ioctl_mmap()
  mm: Optimize find_exact_vma() to use vma_lookup()
  mm/khugepaged: Optimize collapse_pte_mapped_thp() by using
    vma_lookup()
  mm/mmap: Change do_brk_flags() to expand existing VMA and add
    do_brk_munmap()
  mm: Use maple tree operations for find_vma_intersection()
  mm/mmap: Use advanced maple tree API for mmap_region()
  mm: Remove vmacache
  mm/mmap: Move mmap_region() below do_munmap()
  mm/mmap: Reorganize munmap to use maple states
  mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()
  arm64: Remove mmap linked list from vdso
  parisc: Remove mmap linked list from cache handling
  powerpc: Remove mmap linked list walks
  s390: Remove vma linked list walks
  x86: Remove vma linked list walks
  xtensa: Remove vma linked list walks
  cxl: Remove vma linked list walk
  optee: Remove vma linked list walk
  um: Remove vma linked list walk
  binfmt_elf: Remove vma linked list walk
  exec: Use VMA iterator instead of linked list
  fs/proc/base: Use maple tree iterators in place of linked list
  userfaultfd: Use maple tree iterator to iterate VMAs
  ipc/shm: Use VMA iterator instead of linked list
  acct: Use VMA iterator instead of linked list
  perf: Use VMA iterator
  sched: Use maple tree iterator to walk VMAs
  fork: Use VMA iterator
  bpf: Remove VMA linked list
  mm/gup: Use maple tree navigation instead of linked list
  mm/khugepaged: Stop using vma linked list
  mm/ksm: Use vma iterators instead of vma linked list
  mm/madvise: Use vma_find() instead of vma linked list
  mm/memcontrol: Stop using mm->highest_vm_end
  mm/mempolicy: Use vma iterator & maple state instead of vma linked
    list
  mm/mlock: Use vma iterator and instead of vma linked list
  mm/mprotect: Use maple tree navigation instead of vma linked list
  mm/mremap: Use vma_find_intersection() instead of vma linked list
  mm/msync: Use vma_find() instead of vma linked list
  mm/oom_kill: Use maple tree iterators instead of vma linked list
  mm/pagewalk: Use vma_find() instead of vma linked list
  mm/swapfile: Use vma iterator instead of vma linked list
  riscv: Use vma iterator for vdso
  mm: Remove the vma linked list
  mm/mmap: Drop range_has_overlap() function
  mm/mmap.c: Pass in mapping to __vma_link_file()

Matthew Wilcox (Oracle) (10):
  binfmt_elf: Take the mmap lock when walking the VMA list
  mm: Add VMA iterator
  mmap: Use the VMA iterator in count_vma_pages_range()
  damon: Convert __damon_va_three_regions to use the VMA iterator
  proc: Remove VMA rbtree use from nommu
  mm: Convert vma_lookup() to use mtree_load()
  coredump: Remove vma linked list walk
  fs/proc/task_mmu: Stop using linked list and highest_vm_end
  i915: Use the VMA iterator
  nommu: Remove uses of VMA linked list

 Documentation/core-api/index.rst              |     1 +
 Documentation/core-api/maple_tree.rst         |   218 +
 MAINTAINERS                                   |    12 +
 arch/arm64/kernel/vdso.c                      |     3 +-
 arch/parisc/kernel/cache.c                    |     9 +-
 arch/powerpc/kernel/vdso.c                    |     6 +-
 arch/powerpc/mm/book3s32/tlb.c                |    11 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |    13 +-
 arch/riscv/kernel/vdso.c                      |     3 +-
 arch/s390/kernel/vdso.c                       |     3 +-
 arch/s390/mm/gmap.c                           |     6 +-
 arch/um/kernel/tlb.c                          |    14 +-
 arch/x86/entry/vdso/vma.c                     |     9 +-
 arch/x86/kernel/tboot.c                       |     2 +-
 arch/xtensa/kernel/syscall.c                  |    18 +-
 drivers/firmware/efi/efi.c                    |     2 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |    14 +-
 drivers/misc/cxl/fault.c                      |    45 +-
 drivers/tee/optee/call.c                      |    18 +-
 drivers/xen/privcmd.c                         |     2 +-
 fs/binfmt_elf.c                               |     6 +-
 fs/coredump.c                                 |    33 +-
 fs/exec.c                                     |    12 +-
 fs/proc/base.c                                |     5 +-
 fs/proc/internal.h                            |     2 +-
 fs/proc/task_mmu.c                            |    74 +-
 fs/proc/task_nommu.c                          |    45 +-
 fs/userfaultfd.c                              |    49 +-
 include/linux/maple_tree.h                    |   683 +
 include/linux/mm.h                            |    77 +-
 include/linux/mm_types.h                      |    43 +-
 include/linux/mm_types_task.h                 |    12 -
 include/linux/sched.h                         |     1 -
 include/linux/userfaultfd_k.h                 |     7 +-
 include/linux/vm_event_item.h                 |     4 -
 include/linux/vmacache.h                      |    28 -
 include/linux/vmstat.h                        |     6 -
 include/linux/xarray.h                        |     1 +
 include/trace/events/maple_tree.h             |   123 +
 include/trace/events/mmap.h                   |    71 +
 init/main.c                                   |     2 +
 ipc/shm.c                                     |    21 +-
 kernel/acct.c                                 |    11 +-
 kernel/bpf/task_iter.c                        |    10 +-
 kernel/debug/debug_core.c                     |    12 -
 kernel/events/core.c                          |     3 +-
 kernel/events/uprobes.c                       |     9 +-
 kernel/fork.c                                 |    58 +-
 kernel/sched/fair.c                           |    10 +-
 lib/Kconfig.debug                             |    18 +-
 lib/Makefile                                  |     3 +-
 lib/maple_tree.c                              |  6967 +++
 lib/test_maple_tree.c                         | 37398 ++++++++++++++++
 mm/Makefile                                   |     2 +-
 mm/damon/vaddr-test.h                         |    37 +-
 mm/damon/vaddr.c                              |    53 +-
 mm/debug.c                                    |    14 +-
 mm/gup.c                                      |     9 +-
 mm/huge_memory.c                              |     4 +-
 mm/init-mm.c                                  |     4 +-
 mm/internal.h                                 |    78 +-
 mm/khugepaged.c                               |    13 +-
 mm/ksm.c                                      |    18 +-
 mm/madvise.c                                  |     2 +-
 mm/memcontrol.c                               |     6 +-
 mm/memory.c                                   |    33 +-
 mm/mempolicy.c                                |    58 +-
 mm/mlock.c                                    |    34 +-
 mm/mmap.c                                     |  2086 +-
 mm/mprotect.c                                 |     7 +-
 mm/mremap.c                                   |    22 +-
 mm/msync.c                                    |     2 +-
 mm/nommu.c                                    |   127 +-
 mm/oom_kill.c                                 |     3 +-
 mm/pagewalk.c                                 |     2 +-
 mm/swapfile.c                                 |     4 +-
 mm/util.c                                     |    32 -
 mm/vmacache.c                                 |   117 -
 mm/vmstat.c                                   |     4 -
 tools/testing/radix-tree/.gitignore           |     2 +
 tools/testing/radix-tree/Makefile             |     9 +-
 tools/testing/radix-tree/generated/autoconf.h |     1 +
 tools/testing/radix-tree/linux.c              |   160 +-
 tools/testing/radix-tree/linux/kernel.h       |     1 +
 tools/testing/radix-tree/linux/lockdep.h      |     2 +
 tools/testing/radix-tree/linux/maple_tree.h   |     7 +
 tools/testing/radix-tree/linux/slab.h         |     4 +
 tools/testing/radix-tree/maple.c              |    59 +
 .../radix-tree/trace/events/maple_tree.h      |     3 +
 89 files changed, 47390 insertions(+), 1842 deletions(-)
 create mode 100644 Documentation/core-api/maple_tree.rst
 create mode 100644 include/linux/maple_tree.h
 delete mode 100644 include/linux/vmacache.h
 create mode 100644 include/trace/events/maple_tree.h
 create mode 100644 lib/maple_tree.c
 create mode 100644 lib/test_maple_tree.c
 delete mode 100644 mm/vmacache.c
 create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
 create mode 100644 tools/testing/radix-tree/maple.c
 create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h

Comments

Andrew Morton Feb. 16, 2022, 7:47 p.m. UTC | #1
On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:

> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently.  There are a number of places in the kernel
> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface.  The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct.  The long term goal is to reduce
> or remove the mmap_sem contention.

Has a path been chosen which gets us from here to significant reduction
in mmap_lock overhead?

If so, what's the plan and what must be done?

Thanks.
Matthew Wilcox Feb. 16, 2022, 8:24 p.m. UTC | #2
On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote:
> On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> 
> > The maple tree is an RCU-safe range based B-tree designed to use modern
> > processor cache efficiently.  There are a number of places in the kernel
> > that a non-overlapping range-based tree would be beneficial, especially
> > one with a simple interface.  The first user that is covered in this
> > patch set is the vm_area_struct, where three data structures are
> > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > or remove the mmap_sem contention.
> 
> Has a path been chosen which gets us from here to significant reduction
> in mmap_lock overhead?
> 
> If so, what's the plan and what must be done?

I would say there are still competing ideas for how that is to be done.

First, the Maple Tree is independent (to a certain extent) of all the
approaches to reducing mmap_lock overhead.  It's a better data structure
for VMA lookup than the rbtree.  Set against that, it has higher overhead
for modifications.  That means that benchmarks which primarily measure
modification overhead see worse performance.  We believe this is not
representative of real workloads, and indeed we see ~parity on workloads
like git and make -jN.

The advantage to this patch set is that we get rid of the vmacache
entirely, we get rid of the doubly-linked list chaining VMAs together
and we get rid of the rbtree usage.  And we add a new generic data
structure that hopefully I can point people towards instead of using
the XArray inappropriately.

Both of the approaches to avoiding taking the mmap_lock in the fault path
will benefit from using the maple tree for VMAs.  Fewer cache misses and
RCU safety are things we all want.  The maple tree will also benefit
if/when either approach is merged; because modifications to the maple
tree are more heavyweight than the rbtree, they block readers for longer.


The SPF approach is further along.  It has been tested in the past with
the maple tree, but not in the most recent iteration of either.  It
doesn't rely on the maple tree, but it does benefit.

The RCU-freeing-of-VMAs approach that I favour has not received much
attention.  I've been busy with other things ;-)  It offers much more
opportunity for performance improvement, but will take longer to arrive.
Mel Gorman Feb. 23, 2022, 4:35 p.m. UTC | #3
On Wed, Feb 16, 2022 at 08:24:34PM +0000, Matthew Wilcox wrote:
> On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote:
> > On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> > 
> > > The maple tree is an RCU-safe range based B-tree designed to use modern
> > > processor cache efficiently.  There are a number of places in the kernel
> > > that a non-overlapping range-based tree would be beneficial, especially
> > > one with a simple interface.  The first user that is covered in this
> > > patch set is the vm_area_struct, where three data structures are
> > > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > > or remove the mmap_sem contention.
> > 
> > Has a path been chosen which gets us from here to significant reduction
> > in mmap_lock overhead?
> > 
> > If so, what's the plan and what must be done?
> 
> I would say there are still competing ideas for how that is to be done.
> 
> First, the Maple Tree is independent (to a certain extent) of all the
> approaches to reducing mmap_lock overhead.  It's a better data structure
> for VMA lookup than the rbtree.  Set against that, it has higher overhead
> for modifications.  That means that benchmarks which primarily measure
> modification overhead see worse performance.  We believe this is not
> representative of real workloads, and indeed we see ~parity on workloads
> like git and make -jN.
> 

I'm way behind and only got around to looking at SPF properly today. Maple
is working its way up my list and I need to gather new data but I did
have old data queued from a time when I thought I would get around to
maple tree soon.  The big result that stood was was brk performance from
will-it-scale but for processes specifically

wis-malloc
                                                5.17.0-rc3                 5.17.0-rc3
                                                   vanilla           mm-maplevma-v5r1
Hmean     brk1-processes-2           5265361.84 (   0.00%)      3489097.63 * -33.73%*
Hmean     brk1-processes-5          12270382.36 (   0.00%)      8136542.13 * -33.69%*
Hmean     brk1-processes-8          19560585.95 (   0.00%)     12941094.22 * -33.84%*
Hmean     brk1-processes-12         27328267.18 (   0.00%)     18148145.64 * -33.59%*
Hmean     brk1-processes-21         39599378.04 (   0.00%)     26188063.96 * -33.87%*
Hmean     brk1-processes-30         59880571.20 (   0.00%)     39653658.00 * -33.78%*
Hmean     brk1-processes-48         72900335.59 (   0.00%)     49044880.84 * -32.72%*
Hmean     brk1-processes-79         72388042.42 (   0.00%)     48416196.93 * -33.12%*
Hmean     brk1-processes-110        72333206.18 (   0.00%)     48437222.12 * -33.04%*
Hmean     brk1-processes-141        72443841.68 (   0.00%)     48490158.48 * -33.07%*
Hmean     brk1-processes-172        72187451.54 (   0.00%)     48480696.77 * -32.84%*
Hmean     brk1-processes-203        72575945.95 (   0.00%)     48503896.03 * -33.17%*
Hmean     brk1-processes-234        67509812.97 (   0.00%)     48466347.95 * -28.21%*
Hmean     brk1-processes-265        72379447.48 (   0.00%)     48449330.06 * -33.06%*
Hmean     brk1-processes-296        72241861.59 (   0.00%)     48303854.85 * -33.14%*
Hmean     brk1-processes-320        72183320.07 (   0.00%)     48253708.21 * -33.15%*
Hmean     brk1-threads-2             1352297.67 (   0.00%)      2963406.42 * 119.14%*
Hmean     brk1-threads-5              814358.55 (   0.00%)      1233604.84 *  51.48%*
Hmean     brk1-threads-8              849900.52 (   0.00%)      1241170.76 *  46.04%*
Hmean     brk1-threads-12             804143.24 (   0.00%)      1232451.60 *  53.26%*
Hmean     brk1-threads-21             672680.28 (   0.00%)      1001257.38 *  48.85%*
Hmean     brk1-threads-30             443221.73 (   0.00%)       713592.66 *  61.00%*
Hmean     brk1-threads-48             397283.95 (   0.00%)       591349.83 *  48.85%*
Hmean     brk1-threads-79             167479.04 (   0.00%)       548681.03 * 227.61%*
Hmean     brk1-threads-110            154108.98 (   0.00%)       207349.67 *  34.55%*
Hmean     brk1-threads-141            166712.91 (   0.00%)       217314.50 *  30.35%*
Hmean     brk1-threads-172            155023.61 (   0.00%)       210318.59 *  35.67%*
Hmean     brk1-threads-203            166943.60 (   0.00%)       227892.26 *  36.51%*

The mmtests configuration this was based on no longer exists but the
equivalent should be

configs/config-workload-will-it-scale-pf-processes
configs/config-workload-will-it-scale-pf-threads

It shouldn't be too hard to pull out the brk1 test on its own and see
what it's doing in profiles.

brk_test from aim9 also suffers so may be brk-specific.

Other workloads looked ok, kernel building took a small -1.5%
performance hit on some machines but not enough that I would scream.
Others looked mostly neutral but it was only a provisional sniff test
before I even considered looking at 70 patches :O
Matthew Wilcox Feb. 23, 2022, 4:45 p.m. UTC | #4
On Wed, Feb 23, 2022 at 04:35:09PM +0000, Mel Gorman wrote:
> On Wed, Feb 16, 2022 at 08:24:34PM +0000, Matthew Wilcox wrote:
> > On Wed, Feb 16, 2022 at 11:47:00AM -0800, Andrew Morton wrote:
> > > On Tue, 15 Feb 2022 14:37:44 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> > > 
> > > > The maple tree is an RCU-safe range based B-tree designed to use modern
> > > > processor cache efficiently.  There are a number of places in the kernel
> > > > that a non-overlapping range-based tree would be beneficial, especially
> > > > one with a simple interface.  The first user that is covered in this
> > > > patch set is the vm_area_struct, where three data structures are
> > > > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > > > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > > > or remove the mmap_sem contention.
> > > 
> > > Has a path been chosen which gets us from here to significant reduction
> > > in mmap_lock overhead?
> > > 
> > > If so, what's the plan and what must be done?
> > 
> > I would say there are still competing ideas for how that is to be done.
> > 
> > First, the Maple Tree is independent (to a certain extent) of all the
> > approaches to reducing mmap_lock overhead.  It's a better data structure
> > for VMA lookup than the rbtree.  Set against that, it has higher overhead
> > for modifications.  That means that benchmarks which primarily measure
> > modification overhead see worse performance.  We believe this is not
> > representative of real workloads, and indeed we see ~parity on workloads
> > like git and make -jN.
> 
> I'm way behind and only got around to looking at SPF properly today. Maple
> is working its way up my list and I need to gather new data but I did
> have old data queued from a time when I thought I would get around to
> maple tree soon.  The big result that stood was was brk performance from
> will-it-scale but for processes specifically

Yup, we know about the brk1 regression.  It's a really unrealistic
benchmark, which is why we added brk2.  To quote the commit message:

    Linux has this horrendously complicated anon_vma structure that you don't
    care about, but the upshot is that after calling fork(), each process
    that calls brk() gets a _new_ VMA created.  That is, after calling brk()
    the first time, the process address space looks like this:

    557777fab000-557777ff0000 rw-p 00000000 00:00 0                          [heap]
    557777ff0000-557777ff1000 rw-p 00000000 00:00 0                          [heap]

    so what brk1 is actually testing is how long it takes to create & destroy
    a new VMA.  This does not match what most programs do -- most will call
    exec() which resets the anon_vma structures and starts each program off
    with its own heap.  And if you do have a multi-process program which
    uses brk(), chances are it doesn't just oscillate betwee zero and one
    extra pages of heap compared to its parent.

    A better test starts out by allocating one page on the heap and then
    throbs between one and two pages instead of throbbing between zero and
    one page.  That means we're actually testing expanding and contracting
    the heap instead of creating and destroying a new heap.

    For realism, I wanted to add actually accessing the memory in the new
    heap, but that doesn't work for the threaded case -- another thread
    might remove the memory you just allocated while you're allocating it.
    Threaded programs give each thread its own heap anyway, so this is
    kind of a pointless syscall to ask about its threaded scalability.
    
    Anyway, here's brk2.c.  It is not very different from brk1.c, but the
    performance results are quite different (actually worse by about 10-15%).
Qian Cai Feb. 25, 2022, 3:49 a.m. UTC | #5
On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently.  There are a number of places in the kernel
> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface.  The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct.  The long term goal is to reduce
> or remove the mmap_sem contention.
> 
> The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> nodes.  With the increased branching factor, it is significantly shorter than
> the rbtree so it has fewer cache misses.  The removal of the linked list
> between subsequent entries also reduces the cache misses and the need to pull
> in the previous and next VMA during many tree alterations.
> 
> This patch is based on v5.17-rc4
> 
> git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214

Just a heads-up. I noticed an use-after-free in today's linux-next below. I
am running out of time to fully triage this, but I noticed this is the only
series (not sure which revision) heavily touched mm/mmap.c recently there.

 BUG: KASAN: use-after-free in move_vma.isra.0
 Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280

 CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242
 Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
 Call trace:
  dump_backtrace
  show_stack
  dump_stack_lvl
  print_address_description.constprop.0
  kasan_report
  __asan_report_load8_noabort
  move_vma.isra.0
  move_vma at /usr/src/linux-next/mm/mremap.c:714
  __do_sys_mremap
  __arm64_sys_mremap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

 Allocated by task 1280:
  kasan_save_stack
  __kasan_slab_alloc
  slab_post_alloc_hook
  kmem_cache_alloc
  vm_area_alloc
  vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455
  mmap_region
  mmap_region at /usr/src/linux-next/mm/mmap.c:2585
  do_mmap
  vm_mmap_pgoff
  ksys_mmap_pgoff
  __arm64_sys_mmap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

 Freed by task 1280:
  kasan_save_stack
  kasan_set_track
  kasan_set_free_info
  __kasan_slab_free
  kmem_cache_free
  vm_area_free
  remove_vma
  remove_vma at /usr/src/linux-next/mm/mmap.c:187
  do_mas_align_munmap.isra.0
  remove_mt at /usr/src/linux-next/mm/mmap.c:2176
  (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437
  do_mas_munmap
  do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483
  do_munmap
  move_vma.isra.0
  __do_sys_mremap
  __arm64_sys_mremap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

 The buggy address belongs to the object at ffff0009ce752aa8
  which belongs to the cache vm_area_struct of size 144
 The buggy address is located 32 bytes inside of
  144-byte region [ffff0009ce752aa8, ffff0009ce752b38)
 The buggy address belongs to the page:
 page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750
 head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0
 memcg:ffff0009ce456e01
 flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff)
 raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380
 raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb
                                               ^
  ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
  ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
Liam R. Howlett Feb. 25, 2022, 7:08 p.m. UTC | #6
* Qian Cai <quic_qiancai@quicinc.com> [220224 22:49]:
> On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> > The maple tree is an RCU-safe range based B-tree designed to use modern
> > processor cache efficiently.  There are a number of places in the kernel
> > that a non-overlapping range-based tree would be beneficial, especially
> > one with a simple interface.  The first user that is covered in this
> > patch set is the vm_area_struct, where three data structures are
> > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > or remove the mmap_sem contention.
> > 
> > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > nodes.  With the increased branching factor, it is significantly shorter than
> > the rbtree so it has fewer cache misses.  The removal of the linked list
> > between subsequent entries also reduces the cache misses and the need to pull
> > in the previous and next VMA during many tree alterations.
> > 
> > This patch is based on v5.17-rc4
> > 
> > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214
> 
> Just a heads-up. I noticed an use-after-free in today's linux-next below. I
> am running out of time to fully triage this, but I noticed this is the only
> series (not sure which revision) heavily touched mm/mmap.c recently there.
> 
>  BUG: KASAN: use-after-free in move_vma.isra.0
>  Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280
> 
>  CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242
>  Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
>  Call trace:
>   dump_backtrace
>   show_stack
>   dump_stack_lvl
>   print_address_description.constprop.0
>   kasan_report
>   __asan_report_load8_noabort
>   move_vma.isra.0
>   move_vma at /usr/src/linux-next/mm/mremap.c:714
>   __do_sys_mremap
>   __arm64_sys_mremap
>   invoke_syscall
>   el0_svc_common.constprop.0
>   do_el0_svc
>   el0_svc
>   el0t_64_sync_handler
>   el0t_64_sync
> 
>  Allocated by task 1280:
>   kasan_save_stack
>   __kasan_slab_alloc
>   slab_post_alloc_hook
>   kmem_cache_alloc
>   vm_area_alloc
>   vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455
>   mmap_region
>   mmap_region at /usr/src/linux-next/mm/mmap.c:2585
>   do_mmap
>   vm_mmap_pgoff
>   ksys_mmap_pgoff
>   __arm64_sys_mmap
>   invoke_syscall
>   el0_svc_common.constprop.0
>   do_el0_svc
>   el0_svc
>   el0t_64_sync_handler
>   el0t_64_sync
> 
>  Freed by task 1280:
>   kasan_save_stack
>   kasan_set_track
>   kasan_set_free_info
>   __kasan_slab_free
>   kmem_cache_free
>   vm_area_free
>   remove_vma
>   remove_vma at /usr/src/linux-next/mm/mmap.c:187
>   do_mas_align_munmap.isra.0
>   remove_mt at /usr/src/linux-next/mm/mmap.c:2176
>   (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437
>   do_mas_munmap
>   do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483
>   do_munmap
>   move_vma.isra.0
>   __do_sys_mremap
>   __arm64_sys_mremap
>   invoke_syscall
>   el0_svc_common.constprop.0
>   do_el0_svc
>   el0_svc
>   el0t_64_sync_handler
>   el0t_64_sync
> 
>  The buggy address belongs to the object at ffff0009ce752aa8
>   which belongs to the cache vm_area_struct of size 144
>  The buggy address is located 32 bytes inside of
>   144-byte region [ffff0009ce752aa8, ffff0009ce752b38)
>  The buggy address belongs to the page:
>  page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750
>  head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0
>  memcg:ffff0009ce456e01
>  flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff)
>  raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380
>  raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01
>  page dumped because: kasan: bad access detected
> 
>  Memory state around the buggy address:
>   ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>   ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>  >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb
>                                                ^
>   ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
>   ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

Thank you for the report.  I will try to reproduce it with a kvm here.

Cheers,
Liam
Liam R. Howlett Feb. 25, 2022, 8:23 p.m. UTC | #7
* Liam Howlett <liam.howlett@oracle.com> [220225 14:09]:
> * Qian Cai <quic_qiancai@quicinc.com> [220224 22:49]:
> > On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> > > The maple tree is an RCU-safe range based B-tree designed to use modern
> > > processor cache efficiently.  There are a number of places in the kernel
> > > that a non-overlapping range-based tree would be beneficial, especially
> > > one with a simple interface.  The first user that is covered in this
> > > patch set is the vm_area_struct, where three data structures are
> > > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > > or remove the mmap_sem contention.
> > > 
> > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > > nodes.  With the increased branching factor, it is significantly shorter than
> > > the rbtree so it has fewer cache misses.  The removal of the linked list
> > > between subsequent entries also reduces the cache misses and the need to pull
> > > in the previous and next VMA during many tree alterations.
> > > 
> > > This patch is based on v5.17-rc4
> > > 
> > > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214
> > 
> > Just a heads-up. I noticed an use-after-free in today's linux-next below. I
> > am running out of time to fully triage this, but I noticed this is the only
> > series (not sure which revision) heavily touched mm/mmap.c recently there.
> > 
> >  BUG: KASAN: use-after-free in move_vma.isra.0
> >  Read of size 8 at addr ffff0009ce752ac8 by task systemd-logind/1280
> > 
> >  CPU: 21 PID: 1280 Comm: systemd-logind Not tainted 5.17.0-rc5-next-20220223-dirty #242
> >  Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
> >  Call trace:
> >   dump_backtrace
> >   show_stack
> >   dump_stack_lvl
> >   print_address_description.constprop.0
> >   kasan_report
> >   __asan_report_load8_noabort
> >   move_vma.isra.0
> >   move_vma at /usr/src/linux-next/mm/mremap.c:714
> >   __do_sys_mremap
> >   __arm64_sys_mremap
> >   invoke_syscall
> >   el0_svc_common.constprop.0
> >   do_el0_svc
> >   el0_svc
> >   el0t_64_sync_handler
> >   el0t_64_sync
> > 
> >  Allocated by task 1280:
> >   kasan_save_stack
> >   __kasan_slab_alloc
> >   slab_post_alloc_hook
> >   kmem_cache_alloc
> >   vm_area_alloc
> >   vm_area_alloc at /usr/src/linux-next/kernel/fork.c:455
> >   mmap_region
> >   mmap_region at /usr/src/linux-next/mm/mmap.c:2585
> >   do_mmap
> >   vm_mmap_pgoff
> >   ksys_mmap_pgoff
> >   __arm64_sys_mmap
> >   invoke_syscall
> >   el0_svc_common.constprop.0
> >   do_el0_svc
> >   el0_svc
> >   el0t_64_sync_handler
> >   el0t_64_sync
> > 
> >  Freed by task 1280:
> >   kasan_save_stack
> >   kasan_set_track
> >   kasan_set_free_info
> >   __kasan_slab_free
> >   kmem_cache_free
> >   vm_area_free
> >   remove_vma
> >   remove_vma at /usr/src/linux-next/mm/mmap.c:187
> >   do_mas_align_munmap.isra.0
> >   remove_mt at /usr/src/linux-next/mm/mmap.c:2176
> >   (inlined by) do_mas_align_munmap at /usr/src/linux-next/mm/mmap.c:2437
> >   do_mas_munmap
> >   do_mas_munmap at /usr/src/linux-next/mm/mmap.c:2483
> >   do_munmap
> >   move_vma.isra.0
> >   __do_sys_mremap
> >   __arm64_sys_mremap
> >   invoke_syscall
> >   el0_svc_common.constprop.0
> >   do_el0_svc
> >   el0_svc
> >   el0t_64_sync_handler
> >   el0t_64_sync
> > 
> >  The buggy address belongs to the object at ffff0009ce752aa8
> >   which belongs to the cache vm_area_struct of size 144
> >  The buggy address is located 32 bytes inside of
> >   144-byte region [ffff0009ce752aa8, ffff0009ce752b38)
> >  The buggy address belongs to the page:
> >  page:fffffc002739d400 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xa4e750
> >  head:fffffc002739d400 order:2 compound_mapcount:0 compound_pincount:0
> >  memcg:ffff0009ce456e01
> >  flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff)
> >  raw: 0bfffc0000010200 fffffc002739f108 fffffc00273a3b08 ffff000800f53380
> >  raw: 0000000000000000 0000000000210021 00000001ffffffff ffff0009ce456e01
> >  page dumped because: kasan: bad access detected
> > 
> >  Memory state around the buggy address:
> >   ffff0009ce752980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> >   ffff0009ce752a00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> >  >ffff0009ce752a80: fc fc fc fc fc fa fb fb fb fb fb fb fb fb fb fb
> >                                                ^
> >   ffff0009ce752b00: fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc
> >   ffff0009ce752b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
> 
> Thank you for the report.  I will try to reproduce it with a kvm here.
> 

I just booted an arm64 VM with my build and kasan enabled with no issue.
Could you please send me your config file for the build?

Thanks,
Liam
Qian Cai Feb. 25, 2022, 8:46 p.m. UTC | #8
On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote:
> I just booted an arm64 VM with my build and kasan enabled with no issue.
> Could you please send me your config file for the build?

On linux-next, I just do:

$ make arch=arm64 defconfig debug.config [1]

Then, I just generate some memory pressume into swapping/OOM Killer to
trigger it.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config
Nathan Chancellor Feb. 25, 2022, 11 p.m. UTC | #9
On Fri, Feb 25, 2022 at 03:46:52PM -0500, Qian Cai wrote:
> On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote:
> > I just booted an arm64 VM with my build and kasan enabled with no issue.
> > Could you please send me your config file for the build?
> 
> On linux-next, I just do:
> 
> $ make arch=arm64 defconfig debug.config [1]
> 
> Then, I just generate some memory pressume into swapping/OOM Killer to
> trigger it.
> 
> [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config

Is the stacktrace [1] related to the conflict that Mark encountered [2]
while merging the maple and folio trees? Booting a next-20220223 kernel
on my Raspberry Pi 3 and 4 shows constant NULL pointer dereferences
(just ARCH=arm and ARCH=arm64 defconfigs) and reverting the folio and
maple tree merges makes everything work properly again.

[1]: https://lore.kernel.org/r/YhhRrBpXTFolUAKi@qian/
[2]: https://lore.kernel.org/r/20220224011653.1380557-1-broonie@kernel.org/

Cheers,
Nathan
Liam R. Howlett Feb. 26, 2022, 1:58 a.m. UTC | #10
* Nathan Chancellor <nathan@kernel.org> [220225 18:00]:
> On Fri, Feb 25, 2022 at 03:46:52PM -0500, Qian Cai wrote:
> > On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote:
> > > I just booted an arm64 VM with my build and kasan enabled with no issue.
> > > Could you please send me your config file for the build?
> > 
> > On linux-next, I just do:
> > 
> > $ make arch=arm64 defconfig debug.config [1]
> > 
> > Then, I just generate some memory pressume into swapping/OOM Killer to
> > trigger it.
> > 
> > [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config
> 
> Is the stacktrace [1] related to the conflict that Mark encountered [2]
> while merging the maple and folio trees? Booting a next-20220223 kernel
> on my Raspberry Pi 3 and 4 shows constant NULL pointer dereferences
> (just ARCH=arm and ARCH=arm64 defconfigs) and reverting the folio and
> maple tree merges makes everything work properly again.
> 
> [1]: https://lore.kernel.org/r/YhhRrBpXTFolUAKi@qian/
> [2]: https://lore.kernel.org/r/20220224011653.1380557-1-broonie@kernel.org/

Maybe?  I'm trying to figure out why it's having issues.. I've not been
able to reproduce it with just my maple tree branch.  Steven Rostedt
found a bad commit that has been fixed in either 20220224, I believe
[1].  It might be best to try next-20220225 and see if you have better
luck if that's an option.

[1]:
https://lore.kernel.org/linux-fsdevel/f6fb6fd4-dcf2-4326-d25e-9a4a9dad5020@fb.com/T/#t


Thanks,
Liam
Hugh Dickins Feb. 26, 2022, 11:19 p.m. UTC | #11
On Sat, 26 Feb 2022, Liam Howlett wrote:
> * Nathan Chancellor <nathan@kernel.org> [220225 18:00]:
> > On Fri, Feb 25, 2022 at 03:46:52PM -0500, Qian Cai wrote:
> > > On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote:
> > > > I just booted an arm64 VM with my build and kasan enabled with no issue.
> > > > Could you please send me your config file for the build?
> > > 
> > > On linux-next, I just do:
> > > 
> > > $ make arch=arm64 defconfig debug.config [1]
> > > 
> > > Then, I just generate some memory pressume into swapping/OOM Killer to
> > > trigger it.
> > > 
> > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config
> > 
> > Is the stacktrace [1] related to the conflict that Mark encountered [2]
> > while merging the maple and folio trees? Booting a next-20220223 kernel
> > on my Raspberry Pi 3 and 4 shows constant NULL pointer dereferences
> > (just ARCH=arm and ARCH=arm64 defconfigs) and reverting the folio and
> > maple tree merges makes everything work properly again.
> > 
> > [1]: https://lore.kernel.org/r/YhhRrBpXTFolUAKi@qian/
> > [2]: https://lore.kernel.org/r/20220224011653.1380557-1-broonie@kernel.org/
> 
> Maybe?  I'm trying to figure out why it's having issues.. I've not been
> able to reproduce it with just my maple tree branch.  Steven Rostedt
> found a bad commit that has been fixed in either 20220224, I believe
> [1].  It might be best to try next-20220225 and see if you have better
> luck if that's an option.
> 
> [1]:
> https://lore.kernel.org/linux-fsdevel/f6fb6fd4-dcf2-4326-d25e-9a4a9dad5020@fb.com/T/#t

Hi Liam, I think I have the beginnings of an answer for Qian's issue,
details below; but first off, I'd better make my own opinion clear.

I think this series, however good it may be so far, is undertested and
not approaching ready for 5.18, and should not have been pushed via
git tree into 5.17-rc-late-stage linux-next, subverting Andrew's mmotm.

I believe Stephen and Mark should drop it from linux-next for now,
while linux-next stabilizes for 5.18, while you work on testing and
fixing, and resubmit rebased to 5.18-rc1 or -rc2 when you're ready.

Qian's issue, BUG: KASAN: use-after-free in move_vma.isra.0,
with stacktrace implicating mremap.

I don't have KASAN on, but my config on this laptop does have
CONFIG_SLAB=y CONFIG_DEBUG_SLAB=y, and reported "Slab corruption"
during bootup of mmotm 2022-02-24-22-38 (plus Steven Rostedt's fix
to vfs_statx() which you mention in [1], missed from mmotm - not
related to your series of course, but essential for booting here).
Previous mmotm 2022-02-23-21-20 (plus vfs_statx() fix) was okay,
but did not contain the Maple Tree series.

The "vm_area" "Slab corruption" "Single bit error detected" 6b->7b
report is enough to indicate that the VM_ACCOUNT bit gets set in a
freed vma's vm_flags.  And looking through "|= VM_ACCOUNT"s, my
suspicion fell on mm/mremap.c's move_vma().  Which I now see is
implicated in Qian's report too.

mremap move's VM_ACCOUNT accounting is difficult: not losing what's
already charged in case the move fails, accounting the extra without
overcharging, in the face of vmas being split and merged: difficult.

And there's an assumption documented in (now) do_mas_align_munmap()
"Note: mremap's move_vma VM_ACCOUNT handling assumes a partially
unmapped vm_area_struct will remain in use".  My suspicion (not
verified) is that the maple tree changes are now violating that;
and I doubt that fixing it will be easy (I'm not going to try) -
doable, sure, but needs careful thought.  (The way move_vma() masks
out VM_ACCOUNT in a live vma, then adds it back later: implications
for vma merging; all under mmap lock of course, but still awkward.)

Though I did partially verify it, by commenting out the VM_ACCOUNT
adjustments in move_vma(), and booting several times without any
"Slab corruption".  And did also kind-of verify it by booting with
#define VM_ACCOUNT 0: I was interested to see if that bit corruption
could account for all the other bugs I get, but sadly no.

Initially I was building without CONFIG_DEBUG_VM_MAPLE_TREE=y and
CONFIG_DEBUG_MAPLE_TREE=y, but have now switched them on.  Hit bugs
without them and with them, but now they're on, validate_mm() often
catches something (whether it is correct to complain, I haven't
investigated, but I assume so since the debug option is there,
and problems are seen without it).

I say "often": it's very erratic.  Once, a machine booted with mem=1G
ran kernel builds successfully swapping for 4.5 hours before faulting
in __nr_to_section in virt_to_folio ... while doing a __vma_adjust() -
you'll ask me for a full stacktrace, and I'll answer sorry, too many,
please try for yourself.  Another time, for 1.5 hours before hitting
the BUG_ON(is_migration_entry(entry) && !PageLocked(p)) in
pfn_swap_entry_to_page() - suggesting anon_vma locking had not been
right while doing page migration (I was exercising THPs a lot).
But now, can I even get it to complete the boot sequence?

(I happened to be sampling /proc/meminfo during those successful
runs: looking afterwards at those samples, I see Committed_AS growing
steadily; whereas samples saved from pre-maple runs were not: that
would correspond to VM_ACCOUNT vm_enough_memory() charges leaking.)

You're having difficulty reproducing: I suggest trying with different
mem= on the boot command line (some of my loads I run with mem=700M,
some mem=1G, some with the workstation mem 8G) - I get the impression
that different mem= jumbles up memory allocations differently, so
what's harmless in one case is quickly harmful in another.
Or try changing between SLAB and SLUB.

One other thing: I haven't studied the source much, but didn't like
that rwsem_acquire(&mm->mmap_lock.dep_map, 0, 0, _THIS_IP_) hack in
exit_mmap(): sneaked into "mm: Start tracking VMAs with maple tree",
it reverts Suren's 64591e8605d6 "mm: protect free_pgtables with
mmap_lock write lock in exit_mmap", without explanating how it becomes
safe with maple tree (I imagine it's not).  That would have to be a
separate, justified patch if it goes forward.  (The nearby conflict
resolutions in mmotm and next are not quite right there, some stuff
my mm/munlock series deleted has resurfaced: but it's harmless, and
not worth worrying about if maple tree is dropped from linux-next.)

Hugh
Vasily Gorbik Feb. 27, 2022, 2:22 a.m. UTC | #12
On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently.  There are a number of places in the kernel
> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface.  The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct.  The long term goal is to reduce
> or remove the mmap_sem contention.
> 
> The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> nodes.  With the increased branching factor, it is significantly shorter than
> the rbtree so it has fewer cache misses.  The removal of the linked list
> between subsequent entries also reduces the cache misses and the need to pull
> in the previous and next VMA during many tree alterations.
> 
> This patch is based on v5.17-rc4
> 
> git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214

Hi Liam,

this patch series completely breaks s390. Besides endianess issue
in maple trees reported here:

https://lore.kernel.org/all/your-ad-here.call-01645924312-ext-0398@work.hours/

and with this endianess issue fixed (fixup from here ^^^) we still get numerous
KASAN reports starting with

[PATCH v6 10/71] mm: Start tracking VMAs with maple tree

mostly looking like this:

 BUG: KASAN: use-after-free in mas_descend_adopt+0x113a/0x1408
 Read of size 8 at addr 00000000b1d4e900 by task cat/610

 CPU: 34 PID: 610 Comm: cat Tainted: G        W         5.17.0-rc4-91313-g592c3b299aad #1
 Hardware name: IBM 3906 M04 704 (LPAR)
 Call Trace:
  [<000000000234647a>] dump_stack_lvl+0xfa/0x150
  [<0000000002328154>] print_address_description.constprop.0+0x64/0x340
  [<00000000008f5ba6>] kasan_report+0x13e/0x1a8
  [<00000000017f0c12>] mas_descend_adopt+0x113a/0x1408
  [<00000000018076ec>] mas_spanning_rebalance.isra.0+0x5164/0x6a20
  [<000000000180a71e>] mas_wr_spanning_store.isra.0+0x476/0xbc8
  [<00000000018189bc>] mas_store_gfp+0xd4/0x188
  [<0000000000827bae>] vma_mt_szero+0x146/0x368
  [<000000000082e990>] __do_munmap+0x340/0xe20
  [<000000000082f580>] __vm_munmap+0x110/0x1e8
  [<000000000082f76e>] __s390x_sys_munmap+0x6e/0x90
  [<000000000010dc6c>] do_syscall+0x22c/0x328
  [<000000000234d3d2>] __do_syscall+0x9a/0xf8
  [<0000000002374ed2>] system_call+0x82/0xb0
 INFO: lockdep is turned off.

 Allocated by task 610:
  kasan_save_stack+0x34/0x58
  __kasan_slab_alloc+0x84/0xa8
  kmem_cache_alloc+0x20c/0x520
  mas_alloc_nodes+0x26a/0x4c8
  mas_split.isra.0+0x2aa/0x1418
  mas_wr_modify+0x3fa/0xd28
  mas_store_gfp+0xd4/0x188
  vma_store+0x17a/0x3d8
  vma_link+0xac/0x798
  mmap_region+0xa5a/0x10b8
  do_mmap+0x7c2/0xa90
  vm_mmap_pgoff+0x186/0x250
  ksys_mmap_pgoff+0x334/0x400
  __s390x_sys_old_mmap+0xf4/0x130
  do_syscall+0x22c/0x328
  __do_syscall+0x9a/0xf8
  system_call+0x82/0xb0
 Freed by task 610:
  kasan_save_stack+0x34/0x58
  kasan_set_track+0x36/0x48
  kasan_set_free_info+0x34/0x58
  ____kasan_slab_free+0x11c/0x188
  __kasan_slab_free+0x24/0x30
  kmem_cache_free_bulk.part.0+0xec/0x538
  mas_destroy+0x2e4/0x710
  mas_store_gfp+0xf4/0x188
  vma_mt_szero+0x146/0x368
  __vma_adjust+0x155a/0x2188
  __split_vma+0x228/0x450
  mprotect_fixup+0x4f2/0x618
  do_mprotect_pkey.constprop.0+0x328/0x600
  __s390x_sys_mprotect+0x8e/0xb8
  do_syscall+0x22c/0x328
  __do_syscall+0x9a/0xf8
  system_call+0x82/0xb0

 The buggy address belongs to the object at 00000000b1d4e900
  which belongs to the cache maple_node of size 256
 The buggy address is located 0 bytes inside of
  256-byte region [00000000b1d4e900, 00000000b1d4ea00)
 The buggy address belongs to the page:
 page:0000400002c75200 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0xb1d48
 head:0000400002c75200 order:3 compound_mapcount:0 compound_pincount:0
 flags: 0x3ffff00000010200(slab|head|node=0|zone=1|lastcpupid=0x1ffff)
 raw: 3ffff00000010200 0000400002c91e08 000040000263b608 000000008009ed00
 raw: 0000000000000000 0020004000000000 ffffffff00000001 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  00000000b1d4e800: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  00000000b1d4e880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >00000000b1d4e900: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                    ^
  00000000b1d4e980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  00000000b1d4ea00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

with eventual crash.

How come none of s390 architecture maintainers nor s390 mailing list were
on CC for the changes which affect s390 to that magnitude? None of this
has apparently been tested on s390 or any other big endian system. And
a shoulder tap to give it try would be helpful.

Now we are just starting looking at the problems. And until issues
are resolved this patch series has to be dropped from linux-next.
Hugh Dickins Feb. 27, 2022, 6:32 p.m. UTC | #13
On Sat, 26 Feb 2022, Hugh Dickins wrote:
> 
> I say "often": it's very erratic.  Once, a machine booted with mem=1G
> ran kernel builds successfully swapping for 4.5 hours before faulting
> in __nr_to_section in virt_to_folio ... while doing a __vma_adjust() -
> you'll ask me for a full stacktrace, and I'll answer sorry, too many,
> please try for yourself.  Another time, for 1.5 hours before hitting
> the BUG_ON(is_migration_entry(entry) && !PageLocked(p)) in
> pfn_swap_entry_to_page() - suggesting anon_vma locking had not been
> right while doing page migration (I was exercising THPs a lot).
> But now, can I even get it to complete the boot sequence?

Sorry, I never mentioned the architecture: x86_64 throughout for me.
Qian Cai Feb. 28, 2022, 11:56 a.m. UTC | #14
On Sat, Feb 26, 2022 at 01:58:09AM +0000, Liam Howlett wrote:
> Maybe?  I'm trying to figure out why it's having issues.. I've not been
> able to reproduce it with just my maple tree branch.  Steven Rostedt
> found a bad commit that has been fixed in either 20220224, I believe
> [1].  It might be best to try next-20220225 and see if you have better
> luck if that's an option.

Same thing for next-20220225. I just need to boot it and it will crash
shortly after.

BUG: KASAN: use-after-free in move_vma.isra.0
 Read of size 8 at addr ffff00088f74c7e8 by task apt-check/2331

 CPU: 2 PID: 2331 Comm: apt-check Tainted: G        W         5.17.0-rc5-next-20220225 #245
 Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
 Call trace:
  dump_backtrace
  show_stack
  dump_stack_lvl
  print_address_description.constprop.0
  kasan_report
  __asan_report_load8_noabort
  move_vma.isra.0
  __do_sys_mremap
  __arm64_sys_mremap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

 Allocated by task 2331:
  kasan_save_stack
  __kasan_slab_alloc
  slab_post_alloc_hook
  kmem_cache_alloc
  vm_area_dup
  __split_vma
  do_mas_align_munmap.isra.0
  do_mas_munmap
  __vm_munmap
  __arm64_sys_munmap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

 Freed by task 2331:
  kasan_save_stack
  kasan_set_track
  kasan_set_free_info
  __kasan_slab_free
  kmem_cache_free
  vm_area_free
  remove_vma
  do_mas_align_munmap.isra.0
  do_mas_munmap
  do_munmap
  move_vma.isra.0
  __do_sys_mremap
  __arm64_sys_mremap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync

 The buggy address belongs to the object at ffff00088f74c7c8
  which belongs to the cache vm_area_struct of size 144
 The buggy address is located 32 bytes inside of
  144-byte region [ffff00088f74c7c8, ffff00088f74c858)
 The buggy address belongs to the physical page:
 page:fffffc00223dd300 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x90f74c
 head:fffffc00223dd300 order:2 compound_mapcount:0 compound_pincount:0
 memcg:ffff000884a75001
 flags: 0xbfffc0000010200(slab|head|node=0|zone=2|lastcpupid=0xffff)
 raw: 0bfffc0000010200 fffffc00223dd008 fffffc0021e6bc08 ffff000800f5b380
 raw: 0000000000000000 0000000000210021 00000001ffffffff ffff000884a75001
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff00088f74c680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff00088f74c700: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >ffff00088f74c780: fc fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
                                                           ^
  ffff00088f74c800: fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc fc
  ffff00088f74c880: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ==================================================================
 Disabling lock debugging due to kernel taint
 Unable to handle kernel paging request at virtual address dfff800000000004
 KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027]
 Mem abort info:
   ESR = 0x96000004
   EC = 0x25: DABT (current EL), IL = 32 bits
   SET = 0, FnV = 0
   EA = 0, S1PTW = 0
   FSC = 0x04: level 0 translation fault
 Data abort info:
   ISV = 0, ISS = 0x00000004
   CM = 0, WnR = 0
 [dfff800000000004] address between user and kernel address ranges
 Internal error: Oops: 96000004 [#1] PREEMPT SMP
 Modules linked in: ipmi_devintf ipmi_msghandler cppc_cpufreq drm ip_tables x_tables ipv6 btrfs blake2b_generic libcrc32c xor xor_neon raid6_pq zstd_compress dm_mod crct10dif_ce nvme mlx5_core mpt3sas nvme_core raid_class
 CPU: 2 PID: 2331 Comm: apt-check Tainted: G    B   W         5.17.0-rc5-next-20220225 #245
 Hardware name: MiTAC RAPTOR EV-883832-X3-0001/RAPTOR, BIOS 1.6 06/28/2020
 pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
 pc : move_vma.isra.0
 lr : move_vma.isra.0
 sp : ffff800016df7ad0
 x29: ffff800016df7ad0 x28: ffff00088c7ba6c8 x27: 0000000000000001
 x26: 0000ffffad180000 x25: ffff00088bea1c40 x24: ffff00088bea1dc8
 x23: ffff00088c988040 x22: 0000000001800000 x21: 0000ffffab82d000
 x20: 0000000000000000 x19: 1ffff00002dbef6c x18: 0000000000000066
 x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000
 x14: 0000000000000000 x13: 0000000000001490 x12: 0000000000000001
 x11: 0000000000000009 x10: 0000000000000000 x9 : ffff800008445544
 x8 : 000000003fffffff x7 : ffff80000928fbc4 x6 : 00000000f1f1f1f1
 x5 : dfff800000000000 x4 : ffff700002dbef18 x3 : 0000000041b58ab3
 x2 : 0000000000000004 x1 : dfff800000000000 x0 : 0000000000000020
 Call trace:
  move_vma.isra.0
  __do_sys_mremap
  __arm64_sys_mremap
  invoke_syscall
  el0_svc_common.constprop.0
  do_el0_svc
  el0_svc
  el0t_64_sync_handler
  el0t_64_sync
 Code: 91008000 d2d00001 f2fbffe1 d343fc02 (38e16841)
 ---[ end trace 0000000000000000 ]---
 Kernel panic - not syncing: Oops: Fatal exception
 SMP: stopping secondary CPUs
 Kernel Offset: 0x160000 from 0xffff800008000000
 PHYS_OFFSET: 0x80000000
 CPU features: 0x00,00000942,40000846
 Memory Limit: none
 ---[ end Kernel panic - not syncing: Oops: Fatal exception ]---
Liam R. Howlett Feb. 28, 2022, 2:26 p.m. UTC | #15
* Hugh Dickins <hughd@google.com> [220226 18:20]:
> On Sat, 26 Feb 2022, Liam Howlett wrote:
> > * Nathan Chancellor <nathan@kernel.org> [220225 18:00]:
> > > On Fri, Feb 25, 2022 at 03:46:52PM -0500, Qian Cai wrote:
> > > > On Fri, Feb 25, 2022 at 08:23:41PM +0000, Liam Howlett wrote:
> > > > > I just booted an arm64 VM with my build and kasan enabled with no issue.
> > > > > Could you please send me your config file for the build?
> > > > 
> > > > On linux-next, I just do:
> > > > 
> > > > $ make arch=arm64 defconfig debug.config [1]
> > > > 
> > > > Then, I just generate some memory pressume into swapping/OOM Killer to
> > > > trigger it.
> > > > 
> > > > [1] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/kernel/configs/debug.config
> > > 
> > > Is the stacktrace [1] related to the conflict that Mark encountered [2]
> > > while merging the maple and folio trees? Booting a next-20220223 kernel
> > > on my Raspberry Pi 3 and 4 shows constant NULL pointer dereferences
> > > (just ARCH=arm and ARCH=arm64 defconfigs) and reverting the folio and
> > > maple tree merges makes everything work properly again.
> > > 
> > > [1]: https://lore.kernel.org/r/YhhRrBpXTFolUAKi@qian/
> > > [2]: https://lore.kernel.org/r/20220224011653.1380557-1-broonie@kernel.org/
> > 
> > Maybe?  I'm trying to figure out why it's having issues.. I've not been
> > able to reproduce it with just my maple tree branch.  Steven Rostedt
> > found a bad commit that has been fixed in either 20220224, I believe
> > [1].  It might be best to try next-20220225 and see if you have better
> > luck if that's an option.
> > 
> > [1]:
> > https://lore.kernel.org/linux-fsdevel/f6fb6fd4-dcf2-4326-d25e-9a4a9dad5020@fb.com/T/#t
> 
> Hi Liam, I think I have the beginnings of an answer for Qian's issue,
> details below; but first off, I'd better make my own opinion clear.
> 
> I think this series, however good it may be so far, is undertested and
> not approaching ready for 5.18, and should not have been pushed via
> git tree into 5.17-rc-late-stage linux-next, subverting Andrew's mmotm.
> 
> I believe Stephen and Mark should drop it from linux-next for now,
> while linux-next stabilizes for 5.18, while you work on testing and
> fixing, and resubmit rebased to 5.18-rc1 or -rc2 when you're ready.

Hugh,

Thank you for your testing, review, and thoughts on the maple tree.  I
agree with your findings and see the error, although still have
difficulty reproducing on amd64 and arm64 VMs.

Stephen, Please drop the maple tree from linux-next.


Thank you,
Liam
Liam R. Howlett Feb. 28, 2022, 2:56 p.m. UTC | #16
* Vasily Gorbik <gor@linux.ibm.com> [220226 21:23]:
> On Tue, Feb 15, 2022 at 02:37:44PM +0000, Liam Howlett wrote:
> > The maple tree is an RCU-safe range based B-tree designed to use modern
> > processor cache efficiently.  There are a number of places in the kernel
> > that a non-overlapping range-based tree would be beneficial, especially
> > one with a simple interface.  The first user that is covered in this
> > patch set is the vm_area_struct, where three data structures are
> > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > or remove the mmap_sem contention.
> > 
> > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > nodes.  With the increased branching factor, it is significantly shorter than
> > the rbtree so it has fewer cache misses.  The removal of the linked list
> > between subsequent entries also reduces the cache misses and the need to pull
> > in the previous and next VMA during many tree alterations.
> > 
> > This patch is based on v5.17-rc4
> > 
> > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220214
> 
> Hi Liam,
> 
> this patch series completely breaks s390. Besides endianess issue
> in maple trees reported here:
> 
> https://lore.kernel.org/all/your-ad-here.call-01645924312-ext-0398@work.hours/
> 
> and with this endianess issue fixed (fixup from here ^^^) we still get numerous
> KASAN reports starting with
> 
> [PATCH v6 10/71] mm: Start tracking VMAs with maple tree

Thanks for looking at this.  That commit had two issues which I was
debugging on parisc for most of last week.  The first is that I was not
actually writing the stack expansion to the maple tree in that commit -
it was added in a later commit.  The second issue which was most likely
causing your crashes was the calculation for the gap in
unmapped_area_topdown().

With the maple tree (3dea5db7fbb5), I am able to boot parisc and s390
virtual machines.  Although, the issue which Hugh discusses in detail
still exists [1].  The latest maple tree can be found on github [2].  I
am still investigating why the heap changed locations on parisc to
ensure it's not of issue.

1: https://lore.kernel.org/all/5f8f4f-ad63-eb-fd73-d48748af8a76@google.com/
2: https://github.com/oracle/linux-uek/tree/maple/mainline

Regards,
Liam