mbox series

[v5,00/70] Introducing the Maple Tree

Message ID 20220202024137.2516438-1-Liam.Howlett@oracle.com (mailing list archive)
Headers show
Series Introducing the Maple Tree | expand

Message

Liam R. Howlett Feb. 2, 2022, 2:41 a.m. UTC
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct.  The long term goal is to reduce
or remove the mmap_sem contention.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses.  The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.

This patch is based on v5.17-rc2

git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220131

v5 changes:
- Fixed race condition in mas_prev() on RCU read of a dead node
- Fixed potential double NULL entries into the tree in rare cases
- Separated testing from tree commit for easier review
- Added an extra patch to clean up __vma_link_file() by passing mapping
- Extra patch for riscv link list
- Expanded mm/mempolicy patch to remove new link list use
- Changed comment in include/linux/mm.h to reference mmap.c for vma_store() -
  Thanks Vlastimil Babka
- Added comment to vma iterator code to explain the use of vma_find() for
  vma_next() implementation - Thanks Vlastimil Babka
- Fixed missing and incorrectly ordered s-o-b - Thanks Vlastimil Babka
- Move duplicate documentation from rst and header to .c - Thanks Vlastimil Babka
- Fixed whitespace in header - Thanks Vlastimil Babka
- Changed mm/mmap.c remove_mt() to use mas_for_each() since it is safe now with
  the locking changes
- Changed khugepaged_scan_mm_slot() to use vma iterator
- Changed munlock() to use vma iterator
- Changed vma_store() to use vma_mas_store() instead of its own implementation
- Removed RCU lock from early unmapped_area{_topdown} patch
- Added comment to vma_find_prev() to explain why the RCU lock is not needed.
- Renamed vma_lookup() change to specify mtree_load() for clarity - Thanks Vlastimil Babka
- Move vma_lookup() change to after vmacache removal to avoid subtle change in
  vmacache behaviour. - Thanks Vlastimil Babka
- Drop extra vmacache info from header - Thanks Vlastimil Babka
- Fixed htmldocs issues - Thanks Vlastimil Babka
- Fixed formatting in mm/util.c when removing rbtree
- Add missing arguments to doc of vma_expand & add VM_BUG_ONs - Thanks Vlastimil Babka
- Fixed error in mmap_region() check to see if the vma can be expaded over next
- Remove spurious edits to mremap - Thanks Vlastimil Babka
- Consolidated fix to do_brk_munmap() into introduction of the function - Thanks Vlastimil Babka
- Change xtensa loop to avoid missing address exhaustion - Thanks Vlastimil Babka
- Fix up cxl driver - Thanks Vlastimil Babka
- Change second loop in um tlb to use vma iterator - Thanks Vlastimil Babka
- Added comments to fs/userfaultfd.c maple state manipulations - Thanks Vlastimil Babka
- Fixed fs/userfaultfd cases where maple state was invalidated by vma_merge()/split_vma()
- Fixed do_brk_munmap() vm_flags usage - Thanks Vlastimil Babka
- Reverted bpf linked list changes to v3 - Thanks Vlastimil Babka
- Fixed mm/huge_memory comment - Thanks Vlastimil Babka
- Changed mm/khugepaged loop to a vma iterator - Thanks Vlastimil Babka
- Change mm/ksm.c to vma iterator instead of mas_for_each - Thanks Vlastimil Babka
- Changed walk_page_range() to use ULONG_MAX instead of -1 - Thanks Vlastimil Babka
- Updated mm/mempolicy to use vma iterator - Thanks Vlastimil Babka
- Restored VM_BUG_ON(!vma) to mbind_range() - Thanks Vlastimil Babka
- Use vma iterator in mlock - Thanks Vlastimil Babka
- Fix overflow handling in count_mm_mlocked_page_nr() - Thanks Vlastimil Babka
- Use find_vma_intersection() in vma_expandable() - Thanks Vlastimil Babka
- Use vma iterator in oom_kill - Thanks Vlastimil Babka
- Convert swapfile to vma iterator and remove mas_pause() - Thanks Vlastimil Babka
- Fixed comment in i915 patch - Thanks Vlastimil Babka
- Added comment to nommu exit_mmap() about locking - Thanks Vlastimil Babka
- Updated documentation from mtree_init() to mt_init() - Thanks David Howells
- Renamed mas_entry_count() to mas_expected_entries() - Thanks David Howells

v4: https://lore.kernel.org/linux-mm/20211201142918.921493-30-Liam.Howlett@oracle.com/t/
v3: https://lore.kernel.org/linux-mm/20211005012959.1110504-1-Liam.Howlett@oracle.com/
v2: https://lore.kernel.org/linux-mm/20210817154651.1570984-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20210428153542.2814175-1-Liam.Howlett@Oracle.com/


Performance on a 144 core x86:

It is important to note that the code is still using the mmap_sem, the
performance seems fairly similar on real-world workloads, while there
are variations in micro-benchmarks.

kernbench, increased system time, less user time:
Amean     user-2        887.04 (   0.00%)      887.58 *  -0.06%*
Amean     syst-2        165.58 (   0.00%)      171.42 *  -3.53%*
Amean     elsp-2        532.29 (   0.00%)      535.27 *  -0.56%*
Amean     user-4        909.41 (   0.00%)      911.48 *  -0.23%*
Amean     syst-4        168.91 (   0.00%)      176.91 *  -4.74%*
Amean     elsp-4        276.44 (   0.00%)      280.24 *  -1.38%*
Amean     user-8        962.08 (   0.00%)      964.41 *  -0.24%*
Amean     syst-8        178.70 (   0.00%)      185.35 *  -3.72%*
Amean     elsp-8        151.55 (   0.00%)      152.77 *  -0.81%*
Amean     user-16      1042.51 (   0.00%)     1037.92 *   0.44%*
Amean     syst-16       190.52 (   0.00%)      195.19 *  -2.45%*
Amean     elsp-16        86.03 (   0.00%)       86.64 *  -0.70%*
Amean     user-32      1225.85 (   0.00%)     1208.83 *   1.39%*
Amean     syst-32       215.97 (   0.00%)      220.13 *  -1.92%*
Amean     elsp-32        55.03 (   0.00%)       55.02 *   0.01%*
Amean     user-64      1235.68 (   0.00%)     1232.82 *   0.23%*
Amean     syst-64       215.98 (   0.00%)      222.28 *  -2.92%*
Amean     elsp-64        33.12 (   0.00%)       33.24 *  -0.36%*
Amean     user-128     1610.29 (   0.00%)     1609.63 *   0.04%*
Amean     syst-128      268.52 (   0.00%)      277.45 *  -3.33%*
Amean     elsp-128       25.98 (   0.00%)       26.04 *  -0.23%*

Increase in performance in the following micro-benchmarks in Hmean:
- signal1-processes +1.81% to +11.12%

Decrease in performance in the following micro-benchmarks in Hmean:
- brk1-processes -38.07% to -43.20%

Mixed Hmean results:
- pft timing system -4.73% to +10.75%
- pft timing elapsed -4.31% to +22.47%
- pft faults/cpu -4.10% to +11.50%
- pft faults/sec -4.12% to +28.96%
- freqmine-medium -6.86% to +11.06%
- malloc1-processes -24.52% to +4.61%
- malloc1-threads -5.18% to -48.83%
- page_fault3-threads -13.92% to +11.37%


Patch organization:
Patch 1 is to add a missing lock to avoid an assert issue when using a vma iterator.

Patches 2 to 6 are radix tree test suite additions for maple tree
support.

Patch 7 adds the maple tree.
Patch 8 adds the maple tree test code.

Patches 9-18 are the removal of the rbtree from the mm_struct.  This now
includes the introduction of the vma iterator.

Patch 19 optimizes __vma_adjust() for the maple tree.

Patches 20-26 are the removal of the vmacache from the kernel.

Patches 27-30 are internal mm changes for efficiencies.

Patches 31-68 are the removal of the linked list

Patches 69 and 70 are a small cleanup from the removal of the vma linked list.


Liam R. Howlett (60):
  radix tree test suite: Add pr_err define
  radix tree test suite: Add kmem_cache_set_non_kernel()
  radix tree test suite: Add allocation counts and size to kmem_cache
  radix tree test suite: Add support for slab bulk APIs
  radix tree test suite: Add lockdep_is_held to header
  Maple Tree: Add new data structure
  lib/test_maple_tree: Add testing for maple tree
  mm: Start tracking VMAs with maple tree
  mm/mmap: Use the maple tree in find_vma() instead of the rbtree.
  mm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree
  mm/mmap: Use maple tree for unmapped_area{_topdown}
  kernel/fork: Use maple tree for dup_mmap() during forking
  mm: Remove rb tree.
  mmap: Change zeroing of maple tree in __vma_adjust()
  xen: Use vma_lookup() in privcmd_ioctl_mmap()
  mm: Optimize find_exact_vma() to use vma_lookup()
  mm/khugepaged: Optimize collapse_pte_mapped_thp() by using
    vma_lookup()
  mm/mmap: Change do_brk_flags() to expand existing VMA and add
    do_brk_munmap()
  mm: Use maple tree operations for find_vma_intersection()
  mm/mmap: Use advanced maple tree API for mmap_region()
  mm: Remove vmacache
  mm/mmap: Move mmap_region() below do_munmap()
  mm/mmap: Reorganize munmap to use maple states
  mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()
  arm64: Remove mmap linked list from vdso
  parisc: Remove mmap linked list from cache handling
  powerpc: Remove mmap linked list walks
  s390: Remove vma linked list walks
  x86: Remove vma linked list walks
  xtensa: Remove vma linked list walks
  cxl: Remove vma linked list walk
  optee: Remove vma linked list walk
  um: Remove vma linked list walk
  binfmt_elf: Remove vma linked list walk
  exec: Use VMA iterator instead of linked list
  fs/proc/base: Use maple tree iterators in place of linked list
  userfaultfd: Use maple tree iterator to iterate VMAs
  ipc/shm: Use VMA iterator instead of linked list
  acct: Use VMA iterator instead of linked list
  perf: Use VMA iterator
  sched: Use maple tree iterator to walk VMAs
  fork: Use VMA iterator
  bpf: Remove VMA linked list
  mm/gup: Use maple tree navigation instead of linked list
  mm/khugepaged: Stop using vma linked list
  mm/ksm: Use vma iterators instead of vma linked list
  mm/madvise: Use vma_find() instead of vma linked list
  mm/memcontrol: Stop using mm->highest_vm_end
  mm/mempolicy: Use vma iterator & maple state instead of vma linked
    list
  mm/mlock: Use vma iterator and instead of vma linked list
  mm/mprotect: Use maple tree navigation instead of vma linked list
  mm/mremap: Use vma_find_intersection() instead of vma linked list
  mm/msync: Use vma_find() instead of vma linked list
  mm/oom_kill: Use maple tree iterators instead of vma linked list
  mm/pagewalk: Use vma_find() instead of vma linked list
  mm/swapfile: Use vma iterator instead of vma linked list
  riscv: Use vma iterator for vdso
  mm: Remove the vma linked list
  mm/mmap: Drop range_has_overlap() function
  mm/mmap.c: Pass in mapping to __vma_link_file()

Matthew Wilcox (Oracle) (10):
  binfmt_elf: Take the mmap lock when walking the VMA list
  mm: Add VMA iterator
  mmap: Use the VMA iterator in count_vma_pages_range()
  damon: Convert __damon_va_three_regions to use the VMA iterator
  proc: Remove VMA rbtree use from nommu
  mm: Convert vma_lookup() to use mtree_load()
  coredump: Remove vma linked list walk
  fs/proc/task_mmu: Stop using linked list and highest_vm_end
  i915: Use the VMA iterator
  nommu: Remove uses of VMA linked list

 Documentation/core-api/index.rst              |     1 +
 Documentation/core-api/maple_tree.rst         |   196 +
 MAINTAINERS                                   |    12 +
 arch/arm64/kernel/vdso.c                      |     3 +-
 arch/parisc/kernel/cache.c                    |     9 +-
 arch/powerpc/kernel/vdso.c                    |     6 +-
 arch/powerpc/mm/book3s32/tlb.c                |    11 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |    13 +-
 arch/riscv/kernel/vdso.c                      |     3 +-
 arch/s390/kernel/vdso.c                       |     3 +-
 arch/s390/mm/gmap.c                           |     6 +-
 arch/um/kernel/tlb.c                          |    14 +-
 arch/x86/entry/vdso/vma.c                     |     9 +-
 arch/x86/kernel/tboot.c                       |     2 +-
 arch/xtensa/kernel/syscall.c                  |    18 +-
 drivers/firmware/efi/efi.c                    |     2 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |    14 +-
 drivers/misc/cxl/fault.c                      |    45 +-
 drivers/tee/optee/call.c                      |    18 +-
 drivers/xen/privcmd.c                         |     2 +-
 fs/binfmt_elf.c                               |     6 +-
 fs/coredump.c                                 |    33 +-
 fs/exec.c                                     |    12 +-
 fs/proc/base.c                                |     5 +-
 fs/proc/internal.h                            |     2 +-
 fs/proc/task_mmu.c                            |    74 +-
 fs/proc/task_nommu.c                          |    45 +-
 fs/userfaultfd.c                              |    49 +-
 include/linux/maple_tree.h                    |   673 +
 include/linux/mm.h                            |    77 +-
 include/linux/mm_types.h                      |    43 +-
 include/linux/mm_types_task.h                 |    12 -
 include/linux/sched.h                         |     1 -
 include/linux/sched/mm.h                      |    13 +
 include/linux/userfaultfd_k.h                 |     7 +-
 include/linux/vm_event_item.h                 |     4 -
 include/linux/vmacache.h                      |    28 -
 include/linux/vmstat.h                        |     6 -
 include/trace/events/maple_tree.h             |   123 +
 include/trace/events/mmap.h                   |    71 +
 init/main.c                                   |     2 +
 ipc/shm.c                                     |    21 +-
 kernel/acct.c                                 |    11 +-
 kernel/bpf/task_iter.c                        |    10 +-
 kernel/debug/debug_core.c                     |    12 -
 kernel/events/core.c                          |     3 +-
 kernel/events/uprobes.c                       |     9 +-
 kernel/fork.c                                 |    69 +-
 kernel/sched/fair.c                           |    10 +-
 lib/Kconfig.debug                             |    15 +-
 lib/Makefile                                  |     3 +-
 lib/maple_tree.c                              |  6943 +++
 lib/test_maple_tree.c                         | 37204 ++++++++++++++++
 mm/Makefile                                   |     2 +-
 mm/damon/vaddr-test.h                         |    37 +-
 mm/damon/vaddr.c                              |    53 +-
 mm/debug.c                                    |    14 +-
 mm/gup.c                                      |     9 +-
 mm/huge_memory.c                              |     4 +-
 mm/init-mm.c                                  |     4 +-
 mm/internal.h                                 |    78 +-
 mm/khugepaged.c                               |    13 +-
 mm/ksm.c                                      |    18 +-
 mm/madvise.c                                  |     2 +-
 mm/memcontrol.c                               |     6 +-
 mm/memory.c                                   |    33 +-
 mm/mempolicy.c                                |    58 +-
 mm/mlock.c                                    |    34 +-
 mm/mmap.c                                     |  2106 +-
 mm/mprotect.c                                 |     7 +-
 mm/mremap.c                                   |    22 +-
 mm/msync.c                                    |     2 +-
 mm/nommu.c                                    |   127 +-
 mm/oom_kill.c                                 |     3 +-
 mm/pagewalk.c                                 |     2 +-
 mm/swapfile.c                                 |     4 +-
 mm/util.c                                     |    32 -
 mm/vmacache.c                                 |   117 -
 mm/vmstat.c                                   |     4 -
 tools/testing/radix-tree/.gitignore           |     2 +
 tools/testing/radix-tree/Makefile             |    13 +-
 tools/testing/radix-tree/generated/autoconf.h |     1 +
 tools/testing/radix-tree/linux.c              |   160 +-
 tools/testing/radix-tree/linux/kernel.h       |     1 +
 tools/testing/radix-tree/linux/lockdep.h      |     2 +
 tools/testing/radix-tree/linux/maple_tree.h   |     7 +
 tools/testing/radix-tree/linux/slab.h         |     4 +
 tools/testing/radix-tree/maple.c              |    59 +
 .../radix-tree/trace/events/maple_tree.h      |     3 +
 89 files changed, 47179 insertions(+), 1847 deletions(-)
 create mode 100644 Documentation/core-api/maple_tree.rst
 create mode 100644 include/linux/maple_tree.h
 delete mode 100644 include/linux/vmacache.h
 create mode 100644 include/trace/events/maple_tree.h
 create mode 100644 lib/maple_tree.c
 create mode 100644 lib/test_maple_tree.c
 delete mode 100644 mm/vmacache.c
 create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
 create mode 100644 tools/testing/radix-tree/maple.c
 create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h