mbox series

[v3,00/66] Introducing the Maple Tree

Message ID 20211005012959.1110504-1-Liam.Howlett@oracle.com (mailing list archive)
Headers show
Series Introducing the Maple Tree | expand

Message

Liam R. Howlett Oct. 5, 2021, 1:30 a.m. UTC
The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct.  The long term goal is to reduce
or remove the mmap_sem contention.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses.  The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.

This patch is based on v5.15-rc3

git: https://github.com/oracle/linux-uek/tree/howlett/maple/20211004

v3 changes:
- Reverted unnecessary change to mm headers - Thanks David Hildenbrand
- Fixed 8 typos in the documentation - Thanks Douglas Gilbert
- Reduced the number of walks in do_mas_align_munmap()
- Remove unnecessary resets of the maple state
- Reworked mas_node_store() and mas_slot_store() so _mas_store has a more
  shallow code path
- Added support for metadata node end markers to leaves and non-alloc
  nodes
- Fixed lockdep complain on detaching VMAs, mainly cosmetic
- Added maple iterator for powerpc, damon, i915, and s390 VDSO
- Changed zeroing of maple tree in _vma_adjust().  Just overwrite the area.

v2: https://lore.kernel.org/linux-mm/20210817154651.1570984-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20210428153542.2814175-1-Liam.Howlett@Oracle.com/

Performance on a 144 core x86:

It is important to note that the code is still using the mmap_sem, the
performance seems fairly similar on real-world workloads, while there
are variations in micro-benchmarks.

kernbench, increased system time, less user time:

Amean     user-2        881.41 (   0.00%)      879.51 *   0.22%*
Amean     syst-2        147.52 (   0.00%)      151.16 *  -2.47%*
Amean     elsp-2        520.65 (   0.00%)      521.30 *  -0.13%*
Amean     user-4        904.92 (   0.00%)      905.51 *  -0.07%*
Amean     syst-4        154.76 (   0.00%)      159.14 *  -2.83%*
Amean     elsp-4        271.53 (   0.00%)      272.52 *  -0.36%*
Amean     user-8        957.37 (   0.00%)      957.04 *   0.03%*
Amean     syst-8        162.09 (   0.00%)      168.37 *  -3.88%*
Amean     elsp-8        148.01 (   0.00%)      149.64 *  -1.10%*
Amean     user-16      1037.05 (   0.00%)     1034.01 *   0.29%*
Amean     syst-16       171.22 (   0.00%)      178.27 *  -4.12%*
Amean     elsp-16        85.17 (   0.00%)       84.98 *   0.23%*
Amean     user-32      1202.17 (   0.00%)     1164.72 *   3.11%*
Amean     syst-32       189.54 (   0.00%)      195.18 *  -2.98%*
Amean     elsp-32        52.93 (   0.00%)       52.41 *   0.98%*
Amean     user-64      1236.16 (   0.00%)     1244.45 *  -0.67%*
Amean     syst-64       192.87 (   0.00%)      203.73 *  -5.63%*
Amean     elsp-64        32.53 (   0.00%)       32.81 *  -0.84%*
Amean     user-128     1609.20 (   0.00%)     1608.73 *   0.03%*
Amean     syst-128      237.86 (   0.00%)      250.98 *  -5.52%*
Amean     elsp-128       25.55 (   0.00%)       25.89 *  -1.30%*

Increase in performance in the following micro-benchmarks in Hmean:
- wis brk1-threads: Disregard, this is useless.
- wis malloc1-threads: Increase of 13% to 992%
- wis page_fault1-threads: Increase of up to 15%
- wis signal1-processes: 0% to +10%

Decrease in performance in the following micro-benchmarks in Hmean:
- wis brk1-processes: -42% to -49% due to RCU required allocations
- wis signal1-threads: 0% to -5%

Mixed:
- wis malloc1-processes: +4% to -19%
- wis pthread_mutex1-threads: +21% to -8%
- wis page_fault3-threads: +10% to -28%


Patch organization:

Patches 1 to 4 are radix tree test suite additions for maple tree
support.

Patch 5 adds the maple tree.  The bulk of which is test code.

Patches 6-11 are the removal of the rbtree from the mm_struct.

Patch 12 optimizes __vma_adjust() for the maple tree.

Patches 13-19 are the removal of the vmacache from the kernel.

Patches 20-24 make necessary changes for the removal of the vma linked
list.

Patches 25-65 are the removal of the vma linked list from the mm_struct.

Patch 66 is a small cleanup from the removal of the vma linked list.

Liam R. Howlett (66):
  radix tree test suite: Add pr_err define
  radix tree test suite: Add kmem_cache_set_non_kernel()
  radix tree test suite: Add allocation counts and size to kmem_cache
  radix tree test suite: Add support for slab bulk APIs
  Maple Tree: Add new data structure
  mm: Start tracking VMAs with maple tree
  mm/mmap: Use the maple tree in find_vma() instead of the rbtree.
  mm/mmap: Use the maple tree for find_vma_prev() instead of the rbtree
  mm/mmap: Use maple tree for unmapped_area{_topdown}
  kernel/fork: Use maple tree for dup_mmap() during forking
  mm: Remove rb tree.
  mmap: Change zeroing of maple tree in __vma_adjust
  xen/privcmd: Optimized privcmd_ioctl_mmap() by using vma_lookup()
  mm: Optimize find_exact_vma() to use vma_lookup()
  mm/khugepaged: Optimize collapse_pte_mapped_thp() by using
    vma_lookup()
  mm/mmap: Change do_brk_flags() to expand existing VMA and add
    do_brk_munmap()
  mm: Use maple tree operations for find_vma_intersection() and
    find_vma()
  mm/mmap: Use advanced maple tree API for mmap_region()
  mm: Remove vmacache
  mm/mmap: Move mmap_region() below do_munmap()
  mm/mmap: Convert count_vma_pages_range() to use ma_state
  mm/mmap: Reorganize munmap to use maple states
  mm/mmap: Change do_brk_munmap() to use do_mas_align_munmap()
  mm: Introduce vma_next() and vma_prev()
  arch/arm64: Remove mmap linked list from vdso.
  arch/parisc: Remove mmap linked list from kernel/cache
  arch/powerpc: Remove mmap linked list from mm/book3s32/tlb
  arch/powerpc: Remove mmap linked list from mm/book3s64/subpage_prot
  arch/s390: Use maple tree iterators instead of linked list.
  arch/x86: Use maple tree iterators for vdso/vma
  arch/xtensa: Use maple tree iterators for unmapped area
  drivers/misc/cxl: Use maple tree iterators for cxl_prefault_vma()
  drivers/tee/optee: Use maple tree iterators for __check_mem_type()
  fs/binfmt_elf: Use maple tree iterators for fill_files_note()
  fs/coredump: Use maple tree iterators in place of linked list
  fs/exec: Use vma_next() instead of linked list
  fs/proc/base: Use maple tree iterators in place of linked list
  fs/proc/task_mmu: Stop using linked list and highest_vm_end
  fs/userfaultfd: Stop using vma linked list.
  ipc/shm: Stop using the vma linked list
  kernel/acct: Use maple tree iterators instead of linked list
  kernel/events/core: Use maple tree iterators instead of linked list
  kernel/events/uprobes: Use maple tree iterators instead of linked list
  kernel/sched/fair: Use maple tree iterators instead of linked list
  kernel/fork: Use maple tree iterators instead of linked list
  arch/um/kernel/tlb: Stop using linked list
  bpf: Remove VMA linked list
  mm/gup: Use maple tree navigation instead of linked list
  mm/khugepaged: Use maple tree iterators instead of vma linked list
  mm/ksm: Use maple tree iterators instead of vma linked list
  mm/madvise: Use vma_next instead of vma linked list
  mm/memcontrol: Stop using mm->highest_vm_end
  mm/mempolicy: Use maple tree iterators instead of vma linked list
  mm/mlock: Use maple tree iterators instead of vma linked list
  mm/mprotect: Use maple tree navigation instead of vma linked list
  mm/mremap: Use vma_next() instead of vma linked list
  mm/msync: Use vma_next() instead of vma linked list
  mm/oom_kill: Use maple tree iterators instead of vma linked list
  mm/pagewalk: Use vma_next() instead of vma linked list
  mm/swapfile: Use maple tree iterator instead of vma linked list
  damon: Change vma iterator to mas_for_each
  powerpc: Use maple tree iterator for vdso.
  s390: Use the maple tree iterator for vdso
  i915: Use the maple tree iterator for vdso
  mm: Remove the vma linked list
  mm/mmap: Drop range_has_overlap() function

 Documentation/core-api/index.rst              |     1 +
 Documentation/core-api/maple-tree.rst         |   508 +
 MAINTAINERS                                   |    12 +
 arch/arm64/kernel/vdso.c                      |     5 +-
 arch/parisc/kernel/cache.c                    |    15 +-
 arch/powerpc/kernel/vdso.c                    |     4 +-
 arch/powerpc/mm/book3s32/tlb.c                |     5 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |    15 +-
 arch/s390/configs/debug_defconfig             |     1 -
 arch/s390/kernel/vdso.c                       |     3 +-
 arch/s390/mm/gmap.c                           |     8 +-
 arch/um/kernel/tlb.c                          |    16 +-
 arch/x86/entry/vdso/vma.c                     |    12 +-
 arch/x86/kernel/tboot.c                       |     2 +-
 arch/xtensa/kernel/syscall.c                  |     4 +-
 drivers/firmware/efi/efi.c                    |     2 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |     3 +-
 drivers/misc/cxl/fault.c                      |     6 +-
 drivers/tee/optee/call.c                      |    15 +-
 drivers/xen/privcmd.c                         |     2 +-
 fs/binfmt_elf.c                               |     5 +-
 fs/coredump.c                                 |    13 +-
 fs/exec.c                                     |     9 +-
 fs/proc/base.c                                |     7 +-
 fs/proc/task_mmu.c                            |    48 +-
 fs/proc/task_nommu.c                          |    55 +-
 fs/userfaultfd.c                              |    34 +-
 include/linux/maple_tree.h                    |   485 +
 include/linux/mm.h                            |    54 +-
 include/linux/mm_types.h                      |    28 +-
 include/linux/mm_types_task.h                 |     5 -
 include/linux/sched.h                         |     1 -
 include/linux/sched/mm.h                      |     9 +
 include/linux/vm_event_item.h                 |     4 -
 include/linux/vmacache.h                      |    28 -
 include/linux/vmstat.h                        |     6 -
 include/trace/events/maple_tree.h             |   123 +
 include/trace/events/mmap.h                   |    71 +
 init/main.c                                   |     2 +
 ipc/shm.c                                     |    13 +-
 kernel/acct.c                                 |     8 +-
 kernel/bpf/task_iter.c                        |    10 +-
 kernel/debug/debug_core.c                     |    12 -
 kernel/events/core.c                          |     7 +-
 kernel/events/uprobes.c                       |    25 +-
 kernel/fork.c                                 |    70 +-
 kernel/sched/fair.c                           |    14 +-
 lib/Kconfig.debug                             |    15 +-
 lib/Makefile                                  |     3 +-
 lib/maple_tree.c                              |  6811 +++
 lib/test_maple_tree.c                         | 36996 ++++++++++++++++
 mm/Makefile                                   |     2 +-
 mm/damon/vaddr.c                              |     3 +-
 mm/debug.c                                    |    14 +-
 mm/gup.c                                      |     7 +-
 mm/huge_memory.c                              |     4 +-
 mm/init-mm.c                                  |     4 +-
 mm/internal.h                                 |    81 +-
 mm/khugepaged.c                               |    11 +-
 mm/ksm.c                                      |    26 +-
 mm/madvise.c                                  |     2 +-
 mm/memcontrol.c                               |     6 +-
 mm/memory.c                                   |    33 +-
 mm/mempolicy.c                                |    41 +-
 mm/mlock.c                                    |    21 +-
 mm/mmap.c                                     |  2094 +-
 mm/mprotect.c                                 |    13 +-
 mm/mremap.c                                   |    27 +-
 mm/msync.c                                    |     2 +-
 mm/nommu.c                                    |   120 +-
 mm/oom_kill.c                                 |     5 +-
 mm/pagewalk.c                                 |     2 +-
 mm/swapfile.c                                 |     9 +-
 mm/util.c                                     |    32 -
 mm/vmacache.c                                 |   117 -
 mm/vmstat.c                                   |     4 -
 tools/testing/radix-tree/.gitignore           |     2 +
 tools/testing/radix-tree/Makefile             |    13 +-
 tools/testing/radix-tree/generated/autoconf.h |     1 +
 tools/testing/radix-tree/linux.c              |   160 +-
 tools/testing/radix-tree/linux/kernel.h       |     1 +
 tools/testing/radix-tree/linux/maple_tree.h   |     7 +
 tools/testing/radix-tree/linux/slab.h         |     4 +
 tools/testing/radix-tree/maple.c              |    59 +
 .../radix-tree/trace/events/maple_tree.h      |     3 +
 85 files changed, 46908 insertions(+), 1632 deletions(-)
 create mode 100644 Documentation/core-api/maple-tree.rst
 create mode 100644 include/linux/maple_tree.h
 delete mode 100644 include/linux/vmacache.h
 create mode 100644 include/trace/events/maple_tree.h
 create mode 100644 lib/maple_tree.c
 create mode 100644 lib/test_maple_tree.c
 delete mode 100644 mm/vmacache.c
 create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
 create mode 100644 tools/testing/radix-tree/maple.c
 create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h