[v8,00/70] Introducing the Maple Tree

Message ID	20220426150616.3937571-1-Liam.Howlett@oracle.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Liam Howlett <liam.howlett@oracle.com> To: "maple-tree@lists.infradead.org" <maple-tree@lists.infradead.org>, "linux-mm@kvack.org" <linux-mm@kvack.org>, "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>, Andrew Morton <akpm@linux-foundation.org>, Yu Zhao <yuzhao@google.com> Subject: [PATCH v8 00/70] Introducing the Maple Tree Thread-Topic: [PATCH v8 00/70] Introducing the Maple Tree Thread-Index: AQHYWX8vSnmCSFccn0+psIQVsVvpYQ== Date: Tue, 26 Apr 2022 15:06:19 +0000 Message-ID: <20220426150616.3937571-1-Liam.Howlett@oracle.com> Accept-Language: en-US Content-Language: en-US Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Introducing the Maple Tree \| expand [v8,00/70] Introducing the Maple Tree [v8,02/70] radix tree test suite: add kmem_cache_set_non_kernel() [v8,01/70] radix tree test suite: add pr_err define [v8,03/70] radix tree test suite: add allocation counts and size to kmem_cache [v8,04/70] radix tree test suite: add support for slab bulk APIs [v8,05/70] radix tree test suite: add lockdep_is_held to header [v8,06/70] mips: rename mt_init to mips_mt_init [v8,07/70] Maple Tree: add new data structure [v8,08/70] lib/test_maple_tree: add testing for maple tree [v8,09/70] mm: start tracking VMAs with maple tree [v8,11/70] mmap: use the VMA iterator in count_vma_pages_range() [v8,12/70] mm/mmap: use the maple tree in find_vma() instead of the rbtree. [v8,10/70] mm: add VMA iterator [v8,13/70] mm/mmap: use the maple tree for find_vma_prev() instead of the rbtree [v8,14/70] mm/mmap: use maple tree for unmapped_area{_topdown} [v8,15/70] kernel/fork: use maple tree for dup_mmap() during forking [v8,16/70] damon: Convert __damon_va_three_regions to use the VMA iterator [v8,17/70] proc: remove VMA rbtree use from nommu [v8,18/70] mm: remove rb tree. [v8,19/70] mmap: change zeroing of maple tree in __vma_adjust() [v8,21/70] mm: optimize find_exact_vma() to use vma_lookup() [v8,20/70] xen: use vma_lookup() in privcmd_ioctl_mmap() [v8,22/70] mm/khugepaged: optimize collapse_pte_mapped_thp() by using vma_lookup() [v8,23/70] mm/mmap: change do_brk_flags() to expand existing VMA and add do_brk_munmap() [v8,24/70] mm: use maple tree operations for find_vma_intersection() [v8,25/70] mm/mmap: use advanced maple tree API for mmap_region() [v8,26/70] mm: remove vmacache [v8,27/70] mm: convert vma_lookup() to use mtree_load() [v8,28/70] mm/mmap: move mmap_region() below do_munmap() [v8,29/70] mm/mmap: reorganize munmap to use maple states [v8,31/70] arm64: remove mmap linked list from vdso [v8,30/70] mm/mmap: change do_brk_munmap() to use do_mas_align_munmap() [v8,32/70] arm64: Change elfcore for_each_mte_vma() to use VMA iterator [v8,33/70] parisc: remove mmap linked list from cache handling [v8,34/70] powerpc: remove mmap linked list walks [v8,35/70] s390: remove vma linked list walks [v8,36/70] x86: remove vma linked list walks [v8,38/70] cxl: remove vma linked list walk [v8,37/70] xtensa: remove vma linked list walks [v8,39/70] optee: remove vma linked list walk [v8,40/70] um: remove vma linked list walk [v8,41/70] coredump: remove vma linked list walk [v8,42/70] exec: use VMA iterator instead of linked list [v8,43/70] fs/proc/base: use maple tree iterators in place of linked list [v8,44/70] fs/proc/task_mmu: stop using linked list and highest_vm_end [v8,45/70] userfaultfd: use maple tree iterator to iterate VMAs [v8,47/70] acct: use VMA iterator instead of linked list [v8,48/70] perf: use VMA iterator [v8,46/70] ipc/shm: use VMA iterator instead of linked list [v8,49/70] sched: use maple tree iterator to walk VMAs [v8,50/70] fork: use VMA iterator [v8,51/70] bpf: remove VMA linked list [v8,52/70] mm/gup: use maple tree navigation instead of linked list [v8,53/70] mm/khugepaged: stop using vma linked list [v8,54/70] mm/ksm: use vma iterators instead of vma linked list [v8,56/70] mm/memcontrol: stop using mm->highest_vm_end [v8,55/70] mm/madvise: use vma_find() instead of vma linked list [v8,57/70] mm/mempolicy: use vma iterator & maple state instead of vma linked list [v8,58/70] mm/mlock: use vma iterator and instead of vma linked list [v8,59/70] mm/mprotect: use maple tree navigation instead of vma linked list [v8,60/70] mm/mremap: use vma_find_intersection() instead of vma linked list [v8,61/70] mm/msync: use vma_find() instead of vma linked list [v8,62/70] mm/oom_kill: use maple tree iterators instead of vma linked list [v8,63/70] mm/pagewalk: use vma_find() instead of vma linked list [v8,64/70] mm/swapfile: use vma iterator instead of vma linked list [v8,65/70] i915: use the VMA iterator [v8,66/70] nommu: remove uses of VMA linked list [v8,67/70] riscv: use vma iterator for vdso [v8,68/70] mm: remove the vma linked list [v8,69/70] mm/mmap: drop range_has_overlap() function [v8,70/70] mm/mmap.c: pass in mapping to __vma_link_file()

Liam R. Howlett April 26, 2022, 3:06 p.m. UTC

Andrew,

Please replace the patches in your mglru-maple branch with this set.  It should
be a drop in replacement for my patch range with the fixes into these
patches.  Adding the preallocation to work around the fs-reclaim LOCKDEP
issue caused enough changes to the patches to warrant a respin.

The last patch on the branch is still needed to fix vmscan after mglru
is applied.  ee4b1fc24f30 "mm/vmscan: Use VMA_ITERATOR in
get_next_vma()"


Here is the pretty cover letter you requested last time.

------------------------------------

The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  The first user that is covered in this
patch set is the vm_area_struct, where three data structures are
replaced by the maple tree: the augmented rbtree, the vma cache, and the
linked list of VMAs in the mm_struct.  The long term goal is to reduce
or remove the mmap_sem contention.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter than
the rbtree so it has fewer cache misses.  The removal of the linked list
between subsequent entries also reduces the cache misses and the need to pull
in the previous and next VMA during many tree alterations.

This patch set is based on v5.18-rc2

git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220426

v8 changes:
 - Added preallocations before any potential edits to the tree when holding the
i_mmap_lock to avoid fs-reclaim issues on extreme memory pressure.
 - Fixed issue in mempolicy mas_for_each() loop.
 - Moved static definitions inside ifdef for DEBUG_MAPLE
 - Fixed compile warnings reported by build bots
 - Moved mas_dfs_preorder() to testing code
 - Changed __vma_adjust() to record the highest vma in case 6 instead of
finding it twice.
 - Fixed locking issue in exit_mmap()
 - Fixed up from/s-o-b ordering

v7: https://lore.kernel.org/linux-mm/20220404143501.2016403-8-Liam.Howlett@oracle.com/
v6: https://lore.kernel.org/linux-mm/20220215143728.3810954-1-Liam.Howlett@oracle.com/
v5: https://lore.kernel.org/linux-mm/20220202024137.2516438-1-Liam.Howlett@oracle.com/
v4: https://lore.kernel.org/linux-mm/20211201142918.921493-1-Liam.Howlett@oracle.com/
v3: https://lore.kernel.org/linux-mm/20211005012959.1110504-1-Liam.Howlett@oracle.com/
v2: https://lore.kernel.org/linux-mm/20210817154651.1570984-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20210428153542.2814175-1-Liam.Howlett@Oracle.com/


Liam R. Howlett (45):
  radix tree test suite: add pr_err define
  radix tree test suite: add kmem_cache_set_non_kernel()
  radix tree test suite: add allocation counts and size to kmem_cache
  radix tree test suite: add support for slab bulk APIs
  radix tree test suite: add lockdep_is_held to header
  mips: rename mt_init to mips_mt_init
  Maple Tree: add new data structure
  lib/test_maple_tree: add testing for maple tree
  mm: start tracking VMAs with maple tree
  mm/mmap: use the maple tree in find_vma() instead of the rbtree.
  mm/mmap: use the maple tree for find_vma_prev() instead of the rbtree
  mm/mmap: use maple tree for unmapped_area{_topdown}
  kernel/fork: use maple tree for dup_mmap() during forking
  damon: Convert __damon_va_three_regions to use the VMA iterator
  mm: remove rb tree.
  mmap: change zeroing of maple tree in __vma_adjust()
  xen: use vma_lookup() in privcmd_ioctl_mmap()
  mm: optimize find_exact_vma() to use vma_lookup()
  mm/khugepaged: optimize collapse_pte_mapped_thp() by using
    vma_lookup()
  mm/mmap: change do_brk_flags() to expand existing VMA and add
    do_brk_munmap()
  mm: use maple tree operations for find_vma_intersection()
  mm/mmap: use advanced maple tree API for mmap_region()
  mm: remove vmacache
  mm: convert vma_lookup() to use mtree_load()
  mm/mmap: move mmap_region() below do_munmap()
  mm/mmap: reorganize munmap to use maple states
  mm/mmap: change do_brk_munmap() to use do_mas_align_munmap()
  arm64: Change elfcore for_each_mte_vma() to use VMA iterator
  fs/proc/base: use maple tree iterators in place of linked list
  userfaultfd: use maple tree iterator to iterate VMAs
  ipc/shm: use VMA iterator instead of linked list
  bpf: remove VMA linked list
  mm/gup: use maple tree navigation instead of linked list
  mm/madvise: use vma_find() instead of vma linked list
  mm/memcontrol: stop using mm->highest_vm_end
  mm/mempolicy: use vma iterator & maple state instead of vma linked
    list
  mm/mprotect: use maple tree navigation instead of vma linked list
  mm/mremap: use vma_find_intersection() instead of vma linked list
  mm/msync: use vma_find() instead of vma linked list
  mm/oom_kill: use maple tree iterators instead of vma linked list
  mm/swapfile: use vma iterator instead of vma linked list
  riscv: use vma iterator for vdso
  mm: remove the vma linked list
  mm/mmap: drop range_has_overlap() function
  mm/mmap.c: pass in mapping to __vma_link_file()

Matthew Wilcox (Oracle) (25):
  mm: add VMA iterator
  mmap: use the VMA iterator in count_vma_pages_range()
  proc: remove VMA rbtree use from nommu
  arm64: remove mmap linked list from vdso
  parisc: remove mmap linked list from cache handling
  powerpc: remove mmap linked list walks
  s390: remove vma linked list walks
  x86: remove vma linked list walks
  xtensa: remove vma linked list walks
  cxl: remove vma linked list walk
  optee: remove vma linked list walk
  um: remove vma linked list walk
  coredump: remove vma linked list walk
  exec: use VMA iterator instead of linked list
  fs/proc/task_mmu: stop using linked list and highest_vm_end
  acct: use VMA iterator instead of linked list
  perf: use VMA iterator
  sched: use maple tree iterator to walk VMAs
  fork: use VMA iterator
  mm/khugepaged: stop using vma linked list
  mm/ksm: use vma iterators instead of vma linked list
  mm/mlock: use vma iterator and instead of vma linked list
  mm/pagewalk: use vma_find() instead of vma linked list
  i915: use the VMA iterator
  nommu: remove uses of VMA linked list

 Documentation/core-api/index.rst              |     1 +
 Documentation/core-api/maple_tree.rst         |   218 +
 MAINTAINERS                                   |    12 +
 arch/arm64/kernel/elfcore.c                   |    16 +-
 arch/arm64/kernel/vdso.c                      |     3 +-
 arch/mips/kernel/mips-mt.c                    |     4 +-
 arch/parisc/kernel/cache.c                    |     7 +-
 arch/powerpc/kernel/vdso.c                    |     6 +-
 arch/powerpc/mm/book3s32/tlb.c                |    11 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |    13 +-
 arch/riscv/kernel/vdso.c                      |     3 +-
 arch/s390/kernel/vdso.c                       |     3 +-
 arch/s390/mm/gmap.c                           |     6 +-
 arch/um/kernel/tlb.c                          |    14 +-
 arch/x86/entry/vdso/vma.c                     |     9 +-
 arch/x86/kernel/tboot.c                       |     2 +-
 arch/xtensa/kernel/syscall.c                  |    18 +-
 drivers/firmware/efi/efi.c                    |     2 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |    14 +-
 drivers/misc/cxl/fault.c                      |    45 +-
 drivers/tee/optee/call.c                      |    18 +-
 drivers/xen/privcmd.c                         |     2 +-
 fs/coredump.c                                 |    34 +-
 fs/exec.c                                     |    12 +-
 fs/proc/base.c                                |     5 +-
 fs/proc/internal.h                            |     2 +-
 fs/proc/task_mmu.c                            |    74 +-
 fs/proc/task_nommu.c                          |    45 +-
 fs/userfaultfd.c                              |    49 +-
 include/linux/maple_tree.h                    |   685 +
 include/linux/mm.h                            |    74 +-
 include/linux/mm_types.h                      |    43 +-
 include/linux/mm_types_task.h                 |    12 -
 include/linux/sched.h                         |     1 -
 include/linux/userfaultfd_k.h                 |     7 +-
 include/linux/vm_event_item.h                 |     4 -
 include/linux/vmacache.h                      |    28 -
 include/linux/vmstat.h                        |     6 -
 include/trace/events/maple_tree.h             |   123 +
 include/trace/events/mmap.h                   |    71 +
 init/main.c                                   |     2 +
 ipc/shm.c                                     |    21 +-
 kernel/acct.c                                 |    11 +-
 kernel/bpf/task_iter.c                        |    10 +-
 kernel/debug/debug_core.c                     |    12 -
 kernel/events/core.c                          |     3 +-
 kernel/events/uprobes.c                       |     9 +-
 kernel/fork.c                                 |    58 +-
 kernel/sched/fair.c                           |    10 +-
 lib/Kconfig.debug                             |    17 +-
 lib/Makefile                                  |     3 +-
 lib/maple_tree.c                              |  6964 +++
 lib/test_maple_tree.c                         | 37854 ++++++++++++++++
 mm/Makefile                                   |     2 +-
 mm/damon/vaddr-test.h                         |    37 +-
 mm/damon/vaddr.c                              |    53 +-
 mm/debug.c                                    |    14 +-
 mm/gup.c                                      |     7 +-
 mm/huge_memory.c                              |     4 +-
 mm/init-mm.c                                  |     4 +-
 mm/internal.h                                 |    42 +-
 mm/khugepaged.c                               |    13 +-
 mm/ksm.c                                      |    18 +-
 mm/madvise.c                                  |     2 +-
 mm/memcontrol.c                               |     6 +-
 mm/memory.c                                   |    33 +-
 mm/mempolicy.c                                |    56 +-
 mm/mlock.c                                    |    34 +-
 mm/mmap.c                                     |  2079 +-
 mm/mprotect.c                                 |     7 +-
 mm/mremap.c                                   |    22 +-
 mm/msync.c                                    |     2 +-
 mm/nommu.c                                    |   135 +-
 mm/oom_kill.c                                 |     3 +-
 mm/pagewalk.c                                 |     2 +-
 mm/swapfile.c                                 |     4 +-
 mm/util.c                                     |    32 -
 mm/vmacache.c                                 |   117 -
 mm/vmstat.c                                   |     4 -
 tools/include/linux/slab.h                    |     4 +
 tools/testing/radix-tree/.gitignore           |     2 +
 tools/testing/radix-tree/Makefile             |     9 +-
 tools/testing/radix-tree/generated/autoconf.h |     1 +
 tools/testing/radix-tree/linux.c              |   160 +-
 tools/testing/radix-tree/linux/kernel.h       |     1 +
 tools/testing/radix-tree/linux/lockdep.h      |     2 +
 tools/testing/radix-tree/linux/maple_tree.h   |     7 +
 tools/testing/radix-tree/maple.c              |    59 +
 .../radix-tree/trace/events/maple_tree.h      |     3 +
 89 files changed, 47833 insertions(+), 1823 deletions(-)
 create mode 100644 Documentation/core-api/maple_tree.rst
 create mode 100644 include/linux/maple_tree.h
 delete mode 100644 include/linux/vmacache.h
 create mode 100644 include/trace/events/maple_tree.h
 create mode 100644 lib/maple_tree.c
 create mode 100644 lib/test_maple_tree.c
 delete mode 100644 mm/vmacache.c
 create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
 create mode 100644 tools/testing/radix-tree/maple.c
 create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h

Andrew Morton April 26, 2022, 8:06 p.m. UTC | #1

On Tue, 26 Apr 2022 15:06:19 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:

> Please replace the patches in your mglru-maple branch with this set.  It should
> be a drop in replacement for my patch range with the fixes into these
> patches.  Adding the preallocation to work around the fs-reclaim LOCKDEP
> issue caused enough changes to the patches to warrant a respin.

OK, thanks.  I'll give these a bit of testing locally with a view to
having a run in -next soon.

It's all not looking very reviewed.  Any thoughts on who we could
attempt to presuade to hep out here?

> The last patch on the branch is still needed to fix vmscan after mglru
> is applied.  ee4b1fc24f30 "mm/vmscan: Use VMA_ITERATOR in
> get_next_vma()"

OK, someone please send that at me when the next rev of mglru surfaces?

Andrew Morton April 26, 2022, 8:08 p.m. UTC | #2

On Tue, 26 Apr 2022 15:06:19 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:

> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently.  There are a number of places in the kernel

I think it would be helpful to expand on "a number of places". 
Specifically which places?

> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface.  The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct.  The long term goal is to reduce
> or remove the mmap_sem contention.

"mmap_lock" ;)

> 
> The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> nodes.  With the increased branching factor, it is significantly shorter than
> the rbtree so it has fewer cache misses.  The removal of the linked list
> between subsequent entries also reduces the cache misses and the need to pull
> in the previous and next VMA during many tree alterations.

Do we have any quantitative testing results?

What's the plan on utilizing this to further reduce mmap_lock contention?

Matthew Wilcox April 26, 2022, 8:23 p.m. UTC | #3

On Tue, Apr 26, 2022 at 01:08:57PM -0700, Andrew Morton wrote:
> On Tue, 26 Apr 2022 15:06:19 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> > The maple tree is an RCU-safe range based B-tree designed to use modern
> > processor cache efficiently.  There are a number of places in the kernel
> 
> I think it would be helpful to expand on "a number of places". 
> Specifically which places?

The page cache would be a good place to use it once it has a few more
features to bring it up to par with the radix tree.  I can go into more
detail if you want.

In general, anywhere that's using the IDR/Radix Tree/XArray would
benefit.  The radix tree has some pretty poor space consumption
properties, particularly for anyone using the cyclic variants.

Many users of the rbtree would have better performance if converted
to the maple tree.

Ultimately, it's going to be incumbent on people who know their own
chunk of the kernel to say "Oh, yes, I can use that data structure",
rather than on Liam to go around converting everybody's code for them.

Liam R. Howlett April 27, 2022, 2:08 p.m. UTC | #4

* Andrew Morton <akpm@linux-foundation.org> [220426 16:09]:
> On Tue, 26 Apr 2022 15:06:19 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> 
> > The maple tree is an RCU-safe range based B-tree designed to use modern
> > processor cache efficiently.  There are a number of places in the kernel
> 
> I think it would be helpful to expand on "a number of places". 
> Specifically which places?

Matthew answered this, but if you look for users of the interval tree
you will come across many users that do not have overlapping ranges.
The interval tree is being (ab)used because it is easier than the other
options even though it is not necessarily the best choice for the data
being stored. I don't want to be negative about the other options, they
are all really valuable and have their uses.  I think where the maple
tree excels is the ease of use and the cache efficiency.

> 
> > that a non-overlapping range-based tree would be beneficial, especially
> > one with a simple interface.  The first user that is covered in this
> > patch set is the vm_area_struct, where three data structures are
> > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > or remove the mmap_sem contention.
> 
> "mmap_lock" ;)

Ah, right.  Thanks.

> 
> > 
> > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > nodes.  With the increased branching factor, it is significantly shorter than
> > the rbtree so it has fewer cache misses.  The removal of the linked list
> > between subsequent entries also reduces the cache misses and the need to pull
> > in the previous and next VMA during many tree alterations.
> 
> Do we have any quantitative testing results?

I was waiting for the mmtests sweep to complete before sending them but
I didn't want to hold up Yu Zhao's testing of the combined tree as it
has proved useful already. Please don't include the results in the first
commit as it wouldn't make much sense.  If you intend to put them in a
commit message, please put them in the patch introducing the maple tree.
The benchmarks are around the same as they have always been.

kernbench                      
                               rb5.18-rc2             mt5.18-rc2
Amean     user-2        862.24 (   0.00%)      861.45 *   0.09%*
Amean     syst-2        136.65 (   0.00%)      141.58 *  -3.61%*
Amean     elsp-2        505.38 (   0.00%)      507.99 *  -0.52%*
Amean     user-4        890.24 (   0.00%)      888.34 *   0.21%*
Amean     syst-4        140.64 (   0.00%)      145.32 *  -3.33%*
Amean     elsp-4        264.34 (   0.00%)      265.76 *  -0.54%*
Amean     user-8        952.30 (   0.00%)      947.57 *   0.50%*
Amean     syst-8        145.52 (   0.00%)      147.79 *  -1.56%*
Amean     elsp-8        145.02 (   0.00%)      145.38 *  -0.24%*
Amean     user-16       920.83 (   0.00%)      918.95 *   0.20%*
Amean     syst-16       135.37 (   0.00%)      138.99 *  -2.67%*
Amean     elsp-16        75.03 (   0.00%)       75.25 *  -0.29%*
Amean     user-32       970.98 (   0.00%)      969.01 *   0.20%*
Amean     syst-32       144.75 (   0.00%)      148.58 *  -2.64%*
Amean     elsp-32        44.10 (   0.00%)       44.64 *  -1.24%*
Amean     user-64      1062.19 (   0.00%)     1060.30 *   0.18%*
Amean     syst-64       154.24 (   0.00%)      157.58 *  -2.17%*
Amean     elsp-64        28.88 (   0.00%)       29.10 *  -0.76%*
Amean     user-128     1609.09 (   0.00%)     1612.19 *  -0.19%*
Amean     syst-128      210.05 (   0.00%)      215.09 *  -2.40%*
Amean     elsp-128       25.22 (   0.00%)       25.45 *  -0.94%*
Amean     user-256     1767.37 (   0.00%)     1766.86 *   0.03%*
Amean     syst-256      215.99 (   0.00%)      221.56 *  -2.58%*
Amean     elsp-256       25.20 (   0.00%)       25.33 *  -0.54%*
Amean     user-288     1772.73 (   0.00%)     1772.35 *   0.02%*
Amean     syst-288      216.08 (   0.00%)      221.39 *  -2.45%*
Amean     elsp-288       25.16 (   0.00%)       25.44 *  -1.13%*

Increase in performance in the following micro-benchmarks in Hmean:
- context_switch1-processes +14.74% to 2.22%

Mixed results in the following micro-benchmarks in Hmean:
- malloc1-threads +34.95% to -30.06%
- malloc1-processes +2.73% to -23.65%
- page_fault3-threads +19.84% to -11.55%
- pthread_mutex1-threads +42.50% to -11.76%

Decrease in performance in the following micro-benchmarks in Hmean:
- brk1-processes -35.35% to -42.69%

brk1-processes decrease is due to the test itself.  Since the VMA cannot
be expanded, the test is actually inserting a new VMA.

> 
> What's the plan on utilizing this to further reduce mmap_lock contention?

The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers.  A single write operation will be
allowed at a time.  A reader re-walks if stale data is encountered. VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.  I can go into more details if you want.

Qian Cai April 27, 2022, 4:10 p.m. UTC | #5

On Tue, Apr 26, 2022 at 03:06:19PM +0000, Liam Howlett wrote:
> Andrew,
> 
> Please replace the patches in your mglru-maple branch with this set.  It should
> be a drop in replacement for my patch range with the fixes into these
> patches.  Adding the preallocation to work around the fs-reclaim LOCKDEP
> issue caused enough changes to the patches to warrant a respin.
> 
> The last patch on the branch is still needed to fix vmscan after mglru
> is applied.  ee4b1fc24f30 "mm/vmscan: Use VMA_ITERATOR in
> get_next_vma()"
> 
> 
> Here is the pretty cover letter you requested last time.
> 
> ------------------------------------
> 
> The maple tree is an RCU-safe range based B-tree designed to use modern
> processor cache efficiently.  There are a number of places in the kernel
> that a non-overlapping range-based tree would be beneficial, especially
> one with a simple interface.  The first user that is covered in this
> patch set is the vm_area_struct, where three data structures are
> replaced by the maple tree: the augmented rbtree, the vma cache, and the
> linked list of VMAs in the mm_struct.  The long term goal is to reduce
> or remove the mmap_sem contention.
> 
> The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> nodes.  With the increased branching factor, it is significantly shorter than
> the rbtree so it has fewer cache misses.  The removal of the linked list
> between subsequent entries also reduces the cache misses and the need to pull
> in the previous and next VMA during many tree alterations.
> 
> This patch set is based on v5.18-rc2
> 
> git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220426
> 
> v8 changes:
>  - Added preallocations before any potential edits to the tree when holding the
> i_mmap_lock to avoid fs-reclaim issues on extreme memory pressure.
>  - Fixed issue in mempolicy mas_for_each() loop.
>  - Moved static definitions inside ifdef for DEBUG_MAPLE
>  - Fixed compile warnings reported by build bots
>  - Moved mas_dfs_preorder() to testing code
>  - Changed __vma_adjust() to record the highest vma in case 6 instead of
> finding it twice.
>  - Fixed locking issue in exit_mmap()
>  - Fixed up from/s-o-b ordering

Running some syscall fuzzer would trigger a crash.

 BUG: KASAN: use-after-free in mas_find
 ma_dead_node at lib/maple_tree.c:532
 (inlined by) mas_next_entry at lib/maple_tree.c:4637
 (inlined by) mas_find at lib/maple_tree.c:5869
 Read of size 8 at addr ffff88811c5e9c00 by task trinity-c0/1351

 CPU: 5 PID: 1351 Comm: trinity-c0 Not tainted 5.18.0-rc4-next-20220427 #3
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-5.fc35 04/01/2014
 Call Trace:
  <TASK>
  dump_stack_lvl
  print_address_description.constprop.0.cold
  print_report.cold
  kasan_report
  mas_find
  apply_mlockall_flags
  __do_sys_munlockall
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
 RIP: 0033:0x7f3105611a3d
 Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 738
 RSP: 002b:00007ffeefae7c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000098
 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f3105611a3d
 RDX: 00000000ffff0000 RSI: ffffffffffffafff RDI: 0000009357075a39
 RBP: 00007f3103f94000 R08: 00000000fffffff7 R09: 000000000000006b
 R10: fffffffffffffffd R11: 0000000000000246 R12: 0000000000000098
 R13: 00007f31054f06c0 R14: 00007f3103f94058 R15: 00007f3103f94000
  </TASK>

 Allocated by task 1351:
  kasan_save_stack
  __kasan_slab_alloc
  kmem_cache_alloc_bulk
  mas_alloc_nodes
  mas_preallocate
  __vma_adjust
  __split_vma
  do_mas_align_munmap.constprop.0
  __vm_munmap
  __x64_sys_munmap
  do_syscall_64
  entry_SYSCALL_64_after_hwframe

 Freed by task 1351:
  kasan_save_stack
  kasan_set_track
  kasan_set_free_info
  ____kasan_slab_free
  slab_free_freelist_hook
  kmem_cache_free_bulk.part.0
  mas_destroy
  mas_store_prealloc
  __vma_adjust
  vma_merge
  mlock_fixup
  apply_mlockall_flags
  __do_sys_munlockall
  do_syscall_64
  entry_SYSCALL_64_after_hwframe

 The buggy address belongs to the object at ffff88811c5e9c00
  which belongs to the cache maple_node of size 256
 The buggy address is located 0 bytes inside of
  256-byte region [ffff88811c5e9c00, ffff88811c5e9d00)

 The buggy address belongs to the physical page:
 page:ffffea0004717a00 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x11c5e8
 head:ffffea0004717a00 order:2 compound_mapcount:0 compound_pincount:0
 flags: 0x200000000010200(slab|head|node=0|zone=2)
 raw: 0200000000010200 ffffea0004f8da08 ffffea0004676a08 ffff888100050940
 raw: 0000000000000000 0000000000150015 00000001ffffffff 0000000000000000
 page dumped because: kasan: bad access detected

 Memory state around the buggy address:
  ffff88811c5e9b00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff88811c5e9b80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 >ffff88811c5e9c00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                    ^
  ffff88811c5e9c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff88811c5e9d00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ==================================================================
 Disabling lock debugging due to kernel taint
 general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN PTI
 KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
 CPU: 5 PID: 1351 Comm: trinity-c0 Tainted: G    B             5.18.0-rc4-next-20220427 #3
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-5.fc35 04/01/2014
 RIP: 0010:mas_ascend
 Code: cb 18 41 89 d3 48 83 cb 04 41 83 f3 01 40 84 ed 40 0f 95 c7 41 20 fb 74 26 83 ee 01 48 63 f6 48 8d 14 f1 48 89 d6 48 c1 ee 03 <42> 80 3c 3e 00 0f 854
 RSP: 0018:ffffc9000279fc78 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: ffff888106dc3504 RCX: 0000000000000000
 RDX: 0000000000000008 RSI: 0000000000000001 RDI: ffff88811d42e201
 RBP: 0000000000000002 R08: 0000000000000000 R09: ffff88810f723300
 R10: ffffffffffffffff R11: 0000000000000001 R12: ffffc9000279fe78
 R13: 0000000000000000 R14: ffff888106dc3500 R15: dffffc0000000000
 FS:  00007f31054f0740(0000) GS:ffff8882d2e80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f3104db847c CR3: 0000000127e80002 CR4: 0000000000370ee0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Call Trace:
  <TASK>
  mas_next_node
  mas_find
  apply_mlockall_flags
  __do_sys_munlockall
  do_syscall_64
  entry_SYSCALL_64_after_hwframe
 RIP: 0033:0x7f3105611a3d
 Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 738
 RSP: 002b:00007ffeefae7c68 EFLAGS: 00000246 ORIG_RAX: 0000000000000098
 RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f3105611a3d
 RDX: 00000000ffff0000 RSI: ffffffffffffafff RDI: 0000009357075a39
 RBP: 00007f3103f94000 R08: 00000000fffffff7 R09: 000000000000006b
 R10: fffffffffffffffd R11: 0000000000000246 R12: 0000000000000098
 R13: 00007f31054f06c0 R14: 00007f3103f94058 R15: 00007f3103f94000
  </TASK>
 Modules linked in:
 ---[ end trace 0000000000000000 ]---
 RIP: 0010:mas_ascend
 Code: cb 18 41 89 d3 48 83 cb 04 41 83 f3 01 40 84 ed 40 0f 95 c7 41 20 fb 74 26 83 ee 01 48 63 f6 48 8d 14 f1 48 89 d6 48 c1 ee 03 <42> 80 3c 3e 00 0f 854
 RSP: 0018:ffffc9000279fc78 EFLAGS: 00010202
 RAX: 0000000000000000 RBX: ffff888106dc3504 RCX: 0000000000000000
 RDX: 0000000000000008 RSI: 0000000000000001 RDI: ffff88811d42e201
 RBP: 0000000000000002 R08: 0000000000000000 R09: ffff88810f723300
 R10: ffffffffffffffff R11: 0000000000000001 R12: ffffc9000279fe78
 R13: 0000000000000000 R14: ffff888106dc3500 R15: dffffc0000000000
 FS:  00007f31054f0740(0000) GS:ffff8882d2e80000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00007f3104db847c CR3: 0000000127e80002 CR4: 0000000000370ee0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 Kernel panic - not syncing: Fatal exception
 Kernel Offset: 0x15400000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)

Liam R. Howlett April 27, 2022, 4:51 p.m. UTC | #6

* Qian Cai <quic_qiancai@quicinc.com> [220427 12:10]:
> On Tue, Apr 26, 2022 at 03:06:19PM +0000, Liam Howlett wrote:
> > Andrew,
> > 
> > Please replace the patches in your mglru-maple branch with this set.  It should
> > be a drop in replacement for my patch range with the fixes into these
> > patches.  Adding the preallocation to work around the fs-reclaim LOCKDEP
> > issue caused enough changes to the patches to warrant a respin.
> > 
> > The last patch on the branch is still needed to fix vmscan after mglru
> > is applied.  ee4b1fc24f30 "mm/vmscan: Use VMA_ITERATOR in
> > get_next_vma()"
> > 
> > 
> > Here is the pretty cover letter you requested last time.
> > 
> > ------------------------------------
> > 
> > The maple tree is an RCU-safe range based B-tree designed to use modern
> > processor cache efficiently.  There are a number of places in the kernel
> > that a non-overlapping range-based tree would be beneficial, especially
> > one with a simple interface.  The first user that is covered in this
> > patch set is the vm_area_struct, where three data structures are
> > replaced by the maple tree: the augmented rbtree, the vma cache, and the
> > linked list of VMAs in the mm_struct.  The long term goal is to reduce
> > or remove the mmap_sem contention.
> > 
> > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > nodes.  With the increased branching factor, it is significantly shorter than
> > the rbtree so it has fewer cache misses.  The removal of the linked list
> > between subsequent entries also reduces the cache misses and the need to pull
> > in the previous and next VMA during many tree alterations.
> > 
> > This patch set is based on v5.18-rc2
> > 
> > git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220426
> > 
> > v8 changes:
> >  - Added preallocations before any potential edits to the tree when holding the
> > i_mmap_lock to avoid fs-reclaim issues on extreme memory pressure.
> >  - Fixed issue in mempolicy mas_for_each() loop.
> >  - Moved static definitions inside ifdef for DEBUG_MAPLE
> >  - Fixed compile warnings reported by build bots
> >  - Moved mas_dfs_preorder() to testing code
> >  - Changed __vma_adjust() to record the highest vma in case 6 instead of
> > finding it twice.
> >  - Fixed locking issue in exit_mmap()
> >  - Fixed up from/s-o-b ordering
> 
> Running some syscall fuzzer would trigger a crash.
> 
>  BUG: KASAN: use-after-free in mas_find
>  ma_dead_node at lib/maple_tree.c:532
>  (inlined by) mas_next_entry at lib/maple_tree.c:4637
>  (inlined by) mas_find at lib/maple_tree.c:5869
>  Read of size 8 at addr ffff88811c5e9c00 by task trinity-c0/1351
> 
>  CPU: 5 PID: 1351 Comm: trinity-c0 Not tainted 5.18.0-rc4-next-20220427 #3
>  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-5.fc35 04/01/2014
>  Call Trace:
>   <TASK>
>   dump_stack_lvl
>   print_address_description.constprop.0.cold
>   print_report.cold
>   kasan_report
>   mas_find
>   apply_mlockall_flags


Thanks.  This is indeed an issue with 0d43186b36c1 (mm/mlock: use vma
iterator and instead of vma linked list)                                                 

Andrew, Please include this patch as a fix.

Thanks,
Liam

Andrew Morton April 27, 2022, 5:33 p.m. UTC | #7

On Wed, 27 Apr 2022 14:08:39 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:

> > > 
> > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > > nodes.  With the increased branching factor, it is significantly shorter than
> > > the rbtree so it has fewer cache misses.  The removal of the linked list
> > > between subsequent entries also reduces the cache misses and the need to pull
> > > in the previous and next VMA during many tree alterations.
> > 
> > Do we have any quantitative testing results?
> 
> I was waiting for the mmtests sweep to complete before sending them but
> I didn't want to hold up Yu Zhao's testing of the combined tree as it
> has proved useful already. Please don't include the results in the first
> commit as it wouldn't make much sense.  If you intend to put them in a
> commit message, please put them in the patch introducing the maple tree.

I did that.

> The benchmarks are around the same as they have always been.

So it's presently a wash.

That makes "the plan" (below) really critical, otherwise there seems
little point in merging this code at this time?

Please send me many very soothing words about how confident we should
be that the plan will be implemented and that it shall be good?

> > 
> > What's the plan on utilizing this to further reduce mmap_lock contention?
> 
> The plan is to get to the point where we use the maple tree in RCU mode.
> Readers will not block for writers.  A single write operation will be
> allowed at a time.  A reader re-walks if stale data is encountered. VMAs
> would be RCU enabled and this mode would be entered once multiple tasks
> are using the mm_struct.  I can go into more details if you want.

Uh, that's very important information.  It's really the whole point
of the patchset!   I added this to the [0/n] changelog.

Matthew Wilcox April 27, 2022, 6:12 p.m. UTC | #8

On Wed, Apr 27, 2022 at 10:33:31AM -0700, Andrew Morton wrote:
> On Wed, 27 Apr 2022 14:08:39 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> > The benchmarks are around the same as they have always been.
> 
> So it's presently a wash.
> 
> That makes "the plan" (below) really critical, otherwise there seems
> little point in merging this code at this time?
> 
> Please send me many very soothing words about how confident we should
> be that the plan will be implemented and that it shall be good?

Yes, performance-wise it's a wash.  However, Davidlohr was very
impressed that it was a wash because we're actually getting rid of three
data structures here; the linked list, the rbtree and the vmacache.
His opinion was that we should push the maple tree in now, in advance
of the future RCU uses.

We also have other users waiting in the wings.  Dave Howells has something
he's working on that uses the maple tree directly.  I have a couple of
XArray users that are using it inappropriately that I want to convert
... I just didn't want to do that work before all this lands.

The current LSFMM schedule has very many words about the Maple tree
scheduled for 13:30-15:00 on Monday.  Hopefully we'll have a better idea
after that how confident we are that RCU VMA walking is going to work.

Qian Cai April 27, 2022, 8:21 p.m. UTC | #9

On Wed, Apr 27, 2022 at 04:51:50PM +0000, Liam Howlett wrote:
> Thanks.  This is indeed an issue with 0d43186b36c1 (mm/mlock: use vma
> iterator and instead of vma linked list)                                                 
> 
> Andrew, Please include this patch as a fix.

Even with the patch applied, there are still thousands of memory leaks
reports from kmemleak after booting.

unreferenced object 0xffff400259bd6d00 (size 256):
  comm "multipathd", pid 2577, jiffies 4294915929 (age 2370.384s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
     slab_post_alloc_hook
     kmem_cache_alloc_bulk
     mas_alloc_nodes
     mt_alloc_bulk at lib/maple_tree.c:151
     (inlined by) mas_alloc_nodes at lib/maple_tree.c:1244
     mas_preallocate
     __vma_adjust
     shift_arg_pages
     setup_arg_pages
     load_elf_binary
     search_binary_handler
     exec_binprm
     bprm_execve
     do_execveat_common.isra.0
     __arm64_sys_execve
     invoke_syscall
     el0_svc_common.constprop.0
     do_el0_svc

Liam R. Howlett April 27, 2022, 10:41 p.m. UTC | #10

* Qian Cai <quic_qiancai@quicinc.com> [220427 16:22]:
> On Wed, Apr 27, 2022 at 04:51:50PM +0000, Liam Howlett wrote:
> > Thanks.  This is indeed an issue with 0d43186b36c1 (mm/mlock: use vma
> > iterator and instead of vma linked list)                                                 
> > 
> > Andrew, Please include this patch as a fix.
> 
> Even with the patch applied, there are still thousands of memory leaks
> reports from kmemleak after booting.

Thank you for finding this.

> 
> unreferenced object 0xffff400259bd6d00 (size 256):
>   comm "multipathd", pid 2577, jiffies 4294915929 (age 2370.384s)
>   hex dump (first 32 bytes):
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>     00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
>   backtrace:
>      slab_post_alloc_hook
>      kmem_cache_alloc_bulk
>      mas_alloc_nodes
>      mt_alloc_bulk at lib/maple_tree.c:151
>      (inlined by) mas_alloc_nodes at lib/maple_tree.c:1244
>      mas_preallocate
>      __vma_adjust
>      shift_arg_pages
>      setup_arg_pages
>      load_elf_binary
>      search_binary_handler
>      exec_binprm
>      bprm_execve
>      do_execveat_common.isra.0
>      __arm64_sys_execve
>      invoke_syscall
>      el0_svc_common.constprop.0
>      do_el0_svc

__vma_adjust is way too complicated.  This patch should fix the leak.

Andrew, please add this patch to "mm: start tracking VMAs with maple tree"


Thanks,
Liam

Liam R. Howlett April 28, 2022, 2:28 a.m. UTC | #11

* Andrew Morton <akpm@linux-foundation.org> [220427 13:33]:
> On Wed, 27 Apr 2022 14:08:39 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> 
> > > > 
> > > > The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
> > > > nodes.  With the increased branching factor, it is significantly shorter than
> > > > the rbtree so it has fewer cache misses.  The removal of the linked list
> > > > between subsequent entries also reduces the cache misses and the need to pull
> > > > in the previous and next VMA during many tree alterations.
> > > 
> > > Do we have any quantitative testing results?
> > 
> > I was waiting for the mmtests sweep to complete before sending them but
> > I didn't want to hold up Yu Zhao's testing of the combined tree as it
> > has proved useful already. Please don't include the results in the first
> > commit as it wouldn't make much sense.  If you intend to put them in a
> > commit message, please put them in the patch introducing the maple tree.
> 
> I did that.
> 
> > The benchmarks are around the same as they have always been.
> 
> So it's presently a wash.
> 
> That makes "the plan" (below) really critical, otherwise there seems
> little point in merging this code at this time?

There are a number of reasons to merge at this time.  First, as Matthew
said, we have more than one user so having the tree upstream would
obviously help them out.  Second, we can make sure this much (maple tree
+ vma tracking) works for everyone before we enable VMA RCU and play
even more with the locking scenarios.  The third reason is to get more
community input into "the plan" to ensure the maple tree is beneficial
in the most common execution paths and capture corner cases.

> 
> Please send me many very soothing words about how confident we should
> be that the plan will be implemented and that it shall be good?

Tea, warmth, fresh bread.

As we know, VMAs are rarely written and mostly read.  This plan removes
the majority of the slow down on readers.  The only slow down is when a
reader conflicts with a writer.  We are actually slower on writes - we
allocate nodes for the tree vs having it embedded in the VMA itself and
we do more work in calculating the gaps, but we are actually fast enough
to hide that on the read side.  Once the writers stop blocking the
readers it shall be good.  We need this step to get to using the maple
tree in RCU mode to make this happen.

> 
> > > 
> > > What's the plan on utilizing this to further reduce mmap_lock contention?
> > 
> > The plan is to get to the point where we use the maple tree in RCU mode.
> > Readers will not block for writers.  A single write operation will be
> > allowed at a time.  A reader re-walks if stale data is encountered. VMAs
> > would be RCU enabled and this mode would be entered once multiple tasks
> > are using the mm_struct.  I can go into more details if you want.
> 
> Uh, that's very important information.  It's really the whole point
> of the patchset!   I added this to the [0/n] changelog.

Yes, but that's not what the patch set does today.  I didn't want to
include The Plan until it exists.  I'd like to think that what we have
here is worth while on its own and a good start - but a start to
something big.  It's hard enough to get people to look at 70 patches,
doubly so for an RFC of 70.  The interest from others in using a tree
with an easy interface seems like a win as well.

I also have some side goals such as the removal of __vma_adjust() for
code readability.


Thanks,
Liam

Davidlohr Bueso May 1, 2022, 8:26 p.m. UTC | #12

On Wed, 27 Apr 2022, Matthew Wilcox wrote:

>On Wed, Apr 27, 2022 at 10:33:31AM -0700, Andrew Morton wrote:
>> On Wed, 27 Apr 2022 14:08:39 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
>> > The benchmarks are around the same as they have always been.
>>
>> So it's presently a wash.
>>
>> That makes "the plan" (below) really critical, otherwise there seems
>> little point in merging this code at this time?
>>
>> Please send me many very soothing words about how confident we should
>> be that the plan will be implemented and that it shall be good?
>
>Yes, performance-wise it's a wash.  However, Davidlohr was very
>impressed that it was a wash because we're actually getting rid of three
>data structures here; the linked list, the rbtree and the vmacache.
>His opinion was that we should push the maple tree in now, in advance
>of the future RCU uses.

Yes I like the maple tree, and at this stage I don't think we can ask
for more from this series wrt the MM - albeit there seems to still be
some folks reporting breakage. Fundamentally I see Liam's work to (re)move
complexity out of the MM (not to say that the actual maple tree is not
complex) by consolidating the three complimentary data structures very
much worth it considering performance does not take a hit. This was
very much a turn off with the range locking approach, which worst case
scenario incurred in prohibitive overhead. Also as Liam and Matthew
have mentioned, RCU opens up a lot of nice performance opportunities,
and in addition academia[1] has shown outstanding scalability of address
spaces with the foundation of replacing the locked rbtree with RCU
aware trees.

[1] https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf

Thanks,
Davidlohr

Andrew Morton May 1, 2022, 11:56 p.m. UTC | #13

On Sun, 1 May 2022 13:26:34 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:

> On Wed, 27 Apr 2022, Matthew Wilcox wrote:
> 
> >On Wed, Apr 27, 2022 at 10:33:31AM -0700, Andrew Morton wrote:
> >> On Wed, 27 Apr 2022 14:08:39 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> >> > The benchmarks are around the same as they have always been.
> >>
> >> So it's presently a wash.
> >>
> >> That makes "the plan" (below) really critical, otherwise there seems
> >> little point in merging this code at this time?
> >>
> >> Please send me many very soothing words about how confident we should
> >> be that the plan will be implemented and that it shall be good?
> >
> >Yes, performance-wise it's a wash.  However, Davidlohr was very
> >impressed that it was a wash because we're actually getting rid of three
> >data structures here; the linked list, the rbtree and the vmacache.
> >His opinion was that we should push the maple tree in now, in advance
> >of the future RCU uses.
> 
> Yes I like the maple tree, and at this stage I don't think we can ask
> for more from this series wrt the MM - albeit there seems to still be
> some folks reporting breakage. Fundamentally I see Liam's work to (re)move
> complexity out of the MM (not to say that the actual maple tree is not
> complex) by consolidating the three complimentary data structures very
> much worth it considering performance does not take a hit. This was
> very much a turn off with the range locking approach, which worst case
> scenario incurred in prohibitive overhead. Also as Liam and Matthew
> have mentioned, RCU opens up a lot of nice performance opportunities,
> and in addition academia[1] has shown outstanding scalability of address
> spaces with the foundation of replacing the locked rbtree with RCU
> aware trees.

Thanks.   That sounded like a wordy acked-by to me? :)

Liam, I think the above is useful background for the [0/N].

> [1] https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf

As is that.  The paper seems shockingly relevant.  Do we know the
authors or is it a cosmic coincidence?

Liam R. Howlett May 4, 2022, 12:43 a.m. UTC | #14

* Andrew Morton <akpm@linux-foundation.org> [220501 19:56]:
> On Sun, 1 May 2022 13:26:34 -0700 Davidlohr Bueso <dave@stgolabs.net> wrote:
> 
> > On Wed, 27 Apr 2022, Matthew Wilcox wrote:
> > 
> > >On Wed, Apr 27, 2022 at 10:33:31AM -0700, Andrew Morton wrote:
> > >> On Wed, 27 Apr 2022 14:08:39 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> > >> > The benchmarks are around the same as they have always been.
> > >>
> > >> So it's presently a wash.
> > >>
> > >> That makes "the plan" (below) really critical, otherwise there seems
> > >> little point in merging this code at this time?
> > >>
> > >> Please send me many very soothing words about how confident we should
> > >> be that the plan will be implemented and that it shall be good?
> > >
> > >Yes, performance-wise it's a wash.  However, Davidlohr was very
> > >impressed that it was a wash because we're actually getting rid of three
> > >data structures here; the linked list, the rbtree and the vmacache.
> > >His opinion was that we should push the maple tree in now, in advance
> > >of the future RCU uses.
> > 
> > Yes I like the maple tree, and at this stage I don't think we can ask
> > for more from this series wrt the MM - albeit there seems to still be
> > some folks reporting breakage. Fundamentally I see Liam's work to (re)move
> > complexity out of the MM (not to say that the actual maple tree is not
> > complex) by consolidating the three complimentary data structures very
> > much worth it considering performance does not take a hit. This was
> > very much a turn off with the range locking approach, which worst case
> > scenario incurred in prohibitive overhead. Also as Liam and Matthew
> > have mentioned, RCU opens up a lot of nice performance opportunities,
> > and in addition academia[1] has shown outstanding scalability of address
> > spaces with the foundation of replacing the locked rbtree with RCU
> > aware trees.
> 
> Thanks.   That sounded like a wordy acked-by to me? :)
> 
> Liam, I think the above is useful background for the [0/N].
> 
> > [1] https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf
> 
> As is that.  The paper seems shockingly relevant.  Do we know the
> authors or is it a cosmic coincidence?
> 

Cosmic coincidence.  We designed our tree with the intention of solving
the hardest problem first.  Upon settling on a b-tree variant and a
rough outline, we researched ranged based b-trees and RCU b-trees and
did find that article.  So it was nice to find reassurances that we were
on the right path, but our design choice of using ranges made that paper
unusable for us.

Thanks,
Liam

[v8,00/70] Introducing the Maple Tree

Message

Comments