mbox series

[v11,00/69] Introducing the Maple Tree

Message ID 20220717024615.2106835-1-Liam.Howlett@oracle.com (mailing list archive)
Headers show
Series Introducing the Maple Tree | expand

Message

Liam R. Howlett July 17, 2022, 2:46 a.m. UTC
Hello,

This is the v10 + fixes and a few clean ups.  I used the mm-unstable
branch and patched in place.

git: https://github.com/oracle/linux-uek/tree/howlett/maple/20220716

Patch series "Introducing the Maple Tree".

The maple tree is an RCU-safe range based B-tree designed to use modern
processor cache efficiently.  There are a number of places in the kernel
that a non-overlapping range-based tree would be beneficial, especially
one with a simple interface.  If you use an rbtree with other data
structures to improve performance or an interval tree to track
non-overlapping ranges, then this is for you.

The tree has a branching factor of 10 for non-leaf nodes and 16 for leaf
nodes.  With the increased branching factor, it is significantly shorter
than the rbtree so it has fewer cache misses.  The removal of the linked
list between subsequent entries also reduces the cache misses and the need
to pull in the previous and next VMA during many tree alterations.

The first user that is covered in this patch set is the vm_area_struct,
where three data structures are replaced by the maple tree: the augmented
rbtree, the vma cache, and the linked list of VMAs in the mm_struct.  The
long term goal is to reduce or remove the mmap_lock contention.

The plan is to get to the point where we use the maple tree in RCU mode.
Readers will not block for writers.  A single write operation will be
allowed at a time.  A reader re-walks if stale data is encountered.  VMAs
would be RCU enabled and this mode would be entered once multiple tasks
are using the mm_struct.

Davidlor said

: Yes I like the maple tree, and at this stage I don't think we can ask for
: more from this series wrt the MM - albeit there seems to still be some
: folks reporting breakage.  Fundamentally I see Liam's work to (re)move
: complexity out of the MM (not to say that the actual maple tree is not
: complex) by consolidating the three complimentary data structures very
: much worth it considering performance does not take a hit.  This was very
: much a turn off with the range locking approach, which worst case scenario
: incurred in prohibitive overhead.  Also as Liam and Matthew have
: mentioned, RCU opens up a lot of nice performance opportunities, and in
: addition academia[1] has shown outstanding scalability of address spaces
: with the foundation of replacing the locked rbtree with RCU aware trees.

A similar work has been discovered in the academic press

	https://pdos.csail.mit.edu/papers/rcuvm:asplos12.pdf

Sheer coincidence.  We designed our tree with the intention of solving the
hardest problem first.  Upon settling on a b-tree variant and a rough
outline, we researched ranged based b-trees and RCU b-trees and did find
that article.  So it was nice to find reassurances that we were on the
right path, but our design choice of using ranges made that paper unusable
for us.

Changes:
 - maple_tree: Fix underflow in mas_spanning_rebalance() - Thanks Yu
   Zhao
 - maple_tree: Fix mas_spanning_rebalance() corner case - Thanks Yu Zhao
 - mmap: Reorder validate_mm_mt() checks
 - VMA iterator: Add underscores and parentheses - Thanks David
   Hildenbrand
 - maple_tree: Add underscores and parentheses - Thanks David
   Hildenbrand
 - maple_tree: Fix sparse reported issues
 - mmap: Fix locking issues in vma_expand - Thanks Hugh Dickins
 - mmap: Fix reutrn on maple tree dexpand fail in brk() - Thanks Hugh
   Dickins
 - maple_tree: Fix mas_empty_area_rev() search exhaustion at root node -
   Thanks Alexander Gordeev
 - maple_tree: Fix out of bounds access on mas_wr_node_walk() - Thanks
   Yu Zhao
 - mmap: Fix __vma_adjust() issue on memory failure
 - maple_tree: Fix stale data copy in mas_wr_node_store()
 - mmap: Change comments to be more specific - Thanks David Hildenbrand
 - mmap: Use PHYS_PFN() where necessary - Thanks David Hildenbrand
 - Removed %px use in patches that only existed in the series
 - Fixed a few checkscript reports.
 - Added necessary tests to cover all bugs found.

v10: https://lore.kernel.org/linux-mm/20220621204632.3370049-1-Liam.Howlett@oracle.com/
v9: https://lore.kernel.org/lkml/20220504010716.661115-1-Liam.Howlett@oracle.com/
...and
https://lore.kernel.org/lkml/20220504011215.661968-1-Liam.Howlett@oracle.com/

v8: https://lore.kernel.org/lkml/20220426150616.3937571-1-Liam.Howlett@oracle.com/
v7: https://lore.kernel.org/linux-mm/20220404143501.2016403-8-Liam.Howlett@oracle.com/
v6: https://lore.kernel.org/linux-mm/20220215143728.3810954-1-Liam.Howlett@oracle.com/
v5: https://lore.kernel.org/linux-mm/20220202024137.2516438-1-Liam.Howlett@oracle.com/
v4: https://lore.kernel.org/linux-mm/20211201142918.921493-1-Liam.Howlett@oracle.com/
v3: https://lore.kernel.org/linux-mm/20211005012959.1110504-1-Liam.Howlett@oracle.com/
v2: https://lore.kernel.org/linux-mm/20210817154651.1570984-1-Liam.Howlett@oracle.com/
v1: https://lore.kernel.org/linux-mm/20210428153542.2814175-1-Liam.Howlett@Oracle.com/


Liam R. Howlett (44):
  Maple Tree: add new data structure
  radix tree test suite: add pr_err define
  radix tree test suite: add kmem_cache_set_non_kernel()
  radix tree test suite: add allocation counts and size to kmem_cache
  radix tree test suite: add support for slab bulk APIs
  radix tree test suite: add lockdep_is_held to header
  lib/test_maple_tree: add testing for maple tree
  mm: start tracking VMAs with maple tree
  mm/mmap: use the maple tree in find_vma() instead of the rbtree.
  mm/mmap: use the maple tree for find_vma_prev() instead of the rbtree
  mm/mmap: use maple tree for unmapped_area{_topdown}
  kernel/fork: use maple tree for dup_mmap() during forking
  damon: convert __damon_va_three_regions to use the VMA iterator
  mm: remove rb tree.
  mmap: change zeroing of maple tree in __vma_adjust()
  xen: use vma_lookup() in privcmd_ioctl_mmap()
  mm: optimize find_exact_vma() to use vma_lookup()
  mm/khugepaged: optimize collapse_pte_mapped_thp() by using
    vma_lookup()
  mm/mmap: change do_brk_flags() to expand existing VMA and add
    do_brk_munmap()
  mm: use maple tree operations for find_vma_intersection()
  mm/mmap: use advanced maple tree API for mmap_region()
  mm: remove vmacache
  mm: convert vma_lookup() to use mtree_load()
  mm/mmap: move mmap_region() below do_munmap()
  mm/mmap: reorganize munmap to use maple states
  mm/mmap: change do_brk_munmap() to use do_mas_align_munmap()
  arm64: Change elfcore for_each_mte_vma() to use VMA iterator
  fs/proc/base: use maple tree iterators in place of linked list
  userfaultfd: use maple tree iterator to iterate VMAs
  ipc/shm: use VMA iterator instead of linked list
  bpf: remove VMA linked list
  mm/gup: use maple tree navigation instead of linked list
  mm/madvise: use vma_find() instead of vma linked list
  mm/memcontrol: stop using mm->highest_vm_end
  mm/mempolicy: use vma iterator & maple state instead of vma linked
    list
  mm/mprotect: use maple tree navigation instead of vma linked list
  mm/mremap: use vma_find_intersection() instead of vma linked list
  mm/msync: use vma_find() instead of vma linked list
  mm/oom_kill: use maple tree iterators instead of vma linked list
  mm/swapfile: use vma iterator instead of vma linked list
  riscv: use vma iterator for vdso
  mm: remove the vma linked list
  mm/mmap: drop range_has_overlap() function
  mm/mmap.c: pass in mapping to __vma_link_file()

Matthew Wilcox (Oracle) (25):
  mm: add VMA iterator
  mmap: use the VMA iterator in count_vma_pages_range()
  proc: remove VMA rbtree use from nommu
  arm64: remove mmap linked list from vdso
  parisc: remove mmap linked list from cache handling
  powerpc: remove mmap linked list walks
  s390: remove vma linked list walks
  x86: remove vma linked list walks
  xtensa: remove vma linked list walks
  cxl: remove vma linked list walk
  optee: remove vma linked list walk
  um: remove vma linked list walk
  coredump: remove vma linked list walk
  exec: use VMA iterator instead of linked list
  fs/proc/task_mmu: stop using linked list and highest_vm_end
  acct: use VMA iterator instead of linked list
  perf: use VMA iterator
  sched: use maple tree iterator to walk VMAs
  fork: use VMA iterator
  mm/khugepaged: stop using vma linked list
  mm/ksm: use vma iterators instead of vma linked list
  mm/mlock: use vma iterator and maple state instead of vma linked list
  mm/pagewalk: use vma_find() instead of vma linked list
  i915: use the VMA iterator
  nommu: remove uses of VMA linked list

 Documentation/core-api/index.rst              |     1 +
 Documentation/core-api/maple_tree.rst         |   217 +
 MAINTAINERS                                   |    12 +
 arch/arm64/kernel/elfcore.c                   |    16 +-
 arch/arm64/kernel/vdso.c                      |     3 +-
 arch/parisc/kernel/cache.c                    |     9 +-
 arch/powerpc/kernel/vdso.c                    |     6 +-
 arch/powerpc/mm/book3s32/tlb.c                |    11 +-
 arch/powerpc/mm/book3s64/subpage_prot.c       |    13 +-
 arch/riscv/kernel/vdso.c                      |     3 +-
 arch/s390/kernel/vdso.c                       |     3 +-
 arch/s390/mm/gmap.c                           |     6 +-
 arch/um/kernel/tlb.c                          |    14 +-
 arch/x86/entry/vdso/vma.c                     |     9 +-
 arch/x86/kernel/tboot.c                       |     2 +-
 arch/xtensa/kernel/syscall.c                  |    18 +-
 drivers/firmware/efi/efi.c                    |     2 +-
 drivers/gpu/drm/i915/gem/i915_gem_userptr.c   |    14 +-
 drivers/misc/cxl/fault.c                      |    45 +-
 drivers/tee/optee/call.c                      |    18 +-
 drivers/xen/privcmd.c                         |     2 +-
 fs/coredump.c                                 |    34 +-
 fs/exec.c                                     |    12 +-
 fs/proc/base.c                                |     5 +-
 fs/proc/internal.h                            |     2 +-
 fs/proc/task_mmu.c                            |    74 +-
 fs/proc/task_nommu.c                          |    45 +-
 fs/userfaultfd.c                              |    62 +-
 include/linux/maple_tree.h                    |   686 +
 include/linux/mm.h                            |    78 +-
 include/linux/mm_types.h                      |    43 +-
 include/linux/mm_types_task.h                 |    12 -
 include/linux/sched.h                         |     1 -
 include/linux/userfaultfd_k.h                 |     7 +-
 include/linux/vm_event_item.h                 |     4 -
 include/linux/vmacache.h                      |    28 -
 include/linux/vmstat.h                        |     6 -
 include/trace/events/maple_tree.h             |   123 +
 include/trace/events/mmap.h                   |    73 +
 init/main.c                                   |     2 +
 ipc/shm.c                                     |    21 +-
 kernel/acct.c                                 |    11 +-
 kernel/bpf/task_iter.c                        |    10 +-
 kernel/debug/debug_core.c                     |    12 -
 kernel/events/core.c                          |     3 +-
 kernel/events/uprobes.c                       |     9 +-
 kernel/fork.c                                 |    57 +-
 kernel/sched/fair.c                           |    10 +-
 lib/Kconfig.debug                             |    17 +-
 lib/Makefile                                  |     3 +-
 lib/maple_tree.c                              |  7102 +++
 lib/test_maple_tree.c                         | 38206 ++++++++++++++++
 mm/Makefile                                   |     2 +-
 mm/damon/vaddr-test.h                         |    36 +-
 mm/damon/vaddr.c                              |    53 +-
 mm/debug.c                                    |    14 +-
 mm/gup.c                                      |     7 +-
 mm/huge_memory.c                              |     4 +-
 mm/init-mm.c                                  |     4 +-
 mm/internal.h                                 |     8 +-
 mm/khugepaged.c                               |    13 +-
 mm/ksm.c                                      |    18 +-
 mm/madvise.c                                  |     2 +-
 mm/memcontrol.c                               |     6 +-
 mm/memory.c                                   |    33 +-
 mm/mempolicy.c                                |    56 +-
 mm/mlock.c                                    |    35 +-
 mm/mmap.c                                     |  2206 +-
 mm/mprotect.c                                 |     7 +-
 mm/mremap.c                                   |    22 +-
 mm/msync.c                                    |     2 +-
 mm/nommu.c                                    |   249 +-
 mm/oom_kill.c                                 |     3 +-
 mm/pagewalk.c                                 |     2 +-
 mm/swapfile.c                                 |     4 +-
 mm/util.c                                     |    32 -
 mm/vmacache.c                                 |   117 -
 mm/vmstat.c                                   |     4 -
 tools/include/linux/slab.h                    |     4 +
 tools/testing/radix-tree/.gitignore           |     2 +
 tools/testing/radix-tree/Makefile             |     9 +-
 tools/testing/radix-tree/generated/autoconf.h |     1 +
 tools/testing/radix-tree/linux.c              |   160 +-
 tools/testing/radix-tree/linux/kernel.h       |     1 +
 tools/testing/radix-tree/linux/lockdep.h      |     2 +
 tools/testing/radix-tree/linux/maple_tree.h   |     7 +
 tools/testing/radix-tree/maple.c              |    59 +
 .../radix-tree/trace/events/maple_tree.h      |     5 +
 88 files changed, 48494 insertions(+), 1877 deletions(-)
 create mode 100644 Documentation/core-api/maple_tree.rst
 create mode 100644 include/linux/maple_tree.h
 delete mode 100644 include/linux/vmacache.h
 create mode 100644 include/trace/events/maple_tree.h
 create mode 100644 lib/maple_tree.c
 create mode 100644 lib/test_maple_tree.c
 delete mode 100644 mm/vmacache.c
 create mode 100644 tools/testing/radix-tree/linux/maple_tree.h
 create mode 100644 tools/testing/radix-tree/maple.c
 create mode 100644 tools/testing/radix-tree/trace/events/maple_tree.h

Comments

Andrew Morton July 17, 2022, 4:20 a.m. UTC | #1
On Sun, 17 Jul 2022 02:46:32 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:

> This is the v10 + fixes and a few clean ups.

I'm seeing quite a lot of differences between this and what we had in
mm-unstable.  A surprising amount.

Please check it all?


--- include/linux/maple_tree.h	2022-07-16 21:05:04.697041964 -0700
+++ ../25-pre-maple/include/linux/maple_tree.h	2022-07-16 21:05:34.509477799 -0700
@@ -225,9 +225,9 @@
  * @flags: The maple tree flags
  *
  */
-#define MTREE_INIT(name, __flags) {					\
-	.ma_lock = __SPIN_LOCK_UNLOCKED((name).ma_lock),		\
-	.ma_flags = __flags,						\
+#define MTREE_INIT(name, flags) {					\
+	.ma_lock = __SPIN_LOCK_UNLOCKED(name.ma_lock),			\
+	.ma_flags = flags,						\
 	.ma_root = NULL,						\
 }
 
@@ -238,13 +238,13 @@
  * @lock: The external lock
  */
 #ifdef CONFIG_LOCKDEP
-#define MTREE_INIT_EXT(name, __flags, __lock) {				\
-	.ma_external_lock = &(__lock).dep_map,				\
-	.ma_flags = (__flags),						\
+#define MTREE_INIT_EXT(name, flags, lock) {				\
+	.ma_external_lock = &(lock).dep_map,				\
+	.ma_flags = flags,						\
 	.ma_root = NULL,						\
 }
 #else
-#define MTREE_INIT_EXT(name, __flags, __lock)	MTREE_INIT(name, __flags)
+#define MTREE_INIT_EXT(name, flags, lock)	MTREE_INIT(name, flags)
 #endif
 
 #define DEFINE_MTREE(name)						\
@@ -520,8 +520,8 @@
  * Note: may return the zero entry.
  *
  */
-#define mas_for_each(__mas, __entry, __max) \
-	while (((__entry) = mas_find((__mas), (__max))) != NULL)
+#define mas_for_each(mas, entry, max) \
+	while (((entry) = mas_find((mas), (max))) != NULL)
 
 
 /**
@@ -654,9 +654,9 @@
  *
  * Note: Will not return the zero entry.
  */
-#define mt_for_each(__tree, __entry, __index, __max) \
-	for (__entry = mt_find(__tree, &(__index), __max); \
-		__entry; __entry = mt_find_after(__tree, &(__index), __max))
+#define mt_for_each(tree, entry, index, max) \
+	for (entry = mt_find(tree, &(index), max); \
+		entry; entry = mt_find_after(tree, &(index), max))
 
 
 #ifdef CONFIG_DEBUG_MAPLE_TREE
@@ -665,12 +665,12 @@
 
 void mt_dump(const struct maple_tree *mt);
 void mt_validate(struct maple_tree *mt);
-#define MT_BUG_ON(__tree, __x) do {					\
+#define MT_BUG_ON(tree, x) do {						\
 	atomic_inc(&maple_tree_tests_run);				\
-	if (__x) {							\
+	if (x) {							\
 		pr_info("BUG at %s:%d (%u)\n",				\
-		__func__, __LINE__, __x);				\
-		mt_dump(__tree);					\
+		__func__, __LINE__, x);					\
+		mt_dump(tree);						\
 		pr_info("Pass: %u Run:%u\n",				\
 			atomic_read(&maple_tree_tests_passed),		\
 			atomic_read(&maple_tree_tests_run));		\
@@ -680,7 +680,7 @@
 	}								\
 } while (0)
 #else
-#define MT_BUG_ON(__tree, __x) BUG_ON(__x)
+#define MT_BUG_ON(tree, x) BUG_ON(x)
 #endif /* CONFIG_DEBUG_MAPLE_TREE */
 
 #endif /*_LINUX_MAPLE_TREE_H */
--- include/linux/mm.h	2022-07-16 21:05:07.625084770 -0700
+++ ../25-pre-maple/include/linux/mm.h	2022-07-16 21:05:37.847526599 -0700
@@ -683,12 +683,11 @@
 	return vmi->mas.index;
 }
 
-#define for_each_vma(__vmi, __vma)					\
-	while (((__vma) = vma_next(&(__vmi))) != NULL)
+#define for_each_vma(vmi, vma)		while ((vma = vma_next(&(vmi))) != NULL)
 
 /* The MM code likes to work with exclusive end addresses */
-#define for_each_vma_range(__vmi, __vma, __end)				\
-	while (((__vma) = vma_find(&(__vmi), (__end) - 1)) != NULL)
+#define for_each_vma_range(vmi, vma, end)				\
+	while ((vma = vma_find(&(vmi), (end) - 1)) != NULL)
 
 #ifdef CONFIG_SHMEM
 /*
--- include/linux/mm_types.h	2022-07-16 21:05:07.625084770 -0700
+++ ../25-pre-maple/include/linux/mm_types.h	2022-07-16 21:05:37.848526614 -0700
@@ -681,11 +681,11 @@
 	struct ma_state mas;
 };
 
-#define VMA_ITERATOR(name, __mm, __addr)				\
+#define VMA_ITERATOR(name, mm, addr) 					\
 	struct vma_iterator name = {					\
 		.mas = {						\
-			.tree = &(__mm)->mm_mt,				\
-			.index = __addr,				\
+			.tree = &mm->mm_mt,				\
+			.index = addr,					\
 			.node = MAS_START,				\
 		},							\
 	}




--- mm/memory.c	2022-07-16 21:05:07.627084799 -0700
+++ ../25-pre-maple/mm/memory.c	2022-07-16 21:05:37.980528543 -0700
@@ -1699,6 +1699,7 @@
 /**
  * unmap_vmas - unmap a range of memory covered by a list of vma's
  * @tlb: address of the caller's struct mmu_gather
+ * @mt: The maple tree pointer for the VMAs
  * @vma: the starting vma
  * @start_addr: virtual address at which to start unmapping
  * @end_addr: virtual address at which to end unmapping
--- mm/mmap.c	2022-07-16 21:05:07.716086100 -0700
+++ ../25-pre-maple/mm/mmap.c	2022-07-16 21:05:38.104530356 -0700
@@ -341,27 +341,28 @@
 	MA_STATE(mas, mt, 0, 0);
 
 	mt_validate(&mm->mm_mt);
+
 	mas_for_each(&mas, vma_mt, ULONG_MAX) {
 		if ((vma_mt->vm_start != mas.index) ||
 		    (vma_mt->vm_end - 1 != mas.last)) {
 			pr_emerg("issue in %s\n", current->comm);
 			dump_stack();
 			dump_vma(vma_mt);
-			pr_emerg("mt piv: %p %lu - %lu\n", vma_mt,
+			pr_emerg("mt piv: %px %lu - %lu\n", vma_mt,
 				 mas.index, mas.last);
-			pr_emerg("mt vma: %p %lu - %lu\n", vma_mt,
+			pr_emerg("mt vma: %px %lu - %lu\n", vma_mt,
 				 vma_mt->vm_start, vma_mt->vm_end);
 
 			mt_dump(mas.tree);
 			if (vma_mt->vm_end != mas.last + 1) {
-				pr_err("vma: %p vma_mt %lu-%lu\tmt %lu-%lu\n",
+				pr_err("vma: %px vma_mt %lu-%lu\tmt %lu-%lu\n",
 						mm, vma_mt->vm_start, vma_mt->vm_end,
 						mas.index, mas.last);
 				mt_dump(mas.tree);
 			}
 			VM_BUG_ON_MM(vma_mt->vm_end != mas.last + 1, mm);
 			if (vma_mt->vm_start != mas.index) {
-				pr_err("vma: %p vma_mt %p %lu - %lu doesn't match\n",
+				pr_err("vma: %px vma_mt %px %lu - %lu doesn't match\n",
 						mm, vma_mt, vma_mt->vm_start, vma_mt->vm_end);
 				mt_dump(mas.tree);
 			}
@@ -448,7 +449,7 @@
 		unsigned long vm_start = max(addr, vma->vm_start);
 		unsigned long vm_end = min(end, vma->vm_end);
 
-		nr_pages += PHYS_PFN(vm_end - vm_start);
+		nr_pages += (vm_end - vm_start) / PAGE_SIZE;
 	}
 
 	return nr_pages;
@@ -710,11 +711,11 @@
 				 * remove_next == 1 is case 1 or 7.
 				 */
 				remove_next = 1 + (end > next->vm_end);
-				if (remove_next == 2)
-					next_next = find_vma(mm, next->vm_end);
-
+				next_next = find_vma(mm, next->vm_end);
 				VM_WARN_ON(remove_next == 2 &&
 					   end != next_next->vm_end);
+				/* trim end to next, for case 6 first pass */
+				end = next->vm_end;
 			}
 
 			exporter = next;
@@ -725,7 +726,7 @@
 			 * next, if the vma overlaps with it.
 			 */
 			if (remove_next == 2 && !next->anon_vma)
-				exporter = next_next;
+				exporter = find_vma(mm, next->vm_end);
 
 		} else if (end > next->vm_start) {
 			/*
@@ -762,6 +763,7 @@
 				return error;
 		}
 	}
+again:
 	vma_adjust_trans_huge(orig_vma, start, end, adjust_next);
 
 	if (mas_preallocate(&mas, vma, GFP_KERNEL)) {
@@ -852,8 +854,6 @@
 
 	if (remove_next && file) {
 		__remove_shared_vm_struct(next, file, mapping);
-		if (remove_next == 2)
-			__remove_shared_vm_struct(next_next, file, mapping);
 	} else if (insert) {
 		/*
 		 * split_vma has split insert from vma, and needs
@@ -881,7 +881,6 @@
 	}
 
 	if (remove_next) {
-again:
 		if (file) {
 			uprobe_munmap(next, next->vm_start, next->vm_end);
 			fput(file);
@@ -890,22 +889,45 @@
 			anon_vma_merge(vma, next);
 		mm->map_count--;
 		mpol_put(vma_policy(next));
-		if (remove_next != 2)
-			BUG_ON(vma->vm_end < next->vm_end);
+		BUG_ON(vma->vm_end < next->vm_end);
 		vm_area_free(next);
 
 		/*
 		 * In mprotect's case 6 (see comments on vma_merge),
-		 * we must remove next_next too.
+		 * we must remove another next too. It would clutter
+		 * up the code too much to do both in one go.
 		 */
+		if (remove_next != 3) {
+			/*
+			 * If "next" was removed and vma->vm_end was
+			 * expanded (up) over it, in turn
+			 * "next->prev->vm_end" changed and the
+			 * "vma->next" gap must be updated.
+			 */
+			next = next_next;
+		} else {
+			/*
+			 * For the scope of the comment "next" and
+			 * "vma" considered pre-swap(): if "vma" was
+			 * removed, next->vm_start was expanded (down)
+			 * over it and the "next" gap must be updated.
+			 * Because of the swap() the post-swap() "vma"
+			 * actually points to pre-swap() "next"
+			 * (post-swap() "next" as opposed is now a
+			 * dangling pointer).
+			 */
+			next = vma;
+		}
 		if (remove_next == 2) {
+			mas_reset(&mas);
 			remove_next = 1;
-			next = next_next;
+			end = next->vm_end;
 			goto again;
 		}
 	}
-	if (insert && file)
+	if (insert && file) {
 		uprobe_mmap(insert);
+	}
 
 	mas_destroy(&mas);
 	validate_mm(mm);
@@ -1613,8 +1635,8 @@
 	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
 }
 
-/**
- * unmapped_area() - Find an area between the low_limit and the high_limit with
+/*
+ * unmapped_area() Find an area between the low_limit and the high_limit with
  * the correct alignment and offset, all from @info. Note: current->mm is used
  * for the search.
  *
@@ -1640,15 +1662,12 @@
 
 	gap = mas.index;
 	gap += (info->align_offset - gap) & info->align_mask;
-	VM_BUG_ON(gap + info->length > info->high_limit);
-	VM_BUG_ON(gap + info->length > mas.last);
 	return gap;
 }
 
-/**
- * unmapped_area_topdown() - Find an area between the low_limit and the
+/* unmapped_area_topdown() Find an area between the low_limit and the
  * high_limit with * the correct alignment and offset at the highest available
- * address, all from @info. Note: current->mm is used for the search.
+ * address, all from * @info. Note: current->mm is used for the search.
  *
  * @info: The unmapped area information including the range (low_limit -
  * hight_limit), the alignment offset and mask.
@@ -1671,8 +1690,6 @@
 
 	gap = mas.last + 1 - info->length;
 	gap -= (gap - info->align_offset) & info->align_mask;
-	VM_BUG_ON(gap < info->low_limit);
-	VM_BUG_ON(gap < mas.index);
 	return gap;
 }
 
@@ -1884,12 +1901,12 @@
 EXPORT_SYMBOL(find_vma_intersection);
 
 /**
- * find_vma() - Find the VMA for a given address, or the next VMA.
+ * find_vma() - Find the VMA for a given address, or the next vma.
  * @mm: The mm_struct to check
  * @addr: The address
  *
- * Returns: The VMA associated with addr, or the next VMA.
- * May return %NULL in the case of no VMA at addr or above.
+ * Returns: The VMA associated with addr, or the next vma.
+ * May return %NULL in the case of no vma at addr or above.
  */
 struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
 {
@@ -2642,7 +2659,7 @@
 	    (vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file,
 				       pgoff, vma->vm_userfaultfd_ctx, NULL) :
 		   can_vma_merge_after(prev, vm_flags, NULL, file, pgoff,
-				       NULL_VM_UFFD_CTX, NULL))) {
+				       NULL_VM_UFFD_CTX , NULL))) {
 		merge_start = prev->vm_start;
 		vma = prev;
 		vm_pgoff = prev->vm_pgoff;
@@ -3036,7 +3053,6 @@
 		unsigned long addr, unsigned long len, unsigned long flags)
 {
 	struct mm_struct *mm = current->mm;
-
 	validate_mm_mt(mm);
 
 	/*
@@ -3092,7 +3108,7 @@
 	vma->vm_flags = flags;
 	vma->vm_page_prot = vm_get_page_prot(flags);
 	mas_set_range(mas, vma->vm_start, addr + len - 1);
-	if (mas_store_gfp(mas, vma, GFP_KERNEL))
+	if ( mas_store_gfp(mas, vma, GFP_KERNEL))
 		goto mas_store_fail;
 
 	mm->map_count++;
@@ -3155,7 +3171,7 @@
 	vma = mas_prev(&mas, 0);
 	if (!vma || vma->vm_end != addr || vma_policy(vma) ||
 	    !can_vma_merge_after(vma, flags, NULL, NULL,
-				 addr >> PAGE_SHIFT, NULL_VM_UFFD_CTX, NULL))
+				 addr >> PAGE_SHIFT,NULL_VM_UFFD_CTX, NULL))
 		vma = NULL;
 
 	ret = do_brk_flags(&mas, vma, addr, len, flags);
Yu Zhao July 17, 2022, 5:57 a.m. UTC | #2
On Sat, Jul 16, 2022 at 10:20 PM Andrew Morton
<akpm@linux-foundation.org> wrote:
>
> On Sun, 17 Jul 2022 02:46:32 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
...
>         gap = mas.index;
>         gap += (info->align_offset - gap) & info->align_mask;
> -       VM_BUG_ON(gap + info->length > info->high_limit);
> -       VM_BUG_ON(gap + info->length > mas.last);

These VM_BUG_ONs are new, and I hit the second one quickly:

  kernel BUG at mm/mmap.c:1631!
  RIP: 0010:vm_unmapped_area+0xdb/0x1c0
  Call Trace:
   <TASK>
   arch_get_unmapped_area+0x1ee/0x220
   arch_get_unmapped_area_topdown+0x25a/0x290
   get_unmapped_area+0x92/0x100
   do_mmap+0x13f/0x560
   vm_mmap_pgoff+0xcd/0x170
Liam R. Howlett July 17, 2022, 12:42 p.m. UTC | #3
* Yu Zhao <yuzhao@google.com> [220717 01:58]:
> On Sat, Jul 16, 2022 at 10:20 PM Andrew Morton
> <akpm@linux-foundation.org> wrote:
> >
> > On Sun, 17 Jul 2022 02:46:32 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> ...
> >         gap = mas.index;
> >         gap += (info->align_offset - gap) & info->align_mask;
> > -       VM_BUG_ON(gap + info->length > info->high_limit);
> > -       VM_BUG_ON(gap + info->length > mas.last);

What arch did you hit these on?  David asked for me to restore some of
the BUG_ONs in these functions and I guess I got it wrong.

> 
> These VM_BUG_ONs are new, and I hit the second one quickly:
> 
>   kernel BUG at mm/mmap.c:1631!
>   RIP: 0010:vm_unmapped_area+0xdb/0x1c0
>   Call Trace:
>    <TASK>
>    arch_get_unmapped_area+0x1ee/0x220
>    arch_get_unmapped_area_topdown+0x25a/0x290
>    get_unmapped_area+0x92/0x100
>    do_mmap+0x13f/0x560
>    vm_mmap_pgoff+0xcd/0x170
Liam R. Howlett July 17, 2022, 1:01 p.m. UTC | #4
* Andrew Morton <akpm@linux-foundation.org> [220717 00:21]:
> On Sun, 17 Jul 2022 02:46:32 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> 
> > This is the v10 + fixes and a few clean ups.
> 
> I'm seeing quite a lot of differences between this and what we had in
> mm-unstable.  A surprising amount.
> 
> Please check it all?

Sorry, I tried to make checkscript happy and some requested cleanups.

> 
> 
> --- include/linux/maple_tree.h	2022-07-16 21:05:04.697041964 -0700
> +++ ../25-pre-maple/include/linux/maple_tree.h	2022-07-16 21:05:34.509477799 -0700
> @@ -225,9 +225,9 @@
>   * @flags: The maple tree flags
>   *
>   */
> -#define MTREE_INIT(name, __flags) {					\
> -	.ma_lock = __SPIN_LOCK_UNLOCKED((name).ma_lock),		\
> -	.ma_flags = __flags,						\
> +#define MTREE_INIT(name, flags) {					\
> +	.ma_lock = __SPIN_LOCK_UNLOCKED(name.ma_lock),			\
> +	.ma_flags = flags,						\
>  	.ma_root = NULL,						\
>  }
>  
> @@ -238,13 +238,13 @@
>   * @lock: The external lock
>   */
>  #ifdef CONFIG_LOCKDEP
> -#define MTREE_INIT_EXT(name, __flags, __lock) {				\
> -	.ma_external_lock = &(__lock).dep_map,				\
> -	.ma_flags = (__flags),						\
> +#define MTREE_INIT_EXT(name, flags, lock) {				\
> +	.ma_external_lock = &(lock).dep_map,				\
> +	.ma_flags = flags,						\
>  	.ma_root = NULL,						\
>  }
>  #else
> -#define MTREE_INIT_EXT(name, __flags, __lock)	MTREE_INIT(name, __flags)
> +#define MTREE_INIT_EXT(name, flags, lock)	MTREE_INIT(name, flags)
>  #endif
>  
>  #define DEFINE_MTREE(name)						\
> @@ -520,8 +520,8 @@
>   * Note: may return the zero entry.
>   *
>   */
> -#define mas_for_each(__mas, __entry, __max) \
> -	while (((__entry) = mas_find((__mas), (__max))) != NULL)
> +#define mas_for_each(mas, entry, max) \
> +	while (((entry) = mas_find((mas), (max))) != NULL)
>  
>  
>  /**
> @@ -654,9 +654,9 @@
>   *
>   * Note: Will not return the zero entry.
>   */
> -#define mt_for_each(__tree, __entry, __index, __max) \
> -	for (__entry = mt_find(__tree, &(__index), __max); \
> -		__entry; __entry = mt_find_after(__tree, &(__index), __max))
> +#define mt_for_each(tree, entry, index, max) \
> +	for (entry = mt_find(tree, &(index), max); \
> +		entry; entry = mt_find_after(tree, &(index), max))
>  
>  
>  #ifdef CONFIG_DEBUG_MAPLE_TREE
> @@ -665,12 +665,12 @@
>  
>  void mt_dump(const struct maple_tree *mt);
>  void mt_validate(struct maple_tree *mt);
> -#define MT_BUG_ON(__tree, __x) do {					\
> +#define MT_BUG_ON(tree, x) do {						\
>  	atomic_inc(&maple_tree_tests_run);				\
> -	if (__x) {							\
> +	if (x) {							\
>  		pr_info("BUG at %s:%d (%u)\n",				\
> -		__func__, __LINE__, __x);				\
> -		mt_dump(__tree);					\
> +		__func__, __LINE__, x);					\
> +		mt_dump(tree);						\
>  		pr_info("Pass: %u Run:%u\n",				\
>  			atomic_read(&maple_tree_tests_passed),		\
>  			atomic_read(&maple_tree_tests_run));		\
> @@ -680,7 +680,7 @@
>  	}								\
>  } while (0)
>  #else
> -#define MT_BUG_ON(__tree, __x) BUG_ON(__x)
> +#define MT_BUG_ON(tree, x) BUG_ON(x)
>  #endif /* CONFIG_DEBUG_MAPLE_TREE */
>  
>  #endif /*_LINUX_MAPLE_TREE_H */
> --- include/linux/mm.h	2022-07-16 21:05:07.625084770 -0700
> +++ ../25-pre-maple/include/linux/mm.h	2022-07-16 21:05:37.847526599 -0700
> @@ -683,12 +683,11 @@
>  	return vmi->mas.index;
>  }
>  
> -#define for_each_vma(__vmi, __vma)					\
> -	while (((__vma) = vma_next(&(__vmi))) != NULL)
> +#define for_each_vma(vmi, vma)		while ((vma = vma_next(&(vmi))) != NULL)
>  
>  /* The MM code likes to work with exclusive end addresses */
> -#define for_each_vma_range(__vmi, __vma, __end)				\
> -	while (((__vma) = vma_find(&(__vmi), (__end) - 1)) != NULL)
> +#define for_each_vma_range(vmi, vma, end)				\
> +	while ((vma = vma_find(&(vmi), (end) - 1)) != NULL)
>  
>  #ifdef CONFIG_SHMEM
>  /*
> --- include/linux/mm_types.h	2022-07-16 21:05:07.625084770 -0700
> +++ ../25-pre-maple/include/linux/mm_types.h	2022-07-16 21:05:37.848526614 -0700
> @@ -681,11 +681,11 @@
>  	struct ma_state mas;
>  };
>  
> -#define VMA_ITERATOR(name, __mm, __addr)				\
> +#define VMA_ITERATOR(name, mm, addr) 					\
>  	struct vma_iterator name = {					\
>  		.mas = {						\
> -			.tree = &(__mm)->mm_mt,				\
> -			.index = __addr,				\
> +			.tree = &mm->mm_mt,				\
> +			.index = addr,					\
>  			.node = MAS_START,				\
>  		},							\
>  	}
> 
> 
> 

All of the above changes are attempting to avoid issues with define
using variable names that may be used and potential dereferencing
issues.

> 
> --- mm/memory.c	2022-07-16 21:05:07.627084799 -0700
> +++ ../25-pre-maple/mm/memory.c	2022-07-16 21:05:37.980528543 -0700
> @@ -1699,6 +1699,7 @@
>  /**
>   * unmap_vmas - unmap a range of memory covered by a list of vma's
>   * @tlb: address of the caller's struct mmu_gather
> + * @mt: The maple tree pointer for the VMAs
>   * @vma: the starting vma
>   * @start_addr: virtual address at which to start unmapping
>   * @end_addr: virtual address at which to end unmapping
> --- mm/mmap.c	2022-07-16 21:05:07.716086100 -0700
> +++ ../25-pre-maple/mm/mmap.c	2022-07-16 21:05:38.104530356 -0700
> @@ -341,27 +341,28 @@
>  	MA_STATE(mas, mt, 0, 0);
>  
>  	mt_validate(&mm->mm_mt);
> +
>  	mas_for_each(&mas, vma_mt, ULONG_MAX) {
>  		if ((vma_mt->vm_start != mas.index) ||
>  		    (vma_mt->vm_end - 1 != mas.last)) {
>  			pr_emerg("issue in %s\n", current->comm);
>  			dump_stack();
>  			dump_vma(vma_mt);
> -			pr_emerg("mt piv: %p %lu - %lu\n", vma_mt,
> +			pr_emerg("mt piv: %px %lu - %lu\n", vma_mt,
>  				 mas.index, mas.last);
> -			pr_emerg("mt vma: %p %lu - %lu\n", vma_mt,
> +			pr_emerg("mt vma: %px %lu - %lu\n", vma_mt,
>  				 vma_mt->vm_start, vma_mt->vm_end);
>  
>  			mt_dump(mas.tree);
>  			if (vma_mt->vm_end != mas.last + 1) {
> -				pr_err("vma: %p vma_mt %lu-%lu\tmt %lu-%lu\n",
> +				pr_err("vma: %px vma_mt %lu-%lu\tmt %lu-%lu\n",
>  						mm, vma_mt->vm_start, vma_mt->vm_end,
>  						mas.index, mas.last);
>  				mt_dump(mas.tree);
>  			}
>  			VM_BUG_ON_MM(vma_mt->vm_end != mas.last + 1, mm);
>  			if (vma_mt->vm_start != mas.index) {
> -				pr_err("vma: %p vma_mt %p %lu - %lu doesn't match\n",
> +				pr_err("vma: %px vma_mt %px %lu - %lu doesn't match\n",
>  						mm, vma_mt, vma_mt->vm_start, vma_mt->vm_end);
>  				mt_dump(mas.tree);
>  			}

I removed px here to avoid leaking pointers - although this is in debug,
so maybe it's not that critical.

> @@ -448,7 +449,7 @@
>  		unsigned long vm_start = max(addr, vma->vm_start);
>  		unsigned long vm_end = min(end, vma->vm_end);
>  
> -		nr_pages += PHYS_PFN(vm_end - vm_start);
> +		nr_pages += (vm_end - vm_start) / PAGE_SIZE;
>  	}

This was suggested by David, I should have sent it out as a patch before
v11.

>  
>  	return nr_pages;
> @@ -710,11 +711,11 @@
>  				 * remove_next == 1 is case 1 or 7.
>  				 */
>  				remove_next = 1 + (end > next->vm_end);
> -				if (remove_next == 2)
> -					next_next = find_vma(mm, next->vm_end);
> -
> +				next_next = find_vma(mm, next->vm_end);
>  				VM_WARN_ON(remove_next == 2 &&
>  					   end != next_next->vm_end);
> +				/* trim end to next, for case 6 first pass */
> +				end = next->vm_end;
>  			}
>  
>  			exporter = next;
> @@ -725,7 +726,7 @@
>  			 * next, if the vma overlaps with it.
>  			 */
>  			if (remove_next == 2 && !next->anon_vma)
> -				exporter = next_next;
> +				exporter = find_vma(mm, next->vm_end);
>  
>  		} else if (end > next->vm_start) {
>  			/*
> @@ -762,6 +763,7 @@
>  				return error;
>  		}
>  	}
> +again:
>  	vma_adjust_trans_huge(orig_vma, start, end, adjust_next);
>  
>  	if (mas_preallocate(&mas, vma, GFP_KERNEL)) {
> @@ -852,8 +854,6 @@
>  
>  	if (remove_next && file) {
>  		__remove_shared_vm_struct(next, file, mapping);
> -		if (remove_next == 2)
> -			__remove_shared_vm_struct(next_next, file, mapping);
>  	} else if (insert) {
>  		/*
>  		 * split_vma has split insert from vma, and needs
> @@ -881,7 +881,6 @@
>  	}
>  
>  	if (remove_next) {
> -again:
>  		if (file) {
>  			uprobe_munmap(next, next->vm_start, next->vm_end);
>  			fput(file);
> @@ -890,22 +889,45 @@
>  			anon_vma_merge(vma, next);
>  		mm->map_count--;
>  		mpol_put(vma_policy(next));
> -		if (remove_next != 2)
> -			BUG_ON(vma->vm_end < next->vm_end);
> +		BUG_ON(vma->vm_end < next->vm_end);
>  		vm_area_free(next);
>  
>  		/*
>  		 * In mprotect's case 6 (see comments on vma_merge),
> -		 * we must remove next_next too.
> +		 * we must remove another next too. It would clutter
> +		 * up the code too much to do both in one go.
>  		 */
> +		if (remove_next != 3) {
> +			/*
> +			 * If "next" was removed and vma->vm_end was
> +			 * expanded (up) over it, in turn
> +			 * "next->prev->vm_end" changed and the
> +			 * "vma->next" gap must be updated.
> +			 */
> +			next = next_next;
> +		} else {
> +			/*
> +			 * For the scope of the comment "next" and
> +			 * "vma" considered pre-swap(): if "vma" was
> +			 * removed, next->vm_start was expanded (down)
> +			 * over it and the "next" gap must be updated.
> +			 * Because of the swap() the post-swap() "vma"
> +			 * actually points to pre-swap() "next"
> +			 * (post-swap() "next" as opposed is now a
> +			 * dangling pointer).
> +			 */
> +			next = vma;
> +		}
>  		if (remove_next == 2) {
> +			mas_reset(&mas);
>  			remove_next = 1;
> -			next = next_next;
> +			end = next->vm_end;
>  			goto again;
>  		}
>  	}
> -	if (insert && file)
> +	if (insert && file) {
>  		uprobe_mmap(insert);
> +	}
>  
>  	mas_destroy(&mas);
>  	validate_mm(mm);
> @@ -1613,8 +1635,8 @@
>  	return (vm_flags & (VM_NORESERVE | VM_SHARED | VM_WRITE)) == VM_WRITE;
>  }

This is fallout from the syzbot bug from the patch I sent Hugh.  This is
the expected difference that had to change a few patches in the series
as it went through.  I should say that it's slightly different than
Hugh's copy.  This copy searches for the required VMA only when needed.
I actually really like how this turned out since it was able to simplify
__vma_adjust().

>  
> -/**
> - * unmapped_area() - Find an area between the low_limit and the high_limit with
> +/*
> + * unmapped_area() Find an area between the low_limit and the high_limit with
>   * the correct alignment and offset, all from @info. Note: current->mm is used
>   * for the search.
>   *
> @@ -1640,15 +1662,12 @@
>  
>  	gap = mas.index;
>  	gap += (info->align_offset - gap) & info->align_mask;
> -	VM_BUG_ON(gap + info->length > info->high_limit);
> -	VM_BUG_ON(gap + info->length > mas.last);
>  	return gap;
>  }
>  
> -/**
> - * unmapped_area_topdown() - Find an area between the low_limit and the
> +/* unmapped_area_topdown() Find an area between the low_limit and the
>   * high_limit with * the correct alignment and offset at the highest available
> - * address, all from @info. Note: current->mm is used for the search.
> + * address, all from * @info. Note: current->mm is used for the search.
>   *
>   * @info: The unmapped area information including the range (low_limit -
>   * hight_limit), the alignment offset and mask.
> @@ -1671,8 +1690,6 @@
>  
>  	gap = mas.last + 1 - info->length;
>  	gap -= (gap - info->align_offset) & info->align_mask;
> -	VM_BUG_ON(gap < info->low_limit);
> -	VM_BUG_ON(gap < mas.index);
>  	return gap;
>  }

These were added in an attempt to restore the previous VM_BUG_ON(). I
must have errors in them.  These should be dropped until better tested.

>  
> @@ -1884,12 +1901,12 @@
>  EXPORT_SYMBOL(find_vma_intersection);
>  
>  /**
> - * find_vma() - Find the VMA for a given address, or the next VMA.
> + * find_vma() - Find the VMA for a given address, or the next vma.
>   * @mm: The mm_struct to check
>   * @addr: The address
>   *
> - * Returns: The VMA associated with addr, or the next VMA.
> - * May return %NULL in the case of no VMA at addr or above.
> + * Returns: The VMA associated with addr, or the next vma.
> + * May return %NULL in the case of no vma at addr or above.
>   */
>  struct vm_area_struct *find_vma(struct mm_struct *mm, unsigned long addr)
>  {
> @@ -2642,7 +2659,7 @@
>  	    (vma ? can_vma_merge_after(prev, vm_flags, vma->anon_vma, file,
>  				       pgoff, vma->vm_userfaultfd_ctx, NULL) :
>  		   can_vma_merge_after(prev, vm_flags, NULL, file, pgoff,
> -				       NULL_VM_UFFD_CTX, NULL))) {
> +				       NULL_VM_UFFD_CTX , NULL))) {
>  		merge_start = prev->vm_start;
>  		vma = prev;
>  		vm_pgoff = prev->vm_pgoff;
> @@ -3036,7 +3053,6 @@
>  		unsigned long addr, unsigned long len, unsigned long flags)
>  {
>  	struct mm_struct *mm = current->mm;
> -
>  	validate_mm_mt(mm);
>  
>  	/*
> @@ -3092,7 +3108,7 @@
>  	vma->vm_flags = flags;
>  	vma->vm_page_prot = vm_get_page_prot(flags);
>  	mas_set_range(mas, vma->vm_start, addr + len - 1);
> -	if (mas_store_gfp(mas, vma, GFP_KERNEL))
> +	if ( mas_store_gfp(mas, vma, GFP_KERNEL))
>  		goto mas_store_fail;
>  
>  	mm->map_count++;
> @@ -3155,7 +3171,7 @@
>  	vma = mas_prev(&mas, 0);
>  	if (!vma || vma->vm_end != addr || vma_policy(vma) ||
>  	    !can_vma_merge_after(vma, flags, NULL, NULL,
> -				 addr >> PAGE_SHIFT, NULL_VM_UFFD_CTX, NULL))
> +				 addr >> PAGE_SHIFT,NULL_VM_UFFD_CTX, NULL))
>  		vma = NULL;
>  
>  	ret = do_brk_flags(&mas, vma, addr, len, flags);


These are all cosmetic fixes for checkscript.  I now regret making these
changes as I increased confusion for very little gain - plus the second
from the last is incorrect.

I should have prioritized stability over cleaner code this late in the
cycle.  Let me know if you'd like me to respin with just the addition of
the one fix sent to Hugh.


Thanks,
Liam
Liam R. Howlett July 17, 2022, 1:06 p.m. UTC | #5
* Liam R. Howlett <Liam.Howlett@Oracle.com> [220717 08:42]:
> * Yu Zhao <yuzhao@google.com> [220717 01:58]:
> > On Sat, Jul 16, 2022 at 10:20 PM Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> > >
> > > On Sun, 17 Jul 2022 02:46:32 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> > ...
> > >         gap = mas.index;
> > >         gap += (info->align_offset - gap) & info->align_mask;
> > > -       VM_BUG_ON(gap + info->length > info->high_limit);
> > > -       VM_BUG_ON(gap + info->length > mas.last);
> 
> What arch did you hit these on?  David asked for me to restore some of
> the BUG_ONs in these functions and I guess I got it wrong.

I think these should be removed until I better test them.

> 
> > 
> > These VM_BUG_ONs are new, and I hit the second one quickly:
> > 
> >   kernel BUG at mm/mmap.c:1631!
> >   RIP: 0010:vm_unmapped_area+0xdb/0x1c0
> >   Call Trace:
> >    <TASK>
> >    arch_get_unmapped_area+0x1ee/0x220
> >    arch_get_unmapped_area_topdown+0x25a/0x290
> >    get_unmapped_area+0x92/0x100
> >    do_mmap+0x13f/0x560
> >    vm_mmap_pgoff+0xcd/0x170
Yu Zhao July 17, 2022, 6:55 p.m. UTC | #6
On Sun, Jul 17, 2022 at 6:42 AM Liam Howlett <liam.howlett@oracle.com> wrote:
>
> * Yu Zhao <yuzhao@google.com> [220717 01:58]:
> > On Sat, Jul 16, 2022 at 10:20 PM Andrew Morton
> > <akpm@linux-foundation.org> wrote:
> > >
> > > On Sun, 17 Jul 2022 02:46:32 +0000 Liam Howlett <liam.howlett@oracle.com> wrote:
> > ...
> > >         gap = mas.index;
> > >         gap += (info->align_offset - gap) & info->align_mask;
> > > -       VM_BUG_ON(gap + info->length > info->high_limit);
> > > -       VM_BUG_ON(gap + info->length > mas.last);
>
> What arch did you hit these on?

This is x86_64. Full trace:

  kernel BUG at mm/mmap.c:1584!
  invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
  CPU: 108 PID: 107064 Comm: stress-ng Tainted: G S      W  O
5.19.0-dbg-DEV #1
  RIP: 0010:vm_unmapped_area+0xdb/0x1c0
  Code: b0 4c 8b 73 28 49 29 c6 4c 23 73 20 49 01 c6 48 8b 43 08 4c 01
f0 48 3b 43 18 0f 87 da 00 00 00 48 3b 45 b8 0f 86 a5 00 00 00 <0f> 0b
48 c7 45 e0 00 00 00 00 65 48 8b 04 25 40 ff 01 00 48 8b 80
  RSP: 0018:ffff968562137c00 EFLAGS: 00010212
  RAX: 0000000040da1000 RBX: ffff968562137c88 RCX: 00000000025a8000
  RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff968562137c00
  RBP: ffff968562137c58 R08: ffff968562137c28 R09: ffff8f3a8d958f80
  R10: 0000000000000001 R11: 000000000000000f R12: 0000000000249000
  R13: 0000000000000000 R14: 0000000040b58000 R15: ffff8f3aafb741c0
  FS:  0000000002586340(0000) GS:ffff8f989fd00000(0000) knlGS:0000000000000000
  CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  CR2: 000000000064aff0 CR3: 00000061d15ba006 CR4: 00000000003706e0
  DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  Call Trace:
   <TASK>
   arch_get_unmapped_area+0x1ee/0x220
   arch_get_unmapped_area_topdown+0x25a/0x290
   get_unmapped_area+0x92/0x100
   do_mmap+0x13f/0x560
   vm_mmap_pgoff+0xcd/0x170
   ksys_mmap_pgoff+0xd8/0x200
   __x64_sys_mmap+0x3c/0x40
   do_syscall_64+0x44/0xa0
   entry_SYSCALL_64_after_hwframe+0x46/0xb0