mbox series

[RFC,0/2] mm: introduce anon_vma flags, reduce kernel allocs

Message ID cover.1741185865.git.lorenzo.stoakes@oracle.com (mailing list archive)
Headers show
Series mm: introduce anon_vma flags, reduce kernel allocs | expand

Message

Lorenzo Stoakes March 5, 2025, 2:55 p.m. UTC
VMA resources are scarce. This is a data structure whose weight we wish to
reduce (certainly as slab allocations are unreclaimable and - for now -
unmigratable).

So adding additional fields is generally unviable, and VMA flags are
equally as contended, and prevent VMA merge, further impacting overhead.

We can however make use of the time-honoured kernel tradition of grabbing
bits where we can.

Since we can rely upon anon_vma allocations being at least system
word-aligned, we have a handful of bits in the vma->anon_vma available to
use as flags.

In this series we establish doing so, and immediately use this to solve a
problem encountered as part of the guard region feature
(MADV_GUARD_INSTALL, MADV_GUARD_REMOVE).

We absolutely must preserve guard regions over fork, however it turns out
the only reasonable means of doing so is to establish an anon_vma even if
the VMA is unfaulted.

This creates unnecessary overhead, a problem extenuated by the extension of
this functionality to file-backed regions, where such-allocated memory may
never be utilised or freed until the end of the VMA's lifetime.

We can avoid this if we have a means of indicating to fork that we wish to
copy page tables without having to have this overhead.

Having flags available in vma->anon_vma allows us to do so - we can
therefore introduce a flag, ANON_VMA_UNFAULTED, which indicates that this
is the case.

We introduce wrapper functions to mask off these bits, and nearly every
part of the kernel behaves precisely the same as a result, with only the
desired change in behaviour in the forking logic.

On fault, or any operation that actually requires an established anon_vma,
the ANON_VMA_UNFAULTED flag is cleared and replaced by an actual anon_vma.

An additional advantage of having this mechanism is that we can also remove
this flag, should no 'real' anon_vma be established, and the user is
executing MADV_GUARD_REMOVE on the whole VMA, meaning we can prevent future
unneeded page table operations.

A benefit of this change, aside from saving kernel memory allocations, is
that THP page collapse is no longer impacted if we apply guard regions then
remove them in their entirety from a VMA, as otherwise the immediate
collapse of aligned page tables in retract_page_tables() cannot proceed.

Lorenzo Stoakes (2):
  mm: introduce anon_vma flags and use wrapper functions
  mm/madvise: utilise anon_vma unfaulted flag on guard region install

 fs/coredump.c                    |  2 +-
 include/linux/mm_types.h         | 67 ++++++++++++++++++++-
 include/linux/rmap.h             |  4 +-
 kernel/fork.c                    |  4 +-
 mm/debug.c                       |  6 +-
 mm/huge_memory.c                 |  4 +-
 mm/khugepaged.c                  | 12 ++--
 mm/ksm.c                         | 16 +++---
 mm/madvise.c                     | 49 ++++++++++------
 mm/memory.c                      |  6 +-
 mm/mmap.c                        |  2 +-
 mm/mprotect.c                    |  2 +-
 mm/mremap.c                      |  8 +--
 mm/rmap.c                        | 42 +++++++-------
 mm/swapfile.c                    |  2 +-
 mm/userfaultfd.c                 |  2 +-
 mm/vma.c                         | 99 +++++++++++++++++++++++++-------
 mm/vma.h                         |  6 +-
 security/selinux/hooks.c         |  2 +-
 tools/testing/vma/vma.c          | 95 +++++++++++++++---------------
 tools/testing/vma/vma_internal.h | 78 ++++++++++++++++++++++---
 21 files changed, 358 insertions(+), 150 deletions(-)

--
2.48.1

Comments

Matthew Wilcox March 5, 2025, 3:59 p.m. UTC | #1
On Wed, Mar 05, 2025 at 02:55:06PM +0000, Lorenzo Stoakes wrote:
> So adding additional fields is generally unviable, and VMA flags are
> equally as contended, and prevent VMA merge, further impacting overhead.
> 
> We can however make use of the time-honoured kernel tradition of grabbing
> bits where we can.
> 
> Since we can rely upon anon_vma allocations being at least system
> word-aligned, we have a handful of bits in the vma->anon_vma available to
> use as flags.

I'm not a huge fan when there's a much better solution.  It's an
unsigned long, but we can only use the first 32 bits because of 32-bit
compatibility?  This is a noose we've made for our own neck.

(there are many more places to fix up; this is illustrative):

diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
index 0660a03d37d9..c6ea81ff4afe 100644
--- a/include/linux/hugetlb_inline.h
+++ b/include/linux/hugetlb_inline.h
@@ -8,7 +8,7 @@
 
 static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
 {
-	return !!(vma->vm_flags & VM_HUGETLB);
+	return test_bit(VM_HUGETLB, vma->vm_flags);
 }
 
 #else
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 0ca9feec67b8..763210ba70b6 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -571,7 +571,8 @@ static inline void *folio_get_private(struct folio *folio)
 	return folio->private;
 }
 
-typedef unsigned long vm_flags_t;
+#define VM_FLAGS_COUNT	(8 / sizeof(unsigned long))
+typedef unsigned long vm_flags_t[VM_FLAGS_COUNT];
 
 /*
  * A region containing a mapping of a non-memory backed file under NOMMU
Lorenzo Stoakes March 5, 2025, 4:27 p.m. UTC | #2
On Wed, Mar 05, 2025 at 03:59:28PM +0000, Matthew Wilcox wrote:
> On Wed, Mar 05, 2025 at 02:55:06PM +0000, Lorenzo Stoakes wrote:
> > So adding additional fields is generally unviable, and VMA flags are
> > equally as contended, and prevent VMA merge, further impacting overhead.
> >
> > We can however make use of the time-honoured kernel tradition of grabbing
> > bits where we can.
> >
> > Since we can rely upon anon_vma allocations being at least system
> > word-aligned, we have a handful of bits in the vma->anon_vma available to
> > use as flags.
>
> I'm not a huge fan when there's a much better solution.  It's an
> unsigned long, but we can only use the first 32 bits because of 32-bit
> compatibility?  This is a noose we've made for our own neck.

Sure, as discussed off-list this is something I'm going to look at, it's
not either/or at all :)

I have in the back of my mind _other uses_ for these flags which isn't
stated here (perhaps should be), perhaps we can have other kinds of data
type we reference here.

But perhaps it's a bit early for that...

Other than that was a morning's work to see if this _could_ work as a
_immediate_ 'what can we do to address this issue?' solution.

The vm_flags solution is viable, as we can modify merging behaviour for a
'sticky' VMA flag that doesn't impact merging, i.e. mask out for merge
compatibility testing, but ensure propagated on split/merge (same behaviour
as what the anon_vma flags achieve).

>
> (there are many more places to fix up; this is illustrative):
>
> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
> index 0660a03d37d9..c6ea81ff4afe 100644
> --- a/include/linux/hugetlb_inline.h
> +++ b/include/linux/hugetlb_inline.h
> @@ -8,7 +8,7 @@
>
>  static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
>  {
> -	return !!(vma->vm_flags & VM_HUGETLB);
> +	return test_bit(VM_HUGETLB, vma->vm_flags);
>  }
>
>  #else
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 0ca9feec67b8..763210ba70b6 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -571,7 +571,8 @@ static inline void *folio_get_private(struct folio *folio)
>  	return folio->private;
>  }
>
> -typedef unsigned long vm_flags_t;
> +#define VM_FLAGS_COUNT	(8 / sizeof(unsigned long))
> +typedef unsigned long vm_flags_t[VM_FLAGS_COUNT];
>
>  /*
>   * A region containing a mapping of a non-memory backed file under NOMMU

Ohhhh does test_bit() automagically figure things out for sizeof(addr)
including on 32-bit?

Maybe I can quickly slap together something for that quicker than I thought
then? I was _very concerned_ about tearing and that being a total PITA
hence deferring a bit.

But... if you're saying I can churn this to death and it'll 'just work'
then say no more... ;)

Let's put this on hold then until/when I have actual reasons to have
different anon_vma types and I'll look at the 'make vm_flags 64-bit
everywhere' thing instead.