diff mbox series

[v2,1/1] mm/vmscan: avoid split PMD-mapped THP during shrink_folio_list()

Message ID 20240422055213.60231-1-ioworker0@gmail.com (mailing list archive)
State New
Headers show
Series [v2,1/1] mm/vmscan: avoid split PMD-mapped THP during shrink_folio_list() | expand

Commit Message

Lance Yang April 22, 2024, 5:52 a.m. UTC
When the user no longer requires the pages, they would use
madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not
typically rewrite to the given range.

At present, PMD-mapped THPs that are marked as lazyfree during
shrink_folio_list() are unconditionally split, which may be unnecessary.
If the THP is clean, its PMD is also clean, and there are no unexpected
references, then we can attempt to remove the PMD mapping from it. This
change will improve the efficiency of memory reclamation in this case.

On an Intel i5 CPU, reclaiming 1GiB of PMD-mapped THPs using
mem_cgroup_force_empty() results in the following runtimes in seconds
(shorter is better):

--------------------------------------------
|     Old       |      New       |  Change  |
--------------------------------------------
|   0.683426    |    0.049197    |  -92.80% |
--------------------------------------------

Signed-off-by: Lance Yang <ioworker0@gmail.com>
---
v1 -> v2:
 - Update the changelog
 - Follow the exact same logic as in try_to_unmap_one() (per David Hildenbrand)
 - Remove the extra code from rmap.c (per Matthew Wilcox)
 - https://lore.kernel.org/linux-mm/20240417141111.77855-1-ioworker0@gmail.com

 include/linux/huge_mm.h |  2 +
 include/linux/rmap.h    |  2 +
 mm/huge_memory.c        | 88 +++++++++++++++++++++++++++++++++++++++++
 mm/rmap.c               |  6 +++
 mm/vmscan.c             |  7 ++++
 5 files changed, 105 insertions(+)

Comments

Matthew Wilcox April 24, 2024, 4:15 a.m. UTC | #1
On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote:
> When the user no longer requires the pages, they would use
> madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not
> typically rewrite to the given range.
> 
> At present, PMD-mapped THPs that are marked as lazyfree during
> shrink_folio_list() are unconditionally split, which may be unnecessary.
> If the THP is clean, its PMD is also clean, and there are no unexpected
> references, then we can attempt to remove the PMD mapping from it. This
> change will improve the efficiency of memory reclamation in this case.

Does this happen outside of benchmarks?  I'm really struggling to see
how we end up in this situation.  We have a clean THP without swap
backing, so it's full of zeroes, but for some reason we haven't used the
shared huge zero page?  What is going on?
David Hildenbrand April 24, 2024, 7:17 a.m. UTC | #2
On 24.04.24 06:15, Matthew Wilcox wrote:
> On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote:
>> When the user no longer requires the pages, they would use
>> madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not
>> typically rewrite to the given range.
>>
>> At present, PMD-mapped THPs that are marked as lazyfree during
>> shrink_folio_list() are unconditionally split, which may be unnecessary.
>> If the THP is clean, its PMD is also clean, and there are no unexpected
>> references, then we can attempt to remove the PMD mapping from it. This
>> change will improve the efficiency of memory reclamation in this case.
> 
> Does this happen outside of benchmarks?  I'm really struggling to see
> how we end up in this situation.  We have a clean THP without swap
> backing, so it's full of zeroes, but for some reason we haven't used the
> shared huge zero page?  What is going on?

It's not full of zeroes.

User space called MADV_FREE on a PMD-mapped THP.

During MADV_FREE, we mark the PTEs as clean, the folio as clean and sd 
"lazyfree" (no swap backend). If, during memory reclaim, we detect that 
(a) the folio is still clean (b) the PTEs are still clean and (c) there 
are no unexpected references (GUP), user space didn't re-write to that 
memory again, so we can just discard the memory "lazily".
Lance Yang April 24, 2024, 3:46 p.m. UTC | #3
Hey Matthew,

Thanks for taking time to review.

On Wed, Apr 24, 2024 at 12:15 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote:
> > When the user no longer requires the pages, they would use
> > madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not
> > typically rewrite to the given range.
> >
> > At present, PMD-mapped THPs that are marked as lazyfree during
> > shrink_folio_list() are unconditionally split, which may be unnecessary.
> > If the THP is clean, its PMD is also clean, and there are no unexpected

"If the THP is clean, its PMD is also clean" can be confusing - sorry. It should
be modified to "If the THP and its PMD are both marked as clean".

Thanks,
Lance

> > references, then we can attempt to remove the PMD mapping from it. This
> > change will improve the efficiency of memory reclamation in this case.
>
> Does this happen outside of benchmarks?  I'm really struggling to see
> how we end up in this situation.  We have a clean THP without swap
> backing, so it's full of zeroes, but for some reason we haven't used the
> shared huge zero page?  What is going on?
>
Zi Yan April 24, 2024, 3:57 p.m. UTC | #4
On 24 Apr 2024, at 3:17, David Hildenbrand wrote:

> On 24.04.24 06:15, Matthew Wilcox wrote:
>> On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote:
>>> When the user no longer requires the pages, they would use
>>> madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not
>>> typically rewrite to the given range.
>>>
>>> At present, PMD-mapped THPs that are marked as lazyfree during
>>> shrink_folio_list() are unconditionally split, which may be unnecessary.
>>> If the THP is clean, its PMD is also clean, and there are no unexpected
>>> references, then we can attempt to remove the PMD mapping from it. This
>>> change will improve the efficiency of memory reclamation in this case.
>>
>> Does this happen outside of benchmarks?  I'm really struggling to see
>> how we end up in this situation.  We have a clean THP without swap
>> backing, so it's full of zeroes, but for some reason we haven't used the
>> shared huge zero page?  What is going on?
>
> It's not full of zeroes.
>
> User space called MADV_FREE on a PMD-mapped THP.
>
> During MADV_FREE, we mark the PTEs as clean, the folio as clean and sd "lazyfree" (no swap backend). If, during memory reclaim, we detect that (a) the folio is still clean (b) the PTEs are still clean and (c) there are no unexpected references (GUP), user space didn't re-write to that memory again, so we can just discard the memory "lazily".

It seems that try_to_unmap_one() does not support unmapping PMD-mapped folios.
Maybe adding that support instead of a special case handling?

--
Best Regards,
Yan, Zi
David Hildenbrand April 24, 2024, 3:58 p.m. UTC | #5
On 24.04.24 17:57, Zi Yan wrote:
> On 24 Apr 2024, at 3:17, David Hildenbrand wrote:
> 
>> On 24.04.24 06:15, Matthew Wilcox wrote:
>>> On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote:
>>>> When the user no longer requires the pages, they would use
>>>> madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not
>>>> typically rewrite to the given range.
>>>>
>>>> At present, PMD-mapped THPs that are marked as lazyfree during
>>>> shrink_folio_list() are unconditionally split, which may be unnecessary.
>>>> If the THP is clean, its PMD is also clean, and there are no unexpected
>>>> references, then we can attempt to remove the PMD mapping from it. This
>>>> change will improve the efficiency of memory reclamation in this case.
>>>
>>> Does this happen outside of benchmarks?  I'm really struggling to see
>>> how we end up in this situation.  We have a clean THP without swap
>>> backing, so it's full of zeroes, but for some reason we haven't used the
>>> shared huge zero page?  What is going on?
>>
>> It's not full of zeroes.
>>
>> User space called MADV_FREE on a PMD-mapped THP.
>>
>> During MADV_FREE, we mark the PTEs as clean, the folio as clean and sd "lazyfree" (no swap backend). If, during memory reclaim, we detect that (a) the folio is still clean (b) the PTEs are still clean and (c) there are no unexpected references (GUP), user space didn't re-write to that memory again, so we can just discard the memory "lazily".
> 
> It seems that try_to_unmap_one() does not support unmapping PMD-mapped folios.
> Maybe adding that support instead of a special case handling?

I was thinking the same, and finding a way to avoid TTU_LAZYFREE_THP.
Andrew Morton April 24, 2024, 9:20 p.m. UTC | #6
On Wed, 24 Apr 2024 23:46:59 +0800 Lance Yang <ioworker0@gmail.com> wrote:

> On Wed, Apr 24, 2024 at 12:15 PM Matthew Wilcox <willy@infradead.org> wrote:
> >
> > On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote:
> > > When the user no longer requires the pages, they would use
> > > madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not
> > > typically rewrite to the given range.
> > >
> > > At present, PMD-mapped THPs that are marked as lazyfree during
> > > shrink_folio_list() are unconditionally split, which may be unnecessary.
> > > If the THP is clean, its PMD is also clean, and there are no unexpected
> 
> "If the THP is clean, its PMD is also clean" can be confusing - sorry. It should
> be modified to "If the THP and its PMD are both marked as clean".

I made that changelog edit.
Lance Yang April 25, 2024, 4:17 a.m. UTC | #7
Hey Zi, David,

Thanks for taking time to review!

On Wed, Apr 24, 2024 at 11:58 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 24.04.24 17:57, Zi Yan wrote:
> > On 24 Apr 2024, at 3:17, David Hildenbrand wrote:
> >
> >> On 24.04.24 06:15, Matthew Wilcox wrote:
> >>> On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote:
> >>>> When the user no longer requires the pages, they would use
> >>>> madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not
> >>>> typically rewrite to the given range.
> >>>>
> >>>> At present, PMD-mapped THPs that are marked as lazyfree during
> >>>> shrink_folio_list() are unconditionally split, which may be unnecessary.
> >>>> If the THP is clean, its PMD is also clean, and there are no unexpected
> >>>> references, then we can attempt to remove the PMD mapping from it. This
> >>>> change will improve the efficiency of memory reclamation in this case.
> >>>
> >>> Does this happen outside of benchmarks?  I'm really struggling to see
> >>> how we end up in this situation.  We have a clean THP without swap
> >>> backing, so it's full of zeroes, but for some reason we haven't used the
> >>> shared huge zero page?  What is going on?
> >>
> >> It's not full of zeroes.
> >>
> >> User space called MADV_FREE on a PMD-mapped THP.
> >>
> >> During MADV_FREE, we mark the PTEs as clean, the folio as clean and sd "lazyfree" (no swap backend). If, during memory reclaim, we detect that (a) the folio is still clean (b) the PTEs are still clean and (c) there are no unexpected references (GUP), user space didn't re-write to that memory again, so we can just discard the memory "lazily".
> >
> > It seems that try_to_unmap_one() does not support unmapping PMD-mapped folios.
> > Maybe adding that support instead of a special case handling?
>
> I was thinking the same, and finding a way to avoid TTU_LAZYFREE_THP.

Thanks for the suggestions!

Yep, I completely agreed. Adding support for unmapping PMD-mapped folios to
try_to_unmap_one() would make it more future-proof.

Thanks again for the review!
Lance

>
> --
> Cheers,
>
> David / dhildenb
>
Lance Yang April 25, 2024, 4:19 a.m. UTC | #8
On Thu, Apr 25, 2024 at 5:20 AM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Wed, 24 Apr 2024 23:46:59 +0800 Lance Yang <ioworker0@gmail.com> wrote:
>
> > On Wed, Apr 24, 2024 at 12:15 PM Matthew Wilcox <willy@infradead.org> wrote:
> > >
> > > On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote:
> > > > When the user no longer requires the pages, they would use
> > > > madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not
> > > > typically rewrite to the given range.
> > > >
> > > > At present, PMD-mapped THPs that are marked as lazyfree during
> > > > shrink_folio_list() are unconditionally split, which may be unnecessary.
> > > > If the THP is clean, its PMD is also clean, and there are no unexpected
> >
> > "If the THP is clean, its PMD is also clean" can be confusing - sorry. It should
> > be modified to "If the THP and its PMD are both marked as clean".
>
> I made that changelog edit.

Thanks for updating, Andrew!
Lance
Lance Yang April 25, 2024, 8:50 a.m. UTC | #9
Hey Zi, David,

How about this change(diff against mm-unstable) as follows?

I'd like to add __try_to_unmap_huge_pmd() as a new internal function
specifically for unmapping PMD-mapped folios. If, for any reason, we cannot
unmap the folio, then we'll still split it as previously done.

Currently, __try_to_unmap_huge_pmd() only handles lazyfree THPs, but it
can be extended to support other large folios that are PMD-mapped in the
future if needed.

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 670218f762c8..0f906dc6d280 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -100,8 +100,6 @@ enum ttu_flags {
 					 * do a final flush if necessary */
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
-	TTU_LAZYFREE_THP	= 0x100, /* avoid splitting PMD-mapped THPs
-					  * that are marked as lazyfree. */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/rmap.c b/mm/rmap.c
index a7913a454028..879c8923abfc 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1606,6 +1606,19 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page,
 #endif
 }
 
+static bool __try_to_unmap_huge_pmd(struct vm_area_struct *vma,
+				    unsigned long addr, struct folio *folio)
+{
+	VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
+		return discard_trans_pmd(vma, addr, folio);
+#endif
+
+	return false;
+}
+
 /*
  * @arg: enum ttu_flags will be passed to this argument
  */
@@ -1631,14 +1644,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	if (flags & TTU_SYNC)
 		pvmw.flags = PVMW_SYNC;
 
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-	if (flags & TTU_LAZYFREE_THP)
-		if (discard_trans_pmd(vma, address, folio))
+	if (flags & TTU_SPLIT_HUGE_PMD) {
+		if (__try_to_unmap_huge_pmd(vma, address, folio))
 			return true;
-#endif
-
-	if (flags & TTU_SPLIT_HUGE_PMD)
 		split_huge_pmd_address(vma, address, false, folio);
+	}
 
 	/*
 	 * For THP, we have to assume the worse case ie pmd for invalidation.
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e2686cc0c037..49bd94423961 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1277,13 +1277,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 			if (folio_test_pmd_mappable(folio))
 				flags |= TTU_SPLIT_HUGE_PMD;
-
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
-			if (folio_test_anon(folio) && !was_swapbacked &&
-			    (flags & TTU_SPLIT_HUGE_PMD))
-				flags |= TTU_LAZYFREE_THP;
-#endif
-
 			/*
 			 * Without TTU_SYNC, try_to_unmap will only begin to
 			 * hold PTL from the first present PTE within a large
David Hildenbrand April 25, 2024, 9:01 a.m. UTC | #10
On 25.04.24 10:50, Lance Yang wrote:
> Hey Zi, David,

Hi,

> 
> How about this change(diff against mm-unstable) as follows?

goes into the right direction, please resent the whole thing, that will 
make it easier to review.

> 
> I'd like to add __try_to_unmap_huge_pmd() as a new internal function
> specifically for unmapping PMD-mapped folios. If, for any reason, we cannot
> unmap the folio, then we'll still split it as previously done.
> 
> Currently, __try_to_unmap_huge_pmd() only handles lazyfree THPs, but it
> can be extended to support other large folios that are PMD-mapped in the
> future if needed.
> 
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index 670218f762c8..0f906dc6d280 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -100,8 +100,6 @@ enum ttu_flags {
>   					 * do a final flush if necessary */
>   	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
>   					 * caller holds it */
> -	TTU_LAZYFREE_THP	= 0x100, /* avoid splitting PMD-mapped THPs
> -					  * that are marked as lazyfree. */
>   };
>   
>   #ifdef CONFIG_MMU
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a7913a454028..879c8923abfc 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1606,6 +1606,19 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page,
>   #endif
>   }
>   
> +static bool __try_to_unmap_huge_pmd(struct vm_area_struct *vma,
> +				    unsigned long addr, struct folio *folio)
> +{
> +	VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
> +
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> +		return discard_trans_pmd(vma, addr, folio);
> +#endif
> +
> +	return false;
> +}
> +
>   /*
>    * @arg: enum ttu_flags will be passed to this argument
>    */
> @@ -1631,14 +1644,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
>   	if (flags & TTU_SYNC)
>   		pvmw.flags = PVMW_SYNC;
>   
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	if (flags & TTU_LAZYFREE_THP)
> -		if (discard_trans_pmd(vma, address, folio))
> +	if (flags & TTU_SPLIT_HUGE_PMD) {
> +		if (__try_to_unmap_huge_pmd(vma, address, folio))
>   			return true;
> -#endif
> -
> -	if (flags & TTU_SPLIT_HUGE_PMD)
>   		split_huge_pmd_address(vma, address, false, folio);
> +	}
>   

I was wondering if we can better integrate that into the pagewalk below.

That is, don't do the TTU_SPLIT_HUGE_PMD immediately. Start the pagewalk 
first. If we walk a PMD, try to unmap it. Only if that fails, split it.

Less working on "vma + address" and instead directly on PMDs.
Lance Yang April 25, 2024, 9:21 a.m. UTC | #11
On Thu, Apr 25, 2024 at 5:01 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 25.04.24 10:50, Lance Yang wrote:
> > Hey Zi, David,
>
> Hi,
>
> >
> > How about this change(diff against mm-unstable) as follows?
>
> goes into the right direction, please resent the whole thing, that will
> make it easier to review.

Got it. I‘ll keep that in mind, thanks!

>
> >
> > I'd like to add __try_to_unmap_huge_pmd() as a new internal function
> > specifically for unmapping PMD-mapped folios. If, for any reason, we cannot
> > unmap the folio, then we'll still split it as previously done.
> >
> > Currently, __try_to_unmap_huge_pmd() only handles lazyfree THPs, but it
> > can be extended to support other large folios that are PMD-mapped in the
> > future if needed.
> >
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index 670218f762c8..0f906dc6d280 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -100,8 +100,6 @@ enum ttu_flags {
> >                                        * do a final flush if necessary */
> >       TTU_RMAP_LOCKED         = 0x80, /* do not grab rmap lock:
> >                                        * caller holds it */
> > -     TTU_LAZYFREE_THP        = 0x100, /* avoid splitting PMD-mapped THPs
> > -                                       * that are marked as lazyfree. */
> >   };
> >
> >   #ifdef CONFIG_MMU
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index a7913a454028..879c8923abfc 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1606,6 +1606,19 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page,
> >   #endif
> >   }
> >
> > +static bool __try_to_unmap_huge_pmd(struct vm_area_struct *vma,
> > +                                 unsigned long addr, struct folio *folio)
> > +{
> > +     VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
> > +
> > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +     if (folio_test_anon(folio) && !folio_test_swapbacked(folio))
> > +             return discard_trans_pmd(vma, addr, folio);
> > +#endif
> > +
> > +     return false;
> > +}
> > +
> >   /*
> >    * @arg: enum ttu_flags will be passed to this argument
> >    */
> > @@ -1631,14 +1644,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> >       if (flags & TTU_SYNC)
> >               pvmw.flags = PVMW_SYNC;
> >
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > -     if (flags & TTU_LAZYFREE_THP)
> > -             if (discard_trans_pmd(vma, address, folio))
> > +     if (flags & TTU_SPLIT_HUGE_PMD) {
> > +             if (__try_to_unmap_huge_pmd(vma, address, folio))
> >                       return true;
> > -#endif
> > -
> > -     if (flags & TTU_SPLIT_HUGE_PMD)
> >               split_huge_pmd_address(vma, address, false, folio);
> > +     }
> >
>
> I was wondering if we can better integrate that into the pagewalk below.
>
> That is, don't do the TTU_SPLIT_HUGE_PMD immediately. Start the pagewalk
> first. If we walk a PMD, try to unmap it. Only if that fails, split it.

Nice. Thanks for the suggestion!
I'll work on integrating it into the pagewalk as you suggested.

>
> Less working on "vma + address" and instead directly on PMDs.

Yes, some of the work on "vma + address" can be avoided :)

Thanks again for the review!
Lance

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand April 25, 2024, 9:25 a.m. UTC | #12
>> I was wondering if we can better integrate that into the pagewalk below.
>>
>> That is, don't do the TTU_SPLIT_HUGE_PMD immediately. Start the pagewalk
>> first. If we walk a PMD, try to unmap it. Only if that fails, split it.
> 
> Nice. Thanks for the suggestion!
> I'll work on integrating it into the pagewalk as you suggested.
> 
>>
>> Less working on "vma + address" and instead directly on PMDs.
> 
> Yes, some of the work on "vma + address" can be avoided :)

Doing the conditional split while in the pagewalk will be the 
interesting bit to sort out (we temporarily have to drop the PTL and 
start once again from that now-PTE-mapped page table). But it should be 
a reasonable thing to have.

Please let us know if you run into bigger issues with that!

See walk_pmd_range() as an inspiration where we call split_huge_pmd().
Lance Yang April 25, 2024, noon UTC | #13
On Thu, Apr 25, 2024 at 5:25 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> I was wondering if we can better integrate that into the pagewalk below.
> >>
> >> That is, don't do the TTU_SPLIT_HUGE_PMD immediately. Start the pagewalk
> >> first. If we walk a PMD, try to unmap it. Only if that fails, split it.
> >
> > Nice. Thanks for the suggestion!
> > I'll work on integrating it into the pagewalk as you suggested.
> >
> >>
> >> Less working on "vma + address" and instead directly on PMDs.
> >
> > Yes, some of the work on "vma + address" can be avoided :)
>
> Doing the conditional split while in the pagewalk will be the
> interesting bit to sort out (we temporarily have to drop the PTL and
> start once again from that now-PTE-mapped page table). But it should be
> a reasonable thing to have.

Yep, it's an interesting challenge for me, but definitely worth tackling ;)

>
> Please let us know if you run into bigger issues with that!

Thanks, I'll do my best to sort it out and reach out for help if needed :p

>
> See walk_pmd_range() as an inspiration where we call split_huge_pmd().

Thanks for your suggestion! I'll take a closer look at it.

Thanks again for clarifying!
Lance

>
> --
> Cheers,
>
> David / dhildenb
>
diff mbox series

Patch

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 7cd07b83a3d0..56c7ea73090b 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -36,6 +36,8 @@  bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
 int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		    pmd_t *pmd, unsigned long addr, pgprot_t newprot,
 		    unsigned long cp_flags);
+bool discard_trans_pmd(struct vm_area_struct *vma, unsigned long addr,
+		       struct folio *folio);
 
 vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write);
 vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write);
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 0f906dc6d280..670218f762c8 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -100,6 +100,8 @@  enum ttu_flags {
 					 * do a final flush if necessary */
 	TTU_RMAP_LOCKED		= 0x80,	/* do not grab rmap lock:
 					 * caller holds it */
+	TTU_LAZYFREE_THP	= 0x100, /* avoid splitting PMD-mapped THPs
+					  * that are marked as lazyfree. */
 };
 
 #ifdef CONFIG_MMU
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 824eff9211db..63de1445feab 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1810,6 +1810,94 @@  static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd)
 	mm_dec_nr_ptes(mm);
 }
 
+bool discard_trans_pmd(struct vm_area_struct *vma, unsigned long addr,
+		       struct folio *folio)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct mmu_notifier_range range;
+	int ref_count, map_count;
+	struct mmu_gather tlb;
+	pmd_t *pmdp, orig_pmd;
+	struct page *page;
+	bool ret = false;
+	spinlock_t *ptl;
+
+	VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio);
+	VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio);
+	VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio);
+	VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio);
+
+	/* Perform best-effort early checks before acquiring the PMD lock */
+	if (folio_ref_count(folio) != folio_mapcount(folio) + 1 ||
+	    folio_test_dirty(folio))
+		return false;
+
+	pmdp = mm_find_pmd(mm, addr);
+	if (unlikely(!pmdp))
+		return false;
+	if (pmd_dirty(*pmdp))
+		return false;
+
+	tlb_gather_mmu(&tlb, mm);
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm,
+				addr & HPAGE_PMD_MASK,
+				(addr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+
+	ptl = pmd_lock(mm, pmdp);
+	orig_pmd = *pmdp;
+	if (unlikely(!pmd_present(orig_pmd) || !pmd_trans_huge(orig_pmd)))
+		goto out;
+
+	page = pmd_page(orig_pmd);
+	if (unlikely(page_folio(page) != folio))
+		goto out;
+
+	orig_pmd = pmdp_huge_get_and_clear(mm, addr, pmdp);
+	tlb_remove_pmd_tlb_entry(&tlb, pmdp, addr);
+
+	/*
+	 * Syncing against concurrent GUP-fast:
+	 * - clear PMD; barrier; read refcount
+	 * - inc refcount; barrier; read PMD
+	 */
+	smp_mb();
+
+	ref_count = folio_ref_count(folio);
+	map_count = folio_mapcount(folio);
+
+	/*
+	 * Order reads for folio refcount and dirty flag
+	 * (see comments in __remove_mapping()).
+	 */
+	smp_rmb();
+
+	/*
+	 * If the PMD or folio is redirtied at this point, or if there are
+	 * unexpected references, we will give up to discard this folio
+	 * and remap it.
+	 *
+	 * The only folio refs must be one from isolation plus the rmap(s).
+	 */
+	if (ref_count != map_count + 1 || folio_test_dirty(folio) ||
+	    pmd_dirty(orig_pmd)) {
+		set_pmd_at(mm, addr, pmdp, orig_pmd);
+		goto out;
+	}
+
+	folio_remove_rmap_pmd(folio, page, vma);
+	zap_deposited_table(mm, pmdp);
+	add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR);
+	folio_put(folio);
+	ret = true;
+
+out:
+	spin_unlock(ptl);
+	mmu_notifier_invalidate_range_end(&range);
+
+	return ret;
+}
+
 int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
 		 pmd_t *pmd, unsigned long addr)
 {
diff --git a/mm/rmap.c b/mm/rmap.c
index 2608c40dffad..a7913a454028 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1631,6 +1631,12 @@  static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
 	if (flags & TTU_SYNC)
 		pvmw.flags = PVMW_SYNC;
 
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+	if (flags & TTU_LAZYFREE_THP)
+		if (discard_trans_pmd(vma, address, folio))
+			return true;
+#endif
+
 	if (flags & TTU_SPLIT_HUGE_PMD)
 		split_huge_pmd_address(vma, address, false, folio);
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 49bd94423961..e2686cc0c037 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1277,6 +1277,13 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 			if (folio_test_pmd_mappable(folio))
 				flags |= TTU_SPLIT_HUGE_PMD;
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+			if (folio_test_anon(folio) && !was_swapbacked &&
+			    (flags & TTU_SPLIT_HUGE_PMD))
+				flags |= TTU_LAZYFREE_THP;
+#endif
+
 			/*
 			 * Without TTU_SYNC, try_to_unmap will only begin to
 			 * hold PTL from the first present PTE within a large