mbox series

[0/3] mm: batched unmap lazyfree large folios during reclamation

Message ID 20250106031711.82855-1-21cnbao@gmail.com (mailing list archive)
Headers show
Series mm: batched unmap lazyfree large folios during reclamation | expand

Message

Barry Song Jan. 6, 2025, 3:17 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

Commit 735ecdfaf4e80 ("mm/vmscan: avoid splitting lazyfree THP during 
shrink_folio_list()") prevents the splitting of MADV_FREE'd THP in madvise.c. 
However, those folios are still added to the deferred_split list in 
try_to_unmap_one() because we are unmapping PTEs and removing rmap entries 
one by one. This approach is not only slow but also increases the risk of a 
race condition where lazyfree folios are incorrectly set back to swapbacked, 
as a speculative folio_get may occur in the shrinker's callback.

This patchset addresses the issue by only marking truly dirty folios as 
swapbacked as suggested by David and shifting to batched unmapping of the
entire folio in  try_to_unmap_one(). As a result, we've observed
deferred_split dropping to  zero and significant performance improvements
in memory reclamation.

Barry Song (3):
  mm: set folio swapbacked iff folios are dirty in try_to_unmap_one
  mm: Support tlbbatch flush for a range of PTEs
  mm: Support batched unmap for lazyfree large folios during reclamation

 arch/arm64/include/asm/tlbflush.h |  26 ++++----
 arch/arm64/mm/contpte.c           |   2 +-
 arch/x86/include/asm/tlbflush.h   |   3 +-
 mm/rmap.c                         | 103 ++++++++++++++++++++----------
 4 files changed, 85 insertions(+), 49 deletions(-)

Comments

Lorenzo Stoakes Jan. 6, 2025, 5:28 p.m. UTC | #1
On Mon, Jan 06, 2025 at 04:17:08PM +1300, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
>
> Commit 735ecdfaf4e80 ("mm/vmscan: avoid splitting lazyfree THP during
> shrink_folio_list()") prevents the splitting of MADV_FREE'd THP in madvise.c.
> However, those folios are still added to the deferred_split list in
> try_to_unmap_one() because we are unmapping PTEs and removing rmap entries
> one by one. This approach is not only slow but also increases the risk of a
> race condition where lazyfree folios are incorrectly set back to swapbacked,
> as a speculative folio_get may occur in the shrinker's callback.
>
> This patchset addresses the issue by only marking truly dirty folios as
> swapbacked as suggested by David and shifting to batched unmapping of the
> entire folio in  try_to_unmap_one(). As a result, we've observed
> deferred_split dropping to  zero and significant performance improvements
> in memory reclamation.

You've not provided any numbers? What performance improvements? Under what
workloads?

You're adding a bunch of complexity here, so I feel like we need to see
some numbers, background, etc.?

Thanks!

>
> Barry Song (3):
>   mm: set folio swapbacked iff folios are dirty in try_to_unmap_one
>   mm: Support tlbbatch flush for a range of PTEs
>   mm: Support batched unmap for lazyfree large folios during reclamation
>
>  arch/arm64/include/asm/tlbflush.h |  26 ++++----
>  arch/arm64/mm/contpte.c           |   2 +-
>  arch/x86/include/asm/tlbflush.h   |   3 +-
>  mm/rmap.c                         | 103 ++++++++++++++++++++----------
>  4 files changed, 85 insertions(+), 49 deletions(-)
>
> --
> 2.39.3 (Apple Git-146)
>
Barry Song Jan. 6, 2025, 7:15 p.m. UTC | #2
On Tue, Jan 7, 2025 at 6:28 AM Lorenzo Stoakes
<lorenzo.stoakes@oracle.com> wrote:
>
> On Mon, Jan 06, 2025 at 04:17:08PM +1300, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > Commit 735ecdfaf4e80 ("mm/vmscan: avoid splitting lazyfree THP during
> > shrink_folio_list()") prevents the splitting of MADV_FREE'd THP in madvise.c.
> > However, those folios are still added to the deferred_split list in
> > try_to_unmap_one() because we are unmapping PTEs and removing rmap entries
> > one by one. This approach is not only slow but also increases the risk of a
> > race condition where lazyfree folios are incorrectly set back to swapbacked,
> > as a speculative folio_get may occur in the shrinker's callback.
> >
> > This patchset addresses the issue by only marking truly dirty folios as
> > swapbacked as suggested by David and shifting to batched unmapping of the
> > entire folio in  try_to_unmap_one(). As a result, we've observed
> > deferred_split dropping to  zero and significant performance improvements
> > in memory reclamation.
>
> You've not provided any numbers? What performance improvements? Under what
> workloads?

The number can be found in patch 3/3 at the following link:
https://lore.kernel.org/linux-mm/20250106031711.82855-4-21cnbao@gmail.com/

Reclaiming lazyfree mTHP will now be significantly faster.
Additionally, this patch
addresses the issue with the misleading split_deferred counter. The
split_deferred
counter was intended to track operations like unaligned unmap/madvise, but in
practice, the majority of split_deferred cases result from memory reclamation
of aligned lazyfree mTHP. This discrepancy rendered the split_deferred
counter highly
misleading.

>
> You're adding a bunch of complexity here, so I feel like we need to see
> some numbers, background, etc.?

I agree that I can provide more details in v2. In the meantime, you can
find additional background information here:

https://lore.kernel.org/linux-mm/CAGsJ_4wOL6TLa3FKQASdrGfuqqu=14EuxAtpKmnebiGLm0dnfA@mail.gmail.com/

>
> Thanks!
>
> >
> > Barry Song (3):
> >   mm: set folio swapbacked iff folios are dirty in try_to_unmap_one
> >   mm: Support tlbbatch flush for a range of PTEs
> >   mm: Support batched unmap for lazyfree large folios during reclamation
> >
> >  arch/arm64/include/asm/tlbflush.h |  26 ++++----
> >  arch/arm64/mm/contpte.c           |   2 +-
> >  arch/x86/include/asm/tlbflush.h   |   3 +-
> >  mm/rmap.c                         | 103 ++++++++++++++++++++----------
> >  4 files changed, 85 insertions(+), 49 deletions(-)
> >
> > --
> > 2.39.3 (Apple Git-146)

Thanks
Barry