Message ID | 20240422055213.60231-1-ioworker0@gmail.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2,1/1] mm/vmscan: avoid split PMD-mapped THP during shrink_folio_list() | expand |
On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote: > When the user no longer requires the pages, they would use > madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not > typically rewrite to the given range. > > At present, PMD-mapped THPs that are marked as lazyfree during > shrink_folio_list() are unconditionally split, which may be unnecessary. > If the THP is clean, its PMD is also clean, and there are no unexpected > references, then we can attempt to remove the PMD mapping from it. This > change will improve the efficiency of memory reclamation in this case. Does this happen outside of benchmarks? I'm really struggling to see how we end up in this situation. We have a clean THP without swap backing, so it's full of zeroes, but for some reason we haven't used the shared huge zero page? What is going on?
On 24.04.24 06:15, Matthew Wilcox wrote: > On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote: >> When the user no longer requires the pages, they would use >> madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not >> typically rewrite to the given range. >> >> At present, PMD-mapped THPs that are marked as lazyfree during >> shrink_folio_list() are unconditionally split, which may be unnecessary. >> If the THP is clean, its PMD is also clean, and there are no unexpected >> references, then we can attempt to remove the PMD mapping from it. This >> change will improve the efficiency of memory reclamation in this case. > > Does this happen outside of benchmarks? I'm really struggling to see > how we end up in this situation. We have a clean THP without swap > backing, so it's full of zeroes, but for some reason we haven't used the > shared huge zero page? What is going on? It's not full of zeroes. User space called MADV_FREE on a PMD-mapped THP. During MADV_FREE, we mark the PTEs as clean, the folio as clean and sd "lazyfree" (no swap backend). If, during memory reclaim, we detect that (a) the folio is still clean (b) the PTEs are still clean and (c) there are no unexpected references (GUP), user space didn't re-write to that memory again, so we can just discard the memory "lazily".
Hey Matthew, Thanks for taking time to review. On Wed, Apr 24, 2024 at 12:15 PM Matthew Wilcox <willy@infradead.org> wrote: > > On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote: > > When the user no longer requires the pages, they would use > > madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not > > typically rewrite to the given range. > > > > At present, PMD-mapped THPs that are marked as lazyfree during > > shrink_folio_list() are unconditionally split, which may be unnecessary. > > If the THP is clean, its PMD is also clean, and there are no unexpected "If the THP is clean, its PMD is also clean" can be confusing - sorry. It should be modified to "If the THP and its PMD are both marked as clean". Thanks, Lance > > references, then we can attempt to remove the PMD mapping from it. This > > change will improve the efficiency of memory reclamation in this case. > > Does this happen outside of benchmarks? I'm really struggling to see > how we end up in this situation. We have a clean THP without swap > backing, so it's full of zeroes, but for some reason we haven't used the > shared huge zero page? What is going on? >
On 24 Apr 2024, at 3:17, David Hildenbrand wrote: > On 24.04.24 06:15, Matthew Wilcox wrote: >> On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote: >>> When the user no longer requires the pages, they would use >>> madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not >>> typically rewrite to the given range. >>> >>> At present, PMD-mapped THPs that are marked as lazyfree during >>> shrink_folio_list() are unconditionally split, which may be unnecessary. >>> If the THP is clean, its PMD is also clean, and there are no unexpected >>> references, then we can attempt to remove the PMD mapping from it. This >>> change will improve the efficiency of memory reclamation in this case. >> >> Does this happen outside of benchmarks? I'm really struggling to see >> how we end up in this situation. We have a clean THP without swap >> backing, so it's full of zeroes, but for some reason we haven't used the >> shared huge zero page? What is going on? > > It's not full of zeroes. > > User space called MADV_FREE on a PMD-mapped THP. > > During MADV_FREE, we mark the PTEs as clean, the folio as clean and sd "lazyfree" (no swap backend). If, during memory reclaim, we detect that (a) the folio is still clean (b) the PTEs are still clean and (c) there are no unexpected references (GUP), user space didn't re-write to that memory again, so we can just discard the memory "lazily". It seems that try_to_unmap_one() does not support unmapping PMD-mapped folios. Maybe adding that support instead of a special case handling? -- Best Regards, Yan, Zi
On 24.04.24 17:57, Zi Yan wrote: > On 24 Apr 2024, at 3:17, David Hildenbrand wrote: > >> On 24.04.24 06:15, Matthew Wilcox wrote: >>> On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote: >>>> When the user no longer requires the pages, they would use >>>> madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not >>>> typically rewrite to the given range. >>>> >>>> At present, PMD-mapped THPs that are marked as lazyfree during >>>> shrink_folio_list() are unconditionally split, which may be unnecessary. >>>> If the THP is clean, its PMD is also clean, and there are no unexpected >>>> references, then we can attempt to remove the PMD mapping from it. This >>>> change will improve the efficiency of memory reclamation in this case. >>> >>> Does this happen outside of benchmarks? I'm really struggling to see >>> how we end up in this situation. We have a clean THP without swap >>> backing, so it's full of zeroes, but for some reason we haven't used the >>> shared huge zero page? What is going on? >> >> It's not full of zeroes. >> >> User space called MADV_FREE on a PMD-mapped THP. >> >> During MADV_FREE, we mark the PTEs as clean, the folio as clean and sd "lazyfree" (no swap backend). If, during memory reclaim, we detect that (a) the folio is still clean (b) the PTEs are still clean and (c) there are no unexpected references (GUP), user space didn't re-write to that memory again, so we can just discard the memory "lazily". > > It seems that try_to_unmap_one() does not support unmapping PMD-mapped folios. > Maybe adding that support instead of a special case handling? I was thinking the same, and finding a way to avoid TTU_LAZYFREE_THP.
On Wed, 24 Apr 2024 23:46:59 +0800 Lance Yang <ioworker0@gmail.com> wrote: > On Wed, Apr 24, 2024 at 12:15 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote: > > > When the user no longer requires the pages, they would use > > > madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not > > > typically rewrite to the given range. > > > > > > At present, PMD-mapped THPs that are marked as lazyfree during > > > shrink_folio_list() are unconditionally split, which may be unnecessary. > > > If the THP is clean, its PMD is also clean, and there are no unexpected > > "If the THP is clean, its PMD is also clean" can be confusing - sorry. It should > be modified to "If the THP and its PMD are both marked as clean". I made that changelog edit.
Hey Zi, David, Thanks for taking time to review! On Wed, Apr 24, 2024 at 11:58 PM David Hildenbrand <david@redhat.com> wrote: > > On 24.04.24 17:57, Zi Yan wrote: > > On 24 Apr 2024, at 3:17, David Hildenbrand wrote: > > > >> On 24.04.24 06:15, Matthew Wilcox wrote: > >>> On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote: > >>>> When the user no longer requires the pages, they would use > >>>> madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not > >>>> typically rewrite to the given range. > >>>> > >>>> At present, PMD-mapped THPs that are marked as lazyfree during > >>>> shrink_folio_list() are unconditionally split, which may be unnecessary. > >>>> If the THP is clean, its PMD is also clean, and there are no unexpected > >>>> references, then we can attempt to remove the PMD mapping from it. This > >>>> change will improve the efficiency of memory reclamation in this case. > >>> > >>> Does this happen outside of benchmarks? I'm really struggling to see > >>> how we end up in this situation. We have a clean THP without swap > >>> backing, so it's full of zeroes, but for some reason we haven't used the > >>> shared huge zero page? What is going on? > >> > >> It's not full of zeroes. > >> > >> User space called MADV_FREE on a PMD-mapped THP. > >> > >> During MADV_FREE, we mark the PTEs as clean, the folio as clean and sd "lazyfree" (no swap backend). If, during memory reclaim, we detect that (a) the folio is still clean (b) the PTEs are still clean and (c) there are no unexpected references (GUP), user space didn't re-write to that memory again, so we can just discard the memory "lazily". > > > > It seems that try_to_unmap_one() does not support unmapping PMD-mapped folios. > > Maybe adding that support instead of a special case handling? > > I was thinking the same, and finding a way to avoid TTU_LAZYFREE_THP. Thanks for the suggestions! Yep, I completely agreed. Adding support for unmapping PMD-mapped folios to try_to_unmap_one() would make it more future-proof. Thanks again for the review! Lance > > -- > Cheers, > > David / dhildenb >
On Thu, Apr 25, 2024 at 5:20 AM Andrew Morton <akpm@linux-foundation.org> wrote: > > On Wed, 24 Apr 2024 23:46:59 +0800 Lance Yang <ioworker0@gmail.com> wrote: > > > On Wed, Apr 24, 2024 at 12:15 PM Matthew Wilcox <willy@infradead.org> wrote: > > > > > > On Mon, Apr 22, 2024 at 01:52:13PM +0800, Lance Yang wrote: > > > > When the user no longer requires the pages, they would use > > > > madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not > > > > typically rewrite to the given range. > > > > > > > > At present, PMD-mapped THPs that are marked as lazyfree during > > > > shrink_folio_list() are unconditionally split, which may be unnecessary. > > > > If the THP is clean, its PMD is also clean, and there are no unexpected > > > > "If the THP is clean, its PMD is also clean" can be confusing - sorry. It should > > be modified to "If the THP and its PMD are both marked as clean". > > I made that changelog edit. Thanks for updating, Andrew! Lance
Hey Zi, David, How about this change(diff against mm-unstable) as follows? I'd like to add __try_to_unmap_huge_pmd() as a new internal function specifically for unmapping PMD-mapped folios. If, for any reason, we cannot unmap the folio, then we'll still split it as previously done. Currently, __try_to_unmap_huge_pmd() only handles lazyfree THPs, but it can be extended to support other large folios that are PMD-mapped in the future if needed. diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 670218f762c8..0f906dc6d280 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -100,8 +100,6 @@ enum ttu_flags { * do a final flush if necessary */ TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock: * caller holds it */ - TTU_LAZYFREE_THP = 0x100, /* avoid splitting PMD-mapped THPs - * that are marked as lazyfree. */ }; #ifdef CONFIG_MMU diff --git a/mm/rmap.c b/mm/rmap.c index a7913a454028..879c8923abfc 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1606,6 +1606,19 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page, #endif } +static bool __try_to_unmap_huge_pmd(struct vm_area_struct *vma, + unsigned long addr, struct folio *folio) +{ + VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio); + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) + return discard_trans_pmd(vma, addr, folio); +#endif + + return false; +} + /* * @arg: enum ttu_flags will be passed to this argument */ @@ -1631,14 +1644,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, if (flags & TTU_SYNC) pvmw.flags = PVMW_SYNC; -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (flags & TTU_LAZYFREE_THP) - if (discard_trans_pmd(vma, address, folio)) + if (flags & TTU_SPLIT_HUGE_PMD) { + if (__try_to_unmap_huge_pmd(vma, address, folio)) return true; -#endif - - if (flags & TTU_SPLIT_HUGE_PMD) split_huge_pmd_address(vma, address, false, folio); + } /* * For THP, we have to assume the worse case ie pmd for invalidation. diff --git a/mm/vmscan.c b/mm/vmscan.c index e2686cc0c037..49bd94423961 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1277,13 +1277,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (folio_test_pmd_mappable(folio)) flags |= TTU_SPLIT_HUGE_PMD; - -#ifdef CONFIG_TRANSPARENT_HUGEPAGE - if (folio_test_anon(folio) && !was_swapbacked && - (flags & TTU_SPLIT_HUGE_PMD)) - flags |= TTU_LAZYFREE_THP; -#endif - /* * Without TTU_SYNC, try_to_unmap will only begin to * hold PTL from the first present PTE within a large
On 25.04.24 10:50, Lance Yang wrote: > Hey Zi, David, Hi, > > How about this change(diff against mm-unstable) as follows? goes into the right direction, please resent the whole thing, that will make it easier to review. > > I'd like to add __try_to_unmap_huge_pmd() as a new internal function > specifically for unmapping PMD-mapped folios. If, for any reason, we cannot > unmap the folio, then we'll still split it as previously done. > > Currently, __try_to_unmap_huge_pmd() only handles lazyfree THPs, but it > can be extended to support other large folios that are PMD-mapped in the > future if needed. > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > index 670218f762c8..0f906dc6d280 100644 > --- a/include/linux/rmap.h > +++ b/include/linux/rmap.h > @@ -100,8 +100,6 @@ enum ttu_flags { > * do a final flush if necessary */ > TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock: > * caller holds it */ > - TTU_LAZYFREE_THP = 0x100, /* avoid splitting PMD-mapped THPs > - * that are marked as lazyfree. */ > }; > > #ifdef CONFIG_MMU > diff --git a/mm/rmap.c b/mm/rmap.c > index a7913a454028..879c8923abfc 100644 > --- a/mm/rmap.c > +++ b/mm/rmap.c > @@ -1606,6 +1606,19 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page, > #endif > } > > +static bool __try_to_unmap_huge_pmd(struct vm_area_struct *vma, > + unsigned long addr, struct folio *folio) > +{ > + VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio); > + > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > + if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) > + return discard_trans_pmd(vma, addr, folio); > +#endif > + > + return false; > +} > + > /* > * @arg: enum ttu_flags will be passed to this argument > */ > @@ -1631,14 +1644,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > if (flags & TTU_SYNC) > pvmw.flags = PVMW_SYNC; > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > - if (flags & TTU_LAZYFREE_THP) > - if (discard_trans_pmd(vma, address, folio)) > + if (flags & TTU_SPLIT_HUGE_PMD) { > + if (__try_to_unmap_huge_pmd(vma, address, folio)) > return true; > -#endif > - > - if (flags & TTU_SPLIT_HUGE_PMD) > split_huge_pmd_address(vma, address, false, folio); > + } > I was wondering if we can better integrate that into the pagewalk below. That is, don't do the TTU_SPLIT_HUGE_PMD immediately. Start the pagewalk first. If we walk a PMD, try to unmap it. Only if that fails, split it. Less working on "vma + address" and instead directly on PMDs.
On Thu, Apr 25, 2024 at 5:01 PM David Hildenbrand <david@redhat.com> wrote: > > On 25.04.24 10:50, Lance Yang wrote: > > Hey Zi, David, > > Hi, > > > > > How about this change(diff against mm-unstable) as follows? > > goes into the right direction, please resent the whole thing, that will > make it easier to review. Got it. I‘ll keep that in mind, thanks! > > > > > I'd like to add __try_to_unmap_huge_pmd() as a new internal function > > specifically for unmapping PMD-mapped folios. If, for any reason, we cannot > > unmap the folio, then we'll still split it as previously done. > > > > Currently, __try_to_unmap_huge_pmd() only handles lazyfree THPs, but it > > can be extended to support other large folios that are PMD-mapped in the > > future if needed. > > > > diff --git a/include/linux/rmap.h b/include/linux/rmap.h > > index 670218f762c8..0f906dc6d280 100644 > > --- a/include/linux/rmap.h > > +++ b/include/linux/rmap.h > > @@ -100,8 +100,6 @@ enum ttu_flags { > > * do a final flush if necessary */ > > TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock: > > * caller holds it */ > > - TTU_LAZYFREE_THP = 0x100, /* avoid splitting PMD-mapped THPs > > - * that are marked as lazyfree. */ > > }; > > > > #ifdef CONFIG_MMU > > diff --git a/mm/rmap.c b/mm/rmap.c > > index a7913a454028..879c8923abfc 100644 > > --- a/mm/rmap.c > > +++ b/mm/rmap.c > > @@ -1606,6 +1606,19 @@ void folio_remove_rmap_pmd(struct folio *folio, struct page *page, > > #endif > > } > > > > +static bool __try_to_unmap_huge_pmd(struct vm_area_struct *vma, > > + unsigned long addr, struct folio *folio) > > +{ > > + VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio); > > + > > +#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > + if (folio_test_anon(folio) && !folio_test_swapbacked(folio)) > > + return discard_trans_pmd(vma, addr, folio); > > +#endif > > + > > + return false; > > +} > > + > > /* > > * @arg: enum ttu_flags will be passed to this argument > > */ > > @@ -1631,14 +1644,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, > > if (flags & TTU_SYNC) > > pvmw.flags = PVMW_SYNC; > > > > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE > > - if (flags & TTU_LAZYFREE_THP) > > - if (discard_trans_pmd(vma, address, folio)) > > + if (flags & TTU_SPLIT_HUGE_PMD) { > > + if (__try_to_unmap_huge_pmd(vma, address, folio)) > > return true; > > -#endif > > - > > - if (flags & TTU_SPLIT_HUGE_PMD) > > split_huge_pmd_address(vma, address, false, folio); > > + } > > > > I was wondering if we can better integrate that into the pagewalk below. > > That is, don't do the TTU_SPLIT_HUGE_PMD immediately. Start the pagewalk > first. If we walk a PMD, try to unmap it. Only if that fails, split it. Nice. Thanks for the suggestion! I'll work on integrating it into the pagewalk as you suggested. > > Less working on "vma + address" and instead directly on PMDs. Yes, some of the work on "vma + address" can be avoided :) Thanks again for the review! Lance > > -- > Cheers, > > David / dhildenb >
>> I was wondering if we can better integrate that into the pagewalk below. >> >> That is, don't do the TTU_SPLIT_HUGE_PMD immediately. Start the pagewalk >> first. If we walk a PMD, try to unmap it. Only if that fails, split it. > > Nice. Thanks for the suggestion! > I'll work on integrating it into the pagewalk as you suggested. > >> >> Less working on "vma + address" and instead directly on PMDs. > > Yes, some of the work on "vma + address" can be avoided :) Doing the conditional split while in the pagewalk will be the interesting bit to sort out (we temporarily have to drop the PTL and start once again from that now-PTE-mapped page table). But it should be a reasonable thing to have. Please let us know if you run into bigger issues with that! See walk_pmd_range() as an inspiration where we call split_huge_pmd().
On Thu, Apr 25, 2024 at 5:25 PM David Hildenbrand <david@redhat.com> wrote: > > >> I was wondering if we can better integrate that into the pagewalk below. > >> > >> That is, don't do the TTU_SPLIT_HUGE_PMD immediately. Start the pagewalk > >> first. If we walk a PMD, try to unmap it. Only if that fails, split it. > > > > Nice. Thanks for the suggestion! > > I'll work on integrating it into the pagewalk as you suggested. > > > >> > >> Less working on "vma + address" and instead directly on PMDs. > > > > Yes, some of the work on "vma + address" can be avoided :) > > Doing the conditional split while in the pagewalk will be the > interesting bit to sort out (we temporarily have to drop the PTL and > start once again from that now-PTE-mapped page table). But it should be > a reasonable thing to have. Yep, it's an interesting challenge for me, but definitely worth tackling ;) > > Please let us know if you run into bigger issues with that! Thanks, I'll do my best to sort it out and reach out for help if needed :p > > See walk_pmd_range() as an inspiration where we call split_huge_pmd(). Thanks for your suggestion! I'll take a closer look at it. Thanks again for clarifying! Lance > > -- > Cheers, > > David / dhildenb >
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h index 7cd07b83a3d0..56c7ea73090b 100644 --- a/include/linux/huge_mm.h +++ b/include/linux/huge_mm.h @@ -36,6 +36,8 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr, int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr, pgprot_t newprot, unsigned long cp_flags); +bool discard_trans_pmd(struct vm_area_struct *vma, unsigned long addr, + struct folio *folio); vm_fault_t vmf_insert_pfn_pmd(struct vm_fault *vmf, pfn_t pfn, bool write); vm_fault_t vmf_insert_pfn_pud(struct vm_fault *vmf, pfn_t pfn, bool write); diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 0f906dc6d280..670218f762c8 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -100,6 +100,8 @@ enum ttu_flags { * do a final flush if necessary */ TTU_RMAP_LOCKED = 0x80, /* do not grab rmap lock: * caller holds it */ + TTU_LAZYFREE_THP = 0x100, /* avoid splitting PMD-mapped THPs + * that are marked as lazyfree. */ }; #ifdef CONFIG_MMU diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 824eff9211db..63de1445feab 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1810,6 +1810,94 @@ static inline void zap_deposited_table(struct mm_struct *mm, pmd_t *pmd) mm_dec_nr_ptes(mm); } +bool discard_trans_pmd(struct vm_area_struct *vma, unsigned long addr, + struct folio *folio) +{ + struct mm_struct *mm = vma->vm_mm; + struct mmu_notifier_range range; + int ref_count, map_count; + struct mmu_gather tlb; + pmd_t *pmdp, orig_pmd; + struct page *page; + bool ret = false; + spinlock_t *ptl; + + VM_WARN_ON_FOLIO(!folio_test_anon(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_locked(folio), folio); + VM_WARN_ON_FOLIO(folio_test_swapbacked(folio), folio); + VM_WARN_ON_FOLIO(!folio_test_pmd_mappable(folio), folio); + + /* Perform best-effort early checks before acquiring the PMD lock */ + if (folio_ref_count(folio) != folio_mapcount(folio) + 1 || + folio_test_dirty(folio)) + return false; + + pmdp = mm_find_pmd(mm, addr); + if (unlikely(!pmdp)) + return false; + if (pmd_dirty(*pmdp)) + return false; + + tlb_gather_mmu(&tlb, mm); + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, + addr & HPAGE_PMD_MASK, + (addr & HPAGE_PMD_MASK) + HPAGE_PMD_SIZE); + mmu_notifier_invalidate_range_start(&range); + + ptl = pmd_lock(mm, pmdp); + orig_pmd = *pmdp; + if (unlikely(!pmd_present(orig_pmd) || !pmd_trans_huge(orig_pmd))) + goto out; + + page = pmd_page(orig_pmd); + if (unlikely(page_folio(page) != folio)) + goto out; + + orig_pmd = pmdp_huge_get_and_clear(mm, addr, pmdp); + tlb_remove_pmd_tlb_entry(&tlb, pmdp, addr); + + /* + * Syncing against concurrent GUP-fast: + * - clear PMD; barrier; read refcount + * - inc refcount; barrier; read PMD + */ + smp_mb(); + + ref_count = folio_ref_count(folio); + map_count = folio_mapcount(folio); + + /* + * Order reads for folio refcount and dirty flag + * (see comments in __remove_mapping()). + */ + smp_rmb(); + + /* + * If the PMD or folio is redirtied at this point, or if there are + * unexpected references, we will give up to discard this folio + * and remap it. + * + * The only folio refs must be one from isolation plus the rmap(s). + */ + if (ref_count != map_count + 1 || folio_test_dirty(folio) || + pmd_dirty(orig_pmd)) { + set_pmd_at(mm, addr, pmdp, orig_pmd); + goto out; + } + + folio_remove_rmap_pmd(folio, page, vma); + zap_deposited_table(mm, pmdp); + add_mm_counter(mm, MM_ANONPAGES, -HPAGE_PMD_NR); + folio_put(folio); + ret = true; + +out: + spin_unlock(ptl); + mmu_notifier_invalidate_range_end(&range); + + return ret; +} + int zap_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr) { diff --git a/mm/rmap.c b/mm/rmap.c index 2608c40dffad..a7913a454028 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1631,6 +1631,12 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma, if (flags & TTU_SYNC) pvmw.flags = PVMW_SYNC; +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (flags & TTU_LAZYFREE_THP) + if (discard_trans_pmd(vma, address, folio)) + return true; +#endif + if (flags & TTU_SPLIT_HUGE_PMD) split_huge_pmd_address(vma, address, false, folio); diff --git a/mm/vmscan.c b/mm/vmscan.c index 49bd94423961..e2686cc0c037 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1277,6 +1277,13 @@ static unsigned int shrink_folio_list(struct list_head *folio_list, if (folio_test_pmd_mappable(folio)) flags |= TTU_SPLIT_HUGE_PMD; + +#ifdef CONFIG_TRANSPARENT_HUGEPAGE + if (folio_test_anon(folio) && !was_swapbacked && + (flags & TTU_SPLIT_HUGE_PMD)) + flags |= TTU_LAZYFREE_THP; +#endif + /* * Without TTU_SYNC, try_to_unmap will only begin to * hold PTL from the first present PTE within a large
When the user no longer requires the pages, they would use madvise(MADV_FREE) to mark the pages as lazy free. IMO, they would not typically rewrite to the given range. At present, PMD-mapped THPs that are marked as lazyfree during shrink_folio_list() are unconditionally split, which may be unnecessary. If the THP is clean, its PMD is also clean, and there are no unexpected references, then we can attempt to remove the PMD mapping from it. This change will improve the efficiency of memory reclamation in this case. On an Intel i5 CPU, reclaiming 1GiB of PMD-mapped THPs using mem_cgroup_force_empty() results in the following runtimes in seconds (shorter is better): -------------------------------------------- | Old | New | Change | -------------------------------------------- | 0.683426 | 0.049197 | -92.80% | -------------------------------------------- Signed-off-by: Lance Yang <ioworker0@gmail.com> --- v1 -> v2: - Update the changelog - Follow the exact same logic as in try_to_unmap_one() (per David Hildenbrand) - Remove the extra code from rmap.c (per Matthew Wilcox) - https://lore.kernel.org/linux-mm/20240417141111.77855-1-ioworker0@gmail.com include/linux/huge_mm.h | 2 + include/linux/rmap.h | 2 + mm/huge_memory.c | 88 +++++++++++++++++++++++++++++++++++++++++ mm/rmap.c | 6 +++ mm/vmscan.c | 7 ++++ 5 files changed, 105 insertions(+)