diff mbox series

[v4,6/6] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD

Message ID 20240311150058.1122862-7-ryan.roberts@arm.com (mailing list archive)
State New
Headers show
Series Swap-out mTHP without splitting | expand

Commit Message

Ryan Roberts March 11, 2024, 3 p.m. UTC
Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
folio that is fully and contiguously mapped in the pageout/cold vm
range. This change means that large folios will be maintained all the
way to swap storage. This both improves performance during swap-out, by
eliding the cost of splitting the folio, and sets us up nicely for
maintaining the large folio when it is swapped back in (to be covered in
a separate series).

Folios that are not fully mapped in the target range are still split,
but note that behavior is changed so that if the split fails for any
reason (folio locked, shared, etc) we now leave it as is and move to the
next pte in the range and continue work on the proceeding folios.
Previously any failure of this sort would cause the entire operation to
give up and no folios mapped at higher addresses were paged out or made
cold. Given large folios are becoming more common, this old behavior
would have likely lead to wasted opportunities.

While we are at it, change the code that clears young from the ptes to
use ptep_test_and_clear_young(), which is more efficent than
get_and_clear/modify/set, especially for contpte mappings on arm64,
where the old approach would require unfolding/refolding and the new
approach can be done in place.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
 1 file changed, 51 insertions(+), 38 deletions(-)

Comments

Barry Song March 13, 2024, 7:19 a.m. UTC | #1
On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> folio that is fully and contiguously mapped in the pageout/cold vm
> range. This change means that large folios will be maintained all the
> way to swap storage. This both improves performance during swap-out, by
> eliding the cost of splitting the folio, and sets us up nicely for
> maintaining the large folio when it is swapped back in (to be covered in
> a separate series).
>
> Folios that are not fully mapped in the target range are still split,
> but note that behavior is changed so that if the split fails for any
> reason (folio locked, shared, etc) we now leave it as is and move to the
> next pte in the range and continue work on the proceeding folios.
> Previously any failure of this sort would cause the entire operation to
> give up and no folios mapped at higher addresses were paged out or made
> cold. Given large folios are becoming more common, this old behavior
> would have likely lead to wasted opportunities.
>
> While we are at it, change the code that clears young from the ptes to
> use ptep_test_and_clear_young(), which is more efficent than
> get_and_clear/modify/set, especially for contpte mappings on arm64,
> where the old approach would require unfolding/refolding and the new
> approach can be done in place.
>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>

This looks so much better than our initial RFC.
Thank you for your excellent work!

> ---
>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>  1 file changed, 51 insertions(+), 38 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 547dcd1f7a39..56c7ba7bd558 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>         LIST_HEAD(folio_list);
>         bool pageout_anon_only_filter;
>         unsigned int batch_count = 0;
> +       int nr;
>
>         if (fatal_signal_pending(current))
>                 return -EINTR;
> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>                 return 0;
>         flush_tlb_batched_pending(mm);
>         arch_enter_lazy_mmu_mode();
> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> +               nr = 1;
>                 ptent = ptep_get(pte);
>
>                 if (++batch_count == SWAP_CLUSTER_MAX) {
> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>                         continue;
>
>                 /*
> -                * Creating a THP page is expensive so split it only if we
> -                * are sure it's worth. Split it if we are only owner.
> +                * If we encounter a large folio, only split it if it is not
> +                * fully mapped within the range we are operating on. Otherwise
> +                * leave it as is so that it can be swapped out whole. If we
> +                * fail to split a folio, leave it in place and advance to the
> +                * next pte in the range.
>                  */
>                 if (folio_test_large(folio)) {
> -                       int err;
> -
> -                       if (folio_estimated_sharers(folio) > 1)
> -                               break;
> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
> -                               break;
> -                       if (!folio_trylock(folio))
> -                               break;
> -                       folio_get(folio);
> -                       arch_leave_lazy_mmu_mode();
> -                       pte_unmap_unlock(start_pte, ptl);
> -                       start_pte = NULL;
> -                       err = split_folio(folio);
> -                       folio_unlock(folio);
> -                       folio_put(folio);
> -                       if (err)
> -                               break;
> -                       start_pte = pte =
> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> -                       if (!start_pte)
> -                               break;
> -                       arch_enter_lazy_mmu_mode();
> -                       pte--;
> -                       addr -= PAGE_SIZE;
> -                       continue;
> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> +                                               FPB_IGNORE_SOFT_DIRTY;
> +                       int max_nr = (end - addr) / PAGE_SIZE;
> +
> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> +                                            fpb_flags, NULL);

I wonder if we have a quick way to avoid folio_pte_batch() if users
are doing madvise() on a portion of a large folio.

> +
> +                       if (nr < folio_nr_pages(folio)) {
> +                               int err;
> +
> +                               if (folio_estimated_sharers(folio) > 1)
> +                                       continue;
> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
> +                                       continue;
> +                               if (!folio_trylock(folio))
> +                                       continue;
> +                               folio_get(folio);
> +                               arch_leave_lazy_mmu_mode();
> +                               pte_unmap_unlock(start_pte, ptl);
> +                               start_pte = NULL;
> +                               err = split_folio(folio);
> +                               folio_unlock(folio);
> +                               folio_put(folio);
> +                               if (err)
> +                                       continue;
> +                               start_pte = pte =
> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
> +                               if (!start_pte)
> +                                       break;
> +                               arch_enter_lazy_mmu_mode();
> +                               nr = 0;
> +                               continue;
> +                       }
>                 }
>
>                 /*
>                  * Do not interfere with other mappings of this folio and
> -                * non-LRU folio.
> +                * non-LRU folio. If we have a large folio at this point, we
> +                * know it is fully mapped so if its mapcount is the same as its
> +                * number of pages, it must be exclusive.
>                  */
> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> +               if (!folio_test_lru(folio) ||
> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>                         continue;

This looks so perfect and is exactly what I wanted to achieve.

>
>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>                         continue;
>
> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> -
> -               if (!pageout && pte_young(ptent)) {
> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> -                                                       tlb->fullmm);
> -                       ptent = pte_mkold(ptent);
> -                       set_pte_at(mm, addr, pte, ptent);
> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> +               if (!pageout) {
> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> +                               if (ptep_test_and_clear_young(vma, addr, pte))
> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> +                       }

This looks so smart. if it is not pageout, we have increased pte
and addr here; so nr is 0 and we don't need to increase again in
for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)

otherwise, nr won't be 0. so we will increase addr and
pte by nr.


>                 }
>
>                 /*
> --
> 2.25.1
>

Overall, LGTM,

Reviewed-by: Barry Song <v-songbaohua@oppo.com>
Ryan Roberts March 13, 2024, 9:03 a.m. UTC | #2
On 13/03/2024 07:19, Barry Song wrote:
> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>> folio that is fully and contiguously mapped in the pageout/cold vm
>> range. This change means that large folios will be maintained all the
>> way to swap storage. This both improves performance during swap-out, by
>> eliding the cost of splitting the folio, and sets us up nicely for
>> maintaining the large folio when it is swapped back in (to be covered in
>> a separate series).
>>
>> Folios that are not fully mapped in the target range are still split,
>> but note that behavior is changed so that if the split fails for any
>> reason (folio locked, shared, etc) we now leave it as is and move to the
>> next pte in the range and continue work on the proceeding folios.
>> Previously any failure of this sort would cause the entire operation to
>> give up and no folios mapped at higher addresses were paged out or made
>> cold. Given large folios are becoming more common, this old behavior
>> would have likely lead to wasted opportunities.
>>
>> While we are at it, change the code that clears young from the ptes to
>> use ptep_test_and_clear_young(), which is more efficent than
>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>> where the old approach would require unfolding/refolding and the new
>> approach can be done in place.
>>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> 
> This looks so much better than our initial RFC.
> Thank you for your excellent work!

Thanks - its a team effort - I had your PoC and David's previous batching work
to use as a template.

> 
>> ---
>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>  1 file changed, 51 insertions(+), 38 deletions(-)
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index 547dcd1f7a39..56c7ba7bd558 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>         LIST_HEAD(folio_list);
>>         bool pageout_anon_only_filter;
>>         unsigned int batch_count = 0;
>> +       int nr;
>>
>>         if (fatal_signal_pending(current))
>>                 return -EINTR;
>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>                 return 0;
>>         flush_tlb_batched_pending(mm);
>>         arch_enter_lazy_mmu_mode();
>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>> +               nr = 1;
>>                 ptent = ptep_get(pte);
>>
>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>                         continue;
>>
>>                 /*
>> -                * Creating a THP page is expensive so split it only if we
>> -                * are sure it's worth. Split it if we are only owner.
>> +                * If we encounter a large folio, only split it if it is not
>> +                * fully mapped within the range we are operating on. Otherwise
>> +                * leave it as is so that it can be swapped out whole. If we
>> +                * fail to split a folio, leave it in place and advance to the
>> +                * next pte in the range.
>>                  */
>>                 if (folio_test_large(folio)) {
>> -                       int err;
>> -
>> -                       if (folio_estimated_sharers(folio) > 1)
>> -                               break;
>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
>> -                               break;
>> -                       if (!folio_trylock(folio))
>> -                               break;
>> -                       folio_get(folio);
>> -                       arch_leave_lazy_mmu_mode();
>> -                       pte_unmap_unlock(start_pte, ptl);
>> -                       start_pte = NULL;
>> -                       err = split_folio(folio);
>> -                       folio_unlock(folio);
>> -                       folio_put(folio);
>> -                       if (err)
>> -                               break;
>> -                       start_pte = pte =
>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
>> -                       if (!start_pte)
>> -                               break;
>> -                       arch_enter_lazy_mmu_mode();
>> -                       pte--;
>> -                       addr -= PAGE_SIZE;
>> -                       continue;
>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>> +                                               FPB_IGNORE_SOFT_DIRTY;
>> +                       int max_nr = (end - addr) / PAGE_SIZE;
>> +
>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>> +                                            fpb_flags, NULL);
> 
> I wonder if we have a quick way to avoid folio_pte_batch() if users
> are doing madvise() on a portion of a large folio.

Good idea. Something like this?:

	if (pte_pfn(pte) == folio_pfn(folio)
		nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
				     fpb_flags, NULL);

If we are not mapping the first page of the folio, then it can't be a full
mapping, so no need to call folio_pte_batch(). Just split it.

> 
>> +
>> +                       if (nr < folio_nr_pages(folio)) {
>> +                               int err;
>> +
>> +                               if (folio_estimated_sharers(folio) > 1)
>> +                                       continue;
>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
>> +                                       continue;
>> +                               if (!folio_trylock(folio))
>> +                                       continue;
>> +                               folio_get(folio);
>> +                               arch_leave_lazy_mmu_mode();
>> +                               pte_unmap_unlock(start_pte, ptl);
>> +                               start_pte = NULL;
>> +                               err = split_folio(folio);
>> +                               folio_unlock(folio);
>> +                               folio_put(folio);
>> +                               if (err)
>> +                                       continue;
>> +                               start_pte = pte =
>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
>> +                               if (!start_pte)
>> +                                       break;
>> +                               arch_enter_lazy_mmu_mode();
>> +                               nr = 0;
>> +                               continue;
>> +                       }
>>                 }
>>
>>                 /*
>>                  * Do not interfere with other mappings of this folio and
>> -                * non-LRU folio.
>> +                * non-LRU folio. If we have a large folio at this point, we
>> +                * know it is fully mapped so if its mapcount is the same as its
>> +                * number of pages, it must be exclusive.
>>                  */
>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>> +               if (!folio_test_lru(folio) ||
>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>>                         continue;
> 
> This looks so perfect and is exactly what I wanted to achieve.
> 
>>
>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>>                         continue;
>>
>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>> -
>> -               if (!pageout && pte_young(ptent)) {
>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>> -                                                       tlb->fullmm);
>> -                       ptent = pte_mkold(ptent);
>> -                       set_pte_at(mm, addr, pte, ptent);
>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>> +               if (!pageout) {
>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
>> +                       }
> 
> This looks so smart. if it is not pageout, we have increased pte
> and addr here; so nr is 0 and we don't need to increase again in
> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> 
> otherwise, nr won't be 0. so we will increase addr and
> pte by nr.

Indeed. I'm hoping that Lance is able to follow a similar pattern for
madvise_free_pte_range().


> 
> 
>>                 }
>>
>>                 /*
>> --
>> 2.25.1
>>
> 
> Overall, LGTM,
> 
> Reviewed-by: Barry Song <v-songbaohua@oppo.com>

Thanks!
Barry Song March 13, 2024, 9:16 a.m. UTC | #3
On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 13/03/2024 07:19, Barry Song wrote:
> > On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >> folio that is fully and contiguously mapped in the pageout/cold vm
> >> range. This change means that large folios will be maintained all the
> >> way to swap storage. This both improves performance during swap-out, by
> >> eliding the cost of splitting the folio, and sets us up nicely for
> >> maintaining the large folio when it is swapped back in (to be covered in
> >> a separate series).
> >>
> >> Folios that are not fully mapped in the target range are still split,
> >> but note that behavior is changed so that if the split fails for any
> >> reason (folio locked, shared, etc) we now leave it as is and move to the
> >> next pte in the range and continue work on the proceeding folios.
> >> Previously any failure of this sort would cause the entire operation to
> >> give up and no folios mapped at higher addresses were paged out or made
> >> cold. Given large folios are becoming more common, this old behavior
> >> would have likely lead to wasted opportunities.
> >>
> >> While we are at it, change the code that clears young from the ptes to
> >> use ptep_test_and_clear_young(), which is more efficent than
> >> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >> where the old approach would require unfolding/refolding and the new
> >> approach can be done in place.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >
> > This looks so much better than our initial RFC.
> > Thank you for your excellent work!
>
> Thanks - its a team effort - I had your PoC and David's previous batching work
> to use as a template.
>
> >
> >> ---
> >>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >>  1 file changed, 51 insertions(+), 38 deletions(-)
> >>
> >> diff --git a/mm/madvise.c b/mm/madvise.c
> >> index 547dcd1f7a39..56c7ba7bd558 100644
> >> --- a/mm/madvise.c
> >> +++ b/mm/madvise.c
> >> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>         LIST_HEAD(folio_list);
> >>         bool pageout_anon_only_filter;
> >>         unsigned int batch_count = 0;
> >> +       int nr;
> >>
> >>         if (fatal_signal_pending(current))
> >>                 return -EINTR;
> >> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>                 return 0;
> >>         flush_tlb_batched_pending(mm);
> >>         arch_enter_lazy_mmu_mode();
> >> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
> >> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >> +               nr = 1;
> >>                 ptent = ptep_get(pte);
> >>
> >>                 if (++batch_count == SWAP_CLUSTER_MAX) {
> >> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>                         continue;
> >>
> >>                 /*
> >> -                * Creating a THP page is expensive so split it only if we
> >> -                * are sure it's worth. Split it if we are only owner.
> >> +                * If we encounter a large folio, only split it if it is not
> >> +                * fully mapped within the range we are operating on. Otherwise
> >> +                * leave it as is so that it can be swapped out whole. If we
> >> +                * fail to split a folio, leave it in place and advance to the
> >> +                * next pte in the range.
> >>                  */
> >>                 if (folio_test_large(folio)) {
> >> -                       int err;
> >> -
> >> -                       if (folio_estimated_sharers(folio) > 1)
> >> -                               break;
> >> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> -                               break;
> >> -                       if (!folio_trylock(folio))
> >> -                               break;
> >> -                       folio_get(folio);
> >> -                       arch_leave_lazy_mmu_mode();
> >> -                       pte_unmap_unlock(start_pte, ptl);
> >> -                       start_pte = NULL;
> >> -                       err = split_folio(folio);
> >> -                       folio_unlock(folio);
> >> -                       folio_put(folio);
> >> -                       if (err)
> >> -                               break;
> >> -                       start_pte = pte =
> >> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> -                       if (!start_pte)
> >> -                               break;
> >> -                       arch_enter_lazy_mmu_mode();
> >> -                       pte--;
> >> -                       addr -= PAGE_SIZE;
> >> -                       continue;
> >> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >> +                                               FPB_IGNORE_SOFT_DIRTY;
> >> +                       int max_nr = (end - addr) / PAGE_SIZE;
> >> +
> >> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >> +                                            fpb_flags, NULL);
> >
> > I wonder if we have a quick way to avoid folio_pte_batch() if users
> > are doing madvise() on a portion of a large folio.
>
> Good idea. Something like this?:
>
>         if (pte_pfn(pte) == folio_pfn(folio)

what about

"If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"

 just to account for cases where the user's end address falls within
the middle of a large folio?


BTW, another minor issue is here:

                if (++batch_count == SWAP_CLUSTER_MAX) {
                        batch_count = 0;
                        if (need_resched()) {
                                arch_leave_lazy_mmu_mode();
                                pte_unmap_unlock(start_pte, ptl);
                                cond_resched();
                                goto restart;
                        }
                }

We are increasing 1 for nr ptes, thus, we are holding PTL longer
than small folios case? we used to increase 1 for each PTE.
Does it matter?

>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>                                      fpb_flags, NULL);
>
> If we are not mapping the first page of the folio, then it can't be a full
> mapping, so no need to call folio_pte_batch(). Just split it.
>
> >
> >> +
> >> +                       if (nr < folio_nr_pages(folio)) {
> >> +                               int err;
> >> +
> >> +                               if (folio_estimated_sharers(folio) > 1)
> >> +                                       continue;
> >> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> +                                       continue;
> >> +                               if (!folio_trylock(folio))
> >> +                                       continue;
> >> +                               folio_get(folio);
> >> +                               arch_leave_lazy_mmu_mode();
> >> +                               pte_unmap_unlock(start_pte, ptl);
> >> +                               start_pte = NULL;
> >> +                               err = split_folio(folio);
> >> +                               folio_unlock(folio);
> >> +                               folio_put(folio);
> >> +                               if (err)
> >> +                                       continue;
> >> +                               start_pte = pte =
> >> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> +                               if (!start_pte)
> >> +                                       break;
> >> +                               arch_enter_lazy_mmu_mode();
> >> +                               nr = 0;
> >> +                               continue;
> >> +                       }
> >>                 }
> >>
> >>                 /*
> >>                  * Do not interfere with other mappings of this folio and
> >> -                * non-LRU folio.
> >> +                * non-LRU folio. If we have a large folio at this point, we
> >> +                * know it is fully mapped so if its mapcount is the same as its
> >> +                * number of pages, it must be exclusive.
> >>                  */
> >> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >> +               if (!folio_test_lru(folio) ||
> >> +                   folio_mapcount(folio) != folio_nr_pages(folio))
> >>                         continue;
> >
> > This looks so perfect and is exactly what I wanted to achieve.
> >
> >>
> >>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>                         continue;
> >>
> >> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >> -
> >> -               if (!pageout && pte_young(ptent)) {
> >> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> >> -                                                       tlb->fullmm);
> >> -                       ptent = pte_mkold(ptent);
> >> -                       set_pte_at(mm, addr, pte, ptent);
> >> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> >> +               if (!pageout) {
> >> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >> +                               if (ptep_test_and_clear_young(vma, addr, pte))
> >> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> >> +                       }
> >
> > This looks so smart. if it is not pageout, we have increased pte
> > and addr here; so nr is 0 and we don't need to increase again in
> > for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >
> > otherwise, nr won't be 0. so we will increase addr and
> > pte by nr.
>
> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> madvise_free_pte_range().
>
>
> >
> >
> >>                 }
> >>
> >>                 /*
> >> --
> >> 2.25.1
> >>
> >
> > Overall, LGTM,
> >
> > Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>
> Thanks!
>
>
Lance Yang March 13, 2024, 9:19 a.m. UTC | #4
On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 13/03/2024 07:19, Barry Song wrote:
> > On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >> folio that is fully and contiguously mapped in the pageout/cold vm
> >> range. This change means that large folios will be maintained all the
> >> way to swap storage. This both improves performance during swap-out, by
> >> eliding the cost of splitting the folio, and sets us up nicely for
> >> maintaining the large folio when it is swapped back in (to be covered in
> >> a separate series).
> >>
> >> Folios that are not fully mapped in the target range are still split,
> >> but note that behavior is changed so that if the split fails for any
> >> reason (folio locked, shared, etc) we now leave it as is and move to the
> >> next pte in the range and continue work on the proceeding folios.
> >> Previously any failure of this sort would cause the entire operation to
> >> give up and no folios mapped at higher addresses were paged out or made
> >> cold. Given large folios are becoming more common, this old behavior
> >> would have likely lead to wasted opportunities.
> >>
> >> While we are at it, change the code that clears young from the ptes to
> >> use ptep_test_and_clear_young(), which is more efficent than
> >> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >> where the old approach would require unfolding/refolding and the new
> >> approach can be done in place.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >
> > This looks so much better than our initial RFC.
> > Thank you for your excellent work!
>
> Thanks - its a team effort - I had your PoC and David's previous batching work
> to use as a template.
>
> >
> >> ---
> >>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >>  1 file changed, 51 insertions(+), 38 deletions(-)
> >>
> >> diff --git a/mm/madvise.c b/mm/madvise.c
> >> index 547dcd1f7a39..56c7ba7bd558 100644
> >> --- a/mm/madvise.c
> >> +++ b/mm/madvise.c
> >> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>         LIST_HEAD(folio_list);
> >>         bool pageout_anon_only_filter;
> >>         unsigned int batch_count = 0;
> >> +       int nr;
> >>
> >>         if (fatal_signal_pending(current))
> >>                 return -EINTR;
> >> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>                 return 0;
> >>         flush_tlb_batched_pending(mm);
> >>         arch_enter_lazy_mmu_mode();
> >> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
> >> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >> +               nr = 1;
> >>                 ptent = ptep_get(pte);
> >>
> >>                 if (++batch_count == SWAP_CLUSTER_MAX) {
> >> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>                         continue;
> >>
> >>                 /*
> >> -                * Creating a THP page is expensive so split it only if we
> >> -                * are sure it's worth. Split it if we are only owner.
> >> +                * If we encounter a large folio, only split it if it is not
> >> +                * fully mapped within the range we are operating on. Otherwise
> >> +                * leave it as is so that it can be swapped out whole. If we
> >> +                * fail to split a folio, leave it in place and advance to the
> >> +                * next pte in the range.
> >>                  */
> >>                 if (folio_test_large(folio)) {
> >> -                       int err;
> >> -
> >> -                       if (folio_estimated_sharers(folio) > 1)
> >> -                               break;
> >> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> -                               break;
> >> -                       if (!folio_trylock(folio))
> >> -                               break;
> >> -                       folio_get(folio);
> >> -                       arch_leave_lazy_mmu_mode();
> >> -                       pte_unmap_unlock(start_pte, ptl);
> >> -                       start_pte = NULL;
> >> -                       err = split_folio(folio);
> >> -                       folio_unlock(folio);
> >> -                       folio_put(folio);
> >> -                       if (err)
> >> -                               break;
> >> -                       start_pte = pte =
> >> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> -                       if (!start_pte)
> >> -                               break;
> >> -                       arch_enter_lazy_mmu_mode();
> >> -                       pte--;
> >> -                       addr -= PAGE_SIZE;
> >> -                       continue;
> >> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >> +                                               FPB_IGNORE_SOFT_DIRTY;
> >> +                       int max_nr = (end - addr) / PAGE_SIZE;
> >> +
> >> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >> +                                            fpb_flags, NULL);
> >
> > I wonder if we have a quick way to avoid folio_pte_batch() if users
> > are doing madvise() on a portion of a large folio.
>
> Good idea. Something like this?:
>
>         if (pte_pfn(pte) == folio_pfn(folio)
>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>                                      fpb_flags, NULL);
>
> If we are not mapping the first page of the folio, then it can't be a full
> mapping, so no need to call folio_pte_batch(). Just split it.
>
> >
> >> +
> >> +                       if (nr < folio_nr_pages(folio)) {
> >> +                               int err;
> >> +
> >> +                               if (folio_estimated_sharers(folio) > 1)
> >> +                                       continue;
> >> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> +                                       continue;
> >> +                               if (!folio_trylock(folio))
> >> +                                       continue;
> >> +                               folio_get(folio);
> >> +                               arch_leave_lazy_mmu_mode();
> >> +                               pte_unmap_unlock(start_pte, ptl);
> >> +                               start_pte = NULL;
> >> +                               err = split_folio(folio);
> >> +                               folio_unlock(folio);
> >> +                               folio_put(folio);
> >> +                               if (err)
> >> +                                       continue;
> >> +                               start_pte = pte =
> >> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> +                               if (!start_pte)
> >> +                                       break;
> >> +                               arch_enter_lazy_mmu_mode();
> >> +                               nr = 0;
> >> +                               continue;
> >> +                       }
> >>                 }
> >>
> >>                 /*
> >>                  * Do not interfere with other mappings of this folio and
> >> -                * non-LRU folio.
> >> +                * non-LRU folio. If we have a large folio at this point, we
> >> +                * know it is fully mapped so if its mapcount is the same as its
> >> +                * number of pages, it must be exclusive.
> >>                  */
> >> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >> +               if (!folio_test_lru(folio) ||
> >> +                   folio_mapcount(folio) != folio_nr_pages(folio))
> >>                         continue;
> >
> > This looks so perfect and is exactly what I wanted to achieve.
> >
> >>
> >>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>                         continue;
> >>
> >> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >> -
> >> -               if (!pageout && pte_young(ptent)) {
> >> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> >> -                                                       tlb->fullmm);
> >> -                       ptent = pte_mkold(ptent);
> >> -                       set_pte_at(mm, addr, pte, ptent);
> >> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> >> +               if (!pageout) {
> >> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >> +                               if (ptep_test_and_clear_young(vma, addr, pte))
> >> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> >> +                       }
> >
> > This looks so smart. if it is not pageout, we have increased pte
> > and addr here; so nr is 0 and we don't need to increase again in
> > for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >
> > otherwise, nr won't be 0. so we will increase addr and
> > pte by nr.
>
> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> madvise_free_pte_range().

Thanks a lot, Ryan!
I'll make sure to follow a similar pattern for madvise_free_pte_range().

Best,
Lance

>
>
> >
> >
> >>                 }
> >>
> >>                 /*
> >> --
> >> 2.25.1
> >>
> >
> > Overall, LGTM,
> >
> > Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>
> Thanks!
>
>
Ryan Roberts March 13, 2024, 9:36 a.m. UTC | #5
On 13/03/2024 09:16, Barry Song wrote:
> On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 13/03/2024 07:19, Barry Song wrote:
>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>>> range. This change means that large folios will be maintained all the
>>>> way to swap storage. This both improves performance during swap-out, by
>>>> eliding the cost of splitting the folio, and sets us up nicely for
>>>> maintaining the large folio when it is swapped back in (to be covered in
>>>> a separate series).
>>>>
>>>> Folios that are not fully mapped in the target range are still split,
>>>> but note that behavior is changed so that if the split fails for any
>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>>> next pte in the range and continue work on the proceeding folios.
>>>> Previously any failure of this sort would cause the entire operation to
>>>> give up and no folios mapped at higher addresses were paged out or made
>>>> cold. Given large folios are becoming more common, this old behavior
>>>> would have likely lead to wasted opportunities.
>>>>
>>>> While we are at it, change the code that clears young from the ptes to
>>>> use ptep_test_and_clear_young(), which is more efficent than
>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>>>> where the old approach would require unfolding/refolding and the new
>>>> approach can be done in place.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>
>>> This looks so much better than our initial RFC.
>>> Thank you for your excellent work!
>>
>> Thanks - its a team effort - I had your PoC and David's previous batching work
>> to use as a template.
>>
>>>
>>>> ---
>>>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>>>  1 file changed, 51 insertions(+), 38 deletions(-)
>>>>
>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>> index 547dcd1f7a39..56c7ba7bd558 100644
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>         LIST_HEAD(folio_list);
>>>>         bool pageout_anon_only_filter;
>>>>         unsigned int batch_count = 0;
>>>> +       int nr;
>>>>
>>>>         if (fatal_signal_pending(current))
>>>>                 return -EINTR;
>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>                 return 0;
>>>>         flush_tlb_batched_pending(mm);
>>>>         arch_enter_lazy_mmu_mode();
>>>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
>>>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>>> +               nr = 1;
>>>>                 ptent = ptep_get(pte);
>>>>
>>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>                         continue;
>>>>
>>>>                 /*
>>>> -                * Creating a THP page is expensive so split it only if we
>>>> -                * are sure it's worth. Split it if we are only owner.
>>>> +                * If we encounter a large folio, only split it if it is not
>>>> +                * fully mapped within the range we are operating on. Otherwise
>>>> +                * leave it as is so that it can be swapped out whole. If we
>>>> +                * fail to split a folio, leave it in place and advance to the
>>>> +                * next pte in the range.
>>>>                  */
>>>>                 if (folio_test_large(folio)) {
>>>> -                       int err;
>>>> -
>>>> -                       if (folio_estimated_sharers(folio) > 1)
>>>> -                               break;
>>>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> -                               break;
>>>> -                       if (!folio_trylock(folio))
>>>> -                               break;
>>>> -                       folio_get(folio);
>>>> -                       arch_leave_lazy_mmu_mode();
>>>> -                       pte_unmap_unlock(start_pte, ptl);
>>>> -                       start_pte = NULL;
>>>> -                       err = split_folio(folio);
>>>> -                       folio_unlock(folio);
>>>> -                       folio_put(folio);
>>>> -                       if (err)
>>>> -                               break;
>>>> -                       start_pte = pte =
>>>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>> -                       if (!start_pte)
>>>> -                               break;
>>>> -                       arch_enter_lazy_mmu_mode();
>>>> -                       pte--;
>>>> -                       addr -= PAGE_SIZE;
>>>> -                       continue;
>>>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>> +                                               FPB_IGNORE_SOFT_DIRTY;
>>>> +                       int max_nr = (end - addr) / PAGE_SIZE;
>>>> +
>>>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>> +                                            fpb_flags, NULL);
>>>
>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
>>> are doing madvise() on a portion of a large folio.
>>
>> Good idea. Something like this?:
>>
>>         if (pte_pfn(pte) == folio_pfn(folio)
> 
> what about
> 
> "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"
> 
>  just to account for cases where the user's end address falls within
> the middle of a large folio?

yes, even better. I'll add this for the next version.

> 
> 
> BTW, another minor issue is here:
> 
>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>                         batch_count = 0;
>                         if (need_resched()) {
>                                 arch_leave_lazy_mmu_mode();
>                                 pte_unmap_unlock(start_pte, ptl);
>                                 cond_resched();
>                                 goto restart;
>                         }
>                 }
> 
> We are increasing 1 for nr ptes, thus, we are holding PTL longer
> than small folios case? we used to increase 1 for each PTE.
> Does it matter?

I thought about that, but the vast majority of the work is per-folio, not
per-pte. So I concluded it would be best to continue to increment per-folio.

> 
>>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>                                      fpb_flags, NULL);
>>
>> If we are not mapping the first page of the folio, then it can't be a full
>> mapping, so no need to call folio_pte_batch(). Just split it.
>>
>>>
>>>> +
>>>> +                       if (nr < folio_nr_pages(folio)) {
>>>> +                               int err;
>>>> +
>>>> +                               if (folio_estimated_sharers(folio) > 1)
>>>> +                                       continue;
>>>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> +                                       continue;
>>>> +                               if (!folio_trylock(folio))
>>>> +                                       continue;
>>>> +                               folio_get(folio);
>>>> +                               arch_leave_lazy_mmu_mode();
>>>> +                               pte_unmap_unlock(start_pte, ptl);
>>>> +                               start_pte = NULL;
>>>> +                               err = split_folio(folio);
>>>> +                               folio_unlock(folio);
>>>> +                               folio_put(folio);
>>>> +                               if (err)
>>>> +                                       continue;
>>>> +                               start_pte = pte =
>>>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>> +                               if (!start_pte)
>>>> +                                       break;
>>>> +                               arch_enter_lazy_mmu_mode();
>>>> +                               nr = 0;
>>>> +                               continue;
>>>> +                       }
>>>>                 }
>>>>
>>>>                 /*
>>>>                  * Do not interfere with other mappings of this folio and
>>>> -                * non-LRU folio.
>>>> +                * non-LRU folio. If we have a large folio at this point, we
>>>> +                * know it is fully mapped so if its mapcount is the same as its
>>>> +                * number of pages, it must be exclusive.
>>>>                  */
>>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>>> +               if (!folio_test_lru(folio) ||
>>>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>>>>                         continue;
>>>
>>> This looks so perfect and is exactly what I wanted to achieve.
>>>
>>>>
>>>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>                         continue;
>>>>
>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>> -
>>>> -               if (!pageout && pte_young(ptent)) {
>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>> -                                                       tlb->fullmm);
>>>> -                       ptent = pte_mkold(ptent);
>>>> -                       set_pte_at(mm, addr, pte, ptent);
>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>> +               if (!pageout) {
>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>> +                       }
>>>
>>> This looks so smart. if it is not pageout, we have increased pte
>>> and addr here; so nr is 0 and we don't need to increase again in
>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>
>>> otherwise, nr won't be 0. so we will increase addr and
>>> pte by nr.
>>
>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>> madvise_free_pte_range().
>>
>>
>>>
>>>
>>>>                 }
>>>>
>>>>                 /*
>>>> --
>>>> 2.25.1
>>>>
>>>
>>> Overall, LGTM,
>>>
>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>
>> Thanks!
>>
>>
Barry Song March 13, 2024, 10:37 a.m. UTC | #6
On Wed, Mar 13, 2024 at 10:36 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 13/03/2024 09:16, Barry Song wrote:
> > On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 13/03/2024 07:19, Barry Song wrote:
> >>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >>>> folio that is fully and contiguously mapped in the pageout/cold vm
> >>>> range. This change means that large folios will be maintained all the
> >>>> way to swap storage. This both improves performance during swap-out, by
> >>>> eliding the cost of splitting the folio, and sets us up nicely for
> >>>> maintaining the large folio when it is swapped back in (to be covered in
> >>>> a separate series).
> >>>>
> >>>> Folios that are not fully mapped in the target range are still split,
> >>>> but note that behavior is changed so that if the split fails for any
> >>>> reason (folio locked, shared, etc) we now leave it as is and move to the
> >>>> next pte in the range and continue work on the proceeding folios.
> >>>> Previously any failure of this sort would cause the entire operation to
> >>>> give up and no folios mapped at higher addresses were paged out or made
> >>>> cold. Given large folios are becoming more common, this old behavior
> >>>> would have likely lead to wasted opportunities.
> >>>>
> >>>> While we are at it, change the code that clears young from the ptes to
> >>>> use ptep_test_and_clear_young(), which is more efficent than
> >>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >>>> where the old approach would require unfolding/refolding and the new
> >>>> approach can be done in place.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>
> >>> This looks so much better than our initial RFC.
> >>> Thank you for your excellent work!
> >>
> >> Thanks - its a team effort - I had your PoC and David's previous batching work
> >> to use as a template.
> >>
> >>>
> >>>> ---
> >>>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >>>>  1 file changed, 51 insertions(+), 38 deletions(-)
> >>>>
> >>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>> index 547dcd1f7a39..56c7ba7bd558 100644
> >>>> --- a/mm/madvise.c
> >>>> +++ b/mm/madvise.c
> >>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>         LIST_HEAD(folio_list);
> >>>>         bool pageout_anon_only_filter;
> >>>>         unsigned int batch_count = 0;
> >>>> +       int nr;
> >>>>
> >>>>         if (fatal_signal_pending(current))
> >>>>                 return -EINTR;
> >>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>                 return 0;
> >>>>         flush_tlb_batched_pending(mm);
> >>>>         arch_enter_lazy_mmu_mode();
> >>>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
> >>>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >>>> +               nr = 1;
> >>>>                 ptent = ptep_get(pte);
> >>>>
> >>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
> >>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>                         continue;
> >>>>
> >>>>                 /*
> >>>> -                * Creating a THP page is expensive so split it only if we
> >>>> -                * are sure it's worth. Split it if we are only owner.
> >>>> +                * If we encounter a large folio, only split it if it is not
> >>>> +                * fully mapped within the range we are operating on. Otherwise
> >>>> +                * leave it as is so that it can be swapped out whole. If we
> >>>> +                * fail to split a folio, leave it in place and advance to the
> >>>> +                * next pte in the range.
> >>>>                  */
> >>>>                 if (folio_test_large(folio)) {
> >>>> -                       int err;
> >>>> -
> >>>> -                       if (folio_estimated_sharers(folio) > 1)
> >>>> -                               break;
> >>>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>> -                               break;
> >>>> -                       if (!folio_trylock(folio))
> >>>> -                               break;
> >>>> -                       folio_get(folio);
> >>>> -                       arch_leave_lazy_mmu_mode();
> >>>> -                       pte_unmap_unlock(start_pte, ptl);
> >>>> -                       start_pte = NULL;
> >>>> -                       err = split_folio(folio);
> >>>> -                       folio_unlock(folio);
> >>>> -                       folio_put(folio);
> >>>> -                       if (err)
> >>>> -                               break;
> >>>> -                       start_pte = pte =
> >>>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>> -                       if (!start_pte)
> >>>> -                               break;
> >>>> -                       arch_enter_lazy_mmu_mode();
> >>>> -                       pte--;
> >>>> -                       addr -= PAGE_SIZE;
> >>>> -                       continue;
> >>>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >>>> +                                               FPB_IGNORE_SOFT_DIRTY;
> >>>> +                       int max_nr = (end - addr) / PAGE_SIZE;
> >>>> +
> >>>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>> +                                            fpb_flags, NULL);
> >>>
> >>> I wonder if we have a quick way to avoid folio_pte_batch() if users
> >>> are doing madvise() on a portion of a large folio.
> >>
> >> Good idea. Something like this?:
> >>
> >>         if (pte_pfn(pte) == folio_pfn(folio)
> >
> > what about
> >
> > "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"
> >
> >  just to account for cases where the user's end address falls within
> > the middle of a large folio?
>
> yes, even better. I'll add this for the next version.
>
> >
> >
> > BTW, another minor issue is here:
> >
> >                 if (++batch_count == SWAP_CLUSTER_MAX) {
> >                         batch_count = 0;
> >                         if (need_resched()) {
> >                                 arch_leave_lazy_mmu_mode();
> >                                 pte_unmap_unlock(start_pte, ptl);
> >                                 cond_resched();
> >                                 goto restart;
> >                         }
> >                 }
> >
> > We are increasing 1 for nr ptes, thus, we are holding PTL longer
> > than small folios case? we used to increase 1 for each PTE.
> > Does it matter?
>
> I thought about that, but the vast majority of the work is per-folio, not
> per-pte. So I concluded it would be best to continue to increment per-folio.

Okay. The original patch commit b2f557a21bc8 ("mm/madvise: add
cond_resched() in madvise_cold_or_pageout_pte_range()")
primarily addressed the real-time wake-up latency issue. MADV_PAGEOUT
and MADV_COLD are much less critical compared
to other scenarios where operations like do_anon_page or do_swap_page
necessarily need PTL to progress. Therefore, adopting
an approach that relatively aggressively releases the PTL seems to
neither harm MADV_PAGEOUT/COLD nor disadvantage
others.

We are slightly increasing the duration of holding the PTL due to the
iteration of folio_pte_batch() potentially taking longer than
the case of small folios, which do not require it. However, compared
to operations like folio_isolate_lru() and folio_deactivate(),
this increase seems negligible. Recently, we have actually removed
ptep_test_and_clear_young() for MADV_PAGEOUT,
which should also benefit real-time scenarios. Nonetheless, there is a
small risk with large folios, such as 1 MiB mTHP, where
we may need to loop 256 times in folio_pte_batch().

I would vote for increasing 'nr' or maybe max(log2(nr), 1) rather than
1 for two reasons:

1. We are not making MADV_PAGEOUT/COLD worse; in fact, we are
improving them by reducing the time taken to put the same
number of pages into the reclaim list.

2. MADV_PAGEOUT/COLD scenarios are not urgent compared to others that
genuinely require the PTL to progress. Moreover,
the majority of time spent on PAGEOUT is actually reclaim_pages().

> >
> >>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>                                      fpb_flags, NULL);
> >>
> >> If we are not mapping the first page of the folio, then it can't be a full
> >> mapping, so no need to call folio_pte_batch(). Just split it.
> >>
> >>>
> >>>> +
> >>>> +                       if (nr < folio_nr_pages(folio)) {
> >>>> +                               int err;
> >>>> +
> >>>> +                               if (folio_estimated_sharers(folio) > 1)
> >>>> +                                       continue;
> >>>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>> +                                       continue;
> >>>> +                               if (!folio_trylock(folio))
> >>>> +                                       continue;
> >>>> +                               folio_get(folio);
> >>>> +                               arch_leave_lazy_mmu_mode();
> >>>> +                               pte_unmap_unlock(start_pte, ptl);
> >>>> +                               start_pte = NULL;
> >>>> +                               err = split_folio(folio);
> >>>> +                               folio_unlock(folio);
> >>>> +                               folio_put(folio);
> >>>> +                               if (err)
> >>>> +                                       continue;
> >>>> +                               start_pte = pte =
> >>>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>> +                               if (!start_pte)
> >>>> +                                       break;
> >>>> +                               arch_enter_lazy_mmu_mode();
> >>>> +                               nr = 0;
> >>>> +                               continue;
> >>>> +                       }
> >>>>                 }
> >>>>
> >>>>                 /*
> >>>>                  * Do not interfere with other mappings of this folio and
> >>>> -                * non-LRU folio.
> >>>> +                * non-LRU folio. If we have a large folio at this point, we
> >>>> +                * know it is fully mapped so if its mapcount is the same as its
> >>>> +                * number of pages, it must be exclusive.
> >>>>                  */
> >>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >>>> +               if (!folio_test_lru(folio) ||
> >>>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
> >>>>                         continue;
> >>>
> >>> This looks so perfect and is exactly what I wanted to achieve.
> >>>
> >>>>
> >>>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>                         continue;
> >>>>
> >>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>> -
> >>>> -               if (!pageout && pte_young(ptent)) {
> >>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>> -                                                       tlb->fullmm);
> >>>> -                       ptent = pte_mkold(ptent);
> >>>> -                       set_pte_at(mm, addr, pte, ptent);
> >>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> >>>> +               if (!pageout) {
> >>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
> >>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> >>>> +                       }
> >>>
> >>> This looks so smart. if it is not pageout, we have increased pte
> >>> and addr here; so nr is 0 and we don't need to increase again in
> >>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >>>
> >>> otherwise, nr won't be 0. so we will increase addr and
> >>> pte by nr.
> >>
> >> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >> madvise_free_pte_range().
> >>
> >>
> >>>
> >>>
> >>>>                 }
> >>>>
> >>>>                 /*
> >>>> --
> >>>> 2.25.1
> >>>>
> >>>
> >>> Overall, LGTM,
> >>>
> >>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> >>
> >> Thanks!
> >>

Thanks
Barry
Ryan Roberts March 13, 2024, 11:08 a.m. UTC | #7
On 13/03/2024 10:37, Barry Song wrote:
> On Wed, Mar 13, 2024 at 10:36 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 13/03/2024 09:16, Barry Song wrote:
>>> On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 13/03/2024 07:19, Barry Song wrote:
>>>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>>>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>>>>> range. This change means that large folios will be maintained all the
>>>>>> way to swap storage. This both improves performance during swap-out, by
>>>>>> eliding the cost of splitting the folio, and sets us up nicely for
>>>>>> maintaining the large folio when it is swapped back in (to be covered in
>>>>>> a separate series).
>>>>>>
>>>>>> Folios that are not fully mapped in the target range are still split,
>>>>>> but note that behavior is changed so that if the split fails for any
>>>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>>>>> next pte in the range and continue work on the proceeding folios.
>>>>>> Previously any failure of this sort would cause the entire operation to
>>>>>> give up and no folios mapped at higher addresses were paged out or made
>>>>>> cold. Given large folios are becoming more common, this old behavior
>>>>>> would have likely lead to wasted opportunities.
>>>>>>
>>>>>> While we are at it, change the code that clears young from the ptes to
>>>>>> use ptep_test_and_clear_young(), which is more efficent than
>>>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>>>>>> where the old approach would require unfolding/refolding and the new
>>>>>> approach can be done in place.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>
>>>>> This looks so much better than our initial RFC.
>>>>> Thank you for your excellent work!
>>>>
>>>> Thanks - its a team effort - I had your PoC and David's previous batching work
>>>> to use as a template.
>>>>
>>>>>
>>>>>> ---
>>>>>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>>>>>  1 file changed, 51 insertions(+), 38 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>>>> index 547dcd1f7a39..56c7ba7bd558 100644
>>>>>> --- a/mm/madvise.c
>>>>>> +++ b/mm/madvise.c
>>>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>         LIST_HEAD(folio_list);
>>>>>>         bool pageout_anon_only_filter;
>>>>>>         unsigned int batch_count = 0;
>>>>>> +       int nr;
>>>>>>
>>>>>>         if (fatal_signal_pending(current))
>>>>>>                 return -EINTR;
>>>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>                 return 0;
>>>>>>         flush_tlb_batched_pending(mm);
>>>>>>         arch_enter_lazy_mmu_mode();
>>>>>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
>>>>>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>>>>> +               nr = 1;
>>>>>>                 ptent = ptep_get(pte);
>>>>>>
>>>>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>>>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>                         continue;
>>>>>>
>>>>>>                 /*
>>>>>> -                * Creating a THP page is expensive so split it only if we
>>>>>> -                * are sure it's worth. Split it if we are only owner.
>>>>>> +                * If we encounter a large folio, only split it if it is not
>>>>>> +                * fully mapped within the range we are operating on. Otherwise
>>>>>> +                * leave it as is so that it can be swapped out whole. If we
>>>>>> +                * fail to split a folio, leave it in place and advance to the
>>>>>> +                * next pte in the range.
>>>>>>                  */
>>>>>>                 if (folio_test_large(folio)) {
>>>>>> -                       int err;
>>>>>> -
>>>>>> -                       if (folio_estimated_sharers(folio) > 1)
>>>>>> -                               break;
>>>>>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>> -                               break;
>>>>>> -                       if (!folio_trylock(folio))
>>>>>> -                               break;
>>>>>> -                       folio_get(folio);
>>>>>> -                       arch_leave_lazy_mmu_mode();
>>>>>> -                       pte_unmap_unlock(start_pte, ptl);
>>>>>> -                       start_pte = NULL;
>>>>>> -                       err = split_folio(folio);
>>>>>> -                       folio_unlock(folio);
>>>>>> -                       folio_put(folio);
>>>>>> -                       if (err)
>>>>>> -                               break;
>>>>>> -                       start_pte = pte =
>>>>>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>>>> -                       if (!start_pte)
>>>>>> -                               break;
>>>>>> -                       arch_enter_lazy_mmu_mode();
>>>>>> -                       pte--;
>>>>>> -                       addr -= PAGE_SIZE;
>>>>>> -                       continue;
>>>>>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>>>> +                                               FPB_IGNORE_SOFT_DIRTY;
>>>>>> +                       int max_nr = (end - addr) / PAGE_SIZE;
>>>>>> +
>>>>>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>>>> +                                            fpb_flags, NULL);
>>>>>
>>>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
>>>>> are doing madvise() on a portion of a large folio.
>>>>
>>>> Good idea. Something like this?:
>>>>
>>>>         if (pte_pfn(pte) == folio_pfn(folio)
>>>
>>> what about
>>>
>>> "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"
>>>
>>>  just to account for cases where the user's end address falls within
>>> the middle of a large folio?
>>
>> yes, even better. I'll add this for the next version.
>>
>>>
>>>
>>> BTW, another minor issue is here:
>>>
>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>>>                         batch_count = 0;
>>>                         if (need_resched()) {
>>>                                 arch_leave_lazy_mmu_mode();
>>>                                 pte_unmap_unlock(start_pte, ptl);
>>>                                 cond_resched();
>>>                                 goto restart;
>>>                         }
>>>                 }
>>>
>>> We are increasing 1 for nr ptes, thus, we are holding PTL longer
>>> than small folios case? we used to increase 1 for each PTE.
>>> Does it matter?
>>
>> I thought about that, but the vast majority of the work is per-folio, not
>> per-pte. So I concluded it would be best to continue to increment per-folio.
> 
> Okay. The original patch commit b2f557a21bc8 ("mm/madvise: add
> cond_resched() in madvise_cold_or_pageout_pte_range()")
> primarily addressed the real-time wake-up latency issue. MADV_PAGEOUT
> and MADV_COLD are much less critical compared
> to other scenarios where operations like do_anon_page or do_swap_page
> necessarily need PTL to progress. Therefore, adopting
> an approach that relatively aggressively releases the PTL seems to
> neither harm MADV_PAGEOUT/COLD nor disadvantage
> others.
> 
> We are slightly increasing the duration of holding the PTL due to the
> iteration of folio_pte_batch() potentially taking longer than
> the case of small folios, which do not require it. 

If we can't scan all the PTEs in a page table without dropping the PTL
intermittently we have bigger problems. This all works perfectly fine in all the
other PTE iterators; see zap_pte_range() for example.

> However, compared
> to operations like folio_isolate_lru() and folio_deactivate(),
> this increase seems negligible. Recently, we have actually removed
> ptep_test_and_clear_young() for MADV_PAGEOUT,
> which should also benefit real-time scenarios. Nonetheless, there is a
> small risk with large folios, such as 1 MiB mTHP, where
> we may need to loop 256 times in folio_pte_batch().

As I understand it, RT and THP are mutually exclusive. RT can't handle the extra
latencies THPs can cause in allocation path, etc. So I don't think you will see
a problem here.

> 
> I would vote for increasing 'nr' or maybe max(log2(nr), 1) rather than
> 1 for two reasons:
> 
> 1. We are not making MADV_PAGEOUT/COLD worse; in fact, we are
> improving them by reducing the time taken to put the same
> number of pages into the reclaim list.
> 
> 2. MADV_PAGEOUT/COLD scenarios are not urgent compared to others that
> genuinely require the PTL to progress. Moreover,
> the majority of time spent on PAGEOUT is actually reclaim_pages().

I understand your logic. But I'd rather optimize for fewer lock acquisitions for
the !RT+THP case, since RT+THP is not supported.

> 
>>>
>>>>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>>                                      fpb_flags, NULL);
>>>>
>>>> If we are not mapping the first page of the folio, then it can't be a full
>>>> mapping, so no need to call folio_pte_batch(). Just split it.
>>>>
>>>>>
>>>>>> +
>>>>>> +                       if (nr < folio_nr_pages(folio)) {
>>>>>> +                               int err;
>>>>>> +
>>>>>> +                               if (folio_estimated_sharers(folio) > 1)
>>>>>> +                                       continue;
>>>>>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>> +                                       continue;
>>>>>> +                               if (!folio_trylock(folio))
>>>>>> +                                       continue;
>>>>>> +                               folio_get(folio);
>>>>>> +                               arch_leave_lazy_mmu_mode();
>>>>>> +                               pte_unmap_unlock(start_pte, ptl);
>>>>>> +                               start_pte = NULL;
>>>>>> +                               err = split_folio(folio);
>>>>>> +                               folio_unlock(folio);
>>>>>> +                               folio_put(folio);
>>>>>> +                               if (err)
>>>>>> +                                       continue;
>>>>>> +                               start_pte = pte =
>>>>>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>>>> +                               if (!start_pte)
>>>>>> +                                       break;
>>>>>> +                               arch_enter_lazy_mmu_mode();
>>>>>> +                               nr = 0;
>>>>>> +                               continue;
>>>>>> +                       }
>>>>>>                 }
>>>>>>
>>>>>>                 /*
>>>>>>                  * Do not interfere with other mappings of this folio and
>>>>>> -                * non-LRU folio.
>>>>>> +                * non-LRU folio. If we have a large folio at this point, we
>>>>>> +                * know it is fully mapped so if its mapcount is the same as its
>>>>>> +                * number of pages, it must be exclusive.
>>>>>>                  */
>>>>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>>>>> +               if (!folio_test_lru(folio) ||
>>>>>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>>>>>>                         continue;
>>>>>
>>>>> This looks so perfect and is exactly what I wanted to achieve.
>>>>>
>>>>>>
>>>>>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>>                         continue;
>>>>>>
>>>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>> -
>>>>>> -               if (!pageout && pte_young(ptent)) {
>>>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>> -                                                       tlb->fullmm);
>>>>>> -                       ptent = pte_mkold(ptent);
>>>>>> -                       set_pte_at(mm, addr, pte, ptent);
>>>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>> +               if (!pageout) {
>>>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
>>>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>> +                       }
>>>>>
>>>>> This looks so smart. if it is not pageout, we have increased pte
>>>>> and addr here; so nr is 0 and we don't need to increase again in
>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>>>
>>>>> otherwise, nr won't be 0. so we will increase addr and
>>>>> pte by nr.
>>>>
>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>>>> madvise_free_pte_range().
>>>>
>>>>
>>>>>
>>>>>
>>>>>>                 }
>>>>>>
>>>>>>                 /*
>>>>>> --
>>>>>> 2.25.1
>>>>>>
>>>>>
>>>>> Overall, LGTM,
>>>>>
>>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> Thanks!
>>>>
> 
> Thanks
> Barry
Barry Song March 13, 2024, 11:37 a.m. UTC | #8
On Wed, Mar 13, 2024 at 7:08 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 13/03/2024 10:37, Barry Song wrote:
> > On Wed, Mar 13, 2024 at 10:36 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 13/03/2024 09:16, Barry Song wrote:
> >>> On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 13/03/2024 07:19, Barry Song wrote:
> >>>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>
> >>>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >>>>>> folio that is fully and contiguously mapped in the pageout/cold vm
> >>>>>> range. This change means that large folios will be maintained all the
> >>>>>> way to swap storage. This both improves performance during swap-out, by
> >>>>>> eliding the cost of splitting the folio, and sets us up nicely for
> >>>>>> maintaining the large folio when it is swapped back in (to be covered in
> >>>>>> a separate series).
> >>>>>>
> >>>>>> Folios that are not fully mapped in the target range are still split,
> >>>>>> but note that behavior is changed so that if the split fails for any
> >>>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
> >>>>>> next pte in the range and continue work on the proceeding folios.
> >>>>>> Previously any failure of this sort would cause the entire operation to
> >>>>>> give up and no folios mapped at higher addresses were paged out or made
> >>>>>> cold. Given large folios are becoming more common, this old behavior
> >>>>>> would have likely lead to wasted opportunities.
> >>>>>>
> >>>>>> While we are at it, change the code that clears young from the ptes to
> >>>>>> use ptep_test_and_clear_young(), which is more efficent than
> >>>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >>>>>> where the old approach would require unfolding/refolding and the new
> >>>>>> approach can be done in place.
> >>>>>>
> >>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>>>
> >>>>> This looks so much better than our initial RFC.
> >>>>> Thank you for your excellent work!
> >>>>
> >>>> Thanks - its a team effort - I had your PoC and David's previous batching work
> >>>> to use as a template.
> >>>>
> >>>>>
> >>>>>> ---
> >>>>>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >>>>>>  1 file changed, 51 insertions(+), 38 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>>>> index 547dcd1f7a39..56c7ba7bd558 100644
> >>>>>> --- a/mm/madvise.c
> >>>>>> +++ b/mm/madvise.c
> >>>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>>         LIST_HEAD(folio_list);
> >>>>>>         bool pageout_anon_only_filter;
> >>>>>>         unsigned int batch_count = 0;
> >>>>>> +       int nr;
> >>>>>>
> >>>>>>         if (fatal_signal_pending(current))
> >>>>>>                 return -EINTR;
> >>>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>>                 return 0;
> >>>>>>         flush_tlb_batched_pending(mm);
> >>>>>>         arch_enter_lazy_mmu_mode();
> >>>>>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
> >>>>>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >>>>>> +               nr = 1;
> >>>>>>                 ptent = ptep_get(pte);
> >>>>>>
> >>>>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
> >>>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>>                         continue;
> >>>>>>
> >>>>>>                 /*
> >>>>>> -                * Creating a THP page is expensive so split it only if we
> >>>>>> -                * are sure it's worth. Split it if we are only owner.
> >>>>>> +                * If we encounter a large folio, only split it if it is not
> >>>>>> +                * fully mapped within the range we are operating on. Otherwise
> >>>>>> +                * leave it as is so that it can be swapped out whole. If we
> >>>>>> +                * fail to split a folio, leave it in place and advance to the
> >>>>>> +                * next pte in the range.
> >>>>>>                  */
> >>>>>>                 if (folio_test_large(folio)) {
> >>>>>> -                       int err;
> >>>>>> -
> >>>>>> -                       if (folio_estimated_sharers(folio) > 1)
> >>>>>> -                               break;
> >>>>>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>>> -                               break;
> >>>>>> -                       if (!folio_trylock(folio))
> >>>>>> -                               break;
> >>>>>> -                       folio_get(folio);
> >>>>>> -                       arch_leave_lazy_mmu_mode();
> >>>>>> -                       pte_unmap_unlock(start_pte, ptl);
> >>>>>> -                       start_pte = NULL;
> >>>>>> -                       err = split_folio(folio);
> >>>>>> -                       folio_unlock(folio);
> >>>>>> -                       folio_put(folio);
> >>>>>> -                       if (err)
> >>>>>> -                               break;
> >>>>>> -                       start_pte = pte =
> >>>>>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>>>> -                       if (!start_pte)
> >>>>>> -                               break;
> >>>>>> -                       arch_enter_lazy_mmu_mode();
> >>>>>> -                       pte--;
> >>>>>> -                       addr -= PAGE_SIZE;
> >>>>>> -                       continue;
> >>>>>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >>>>>> +                                               FPB_IGNORE_SOFT_DIRTY;
> >>>>>> +                       int max_nr = (end - addr) / PAGE_SIZE;
> >>>>>> +
> >>>>>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>>>> +                                            fpb_flags, NULL);
> >>>>>
> >>>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
> >>>>> are doing madvise() on a portion of a large folio.
> >>>>
> >>>> Good idea. Something like this?:
> >>>>
> >>>>         if (pte_pfn(pte) == folio_pfn(folio)
> >>>
> >>> what about
> >>>
> >>> "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"
> >>>
> >>>  just to account for cases where the user's end address falls within
> >>> the middle of a large folio?
> >>
> >> yes, even better. I'll add this for the next version.
> >>
> >>>
> >>>
> >>> BTW, another minor issue is here:
> >>>
> >>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
> >>>                         batch_count = 0;
> >>>                         if (need_resched()) {
> >>>                                 arch_leave_lazy_mmu_mode();
> >>>                                 pte_unmap_unlock(start_pte, ptl);
> >>>                                 cond_resched();
> >>>                                 goto restart;
> >>>                         }
> >>>                 }
> >>>
> >>> We are increasing 1 for nr ptes, thus, we are holding PTL longer
> >>> than small folios case? we used to increase 1 for each PTE.
> >>> Does it matter?
> >>
> >> I thought about that, but the vast majority of the work is per-folio, not
> >> per-pte. So I concluded it would be best to continue to increment per-folio.
> >
> > Okay. The original patch commit b2f557a21bc8 ("mm/madvise: add
> > cond_resched() in madvise_cold_or_pageout_pte_range()")
> > primarily addressed the real-time wake-up latency issue. MADV_PAGEOUT
> > and MADV_COLD are much less critical compared
> > to other scenarios where operations like do_anon_page or do_swap_page
> > necessarily need PTL to progress. Therefore, adopting
> > an approach that relatively aggressively releases the PTL seems to
> > neither harm MADV_PAGEOUT/COLD nor disadvantage
> > others.
> >
> > We are slightly increasing the duration of holding the PTL due to the
> > iteration of folio_pte_batch() potentially taking longer than
> > the case of small folios, which do not require it.
>
> If we can't scan all the PTEs in a page table without dropping the PTL
> intermittently we have bigger problems. This all works perfectly fine in all the
> other PTE iterators; see zap_pte_range() for example.

I've no doubt about folio_pte_batch(). was just talking about the
original rt issue
it might affect.

>
> > However, compared
> > to operations like folio_isolate_lru() and folio_deactivate(),
> > this increase seems negligible. Recently, we have actually removed
> > ptep_test_and_clear_young() for MADV_PAGEOUT,
> > which should also benefit real-time scenarios. Nonetheless, there is a
> > small risk with large folios, such as 1 MiB mTHP, where
> > we may need to loop 256 times in folio_pte_batch().
>
> As I understand it, RT and THP are mutually exclusive. RT can't handle the extra
> latencies THPs can cause in allocation path, etc. So I don't think you will see
> a problem here.

I was actually taking a different approach on the phones as obviously
we have some
UX(user-experience)/UI/audio related tasks which cannot tolerate
allocation latency. with
a TAO-similar optimization(we did it by ext_migratetype for some pageblocks), we
actually don't push buddy to do compaction or reclamation for forming
64KiB folio.
We immediately fallback to small folios if a zero-latency 64KiB
allocation can't be
obtained from some kinds of pools - ext_migratetype pageblocks.

>
> >
> > I would vote for increasing 'nr' or maybe max(log2(nr), 1) rather than
> > 1 for two reasons:
> >
> > 1. We are not making MADV_PAGEOUT/COLD worse; in fact, we are
> > improving them by reducing the time taken to put the same
> > number of pages into the reclaim list.
> >
> > 2. MADV_PAGEOUT/COLD scenarios are not urgent compared to others that
> > genuinely require the PTL to progress. Moreover,
> > the majority of time spent on PAGEOUT is actually reclaim_pages().
>
> I understand your logic. But I'd rather optimize for fewer lock acquisitions for
> the !RT+THP case, since RT+THP is not supported.

Fair enough. I agree we can postpone this until RT and THP become an
available option.
For now, keeping this patch simpler seems to be better.

>
> >
> >>>
> >>>>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>>                                      fpb_flags, NULL);
> >>>>
> >>>> If we are not mapping the first page of the folio, then it can't be a full
> >>>> mapping, so no need to call folio_pte_batch(). Just split it.
> >>>>
> >>>>>
> >>>>>> +
> >>>>>> +                       if (nr < folio_nr_pages(folio)) {
> >>>>>> +                               int err;
> >>>>>> +
> >>>>>> +                               if (folio_estimated_sharers(folio) > 1)
> >>>>>> +                                       continue;
> >>>>>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>>> +                                       continue;
> >>>>>> +                               if (!folio_trylock(folio))
> >>>>>> +                                       continue;
> >>>>>> +                               folio_get(folio);
> >>>>>> +                               arch_leave_lazy_mmu_mode();
> >>>>>> +                               pte_unmap_unlock(start_pte, ptl);
> >>>>>> +                               start_pte = NULL;
> >>>>>> +                               err = split_folio(folio);
> >>>>>> +                               folio_unlock(folio);
> >>>>>> +                               folio_put(folio);
> >>>>>> +                               if (err)
> >>>>>> +                                       continue;
> >>>>>> +                               start_pte = pte =
> >>>>>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>>>> +                               if (!start_pte)
> >>>>>> +                                       break;
> >>>>>> +                               arch_enter_lazy_mmu_mode();
> >>>>>> +                               nr = 0;
> >>>>>> +                               continue;
> >>>>>> +                       }
> >>>>>>                 }
> >>>>>>
> >>>>>>                 /*
> >>>>>>                  * Do not interfere with other mappings of this folio and
> >>>>>> -                * non-LRU folio.
> >>>>>> +                * non-LRU folio. If we have a large folio at this point, we
> >>>>>> +                * know it is fully mapped so if its mapcount is the same as its
> >>>>>> +                * number of pages, it must be exclusive.
> >>>>>>                  */
> >>>>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >>>>>> +               if (!folio_test_lru(folio) ||
> >>>>>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
> >>>>>>                         continue;
> >>>>>
> >>>>> This looks so perfect and is exactly what I wanted to achieve.
> >>>>>
> >>>>>>
> >>>>>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>>>                         continue;
> >>>>>>
> >>>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>>>> -
> >>>>>> -               if (!pageout && pte_young(ptent)) {
> >>>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>>>> -                                                       tlb->fullmm);
> >>>>>> -                       ptent = pte_mkold(ptent);
> >>>>>> -                       set_pte_at(mm, addr, pte, ptent);
> >>>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>> +               if (!pageout) {
> >>>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >>>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
> >>>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>> +                       }
> >>>>>
> >>>>> This looks so smart. if it is not pageout, we have increased pte
> >>>>> and addr here; so nr is 0 and we don't need to increase again in
> >>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >>>>>
> >>>>> otherwise, nr won't be 0. so we will increase addr and
> >>>>> pte by nr.
> >>>>
> >>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >>>> madvise_free_pte_range().
> >>>>
> >>>>
> >>>>>
> >>>>>
> >>>>>>                 }
> >>>>>>
> >>>>>>                 /*
> >>>>>> --
> >>>>>> 2.25.1
> >>>>>>
> >>>>>
> >>>>> Overall, LGTM,
> >>>>>
> >>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> >>>>
> >>>> Thanks!
> >>>>
> >
> > Thanks
> > Barry
>
Ryan Roberts March 13, 2024, 12:02 p.m. UTC | #9
On 13/03/2024 11:37, Barry Song wrote:
> On Wed, Mar 13, 2024 at 7:08 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 13/03/2024 10:37, Barry Song wrote:
>>> On Wed, Mar 13, 2024 at 10:36 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 13/03/2024 09:16, Barry Song wrote:
>>>>> On Wed, Mar 13, 2024 at 10:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 13/03/2024 07:19, Barry Song wrote:
>>>>>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>>>
>>>>>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>>>>>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>>>>>>> range. This change means that large folios will be maintained all the
>>>>>>>> way to swap storage. This both improves performance during swap-out, by
>>>>>>>> eliding the cost of splitting the folio, and sets us up nicely for
>>>>>>>> maintaining the large folio when it is swapped back in (to be covered in
>>>>>>>> a separate series).
>>>>>>>>
>>>>>>>> Folios that are not fully mapped in the target range are still split,
>>>>>>>> but note that behavior is changed so that if the split fails for any
>>>>>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>>>>>>> next pte in the range and continue work on the proceeding folios.
>>>>>>>> Previously any failure of this sort would cause the entire operation to
>>>>>>>> give up and no folios mapped at higher addresses were paged out or made
>>>>>>>> cold. Given large folios are becoming more common, this old behavior
>>>>>>>> would have likely lead to wasted opportunities.
>>>>>>>>
>>>>>>>> While we are at it, change the code that clears young from the ptes to
>>>>>>>> use ptep_test_and_clear_young(), which is more efficent than
>>>>>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>>>>>>>> where the old approach would require unfolding/refolding and the new
>>>>>>>> approach can be done in place.
>>>>>>>>
>>>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>>>
>>>>>>> This looks so much better than our initial RFC.
>>>>>>> Thank you for your excellent work!
>>>>>>
>>>>>> Thanks - its a team effort - I had your PoC and David's previous batching work
>>>>>> to use as a template.
>>>>>>
>>>>>>>
>>>>>>>> ---
>>>>>>>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>>>>>>>  1 file changed, 51 insertions(+), 38 deletions(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>>>>>> index 547dcd1f7a39..56c7ba7bd558 100644
>>>>>>>> --- a/mm/madvise.c
>>>>>>>> +++ b/mm/madvise.c
>>>>>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>>>         LIST_HEAD(folio_list);
>>>>>>>>         bool pageout_anon_only_filter;
>>>>>>>>         unsigned int batch_count = 0;
>>>>>>>> +       int nr;
>>>>>>>>
>>>>>>>>         if (fatal_signal_pending(current))
>>>>>>>>                 return -EINTR;
>>>>>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>>>                 return 0;
>>>>>>>>         flush_tlb_batched_pending(mm);
>>>>>>>>         arch_enter_lazy_mmu_mode();
>>>>>>>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
>>>>>>>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>>>>>>> +               nr = 1;
>>>>>>>>                 ptent = ptep_get(pte);
>>>>>>>>
>>>>>>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>>>>>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>>>                         continue;
>>>>>>>>
>>>>>>>>                 /*
>>>>>>>> -                * Creating a THP page is expensive so split it only if we
>>>>>>>> -                * are sure it's worth. Split it if we are only owner.
>>>>>>>> +                * If we encounter a large folio, only split it if it is not
>>>>>>>> +                * fully mapped within the range we are operating on. Otherwise
>>>>>>>> +                * leave it as is so that it can be swapped out whole. If we
>>>>>>>> +                * fail to split a folio, leave it in place and advance to the
>>>>>>>> +                * next pte in the range.
>>>>>>>>                  */
>>>>>>>>                 if (folio_test_large(folio)) {
>>>>>>>> -                       int err;
>>>>>>>> -
>>>>>>>> -                       if (folio_estimated_sharers(folio) > 1)
>>>>>>>> -                               break;
>>>>>>>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>>>> -                               break;
>>>>>>>> -                       if (!folio_trylock(folio))
>>>>>>>> -                               break;
>>>>>>>> -                       folio_get(folio);
>>>>>>>> -                       arch_leave_lazy_mmu_mode();
>>>>>>>> -                       pte_unmap_unlock(start_pte, ptl);
>>>>>>>> -                       start_pte = NULL;
>>>>>>>> -                       err = split_folio(folio);
>>>>>>>> -                       folio_unlock(folio);
>>>>>>>> -                       folio_put(folio);
>>>>>>>> -                       if (err)
>>>>>>>> -                               break;
>>>>>>>> -                       start_pte = pte =
>>>>>>>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>>>>>> -                       if (!start_pte)
>>>>>>>> -                               break;
>>>>>>>> -                       arch_enter_lazy_mmu_mode();
>>>>>>>> -                       pte--;
>>>>>>>> -                       addr -= PAGE_SIZE;
>>>>>>>> -                       continue;
>>>>>>>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>>>>>> +                                               FPB_IGNORE_SOFT_DIRTY;
>>>>>>>> +                       int max_nr = (end - addr) / PAGE_SIZE;
>>>>>>>> +
>>>>>>>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>>>>>> +                                            fpb_flags, NULL);
>>>>>>>
>>>>>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
>>>>>>> are doing madvise() on a portion of a large folio.
>>>>>>
>>>>>> Good idea. Something like this?:
>>>>>>
>>>>>>         if (pte_pfn(pte) == folio_pfn(folio)
>>>>>
>>>>> what about
>>>>>
>>>>> "If (pte_pfn(pte) == folio_pfn(folio) && max_nr >= nr_pages)"
>>>>>
>>>>>  just to account for cases where the user's end address falls within
>>>>> the middle of a large folio?
>>>>
>>>> yes, even better. I'll add this for the next version.
>>>>
>>>>>
>>>>>
>>>>> BTW, another minor issue is here:
>>>>>
>>>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>>>>>                         batch_count = 0;
>>>>>                         if (need_resched()) {
>>>>>                                 arch_leave_lazy_mmu_mode();
>>>>>                                 pte_unmap_unlock(start_pte, ptl);
>>>>>                                 cond_resched();
>>>>>                                 goto restart;
>>>>>                         }
>>>>>                 }
>>>>>
>>>>> We are increasing 1 for nr ptes, thus, we are holding PTL longer
>>>>> than small folios case? we used to increase 1 for each PTE.
>>>>> Does it matter?
>>>>
>>>> I thought about that, but the vast majority of the work is per-folio, not
>>>> per-pte. So I concluded it would be best to continue to increment per-folio.
>>>
>>> Okay. The original patch commit b2f557a21bc8 ("mm/madvise: add
>>> cond_resched() in madvise_cold_or_pageout_pte_range()")
>>> primarily addressed the real-time wake-up latency issue. MADV_PAGEOUT
>>> and MADV_COLD are much less critical compared
>>> to other scenarios where operations like do_anon_page or do_swap_page
>>> necessarily need PTL to progress. Therefore, adopting
>>> an approach that relatively aggressively releases the PTL seems to
>>> neither harm MADV_PAGEOUT/COLD nor disadvantage
>>> others.
>>>
>>> We are slightly increasing the duration of holding the PTL due to the
>>> iteration of folio_pte_batch() potentially taking longer than
>>> the case of small folios, which do not require it.
>>
>> If we can't scan all the PTEs in a page table without dropping the PTL
>> intermittently we have bigger problems. This all works perfectly fine in all the
>> other PTE iterators; see zap_pte_range() for example.
> 
> I've no doubt about folio_pte_batch(). was just talking about the
> original rt issue
> it might affect.
> 
>>
>>> However, compared
>>> to operations like folio_isolate_lru() and folio_deactivate(),
>>> this increase seems negligible. Recently, we have actually removed
>>> ptep_test_and_clear_young() for MADV_PAGEOUT,
>>> which should also benefit real-time scenarios. Nonetheless, there is a
>>> small risk with large folios, such as 1 MiB mTHP, where
>>> we may need to loop 256 times in folio_pte_batch().
>>
>> As I understand it, RT and THP are mutually exclusive. RT can't handle the extra
>> latencies THPs can cause in allocation path, etc. So I don't think you will see
>> a problem here.
> 
> I was actually taking a different approach on the phones as obviously
> we have some
> UX(user-experience)/UI/audio related tasks which cannot tolerate
> allocation latency. with
> a TAO-similar optimization(we did it by ext_migratetype for some pageblocks), we
> actually don't push buddy to do compaction or reclamation for forming
> 64KiB folio.
> We immediately fallback to small folios if a zero-latency 64KiB
> allocation can't be
> obtained from some kinds of pools - ext_migratetype pageblocks.

You can opt-in to avoiding latency due to compaction, etc. with various settings
in /sys/kernel/mm/transparent_hugepage/defrag. That applies to mTHP as well. See
Documentation/admin-guide/mm/transhuge.rst. Obviously this is not as useful as
the TAO approach because it does nothing to avoid fragmentation in the first place.

The other source of latency for THP allocation, which I believe RT doesn't like,
is the cost of zeroing the huge page, IIRC.

> 
>>
>>>
>>> I would vote for increasing 'nr' or maybe max(log2(nr), 1) rather than
>>> 1 for two reasons:
>>>
>>> 1. We are not making MADV_PAGEOUT/COLD worse; in fact, we are
>>> improving them by reducing the time taken to put the same
>>> number of pages into the reclaim list.
>>>
>>> 2. MADV_PAGEOUT/COLD scenarios are not urgent compared to others that
>>> genuinely require the PTL to progress. Moreover,
>>> the majority of time spent on PAGEOUT is actually reclaim_pages().
>>
>> I understand your logic. But I'd rather optimize for fewer lock acquisitions for
>> the !RT+THP case, since RT+THP is not supported.
> 
> Fair enough. I agree we can postpone this until RT and THP become an
> available option.
> For now, keeping this patch simpler seems to be better.

OK thanks.

> 
>>
>>>
>>>>>
>>>>>>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>>>>                                      fpb_flags, NULL);
>>>>>>
>>>>>> If we are not mapping the first page of the folio, then it can't be a full
>>>>>> mapping, so no need to call folio_pte_batch(). Just split it.
>>>>>>
>>>>>>>
>>>>>>>> +
>>>>>>>> +                       if (nr < folio_nr_pages(folio)) {
>>>>>>>> +                               int err;
>>>>>>>> +
>>>>>>>> +                               if (folio_estimated_sharers(folio) > 1)
>>>>>>>> +                                       continue;
>>>>>>>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>>>> +                                       continue;
>>>>>>>> +                               if (!folio_trylock(folio))
>>>>>>>> +                                       continue;
>>>>>>>> +                               folio_get(folio);
>>>>>>>> +                               arch_leave_lazy_mmu_mode();
>>>>>>>> +                               pte_unmap_unlock(start_pte, ptl);
>>>>>>>> +                               start_pte = NULL;
>>>>>>>> +                               err = split_folio(folio);
>>>>>>>> +                               folio_unlock(folio);
>>>>>>>> +                               folio_put(folio);
>>>>>>>> +                               if (err)
>>>>>>>> +                                       continue;
>>>>>>>> +                               start_pte = pte =
>>>>>>>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>>>>>> +                               if (!start_pte)
>>>>>>>> +                                       break;
>>>>>>>> +                               arch_enter_lazy_mmu_mode();
>>>>>>>> +                               nr = 0;
>>>>>>>> +                               continue;
>>>>>>>> +                       }
>>>>>>>>                 }
>>>>>>>>
>>>>>>>>                 /*
>>>>>>>>                  * Do not interfere with other mappings of this folio and
>>>>>>>> -                * non-LRU folio.
>>>>>>>> +                * non-LRU folio. If we have a large folio at this point, we
>>>>>>>> +                * know it is fully mapped so if its mapcount is the same as its
>>>>>>>> +                * number of pages, it must be exclusive.
>>>>>>>>                  */
>>>>>>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>>>>>>> +               if (!folio_test_lru(folio) ||
>>>>>>>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>>>>>>>>                         continue;
>>>>>>>
>>>>>>> This looks so perfect and is exactly what I wanted to achieve.
>>>>>>>
>>>>>>>>
>>>>>>>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>>>>                         continue;
>>>>>>>>
>>>>>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>>>> -
>>>>>>>> -               if (!pageout && pte_young(ptent)) {
>>>>>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>>>> -                                                       tlb->fullmm);
>>>>>>>> -                       ptent = pte_mkold(ptent);
>>>>>>>> -                       set_pte_at(mm, addr, pte, ptent);
>>>>>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>>>> +               if (!pageout) {
>>>>>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>>>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
>>>>>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>>>> +                       }
>>>>>>>
>>>>>>> This looks so smart. if it is not pageout, we have increased pte
>>>>>>> and addr here; so nr is 0 and we don't need to increase again in
>>>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>>>>>
>>>>>>> otherwise, nr won't be 0. so we will increase addr and
>>>>>>> pte by nr.
>>>>>>
>>>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>>>>>> madvise_free_pte_range().
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>                 }
>>>>>>>>
>>>>>>>>                 /*
>>>>>>>> --
>>>>>>>> 2.25.1
>>>>>>>>
>>>>>>>
>>>>>>> Overall, LGTM,
>>>>>>>
>>>>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>
>>> Thanks
>>> Barry
>>
Lance Yang March 13, 2024, 2:02 p.m. UTC | #10
On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 13/03/2024 07:19, Barry Song wrote:
> > On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >> folio that is fully and contiguously mapped in the pageout/cold vm
> >> range. This change means that large folios will be maintained all the
> >> way to swap storage. This both improves performance during swap-out, by
> >> eliding the cost of splitting the folio, and sets us up nicely for
> >> maintaining the large folio when it is swapped back in (to be covered in
> >> a separate series).
> >>
> >> Folios that are not fully mapped in the target range are still split,
> >> but note that behavior is changed so that if the split fails for any
> >> reason (folio locked, shared, etc) we now leave it as is and move to the
> >> next pte in the range and continue work on the proceeding folios.
> >> Previously any failure of this sort would cause the entire operation to
> >> give up and no folios mapped at higher addresses were paged out or made
> >> cold. Given large folios are becoming more common, this old behavior
> >> would have likely lead to wasted opportunities.
> >>
> >> While we are at it, change the code that clears young from the ptes to
> >> use ptep_test_and_clear_young(), which is more efficent than
> >> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >> where the old approach would require unfolding/refolding and the new
> >> approach can be done in place.
> >>
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >
> > This looks so much better than our initial RFC.
> > Thank you for your excellent work!
>
> Thanks - its a team effort - I had your PoC and David's previous batching work
> to use as a template.
>
> >
> >> ---
> >>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >>  1 file changed, 51 insertions(+), 38 deletions(-)
> >>
> >> diff --git a/mm/madvise.c b/mm/madvise.c
> >> index 547dcd1f7a39..56c7ba7bd558 100644
> >> --- a/mm/madvise.c
> >> +++ b/mm/madvise.c
> >> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>         LIST_HEAD(folio_list);
> >>         bool pageout_anon_only_filter;
> >>         unsigned int batch_count = 0;
> >> +       int nr;
> >>
> >>         if (fatal_signal_pending(current))
> >>                 return -EINTR;
> >> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>                 return 0;
> >>         flush_tlb_batched_pending(mm);
> >>         arch_enter_lazy_mmu_mode();
> >> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
> >> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >> +               nr = 1;
> >>                 ptent = ptep_get(pte);
> >>
> >>                 if (++batch_count == SWAP_CLUSTER_MAX) {
> >> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>                         continue;
> >>
> >>                 /*
> >> -                * Creating a THP page is expensive so split it only if we
> >> -                * are sure it's worth. Split it if we are only owner.
> >> +                * If we encounter a large folio, only split it if it is not
> >> +                * fully mapped within the range we are operating on. Otherwise
> >> +                * leave it as is so that it can be swapped out whole. If we
> >> +                * fail to split a folio, leave it in place and advance to the
> >> +                * next pte in the range.
> >>                  */
> >>                 if (folio_test_large(folio)) {
> >> -                       int err;
> >> -
> >> -                       if (folio_estimated_sharers(folio) > 1)
> >> -                               break;
> >> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> -                               break;
> >> -                       if (!folio_trylock(folio))
> >> -                               break;
> >> -                       folio_get(folio);
> >> -                       arch_leave_lazy_mmu_mode();
> >> -                       pte_unmap_unlock(start_pte, ptl);
> >> -                       start_pte = NULL;
> >> -                       err = split_folio(folio);
> >> -                       folio_unlock(folio);
> >> -                       folio_put(folio);
> >> -                       if (err)
> >> -                               break;
> >> -                       start_pte = pte =
> >> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> -                       if (!start_pte)
> >> -                               break;
> >> -                       arch_enter_lazy_mmu_mode();
> >> -                       pte--;
> >> -                       addr -= PAGE_SIZE;
> >> -                       continue;
> >> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >> +                                               FPB_IGNORE_SOFT_DIRTY;
> >> +                       int max_nr = (end - addr) / PAGE_SIZE;
> >> +
> >> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >> +                                            fpb_flags, NULL);
> >
> > I wonder if we have a quick way to avoid folio_pte_batch() if users
> > are doing madvise() on a portion of a large folio.
>
> Good idea. Something like this?:
>
>         if (pte_pfn(pte) == folio_pfn(folio)
>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>                                      fpb_flags, NULL);
>
> If we are not mapping the first page of the folio, then it can't be a full
> mapping, so no need to call folio_pte_batch(). Just split it.

                 if (folio_test_large(folio)) {
[...]
                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
                                            fpb_flags, NULL);
+                       if (folio_estimated_sharers(folio) > 1)
+                               continue;

Could we use folio_estimated_sharers as an early exit point here?

                       if (nr < folio_nr_pages(folio)) {
                               int err;

-                               if (folio_estimated_sharers(folio) > 1)
-                                       continue;
[...]

>
> >
> >> +
> >> +                       if (nr < folio_nr_pages(folio)) {
> >> +                               int err;
> >> +
> >> +                               if (folio_estimated_sharers(folio) > 1)
> >> +                                       continue;
> >> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
> >> +                                       continue;
> >> +                               if (!folio_trylock(folio))
> >> +                                       continue;
> >> +                               folio_get(folio);
> >> +                               arch_leave_lazy_mmu_mode();
> >> +                               pte_unmap_unlock(start_pte, ptl);
> >> +                               start_pte = NULL;
> >> +                               err = split_folio(folio);
> >> +                               folio_unlock(folio);
> >> +                               folio_put(folio);
> >> +                               if (err)
> >> +                                       continue;
> >> +                               start_pte = pte =
> >> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
> >> +                               if (!start_pte)
> >> +                                       break;
> >> +                               arch_enter_lazy_mmu_mode();
> >> +                               nr = 0;
> >> +                               continue;
> >> +                       }
> >>                 }
> >>
> >>                 /*
> >>                  * Do not interfere with other mappings of this folio and
> >> -                * non-LRU folio.
> >> +                * non-LRU folio. If we have a large folio at this point, we
> >> +                * know it is fully mapped so if its mapcount is the same as its
> >> +                * number of pages, it must be exclusive.
> >>                  */
> >> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >> +               if (!folio_test_lru(folio) ||
> >> +                   folio_mapcount(folio) != folio_nr_pages(folio))
> >>                         continue;
> >
> > This looks so perfect and is exactly what I wanted to achieve.
> >
> >>
> >>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>                         continue;
> >>
> >> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >> -
> >> -               if (!pageout && pte_young(ptent)) {
> >> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> >> -                                                       tlb->fullmm);
> >> -                       ptent = pte_mkold(ptent);
> >> -                       set_pte_at(mm, addr, pte, ptent);
> >> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> >> +               if (!pageout) {
> >> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >> +                               if (ptep_test_and_clear_young(vma, addr, pte))
> >> +                                       tlb_remove_tlb_entry(tlb, pte, addr);

IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
pte clearing?

Thanks,
Lance



> >> +                       }
> >
> > This looks so smart. if it is not pageout, we have increased pte
> > and addr here; so nr is 0 and we don't need to increase again in
> > for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >
> > otherwise, nr won't be 0. so we will increase addr and
> > pte by nr.
>
> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> madvise_free_pte_range().
>
>
> >
> >
> >>                 }
> >>
> >>                 /*
> >> --
> >> 2.25.1
> >>
> >
> > Overall, LGTM,
> >
> > Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>
> Thanks!
>
>
Ryan Roberts March 15, 2024, 10:55 a.m. UTC | #11
On 15/03/2024 10:35, David Hildenbrand wrote:
>> -        if (!pageout && pte_young(ptent)) {
>> -            ptent = ptep_get_and_clear_full(mm, addr, pte,
>> -                            tlb->fullmm);
>> -            ptent = pte_mkold(ptent);
>> -            set_pte_at(mm, addr, pte, ptent);
>> -            tlb_remove_tlb_entry(tlb, pte, addr);
>> +        if (!pageout) {
>> +            for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>> +                if (ptep_test_and_clear_young(vma, addr, pte))
>> +                    tlb_remove_tlb_entry(tlb, pte, addr);
>> +            }
>>           }
> 
> 
> The following might turn out a bit nicer: Make folio_pte_batch() collect
> "any_young", then doing something like we do with "any_writable" in the fork()
> case:
> 
> ...
>     nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>                  fpb_flags, NULL, any_young);
>     if (any_young)
>         pte_mkyoung(ptent)
> ...
> 
> if (!pageout && pte_young(ptent)) {
>     mkold_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>     tlb_remove_tlb_entries(tlb, pte, nr, addr);
> }
> 

I thought about that but decided that it would be better to only TLBI the actual
entries that were young. Although looking at tlb_remove_tlb_entry() I see that
it just maintains a range between the lowest and highest address, so this won't
actually make any difference.

So, yes, this will be a nice improvement, and also prevent the O(n^2) pte reads
for the contpte case. I'll change in the next version.
David Hildenbrand March 15, 2024, 11:13 a.m. UTC | #12
On 15.03.24 11:55, Ryan Roberts wrote:
> On 15/03/2024 10:35, David Hildenbrand wrote:
>>> -        if (!pageout && pte_young(ptent)) {
>>> -            ptent = ptep_get_and_clear_full(mm, addr, pte,
>>> -                            tlb->fullmm);
>>> -            ptent = pte_mkold(ptent);
>>> -            set_pte_at(mm, addr, pte, ptent);
>>> -            tlb_remove_tlb_entry(tlb, pte, addr);
>>> +        if (!pageout) {
>>> +            for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>> +                if (ptep_test_and_clear_young(vma, addr, pte))
>>> +                    tlb_remove_tlb_entry(tlb, pte, addr);
>>> +            }
>>>            }
>>
>>
>> The following might turn out a bit nicer: Make folio_pte_batch() collect
>> "any_young", then doing something like we do with "any_writable" in the fork()
>> case:
>>
>> ...
>>      nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>                   fpb_flags, NULL, any_young);
>>      if (any_young)
>>          pte_mkyoung(ptent)
>> ...
>>
>> if (!pageout && pte_young(ptent)) {
>>      mkold_full_ptes(mm, addr, pte, nr, tlb->fullmm);
>>      tlb_remove_tlb_entries(tlb, pte, nr, addr);
>> }
>>
> 
> I thought about that but decided that it would be better to only TLBI the actual
> entries that were young. Although looking at tlb_remove_tlb_entry() I see that
> it just maintains a range between the lowest and highest address, so this won't
> actually make any difference.

Yes, tlb_remove_tlb_entries() looks scarier than it actually is :)
Ryan Roberts March 20, 2024, 1:49 p.m. UTC | #13
Hi Lance, Barry,

Sorry - I totally missed this when you originally sent it!


On 13/03/2024 14:02, Lance Yang wrote:
> On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 13/03/2024 07:19, Barry Song wrote:
>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>>> range. This change means that large folios will be maintained all the
>>>> way to swap storage. This both improves performance during swap-out, by
>>>> eliding the cost of splitting the folio, and sets us up nicely for
>>>> maintaining the large folio when it is swapped back in (to be covered in
>>>> a separate series).
>>>>
>>>> Folios that are not fully mapped in the target range are still split,
>>>> but note that behavior is changed so that if the split fails for any
>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>>> next pte in the range and continue work on the proceeding folios.
>>>> Previously any failure of this sort would cause the entire operation to
>>>> give up and no folios mapped at higher addresses were paged out or made
>>>> cold. Given large folios are becoming more common, this old behavior
>>>> would have likely lead to wasted opportunities.
>>>>
>>>> While we are at it, change the code that clears young from the ptes to
>>>> use ptep_test_and_clear_young(), which is more efficent than
>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>>>> where the old approach would require unfolding/refolding and the new
>>>> approach can be done in place.
>>>>
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>
>>> This looks so much better than our initial RFC.
>>> Thank you for your excellent work!
>>
>> Thanks - its a team effort - I had your PoC and David's previous batching work
>> to use as a template.
>>
>>>
>>>> ---
>>>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>>>  1 file changed, 51 insertions(+), 38 deletions(-)
>>>>
>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>> index 547dcd1f7a39..56c7ba7bd558 100644
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>         LIST_HEAD(folio_list);
>>>>         bool pageout_anon_only_filter;
>>>>         unsigned int batch_count = 0;
>>>> +       int nr;
>>>>
>>>>         if (fatal_signal_pending(current))
>>>>                 return -EINTR;
>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>                 return 0;
>>>>         flush_tlb_batched_pending(mm);
>>>>         arch_enter_lazy_mmu_mode();
>>>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
>>>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>>> +               nr = 1;
>>>>                 ptent = ptep_get(pte);
>>>>
>>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>                         continue;
>>>>
>>>>                 /*
>>>> -                * Creating a THP page is expensive so split it only if we
>>>> -                * are sure it's worth. Split it if we are only owner.
>>>> +                * If we encounter a large folio, only split it if it is not
>>>> +                * fully mapped within the range we are operating on. Otherwise
>>>> +                * leave it as is so that it can be swapped out whole. If we
>>>> +                * fail to split a folio, leave it in place and advance to the
>>>> +                * next pte in the range.
>>>>                  */
>>>>                 if (folio_test_large(folio)) {
>>>> -                       int err;
>>>> -
>>>> -                       if (folio_estimated_sharers(folio) > 1)
>>>> -                               break;
>>>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> -                               break;
>>>> -                       if (!folio_trylock(folio))
>>>> -                               break;
>>>> -                       folio_get(folio);
>>>> -                       arch_leave_lazy_mmu_mode();
>>>> -                       pte_unmap_unlock(start_pte, ptl);
>>>> -                       start_pte = NULL;
>>>> -                       err = split_folio(folio);
>>>> -                       folio_unlock(folio);
>>>> -                       folio_put(folio);
>>>> -                       if (err)
>>>> -                               break;
>>>> -                       start_pte = pte =
>>>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>> -                       if (!start_pte)
>>>> -                               break;
>>>> -                       arch_enter_lazy_mmu_mode();
>>>> -                       pte--;
>>>> -                       addr -= PAGE_SIZE;
>>>> -                       continue;
>>>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>> +                                               FPB_IGNORE_SOFT_DIRTY;
>>>> +                       int max_nr = (end - addr) / PAGE_SIZE;
>>>> +
>>>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>> +                                            fpb_flags, NULL);
>>>
>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
>>> are doing madvise() on a portion of a large folio.
>>
>> Good idea. Something like this?:
>>
>>         if (pte_pfn(pte) == folio_pfn(folio)
>>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>                                      fpb_flags, NULL);
>>
>> If we are not mapping the first page of the folio, then it can't be a full
>> mapping, so no need to call folio_pte_batch(). Just split it.
> 
>                  if (folio_test_large(folio)) {
> [...]
>                        nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>                                             fpb_flags, NULL);
> +                       if (folio_estimated_sharers(folio) > 1)
> +                               continue;
> 
> Could we use folio_estimated_sharers as an early exit point here?

I'm not sure what this is saving where you have it? Did you mean to put it
before folio_pte_batch()? Currently it is just saving a single conditional.

But now that I think about it a bit more, I remember why I was originally
unconditionally calling folio_pte_batch(). Given its a large folio, if the split
fails, we can move the cursor to the pte where the next folio begins so we don't
have to iterate through one pte at a time which would cause us to keep calling
folio_estimated_sharers(), folio_test_anon(), etc on the same folio until we get
to the next boundary.

Of course the common case at this point will be for the split to succeed, but
then we are going to iterate over ever single PTE anyway - one way or another
they are all fetched into cache. So I feel like its neater not to add the
conditionals for calling folio_pte_batch(), and just leave this as I have it here.

> 
>                        if (nr < folio_nr_pages(folio)) {
>                                int err;
> 
> -                               if (folio_estimated_sharers(folio) > 1)
> -                                       continue;
> [...]
> 
>>
>>>
>>>> +
>>>> +                       if (nr < folio_nr_pages(folio)) {
>>>> +                               int err;
>>>> +
>>>> +                               if (folio_estimated_sharers(folio) > 1)
>>>> +                                       continue;
>>>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>> +                                       continue;
>>>> +                               if (!folio_trylock(folio))
>>>> +                                       continue;
>>>> +                               folio_get(folio);
>>>> +                               arch_leave_lazy_mmu_mode();
>>>> +                               pte_unmap_unlock(start_pte, ptl);
>>>> +                               start_pte = NULL;
>>>> +                               err = split_folio(folio);
>>>> +                               folio_unlock(folio);
>>>> +                               folio_put(folio);
>>>> +                               if (err)
>>>> +                                       continue;
>>>> +                               start_pte = pte =
>>>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>> +                               if (!start_pte)
>>>> +                                       break;
>>>> +                               arch_enter_lazy_mmu_mode();
>>>> +                               nr = 0;
>>>> +                               continue;
>>>> +                       }
>>>>                 }
>>>>
>>>>                 /*
>>>>                  * Do not interfere with other mappings of this folio and
>>>> -                * non-LRU folio.
>>>> +                * non-LRU folio. If we have a large folio at this point, we
>>>> +                * know it is fully mapped so if its mapcount is the same as its
>>>> +                * number of pages, it must be exclusive.
>>>>                  */
>>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>>> +               if (!folio_test_lru(folio) ||
>>>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>>>>                         continue;
>>>
>>> This looks so perfect and is exactly what I wanted to achieve.
>>>
>>>>
>>>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>                         continue;
>>>>
>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>> -
>>>> -               if (!pageout && pte_young(ptent)) {
>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>> -                                                       tlb->fullmm);
>>>> -                       ptent = pte_mkold(ptent);
>>>> -                       set_pte_at(mm, addr, pte, ptent);
>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>> +               if (!pageout) {
>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> 
> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
> pte clearing?

Sorry Lance, I don't understand this question, can you rephrase? Are you saying
there is a good reason to do the original clear-mkold-set for some arches?

> 
> Thanks,
> Lance
> 
> 
> 
>>>> +                       }
>>>
>>> This looks so smart. if it is not pageout, we have increased pte
>>> and addr here; so nr is 0 and we don't need to increase again in
>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>
>>> otherwise, nr won't be 0. so we will increase addr and
>>> pte by nr.
>>
>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>> madvise_free_pte_range().
>>
>>
>>>
>>>
>>>>                 }
>>>>
>>>>                 /*
>>>> --
>>>> 2.25.1
>>>>
>>>
>>> Overall, LGTM,
>>>
>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>
>> Thanks!
>>
>>
Ryan Roberts March 20, 2024, 1:57 p.m. UTC | #14
On 15/03/2024 10:35, David Hildenbrand wrote:
>> -        if (!pageout && pte_young(ptent)) {
>> -            ptent = ptep_get_and_clear_full(mm, addr, pte,
>> -                            tlb->fullmm);
>> -            ptent = pte_mkold(ptent);
>> -            set_pte_at(mm, addr, pte, ptent);
>> -            tlb_remove_tlb_entry(tlb, pte, addr);
>> +        if (!pageout) {
>> +            for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>> +                if (ptep_test_and_clear_young(vma, addr, pte))
>> +                    tlb_remove_tlb_entry(tlb, pte, addr);
>> +            }
>>           }
> 
> 
> The following might turn out a bit nicer: Make folio_pte_batch() collect
> "any_young", then doing something like we do with "any_writable" in the fork()
> case:
> 
> ...
>     nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>                  fpb_flags, NULL, any_young);
>     if (any_young)
>         pte_mkyoung(ptent)
> ...
> 
> if (!pageout && pte_young(ptent)) {
>     mkold_full_ptes(mm, addr, pte, nr, tlb->fullmm);

I don't think tlb->fullmm makes sense here because we are not clearing the pte,
so there is no chance of optimization? So planning to call this mkold_ptes() and
remove that param. Have I missed something?

>     tlb_remove_tlb_entries(tlb, pte, nr, addr);
> }
>
David Hildenbrand March 20, 2024, 2:09 p.m. UTC | #15
On 20.03.24 14:57, Ryan Roberts wrote:
> On 15/03/2024 10:35, David Hildenbrand wrote:
>>> -        if (!pageout && pte_young(ptent)) {
>>> -            ptent = ptep_get_and_clear_full(mm, addr, pte,
>>> -                            tlb->fullmm);
>>> -            ptent = pte_mkold(ptent);
>>> -            set_pte_at(mm, addr, pte, ptent);
>>> -            tlb_remove_tlb_entry(tlb, pte, addr);
>>> +        if (!pageout) {
>>> +            for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>> +                if (ptep_test_and_clear_young(vma, addr, pte))
>>> +                    tlb_remove_tlb_entry(tlb, pte, addr);
>>> +            }
>>>            }
>>
>>
>> The following might turn out a bit nicer: Make folio_pte_batch() collect
>> "any_young", then doing something like we do with "any_writable" in the fork()
>> case:
>>
>> ...
>>      nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>                   fpb_flags, NULL, any_young);
>>      if (any_young)
>>          pte_mkyoung(ptent)
>> ...
>>
>> if (!pageout && pte_young(ptent)) {
>>      mkold_full_ptes(mm, addr, pte, nr, tlb->fullmm);
> 
> I don't think tlb->fullmm makes sense here because we are not clearing the pte,
> so there is no chance of optimization? So planning to call this mkold_ptes() and
> remove that param. Have I missed something?

Agreed.
Lance Yang March 20, 2024, 2:35 p.m. UTC | #16
On Wed, Mar 20, 2024 at 9:49 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi Lance, Barry,
>
> Sorry - I totally missed this when you originally sent it!

No worries at all :)

>
>
> On 13/03/2024 14:02, Lance Yang wrote:
> > On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 13/03/2024 07:19, Barry Song wrote:
> >>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >>>> folio that is fully and contiguously mapped in the pageout/cold vm
> >>>> range. This change means that large folios will be maintained all the
> >>>> way to swap storage. This both improves performance during swap-out, by
> >>>> eliding the cost of splitting the folio, and sets us up nicely for
> >>>> maintaining the large folio when it is swapped back in (to be covered in
> >>>> a separate series).
> >>>>
> >>>> Folios that are not fully mapped in the target range are still split,
> >>>> but note that behavior is changed so that if the split fails for any
> >>>> reason (folio locked, shared, etc) we now leave it as is and move to the
> >>>> next pte in the range and continue work on the proceeding folios.
> >>>> Previously any failure of this sort would cause the entire operation to
> >>>> give up and no folios mapped at higher addresses were paged out or made
> >>>> cold. Given large folios are becoming more common, this old behavior
> >>>> would have likely lead to wasted opportunities.
> >>>>
> >>>> While we are at it, change the code that clears young from the ptes to
> >>>> use ptep_test_and_clear_young(), which is more efficent than
> >>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >>>> where the old approach would require unfolding/refolding and the new
> >>>> approach can be done in place.
> >>>>
> >>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>
> >>> This looks so much better than our initial RFC.
> >>> Thank you for your excellent work!
> >>
> >> Thanks - its a team effort - I had your PoC and David's previous batching work
> >> to use as a template.
> >>
> >>>
> >>>> ---
> >>>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >>>>  1 file changed, 51 insertions(+), 38 deletions(-)
> >>>>
> >>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>> index 547dcd1f7a39..56c7ba7bd558 100644
> >>>> --- a/mm/madvise.c
> >>>> +++ b/mm/madvise.c
> >>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>         LIST_HEAD(folio_list);
> >>>>         bool pageout_anon_only_filter;
> >>>>         unsigned int batch_count = 0;
> >>>> +       int nr;
> >>>>
> >>>>         if (fatal_signal_pending(current))
> >>>>                 return -EINTR;
> >>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>                 return 0;
> >>>>         flush_tlb_batched_pending(mm);
> >>>>         arch_enter_lazy_mmu_mode();
> >>>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
> >>>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >>>> +               nr = 1;
> >>>>                 ptent = ptep_get(pte);
> >>>>
> >>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
> >>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>                         continue;
> >>>>
> >>>>                 /*
> >>>> -                * Creating a THP page is expensive so split it only if we
> >>>> -                * are sure it's worth. Split it if we are only owner.
> >>>> +                * If we encounter a large folio, only split it if it is not
> >>>> +                * fully mapped within the range we are operating on. Otherwise
> >>>> +                * leave it as is so that it can be swapped out whole. If we
> >>>> +                * fail to split a folio, leave it in place and advance to the
> >>>> +                * next pte in the range.
> >>>>                  */
> >>>>                 if (folio_test_large(folio)) {
> >>>> -                       int err;
> >>>> -
> >>>> -                       if (folio_estimated_sharers(folio) > 1)
> >>>> -                               break;
> >>>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>> -                               break;
> >>>> -                       if (!folio_trylock(folio))
> >>>> -                               break;
> >>>> -                       folio_get(folio);
> >>>> -                       arch_leave_lazy_mmu_mode();
> >>>> -                       pte_unmap_unlock(start_pte, ptl);
> >>>> -                       start_pte = NULL;
> >>>> -                       err = split_folio(folio);
> >>>> -                       folio_unlock(folio);
> >>>> -                       folio_put(folio);
> >>>> -                       if (err)
> >>>> -                               break;
> >>>> -                       start_pte = pte =
> >>>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>> -                       if (!start_pte)
> >>>> -                               break;
> >>>> -                       arch_enter_lazy_mmu_mode();
> >>>> -                       pte--;
> >>>> -                       addr -= PAGE_SIZE;
> >>>> -                       continue;
> >>>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >>>> +                                               FPB_IGNORE_SOFT_DIRTY;
> >>>> +                       int max_nr = (end - addr) / PAGE_SIZE;
> >>>> +
> >>>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>> +                                            fpb_flags, NULL);
> >>>
> >>> I wonder if we have a quick way to avoid folio_pte_batch() if users
> >>> are doing madvise() on a portion of a large folio.
> >>
> >> Good idea. Something like this?:
> >>
> >>         if (pte_pfn(pte) == folio_pfn(folio)
> >>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>                                      fpb_flags, NULL);
> >>
> >> If we are not mapping the first page of the folio, then it can't be a full
> >> mapping, so no need to call folio_pte_batch(). Just split it.
> >
> >                  if (folio_test_large(folio)) {
> > [...]
> >                        nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >                                             fpb_flags, NULL);
> > +                       if (folio_estimated_sharers(folio) > 1)
> > +                               continue;
> >
> > Could we use folio_estimated_sharers as an early exit point here?
>
> I'm not sure what this is saving where you have it? Did you mean to put it
> before folio_pte_batch()? Currently it is just saving a single conditional.

Apologies for the confusion. I made a diff to provide clarity.

diff --git a/mm/madvise.c b/mm/madvise.c
index 56c7ba7bd558..c3458fdea82a 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -462,12 +462,11 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,

                        nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
                                             fpb_flags, NULL);
-
// Could we use folio_estimated_sharers as an early exit point here?
+                       if (folio_estimated_sharers(folio) > 1)
+                               continue;
                        if (nr < folio_nr_pages(folio)) {
                                int err;

-                               if (folio_estimated_sharers(folio) > 1)
-                                       continue;
                                if (pageout_anon_only_filter &&
!folio_test_anon(folio))
                                        continue;
                                if (!folio_trylock(folio))

>
> But now that I think about it a bit more, I remember why I was originally
> unconditionally calling folio_pte_batch(). Given its a large folio, if the split
> fails, we can move the cursor to the pte where the next folio begins so we don't
> have to iterate through one pte at a time which would cause us to keep calling
> folio_estimated_sharers(), folio_test_anon(), etc on the same folio until we get
> to the next boundary.
>
> Of course the common case at this point will be for the split to succeed, but
> then we are going to iterate over ever single PTE anyway - one way or another
> they are all fetched into cache. So I feel like its neater not to add the
> conditionals for calling folio_pte_batch(), and just leave this as I have it here.
>
> >
> >                        if (nr < folio_nr_pages(folio)) {
> >                                int err;
> >
> > -                               if (folio_estimated_sharers(folio) > 1)
> > -                                       continue;
> > [...]
> >
> >>
> >>>
> >>>> +
> >>>> +                       if (nr < folio_nr_pages(folio)) {
> >>>> +                               int err;
> >>>> +
> >>>> +                               if (folio_estimated_sharers(folio) > 1)
> >>>> +                                       continue;
> >>>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>> +                                       continue;
> >>>> +                               if (!folio_trylock(folio))
> >>>> +                                       continue;
> >>>> +                               folio_get(folio);
> >>>> +                               arch_leave_lazy_mmu_mode();
> >>>> +                               pte_unmap_unlock(start_pte, ptl);
> >>>> +                               start_pte = NULL;
> >>>> +                               err = split_folio(folio);
> >>>> +                               folio_unlock(folio);
> >>>> +                               folio_put(folio);
> >>>> +                               if (err)
> >>>> +                                       continue;
> >>>> +                               start_pte = pte =
> >>>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>> +                               if (!start_pte)
> >>>> +                                       break;
> >>>> +                               arch_enter_lazy_mmu_mode();
> >>>> +                               nr = 0;
> >>>> +                               continue;
> >>>> +                       }
> >>>>                 }
> >>>>
> >>>>                 /*
> >>>>                  * Do not interfere with other mappings of this folio and
> >>>> -                * non-LRU folio.
> >>>> +                * non-LRU folio. If we have a large folio at this point, we
> >>>> +                * know it is fully mapped so if its mapcount is the same as its
> >>>> +                * number of pages, it must be exclusive.
> >>>>                  */
> >>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >>>> +               if (!folio_test_lru(folio) ||
> >>>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
> >>>>                         continue;
> >>>
> >>> This looks so perfect and is exactly what I wanted to achieve.
> >>>
> >>>>
> >>>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>                         continue;
> >>>>
> >>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>> -
> >>>> -               if (!pageout && pte_young(ptent)) {
> >>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>> -                                                       tlb->fullmm);
> >>>> -                       ptent = pte_mkold(ptent);
> >>>> -                       set_pte_at(mm, addr, pte, ptent);
> >>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> >>>> +               if (!pageout) {
> >>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
> >>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> >
> > IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
> > tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
> > pte clearing?
>
> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
> there is a good reason to do the original clear-mkold-set for some arches?

IIRC, some of the architecture(ex, PPC)  don't update TLB with
ptep_test_and_clear_young()
and tlb_remove_tlb_entry().

In my new patch[1], I use refresh_full_ptes() and
tlb_remove_tlb_entries() to batch-update the
access and dirty bits.

[1] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworker0@gmail.com

Thanks,
Lance

>
> >
> > Thanks,
> > Lance
> >
> >
> >
> >>>> +                       }
> >>>
> >>> This looks so smart. if it is not pageout, we have increased pte
> >>> and addr here; so nr is 0 and we don't need to increase again in
> >>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >>>
> >>> otherwise, nr won't be 0. so we will increase addr and
> >>> pte by nr.
> >>
> >> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >> madvise_free_pte_range().
> >>
> >>
> >>>
> >>>
> >>>>                 }
> >>>>
> >>>>                 /*
> >>>> --
> >>>> 2.25.1
> >>>>
> >>>
> >>> Overall, LGTM,
> >>>
> >>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> >>
> >> Thanks!
> >>
> >>
>
Ryan Roberts March 20, 2024, 5:38 p.m. UTC | #17
On 20/03/2024 14:35, Lance Yang wrote:
> On Wed, Mar 20, 2024 at 9:49 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi Lance, Barry,
>>
>> Sorry - I totally missed this when you originally sent it!
> 
> No worries at all :)
> 
>>
>>
>> On 13/03/2024 14:02, Lance Yang wrote:
>>> On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 13/03/2024 07:19, Barry Song wrote:
>>>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
>>>>>> folio that is fully and contiguously mapped in the pageout/cold vm
>>>>>> range. This change means that large folios will be maintained all the
>>>>>> way to swap storage. This both improves performance during swap-out, by
>>>>>> eliding the cost of splitting the folio, and sets us up nicely for
>>>>>> maintaining the large folio when it is swapped back in (to be covered in
>>>>>> a separate series).
>>>>>>
>>>>>> Folios that are not fully mapped in the target range are still split,
>>>>>> but note that behavior is changed so that if the split fails for any
>>>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
>>>>>> next pte in the range and continue work on the proceeding folios.
>>>>>> Previously any failure of this sort would cause the entire operation to
>>>>>> give up and no folios mapped at higher addresses were paged out or made
>>>>>> cold. Given large folios are becoming more common, this old behavior
>>>>>> would have likely lead to wasted opportunities.
>>>>>>
>>>>>> While we are at it, change the code that clears young from the ptes to
>>>>>> use ptep_test_and_clear_young(), which is more efficent than
>>>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
>>>>>> where the old approach would require unfolding/refolding and the new
>>>>>> approach can be done in place.
>>>>>>
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>
>>>>> This looks so much better than our initial RFC.
>>>>> Thank you for your excellent work!
>>>>
>>>> Thanks - its a team effort - I had your PoC and David's previous batching work
>>>> to use as a template.
>>>>
>>>>>
>>>>>> ---
>>>>>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
>>>>>>  1 file changed, 51 insertions(+), 38 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
>>>>>> index 547dcd1f7a39..56c7ba7bd558 100644
>>>>>> --- a/mm/madvise.c
>>>>>> +++ b/mm/madvise.c
>>>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>         LIST_HEAD(folio_list);
>>>>>>         bool pageout_anon_only_filter;
>>>>>>         unsigned int batch_count = 0;
>>>>>> +       int nr;
>>>>>>
>>>>>>         if (fatal_signal_pending(current))
>>>>>>                 return -EINTR;
>>>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>                 return 0;
>>>>>>         flush_tlb_batched_pending(mm);
>>>>>>         arch_enter_lazy_mmu_mode();
>>>>>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
>>>>>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
>>>>>> +               nr = 1;
>>>>>>                 ptent = ptep_get(pte);
>>>>>>
>>>>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
>>>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
>>>>>>                         continue;
>>>>>>
>>>>>>                 /*
>>>>>> -                * Creating a THP page is expensive so split it only if we
>>>>>> -                * are sure it's worth. Split it if we are only owner.
>>>>>> +                * If we encounter a large folio, only split it if it is not
>>>>>> +                * fully mapped within the range we are operating on. Otherwise
>>>>>> +                * leave it as is so that it can be swapped out whole. If we
>>>>>> +                * fail to split a folio, leave it in place and advance to the
>>>>>> +                * next pte in the range.
>>>>>>                  */
>>>>>>                 if (folio_test_large(folio)) {
>>>>>> -                       int err;
>>>>>> -
>>>>>> -                       if (folio_estimated_sharers(folio) > 1)
>>>>>> -                               break;
>>>>>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>> -                               break;
>>>>>> -                       if (!folio_trylock(folio))
>>>>>> -                               break;
>>>>>> -                       folio_get(folio);
>>>>>> -                       arch_leave_lazy_mmu_mode();
>>>>>> -                       pte_unmap_unlock(start_pte, ptl);
>>>>>> -                       start_pte = NULL;
>>>>>> -                       err = split_folio(folio);
>>>>>> -                       folio_unlock(folio);
>>>>>> -                       folio_put(folio);
>>>>>> -                       if (err)
>>>>>> -                               break;
>>>>>> -                       start_pte = pte =
>>>>>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>>>> -                       if (!start_pte)
>>>>>> -                               break;
>>>>>> -                       arch_enter_lazy_mmu_mode();
>>>>>> -                       pte--;
>>>>>> -                       addr -= PAGE_SIZE;
>>>>>> -                       continue;
>>>>>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
>>>>>> +                                               FPB_IGNORE_SOFT_DIRTY;
>>>>>> +                       int max_nr = (end - addr) / PAGE_SIZE;
>>>>>> +
>>>>>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>>>> +                                            fpb_flags, NULL);
>>>>>
>>>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
>>>>> are doing madvise() on a portion of a large folio.
>>>>
>>>> Good idea. Something like this?:
>>>>
>>>>         if (pte_pfn(pte) == folio_pfn(folio)
>>>>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>>                                      fpb_flags, NULL);
>>>>
>>>> If we are not mapping the first page of the folio, then it can't be a full
>>>> mapping, so no need to call folio_pte_batch(). Just split it.
>>>
>>>                  if (folio_test_large(folio)) {
>>> [...]
>>>                        nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>>>                                             fpb_flags, NULL);
>>> +                       if (folio_estimated_sharers(folio) > 1)
>>> +                               continue;
>>>
>>> Could we use folio_estimated_sharers as an early exit point here?
>>
>> I'm not sure what this is saving where you have it? Did you mean to put it
>> before folio_pte_batch()? Currently it is just saving a single conditional.
> 
> Apologies for the confusion. I made a diff to provide clarity.
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 56c7ba7bd558..c3458fdea82a 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -462,12 +462,11 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> 
>                         nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
>                                              fpb_flags, NULL);
> -
> // Could we use folio_estimated_sharers as an early exit point here?
> +                       if (folio_estimated_sharers(folio) > 1)
> +                               continue;
>                         if (nr < folio_nr_pages(folio)) {
>                                 int err;
> 
> -                               if (folio_estimated_sharers(folio) > 1)
> -                                       continue;
>                                 if (pageout_anon_only_filter &&
> !folio_test_anon(folio))
>                                         continue;
>                                 if (!folio_trylock(folio))


I'm still not really getting it; with my code, if nr < the folio size, we will
try to split and if we estimate that the folio is not exclusive we will avoid
locking the folio, etc. If nr == folio size, we will proceed to the precise
exclusivity check (which is cheap once we know the folio is fully mapped by this
process).

With your change, we will always do the estimated exclusive check then proceed
to the precise check; seems like duplication to me?

> 
>>
>> But now that I think about it a bit more, I remember why I was originally
>> unconditionally calling folio_pte_batch(). Given its a large folio, if the split
>> fails, we can move the cursor to the pte where the next folio begins so we don't
>> have to iterate through one pte at a time which would cause us to keep calling
>> folio_estimated_sharers(), folio_test_anon(), etc on the same folio until we get
>> to the next boundary.
>>
>> Of course the common case at this point will be for the split to succeed, but
>> then we are going to iterate over ever single PTE anyway - one way or another
>> they are all fetched into cache. So I feel like its neater not to add the
>> conditionals for calling folio_pte_batch(), and just leave this as I have it here.
>>
>>>
>>>                        if (nr < folio_nr_pages(folio)) {
>>>                                int err;
>>>
>>> -                               if (folio_estimated_sharers(folio) > 1)
>>> -                                       continue;
>>> [...]
>>>
>>>>
>>>>>
>>>>>> +
>>>>>> +                       if (nr < folio_nr_pages(folio)) {
>>>>>> +                               int err;
>>>>>> +
>>>>>> +                               if (folio_estimated_sharers(folio) > 1)
>>>>>> +                                       continue;
>>>>>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>> +                                       continue;
>>>>>> +                               if (!folio_trylock(folio))
>>>>>> +                                       continue;
>>>>>> +                               folio_get(folio);
>>>>>> +                               arch_leave_lazy_mmu_mode();
>>>>>> +                               pte_unmap_unlock(start_pte, ptl);
>>>>>> +                               start_pte = NULL;
>>>>>> +                               err = split_folio(folio);
>>>>>> +                               folio_unlock(folio);
>>>>>> +                               folio_put(folio);
>>>>>> +                               if (err)
>>>>>> +                                       continue;
>>>>>> +                               start_pte = pte =
>>>>>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
>>>>>> +                               if (!start_pte)
>>>>>> +                                       break;
>>>>>> +                               arch_enter_lazy_mmu_mode();
>>>>>> +                               nr = 0;
>>>>>> +                               continue;
>>>>>> +                       }
>>>>>>                 }
>>>>>>
>>>>>>                 /*
>>>>>>                  * Do not interfere with other mappings of this folio and
>>>>>> -                * non-LRU folio.
>>>>>> +                * non-LRU folio. If we have a large folio at this point, we
>>>>>> +                * know it is fully mapped so if its mapcount is the same as its
>>>>>> +                * number of pages, it must be exclusive.
>>>>>>                  */
>>>>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
>>>>>> +               if (!folio_test_lru(folio) ||
>>>>>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
>>>>>>                         continue;
>>>>>
>>>>> This looks so perfect and is exactly what I wanted to achieve.
>>>>>
>>>>>>
>>>>>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
>>>>>>                         continue;
>>>>>>
>>>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>> -
>>>>>> -               if (!pageout && pte_young(ptent)) {
>>>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>> -                                                       tlb->fullmm);
>>>>>> -                       ptent = pte_mkold(ptent);
>>>>>> -                       set_pte_at(mm, addr, pte, ptent);
>>>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>> +               if (!pageout) {
>>>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
>>>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>
>>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
>>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
>>> pte clearing?
>>
>> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
>> there is a good reason to do the original clear-mkold-set for some arches?
> 
> IIRC, some of the architecture(ex, PPC)  don't update TLB with
> ptep_test_and_clear_young()
> and tlb_remove_tlb_entry().

Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this
address please" - albeit its deferred and batched. I'll look into this.

> 
> In my new patch[1], I use refresh_full_ptes() and
> tlb_remove_tlb_entries() to batch-update the
> access and dirty bits.

I want to avoid the per-pte clear-modify-set approach, because this doesn't
perform well on arm64 when using contpte mappings; it will cause the contpe
mapping to be unfolded by the first clear that touches the contpte block, then
refolded by the last set to touch the block. That's expensive.
ptep_test_and_clear_young() doesn't suffer that problem.

> 
> [1] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworker0@gmail.com
> 
> Thanks,
> Lance
> 
>>
>>>
>>> Thanks,
>>> Lance
>>>
>>>
>>>
>>>>>> +                       }
>>>>>
>>>>> This looks so smart. if it is not pageout, we have increased pte
>>>>> and addr here; so nr is 0 and we don't need to increase again in
>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>>>
>>>>> otherwise, nr won't be 0. so we will increase addr and
>>>>> pte by nr.
>>>>
>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>>>> madvise_free_pte_range().
>>>>
>>>>
>>>>>
>>>>>
>>>>>>                 }
>>>>>>
>>>>>>                 /*
>>>>>> --
>>>>>> 2.25.1
>>>>>>
>>>>>
>>>>> Overall, LGTM,
>>>>>
>>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> Thanks!
>>>>
>>>>
>>
Lance Yang March 21, 2024, 1:38 a.m. UTC | #18
On Thu, Mar 21, 2024 at 1:38 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 20/03/2024 14:35, Lance Yang wrote:
> > On Wed, Mar 20, 2024 at 9:49 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi Lance, Barry,
> >>
> >> Sorry - I totally missed this when you originally sent it!
> >
> > No worries at all :)
> >
> >>
> >>
> >> On 13/03/2024 14:02, Lance Yang wrote:
> >>> On Wed, Mar 13, 2024 at 5:03 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 13/03/2024 07:19, Barry Song wrote:
> >>>>> On Tue, Mar 12, 2024 at 4:01 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>>>
> >>>>>> Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large
> >>>>>> folio that is fully and contiguously mapped in the pageout/cold vm
> >>>>>> range. This change means that large folios will be maintained all the
> >>>>>> way to swap storage. This both improves performance during swap-out, by
> >>>>>> eliding the cost of splitting the folio, and sets us up nicely for
> >>>>>> maintaining the large folio when it is swapped back in (to be covered in
> >>>>>> a separate series).
> >>>>>>
> >>>>>> Folios that are not fully mapped in the target range are still split,
> >>>>>> but note that behavior is changed so that if the split fails for any
> >>>>>> reason (folio locked, shared, etc) we now leave it as is and move to the
> >>>>>> next pte in the range and continue work on the proceeding folios.
> >>>>>> Previously any failure of this sort would cause the entire operation to
> >>>>>> give up and no folios mapped at higher addresses were paged out or made
> >>>>>> cold. Given large folios are becoming more common, this old behavior
> >>>>>> would have likely lead to wasted opportunities.
> >>>>>>
> >>>>>> While we are at it, change the code that clears young from the ptes to
> >>>>>> use ptep_test_and_clear_young(), which is more efficent than
> >>>>>> get_and_clear/modify/set, especially for contpte mappings on arm64,
> >>>>>> where the old approach would require unfolding/refolding and the new
> >>>>>> approach can be done in place.
> >>>>>>
> >>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >>>>>
> >>>>> This looks so much better than our initial RFC.
> >>>>> Thank you for your excellent work!
> >>>>
> >>>> Thanks - its a team effort - I had your PoC and David's previous batching work
> >>>> to use as a template.
> >>>>
> >>>>>
> >>>>>> ---
> >>>>>>  mm/madvise.c | 89 ++++++++++++++++++++++++++++++----------------------
> >>>>>>  1 file changed, 51 insertions(+), 38 deletions(-)
> >>>>>>
> >>>>>> diff --git a/mm/madvise.c b/mm/madvise.c
> >>>>>> index 547dcd1f7a39..56c7ba7bd558 100644
> >>>>>> --- a/mm/madvise.c
> >>>>>> +++ b/mm/madvise.c
> >>>>>> @@ -336,6 +336,7 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>>         LIST_HEAD(folio_list);
> >>>>>>         bool pageout_anon_only_filter;
> >>>>>>         unsigned int batch_count = 0;
> >>>>>> +       int nr;
> >>>>>>
> >>>>>>         if (fatal_signal_pending(current))
> >>>>>>                 return -EINTR;
> >>>>>> @@ -423,7 +424,8 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>>                 return 0;
> >>>>>>         flush_tlb_batched_pending(mm);
> >>>>>>         arch_enter_lazy_mmu_mode();
> >>>>>> -       for (; addr < end; pte++, addr += PAGE_SIZE) {
> >>>>>> +       for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
> >>>>>> +               nr = 1;
> >>>>>>                 ptent = ptep_get(pte);
> >>>>>>
> >>>>>>                 if (++batch_count == SWAP_CLUSTER_MAX) {
> >>>>>> @@ -447,55 +449,66 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >>>>>>                         continue;
> >>>>>>
> >>>>>>                 /*
> >>>>>> -                * Creating a THP page is expensive so split it only if we
> >>>>>> -                * are sure it's worth. Split it if we are only owner.
> >>>>>> +                * If we encounter a large folio, only split it if it is not
> >>>>>> +                * fully mapped within the range we are operating on. Otherwise
> >>>>>> +                * leave it as is so that it can be swapped out whole. If we
> >>>>>> +                * fail to split a folio, leave it in place and advance to the
> >>>>>> +                * next pte in the range.
> >>>>>>                  */
> >>>>>>                 if (folio_test_large(folio)) {
> >>>>>> -                       int err;
> >>>>>> -
> >>>>>> -                       if (folio_estimated_sharers(folio) > 1)
> >>>>>> -                               break;
> >>>>>> -                       if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>>> -                               break;
> >>>>>> -                       if (!folio_trylock(folio))
> >>>>>> -                               break;
> >>>>>> -                       folio_get(folio);
> >>>>>> -                       arch_leave_lazy_mmu_mode();
> >>>>>> -                       pte_unmap_unlock(start_pte, ptl);
> >>>>>> -                       start_pte = NULL;
> >>>>>> -                       err = split_folio(folio);
> >>>>>> -                       folio_unlock(folio);
> >>>>>> -                       folio_put(folio);
> >>>>>> -                       if (err)
> >>>>>> -                               break;
> >>>>>> -                       start_pte = pte =
> >>>>>> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>>>> -                       if (!start_pte)
> >>>>>> -                               break;
> >>>>>> -                       arch_enter_lazy_mmu_mode();
> >>>>>> -                       pte--;
> >>>>>> -                       addr -= PAGE_SIZE;
> >>>>>> -                       continue;
> >>>>>> +                       const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
> >>>>>> +                                               FPB_IGNORE_SOFT_DIRTY;
> >>>>>> +                       int max_nr = (end - addr) / PAGE_SIZE;
> >>>>>> +
> >>>>>> +                       nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>>>> +                                            fpb_flags, NULL);
> >>>>>
> >>>>> I wonder if we have a quick way to avoid folio_pte_batch() if users
> >>>>> are doing madvise() on a portion of a large folio.
> >>>>
> >>>> Good idea. Something like this?:
> >>>>
> >>>>         if (pte_pfn(pte) == folio_pfn(folio)
> >>>>                 nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>>                                      fpb_flags, NULL);
> >>>>
> >>>> If we are not mapping the first page of the folio, then it can't be a full
> >>>> mapping, so no need to call folio_pte_batch(). Just split it.
> >>>
> >>>                  if (folio_test_large(folio)) {
> >>> [...]
> >>>                        nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >>>                                             fpb_flags, NULL);
> >>> +                       if (folio_estimated_sharers(folio) > 1)
> >>> +                               continue;
> >>>
> >>> Could we use folio_estimated_sharers as an early exit point here?
> >>
> >> I'm not sure what this is saving where you have it? Did you mean to put it
> >> before folio_pte_batch()? Currently it is just saving a single conditional.
> >
> > Apologies for the confusion. I made a diff to provide clarity.
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 56c7ba7bd558..c3458fdea82a 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -462,12 +462,11 @@ static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
> >
> >                         nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
> >                                              fpb_flags, NULL);
> > -
> > // Could we use folio_estimated_sharers as an early exit point here?
> > +                       if (folio_estimated_sharers(folio) > 1)
> > +                               continue;
> >                         if (nr < folio_nr_pages(folio)) {
> >                                 int err;
> >
> > -                               if (folio_estimated_sharers(folio) > 1)
> > -                                       continue;
> >                                 if (pageout_anon_only_filter &&
> > !folio_test_anon(folio))
> >                                         continue;
> >                                 if (!folio_trylock(folio))
>
>
> I'm still not really getting it; with my code, if nr < the folio size, we will
> try to split and if we estimate that the folio is not exclusive we will avoid
> locking the folio, etc. If nr == folio size, we will proceed to the precise
> exclusivity check (which is cheap once we know the folio is fully mapped by this
> process).
>
> With your change, we will always do the estimated exclusive check then proceed
> to the precise check; seems like duplication to me?

Agreed. The estimated exclusive check is indeed redundant with my change.

>
> >
> >>
> >> But now that I think about it a bit more, I remember why I was originally
> >> unconditionally calling folio_pte_batch(). Given its a large folio, if the split
> >> fails, we can move the cursor to the pte where the next folio begins so we don't
> >> have to iterate through one pte at a time which would cause us to keep calling
> >> folio_estimated_sharers(), folio_test_anon(), etc on the same folio until we get
> >> to the next boundary.
> >>
> >> Of course the common case at this point will be for the split to succeed, but
> >> then we are going to iterate over ever single PTE anyway - one way or another
> >> they are all fetched into cache. So I feel like its neater not to add the
> >> conditionals for calling folio_pte_batch(), and just leave this as I have it here.
> >>
> >>>
> >>>                        if (nr < folio_nr_pages(folio)) {
> >>>                                int err;
> >>>
> >>> -                               if (folio_estimated_sharers(folio) > 1)
> >>> -                                       continue;
> >>> [...]
> >>>
> >>>>
> >>>>>
> >>>>>> +
> >>>>>> +                       if (nr < folio_nr_pages(folio)) {
> >>>>>> +                               int err;
> >>>>>> +
> >>>>>> +                               if (folio_estimated_sharers(folio) > 1)
> >>>>>> +                                       continue;
> >>>>>> +                               if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>>> +                                       continue;
> >>>>>> +                               if (!folio_trylock(folio))
> >>>>>> +                                       continue;
> >>>>>> +                               folio_get(folio);
> >>>>>> +                               arch_leave_lazy_mmu_mode();
> >>>>>> +                               pte_unmap_unlock(start_pte, ptl);
> >>>>>> +                               start_pte = NULL;
> >>>>>> +                               err = split_folio(folio);
> >>>>>> +                               folio_unlock(folio);
> >>>>>> +                               folio_put(folio);
> >>>>>> +                               if (err)
> >>>>>> +                                       continue;
> >>>>>> +                               start_pte = pte =
> >>>>>> +                                       pte_offset_map_lock(mm, pmd, addr, &ptl);
> >>>>>> +                               if (!start_pte)
> >>>>>> +                                       break;
> >>>>>> +                               arch_enter_lazy_mmu_mode();
> >>>>>> +                               nr = 0;
> >>>>>> +                               continue;
> >>>>>> +                       }
> >>>>>>                 }
> >>>>>>
> >>>>>>                 /*
> >>>>>>                  * Do not interfere with other mappings of this folio and
> >>>>>> -                * non-LRU folio.
> >>>>>> +                * non-LRU folio. If we have a large folio at this point, we
> >>>>>> +                * know it is fully mapped so if its mapcount is the same as its
> >>>>>> +                * number of pages, it must be exclusive.
> >>>>>>                  */
> >>>>>> -               if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
> >>>>>> +               if (!folio_test_lru(folio) ||
> >>>>>> +                   folio_mapcount(folio) != folio_nr_pages(folio))
> >>>>>>                         continue;
> >>>>>
> >>>>> This looks so perfect and is exactly what I wanted to achieve.
> >>>>>
> >>>>>>
> >>>>>>                 if (pageout_anon_only_filter && !folio_test_anon(folio))
> >>>>>>                         continue;
> >>>>>>
> >>>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>>>> -
> >>>>>> -               if (!pageout && pte_young(ptent)) {
> >>>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>>>> -                                                       tlb->fullmm);
> >>>>>> -                       ptent = pte_mkold(ptent);
> >>>>>> -                       set_pte_at(mm, addr, pte, ptent);
> >>>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>> +               if (!pageout) {
> >>>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >>>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
> >>>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> >>>
> >>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
> >>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
> >>> pte clearing?
> >>
> >> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
> >> there is a good reason to do the original clear-mkold-set for some arches?
> >
> > IIRC, some of the architecture(ex, PPC)  don't update TLB with
> > ptep_test_and_clear_young()
> > and tlb_remove_tlb_entry().
>
> Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this
> address please" - albeit its deferred and batched. I'll look into this.
>
> >
> > In my new patch[1], I use refresh_full_ptes() and
> > tlb_remove_tlb_entries() to batch-update the
> > access and dirty bits.
>
> I want to avoid the per-pte clear-modify-set approach, because this doesn't
> perform well on arm64 when using contpte mappings; it will cause the contpe
> mapping to be unfolded by the first clear that touches the contpte block, then
> refolded by the last set to touch the block. That's expensive.
> ptep_test_and_clear_young() doesn't suffer that problem.

Thanks for explaining. I got it.

I think that other architectures will benefit from the per-pte clear-modify-set
approach. IMO, refresh_full_ptes() can be overridden by arm64.

Thanks,
Lance
>
> >
> > [1] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworker0@gmail.com
> >
> > Thanks,
> > Lance
> >
> >>
> >>>
> >>> Thanks,
> >>> Lance
> >>>
> >>>
> >>>
> >>>>>> +                       }
> >>>>>
> >>>>> This looks so smart. if it is not pageout, we have increased pte
> >>>>> and addr here; so nr is 0 and we don't need to increase again in
> >>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >>>>>
> >>>>> otherwise, nr won't be 0. so we will increase addr and
> >>>>> pte by nr.
> >>>>
> >>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >>>> madvise_free_pte_range().
> >>>>
> >>>>
> >>>>>
> >>>>>
> >>>>>>                 }
> >>>>>>
> >>>>>>                 /*
> >>>>>> --
> >>>>>> 2.25.1
> >>>>>>
> >>>>>
> >>>>> Overall, LGTM,
> >>>>>
> >>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> >>>>
> >>>> Thanks!
> >>>>
> >>>>
> >>
>
Ryan Roberts March 21, 2024, 1:38 p.m. UTC | #19
>>>>>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>>>> -
>>>>>>>> -               if (!pageout && pte_young(ptent)) {
>>>>>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>>>> -                                                       tlb->fullmm);
>>>>>>>> -                       ptent = pte_mkold(ptent);
>>>>>>>> -                       set_pte_at(mm, addr, pte, ptent);
>>>>>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>>>> +               if (!pageout) {
>>>>>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>>>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
>>>>>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>
>>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
>>>>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
>>>>> pte clearing?
>>>>
>>>> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
>>>> there is a good reason to do the original clear-mkold-set for some arches?
>>>
>>> IIRC, some of the architecture(ex, PPC)  don't update TLB with
>>> ptep_test_and_clear_young()
>>> and tlb_remove_tlb_entry().

Afraid I'm still struggling with this comment. Do you mean to say that powerpc
invalidates the TLB entry as part of the call to ptep_test_and_clear_young()? So
tlb_remove_tlb_entry() would be redundant here, and likely cause performance
degradation on that architecture?

IMHO, ptep_test_and_clear_young() really shouldn't be invalidating the TLB
entry, that's what ptep_clear_flush_young() is for.

But I do see that for some cases of the 32-bit ppc, there appears to be a flush:

#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
					      unsigned long addr, pte_t *ptep)
{
	unsigned long old;
	old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0);
	if (old & _PAGE_HASHPTE)
		flush_hash_entry(mm, ptep, addr);   <<<<<<<<

	return (old & _PAGE_ACCESSED) != 0;
}
#define ptep_test_and_clear_young(__vma, __addr, __ptep) \
	__ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep)

Is that what you are describing? Does any anyone know why flush_hash_entry() is
called? I'd say that's a bug in ppc and not a reason not to use
ptep_test_and_clear_young() in the common code!

Thanks,
Ryan


>>
>> Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this
>> address please" - albeit its deferred and batched. I'll look into this.
>>
>>>
>>> In my new patch[1], I use refresh_full_ptes() and
>>> tlb_remove_tlb_entries() to batch-update the
>>> access and dirty bits.
>>
>> I want to avoid the per-pte clear-modify-set approach, because this doesn't
>> perform well on arm64 when using contpte mappings; it will cause the contpe
>> mapping to be unfolded by the first clear that touches the contpte block, then
>> refolded by the last set to touch the block. That's expensive.
>> ptep_test_and_clear_young() doesn't suffer that problem.
> 
> Thanks for explaining. I got it.
> 
> I think that other architectures will benefit from the per-pte clear-modify-set
> approach. IMO, refresh_full_ptes() can be overridden by arm64.
> 
> Thanks,
> Lance
>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworker0@gmail.com
>>>
>>> Thanks,
>>> Lance
>>>
>>>>
>>>>>
>>>>> Thanks,
>>>>> Lance
>>>>>
>>>>>
>>>>>
>>>>>>>> +                       }
>>>>>>>
>>>>>>> This looks so smart. if it is not pageout, we have increased pte
>>>>>>> and addr here; so nr is 0 and we don't need to increase again in
>>>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>>>>>
>>>>>>> otherwise, nr won't be 0. so we will increase addr and
>>>>>>> pte by nr.
>>>>>>
>>>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>>>>>> madvise_free_pte_range().
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>                 }
>>>>>>>>
>>>>>>>>                 /*
>>>>>>>> --
>>>>>>>> 2.25.1
>>>>>>>>
>>>>>>>
>>>>>>> Overall, LGTM,
>>>>>>>
>>>>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>
>>
Lance Yang March 21, 2024, 2:55 p.m. UTC | #20
On Thu, Mar 21, 2024 at 9:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> >>>>>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>>>>>> -
> >>>>>>>> -               if (!pageout && pte_young(ptent)) {
> >>>>>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>>>>>> -                                                       tlb->fullmm);
> >>>>>>>> -                       ptent = pte_mkold(ptent);
> >>>>>>>> -                       set_pte_at(mm, addr, pte, ptent);
> >>>>>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>>>> +               if (!pageout) {
> >>>>>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >>>>>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
> >>>>>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>
> >>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
> >>>>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
> >>>>> pte clearing?
> >>>>
> >>>> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
> >>>> there is a good reason to do the original clear-mkold-set for some arches?
> >>>
> >>> IIRC, some of the architecture(ex, PPC)  don't update TLB with
> >>> ptep_test_and_clear_young()
> >>> and tlb_remove_tlb_entry().
>
> Afraid I'm still struggling with this comment. Do you mean to say that powerpc
> invalidates the TLB entry as part of the call to ptep_test_and_clear_young()? So
> tlb_remove_tlb_entry() would be redundant here, and likely cause performance
> degradation on that architecture?

I just thought that using ptep_test_and_clear_young() instead of
ptep_get_and_clear_full() + pte_mkold() might not be correct.
However, it's most likely that I was mistaken :(

I also have a question. Why aren't we using ptep_test_and_clear_young() in
madvise_cold_or_pageout_pte_range(), but instead
ptep_get_and_clear_full() + pte_mkold() as we did previously.

/*
* Some of architecture(ex, PPC) don't update TLB
* with set_pte_at and tlb_remove_tlb_entry so for
* the portability, remap the pte with old|clean
* after pte clearing.
*/

According to this comment from madvise_free_pte_range. IIUC, we need to
call ptep_get_and_clear_full() to clear the PTE, and then remap the
PTE with old|clean.

Thanks,
Lance

>
> IMHO, ptep_test_and_clear_young() really shouldn't be invalidating the TLB
> entry, that's what ptep_clear_flush_young() is for.
>
> But I do see that for some cases of the 32-bit ppc, there appears to be a flush:
>
> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
>                                               unsigned long addr, pte_t *ptep)
> {
>         unsigned long old;
>         old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0);
>         if (old & _PAGE_HASHPTE)
>                 flush_hash_entry(mm, ptep, addr);   <<<<<<<<
>
>         return (old & _PAGE_ACCESSED) != 0;
> }
> #define ptep_test_and_clear_young(__vma, __addr, __ptep) \
>         __ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep)
>
> Is that what you are describing? Does any anyone know why flush_hash_entry() is
> called? I'd say that's a bug in ppc and not a reason not to use
> ptep_test_and_clear_young() in the common code!
>
> Thanks,
> Ryan
>
>
> >>
> >> Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this
> >> address please" - albeit its deferred and batched. I'll look into this.
> >>
> >>>
> >>> In my new patch[1], I use refresh_full_ptes() and
> >>> tlb_remove_tlb_entries() to batch-update the
> >>> access and dirty bits.
> >>
> >> I want to avoid the per-pte clear-modify-set approach, because this doesn't
> >> perform well on arm64 when using contpte mappings; it will cause the contpe
> >> mapping to be unfolded by the first clear that touches the contpte block, then
> >> refolded by the last set to touch the block. That's expensive.
> >> ptep_test_and_clear_young() doesn't suffer that problem.
> >
> > Thanks for explaining. I got it.
> >
> > I think that other architectures will benefit from the per-pte clear-modify-set
> > approach. IMO, refresh_full_ptes() can be overridden by arm64.
> >
> > Thanks,
> > Lance
> >>
> >>>
> >>> [1] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworker0@gmail.com
> >>>
> >>> Thanks,
> >>> Lance
> >>>
> >>>>
> >>>>>
> >>>>> Thanks,
> >>>>> Lance
> >>>>>
> >>>>>
> >>>>>
> >>>>>>>> +                       }
> >>>>>>>
> >>>>>>> This looks so smart. if it is not pageout, we have increased pte
> >>>>>>> and addr here; so nr is 0 and we don't need to increase again in
> >>>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >>>>>>>
> >>>>>>> otherwise, nr won't be 0. so we will increase addr and
> >>>>>>> pte by nr.
> >>>>>>
> >>>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >>>>>> madvise_free_pte_range().
> >>>>>>
> >>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>                 }
> >>>>>>>>
> >>>>>>>>                 /*
> >>>>>>>> --
> >>>>>>>> 2.25.1
> >>>>>>>>
> >>>>>>>
> >>>>>>> Overall, LGTM,
> >>>>>>>
> >>>>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> >>>>>>
> >>>>>> Thanks!
> >>>>>>
> >>>>>>
> >>>>
> >>
>
Ryan Roberts March 21, 2024, 3:24 p.m. UTC | #21
On 21/03/2024 14:55, Lance Yang wrote:
> On Thu, Mar 21, 2024 at 9:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>>>>>>>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
>>>>>>>>>> -
>>>>>>>>>> -               if (!pageout && pte_young(ptent)) {
>>>>>>>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
>>>>>>>>>> -                                                       tlb->fullmm);
>>>>>>>>>> -                       ptent = pte_mkold(ptent);
>>>>>>>>>> -                       set_pte_at(mm, addr, pte, ptent);
>>>>>>>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>>>>>> +               if (!pageout) {
>>>>>>>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
>>>>>>>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
>>>>>>>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
>>>>>>>
>>>>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
>>>>>>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
>>>>>>> pte clearing?
>>>>>>
>>>>>> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
>>>>>> there is a good reason to do the original clear-mkold-set for some arches?
>>>>>
>>>>> IIRC, some of the architecture(ex, PPC)  don't update TLB with
>>>>> ptep_test_and_clear_young()
>>>>> and tlb_remove_tlb_entry().
>>
>> Afraid I'm still struggling with this comment. Do you mean to say that powerpc
>> invalidates the TLB entry as part of the call to ptep_test_and_clear_young()? So
>> tlb_remove_tlb_entry() would be redundant here, and likely cause performance
>> degradation on that architecture?
> 
> I just thought that using ptep_test_and_clear_young() instead of
> ptep_get_and_clear_full() + pte_mkold() might not be correct.
> However, it's most likely that I was mistaken :(

OK, I'm pretty confident that my usage is correct.

> 
> I also have a question. Why aren't we using ptep_test_and_clear_young() in
> madvise_cold_or_pageout_pte_range(), but instead
> ptep_get_and_clear_full() + pte_mkold() as we did previously.
> 
> /*
> * Some of architecture(ex, PPC) don't update TLB
> * with set_pte_at and tlb_remove_tlb_entry so for
> * the portability, remap the pte with old|clean
> * after pte clearing.
> */

Ahh, I see; this is a comment from madvise_free_pte_range() I don't quite
understand that comment. I suspect it might be out of date, or saying that doing
set_pte_at(pte_mkold(ptep_get(ptent))) is not correct because it is not atomic
and the HW could set the dirty bit between the get and the set. Doing the atomic
ptep_get_and_clear_full() means you go via a pte_none() state, so if the TLB is
racing it will see the entry isn't valid and fault.

Note that madvise_free_pte_range() is trying to clear both the access and dirty
bits, whereas madvise_cold_or_pageout_pte_range() is only trying to clear the
access bit. There is a special helper to clear the access bit atomically -
ptep_test_and_clear_young() - but there is no helper to clear the access *and*
dirty bit, I don't believe. There is ptep_set_access_flags(), but that sets
flags to a "more permissive setting" (i.e. allows setting the flags, not
clearing them). Perhaps this constraint can be relaxed given we will follow up
with an explicit TLBI - it would require auditing all the implementations.

> 
> According to this comment from madvise_free_pte_range. IIUC, we need to
> call ptep_get_and_clear_full() to clear the PTE, and then remap the
> PTE with old|clean.
> 
> Thanks,
> Lance
> 
>>
>> IMHO, ptep_test_and_clear_young() really shouldn't be invalidating the TLB
>> entry, that's what ptep_clear_flush_young() is for.
>>
>> But I do see that for some cases of the 32-bit ppc, there appears to be a flush:
>>
>> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
>>                                               unsigned long addr, pte_t *ptep)
>> {
>>         unsigned long old;
>>         old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0);
>>         if (old & _PAGE_HASHPTE)
>>                 flush_hash_entry(mm, ptep, addr);   <<<<<<<<
>>
>>         return (old & _PAGE_ACCESSED) != 0;
>> }
>> #define ptep_test_and_clear_young(__vma, __addr, __ptep) \
>>         __ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep)
>>
>> Is that what you are describing? Does any anyone know why flush_hash_entry() is
>> called? I'd say that's a bug in ppc and not a reason not to use
>> ptep_test_and_clear_young() in the common code!
>>
>> Thanks,
>> Ryan
>>
>>
>>>>
>>>> Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this
>>>> address please" - albeit its deferred and batched. I'll look into this.
>>>>
>>>>>
>>>>> In my new patch[1], I use refresh_full_ptes() and
>>>>> tlb_remove_tlb_entries() to batch-update the
>>>>> access and dirty bits.
>>>>
>>>> I want to avoid the per-pte clear-modify-set approach, because this doesn't
>>>> perform well on arm64 when using contpte mappings; it will cause the contpe
>>>> mapping to be unfolded by the first clear that touches the contpte block, then
>>>> refolded by the last set to touch the block. That's expensive.
>>>> ptep_test_and_clear_young() doesn't suffer that problem.
>>>
>>> Thanks for explaining. I got it.
>>>
>>> I think that other architectures will benefit from the per-pte clear-modify-set
>>> approach. IMO, refresh_full_ptes() can be overridden by arm64.
>>>
>>> Thanks,
>>> Lance
>>>>
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworker0@gmail.com
>>>>>
>>>>> Thanks,
>>>>> Lance
>>>>>
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Lance
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>>> +                       }
>>>>>>>>>
>>>>>>>>> This looks so smart. if it is not pageout, we have increased pte
>>>>>>>>> and addr here; so nr is 0 and we don't need to increase again in
>>>>>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
>>>>>>>>>
>>>>>>>>> otherwise, nr won't be 0. so we will increase addr and
>>>>>>>>> pte by nr.
>>>>>>>>
>>>>>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
>>>>>>>> madvise_free_pte_range().
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>                 }
>>>>>>>>>>
>>>>>>>>>>                 /*
>>>>>>>>>> --
>>>>>>>>>> 2.25.1
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Overall, LGTM,
>>>>>>>>>
>>>>>>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>
>>
Lance Yang March 22, 2024, 12:56 a.m. UTC | #22
On Thu, Mar 21, 2024 at 11:24 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 21/03/2024 14:55, Lance Yang wrote:
> > On Thu, Mar 21, 2024 at 9:38 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >>>>>>>>>> -               VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
> >>>>>>>>>> -
> >>>>>>>>>> -               if (!pageout && pte_young(ptent)) {
> >>>>>>>>>> -                       ptent = ptep_get_and_clear_full(mm, addr, pte,
> >>>>>>>>>> -                                                       tlb->fullmm);
> >>>>>>>>>> -                       ptent = pte_mkold(ptent);
> >>>>>>>>>> -                       set_pte_at(mm, addr, pte, ptent);
> >>>>>>>>>> -                       tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>>>>>> +               if (!pageout) {
> >>>>>>>>>> +                       for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
> >>>>>>>>>> +                               if (ptep_test_and_clear_young(vma, addr, pte))
> >>>>>>>>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> >>>>>>>
> >>>>>>> IIRC, some of the architecture(ex, PPC) don't update TLB with set_pte_at and
> >>>>>>> tlb_remove_tlb_entry. So, didn't we consider remapping the PTE with old after
> >>>>>>> pte clearing?
> >>>>>>
> >>>>>> Sorry Lance, I don't understand this question, can you rephrase? Are you saying
> >>>>>> there is a good reason to do the original clear-mkold-set for some arches?
> >>>>>
> >>>>> IIRC, some of the architecture(ex, PPC)  don't update TLB with
> >>>>> ptep_test_and_clear_young()
> >>>>> and tlb_remove_tlb_entry().
> >>
> >> Afraid I'm still struggling with this comment. Do you mean to say that powerpc
> >> invalidates the TLB entry as part of the call to ptep_test_and_clear_young()? So
> >> tlb_remove_tlb_entry() would be redundant here, and likely cause performance
> >> degradation on that architecture?
> >
> > I just thought that using ptep_test_and_clear_young() instead of
> > ptep_get_and_clear_full() + pte_mkold() might not be correct.
> > However, it's most likely that I was mistaken :(
>
> OK, I'm pretty confident that my usage is correct.
>
> >
> > I also have a question. Why aren't we using ptep_test_and_clear_young() in
> > madvise_cold_or_pageout_pte_range(), but instead
> > ptep_get_and_clear_full() + pte_mkold() as we did previously.
> >
> > /*
> > * Some of architecture(ex, PPC) don't update TLB
> > * with set_pte_at and tlb_remove_tlb_entry so for
> > * the portability, remap the pte with old|clean
> > * after pte clearing.
> > */
>
> Ahh, I see; this is a comment from madvise_free_pte_range() I don't quite
> understand that comment. I suspect it might be out of date, or saying that doing
> set_pte_at(pte_mkold(ptep_get(ptent))) is not correct because it is not atomic
> and the HW could set the dirty bit between the get and the set. Doing the atomic
> ptep_get_and_clear_full() means you go via a pte_none() state, so if the TLB is
> racing it will see the entry isn't valid and fault.

Thanks for your analysis and explanations!

>
> Note that madvise_free_pte_range() is trying to clear both the access and dirty
> bits, whereas madvise_cold_or_pageout_pte_range() is only trying to clear the
> access bit. There is a special helper to clear the access bit atomically -
> ptep_test_and_clear_young() - but there is no helper to clear the access *and*
> dirty bit, I don't believe. There is ptep_set_access_flags(), but that sets
> flags to a "more permissive setting" (i.e. allows setting the flags, not
> clearing them). Perhaps this constraint can be relaxed given we will follow up
> with an explicit TLBI - it would require auditing all the implementations.

Thanks for bringing this! I'll take a closer look at it.

Thanks again for your time!
Lance

>
> >
> > According to this comment from madvise_free_pte_range. IIUC, we need to
> > call ptep_get_and_clear_full() to clear the PTE, and then remap the
> > PTE with old|clean.
> >
> > Thanks,
> > Lance
> >
> >>
> >> IMHO, ptep_test_and_clear_young() really shouldn't be invalidating the TLB
> >> entry, that's what ptep_clear_flush_young() is for.
> >>
> >> But I do see that for some cases of the 32-bit ppc, there appears to be a flush:
> >>
> >> #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> >> static inline int __ptep_test_and_clear_young(struct mm_struct *mm,
> >>                                               unsigned long addr, pte_t *ptep)
> >> {
> >>         unsigned long old;
> >>         old = pte_update(mm, addr, ptep, _PAGE_ACCESSED, 0, 0);
> >>         if (old & _PAGE_HASHPTE)
> >>                 flush_hash_entry(mm, ptep, addr);   <<<<<<<<
> >>
> >>         return (old & _PAGE_ACCESSED) != 0;
> >> }
> >> #define ptep_test_and_clear_young(__vma, __addr, __ptep) \
> >>         __ptep_test_and_clear_young((__vma)->vm_mm, __addr, __ptep)
> >>
> >> Is that what you are describing? Does any anyone know why flush_hash_entry() is
> >> called? I'd say that's a bug in ppc and not a reason not to use
> >> ptep_test_and_clear_young() in the common code!
> >>
> >> Thanks,
> >> Ryan
> >>
> >>
> >>>>
> >>>> Err, I assumed tlb_remove_tlb_entry() meant "invalidate the TLB entry for this
> >>>> address please" - albeit its deferred and batched. I'll look into this.
> >>>>
> >>>>>
> >>>>> In my new patch[1], I use refresh_full_ptes() and
> >>>>> tlb_remove_tlb_entries() to batch-update the
> >>>>> access and dirty bits.
> >>>>
> >>>> I want to avoid the per-pte clear-modify-set approach, because this doesn't
> >>>> perform well on arm64 when using contpte mappings; it will cause the contpe
> >>>> mapping to be unfolded by the first clear that touches the contpte block, then
> >>>> refolded by the last set to touch the block. That's expensive.
> >>>> ptep_test_and_clear_young() doesn't suffer that problem.
> >>>
> >>> Thanks for explaining. I got it.
> >>>
> >>> I think that other architectures will benefit from the per-pte clear-modify-set
> >>> approach. IMO, refresh_full_ptes() can be overridden by arm64.
> >>>
> >>> Thanks,
> >>> Lance
> >>>>
> >>>>>
> >>>>> [1] https://lore.kernel.org/linux-mm/20240316102952.39233-1-ioworker0@gmail.com
> >>>>>
> >>>>> Thanks,
> >>>>> Lance
> >>>>>
> >>>>>>
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Lance
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>>>> +                       }
> >>>>>>>>>
> >>>>>>>>> This looks so smart. if it is not pageout, we have increased pte
> >>>>>>>>> and addr here; so nr is 0 and we don't need to increase again in
> >>>>>>>>> for (; addr < end; pte += nr, addr += nr * PAGE_SIZE)
> >>>>>>>>>
> >>>>>>>>> otherwise, nr won't be 0. so we will increase addr and
> >>>>>>>>> pte by nr.
> >>>>>>>>
> >>>>>>>> Indeed. I'm hoping that Lance is able to follow a similar pattern for
> >>>>>>>> madvise_free_pte_range().
> >>>>>>>>
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>>>                 }
> >>>>>>>>>>
> >>>>>>>>>>                 /*
> >>>>>>>>>> --
> >>>>>>>>>> 2.25.1
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Overall, LGTM,
> >>>>>>>>>
> >>>>>>>>> Reviewed-by: Barry Song <v-songbaohua@oppo.com>
> >>>>>>>>
> >>>>>>>> Thanks!
> >>>>>>>>
> >>>>>>>>
> >>>>>>
> >>>>
> >>
>
diff mbox series

Patch

diff --git a/mm/madvise.c b/mm/madvise.c
index 547dcd1f7a39..56c7ba7bd558 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -336,6 +336,7 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 	LIST_HEAD(folio_list);
 	bool pageout_anon_only_filter;
 	unsigned int batch_count = 0;
+	int nr;
 
 	if (fatal_signal_pending(current))
 		return -EINTR;
@@ -423,7 +424,8 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 		return 0;
 	flush_tlb_batched_pending(mm);
 	arch_enter_lazy_mmu_mode();
-	for (; addr < end; pte++, addr += PAGE_SIZE) {
+	for (; addr < end; pte += nr, addr += nr * PAGE_SIZE) {
+		nr = 1;
 		ptent = ptep_get(pte);
 
 		if (++batch_count == SWAP_CLUSTER_MAX) {
@@ -447,55 +449,66 @@  static int madvise_cold_or_pageout_pte_range(pmd_t *pmd,
 			continue;
 
 		/*
-		 * Creating a THP page is expensive so split it only if we
-		 * are sure it's worth. Split it if we are only owner.
+		 * If we encounter a large folio, only split it if it is not
+		 * fully mapped within the range we are operating on. Otherwise
+		 * leave it as is so that it can be swapped out whole. If we
+		 * fail to split a folio, leave it in place and advance to the
+		 * next pte in the range.
 		 */
 		if (folio_test_large(folio)) {
-			int err;
-
-			if (folio_estimated_sharers(folio) > 1)
-				break;
-			if (pageout_anon_only_filter && !folio_test_anon(folio))
-				break;
-			if (!folio_trylock(folio))
-				break;
-			folio_get(folio);
-			arch_leave_lazy_mmu_mode();
-			pte_unmap_unlock(start_pte, ptl);
-			start_pte = NULL;
-			err = split_folio(folio);
-			folio_unlock(folio);
-			folio_put(folio);
-			if (err)
-				break;
-			start_pte = pte =
-				pte_offset_map_lock(mm, pmd, addr, &ptl);
-			if (!start_pte)
-				break;
-			arch_enter_lazy_mmu_mode();
-			pte--;
-			addr -= PAGE_SIZE;
-			continue;
+			const fpb_t fpb_flags = FPB_IGNORE_DIRTY |
+						FPB_IGNORE_SOFT_DIRTY;
+			int max_nr = (end - addr) / PAGE_SIZE;
+
+			nr = folio_pte_batch(folio, addr, pte, ptent, max_nr,
+					     fpb_flags, NULL);
+
+			if (nr < folio_nr_pages(folio)) {
+				int err;
+
+				if (folio_estimated_sharers(folio) > 1)
+					continue;
+				if (pageout_anon_only_filter && !folio_test_anon(folio))
+					continue;
+				if (!folio_trylock(folio))
+					continue;
+				folio_get(folio);
+				arch_leave_lazy_mmu_mode();
+				pte_unmap_unlock(start_pte, ptl);
+				start_pte = NULL;
+				err = split_folio(folio);
+				folio_unlock(folio);
+				folio_put(folio);
+				if (err)
+					continue;
+				start_pte = pte =
+					pte_offset_map_lock(mm, pmd, addr, &ptl);
+				if (!start_pte)
+					break;
+				arch_enter_lazy_mmu_mode();
+				nr = 0;
+				continue;
+			}
 		}
 
 		/*
 		 * Do not interfere with other mappings of this folio and
-		 * non-LRU folio.
+		 * non-LRU folio. If we have a large folio at this point, we
+		 * know it is fully mapped so if its mapcount is the same as its
+		 * number of pages, it must be exclusive.
 		 */
-		if (!folio_test_lru(folio) || folio_mapcount(folio) != 1)
+		if (!folio_test_lru(folio) ||
+		    folio_mapcount(folio) != folio_nr_pages(folio))
 			continue;
 
 		if (pageout_anon_only_filter && !folio_test_anon(folio))
 			continue;
 
-		VM_BUG_ON_FOLIO(folio_test_large(folio), folio);
-
-		if (!pageout && pte_young(ptent)) {
-			ptent = ptep_get_and_clear_full(mm, addr, pte,
-							tlb->fullmm);
-			ptent = pte_mkold(ptent);
-			set_pte_at(mm, addr, pte, ptent);
-			tlb_remove_tlb_entry(tlb, pte, addr);
+		if (!pageout) {
+			for (; nr != 0; nr--, pte++, addr += PAGE_SIZE) {
+				if (ptep_test_and_clear_young(vma, addr, pte))
+					tlb_remove_tlb_entry(tlb, pte, addr);
+			}
 		}
 
 		/*