diff mbox series

[1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free

Message ID 20240225123215.86503-1-ioworker0@gmail.com (mailing list archive)
State New
Headers show
Series [1/1] mm/madvise: enhance lazyfreeing with mTHP in madvise_free | expand

Commit Message

Lance Yang Feb. 25, 2024, 12:32 p.m. UTC
This patch improves madvise_free_pte_range() to correctly
handle large folio that is smaller than PMD-size
(for example, 16KiB to 1024KiB[1]). It’s probably part of
the preparation to support anonymous multi-size THP.

Additionally, when the consecutive PTEs are mapped to
consecutive pages of the same large folio (mTHP), if the
folio is locked before madvise(MADV_FREE) or cannot be
split, then all subsequent PTEs within the same PMD will
be skipped. However, they should have been MADV_FREEed.

Moreover, this patch also optimizes lazyfreeing with
PTE-mapped mTHP (Inspired by David Hildenbrand[2]). We
aim to avoid unnecessary folio splitting if the large
folio is entirely within the given range.

On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by
PTE-mapped folios of the same size results in the following
runtimes for madvise(MADV_FREE) in seconds (shorter is better):

Folio Size  |    Old     |    New     |  Change
----------------------------------------------
      4KiB  |  0.590251  |  0.590264  |     0%
     16KiB  |  2.990447  |  0.182167  |   -94%
     32KiB  |  2.547831  |  0.101622  |   -96%
     64KiB  |  2.457796  |  0.049726  |   -98%
    128KiB  |  2.281034  |  0.030109  |   -99%
    256KiB  |  2.230387  |  0.015838  |   -99%
    512KiB  |  2.189106  |  0.009149  |   -99%
   1024KiB  |  2.183949  |  0.006620  |   -99%
   2048KiB  |  0.002799  |  0.002795  |     0%

[1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com
[2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/

Signed-off-by: Lance Yang <ioworker0@gmail.com>
---
 mm/madvise.c | 69 +++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 58 insertions(+), 11 deletions(-)

Comments

Yin Fengwei Feb. 26, 2024, 2:38 a.m. UTC | #1
On 1/1/70 08:00,  wrote:
> diff --git a/mm/madvise.c b/mm/madvise.c
> index cfa5e7288261..bcbf56595a2e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		 */
>  		if (folio_test_large(folio)) {
>  			int err;
> +			unsigned long next_addr, align;
>  
> -			if (folio_estimated_sharers(folio) != 1)
> -				break;
> -			if (!folio_trylock(folio))
> -				break;
> +			if (folio_estimated_sharers(folio) != 1 ||
> +			    !folio_trylock(folio))
> +				goto skip_large_folio;
> +
> +			align = folio_nr_pages(folio) * PAGE_SIZE;
> +			next_addr = ALIGN_DOWN(addr + align, align);
There is a possible corner case:
If there is a cow folio associated with this folio and the cow folio
has smaller size than this folio for whatever reason, this change can't
handle it correctly.


Regards
Yin, Fengwei
Barry Song Feb. 26, 2024, 4 a.m. UTC | #2
Hi Lance,


On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote:
>
> This patch improves madvise_free_pte_range() to correctly
> handle large folio that is smaller than PMD-size
> (for example, 16KiB to 1024KiB[1]). It’s probably part of
> the preparation to support anonymous multi-size THP.
>
> Additionally, when the consecutive PTEs are mapped to
> consecutive pages of the same large folio (mTHP), if the
> folio is locked before madvise(MADV_FREE) or cannot be
> split, then all subsequent PTEs within the same PMD will
> be skipped. However, they should have been MADV_FREEed.
>
> Moreover, this patch also optimizes lazyfreeing with
> PTE-mapped mTHP (Inspired by David Hildenbrand[2]). We
> aim to avoid unnecessary folio splitting if the large
> folio is entirely within the given range.
>

We did something similar on MADV_PAGEOUT[1]

[1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/


> On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by
> PTE-mapped folios of the same size results in the following
> runtimes for madvise(MADV_FREE) in seconds (shorter is better):
>
> Folio Size  |    Old     |    New     |  Change
> ----------------------------------------------
>       4KiB  |  0.590251  |  0.590264  |     0%
>      16KiB  |  2.990447  |  0.182167  |   -94%
>      32KiB  |  2.547831  |  0.101622  |   -96%
>      64KiB  |  2.457796  |  0.049726  |   -98%
>     128KiB  |  2.281034  |  0.030109  |   -99%
>     256KiB  |  2.230387  |  0.015838  |   -99%
>     512KiB  |  2.189106  |  0.009149  |   -99%
>    1024KiB  |  2.183949  |  0.006620  |   -99%
>    2048KiB  |  0.002799  |  0.002795  |     0%
>
> [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com
> [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/
>
> Signed-off-by: Lance Yang <ioworker0@gmail.com>
> ---
>  mm/madvise.c | 69 +++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 58 insertions(+), 11 deletions(-)
>
> diff --git a/mm/madvise.c b/mm/madvise.c
> index cfa5e7288261..bcbf56595a2e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>                  */
>                 if (folio_test_large(folio)) {
>                         int err;
> +                       unsigned long next_addr, align;
>
> -                       if (folio_estimated_sharers(folio) != 1)
> -                               break;
> -                       if (!folio_trylock(folio))
> -                               break;
> +                       if (folio_estimated_sharers(folio) != 1 ||
> +                           !folio_trylock(folio))
> +                               goto skip_large_folio;
> +
> +                       align = folio_nr_pages(folio) * PAGE_SIZE;
> +                       next_addr = ALIGN_DOWN(addr + align, align);
> +
> +                       /*
> +                        * If we mark only the subpages as lazyfree,
> +                        * split the large folio.
> +                        */
> +                       if (next_addr > end || next_addr - addr != align)
> +                               goto split_large_folio;
> +
> +                       /*
> +                        * Avoid unnecessary folio splitting if the large
> +                        * folio is entirely within the given range.
> +                        */
> +                       folio_test_clear_dirty(folio);
> +                       folio_unlock(folio);
> +                       for (; addr != next_addr; pte++, addr += PAGE_SIZE) {
> +                               ptent = ptep_get(pte);
> +                               if (pte_young(ptent) || pte_dirty(ptent)) {
> +                                       ptent = ptep_get_and_clear_full(
> +                                               mm, addr, pte, tlb->fullmm);
> +                                       ptent = pte_mkold(ptent);
> +                                       ptent = pte_mkclean(ptent);
> +                                       set_pte_at(mm, addr, pte, ptent);
> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> +                               }

The code works under the assumption the large folio is entirely mapped
in all PTEs in the range. This is not always true.

This won't work in some cases as some PTEs might be mapping to the
large folios. some others might have been unmapped or mapped
to different folios.

so in MADV_PAGEOUT, we have a function to check the folio is
really entirely mapped:

+static inline bool pte_range_cont_mapped(unsigned long start_pfn,
+ pte_t *start_pte, unsigned long start_addr, int nr)
+{
+              int i;
+              pte_t pte_val;
+
+              for (i = 0; i < nr; i++) {
+                           pte_val = ptep_get(start_pte + i);
+
+                           if (pte_none(pte_val))
+                                        return false;
+
+                           if (pte_pfn(pte_val) != (start_pfn + i))
+                                        return false;
+              }
+
+              return true;
+}

> +                       }
> +                       folio_mark_lazyfree(folio);
> +                       goto next_folio;
> +
> +split_large_folio:
>                         folio_get(folio);
>                         arch_leave_lazy_mmu_mode();
>                         pte_unmap_unlock(start_pte, ptl);
> @@ -688,13 +720,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>                         err = split_folio(folio);
>                         folio_unlock(folio);
>                         folio_put(folio);
> -                       if (err)
> -                               break;
> -                       start_pte = pte =
> -                               pte_offset_map_lock(mm, pmd, addr, &ptl);
> -                       if (!start_pte)
> -                               break;
> -                       arch_enter_lazy_mmu_mode();
> +
> +                       /*
> +                        * If the large folio is locked before madvise(MADV_FREE)
> +                        * or cannot be split, we just skip it.
> +                        */
> +                       if (err) {
> +skip_large_folio:
> +                               if (next_addr >= end)
> +                                       break;
> +                               pte += (next_addr - addr) / PAGE_SIZE;
> +                               addr = next_addr;
> +                       }
> +
> +                       if (!start_pte) {
> +                               start_pte = pte = pte_offset_map_lock(
> +                                       mm, pmd, addr, &ptl);
> +                               if (!start_pte)
> +                                       break;
> +                               arch_enter_lazy_mmu_mode();
> +                       }
> +
> +next_folio:
>                         pte--;
>                         addr -= PAGE_SIZE;
>                         continue;
> --
> 2.33.1
>
>

Thanks
Barry
Lance Yang Feb. 26, 2024, 8:35 a.m. UTC | #3
Hey Fengwei,

Thanks for taking time to review!

> On Mon, Feb 26, 2024 at 10:38 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
> > On Sun, Feb 25, 2024 at 8:32 PM Lance Yang <ioworker0@gmail.com> wrote:
[...]
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >                */
> >               if (folio_test_large(folio)) {
> >                       int err;
> > +                     unsigned long next_addr, align;
> >
> > -                     if (folio_estimated_sharers(folio) != 1)
> > -                             break;
> > -                     if (!folio_trylock(folio))
> > -                             break;
> > +                     if (folio_estimated_sharers(folio) != 1 ||
> > +                         !folio_trylock(folio))
> > +                             goto skip_large_folio;
> > +
> > +                     align = folio_nr_pages(folio) * PAGE_SIZE;
> > +                     next_addr = ALIGN_DOWN(addr + align, align);
> There is a possible corner case:
> If there is a cow folio associated with this folio and the cow folio
> has smaller size than this folio for whatever reason, this change can't
> handle it correctly.

Thanks for pointing that out; it's very helpful to me!
I made some changes. Could you please check if this corner case is now resolved?

As a diff against this patch.

diff --git a/mm/madvise.c b/mm/madvise.c
index bcbf56595a2e..c7aacc9f9536 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -686,10 +686,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			next_addr = ALIGN_DOWN(addr + align, align);
 
 			/*
-			 * If we mark only the subpages as lazyfree,
-			 * split the large folio.
+			 * If we mark only the subpages as lazyfree, or
+			 * if there is a cow folio associated with this folio,
+			 * then split the large folio.
 			 */
-			if (next_addr > end || next_addr - addr != align)
+			if (next_addr > end || next_addr - addr != align ||
+			    folio_total_mapcount(folio) != folio_nr_pages(folio))
 				goto split_large_folio;
 
 			/*
---

Thanks again for your time!

Best,
Lance
Lance Yang Feb. 26, 2024, 8:37 a.m. UTC | #4
Hey Barry,

Thanks for taking time to review!

On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote:
[...]
> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote:
[...]
> We did something similar on MADV_PAGEOUT[1]
> [1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/

Thanks for providing the link above.

[...]
> > +                        * Avoid unnecessary folio splitting if the large
> > +                        * folio is entirely within the given range.
> > +                        */
> > +                       folio_test_clear_dirty(folio);
> > +                       folio_unlock(folio);
> > +                       for (; addr != next_addr; pte++, addr += PAGE_SIZE) {
> > +                               ptent = ptep_get(pte);
> > +                               if (pte_young(ptent) || pte_dirty(ptent)) {
> > +                                       ptent = ptep_get_and_clear_full(
> > +                                               mm, addr, pte, tlb->fullmm);
> > +                                       ptent = pte_mkold(ptent);
> > +                                       ptent = pte_mkclean(ptent);
> > +                                       set_pte_at(mm, addr, pte, ptent);
> > +                                       tlb_remove_tlb_entry(tlb, pte, addr);
> > +                               }
>
> The code works under the assumption the large folio is entirely mapped
> in all PTEs in the range. This is not always true.
>
> This won't work in some cases as some PTEs might be mapping to the
> large folios. some others might have been unmapped or mapped
> to different folios.
>
> so in MADV_PAGEOUT, we have a function to check the folio is
> really entirely mapped:
>
> +static inline bool pte_range_cont_mapped(unsigned long start_pfn,
> + pte_t *start_pte, unsigned long start_addr, int nr)
> +{
> +              int i;
> +              pte_t pte_val;
> +
> +              for (i = 0; i < nr; i++) {
> +                           pte_val = ptep_get(start_pte + i);
> +
> +                           if (pte_none(pte_val))
> +                                        return false;
> +
> +                           if (pte_pfn(pte_val) != (start_pfn + i))
> +                                        return false;
> +              }
> +
> +              return true;
> +}

Thanks for providing the information; it's very helpful to me!
I made some changes. Would you mind taking another look, please?

As a diff against this patch.

diff --git a/mm/madvise.c b/mm/madvise.c
index bcbf56595a2e..255d2f329be4 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -616,6 +616,18 @@ static long madvise_pageout(struct vm_area_struct *vma,
 	return 0;
 }
 
+static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr)
+{
+	pte_t pte_val;
+	unsigned long pfn = pte_pfn(pte);
+	for (int i = 0; i < nr; i++) {
+		pte_val = ptep_get(pte + i);
+		if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i))
+			return false;
+	}
+	return true;
+}
+
 static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 				unsigned long end, struct mm_walk *walk)
 
@@ -676,20 +688,25 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		 */
 		if (folio_test_large(folio)) {
 			int err;
-			unsigned long next_addr, align;
+			unsigned long nr, next_addr, align;
 
 			if (folio_estimated_sharers(folio) != 1 ||
 			    !folio_trylock(folio))
 				goto skip_large_folio;
 
-			align = folio_nr_pages(folio) * PAGE_SIZE;
+			nr = folio_nr_pages(folio);
+			align = nr * PAGE_SIZE;
 			next_addr = ALIGN_DOWN(addr + align, align);
 
 			/*
-			 * If we mark only the subpages as lazyfree,
-			 * split the large folio.
+			 * If we mark only the subpages as lazyfree, or
+			 * if there is a cow folio associated with this folio,
+			 * or if this folio is not really entirely mapped,
+			 * then split the large folio.
 			 */
-			if (next_addr > end || next_addr - addr != align)
+			if (next_addr > end || next_addr - addr != align ||
+			    folio_total_mapcount(folio) != nr ||
+			    pte_range_cont_mapped(pte, nr))
 				goto split_large_folio;
 
 			/*
---

Thanks again for your time!

Best,
Lance
David Hildenbrand Feb. 26, 2024, 8:41 a.m. UTC | #5
On 26.02.24 09:37, Lance Yang wrote:
> Hey Barry,
> 
> Thanks for taking time to review!
> 
> On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote:
> [...]
>> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote:
> [...]
>> We did something similar on MADV_PAGEOUT[1]
>> [1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/
> 
> Thanks for providing the link above.
> 
> [...]
>>> +                        * Avoid unnecessary folio splitting if the large
>>> +                        * folio is entirely within the given range.
>>> +                        */
>>> +                       folio_test_clear_dirty(folio);
>>> +                       folio_unlock(folio);
>>> +                       for (; addr != next_addr; pte++, addr += PAGE_SIZE) {
>>> +                               ptent = ptep_get(pte);
>>> +                               if (pte_young(ptent) || pte_dirty(ptent)) {
>>> +                                       ptent = ptep_get_and_clear_full(
>>> +                                               mm, addr, pte, tlb->fullmm);
>>> +                                       ptent = pte_mkold(ptent);
>>> +                                       ptent = pte_mkclean(ptent);
>>> +                                       set_pte_at(mm, addr, pte, ptent);
>>> +                                       tlb_remove_tlb_entry(tlb, pte, addr);
>>> +                               }
>>
>> The code works under the assumption the large folio is entirely mapped
>> in all PTEs in the range. This is not always true.
>>
>> This won't work in some cases as some PTEs might be mapping to the
>> large folios. some others might have been unmapped or mapped
>> to different folios.
>>
>> so in MADV_PAGEOUT, we have a function to check the folio is
>> really entirely mapped:
>>
>> +static inline bool pte_range_cont_mapped(unsigned long start_pfn,
>> + pte_t *start_pte, unsigned long start_addr, int nr)
>> +{
>> +              int i;
>> +              pte_t pte_val;
>> +
>> +              for (i = 0; i < nr; i++) {
>> +                           pte_val = ptep_get(start_pte + i);
>> +
>> +                           if (pte_none(pte_val))
>> +                                        return false;
>> +
>> +                           if (pte_pfn(pte_val) != (start_pfn + i))
>> +                                        return false;
>> +              }
>> +
>> +              return true;
>> +}
> 
> Thanks for providing the information; it's very helpful to me!
> I made some changes. Would you mind taking another look, please?
> 
> As a diff against this patch.
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index bcbf56595a2e..255d2f329be4 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -616,6 +616,18 @@ static long madvise_pageout(struct vm_area_struct *vma,
>   	return 0;
>   }
>   
> +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr)
> +{
> +	pte_t pte_val;
> +	unsigned long pfn = pte_pfn(pte);
> +	for (int i = 0; i < nr; i++) {
> +		pte_val = ptep_get(pte + i);
> +		if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i))
> +			return false;
> +	}
> +	return true;
> +}

I dislike the "cont mapped" terminology.

Maybe folio_pte_batch() does what you want?
Lance Yang Feb. 26, 2024, 8:55 a.m. UTC | #6
Hey David,

Thanks for your suggestion!

On Mon, Feb 26, 2024 at 4:41 PM David Hildenbrand <david@redhat.com> wrote:
>
[...]
> > On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote:
> > [...]
> >> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote:
> > [...]
[...]
> > +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr)
> > +{
> > +     pte_t pte_val;
> > +     unsigned long pfn = pte_pfn(pte);
> > +     for (int i = 0; i < nr; i++) {
> > +             pte_val = ptep_get(pte + i);
> > +             if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i))
> > +                     return false;
> > +     }
> > +     return true;
> > +}
>
> I dislike the "cont mapped" terminology.
>
> Maybe folio_pte_batch() does what you want?

folio_pte_batch() is a good choice. Appreciate it!

Best,
Lance

>
> --
> Cheers,
>
> David / dhildenb
>
Ryan Roberts Feb. 26, 2024, 12:57 p.m. UTC | #7
On 26/02/2024 08:35, Lance Yang wrote:
> Hey Fengwei,
> 
> Thanks for taking time to review!
> 
>> On Mon, Feb 26, 2024 at 10:38 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>> On Sun, Feb 25, 2024 at 8:32 PM Lance Yang <ioworker0@gmail.com> wrote:
> [...]
>>> --- a/mm/madvise.c
>>> +++ b/mm/madvise.c
>>> @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>>                */
>>>               if (folio_test_large(folio)) {
>>>                       int err;
>>> +                     unsigned long next_addr, align;
>>>
>>> -                     if (folio_estimated_sharers(folio) != 1)
>>> -                             break;
>>> -                     if (!folio_trylock(folio))
>>> -                             break;
>>> +                     if (folio_estimated_sharers(folio) != 1 ||
>>> +                         !folio_trylock(folio))
>>> +                             goto skip_large_folio;
>>> +
>>> +                     align = folio_nr_pages(folio) * PAGE_SIZE;
>>> +                     next_addr = ALIGN_DOWN(addr + align, align);
>> There is a possible corner case:
>> If there is a cow folio associated with this folio and the cow folio
>> has smaller size than this folio for whatever reason, this change can't
>> handle it correctly.
> 
> Thanks for pointing that out; it's very helpful to me!
> I made some changes. Could you please check if this corner case is now resolved?
> 
> As a diff against this patch.
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index bcbf56595a2e..c7aacc9f9536 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -686,10 +686,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			next_addr = ALIGN_DOWN(addr + align, align);
>  
>  			/*
> -			 * If we mark only the subpages as lazyfree,
> -			 * split the large folio.
> +			 * If we mark only the subpages as lazyfree, or
> +			 * if there is a cow folio associated with this folio,
> +			 * then split the large folio.
>  			 */
> -			if (next_addr > end || next_addr - addr != align)
> +			if (next_addr > end || next_addr - addr != align ||
> +			    folio_total_mapcount(folio) != folio_nr_pages(folio))

I still don't think this is correct. I think you were previously assuming that
if you see a page from a large folio then the whole large folio should be
contiguously mapped? This new check doesn't validate that assumption reliably;
you need to iterate through every pte to generate a batch, like David does in
folio_pte_batch() for this to be safe.

An example of when this check is insufficient; let's say you have a 4 page anon
folio mapped contiguously in a process (total_mapcount=4). The process is forked
(total_mapcount=8). Then each process munmaps the second 2 pages
(total_mapcount=4). In place of the munmapped 2 pages, 2 new pages are mapped.
Then call madvise. It's probably even easier to trigger for file-backed memory
(I think this code path is used for both file and anon?)

Thanks,
Ryan




>  				goto split_large_folio;
>  
>  			/*
> ---
> 
> Thanks again for your time!
> 
> Best,
> Lance
Ryan Roberts Feb. 26, 2024, 1 p.m. UTC | #8
Hi Lance,

Thanks for working on this!


On 25/02/2024 12:32, Lance Yang wrote:
> This patch improves madvise_free_pte_range() to correctly
> handle large folio that is smaller than PMD-size

When you say "correctly handle" are you implying there is a bug with the current
implementation or are you just saying you can optimize this to improve
performance? I'm not convinced there is a bug, but I agree there is certainly
room for performance improvement.

Thanks,
Ryan

> (for example, 16KiB to 1024KiB[1]). It’s probably part of
> the preparation to support anonymous multi-size THP.
> 
> Additionally, when the consecutive PTEs are mapped to
> consecutive pages of the same large folio (mTHP), if the
> folio is locked before madvise(MADV_FREE) or cannot be
> split, then all subsequent PTEs within the same PMD will
> be skipped. However, they should have been MADV_FREEed.
> 
> Moreover, this patch also optimizes lazyfreeing with
> PTE-mapped mTHP (Inspired by David Hildenbrand[2]). We
> aim to avoid unnecessary folio splitting if the large
> folio is entirely within the given range.
> 
> On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by
> PTE-mapped folios of the same size results in the following
> runtimes for madvise(MADV_FREE) in seconds (shorter is better):
> 
> Folio Size  |    Old     |    New     |  Change
> ----------------------------------------------
>       4KiB  |  0.590251  |  0.590264  |     0%
>      16KiB  |  2.990447  |  0.182167  |   -94%
>      32KiB  |  2.547831  |  0.101622  |   -96%
>      64KiB  |  2.457796  |  0.049726  |   -98%
>     128KiB  |  2.281034  |  0.030109  |   -99%
>     256KiB  |  2.230387  |  0.015838  |   -99%
>     512KiB  |  2.189106  |  0.009149  |   -99%
>    1024KiB  |  2.183949  |  0.006620  |   -99%
>    2048KiB  |  0.002799  |  0.002795  |     0%
> 
> [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com
> [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/
> 
> Signed-off-by: Lance Yang <ioworker0@gmail.com>
> ---
>  mm/madvise.c | 69 +++++++++++++++++++++++++++++++++++++++++++---------
>  1 file changed, 58 insertions(+), 11 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index cfa5e7288261..bcbf56595a2e 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  		 */
>  		if (folio_test_large(folio)) {
>  			int err;
> +			unsigned long next_addr, align;
>  
> -			if (folio_estimated_sharers(folio) != 1)
> -				break;
> -			if (!folio_trylock(folio))
> -				break;
> +			if (folio_estimated_sharers(folio) != 1 ||
> +			    !folio_trylock(folio))
> +				goto skip_large_folio;
> +
> +			align = folio_nr_pages(folio) * PAGE_SIZE;
> +			next_addr = ALIGN_DOWN(addr + align, align);
> +
> +			/*
> +			 * If we mark only the subpages as lazyfree,
> +			 * split the large folio.
> +			 */
> +			if (next_addr > end || next_addr - addr != align)
> +				goto split_large_folio;
> +
> +			/*
> +			 * Avoid unnecessary folio splitting if the large
> +			 * folio is entirely within the given range.
> +			 */
> +			folio_test_clear_dirty(folio);
> +			folio_unlock(folio);
> +			for (; addr != next_addr; pte++, addr += PAGE_SIZE) {
> +				ptent = ptep_get(pte);
> +				if (pte_young(ptent) || pte_dirty(ptent)) {
> +					ptent = ptep_get_and_clear_full(
> +						mm, addr, pte, tlb->fullmm);
> +					ptent = pte_mkold(ptent);
> +					ptent = pte_mkclean(ptent);
> +					set_pte_at(mm, addr, pte, ptent);
> +					tlb_remove_tlb_entry(tlb, pte, addr);
> +				}
> +			}
> +			folio_mark_lazyfree(folio);
> +			goto next_folio;
> +
> +split_large_folio:
>  			folio_get(folio);
>  			arch_leave_lazy_mmu_mode();
>  			pte_unmap_unlock(start_pte, ptl);
> @@ -688,13 +720,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>  			err = split_folio(folio);
>  			folio_unlock(folio);
>  			folio_put(folio);
> -			if (err)
> -				break;
> -			start_pte = pte =
> -				pte_offset_map_lock(mm, pmd, addr, &ptl);
> -			if (!start_pte)
> -				break;
> -			arch_enter_lazy_mmu_mode();
> +
> +			/*
> +			 * If the large folio is locked before madvise(MADV_FREE)
> +			 * or cannot be split, we just skip it.
> +			 */
> +			if (err) {
> +skip_large_folio:
> +				if (next_addr >= end)
> +					break;
> +				pte += (next_addr - addr) / PAGE_SIZE;
> +				addr = next_addr;
> +			}
> +
> +			if (!start_pte) {
> +				start_pte = pte = pte_offset_map_lock(
> +					mm, pmd, addr, &ptl);
> +				if (!start_pte)
> +					break;
> +				arch_enter_lazy_mmu_mode();
> +			}
> +
> +next_folio:
>  			pte--;
>  			addr -= PAGE_SIZE;
>  			continue;
David Hildenbrand Feb. 26, 2024, 1:03 p.m. UTC | #9
On 26.02.24 13:57, Ryan Roberts wrote:
> On 26/02/2024 08:35, Lance Yang wrote:
>> Hey Fengwei,
>>
>> Thanks for taking time to review!
>>
>>> On Mon, Feb 26, 2024 at 10:38 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>> On Sun, Feb 25, 2024 at 8:32 PM Lance Yang <ioworker0@gmail.com> wrote:
>> [...]
>>>> --- a/mm/madvise.c
>>>> +++ b/mm/madvise.c
>>>> @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>>>                 */
>>>>                if (folio_test_large(folio)) {
>>>>                        int err;
>>>> +                     unsigned long next_addr, align;
>>>>
>>>> -                     if (folio_estimated_sharers(folio) != 1)
>>>> -                             break;
>>>> -                     if (!folio_trylock(folio))
>>>> -                             break;
>>>> +                     if (folio_estimated_sharers(folio) != 1 ||
>>>> +                         !folio_trylock(folio))
>>>> +                             goto skip_large_folio;
>>>> +
>>>> +                     align = folio_nr_pages(folio) * PAGE_SIZE;
>>>> +                     next_addr = ALIGN_DOWN(addr + align, align);
>>> There is a possible corner case:
>>> If there is a cow folio associated with this folio and the cow folio
>>> has smaller size than this folio for whatever reason, this change can't
>>> handle it correctly.
>>
>> Thanks for pointing that out; it's very helpful to me!
>> I made some changes. Could you please check if this corner case is now resolved?
>>
>> As a diff against this patch.
>>
>> diff --git a/mm/madvise.c b/mm/madvise.c
>> index bcbf56595a2e..c7aacc9f9536 100644
>> --- a/mm/madvise.c
>> +++ b/mm/madvise.c
>> @@ -686,10 +686,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
>>   			next_addr = ALIGN_DOWN(addr + align, align);
>>   
>>   			/*
>> -			 * If we mark only the subpages as lazyfree,
>> -			 * split the large folio.
>> +			 * If we mark only the subpages as lazyfree, or
>> +			 * if there is a cow folio associated with this folio,
>> +			 * then split the large folio.
>>   			 */
>> -			if (next_addr > end || next_addr - addr != align)
>> +			if (next_addr > end || next_addr - addr != align ||
>> +			    folio_total_mapcount(folio) != folio_nr_pages(folio))
> 
> I still don't think this is correct. I think you were previously assuming that
> if you see a page from a large folio then the whole large folio should be
> contiguously mapped? This new check doesn't validate that assumption reliably;
> you need to iterate through every pte to generate a batch, like David does in
> folio_pte_batch() for this to be safe.
> 
> An example of when this check is insufficient; let's say you have a 4 page anon
> folio mapped contiguously in a process (total_mapcount=4). The process is forked
> (total_mapcount=8). Then each process munmaps the second 2 pages
> (total_mapcount=4). In place of the munmapped 2 pages, 2 new pages are mapped.
> Then call madvise. It's probably even easier to trigger for file-backed memory
> (I think this code path is used for both file and anon?)

What would work here is using folio_pte_batch() to get how many PTEs are 
mapped *here*, then comparing the the batch size to folio_nr_pages(). If 
both match, we are mapping all subpages.
Ryan Roberts Feb. 26, 2024, 1:04 p.m. UTC | #10
On 26/02/2024 08:55, Lance Yang wrote:
> Hey David,
> 
> Thanks for your suggestion!
> 
> On Mon, Feb 26, 2024 at 4:41 PM David Hildenbrand <david@redhat.com> wrote:
>>
> [...]
>>> On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote:
>>> [...]
>>>> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote:
>>> [...]
> [...]
>>> +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr)
>>> +{
>>> +     pte_t pte_val;
>>> +     unsigned long pfn = pte_pfn(pte);
>>> +     for (int i = 0; i < nr; i++) {
>>> +             pte_val = ptep_get(pte + i);
>>> +             if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i))
>>> +                     return false;
>>> +     }
>>> +     return true;
>>> +}
>>
>> I dislike the "cont mapped" terminology.
>>
>> Maybe folio_pte_batch() does what you want?
> 
> folio_pte_batch() is a good choice. Appreciate it!

Agreed, folio_pte_batch() is likely to be widely useful for this change and
others, so suggest exporting it from memory.c and reusing as is if possible.

> 
> Best,
> Lance
> 
>>
>> --
>> Cheers,
>>
>> David / dhildenb
>>
Lance Yang Feb. 26, 2024, 1:47 p.m. UTC | #11
On Mon, Feb 26, 2024 at 9:04 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 26.02.24 13:57, Ryan Roberts wrote:
> > On 26/02/2024 08:35, Lance Yang wrote:
> >> Hey Fengwei,
> >>
> >> Thanks for taking time to review!
> >>
> >>> On Mon, Feb 26, 2024 at 10:38 AM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>>> On Sun, Feb 25, 2024 at 8:32 PM Lance Yang <ioworker0@gmail.com> wrote:
> >> [...]
> >>>> --- a/mm/madvise.c
> >>>> +++ b/mm/madvise.c
> >>>> @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >>>>                 */
> >>>>                if (folio_test_large(folio)) {
> >>>>                        int err;
> >>>> +                     unsigned long next_addr, align;
> >>>>
> >>>> -                     if (folio_estimated_sharers(folio) != 1)
> >>>> -                             break;
> >>>> -                     if (!folio_trylock(folio))
> >>>> -                             break;
> >>>> +                     if (folio_estimated_sharers(folio) != 1 ||
> >>>> +                         !folio_trylock(folio))
> >>>> +                             goto skip_large_folio;
> >>>> +
> >>>> +                     align = folio_nr_pages(folio) * PAGE_SIZE;
> >>>> +                     next_addr = ALIGN_DOWN(addr + align, align);
> >>> There is a possible corner case:
> >>> If there is a cow folio associated with this folio and the cow folio
> >>> has smaller size than this folio for whatever reason, this change can't
> >>> handle it correctly.
> >>
> >> Thanks for pointing that out; it's very helpful to me!
> >> I made some changes. Could you please check if this corner case is now resolved?
> >>
> >> As a diff against this patch.
> >>
> >> diff --git a/mm/madvise.c b/mm/madvise.c
> >> index bcbf56595a2e..c7aacc9f9536 100644
> >> --- a/mm/madvise.c
> >> +++ b/mm/madvise.c
> >> @@ -686,10 +686,12 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >>                      next_addr = ALIGN_DOWN(addr + align, align);
> >>
> >>                      /*
> >> -                     * If we mark only the subpages as lazyfree,
> >> -                     * split the large folio.
> >> +                     * If we mark only the subpages as lazyfree, or
> >> +                     * if there is a cow folio associated with this folio,
> >> +                     * then split the large folio.
> >>                       */
> >> -                    if (next_addr > end || next_addr - addr != align)
> >> +                    if (next_addr > end || next_addr - addr != align ||
> >> +                        folio_total_mapcount(folio) != folio_nr_pages(folio))
> >
> > I still don't think this is correct. I think you were previously assuming that
> > if you see a page from a large folio then the whole large folio should be
> > contiguously mapped? This new check doesn't validate that assumption reliably;
> > you need to iterate through every pte to generate a batch, like David does in
> > folio_pte_batch() for this to be safe.
> >
> > An example of when this check is insufficient; let's say you have a 4 page anon
> > folio mapped contiguously in a process (total_mapcount=4). The process is forked
> > (total_mapcount=8). Then each process munmaps the second 2 pages
> > (total_mapcount=4). In place of the munmapped 2 pages, 2 new pages are mapped.
> > Then call madvise. It's probably even easier to trigger for file-backed memory
> > (I think this code path is used for both file and anon?)
>
> What would work here is using folio_pte_batch() to get how many PTEs are
> mapped *here*, then comparing the the batch size to folio_nr_pages(). If
> both match, we are mapping all subpages.

Thanks! I'll use folio_pte_batch() here in v2.

Best,
Lance

>
> --
> Cheers,
>
> David / dhildenb
>
Lance Yang Feb. 26, 2024, 1:50 p.m. UTC | #12
On Mon, Feb 26, 2024 at 9:04 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 26/02/2024 08:55, Lance Yang wrote:
> > Hey David,
> >
> > Thanks for your suggestion!
> >
> > On Mon, Feb 26, 2024 at 4:41 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> > [...]
> >>> On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote:
> >>> [...]
> >>>> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote:
> >>> [...]
> > [...]
> >>> +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr)
> >>> +{
> >>> +     pte_t pte_val;
> >>> +     unsigned long pfn = pte_pfn(pte);
> >>> +     for (int i = 0; i < nr; i++) {
> >>> +             pte_val = ptep_get(pte + i);
> >>> +             if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i))
> >>> +                     return false;
> >>> +     }
> >>> +     return true;
> >>> +}
> >>
> >> I dislike the "cont mapped" terminology.
> >>
> >> Maybe folio_pte_batch() does what you want?
> >
> > folio_pte_batch() is a good choice. Appreciate it!
>
> Agreed, folio_pte_batch() is likely to be widely useful for this change and
> others, so suggest exporting it from memory.c and reusing as is if possible.

Thanks for your suggestion. I'll use folio_pte_batch() in v2.

Best,
Lance

>
> >
> > Best,
> > Lance
> >
> >>
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >>
>
Lance Yang Feb. 26, 2024, 1:54 p.m. UTC | #13
Hey Ryan,

Thanks for taking time to review!

On Mon, Feb 26, 2024 at 9:00 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi Lance,
>
> Thanks for working on this!
>
>
> On 25/02/2024 12:32, Lance Yang wrote:
> > This patch improves madvise_free_pte_range() to correctly
> > handle large folio that is smaller than PMD-size
>
> When you say "correctly handle" are you implying there is a bug with the current
> implementation or are you just saying you can optimize this to improve
> performance? I'm not convinced there is a bug, but I agree there is certainly
> room for performance improvement.

I agree with your point, and will update the changelog in v2.

Thanks again for your time!

Best,
Lance

>
> Thanks,
> Ryan
>
> > (for example, 16KiB to 1024KiB[1]). It’s probably part of
> > the preparation to support anonymous multi-size THP.
> >
> > Additionally, when the consecutive PTEs are mapped to
> > consecutive pages of the same large folio (mTHP), if the
> > folio is locked before madvise(MADV_FREE) or cannot be
> > split, then all subsequent PTEs within the same PMD will
> > be skipped. However, they should have been MADV_FREEed.
> >
> > Moreover, this patch also optimizes lazyfreeing with
> > PTE-mapped mTHP (Inspired by David Hildenbrand[2]). We
> > aim to avoid unnecessary folio splitting if the large
> > folio is entirely within the given range.
> >
> > On an Intel I5 CPU, lazyfreeing a 1GiB VMA backed by
> > PTE-mapped folios of the same size results in the following
> > runtimes for madvise(MADV_FREE) in seconds (shorter is better):
> >
> > Folio Size  |    Old     |    New     |  Change
> > ----------------------------------------------
> >       4KiB  |  0.590251  |  0.590264  |     0%
> >      16KiB  |  2.990447  |  0.182167  |   -94%
> >      32KiB  |  2.547831  |  0.101622  |   -96%
> >      64KiB  |  2.457796  |  0.049726  |   -98%
> >     128KiB  |  2.281034  |  0.030109  |   -99%
> >     256KiB  |  2.230387  |  0.015838  |   -99%
> >     512KiB  |  2.189106  |  0.009149  |   -99%
> >    1024KiB  |  2.183949  |  0.006620  |   -99%
> >    2048KiB  |  0.002799  |  0.002795  |     0%
> >
> > [1] https://lkml.kernel.org/r/20231207161211.2374093-5-ryan.roberts@arm.com
> > [2] https://lore.kernel.org/linux-mm/20240214204435.167852-1-david@redhat.com/
> >
> > Signed-off-by: Lance Yang <ioworker0@gmail.com>
> > ---
> >  mm/madvise.c | 69 +++++++++++++++++++++++++++++++++++++++++++---------
> >  1 file changed, 58 insertions(+), 11 deletions(-)
> >
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index cfa5e7288261..bcbf56595a2e 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -676,11 +676,43 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >                */
> >               if (folio_test_large(folio)) {
> >                       int err;
> > +                     unsigned long next_addr, align;
> >
> > -                     if (folio_estimated_sharers(folio) != 1)
> > -                             break;
> > -                     if (!folio_trylock(folio))
> > -                             break;
> > +                     if (folio_estimated_sharers(folio) != 1 ||
> > +                         !folio_trylock(folio))
> > +                             goto skip_large_folio;
> > +
> > +                     align = folio_nr_pages(folio) * PAGE_SIZE;
> > +                     next_addr = ALIGN_DOWN(addr + align, align);
> > +
> > +                     /*
> > +                      * If we mark only the subpages as lazyfree,
> > +                      * split the large folio.
> > +                      */
> > +                     if (next_addr > end || next_addr - addr != align)
> > +                             goto split_large_folio;
> > +
> > +                     /*
> > +                      * Avoid unnecessary folio splitting if the large
> > +                      * folio is entirely within the given range.
> > +                      */
> > +                     folio_test_clear_dirty(folio);
> > +                     folio_unlock(folio);
> > +                     for (; addr != next_addr; pte++, addr += PAGE_SIZE) {
> > +                             ptent = ptep_get(pte);
> > +                             if (pte_young(ptent) || pte_dirty(ptent)) {
> > +                                     ptent = ptep_get_and_clear_full(
> > +                                             mm, addr, pte, tlb->fullmm);
> > +                                     ptent = pte_mkold(ptent);
> > +                                     ptent = pte_mkclean(ptent);
> > +                                     set_pte_at(mm, addr, pte, ptent);
> > +                                     tlb_remove_tlb_entry(tlb, pte, addr);
> > +                             }
> > +                     }
> > +                     folio_mark_lazyfree(folio);
> > +                     goto next_folio;
> > +
> > +split_large_folio:
> >                       folio_get(folio);
> >                       arch_leave_lazy_mmu_mode();
> >                       pte_unmap_unlock(start_pte, ptl);
> > @@ -688,13 +720,28 @@ static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
> >                       err = split_folio(folio);
> >                       folio_unlock(folio);
> >                       folio_put(folio);
> > -                     if (err)
> > -                             break;
> > -                     start_pte = pte =
> > -                             pte_offset_map_lock(mm, pmd, addr, &ptl);
> > -                     if (!start_pte)
> > -                             break;
> > -                     arch_enter_lazy_mmu_mode();
> > +
> > +                     /*
> > +                      * If the large folio is locked before madvise(MADV_FREE)
> > +                      * or cannot be split, we just skip it.
> > +                      */
> > +                     if (err) {
> > +skip_large_folio:
> > +                             if (next_addr >= end)
> > +                                     break;
> > +                             pte += (next_addr - addr) / PAGE_SIZE;
> > +                             addr = next_addr;
> > +                     }
> > +
> > +                     if (!start_pte) {
> > +                             start_pte = pte = pte_offset_map_lock(
> > +                                     mm, pmd, addr, &ptl);
> > +                             if (!start_pte)
> > +                                     break;
> > +                             arch_enter_lazy_mmu_mode();
> > +                     }
> > +
> > +next_folio:
> >                       pte--;
> >                       addr -= PAGE_SIZE;
> >                       continue;
>
Barry Song Feb. 26, 2024, 8:49 p.m. UTC | #14
On Tue, Feb 27, 2024 at 2:04 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 26/02/2024 08:55, Lance Yang wrote:
> > Hey David,
> >
> > Thanks for your suggestion!
> >
> > On Mon, Feb 26, 2024 at 4:41 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> > [...]
> >>> On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote:
> >>> [...]
> >>>> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote:
> >>> [...]
> > [...]
> >>> +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr)
> >>> +{
> >>> +     pte_t pte_val;
> >>> +     unsigned long pfn = pte_pfn(pte);
> >>> +     for (int i = 0; i < nr; i++) {
> >>> +             pte_val = ptep_get(pte + i);
> >>> +             if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i))
> >>> +                     return false;
> >>> +     }
> >>> +     return true;
> >>> +}
> >>
> >> I dislike the "cont mapped" terminology.
> >>
> >> Maybe folio_pte_batch() does what you want?
> >
> > folio_pte_batch() is a good choice. Appreciate it!
>
> Agreed, folio_pte_batch() is likely to be widely useful for this change and
> others, so suggest exporting it from memory.c and reusing as is if possible.

I actually missed folio_pte_batch() in cont-pte series and re-invented
a function
to check if a large folio is entirely mapped in MADV_PAGEOUT[1]. exporting
folio_pte_batch() will also benefit that case. The problem space is same.

[1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/

>
> >
> > Best,
> > Lance
> >
> >>
> >> --
> >> Cheers,
> >>
> >> David / dhildenb

Thanks
Barry
Barry Song Feb. 27, 2024, 1:21 a.m. UTC | #15
> Thanks for your suggestion. I'll use folio_pte_batch() in v2.

Hi Lance,
Obviously, we both need this. While making large folio swap-in
v2, I am exporting folio_pte_batch() as below,

From: Barry Song <v-songbaohua@oppo.com>
Date: Tue, 27 Feb 2024 14:05:43 +1300
Subject: [PATCH] mm: export folio_pte_batch as a couple of modules need it

MADV_FREE, MADV_PAGEOUT and some other modules might need folio_pte_batch
to check if a range of PTEs are completely mapped to a large folio with
contiguous physcial offset.

Cc: Lance Yang <ioworker0@gmail.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: David Hildenbrand <david@redhat.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/internal.h | 13 +++++++++++++
 mm/memory.c   |  2 +-
 2 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/mm/internal.h b/mm/internal.h
index 36c11ea41f47..7e11aea3eda9 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
 	return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
 }
 
+/* Flags for folio_pte_batch(). */
+typedef int __bitwise fpb_t;
+
+/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
+#define FPB_IGNORE_DIRTY		((__force fpb_t)BIT(0))
+
+/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
+#define FPB_IGNORE_SOFT_DIRTY		((__force fpb_t)BIT(1))
+
+extern int folio_pte_batch(struct folio *folio, unsigned long addr,
+		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
+		bool *any_writable);
+
 void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
 						int nr_throttled);
 static inline void acct_reclaim_writeback(struct folio *folio)
diff --git a/mm/memory.c b/mm/memory.c
index 6378f6bc22c5..dd9bd67f037a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -989,7 +989,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
  * If "any_writable" is set, it will indicate if any other PTE besides the
  * first (given) PTE is writable.
  */
-static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
+int folio_pte_batch(struct folio *folio, unsigned long addr,
 		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
 		bool *any_writable)
 {
Lance Yang Feb. 27, 2024, 1:48 a.m. UTC | #16
On Tue, Feb 27, 2024 at 9:21 AM Barry Song <21cnbao@gmail.com> wrote:
>
> > Thanks for your suggestion. I'll use folio_pte_batch() in v2.
>
> Hi Lance,
> Obviously, we both need this. While making large folio swap-in
> v2, I am exporting folio_pte_batch() as below,

Thanks, Barry.

Could you separate the export of folio_pte_batch() from the large folio
swap-in v2? Prioritizing the push for this specific change would aid in
the development of the v2 based on it.

Best,
Lance

>
> From: Barry Song <v-songbaohua@oppo.com>
> Date: Tue, 27 Feb 2024 14:05:43 +1300
> Subject: [PATCH] mm: export folio_pte_batch as a couple of modules need it
>
> MADV_FREE, MADV_PAGEOUT and some other modules might need folio_pte_batch
> to check if a range of PTEs are completely mapped to a large folio with
> contiguous physcial offset.
>
> Cc: Lance Yang <ioworker0@gmail.com>
> Cc: Ryan Roberts <ryan.roberts@arm.com>
> Cc: David Hildenbrand <david@redhat.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/internal.h | 13 +++++++++++++
>  mm/memory.c   |  2 +-
>  2 files changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/mm/internal.h b/mm/internal.h
> index 36c11ea41f47..7e11aea3eda9 100644
> --- a/mm/internal.h
> +++ b/mm/internal.h
> @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
>         return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
>  }
>
> +/* Flags for folio_pte_batch(). */
> +typedef int __bitwise fpb_t;
> +
> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> +#define FPB_IGNORE_DIRTY               ((__force fpb_t)BIT(0))
> +
> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> +#define FPB_IGNORE_SOFT_DIRTY          ((__force fpb_t)BIT(1))
> +
> +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> +               pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> +               bool *any_writable);
> +
>  void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
>                                                 int nr_throttled);
>  static inline void acct_reclaim_writeback(struct folio *folio)
> diff --git a/mm/memory.c b/mm/memory.c
> index 6378f6bc22c5..dd9bd67f037a 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -989,7 +989,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
>   * If "any_writable" is set, it will indicate if any other PTE besides the
>   * first (given) PTE is writable.
>   */
> -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> +int folio_pte_batch(struct folio *folio, unsigned long addr,
>                 pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
>                 bool *any_writable)
>  {
> --
> 2.34.1
>
> > Best,
> > Lance
>
> Thanks
> Barry
>
Yin Fengwei Feb. 27, 2024, 1:51 a.m. UTC | #17
On 2/27/24 04:49, Barry Song wrote:
> On Tue, Feb 27, 2024 at 2:04 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 26/02/2024 08:55, Lance Yang wrote:
>>> Hey David,
>>>
>>> Thanks for your suggestion!
>>>
>>> On Mon, Feb 26, 2024 at 4:41 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>> [...]
>>>>> On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>> [...]
>>>>>> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote:
>>>>> [...]
>>> [...]
>>>>> +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr)
>>>>> +{
>>>>> +     pte_t pte_val;
>>>>> +     unsigned long pfn = pte_pfn(pte);
>>>>> +     for (int i = 0; i < nr; i++) {
>>>>> +             pte_val = ptep_get(pte + i);
>>>>> +             if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i))
>>>>> +                     return false;
>>>>> +     }
>>>>> +     return true;
>>>>> +}
>>>>
>>>> I dislike the "cont mapped" terminology.
>>>>
>>>> Maybe folio_pte_batch() does what you want?
>>>
>>> folio_pte_batch() is a good choice. Appreciate it!
>>
>> Agreed, folio_pte_batch() is likely to be widely useful for this change and
>> others, so suggest exporting it from memory.c and reusing as is if possible.
> 
> I actually missed folio_pte_batch() in cont-pte series and re-invented
> a function
> to check if a large folio is entirely mapped in MADV_PAGEOUT[1]. exporting
> folio_pte_batch() will also benefit that case. The problem space is same.
> 
> [1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/
I am wondering whether we can delay large folio split till page reclaim phase
for madvise cases.

Like if we hit folio which is partially mapped to the range, don't split it but
just unmap the mapping part from the range. Let page reclaim decide whether
split the large folio or not (If it's not mapped to any other range,it will be
freed as whole large folio. If part of it still mapped to other range,page reclaim
can decide whether to split it or ignore it for current reclaim cycle).

Splitting does work here. But it just drops all the benefits of large folio.


Regards
Yin, Fengwei

> 
>>
>>>
>>> Best,
>>> Lance
>>>
>>>>
>>>> --
>>>> Cheers,
>>>>
>>>> David / dhildenb
> 
> Thanks
> Barry
Barry Song Feb. 27, 2024, 2:12 a.m. UTC | #18
On Tue, Feb 27, 2024 at 2:48 PM Lance Yang <ioworker0@gmail.com> wrote:
>
> On Tue, Feb 27, 2024 at 9:21 AM Barry Song <21cnbao@gmail.com> wrote:
> >
> > > Thanks for your suggestion. I'll use folio_pte_batch() in v2.
> >
> > Hi Lance,
> > Obviously, we both need this. While making large folio swap-in
> > v2, I am exporting folio_pte_batch() as below,
>
> Thanks, Barry.
>
> Could you separate the export of folio_pte_batch() from the large folio
> swap-in v2? Prioritizing the push for this specific change would aid in
> the development of the v2 based on it.

I agree we should make this one pulled in by Andrew early to avoid potential
dependencies and conflicts in two jobs.

>
> Best,
> Lance
>
> >
> > From: Barry Song <v-songbaohua@oppo.com>
> > Date: Tue, 27 Feb 2024 14:05:43 +1300
> > Subject: [PATCH] mm: export folio_pte_batch as a couple of modules need it
> >
> > MADV_FREE, MADV_PAGEOUT and some other modules might need folio_pte_batch
> > to check if a range of PTEs are completely mapped to a large folio with
> > contiguous physcial offset.
> >
> > Cc: Lance Yang <ioworker0@gmail.com>
> > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > Cc: David Hildenbrand <david@redhat.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/internal.h | 13 +++++++++++++
> >  mm/memory.c   |  2 +-
> >  2 files changed, 14 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/internal.h b/mm/internal.h
> > index 36c11ea41f47..7e11aea3eda9 100644
> > --- a/mm/internal.h
> > +++ b/mm/internal.h
> > @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
> >         return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
> >  }
> >
> > +/* Flags for folio_pte_batch(). */
> > +typedef int __bitwise fpb_t;
> > +
> > +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> > +#define FPB_IGNORE_DIRTY               ((__force fpb_t)BIT(0))
> > +
> > +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> > +#define FPB_IGNORE_SOFT_DIRTY          ((__force fpb_t)BIT(1))
> > +
> > +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> > +               pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> > +               bool *any_writable);
> > +
> >  void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
> >                                                 int nr_throttled);
> >  static inline void acct_reclaim_writeback(struct folio *folio)
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6378f6bc22c5..dd9bd67f037a 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -989,7 +989,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> >   * If "any_writable" is set, it will indicate if any other PTE besides the
> >   * first (given) PTE is writable.
> >   */
> > -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> > +int folio_pte_batch(struct folio *folio, unsigned long addr,
> >                 pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> >                 bool *any_writable)
> >  {
> > --
> > 2.34.1
> >
> > > Best,
> > > Lance
> >
Thanks
Barry
> >
Lance Yang Feb. 27, 2024, 2:15 a.m. UTC | #19
Thanks, Barry!

On Tue, Feb 27, 2024 at 10:12 AM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Feb 27, 2024 at 2:48 PM Lance Yang <ioworker0@gmail.com> wrote:
> >
> > On Tue, Feb 27, 2024 at 9:21 AM Barry Song <21cnbao@gmail.com> wrote:
> > >
> > > > Thanks for your suggestion. I'll use folio_pte_batch() in v2.
> > >
> > > Hi Lance,
> > > Obviously, we both need this. While making large folio swap-in
> > > v2, I am exporting folio_pte_batch() as below,
> >
> > Thanks, Barry.
> >
> > Could you separate the export of folio_pte_batch() from the large folio
> > swap-in v2? Prioritizing the push for this specific change would aid in
> > the development of the v2 based on it.
>
> I agree we should make this one pulled in by Andrew early to avoid potential
> dependencies and conflicts in two jobs.
>
> >
> > Best,
> > Lance
> >
> > >
> > > From: Barry Song <v-songbaohua@oppo.com>
> > > Date: Tue, 27 Feb 2024 14:05:43 +1300
> > > Subject: [PATCH] mm: export folio_pte_batch as a couple of modules need it
> > >
> > > MADV_FREE, MADV_PAGEOUT and some other modules might need folio_pte_batch
> > > to check if a range of PTEs are completely mapped to a large folio with
> > > contiguous physcial offset.
> > >
> > > Cc: Lance Yang <ioworker0@gmail.com>
> > > Cc: Ryan Roberts <ryan.roberts@arm.com>
> > > Cc: David Hildenbrand <david@redhat.com>
> > > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > > ---
> > >  mm/internal.h | 13 +++++++++++++
> > >  mm/memory.c   |  2 +-
> > >  2 files changed, 14 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/internal.h b/mm/internal.h
> > > index 36c11ea41f47..7e11aea3eda9 100644
> > > --- a/mm/internal.h
> > > +++ b/mm/internal.h
> > > @@ -83,6 +83,19 @@ static inline void *folio_raw_mapping(struct folio *folio)
> > >         return (void *)(mapping & ~PAGE_MAPPING_FLAGS);
> > >  }
> > >
> > > +/* Flags for folio_pte_batch(). */
> > > +typedef int __bitwise fpb_t;
> > > +
> > > +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> > > +#define FPB_IGNORE_DIRTY               ((__force fpb_t)BIT(0))
> > > +
> > > +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> > > +#define FPB_IGNORE_SOFT_DIRTY          ((__force fpb_t)BIT(1))
> > > +
> > > +extern int folio_pte_batch(struct folio *folio, unsigned long addr,
> > > +               pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> > > +               bool *any_writable);
> > > +
> > >  void __acct_reclaim_writeback(pg_data_t *pgdat, struct folio *folio,
> > >                                                 int nr_throttled);
> > >  static inline void acct_reclaim_writeback(struct folio *folio)
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6378f6bc22c5..dd9bd67f037a 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -989,7 +989,7 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> > >   * If "any_writable" is set, it will indicate if any other PTE besides the
> > >   * first (given) PTE is writable.
> > >   */
> > > -static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> > > +int folio_pte_batch(struct folio *folio, unsigned long addr,
> > >                 pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags,
> > >                 bool *any_writable)
> > >  {
> > > --
> > > 2.34.1
> > >
> > > > Best,
> > > > Lance
> > >
> Thanks
> Barry
> > >
Barry Song Feb. 27, 2024, 2:17 a.m. UTC | #20
On Tue, Feb 27, 2024 at 2:51 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 2/27/24 04:49, Barry Song wrote:
> > On Tue, Feb 27, 2024 at 2:04 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 26/02/2024 08:55, Lance Yang wrote:
> >>> Hey David,
> >>>
> >>> Thanks for your suggestion!
> >>>
> >>> On Mon, Feb 26, 2024 at 4:41 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>> [...]
> >>>>> On Mon, Feb 26, 2024 at 12:00 PM Barry Song <21cnbao@gmail.com> wrote:
> >>>>> [...]
> >>>>>> On Mon, Feb 26, 2024 at 1:33 AM Lance Yang <ioworker0@gmail.com> wrote:
> >>>>> [...]
> >>> [...]
> >>>>> +static inline bool pte_range_cont_mapped(pte_t *pte, unsigned long nr)
> >>>>> +{
> >>>>> +     pte_t pte_val;
> >>>>> +     unsigned long pfn = pte_pfn(pte);
> >>>>> +     for (int i = 0; i < nr; i++) {
> >>>>> +             pte_val = ptep_get(pte + i);
> >>>>> +             if (pte_none(pte_val) || pte_pfn(pte_val) != (pfn + i))
> >>>>> +                     return false;
> >>>>> +     }
> >>>>> +     return true;
> >>>>> +}
> >>>>
> >>>> I dislike the "cont mapped" terminology.
> >>>>
> >>>> Maybe folio_pte_batch() does what you want?
> >>>
> >>> folio_pte_batch() is a good choice. Appreciate it!
> >>
> >> Agreed, folio_pte_batch() is likely to be widely useful for this change and
> >> others, so suggest exporting it from memory.c and reusing as is if possible.
> >
> > I actually missed folio_pte_batch() in cont-pte series and re-invented
> > a function
> > to check if a large folio is entirely mapped in MADV_PAGEOUT[1]. exporting
> > folio_pte_batch() will also benefit that case. The problem space is same.
> >
> > [1] https://lore.kernel.org/linux-mm/20240118111036.72641-7-21cnbao@gmail.com/
> I am wondering whether we can delay large folio split till page reclaim phase
> for madvise cases.
>
> Like if we hit folio which is partially mapped to the range, don't split it but
> just unmap the mapping part from the range. Let page reclaim decide whether
> split the large folio or not (If it's not mapped to any other range,it will be
> freed as whole large folio. If part of it still mapped to other range,page reclaim
> can decide whether to split it or ignore it for current reclaim cycle).

Yes, we can. but we still have to play the ptes check game to avoid adding
folios multiple times to reclaim the list.

I don't see too much difference between splitting in madvise and splitting
in vmscan.  as our real purpose is avoiding splitting entirely mapped
large folios. for partial mapped large folios, if we split in madvise, then
we don't need to play the game of skipping folios while iterating PTEs.
if we don't split in madvise, we have to make sure the large folio is only
added in reclaimed list one time by checking if PTEs belong to the
previous added folio.

>
> Splitting does work here. But it just drops all the benefits of large folio.
>
>
> Regards
> Yin, Fengwei
>
> >
> >>
> >>>
> >>> Best,
> >>> Lance
> >>>
> >>>>
> >>>> --
> >>>> Cheers,
> >>>>
> >>>> David / dhildenb
> >
Thanks
Barry
Yin Fengwei Feb. 27, 2024, 6:14 a.m. UTC | #21
On 2/27/24 10:17, Barry Song wrote:
>> Like if we hit folio which is partially mapped to the range, don't split it but
>> just unmap the mapping part from the range. Let page reclaim decide whether
>> split the large folio or not (If it's not mapped to any other range,it will be
>> freed as whole large folio. If part of it still mapped to other range,page reclaim
>> can decide whether to split it or ignore it for current reclaim cycle).
> Yes, we can. but we still have to play the ptes check game to avoid adding
> folios multiple times to reclaim the list.
> 
> I don't see too much difference between splitting in madvise and splitting
> in vmscan.  as our real purpose is avoiding splitting entirely mapped
> large folios. for partial mapped large folios, if we split in madvise, then
> we don't need to play the game of skipping folios while iterating PTEs.
> if we don't split in madvise, we have to make sure the large folio is only
> added in reclaimed list one time by checking if PTEs belong to the
> previous added folio.

If the partial mapped large folio is unmapped from the range, the related PTE
become none. How could the folio be added to reclaimed list multiple times?


Regards
Yin, Fengwei
Barry Song Feb. 27, 2024, 6:40 a.m. UTC | #22
On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 2/27/24 10:17, Barry Song wrote:
> >> Like if we hit folio which is partially mapped to the range, don't split it but
> >> just unmap the mapping part from the range. Let page reclaim decide whether
> >> split the large folio or not (If it's not mapped to any other range,it will be
> >> freed as whole large folio. If part of it still mapped to other range,page reclaim
> >> can decide whether to split it or ignore it for current reclaim cycle).
> > Yes, we can. but we still have to play the ptes check game to avoid adding
> > folios multiple times to reclaim the list.
> >
> > I don't see too much difference between splitting in madvise and splitting
> > in vmscan.  as our real purpose is avoiding splitting entirely mapped
> > large folios. for partial mapped large folios, if we split in madvise, then
> > we don't need to play the game of skipping folios while iterating PTEs.
> > if we don't split in madvise, we have to make sure the large folio is only
> > added in reclaimed list one time by checking if PTEs belong to the
> > previous added folio.
>
> If the partial mapped large folio is unmapped from the range, the related PTE
> become none. How could the folio be added to reclaimed list multiple times?

in case we have 16 PTEs in a large folio.
PTE0 present
PTE1 present
PTE2 present
PTE3  none
PTE4 present
PTE5 none
PTE6 present
....
the current code is scanning PTE one by one.
while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...

there are all kinds of possibilities for unmapping.

so what we can do is recording we have added the folio while scanning PTE0,
then skipping this folios for all other PTEs.

otherwise, we can split it while scanning PTE0, then we will meet
different folios
afterwards.

>
>
> Regards
> Yin, Fengwei

Thanks
Barry
Barry Song Feb. 27, 2024, 6:42 a.m. UTC | #23
On Tue, Feb 27, 2024 at 7:40 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >
> >
> >
> > On 2/27/24 10:17, Barry Song wrote:
> > >> Like if we hit folio which is partially mapped to the range, don't split it but
> > >> just unmap the mapping part from the range. Let page reclaim decide whether
> > >> split the large folio or not (If it's not mapped to any other range,it will be
> > >> freed as whole large folio. If part of it still mapped to other range,page reclaim
> > >> can decide whether to split it or ignore it for current reclaim cycle).
> > > Yes, we can. but we still have to play the ptes check game to avoid adding
> > > folios multiple times to reclaim the list.
> > >
> > > I don't see too much difference between splitting in madvise and splitting
> > > in vmscan.  as our real purpose is avoiding splitting entirely mapped
> > > large folios. for partial mapped large folios, if we split in madvise, then
> > > we don't need to play the game of skipping folios while iterating PTEs.
> > > if we don't split in madvise, we have to make sure the large folio is only
> > > added in reclaimed list one time by checking if PTEs belong to the
> > > previous added folio.
> >
> > If the partial mapped large folio is unmapped from the range, the related PTE
> > become none. How could the folio be added to reclaimed list multiple times?
>
> in case we have 16 PTEs in a large folio.
> PTE0 present
> PTE1 present
> PTE2 present
> PTE3  none
> PTE4 present
> PTE5 none
> PTE6 present
> ....
> the current code is scanning PTE one by one.
> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
>
> there are all kinds of possibilities for unmapping.
>

not to mention we have all kinds of possibilities like

PTE0 present for large folio1
PTE1 present for large folio1
PTE2 present for another folio2
PTE3 present for another folio3
PTE4 present for large folio1
...

> so what we can do is recording we have added the folio while scanning PTE0,
> then skipping this folios for all other PTEs.
>
> otherwise, we can split it while scanning PTE0, then we will meet
> different folios
> afterwards.
>
> >
> >
> > Regards
> > Yin, Fengwei
>
> Thanks
> Barry
Yin Fengwei Feb. 27, 2024, 7:02 a.m. UTC | #24
On 2/27/24 14:40, Barry Song wrote:
> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 2/27/24 10:17, Barry Song wrote:
>>>> Like if we hit folio which is partially mapped to the range, don't split it but
>>>> just unmap the mapping part from the range. Let page reclaim decide whether
>>>> split the large folio or not (If it's not mapped to any other range,it will be
>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
>>>> can decide whether to split it or ignore it for current reclaim cycle).
>>> Yes, we can. but we still have to play the ptes check game to avoid adding
>>> folios multiple times to reclaim the list.
>>>
>>> I don't see too much difference between splitting in madvise and splitting
>>> in vmscan.  as our real purpose is avoiding splitting entirely mapped
>>> large folios. for partial mapped large folios, if we split in madvise, then
>>> we don't need to play the game of skipping folios while iterating PTEs.
>>> if we don't split in madvise, we have to make sure the large folio is only
>>> added in reclaimed list one time by checking if PTEs belong to the
>>> previous added folio.
>>
>> If the partial mapped large folio is unmapped from the range, the related PTE
>> become none. How could the folio be added to reclaimed list multiple times?
> 
> in case we have 16 PTEs in a large folio.
> PTE0 present
> PTE1 present
> PTE2 present
> PTE3  none
> PTE4 present
> PTE5 none
> PTE6 present
> ....
> the current code is scanning PTE one by one.
> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
No. Before detect the folio is fully mapped to the range, we can't add folio
to reclaim list because the partial mapped folio shouldn't be added. We can
only scan PTE15 and know it's fully mapped.

So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
become none.

If the large folio is fully mapped, the folio will be added to reclaim list
after scan PTE15 and know it's fully mapped.

Regards
Yin, Fengwei

> 
> there are all kinds of possibilities for unmapping.
> 
> so what we can do is recording we have added the folio while scanning PTE0,
> then skipping this folios for all other PTEs.
> 
> otherwise, we can split it while scanning PTE0, then we will meet
> different folios
> afterwards.
> 
>>
>>
>> Regards
>> Yin, Fengwei
> 
> Thanks
> Barry
Barry Song Feb. 27, 2024, 7:11 a.m. UTC | #25
On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 2/27/24 14:40, Barry Song wrote:
> > On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >>
> >>
> >> On 2/27/24 10:17, Barry Song wrote:
> >>>> Like if we hit folio which is partially mapped to the range, don't split it but
> >>>> just unmap the mapping part from the range. Let page reclaim decide whether
> >>>> split the large folio or not (If it's not mapped to any other range,it will be
> >>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
> >>>> can decide whether to split it or ignore it for current reclaim cycle).
> >>> Yes, we can. but we still have to play the ptes check game to avoid adding
> >>> folios multiple times to reclaim the list.
> >>>
> >>> I don't see too much difference between splitting in madvise and splitting
> >>> in vmscan.  as our real purpose is avoiding splitting entirely mapped
> >>> large folios. for partial mapped large folios, if we split in madvise, then
> >>> we don't need to play the game of skipping folios while iterating PTEs.
> >>> if we don't split in madvise, we have to make sure the large folio is only
> >>> added in reclaimed list one time by checking if PTEs belong to the
> >>> previous added folio.
> >>
> >> If the partial mapped large folio is unmapped from the range, the related PTE
> >> become none. How could the folio be added to reclaimed list multiple times?
> >
> > in case we have 16 PTEs in a large folio.
> > PTE0 present
> > PTE1 present
> > PTE2 present
> > PTE3  none
> > PTE4 present
> > PTE5 none
> > PTE6 present
> > ....
> > the current code is scanning PTE one by one.
> > while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
> No. Before detect the folio is fully mapped to the range, we can't add folio
> to reclaim list because the partial mapped folio shouldn't be added. We can
> only scan PTE15 and know it's fully mapped.

you never know PTE15 is the last one mapping to the large folio, PTE15 can
be mapping to a completely different folio with PTE0.

>
> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
> become none.

I don't understand why all 16PTEs become none as we set PTEs to none.
we set PTEs to swap entries till try_to_unmap_one called by vmscan.

>
> If the large folio is fully mapped, the folio will be added to reclaim list
> after scan PTE15 and know it's fully mapped.

our approach is calling pte_batch_pte while meeting the first pte, if
pte_batch_pte = 16,
then we add this folio to reclaim_list and skip the left 15 PTEs.

>
> Regards
> Yin, Fengwei
>
> >
> > there are all kinds of possibilities for unmapping.
> >
> > so what we can do is recording we have added the folio while scanning PTE0,
> > then skipping this folios for all other PTEs.
> >
> > otherwise, we can split it while scanning PTE0, then we will meet
> > different folios
> > afterwards.
> >
> >>
> >>
> >> Regards
> >> Yin, Fengwei
> >

Thanks
Barry
Barry Song Feb. 27, 2024, 7:21 a.m. UTC | #26
On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >
> >
> >
> > On 2/27/24 14:40, Barry Song wrote:
> > > On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> > >>
> > >>
> > >>
> > >> On 2/27/24 10:17, Barry Song wrote:
> > >>>> Like if we hit folio which is partially mapped to the range, don't split it but
> > >>>> just unmap the mapping part from the range. Let page reclaim decide whether
> > >>>> split the large folio or not (If it's not mapped to any other range,it will be
> > >>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
> > >>>> can decide whether to split it or ignore it for current reclaim cycle).
> > >>> Yes, we can. but we still have to play the ptes check game to avoid adding
> > >>> folios multiple times to reclaim the list.
> > >>>
> > >>> I don't see too much difference between splitting in madvise and splitting
> > >>> in vmscan.  as our real purpose is avoiding splitting entirely mapped
> > >>> large folios. for partial mapped large folios, if we split in madvise, then
> > >>> we don't need to play the game of skipping folios while iterating PTEs.
> > >>> if we don't split in madvise, we have to make sure the large folio is only
> > >>> added in reclaimed list one time by checking if PTEs belong to the
> > >>> previous added folio.
> > >>
> > >> If the partial mapped large folio is unmapped from the range, the related PTE
> > >> become none. How could the folio be added to reclaimed list multiple times?
> > >
> > > in case we have 16 PTEs in a large folio.
> > > PTE0 present
> > > PTE1 present
> > > PTE2 present
> > > PTE3  none
> > > PTE4 present
> > > PTE5 none
> > > PTE6 present
> > > ....
> > > the current code is scanning PTE one by one.
> > > while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
> > No. Before detect the folio is fully mapped to the range, we can't add folio
> > to reclaim list because the partial mapped folio shouldn't be added. We can
> > only scan PTE15 and know it's fully mapped.
>
> you never know PTE15 is the last one mapping to the large folio, PTE15 can
> be mapping to a completely different folio with PTE0.
>
> >
> > So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
> > this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
> > become none.
>
> I don't understand why all 16PTEs become none as we set PTEs to none.
> we set PTEs to swap entries till try_to_unmap_one called by vmscan.
>
> >
> > If the large folio is fully mapped, the folio will be added to reclaim list
> > after scan PTE15 and know it's fully mapped.
>
> our approach is calling pte_batch_pte while meeting the first pte, if
> pte_batch_pte = 16,
> then we add this folio to reclaim_list and skip the left 15 PTEs.

Let's compare two different implementation, for partial mapped large folio
with 8 PTEs as below,

PTE0 present for large folio1
PTE1 present for large folio1
PTE2 present for another folio2
PTE3 present for another folio3
PTE4 present for large folio1
PTE5 present for large folio1
PTE6 present for another folio4
PTE7 present for another folio5

If we don't split in madvise(depend on vmscan to split after adding
folio1), we will have
to make sure folio1, folio2, folio3, folio4, folio5 are added to
reclaim_list by doing a complex
game while scanning these 8 PTEs.

if we split in madvise, they become:

PTE0 present for large folioA  - splitted from folio 1
PTE1 present for large folioB - splitted from folio 1
PTE2 present for another folio2
PTE3 present for another folio3
PTE4 present for large folioC - splitted from folio 1
PTE5 present for large folioD - splitted from folio 1
PTE6 present for another folio4
PTE7 present for another folio5

we simply add the above 8 folios into reclaim_list one by one.

I would vote for splitting for partial mapped large folio in madvise.

Thanks
Barry
Yin Fengwei Feb. 27, 2024, 7:42 a.m. UTC | #27
On 2/27/24 15:21, Barry Song wrote:
> On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@gmail.com> wrote:
>>
>> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>
>>>
>>>
>>> On 2/27/24 14:40, Barry Song wrote:
>>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2/27/24 10:17, Barry Song wrote:
>>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but
>>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether
>>>>>>> split the large folio or not (If it's not mapped to any other range,it will be
>>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
>>>>>>> can decide whether to split it or ignore it for current reclaim cycle).
>>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding
>>>>>> folios multiple times to reclaim the list.
>>>>>>
>>>>>> I don't see too much difference between splitting in madvise and splitting
>>>>>> in vmscan.  as our real purpose is avoiding splitting entirely mapped
>>>>>> large folios. for partial mapped large folios, if we split in madvise, then
>>>>>> we don't need to play the game of skipping folios while iterating PTEs.
>>>>>> if we don't split in madvise, we have to make sure the large folio is only
>>>>>> added in reclaimed list one time by checking if PTEs belong to the
>>>>>> previous added folio.
>>>>>
>>>>> If the partial mapped large folio is unmapped from the range, the related PTE
>>>>> become none. How could the folio be added to reclaimed list multiple times?
>>>>
>>>> in case we have 16 PTEs in a large folio.
>>>> PTE0 present
>>>> PTE1 present
>>>> PTE2 present
>>>> PTE3  none
>>>> PTE4 present
>>>> PTE5 none
>>>> PTE6 present
>>>> ....
>>>> the current code is scanning PTE one by one.
>>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
>>> No. Before detect the folio is fully mapped to the range, we can't add folio
>>> to reclaim list because the partial mapped folio shouldn't be added. We can
>>> only scan PTE15 and know it's fully mapped.
>>
>> you never know PTE15 is the last one mapping to the large folio, PTE15 can
>> be mapping to a completely different folio with PTE0.
>>
>>>
>>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
>>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
>>> become none.
>>
>> I don't understand why all 16PTEs become none as we set PTEs to none.
>> we set PTEs to swap entries till try_to_unmap_one called by vmscan.
>>
>>>
>>> If the large folio is fully mapped, the folio will be added to reclaim list
>>> after scan PTE15 and know it's fully mapped.
>>
>> our approach is calling pte_batch_pte while meeting the first pte, if
>> pte_batch_pte = 16,
>> then we add this folio to reclaim_list and skip the left 15 PTEs.
> 
> Let's compare two different implementation, for partial mapped large folio
> with 8 PTEs as below,
> 
> PTE0 present for large folio1
> PTE1 present for large folio1
> PTE2 present for another folio2
> PTE3 present for another folio3
> PTE4 present for large folio1
> PTE5 present for large folio1
> PTE6 present for another folio4
> PTE7 present for another folio5
> 
> If we don't split in madvise(depend on vmscan to split after adding
> folio1), we will have
Let me clarify something here:

I prefer that we don't split large folio here. Instead, we unmap the
large folio from this VMA range (I think you missed the unmap operation
I mentioned).

The intention is trying best to avoid splitting the large folio. If
the folio is only partially mapped to this VMA range, it's likely it
will be reclaimed as whole large folio. Which brings benefit for lru
and zone lock contention comparing to splitting large folio.

The thing I am not sure is unmapping from specific VMA range is not
available and whether it's worthy to add it.

> to make sure folio1, folio2, folio3, folio4, folio5 are added to
> reclaim_list by doing a complex
> game while scanning these 8 PTEs.
> 
> if we split in madvise, they become:
> 
> PTE0 present for large folioA  - splitted from folio 1
> PTE1 present for large folioB - splitted from folio 1
> PTE2 present for another folio2
> PTE3 present for another folio3
> PTE4 present for large folioC - splitted from folio 1
> PTE5 present for large folioD - splitted from folio 1
> PTE6 present for another folio4
> PTE7 present for another folio5
> 
> we simply add the above 8 folios into reclaim_list one by one.
> 
> I would vote for splitting for partial mapped large folio in madvise.
> 
> Thanks
> Barry
Barry Song Feb. 27, 2024, 7:54 a.m. UTC | #28
On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 2/27/24 15:21, Barry Song wrote:
> > On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@gmail.com> wrote:
> >>
> >> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>>
> >>>
> >>>
> >>> On 2/27/24 14:40, Barry Song wrote:
> >>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2/27/24 10:17, Barry Song wrote:
> >>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but
> >>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether
> >>>>>>> split the large folio or not (If it's not mapped to any other range,it will be
> >>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
> >>>>>>> can decide whether to split it or ignore it for current reclaim cycle).
> >>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding
> >>>>>> folios multiple times to reclaim the list.
> >>>>>>
> >>>>>> I don't see too much difference between splitting in madvise and splitting
> >>>>>> in vmscan.  as our real purpose is avoiding splitting entirely mapped
> >>>>>> large folios. for partial mapped large folios, if we split in madvise, then
> >>>>>> we don't need to play the game of skipping folios while iterating PTEs.
> >>>>>> if we don't split in madvise, we have to make sure the large folio is only
> >>>>>> added in reclaimed list one time by checking if PTEs belong to the
> >>>>>> previous added folio.
> >>>>>
> >>>>> If the partial mapped large folio is unmapped from the range, the related PTE
> >>>>> become none. How could the folio be added to reclaimed list multiple times?
> >>>>
> >>>> in case we have 16 PTEs in a large folio.
> >>>> PTE0 present
> >>>> PTE1 present
> >>>> PTE2 present
> >>>> PTE3  none
> >>>> PTE4 present
> >>>> PTE5 none
> >>>> PTE6 present
> >>>> ....
> >>>> the current code is scanning PTE one by one.
> >>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
> >>> No. Before detect the folio is fully mapped to the range, we can't add folio
> >>> to reclaim list because the partial mapped folio shouldn't be added. We can
> >>> only scan PTE15 and know it's fully mapped.
> >>
> >> you never know PTE15 is the last one mapping to the large folio, PTE15 can
> >> be mapping to a completely different folio with PTE0.
> >>
> >>>
> >>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
> >>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
> >>> become none.
> >>
> >> I don't understand why all 16PTEs become none as we set PTEs to none.
> >> we set PTEs to swap entries till try_to_unmap_one called by vmscan.
> >>
> >>>
> >>> If the large folio is fully mapped, the folio will be added to reclaim list
> >>> after scan PTE15 and know it's fully mapped.
> >>
> >> our approach is calling pte_batch_pte while meeting the first pte, if
> >> pte_batch_pte = 16,
> >> then we add this folio to reclaim_list and skip the left 15 PTEs.
> >
> > Let's compare two different implementation, for partial mapped large folio
> > with 8 PTEs as below,
> >
> > PTE0 present for large folio1
> > PTE1 present for large folio1
> > PTE2 present for another folio2
> > PTE3 present for another folio3
> > PTE4 present for large folio1
> > PTE5 present for large folio1
> > PTE6 present for another folio4
> > PTE7 present for another folio5
> >
> > If we don't split in madvise(depend on vmscan to split after adding
> > folio1), we will have
> Let me clarify something here:
>
> I prefer that we don't split large folio here. Instead, we unmap the
> large folio from this VMA range (I think you missed the unmap operation
> I mentioned).

I don't understand why we unmap as this is a MADV_PAGEOUT not
an unmap. unmapping totally changes the semantics. Would you like
to show pseudo code?

for MADV_PAGEOUT on swap-out, the last step is writing swap entries
to replace PTEs which are present. I don't understand how an unmap
can be involved in this process.

>
> The intention is trying best to avoid splitting the large folio. If
> the folio is only partially mapped to this VMA range, it's likely it
> will be reclaimed as whole large folio. Which brings benefit for lru
> and zone lock contention comparing to splitting large folio.

which also brings negative side effects such as redundant I/O.
For example, if you have only one subpage left in a large folio,
pageout will still write nr_pages subpages into swap, then immediately
free them in swap.

>
> The thing I am not sure is unmapping from specific VMA range is not
> available and whether it's worthy to add it.

I think we might have the possibility to have some complex code to
add folio1, folio2, folio3, folio4 and folio5 in the above example into
reclaim_list while avoiding splitting folio1. but i really don't understand
how unmap will work.

>
> > to make sure folio1, folio2, folio3, folio4, folio5 are added to
> > reclaim_list by doing a complex
> > game while scanning these 8 PTEs.
> >
> > if we split in madvise, they become:
> >
> > PTE0 present for large folioA  - splitted from folio 1
> > PTE1 present for large folioB - splitted from folio 1
> > PTE2 present for another folio2
> > PTE3 present for another folio3
> > PTE4 present for large folioC - splitted from folio 1
> > PTE5 present for large folioD - splitted from folio 1
> > PTE6 present for another folio4
> > PTE7 present for another folio5
> >
> > we simply add the above 8 folios into reclaim_list one by one.
> >
> > I would vote for splitting for partial mapped large folio in madvise.
> >

Thanks
Barry
Yin Fengwei Feb. 27, 2024, 8:33 a.m. UTC | #29
On 2/27/24 15:54, Barry Song wrote:
> On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>
>>
>>
>> On 2/27/24 15:21, Barry Song wrote:
>>> On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@gmail.com> wrote:
>>>>
>>>> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>>>
>>>>>
>>>>>
>>>>> On 2/27/24 14:40, Barry Song wrote:
>>>>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 2/27/24 10:17, Barry Song wrote:
>>>>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but
>>>>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether
>>>>>>>>> split the large folio or not (If it's not mapped to any other range,it will be
>>>>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
>>>>>>>>> can decide whether to split it or ignore it for current reclaim cycle).
>>>>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding
>>>>>>>> folios multiple times to reclaim the list.
>>>>>>>>
>>>>>>>> I don't see too much difference between splitting in madvise and splitting
>>>>>>>> in vmscan.  as our real purpose is avoiding splitting entirely mapped
>>>>>>>> large folios. for partial mapped large folios, if we split in madvise, then
>>>>>>>> we don't need to play the game of skipping folios while iterating PTEs.
>>>>>>>> if we don't split in madvise, we have to make sure the large folio is only
>>>>>>>> added in reclaimed list one time by checking if PTEs belong to the
>>>>>>>> previous added folio.
>>>>>>>
>>>>>>> If the partial mapped large folio is unmapped from the range, the related PTE
>>>>>>> become none. How could the folio be added to reclaimed list multiple times?
>>>>>>
>>>>>> in case we have 16 PTEs in a large folio.
>>>>>> PTE0 present
>>>>>> PTE1 present
>>>>>> PTE2 present
>>>>>> PTE3  none
>>>>>> PTE4 present
>>>>>> PTE5 none
>>>>>> PTE6 present
>>>>>> ....
>>>>>> the current code is scanning PTE one by one.
>>>>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
>>>>> No. Before detect the folio is fully mapped to the range, we can't add folio
>>>>> to reclaim list because the partial mapped folio shouldn't be added. We can
>>>>> only scan PTE15 and know it's fully mapped.
>>>>
>>>> you never know PTE15 is the last one mapping to the large folio, PTE15 can
>>>> be mapping to a completely different folio with PTE0.
>>>>
>>>>>
>>>>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
>>>>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
>>>>> become none.
>>>>
>>>> I don't understand why all 16PTEs become none as we set PTEs to none.
>>>> we set PTEs to swap entries till try_to_unmap_one called by vmscan.
>>>>
>>>>>
>>>>> If the large folio is fully mapped, the folio will be added to reclaim list
>>>>> after scan PTE15 and know it's fully mapped.
>>>>
>>>> our approach is calling pte_batch_pte while meeting the first pte, if
>>>> pte_batch_pte = 16,
>>>> then we add this folio to reclaim_list and skip the left 15 PTEs.
>>>
>>> Let's compare two different implementation, for partial mapped large folio
>>> with 8 PTEs as below,
>>>
>>> PTE0 present for large folio1
>>> PTE1 present for large folio1
>>> PTE2 present for another folio2
>>> PTE3 present for another folio3
>>> PTE4 present for large folio1
>>> PTE5 present for large folio1
>>> PTE6 present for another folio4
>>> PTE7 present for another folio5
>>>
>>> If we don't split in madvise(depend on vmscan to split after adding
>>> folio1), we will have
>> Let me clarify something here:
>>
>> I prefer that we don't split large folio here. Instead, we unmap the
>> large folio from this VMA range (I think you missed the unmap operation
>> I mentioned).
> 
> I don't understand why we unmap as this is a MADV_PAGEOUT not
> an unmap. unmapping totally changes the semantics. Would you like
> to show pseudo code?
Oh. Yes. MADV_PAGEOUT is not suitable.

What about MADV_FREE?

> 
> for MADV_PAGEOUT on swap-out, the last step is writing swap entries
> to replace PTEs which are present. I don't understand how an unmap
> can be involved in this process.
> 
>>
>> The intention is trying best to avoid splitting the large folio. If
>> the folio is only partially mapped to this VMA range, it's likely it
>> will be reclaimed as whole large folio. Which brings benefit for lru
>> and zone lock contention comparing to splitting large folio.
> 
> which also brings negative side effects such as redundant I/O.
> For example, if you have only one subpage left in a large folio,
> pageout will still write nr_pages subpages into swap, then immediately
> free them in swap.
> 
>>
>> The thing I am not sure is unmapping from specific VMA range is not
>> available and whether it's worthy to add it.
> 
> I think we might have the possibility to have some complex code to
> add folio1, folio2, folio3, folio4 and folio5 in the above example into
> reclaim_list while avoiding splitting folio1. but i really don't understand
> how unmap will work.
> 
>>
>>> to make sure folio1, folio2, folio3, folio4, folio5 are added to
>>> reclaim_list by doing a complex
>>> game while scanning these 8 PTEs.
>>>
>>> if we split in madvise, they become:
>>>
>>> PTE0 present for large folioA  - splitted from folio 1
>>> PTE1 present for large folioB - splitted from folio 1
>>> PTE2 present for another folio2
>>> PTE3 present for another folio3
>>> PTE4 present for large folioC - splitted from folio 1
>>> PTE5 present for large folioD - splitted from folio 1
>>> PTE6 present for another folio4
>>> PTE7 present for another folio5
>>>
>>> we simply add the above 8 folios into reclaim_list one by one.
>>>
>>> I would vote for splitting for partial mapped large folio in madvise.
>>>
> 
> Thanks
> Barry
Barry Song Feb. 27, 2024, 9:01 a.m. UTC | #30
On Tue, Feb 27, 2024 at 9:33 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
>
>
>
> On 2/27/24 15:54, Barry Song wrote:
> > On Tue, Feb 27, 2024 at 8:42 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>
> >>
> >>
> >> On 2/27/24 15:21, Barry Song wrote:
> >>> On Tue, Feb 27, 2024 at 8:11 PM Barry Song <21cnbao@gmail.com> wrote:
> >>>>
> >>>> On Tue, Feb 27, 2024 at 8:02 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>>>>
> >>>>>
> >>>>>
> >>>>> On 2/27/24 14:40, Barry Song wrote:
> >>>>>> On Tue, Feb 27, 2024 at 7:14 PM Yin Fengwei <fengwei.yin@intel.com> wrote:
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On 2/27/24 10:17, Barry Song wrote:
> >>>>>>>>> Like if we hit folio which is partially mapped to the range, don't split it but
> >>>>>>>>> just unmap the mapping part from the range. Let page reclaim decide whether
> >>>>>>>>> split the large folio or not (If it's not mapped to any other range,it will be
> >>>>>>>>> freed as whole large folio. If part of it still mapped to other range,page reclaim
> >>>>>>>>> can decide whether to split it or ignore it for current reclaim cycle).
> >>>>>>>> Yes, we can. but we still have to play the ptes check game to avoid adding
> >>>>>>>> folios multiple times to reclaim the list.
> >>>>>>>>
> >>>>>>>> I don't see too much difference between splitting in madvise and splitting
> >>>>>>>> in vmscan.  as our real purpose is avoiding splitting entirely mapped
> >>>>>>>> large folios. for partial mapped large folios, if we split in madvise, then
> >>>>>>>> we don't need to play the game of skipping folios while iterating PTEs.
> >>>>>>>> if we don't split in madvise, we have to make sure the large folio is only
> >>>>>>>> added in reclaimed list one time by checking if PTEs belong to the
> >>>>>>>> previous added folio.
> >>>>>>>
> >>>>>>> If the partial mapped large folio is unmapped from the range, the related PTE
> >>>>>>> become none. How could the folio be added to reclaimed list multiple times?
> >>>>>>
> >>>>>> in case we have 16 PTEs in a large folio.
> >>>>>> PTE0 present
> >>>>>> PTE1 present
> >>>>>> PTE2 present
> >>>>>> PTE3  none
> >>>>>> PTE4 present
> >>>>>> PTE5 none
> >>>>>> PTE6 present
> >>>>>> ....
> >>>>>> the current code is scanning PTE one by one.
> >>>>>> while scanning PTE0, we have added the folio. then PTE1, PTE2, PTE4, PTE6...
> >>>>> No. Before detect the folio is fully mapped to the range, we can't add folio
> >>>>> to reclaim list because the partial mapped folio shouldn't be added. We can
> >>>>> only scan PTE15 and know it's fully mapped.
> >>>>
> >>>> you never know PTE15 is the last one mapping to the large folio, PTE15 can
> >>>> be mapping to a completely different folio with PTE0.
> >>>>
> >>>>>
> >>>>> So, when scanning PTE0, we will not add folio. Then when hit PTE3, we know
> >>>>> this is a partial mapped large folio. We will unmap it. Then all 16 PTEs
> >>>>> become none.
> >>>>
> >>>> I don't understand why all 16PTEs become none as we set PTEs to none.
> >>>> we set PTEs to swap entries till try_to_unmap_one called by vmscan.
> >>>>
> >>>>>
> >>>>> If the large folio is fully mapped, the folio will be added to reclaim list
> >>>>> after scan PTE15 and know it's fully mapped.
> >>>>
> >>>> our approach is calling pte_batch_pte while meeting the first pte, if
> >>>> pte_batch_pte = 16,
> >>>> then we add this folio to reclaim_list and skip the left 15 PTEs.
> >>>
> >>> Let's compare two different implementation, for partial mapped large folio
> >>> with 8 PTEs as below,
> >>>
> >>> PTE0 present for large folio1
> >>> PTE1 present for large folio1
> >>> PTE2 present for another folio2
> >>> PTE3 present for another folio3
> >>> PTE4 present for large folio1
> >>> PTE5 present for large folio1
> >>> PTE6 present for another folio4
> >>> PTE7 present for another folio5
> >>>
> >>> If we don't split in madvise(depend on vmscan to split after adding
> >>> folio1), we will have
> >> Let me clarify something here:
> >>
> >> I prefer that we don't split large folio here. Instead, we unmap the
> >> large folio from this VMA range (I think you missed the unmap operation
> >> I mentioned).
> >
> > I don't understand why we unmap as this is a MADV_PAGEOUT not
> > an unmap. unmapping totally changes the semantics. Would you like
> > to show pseudo code?
> Oh. Yes. MADV_PAGEOUT is not suitable.
>
> What about MADV_FREE?

we can't unmap either. as MADV_FREE applies to anon vma.  while a
folio is marked lazyfree, we move anon folio to file LRU. if somebody
writes the folio afterwards, we take the folio back; if nobody writes it
before vmscan gets it in the file LRU, we can reclaim it by setting PTEs
to none. we can't immediately unmap a large folio at the time
MADV_FREE is called.  immediate unmap is the behavior of MADV_DONTNEED
but not MADV_FREE.

>
> >
> > for MADV_PAGEOUT on swap-out, the last step is writing swap entries
> > to replace PTEs which are present. I don't understand how an unmap
> > can be involved in this process.
> >
> >>
> >> The intention is trying best to avoid splitting the large folio. If
> >> the folio is only partially mapped to this VMA range, it's likely it
> >> will be reclaimed as whole large folio. Which brings benefit for lru
> >> and zone lock contention comparing to splitting large folio.
> >
> > which also brings negative side effects such as redundant I/O.
> > For example, if you have only one subpage left in a large folio,
> > pageout will still write nr_pages subpages into swap, then immediately
> > free them in swap.
> >
> >>
> >> The thing I am not sure is unmapping from specific VMA range is not
> >> available and whether it's worthy to add it.
> >
> > I think we might have the possibility to have some complex code to
> > add folio1, folio2, folio3, folio4 and folio5 in the above example into
> > reclaim_list while avoiding splitting folio1. but i really don't understand
> > how unmap will work.
> >
> >>
> >>> to make sure folio1, folio2, folio3, folio4, folio5 are added to
> >>> reclaim_list by doing a complex
> >>> game while scanning these 8 PTEs.
> >>>
> >>> if we split in madvise, they become:
> >>>
> >>> PTE0 present for large folioA  - splitted from folio 1
> >>> PTE1 present for large folioB - splitted from folio 1
> >>> PTE2 present for another folio2
> >>> PTE3 present for another folio3
> >>> PTE4 present for large folioC - splitted from folio 1
> >>> PTE5 present for large folioD - splitted from folio 1
> >>> PTE6 present for another folio4
> >>> PTE7 present for another folio5
> >>>
> >>> we simply add the above 8 folios into reclaim_list one by one.
> >>>
> >>> I would vote for splitting for partial mapped large folio in madvise.
> >>>
> >

Thanks
Barry
diff mbox series

Patch

diff --git a/mm/madvise.c b/mm/madvise.c
index cfa5e7288261..bcbf56595a2e 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -676,11 +676,43 @@  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 		 */
 		if (folio_test_large(folio)) {
 			int err;
+			unsigned long next_addr, align;
 
-			if (folio_estimated_sharers(folio) != 1)
-				break;
-			if (!folio_trylock(folio))
-				break;
+			if (folio_estimated_sharers(folio) != 1 ||
+			    !folio_trylock(folio))
+				goto skip_large_folio;
+
+			align = folio_nr_pages(folio) * PAGE_SIZE;
+			next_addr = ALIGN_DOWN(addr + align, align);
+
+			/*
+			 * If we mark only the subpages as lazyfree,
+			 * split the large folio.
+			 */
+			if (next_addr > end || next_addr - addr != align)
+				goto split_large_folio;
+
+			/*
+			 * Avoid unnecessary folio splitting if the large
+			 * folio is entirely within the given range.
+			 */
+			folio_test_clear_dirty(folio);
+			folio_unlock(folio);
+			for (; addr != next_addr; pte++, addr += PAGE_SIZE) {
+				ptent = ptep_get(pte);
+				if (pte_young(ptent) || pte_dirty(ptent)) {
+					ptent = ptep_get_and_clear_full(
+						mm, addr, pte, tlb->fullmm);
+					ptent = pte_mkold(ptent);
+					ptent = pte_mkclean(ptent);
+					set_pte_at(mm, addr, pte, ptent);
+					tlb_remove_tlb_entry(tlb, pte, addr);
+				}
+			}
+			folio_mark_lazyfree(folio);
+			goto next_folio;
+
+split_large_folio:
 			folio_get(folio);
 			arch_leave_lazy_mmu_mode();
 			pte_unmap_unlock(start_pte, ptl);
@@ -688,13 +720,28 @@  static int madvise_free_pte_range(pmd_t *pmd, unsigned long addr,
 			err = split_folio(folio);
 			folio_unlock(folio);
 			folio_put(folio);
-			if (err)
-				break;
-			start_pte = pte =
-				pte_offset_map_lock(mm, pmd, addr, &ptl);
-			if (!start_pte)
-				break;
-			arch_enter_lazy_mmu_mode();
+
+			/*
+			 * If the large folio is locked before madvise(MADV_FREE)
+			 * or cannot be split, we just skip it.
+			 */
+			if (err) {
+skip_large_folio:
+				if (next_addr >= end)
+					break;
+				pte += (next_addr - addr) / PAGE_SIZE;
+				addr = next_addr;
+			}
+
+			if (!start_pte) {
+				start_pte = pte = pte_offset_map_lock(
+					mm, pmd, addr, &ptl);
+				if (!start_pte)
+					break;
+				arch_enter_lazy_mmu_mode();
+			}
+
+next_folio:
 			pte--;
 			addr -= PAGE_SIZE;
 			continue;