diff mbox series

[RFC] mm: entirely reuse the whole anon mTHP in do_wp_page

Message ID 20240831092339.66085-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] mm: entirely reuse the whole anon mTHP in do_wp_page | expand

Commit Message

Barry Song Aug. 31, 2024, 9:23 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

On a physical phone, it's sometimes observed that deferred_split
mTHPs account for over 15% of the total mTHPs. Profiling by Chuanhua
indicates that the majority of these originate from the typical fork
scenario.
When the child process either execs or exits, the parent process should
ideally be able to reuse the entire mTHP. However, the current kernel
lacks this capability and instead places the mTHP into split_deferred,
performing a CoW (Copy-on-Write) on just a single subpage of the mTHP.

 main()
 {
 #define SIZE 1024 * 1024UL
         void *p = malloc(SIZE);
         memset(p, 0x11, SIZE);
         if (fork() == 0)
                 exec(....);
        /*
 	 * this will trigger cow one subpage from
 	 * mTHP and put mTHP into split_deferred
 	 * list
 	 */
 	*(int *)(p + 10) = 10;
 	printf("done\n");
 	while(1);
 }

This leads to two significant issues:

* Memory Waste: Before the mTHP is fully split by the shrinker,
it wastes memory. In extreme cases, such as with a 64KB mTHP,
the memory usage could be 64KB + 60KB until the last subpage
is written, at which point the mTHP is freed.

* Fragmentation and Performance Loss: It destroys large folios
(negating the performance benefits of CONT-PTE) and fragments memory.

To address this, we should aim to reuse the entire mTHP in such cases.

Hi David,

I’ve renamed wp_page_reuse() to wp_folio_reuse() and added an
entirely_reuse argument because I’m not sure if there are still cases
where we reuse a subpage within an mTHP. For now, I’m setting
entirely_reuse to true only for the newly supported case, while all
other cases still get false. Please let me know if this is incorrect—if
we don’t reuse subpages at all, we could remove the argument.

Hi Ryan,

Ideally, I’d like to see ptep_set_access_flags_nr() support setting
write-permission for the entire mTHP. Since we don’t currently have
this capability, I’m doing it in a rather inefficient way—setting
permissions one by one, which involves redundant unfolding and
folding of CONTPTE. I wonder if we could collaborate on providing
a batched ptep_set_access_flags_nr().

Cc: Chuanhua Han <hanchuanhua@oppo.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Zi Yan <ziy@nvidia.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Chris Li <chrisl@kernel.org>
Cc: Kairui Song <kasong@tencent.com>
Cc: Kalesh Singh <kaleshsingh@google.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/memory.c | 91 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 66 insertions(+), 25 deletions(-)

Comments

David Hildenbrand Aug. 31, 2024, 9:44 a.m. UTC | #1
On 31.08.24 11:23, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> On a physical phone, it's sometimes observed that deferred_split
> mTHPs account for over 15% of the total mTHPs. Profiling by Chuanhua
> indicates that the majority of these originate from the typical fork
> scenario.
> When the child process either execs or exits, the parent process should
> ideally be able to reuse the entire mTHP. However, the current kernel
> lacks this capability and instead places the mTHP into split_deferred,
> performing a CoW (Copy-on-Write) on just a single subpage of the mTHP.
> 
>   main()
>   {
>   #define SIZE 1024 * 1024UL
>           void *p = malloc(SIZE);
>           memset(p, 0x11, SIZE);
>           if (fork() == 0)
>                   exec(....);
>          /*
>   	 * this will trigger cow one subpage from
>   	 * mTHP and put mTHP into split_deferred
>   	 * list
>   	 */
>   	*(int *)(p + 10) = 10;
>   	printf("done\n");
>   	while(1);
>   }
> 
> This leads to two significant issues:
> 
> * Memory Waste: Before the mTHP is fully split by the shrinker,
> it wastes memory. In extreme cases, such as with a 64KB mTHP,
> the memory usage could be 64KB + 60KB until the last subpage
> is written, at which point the mTHP is freed.
> 
> * Fragmentation and Performance Loss: It destroys large folios
> (negating the performance benefits of CONT-PTE) and fragments memory.
> 
> To address this, we should aim to reuse the entire mTHP in such cases.
> 
> Hi David,
> 
> I’ve renamed wp_page_reuse() to wp_folio_reuse() and added an
> entirely_reuse argument because I’m not sure if there are still cases
> where we reuse a subpage within an mTHP. For now, I’m setting
> entirely_reuse to true only for the newly supported case, while all
> other cases still get false. Please let me know if this is incorrect—if
> we don’t reuse subpages at all, we could remove the argument.

See [1] I sent out this week, that is able to reuse even without 
scanning page tables. If we find the the folio is exclusive we could try 
processing surrounding PTEs that map the same folio.

[1] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com
Barry Song Aug. 31, 2024, 9:55 a.m. UTC | #2
On Sat, Aug 31, 2024 at 9:44 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 31.08.24 11:23, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > On a physical phone, it's sometimes observed that deferred_split
> > mTHPs account for over 15% of the total mTHPs. Profiling by Chuanhua
> > indicates that the majority of these originate from the typical fork
> > scenario.
> > When the child process either execs or exits, the parent process should
> > ideally be able to reuse the entire mTHP. However, the current kernel
> > lacks this capability and instead places the mTHP into split_deferred,
> > performing a CoW (Copy-on-Write) on just a single subpage of the mTHP.
> >
> >   main()
> >   {
> >   #define SIZE 1024 * 1024UL
> >           void *p = malloc(SIZE);
> >           memset(p, 0x11, SIZE);
> >           if (fork() == 0)
> >                   exec(....);
> >          /*
> >        * this will trigger cow one subpage from
> >        * mTHP and put mTHP into split_deferred
> >        * list
> >        */
> >       *(int *)(p + 10) = 10;
> >       printf("done\n");
> >       while(1);
> >   }
> >
> > This leads to two significant issues:
> >
> > * Memory Waste: Before the mTHP is fully split by the shrinker,
> > it wastes memory. In extreme cases, such as with a 64KB mTHP,
> > the memory usage could be 64KB + 60KB until the last subpage
> > is written, at which point the mTHP is freed.
> >
> > * Fragmentation and Performance Loss: It destroys large folios
> > (negating the performance benefits of CONT-PTE) and fragments memory.
> >
> > To address this, we should aim to reuse the entire mTHP in such cases.
> >
> > Hi David,
> >
> > I’ve renamed wp_page_reuse() to wp_folio_reuse() and added an
> > entirely_reuse argument because I’m not sure if there are still cases
> > where we reuse a subpage within an mTHP. For now, I’m setting
> > entirely_reuse to true only for the newly supported case, while all
> > other cases still get false. Please let me know if this is incorrect—if
> > we don’t reuse subpages at all, we could remove the argument.
>
> See [1] I sent out this week, that is able to reuse even without
> scanning page tables. If we find the the folio is exclusive we could try
> processing surrounding PTEs that map the same folio.
>
> [1] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com

Great! It looks like I missed your patch again. Since you've implemented this
in a better way, I’d prefer to use your patchset.

I’m curious about how you're handling ptep_set_access_flags_nr() or similar
things because I couldn’t find the related code in your patch 10/17:

[PATCH v1 10/17] mm: COW reuse support for PTE-mapped THP with CONFIG_MM_ID

Am I missing something?

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
David Hildenbrand Aug. 31, 2024, 9:59 a.m. UTC | #3
> +		idx = folio_page_idx(folio, vmf->page);
> +		folio_start = address - idx * PAGE_SIZE;
> +		folio_end = folio_start + nr * PAGE_SIZE;
> +
> +		if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
> +			return false;
> +		if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
> +			return false;
> +		folio_ptep = vmf->pte - idx;
> +		folio_pte = ptep_get(folio_ptep);
> +		if (!pte_present(folio_pte) || pte_pfn(folio_pte) != folio_pfn(folio))
> +			return false;
> +		if (folio_pte_batch(folio, folio_start, folio_ptep, folio_pte, nr, 0,
> +				NULL, NULL, NULL) != nr)
> +			return false;
> +		if (folio_mapcount(folio) != nr)
> +			return false;

BTW, you're not checking against the refcount (and it's all a bit racy 
on concurrent unmapping!). So you're re-introducing the vmsplice 
child->parent attak.
David Hildenbrand Aug. 31, 2024, 10:07 a.m. UTC | #4
On 31.08.24 11:55, Barry Song wrote:
> On Sat, Aug 31, 2024 at 9:44 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 31.08.24 11:23, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> On a physical phone, it's sometimes observed that deferred_split
>>> mTHPs account for over 15% of the total mTHPs. Profiling by Chuanhua
>>> indicates that the majority of these originate from the typical fork
>>> scenario.
>>> When the child process either execs or exits, the parent process should
>>> ideally be able to reuse the entire mTHP. However, the current kernel
>>> lacks this capability and instead places the mTHP into split_deferred,
>>> performing a CoW (Copy-on-Write) on just a single subpage of the mTHP.
>>>
>>>    main()
>>>    {
>>>    #define SIZE 1024 * 1024UL
>>>            void *p = malloc(SIZE);
>>>            memset(p, 0x11, SIZE);
>>>            if (fork() == 0)
>>>                    exec(....);
>>>           /*
>>>         * this will trigger cow one subpage from
>>>         * mTHP and put mTHP into split_deferred
>>>         * list
>>>         */
>>>        *(int *)(p + 10) = 10;
>>>        printf("done\n");
>>>        while(1);
>>>    }
>>>
>>> This leads to two significant issues:
>>>
>>> * Memory Waste: Before the mTHP is fully split by the shrinker,
>>> it wastes memory. In extreme cases, such as with a 64KB mTHP,
>>> the memory usage could be 64KB + 60KB until the last subpage
>>> is written, at which point the mTHP is freed.
>>>
>>> * Fragmentation and Performance Loss: It destroys large folios
>>> (negating the performance benefits of CONT-PTE) and fragments memory.
>>>
>>> To address this, we should aim to reuse the entire mTHP in such cases.
>>>
>>> Hi David,
>>>
>>> I’ve renamed wp_page_reuse() to wp_folio_reuse() and added an
>>> entirely_reuse argument because I’m not sure if there are still cases
>>> where we reuse a subpage within an mTHP. For now, I’m setting
>>> entirely_reuse to true only for the newly supported case, while all
>>> other cases still get false. Please let me know if this is incorrect—if
>>> we don’t reuse subpages at all, we could remove the argument.
>>
>> See [1] I sent out this week, that is able to reuse even without
>> scanning page tables. If we find the the folio is exclusive we could try
>> processing surrounding PTEs that map the same folio.
>>
>> [1] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com
> 
> Great! It looks like I missed your patch again. Since you've implemented this
> in a better way, I’d prefer to use your patchset.

I wouldn't say better, just more universally. And while taking care of 
properly sync'ing the mapcount vs. refcount :P

> 
> I’m curious about how you're handling ptep_set_access_flags_nr() or similar
> things because I couldn’t find the related code in your patch 10/17:
> 
> [PATCH v1 10/17] mm: COW reuse support for PTE-mapped THP with CONFIG_MM_ID
> 
> Am I missing something?

The idea is to keep individual write faults as fast as possible. So the 
patch set keeps it simple and only reuses a single PTE at a time, 
setting that one PAE and mapping it writable.

As the patch states, it might be reasonable to optimize some cases, 
maybe also only on some architectures. For example to fault-around and 
map the other ones writable as well. It might not always be desirable 
though, especially not for larger folios.
Barry Song Aug. 31, 2024, 10:09 a.m. UTC | #5
On Sat, Aug 31, 2024 at 9:59 PM David Hildenbrand <david@redhat.com> wrote:
>
>
> > +             idx = folio_page_idx(folio, vmf->page);
> > +             folio_start = address - idx * PAGE_SIZE;
> > +             folio_end = folio_start + nr * PAGE_SIZE;
> > +
> > +             if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
> > +                     return false;
> > +             if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
> > +                     return false;
> > +             folio_ptep = vmf->pte - idx;
> > +             folio_pte = ptep_get(folio_ptep);
> > +             if (!pte_present(folio_pte) || pte_pfn(folio_pte) != folio_pfn(folio))
> > +                     return false;
> > +             if (folio_pte_batch(folio, folio_start, folio_ptep, folio_pte, nr, 0,
> > +                             NULL, NULL, NULL) != nr)
> > +                     return false;
> > +             if (folio_mapcount(folio) != nr)
> > +                     return false;
>
> BTW, you're not checking against the refcount (and it's all a bit racy
> on concurrent unmapping!). So you're re-introducing the vmsplice
> child->parent attak.

i don't quite understand this, you mean the below is not enough?

-       if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
+       if (folio_test_ksm(folio) || folio_ref_count(folio) > 2 + nr)
                return false;
        if (!folio_test_lru(folio))
                /*
@@ -3591,13 +3627,13 @@ static bool wp_can_reuse_anon_folio(struct folio *folio,
                 * remote LRU caches or references to LRU folios.
                 */
                lru_add_drain();
-       if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio))
+       if (folio_ref_count(folio) > nr + folio_test_swapcache(folio))
                return false;
        if (!folio_trylock(folio))
                return false;
        if (folio_test_swapcache(folio))
                folio_free_swap(folio);
-       if (folio_test_ksm(folio) || folio_ref_count(folio) != 1) {
+       if (folio_test_ksm(folio) || folio_ref_count(folio) != nr) {
                folio_unlock(folio);
                return false;

>
> --
> Cheers,
>
> David / dhildenb
>
David Hildenbrand Aug. 31, 2024, 10:14 a.m. UTC | #6
On 31.08.24 12:09, Barry Song wrote:
> On Sat, Aug 31, 2024 at 9:59 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>
>>> +             idx = folio_page_idx(folio, vmf->page);
>>> +             folio_start = address - idx * PAGE_SIZE;
>>> +             folio_end = folio_start + nr * PAGE_SIZE;
>>> +
>>> +             if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
>>> +                     return false;
>>> +             if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
>>> +                     return false;
>>> +             folio_ptep = vmf->pte - idx;
>>> +             folio_pte = ptep_get(folio_ptep);
>>> +             if (!pte_present(folio_pte) || pte_pfn(folio_pte) != folio_pfn(folio))
>>> +                     return false;
>>> +             if (folio_pte_batch(folio, folio_start, folio_ptep, folio_pte, nr, 0,
>>> +                             NULL, NULL, NULL) != nr)
>>> +                     return false;
>>> +             if (folio_mapcount(folio) != nr)
>>> +                     return false;
>>
>> BTW, you're not checking against the refcount (and it's all a bit racy
>> on concurrent unmapping!). So you're re-introducing the vmsplice
>> child->parent attak.
> 
> i don't quite understand this, you mean the below is not enough?
> 

Ah! You use the fallthrough, sorry I missed that!

You're not handling the swapcache references "correctly" (would have one 
reference per page), but the final check would be correct.

Yes, that should not allow for false positives here.
Barry Song Aug. 31, 2024, 10:21 a.m. UTC | #7
On Sat, Aug 31, 2024 at 10:07 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 31.08.24 11:55, Barry Song wrote:
> > On Sat, Aug 31, 2024 at 9:44 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 31.08.24 11:23, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> On a physical phone, it's sometimes observed that deferred_split
> >>> mTHPs account for over 15% of the total mTHPs. Profiling by Chuanhua
> >>> indicates that the majority of these originate from the typical fork
> >>> scenario.
> >>> When the child process either execs or exits, the parent process should
> >>> ideally be able to reuse the entire mTHP. However, the current kernel
> >>> lacks this capability and instead places the mTHP into split_deferred,
> >>> performing a CoW (Copy-on-Write) on just a single subpage of the mTHP.
> >>>
> >>>    main()
> >>>    {
> >>>    #define SIZE 1024 * 1024UL
> >>>            void *p = malloc(SIZE);
> >>>            memset(p, 0x11, SIZE);
> >>>            if (fork() == 0)
> >>>                    exec(....);
> >>>           /*
> >>>         * this will trigger cow one subpage from
> >>>         * mTHP and put mTHP into split_deferred
> >>>         * list
> >>>         */
> >>>        *(int *)(p + 10) = 10;
> >>>        printf("done\n");
> >>>        while(1);
> >>>    }
> >>>
> >>> This leads to two significant issues:
> >>>
> >>> * Memory Waste: Before the mTHP is fully split by the shrinker,
> >>> it wastes memory. In extreme cases, such as with a 64KB mTHP,
> >>> the memory usage could be 64KB + 60KB until the last subpage
> >>> is written, at which point the mTHP is freed.
> >>>
> >>> * Fragmentation and Performance Loss: It destroys large folios
> >>> (negating the performance benefits of CONT-PTE) and fragments memory.
> >>>
> >>> To address this, we should aim to reuse the entire mTHP in such cases.
> >>>
> >>> Hi David,
> >>>
> >>> I’ve renamed wp_page_reuse() to wp_folio_reuse() and added an
> >>> entirely_reuse argument because I’m not sure if there are still cases
> >>> where we reuse a subpage within an mTHP. For now, I’m setting
> >>> entirely_reuse to true only for the newly supported case, while all
> >>> other cases still get false. Please let me know if this is incorrect—if
> >>> we don’t reuse subpages at all, we could remove the argument.
> >>
> >> See [1] I sent out this week, that is able to reuse even without
> >> scanning page tables. If we find the the folio is exclusive we could try
> >> processing surrounding PTEs that map the same folio.
> >>
> >> [1] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com
> >
> > Great! It looks like I missed your patch again. Since you've implemented this
> > in a better way, I’d prefer to use your patchset.
>
> I wouldn't say better, just more universally. And while taking care of
> properly sync'ing the mapcount vs. refcount :P
>
> >
> > I’m curious about how you're handling ptep_set_access_flags_nr() or similar
> > things because I couldn’t find the related code in your patch 10/17:
> >
> > [PATCH v1 10/17] mm: COW reuse support for PTE-mapped THP with CONFIG_MM_ID
> >
> > Am I missing something?
>
> The idea is to keep individual write faults as fast as possible. So the
> patch set keeps it simple and only reuses a single PTE at a time,
> setting that one PAE and mapping it writable.

I got your point, thanks! as anyway the mTHP has been exclusive,
so the following nr-1 minor page faults will set their particular PTE
to writable one by one.

>
> As the patch states, it might be reasonable to optimize some cases,
> maybe also only on some architectures. For example to fault-around and
> map the other ones writable as well. It might not always be desirable
> though, especially not for larger folios.

as anyway, the mTHP has been entirely exclusive, setting all PTEs
directly to writable should help reduce nr - 1 minor page faults and
ideally help reduce CONTPTE unfold and fold?

What is the downside to doing that? I also don't think mapping them
all together will waste memory?

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
David Hildenbrand Aug. 31, 2024, 10:29 a.m. UTC | #8
On 31.08.24 12:21, Barry Song wrote:
> On Sat, Aug 31, 2024 at 10:07 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 31.08.24 11:55, Barry Song wrote:
>>> On Sat, Aug 31, 2024 at 9:44 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 31.08.24 11:23, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> On a physical phone, it's sometimes observed that deferred_split
>>>>> mTHPs account for over 15% of the total mTHPs. Profiling by Chuanhua
>>>>> indicates that the majority of these originate from the typical fork
>>>>> scenario.
>>>>> When the child process either execs or exits, the parent process should
>>>>> ideally be able to reuse the entire mTHP. However, the current kernel
>>>>> lacks this capability and instead places the mTHP into split_deferred,
>>>>> performing a CoW (Copy-on-Write) on just a single subpage of the mTHP.
>>>>>
>>>>>     main()
>>>>>     {
>>>>>     #define SIZE 1024 * 1024UL
>>>>>             void *p = malloc(SIZE);
>>>>>             memset(p, 0x11, SIZE);
>>>>>             if (fork() == 0)
>>>>>                     exec(....);
>>>>>            /*
>>>>>          * this will trigger cow one subpage from
>>>>>          * mTHP and put mTHP into split_deferred
>>>>>          * list
>>>>>          */
>>>>>         *(int *)(p + 10) = 10;
>>>>>         printf("done\n");
>>>>>         while(1);
>>>>>     }
>>>>>
>>>>> This leads to two significant issues:
>>>>>
>>>>> * Memory Waste: Before the mTHP is fully split by the shrinker,
>>>>> it wastes memory. In extreme cases, such as with a 64KB mTHP,
>>>>> the memory usage could be 64KB + 60KB until the last subpage
>>>>> is written, at which point the mTHP is freed.
>>>>>
>>>>> * Fragmentation and Performance Loss: It destroys large folios
>>>>> (negating the performance benefits of CONT-PTE) and fragments memory.
>>>>>
>>>>> To address this, we should aim to reuse the entire mTHP in such cases.
>>>>>
>>>>> Hi David,
>>>>>
>>>>> I’ve renamed wp_page_reuse() to wp_folio_reuse() and added an
>>>>> entirely_reuse argument because I’m not sure if there are still cases
>>>>> where we reuse a subpage within an mTHP. For now, I’m setting
>>>>> entirely_reuse to true only for the newly supported case, while all
>>>>> other cases still get false. Please let me know if this is incorrect—if
>>>>> we don’t reuse subpages at all, we could remove the argument.
>>>>
>>>> See [1] I sent out this week, that is able to reuse even without
>>>> scanning page tables. If we find the the folio is exclusive we could try
>>>> processing surrounding PTEs that map the same folio.
>>>>
>>>> [1] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com
>>>
>>> Great! It looks like I missed your patch again. Since you've implemented this
>>> in a better way, I’d prefer to use your patchset.
>>
>> I wouldn't say better, just more universally. And while taking care of
>> properly sync'ing the mapcount vs. refcount :P
>>
>>>
>>> I’m curious about how you're handling ptep_set_access_flags_nr() or similar
>>> things because I couldn’t find the related code in your patch 10/17:
>>>
>>> [PATCH v1 10/17] mm: COW reuse support for PTE-mapped THP with CONFIG_MM_ID
>>>
>>> Am I missing something?
>>
>> The idea is to keep individual write faults as fast as possible. So the
>> patch set keeps it simple and only reuses a single PTE at a time,
>> setting that one PAE and mapping it writable.
> 
> I got your point, thanks! as anyway the mTHP has been exclusive,
> so the following nr-1 minor page faults will set their particular PTE
> to writable one by one.

Yes, assuming you would get these page faults, and assuming you would 
get them in the near future.

> 
>>
>> As the patch states, it might be reasonable to optimize some cases,
>> maybe also only on some architectures. For example to fault-around and
>> map the other ones writable as well. It might not always be desirable
>> though, especially not for larger folios.
> 
> as anyway, the mTHP has been entirely exclusive, setting all PTEs
> directly to writable should help reduce nr - 1 minor page faults and
> ideally help reduce CONTPTE unfold and fold?

Yes, doing that on CONTPTE granularity would very likely make sense. For 
anything bigger than that, I am not sure.

Assuming we have a 1M folio mapped by PTEs. Trying to fault-around in 
aligned CONTPTE granularity likely makes sense. Bigger than that, I am 
not convinced.

> 
> What is the downside to doing that? I also don't think mapping them
> all together will waste memory?

No, it's all about increasing the latency of individual write faults.
Barry Song Aug. 31, 2024, 10:49 a.m. UTC | #9
On Sat, Aug 31, 2024 at 10:29 PM David Hildenbrand <david@redhat.com> wrote:
>
> On 31.08.24 12:21, Barry Song wrote:
> > On Sat, Aug 31, 2024 at 10:07 PM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 31.08.24 11:55, Barry Song wrote:
> >>> On Sat, Aug 31, 2024 at 9:44 PM David Hildenbrand <david@redhat.com> wrote:
> >>>>
> >>>> On 31.08.24 11:23, Barry Song wrote:
> >>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>
> >>>>> On a physical phone, it's sometimes observed that deferred_split
> >>>>> mTHPs account for over 15% of the total mTHPs. Profiling by Chuanhua
> >>>>> indicates that the majority of these originate from the typical fork
> >>>>> scenario.
> >>>>> When the child process either execs or exits, the parent process should
> >>>>> ideally be able to reuse the entire mTHP. However, the current kernel
> >>>>> lacks this capability and instead places the mTHP into split_deferred,
> >>>>> performing a CoW (Copy-on-Write) on just a single subpage of the mTHP.
> >>>>>
> >>>>>     main()
> >>>>>     {
> >>>>>     #define SIZE 1024 * 1024UL
> >>>>>             void *p = malloc(SIZE);
> >>>>>             memset(p, 0x11, SIZE);
> >>>>>             if (fork() == 0)
> >>>>>                     exec(....);
> >>>>>            /*
> >>>>>          * this will trigger cow one subpage from
> >>>>>          * mTHP and put mTHP into split_deferred
> >>>>>          * list
> >>>>>          */
> >>>>>         *(int *)(p + 10) = 10;
> >>>>>         printf("done\n");
> >>>>>         while(1);
> >>>>>     }
> >>>>>
> >>>>> This leads to two significant issues:
> >>>>>
> >>>>> * Memory Waste: Before the mTHP is fully split by the shrinker,
> >>>>> it wastes memory. In extreme cases, such as with a 64KB mTHP,
> >>>>> the memory usage could be 64KB + 60KB until the last subpage
> >>>>> is written, at which point the mTHP is freed.
> >>>>>
> >>>>> * Fragmentation and Performance Loss: It destroys large folios
> >>>>> (negating the performance benefits of CONT-PTE) and fragments memory.
> >>>>>
> >>>>> To address this, we should aim to reuse the entire mTHP in such cases.
> >>>>>
> >>>>> Hi David,
> >>>>>
> >>>>> I’ve renamed wp_page_reuse() to wp_folio_reuse() and added an
> >>>>> entirely_reuse argument because I’m not sure if there are still cases
> >>>>> where we reuse a subpage within an mTHP. For now, I’m setting
> >>>>> entirely_reuse to true only for the newly supported case, while all
> >>>>> other cases still get false. Please let me know if this is incorrect—if
> >>>>> we don’t reuse subpages at all, we could remove the argument.
> >>>>
> >>>> See [1] I sent out this week, that is able to reuse even without
> >>>> scanning page tables. If we find the the folio is exclusive we could try
> >>>> processing surrounding PTEs that map the same folio.
> >>>>
> >>>> [1] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com
> >>>
> >>> Great! It looks like I missed your patch again. Since you've implemented this
> >>> in a better way, I’d prefer to use your patchset.
> >>
> >> I wouldn't say better, just more universally. And while taking care of
> >> properly sync'ing the mapcount vs. refcount :P
> >>
> >>>
> >>> I’m curious about how you're handling ptep_set_access_flags_nr() or similar
> >>> things because I couldn’t find the related code in your patch 10/17:
> >>>
> >>> [PATCH v1 10/17] mm: COW reuse support for PTE-mapped THP with CONFIG_MM_ID
> >>>
> >>> Am I missing something?
> >>
> >> The idea is to keep individual write faults as fast as possible. So the
> >> patch set keeps it simple and only reuses a single PTE at a time,
> >> setting that one PAE and mapping it writable.
> >
> > I got your point, thanks! as anyway the mTHP has been exclusive,
> > so the following nr-1 minor page faults will set their particular PTE
> > to writable one by one.
>
> Yes, assuming you would get these page faults, and assuming you would
> get them in the near future.
>
> >
> >>
> >> As the patch states, it might be reasonable to optimize some cases,
> >> maybe also only on some architectures. For example to fault-around and
> >> map the other ones writable as well. It might not always be desirable
> >> though, especially not for larger folios.
> >
> > as anyway, the mTHP has been entirely exclusive, setting all PTEs
> > directly to writable should help reduce nr - 1 minor page faults and
> > ideally help reduce CONTPTE unfold and fold?
>
> Yes, doing that on CONTPTE granularity would very likely make sense. For
> anything bigger than that, I am not sure.
>
> Assuming we have a 1M folio mapped by PTEs. Trying to fault-around in
> aligned CONTPTE granularity likely makes sense. Bigger than that, I am
> not convinced.
>

I see. maybe we can have something like:

static bool pte_fault_around_estimate(int nr)
{
       if (nr / arch_batched_ptes_nr() < 16)
             return true;

       return false;
}

if (pte_fault_around_estimate(folio_nr_pages(folio)))
       set all ptes;

for arm64, arch_batched_ptes_nr()  == 16. for
arch without cont-pte or similar things,
arch_batched_ptes_nr()  == 1.

Just some rough ideas; all the naming might be quite messy.

at least, we won't lose the benefit of reduced TLB miss
before all nr_pages are written for aarch64 :-)

> >
> > What is the downside to doing that? I also don't think mapping them
> > all together will waste memory?
>
> No, it's all about increasing the latency of individual write faults.
>

i see, i assume it won't be worse than the current case where we have to
allocate small folios and copy? and folio allocation can even further incur
direct reclamation?

> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
David Hildenbrand Aug. 31, 2024, 11:02 a.m. UTC | #10
On 31.08.24 12:49, Barry Song wrote:
> On Sat, Aug 31, 2024 at 10:29 PM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 31.08.24 12:21, Barry Song wrote:
>>> On Sat, Aug 31, 2024 at 10:07 PM David Hildenbrand <david@redhat.com> wrote:
>>>>
>>>> On 31.08.24 11:55, Barry Song wrote:
>>>>> On Sat, Aug 31, 2024 at 9:44 PM David Hildenbrand <david@redhat.com> wrote:
>>>>>>
>>>>>> On 31.08.24 11:23, Barry Song wrote:
>>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>>
>>>>>>> On a physical phone, it's sometimes observed that deferred_split
>>>>>>> mTHPs account for over 15% of the total mTHPs. Profiling by Chuanhua
>>>>>>> indicates that the majority of these originate from the typical fork
>>>>>>> scenario.
>>>>>>> When the child process either execs or exits, the parent process should
>>>>>>> ideally be able to reuse the entire mTHP. However, the current kernel
>>>>>>> lacks this capability and instead places the mTHP into split_deferred,
>>>>>>> performing a CoW (Copy-on-Write) on just a single subpage of the mTHP.
>>>>>>>
>>>>>>>      main()
>>>>>>>      {
>>>>>>>      #define SIZE 1024 * 1024UL
>>>>>>>              void *p = malloc(SIZE);
>>>>>>>              memset(p, 0x11, SIZE);
>>>>>>>              if (fork() == 0)
>>>>>>>                      exec(....);
>>>>>>>             /*
>>>>>>>           * this will trigger cow one subpage from
>>>>>>>           * mTHP and put mTHP into split_deferred
>>>>>>>           * list
>>>>>>>           */
>>>>>>>          *(int *)(p + 10) = 10;
>>>>>>>          printf("done\n");
>>>>>>>          while(1);
>>>>>>>      }
>>>>>>>
>>>>>>> This leads to two significant issues:
>>>>>>>
>>>>>>> * Memory Waste: Before the mTHP is fully split by the shrinker,
>>>>>>> it wastes memory. In extreme cases, such as with a 64KB mTHP,
>>>>>>> the memory usage could be 64KB + 60KB until the last subpage
>>>>>>> is written, at which point the mTHP is freed.
>>>>>>>
>>>>>>> * Fragmentation and Performance Loss: It destroys large folios
>>>>>>> (negating the performance benefits of CONT-PTE) and fragments memory.
>>>>>>>
>>>>>>> To address this, we should aim to reuse the entire mTHP in such cases.
>>>>>>>
>>>>>>> Hi David,
>>>>>>>
>>>>>>> I’ve renamed wp_page_reuse() to wp_folio_reuse() and added an
>>>>>>> entirely_reuse argument because I’m not sure if there are still cases
>>>>>>> where we reuse a subpage within an mTHP. For now, I’m setting
>>>>>>> entirely_reuse to true only for the newly supported case, while all
>>>>>>> other cases still get false. Please let me know if this is incorrect—if
>>>>>>> we don’t reuse subpages at all, we could remove the argument.
>>>>>>
>>>>>> See [1] I sent out this week, that is able to reuse even without
>>>>>> scanning page tables. If we find the the folio is exclusive we could try
>>>>>> processing surrounding PTEs that map the same folio.
>>>>>>
>>>>>> [1] https://lkml.kernel.org/r/20240829165627.2256514-1-david@redhat.com
>>>>>
>>>>> Great! It looks like I missed your patch again. Since you've implemented this
>>>>> in a better way, I’d prefer to use your patchset.
>>>>
>>>> I wouldn't say better, just more universally. And while taking care of
>>>> properly sync'ing the mapcount vs. refcount :P
>>>>
>>>>>
>>>>> I’m curious about how you're handling ptep_set_access_flags_nr() or similar
>>>>> things because I couldn’t find the related code in your patch 10/17:
>>>>>
>>>>> [PATCH v1 10/17] mm: COW reuse support for PTE-mapped THP with CONFIG_MM_ID
>>>>>
>>>>> Am I missing something?
>>>>
>>>> The idea is to keep individual write faults as fast as possible. So the
>>>> patch set keeps it simple and only reuses a single PTE at a time,
>>>> setting that one PAE and mapping it writable.
>>>
>>> I got your point, thanks! as anyway the mTHP has been exclusive,
>>> so the following nr-1 minor page faults will set their particular PTE
>>> to writable one by one.
>>
>> Yes, assuming you would get these page faults, and assuming you would
>> get them in the near future.
>>
>>>
>>>>
>>>> As the patch states, it might be reasonable to optimize some cases,
>>>> maybe also only on some architectures. For example to fault-around and
>>>> map the other ones writable as well. It might not always be desirable
>>>> though, especially not for larger folios.
>>>
>>> as anyway, the mTHP has been entirely exclusive, setting all PTEs
>>> directly to writable should help reduce nr - 1 minor page faults and
>>> ideally help reduce CONTPTE unfold and fold?
>>
>> Yes, doing that on CONTPTE granularity would very likely make sense. For
>> anything bigger than that, I am not sure.
>>
>> Assuming we have a 1M folio mapped by PTEs. Trying to fault-around in
>> aligned CONTPTE granularity likely makes sense. Bigger than that, I am
>> not convinced.
>>
> 
> I see. maybe we can have something like:
> 
> static bool pte_fault_around_estimate(int nr)
> {
>         if (nr / arch_batched_ptes_nr() < 16)
>               return true;
> 
>         return false;
> }
> 
> if (pte_fault_around_estimate(folio_nr_pages(folio)))
>         set all ptes;
> 
> for arm64, arch_batched_ptes_nr()  == 16. for
> arch without cont-pte or similar things,
> arch_batched_ptes_nr()  == 1.

Yes, something like that would be my take.

After we know that we can reuse the large folio, we'll try scanning 
starting from the aligned PTE. If we find that we can batch, we'll batch 
that part. Otherwise we'll simply fallback to a single one.

Handling batching across VMAs is a bit harder. We might be able to 
batch, or might not ... We could have the CONT_PTE bit set across VMAs, 
but might not necessarily be able to batch (e.g., some VMAs are read-only).

> 
> Just some rough ideas; all the naming might be quite messy.
> 
> at least, we won't lose the benefit of reduced TLB miss
> before all nr_pages are written for aarch64 :-)
> 
>>>
>>> What is the downside to doing that? I also don't think mapping them
>>> all together will waste memory?
>>
>> No, it's all about increasing the latency of individual write faults.
>>
> 
> i see, i assume it won't be worse than the current case where we have to
> allocate small folios and copy? and folio allocation can even further incur
> direct reclamation?

Yes, it would certainly better than what we currently have. Almost 
everything would likely be better than what we currently have. :)
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index b95fce7d190f..c51980d14e41 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3205,18 +3205,26 @@  static vm_fault_t fault_dirty_shared_page(struct vm_fault *vmf)
 	return 0;
 }
 
-/*
+ /*
  * Handle write page faults for pages that can be reused in the current vma
  *
  * This can happen either due to the mapping being with the VM_SHARED flag,
  * or due to us being the last reference standing to the page. In either
  * case, all we need to do here is to mark the page as writable and update
  * any related book-keeping.
+ * If entirely_reuse is true, we are reusing the whole large folio; otherwise,
+ * we are reusing a subpage even though folio might be large one.
  */
-static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio)
+static inline void wp_folio_reuse(struct vm_fault *vmf, struct folio *folio,
+				  bool entirely_reuse)
 	__releases(vmf->ptl)
 {
+	unsigned long idx = entirely_reuse ? folio_page_idx(folio, vmf->page) : 0;
+	int nr = entirely_reuse ? folio_nr_pages(folio) : 1;
+	unsigned long start = vmf->address - idx * PAGE_SIZE;
+	unsigned long end = start + nr * PAGE_SIZE;
 	struct vm_area_struct *vma = vmf->vma;
+	pte_t *ptep = vmf->pte - idx;
 	pte_t entry;
 
 	VM_BUG_ON(!(vmf->flags & FAULT_FLAG_WRITE));
@@ -3233,11 +3241,15 @@  static inline void wp_page_reuse(struct vm_fault *vmf, struct folio *folio)
 		folio_xchg_last_cpupid(folio, (1 << LAST_CPUPID_SHIFT) - 1);
 	}
 
-	flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte));
-	entry = pte_mkyoung(vmf->orig_pte);
-	entry = maybe_mkwrite(pte_mkdirty(entry), vma);
-	if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1))
-		update_mmu_cache_range(vmf, vma, vmf->address, vmf->pte, 1);
+	flush_cache_range(vma, start, end);
+	for (int i = 0; i < nr; i++) {
+		entry = ptep_get(ptep + i);
+		entry = pte_mkyoung(entry);
+		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
+		if (ptep_set_access_flags(vma, start + i * PAGE_SIZE,
+				ptep + i, entry, 1))
+			update_mmu_cache_range(vmf, vma, start, ptep + i, 1);
+	}
 	pte_unmap_unlock(vmf->pte, vmf->ptl);
 	count_vm_event(PGREUSE);
 }
@@ -3493,7 +3505,7 @@  static vm_fault_t finish_mkwrite_fault(struct vm_fault *vmf, struct folio *folio
 		pte_unmap_unlock(vmf->pte, vmf->ptl);
 		return VM_FAULT_NOPAGE;
 	}
-	wp_page_reuse(vmf, folio);
+	wp_folio_reuse(vmf, folio, false);
 	return 0;
 }
 
@@ -3519,7 +3531,7 @@  static vm_fault_t wp_pfn_shared(struct vm_fault *vmf)
 			return ret;
 		return finish_mkwrite_fault(vmf, NULL);
 	}
-	wp_page_reuse(vmf, NULL);
+	wp_folio_reuse(vmf, NULL, false);
 	return 0;
 }
 
@@ -3554,7 +3566,7 @@  static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio)
 			return tmp;
 		}
 	} else {
-		wp_page_reuse(vmf, folio);
+		wp_folio_reuse(vmf, folio, false);
 		folio_lock(folio);
 	}
 	ret |= fault_dirty_shared_page(vmf);
@@ -3564,17 +3576,41 @@  static vm_fault_t wp_page_shared(struct vm_fault *vmf, struct folio *folio)
 }
 
 static bool wp_can_reuse_anon_folio(struct folio *folio,
-				    struct vm_area_struct *vma)
+				    struct vm_fault *vmf)
 {
+	struct vm_area_struct *vma = vmf->vma;
+	int nr = folio_nr_pages(folio);
+
 	/*
-	 * We could currently only reuse a subpage of a large folio if no
-	 * other subpages of the large folios are still mapped. However,
-	 * let's just consistently not reuse subpages even if we could
-	 * reuse in that scenario, and give back a large folio a bit
-	 * sooner.
+	 * reuse a large folio while it is entirely mapped and
+	 * exclusive (mapcount == folio_nr_pages)
 	 */
-	if (folio_test_large(folio))
-		return false;
+	if (folio_test_large(folio)) {
+		unsigned long folio_start, folio_end, idx;
+		unsigned long address = vmf->address;
+		pte_t *folio_ptep;
+		pte_t folio_pte;
+		if (folio_likely_mapped_shared(folio))
+			return false;
+
+		idx = folio_page_idx(folio, vmf->page);
+		folio_start = address - idx * PAGE_SIZE;
+		folio_end = folio_start + nr * PAGE_SIZE;
+
+		if (unlikely(folio_start < max(address & PMD_MASK, vma->vm_start)))
+			return false;
+		if (unlikely(folio_end > pmd_addr_end(address, vma->vm_end)))
+			return false;
+		folio_ptep = vmf->pte - idx;
+		folio_pte = ptep_get(folio_ptep);
+		if (!pte_present(folio_pte) || pte_pfn(folio_pte) != folio_pfn(folio))
+			return false;
+		if (folio_pte_batch(folio, folio_start, folio_ptep, folio_pte, nr, 0,
+				NULL, NULL, NULL) != nr)
+			return false;
+		if (folio_mapcount(folio) != nr)
+			return false;
+	}
 
 	/*
 	 * We have to verify under folio lock: these early checks are
@@ -3583,7 +3619,7 @@  static bool wp_can_reuse_anon_folio(struct folio *folio,
 	 *
 	 * KSM doesn't necessarily raise the folio refcount.
 	 */
-	if (folio_test_ksm(folio) || folio_ref_count(folio) > 3)
+	if (folio_test_ksm(folio) || folio_ref_count(folio) > 2 + nr)
 		return false;
 	if (!folio_test_lru(folio))
 		/*
@@ -3591,13 +3627,13 @@  static bool wp_can_reuse_anon_folio(struct folio *folio,
 		 * remote LRU caches or references to LRU folios.
 		 */
 		lru_add_drain();
-	if (folio_ref_count(folio) > 1 + folio_test_swapcache(folio))
+	if (folio_ref_count(folio) > nr + folio_test_swapcache(folio))
 		return false;
 	if (!folio_trylock(folio))
 		return false;
 	if (folio_test_swapcache(folio))
 		folio_free_swap(folio);
-	if (folio_test_ksm(folio) || folio_ref_count(folio) != 1) {
+	if (folio_test_ksm(folio) || folio_ref_count(folio) != nr) {
 		folio_unlock(folio);
 		return false;
 	}
@@ -3639,6 +3675,7 @@  static vm_fault_t do_wp_page(struct vm_fault *vmf)
 	const bool unshare = vmf->flags & FAULT_FLAG_UNSHARE;
 	struct vm_area_struct *vma = vmf->vma;
 	struct folio *folio = NULL;
+	int nr = 1;
 	pte_t pte;
 
 	if (likely(!unshare)) {
@@ -3702,14 +3739,18 @@  static vm_fault_t do_wp_page(struct vm_fault *vmf)
 	 * the page without further checks.
 	 */
 	if (folio && folio_test_anon(folio) &&
-	    (PageAnonExclusive(vmf->page) || wp_can_reuse_anon_folio(folio, vma))) {
-		if (!PageAnonExclusive(vmf->page))
-			SetPageAnonExclusive(vmf->page);
+	    (PageAnonExclusive(vmf->page) || wp_can_reuse_anon_folio(folio, vmf))) {
+		/* this is the case we are going to reuse the entire folio */
+		if (!PageAnonExclusive(vmf->page)) {
+			nr = folio_nr_pages(folio);
+			for (int i = 0; i < nr; i++)
+				SetPageAnonExclusive(folio_page(folio, i));
+		}
 		if (unlikely(unshare)) {
 			pte_unmap_unlock(vmf->pte, vmf->ptl);
 			return 0;
 		}
-		wp_page_reuse(vmf, folio);
+		wp_folio_reuse(vmf, folio, nr > 1);
 		return 0;
 	}
 	/*