diff mbox series

[v2] mm: hold PTL from the first PTE while reclaiming a large folio

Message ID 20240306095219.71086-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series [v2] mm: hold PTL from the first PTE while reclaiming a large folio | expand

Commit Message

Barry Song March 6, 2024, 9:52 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

Within try_to_unmap_one(), page_vma_mapped_walk() races with other
PTE modifications preceded by pte clear. While iterating over PTEs
of a large folio, it only starts acquiring PTL from the first valid
(present) PTE. PTE modifications can temporarily set PTEs to
pte_none. Consequently, the initial PTEs of a large folio might
be skipped in try_to_unmap_one().
For example, for an anon folio, if we skip PTE0, we may have PTE0
which is still present, while PTE1 ~ PTE(nr_pages - 1) are swap
entries after try_to_unmap_one().
So folio will be still mapped, the folio fails to be reclaimed
and is put back to LRU in this round.
This also breaks up PTEs optimization such as CONT-PTE on this
large folio and may lead to accident folio_split() afterwards.
And since a part of PTEs are now swap entries, accessing those
parts will introduce overhead - do_swap_page.
Although the kernel can withstand all of the above issues, the
situation still seems quite awkward and warrants making it more
ideal.
The same race also occurs with small folios, but they have only
one PTE, thus, it won't be possible for them to be partially
unmapped.
This patch holds PTL from PTE0, allowing us to avoid reading PTE
values that are in the process of being transformed. With stable
PTE values, we can ensure that this large folio is either
completely reclaimed or that all PTEs remain untouched in this
round.
A corner case is that if we hold PTL from PTE0 and most initial
PTEs have been really unmapped before that, we may increase the
duration of holding PTL. Thus we only apply this optimization to
folios which are still entirely mapped (not in deferred_split
list).

Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 v2:
  * Refine commit message and code comment after reading all comments
    from Ryan and David, thanks!
  * Avoid increasing the duration of PTL by applying optimization
    on folios not in deferred_split_list with respect to Ying's
    comment, thanks!

 mm/vmscan.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

Comments

David Hildenbrand March 26, 2024, 4:10 p.m. UTC | #1
On 06.03.24 10:52, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> Within try_to_unmap_one(), page_vma_mapped_walk() races with other
> PTE modifications preceded by pte clear. While iterating over PTEs
> of a large folio, it only starts acquiring PTL from the first valid
> (present) PTE. PTE modifications can temporarily set PTEs to
> pte_none. Consequently, the initial PTEs of a large folio might
> be skipped in try_to_unmap_one().
> For example, for an anon folio, if we skip PTE0, we may have PTE0
> which is still present, while PTE1 ~ PTE(nr_pages - 1) are swap
> entries after try_to_unmap_one().
> So folio will be still mapped, the folio fails to be reclaimed
> and is put back to LRU in this round.
> This also breaks up PTEs optimization such as CONT-PTE on this
> large folio and may lead to accident folio_split() afterwards.
> And since a part of PTEs are now swap entries, accessing those
> parts will introduce overhead - do_swap_page.
> Although the kernel can withstand all of the above issues, the
> situation still seems quite awkward and warrants making it more
> ideal.
> The same race also occurs with small folios, but they have only
> one PTE, thus, it won't be possible for them to be partially
> unmapped.
> This patch holds PTL from PTE0, allowing us to avoid reading PTE
> values that are in the process of being transformed. With stable
> PTE values, we can ensure that this large folio is either
> completely reclaimed or that all PTEs remain untouched in this
> round.
> A corner case is that if we hold PTL from PTE0 and most initial
> PTEs have been really unmapped before that, we may increase the
> duration of holding PTL. Thus we only apply this optimization to
> folios which are still entirely mapped (not in deferred_split
> list).
> 
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>   v2:
>    * Refine commit message and code comment after reading all comments
>      from Ryan and David, thanks!
>    * Avoid increasing the duration of PTL by applying optimization
>      on folios not in deferred_split_list with respect to Ying's
>      comment, thanks!
> 
>   mm/vmscan.c | 12 ++++++++++++
>   1 file changed, 12 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0b888a2afa58..7106741387e8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1270,6 +1270,18 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>   
>   			if (folio_test_pmd_mappable(folio))
>   				flags |= TTU_SPLIT_HUGE_PMD;
> +			/*
> +			 * Without TTU_SYNC, try_to_unmap will only begin to hold PTL
> +			 * from the first present PTE within a large folio. Some initial
> +			 * PTEs might be skipped due to races with parallel PTE writes
> +			 * in which PTEs can be cleared temporarily before being written
> +			 * new present values. This will lead to a large folio is still
> +			 * mapped while some subpages have been partially unmapped after
> +			 * try_to_unmap; TTU_SYNC helps try_to_unmap acquire PTL from the
> +			 * first PTE, eliminating the influence of temporary PTE values.
> +			 */
> +			if (folio_test_large(folio) && list_empty(&folio->_deferred_list))
> +				flags |= TTU_SYNC;
>   
>   			try_to_unmap(folio, flags);
>   			if (folio_mapped(folio)) {

Hopefully this won't have unexpected performance "surprises".

I do wonder if we should really care about the "_deferred_list" 
optimization here, though, I'd just drop it.

In any case

Acked-by: David Hildenbrand <david@redhat.com>
Ryan Roberts March 26, 2024, 4:17 p.m. UTC | #2
On 26/03/2024 16:10, David Hildenbrand wrote:
> On 06.03.24 10:52, Barry Song wrote:
>> From: Barry Song <v-songbaohua@oppo.com>
>>
>> Within try_to_unmap_one(), page_vma_mapped_walk() races with other
>> PTE modifications preceded by pte clear. While iterating over PTEs
>> of a large folio, it only starts acquiring PTL from the first valid
>> (present) PTE. PTE modifications can temporarily set PTEs to
>> pte_none. Consequently, the initial PTEs of a large folio might
>> be skipped in try_to_unmap_one().
>> For example, for an anon folio, if we skip PTE0, we may have PTE0
>> which is still present, while PTE1 ~ PTE(nr_pages - 1) are swap
>> entries after try_to_unmap_one().
>> So folio will be still mapped, the folio fails to be reclaimed
>> and is put back to LRU in this round.
>> This also breaks up PTEs optimization such as CONT-PTE on this
>> large folio and may lead to accident folio_split() afterwards.
>> And since a part of PTEs are now swap entries, accessing those
>> parts will introduce overhead - do_swap_page.
>> Although the kernel can withstand all of the above issues, the
>> situation still seems quite awkward and warrants making it more
>> ideal.
>> The same race also occurs with small folios, but they have only
>> one PTE, thus, it won't be possible for them to be partially
>> unmapped.
>> This patch holds PTL from PTE0, allowing us to avoid reading PTE
>> values that are in the process of being transformed. With stable
>> PTE values, we can ensure that this large folio is either
>> completely reclaimed or that all PTEs remain untouched in this
>> round.
>> A corner case is that if we hold PTL from PTE0 and most initial
>> PTEs have been really unmapped before that, we may increase the
>> duration of holding PTL. Thus we only apply this optimization to
>> folios which are still entirely mapped (not in deferred_split
>> list).
>>
>> Cc: Hugh Dickins <hughd@google.com>
>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> ---
>>   v2:
>>    * Refine commit message and code comment after reading all comments
>>      from Ryan and David, thanks!
>>    * Avoid increasing the duration of PTL by applying optimization
>>      on folios not in deferred_split_list with respect to Ying's
>>      comment, thanks!
>>
>>   mm/vmscan.c | 12 ++++++++++++
>>   1 file changed, 12 insertions(+)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 0b888a2afa58..7106741387e8 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1270,6 +1270,18 @@ static unsigned int shrink_folio_list(struct list_head
>> *folio_list,
>>                 if (folio_test_pmd_mappable(folio))
>>                   flags |= TTU_SPLIT_HUGE_PMD;
>> +            /*
>> +             * Without TTU_SYNC, try_to_unmap will only begin to hold PTL
>> +             * from the first present PTE within a large folio. Some initial
>> +             * PTEs might be skipped due to races with parallel PTE writes
>> +             * in which PTEs can be cleared temporarily before being written
>> +             * new present values. This will lead to a large folio is still
>> +             * mapped while some subpages have been partially unmapped after
>> +             * try_to_unmap; TTU_SYNC helps try_to_unmap acquire PTL from the
>> +             * first PTE, eliminating the influence of temporary PTE values.
>> +             */
>> +            if (folio_test_large(folio) && list_empty(&folio->_deferred_list))
>> +                flags |= TTU_SYNC;
>>                 try_to_unmap(folio, flags);
>>               if (folio_mapped(folio)) {
> 
> Hopefully this won't have unexpected performance "surprises".
> 
> I do wonder if we should really care about the "_deferred_list" optimization
> here, though, I'd just drop it.

I also concluded that we do need the data_race() annotation around list_empty()
if you do wind up keeping it. But I agree with David about dropping it.

> 
> In any case
> 
> Acked-by: David Hildenbrand <david@redhat.com>
>
Matthew Wilcox March 26, 2024, 4:21 p.m. UTC | #3
On Wed, Mar 06, 2024 at 10:52:19PM +1300, Barry Song wrote:
>  			if (folio_test_pmd_mappable(folio))
>  				flags |= TTU_SPLIT_HUGE_PMD;
> +			/*
> +			 * Without TTU_SYNC, try_to_unmap will only begin to hold PTL
> +			 * from the first present PTE within a large folio. Some initial
> +			 * PTEs might be skipped due to races with parallel PTE writes
> +			 * in which PTEs can be cleared temporarily before being written
> +			 * new present values. This will lead to a large folio is still
> +			 * mapped while some subpages have been partially unmapped after
> +			 * try_to_unmap; TTU_SYNC helps try_to_unmap acquire PTL from the
> +			 * first PTE, eliminating the influence of temporary PTE values.
> +			 */

Please wrap comments at <80 columns.  The easiest way to do this is just
to pipe the block of text through "fmt -p \*" (need to quote the * to
prevent it from being a glob/regex)
Barry Song March 26, 2024, 10:04 p.m. UTC | #4
On Wed, Mar 27, 2024 at 5:17 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 26/03/2024 16:10, David Hildenbrand wrote:
> > On 06.03.24 10:52, Barry Song wrote:
> >> From: Barry Song <v-songbaohua@oppo.com>
> >>
> >> Within try_to_unmap_one(), page_vma_mapped_walk() races with other
> >> PTE modifications preceded by pte clear. While iterating over PTEs
> >> of a large folio, it only starts acquiring PTL from the first valid
> >> (present) PTE. PTE modifications can temporarily set PTEs to
> >> pte_none. Consequently, the initial PTEs of a large folio might
> >> be skipped in try_to_unmap_one().
> >> For example, for an anon folio, if we skip PTE0, we may have PTE0
> >> which is still present, while PTE1 ~ PTE(nr_pages - 1) are swap
> >> entries after try_to_unmap_one().
> >> So folio will be still mapped, the folio fails to be reclaimed
> >> and is put back to LRU in this round.
> >> This also breaks up PTEs optimization such as CONT-PTE on this
> >> large folio and may lead to accident folio_split() afterwards.
> >> And since a part of PTEs are now swap entries, accessing those
> >> parts will introduce overhead - do_swap_page.
> >> Although the kernel can withstand all of the above issues, the
> >> situation still seems quite awkward and warrants making it more
> >> ideal.
> >> The same race also occurs with small folios, but they have only
> >> one PTE, thus, it won't be possible for them to be partially
> >> unmapped.
> >> This patch holds PTL from PTE0, allowing us to avoid reading PTE
> >> values that are in the process of being transformed. With stable
> >> PTE values, we can ensure that this large folio is either
> >> completely reclaimed or that all PTEs remain untouched in this
> >> round.
> >> A corner case is that if we hold PTL from PTE0 and most initial
> >> PTEs have been really unmapped before that, we may increase the
> >> duration of holding PTL. Thus we only apply this optimization to
> >> folios which are still entirely mapped (not in deferred_split
> >> list).
> >>
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >> ---
> >>   v2:
> >>    * Refine commit message and code comment after reading all comments
> >>      from Ryan and David, thanks!
> >>    * Avoid increasing the duration of PTL by applying optimization
> >>      on folios not in deferred_split_list with respect to Ying's
> >>      comment, thanks!
> >>
> >>   mm/vmscan.c | 12 ++++++++++++
> >>   1 file changed, 12 insertions(+)
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 0b888a2afa58..7106741387e8 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1270,6 +1270,18 @@ static unsigned int shrink_folio_list(struct list_head
> >> *folio_list,
> >>                 if (folio_test_pmd_mappable(folio))
> >>                   flags |= TTU_SPLIT_HUGE_PMD;
> >> +            /*
> >> +             * Without TTU_SYNC, try_to_unmap will only begin to hold PTL
> >> +             * from the first present PTE within a large folio. Some initial
> >> +             * PTEs might be skipped due to races with parallel PTE writes
> >> +             * in which PTEs can be cleared temporarily before being written
> >> +             * new present values. This will lead to a large folio is still
> >> +             * mapped while some subpages have been partially unmapped after
> >> +             * try_to_unmap; TTU_SYNC helps try_to_unmap acquire PTL from the
> >> +             * first PTE, eliminating the influence of temporary PTE values.
> >> +             */
> >> +            if (folio_test_large(folio) && list_empty(&folio->_deferred_list))
> >> +                flags |= TTU_SYNC;
> >>                 try_to_unmap(folio, flags);
> >>               if (folio_mapped(folio)) {
> >
> > Hopefully this won't have unexpected performance "surprises".
> >
> > I do wonder if we should really care about the "_deferred_list" optimization
> > here, though, I'd just drop it.

this is for a corner case:  <0, nr_pages-2> of a large folio have been unmapped
but nr_pages - 1 is still mapped. we are holding PTL to skip <0, nr_pages-2> w/
the patch, otherwise, we are skipping those PTEs w/o PTL.

But Ryan's swap-out will anyway split the folio before try_to_unmap,
so I feel we can
drop it.

>
> I also concluded that we do need the data_race() annotation around list_empty()
> if you do wind up keeping it. But I agree with David about dropping it.
>
> >
> > In any case
> >
> > Acked-by: David Hildenbrand <david@redhat.com>

Thanks!

> >
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0b888a2afa58..7106741387e8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1270,6 +1270,18 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 			if (folio_test_pmd_mappable(folio))
 				flags |= TTU_SPLIT_HUGE_PMD;
+			/*
+			 * Without TTU_SYNC, try_to_unmap will only begin to hold PTL
+			 * from the first present PTE within a large folio. Some initial
+			 * PTEs might be skipped due to races with parallel PTE writes
+			 * in which PTEs can be cleared temporarily before being written
+			 * new present values. This will lead to a large folio is still
+			 * mapped while some subpages have been partially unmapped after
+			 * try_to_unmap; TTU_SYNC helps try_to_unmap acquire PTL from the
+			 * first PTE, eliminating the influence of temporary PTE values.
+			 */
+			if (folio_test_large(folio) && list_empty(&folio->_deferred_list))
+				flags |= TTU_SYNC;
 
 			try_to_unmap(folio, flags);
 			if (folio_mapped(folio)) {