diff mbox series

[RFC] mm: hold PTL from the first PTE while reclaiming a large folio

Message ID 20240304103757.235352-1-21cnbao@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] mm: hold PTL from the first PTE while reclaiming a large folio | expand

Commit Message

Barry Song March 4, 2024, 10:37 a.m. UTC
From: Barry Song <v-songbaohua@oppo.com>

page_vma_mapped_walk() within try_to_unmap_one() races with other
PTEs modification such as break-before-make, while iterating PTEs
of a large folio, it will only begin to acquire PTL after it gets
a valid(present) PTE. break-before-make intermediately sets PTEs
to pte_none. Thus, a large folio's PTEs might be partially skipped
in try_to_unmap_one().
For example, for an anon folio, after try_to_unmap_one(), we may
have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
So folio will be still mapped, the folio fails to be reclaimed.
What’s even more worrying is, its PTEs are no longer in a unified
state. This might lead to accident folio_split() afterwards. And
since a part of PTEs are now swap entries, accessing them will
incur page fault - do_swap_page.
It creates both anxiety and more expense. While we can't avoid
userspace's unmap to break up unified PTEs such as CONT-PTE for
a large folio, we can indeed keep away from kernel's breaking up
them due to its code design.
This patch is holding PTL from PTE0, thus, the folio will either
be entirely reclaimed or entirely kept. On the other hand, this
approach doesn't increase PTL contention. Even w/o the patch,
page_vma_mapped_walk() will always get PTL after it sometimes
skips one or two PTEs because intermediate break-before-makes
are short, according to test. Of course, even w/o this patch,
the vast majority of try_to_unmap_one still can get PTL from
PTE0. This patch makes the number 100%.
The other option is that we can give up in try_to_unmap_one
once we find PTE0 is not the first entry we get PTL, we call
page_vma_mapped_walk_done() to end the iteration at this case.
This will keep the unified PTEs while the folio isn't reclaimed.
The result is quite similar with small folios with one PTE -
either entirely reclaimed or entirely kept.
Reclaiming large folios by holding PTL from PTE0 seems a better
option comparing to giving up after detecting PTL begins from
non-PTE0.

Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Barry Song <v-songbaohua@oppo.com>
---
 mm/vmscan.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

Comments

Ryan Roberts March 4, 2024, 12:20 p.m. UTC | #1
Hi Barry,

On 04/03/2024 10:37, Barry Song wrote:
> From: Barry Song <v-songbaohua@oppo.com>
> 
> page_vma_mapped_walk() within try_to_unmap_one() races with other
> PTEs modification such as break-before-make, while iterating PTEs
> of a large folio, it will only begin to acquire PTL after it gets
> a valid(present) PTE. break-before-make intermediately sets PTEs
> to pte_none. Thus, a large folio's PTEs might be partially skipped
> in try_to_unmap_one().

I just want to check my understanding here - I think the problem occurs for
PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
that I've had a look at the code and have a better understanding, I think that
must be the case? And therefore this problem exists independently of my work to
support swap-out of mTHP? (From your previous report I was under the impression
that it only affected mTHP).

Its just that the problem is becoming more pronounced because with mTHP,
PTE-mapped large folios are much more common?

> For example, for an anon folio, after try_to_unmap_one(), we may
> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
> So folio will be still mapped, the folio fails to be reclaimed.
> What’s even more worrying is, its PTEs are no longer in a unified
> state. This might lead to accident folio_split() afterwards. And
> since a part of PTEs are now swap entries, accessing them will
> incur page fault - do_swap_page.
> It creates both anxiety and more expense. While we can't avoid
> userspace's unmap to break up unified PTEs such as CONT-PTE for
> a large folio, we can indeed keep away from kernel's breaking up
> them due to its code design.
> This patch is holding PTL from PTE0, thus, the folio will either
> be entirely reclaimed or entirely kept. On the other hand, this
> approach doesn't increase PTL contention. Even w/o the patch,
> page_vma_mapped_walk() will always get PTL after it sometimes
> skips one or two PTEs because intermediate break-before-makes
> are short, according to test. Of course, even w/o this patch,
> the vast majority of try_to_unmap_one still can get PTL from
> PTE0. This patch makes the number 100%.
> The other option is that we can give up in try_to_unmap_one
> once we find PTE0 is not the first entry we get PTL, we call
> page_vma_mapped_walk_done() to end the iteration at this case.
> This will keep the unified PTEs while the folio isn't reclaimed.
> The result is quite similar with small folios with one PTE -
> either entirely reclaimed or entirely kept.
> Reclaiming large folios by holding PTL from PTE0 seems a better
> option comparing to giving up after detecting PTL begins from
> non-PTE0.
> 
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>

Do we need a Fixes tag?

> ---
>  mm/vmscan.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0b888a2afa58..e4722fbbcd0c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  
>  			if (folio_test_pmd_mappable(folio))
>  				flags |= TTU_SPLIT_HUGE_PMD;
> +			/*
> +			 * if page table lock is not held from the first PTE of
> +			 * a large folio, some PTEs might be skipped because of
> +			 * races with break-before-make, for example, PTEs can
> +			 * be pte_none intermediately, thus one or more PTEs
> +			 * might be skipped in try_to_unmap_one, we might result
> +			 * in a large folio is partially mapped and partially
> +			 * unmapped after try_to_unmap
> +			 */
> +			if (folio_test_large(folio))
> +				flags |= TTU_SYNC;

This looks sensible to me after thinking about it for a while. But I also have a
gut feeling that there might be some more subtleties that are going over my
head, since I'm not expert in this area. So will leave others to provide R-b :)

Thanks,
Ryan

>  
>  			try_to_unmap(folio, flags);
>  			if (folio_mapped(folio)) {
David Hildenbrand March 4, 2024, 12:41 p.m. UTC | #2
On 04.03.24 13:20, Ryan Roberts wrote:
> Hi Barry,
> 
> On 04/03/2024 10:37, Barry Song wrote:
>> From: Barry Song <v-songbaohua@oppo.com>
>>
>> page_vma_mapped_walk() within try_to_unmap_one() races with other
>> PTEs modification such as break-before-make, while iterating PTEs
>> of a large folio, it will only begin to acquire PTL after it gets
>> a valid(present) PTE. break-before-make intermediately sets PTEs
>> to pte_none. Thus, a large folio's PTEs might be partially skipped
>> in try_to_unmap_one().
> 
> I just want to check my understanding here - I think the problem occurs for
> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
> that I've had a look at the code and have a better understanding, I think that
> must be the case? And therefore this problem exists independently of my work to
> support swap-out of mTHP? (From your previous report I was under the impression
> that it only affected mTHP).
> 
> Its just that the problem is becoming more pronounced because with mTHP,
> PTE-mapped large folios are much more common?

That is my understanding.

> 
>> For example, for an anon folio, after try_to_unmap_one(), we may
>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
>> So folio will be still mapped, the folio fails to be reclaimed.
>> What’s even more worrying is, its PTEs are no longer in a unified
>> state. This might lead to accident folio_split() afterwards. And
>> since a part of PTEs are now swap entries, accessing them will
>> incur page fault - do_swap_page.
>> It creates both anxiety and more expense. While we can't avoid
>> userspace's unmap to break up unified PTEs such as CONT-PTE for
>> a large folio, we can indeed keep away from kernel's breaking up
>> them due to its code design.
>> This patch is holding PTL from PTE0, thus, the folio will either
>> be entirely reclaimed or entirely kept. On the other hand, this
>> approach doesn't increase PTL contention. Even w/o the patch,
>> page_vma_mapped_walk() will always get PTL after it sometimes
>> skips one or two PTEs because intermediate break-before-makes
>> are short, according to test. Of course, even w/o this patch,
>> the vast majority of try_to_unmap_one still can get PTL from
>> PTE0. This patch makes the number 100%.
>> The other option is that we can give up in try_to_unmap_one
>> once we find PTE0 is not the first entry we get PTL, we call
>> page_vma_mapped_walk_done() to end the iteration at this case.
>> This will keep the unified PTEs while the folio isn't reclaimed.
>> The result is quite similar with small folios with one PTE -
>> either entirely reclaimed or entirely kept.
>> Reclaiming large folios by holding PTL from PTE0 seems a better
>> option comparing to giving up after detecting PTL begins from
>> non-PTE0.
>>

I'm sure that wall of text can be formatted in a better way :) . Also, I 
think we can drop some of the details,

If you need some inspiration, I can give it a shot.

>> Cc: Hugh Dickins <hughd@google.com>
>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> 
> Do we need a Fixes tag?
> 

What would be the description of the problem we are fixing?

1) failing to unmap?

That can happen with small folios as well IIUC.

2) Putting the large folio on the deferred split queue?

That sounds more reasonable.

>> ---
>>   mm/vmscan.c | 11 +++++++++++
>>   1 file changed, 11 insertions(+)
>>
>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>> index 0b888a2afa58..e4722fbbcd0c 100644
>> --- a/mm/vmscan.c
>> +++ b/mm/vmscan.c
>> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>>   
>>   			if (folio_test_pmd_mappable(folio))
>>   				flags |= TTU_SPLIT_HUGE_PMD;
>> +			/*
>> +			 * if page table lock is not held from the first PTE of
>> +			 * a large folio, some PTEs might be skipped because of
>> +			 * races with break-before-make, for example, PTEs can
>> +			 * be pte_none intermediately, thus one or more PTEs
>> +			 * might be skipped in try_to_unmap_one, we might result
>> +			 * in a large folio is partially mapped and partially
>> +			 * unmapped after try_to_unmap
>> +			 */
>> +			if (folio_test_large(folio))
>> +				flags |= TTU_SYNC;
> 
> This looks sensible to me after thinking about it for a while. But I also have a
> gut feeling that there might be some more subtleties that are going over my
> head, since I'm not expert in this area. So will leave others to provide R-b :)
> 

As we are seeing more such problems with lockless PT walks, maybe we 
really want some other special value (nonswap entry?) to indicate that a 
PTE this is currently ondergoing protection changes. So we'd avoid the 
pte_none() temporarily, if possible.

Without that, TTU_SYNC feels like the right thing to do.
Ryan Roberts March 4, 2024, 1:03 p.m. UTC | #3
On 04/03/2024 12:41, David Hildenbrand wrote:
> On 04.03.24 13:20, Ryan Roberts wrote:
>> Hi Barry,
>>
>> On 04/03/2024 10:37, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
>>> PTEs modification such as break-before-make, while iterating PTEs
>>> of a large folio, it will only begin to acquire PTL after it gets
>>> a valid(present) PTE. break-before-make intermediately sets PTEs
>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
>>> in try_to_unmap_one().
>>
>> I just want to check my understanding here - I think the problem occurs for
>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
>> that I've had a look at the code and have a better understanding, I think that
>> must be the case? And therefore this problem exists independently of my work to
>> support swap-out of mTHP? (From your previous report I was under the impression
>> that it only affected mTHP).
>>
>> Its just that the problem is becoming more pronounced because with mTHP,
>> PTE-mapped large folios are much more common?
> 
> That is my understanding.
> 
>>
>>> For example, for an anon folio, after try_to_unmap_one(), we may
>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
>>> So folio will be still mapped, the folio fails to be reclaimed.
>>> What’s even more worrying is, its PTEs are no longer in a unified
>>> state. This might lead to accident folio_split() afterwards. And
>>> since a part of PTEs are now swap entries, accessing them will
>>> incur page fault - do_swap_page.
>>> It creates both anxiety and more expense. While we can't avoid
>>> userspace's unmap to break up unified PTEs such as CONT-PTE for
>>> a large folio, we can indeed keep away from kernel's breaking up
>>> them due to its code design.
>>> This patch is holding PTL from PTE0, thus, the folio will either
>>> be entirely reclaimed or entirely kept. On the other hand, this
>>> approach doesn't increase PTL contention. Even w/o the patch,
>>> page_vma_mapped_walk() will always get PTL after it sometimes
>>> skips one or two PTEs because intermediate break-before-makes
>>> are short, according to test. Of course, even w/o this patch,
>>> the vast majority of try_to_unmap_one still can get PTL from
>>> PTE0. This patch makes the number 100%.
>>> The other option is that we can give up in try_to_unmap_one
>>> once we find PTE0 is not the first entry we get PTL, we call
>>> page_vma_mapped_walk_done() to end the iteration at this case.
>>> This will keep the unified PTEs while the folio isn't reclaimed.
>>> The result is quite similar with small folios with one PTE -
>>> either entirely reclaimed or entirely kept.
>>> Reclaiming large folios by holding PTL from PTE0 seems a better
>>> option comparing to giving up after detecting PTL begins from
>>> non-PTE0.
>>>
> 
> I'm sure that wall of text can be formatted in a better way :) . Also, I think
> we can drop some of the details,
> 
> If you need some inspiration, I can give it a shot.
> 
>>> Cc: Hugh Dickins <hughd@google.com>
>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>
>> Do we need a Fixes tag?
>>
> 
> What would be the description of the problem we are fixing?
> 
> 1) failing to unmap?
> 
> That can happen with small folios as well IIUC.
> 
> 2) Putting the large folio on the deferred split queue?
> 
> That sounds more reasonable.

Isn't the real problem today that we can end up writng a THP to the swap file
(so 2M more IO and space used) but we can't remove it from memory, so no actual
reclaim happens? Although I guess your (2) is really just another way of saying
that.

> 
>>> ---
>>>   mm/vmscan.c | 11 +++++++++++
>>>   1 file changed, 11 insertions(+)
>>>
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index 0b888a2afa58..e4722fbbcd0c 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head
>>> *folio_list,
>>>                 if (folio_test_pmd_mappable(folio))
>>>                   flags |= TTU_SPLIT_HUGE_PMD;
>>> +            /*
>>> +             * if page table lock is not held from the first PTE of
>>> +             * a large folio, some PTEs might be skipped because of
>>> +             * races with break-before-make, for example, PTEs can
>>> +             * be pte_none intermediately, thus one or more PTEs
>>> +             * might be skipped in try_to_unmap_one, we might result
>>> +             * in a large folio is partially mapped and partially
>>> +             * unmapped after try_to_unmap
>>> +             */
>>> +            if (folio_test_large(folio))
>>> +                flags |= TTU_SYNC;
>>
>> This looks sensible to me after thinking about it for a while. But I also have a
>> gut feeling that there might be some more subtleties that are going over my
>> head, since I'm not expert in this area. So will leave others to provide R-b :)
>>
> 
> As we are seeing more such problems with lockless PT walks, maybe we really want
> some other special value (nonswap entry?) to indicate that a PTE this is
> currently ondergoing protection changes. So we'd avoid the pte_none()
> temporarily, if possible.
> 
> Without that, TTU_SYNC feels like the right thing to do.
>
David Hildenbrand March 4, 2024, 2:27 p.m. UTC | #4
On 04.03.24 14:03, Ryan Roberts wrote:
> On 04/03/2024 12:41, David Hildenbrand wrote:
>> On 04.03.24 13:20, Ryan Roberts wrote:
>>> Hi Barry,
>>>
>>> On 04/03/2024 10:37, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
>>>> PTEs modification such as break-before-make, while iterating PTEs
>>>> of a large folio, it will only begin to acquire PTL after it gets
>>>> a valid(present) PTE. break-before-make intermediately sets PTEs
>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
>>>> in try_to_unmap_one().
>>>
>>> I just want to check my understanding here - I think the problem occurs for
>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
>>> that I've had a look at the code and have a better understanding, I think that
>>> must be the case? And therefore this problem exists independently of my work to
>>> support swap-out of mTHP? (From your previous report I was under the impression
>>> that it only affected mTHP).
>>>
>>> Its just that the problem is becoming more pronounced because with mTHP,
>>> PTE-mapped large folios are much more common?
>>
>> That is my understanding.
>>
>>>
>>>> For example, for an anon folio, after try_to_unmap_one(), we may
>>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
>>>> So folio will be still mapped, the folio fails to be reclaimed.
>>>> What’s even more worrying is, its PTEs are no longer in a unified
>>>> state. This might lead to accident folio_split() afterwards. And
>>>> since a part of PTEs are now swap entries, accessing them will
>>>> incur page fault - do_swap_page.
>>>> It creates both anxiety and more expense. While we can't avoid
>>>> userspace's unmap to break up unified PTEs such as CONT-PTE for
>>>> a large folio, we can indeed keep away from kernel's breaking up
>>>> them due to its code design.
>>>> This patch is holding PTL from PTE0, thus, the folio will either
>>>> be entirely reclaimed or entirely kept. On the other hand, this
>>>> approach doesn't increase PTL contention. Even w/o the patch,
>>>> page_vma_mapped_walk() will always get PTL after it sometimes
>>>> skips one or two PTEs because intermediate break-before-makes
>>>> are short, according to test. Of course, even w/o this patch,
>>>> the vast majority of try_to_unmap_one still can get PTL from
>>>> PTE0. This patch makes the number 100%.
>>>> The other option is that we can give up in try_to_unmap_one
>>>> once we find PTE0 is not the first entry we get PTL, we call
>>>> page_vma_mapped_walk_done() to end the iteration at this case.
>>>> This will keep the unified PTEs while the folio isn't reclaimed.
>>>> The result is quite similar with small folios with one PTE -
>>>> either entirely reclaimed or entirely kept.
>>>> Reclaiming large folios by holding PTL from PTE0 seems a better
>>>> option comparing to giving up after detecting PTL begins from
>>>> non-PTE0.
>>>>
>>
>> I'm sure that wall of text can be formatted in a better way :) . Also, I think
>> we can drop some of the details,
>>
>> If you need some inspiration, I can give it a shot.
>>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>
>>> Do we need a Fixes tag?
>>>
>>
>> What would be the description of the problem we are fixing?
>>
>> 1) failing to unmap?
>>
>> That can happen with small folios as well IIUC.
>>
>> 2) Putting the large folio on the deferred split queue?
>>
>> That sounds more reasonable.
> 
> Isn't the real problem today that we can end up writng a THP to the swap file
> (so 2M more IO and space used) but we can't remove it from memory, so no actual
> reclaim happens? Although I guess your (2) is really just another way of saying
> that.

The same could happen with small folios I believe? We might end up 
running into the

folio_mapped()

after the try_to_unmap().

Note that the actual I/O does not happen during add_to_swap(), but 
during the pageout() call when we find the folio to be dirty.

So there would not actually be more I/O. Only swap space would be 
reserved, that would be used later when not running into the race.
Barry Song March 4, 2024, 8:42 p.m. UTC | #5
On Tue, Mar 5, 2024 at 3:27 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.03.24 14:03, Ryan Roberts wrote:
> > On 04/03/2024 12:41, David Hildenbrand wrote:
> >> On 04.03.24 13:20, Ryan Roberts wrote:
> >>> Hi Barry,
> >>>
> >>> On 04/03/2024 10:37, Barry Song wrote:
> >>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>
> >>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
> >>>> PTEs modification such as break-before-make, while iterating PTEs
> >>>> of a large folio, it will only begin to acquire PTL after it gets
> >>>> a valid(present) PTE. break-before-make intermediately sets PTEs
> >>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
> >>>> in try_to_unmap_one().
> >>>
> >>> I just want to check my understanding here - I think the problem occurs for
> >>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
> >>> that I've had a look at the code and have a better understanding, I think that
> >>> must be the case? And therefore this problem exists independently of my work to
> >>> support swap-out of mTHP? (From your previous report I was under the impression
> >>> that it only affected mTHP).
> >>>
> >>> Its just that the problem is becoming more pronounced because with mTHP,
> >>> PTE-mapped large folios are much more common?
> >>
> >> That is my understanding.
> >>
> >>>
> >>>> For example, for an anon folio, after try_to_unmap_one(), we may
> >>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
> >>>> So folio will be still mapped, the folio fails to be reclaimed.
> >>>> What’s even more worrying is, its PTEs are no longer in a unified
> >>>> state. This might lead to accident folio_split() afterwards. And
> >>>> since a part of PTEs are now swap entries, accessing them will
> >>>> incur page fault - do_swap_page.
> >>>> It creates both anxiety and more expense. While we can't avoid
> >>>> userspace's unmap to break up unified PTEs such as CONT-PTE for
> >>>> a large folio, we can indeed keep away from kernel's breaking up
> >>>> them due to its code design.
> >>>> This patch is holding PTL from PTE0, thus, the folio will either
> >>>> be entirely reclaimed or entirely kept. On the other hand, this
> >>>> approach doesn't increase PTL contention. Even w/o the patch,
> >>>> page_vma_mapped_walk() will always get PTL after it sometimes
> >>>> skips one or two PTEs because intermediate break-before-makes
> >>>> are short, according to test. Of course, even w/o this patch,
> >>>> the vast majority of try_to_unmap_one still can get PTL from
> >>>> PTE0. This patch makes the number 100%.
> >>>> The other option is that we can give up in try_to_unmap_one
> >>>> once we find PTE0 is not the first entry we get PTL, we call
> >>>> page_vma_mapped_walk_done() to end the iteration at this case.
> >>>> This will keep the unified PTEs while the folio isn't reclaimed.
> >>>> The result is quite similar with small folios with one PTE -
> >>>> either entirely reclaimed or entirely kept.
> >>>> Reclaiming large folios by holding PTL from PTE0 seems a better
> >>>> option comparing to giving up after detecting PTL begins from
> >>>> non-PTE0.
> >>>>
> >>
> >> I'm sure that wall of text can be formatted in a better way :) . Also, I think
> >> we can drop some of the details,
> >>
> >> If you need some inspiration, I can give it a shot.
> >>
> >>>> Cc: Hugh Dickins <hughd@google.com>
> >>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> Do we need a Fixes tag?
> >>>
> >>
> >> What would be the description of the problem we are fixing?
> >>
> >> 1) failing to unmap?
> >>
> >> That can happen with small folios as well IIUC.
> >>
> >> 2) Putting the large folio on the deferred split queue?
> >>
> >> That sounds more reasonable.
> >
> > Isn't the real problem today that we can end up writng a THP to the swap file
> > (so 2M more IO and space used) but we can't remove it from memory, so no actual
> > reclaim happens? Although I guess your (2) is really just another way of saying
> > that.
>
> The same could happen with small folios I believe? We might end up
> running into the
>
> folio_mapped()
>
> after the try_to_unmap().
>
> Note that the actual I/O does not happen during add_to_swap(), but
> during the pageout() call when we find the folio to be dirty.
>
> So there would not actually be more I/O. Only swap space would be
> reserved, that would be used later when not running into the race.

I am not worried about small folios at all as they have only one PTE.
so the PTE is either completely unmapped or completely mapped.

In terms of large folios, it is a different story. for example, a large
folio with 16 PTEs with CONT-PTE, we will have

1. unfolded CONT-PTE, eg. PTE0 present, PTE1-PTE15 swap entries

2. page faults on PTE1-PTE15 after try_to_unmap if we access them.

This is totally useless PF and can be avoided if we can try_to_unmap
properly at the beginning.

3. potential need to split a large folio afterwards. for example, MADV_PAGEOUT,
MADV_FREE might split it after finding it is not completely mapped.

For small folios, we don't have any concern on the above issues.

>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
David Hildenbrand March 4, 2024, 9:02 p.m. UTC | #6
On 04.03.24 21:42, Barry Song wrote:
> On Tue, Mar 5, 2024 at 3:27 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.03.24 14:03, Ryan Roberts wrote:
>>> On 04/03/2024 12:41, David Hildenbrand wrote:
>>>> On 04.03.24 13:20, Ryan Roberts wrote:
>>>>> Hi Barry,
>>>>>
>>>>> On 04/03/2024 10:37, Barry Song wrote:
>>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>>
>>>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
>>>>>> PTEs modification such as break-before-make, while iterating PTEs
>>>>>> of a large folio, it will only begin to acquire PTL after it gets
>>>>>> a valid(present) PTE. break-before-make intermediately sets PTEs
>>>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
>>>>>> in try_to_unmap_one().
>>>>>
>>>>> I just want to check my understanding here - I think the problem occurs for
>>>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
>>>>> that I've had a look at the code and have a better understanding, I think that
>>>>> must be the case? And therefore this problem exists independently of my work to
>>>>> support swap-out of mTHP? (From your previous report I was under the impression
>>>>> that it only affected mTHP).
>>>>>
>>>>> Its just that the problem is becoming more pronounced because with mTHP,
>>>>> PTE-mapped large folios are much more common?
>>>>
>>>> That is my understanding.
>>>>
>>>>>
>>>>>> For example, for an anon folio, after try_to_unmap_one(), we may
>>>>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
>>>>>> So folio will be still mapped, the folio fails to be reclaimed.
>>>>>> What’s even more worrying is, its PTEs are no longer in a unified
>>>>>> state. This might lead to accident folio_split() afterwards. And
>>>>>> since a part of PTEs are now swap entries, accessing them will
>>>>>> incur page fault - do_swap_page.
>>>>>> It creates both anxiety and more expense. While we can't avoid
>>>>>> userspace's unmap to break up unified PTEs such as CONT-PTE for
>>>>>> a large folio, we can indeed keep away from kernel's breaking up
>>>>>> them due to its code design.
>>>>>> This patch is holding PTL from PTE0, thus, the folio will either
>>>>>> be entirely reclaimed or entirely kept. On the other hand, this
>>>>>> approach doesn't increase PTL contention. Even w/o the patch,
>>>>>> page_vma_mapped_walk() will always get PTL after it sometimes
>>>>>> skips one or two PTEs because intermediate break-before-makes
>>>>>> are short, according to test. Of course, even w/o this patch,
>>>>>> the vast majority of try_to_unmap_one still can get PTL from
>>>>>> PTE0. This patch makes the number 100%.
>>>>>> The other option is that we can give up in try_to_unmap_one
>>>>>> once we find PTE0 is not the first entry we get PTL, we call
>>>>>> page_vma_mapped_walk_done() to end the iteration at this case.
>>>>>> This will keep the unified PTEs while the folio isn't reclaimed.
>>>>>> The result is quite similar with small folios with one PTE -
>>>>>> either entirely reclaimed or entirely kept.
>>>>>> Reclaiming large folios by holding PTL from PTE0 seems a better
>>>>>> option comparing to giving up after detecting PTL begins from
>>>>>> non-PTE0.
>>>>>>
>>>>
>>>> I'm sure that wall of text can be formatted in a better way :) . Also, I think
>>>> we can drop some of the details,
>>>>
>>>> If you need some inspiration, I can give it a shot.
>>>>
>>>>>> Cc: Hugh Dickins <hughd@google.com>
>>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> Do we need a Fixes tag?
>>>>>
>>>>
>>>> What would be the description of the problem we are fixing?
>>>>
>>>> 1) failing to unmap?
>>>>
>>>> That can happen with small folios as well IIUC.
>>>>
>>>> 2) Putting the large folio on the deferred split queue?
>>>>
>>>> That sounds more reasonable.
>>>
>>> Isn't the real problem today that we can end up writng a THP to the swap file
>>> (so 2M more IO and space used) but we can't remove it from memory, so no actual
>>> reclaim happens? Although I guess your (2) is really just another way of saying
>>> that.
>>
>> The same could happen with small folios I believe? We might end up
>> running into the
>>
>> folio_mapped()
>>
>> after the try_to_unmap().
>>
>> Note that the actual I/O does not happen during add_to_swap(), but
>> during the pageout() call when we find the folio to be dirty.
>>
>> So there would not actually be more I/O. Only swap space would be
>> reserved, that would be used later when not running into the race.
> 
> I am not worried about small folios at all as they have only one PTE.
> so the PTE is either completely unmapped or completely mapped.
> 
> In terms of large folios, it is a different story. for example, a large
> folio with 16 PTEs with CONT-PTE, we will have
> 
> 1. unfolded CONT-PTE, eg. PTE0 present, PTE1-PTE15 swap entries
> 
> 2. page faults on PTE1-PTE15 after try_to_unmap if we access them.
> 
> This is totally useless PF and can be avoided if we can try_to_unmap
> properly at the beginning.
> 
> 3. potential need to split a large folio afterwards. for example, MADV_PAGEOUT,
> MADV_FREE might split it after finding it is not completely mapped.
> 
> For small folios, we don't have any concern on the above issues.

Right, but when we talk about "Fixes:", what exactly are we consider 
"really broken" above and what is "undesired"?

(a) is there a correctness issue? I don't think so.

(b) is there a real performance issue? I'd like to understand.

After all, we've been living with that ever since we supported THP_SWAP, 
correct? "something does not work ideally in some corner cases" might be 
reasonable to handle here (and I really think we should), but might not 
be worth a "Fixes:".

So if we could clarify that, it would be great.
Barry Song March 4, 2024, 9:04 p.m. UTC | #7
On Tue, Mar 5, 2024 at 1:41 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.03.24 13:20, Ryan Roberts wrote:
> > Hi Barry,
> >
> > On 04/03/2024 10:37, Barry Song wrote:
> >> From: Barry Song <v-songbaohua@oppo.com>
> >>
> >> page_vma_mapped_walk() within try_to_unmap_one() races with other
> >> PTEs modification such as break-before-make, while iterating PTEs
> >> of a large folio, it will only begin to acquire PTL after it gets
> >> a valid(present) PTE. break-before-make intermediately sets PTEs
> >> to pte_none. Thus, a large folio's PTEs might be partially skipped
> >> in try_to_unmap_one().
> >
> > I just want to check my understanding here - I think the problem occurs for
> > PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
> > that I've had a look at the code and have a better understanding, I think that
> > must be the case? And therefore this problem exists independently of my work to
> > support swap-out of mTHP? (From your previous report I was under the impression
> > that it only affected mTHP).
> >
> > Its just that the problem is becoming more pronounced because with mTHP,
> > PTE-mapped large folios are much more common?
>
> That is my understanding.
>
> >
> >> For example, for an anon folio, after try_to_unmap_one(), we may
> >> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
> >> So folio will be still mapped, the folio fails to be reclaimed.
> >> What’s even more worrying is, its PTEs are no longer in a unified
> >> state. This might lead to accident folio_split() afterwards. And
> >> since a part of PTEs are now swap entries, accessing them will
> >> incur page fault - do_swap_page.
> >> It creates both anxiety and more expense. While we can't avoid
> >> userspace's unmap to break up unified PTEs such as CONT-PTE for
> >> a large folio, we can indeed keep away from kernel's breaking up
> >> them due to its code design.
> >> This patch is holding PTL from PTE0, thus, the folio will either
> >> be entirely reclaimed or entirely kept. On the other hand, this
> >> approach doesn't increase PTL contention. Even w/o the patch,
> >> page_vma_mapped_walk() will always get PTL after it sometimes
> >> skips one or two PTEs because intermediate break-before-makes
> >> are short, according to test. Of course, even w/o this patch,
> >> the vast majority of try_to_unmap_one still can get PTL from
> >> PTE0. This patch makes the number 100%.
> >> The other option is that we can give up in try_to_unmap_one
> >> once we find PTE0 is not the first entry we get PTL, we call
> >> page_vma_mapped_walk_done() to end the iteration at this case.
> >> This will keep the unified PTEs while the folio isn't reclaimed.
> >> The result is quite similar with small folios with one PTE -
> >> either entirely reclaimed or entirely kept.
> >> Reclaiming large folios by holding PTL from PTE0 seems a better
> >> option comparing to giving up after detecting PTL begins from
> >> non-PTE0.
> >>
>
> I'm sure that wall of text can be formatted in a better way :) . Also, I
> think we can drop some of the details,
>
> If you need some inspiration, I can give it a shot.
>
> >> Cc: Hugh Dickins <hughd@google.com>
> >> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >
> > Do we need a Fixes tag?

I am not quite sure which commit should be here for a fixes tag.
I think it's more of an optimization.

> >
>
> What would be the description of the problem we are fixing?
>
> 1) failing to unmap?
>
> That can happen with small folios as well IIUC.
>
> 2) Putting the large folio on the deferred split queue?
>
> That sounds more reasonable.

I don't feel it is reasonable. Avoiding this kind of accident splitting
from the kernel's improper code is a more reasonable approach
as there is always a price to pay for splitting and unfolding PTEs
etc.

While we can't avoid splitting coming from userspace's
MADV_DONTNEED, munmap, mprotect, we have a way
to ensure the kernel itself doesn't accidently break up a
large folio.

In OPPO's phones, we ran into some weird bugs due to skipped PTEs
in try_to_unmap_one. hardly could we fix it from the root cause. with
various races, figuring out their timings was really a big pain :-)

But we did "resolve" those bugs by entirely untouching all PTEs if we
found some PTEs were skipped in try_to_unmap_one [1].

While we find we only get the PTL from 2nd, 3rd but not
1st PTE, we entirely give up on try_to_unmap_one, and leave
all PTEs untouched.

/* we are not starting from head */
if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
                   ret = false;
                   atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
                   set_pte_at(mm, address, pvmw.pte, pteval);
                   page_vma_mapped_walk_done(&pvmw);
                   break;
}
This will ensure all PTEs still have a unified state such as CONT-PTE
after try_to_unmap fails.
I feel this could have some false postive because when racing
with unmap, 1st PTE might really become pte_none. So explicitly
holding PTL from 1st PTE seems a better way.

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/rmap.c#L1730

>
> >> ---
> >>   mm/vmscan.c | 11 +++++++++++
> >>   1 file changed, 11 insertions(+)
> >>
> >> diff --git a/mm/vmscan.c b/mm/vmscan.c
> >> index 0b888a2afa58..e4722fbbcd0c 100644
> >> --- a/mm/vmscan.c
> >> +++ b/mm/vmscan.c
> >> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >>
> >>                      if (folio_test_pmd_mappable(folio))
> >>                              flags |= TTU_SPLIT_HUGE_PMD;
> >> +                    /*
> >> +                     * if page table lock is not held from the first PTE of
> >> +                     * a large folio, some PTEs might be skipped because of
> >> +                     * races with break-before-make, for example, PTEs can
> >> +                     * be pte_none intermediately, thus one or more PTEs
> >> +                     * might be skipped in try_to_unmap_one, we might result
> >> +                     * in a large folio is partially mapped and partially
> >> +                     * unmapped after try_to_unmap
> >> +                     */
> >> +                    if (folio_test_large(folio))
> >> +                            flags |= TTU_SYNC;
> >
> > This looks sensible to me after thinking about it for a while. But I also have a
> > gut feeling that there might be some more subtleties that are going over my
> > head, since I'm not expert in this area. So will leave others to provide R-b :)
> >
>
> As we are seeing more such problems with lockless PT walks, maybe we
> really want some other special value (nonswap entry?) to indicate that a
> PTE this is currently ondergoing protection changes. So we'd avoid the
> pte_none() temporarily, if possible.
>
> Without that, TTU_SYNC feels like the right thing to do.
>
> --
> Cheers,
>
> David / dhildenb
>

Thanks
Barry
David Hildenbrand March 4, 2024, 9:15 p.m. UTC | #8
>>> Do we need a Fixes tag?
> 
> I am not quite sure which commit should be here for a fixes tag.
> I think it's more of an optimization.

Good, that helps!

> 
>>>
>>
>> What would be the description of the problem we are fixing?
>>
>> 1) failing to unmap?
>>
>> That can happen with small folios as well IIUC.
>>
>> 2) Putting the large folio on the deferred split queue?
>>
>> That sounds more reasonable.
> 
> I don't feel it is reasonable. Avoiding this kind of accident splitting
> from the kernel's improper code is a more reasonable approach
> as there is always a price to pay for splitting and unfolding PTEs
> etc.
> 
> While we can't avoid splitting coming from userspace's
> MADV_DONTNEED, munmap, mprotect, we have a way
> to ensure the kernel itself doesn't accidently break up a
> large folio.

Note that on the next vmscan we would retry, find the remaining present 
entries and swapout that thing completely :)

> 
> In OPPO's phones, we ran into some weird bugs due to skipped PTEs
> in try_to_unmap_one. hardly could we fix it from the root cause. with
> various races, figuring out their timings was really a big pain :-)
> 

I can imagine. I assume, though, that it might be related to the way the 
cont-pte bit was handled. Ryan's implementation should be able to cope 
with that.

> But we did "resolve" those bugs by entirely untouching all PTEs if we
> found some PTEs were skipped in try_to_unmap_one [1].
> 
> While we find we only get the PTL from 2nd, 3rd but not
> 1st PTE, we entirely give up on try_to_unmap_one, and leave
> all PTEs untouched.
> 
> /* we are not starting from head */
> if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
>                     ret = false;
>                     atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
>                     set_pte_at(mm, address, pvmw.pte, pteval);
>                     page_vma_mapped_walk_done(&pvmw);
>                     break;
> }
> This will ensure all PTEs still have a unified state such as CONT-PTE
> after try_to_unmap fails.
> I feel this could have some false postive because when racing
> with unmap, 1st PTE might really become pte_none. So explicitly
> holding PTL from 1st PTE seems a better way.

Can we estimate the "cost" of holding the PTL?
Barry Song March 4, 2024, 9:41 p.m. UTC | #9
On Tue, Mar 5, 2024 at 10:02 AM David Hildenbrand <david@redhat.com> wrote:
>
> On 04.03.24 21:42, Barry Song wrote:
> > On Tue, Mar 5, 2024 at 3:27 AM David Hildenbrand <david@redhat.com> wrote:
> >>
> >> On 04.03.24 14:03, Ryan Roberts wrote:
> >>> On 04/03/2024 12:41, David Hildenbrand wrote:
> >>>> On 04.03.24 13:20, Ryan Roberts wrote:
> >>>>> Hi Barry,
> >>>>>
> >>>>> On 04/03/2024 10:37, Barry Song wrote:
> >>>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>>
> >>>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
> >>>>>> PTEs modification such as break-before-make, while iterating PTEs
> >>>>>> of a large folio, it will only begin to acquire PTL after it gets
> >>>>>> a valid(present) PTE. break-before-make intermediately sets PTEs
> >>>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
> >>>>>> in try_to_unmap_one().
> >>>>>
> >>>>> I just want to check my understanding here - I think the problem occurs for
> >>>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
> >>>>> that I've had a look at the code and have a better understanding, I think that
> >>>>> must be the case? And therefore this problem exists independently of my work to
> >>>>> support swap-out of mTHP? (From your previous report I was under the impression
> >>>>> that it only affected mTHP).
> >>>>>
> >>>>> Its just that the problem is becoming more pronounced because with mTHP,
> >>>>> PTE-mapped large folios are much more common?
> >>>>
> >>>> That is my understanding.
> >>>>
> >>>>>
> >>>>>> For example, for an anon folio, after try_to_unmap_one(), we may
> >>>>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
> >>>>>> So folio will be still mapped, the folio fails to be reclaimed.
> >>>>>> What’s even more worrying is, its PTEs are no longer in a unified
> >>>>>> state. This might lead to accident folio_split() afterwards. And
> >>>>>> since a part of PTEs are now swap entries, accessing them will
> >>>>>> incur page fault - do_swap_page.
> >>>>>> It creates both anxiety and more expense. While we can't avoid
> >>>>>> userspace's unmap to break up unified PTEs such as CONT-PTE for
> >>>>>> a large folio, we can indeed keep away from kernel's breaking up
> >>>>>> them due to its code design.
> >>>>>> This patch is holding PTL from PTE0, thus, the folio will either
> >>>>>> be entirely reclaimed or entirely kept. On the other hand, this
> >>>>>> approach doesn't increase PTL contention. Even w/o the patch,
> >>>>>> page_vma_mapped_walk() will always get PTL after it sometimes
> >>>>>> skips one or two PTEs because intermediate break-before-makes
> >>>>>> are short, according to test. Of course, even w/o this patch,
> >>>>>> the vast majority of try_to_unmap_one still can get PTL from
> >>>>>> PTE0. This patch makes the number 100%.
> >>>>>> The other option is that we can give up in try_to_unmap_one
> >>>>>> once we find PTE0 is not the first entry we get PTL, we call
> >>>>>> page_vma_mapped_walk_done() to end the iteration at this case.
> >>>>>> This will keep the unified PTEs while the folio isn't reclaimed.
> >>>>>> The result is quite similar with small folios with one PTE -
> >>>>>> either entirely reclaimed or entirely kept.
> >>>>>> Reclaiming large folios by holding PTL from PTE0 seems a better
> >>>>>> option comparing to giving up after detecting PTL begins from
> >>>>>> non-PTE0.
> >>>>>>
> >>>>
> >>>> I'm sure that wall of text can be formatted in a better way :) . Also, I think
> >>>> we can drop some of the details,
> >>>>
> >>>> If you need some inspiration, I can give it a shot.
> >>>>
> >>>>>> Cc: Hugh Dickins <hughd@google.com>
> >>>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> >>>>>
> >>>>> Do we need a Fixes tag?
> >>>>>
> >>>>
> >>>> What would be the description of the problem we are fixing?
> >>>>
> >>>> 1) failing to unmap?
> >>>>
> >>>> That can happen with small folios as well IIUC.
> >>>>
> >>>> 2) Putting the large folio on the deferred split queue?
> >>>>
> >>>> That sounds more reasonable.
> >>>
> >>> Isn't the real problem today that we can end up writng a THP to the swap file
> >>> (so 2M more IO and space used) but we can't remove it from memory, so no actual
> >>> reclaim happens? Although I guess your (2) is really just another way of saying
> >>> that.
> >>
> >> The same could happen with small folios I believe? We might end up
> >> running into the
> >>
> >> folio_mapped()
> >>
> >> after the try_to_unmap().
> >>
> >> Note that the actual I/O does not happen during add_to_swap(), but
> >> during the pageout() call when we find the folio to be dirty.
> >>
> >> So there would not actually be more I/O. Only swap space would be
> >> reserved, that would be used later when not running into the race.
> >
> > I am not worried about small folios at all as they have only one PTE.
> > so the PTE is either completely unmapped or completely mapped.
> >
> > In terms of large folios, it is a different story. for example, a large
> > folio with 16 PTEs with CONT-PTE, we will have
> >
> > 1. unfolded CONT-PTE, eg. PTE0 present, PTE1-PTE15 swap entries
> >
> > 2. page faults on PTE1-PTE15 after try_to_unmap if we access them.
> >
> > This is totally useless PF and can be avoided if we can try_to_unmap
> > properly at the beginning.
> >
> > 3. potential need to split a large folio afterwards. for example, MADV_PAGEOUT,
> > MADV_FREE might split it after finding it is not completely mapped.
> >
> > For small folios, we don't have any concern on the above issues.
>
> Right, but when we talk about "Fixes:", what exactly are we consider
> "really broken" above and what is "undesired"?
>
> (a) is there a correctness issue? I don't think so.
>
> (b) is there a real performance issue? I'd like to understand.
>
> After all, we've been living with that ever since we supported THP_SWAP,
> correct? "something does not work ideally in some corner cases" might be
> reasonable to handle here (and I really think we should), but might not
> be worth a "Fixes:".
>
> So if we could clarify that, it would be great.

I don't think this needs a fixes tag, and I would think it is an optimization
on corner cases. And I don't think we will see noticeable performance
improvement as this isn't happening quite often though I believe it does
improve the performance of corner cases. but if the corner case only
takes 0.1% of all try_to_unmap_one, no noticeable performance
improvement will be seen.

I feel it is more of a behavior to kick flies out while flies don't kill people
but it can be sometimes quite annoying :-)  I ran into another bug in
large-folio swapin series due to this problem, 16 PTEs had contiguous
swap entries from Ryan's add_to_swap(), but they were splitted in
MADV_PAGEOUT because the folio was not completely mapped after
try_to_unmap_one.

Then after some time, some of the 16 pages were in swapcache, while
others were not in. In this case, I couldn't swap them in together and had to
handle PF one by one, but i was incorrectly handling it  and trying to
swap-in them together by reading swapfile. if they were atomically
handled in try_to_unmap_one, we could avoid this.

we may have to cope with this kind of problem from time to time in
future work.

>
> --
> Cheers,
>
> David / dhildenb

Thanks
Barry
Barry Song March 4, 2024, 9:57 p.m. UTC | #10
On Tue, Mar 5, 2024 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> Hi Barry,
>
> On 04/03/2024 10:37, Barry Song wrote:
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > page_vma_mapped_walk() within try_to_unmap_one() races with other
> > PTEs modification such as break-before-make, while iterating PTEs
> > of a large folio, it will only begin to acquire PTL after it gets
> > a valid(present) PTE. break-before-make intermediately sets PTEs
> > to pte_none. Thus, a large folio's PTEs might be partially skipped
> > in try_to_unmap_one().
>
> I just want to check my understanding here - I think the problem occurs for
> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
> that I've had a look at the code and have a better understanding, I think that
> must be the case? And therefore this problem exists independently of my work to
> support swap-out of mTHP? (From your previous report I was under the impression
> that it only affected mTHP).

I think this affects all large folios with PTEs entries more than 1. but hugeTLB
is handled as a whole in try_to_unmap_one and its rmap is removed all
together, i feel hugeTLB doesn't have this problem.

>
> Its just that the problem is becoming more pronounced because with mTHP,
> PTE-mapped large folios are much more common?

right. as now large folios become a more common case, and it is my case
running in millions of phones.

BTW, I feel we can somehow learn from hugeTLB, for example, we can reclaim
all PTEs all together rather than iterating PTEs one by one. This will improve
performance. for example, a batched
set_ptes_to_swap_entries()
{
}
then we only need to loop once for a large folio, right now we are looping
nr_pages times.

>
> > For example, for an anon folio, after try_to_unmap_one(), we may
> > have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
> > So folio will be still mapped, the folio fails to be reclaimed.
> > What’s even more worrying is, its PTEs are no longer in a unified
> > state. This might lead to accident folio_split() afterwards. And
> > since a part of PTEs are now swap entries, accessing them will
> > incur page fault - do_swap_page.
> > It creates both anxiety and more expense. While we can't avoid
> > userspace's unmap to break up unified PTEs such as CONT-PTE for
> > a large folio, we can indeed keep away from kernel's breaking up
> > them due to its code design.
> > This patch is holding PTL from PTE0, thus, the folio will either
> > be entirely reclaimed or entirely kept. On the other hand, this
> > approach doesn't increase PTL contention. Even w/o the patch,
> > page_vma_mapped_walk() will always get PTL after it sometimes
> > skips one or two PTEs because intermediate break-before-makes
> > are short, according to test. Of course, even w/o this patch,
> > the vast majority of try_to_unmap_one still can get PTL from
> > PTE0. This patch makes the number 100%.
> > The other option is that we can give up in try_to_unmap_one
> > once we find PTE0 is not the first entry we get PTL, we call
> > page_vma_mapped_walk_done() to end the iteration at this case.
> > This will keep the unified PTEs while the folio isn't reclaimed.
> > The result is quite similar with small folios with one PTE -
> > either entirely reclaimed or entirely kept.
> > Reclaiming large folios by holding PTL from PTE0 seems a better
> > option comparing to giving up after detecting PTL begins from
> > non-PTE0.
> >
> > Cc: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>
> Do we need a Fixes tag?

I don't feel a strong need for this as this doesn't cause a crash, memory
leak or whatever serious.

>
> > ---
> >  mm/vmscan.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 0b888a2afa58..e4722fbbcd0c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >
> >                       if (folio_test_pmd_mappable(folio))
> >                               flags |= TTU_SPLIT_HUGE_PMD;
> > +                     /*
> > +                      * if page table lock is not held from the first PTE of
> > +                      * a large folio, some PTEs might be skipped because of
> > +                      * races with break-before-make, for example, PTEs can
> > +                      * be pte_none intermediately, thus one or more PTEs
> > +                      * might be skipped in try_to_unmap_one, we might result
> > +                      * in a large folio is partially mapped and partially
> > +                      * unmapped after try_to_unmap
> > +                      */
> > +                     if (folio_test_large(folio))
> > +                             flags |= TTU_SYNC;
>
> This looks sensible to me after thinking about it for a while. But I also have a
> gut feeling that there might be some more subtleties that are going over my
> head, since I'm not expert in this area. So will leave others to provide R-b :)
>

ok, thanks :-)

> Thanks,
> Ryan
>
> >
> >                       try_to_unmap(folio, flags);
> >                       if (folio_mapped(folio)) {
>

Thanks
Barry
Ryan Roberts March 4, 2024, 10:02 p.m. UTC | #11
On 04/03/2024 21:04, Barry Song wrote:
> On Tue, Mar 5, 2024 at 1:41 AM David Hildenbrand <david@redhat.com> wrote:
>>
>> On 04.03.24 13:20, Ryan Roberts wrote:
>>> Hi Barry,
>>>
>>> On 04/03/2024 10:37, Barry Song wrote:
>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>
>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
>>>> PTEs modification such as break-before-make, while iterating PTEs
>>>> of a large folio, it will only begin to acquire PTL after it gets
>>>> a valid(present) PTE. break-before-make intermediately sets PTEs
>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
>>>> in try_to_unmap_one().
>>>
>>> I just want to check my understanding here - I think the problem occurs for
>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
>>> that I've had a look at the code and have a better understanding, I think that
>>> must be the case? And therefore this problem exists independently of my work to
>>> support swap-out of mTHP? (From your previous report I was under the impression
>>> that it only affected mTHP).
>>>
>>> Its just that the problem is becoming more pronounced because with mTHP,
>>> PTE-mapped large folios are much more common?
>>
>> That is my understanding.
>>
>>>
>>>> For example, for an anon folio, after try_to_unmap_one(), we may
>>>> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
>>>> So folio will be still mapped, the folio fails to be reclaimed.
>>>> What’s even more worrying is, its PTEs are no longer in a unified
>>>> state. This might lead to accident folio_split() afterwards. And
>>>> since a part of PTEs are now swap entries, accessing them will
>>>> incur page fault - do_swap_page.
>>>> It creates both anxiety and more expense. While we can't avoid
>>>> userspace's unmap to break up unified PTEs such as CONT-PTE for
>>>> a large folio, we can indeed keep away from kernel's breaking up
>>>> them due to its code design.
>>>> This patch is holding PTL from PTE0, thus, the folio will either
>>>> be entirely reclaimed or entirely kept. On the other hand, this
>>>> approach doesn't increase PTL contention. Even w/o the patch,
>>>> page_vma_mapped_walk() will always get PTL after it sometimes
>>>> skips one or two PTEs because intermediate break-before-makes
>>>> are short, according to test. Of course, even w/o this patch,
>>>> the vast majority of try_to_unmap_one still can get PTL from
>>>> PTE0. This patch makes the number 100%.
>>>> The other option is that we can give up in try_to_unmap_one
>>>> once we find PTE0 is not the first entry we get PTL, we call
>>>> page_vma_mapped_walk_done() to end the iteration at this case.
>>>> This will keep the unified PTEs while the folio isn't reclaimed.
>>>> The result is quite similar with small folios with one PTE -
>>>> either entirely reclaimed or entirely kept.
>>>> Reclaiming large folios by holding PTL from PTE0 seems a better
>>>> option comparing to giving up after detecting PTL begins from
>>>> non-PTE0.
>>>>
>>
>> I'm sure that wall of text can be formatted in a better way :) . Also, I
>> think we can drop some of the details,
>>
>> If you need some inspiration, I can give it a shot.
>>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>>>
>>> Do we need a Fixes tag?

It seems my original question has snowballed a bit. I was conflating this change
with other reports Barry has made where the kernel was panicking (I think?).
Given we are not seeing any incorrect functional behaviour that this change
fixes, I agree we don't need a Fixes tag here.
Barry Song March 4, 2024, 10:29 p.m. UTC | #12
On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote:
>
>
> >>> Do we need a Fixes tag?
> >
> > I am not quite sure which commit should be here for a fixes tag.
> > I think it's more of an optimization.
>
> Good, that helps!
>
> >
> >>>
> >>
> >> What would be the description of the problem we are fixing?
> >>
> >> 1) failing to unmap?
> >>
> >> That can happen with small folios as well IIUC.
> >>
> >> 2) Putting the large folio on the deferred split queue?
> >>
> >> That sounds more reasonable.
> >
> > I don't feel it is reasonable. Avoiding this kind of accident splitting
> > from the kernel's improper code is a more reasonable approach
> > as there is always a price to pay for splitting and unfolding PTEs
> > etc.
> >
> > While we can't avoid splitting coming from userspace's
> > MADV_DONTNEED, munmap, mprotect, we have a way
> > to ensure the kernel itself doesn't accidently break up a
> > large folio.
>
> Note that on the next vmscan we would retry, find the remaining present
> entries and swapout that thing completely :)

This is true, but since we can finish the job the first time, it seems
second retry is a cost :-)

>
> >
> > In OPPO's phones, we ran into some weird bugs due to skipped PTEs
> > in try_to_unmap_one. hardly could we fix it from the root cause. with
> > various races, figuring out their timings was really a big pain :-)
> >
>
> I can imagine. I assume, though, that it might be related to the way the
> cont-pte bit was handled. Ryan's implementation should be able to cope
> with that.

I guess you are probably right. Ryan's implementation decouples CONT-PTE
from mm core. nice to have it.

>
> > But we did "resolve" those bugs by entirely untouching all PTEs if we
> > found some PTEs were skipped in try_to_unmap_one [1].
> >
> > While we find we only get the PTL from 2nd, 3rd but not
> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
> > all PTEs untouched.
> >
> > /* we are not starting from head */
> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
> >                     ret = false;
> >                     atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
> >                     set_pte_at(mm, address, pvmw.pte, pteval);
> >                     page_vma_mapped_walk_done(&pvmw);
> >                     break;
> > }
> > This will ensure all PTEs still have a unified state such as CONT-PTE
> > after try_to_unmap fails.
> > I feel this could have some false postive because when racing
> > with unmap, 1st PTE might really become pte_none. So explicitly
> > holding PTL from 1st PTE seems a better way.
>
> Can we estimate the "cost" of holding the PTL?
>

This is just moving PTL acquisition one or two PTE earlier in those corner
cases. In normal cases, it doesn't affect when PTL is held.

In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold
PTL immediately. in corner cases, page_vma_mapped_walk races with break-
before-make, after skipping one or two PTEs whose states are transferring,
it will find a present pte then acquire lock.

> --
> Cheers,
>
> David / dhildenb

Thanks
Barry
Huang, Ying March 5, 2024, 7:28 a.m. UTC | #13
Barry Song <21cnbao@gmail.com> writes:

> From: Barry Song <v-songbaohua@oppo.com>
>
> page_vma_mapped_walk() within try_to_unmap_one() races with other
> PTEs modification such as break-before-make, while iterating PTEs

Sorry, I don't know what is "break-before-make", can you elaborate?
IIUC, ptep_modify_prot_start()/ptep_modify_prot_commit() can clear PTE
temporarily, which may cause race with page_vma_mapped_walk().  Is that
the issue that you try to fix?

--
Best Regards,
Huang, Ying

> of a large folio, it will only begin to acquire PTL after it gets
> a valid(present) PTE. break-before-make intermediately sets PTEs
> to pte_none. Thus, a large folio's PTEs might be partially skipped
> in try_to_unmap_one().
> For example, for an anon folio, after try_to_unmap_one(), we may
> have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
> So folio will be still mapped, the folio fails to be reclaimed.
> What’s even more worrying is, its PTEs are no longer in a unified
> state. This might lead to accident folio_split() afterwards. And
> since a part of PTEs are now swap entries, accessing them will
> incur page fault - do_swap_page.
> It creates both anxiety and more expense. While we can't avoid
> userspace's unmap to break up unified PTEs such as CONT-PTE for
> a large folio, we can indeed keep away from kernel's breaking up
> them due to its code design.
> This patch is holding PTL from PTE0, thus, the folio will either
> be entirely reclaimed or entirely kept. On the other hand, this
> approach doesn't increase PTL contention. Even w/o the patch,
> page_vma_mapped_walk() will always get PTL after it sometimes
> skips one or two PTEs because intermediate break-before-makes
> are short, according to test. Of course, even w/o this patch,
> the vast majority of try_to_unmap_one still can get PTL from
> PTE0. This patch makes the number 100%.
> The other option is that we can give up in try_to_unmap_one
> once we find PTE0 is not the first entry we get PTL, we call
> page_vma_mapped_walk_done() to end the iteration at this case.
> This will keep the unified PTEs while the folio isn't reclaimed.
> The result is quite similar with small folios with one PTE -
> either entirely reclaimed or entirely kept.
> Reclaiming large folios by holding PTL from PTE0 seems a better
> option comparing to giving up after detecting PTL begins from
> non-PTE0.
>
> Cc: Hugh Dickins <hughd@google.com>
> Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> ---
>  mm/vmscan.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0b888a2afa58..e4722fbbcd0c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>  
>  			if (folio_test_pmd_mappable(folio))
>  				flags |= TTU_SPLIT_HUGE_PMD;
> +			/*
> +			 * if page table lock is not held from the first PTE of
> +			 * a large folio, some PTEs might be skipped because of
> +			 * races with break-before-make, for example, PTEs can
> +			 * be pte_none intermediately, thus one or more PTEs
> +			 * might be skipped in try_to_unmap_one, we might result
> +			 * in a large folio is partially mapped and partially
> +			 * unmapped after try_to_unmap
> +			 */
> +			if (folio_test_large(folio))
> +				flags |= TTU_SYNC;
>  
>  			try_to_unmap(folio, flags);
>  			if (folio_mapped(folio)) {
Huang, Ying March 5, 2024, 7:50 a.m. UTC | #14
David Hildenbrand <david@redhat.com> writes:
>
> As we are seeing more such problems with lockless PT walks, maybe we
> really want some other special value (nonswap entry?) to indicate that
> a PTE this is currently ondergoing protection changes. So we'd avoid
> the pte_none() temporarily, if possible.

This sounds like a good idea.  This can solve other issue caused by
temporarily pte_none() issue too, like the following,

https://lore.kernel.org/linux-mm/20240229060907.836589-1-zhangpeng362@huawei.com/

--
Best Regards,
Huang, Ying
Huang, Ying March 5, 2024, 7:53 a.m. UTC | #15
Barry Song <21cnbao@gmail.com> writes:

> On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote:
>> > But we did "resolve" those bugs by entirely untouching all PTEs if we
>> > found some PTEs were skipped in try_to_unmap_one [1].
>> >
>> > While we find we only get the PTL from 2nd, 3rd but not
>> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
>> > all PTEs untouched.
>> >
>> > /* we are not starting from head */
>> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
>> >                     ret = false;
>> >                     atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
>> >                     set_pte_at(mm, address, pvmw.pte, pteval);
>> >                     page_vma_mapped_walk_done(&pvmw);
>> >                     break;
>> > }
>> > This will ensure all PTEs still have a unified state such as CONT-PTE
>> > after try_to_unmap fails.
>> > I feel this could have some false postive because when racing
>> > with unmap, 1st PTE might really become pte_none. So explicitly
>> > holding PTL from 1st PTE seems a better way.
>>
>> Can we estimate the "cost" of holding the PTL?
>>
>
> This is just moving PTL acquisition one or two PTE earlier in those corner
> cases. In normal cases, it doesn't affect when PTL is held.

The mTHP may be mapped at the end of page table.  In that case, the PTL
will be held longer.  Or am I missing something?

--
Best Regards,
Huang, Ying


> In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold
> PTL immediately. in corner cases, page_vma_mapped_walk races with break-
> before-make, after skipping one or two PTEs whose states are transferring,
> it will find a present pte then acquire lock.
>
>> --
>> Cheers,
>>
>> David / dhildenb
>
> Thanks
> Barry
Ryan Roberts March 5, 2024, 8:54 a.m. UTC | #16
On 04/03/2024 21:57, Barry Song wrote:
> On Tue, Mar 5, 2024 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> Hi Barry,
>>
>> On 04/03/2024 10:37, Barry Song wrote:
>>> From: Barry Song <v-songbaohua@oppo.com>
>>>
>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
>>> PTEs modification such as break-before-make, while iterating PTEs
>>> of a large folio, it will only begin to acquire PTL after it gets
>>> a valid(present) PTE. break-before-make intermediately sets PTEs
>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
>>> in try_to_unmap_one().
>>
>> I just want to check my understanding here - I think the problem occurs for
>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
>> that I've had a look at the code and have a better understanding, I think that
>> must be the case? And therefore this problem exists independently of my work to
>> support swap-out of mTHP? (From your previous report I was under the impression
>> that it only affected mTHP).
> 
> I think this affects all large folios with PTEs entries more than 1. but hugeTLB
> is handled as a whole in try_to_unmap_one and its rmap is removed all
> together, i feel hugeTLB doesn't have this problem.
> 
>>
>> Its just that the problem is becoming more pronounced because with mTHP,
>> PTE-mapped large folios are much more common?
> 
> right. as now large folios become a more common case, and it is my case
> running in millions of phones.
> 
> BTW, I feel we can somehow learn from hugeTLB, for example, we can reclaim
> all PTEs all together rather than iterating PTEs one by one. This will improve
> performance. for example, a batched
> set_ptes_to_swap_entries()
> {
> }
> then we only need to loop once for a large folio, right now we are looping
> nr_pages times.

You still need a pte-pte loop somewhere. In hugetlb's case it's in the arch
implementation. HugeTLB ptes are all a fixed size for a given VMA, which makes
things a bit easier too, whereas in the regular mm, they are now a variable size.

David and I introduced folio_pte_batch() to help gather batches of ptes, and it
uses the contpte bit to avoid iterating over intermediate ptes. And I'm adding
swap_pte_batch() which does a similar thing for swap entry batching in v4 of my
swap-out series.

For your set_ptes_to_swap_entries() example, I'm not sure what it would do other
than loop over the PTEs setting an incremented swap entry to each one? How is
that more performant?

Thanks,
Ryan
Barry Song March 5, 2024, 8:56 a.m. UTC | #17
On Tue, Mar 5, 2024 at 8:30 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > From: Barry Song <v-songbaohua@oppo.com>
> >
> > page_vma_mapped_walk() within try_to_unmap_one() races with other
> > PTEs modification such as break-before-make, while iterating PTEs
>
> Sorry, I don't know what is "break-before-make", can you elaborate?
> IIUC, ptep_modify_prot_start()/ptep_modify_prot_commit() can clear PTE
> temporarily, which may cause race with page_vma_mapped_walk().  Is that
> the issue that you try to fix?

we are writing pte to zero(break) before writing a new value(make). while
this behavior is within PTL in another thread,  page_vma_mapped_walk()
of try_to_unmap_one thread won't take PTL till it meets a present PTE.
for example, if another threads are modifying nr_pages PTEs under PTL,
but we don't hold PTL, we might skip one or two PTEs at the beginning of
a large folio.
For a large folio, after try_to_unmap_one(), we may result in PTE0 and PTE1
untouched but PTE2~nr_pages-1 are set to swap entries.

by holding PTL from PTE0 for large folios, we won't get these intermediate
values. At the moment we get PTL, other threads have done.

>
> --
> Best Regards,
> Huang, Ying
>
> > of a large folio, it will only begin to acquire PTL after it gets
> > a valid(present) PTE. break-before-make intermediately sets PTEs
> > to pte_none. Thus, a large folio's PTEs might be partially skipped
> > in try_to_unmap_one().
> > For example, for an anon folio, after try_to_unmap_one(), we may
> > have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
> > So folio will be still mapped, the folio fails to be reclaimed.
> > What’s even more worrying is, its PTEs are no longer in a unified
> > state. This might lead to accident folio_split() afterwards. And
> > since a part of PTEs are now swap entries, accessing them will
> > incur page fault - do_swap_page.
> > It creates both anxiety and more expense. While we can't avoid
> > userspace's unmap to break up unified PTEs such as CONT-PTE for
> > a large folio, we can indeed keep away from kernel's breaking up
> > them due to its code design.
> > This patch is holding PTL from PTE0, thus, the folio will either
> > be entirely reclaimed or entirely kept. On the other hand, this
> > approach doesn't increase PTL contention. Even w/o the patch,
> > page_vma_mapped_walk() will always get PTL after it sometimes
> > skips one or two PTEs because intermediate break-before-makes
> > are short, according to test. Of course, even w/o this patch,
> > the vast majority of try_to_unmap_one still can get PTL from
> > PTE0. This patch makes the number 100%.
> > The other option is that we can give up in try_to_unmap_one
> > once we find PTE0 is not the first entry we get PTL, we call
> > page_vma_mapped_walk_done() to end the iteration at this case.
> > This will keep the unified PTEs while the folio isn't reclaimed.
> > The result is quite similar with small folios with one PTE -
> > either entirely reclaimed or entirely kept.
> > Reclaiming large folios by holding PTL from PTE0 seems a better
> > option comparing to giving up after detecting PTL begins from
> > non-PTE0.
> >
> > Cc: Hugh Dickins <hughd@google.com>
> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
> > ---
> >  mm/vmscan.c | 11 +++++++++++
> >  1 file changed, 11 insertions(+)
> >
> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index 0b888a2afa58..e4722fbbcd0c 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
> >
> >                       if (folio_test_pmd_mappable(folio))
> >                               flags |= TTU_SPLIT_HUGE_PMD;
> > +                     /*
> > +                      * if page table lock is not held from the first PTE of
> > +                      * a large folio, some PTEs might be skipped because of
> > +                      * races with break-before-make, for example, PTEs can
> > +                      * be pte_none intermediately, thus one or more PTEs
> > +                      * might be skipped in try_to_unmap_one, we might result
> > +                      * in a large folio is partially mapped and partially
> > +                      * unmapped after try_to_unmap
> > +                      */
> > +                     if (folio_test_large(folio))
> > +                             flags |= TTU_SYNC;
> >
> >                       try_to_unmap(folio, flags);
> >                       if (folio_mapped(folio)) {

Thanks
Barry
Barry Song March 5, 2024, 9:02 a.m. UTC | #18
On Tue, Mar 5, 2024 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote:
> >> > But we did "resolve" those bugs by entirely untouching all PTEs if we
> >> > found some PTEs were skipped in try_to_unmap_one [1].
> >> >
> >> > While we find we only get the PTL from 2nd, 3rd but not
> >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
> >> > all PTEs untouched.
> >> >
> >> > /* we are not starting from head */
> >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
> >> >                     ret = false;
> >> >                     atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
> >> >                     set_pte_at(mm, address, pvmw.pte, pteval);
> >> >                     page_vma_mapped_walk_done(&pvmw);
> >> >                     break;
> >> > }
> >> > This will ensure all PTEs still have a unified state such as CONT-PTE
> >> > after try_to_unmap fails.
> >> > I feel this could have some false postive because when racing
> >> > with unmap, 1st PTE might really become pte_none. So explicitly
> >> > holding PTL from 1st PTE seems a better way.
> >>
> >> Can we estimate the "cost" of holding the PTL?
> >>
> >
> > This is just moving PTL acquisition one or two PTE earlier in those corner
> > cases. In normal cases, it doesn't affect when PTL is held.
>
> The mTHP may be mapped at the end of page table.  In that case, the PTL
> will be held longer.  Or am I missing something?

no. this patch doesn't change when we release PTL but change when we
get PTL.

when the original code iterates nr_pages PTEs in a large folio, it will skip
invalid PTEs, when it meets a valid one, it will acquire PTL. so if it gets
intermediate PTE values some other threads are modifying, it might
skip PTE0, or sometimes PTE0 and PTE1 according to my test. but
arriving at PTE2, likely other threads have written a new value, so we
will begin to hold PTL and iterate till the end of the large folio.

The proposal is that we directly get PTL from PTE0, thus we don't get
intermediate values for the head of nr_pages PTEs. this will ensure
a large folio is either completely unmapped or completely mapped.
but not partially mapped and partially unmapped.

>
> --
> Best Regards,
> Huang, Ying
>
>
> > In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold
> > PTL immediately. in corner cases, page_vma_mapped_walk races with break-
> > before-make, after skipping one or two PTEs whose states are transferring,
> > it will find a present pte then acquire lock.
> >
> >> --
> >> Cheers,
> >>
> >> David / dhildenb
> >
Thanks
Barry
Huang, Ying March 5, 2024, 9:04 a.m. UTC | #19
Barry Song <21cnbao@gmail.com> writes:

> On Tue, Mar 5, 2024 at 8:30 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > From: Barry Song <v-songbaohua@oppo.com>
>> >
>> > page_vma_mapped_walk() within try_to_unmap_one() races with other
>> > PTEs modification such as break-before-make, while iterating PTEs
>>
>> Sorry, I don't know what is "break-before-make", can you elaborate?
>> IIUC, ptep_modify_prot_start()/ptep_modify_prot_commit() can clear PTE
>> temporarily, which may cause race with page_vma_mapped_walk().  Is that
>> the issue that you try to fix?
>
> we are writing pte to zero(break) before writing a new value(make).

OK.  Is break and make is commonly used terminology in kernel?  If not,
it's better to explain a little (e.g., ptep_get_and_clear() / modify /
set_pte_at()).

> while
> this behavior is within PTL in another thread,  page_vma_mapped_walk()
> of try_to_unmap_one thread won't take PTL till it meets a present PTE.

IIUC, !pte_none() should be enough?

> for example, if another threads are modifying nr_pages PTEs under PTL,
> but we don't hold PTL, we might skip one or two PTEs at the beginning of
> a large folio.
> For a large folio, after try_to_unmap_one(), we may result in PTE0 and PTE1
> untouched but PTE2~nr_pages-1 are set to swap entries.
>
> by holding PTL from PTE0 for large folios, we won't get these intermediate
> values. At the moment we get PTL, other threads have done.

Got it!  Thanks!

--
Best Regards,
Huang, Ying

>>
>> --
>> Best Regards,
>> Huang, Ying
>>
>> > of a large folio, it will only begin to acquire PTL after it gets
>> > a valid(present) PTE. break-before-make intermediately sets PTEs
>> > to pte_none. Thus, a large folio's PTEs might be partially skipped
>> > in try_to_unmap_one().
>> > For example, for an anon folio, after try_to_unmap_one(), we may
>> > have PTE0 present, while PTE1 ~ PTE(nr_pages - 1) are swap entries.
>> > So folio will be still mapped, the folio fails to be reclaimed.
>> > What’s even more worrying is, its PTEs are no longer in a unified
>> > state. This might lead to accident folio_split() afterwards. And
>> > since a part of PTEs are now swap entries, accessing them will
>> > incur page fault - do_swap_page.
>> > It creates both anxiety and more expense. While we can't avoid
>> > userspace's unmap to break up unified PTEs such as CONT-PTE for
>> > a large folio, we can indeed keep away from kernel's breaking up
>> > them due to its code design.
>> > This patch is holding PTL from PTE0, thus, the folio will either
>> > be entirely reclaimed or entirely kept. On the other hand, this
>> > approach doesn't increase PTL contention. Even w/o the patch,
>> > page_vma_mapped_walk() will always get PTL after it sometimes
>> > skips one or two PTEs because intermediate break-before-makes
>> > are short, according to test. Of course, even w/o this patch,
>> > the vast majority of try_to_unmap_one still can get PTL from
>> > PTE0. This patch makes the number 100%.
>> > The other option is that we can give up in try_to_unmap_one
>> > once we find PTE0 is not the first entry we get PTL, we call
>> > page_vma_mapped_walk_done() to end the iteration at this case.
>> > This will keep the unified PTEs while the folio isn't reclaimed.
>> > The result is quite similar with small folios with one PTE -
>> > either entirely reclaimed or entirely kept.
>> > Reclaiming large folios by holding PTL from PTE0 seems a better
>> > option comparing to giving up after detecting PTL begins from
>> > non-PTE0.
>> >
>> > Cc: Hugh Dickins <hughd@google.com>
>> > Signed-off-by: Barry Song <v-songbaohua@oppo.com>
>> > ---
>> >  mm/vmscan.c | 11 +++++++++++
>> >  1 file changed, 11 insertions(+)
>> >
>> > diff --git a/mm/vmscan.c b/mm/vmscan.c
>> > index 0b888a2afa58..e4722fbbcd0c 100644
>> > --- a/mm/vmscan.c
>> > +++ b/mm/vmscan.c
>> > @@ -1270,6 +1270,17 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
>> >
>> >                       if (folio_test_pmd_mappable(folio))
>> >                               flags |= TTU_SPLIT_HUGE_PMD;
>> > +                     /*
>> > +                      * if page table lock is not held from the first PTE of
>> > +                      * a large folio, some PTEs might be skipped because of
>> > +                      * races with break-before-make, for example, PTEs can
>> > +                      * be pte_none intermediately, thus one or more PTEs
>> > +                      * might be skipped in try_to_unmap_one, we might result
>> > +                      * in a large folio is partially mapped and partially
>> > +                      * unmapped after try_to_unmap
>> > +                      */
>> > +                     if (folio_test_large(folio))
>> > +                             flags |= TTU_SYNC;
>> >
>> >                       try_to_unmap(folio, flags);
>> >                       if (folio_mapped(folio)) {
>
> Thanks
> Barry
Ryan Roberts March 5, 2024, 9:08 a.m. UTC | #20
On 05/03/2024 08:56, Barry Song wrote:
> are writing pte to zero(break) before writing a new value(make). while

As an aside, "break-before-make" as defined in the Arm architecture would also
require a TLBI, which usually isn't done for these
write-0-modify-prots-write-back operations. Arm doesn't require
"break-before-make" in these situations so its legal (as long as only certain
bits are changed). To my understanding purpose of doing this is to avoid races
with HW access/dirty flag updates; if the MMU wants to set either flag and finds
the PTE is 0 (invalid) it will cause an exception which will be queued waiting
for the PTL.

So I don't think you really mean break-before-make here.

> this behavior is within PTL in another thread,  page_vma_mapped_walk()
> of try_to_unmap_one thread won't take PTL till it meets a present PTE.
> for example, if another threads are modifying nr_pages PTEs under PTL,
> but we don't hold PTL, we might skip one or two PTEs at the beginning of
> a large folio.
> For a large folio, after try_to_unmap_one(), we may result in PTE0 and PTE1
> untouched but PTE2~nr_pages-1 are set to swap entries.
> 
> by holding PTL from PTE0 for large folios, we won't get these intermediate
> values. At the moment we get PTL, other threads have done.
Barry Song March 5, 2024, 9:08 a.m. UTC | #21
On Tue, Mar 5, 2024 at 9:54 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 04/03/2024 21:57, Barry Song wrote:
> > On Tue, Mar 5, 2024 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Hi Barry,
> >>
> >> On 04/03/2024 10:37, Barry Song wrote:
> >>> From: Barry Song <v-songbaohua@oppo.com>
> >>>
> >>> page_vma_mapped_walk() within try_to_unmap_one() races with other
> >>> PTEs modification such as break-before-make, while iterating PTEs
> >>> of a large folio, it will only begin to acquire PTL after it gets
> >>> a valid(present) PTE. break-before-make intermediately sets PTEs
> >>> to pte_none. Thus, a large folio's PTEs might be partially skipped
> >>> in try_to_unmap_one().
> >>
> >> I just want to check my understanding here - I think the problem occurs for
> >> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
> >> that I've had a look at the code and have a better understanding, I think that
> >> must be the case? And therefore this problem exists independently of my work to
> >> support swap-out of mTHP? (From your previous report I was under the impression
> >> that it only affected mTHP).
> >
> > I think this affects all large folios with PTEs entries more than 1. but hugeTLB
> > is handled as a whole in try_to_unmap_one and its rmap is removed all
> > together, i feel hugeTLB doesn't have this problem.
> >
> >>
> >> Its just that the problem is becoming more pronounced because with mTHP,
> >> PTE-mapped large folios are much more common?
> >
> > right. as now large folios become a more common case, and it is my case
> > running in millions of phones.
> >
> > BTW, I feel we can somehow learn from hugeTLB, for example, we can reclaim
> > all PTEs all together rather than iterating PTEs one by one. This will improve
> > performance. for example, a batched
> > set_ptes_to_swap_entries()
> > {
> > }
> > then we only need to loop once for a large folio, right now we are looping
> > nr_pages times.
>
> You still need a pte-pte loop somewhere. In hugetlb's case it's in the arch
> implementation. HugeTLB ptes are all a fixed size for a given VMA, which makes
> things a bit easier too, whereas in the regular mm, they are now a variable size.
>
> David and I introduced folio_pte_batch() to help gather batches of ptes, and it
> uses the contpte bit to avoid iterating over intermediate ptes. And I'm adding
> swap_pte_batch() which does a similar thing for swap entry batching in v4 of my
> swap-out series.
>
> For your set_ptes_to_swap_entries() example, I'm not sure what it would do other
> than loop over the PTEs setting an incremented swap entry to each one? How is
> that more performant?

right now, while (page_vma_mapped_walk(&pvmw)) will loop nr_pages for each
PTE, if each PTE, we do lots of checks within the loop.

by implementing set_ptes_to_swap_entries(), we can iterate once for
page_vma_mapped_walk(), after folio_pte_batch() has confirmed
the large folio is completely mapped, we set nr_pages swap entries
all together.

we are replacing

for(i=0;i<nr_pages;i++)     /* page_vma_mapped_walk */
{
        lots of checks;
        clear PTEn
        set PTEn to swap
}

by

if (large folio && folio_pte_batch() == nr_pages)
    set_ptes_to_swap_entries().

>

Thanks,
Ryan
Huang, Ying March 5, 2024, 9:10 a.m. UTC | #22
Barry Song <21cnbao@gmail.com> writes:

> On Tue, Mar 5, 2024 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Barry Song <21cnbao@gmail.com> writes:
>>
>> > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote:
>> >> > But we did "resolve" those bugs by entirely untouching all PTEs if we
>> >> > found some PTEs were skipped in try_to_unmap_one [1].
>> >> >
>> >> > While we find we only get the PTL from 2nd, 3rd but not
>> >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
>> >> > all PTEs untouched.
>> >> >
>> >> > /* we are not starting from head */
>> >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
>> >> >                     ret = false;
>> >> >                     atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
>> >> >                     set_pte_at(mm, address, pvmw.pte, pteval);
>> >> >                     page_vma_mapped_walk_done(&pvmw);
>> >> >                     break;
>> >> > }
>> >> > This will ensure all PTEs still have a unified state such as CONT-PTE
>> >> > after try_to_unmap fails.
>> >> > I feel this could have some false postive because when racing
>> >> > with unmap, 1st PTE might really become pte_none. So explicitly
>> >> > holding PTL from 1st PTE seems a better way.
>> >>
>> >> Can we estimate the "cost" of holding the PTL?
>> >>
>> >
>> > This is just moving PTL acquisition one or two PTE earlier in those corner
>> > cases. In normal cases, it doesn't affect when PTL is held.
>>
>> The mTHP may be mapped at the end of page table.  In that case, the PTL
>> will be held longer.  Or am I missing something?
>
> no. this patch doesn't change when we release PTL but change when we
> get PTL.
>
> when the original code iterates nr_pages PTEs in a large folio, it will skip
> invalid PTEs, when it meets a valid one, it will acquire PTL. so if it gets
> intermediate PTE values some other threads are modifying, it might
> skip PTE0, or sometimes PTE0 and PTE1 according to my test. but
> arriving at PTE2, likely other threads have written a new value, so we
> will begin to hold PTL and iterate till the end of the large folio.

Is there any guarantee that the mTHP will always be mapped at the
beginning of the page table (PTE0)?  IIUC, mTHP can be mapped at PTE496.
If so, with your patch, PTL will be held from PTE0 instead of PTE496 in
some cases.

--
Best Regards,
Huang, Ying

> The proposal is that we directly get PTL from PTE0, thus we don't get
> intermediate values for the head of nr_pages PTEs. this will ensure
> a large folio is either completely unmapped or completely mapped.
> but not partially mapped and partially unmapped.
>
>>
>> --
>> Best Regards,
>> Huang, Ying
>>
>>
>> > In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold
>> > PTL immediately. in corner cases, page_vma_mapped_walk races with break-
>> > before-make, after skipping one or two PTEs whose states are transferring,
>> > it will find a present pte then acquire lock.
>> >
>> >> --
>> >> Cheers,
>> >>
>> >> David / dhildenb
>> >
> Thanks
> Barry
Ryan Roberts March 5, 2024, 9:11 a.m. UTC | #23
On 05/03/2024 09:08, Barry Song wrote:
> On Tue, Mar 5, 2024 at 9:54 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 04/03/2024 21:57, Barry Song wrote:
>>> On Tue, Mar 5, 2024 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> Hi Barry,
>>>>
>>>> On 04/03/2024 10:37, Barry Song wrote:
>>>>> From: Barry Song <v-songbaohua@oppo.com>
>>>>>
>>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
>>>>> PTEs modification such as break-before-make, while iterating PTEs
>>>>> of a large folio, it will only begin to acquire PTL after it gets
>>>>> a valid(present) PTE. break-before-make intermediately sets PTEs
>>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
>>>>> in try_to_unmap_one().
>>>>
>>>> I just want to check my understanding here - I think the problem occurs for
>>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
>>>> that I've had a look at the code and have a better understanding, I think that
>>>> must be the case? And therefore this problem exists independently of my work to
>>>> support swap-out of mTHP? (From your previous report I was under the impression
>>>> that it only affected mTHP).
>>>
>>> I think this affects all large folios with PTEs entries more than 1. but hugeTLB
>>> is handled as a whole in try_to_unmap_one and its rmap is removed all
>>> together, i feel hugeTLB doesn't have this problem.
>>>
>>>>
>>>> Its just that the problem is becoming more pronounced because with mTHP,
>>>> PTE-mapped large folios are much more common?
>>>
>>> right. as now large folios become a more common case, and it is my case
>>> running in millions of phones.
>>>
>>> BTW, I feel we can somehow learn from hugeTLB, for example, we can reclaim
>>> all PTEs all together rather than iterating PTEs one by one. This will improve
>>> performance. for example, a batched
>>> set_ptes_to_swap_entries()
>>> {
>>> }
>>> then we only need to loop once for a large folio, right now we are looping
>>> nr_pages times.
>>
>> You still need a pte-pte loop somewhere. In hugetlb's case it's in the arch
>> implementation. HugeTLB ptes are all a fixed size for a given VMA, which makes
>> things a bit easier too, whereas in the regular mm, they are now a variable size.
>>
>> David and I introduced folio_pte_batch() to help gather batches of ptes, and it
>> uses the contpte bit to avoid iterating over intermediate ptes. And I'm adding
>> swap_pte_batch() which does a similar thing for swap entry batching in v4 of my
>> swap-out series.
>>
>> For your set_ptes_to_swap_entries() example, I'm not sure what it would do other
>> than loop over the PTEs setting an incremented swap entry to each one? How is
>> that more performant?
> 
> right now, while (page_vma_mapped_walk(&pvmw)) will loop nr_pages for each
> PTE, if each PTE, we do lots of checks within the loop.
> 
> by implementing set_ptes_to_swap_entries(), we can iterate once for
> page_vma_mapped_walk(), after folio_pte_batch() has confirmed
> the large folio is completely mapped, we set nr_pages swap entries
> all together.
> 
> we are replacing
> 
> for(i=0;i<nr_pages;i++)     /* page_vma_mapped_walk */
> {
>         lots of checks;
>         clear PTEn
>         set PTEn to swap
> }

OK so you are effectively hoisting "lots of checks" out of the loop?

> 
> by
> 
> if (large folio && folio_pte_batch() == nr_pages)
>     set_ptes_to_swap_entries().
> 
>>
> 
> Thanks,
> Ryan
Barry Song March 5, 2024, 9:11 a.m. UTC | #24
On Tue, Mar 5, 2024 at 10:08 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/03/2024 08:56, Barry Song wrote:
> > are writing pte to zero(break) before writing a new value(make). while
>
> As an aside, "break-before-make" as defined in the Arm architecture would also
> require a TLBI, which usually isn't done for these
> write-0-modify-prots-write-back operations. Arm doesn't require
> "break-before-make" in these situations so its legal (as long as only certain
> bits are changed). To my understanding purpose of doing this is to avoid races
> with HW access/dirty flag updates; if the MMU wants to set either flag and finds
> the PTE is 0 (invalid) it will cause an exception which will be queued waiting
> for the PTL.
>
> So I don't think you really mean break-before-make here.

I agree I use a stronger term. will change it to something lighter in v2.

>
> > this behavior is within PTL in another thread,  page_vma_mapped_walk()
> > of try_to_unmap_one thread won't take PTL till it meets a present PTE.
> > for example, if another threads are modifying nr_pages PTEs under PTL,
> > but we don't hold PTL, we might skip one or two PTEs at the beginning of
> > a large folio.
> > For a large folio, after try_to_unmap_one(), we may result in PTE0 and PTE1
> > untouched but PTE2~nr_pages-1 are set to swap entries.
> >
> > by holding PTL from PTE0 for large folios, we won't get these intermediate
> > values. At the moment we get PTL, other threads have done.
>

Thanks
Barry
Barry Song March 5, 2024, 9:15 a.m. UTC | #25
On Tue, Mar 5, 2024 at 10:11 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 05/03/2024 09:08, Barry Song wrote:
> > On Tue, Mar 5, 2024 at 9:54 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 04/03/2024 21:57, Barry Song wrote:
> >>> On Tue, Mar 5, 2024 at 1:21 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> Hi Barry,
> >>>>
> >>>> On 04/03/2024 10:37, Barry Song wrote:
> >>>>> From: Barry Song <v-songbaohua@oppo.com>
> >>>>>
> >>>>> page_vma_mapped_walk() within try_to_unmap_one() races with other
> >>>>> PTEs modification such as break-before-make, while iterating PTEs
> >>>>> of a large folio, it will only begin to acquire PTL after it gets
> >>>>> a valid(present) PTE. break-before-make intermediately sets PTEs
> >>>>> to pte_none. Thus, a large folio's PTEs might be partially skipped
> >>>>> in try_to_unmap_one().
> >>>>
> >>>> I just want to check my understanding here - I think the problem occurs for
> >>>> PTE-mapped, PMD-sized folios as well as smaller-than-PMD-size large folios? Now
> >>>> that I've had a look at the code and have a better understanding, I think that
> >>>> must be the case? And therefore this problem exists independently of my work to
> >>>> support swap-out of mTHP? (From your previous report I was under the impression
> >>>> that it only affected mTHP).
> >>>
> >>> I think this affects all large folios with PTEs entries more than 1. but hugeTLB
> >>> is handled as a whole in try_to_unmap_one and its rmap is removed all
> >>> together, i feel hugeTLB doesn't have this problem.
> >>>
> >>>>
> >>>> Its just that the problem is becoming more pronounced because with mTHP,
> >>>> PTE-mapped large folios are much more common?
> >>>
> >>> right. as now large folios become a more common case, and it is my case
> >>> running in millions of phones.
> >>>
> >>> BTW, I feel we can somehow learn from hugeTLB, for example, we can reclaim
> >>> all PTEs all together rather than iterating PTEs one by one. This will improve
> >>> performance. for example, a batched
> >>> set_ptes_to_swap_entries()
> >>> {
> >>> }
> >>> then we only need to loop once for a large folio, right now we are looping
> >>> nr_pages times.
> >>
> >> You still need a pte-pte loop somewhere. In hugetlb's case it's in the arch
> >> implementation. HugeTLB ptes are all a fixed size for a given VMA, which makes
> >> things a bit easier too, whereas in the regular mm, they are now a variable size.
> >>
> >> David and I introduced folio_pte_batch() to help gather batches of ptes, and it
> >> uses the contpte bit to avoid iterating over intermediate ptes. And I'm adding
> >> swap_pte_batch() which does a similar thing for swap entry batching in v4 of my
> >> swap-out series.
> >>
> >> For your set_ptes_to_swap_entries() example, I'm not sure what it would do other
> >> than loop over the PTEs setting an incremented swap entry to each one? How is
> >> that more performant?
> >
> > right now, while (page_vma_mapped_walk(&pvmw)) will loop nr_pages for each
> > PTE, if each PTE, we do lots of checks within the loop.
> >
> > by implementing set_ptes_to_swap_entries(), we can iterate once for
> > page_vma_mapped_walk(), after folio_pte_batch() has confirmed
> > the large folio is completely mapped, we set nr_pages swap entries
> > all together.
> >
> > we are replacing
> >
> > for(i=0;i<nr_pages;i++)     /* page_vma_mapped_walk */
> > {
> >         lots of checks;
> >         clear PTEn
> >         set PTEn to swap
> > }
>
> OK so you are effectively hoisting "lots of checks" out of the loop?

no. page_vma_mapped_walk returns nr_pages times. We are doing
same check each time.  Each time, we do tlbi and set one PTE.

>
> >
> > by
> >
> > if (large folio && folio_pte_batch() == nr_pages)
> >     set_ptes_to_swap_entries().

for this, we do check for one time, and we do much less tlbi.

> >
> >>
> >
> > Thanks,
> > Ryan

Thanks
Barry
Barry Song March 5, 2024, 9:21 a.m. UTC | #26
On Tue, Mar 5, 2024 at 10:12 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Barry Song <21cnbao@gmail.com> writes:
>
> > On Tue, Mar 5, 2024 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote:
> >>
> >> Barry Song <21cnbao@gmail.com> writes:
> >>
> >> > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote:
> >> >> > But we did "resolve" those bugs by entirely untouching all PTEs if we
> >> >> > found some PTEs were skipped in try_to_unmap_one [1].
> >> >> >
> >> >> > While we find we only get the PTL from 2nd, 3rd but not
> >> >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
> >> >> > all PTEs untouched.
> >> >> >
> >> >> > /* we are not starting from head */
> >> >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
> >> >> >                     ret = false;
> >> >> >                     atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
> >> >> >                     set_pte_at(mm, address, pvmw.pte, pteval);
> >> >> >                     page_vma_mapped_walk_done(&pvmw);
> >> >> >                     break;
> >> >> > }
> >> >> > This will ensure all PTEs still have a unified state such as CONT-PTE
> >> >> > after try_to_unmap fails.
> >> >> > I feel this could have some false postive because when racing
> >> >> > with unmap, 1st PTE might really become pte_none. So explicitly
> >> >> > holding PTL from 1st PTE seems a better way.
> >> >>
> >> >> Can we estimate the "cost" of holding the PTL?
> >> >>
> >> >
> >> > This is just moving PTL acquisition one or two PTE earlier in those corner
> >> > cases. In normal cases, it doesn't affect when PTL is held.
> >>
> >> The mTHP may be mapped at the end of page table.  In that case, the PTL
> >> will be held longer.  Or am I missing something?
> >
> > no. this patch doesn't change when we release PTL but change when we
> > get PTL.
> >
> > when the original code iterates nr_pages PTEs in a large folio, it will skip
> > invalid PTEs, when it meets a valid one, it will acquire PTL. so if it gets
> > intermediate PTE values some other threads are modifying, it might
> > skip PTE0, or sometimes PTE0 and PTE1 according to my test. but
> > arriving at PTE2, likely other threads have written a new value, so we
> > will begin to hold PTL and iterate till the end of the large folio.
>
> Is there any guarantee that the mTHP will always be mapped at the
> beginning of the page table (PTE0)?  IIUC, mTHP can be mapped at PTE496.
> If so, with your patch, PTL will be held from PTE0 instead of PTE496 in
> some cases.

I agree. but in another discussion[1], the plan is if we find a large folio has
been deferred split, we split it before try_to_unmap and pageout. otherwise,
we may result in lots of redundant I/O, because PTE0-495 will still be
pageout()-ed.

[1] https://lore.kernel.org/linux-mm/a4a9054f-2040-4f70-8d10-a5af4972e5aa@arm.com/

>
> --
> Best Regards,
> Huang, Ying
>
> > The proposal is that we directly get PTL from PTE0, thus we don't get
> > intermediate values for the head of nr_pages PTEs. this will ensure
> > a large folio is either completely unmapped or completely mapped.
> > but not partially mapped and partially unmapped.
> >
> >>
> >> --
> >> Best Regards,
> >> Huang, Ying
> >>
> >>
> >> > In normal cases, page_vma_mapped_walk will find PTE0 is present, thus hold
> >> > PTL immediately. in corner cases, page_vma_mapped_walk races with break-
> >> > before-make, after skipping one or two PTEs whose states are transferring,
> >> > it will find a present pte then acquire lock.
> >> >
> >> >> --
> >> >> Cheers,
> >> >>
> >> >> David / dhildenb
> >> >

Thanks
Barry
Barry Song March 5, 2024, 10:28 a.m. UTC | #27
On Tue, Mar 5, 2024 at 10:21 PM Barry Song <21cnbao@gmail.com> wrote:
>
> On Tue, Mar 5, 2024 at 10:12 PM Huang, Ying <ying.huang@intel.com> wrote:
> >
> > Barry Song <21cnbao@gmail.com> writes:
> >
> > > On Tue, Mar 5, 2024 at 8:55 PM Huang, Ying <ying.huang@intel.com> wrote:
> > >>
> > >> Barry Song <21cnbao@gmail.com> writes:
> > >>
> > >> > On Tue, Mar 5, 2024 at 10:15 AM David Hildenbrand <david@redhat.com> wrote:
> > >> >> > But we did "resolve" those bugs by entirely untouching all PTEs if we
> > >> >> > found some PTEs were skipped in try_to_unmap_one [1].
> > >> >> >
> > >> >> > While we find we only get the PTL from 2nd, 3rd but not
> > >> >> > 1st PTE, we entirely give up on try_to_unmap_one, and leave
> > >> >> > all PTEs untouched.
> > >> >> >
> > >> >> > /* we are not starting from head */
> > >> >> > if (!IS_ALIGNED((unsigned long)pvmw.pte, CONT_PTES * sizeof(*pvmw.pte))) {
> > >> >> >                     ret = false;
> > >> >> >                     atomic64_inc(&perf_stat.mapped_walk_start_from_non_head);
> > >> >> >                     set_pte_at(mm, address, pvmw.pte, pteval);
> > >> >> >                     page_vma_mapped_walk_done(&pvmw);
> > >> >> >                     break;
> > >> >> > }
> > >> >> > This will ensure all PTEs still have a unified state such as CONT-PTE
> > >> >> > after try_to_unmap fails.
> > >> >> > I feel this could have some false postive because when racing
> > >> >> > with unmap, 1st PTE might really become pte_none. So explicitly
> > >> >> > holding PTL from 1st PTE seems a better way.
> > >> >>
> > >> >> Can we estimate the "cost" of holding the PTL?
> > >> >>
> > >> >
> > >> > This is just moving PTL acquisition one or two PTE earlier in those corner
> > >> > cases. In normal cases, it doesn't affect when PTL is held.
> > >>
> > >> The mTHP may be mapped at the end of page table.  In that case, the PTL
> > >> will be held longer.  Or am I missing something?
> > >
> > > no. this patch doesn't change when we release PTL but change when we
> > > get PTL.
> > >
> > > when the original code iterates nr_pages PTEs in a large folio, it will skip
> > > invalid PTEs, when it meets a valid one, it will acquire PTL. so if it gets
> > > intermediate PTE values some other threads are modifying, it might
> > > skip PTE0, or sometimes PTE0 and PTE1 according to my test. but
> > > arriving at PTE2, likely other threads have written a new value, so we
> > > will begin to hold PTL and iterate till the end of the large folio.
> >
> > Is there any guarantee that the mTHP will always be mapped at the
> > beginning of the page table (PTE0)?  IIUC, mTHP can be mapped at PTE496.
> > If so, with your patch, PTL will be held from PTE0 instead of PTE496 in
> > some cases.
>
> I agree. but in another discussion[1], the plan is if we find a large folio has
> been deferred split, we split it before try_to_unmap and pageout. otherwise,
> we may result in lots of redundant I/O, because PTE0-495 will still be
> pageout()-ed.
>
> [1] https://lore.kernel.org/linux-mm/a4a9054f-2040-4f70-8d10-a5af4972e5aa@arm.com/

I thought about this again, seems we can cope with it even w/o the above plan
by:

+ if (folio_test_large(folio) && list_empty(&folio->_deferred_list))
+              flags |= TTU_SYNC;

if a folio has been deferred split, it seems no sense to have the optimization
for the corner cases this patch wants to provide. Only while we know this
folio is still entirely mapped, we have this optimization. This should have
reduced the chance to be quite small though we still have a bit.

>
> >
> > --
> > Best Regards,
> > Huang, Ying
> >
> > > The proposal is that we directly get PTL from PTE0, thus we don't get
> > > intermediate values for the head of nr_pages PTEs. this will ensure
> > > a large folio is either completely unmapped or completely mapped.
> > > but not partially mapped and partially unmapped.
> > >
> > >>

Thanks
Barry
diff mbox series

Patch

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0b888a2afa58..e4722fbbcd0c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1270,6 +1270,17 @@  static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 			if (folio_test_pmd_mappable(folio))
 				flags |= TTU_SPLIT_HUGE_PMD;
+			/*
+			 * if page table lock is not held from the first PTE of
+			 * a large folio, some PTEs might be skipped because of
+			 * races with break-before-make, for example, PTEs can
+			 * be pte_none intermediately, thus one or more PTEs
+			 * might be skipped in try_to_unmap_one, we might result
+			 * in a large folio is partially mapped and partially
+			 * unmapped after try_to_unmap
+			 */
+			if (folio_test_large(folio))
+				flags |= TTU_SYNC;
 
 			try_to_unmap(folio, flags);
 			if (folio_mapped(folio)) {