diff mbox series

[v2,14/14] arm64/mm: Add ptep_get_and_clear_full() to optimize process teardown

Message ID 20231115163018.1303287-15-ryan.roberts@arm.com (mailing list archive)
State New, archived
Headers show
Series Transparent Contiguous PTEs for User Mappings | expand

Commit Message

Ryan Roberts Nov. 15, 2023, 4:30 p.m. UTC
ptep_get_and_clear_full() adds a 'full' parameter which is not present
for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
a full address space teardown is in progress. We use this information to
optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
contiguous neighbours to keep their contig bit set, because we know we
are about to clear the rest too.

Before this optimization, the cost of arm64_sys_exit_group() exploded to
32x what it was before PTE_CONT support was wired up, when compiling the
kernel. With this optimization in place, we are back down to the
original cost.

This approach is not perfect though, as for the duration between
returning from the first call to ptep_get_and_clear_full() and making
the final call, the contpte block in an intermediate state, where some
ptes are cleared and others are still set with the PTE_CONT bit. If any
other APIs are called for the ptes in the contpte block during that
time, we have to be very careful. The core code currently interleaves
calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
must be careful to ignore the cleared entries when accumulating the
access and dirty bits - the same goes for ptep_get_lockless(). The only
other calls we might resonably expect are to set markers in the
previously cleared ptes. (We shouldn't see valid entries being set until
after the tlbi, at which point we are no longer in the intermediate
state). Since markers are not valid, this is safe; set_ptes() will see
the old, invalid entry and will not attempt to unfold. And the new pte
is also invalid so it won't attempt to fold. We shouldn't see this for
the 'full' case anyway.

The last remaining issue is returning the access/dirty bits. That info
could be present in any of the ptes in the contpte block. ptep_get()
will gather those bits from across the contpte block. We don't bother
doing that here, because we know that the information is used by the
core-mm to mark the underlying folio as accessed/dirty. And since the
same folio must be underpinning the whole block (that was a requirement
for folding in the first place), that information will make it to the
folio eventually once all the ptes have been cleared. This approach
means we don't have to play games with accumulating and storing the
bits. It does mean that any interleaved calls to ptep_get() may lack
correct access/dirty information if we have already cleared the pte that
happened to store it. The core code does not rely on this though.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 18 +++++++++--
 arch/arm64/mm/contpte.c          | 54 ++++++++++++++++++++++++++++++++
 2 files changed, 70 insertions(+), 2 deletions(-)

Comments

Alistair Popple Nov. 23, 2023, 5:13 a.m. UTC | #1
Ryan Roberts <ryan.roberts@arm.com> writes:

> ptep_get_and_clear_full() adds a 'full' parameter which is not present
> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
> a full address space teardown is in progress. We use this information to
> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
> contiguous neighbours to keep their contig bit set, because we know we
> are about to clear the rest too.
>
> Before this optimization, the cost of arm64_sys_exit_group() exploded to
> 32x what it was before PTE_CONT support was wired up, when compiling the
> kernel. With this optimization in place, we are back down to the
> original cost.
>
> This approach is not perfect though, as for the duration between
> returning from the first call to ptep_get_and_clear_full() and making
> the final call, the contpte block in an intermediate state, where some
> ptes are cleared and others are still set with the PTE_CONT bit. If any
> other APIs are called for the ptes in the contpte block during that
> time, we have to be very careful. The core code currently interleaves
> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
> must be careful to ignore the cleared entries when accumulating the
> access and dirty bits - the same goes for ptep_get_lockless(). The only
> other calls we might resonably expect are to set markers in the
> previously cleared ptes. (We shouldn't see valid entries being set until
> after the tlbi, at which point we are no longer in the intermediate
> state). Since markers are not valid, this is safe; set_ptes() will see
> the old, invalid entry and will not attempt to unfold. And the new pte
> is also invalid so it won't attempt to fold. We shouldn't see this for
> the 'full' case anyway.
>
> The last remaining issue is returning the access/dirty bits. That info
> could be present in any of the ptes in the contpte block. ptep_get()
> will gather those bits from across the contpte block. We don't bother
> doing that here, because we know that the information is used by the
> core-mm to mark the underlying folio as accessed/dirty. And since the
> same folio must be underpinning the whole block (that was a requirement
> for folding in the first place), that information will make it to the
> folio eventually once all the ptes have been cleared. This approach
> means we don't have to play games with accumulating and storing the
> bits. It does mean that any interleaved calls to ptep_get() may lack
> correct access/dirty information if we have already cleared the pte that
> happened to store it. The core code does not rely on this though.

Does not *currently* rely on this. I can't help but think it is
potentially something that could change in the future though which would
lead to some subtle bugs.

Would there be any may of avoiding this? Half baked thought but could
you for example copy the access/dirty information to the last (or
perhaps first, most likely invalid) PTE?

 - Alistair

> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/include/asm/pgtable.h | 18 +++++++++--
>  arch/arm64/mm/contpte.c          | 54 ++++++++++++++++++++++++++++++++
>  2 files changed, 70 insertions(+), 2 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 9bd2f57a9e11..ea58a9f4e700 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -1145,6 +1145,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr);
> +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> @@ -1270,12 +1272,24 @@ static inline void pte_clear(struct mm_struct *mm,
>  	__pte_clear(mm, addr, ptep);
>  }
>  
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep, int full)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (!pte_valid_cont(orig_pte) || !full) {
> +		contpte_try_unfold(mm, addr, ptep, orig_pte);
> +		return __ptep_get_and_clear(mm, addr, ptep);
> +	} else
> +		return contpte_ptep_get_and_clear_full(mm, addr, ptep);
> +}
> +
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> -	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> -	return __ptep_get_and_clear(mm, addr, ptep);
> +	return ptep_get_and_clear_full(mm, addr, ptep, 0);
>  }
>  
>  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 426be9cd4dea..5d1aaed82d32 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -144,6 +144,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>  	for (i = 0; i < CONT_PTES; i++, ptep++) {
>  		pte = __ptep_get(ptep);
>  
> +		/*
> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
> +		 * where some of the ptes in the range may be cleared but others
> +		 * are still to do. See contpte_ptep_get_and_clear_full().
> +		 */
> +		if (pte_val(pte) == 0)
> +			continue;
> +
>  		if (pte_dirty(pte))
>  			orig_pte = pte_mkdirty(orig_pte);
>  
> @@ -256,6 +264,52 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  EXPORT_SYMBOL(contpte_set_ptes);
>  
> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	/*
> +	 * When doing a full address space teardown, we can avoid unfolding the
> +	 * contiguous range, and therefore avoid the associated tlbi. Instead,
> +	 * just get and clear the pte. The caller is promising to call us for
> +	 * every pte, so every pte in the range will be cleared by the time the
> +	 * tlbi is issued.
> +	 *
> +	 * This approach is not perfect though, as for the duration between
> +	 * returning from the first call to ptep_get_and_clear_full() and making
> +	 * the final call, the contpte block in an intermediate state, where
> +	 * some ptes are cleared and others are still set with the PTE_CONT bit.
> +	 * If any other APIs are called for the ptes in the contpte block during
> +	 * that time, we have to be very careful. The core code currently
> +	 * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
> +	 * ptep_get() must be careful to ignore the cleared entries when
> +	 * accumulating the access and dirty bits - the same goes for
> +	 * ptep_get_lockless(). The only other calls we might resonably expect
> +	 * are to set markers in the previously cleared ptes. (We shouldn't see
> +	 * valid entries being set until after the tlbi, at which point we are
> +	 * no longer in the intermediate state). Since markers are not valid,
> +	 * this is safe; set_ptes() will see the old, invalid entry and will not
> +	 * attempt to unfold. And the new pte is also invalid so it won't
> +	 * attempt to fold. We shouldn't see this for the 'full' case anyway.
> +	 *
> +	 * The last remaining issue is returning the access/dirty bits. That
> +	 * info could be present in any of the ptes in the contpte block.
> +	 * ptep_get() will gather those bits from across the contpte block. We
> +	 * don't bother doing that here, because we know that the information is
> +	 * used by the core-mm to mark the underlying folio as accessed/dirty.
> +	 * And since the same folio must be underpinning the whole block (that
> +	 * was a requirement for folding in the first place), that information
> +	 * will make it to the folio eventually once all the ptes have been
> +	 * cleared. This approach means we don't have to play games with
> +	 * accumulating and storing the bits. It does mean that any interleaved
> +	 * calls to ptep_get() may lack correct access/dirty information if we
> +	 * have already cleared the pte that happened to store it. The core code
> +	 * does not rely on this though.
> +	 */
> +
> +	return __ptep_get_and_clear(mm, addr, ptep);
> +}
> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> +
>  int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  					unsigned long addr, pte_t *ptep)
>  {
Ryan Roberts Nov. 23, 2023, 4:01 p.m. UTC | #2
On 23/11/2023 05:13, Alistair Popple wrote:
> 
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>> a full address space teardown is in progress. We use this information to
>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>> contiguous neighbours to keep their contig bit set, because we know we
>> are about to clear the rest too.
>>
>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>> 32x what it was before PTE_CONT support was wired up, when compiling the
>> kernel. With this optimization in place, we are back down to the
>> original cost.
>>
>> This approach is not perfect though, as for the duration between
>> returning from the first call to ptep_get_and_clear_full() and making
>> the final call, the contpte block in an intermediate state, where some
>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>> other APIs are called for the ptes in the contpte block during that
>> time, we have to be very careful. The core code currently interleaves
>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>> must be careful to ignore the cleared entries when accumulating the
>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>> other calls we might resonably expect are to set markers in the
>> previously cleared ptes. (We shouldn't see valid entries being set until
>> after the tlbi, at which point we are no longer in the intermediate
>> state). Since markers are not valid, this is safe; set_ptes() will see
>> the old, invalid entry and will not attempt to unfold. And the new pte
>> is also invalid so it won't attempt to fold. We shouldn't see this for
>> the 'full' case anyway.
>>
>> The last remaining issue is returning the access/dirty bits. That info
>> could be present in any of the ptes in the contpte block. ptep_get()
>> will gather those bits from across the contpte block. We don't bother
>> doing that here, because we know that the information is used by the
>> core-mm to mark the underlying folio as accessed/dirty. And since the
>> same folio must be underpinning the whole block (that was a requirement
>> for folding in the first place), that information will make it to the
>> folio eventually once all the ptes have been cleared. This approach
>> means we don't have to play games with accumulating and storing the
>> bits. It does mean that any interleaved calls to ptep_get() may lack
>> correct access/dirty information if we have already cleared the pte that
>> happened to store it. The core code does not rely on this though.
> 
> Does not *currently* rely on this. I can't help but think it is
> potentially something that could change in the future though which would
> lead to some subtle bugs.

Yes, there is a risk, although IMHO, its very small.

> 
> Would there be any may of avoiding this? Half baked thought but could
> you for example copy the access/dirty information to the last (or
> perhaps first, most likely invalid) PTE?

I spent a long time thinking about this and came up with a number of
possibilities, none of them ideal. In the end, I went for the simplest one
(which works but suffers from the problem that it depends on the way it is
called not changing).

1) copy the access/dirty flags into all the remaining uncleared ptes within the
contpte block. This is how I did it in v1; although it was racy. I think this
could be implemented correctly but its extremely complex.

2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
can clear 1 or more full contpte blocks in a single call - the ptes are never in
an intermediate state. This is difficult because ptep_get_and_clear_full()
returns the pte that was cleared so its difficult to scale that up to multiple ptes.

3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
the other functions that ptep_get_and_clear_full() is not on-going when they are
called. So we would get a clear sign that usage patterns have changed. But there
is no easy place to store that state (other than scanning a contpte block
looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.

4) The simple approach I ended up taking; I thought it would be best to keep it
simple and see if anyone was concerned before doing something more drastic.

What do you think? If we really need to solve this, then option 1 is my
preferred route, but it would take some time to figure out and reason about a
race-free scheme.

Thanks,
Ryan
Alistair Popple Nov. 24, 2023, 1:35 a.m. UTC | #3
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 23/11/2023 05:13, Alistair Popple wrote:
>> 
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>> a full address space teardown is in progress. We use this information to
>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>> contiguous neighbours to keep their contig bit set, because we know we
>>> are about to clear the rest too.
>>>
>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>> kernel. With this optimization in place, we are back down to the
>>> original cost.
>>>
>>> This approach is not perfect though, as for the duration between
>>> returning from the first call to ptep_get_and_clear_full() and making
>>> the final call, the contpte block in an intermediate state, where some
>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>> other APIs are called for the ptes in the contpte block during that
>>> time, we have to be very careful. The core code currently interleaves
>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>> must be careful to ignore the cleared entries when accumulating the
>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>> other calls we might resonably expect are to set markers in the
>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>> after the tlbi, at which point we are no longer in the intermediate
>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>> the 'full' case anyway.
>>>
>>> The last remaining issue is returning the access/dirty bits. That info
>>> could be present in any of the ptes in the contpte block. ptep_get()
>>> will gather those bits from across the contpte block. We don't bother
>>> doing that here, because we know that the information is used by the
>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>> same folio must be underpinning the whole block (that was a requirement
>>> for folding in the first place), that information will make it to the
>>> folio eventually once all the ptes have been cleared. This approach
>>> means we don't have to play games with accumulating and storing the
>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>> correct access/dirty information if we have already cleared the pte that
>>> happened to store it. The core code does not rely on this though.
>> 
>> Does not *currently* rely on this. I can't help but think it is
>> potentially something that could change in the future though which would
>> lead to some subtle bugs.
>
> Yes, there is a risk, although IMHO, its very small.
>
>> 
>> Would there be any may of avoiding this? Half baked thought but could
>> you for example copy the access/dirty information to the last (or
>> perhaps first, most likely invalid) PTE?
>
> I spent a long time thinking about this and came up with a number of
> possibilities, none of them ideal. In the end, I went for the simplest one
> (which works but suffers from the problem that it depends on the way it is
> called not changing).

Ok, that answers my underlying question of "has someone thought about
this and are there any easy solutions". I suspected that was the case
given the excellent write up though!

> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
> contpte block. This is how I did it in v1; although it was racy. I think this
> could be implemented correctly but its extremely complex.
>
> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
> can clear 1 or more full contpte blocks in a single call - the ptes are never in
> an intermediate state. This is difficult because ptep_get_and_clear_full()
> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>
> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
> the other functions that ptep_get_and_clear_full() is not on-going when they are
> called. So we would get a clear sign that usage patterns have changed. But there
> is no easy place to store that state (other than scanning a contpte block
> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>
> 4) The simple approach I ended up taking; I thought it would be best to keep it
> simple and see if anyone was concerned before doing something more drastic.
>
> What do you think? If we really need to solve this, then option 1 is my
> preferred route, but it would take some time to figure out and reason about a
> race-free scheme.

Well I like simple, and I agree the risk is small. But I can't help feel
the current situation is too subtle, mainly because it is architecture
specific and the assumptions are not communicated in core-mm code
anywhere. But also none of the aternatives seem much better.

However there are only three callers of ptep_get_and_clear_full(), and
all of these hold the PTL. So if I'm not mistaken that should exclude
just about all users of ptep_get*() which will take the ptl before hand.

So really that only leaves ptep_get_lockless() that could/should
interleave right? From a quick glance of those users none look at the
young/dirty information anyway, so I wonder if we can just assert in the
core-mm that ptep_get_lockless() does not return young/dirty information
and clear it in the helpers? That would make things explicit and
consistent which would address my concern (although I haven't looked too
closely at the details there).

> Thanks,
> Ryan
Ryan Roberts Nov. 24, 2023, 8:54 a.m. UTC | #4
On 24/11/2023 01:35, Alistair Popple wrote:
> 
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 23/11/2023 05:13, Alistair Popple wrote:
>>>
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>>> a full address space teardown is in progress. We use this information to
>>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>>> contiguous neighbours to keep their contig bit set, because we know we
>>>> are about to clear the rest too.
>>>>
>>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>>> kernel. With this optimization in place, we are back down to the
>>>> original cost.
>>>>
>>>> This approach is not perfect though, as for the duration between
>>>> returning from the first call to ptep_get_and_clear_full() and making
>>>> the final call, the contpte block in an intermediate state, where some
>>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>>> other APIs are called for the ptes in the contpte block during that
>>>> time, we have to be very careful. The core code currently interleaves
>>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>>> must be careful to ignore the cleared entries when accumulating the
>>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>>> other calls we might resonably expect are to set markers in the
>>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>>> after the tlbi, at which point we are no longer in the intermediate
>>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>>> the 'full' case anyway.
>>>>
>>>> The last remaining issue is returning the access/dirty bits. That info
>>>> could be present in any of the ptes in the contpte block. ptep_get()
>>>> will gather those bits from across the contpte block. We don't bother
>>>> doing that here, because we know that the information is used by the
>>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>>> same folio must be underpinning the whole block (that was a requirement
>>>> for folding in the first place), that information will make it to the
>>>> folio eventually once all the ptes have been cleared. This approach
>>>> means we don't have to play games with accumulating and storing the
>>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>>> correct access/dirty information if we have already cleared the pte that
>>>> happened to store it. The core code does not rely on this though.
>>>
>>> Does not *currently* rely on this. I can't help but think it is
>>> potentially something that could change in the future though which would
>>> lead to some subtle bugs.
>>
>> Yes, there is a risk, although IMHO, its very small.
>>
>>>
>>> Would there be any may of avoiding this? Half baked thought but could
>>> you for example copy the access/dirty information to the last (or
>>> perhaps first, most likely invalid) PTE?
>>
>> I spent a long time thinking about this and came up with a number of
>> possibilities, none of them ideal. In the end, I went for the simplest one
>> (which works but suffers from the problem that it depends on the way it is
>> called not changing).
> 
> Ok, that answers my underlying question of "has someone thought about
> this and are there any easy solutions". I suspected that was the case
> given the excellent write up though!
> 
>> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
>> contpte block. This is how I did it in v1; although it was racy. I think this
>> could be implemented correctly but its extremely complex.
>>
>> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
>> can clear 1 or more full contpte blocks in a single call - the ptes are never in
>> an intermediate state. This is difficult because ptep_get_and_clear_full()
>> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>>
>> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
>> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
>> the other functions that ptep_get_and_clear_full() is not on-going when they are
>> called. So we would get a clear sign that usage patterns have changed. But there
>> is no easy place to store that state (other than scanning a contpte block
>> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>>
>> 4) The simple approach I ended up taking; I thought it would be best to keep it
>> simple and see if anyone was concerned before doing something more drastic.
>>
>> What do you think? If we really need to solve this, then option 1 is my
>> preferred route, but it would take some time to figure out and reason about a
>> race-free scheme.
> 
> Well I like simple, and I agree the risk is small. But I can't help feel
> the current situation is too subtle, mainly because it is architecture
> specific and the assumptions are not communicated in core-mm code
> anywhere. But also none of the aternatives seem much better.
> 
> However there are only three callers of ptep_get_and_clear_full(), and
> all of these hold the PTL. So if I'm not mistaken that should exclude
> just about all users of ptep_get*() which will take the ptl before hand.

The problem isn't racing threads because as you say, the PTL is already
serializing all calls except ptep_get_lockless(). And although there are 3
callers to ptep_get_and_clear_full(), only the caller in zap_pte_range() ever
calls it with full=1, as I recall.

The problem is that the caller in zap_pte_range() does this:

ptl = lock_page_table()
for each pte {
	ptent = ptep_get(pte);
	if (pte_present(ptent) {
		ptent = ptep_get_and_clear_full(ptent);
		if (pte_dirty(ptent))
			...
		if (pte_young(ptent))
			...
	}
}
unlock_page_table(ptl)

It deliberately interleves calls to ptep_get() and ptep_get_and_clear_full()
under the ptl. So if the loop is iterating over a contpte block and the HW
happens to be storing the access/dirty info in the first pte entry, then the
first time through the loop, ptep_get() will return the correct access/dirty
info, as will ptep_get_and_clear_full(). The next time through the loop though,
the access/dirty info which was in the previous pte is now cleared so ptep_get()
and ptep_get_and_clear_full() will return old/clean. It all works, but is fragile.


> 
> So really that only leaves ptep_get_lockless() that could/should
> interleave right? 

Yes, but ptep_get_lockless() is special. Since it is called without the PTL, it
is very careful to ensure that the contpte block is in a consistent state and it
keeps trying until it is. So this will always return the correct consistent
information.

> From a quick glance of those users none look at the
> young/dirty information anyway, so I wonder if we can just assert in the
> core-mm that ptep_get_lockless() does not return young/dirty information
> and clear it in the helpers? That would make things explicit and
> consistent which would address my concern (although I haven't looked too
> closely at the details there).

As per explanation above, its not ptep_get_lockless() that is the problem so I
don't think this helps.

Thanks,
Ryan

> 
>> Thanks,
>> Ryan
>
Alistair Popple Nov. 27, 2023, 7:34 a.m. UTC | #5
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 24/11/2023 01:35, Alistair Popple wrote:
>> 
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 23/11/2023 05:13, Alistair Popple wrote:
>>>>
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>>>> a full address space teardown is in progress. We use this information to
>>>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>>>> contiguous neighbours to keep their contig bit set, because we know we
>>>>> are about to clear the rest too.
>>>>>
>>>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>>>> kernel. With this optimization in place, we are back down to the
>>>>> original cost.
>>>>>
>>>>> This approach is not perfect though, as for the duration between
>>>>> returning from the first call to ptep_get_and_clear_full() and making
>>>>> the final call, the contpte block in an intermediate state, where some
>>>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>>>> other APIs are called for the ptes in the contpte block during that
>>>>> time, we have to be very careful. The core code currently interleaves
>>>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>>>> must be careful to ignore the cleared entries when accumulating the
>>>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>>>> other calls we might resonably expect are to set markers in the
>>>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>>>> after the tlbi, at which point we are no longer in the intermediate
>>>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>>>> the 'full' case anyway.
>>>>>
>>>>> The last remaining issue is returning the access/dirty bits. That info
>>>>> could be present in any of the ptes in the contpte block. ptep_get()
>>>>> will gather those bits from across the contpte block. We don't bother
>>>>> doing that here, because we know that the information is used by the
>>>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>>>> same folio must be underpinning the whole block (that was a requirement
>>>>> for folding in the first place), that information will make it to the
>>>>> folio eventually once all the ptes have been cleared. This approach
>>>>> means we don't have to play games with accumulating and storing the
>>>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>>>> correct access/dirty information if we have already cleared the pte that
>>>>> happened to store it. The core code does not rely on this though.
>>>>
>>>> Does not *currently* rely on this. I can't help but think it is
>>>> potentially something that could change in the future though which would
>>>> lead to some subtle bugs.
>>>
>>> Yes, there is a risk, although IMHO, its very small.
>>>
>>>>
>>>> Would there be any may of avoiding this? Half baked thought but could
>>>> you for example copy the access/dirty information to the last (or
>>>> perhaps first, most likely invalid) PTE?
>>>
>>> I spent a long time thinking about this and came up with a number of
>>> possibilities, none of them ideal. In the end, I went for the simplest one
>>> (which works but suffers from the problem that it depends on the way it is
>>> called not changing).
>> 
>> Ok, that answers my underlying question of "has someone thought about
>> this and are there any easy solutions". I suspected that was the case
>> given the excellent write up though!
>> 
>>> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
>>> contpte block. This is how I did it in v1; although it was racy. I think this
>>> could be implemented correctly but its extremely complex.
>>>
>>> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
>>> can clear 1 or more full contpte blocks in a single call - the ptes are never in
>>> an intermediate state. This is difficult because ptep_get_and_clear_full()
>>> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>>>
>>> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
>>> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
>>> the other functions that ptep_get_and_clear_full() is not on-going when they are
>>> called. So we would get a clear sign that usage patterns have changed. But there
>>> is no easy place to store that state (other than scanning a contpte block
>>> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>>>
>>> 4) The simple approach I ended up taking; I thought it would be best to keep it
>>> simple and see if anyone was concerned before doing something more drastic.
>>>
>>> What do you think? If we really need to solve this, then option 1 is my
>>> preferred route, but it would take some time to figure out and reason about a
>>> race-free scheme.
>> 
>> Well I like simple, and I agree the risk is small. But I can't help feel
>> the current situation is too subtle, mainly because it is architecture
>> specific and the assumptions are not communicated in core-mm code
>> anywhere. But also none of the aternatives seem much better.
>> 
>> However there are only three callers of ptep_get_and_clear_full(), and
>> all of these hold the PTL. So if I'm not mistaken that should exclude
>> just about all users of ptep_get*() which will take the ptl before hand.
>
> The problem isn't racing threads because as you say, the PTL is already
> serializing all calls except ptep_get_lockless(). And although there are 3
> callers to ptep_get_and_clear_full(), only the caller in zap_pte_range() ever
> calls it with full=1, as I recall.
>
> The problem is that the caller in zap_pte_range() does this:
>
> ptl = lock_page_table()
> for each pte {
> 	ptent = ptep_get(pte);
> 	if (pte_present(ptent) {
> 		ptent = ptep_get_and_clear_full(ptent);
> 		if (pte_dirty(ptent))
> 			...
> 		if (pte_young(ptent))
> 			...
> 	}
> }
> unlock_page_table(ptl)
>
> It deliberately interleves calls to ptep_get() and ptep_get_and_clear_full()
> under the ptl. So if the loop is iterating over a contpte block and the HW
> happens to be storing the access/dirty info in the first pte entry, then the
> first time through the loop, ptep_get() will return the correct access/dirty
> info, as will ptep_get_and_clear_full(). The next time through the loop though,
> the access/dirty info which was in the previous pte is now cleared so ptep_get()
> and ptep_get_and_clear_full() will return old/clean. It all works, but is fragile.

So if ptep_get_lockless() isn't a concern what made the option posted in
v1 racy (your option 1 above)? Is there something else reading PTEs or
clearing PTE bits without holding the PTL that I'm missing?

>> 
>> So really that only leaves ptep_get_lockless() that could/should
>> interleave right? 
>
> Yes, but ptep_get_lockless() is special. Since it is called without the PTL, it
> is very careful to ensure that the contpte block is in a consistent state and it
> keeps trying until it is. So this will always return the correct consistent
> information.
>
>> From a quick glance of those users none look at the
>> young/dirty information anyway, so I wonder if we can just assert in the
>> core-mm that ptep_get_lockless() does not return young/dirty information
>> and clear it in the helpers? That would make things explicit and
>> consistent which would address my concern (although I haven't looked too
>> closely at the details there).
>
> As per explanation above, its not ptep_get_lockless() that is the problem so I
> don't think this helps.
>
> Thanks,
> Ryan
>
>> 
>>> Thanks,
>>> Ryan
>>
Ryan Roberts Nov. 27, 2023, 8:53 a.m. UTC | #6
On 27/11/2023 07:34, Alistair Popple wrote:
> 
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 24/11/2023 01:35, Alistair Popple wrote:
>>>
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> On 23/11/2023 05:13, Alistair Popple wrote:
>>>>>
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>
>>>>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>>>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>>>>> a full address space teardown is in progress. We use this information to
>>>>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>>>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>>>>> contiguous neighbours to keep their contig bit set, because we know we
>>>>>> are about to clear the rest too.
>>>>>>
>>>>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>>>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>>>>> kernel. With this optimization in place, we are back down to the
>>>>>> original cost.
>>>>>>
>>>>>> This approach is not perfect though, as for the duration between
>>>>>> returning from the first call to ptep_get_and_clear_full() and making
>>>>>> the final call, the contpte block in an intermediate state, where some
>>>>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>>>>> other APIs are called for the ptes in the contpte block during that
>>>>>> time, we have to be very careful. The core code currently interleaves
>>>>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>>>>> must be careful to ignore the cleared entries when accumulating the
>>>>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>>>>> other calls we might resonably expect are to set markers in the
>>>>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>>>>> after the tlbi, at which point we are no longer in the intermediate
>>>>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>>>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>>>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>>>>> the 'full' case anyway.
>>>>>>
>>>>>> The last remaining issue is returning the access/dirty bits. That info
>>>>>> could be present in any of the ptes in the contpte block. ptep_get()
>>>>>> will gather those bits from across the contpte block. We don't bother
>>>>>> doing that here, because we know that the information is used by the
>>>>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>>>>> same folio must be underpinning the whole block (that was a requirement
>>>>>> for folding in the first place), that information will make it to the
>>>>>> folio eventually once all the ptes have been cleared. This approach
>>>>>> means we don't have to play games with accumulating and storing the
>>>>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>>>>> correct access/dirty information if we have already cleared the pte that
>>>>>> happened to store it. The core code does not rely on this though.
>>>>>
>>>>> Does not *currently* rely on this. I can't help but think it is
>>>>> potentially something that could change in the future though which would
>>>>> lead to some subtle bugs.
>>>>
>>>> Yes, there is a risk, although IMHO, its very small.
>>>>
>>>>>
>>>>> Would there be any may of avoiding this? Half baked thought but could
>>>>> you for example copy the access/dirty information to the last (or
>>>>> perhaps first, most likely invalid) PTE?
>>>>
>>>> I spent a long time thinking about this and came up with a number of
>>>> possibilities, none of them ideal. In the end, I went for the simplest one
>>>> (which works but suffers from the problem that it depends on the way it is
>>>> called not changing).
>>>
>>> Ok, that answers my underlying question of "has someone thought about
>>> this and are there any easy solutions". I suspected that was the case
>>> given the excellent write up though!
>>>
>>>> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
>>>> contpte block. This is how I did it in v1; although it was racy. I think this
>>>> could be implemented correctly but its extremely complex.
>>>>
>>>> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
>>>> can clear 1 or more full contpte blocks in a single call - the ptes are never in
>>>> an intermediate state. This is difficult because ptep_get_and_clear_full()
>>>> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>>>>
>>>> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
>>>> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
>>>> the other functions that ptep_get_and_clear_full() is not on-going when they are
>>>> called. So we would get a clear sign that usage patterns have changed. But there
>>>> is no easy place to store that state (other than scanning a contpte block
>>>> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>>>>
>>>> 4) The simple approach I ended up taking; I thought it would be best to keep it
>>>> simple and see if anyone was concerned before doing something more drastic.
>>>>
>>>> What do you think? If we really need to solve this, then option 1 is my
>>>> preferred route, but it would take some time to figure out and reason about a
>>>> race-free scheme.
>>>
>>> Well I like simple, and I agree the risk is small. But I can't help feel
>>> the current situation is too subtle, mainly because it is architecture
>>> specific and the assumptions are not communicated in core-mm code
>>> anywhere. But also none of the aternatives seem much better.
>>>
>>> However there are only three callers of ptep_get_and_clear_full(), and
>>> all of these hold the PTL. So if I'm not mistaken that should exclude
>>> just about all users of ptep_get*() which will take the ptl before hand.
>>
>> The problem isn't racing threads because as you say, the PTL is already
>> serializing all calls except ptep_get_lockless(). And although there are 3
>> callers to ptep_get_and_clear_full(), only the caller in zap_pte_range() ever
>> calls it with full=1, as I recall.
>>
>> The problem is that the caller in zap_pte_range() does this:
>>
>> ptl = lock_page_table()
>> for each pte {
>> 	ptent = ptep_get(pte);
>> 	if (pte_present(ptent) {
>> 		ptent = ptep_get_and_clear_full(ptent);
>> 		if (pte_dirty(ptent))
>> 			...
>> 		if (pte_young(ptent))
>> 			...
>> 	}
>> }
>> unlock_page_table(ptl)
>>
>> It deliberately interleves calls to ptep_get() and ptep_get_and_clear_full()
>> under the ptl. So if the loop is iterating over a contpte block and the HW
>> happens to be storing the access/dirty info in the first pte entry, then the
>> first time through the loop, ptep_get() will return the correct access/dirty
>> info, as will ptep_get_and_clear_full(). The next time through the loop though,
>> the access/dirty info which was in the previous pte is now cleared so ptep_get()
>> and ptep_get_and_clear_full() will return old/clean. It all works, but is fragile.
> 
> So if ptep_get_lockless() isn't a concern what made the option posted in
> v1 racy (your option 1 above)? Is there something else reading PTEs or
> clearing PTE bits without holding the PTL that I'm missing?

The HW could be racing to set access and dirty bits. Well actually, I'm not
completely sure if that's the case here; if full=1 then presumably no other
threads in the process should be running at this point, so perhaps it can be
guarranteed that nothing is causing a concurrent memory access and the HW is
therefore definitely not going to try to write the access/dirty bits
concurrently. But I didn't manage to convince myself that's definitely the case.

So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
buggy because it iterated through the PTEs, getting and accumulating. Then
iterated again, writing that final set of bits to all the PTEs. And the HW could
have modified the bits during those loops. I think it would be possible to fix
the race, but intuition says it would be expensive.

> 
>>>
>>> So really that only leaves ptep_get_lockless() that could/should
>>> interleave right? 
>>
>> Yes, but ptep_get_lockless() is special. Since it is called without the PTL, it
>> is very careful to ensure that the contpte block is in a consistent state and it
>> keeps trying until it is. So this will always return the correct consistent
>> information.
>>
>>> From a quick glance of those users none look at the
>>> young/dirty information anyway, so I wonder if we can just assert in the
>>> core-mm that ptep_get_lockless() does not return young/dirty information
>>> and clear it in the helpers? That would make things explicit and
>>> consistent which would address my concern (although I haven't looked too
>>> closely at the details there).
>>
>> As per explanation above, its not ptep_get_lockless() that is the problem so I
>> don't think this helps.
>>
>> Thanks,
>> Ryan
>>
>>>
>>>> Thanks,
>>>> Ryan
>>>
>
Alistair Popple Nov. 28, 2023, 6:54 a.m. UTC | #7
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 27/11/2023 07:34, Alistair Popple wrote:
>> 
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>> On 24/11/2023 01:35, Alistair Popple wrote:
>>>>
>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>
>>>>> On 23/11/2023 05:13, Alistair Popple wrote:
>>>>>>
>>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>>
>>>>>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>>>>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>>>>>> a full address space teardown is in progress. We use this information to
>>>>>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>>>>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>>>>>> contiguous neighbours to keep their contig bit set, because we know we
>>>>>>> are about to clear the rest too.
>>>>>>>
>>>>>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>>>>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>>>>>> kernel. With this optimization in place, we are back down to the
>>>>>>> original cost.
>>>>>>>
>>>>>>> This approach is not perfect though, as for the duration between
>>>>>>> returning from the first call to ptep_get_and_clear_full() and making
>>>>>>> the final call, the contpte block in an intermediate state, where some
>>>>>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>>>>>> other APIs are called for the ptes in the contpte block during that
>>>>>>> time, we have to be very careful. The core code currently interleaves
>>>>>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>>>>>> must be careful to ignore the cleared entries when accumulating the
>>>>>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>>>>>> other calls we might resonably expect are to set markers in the
>>>>>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>>>>>> after the tlbi, at which point we are no longer in the intermediate
>>>>>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>>>>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>>>>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>>>>>> the 'full' case anyway.
>>>>>>>
>>>>>>> The last remaining issue is returning the access/dirty bits. That info
>>>>>>> could be present in any of the ptes in the contpte block. ptep_get()
>>>>>>> will gather those bits from across the contpte block. We don't bother
>>>>>>> doing that here, because we know that the information is used by the
>>>>>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>>>>>> same folio must be underpinning the whole block (that was a requirement
>>>>>>> for folding in the first place), that information will make it to the
>>>>>>> folio eventually once all the ptes have been cleared. This approach
>>>>>>> means we don't have to play games with accumulating and storing the
>>>>>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>>>>>> correct access/dirty information if we have already cleared the pte that
>>>>>>> happened to store it. The core code does not rely on this though.
>>>>>>
>>>>>> Does not *currently* rely on this. I can't help but think it is
>>>>>> potentially something that could change in the future though which would
>>>>>> lead to some subtle bugs.
>>>>>
>>>>> Yes, there is a risk, although IMHO, its very small.
>>>>>
>>>>>>
>>>>>> Would there be any may of avoiding this? Half baked thought but could
>>>>>> you for example copy the access/dirty information to the last (or
>>>>>> perhaps first, most likely invalid) PTE?
>>>>>
>>>>> I spent a long time thinking about this and came up with a number of
>>>>> possibilities, none of them ideal. In the end, I went for the simplest one
>>>>> (which works but suffers from the problem that it depends on the way it is
>>>>> called not changing).
>>>>
>>>> Ok, that answers my underlying question of "has someone thought about
>>>> this and are there any easy solutions". I suspected that was the case
>>>> given the excellent write up though!
>>>>
>>>>> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
>>>>> contpte block. This is how I did it in v1; although it was racy. I think this
>>>>> could be implemented correctly but its extremely complex.
>>>>>
>>>>> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
>>>>> can clear 1 or more full contpte blocks in a single call - the ptes are never in
>>>>> an intermediate state. This is difficult because ptep_get_and_clear_full()
>>>>> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>>>>>
>>>>> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
>>>>> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
>>>>> the other functions that ptep_get_and_clear_full() is not on-going when they are
>>>>> called. So we would get a clear sign that usage patterns have changed. But there
>>>>> is no easy place to store that state (other than scanning a contpte block
>>>>> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>>>>>
>>>>> 4) The simple approach I ended up taking; I thought it would be best to keep it
>>>>> simple and see if anyone was concerned before doing something more drastic.
>>>>>
>>>>> What do you think? If we really need to solve this, then option 1 is my
>>>>> preferred route, but it would take some time to figure out and reason about a
>>>>> race-free scheme.
>>>>
>>>> Well I like simple, and I agree the risk is small. But I can't help feel
>>>> the current situation is too subtle, mainly because it is architecture
>>>> specific and the assumptions are not communicated in core-mm code
>>>> anywhere. But also none of the aternatives seem much better.
>>>>
>>>> However there are only three callers of ptep_get_and_clear_full(), and
>>>> all of these hold the PTL. So if I'm not mistaken that should exclude
>>>> just about all users of ptep_get*() which will take the ptl before hand.
>>>
>>> The problem isn't racing threads because as you say, the PTL is already
>>> serializing all calls except ptep_get_lockless(). And although there are 3
>>> callers to ptep_get_and_clear_full(), only the caller in zap_pte_range() ever
>>> calls it with full=1, as I recall.
>>>
>>> The problem is that the caller in zap_pte_range() does this:
>>>
>>> ptl = lock_page_table()
>>> for each pte {
>>> 	ptent = ptep_get(pte);
>>> 	if (pte_present(ptent) {
>>> 		ptent = ptep_get_and_clear_full(ptent);
>>> 		if (pte_dirty(ptent))
>>> 			...
>>> 		if (pte_young(ptent))
>>> 			...
>>> 	}
>>> }
>>> unlock_page_table(ptl)
>>>
>>> It deliberately interleves calls to ptep_get() and ptep_get_and_clear_full()
>>> under the ptl. So if the loop is iterating over a contpte block and the HW
>>> happens to be storing the access/dirty info in the first pte entry, then the
>>> first time through the loop, ptep_get() will return the correct access/dirty
>>> info, as will ptep_get_and_clear_full(). The next time through the loop though,
>>> the access/dirty info which was in the previous pte is now cleared so ptep_get()
>>> and ptep_get_and_clear_full() will return old/clean. It all works, but is fragile.
>> 
>> So if ptep_get_lockless() isn't a concern what made the option posted in
>> v1 racy (your option 1 above)? Is there something else reading PTEs or
>> clearing PTE bits without holding the PTL that I'm missing?
>
> The HW could be racing to set access and dirty bits. Well actually, I'm not
> completely sure if that's the case here; if full=1 then presumably no other
> threads in the process should be running at this point, so perhaps it can be
> guarranteed that nothing is causing a concurrent memory access and the HW is
> therefore definitely not going to try to write the access/dirty bits
> concurrently. But I didn't manage to convince myself that's definitely the case.

I suppose it's possible something attached to an SMMU or some such could
still be running and causing accesses so agree it's probably not the
case (although it would be an odd corner case).

> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
> buggy because it iterated through the PTEs, getting and accumulating. Then
> iterated again, writing that final set of bits to all the PTEs. And the HW could
> have modified the bits during those loops. I think it would be possible to fix
> the race, but intuition says it would be expensive.

So the issue as I understand it is subsequent iterations would see a
clean PTE after the first iteration returned a dirty PTE. In
ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
bit (if set) from the PTE being cleared to an adjacent PTE rather than
all the PTEs?

That would fix the inconsistency as far as subsequent iterations of
ptep_get_and_clear_full() returning the dirty/accessed if a previous
iteration did. Obviously HW could still race and cause a previously
clean iteration to return dirty, but that seems ok.

However all this has just left me with more questions :-)

Won't clearing bits like this result in inconsistent programming of the
PTE_CONT bit? What happens if HW access a page in the contiguous region
while some of the PTEs are invalid? And same question for programming
them really - I don't think we can atomically set PTE_CONT on multiple
PTEs all at once so if we assume something can be accessing them
concurrently how do we do that without having some HW observing an
intermediate state where PTE_CONT is misprogrammed?

Thanks.

>> 
>>>>
>>>> So really that only leaves ptep_get_lockless() that could/should
>>>> interleave right? 
>>>
>>> Yes, but ptep_get_lockless() is special. Since it is called without the PTL, it
>>> is very careful to ensure that the contpte block is in a consistent state and it
>>> keeps trying until it is. So this will always return the correct consistent
>>> information.
>>>
>>>> From a quick glance of those users none look at the
>>>> young/dirty information anyway, so I wonder if we can just assert in the
>>>> core-mm that ptep_get_lockless() does not return young/dirty information
>>>> and clear it in the helpers? That would make things explicit and
>>>> consistent which would address my concern (although I haven't looked too
>>>> closely at the details there).
>>>
>>> As per explanation above, its not ptep_get_lockless() that is the problem so I
>>> don't think this helps.
>>>
>>> Thanks,
>>> Ryan
>>>
>>>>
>>>>> Thanks,
>>>>> Ryan
>>>>
>>
Barry Song Nov. 28, 2023, 7:32 a.m. UTC | #8
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep, int full)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (!pte_valid_cont(orig_pte) || !full) {
> +		contpte_try_unfold(mm, addr, ptep, orig_pte);
> +		return __ptep_get_and_clear(mm, addr, ptep);
> +	} else
> +		return contpte_ptep_get_and_clear_full(mm, addr, ptep);
> +}
> +

Hi Ryan,

I feel quite hard to understand the code. when !pte_valid_cont(orig_pte),
we will call contpte_try_unfold(mm, addr, ptep, orig_pte);

but in contpte_try_unfold(), we call unfold only if pte_valid_cont()
is true:
static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
                                        pte_t *ptep, pte_t pte) 
{
        if (contpte_is_enabled(mm) && pte_valid_cont(pte))
                __contpte_try_unfold(mm, addr, ptep, pte);
}

so do you mean the below?

if (!pte_valid_cont(orig_pte))
	return __ptep_get_and_clear(mm, addr, ptep);

if (!full) {
	contpte_try_unfold(mm, addr, ptep, orig_pte);
	return __ptep_get_and_clear(mm, addr, ptep);	
} else {
	return contpte_ptep_get_and_clear_full(mm, addr, ptep);
}

Thanks
Barry
Barry Song Nov. 28, 2023, 8:17 a.m. UTC | #9
> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	/*
> +	 * When doing a full address space teardown, we can avoid unfolding the
> +	 * contiguous range, and therefore avoid the associated tlbi. Instead,
> +	 * just get and clear the pte. The caller is promising to call us for
> +	 * every pte, so every pte in the range will be cleared by the time the
> +	 * tlbi is issued.
> +	 *
> +	 * This approach is not perfect though, as for the duration between
> +	 * returning from the first call to ptep_get_and_clear_full() and making
> +	 * the final call, the contpte block in an intermediate state, where
> +	 * some ptes are cleared and others are still set with the PTE_CONT bit.
> +	 * If any other APIs are called for the ptes in the contpte block during
> +	 * that time, we have to be very careful. The core code currently
> +	 * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
> +	 * ptep_get() must be careful to ignore the cleared entries when
> +	 * accumulating the access and dirty bits - the same goes for
> +	 * ptep_get_lockless(). The only other calls we might resonably expect
> +	 * are to set markers in the previously cleared ptes. (We shouldn't see
> +	 * valid entries being set until after the tlbi, at which point we are
> +	 * no longer in the intermediate state). Since markers are not valid,
> +	 * this is safe; set_ptes() will see the old, invalid entry and will not
> +	 * attempt to unfold. And the new pte is also invalid so it won't
> +	 * attempt to fold. We shouldn't see this for the 'full' case anyway.
> +	 *
> +	 * The last remaining issue is returning the access/dirty bits. That
> +	 * info could be present in any of the ptes in the contpte block.
> +	 * ptep_get() will gather those bits from across the contpte block. We
> +	 * don't bother doing that here, because we know that the information is
> +	 * used by the core-mm to mark the underlying folio as accessed/dirty.
> +	 * And since the same folio must be underpinning the whole block (that
> +	 * was a requirement for folding in the first place), that information
> +	 * will make it to the folio eventually once all the ptes have been
> +	 * cleared. This approach means we don't have to play games with
> +	 * accumulating and storing the bits. It does mean that any interleaved
> +	 * calls to ptep_get() may lack correct access/dirty information if we
> +	 * have already cleared the pte that happened to store it. The core code
> +	 * does not rely on this though.

even without any other threads running and touching those PTEs, this won't survive
on some hardware. we expose inconsistent CONTPTEs to hardware, this might result
in crashed firmware even in trustzone, strange&unknown faults to trustzone we have
seen on Qualcomm, but for MTK, it seems fine. when you do tlbi on a part of PTEs
with dropped CONT but still some other PTEs have CONT, we make hardware totally
confused.

zap_pte_range() has a force_flush when tlbbatch is full:

                        if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
                                force_flush = 1; 
                                addr += PAGE_SIZE;
                                break;
                        }

this means you can expose partial tlbi/flush directly to hardware while some
other PTEs are still CONT.

on the other hand, contpte_ptep_get_and_clear_full() doesn't need to depend
on fullmm, as long as zap range covers a large folio, we can flush tlbi for
those CONTPTEs all together in your contpte_ptep_get_and_clear_full() rather
than clearing one PTE.

Our approach in [1] is we do a flush for all CONTPTEs and go directly to the end
of the large folio:

#ifdef CONFIG_CONT_PTE_HUGEPAGE
			if (pte_cont(ptent)) {
				unsigned long next = pte_cont_addr_end(addr, end);

				if (next - addr != HPAGE_CONT_PTE_SIZE) {
					__split_huge_cont_pte(vma, pte, addr, false, NULL, ptl);
					/*
					 * After splitting cont-pte
					 * we need to process pte again.
					 */
					goto again_pte;
				} else {
					cont_pte_huge_ptep_get_and_clear(mm, addr, pte);

					tlb_remove_cont_pte_tlb_entry(tlb, pte, addr);
					if (unlikely(!page))
						continue;

					if (is_huge_zero_page(page)) {
						tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
						goto cont_next;
					}

					rss[mm_counter(page)] -= HPAGE_CONT_PTE_NR;
					page_remove_rmap(page, true);
					if (unlikely(page_mapcount(page) < 0))
						print_bad_pte(vma, addr, ptent, page);

					tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
				}
cont_next:
				/* "do while()" will do "pte++" and "addr + PAGE_SIZE" */
				pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
				addr = next - PAGE_SIZE;
				continue;
			}
#endif

this is our "full" counterpart, which clear_flush CONT_PTES pages directly, and
it never requires tlb->fullmm at all.

static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm,
				       unsigned long addr,
				       pte_t *ptep,
				       bool flush)
{
	pte_t orig_pte = ptep_get(ptep);

	CHP_BUG_ON(!pte_cont(orig_pte));
	CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
	CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));

	return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
}

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539

> +	 */
> +
> +	return __ptep_get_and_clear(mm, addr, ptep);
> +}
> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> +

Thanks
Barry
Ryan Roberts Nov. 28, 2023, 11:15 a.m. UTC | #10
On 28/11/2023 07:32, Barry Song wrote:
>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
>> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep, int full)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (!pte_valid_cont(orig_pte) || !full) {
>> +		contpte_try_unfold(mm, addr, ptep, orig_pte);
>> +		return __ptep_get_and_clear(mm, addr, ptep);
>> +	} else
>> +		return contpte_ptep_get_and_clear_full(mm, addr, ptep);
>> +}
>> +
> 
> Hi Ryan,
> 
> I feel quite hard to understand the code. when !pte_valid_cont(orig_pte),
> we will call contpte_try_unfold(mm, addr, ptep, orig_pte);
> 
> but in contpte_try_unfold(), we call unfold only if pte_valid_cont()
> is true:
> static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>                                         pte_t *ptep, pte_t pte) 
> {
>         if (contpte_is_enabled(mm) && pte_valid_cont(pte))
>                 __contpte_try_unfold(mm, addr, ptep, pte);
> }
> 
> so do you mean the below?
> 
> if (!pte_valid_cont(orig_pte))
> 	return __ptep_get_and_clear(mm, addr, ptep);
> 
> if (!full) {
> 	contpte_try_unfold(mm, addr, ptep, orig_pte);
> 	return __ptep_get_and_clear(mm, addr, ptep);	
> } else {
> 	return contpte_ptep_get_and_clear_full(mm, addr, ptep);
> }

Yes, this is equivalent. In general, I was trying not to spray `if
(pte_valid_cont(orig_pte))` checks everywhere to guard contpte_try_unfold() and
instead put the checks into contpte_try_unfold() (hence the 'try'). I figured
just calling it unconditionally and letting the compiler optimize as it sees fit
was the cleanest approach.

But in this instance I can see this is confusing. I'll modify as you suggest.
Thanks!

> 
> Thanks
> Barry
> 
>
Ryan Roberts Nov. 28, 2023, 11:49 a.m. UTC | #11
On 28/11/2023 08:17, Barry Song wrote:
>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	/*
>> +	 * When doing a full address space teardown, we can avoid unfolding the
>> +	 * contiguous range, and therefore avoid the associated tlbi. Instead,
>> +	 * just get and clear the pte. The caller is promising to call us for
>> +	 * every pte, so every pte in the range will be cleared by the time the
>> +	 * tlbi is issued.
>> +	 *
>> +	 * This approach is not perfect though, as for the duration between
>> +	 * returning from the first call to ptep_get_and_clear_full() and making
>> +	 * the final call, the contpte block in an intermediate state, where
>> +	 * some ptes are cleared and others are still set with the PTE_CONT bit.
>> +	 * If any other APIs are called for the ptes in the contpte block during
>> +	 * that time, we have to be very careful. The core code currently
>> +	 * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
>> +	 * ptep_get() must be careful to ignore the cleared entries when
>> +	 * accumulating the access and dirty bits - the same goes for
>> +	 * ptep_get_lockless(). The only other calls we might resonably expect
>> +	 * are to set markers in the previously cleared ptes. (We shouldn't see
>> +	 * valid entries being set until after the tlbi, at which point we are
>> +	 * no longer in the intermediate state). Since markers are not valid,
>> +	 * this is safe; set_ptes() will see the old, invalid entry and will not
>> +	 * attempt to unfold. And the new pte is also invalid so it won't
>> +	 * attempt to fold. We shouldn't see this for the 'full' case anyway.
>> +	 *
>> +	 * The last remaining issue is returning the access/dirty bits. That
>> +	 * info could be present in any of the ptes in the contpte block.
>> +	 * ptep_get() will gather those bits from across the contpte block. We
>> +	 * don't bother doing that here, because we know that the information is
>> +	 * used by the core-mm to mark the underlying folio as accessed/dirty.
>> +	 * And since the same folio must be underpinning the whole block (that
>> +	 * was a requirement for folding in the first place), that information
>> +	 * will make it to the folio eventually once all the ptes have been
>> +	 * cleared. This approach means we don't have to play games with
>> +	 * accumulating and storing the bits. It does mean that any interleaved
>> +	 * calls to ptep_get() may lack correct access/dirty information if we
>> +	 * have already cleared the pte that happened to store it. The core code
>> +	 * does not rely on this though.
> 
> even without any other threads running and touching those PTEs, this won't survive
> on some hardware. we expose inconsistent CONTPTEs to hardware, this might result

No that's not the case; if you read the Arm ARM, the page table is only
considered "misgrogrammed" when *valid* entries within the same contpte block
have different values for the contiguous bit. We are clearing the ptes to zero
here, which is an *invalid* entry. So if the TLB entry somehow gets invalidated
(either due to explicit tlbi as you point out below, or due to a concurrent TLB
miss which selects our entry for removal to make space for the new incomming
entry), then it gets an access request for an address in our partially cleared
contpte block the address will either be:

A) an address for a pte entry we have already cleared, so its invalid and it
will fault (and get serialized behind the PTL).

or

B) an address for a pte entry we haven't yet cleared, so it will reform a TLB
entry for the contpte block. But that's ok because the memory still exists
because we haven't yet finished clearing the page table and have not yet issued
the final tlbi.


> in crashed firmware even in trustzone, strange&unknown faults to trustzone we have
> seen on Qualcomm, but for MTK, it seems fine. when you do tlbi on a part of PTEs
> with dropped CONT but still some other PTEs have CONT, we make hardware totally
> confused.

I suspect this is because in your case you are "misprogramming" the contpte
block; there are *valid* pte entries within the block that disagree about the
contiguous bit or about various other fields. In this case some HW TLB designs
can do weird things. I suspect in your case, that's resulting in accessing bad
memory space and causing an SError, which is trapped by EL3, and the FW is
probably just panicking at that point.

> 
> zap_pte_range() has a force_flush when tlbbatch is full:
> 
>                         if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
>                                 force_flush = 1; 
>                                 addr += PAGE_SIZE;
>                                 break;
>                         }
> 
> this means you can expose partial tlbi/flush directly to hardware while some
> other PTEs are still CONT.

Yes, but that's also possible even if we have a tight loop that clears down the
contpte block; there could still be another core that issues a tlbi while you're
halfway through that loop, or the HW could happen to evict due to TLB pressure
at any time. The point is, it's safe if you are clearing the pte to an *invalid*
entry.

> 
> on the other hand, contpte_ptep_get_and_clear_full() doesn't need to depend
> on fullmm, as long as zap range covers a large folio, we can flush tlbi for
> those CONTPTEs all together in your contpte_ptep_get_and_clear_full() rather
> than clearing one PTE.
> 
> Our approach in [1] is we do a flush for all CONTPTEs and go directly to the end
> of the large folio:
> 
> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> 			if (pte_cont(ptent)) {
> 				unsigned long next = pte_cont_addr_end(addr, end);
> 
> 				if (next - addr != HPAGE_CONT_PTE_SIZE) {
> 					__split_huge_cont_pte(vma, pte, addr, false, NULL, ptl);
> 					/*
> 					 * After splitting cont-pte
> 					 * we need to process pte again.
> 					 */
> 					goto again_pte;
> 				} else {
> 					cont_pte_huge_ptep_get_and_clear(mm, addr, pte);
> 
> 					tlb_remove_cont_pte_tlb_entry(tlb, pte, addr);
> 					if (unlikely(!page))
> 						continue;
> 
> 					if (is_huge_zero_page(page)) {
> 						tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> 						goto cont_next;
> 					}
> 
> 					rss[mm_counter(page)] -= HPAGE_CONT_PTE_NR;
> 					page_remove_rmap(page, true);
> 					if (unlikely(page_mapcount(page) < 0))
> 						print_bad_pte(vma, addr, ptent, page);
> 
> 					tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> 				}
> cont_next:
> 				/* "do while()" will do "pte++" and "addr + PAGE_SIZE" */
> 				pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> 				addr = next - PAGE_SIZE;
> 				continue;
> 			}
> #endif
> 
> this is our "full" counterpart, which clear_flush CONT_PTES pages directly, and
> it never requires tlb->fullmm at all.

Yes, but you are benefitting from the fact that contpte is exposed to core-mm
and it is special-casing them at this level. I'm trying to avoid that.

I don't think there is any correctness issue here. But there is a problem with
fragility, as raised by Alistair. I have some ideas on potentially how to solve
that. I'm going to try to work on it this afternoon and will post if I get some
confidence that it is a real solution.

Thanks,
Ryan

> 
> static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm,
> 				       unsigned long addr,
> 				       pte_t *ptep,
> 				       bool flush)
> {
> 	pte_t orig_pte = ptep_get(ptep);
> 
> 	CHP_BUG_ON(!pte_cont(orig_pte));
> 	CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
> 	CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));
> 
> 	return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
> }
> 
> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539
> 
>> +	 */
>> +
>> +	return __ptep_get_and_clear(mm, addr, ptep);
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
>> +
> 
> Thanks
> Barry
> 
>
Ryan Roberts Nov. 28, 2023, 12:45 p.m. UTC | #12
On 28/11/2023 06:54, Alistair Popple wrote:
> 
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 27/11/2023 07:34, Alistair Popple wrote:
>>>
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>> On 24/11/2023 01:35, Alistair Popple wrote:
>>>>>
>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>
>>>>>> On 23/11/2023 05:13, Alistair Popple wrote:
>>>>>>>
>>>>>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>>>>>
>>>>>>>> ptep_get_and_clear_full() adds a 'full' parameter which is not present
>>>>>>>> for the fallback ptep_get_and_clear() function. 'full' is set to 1 when
>>>>>>>> a full address space teardown is in progress. We use this information to
>>>>>>>> optimize arm64_sys_exit_group() by avoiding unfolding (and therefore
>>>>>>>> tlbi) contiguous ranges. Instead we just clear the PTE but allow all the
>>>>>>>> contiguous neighbours to keep their contig bit set, because we know we
>>>>>>>> are about to clear the rest too.
>>>>>>>>
>>>>>>>> Before this optimization, the cost of arm64_sys_exit_group() exploded to
>>>>>>>> 32x what it was before PTE_CONT support was wired up, when compiling the
>>>>>>>> kernel. With this optimization in place, we are back down to the
>>>>>>>> original cost.
>>>>>>>>
>>>>>>>> This approach is not perfect though, as for the duration between
>>>>>>>> returning from the first call to ptep_get_and_clear_full() and making
>>>>>>>> the final call, the contpte block in an intermediate state, where some
>>>>>>>> ptes are cleared and others are still set with the PTE_CONT bit. If any
>>>>>>>> other APIs are called for the ptes in the contpte block during that
>>>>>>>> time, we have to be very careful. The core code currently interleaves
>>>>>>>> calls to ptep_get_and_clear_full() with ptep_get() and so ptep_get()
>>>>>>>> must be careful to ignore the cleared entries when accumulating the
>>>>>>>> access and dirty bits - the same goes for ptep_get_lockless(). The only
>>>>>>>> other calls we might resonably expect are to set markers in the
>>>>>>>> previously cleared ptes. (We shouldn't see valid entries being set until
>>>>>>>> after the tlbi, at which point we are no longer in the intermediate
>>>>>>>> state). Since markers are not valid, this is safe; set_ptes() will see
>>>>>>>> the old, invalid entry and will not attempt to unfold. And the new pte
>>>>>>>> is also invalid so it won't attempt to fold. We shouldn't see this for
>>>>>>>> the 'full' case anyway.
>>>>>>>>
>>>>>>>> The last remaining issue is returning the access/dirty bits. That info
>>>>>>>> could be present in any of the ptes in the contpte block. ptep_get()
>>>>>>>> will gather those bits from across the contpte block. We don't bother
>>>>>>>> doing that here, because we know that the information is used by the
>>>>>>>> core-mm to mark the underlying folio as accessed/dirty. And since the
>>>>>>>> same folio must be underpinning the whole block (that was a requirement
>>>>>>>> for folding in the first place), that information will make it to the
>>>>>>>> folio eventually once all the ptes have been cleared. This approach
>>>>>>>> means we don't have to play games with accumulating and storing the
>>>>>>>> bits. It does mean that any interleaved calls to ptep_get() may lack
>>>>>>>> correct access/dirty information if we have already cleared the pte that
>>>>>>>> happened to store it. The core code does not rely on this though.
>>>>>>>
>>>>>>> Does not *currently* rely on this. I can't help but think it is
>>>>>>> potentially something that could change in the future though which would
>>>>>>> lead to some subtle bugs.
>>>>>>
>>>>>> Yes, there is a risk, although IMHO, its very small.
>>>>>>
>>>>>>>
>>>>>>> Would there be any may of avoiding this? Half baked thought but could
>>>>>>> you for example copy the access/dirty information to the last (or
>>>>>>> perhaps first, most likely invalid) PTE?
>>>>>>
>>>>>> I spent a long time thinking about this and came up with a number of
>>>>>> possibilities, none of them ideal. In the end, I went for the simplest one
>>>>>> (which works but suffers from the problem that it depends on the way it is
>>>>>> called not changing).
>>>>>
>>>>> Ok, that answers my underlying question of "has someone thought about
>>>>> this and are there any easy solutions". I suspected that was the case
>>>>> given the excellent write up though!
>>>>>
>>>>>> 1) copy the access/dirty flags into all the remaining uncleared ptes within the
>>>>>> contpte block. This is how I did it in v1; although it was racy. I think this
>>>>>> could be implemented correctly but its extremely complex.
>>>>>>
>>>>>> 2) batch calls from the core-mm (like I did for pte_set_wrprotects()) so that we
>>>>>> can clear 1 or more full contpte blocks in a single call - the ptes are never in
>>>>>> an intermediate state. This is difficult because ptep_get_and_clear_full()
>>>>>> returns the pte that was cleared so its difficult to scale that up to multiple ptes.
>>>>>>
>>>>>> 3) add ptep_get_no_access_dirty() and redefine the interface to only allow that
>>>>>> to be called while ptep_get_and_clear_full() calls are on-going. Then assert in
>>>>>> the other functions that ptep_get_and_clear_full() is not on-going when they are
>>>>>> called. So we would get a clear sign that usage patterns have changed. But there
>>>>>> is no easy place to store that state (other than scanning a contpte block
>>>>>> looking for pte_none() amongst pte_valid_cont() entries) and it all felt ugly.
>>>>>>
>>>>>> 4) The simple approach I ended up taking; I thought it would be best to keep it
>>>>>> simple and see if anyone was concerned before doing something more drastic.
>>>>>>
>>>>>> What do you think? If we really need to solve this, then option 1 is my
>>>>>> preferred route, but it would take some time to figure out and reason about a
>>>>>> race-free scheme.
>>>>>
>>>>> Well I like simple, and I agree the risk is small. But I can't help feel
>>>>> the current situation is too subtle, mainly because it is architecture
>>>>> specific and the assumptions are not communicated in core-mm code
>>>>> anywhere. But also none of the aternatives seem much better.
>>>>>
>>>>> However there are only three callers of ptep_get_and_clear_full(), and
>>>>> all of these hold the PTL. So if I'm not mistaken that should exclude
>>>>> just about all users of ptep_get*() which will take the ptl before hand.
>>>>
>>>> The problem isn't racing threads because as you say, the PTL is already
>>>> serializing all calls except ptep_get_lockless(). And although there are 3
>>>> callers to ptep_get_and_clear_full(), only the caller in zap_pte_range() ever
>>>> calls it with full=1, as I recall.
>>>>
>>>> The problem is that the caller in zap_pte_range() does this:
>>>>
>>>> ptl = lock_page_table()
>>>> for each pte {
>>>> 	ptent = ptep_get(pte);
>>>> 	if (pte_present(ptent) {
>>>> 		ptent = ptep_get_and_clear_full(ptent);
>>>> 		if (pte_dirty(ptent))
>>>> 			...
>>>> 		if (pte_young(ptent))
>>>> 			...
>>>> 	}
>>>> }
>>>> unlock_page_table(ptl)
>>>>
>>>> It deliberately interleves calls to ptep_get() and ptep_get_and_clear_full()
>>>> under the ptl. So if the loop is iterating over a contpte block and the HW
>>>> happens to be storing the access/dirty info in the first pte entry, then the
>>>> first time through the loop, ptep_get() will return the correct access/dirty
>>>> info, as will ptep_get_and_clear_full(). The next time through the loop though,
>>>> the access/dirty info which was in the previous pte is now cleared so ptep_get()
>>>> and ptep_get_and_clear_full() will return old/clean. It all works, but is fragile.
>>>
>>> So if ptep_get_lockless() isn't a concern what made the option posted in
>>> v1 racy (your option 1 above)? Is there something else reading PTEs or
>>> clearing PTE bits without holding the PTL that I'm missing?
>>
>> The HW could be racing to set access and dirty bits. Well actually, I'm not
>> completely sure if that's the case here; if full=1 then presumably no other
>> threads in the process should be running at this point, so perhaps it can be
>> guarranteed that nothing is causing a concurrent memory access and the HW is
>> therefore definitely not going to try to write the access/dirty bits
>> concurrently. But I didn't manage to convince myself that's definitely the case.
> 
> I suppose it's possible something attached to an SMMU or some such could
> still be running and causing accesses so agree it's probably not the
> case (although it would be an odd corner case).

Indeed.

> 
>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>> buggy because it iterated through the PTEs, getting and accumulating. Then
>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>> have modified the bits during those loops. I think it would be possible to fix
>> the race, but intuition says it would be expensive.
> 
> So the issue as I understand it is subsequent iterations would see a
> clean PTE after the first iteration returned a dirty PTE. In
> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
> bit (if set) from the PTE being cleared to an adjacent PTE rather than
> all the PTEs?

The raciness I'm describing is the race between reading access/dirty from one
pte and applying it to another. But yes I like your suggestion. if we do:

pte = __ptep_get_and_clear_full(ptep)

on the target pte, then we have grabbed access/dirty from it in a race-free
manner. we can then loop from current pte up towards the top of the block until
we find a valid entry (and I guess wrap at the top to make us robust against
future callers clearing an an arbitrary order). Then atomically accumulate the
access/dirty bits we have just saved into that new entry. I guess that's just a
cmpxchg loop - there are already examples of how to do that correctly when
racing the TLB.

For most entries, we will just be copying up to the next pte. For the last pte,
we would end up reading all ptes and determine we are the last one.

What do you think?

> 
> That would fix the inconsistency as far as subsequent iterations of
> ptep_get_and_clear_full() returning the dirty/accessed if a previous
> iteration did. Obviously HW could still race and cause a previously
> clean iteration to return dirty, but that seems ok.
> 
> However all this has just left me with more questions :-)
> 
> Won't clearing bits like this result in inconsistent programming of the
> PTE_CONT bit? What happens if HW access a page in the contiguous region
> while some of the PTEs are invalid? And same question for programming
> them really - I don't think we can atomically set PTE_CONT on multiple
> PTEs all at once so if we assume something can be accessing them
> concurrently how do we do that without having some HW observing an
> intermediate state where PTE_CONT is misprogrammed?

I'm fairly confident at this point that it is safe and conformant to the
architecture. See my explanation at [1]. Of course having people look for holes
is very welcome - so thanks!

There is also a clarification we made to the Arm ARM primarily for patch 13 (see
the commit log) but it includes a clarification on how invalid ptes are not
included when considering if a contpte block is misprogrammed. See section
D21194 at [2].


[1] https://lore.kernel.org/linux-mm/207de995-6d48-41ea-8373-2f9caad9b9c3@arm.com/
[2] https://developer.arm.com/documentation/102105/latest/

> 
> Thanks.
> 
>>>
>>>>>
>>>>> So really that only leaves ptep_get_lockless() that could/should
>>>>> interleave right? 
>>>>
>>>> Yes, but ptep_get_lockless() is special. Since it is called without the PTL, it
>>>> is very careful to ensure that the contpte block is in a consistent state and it
>>>> keeps trying until it is. So this will always return the correct consistent
>>>> information.
>>>>
>>>>> From a quick glance of those users none look at the
>>>>> young/dirty information anyway, so I wonder if we can just assert in the
>>>>> core-mm that ptep_get_lockless() does not return young/dirty information
>>>>> and clear it in the helpers? That would make things explicit and
>>>>> consistent which would address my concern (although I haven't looked too
>>>>> closely at the details there).
>>>>
>>>> As per explanation above, its not ptep_get_lockless() that is the problem so I
>>>> don't think this helps.
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>>
>>>>>> Thanks,
>>>>>> Ryan
>>>>>
>>>
>
Ryan Roberts Nov. 28, 2023, 4:55 p.m. UTC | #13
>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>>> buggy because it iterated through the PTEs, getting and accumulating. Then
>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>>> have modified the bits during those loops. I think it would be possible to fix
>>> the race, but intuition says it would be expensive.
>>
>> So the issue as I understand it is subsequent iterations would see a
>> clean PTE after the first iteration returned a dirty PTE. In
>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
>> all the PTEs?
> 
> The raciness I'm describing is the race between reading access/dirty from one
> pte and applying it to another. But yes I like your suggestion. if we do:
> 
> pte = __ptep_get_and_clear_full(ptep)
> 
> on the target pte, then we have grabbed access/dirty from it in a race-free
> manner. we can then loop from current pte up towards the top of the block until
> we find a valid entry (and I guess wrap at the top to make us robust against
> future callers clearing an an arbitrary order). Then atomically accumulate the
> access/dirty bits we have just saved into that new entry. I guess that's just a
> cmpxchg loop - there are already examples of how to do that correctly when
> racing the TLB.
> 
> For most entries, we will just be copying up to the next pte. For the last pte,
> we would end up reading all ptes and determine we are the last one.
> 
> What do you think?

OK here is an attempt at something which solves the fragility. I think this is
now robust and will always return the correct access/dirty state from
ptep_get_and_clear_full() and ptep_get().

But I'm not sure about performance; each call to ptep_get_and_clear_full() for
each pte in a contpte block will cause a ptep_get() to gather the access/dirty
bits from across the contpte block - which requires reading each pte in the
contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.

Was this the type of thing you were thinking of, Alistair?


--8<--
 arch/arm64/include/asm/pgtable.h | 23 ++++++++-
 arch/arm64/mm/contpte.c          | 81 ++++++++++++++++++++++++++++++++
 arch/arm64/mm/fault.c            | 38 +++++++++------
 3 files changed, 125 insertions(+), 17 deletions(-)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9bd2f57a9e11..6c295d277784 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
 	return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
 }

+extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
 extern int __ptep_set_access_flags(struct vm_area_struct *vma,
 				 unsigned long address, pte_t *ptep,
 				 pte_t entry, int dirty);
@@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
 extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
 extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte, unsigned int nr);
+extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
@@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
 	__pte_clear(mm, addr, ptep);
 }

+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, int full)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (!pte_valid_cont(orig_pte))
+		return __ptep_get_and_clear(mm, addr, ptep);
+
+	if (!full) {
+		contpte_try_unfold(mm, addr, ptep, orig_pte);
+		return __ptep_get_and_clear(mm, addr, ptep);
+	}
+
+	return contpte_ptep_get_and_clear_full(mm, addr, ptep);
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
-	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
-	return __ptep_get_and_clear(mm, addr, ptep);
+	return ptep_get_and_clear_full(mm, addr, ptep, 0);
 }

 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 2a57df16bf58..99b211118d93 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
 	for (i = 0; i < CONT_PTES; i++, ptep++) {
 		pte = __ptep_get(ptep);

+		/*
+		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
+		 * where some of the ptes in the range may be cleared but others
+		 * are still to do. See contpte_ptep_get_and_clear_full().
+		 */
+		if (!pte_valid(pte))
+			continue;
+
 		if (pte_dirty(pte))
 			orig_pte = pte_mkdirty(orig_pte);

@@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL(contpte_set_ptes);

+pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
+					unsigned long addr, pte_t *ptep)
+{
+	/*
+	 * When doing a full address space teardown, we can avoid unfolding the
+	 * contiguous range, and therefore avoid the associated tlbi. Instead,
+	 * just get and clear the pte. The caller is promising to call us for
+	 * every pte, so every pte in the range will be cleared by the time the
+	 * final tlbi is issued.
+	 *
+	 * This approach requires some complex hoop jumping though, as for the
+	 * duration between returning from the first call to
+	 * ptep_get_and_clear_full() and making the final call, the contpte
+	 * block is in an intermediate state, where some ptes are cleared and
+	 * others are still set with the PTE_CONT bit. If any other APIs are
+	 * called for the ptes in the contpte block during that time, we have to
+	 * be very careful. The core code currently interleaves calls to
+	 * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
+	 * careful to ignore the cleared entries when accumulating the access
+	 * and dirty bits - the same goes for ptep_get_lockless(). The only
+	 * other calls we might resonably expect are to set markers in the
+	 * previously cleared ptes. (We shouldn't see valid entries being set
+	 * until after the tlbi, at which point we are no longer in the
+	 * intermediate state). Since markers are not valid, this is safe;
+	 * set_ptes() will see the old, invalid entry and will not attempt to
+	 * unfold. And the new pte is also invalid so it won't attempt to fold.
+	 * We shouldn't see pte markers being set for the 'full' case anyway
+	 * since the address space is being torn down.
+	 *
+	 * The last remaining issue is returning the access/dirty bits. That
+	 * info could be present in any of the ptes in the contpte block.
+	 * ptep_get() will gather those bits from across the contpte block (for
+	 * the remaining valid entries). So below, if the pte we are clearing
+	 * has dirty or young set, we need to stash it into a pte that we are
+	 * yet to clear. This allows future calls to return the correct state
+	 * even when the info was stored in a different pte. Since the core-mm
+	 * calls from low to high address, we prefer to stash in the last pte of
+	 * the contpte block - this means we are not "dragging" the bits up
+	 * through all ptes and increases the chances that we can exit early
+	 * because a given pte will have neither dirty or young set.
+	 */
+
+	pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
+	bool dirty = pte_dirty(orig_pte);
+	bool young = pte_young(orig_pte);
+	pte_t *start;
+
+	if (!dirty && !young)
+		return contpte_ptep_get(ptep, orig_pte);
+
+	start = contpte_align_down(ptep);
+	ptep = start + CONT_PTES - 1;
+
+	for (; ptep >= start; ptep--) {
+		pte_t pte = __ptep_get(ptep);
+
+		if (!pte_valid(pte))
+			continue;
+
+		if (dirty)
+			pte = pte_mkdirty(pte);
+
+		if (young)
+			pte = pte_mkyoung(pte);
+
+		__ptep_set_access_flags_notlbi(ptep, pte);
+		return contpte_ptep_get(ptep, orig_pte);
+	}
+
+	return orig_pte;
+}
+EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
+
 int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep)
 {
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index d63f3a0a7251..b22216a8153c 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -199,19 +199,7 @@ static void show_pte(unsigned long addr)
 	pr_cont("\n");
 }

-/*
- * This function sets the access flags (dirty, accessed), as well as write
- * permission, and only to a more permissive setting.
- *
- * It needs to cope with hardware update of the accessed/dirty state by other
- * agents in the system and can safely skip the __sync_icache_dcache() call as,
- * like __set_ptes(), the PTE is never changed from no-exec to exec here.
- *
- * Returns whether or not the PTE actually changed.
- */
-int __ptep_set_access_flags(struct vm_area_struct *vma,
-			    unsigned long address, pte_t *ptep,
-			    pte_t entry, int dirty)
+int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry)
 {
 	pteval_t old_pteval, pteval;
 	pte_t pte = __ptep_get(ptep);
@@ -238,10 +226,30 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
 		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
 	} while (pteval != old_pteval);

+	return 1;
+}
+
+/*
+ * This function sets the access flags (dirty, accessed), as well as write
+ * permission, and only to a more permissive setting.
+ *
+ * It needs to cope with hardware update of the accessed/dirty state by other
+ * agents in the system and can safely skip the __sync_icache_dcache() call as,
+ * like __set_ptes(), the PTE is never changed from no-exec to exec here.
+ *
+ * Returns whether or not the PTE actually changed.
+ */
+int __ptep_set_access_flags(struct vm_area_struct *vma,
+			    unsigned long address, pte_t *ptep,
+			    pte_t entry, int dirty)
+{
+	int changed = __ptep_set_access_flags_notlbi(ptep, entry);
+
 	/* Invalidate a stale read-only entry */
-	if (dirty)
+	if (changed && dirty)
 		flush_tlb_page(vma, address);
-	return 1;
+
+	return changed;
 }

 static bool is_el1_instruction_abort(unsigned long esr)
--8<--
Barry Song Nov. 28, 2023, 8:23 p.m. UTC | #14
On Wed, Nov 29, 2023 at 12:49 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/11/2023 08:17, Barry Song wrote:
> >> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> >> +                                    unsigned long addr, pte_t *ptep)
> >> +{
> >> +    /*
> >> +     * When doing a full address space teardown, we can avoid unfolding the
> >> +     * contiguous range, and therefore avoid the associated tlbi. Instead,
> >> +     * just get and clear the pte. The caller is promising to call us for
> >> +     * every pte, so every pte in the range will be cleared by the time the
> >> +     * tlbi is issued.
> >> +     *
> >> +     * This approach is not perfect though, as for the duration between
> >> +     * returning from the first call to ptep_get_and_clear_full() and making
> >> +     * the final call, the contpte block in an intermediate state, where
> >> +     * some ptes are cleared and others are still set with the PTE_CONT bit.
> >> +     * If any other APIs are called for the ptes in the contpte block during
> >> +     * that time, we have to be very careful. The core code currently
> >> +     * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
> >> +     * ptep_get() must be careful to ignore the cleared entries when
> >> +     * accumulating the access and dirty bits - the same goes for
> >> +     * ptep_get_lockless(). The only other calls we might resonably expect
> >> +     * are to set markers in the previously cleared ptes. (We shouldn't see
> >> +     * valid entries being set until after the tlbi, at which point we are
> >> +     * no longer in the intermediate state). Since markers are not valid,
> >> +     * this is safe; set_ptes() will see the old, invalid entry and will not
> >> +     * attempt to unfold. And the new pte is also invalid so it won't
> >> +     * attempt to fold. We shouldn't see this for the 'full' case anyway.
> >> +     *
> >> +     * The last remaining issue is returning the access/dirty bits. That
> >> +     * info could be present in any of the ptes in the contpte block.
> >> +     * ptep_get() will gather those bits from across the contpte block. We
> >> +     * don't bother doing that here, because we know that the information is
> >> +     * used by the core-mm to mark the underlying folio as accessed/dirty.
> >> +     * And since the same folio must be underpinning the whole block (that
> >> +     * was a requirement for folding in the first place), that information
> >> +     * will make it to the folio eventually once all the ptes have been
> >> +     * cleared. This approach means we don't have to play games with
> >> +     * accumulating and storing the bits. It does mean that any interleaved
> >> +     * calls to ptep_get() may lack correct access/dirty information if we
> >> +     * have already cleared the pte that happened to store it. The core code
> >> +     * does not rely on this though.
> >
> > even without any other threads running and touching those PTEs, this won't survive
> > on some hardware. we expose inconsistent CONTPTEs to hardware, this might result
>
> No that's not the case; if you read the Arm ARM, the page table is only
> considered "misgrogrammed" when *valid* entries within the same contpte block
> have different values for the contiguous bit. We are clearing the ptes to zero
> here, which is an *invalid* entry. So if the TLB entry somehow gets invalidated
> (either due to explicit tlbi as you point out below, or due to a concurrent TLB
> miss which selects our entry for removal to make space for the new incomming
> entry), then it gets an access request for an address in our partially cleared
> contpte block the address will either be:
>
> A) an address for a pte entry we have already cleared, so its invalid and it
> will fault (and get serialized behind the PTL).
>
> or
>
> B) an address for a pte entry we haven't yet cleared, so it will reform a TLB
> entry for the contpte block. But that's ok because the memory still exists
> because we haven't yet finished clearing the page table and have not yet issued
> the final tlbi.
>
>
> > in crashed firmware even in trustzone, strange&unknown faults to trustzone we have
> > seen on Qualcomm, but for MTK, it seems fine. when you do tlbi on a part of PTEs
> > with dropped CONT but still some other PTEs have CONT, we make hardware totally
> > confused.
>
> I suspect this is because in your case you are "misprogramming" the contpte
> block; there are *valid* pte entries within the block that disagree about the
> contiguous bit or about various other fields. In this case some HW TLB designs
> can do weird things. I suspect in your case, that's resulting in accessing bad
> memory space and causing an SError, which is trapped by EL3, and the FW is
> probably just panicking at that point.

you are probably right. as we met the SError, we became very very
cautious. so anytime
when we flush tlb for a CONTPTE, we strictly do it by
1. set all 16 ptes to zero
2. flush the whole 16 ptes

in your case, it can be:
1. set pte0 to zero
2. flush pte0

TBH, i have never tried this. but it might be safe according to your
description.

>
> >
> > zap_pte_range() has a force_flush when tlbbatch is full:
> >
> >                         if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> >                                 force_flush = 1;
> >                                 addr += PAGE_SIZE;
> >                                 break;
> >                         }
> >
> > this means you can expose partial tlbi/flush directly to hardware while some
> > other PTEs are still CONT.
>
> Yes, but that's also possible even if we have a tight loop that clears down the
> contpte block; there could still be another core that issues a tlbi while you're
> halfway through that loop, or the HW could happen to evict due to TLB pressure
> at any time. The point is, it's safe if you are clearing the pte to an *invalid*
> entry.
>
> >
> > on the other hand, contpte_ptep_get_and_clear_full() doesn't need to depend
> > on fullmm, as long as zap range covers a large folio, we can flush tlbi for
> > those CONTPTEs all together in your contpte_ptep_get_and_clear_full() rather
> > than clearing one PTE.
> >
> > Our approach in [1] is we do a flush for all CONTPTEs and go directly to the end
> > of the large folio:
> >
> > #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >                       if (pte_cont(ptent)) {
> >                               unsigned long next = pte_cont_addr_end(addr, end);
> >
> >                               if (next - addr != HPAGE_CONT_PTE_SIZE) {
> >                                       __split_huge_cont_pte(vma, pte, addr, false, NULL, ptl);
> >                                       /*
> >                                        * After splitting cont-pte
> >                                        * we need to process pte again.
> >                                        */
> >                                       goto again_pte;
> >                               } else {
> >                                       cont_pte_huge_ptep_get_and_clear(mm, addr, pte);
> >
> >                                       tlb_remove_cont_pte_tlb_entry(tlb, pte, addr);
> >                                       if (unlikely(!page))
> >                                               continue;
> >
> >                                       if (is_huge_zero_page(page)) {
> >                                               tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> >                                               goto cont_next;
> >                                       }
> >
> >                                       rss[mm_counter(page)] -= HPAGE_CONT_PTE_NR;
> >                                       page_remove_rmap(page, true);
> >                                       if (unlikely(page_mapcount(page) < 0))
> >                                               print_bad_pte(vma, addr, ptent, page);
> >
> >                                       tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> >                               }
> > cont_next:
> >                               /* "do while()" will do "pte++" and "addr + PAGE_SIZE" */
> >                               pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> >                               addr = next - PAGE_SIZE;
> >                               continue;
> >                       }
> > #endif
> >
> > this is our "full" counterpart, which clear_flush CONT_PTES pages directly, and
> > it never requires tlb->fullmm at all.
>
> Yes, but you are benefitting from the fact that contpte is exposed to core-mm
> and it is special-casing them at this level. I'm trying to avoid that.

I am thinking we can even do this while we don't expose CONTPTE.
if zap_pte_range meets a large folio and the zap_range covers the whole
folio, we can flush all ptes in this folio and jump to the end of this folio?
i mean

if (folio head && range_end > folio_end) {
         nr = folio_nr_page(folio);
         full_flush_nr_ptes()
         pte += nr -1;
         addr += (nr - 1) * basepage size
}

zap_pte_range is the most frequent behaviour from userspace libc heap
as i explained
before. libc can call madvise(DONTNEED) the most often. It is crucial
to performance.

and this way can also help drop your full version by moving to full
flushing the whole
large folios? and we don't need to depend on fullmm any more?

>
> I don't think there is any correctness issue here. But there is a problem with
> fragility, as raised by Alistair. I have some ideas on potentially how to solve
> that. I'm going to try to work on it this afternoon and will post if I get some
> confidence that it is a real solution.
>
> Thanks,
> Ryan
>
> >
> > static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm,
> >                                      unsigned long addr,
> >                                      pte_t *ptep,
> >                                      bool flush)
> > {
> >       pte_t orig_pte = ptep_get(ptep);
> >
> >       CHP_BUG_ON(!pte_cont(orig_pte));
> >       CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
> >       CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));
> >
> >       return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
> > }
> >
> > [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539
> >
> >> +     */
> >> +
> >> +    return __ptep_get_and_clear(mm, addr, ptep);
> >> +}
> >> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> >> +
> >

Thanks
Barry

>
Ryan Roberts Nov. 29, 2023, 12:43 p.m. UTC | #15
On 28/11/2023 20:23, Barry Song wrote:
> On Wed, Nov 29, 2023 at 12:49 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 28/11/2023 08:17, Barry Song wrote:
>>>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>>>> +                                    unsigned long addr, pte_t *ptep)
>>>> +{
>>>> +    /*
>>>> +     * When doing a full address space teardown, we can avoid unfolding the
>>>> +     * contiguous range, and therefore avoid the associated tlbi. Instead,
>>>> +     * just get and clear the pte. The caller is promising to call us for
>>>> +     * every pte, so every pte in the range will be cleared by the time the
>>>> +     * tlbi is issued.
>>>> +     *
>>>> +     * This approach is not perfect though, as for the duration between
>>>> +     * returning from the first call to ptep_get_and_clear_full() and making
>>>> +     * the final call, the contpte block in an intermediate state, where
>>>> +     * some ptes are cleared and others are still set with the PTE_CONT bit.
>>>> +     * If any other APIs are called for the ptes in the contpte block during
>>>> +     * that time, we have to be very careful. The core code currently
>>>> +     * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
>>>> +     * ptep_get() must be careful to ignore the cleared entries when
>>>> +     * accumulating the access and dirty bits - the same goes for
>>>> +     * ptep_get_lockless(). The only other calls we might resonably expect
>>>> +     * are to set markers in the previously cleared ptes. (We shouldn't see
>>>> +     * valid entries being set until after the tlbi, at which point we are
>>>> +     * no longer in the intermediate state). Since markers are not valid,
>>>> +     * this is safe; set_ptes() will see the old, invalid entry and will not
>>>> +     * attempt to unfold. And the new pte is also invalid so it won't
>>>> +     * attempt to fold. We shouldn't see this for the 'full' case anyway.
>>>> +     *
>>>> +     * The last remaining issue is returning the access/dirty bits. That
>>>> +     * info could be present in any of the ptes in the contpte block.
>>>> +     * ptep_get() will gather those bits from across the contpte block. We
>>>> +     * don't bother doing that here, because we know that the information is
>>>> +     * used by the core-mm to mark the underlying folio as accessed/dirty.
>>>> +     * And since the same folio must be underpinning the whole block (that
>>>> +     * was a requirement for folding in the first place), that information
>>>> +     * will make it to the folio eventually once all the ptes have been
>>>> +     * cleared. This approach means we don't have to play games with
>>>> +     * accumulating and storing the bits. It does mean that any interleaved
>>>> +     * calls to ptep_get() may lack correct access/dirty information if we
>>>> +     * have already cleared the pte that happened to store it. The core code
>>>> +     * does not rely on this though.
>>>
>>> even without any other threads running and touching those PTEs, this won't survive
>>> on some hardware. we expose inconsistent CONTPTEs to hardware, this might result
>>
>> No that's not the case; if you read the Arm ARM, the page table is only
>> considered "misgrogrammed" when *valid* entries within the same contpte block
>> have different values for the contiguous bit. We are clearing the ptes to zero
>> here, which is an *invalid* entry. So if the TLB entry somehow gets invalidated
>> (either due to explicit tlbi as you point out below, or due to a concurrent TLB
>> miss which selects our entry for removal to make space for the new incomming
>> entry), then it gets an access request for an address in our partially cleared
>> contpte block the address will either be:
>>
>> A) an address for a pte entry we have already cleared, so its invalid and it
>> will fault (and get serialized behind the PTL).
>>
>> or
>>
>> B) an address for a pte entry we haven't yet cleared, so it will reform a TLB
>> entry for the contpte block. But that's ok because the memory still exists
>> because we haven't yet finished clearing the page table and have not yet issued
>> the final tlbi.
>>
>>
>>> in crashed firmware even in trustzone, strange&unknown faults to trustzone we have
>>> seen on Qualcomm, but for MTK, it seems fine. when you do tlbi on a part of PTEs
>>> with dropped CONT but still some other PTEs have CONT, we make hardware totally
>>> confused.
>>
>> I suspect this is because in your case you are "misprogramming" the contpte
>> block; there are *valid* pte entries within the block that disagree about the
>> contiguous bit or about various other fields. In this case some HW TLB designs
>> can do weird things. I suspect in your case, that's resulting in accessing bad
>> memory space and causing an SError, which is trapped by EL3, and the FW is
>> probably just panicking at that point.
> 
> you are probably right. as we met the SError, we became very very
> cautious. so anytime
> when we flush tlb for a CONTPTE, we strictly do it by
> 1. set all 16 ptes to zero
> 2. flush the whole 16 ptes

But my point is that this sequence doesn't guarrantee that the TLB doesn't read
the page table half way through the SW clearing the 16 entries; a TLB entry can
be ejected for other reasons than just issuing a TLBI. So in that case these 2
flows can be equivalent. Its the fact that we are unsetting the valid bit when
clearing each pte that guarantees this to be safe.

> 
> in your case, it can be:
> 1. set pte0 to zero
> 2. flush pte0
> 
> TBH, i have never tried this. but it might be safe according to your
> description.
> 
>>
>>>
>>> zap_pte_range() has a force_flush when tlbbatch is full:
>>>
>>>                         if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
>>>                                 force_flush = 1;
>>>                                 addr += PAGE_SIZE;
>>>                                 break;
>>>                         }
>>>
>>> this means you can expose partial tlbi/flush directly to hardware while some
>>> other PTEs are still CONT.
>>
>> Yes, but that's also possible even if we have a tight loop that clears down the
>> contpte block; there could still be another core that issues a tlbi while you're
>> halfway through that loop, or the HW could happen to evict due to TLB pressure
>> at any time. The point is, it's safe if you are clearing the pte to an *invalid*
>> entry.
>>
>>>
>>> on the other hand, contpte_ptep_get_and_clear_full() doesn't need to depend
>>> on fullmm, as long as zap range covers a large folio, we can flush tlbi for
>>> those CONTPTEs all together in your contpte_ptep_get_and_clear_full() rather
>>> than clearing one PTE.
>>>
>>> Our approach in [1] is we do a flush for all CONTPTEs and go directly to the end
>>> of the large folio:
>>>
>>> #ifdef CONFIG_CONT_PTE_HUGEPAGE
>>>                       if (pte_cont(ptent)) {
>>>                               unsigned long next = pte_cont_addr_end(addr, end);
>>>
>>>                               if (next - addr != HPAGE_CONT_PTE_SIZE) {
>>>                                       __split_huge_cont_pte(vma, pte, addr, false, NULL, ptl);
>>>                                       /*
>>>                                        * After splitting cont-pte
>>>                                        * we need to process pte again.
>>>                                        */
>>>                                       goto again_pte;
>>>                               } else {
>>>                                       cont_pte_huge_ptep_get_and_clear(mm, addr, pte);
>>>
>>>                                       tlb_remove_cont_pte_tlb_entry(tlb, pte, addr);
>>>                                       if (unlikely(!page))
>>>                                               continue;
>>>
>>>                                       if (is_huge_zero_page(page)) {
>>>                                               tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
>>>                                               goto cont_next;
>>>                                       }
>>>
>>>                                       rss[mm_counter(page)] -= HPAGE_CONT_PTE_NR;
>>>                                       page_remove_rmap(page, true);
>>>                                       if (unlikely(page_mapcount(page) < 0))
>>>                                               print_bad_pte(vma, addr, ptent, page);
>>>
>>>                                       tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
>>>                               }
>>> cont_next:
>>>                               /* "do while()" will do "pte++" and "addr + PAGE_SIZE" */
>>>                               pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
>>>                               addr = next - PAGE_SIZE;
>>>                               continue;
>>>                       }
>>> #endif
>>>
>>> this is our "full" counterpart, which clear_flush CONT_PTES pages directly, and
>>> it never requires tlb->fullmm at all.
>>
>> Yes, but you are benefitting from the fact that contpte is exposed to core-mm
>> and it is special-casing them at this level. I'm trying to avoid that.
> 
> I am thinking we can even do this while we don't expose CONTPTE.
> if zap_pte_range meets a large folio and the zap_range covers the whole
> folio, we can flush all ptes in this folio and jump to the end of this folio?
> i mean
> 
> if (folio head && range_end > folio_end) {
>          nr = folio_nr_page(folio);
>          full_flush_nr_ptes()
>          pte += nr -1;
>          addr += (nr - 1) * basepage size
> }

Just because you found a pte that maps a page from a large folio, that doesn't
mean that all pages from the folio are mapped, and it doesn't mean they are
mapped contiguously. We have to deal with partial munmap(), partial mremap()
etc. We could split in these cases (and in future it might be sensible to try),
but that can fail (due to GUP). So we still have to handle the corner case.

But I can imagine doing a batched version of ptep_get_and_clear(), like I did
for ptep_set_wrprotects(). And I think this would be an improvement.

The reason I haven't done that so far, is because ptep_get_and_clear() returns
the pte value when it was cleared and that's hard to do if batching due to the
storage requirement. But perhaps you could just return the logical OR of the
dirty and young bits across all ptes in the batch. The caller should be able to
reconstitute the rest if it needs it?

What do you think?

> 
> zap_pte_range is the most frequent behaviour from userspace libc heap
> as i explained
> before. libc can call madvise(DONTNEED) the most often. It is crucial
> to performance.
> 
> and this way can also help drop your full version by moving to full
> flushing the whole
> large folios? and we don't need to depend on fullmm any more?
> 
>>
>> I don't think there is any correctness issue here. But there is a problem with
>> fragility, as raised by Alistair. I have some ideas on potentially how to solve
>> that. I'm going to try to work on it this afternoon and will post if I get some
>> confidence that it is a real solution.
>>
>> Thanks,
>> Ryan
>>
>>>
>>> static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm,
>>>                                      unsigned long addr,
>>>                                      pte_t *ptep,
>>>                                      bool flush)
>>> {
>>>       pte_t orig_pte = ptep_get(ptep);
>>>
>>>       CHP_BUG_ON(!pte_cont(orig_pte));
>>>       CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
>>>       CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));
>>>
>>>       return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
>>> }
>>>
>>> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539
>>>
>>>> +     */
>>>> +
>>>> +    return __ptep_get_and_clear(mm, addr, ptep);
>>>> +}
>>>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
>>>> +
>>>
> 
> Thanks
> Barry
> 
>>
Barry Song Nov. 29, 2023, 1 p.m. UTC | #16
On Thu, Nov 30, 2023 at 1:43 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/11/2023 20:23, Barry Song wrote:
> > On Wed, Nov 29, 2023 at 12:49 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 28/11/2023 08:17, Barry Song wrote:
> >>>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> >>>> +                                    unsigned long addr, pte_t *ptep)
> >>>> +{
> >>>> +    /*
> >>>> +     * When doing a full address space teardown, we can avoid unfolding the
> >>>> +     * contiguous range, and therefore avoid the associated tlbi. Instead,
> >>>> +     * just get and clear the pte. The caller is promising to call us for
> >>>> +     * every pte, so every pte in the range will be cleared by the time the
> >>>> +     * tlbi is issued.
> >>>> +     *
> >>>> +     * This approach is not perfect though, as for the duration between
> >>>> +     * returning from the first call to ptep_get_and_clear_full() and making
> >>>> +     * the final call, the contpte block in an intermediate state, where
> >>>> +     * some ptes are cleared and others are still set with the PTE_CONT bit.
> >>>> +     * If any other APIs are called for the ptes in the contpte block during
> >>>> +     * that time, we have to be very careful. The core code currently
> >>>> +     * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
> >>>> +     * ptep_get() must be careful to ignore the cleared entries when
> >>>> +     * accumulating the access and dirty bits - the same goes for
> >>>> +     * ptep_get_lockless(). The only other calls we might resonably expect
> >>>> +     * are to set markers in the previously cleared ptes. (We shouldn't see
> >>>> +     * valid entries being set until after the tlbi, at which point we are
> >>>> +     * no longer in the intermediate state). Since markers are not valid,
> >>>> +     * this is safe; set_ptes() will see the old, invalid entry and will not
> >>>> +     * attempt to unfold. And the new pte is also invalid so it won't
> >>>> +     * attempt to fold. We shouldn't see this for the 'full' case anyway.
> >>>> +     *
> >>>> +     * The last remaining issue is returning the access/dirty bits. That
> >>>> +     * info could be present in any of the ptes in the contpte block.
> >>>> +     * ptep_get() will gather those bits from across the contpte block. We
> >>>> +     * don't bother doing that here, because we know that the information is
> >>>> +     * used by the core-mm to mark the underlying folio as accessed/dirty.
> >>>> +     * And since the same folio must be underpinning the whole block (that
> >>>> +     * was a requirement for folding in the first place), that information
> >>>> +     * will make it to the folio eventually once all the ptes have been
> >>>> +     * cleared. This approach means we don't have to play games with
> >>>> +     * accumulating and storing the bits. It does mean that any interleaved
> >>>> +     * calls to ptep_get() may lack correct access/dirty information if we
> >>>> +     * have already cleared the pte that happened to store it. The core code
> >>>> +     * does not rely on this though.
> >>>
> >>> even without any other threads running and touching those PTEs, this won't survive
> >>> on some hardware. we expose inconsistent CONTPTEs to hardware, this might result
> >>
> >> No that's not the case; if you read the Arm ARM, the page table is only
> >> considered "misgrogrammed" when *valid* entries within the same contpte block
> >> have different values for the contiguous bit. We are clearing the ptes to zero
> >> here, which is an *invalid* entry. So if the TLB entry somehow gets invalidated
> >> (either due to explicit tlbi as you point out below, or due to a concurrent TLB
> >> miss which selects our entry for removal to make space for the new incomming
> >> entry), then it gets an access request for an address in our partially cleared
> >> contpte block the address will either be:
> >>
> >> A) an address for a pte entry we have already cleared, so its invalid and it
> >> will fault (and get serialized behind the PTL).
> >>
> >> or
> >>
> >> B) an address for a pte entry we haven't yet cleared, so it will reform a TLB
> >> entry for the contpte block. But that's ok because the memory still exists
> >> because we haven't yet finished clearing the page table and have not yet issued
> >> the final tlbi.
> >>
> >>
> >>> in crashed firmware even in trustzone, strange&unknown faults to trustzone we have
> >>> seen on Qualcomm, but for MTK, it seems fine. when you do tlbi on a part of PTEs
> >>> with dropped CONT but still some other PTEs have CONT, we make hardware totally
> >>> confused.
> >>
> >> I suspect this is because in your case you are "misprogramming" the contpte
> >> block; there are *valid* pte entries within the block that disagree about the
> >> contiguous bit or about various other fields. In this case some HW TLB designs
> >> can do weird things. I suspect in your case, that's resulting in accessing bad
> >> memory space and causing an SError, which is trapped by EL3, and the FW is
> >> probably just panicking at that point.
> >
> > you are probably right. as we met the SError, we became very very
> > cautious. so anytime
> > when we flush tlb for a CONTPTE, we strictly do it by
> > 1. set all 16 ptes to zero
> > 2. flush the whole 16 ptes
>
> But my point is that this sequence doesn't guarrantee that the TLB doesn't read
> the page table half way through the SW clearing the 16 entries; a TLB entry can
> be ejected for other reasons than just issuing a TLBI. So in that case these 2
> flows can be equivalent. Its the fact that we are unsetting the valid bit when
> clearing each pte that guarantees this to be safe.
>
> >
> > in your case, it can be:
> > 1. set pte0 to zero
> > 2. flush pte0
> >
> > TBH, i have never tried this. but it might be safe according to your
> > description.
> >
> >>
> >>>
> >>> zap_pte_range() has a force_flush when tlbbatch is full:
> >>>
> >>>                         if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> >>>                                 force_flush = 1;
> >>>                                 addr += PAGE_SIZE;
> >>>                                 break;
> >>>                         }
> >>>
> >>> this means you can expose partial tlbi/flush directly to hardware while some
> >>> other PTEs are still CONT.
> >>
> >> Yes, but that's also possible even if we have a tight loop that clears down the
> >> contpte block; there could still be another core that issues a tlbi while you're
> >> halfway through that loop, or the HW could happen to evict due to TLB pressure
> >> at any time. The point is, it's safe if you are clearing the pte to an *invalid*
> >> entry.
> >>
> >>>
> >>> on the other hand, contpte_ptep_get_and_clear_full() doesn't need to depend
> >>> on fullmm, as long as zap range covers a large folio, we can flush tlbi for
> >>> those CONTPTEs all together in your contpte_ptep_get_and_clear_full() rather
> >>> than clearing one PTE.
> >>>
> >>> Our approach in [1] is we do a flush for all CONTPTEs and go directly to the end
> >>> of the large folio:
> >>>
> >>> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >>>                       if (pte_cont(ptent)) {
> >>>                               unsigned long next = pte_cont_addr_end(addr, end);
> >>>
> >>>                               if (next - addr != HPAGE_CONT_PTE_SIZE) {
> >>>                                       __split_huge_cont_pte(vma, pte, addr, false, NULL, ptl);
> >>>                                       /*
> >>>                                        * After splitting cont-pte
> >>>                                        * we need to process pte again.
> >>>                                        */
> >>>                                       goto again_pte;
> >>>                               } else {
> >>>                                       cont_pte_huge_ptep_get_and_clear(mm, addr, pte);
> >>>
> >>>                                       tlb_remove_cont_pte_tlb_entry(tlb, pte, addr);
> >>>                                       if (unlikely(!page))
> >>>                                               continue;
> >>>
> >>>                                       if (is_huge_zero_page(page)) {
> >>>                                               tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> >>>                                               goto cont_next;
> >>>                                       }
> >>>
> >>>                                       rss[mm_counter(page)] -= HPAGE_CONT_PTE_NR;
> >>>                                       page_remove_rmap(page, true);
> >>>                                       if (unlikely(page_mapcount(page) < 0))
> >>>                                               print_bad_pte(vma, addr, ptent, page);
> >>>
> >>>                                       tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> >>>                               }
> >>> cont_next:
> >>>                               /* "do while()" will do "pte++" and "addr + PAGE_SIZE" */
> >>>                               pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> >>>                               addr = next - PAGE_SIZE;
> >>>                               continue;
> >>>                       }
> >>> #endif
> >>>
> >>> this is our "full" counterpart, which clear_flush CONT_PTES pages directly, and
> >>> it never requires tlb->fullmm at all.
> >>
> >> Yes, but you are benefitting from the fact that contpte is exposed to core-mm
> >> and it is special-casing them at this level. I'm trying to avoid that.
> >
> > I am thinking we can even do this while we don't expose CONTPTE.
> > if zap_pte_range meets a large folio and the zap_range covers the whole
> > folio, we can flush all ptes in this folio and jump to the end of this folio?
> > i mean
> >
> > if (folio head && range_end > folio_end) {
> >          nr = folio_nr_page(folio);
> >          full_flush_nr_ptes()
> >          pte += nr -1;
> >          addr += (nr - 1) * basepage size
> > }
>
> Just because you found a pte that maps a page from a large folio, that doesn't
> mean that all pages from the folio are mapped, and it doesn't mean they are
> mapped contiguously. We have to deal with partial munmap(), partial mremap()
> etc. We could split in these cases (and in future it might be sensible to try),
> but that can fail (due to GUP). So we still have to handle the corner case.

maybe we can check ptes and find they are all mapped (pte_present),
then do a flush_range.
This is actually the most common case. the majority of madv(DONTNEED)
will benefit from
it. if the condition is false, we fallback to your current code.
zap_pte_range always sets present ptes to 0, but swap entry is really
quite different.

>
> But I can imagine doing a batched version of ptep_get_and_clear(), like I did
> for ptep_set_wrprotects(). And I think this would be an improvement.
>
> The reason I haven't done that so far, is because ptep_get_and_clear() returns
> the pte value when it was cleared and that's hard to do if batching due to the
> storage requirement. But perhaps you could just return the logical OR of the
> dirty and young bits across all ptes in the batch. The caller should be able to
> reconstitute the rest if it needs it?
>
> What do you think?
>
> >
> > zap_pte_range is the most frequent behaviour from userspace libc heap
> > as i explained
> > before. libc can call madvise(DONTNEED) the most often. It is crucial
> > to performance.
> >
> > and this way can also help drop your full version by moving to full
> > flushing the whole
> > large folios? and we don't need to depend on fullmm any more?
> >
> >>
> >> I don't think there is any correctness issue here. But there is a problem with
> >> fragility, as raised by Alistair. I have some ideas on potentially how to solve
> >> that. I'm going to try to work on it this afternoon and will post if I get some
> >> confidence that it is a real solution.
> >>
> >> Thanks,
> >> Ryan
> >>
> >>>
> >>> static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm,
> >>>                                      unsigned long addr,
> >>>                                      pte_t *ptep,
> >>>                                      bool flush)
> >>> {
> >>>       pte_t orig_pte = ptep_get(ptep);
> >>>
> >>>       CHP_BUG_ON(!pte_cont(orig_pte));
> >>>       CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
> >>>       CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));
> >>>
> >>>       return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
> >>> }
> >>>
> >>> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539
> >>>
> >>>> +     */
> >>>> +
> >>>> +    return __ptep_get_and_clear(mm, addr, ptep);
> >>>> +}
> >>>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> >>>> +
> >>>
> >
Thanks
Barry
Alistair Popple Nov. 30, 2023, 5:07 a.m. UTC | #17
Ryan Roberts <ryan.roberts@arm.com> writes:

>>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>>>> buggy because it iterated through the PTEs, getting and accumulating. Then
>>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>>>> have modified the bits during those loops. I think it would be possible to fix
>>>> the race, but intuition says it would be expensive.
>>>
>>> So the issue as I understand it is subsequent iterations would see a
>>> clean PTE after the first iteration returned a dirty PTE. In
>>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
>>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
>>> all the PTEs?
>> 
>> The raciness I'm describing is the race between reading access/dirty from one
>> pte and applying it to another. But yes I like your suggestion. if we do:
>> 
>> pte = __ptep_get_and_clear_full(ptep)
>> 
>> on the target pte, then we have grabbed access/dirty from it in a race-free
>> manner. we can then loop from current pte up towards the top of the block until
>> we find a valid entry (and I guess wrap at the top to make us robust against
>> future callers clearing an an arbitrary order). Then atomically accumulate the
>> access/dirty bits we have just saved into that new entry. I guess that's just a
>> cmpxchg loop - there are already examples of how to do that correctly when
>> racing the TLB.
>> 
>> For most entries, we will just be copying up to the next pte. For the last pte,
>> we would end up reading all ptes and determine we are the last one.
>> 
>> What do you think?
>
> OK here is an attempt at something which solves the fragility. I think this is
> now robust and will always return the correct access/dirty state from
> ptep_get_and_clear_full() and ptep_get().
>
> But I'm not sure about performance; each call to ptep_get_and_clear_full() for
> each pte in a contpte block will cause a ptep_get() to gather the access/dirty
> bits from across the contpte block - which requires reading each pte in the
> contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.
>
> Was this the type of thing you were thinking of, Alistair?

Yes, that is along the lines of what I was thinking. However I have
added a couple of comments inline. 

> --8<--
>  arch/arm64/include/asm/pgtable.h | 23 ++++++++-
>  arch/arm64/mm/contpte.c          | 81 ++++++++++++++++++++++++++++++++
>  arch/arm64/mm/fault.c            | 38 +++++++++------
>  3 files changed, 125 insertions(+), 17 deletions(-)
>
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 9bd2f57a9e11..6c295d277784 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>  	return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
>  }
>
> +extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
>  extern int __ptep_set_access_flags(struct vm_area_struct *vma,
>  				 unsigned long address, pte_t *ptep,
>  				 pte_t entry, int dirty);
> @@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr);
> +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  				unsigned long addr, pte_t *ptep);
>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> @@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
>  	__pte_clear(mm, addr, ptep);
>  }
>
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep, int full)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (!pte_valid_cont(orig_pte))
> +		return __ptep_get_and_clear(mm, addr, ptep);
> +
> +	if (!full) {
> +		contpte_try_unfold(mm, addr, ptep, orig_pte);
> +		return __ptep_get_and_clear(mm, addr, ptep);
> +	}
> +
> +	return contpte_ptep_get_and_clear_full(mm, addr, ptep);
> +}
> +
>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>  				unsigned long addr, pte_t *ptep)
>  {
> -	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> -	return __ptep_get_and_clear(mm, addr, ptep);
> +	return ptep_get_and_clear_full(mm, addr, ptep, 0);
>  }
>
>  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index 2a57df16bf58..99b211118d93 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>  	for (i = 0; i < CONT_PTES; i++, ptep++) {
>  		pte = __ptep_get(ptep);
>
> +		/*
> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
> +		 * where some of the ptes in the range may be cleared but others
> +		 * are still to do. See contpte_ptep_get_and_clear_full().
> +		 */
> +		if (!pte_valid(pte))
> +			continue;
> +
>  		if (pte_dirty(pte))
>  			orig_pte = pte_mkdirty(orig_pte);
>
> @@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>  }
>  EXPORT_SYMBOL(contpte_set_ptes);
>
> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	/*
> +	 * When doing a full address space teardown, we can avoid unfolding the
> +	 * contiguous range, and therefore avoid the associated tlbi. Instead,
> +	 * just get and clear the pte. The caller is promising to call us for
> +	 * every pte, so every pte in the range will be cleared by the time the
> +	 * final tlbi is issued.
> +	 *
> +	 * This approach requires some complex hoop jumping though, as for the
> +	 * duration between returning from the first call to
> +	 * ptep_get_and_clear_full() and making the final call, the contpte
> +	 * block is in an intermediate state, where some ptes are cleared and
> +	 * others are still set with the PTE_CONT bit. If any other APIs are
> +	 * called for the ptes in the contpte block during that time, we have to
> +	 * be very careful. The core code currently interleaves calls to
> +	 * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
> +	 * careful to ignore the cleared entries when accumulating the access
> +	 * and dirty bits - the same goes for ptep_get_lockless(). The only
> +	 * other calls we might resonably expect are to set markers in the
> +	 * previously cleared ptes. (We shouldn't see valid entries being set
> +	 * until after the tlbi, at which point we are no longer in the
> +	 * intermediate state). Since markers are not valid, this is safe;
> +	 * set_ptes() will see the old, invalid entry and will not attempt to
> +	 * unfold. And the new pte is also invalid so it won't attempt to fold.
> +	 * We shouldn't see pte markers being set for the 'full' case anyway
> +	 * since the address space is being torn down.
> +	 *
> +	 * The last remaining issue is returning the access/dirty bits. That
> +	 * info could be present in any of the ptes in the contpte block.
> +	 * ptep_get() will gather those bits from across the contpte block (for
> +	 * the remaining valid entries). So below, if the pte we are clearing
> +	 * has dirty or young set, we need to stash it into a pte that we are
> +	 * yet to clear. This allows future calls to return the correct state
> +	 * even when the info was stored in a different pte. Since the core-mm
> +	 * calls from low to high address, we prefer to stash in the last pte of
> +	 * the contpte block - this means we are not "dragging" the bits up
> +	 * through all ptes and increases the chances that we can exit early
> +	 * because a given pte will have neither dirty or young set.
> +	 */
> +
> +	pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
> +	bool dirty = pte_dirty(orig_pte);
> +	bool young = pte_young(orig_pte);
> +	pte_t *start;
> +
> +	if (!dirty && !young)
> +		return contpte_ptep_get(ptep, orig_pte);

I don't think we need to do this. If the PTE is !dirty && !young we can
just return it. As you say we have to assume HW can set those flags at
any time anyway so it doesn't get us much. This means in the common case
we should only run through the loop setting the dirty/young flags once
which should alay the performance concerns.

However I am now wondering if we're doing the wrong thing trying to hide
this down in the arch layer anyway. Perhaps it would be better to deal
with this in the core-mm code after all.

So how about having ptep_get_and_clear_full() clearing the PTEs for the
entire cont block? We know by definition all PTEs should be pointing to
the same folio anyway, and it seems at least zap_pte_range() would cope
with this just fine because subsequent iterations would just see
pte_none() and continue the loop. I haven't checked the other call sites
though, but in principal I don't see why we couldn't define
ptep_get_and_clear_full() as being something that clears all PTEs
mapping a given folio (although it might need renaming).

This does assume you don't need to partially unmap a page in
zap_pte_range (ie. end >= folio), but we're already making that
assumption.

> +
> +	start = contpte_align_down(ptep);
> +	ptep = start + CONT_PTES - 1;
> +
> +	for (; ptep >= start; ptep--) {
> +		pte_t pte = __ptep_get(ptep);
> +
> +		if (!pte_valid(pte))
> +			continue;
> +
> +		if (dirty)
> +			pte = pte_mkdirty(pte);
> +
> +		if (young)
> +			pte = pte_mkyoung(pte);
> +
> +		__ptep_set_access_flags_notlbi(ptep, pte);
> +		return contpte_ptep_get(ptep, orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> +
>  int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>  					unsigned long addr, pte_t *ptep)
>  {
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index d63f3a0a7251..b22216a8153c 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -199,19 +199,7 @@ static void show_pte(unsigned long addr)
>  	pr_cont("\n");
>  }
>
> -/*
> - * This function sets the access flags (dirty, accessed), as well as write
> - * permission, and only to a more permissive setting.
> - *
> - * It needs to cope with hardware update of the accessed/dirty state by other
> - * agents in the system and can safely skip the __sync_icache_dcache() call as,
> - * like __set_ptes(), the PTE is never changed from no-exec to exec here.
> - *
> - * Returns whether or not the PTE actually changed.
> - */
> -int __ptep_set_access_flags(struct vm_area_struct *vma,
> -			    unsigned long address, pte_t *ptep,
> -			    pte_t entry, int dirty)
> +int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry)
>  {
>  	pteval_t old_pteval, pteval;
>  	pte_t pte = __ptep_get(ptep);
> @@ -238,10 +226,30 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
>  		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
>  	} while (pteval != old_pteval);
>
> +	return 1;
> +}
> +
> +/*
> + * This function sets the access flags (dirty, accessed), as well as write
> + * permission, and only to a more permissive setting.
> + *
> + * It needs to cope with hardware update of the accessed/dirty state by other
> + * agents in the system and can safely skip the __sync_icache_dcache() call as,
> + * like __set_ptes(), the PTE is never changed from no-exec to exec here.
> + *
> + * Returns whether or not the PTE actually changed.
> + */
> +int __ptep_set_access_flags(struct vm_area_struct *vma,
> +			    unsigned long address, pte_t *ptep,
> +			    pte_t entry, int dirty)
> +{
> +	int changed = __ptep_set_access_flags_notlbi(ptep, entry);
> +
>  	/* Invalidate a stale read-only entry */
> -	if (dirty)
> +	if (changed && dirty)
>  		flush_tlb_page(vma, address);
> -	return 1;
> +
> +	return changed;
>  }
>
>  static bool is_el1_instruction_abort(unsigned long esr)
> --8<--
Barry Song Nov. 30, 2023, 5:35 a.m. UTC | #18
On Wed, Nov 29, 2023 at 8:43 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 28/11/2023 20:23, Barry Song wrote:
> > On Wed, Nov 29, 2023 at 12:49 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> On 28/11/2023 08:17, Barry Song wrote:
> >>>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> >>>> +                                    unsigned long addr, pte_t *ptep)
> >>>> +{
> >>>> +    /*
> >>>> +     * When doing a full address space teardown, we can avoid unfolding the
> >>>> +     * contiguous range, and therefore avoid the associated tlbi. Instead,
> >>>> +     * just get and clear the pte. The caller is promising to call us for
> >>>> +     * every pte, so every pte in the range will be cleared by the time the
> >>>> +     * tlbi is issued.
> >>>> +     *
> >>>> +     * This approach is not perfect though, as for the duration between
> >>>> +     * returning from the first call to ptep_get_and_clear_full() and making
> >>>> +     * the final call, the contpte block in an intermediate state, where
> >>>> +     * some ptes are cleared and others are still set with the PTE_CONT bit.
> >>>> +     * If any other APIs are called for the ptes in the contpte block during
> >>>> +     * that time, we have to be very careful. The core code currently
> >>>> +     * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
> >>>> +     * ptep_get() must be careful to ignore the cleared entries when
> >>>> +     * accumulating the access and dirty bits - the same goes for
> >>>> +     * ptep_get_lockless(). The only other calls we might resonably expect
> >>>> +     * are to set markers in the previously cleared ptes. (We shouldn't see
> >>>> +     * valid entries being set until after the tlbi, at which point we are
> >>>> +     * no longer in the intermediate state). Since markers are not valid,
> >>>> +     * this is safe; set_ptes() will see the old, invalid entry and will not
> >>>> +     * attempt to unfold. And the new pte is also invalid so it won't
> >>>> +     * attempt to fold. We shouldn't see this for the 'full' case anyway.
> >>>> +     *
> >>>> +     * The last remaining issue is returning the access/dirty bits. That
> >>>> +     * info could be present in any of the ptes in the contpte block.
> >>>> +     * ptep_get() will gather those bits from across the contpte block. We
> >>>> +     * don't bother doing that here, because we know that the information is
> >>>> +     * used by the core-mm to mark the underlying folio as accessed/dirty.
> >>>> +     * And since the same folio must be underpinning the whole block (that
> >>>> +     * was a requirement for folding in the first place), that information
> >>>> +     * will make it to the folio eventually once all the ptes have been
> >>>> +     * cleared. This approach means we don't have to play games with
> >>>> +     * accumulating and storing the bits. It does mean that any interleaved
> >>>> +     * calls to ptep_get() may lack correct access/dirty information if we
> >>>> +     * have already cleared the pte that happened to store it. The core code
> >>>> +     * does not rely on this though.
> >>>
> >>> even without any other threads running and touching those PTEs, this won't survive
> >>> on some hardware. we expose inconsistent CONTPTEs to hardware, this might result
> >>
> >> No that's not the case; if you read the Arm ARM, the page table is only
> >> considered "misgrogrammed" when *valid* entries within the same contpte block
> >> have different values for the contiguous bit. We are clearing the ptes to zero
> >> here, which is an *invalid* entry. So if the TLB entry somehow gets invalidated
> >> (either due to explicit tlbi as you point out below, or due to a concurrent TLB
> >> miss which selects our entry for removal to make space for the new incomming
> >> entry), then it gets an access request for an address in our partially cleared
> >> contpte block the address will either be:
> >>
> >> A) an address for a pte entry we have already cleared, so its invalid and it
> >> will fault (and get serialized behind the PTL).
> >>
> >> or
> >>
> >> B) an address for a pte entry we haven't yet cleared, so it will reform a TLB
> >> entry for the contpte block. But that's ok because the memory still exists
> >> because we haven't yet finished clearing the page table and have not yet issued
> >> the final tlbi.
> >>
> >>
> >>> in crashed firmware even in trustzone, strange&unknown faults to trustzone we have
> >>> seen on Qualcomm, but for MTK, it seems fine. when you do tlbi on a part of PTEs
> >>> with dropped CONT but still some other PTEs have CONT, we make hardware totally
> >>> confused.
> >>
> >> I suspect this is because in your case you are "misprogramming" the contpte
> >> block; there are *valid* pte entries within the block that disagree about the
> >> contiguous bit or about various other fields. In this case some HW TLB designs
> >> can do weird things. I suspect in your case, that's resulting in accessing bad
> >> memory space and causing an SError, which is trapped by EL3, and the FW is
> >> probably just panicking at that point.
> >
> > you are probably right. as we met the SError, we became very very
> > cautious. so anytime
> > when we flush tlb for a CONTPTE, we strictly do it by
> > 1. set all 16 ptes to zero
> > 2. flush the whole 16 ptes
>
> But my point is that this sequence doesn't guarrantee that the TLB doesn't read
> the page table half way through the SW clearing the 16 entries; a TLB entry can
> be ejected for other reasons than just issuing a TLBI. So in that case these 2
> flows can be equivalent. Its the fact that we are unsetting the valid bit when
> clearing each pte that guarantees this to be safe.
>
> >
> > in your case, it can be:
> > 1. set pte0 to zero
> > 2. flush pte0
> >
> > TBH, i have never tried this. but it might be safe according to your
> > description.
> >
> >>
> >>>
> >>> zap_pte_range() has a force_flush when tlbbatch is full:
> >>>
> >>>                         if (unlikely(__tlb_remove_page(tlb, page, delay_rmap))) {
> >>>                                 force_flush = 1;
> >>>                                 addr += PAGE_SIZE;
> >>>                                 break;
> >>>                         }
> >>>
> >>> this means you can expose partial tlbi/flush directly to hardware while some
> >>> other PTEs are still CONT.
> >>
> >> Yes, but that's also possible even if we have a tight loop that clears down the
> >> contpte block; there could still be another core that issues a tlbi while you're
> >> halfway through that loop, or the HW could happen to evict due to TLB pressure
> >> at any time. The point is, it's safe if you are clearing the pte to an *invalid*
> >> entry.
> >>
> >>>
> >>> on the other hand, contpte_ptep_get_and_clear_full() doesn't need to depend
> >>> on fullmm, as long as zap range covers a large folio, we can flush tlbi for
> >>> those CONTPTEs all together in your contpte_ptep_get_and_clear_full() rather
> >>> than clearing one PTE.
> >>>
> >>> Our approach in [1] is we do a flush for all CONTPTEs and go directly to the end
> >>> of the large folio:
> >>>
> >>> #ifdef CONFIG_CONT_PTE_HUGEPAGE
> >>>                       if (pte_cont(ptent)) {
> >>>                               unsigned long next = pte_cont_addr_end(addr, end);
> >>>
> >>>                               if (next - addr != HPAGE_CONT_PTE_SIZE) {
> >>>                                       __split_huge_cont_pte(vma, pte, addr, false, NULL, ptl);
> >>>                                       /*
> >>>                                        * After splitting cont-pte
> >>>                                        * we need to process pte again.
> >>>                                        */
> >>>                                       goto again_pte;
> >>>                               } else {
> >>>                                       cont_pte_huge_ptep_get_and_clear(mm, addr, pte);
> >>>
> >>>                                       tlb_remove_cont_pte_tlb_entry(tlb, pte, addr);
> >>>                                       if (unlikely(!page))
> >>>                                               continue;
> >>>
> >>>                                       if (is_huge_zero_page(page)) {
> >>>                                               tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> >>>                                               goto cont_next;
> >>>                                       }
> >>>
> >>>                                       rss[mm_counter(page)] -= HPAGE_CONT_PTE_NR;
> >>>                                       page_remove_rmap(page, true);
> >>>                                       if (unlikely(page_mapcount(page) < 0))
> >>>                                               print_bad_pte(vma, addr, ptent, page);
> >>>
> >>>                                       tlb_remove_page_size(tlb, page, HPAGE_CONT_PTE_SIZE);
> >>>                               }
> >>> cont_next:
> >>>                               /* "do while()" will do "pte++" and "addr + PAGE_SIZE" */
> >>>                               pte += (next - PAGE_SIZE - (addr & PAGE_MASK))/PAGE_SIZE;
> >>>                               addr = next - PAGE_SIZE;
> >>>                               continue;
> >>>                       }
> >>> #endif
> >>>
> >>> this is our "full" counterpart, which clear_flush CONT_PTES pages directly, and
> >>> it never requires tlb->fullmm at all.
> >>
> >> Yes, but you are benefitting from the fact that contpte is exposed to core-mm
> >> and it is special-casing them at this level. I'm trying to avoid that.
> >
> > I am thinking we can even do this while we don't expose CONTPTE.
> > if zap_pte_range meets a large folio and the zap_range covers the whole
> > folio, we can flush all ptes in this folio and jump to the end of this folio?
> > i mean
> >
> > if (folio head && range_end > folio_end) {
> >          nr = folio_nr_page(folio);
> >          full_flush_nr_ptes()
> >          pte += nr -1;
> >          addr += (nr - 1) * basepage size
> > }
>
> Just because you found a pte that maps a page from a large folio, that doesn't
> mean that all pages from the folio are mapped, and it doesn't mean they are
> mapped contiguously. We have to deal with partial munmap(), partial mremap()
> etc. We could split in these cases (and in future it might be sensible to try),
> but that can fail (due to GUP). So we still have to handle the corner case.
>
> But I can imagine doing a batched version of ptep_get_and_clear(), like I did
> for ptep_set_wrprotects(). And I think this would be an improvement.
>
> The reason I haven't done that so far, is because ptep_get_and_clear() returns
> the pte value when it was cleared and that's hard to do if batching due to the
> storage requirement. But perhaps you could just return the logical OR of the
> dirty and young bits across all ptes in the batch. The caller should be able to
> reconstitute the rest if it needs it?
>
> What do you think?

I really don't know why we care about the return value of ptep_get_and_clear()
as zap_pte_range() doesn't ask for any ret value at all. so why not totally give
up this kind of complex logical OR of dirty and young as they are useless in
this case?

Is it possible for us to introduce a new api like?

bool clear_folio_ptes(folio, ptep)
{
    if(ptes are contiguous mapped) {
           clear all ptes all together    // this also clears all CONTPTE
           return true;
    }
    return false;
}

in zap_pte_range():

if (large_folio(folio) && clear_folio_ptes(folio, ptep)) {
         addr += nr - 1
         pte += nr  -1
} else
         old path.


>
> >
> > zap_pte_range is the most frequent behaviour from userspace libc heap
> > as i explained
> > before. libc can call madvise(DONTNEED) the most often. It is crucial
> > to performance.
> >
> > and this way can also help drop your full version by moving to full
> > flushing the whole
> > large folios? and we don't need to depend on fullmm any more?
> >
> >>
> >> I don't think there is any correctness issue here. But there is a problem with
> >> fragility, as raised by Alistair. I have some ideas on potentially how to solve
> >> that. I'm going to try to work on it this afternoon and will post if I get some
> >> confidence that it is a real solution.
> >>
> >> Thanks,
> >> Ryan
> >>
> >>>
> >>> static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm,
> >>>                                      unsigned long addr,
> >>>                                      pte_t *ptep,
> >>>                                      bool flush)
> >>> {
> >>>       pte_t orig_pte = ptep_get(ptep);
> >>>
> >>>       CHP_BUG_ON(!pte_cont(orig_pte));
> >>>       CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
> >>>       CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));
> >>>
> >>>       return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
> >>> }
> >>>
> >>> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539
> >>>
> >>>> +     */
> >>>> +
> >>>> +    return __ptep_get_and_clear(mm, addr, ptep);
> >>>> +}
> >>>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> >>>> +
> >>>
> >
 Thanks
 Barry
Barry Song Nov. 30, 2023, 5:57 a.m. UTC | #19
On Thu, Nov 30, 2023 at 1:08 PM Alistair Popple <apopple@nvidia.com> wrote:
>
>
> Ryan Roberts <ryan.roberts@arm.com> writes:
>
> >>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
> >>>> buggy because it iterated through the PTEs, getting and accumulating. Then
> >>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
> >>>> have modified the bits during those loops. I think it would be possible to fix
> >>>> the race, but intuition says it would be expensive.
> >>>
> >>> So the issue as I understand it is subsequent iterations would see a
> >>> clean PTE after the first iteration returned a dirty PTE. In
> >>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
> >>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
> >>> all the PTEs?
> >>
> >> The raciness I'm describing is the race between reading access/dirty from one
> >> pte and applying it to another. But yes I like your suggestion. if we do:
> >>
> >> pte = __ptep_get_and_clear_full(ptep)
> >>
> >> on the target pte, then we have grabbed access/dirty from it in a race-free
> >> manner. we can then loop from current pte up towards the top of the block until
> >> we find a valid entry (and I guess wrap at the top to make us robust against
> >> future callers clearing an an arbitrary order). Then atomically accumulate the
> >> access/dirty bits we have just saved into that new entry. I guess that's just a
> >> cmpxchg loop - there are already examples of how to do that correctly when
> >> racing the TLB.
> >>
> >> For most entries, we will just be copying up to the next pte. For the last pte,
> >> we would end up reading all ptes and determine we are the last one.
> >>
> >> What do you think?
> >
> > OK here is an attempt at something which solves the fragility. I think this is
> > now robust and will always return the correct access/dirty state from
> > ptep_get_and_clear_full() and ptep_get().
> >
> > But I'm not sure about performance; each call to ptep_get_and_clear_full() for
> > each pte in a contpte block will cause a ptep_get() to gather the access/dirty
> > bits from across the contpte block - which requires reading each pte in the
> > contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.
> >
> > Was this the type of thing you were thinking of, Alistair?
>
> Yes, that is along the lines of what I was thinking. However I have
> added a couple of comments inline.
>
> > --8<--
> >  arch/arm64/include/asm/pgtable.h | 23 ++++++++-
> >  arch/arm64/mm/contpte.c          | 81 ++++++++++++++++++++++++++++++++
> >  arch/arm64/mm/fault.c            | 38 +++++++++------
> >  3 files changed, 125 insertions(+), 17 deletions(-)
> >
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 9bd2f57a9e11..6c295d277784 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
> >       return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
> >  }
> >
> > +extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
> >  extern int __ptep_set_access_flags(struct vm_area_struct *vma,
> >                                unsigned long address, pte_t *ptep,
> >                                pte_t entry, int dirty);
> > @@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> >  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> >  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> >                               pte_t *ptep, pte_t pte, unsigned int nr);
> > +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> > +                             unsigned long addr, pte_t *ptep);
> >  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> >                               unsigned long addr, pte_t *ptep);
> >  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> > @@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
> >       __pte_clear(mm, addr, ptep);
> >  }
> >
> > +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
> > +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
> > +                             unsigned long addr, pte_t *ptep, int full)
> > +{
> > +     pte_t orig_pte = __ptep_get(ptep);
> > +
> > +     if (!pte_valid_cont(orig_pte))
> > +             return __ptep_get_and_clear(mm, addr, ptep);
> > +
> > +     if (!full) {
> > +             contpte_try_unfold(mm, addr, ptep, orig_pte);
> > +             return __ptep_get_and_clear(mm, addr, ptep);
> > +     }
> > +
> > +     return contpte_ptep_get_and_clear_full(mm, addr, ptep);
> > +}
> > +
> >  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> >  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> >                               unsigned long addr, pte_t *ptep)
> >  {
> > -     contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> > -     return __ptep_get_and_clear(mm, addr, ptep);
> > +     return ptep_get_and_clear_full(mm, addr, ptep, 0);
> >  }
> >
> >  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> > diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> > index 2a57df16bf58..99b211118d93 100644
> > --- a/arch/arm64/mm/contpte.c
> > +++ b/arch/arm64/mm/contpte.c
> > @@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> >       for (i = 0; i < CONT_PTES; i++, ptep++) {
> >               pte = __ptep_get(ptep);
> >
> > +             /*
> > +              * Deal with the partial contpte_ptep_get_and_clear_full() case,
> > +              * where some of the ptes in the range may be cleared but others
> > +              * are still to do. See contpte_ptep_get_and_clear_full().
> > +              */
> > +             if (!pte_valid(pte))
> > +                     continue;
> > +
> >               if (pte_dirty(pte))
> >                       orig_pte = pte_mkdirty(orig_pte);
> >
> > @@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> >  }
> >  EXPORT_SYMBOL(contpte_set_ptes);
> >
> > +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
> > +                                     unsigned long addr, pte_t *ptep)
> > +{
> > +     /*
> > +      * When doing a full address space teardown, we can avoid unfolding the
> > +      * contiguous range, and therefore avoid the associated tlbi. Instead,
> > +      * just get and clear the pte. The caller is promising to call us for
> > +      * every pte, so every pte in the range will be cleared by the time the
> > +      * final tlbi is issued.
> > +      *
> > +      * This approach requires some complex hoop jumping though, as for the
> > +      * duration between returning from the first call to
> > +      * ptep_get_and_clear_full() and making the final call, the contpte
> > +      * block is in an intermediate state, where some ptes are cleared and
> > +      * others are still set with the PTE_CONT bit. If any other APIs are
> > +      * called for the ptes in the contpte block during that time, we have to
> > +      * be very careful. The core code currently interleaves calls to
> > +      * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
> > +      * careful to ignore the cleared entries when accumulating the access
> > +      * and dirty bits - the same goes for ptep_get_lockless(). The only
> > +      * other calls we might resonably expect are to set markers in the
> > +      * previously cleared ptes. (We shouldn't see valid entries being set
> > +      * until after the tlbi, at which point we are no longer in the
> > +      * intermediate state). Since markers are not valid, this is safe;
> > +      * set_ptes() will see the old, invalid entry and will not attempt to
> > +      * unfold. And the new pte is also invalid so it won't attempt to fold.
> > +      * We shouldn't see pte markers being set for the 'full' case anyway
> > +      * since the address space is being torn down.
> > +      *
> > +      * The last remaining issue is returning the access/dirty bits. That
> > +      * info could be present in any of the ptes in the contpte block.
> > +      * ptep_get() will gather those bits from across the contpte block (for
> > +      * the remaining valid entries). So below, if the pte we are clearing
> > +      * has dirty or young set, we need to stash it into a pte that we are
> > +      * yet to clear. This allows future calls to return the correct state
> > +      * even when the info was stored in a different pte. Since the core-mm
> > +      * calls from low to high address, we prefer to stash in the last pte of
> > +      * the contpte block - this means we are not "dragging" the bits up
> > +      * through all ptes and increases the chances that we can exit early
> > +      * because a given pte will have neither dirty or young set.
> > +      */
> > +
> > +     pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
> > +     bool dirty = pte_dirty(orig_pte);
> > +     bool young = pte_young(orig_pte);
> > +     pte_t *start;
> > +
> > +     if (!dirty && !young)
> > +             return contpte_ptep_get(ptep, orig_pte);
>
> I don't think we need to do this. If the PTE is !dirty && !young we can
> just return it. As you say we have to assume HW can set those flags at
> any time anyway so it doesn't get us much. This means in the common case
> we should only run through the loop setting the dirty/young flags once
> which should alay the performance concerns.
>
> However I am now wondering if we're doing the wrong thing trying to hide
> this down in the arch layer anyway. Perhaps it would be better to deal
> with this in the core-mm code after all.
>
> So how about having ptep_get_and_clear_full() clearing the PTEs for the
> entire cont block? We know by definition all PTEs should be pointing to

I truly believe we should clear all PTEs for the entire folio block. However,
if the existing api ptep_get_and_clear_full() is always handling a single one
PTE, we might keep its behaviour as is.  On the other hand, clearing the
whole block isn't only required in fullmm case, it is also a requirement for
normal zap_pte_range() cases coming from madvise(DONTNEED) etc.

I do think we need a folio-level variant. as we are now supporting
pte-level large
folios, we need some new api to handle folio-level PTEs entirely as we always
have the needs to drop the whole folio rather than one by one when they are
compound.

> the same folio anyway, and it seems at least zap_pte_range() would cope
> with this just fine because subsequent iterations would just see
> pte_none() and continue the loop. I haven't checked the other call sites
> though, but in principal I don't see why we couldn't define
> ptep_get_and_clear_full() as being something that clears all PTEs
> mapping a given folio (although it might need renaming).
>
> This does assume you don't need to partially unmap a page in
> zap_pte_range (ie. end >= folio), but we're already making that
> assumption.
>
> > +
> > +     start = contpte_align_down(ptep);
> > +     ptep = start + CONT_PTES - 1;
> > +
> > +     for (; ptep >= start; ptep--) {
> > +             pte_t pte = __ptep_get(ptep);
> > +
> > +             if (!pte_valid(pte))
> > +                     continue;
> > +
> > +             if (dirty)
> > +                     pte = pte_mkdirty(pte);
> > +
> > +             if (young)
> > +                     pte = pte_mkyoung(pte);
> > +
> > +             __ptep_set_access_flags_notlbi(ptep, pte);
> > +             return contpte_ptep_get(ptep, orig_pte);
> > +     }
> > +
> > +     return orig_pte;
> > +}
> > +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> > +
> >  int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> >                                       unsigned long addr, pte_t *ptep)
> >  {
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index d63f3a0a7251..b22216a8153c 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -199,19 +199,7 @@ static void show_pte(unsigned long addr)
> >       pr_cont("\n");
> >  }
> >
> > -/*
> > - * This function sets the access flags (dirty, accessed), as well as write
> > - * permission, and only to a more permissive setting.
> > - *
> > - * It needs to cope with hardware update of the accessed/dirty state by other
> > - * agents in the system and can safely skip the __sync_icache_dcache() call as,
> > - * like __set_ptes(), the PTE is never changed from no-exec to exec here.
> > - *
> > - * Returns whether or not the PTE actually changed.
> > - */
> > -int __ptep_set_access_flags(struct vm_area_struct *vma,
> > -                         unsigned long address, pte_t *ptep,
> > -                         pte_t entry, int dirty)
> > +int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry)
> >  {
> >       pteval_t old_pteval, pteval;
> >       pte_t pte = __ptep_get(ptep);
> > @@ -238,10 +226,30 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
> >               pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
> >       } while (pteval != old_pteval);
> >
> > +     return 1;
> > +}
> > +
> > +/*
> > + * This function sets the access flags (dirty, accessed), as well as write
> > + * permission, and only to a more permissive setting.
> > + *
> > + * It needs to cope with hardware update of the accessed/dirty state by other
> > + * agents in the system and can safely skip the __sync_icache_dcache() call as,
> > + * like __set_ptes(), the PTE is never changed from no-exec to exec here.
> > + *
> > + * Returns whether or not the PTE actually changed.
> > + */
> > +int __ptep_set_access_flags(struct vm_area_struct *vma,
> > +                         unsigned long address, pte_t *ptep,
> > +                         pte_t entry, int dirty)
> > +{
> > +     int changed = __ptep_set_access_flags_notlbi(ptep, entry);
> > +
> >       /* Invalidate a stale read-only entry */
> > -     if (dirty)
> > +     if (changed && dirty)
> >               flush_tlb_page(vma, address);
> > -     return 1;
> > +
> > +     return changed;
> >  }
> >
> >  static bool is_el1_instruction_abort(unsigned long esr)
> > --8<--
>

Thanks
Barry
Ryan Roberts Nov. 30, 2023, 11:47 a.m. UTC | #20
On 30/11/2023 05:07, Alistair Popple wrote:
> 
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>>>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>>>>> buggy because it iterated through the PTEs, getting and accumulating. Then
>>>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>>>>> have modified the bits during those loops. I think it would be possible to fix
>>>>> the race, but intuition says it would be expensive.
>>>>
>>>> So the issue as I understand it is subsequent iterations would see a
>>>> clean PTE after the first iteration returned a dirty PTE. In
>>>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
>>>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
>>>> all the PTEs?
>>>
>>> The raciness I'm describing is the race between reading access/dirty from one
>>> pte and applying it to another. But yes I like your suggestion. if we do:
>>>
>>> pte = __ptep_get_and_clear_full(ptep)
>>>
>>> on the target pte, then we have grabbed access/dirty from it in a race-free
>>> manner. we can then loop from current pte up towards the top of the block until
>>> we find a valid entry (and I guess wrap at the top to make us robust against
>>> future callers clearing an an arbitrary order). Then atomically accumulate the
>>> access/dirty bits we have just saved into that new entry. I guess that's just a
>>> cmpxchg loop - there are already examples of how to do that correctly when
>>> racing the TLB.
>>>
>>> For most entries, we will just be copying up to the next pte. For the last pte,
>>> we would end up reading all ptes and determine we are the last one.
>>>
>>> What do you think?
>>
>> OK here is an attempt at something which solves the fragility. I think this is
>> now robust and will always return the correct access/dirty state from
>> ptep_get_and_clear_full() and ptep_get().
>>
>> But I'm not sure about performance; each call to ptep_get_and_clear_full() for
>> each pte in a contpte block will cause a ptep_get() to gather the access/dirty
>> bits from across the contpte block - which requires reading each pte in the
>> contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.
>>
>> Was this the type of thing you were thinking of, Alistair?
> 
> Yes, that is along the lines of what I was thinking. However I have
> added a couple of comments inline. 
> 
>> --8<--
>>  arch/arm64/include/asm/pgtable.h | 23 ++++++++-
>>  arch/arm64/mm/contpte.c          | 81 ++++++++++++++++++++++++++++++++
>>  arch/arm64/mm/fault.c            | 38 +++++++++------
>>  3 files changed, 125 insertions(+), 17 deletions(-)
>>
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 9bd2f57a9e11..6c295d277784 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>>  	return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
>>  }
>>
>> +extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
>>  extern int __ptep_set_access_flags(struct vm_area_struct *vma,
>>  				 unsigned long address, pte_t *ptep,
>>  				 pte_t entry, int dirty);
>> @@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>  				pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep);
>>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  				unsigned long addr, pte_t *ptep);
>>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> @@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
>>  	__pte_clear(mm, addr, ptep);
>>  }
>>
>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
>> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep, int full)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (!pte_valid_cont(orig_pte))
>> +		return __ptep_get_and_clear(mm, addr, ptep);
>> +
>> +	if (!full) {
>> +		contpte_try_unfold(mm, addr, ptep, orig_pte);
>> +		return __ptep_get_and_clear(mm, addr, ptep);
>> +	}
>> +
>> +	return contpte_ptep_get_and_clear_full(mm, addr, ptep);
>> +}
>> +
>>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>  				unsigned long addr, pte_t *ptep)
>>  {
>> -	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> -	return __ptep_get_and_clear(mm, addr, ptep);
>> +	return ptep_get_and_clear_full(mm, addr, ptep, 0);
>>  }
>>
>>  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index 2a57df16bf58..99b211118d93 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>  	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>  		pte = __ptep_get(ptep);
>>
>> +		/*
>> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
>> +		 * where some of the ptes in the range may be cleared but others
>> +		 * are still to do. See contpte_ptep_get_and_clear_full().
>> +		 */
>> +		if (!pte_valid(pte))
>> +			continue;
>> +
>>  		if (pte_dirty(pte))
>>  			orig_pte = pte_mkdirty(orig_pte);
>>
>> @@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>  }
>>  EXPORT_SYMBOL(contpte_set_ptes);
>>
>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	/*
>> +	 * When doing a full address space teardown, we can avoid unfolding the
>> +	 * contiguous range, and therefore avoid the associated tlbi. Instead,
>> +	 * just get and clear the pte. The caller is promising to call us for
>> +	 * every pte, so every pte in the range will be cleared by the time the
>> +	 * final tlbi is issued.
>> +	 *
>> +	 * This approach requires some complex hoop jumping though, as for the
>> +	 * duration between returning from the first call to
>> +	 * ptep_get_and_clear_full() and making the final call, the contpte
>> +	 * block is in an intermediate state, where some ptes are cleared and
>> +	 * others are still set with the PTE_CONT bit. If any other APIs are
>> +	 * called for the ptes in the contpte block during that time, we have to
>> +	 * be very careful. The core code currently interleaves calls to
>> +	 * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
>> +	 * careful to ignore the cleared entries when accumulating the access
>> +	 * and dirty bits - the same goes for ptep_get_lockless(). The only
>> +	 * other calls we might resonably expect are to set markers in the
>> +	 * previously cleared ptes. (We shouldn't see valid entries being set
>> +	 * until after the tlbi, at which point we are no longer in the
>> +	 * intermediate state). Since markers are not valid, this is safe;
>> +	 * set_ptes() will see the old, invalid entry and will not attempt to
>> +	 * unfold. And the new pte is also invalid so it won't attempt to fold.
>> +	 * We shouldn't see pte markers being set for the 'full' case anyway
>> +	 * since the address space is being torn down.
>> +	 *
>> +	 * The last remaining issue is returning the access/dirty bits. That
>> +	 * info could be present in any of the ptes in the contpte block.
>> +	 * ptep_get() will gather those bits from across the contpte block (for
>> +	 * the remaining valid entries). So below, if the pte we are clearing
>> +	 * has dirty or young set, we need to stash it into a pte that we are
>> +	 * yet to clear. This allows future calls to return the correct state
>> +	 * even when the info was stored in a different pte. Since the core-mm
>> +	 * calls from low to high address, we prefer to stash in the last pte of
>> +	 * the contpte block - this means we are not "dragging" the bits up
>> +	 * through all ptes and increases the chances that we can exit early
>> +	 * because a given pte will have neither dirty or young set.
>> +	 */
>> +
>> +	pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
>> +	bool dirty = pte_dirty(orig_pte);
>> +	bool young = pte_young(orig_pte);
>> +	pte_t *start;
>> +
>> +	if (!dirty && !young)
>> +		return contpte_ptep_get(ptep, orig_pte);
> 
> I don't think we need to do this. If the PTE is !dirty && !young we can
> just return it. As you say we have to assume HW can set those flags at
> any time anyway so it doesn't get us much. This means in the common case
> we should only run through the loop setting the dirty/young flags once
> which should alay the performance concerns.

I don't follow your logic. This is precisely the problem I was trying to solve
vs my original (simple) attempt - we want to always report the correct
access/dirty info. If we read one of the PTEs and neither access nor dirty are
set, that doesn't mean its old and clean, it just means that that info is
definitely not stored in this PTE - we need to check the others. (when the
contiguous bit is set, the HW will only update the access/dirty bits for 1 of
the PTEs in the contpte block).

Also, IIRC correctly, the core-mm sets access when initially setting up the
mapping so its not guarranteed that all but one of the PTEs in the contpte block
have (!dirty && !young).

> 
> However I am now wondering if we're doing the wrong thing trying to hide
> this down in the arch layer anyway. Perhaps it would be better to deal
> with this in the core-mm code after all.
> 
> So how about having ptep_get_and_clear_full() clearing the PTEs for the
> entire cont block? We know by definition all PTEs should be pointing to
> the same folio anyway, and it seems at least zap_pte_range() would cope
> with this just fine because subsequent iterations would just see
> pte_none() and continue the loop. I haven't checked the other call sites
> though, but in principal I don't see why we couldn't define
> ptep_get_and_clear_full() as being something that clears all PTEs
> mapping a given folio (although it might need renaming).

Ahha! Yes, I've been working on a solution like this since Barry raised it
yesterday. I have a working version, that seems to perform well. I wouldn't want
to just clear all the PTEs in the block inside ptep_get_and_clear_full() because
although it might work today, its fragile in the same way that my v2 version is.

Instead, I've defined a new helper, clear_ptes(), which takes a starting pte and
a number of ptes to clear (like set_ptes()). It returns the PTE read from the
*first* slot, but with the access/dirty bits being accumulated from all of the
ptes in the requested batch. Then zap_pte_range() is reworked to find
appropriate batches (similar to how I've reworked for ptep_set_wrprotects()).

I was trying to avoid introducing new helpers, but I think this is the most
robust approach, and looks slightly more performant to, on first sight. It also
addresses cases where full=0, which Barry says are important for madvise(DONTNEED).

> 
> This does assume you don't need to partially unmap a page in
> zap_pte_range (ie. end >= folio), but we're already making that
> assumption.

That's fine for full=1. But we can't make that assumption for full=0. If a VMA
gets split for a reason that doesn't require re-setting the PTEs then a contpte
block could straddle 2 VMAs. But the solution I describe above is robust to that.

I'll finish gathering perf data then post for all 3 approaches; v2 as originally
posted, "robust ptep_get_and_clear_full()", and clear_ptes(). Hopefully later today.

> 
>> +
>> +	start = contpte_align_down(ptep);
>> +	ptep = start + CONT_PTES - 1;
>> +
>> +	for (; ptep >= start; ptep--) {
>> +		pte_t pte = __ptep_get(ptep);
>> +
>> +		if (!pte_valid(pte))
>> +			continue;
>> +
>> +		if (dirty)
>> +			pte = pte_mkdirty(pte);
>> +
>> +		if (young)
>> +			pte = pte_mkyoung(pte);
>> +
>> +		__ptep_set_access_flags_notlbi(ptep, pte);
>> +		return contpte_ptep_get(ptep, orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
>> +
>>  int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>  					unsigned long addr, pte_t *ptep)
>>  {
>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>> index d63f3a0a7251..b22216a8153c 100644
>> --- a/arch/arm64/mm/fault.c
>> +++ b/arch/arm64/mm/fault.c
>> @@ -199,19 +199,7 @@ static void show_pte(unsigned long addr)
>>  	pr_cont("\n");
>>  }
>>
>> -/*
>> - * This function sets the access flags (dirty, accessed), as well as write
>> - * permission, and only to a more permissive setting.
>> - *
>> - * It needs to cope with hardware update of the accessed/dirty state by other
>> - * agents in the system and can safely skip the __sync_icache_dcache() call as,
>> - * like __set_ptes(), the PTE is never changed from no-exec to exec here.
>> - *
>> - * Returns whether or not the PTE actually changed.
>> - */
>> -int __ptep_set_access_flags(struct vm_area_struct *vma,
>> -			    unsigned long address, pte_t *ptep,
>> -			    pte_t entry, int dirty)
>> +int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry)
>>  {
>>  	pteval_t old_pteval, pteval;
>>  	pte_t pte = __ptep_get(ptep);
>> @@ -238,10 +226,30 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
>>  		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
>>  	} while (pteval != old_pteval);
>>
>> +	return 1;
>> +}
>> +
>> +/*
>> + * This function sets the access flags (dirty, accessed), as well as write
>> + * permission, and only to a more permissive setting.
>> + *
>> + * It needs to cope with hardware update of the accessed/dirty state by other
>> + * agents in the system and can safely skip the __sync_icache_dcache() call as,
>> + * like __set_ptes(), the PTE is never changed from no-exec to exec here.
>> + *
>> + * Returns whether or not the PTE actually changed.
>> + */
>> +int __ptep_set_access_flags(struct vm_area_struct *vma,
>> +			    unsigned long address, pte_t *ptep,
>> +			    pte_t entry, int dirty)
>> +{
>> +	int changed = __ptep_set_access_flags_notlbi(ptep, entry);
>> +
>>  	/* Invalidate a stale read-only entry */
>> -	if (dirty)
>> +	if (changed && dirty)
>>  		flush_tlb_page(vma, address);
>> -	return 1;
>> +
>> +	return changed;
>>  }
>>
>>  static bool is_el1_instruction_abort(unsigned long esr)
>> --8<--
>
Ryan Roberts Nov. 30, 2023, noon UTC | #21
>> Just because you found a pte that maps a page from a large folio, that doesn't
>> mean that all pages from the folio are mapped, and it doesn't mean they are
>> mapped contiguously. We have to deal with partial munmap(), partial mremap()
>> etc. We could split in these cases (and in future it might be sensible to try),
>> but that can fail (due to GUP). So we still have to handle the corner case.
>>
>> But I can imagine doing a batched version of ptep_get_and_clear(), like I did
>> for ptep_set_wrprotects(). And I think this would be an improvement.
>>
>> The reason I haven't done that so far, is because ptep_get_and_clear() returns
>> the pte value when it was cleared and that's hard to do if batching due to the
>> storage requirement. But perhaps you could just return the logical OR of the
>> dirty and young bits across all ptes in the batch. The caller should be able to
>> reconstitute the rest if it needs it?
>>
>> What do you think?
> 
> I really don't know why we care about the return value of ptep_get_and_clear()
> as zap_pte_range() doesn't ask for any ret value at all. so why not totally give
> up this kind of complex logical OR of dirty and young as they are useless in
> this case?

That's not the case in v6.7-rc1:


static unsigned long zap_pte_range(struct mmu_gather *tlb,
				struct vm_area_struct *vma, pmd_t *pmd,
				unsigned long addr, unsigned long end,
				struct zap_details *details)
{
	...

	do {
		pte_t ptent = ptep_get(pte);

		...

		if (pte_present(ptent)) {
			...

			ptent = ptep_get_and_clear_full(mm, addr, pte,
							tlb->fullmm);
			arch_check_zapped_pte(vma, ptent);
			tlb_remove_tlb_entry(tlb, pte, addr);
			zap_install_uffd_wp_if_needed(vma, addr, pte, details,
						      ptent);
			if (unlikely(!page)) {
				ksm_might_unmap_zero_page(mm, ptent);
				continue;
			}

			delay_rmap = 0;
			if (!PageAnon(page)) {
				if (pte_dirty(ptent)) {
					set_page_dirty(page);
					if (tlb_delay_rmap(tlb)) {
						delay_rmap = 1;
						force_flush = 1;
					}
				}
				if (pte_young(ptent) && likely(vma_has_recency(vma)))
					mark_page_accessed(page);
			}

			...
		}

		...
	} while (pte++, addr += PAGE_SIZE, addr != end);

	...
}

Most importantly, file-backed mappings need the access/dirty bits to propagate that information back to the folio, so it will be written back to disk. x86 is also looking at the dirty bit in arch_check_zapped_pte(), and ksm is using it in ksm_might_unmap_zero_page().

Probably for your use case of anon memory on arm64 on a phone, you don't need the return value. But my solution is also setting cotnpte for file-backed memory, and there are performance wins to be had there, especially for executable mappings where contpte reduces iTLB pressure. (I have other work which ensures these file-backed mappings are in correctly-sized large folios).

So I don't think we can just do a clear without the get part. But I think I have a solution in the shape of clear_ptes(), as described in the other thread, which gives the characteristics you suggest.


> 
> Is it possible for us to introduce a new api like?
> 
> bool clear_folio_ptes(folio, ptep)
> {
>     if(ptes are contiguous mapped) {
>            clear all ptes all together    // this also clears all CONTPTE
>            return true;
>     }
>     return false;
> }
> 
> in zap_pte_range():
> 
> if (large_folio(folio) && clear_folio_ptes(folio, ptep)) {
>          addr += nr - 1
>          pte += nr  -1
> } else
>          old path.
> 
> 
>>
>>>
>>> zap_pte_range is the most frequent behaviour from userspace libc heap
>>> as i explained
>>> before. libc can call madvise(DONTNEED) the most often. It is crucial
>>> to performance.
>>>
>>> and this way can also help drop your full version by moving to full
>>> flushing the whole
>>> large folios? and we don't need to depend on fullmm any more?
>>>
>>>>
>>>> I don't think there is any correctness issue here. But there is a problem with
>>>> fragility, as raised by Alistair. I have some ideas on potentially how to solve
>>>> that. I'm going to try to work on it this afternoon and will post if I get some
>>>> confidence that it is a real solution.
>>>>
>>>> Thanks,
>>>> Ryan
>>>>
>>>>>
>>>>> static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm,
>>>>>                                      unsigned long addr,
>>>>>                                      pte_t *ptep,
>>>>>                                      bool flush)
>>>>> {
>>>>>       pte_t orig_pte = ptep_get(ptep);
>>>>>
>>>>>       CHP_BUG_ON(!pte_cont(orig_pte));
>>>>>       CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
>>>>>       CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));
>>>>>
>>>>>       return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
>>>>> }
>>>>>
>>>>> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539
>>>>>
>>>>>> +     */
>>>>>> +
>>>>>> +    return __ptep_get_and_clear(mm, addr, ptep);
>>>>>> +}
>>>>>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
>>>>>> +
>>>>>
>>>
>  Thanks
>  Barry
Barry Song Dec. 3, 2023, 9:41 p.m. UTC | #22
On Thu, Nov 30, 2023 at 8:00 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> >> Just because you found a pte that maps a page from a large folio, that doesn't
> >> mean that all pages from the folio are mapped, and it doesn't mean they are
> >> mapped contiguously. We have to deal with partial munmap(), partial mremap()
> >> etc. We could split in these cases (and in future it might be sensible to try),
> >> but that can fail (due to GUP). So we still have to handle the corner case.
> >>
> >> But I can imagine doing a batched version of ptep_get_and_clear(), like I did
> >> for ptep_set_wrprotects(). And I think this would be an improvement.
> >>
> >> The reason I haven't done that so far, is because ptep_get_and_clear() returns
> >> the pte value when it was cleared and that's hard to do if batching due to the
> >> storage requirement. But perhaps you could just return the logical OR of the
> >> dirty and young bits across all ptes in the batch. The caller should be able to
> >> reconstitute the rest if it needs it?
> >>
> >> What do you think?
> >
> > I really don't know why we care about the return value of ptep_get_and_clear()
> > as zap_pte_range() doesn't ask for any ret value at all. so why not totally give
> > up this kind of complex logical OR of dirty and young as they are useless in
> > this case?
>
> That's not the case in v6.7-rc1:
>
>
> static unsigned long zap_pte_range(struct mmu_gather *tlb,
>                                 struct vm_area_struct *vma, pmd_t *pmd,
>                                 unsigned long addr, unsigned long end,
>                                 struct zap_details *details)
> {
>         ...
>
>         do {
>                 pte_t ptent = ptep_get(pte);
>
>                 ...
>
>                 if (pte_present(ptent)) {
>                         ...
>
>                         ptent = ptep_get_and_clear_full(mm, addr, pte,
>                                                         tlb->fullmm);
>                         arch_check_zapped_pte(vma, ptent);
>                         tlb_remove_tlb_entry(tlb, pte, addr);
>                         zap_install_uffd_wp_if_needed(vma, addr, pte, details,
>                                                       ptent);
>                         if (unlikely(!page)) {
>                                 ksm_might_unmap_zero_page(mm, ptent);
>                                 continue;
>                         }
>
>                         delay_rmap = 0;
>                         if (!PageAnon(page)) {
>                                 if (pte_dirty(ptent)) {
>                                         set_page_dirty(page);
>                                         if (tlb_delay_rmap(tlb)) {
>                                                 delay_rmap = 1;
>                                                 force_flush = 1;
>                                         }
>                                 }
>                                 if (pte_young(ptent) && likely(vma_has_recency(vma)))
>                                         mark_page_accessed(page);
>                         }
>
>                         ...
>                 }
>
>                 ...
>         } while (pte++, addr += PAGE_SIZE, addr != end);
>
>         ...
> }
>
> Most importantly, file-backed mappings need the access/dirty bits to propagate that information back to the folio, so it will be written back to disk. x86 is also looking at the dirty bit in arch_check_zapped_pte(), and ksm is using it in ksm_might_unmap_zero_page().
>
> Probably for your use case of anon memory on arm64 on a phone, you don't need the return value. But my solution is also setting cotnpte for file-backed memory, and there are performance wins to be had there, especially for executable mappings where contpte reduces iTLB pressure. (I have other work which ensures these file-backed mappings are in correctly-sized large folios).
>
> So I don't think we can just do a clear without the get part. But I think I have a solution in the shape of clear_ptes(), as described in the other thread, which gives the characteristics you suggest.

understood. i realized OPPO's code actually also returned logic OR of
dirty and access bits while it exposes
CONTPTE to mm-core [1]:

static pte_t get_clear_flush(struct mm_struct *mm,
                             unsigned long addr,
                             pte_t *ptep,
                             unsigned long pgsize,
                             unsigned long ncontig,
                             bool flush)
{
        pte_t orig_pte = ptep_get(ptep);
        bool valid = pte_valid(orig_pte);
        unsigned long i, saddr = addr;

        for (i = 0; i < ncontig; i++, addr += pgsize, ptep++) {
                pte_t pte = ptep_get_and_clear(mm, addr, ptep);

                if (pte_dirty(pte))
                        orig_pte = pte_mkdirty(orig_pte);

                if (pte_young(pte))
                        orig_pte = pte_mkyoung(orig_pte);
        }

        if (valid && flush) {
                struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);

                flush_tlb_range(&vma, saddr, addr);
        }
        return orig_pte;
}

static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct
mm_struct *mm,
                                       unsigned long addr,
                                       pte_t *ptep,
                                       bool flush)
{
        pte_t orig_pte = ptep_get(ptep);

        CHP_BUG_ON(!pte_cont(orig_pte));
        CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
        CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));

        return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
}

[1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/cont_pte_hugepage.c#L1421

>
>
> >
> > Is it possible for us to introduce a new api like?
> >
> > bool clear_folio_ptes(folio, ptep)
> > {
> >     if(ptes are contiguous mapped) {
> >            clear all ptes all together    // this also clears all CONTPTE
> >            return true;
> >     }
> >     return false;
> > }
> >
> > in zap_pte_range():
> >
> > if (large_folio(folio) && clear_folio_ptes(folio, ptep)) {
> >          addr += nr - 1
> >          pte += nr  -1
> > } else
> >          old path.
> >
> >
> >>
> >>>
> >>> zap_pte_range is the most frequent behaviour from userspace libc heap
> >>> as i explained
> >>> before. libc can call madvise(DONTNEED) the most often. It is crucial
> >>> to performance.
> >>>
> >>> and this way can also help drop your full version by moving to full
> >>> flushing the whole
> >>> large folios? and we don't need to depend on fullmm any more?
> >>>
> >>>>
> >>>> I don't think there is any correctness issue here. But there is a problem with
> >>>> fragility, as raised by Alistair. I have some ideas on potentially how to solve
> >>>> that. I'm going to try to work on it this afternoon and will post if I get some
> >>>> confidence that it is a real solution.
> >>>>
> >>>> Thanks,
> >>>> Ryan
> >>>>
> >>>>>
> >>>>> static inline pte_t __cont_pte_huge_ptep_get_and_clear_flush(struct mm_struct *mm,
> >>>>>                                      unsigned long addr,
> >>>>>                                      pte_t *ptep,
> >>>>>                                      bool flush)
> >>>>> {
> >>>>>       pte_t orig_pte = ptep_get(ptep);
> >>>>>
> >>>>>       CHP_BUG_ON(!pte_cont(orig_pte));
> >>>>>       CHP_BUG_ON(!IS_ALIGNED(addr, HPAGE_CONT_PTE_SIZE));
> >>>>>       CHP_BUG_ON(!IS_ALIGNED(pte_pfn(orig_pte), HPAGE_CONT_PTE_NR));
> >>>>>
> >>>>>       return get_clear_flush(mm, addr, ptep, PAGE_SIZE, CONT_PTES, flush);
> >>>>> }
> >>>>>
> >>>>> [1] https://github.com/OnePlusOSS/android_kernel_oneplus_sm8550/blob/oneplus/sm8550_u_14.0.0_oneplus11/mm/memory.c#L1539
> >>>>>
> >>>>>> +     */
> >>>>>> +
> >>>>>> +    return __ptep_get_and_clear(mm, addr, ptep);
> >>>>>> +}
> >>>>>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
> >>>>>> +
> >>>>>
> >>>
> >  Thanks
> >  Barry
>
Alistair Popple Dec. 3, 2023, 11:20 p.m. UTC | #23
Ryan Roberts <ryan.roberts@arm.com> writes:

> On 30/11/2023 05:07, Alistair Popple wrote:
>> 
>> Ryan Roberts <ryan.roberts@arm.com> writes:
>> 
>>>>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>>>>>> buggy because it iterated through the PTEs, getting and accumulating. Then
>>>>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>>>>>> have modified the bits during those loops. I think it would be possible to fix
>>>>>> the race, but intuition says it would be expensive.
>>>>>
>>>>> So the issue as I understand it is subsequent iterations would see a
>>>>> clean PTE after the first iteration returned a dirty PTE. In
>>>>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
>>>>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
>>>>> all the PTEs?
>>>>
>>>> The raciness I'm describing is the race between reading access/dirty from one
>>>> pte and applying it to another. But yes I like your suggestion. if we do:
>>>>
>>>> pte = __ptep_get_and_clear_full(ptep)
>>>>
>>>> on the target pte, then we have grabbed access/dirty from it in a race-free
>>>> manner. we can then loop from current pte up towards the top of the block until
>>>> we find a valid entry (and I guess wrap at the top to make us robust against
>>>> future callers clearing an an arbitrary order). Then atomically accumulate the
>>>> access/dirty bits we have just saved into that new entry. I guess that's just a
>>>> cmpxchg loop - there are already examples of how to do that correctly when
>>>> racing the TLB.
>>>>
>>>> For most entries, we will just be copying up to the next pte. For the last pte,
>>>> we would end up reading all ptes and determine we are the last one.
>>>>
>>>> What do you think?
>>>
>>> OK here is an attempt at something which solves the fragility. I think this is
>>> now robust and will always return the correct access/dirty state from
>>> ptep_get_and_clear_full() and ptep_get().
>>>
>>> But I'm not sure about performance; each call to ptep_get_and_clear_full() for
>>> each pte in a contpte block will cause a ptep_get() to gather the access/dirty
>>> bits from across the contpte block - which requires reading each pte in the
>>> contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.
>>>
>>> Was this the type of thing you were thinking of, Alistair?
>> 
>> Yes, that is along the lines of what I was thinking. However I have
>> added a couple of comments inline. 
>> 
>>> --8<--
>>>  arch/arm64/include/asm/pgtable.h | 23 ++++++++-
>>>  arch/arm64/mm/contpte.c          | 81 ++++++++++++++++++++++++++++++++
>>>  arch/arm64/mm/fault.c            | 38 +++++++++------
>>>  3 files changed, 125 insertions(+), 17 deletions(-)
>>>
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 9bd2f57a9e11..6c295d277784 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>>>  	return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
>>>  }
>>>
>>> +extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
>>>  extern int __ptep_set_access_flags(struct vm_area_struct *vma,
>>>  				 unsigned long address, pte_t *ptep,
>>>  				 pte_t entry, int dirty);
>>> @@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>>  				pte_t *ptep, pte_t pte, unsigned int nr);
>>> +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep);
>>>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>  				unsigned long addr, pte_t *ptep);
>>>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> @@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
>>>  	__pte_clear(mm, addr, ptep);
>>>  }
>>>
>>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
>>> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep, int full)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	if (!pte_valid_cont(orig_pte))
>>> +		return __ptep_get_and_clear(mm, addr, ptep);
>>> +
>>> +	if (!full) {
>>> +		contpte_try_unfold(mm, addr, ptep, orig_pte);
>>> +		return __ptep_get_and_clear(mm, addr, ptep);
>>> +	}
>>> +
>>> +	return contpte_ptep_get_and_clear_full(mm, addr, ptep);
>>> +}
>>> +
>>>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>  				unsigned long addr, pte_t *ptep)
>>>  {
>>> -	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> -	return __ptep_get_and_clear(mm, addr, ptep);
>>> +	return ptep_get_and_clear_full(mm, addr, ptep, 0);
>>>  }
>>>
>>>  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> index 2a57df16bf58..99b211118d93 100644
>>> --- a/arch/arm64/mm/contpte.c
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>  	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>  		pte = __ptep_get(ptep);
>>>
>>> +		/*
>>> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
>>> +		 * where some of the ptes in the range may be cleared but others
>>> +		 * are still to do. See contpte_ptep_get_and_clear_full().
>>> +		 */
>>> +		if (!pte_valid(pte))
>>> +			continue;
>>> +
>>>  		if (pte_dirty(pte))
>>>  			orig_pte = pte_mkdirty(orig_pte);
>>>
>>> @@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>>  }
>>>  EXPORT_SYMBOL(contpte_set_ptes);
>>>
>>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>>> +					unsigned long addr, pte_t *ptep)
>>> +{
>>> +	/*
>>> +	 * When doing a full address space teardown, we can avoid unfolding the
>>> +	 * contiguous range, and therefore avoid the associated tlbi. Instead,
>>> +	 * just get and clear the pte. The caller is promising to call us for
>>> +	 * every pte, so every pte in the range will be cleared by the time the
>>> +	 * final tlbi is issued.
>>> +	 *
>>> +	 * This approach requires some complex hoop jumping though, as for the
>>> +	 * duration between returning from the first call to
>>> +	 * ptep_get_and_clear_full() and making the final call, the contpte
>>> +	 * block is in an intermediate state, where some ptes are cleared and
>>> +	 * others are still set with the PTE_CONT bit. If any other APIs are
>>> +	 * called for the ptes in the contpte block during that time, we have to
>>> +	 * be very careful. The core code currently interleaves calls to
>>> +	 * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
>>> +	 * careful to ignore the cleared entries when accumulating the access
>>> +	 * and dirty bits - the same goes for ptep_get_lockless(). The only
>>> +	 * other calls we might resonably expect are to set markers in the
>>> +	 * previously cleared ptes. (We shouldn't see valid entries being set
>>> +	 * until after the tlbi, at which point we are no longer in the
>>> +	 * intermediate state). Since markers are not valid, this is safe;
>>> +	 * set_ptes() will see the old, invalid entry and will not attempt to
>>> +	 * unfold. And the new pte is also invalid so it won't attempt to fold.
>>> +	 * We shouldn't see pte markers being set for the 'full' case anyway
>>> +	 * since the address space is being torn down.
>>> +	 *
>>> +	 * The last remaining issue is returning the access/dirty bits. That
>>> +	 * info could be present in any of the ptes in the contpte block.
>>> +	 * ptep_get() will gather those bits from across the contpte block (for
>>> +	 * the remaining valid entries). So below, if the pte we are clearing
>>> +	 * has dirty or young set, we need to stash it into a pte that we are
>>> +	 * yet to clear. This allows future calls to return the correct state
>>> +	 * even when the info was stored in a different pte. Since the core-mm
>>> +	 * calls from low to high address, we prefer to stash in the last pte of
>>> +	 * the contpte block - this means we are not "dragging" the bits up
>>> +	 * through all ptes and increases the chances that we can exit early
>>> +	 * because a given pte will have neither dirty or young set.
>>> +	 */
>>> +
>>> +	pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
>>> +	bool dirty = pte_dirty(orig_pte);
>>> +	bool young = pte_young(orig_pte);
>>> +	pte_t *start;
>>> +
>>> +	if (!dirty && !young)
>>> +		return contpte_ptep_get(ptep, orig_pte);
>> 
>> I don't think we need to do this. If the PTE is !dirty && !young we can
>> just return it. As you say we have to assume HW can set those flags at
>> any time anyway so it doesn't get us much. This means in the common case
>> we should only run through the loop setting the dirty/young flags once
>> which should alay the performance concerns.
>
> I don't follow your logic. This is precisely the problem I was trying to solve
> vs my original (simple) attempt - we want to always report the correct
> access/dirty info. If we read one of the PTEs and neither access nor dirty are
> set, that doesn't mean its old and clean, it just means that that info is
> definitely not stored in this PTE - we need to check the others. (when the
> contiguous bit is set, the HW will only update the access/dirty bits for 1 of
> the PTEs in the contpte block).

So my concern wasn't about incorrectly returning a !young && !dirty PTE
when the CONT_PTE block was *previously* clean/old (ie. the first
ptep_get/ptep_get_and_clear_full returned clean/old) because we have to
tolerate that anyway due to HW being able to set those bits. Rather my
concern was ptep_get_and_clear_full() could implicitly clear dirty/young
bits - ie. ptep_get_and_clear_full() could return a dirty/young PTE but
the next call would not.

That's because regardless of what we do here it is just a matter of
timing if we have to assume other HW threads can set these bits at any
time. There is nothing stopping HW from doing that just after we read
them in that loop, so a block can always become dirty/young at any time.
However it shouldn't become !dirty/!young without explicit SW
intervention.

But this is all a bit of a moot point due to the discussion below.

> Also, IIRC correctly, the core-mm sets access when initially setting up the
> mapping so its not guarranteed that all but one of the PTEs in the contpte block
> have (!dirty && !young).
>
>> 
>> However I am now wondering if we're doing the wrong thing trying to hide
>> this down in the arch layer anyway. Perhaps it would be better to deal
>> with this in the core-mm code after all.
>> 
>> So how about having ptep_get_and_clear_full() clearing the PTEs for the
>> entire cont block? We know by definition all PTEs should be pointing to
>> the same folio anyway, and it seems at least zap_pte_range() would cope
>> with this just fine because subsequent iterations would just see
>> pte_none() and continue the loop. I haven't checked the other call sites
>> though, but in principal I don't see why we couldn't define
>> ptep_get_and_clear_full() as being something that clears all PTEs
>> mapping a given folio (although it might need renaming).
>
> Ahha! Yes, I've been working on a solution like this since Barry raised it
> yesterday. I have a working version, that seems to perform well. I wouldn't want
> to just clear all the PTEs in the block inside ptep_get_and_clear_full() because
> although it might work today, its fragile in the same way that my v2 version is.

Yes, agree a new helper would be needed.

> Instead, I've defined a new helper, clear_ptes(), which takes a starting pte and
> a number of ptes to clear (like set_ptes()). It returns the PTE read from the
> *first* slot, but with the access/dirty bits being accumulated from all of the
> ptes in the requested batch. Then zap_pte_range() is reworked to find
> appropriate batches (similar to how I've reworked for ptep_set_wrprotects()).
>
> I was trying to avoid introducing new helpers, but I think this is the most
> robust approach, and looks slightly more performant to, on first sight. It also
> addresses cases where full=0, which Barry says are important for madvise(DONTNEED).

I strongly agree with this approach now especially if it is equally (or
more!) performant. I get why you didn't want to intorduce new helpers
but I think doing so was making things too subtle so would like to see
this.

>> 
>> This does assume you don't need to partially unmap a page in
>> zap_pte_range (ie. end >= folio), but we're already making that
>> assumption.
>
> That's fine for full=1. But we can't make that assumption for full=0. If a VMA
> gets split for a reason that doesn't require re-setting the PTEs then a contpte
> block could straddle 2 VMAs. But the solution I describe above is robust to that.
>
> I'll finish gathering perf data then post for all 3 approaches; v2 as originally
> posted, "robust ptep_get_and_clear_full()", and clear_ptes(). Hopefully later today.

Thanks!

>> 
>>> +
>>> +	start = contpte_align_down(ptep);
>>> +	ptep = start + CONT_PTES - 1;
>>> +
>>> +	for (; ptep >= start; ptep--) {
>>> +		pte_t pte = __ptep_get(ptep);
>>> +
>>> +		if (!pte_valid(pte))
>>> +			continue;
>>> +
>>> +		if (dirty)
>>> +			pte = pte_mkdirty(pte);
>>> +
>>> +		if (young)
>>> +			pte = pte_mkyoung(pte);
>>> +
>>> +		__ptep_set_access_flags_notlbi(ptep, pte);
>>> +		return contpte_ptep_get(ptep, orig_pte);
>>> +	}
>>> +
>>> +	return orig_pte;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
>>> +
>>>  int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>  					unsigned long addr, pte_t *ptep)
>>>  {
>>> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
>>> index d63f3a0a7251..b22216a8153c 100644
>>> --- a/arch/arm64/mm/fault.c
>>> +++ b/arch/arm64/mm/fault.c
>>> @@ -199,19 +199,7 @@ static void show_pte(unsigned long addr)
>>>  	pr_cont("\n");
>>>  }
>>>
>>> -/*
>>> - * This function sets the access flags (dirty, accessed), as well as write
>>> - * permission, and only to a more permissive setting.
>>> - *
>>> - * It needs to cope with hardware update of the accessed/dirty state by other
>>> - * agents in the system and can safely skip the __sync_icache_dcache() call as,
>>> - * like __set_ptes(), the PTE is never changed from no-exec to exec here.
>>> - *
>>> - * Returns whether or not the PTE actually changed.
>>> - */
>>> -int __ptep_set_access_flags(struct vm_area_struct *vma,
>>> -			    unsigned long address, pte_t *ptep,
>>> -			    pte_t entry, int dirty)
>>> +int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry)
>>>  {
>>>  	pteval_t old_pteval, pteval;
>>>  	pte_t pte = __ptep_get(ptep);
>>> @@ -238,10 +226,30 @@ int __ptep_set_access_flags(struct vm_area_struct *vma,
>>>  		pteval = cmpxchg_relaxed(&pte_val(*ptep), old_pteval, pteval);
>>>  	} while (pteval != old_pteval);
>>>
>>> +	return 1;
>>> +}
>>> +
>>> +/*
>>> + * This function sets the access flags (dirty, accessed), as well as write
>>> + * permission, and only to a more permissive setting.
>>> + *
>>> + * It needs to cope with hardware update of the accessed/dirty state by other
>>> + * agents in the system and can safely skip the __sync_icache_dcache() call as,
>>> + * like __set_ptes(), the PTE is never changed from no-exec to exec here.
>>> + *
>>> + * Returns whether or not the PTE actually changed.
>>> + */
>>> +int __ptep_set_access_flags(struct vm_area_struct *vma,
>>> +			    unsigned long address, pte_t *ptep,
>>> +			    pte_t entry, int dirty)
>>> +{
>>> +	int changed = __ptep_set_access_flags_notlbi(ptep, entry);
>>> +
>>>  	/* Invalidate a stale read-only entry */
>>> -	if (dirty)
>>> +	if (changed && dirty)
>>>  		flush_tlb_page(vma, address);
>>> -	return 1;
>>> +
>>> +	return changed;
>>>  }
>>>
>>>  static bool is_el1_instruction_abort(unsigned long esr)
>>> --8<--
>>
Ryan Roberts Dec. 4, 2023, 9:39 a.m. UTC | #24
On 03/12/2023 23:20, Alistair Popple wrote:
> 
> Ryan Roberts <ryan.roberts@arm.com> writes:
> 
>> On 30/11/2023 05:07, Alistair Popple wrote:
>>>
>>> Ryan Roberts <ryan.roberts@arm.com> writes:
>>>
>>>>>>> So if we do need to deal with racing HW, I'm pretty sure my v1 implementation is
>>>>>>> buggy because it iterated through the PTEs, getting and accumulating. Then
>>>>>>> iterated again, writing that final set of bits to all the PTEs. And the HW could
>>>>>>> have modified the bits during those loops. I think it would be possible to fix
>>>>>>> the race, but intuition says it would be expensive.
>>>>>>
>>>>>> So the issue as I understand it is subsequent iterations would see a
>>>>>> clean PTE after the first iteration returned a dirty PTE. In
>>>>>> ptep_get_and_clear_full() why couldn't you just copy the dirty/accessed
>>>>>> bit (if set) from the PTE being cleared to an adjacent PTE rather than
>>>>>> all the PTEs?
>>>>>
>>>>> The raciness I'm describing is the race between reading access/dirty from one
>>>>> pte and applying it to another. But yes I like your suggestion. if we do:
>>>>>
>>>>> pte = __ptep_get_and_clear_full(ptep)
>>>>>
>>>>> on the target pte, then we have grabbed access/dirty from it in a race-free
>>>>> manner. we can then loop from current pte up towards the top of the block until
>>>>> we find a valid entry (and I guess wrap at the top to make us robust against
>>>>> future callers clearing an an arbitrary order). Then atomically accumulate the
>>>>> access/dirty bits we have just saved into that new entry. I guess that's just a
>>>>> cmpxchg loop - there are already examples of how to do that correctly when
>>>>> racing the TLB.
>>>>>
>>>>> For most entries, we will just be copying up to the next pte. For the last pte,
>>>>> we would end up reading all ptes and determine we are the last one.
>>>>>
>>>>> What do you think?
>>>>
>>>> OK here is an attempt at something which solves the fragility. I think this is
>>>> now robust and will always return the correct access/dirty state from
>>>> ptep_get_and_clear_full() and ptep_get().
>>>>
>>>> But I'm not sure about performance; each call to ptep_get_and_clear_full() for
>>>> each pte in a contpte block will cause a ptep_get() to gather the access/dirty
>>>> bits from across the contpte block - which requires reading each pte in the
>>>> contpte block. So its O(n^2) in that sense. I'll benchmark it and report back.
>>>>
>>>> Was this the type of thing you were thinking of, Alistair?
>>>
>>> Yes, that is along the lines of what I was thinking. However I have
>>> added a couple of comments inline. 
>>>
>>>> --8<--
>>>>  arch/arm64/include/asm/pgtable.h | 23 ++++++++-
>>>>  arch/arm64/mm/contpte.c          | 81 ++++++++++++++++++++++++++++++++
>>>>  arch/arm64/mm/fault.c            | 38 +++++++++------
>>>>  3 files changed, 125 insertions(+), 17 deletions(-)
>>>>
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 9bd2f57a9e11..6c295d277784 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -851,6 +851,7 @@ static inline pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot)
>>>>  	return pte_pmd(pte_modify(pmd_pte(pmd), newprot));
>>>>  }
>>>>
>>>> +extern int __ptep_set_access_flags_notlbi(pte_t *ptep, pte_t entry);
>>>>  extern int __ptep_set_access_flags(struct vm_area_struct *vma,
>>>>  				 unsigned long address, pte_t *ptep,
>>>>  				 pte_t entry, int dirty);
>>>> @@ -1145,6 +1146,8 @@ extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>>>  extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>>>  extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>>>  				pte_t *ptep, pte_t pte, unsigned int nr);
>>>> +extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>>>> +				unsigned long addr, pte_t *ptep);
>>>>  extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>>>  				unsigned long addr, pte_t *ptep);
>>>>  extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>>> @@ -1270,12 +1273,28 @@ static inline void pte_clear(struct mm_struct *mm,
>>>>  	__pte_clear(mm, addr, ptep);
>>>>  }
>>>>
>>>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
>>>> +static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
>>>> +				unsigned long addr, pte_t *ptep, int full)
>>>> +{
>>>> +	pte_t orig_pte = __ptep_get(ptep);
>>>> +
>>>> +	if (!pte_valid_cont(orig_pte))
>>>> +		return __ptep_get_and_clear(mm, addr, ptep);
>>>> +
>>>> +	if (!full) {
>>>> +		contpte_try_unfold(mm, addr, ptep, orig_pte);
>>>> +		return __ptep_get_and_clear(mm, addr, ptep);
>>>> +	}
>>>> +
>>>> +	return contpte_ptep_get_and_clear_full(mm, addr, ptep);
>>>> +}
>>>> +
>>>>  #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>>>  static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>>>  				unsigned long addr, pte_t *ptep)
>>>>  {
>>>> -	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>>> -	return __ptep_get_and_clear(mm, addr, ptep);
>>>> +	return ptep_get_and_clear_full(mm, addr, ptep, 0);
>>>>  }
>>>>
>>>>  #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>>> index 2a57df16bf58..99b211118d93 100644
>>>> --- a/arch/arm64/mm/contpte.c
>>>> +++ b/arch/arm64/mm/contpte.c
>>>> @@ -145,6 +145,14 @@ pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>>>  	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>>>  		pte = __ptep_get(ptep);
>>>>
>>>> +		/*
>>>> +		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
>>>> +		 * where some of the ptes in the range may be cleared but others
>>>> +		 * are still to do. See contpte_ptep_get_and_clear_full().
>>>> +		 */
>>>> +		if (!pte_valid(pte))
>>>> +			continue;
>>>> +
>>>>  		if (pte_dirty(pte))
>>>>  			orig_pte = pte_mkdirty(orig_pte);
>>>>
>>>> @@ -257,6 +265,79 @@ void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>>>  }
>>>>  EXPORT_SYMBOL(contpte_set_ptes);
>>>>
>>>> +pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
>>>> +					unsigned long addr, pte_t *ptep)
>>>> +{
>>>> +	/*
>>>> +	 * When doing a full address space teardown, we can avoid unfolding the
>>>> +	 * contiguous range, and therefore avoid the associated tlbi. Instead,
>>>> +	 * just get and clear the pte. The caller is promising to call us for
>>>> +	 * every pte, so every pte in the range will be cleared by the time the
>>>> +	 * final tlbi is issued.
>>>> +	 *
>>>> +	 * This approach requires some complex hoop jumping though, as for the
>>>> +	 * duration between returning from the first call to
>>>> +	 * ptep_get_and_clear_full() and making the final call, the contpte
>>>> +	 * block is in an intermediate state, where some ptes are cleared and
>>>> +	 * others are still set with the PTE_CONT bit. If any other APIs are
>>>> +	 * called for the ptes in the contpte block during that time, we have to
>>>> +	 * be very careful. The core code currently interleaves calls to
>>>> +	 * ptep_get_and_clear_full() with ptep_get() and so ptep_get() must be
>>>> +	 * careful to ignore the cleared entries when accumulating the access
>>>> +	 * and dirty bits - the same goes for ptep_get_lockless(). The only
>>>> +	 * other calls we might resonably expect are to set markers in the
>>>> +	 * previously cleared ptes. (We shouldn't see valid entries being set
>>>> +	 * until after the tlbi, at which point we are no longer in the
>>>> +	 * intermediate state). Since markers are not valid, this is safe;
>>>> +	 * set_ptes() will see the old, invalid entry and will not attempt to
>>>> +	 * unfold. And the new pte is also invalid so it won't attempt to fold.
>>>> +	 * We shouldn't see pte markers being set for the 'full' case anyway
>>>> +	 * since the address space is being torn down.
>>>> +	 *
>>>> +	 * The last remaining issue is returning the access/dirty bits. That
>>>> +	 * info could be present in any of the ptes in the contpte block.
>>>> +	 * ptep_get() will gather those bits from across the contpte block (for
>>>> +	 * the remaining valid entries). So below, if the pte we are clearing
>>>> +	 * has dirty or young set, we need to stash it into a pte that we are
>>>> +	 * yet to clear. This allows future calls to return the correct state
>>>> +	 * even when the info was stored in a different pte. Since the core-mm
>>>> +	 * calls from low to high address, we prefer to stash in the last pte of
>>>> +	 * the contpte block - this means we are not "dragging" the bits up
>>>> +	 * through all ptes and increases the chances that we can exit early
>>>> +	 * because a given pte will have neither dirty or young set.
>>>> +	 */
>>>> +
>>>> +	pte_t orig_pte = __ptep_get_and_clear(mm, addr, ptep);
>>>> +	bool dirty = pte_dirty(orig_pte);
>>>> +	bool young = pte_young(orig_pte);
>>>> +	pte_t *start;
>>>> +
>>>> +	if (!dirty && !young)
>>>> +		return contpte_ptep_get(ptep, orig_pte);
>>>
>>> I don't think we need to do this. If the PTE is !dirty && !young we can
>>> just return it. As you say we have to assume HW can set those flags at
>>> any time anyway so it doesn't get us much. This means in the common case
>>> we should only run through the loop setting the dirty/young flags once
>>> which should alay the performance concerns.
>>
>> I don't follow your logic. This is precisely the problem I was trying to solve
>> vs my original (simple) attempt - we want to always report the correct
>> access/dirty info. If we read one of the PTEs and neither access nor dirty are
>> set, that doesn't mean its old and clean, it just means that that info is
>> definitely not stored in this PTE - we need to check the others. (when the
>> contiguous bit is set, the HW will only update the access/dirty bits for 1 of
>> the PTEs in the contpte block).
> 
> So my concern wasn't about incorrectly returning a !young && !dirty PTE
> when the CONT_PTE block was *previously* clean/old (ie. the first
> ptep_get/ptep_get_and_clear_full returned clean/old) because we have to
> tolerate that anyway due to HW being able to set those bits. Rather my
> concern was ptep_get_and_clear_full() could implicitly clear dirty/young
> bits - ie. ptep_get_and_clear_full() could return a dirty/young PTE but
> the next call would not.
> 
> That's because regardless of what we do here it is just a matter of
> timing if we have to assume other HW threads can set these bits at any
> time. There is nothing stopping HW from doing that just after we read
> them in that loop, so a block can always become dirty/young at any time.
> However it shouldn't become !dirty/!young without explicit SW
> intervention.
> 
> But this is all a bit of a moot point due to the discussion below.
> 
>> Also, IIRC correctly, the core-mm sets access when initially setting up the
>> mapping so its not guarranteed that all but one of the PTEs in the contpte block
>> have (!dirty && !young).
>>
>>>
>>> However I am now wondering if we're doing the wrong thing trying to hide
>>> this down in the arch layer anyway. Perhaps it would be better to deal
>>> with this in the core-mm code after all.
>>>
>>> So how about having ptep_get_and_clear_full() clearing the PTEs for the
>>> entire cont block? We know by definition all PTEs should be pointing to
>>> the same folio anyway, and it seems at least zap_pte_range() would cope
>>> with this just fine because subsequent iterations would just see
>>> pte_none() and continue the loop. I haven't checked the other call sites
>>> though, but in principal I don't see why we couldn't define
>>> ptep_get_and_clear_full() as being something that clears all PTEs
>>> mapping a given folio (although it might need renaming).
>>
>> Ahha! Yes, I've been working on a solution like this since Barry raised it
>> yesterday. I have a working version, that seems to perform well. I wouldn't want
>> to just clear all the PTEs in the block inside ptep_get_and_clear_full() because
>> although it might work today, its fragile in the same way that my v2 version is.
> 
> Yes, agree a new helper would be needed.
> 
>> Instead, I've defined a new helper, clear_ptes(), which takes a starting pte and
>> a number of ptes to clear (like set_ptes()). It returns the PTE read from the
>> *first* slot, but with the access/dirty bits being accumulated from all of the
>> ptes in the requested batch. Then zap_pte_range() is reworked to find
>> appropriate batches (similar to how I've reworked for ptep_set_wrprotects()).
>>
>> I was trying to avoid introducing new helpers, but I think this is the most
>> robust approach, and looks slightly more performant to, on first sight. It also
>> addresses cases where full=0, which Barry says are important for madvise(DONTNEED).
> 
> I strongly agree with this approach now especially if it is equally (or
> more!) performant. I get why you didn't want to intorduce new helpers
> but I think doing so was making things too subtle so would like to see
> this.
> 
>>>
>>> This does assume you don't need to partially unmap a page in
>>> zap_pte_range (ie. end >= folio), but we're already making that
>>> assumption.
>>
>> That's fine for full=1. But we can't make that assumption for full=0. If a VMA
>> gets split for a reason that doesn't require re-setting the PTEs then a contpte
>> block could straddle 2 VMAs. But the solution I describe above is robust to that.
>>
>> I'll finish gathering perf data then post for all 3 approaches; v2 as originally
>> posted, "robust ptep_get_and_clear_full()", and clear_ptes(). Hopefully later today.
> 
> Thanks!

From the commit log of the new version, which I'll hopefully post later today:

  The following shows the results of running a kernel compilation workload
  and measuring the cost of arm64_sys_exit_group() (which at ~1.5% is a
  very small part of the overall workload).

  Benchmarks were run on Ampere Altra in 2 configs; single numa node and 2
  numa nodes (tlbis are more expensive in 2 node config).

    - baseline: v6.7-rc1 + anonfolio-v7
    - no-opt: contpte series without any attempt to optimize exit()
    - simple-ptep_get_clear_full: simple optimization to exploit full=1.
      ptep_get_clear_full() does not fully conform to its intended semantic
    - robust-ptep_get_clear_full: similar to previous but
      ptep_get_clear_full() fully conforms to its intended semantic
    - clear_ptes: optimization implemented by this patch

  | config                     | numa=1 | numa=2 |
  |----------------------------|--------|--------|
  | baseline                   |     0% |     0% |
  | no-opt                     |   190% |   768% |
  | simple-ptep_get_clear_full |     8% |    29% |
  | robust-ptep_get_clear_full |    21% |    19% |
  | clear_ptes                 |    13% |     9% |

  In all cases, the cost of arm64_sys_exit_group() increases; this is
  anticipated because there is more work to do to tear down the page
  tables. But clear_ptes() gives the smallest increase overall.

Note that "simple-ptep_get_clear_full" is the version I posted with v2.
"robust-ptep_get_clear_full" is the version I tried as part of this
conversation. And "clear_ptes" is the batched version that I think we all now
prefer (and plan to post as part of v3).

Thanks,
Ryan
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 9bd2f57a9e11..ea58a9f4e700 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -1145,6 +1145,8 @@  extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
 extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
 extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 				pte_t *ptep, pte_t pte, unsigned int nr);
+extern pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 				unsigned long addr, pte_t *ptep);
 extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
@@ -1270,12 +1272,24 @@  static inline void pte_clear(struct mm_struct *mm,
 	__pte_clear(mm, addr, ptep);
 }
 
+#define __HAVE_ARCH_PTEP_GET_AND_CLEAR_FULL
+static inline pte_t ptep_get_and_clear_full(struct mm_struct *mm,
+				unsigned long addr, pte_t *ptep, int full)
+{
+	pte_t orig_pte = __ptep_get(ptep);
+
+	if (!pte_valid_cont(orig_pte) || !full) {
+		contpte_try_unfold(mm, addr, ptep, orig_pte);
+		return __ptep_get_and_clear(mm, addr, ptep);
+	} else
+		return contpte_ptep_get_and_clear_full(mm, addr, ptep);
+}
+
 #define __HAVE_ARCH_PTEP_GET_AND_CLEAR
 static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
 				unsigned long addr, pte_t *ptep)
 {
-	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
-	return __ptep_get_and_clear(mm, addr, ptep);
+	return ptep_get_and_clear_full(mm, addr, ptep, 0);
 }
 
 #define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index 426be9cd4dea..5d1aaed82d32 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -144,6 +144,14 @@  pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
 	for (i = 0; i < CONT_PTES; i++, ptep++) {
 		pte = __ptep_get(ptep);
 
+		/*
+		 * Deal with the partial contpte_ptep_get_and_clear_full() case,
+		 * where some of the ptes in the range may be cleared but others
+		 * are still to do. See contpte_ptep_get_and_clear_full().
+		 */
+		if (pte_val(pte) == 0)
+			continue;
+
 		if (pte_dirty(pte))
 			orig_pte = pte_mkdirty(orig_pte);
 
@@ -256,6 +264,52 @@  void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
 }
 EXPORT_SYMBOL(contpte_set_ptes);
 
+pte_t contpte_ptep_get_and_clear_full(struct mm_struct *mm,
+					unsigned long addr, pte_t *ptep)
+{
+	/*
+	 * When doing a full address space teardown, we can avoid unfolding the
+	 * contiguous range, and therefore avoid the associated tlbi. Instead,
+	 * just get and clear the pte. The caller is promising to call us for
+	 * every pte, so every pte in the range will be cleared by the time the
+	 * tlbi is issued.
+	 *
+	 * This approach is not perfect though, as for the duration between
+	 * returning from the first call to ptep_get_and_clear_full() and making
+	 * the final call, the contpte block in an intermediate state, where
+	 * some ptes are cleared and others are still set with the PTE_CONT bit.
+	 * If any other APIs are called for the ptes in the contpte block during
+	 * that time, we have to be very careful. The core code currently
+	 * interleaves calls to ptep_get_and_clear_full() with ptep_get() and so
+	 * ptep_get() must be careful to ignore the cleared entries when
+	 * accumulating the access and dirty bits - the same goes for
+	 * ptep_get_lockless(). The only other calls we might resonably expect
+	 * are to set markers in the previously cleared ptes. (We shouldn't see
+	 * valid entries being set until after the tlbi, at which point we are
+	 * no longer in the intermediate state). Since markers are not valid,
+	 * this is safe; set_ptes() will see the old, invalid entry and will not
+	 * attempt to unfold. And the new pte is also invalid so it won't
+	 * attempt to fold. We shouldn't see this for the 'full' case anyway.
+	 *
+	 * The last remaining issue is returning the access/dirty bits. That
+	 * info could be present in any of the ptes in the contpte block.
+	 * ptep_get() will gather those bits from across the contpte block. We
+	 * don't bother doing that here, because we know that the information is
+	 * used by the core-mm to mark the underlying folio as accessed/dirty.
+	 * And since the same folio must be underpinning the whole block (that
+	 * was a requirement for folding in the first place), that information
+	 * will make it to the folio eventually once all the ptes have been
+	 * cleared. This approach means we don't have to play games with
+	 * accumulating and storing the bits. It does mean that any interleaved
+	 * calls to ptep_get() may lack correct access/dirty information if we
+	 * have already cleared the pte that happened to store it. The core code
+	 * does not rely on this though.
+	 */
+
+	return __ptep_get_and_clear(mm, addr, ptep);
+}
+EXPORT_SYMBOL(contpte_ptep_get_and_clear_full);
+
 int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
 					unsigned long addr, pte_t *ptep)
 {