diff mbox series

[2/2] arm64/mm: Improve comment in contpte_ptep_get_lockless()

Message ID 20240226120321.1055731-3-ryan.roberts@arm.com (mailing list archive)
State New, archived
Headers show
Series Address some contpte nits | expand

Commit Message

Ryan Roberts Feb. 26, 2024, 12:03 p.m. UTC
Make clear the atmicity/consistency requirements of the API and how we
achieve them.

Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

Comments

David Hildenbrand Feb. 26, 2024, 12:30 p.m. UTC | #1
On 26.02.24 13:03, Ryan Roberts wrote:
> Make clear the atmicity/consistency requirements of the API and how we
> achieve them.
> 
> Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
>   1 file changed, 14 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index be0a226c4ff9..1b64b4c3f8bf 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -183,16 +183,20 @@ EXPORT_SYMBOL_GPL(contpte_ptep_get);
>   pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>   {
>   	/*
> -	 * Gather access/dirty bits, which may be populated in any of the ptes
> -	 * of the contig range. We may not be holding the PTL, so any contiguous
> -	 * range may be unfolded/modified/refolded under our feet. Therefore we
> -	 * ensure we read a _consistent_ contpte range by checking that all ptes
> -	 * in the range are valid and have CONT_PTE set, that all pfns are
> -	 * contiguous and that all pgprots are the same (ignoring access/dirty).
> -	 * If we find a pte that is not consistent, then we must be racing with
> -	 * an update so start again. If the target pte does not have CONT_PTE
> -	 * set then that is considered consistent on its own because it is not
> -	 * part of a contpte range.
> +	 * The ptep_get_lockless() API requires us to read and return *orig_ptep
> +	 * so that it is self-consistent, without the PTL held, so we may be
> +	 * racing with other threads modifying the pte. Usually a READ_ONCE()
> +	 * would suffice, but for the contpte case, we also need to gather the
> +	 * access and dirty bits from across all ptes in the contiguous block,
> +	 * and we can't read all of those neighbouring ptes atomically, so any
> +	 * contiguous range may be unfolded/modified/refolded under our feet.
> +	 * Therefore we ensure we read a _consistent_ contpte range by checking
> +	 * that all ptes in the range are valid and have CONT_PTE set, that all
> +	 * pfns are contiguous and that all pgprots are the same (ignoring
> +	 * access/dirty). If we find a pte that is not consistent, then we must
> +	 * be racing with an update so start again. If the target pte does not
> +	 * have CONT_PTE set then that is considered consistent on its own
> +	 * because it is not part of a contpte range.
>   	 */
>   
>   	pgprot_t orig_prot;

Reviewed-by: David Hildenbrand <david@redhat.com>

In an ideal world, we'd really not rely on any accessed/dirty on the 
lockless path and remove contpte_ptep_get_lockless() completely :)
Ryan Roberts Feb. 26, 2024, 12:37 p.m. UTC | #2
On 26/02/2024 12:30, David Hildenbrand wrote:
> On 26.02.24 13:03, Ryan Roberts wrote:
>> Make clear the atmicity/consistency requirements of the API and how we
>> achieve them.
>>
>> Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>   arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
>>   1 file changed, 14 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index be0a226c4ff9..1b64b4c3f8bf 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -183,16 +183,20 @@ EXPORT_SYMBOL_GPL(contpte_ptep_get);
>>   pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>   {
>>       /*
>> -     * Gather access/dirty bits, which may be populated in any of the ptes
>> -     * of the contig range. We may not be holding the PTL, so any contiguous
>> -     * range may be unfolded/modified/refolded under our feet. Therefore we
>> -     * ensure we read a _consistent_ contpte range by checking that all ptes
>> -     * in the range are valid and have CONT_PTE set, that all pfns are
>> -     * contiguous and that all pgprots are the same (ignoring access/dirty).
>> -     * If we find a pte that is not consistent, then we must be racing with
>> -     * an update so start again. If the target pte does not have CONT_PTE
>> -     * set then that is considered consistent on its own because it is not
>> -     * part of a contpte range.
>> +     * The ptep_get_lockless() API requires us to read and return *orig_ptep
>> +     * so that it is self-consistent, without the PTL held, so we may be
>> +     * racing with other threads modifying the pte. Usually a READ_ONCE()
>> +     * would suffice, but for the contpte case, we also need to gather the
>> +     * access and dirty bits from across all ptes in the contiguous block,
>> +     * and we can't read all of those neighbouring ptes atomically, so any
>> +     * contiguous range may be unfolded/modified/refolded under our feet.
>> +     * Therefore we ensure we read a _consistent_ contpte range by checking
>> +     * that all ptes in the range are valid and have CONT_PTE set, that all
>> +     * pfns are contiguous and that all pgprots are the same (ignoring
>> +     * access/dirty). If we find a pte that is not consistent, then we must
>> +     * be racing with an update so start again. If the target pte does not
>> +     * have CONT_PTE set then that is considered consistent on its own
>> +     * because it is not part of a contpte range.
>>        */
>>         pgprot_t orig_prot;
> 
> Reviewed-by: David Hildenbrand <david@redhat.com>

Thanks!

> 
> In an ideal world, we'd really not rely on any accessed/dirty on the lockless
> path and remove contpte_ptep_get_lockless() completely :)

Not sure if you saw my RFC to do exactly that? (well, it doesn't actually remove
[contpte_]ptep_get_lockless() but it does remove all the callers). If you have
any feedback, we could get this moving...

https://lore.kernel.org/linux-mm/20240215121756.2734131-1-ryan.roberts@arm.com/

Thanks,
Ryan
David Hildenbrand Feb. 26, 2024, 12:40 p.m. UTC | #3
On 26.02.24 13:37, Ryan Roberts wrote:
> On 26/02/2024 12:30, David Hildenbrand wrote:
>> On 26.02.24 13:03, Ryan Roberts wrote:
>>> Make clear the atmicity/consistency requirements of the API and how we
>>> achieve them.
>>>
>>> Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>    arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
>>>    1 file changed, 14 insertions(+), 10 deletions(-)
>>>
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> index be0a226c4ff9..1b64b4c3f8bf 100644
>>> --- a/arch/arm64/mm/contpte.c
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -183,16 +183,20 @@ EXPORT_SYMBOL_GPL(contpte_ptep_get);
>>>    pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>>    {
>>>        /*
>>> -     * Gather access/dirty bits, which may be populated in any of the ptes
>>> -     * of the contig range. We may not be holding the PTL, so any contiguous
>>> -     * range may be unfolded/modified/refolded under our feet. Therefore we
>>> -     * ensure we read a _consistent_ contpte range by checking that all ptes
>>> -     * in the range are valid and have CONT_PTE set, that all pfns are
>>> -     * contiguous and that all pgprots are the same (ignoring access/dirty).
>>> -     * If we find a pte that is not consistent, then we must be racing with
>>> -     * an update so start again. If the target pte does not have CONT_PTE
>>> -     * set then that is considered consistent on its own because it is not
>>> -     * part of a contpte range.
>>> +     * The ptep_get_lockless() API requires us to read and return *orig_ptep
>>> +     * so that it is self-consistent, without the PTL held, so we may be
>>> +     * racing with other threads modifying the pte. Usually a READ_ONCE()
>>> +     * would suffice, but for the contpte case, we also need to gather the
>>> +     * access and dirty bits from across all ptes in the contiguous block,
>>> +     * and we can't read all of those neighbouring ptes atomically, so any
>>> +     * contiguous range may be unfolded/modified/refolded under our feet.
>>> +     * Therefore we ensure we read a _consistent_ contpte range by checking
>>> +     * that all ptes in the range are valid and have CONT_PTE set, that all
>>> +     * pfns are contiguous and that all pgprots are the same (ignoring
>>> +     * access/dirty). If we find a pte that is not consistent, then we must
>>> +     * be racing with an update so start again. If the target pte does not
>>> +     * have CONT_PTE set then that is considered consistent on its own
>>> +     * because it is not part of a contpte range.
>>>         */
>>>          pgprot_t orig_prot;
>>
>> Reviewed-by: David Hildenbrand <david@redhat.com>
> 
> Thanks!
> 
>>
>> In an ideal world, we'd really not rely on any accessed/dirty on the lockless
>> path and remove contpte_ptep_get_lockless() completely :)
> 
> Not sure if you saw my RFC to do exactly that? (well, it doesn't actually remove
> [contpte_]ptep_get_lockless() but it does remove all the callers). If you have
> any feedback, we could get this moving...

Yes, I saw it. Hoping we can get that in and then maybe remove the 
contpte_get_lockless() once all callers are gone :)

... on my todo list.
John Hubbard Feb. 27, 2024, 11:45 p.m. UTC | #4
On 2/26/24 04:03, Ryan Roberts wrote:

Hi Ryan!

> Make clear the atmicity/consistency requirements of the API and how we

"atomicity"

> achieve them.
> 
> Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>   arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
>   1 file changed, 14 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index be0a226c4ff9..1b64b4c3f8bf 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -183,16 +183,20 @@ EXPORT_SYMBOL_GPL(contpte_ptep_get);
>   pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>   {
>   	/*
> -	 * Gather access/dirty bits, which may be populated in any of the ptes
> -	 * of the contig range. We may not be holding the PTL, so any contiguous
> -	 * range may be unfolded/modified/refolded under our feet. Therefore we
> -	 * ensure we read a _consistent_ contpte range by checking that all ptes
> -	 * in the range are valid and have CONT_PTE set, that all pfns are
> -	 * contiguous and that all pgprots are the same (ignoring access/dirty).
> -	 * If we find a pte that is not consistent, then we must be racing with
> -	 * an update so start again. If the target pte does not have CONT_PTE
> -	 * set then that is considered consistent on its own because it is not
> -	 * part of a contpte range.
> +	 * The ptep_get_lockless() API requires us to read and return *orig_ptep
> +	 * so that it is self-consistent, without the PTL held, so we may be
> +	 * racing with other threads modifying the pte. Usually a READ_ONCE()
> +	 * would suffice, but for the contpte case, we also need to gather the
> +	 * access and dirty bits from across all ptes in the contiguous block,
> +	 * and we can't read all of those neighbouring ptes atomically, so any

This still leaves a key detail unexplained: how the accessed and dirty bits
are handled. The above raises the *problem*, but then talks about getting a
consistent set of reads. But during those consistent reads, the HW could have
dirtied or read a page. And this code here is only returning a single pte.

So I'm still feeling vague about what we're trying to say about accessed and
dirty bits.


> +	 * contiguous range may be unfolded/modified/refolded under our feet.
> +	 * Therefore we ensure we read a _consistent_ contpte range by checking
> +	 * that all ptes in the range are valid and have CONT_PTE set, that all
> +	 * pfns are contiguous and that all pgprots are the same (ignoring
> +	 * access/dirty). If we find a pte that is not consistent, then we must
> +	 * be racing with an update so start again. If the target pte does not
> +	 * have CONT_PTE set then that is considered consistent on its own
> +	 * because it is not part of a contpte range.
>   	 */
>   
>   	pgprot_t orig_prot;

thanks,
Catalin Marinas March 1, 2024, 6:47 p.m. UTC | #5
On Mon, Feb 26, 2024 at 12:03:21PM +0000, Ryan Roberts wrote:
> Make clear the atmicity/consistency requirements of the API and how we
> achieve them.
> 
> Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
>  1 file changed, 14 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> index be0a226c4ff9..1b64b4c3f8bf 100644
> --- a/arch/arm64/mm/contpte.c
> +++ b/arch/arm64/mm/contpte.c
> @@ -183,16 +183,20 @@ EXPORT_SYMBOL_GPL(contpte_ptep_get);
>  pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>  {
>  	/*
> -	 * Gather access/dirty bits, which may be populated in any of the ptes
> -	 * of the contig range. We may not be holding the PTL, so any contiguous
> -	 * range may be unfolded/modified/refolded under our feet. Therefore we
> -	 * ensure we read a _consistent_ contpte range by checking that all ptes
> -	 * in the range are valid and have CONT_PTE set, that all pfns are
> -	 * contiguous and that all pgprots are the same (ignoring access/dirty).
> -	 * If we find a pte that is not consistent, then we must be racing with
> -	 * an update so start again. If the target pte does not have CONT_PTE
> -	 * set then that is considered consistent on its own because it is not
> -	 * part of a contpte range.
> +	 * The ptep_get_lockless() API requires us to read and return *orig_ptep
> +	 * so that it is self-consistent, without the PTL held, so we may be
> +	 * racing with other threads modifying the pte. Usually a READ_ONCE()
> +	 * would suffice, but for the contpte case, we also need to gather the
> +	 * access and dirty bits from across all ptes in the contiguous block,
> +	 * and we can't read all of those neighbouring ptes atomically, so any
> +	 * contiguous range may be unfolded/modified/refolded under our feet.
> +	 * Therefore we ensure we read a _consistent_ contpte range by checking
> +	 * that all ptes in the range are valid and have CONT_PTE set, that all
> +	 * pfns are contiguous and that all pgprots are the same (ignoring
> +	 * access/dirty). If we find a pte that is not consistent, then we must
> +	 * be racing with an update so start again. If the target pte does not
> +	 * have CONT_PTE set then that is considered consistent on its own
> +	 * because it is not part of a contpte range.
>  	 */

I haven't had the time to properly think about this function but,
depending on what its semantics are, we might not guarantee that, at the
time of reading a pte, we have the correct dirty state from the other
ptes in the range.

Theoretical: let's say we read the first pte in the contig range and
it's clean but further down there's a dirty one. Another (v)CPU breaks
the contig range, sets the dirty bit everywhere, there's some
pte_mkclean for all of them and they are collapsed into a contig range
again. The function above on the first (v)CPU returns a clean pte when
it should have actually been dirty at the time of read.

Throughout the callers of this function, I couldn't find one where it
matters. So I concluded that they don't need the dirty state. Normally
the dirty state is passed to the page flags, so not lost after the pte
has been cleaned.
Ryan Roberts March 4, 2024, 12:54 p.m. UTC | #6
On 01/03/2024 18:47, Catalin Marinas wrote:
> On Mon, Feb 26, 2024 at 12:03:21PM +0000, Ryan Roberts wrote:
>> Make clear the atmicity/consistency requirements of the API and how we
>> achieve them.
>>
>> Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
>>  1 file changed, 14 insertions(+), 10 deletions(-)
>>
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> index be0a226c4ff9..1b64b4c3f8bf 100644
>> --- a/arch/arm64/mm/contpte.c
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -183,16 +183,20 @@ EXPORT_SYMBOL_GPL(contpte_ptep_get);
>>  pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>  {
>>  	/*
>> -	 * Gather access/dirty bits, which may be populated in any of the ptes
>> -	 * of the contig range. We may not be holding the PTL, so any contiguous
>> -	 * range may be unfolded/modified/refolded under our feet. Therefore we
>> -	 * ensure we read a _consistent_ contpte range by checking that all ptes
>> -	 * in the range are valid and have CONT_PTE set, that all pfns are
>> -	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>> -	 * If we find a pte that is not consistent, then we must be racing with
>> -	 * an update so start again. If the target pte does not have CONT_PTE
>> -	 * set then that is considered consistent on its own because it is not
>> -	 * part of a contpte range.
>> +	 * The ptep_get_lockless() API requires us to read and return *orig_ptep
>> +	 * so that it is self-consistent, without the PTL held, so we may be
>> +	 * racing with other threads modifying the pte. Usually a READ_ONCE()
>> +	 * would suffice, but for the contpte case, we also need to gather the
>> +	 * access and dirty bits from across all ptes in the contiguous block,
>> +	 * and we can't read all of those neighbouring ptes atomically, so any
>> +	 * contiguous range may be unfolded/modified/refolded under our feet.
>> +	 * Therefore we ensure we read a _consistent_ contpte range by checking
>> +	 * that all ptes in the range are valid and have CONT_PTE set, that all
>> +	 * pfns are contiguous and that all pgprots are the same (ignoring
>> +	 * access/dirty). If we find a pte that is not consistent, then we must
>> +	 * be racing with an update so start again. If the target pte does not
>> +	 * have CONT_PTE set then that is considered consistent on its own
>> +	 * because it is not part of a contpte range.
>>  	 */
> 
> I haven't had the time to properly think about this function but,
> depending on what its semantics are, we might not guarantee that, at the
> time of reading a pte, we have the correct dirty state from the other
> ptes in the range.
> 
> Theoretical: let's say we read the first pte in the contig range and
> it's clean but further down there's a dirty one. Another (v)CPU breaks
> the contig range, sets the dirty bit everywhere, there's some
> pte_mkclean for all of them and they are collapsed into a contig range
> again. The function above on the first (v)CPU returns a clean pte when
> it should have actually been dirty at the time of read.

But I think that still conforms to semantics of the function. If you had the
same situation with a non-contpte mapping, the first thread may read the PTE at
any time; when it's dirty, or after its been cleaned by the other thread. It's
inherrently racy.

All that matters is that what is returned is _consistent_; you don't want to be
in a position where the first read of the block is mapping one folio, then by
the time all the access/dirty bits are read, the mapping is actually to a
different folio - that's an example of being inconsistent.

> 
> Throughout the callers of this function, I couldn't find one where it
> matters. So I concluded that they don't need the dirty state. Normally
> the dirty state is passed to the page flags, so not lost after the pte
> has been cleaned.

I agree we can simplify the semantics. But I think its better done in a separate
series (which I previously linked).

What's the bottom line here? Are you ok with this comment as a short term
solution for now, or do you want something more radical (i.e. push to get the
series that does these simplifications reviewed and in time for v6.9).

I still believe the current ptep_get_lockless() implementation is correct. So
given I have a plan to simplify in the long run, I hope we can still get this
series into v6.9 as planned.

Thanks,
Ryan
Catalin Marinas March 4, 2024, 5:37 p.m. UTC | #7
On Mon, Mar 04, 2024 at 12:54:23PM +0000, Ryan Roberts wrote:
> On 01/03/2024 18:47, Catalin Marinas wrote:
> > On Mon, Feb 26, 2024 at 12:03:21PM +0000, Ryan Roberts wrote:
> >> Make clear the atmicity/consistency requirements of the API and how we
> >> achieve them.
> >>
> >> Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/
> >> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> >> ---
> >>  arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
> >>  1 file changed, 14 insertions(+), 10 deletions(-)
[...]
> > Throughout the callers of this function, I couldn't find one where it
> > matters. So I concluded that they don't need the dirty state. Normally
> > the dirty state is passed to the page flags, so not lost after the pte
> > has been cleaned.
> 
> I agree we can simplify the semantics. But I think its better done in a separate
> series (which I previously linked).
> 
> What's the bottom line here? Are you ok with this comment as a short term
> solution for now, or do you want something more radical (i.e. push to get the
> series that does these simplifications reviewed and in time for v6.9).
> 
> I still believe the current ptep_get_lockless() implementation is correct. So
> given I have a plan to simplify in the long run, I hope we can still get this
> series into v6.9 as planned.

Yes, I'm fine with this patch. Assuming Andrew picked them up:

Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

I'd like to get the simplification in as well at some point as I think
our ptep_get_lockless() is unnecessarily complex for most use-cases.
Ryan Roberts March 4, 2024, 6:40 p.m. UTC | #8
On 04/03/2024 17:37, Catalin Marinas wrote:
> On Mon, Mar 04, 2024 at 12:54:23PM +0000, Ryan Roberts wrote:
>> On 01/03/2024 18:47, Catalin Marinas wrote:
>>> On Mon, Feb 26, 2024 at 12:03:21PM +0000, Ryan Roberts wrote:
>>>> Make clear the atmicity/consistency requirements of the API and how we
>>>> achieve them.
>>>>
>>>> Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/
>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>> ---
>>>>  arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
>>>>  1 file changed, 14 insertions(+), 10 deletions(-)
> [...]
>>> Throughout the callers of this function, I couldn't find one where it
>>> matters. So I concluded that they don't need the dirty state. Normally
>>> the dirty state is passed to the page flags, so not lost after the pte
>>> has been cleaned.
>>
>> I agree we can simplify the semantics. But I think its better done in a separate
>> series (which I previously linked).
>>
>> What's the bottom line here? Are you ok with this comment as a short term
>> solution for now, or do you want something more radical (i.e. push to get the
>> series that does these simplifications reviewed and in time for v6.9).
>>
>> I still believe the current ptep_get_lockless() implementation is correct. So
>> given I have a plan to simplify in the long run, I hope we can still get this
>> series into v6.9 as planned.
> 
> Yes, I'm fine with this patch. Assuming Andrew picked them up:
> 
> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>

Thanks! Yes, he did - they are in mm-unstable.

> 
> I'd like to get the simplification in as well at some point as I think
> our ptep_get_lockless() is unnecessarily complex for most use-cases.

Yes, I'll keep pushing it. I know DavidH is keen for it.
David Hildenbrand March 4, 2024, 10:04 p.m. UTC | #9
On 04.03.24 19:40, Ryan Roberts wrote:
> On 04/03/2024 17:37, Catalin Marinas wrote:
>> On Mon, Mar 04, 2024 at 12:54:23PM +0000, Ryan Roberts wrote:
>>> On 01/03/2024 18:47, Catalin Marinas wrote:
>>>> On Mon, Feb 26, 2024 at 12:03:21PM +0000, Ryan Roberts wrote:
>>>>> Make clear the atmicity/consistency requirements of the API and how we
>>>>> achieve them.
>>>>>
>>>>> Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/
>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>> ---
>>>>>   arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
>>>>>   1 file changed, 14 insertions(+), 10 deletions(-)
>> [...]
>>>> Throughout the callers of this function, I couldn't find one where it
>>>> matters. So I concluded that they don't need the dirty state. Normally
>>>> the dirty state is passed to the page flags, so not lost after the pte
>>>> has been cleaned.
>>>
>>> I agree we can simplify the semantics. But I think its better done in a separate
>>> series (which I previously linked).
>>>
>>> What's the bottom line here? Are you ok with this comment as a short term
>>> solution for now, or do you want something more radical (i.e. push to get the
>>> series that does these simplifications reviewed and in time for v6.9).
>>>
>>> I still believe the current ptep_get_lockless() implementation is correct. So
>>> given I have a plan to simplify in the long run, I hope we can still get this
>>> series into v6.9 as planned.
>>
>> Yes, I'm fine with this patch. Assuming Andrew picked them up:
>>
>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
> 
> Thanks! Yes, he did - they are in mm-unstable.
> 
>>
>> I'd like to get the simplification in as well at some point as I think
>> our ptep_get_lockless() is unnecessarily complex for most use-cases.
> 
> Yes, I'll keep pushing it. I know DavidH is keen for it.

Maybe just sent a v1 (after the merge window?) if there is no further 
feedback. I still want to look into the details, but it's stuck deep in 
my inbox I'm afraid :)
Ryan Roberts March 5, 2024, 9:13 a.m. UTC | #10
On 04/03/2024 22:04, David Hildenbrand wrote:
> On 04.03.24 19:40, Ryan Roberts wrote:
>> On 04/03/2024 17:37, Catalin Marinas wrote:
>>> On Mon, Mar 04, 2024 at 12:54:23PM +0000, Ryan Roberts wrote:
>>>> On 01/03/2024 18:47, Catalin Marinas wrote:
>>>>> On Mon, Feb 26, 2024 at 12:03:21PM +0000, Ryan Roberts wrote:
>>>>>> Make clear the atmicity/consistency requirements of the API and how we
>>>>>> achieve them.
>>>>>>
>>>>>> Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>   arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
>>>>>>   1 file changed, 14 insertions(+), 10 deletions(-)
>>> [...]
>>>>> Throughout the callers of this function, I couldn't find one where it
>>>>> matters. So I concluded that they don't need the dirty state. Normally
>>>>> the dirty state is passed to the page flags, so not lost after the pte
>>>>> has been cleaned.
>>>>
>>>> I agree we can simplify the semantics. But I think its better done in a
>>>> separate
>>>> series (which I previously linked).
>>>>
>>>> What's the bottom line here? Are you ok with this comment as a short term
>>>> solution for now, or do you want something more radical (i.e. push to get the
>>>> series that does these simplifications reviewed and in time for v6.9).
>>>>
>>>> I still believe the current ptep_get_lockless() implementation is correct. So
>>>> given I have a plan to simplify in the long run, I hope we can still get this
>>>> series into v6.9 as planned.
>>>
>>> Yes, I'm fine with this patch. Assuming Andrew picked them up:
>>>
>>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>>
>> Thanks! Yes, he did - they are in mm-unstable.
>>
>>>
>>> I'd like to get the simplification in as well at some point as I think
>>> our ptep_get_lockless() is unnecessarily complex for most use-cases.
>>
>> Yes, I'll keep pushing it. I know DavidH is keen for it.
> 
> Maybe just sent a v1 (after the merge window?) if there is no further feedback.
> I still want to look into the details, but it's stuck deep in my inbox I'm
> afraid :)

Yep will do! I will also add the final patch to actually remove
ptep_get_lockless() since it is no longer used by the end of the series.
Ryan Roberts March 5, 2024, 9:14 a.m. UTC | #11
On 04/03/2024 22:04, David Hildenbrand wrote:
> On 04.03.24 19:40, Ryan Roberts wrote:
>> On 04/03/2024 17:37, Catalin Marinas wrote:
>>> On Mon, Mar 04, 2024 at 12:54:23PM +0000, Ryan Roberts wrote:
>>>> On 01/03/2024 18:47, Catalin Marinas wrote:
>>>>> On Mon, Feb 26, 2024 at 12:03:21PM +0000, Ryan Roberts wrote:
>>>>>> Make clear the atmicity/consistency requirements of the API and how we
>>>>>> achieve them.
>>>>>>
>>>>>> Link: https://lore.kernel.org/linux-mm/Zc-Tqqfksho3BHmU@arm.com/
>>>>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>>>>> ---
>>>>>>   arch/arm64/mm/contpte.c | 24 ++++++++++++++----------
>>>>>>   1 file changed, 14 insertions(+), 10 deletions(-)
>>> [...]
>>>>> Throughout the callers of this function, I couldn't find one where it
>>>>> matters. So I concluded that they don't need the dirty state. Normally
>>>>> the dirty state is passed to the page flags, so not lost after the pte
>>>>> has been cleaned.
>>>>
>>>> I agree we can simplify the semantics. But I think its better done in a
>>>> separate
>>>> series (which I previously linked).
>>>>
>>>> What's the bottom line here? Are you ok with this comment as a short term
>>>> solution for now, or do you want something more radical (i.e. push to get the
>>>> series that does these simplifications reviewed and in time for v6.9).
>>>>
>>>> I still believe the current ptep_get_lockless() implementation is correct. So
>>>> given I have a plan to simplify in the long run, I hope we can still get this
>>>> series into v6.9 as planned.
>>>
>>> Yes, I'm fine with this patch. Assuming Andrew picked them up:
>>>
>>> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com>
>>
>> Thanks! Yes, he did - they are in mm-unstable.
>>
>>>
>>> I'd like to get the simplification in as well at some point as I think
>>> our ptep_get_lockless() is unnecessarily complex for most use-cases.
>>
>> Yes, I'll keep pushing it. I know DavidH is keen for it.
> 
> Maybe just sent a v1 (after the merge window?) if there is no further feedback.
> I still want to look into the details, but it's stuck deep in my inbox I'm
> afraid :)

Yep will do! I will also add the final patch to actually remove
ptep_get_lockless() since it is no longer used by the end of the series.
diff mbox series

Patch

diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
index be0a226c4ff9..1b64b4c3f8bf 100644
--- a/arch/arm64/mm/contpte.c
+++ b/arch/arm64/mm/contpte.c
@@ -183,16 +183,20 @@  EXPORT_SYMBOL_GPL(contpte_ptep_get);
 pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
 {
 	/*
-	 * Gather access/dirty bits, which may be populated in any of the ptes
-	 * of the contig range. We may not be holding the PTL, so any contiguous
-	 * range may be unfolded/modified/refolded under our feet. Therefore we
-	 * ensure we read a _consistent_ contpte range by checking that all ptes
-	 * in the range are valid and have CONT_PTE set, that all pfns are
-	 * contiguous and that all pgprots are the same (ignoring access/dirty).
-	 * If we find a pte that is not consistent, then we must be racing with
-	 * an update so start again. If the target pte does not have CONT_PTE
-	 * set then that is considered consistent on its own because it is not
-	 * part of a contpte range.
+	 * The ptep_get_lockless() API requires us to read and return *orig_ptep
+	 * so that it is self-consistent, without the PTL held, so we may be
+	 * racing with other threads modifying the pte. Usually a READ_ONCE()
+	 * would suffice, but for the contpte case, we also need to gather the
+	 * access and dirty bits from across all ptes in the contiguous block,
+	 * and we can't read all of those neighbouring ptes atomically, so any
+	 * contiguous range may be unfolded/modified/refolded under our feet.
+	 * Therefore we ensure we read a _consistent_ contpte range by checking
+	 * that all ptes in the range are valid and have CONT_PTE set, that all
+	 * pfns are contiguous and that all pgprots are the same (ignoring
+	 * access/dirty). If we find a pte that is not consistent, then we must
+	 * be racing with an update so start again. If the target pte does not
+	 * have CONT_PTE set then that is considered consistent on its own
+	 * because it is not part of a contpte range.
 	 */
 
 	pgprot_t orig_prot;