[RFC,v1,0/4] Reduce cost of ptep_get_lockless on arm64

Message ID	20240215121756.2734131-1-ryan.roberts@arm.com (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Ryan Roberts <ryan.roberts@arm.com> To: David Hildenbrand <david@redhat.com>, Mark Rutland <mark.rutland@arm.com>, Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Alexander Shishkin <alexander.shishkin@linux.intel.com>, Jiri Olsa <jolsa@kernel.org>, Ian Rogers <irogers@google.com>, Adrian Hunter <adrian.hunter@intel.com>, Andrew Morton <akpm@linux-foundation.org>, Muchun Song <muchun.song@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [RFC PATCH v1 0/4] Reduce cost of ptep_get_lockless on arm64 Date: Thu, 15 Feb 2024 12:17:52 +0000 Message-Id: <20240215121756.2734131-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Reduce cost of ptep_get_lockless on arm64 \| expand [RFC,v1,0/4] Reduce cost of ptep_get_lockless on arm64 [RFC,v1,1/4] mm: Introduce ptep_get_lockless_norecency() [RFC,v1,2/4] mm/gup: Use ptep_get_lockless_norecency() [RFC,v1,3/4] mm/memory: Use ptep_get_lockless_norecency() for orig_pte [RFC,v1,4/4] arm64/mm: Override ptep_get_lockless_norecency()

Ryan Roberts Feb. 15, 2024, 12:17 p.m. UTC

This is an RFC for a series that aims to reduce the cost and complexity of
ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1].
The approach came from discussion with Mark and David [2].

It introduces a new helper, ptep_get_lockless_norecency(), which allows the
access and dirty bits in the returned pte to be incorrect. This relaxation
permits arm64's implementation to just read the single target pte, and avoids
having to iterate over the full contpte block to gather the access and dirty
bits, for the contpte case.

It turns out that none of the call sites using ptep_get_lockless() require
accurate access and dirty bit information, so we can also convert those sites.
Although a couple of places need care (see patches 2 and 3).

Arguably patch 3 is a bit fragile, given the wide accessibility of
vmf->orig_pte. So it might make sense to drop this patch and stick to using
ptep_get_lockless() in the page fault path. I'm keen to hear opinions.

I've chosen the name "recency" because it's shortish and somewhat descriptive,
and is alredy used in a couple of places to mean similar things (see mglru and
damon). I'm open to other names if anyone has better ideas.

If concensus is that this approach is generally acceptable, I intend to create a
series in future to do a similar thing with ptep_get() -> ptep_get_norecency().

---
This series applies on top of [1].

[1] https://lore.kernel.org/linux-mm/20240215103205.2607016-1-ryan.roberts@arm.com/
[2] https://lore.kernel.org/linux-mm/a91cfe1c-289e-4828-8cfc-be34eb69a71b@redhat.com/

Thanks,
Ryan

Ryan Roberts (4):
  mm: Introduce ptep_get_lockless_norecency()
  mm/gup: Use ptep_get_lockless_norecency()
  mm/memory: Use ptep_get_lockless_norecency() for orig_pte
  arm64/mm: Override ptep_get_lockless_norecency()

 arch/arm64/include/asm/pgtable.h |  6 ++++
 include/linux/pgtable.h          | 55 ++++++++++++++++++++++++++++--
 kernel/events/core.c             |  2 +-
 mm/gup.c                         |  7 ++--
 mm/hugetlb.c                     |  2 +-
 mm/khugepaged.c                  |  2 +-
 mm/memory.c                      | 57 ++++++++++++++++++++------------
 mm/swap_state.c                  |  2 +-
 mm/swapfile.c                    |  2 +-
 9 files changed, 102 insertions(+), 33 deletions(-)

--
2.25.1

David Hildenbrand March 26, 2024, 4:17 p.m. UTC | #1

On 15.02.24 13:17, Ryan Roberts wrote:
> This is an RFC for a series that aims to reduce the cost and complexity of
> ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1].
> The approach came from discussion with Mark and David [2].
> 
> It introduces a new helper, ptep_get_lockless_norecency(), which allows the
> access and dirty bits in the returned pte to be incorrect. This relaxation
> permits arm64's implementation to just read the single target pte, and avoids
> having to iterate over the full contpte block to gather the access and dirty
> bits, for the contpte case.
> 
> It turns out that none of the call sites using ptep_get_lockless() require
> accurate access and dirty bit information, so we can also convert those sites.
> Although a couple of places need care (see patches 2 and 3).
> 
> Arguably patch 3 is a bit fragile, given the wide accessibility of
> vmf->orig_pte. So it might make sense to drop this patch and stick to using
> ptep_get_lockless() in the page fault path. I'm keen to hear opinions.

Yes. Especially as we have these pte_same() checks that might just fail 
now because of wrong accessed/dirty bits?

Likely, we just want to read "the real deal" on both sides of the 
pte_same() handling.

> 
> I've chosen the name "recency" because it's shortish and somewhat descriptive,
> and is alredy used in a couple of places to mean similar things (see mglru and
> damon). I'm open to other names if anyone has better ideas.

Not a native speaker; works for me.

> 
> If concensus is that this approach is generally acceptable, I intend to create a
> series in future to do a similar thing with ptep_get() -> ptep_get_norecency().

Yes, sounds good to me.

Ryan Roberts March 26, 2024, 4:31 p.m. UTC | #2

On 26/03/2024 16:17, David Hildenbrand wrote:
> On 15.02.24 13:17, Ryan Roberts wrote:
>> This is an RFC for a series that aims to reduce the cost and complexity of
>> ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1].
>> The approach came from discussion with Mark and David [2].
>>
>> It introduces a new helper, ptep_get_lockless_norecency(), which allows the
>> access and dirty bits in the returned pte to be incorrect. This relaxation
>> permits arm64's implementation to just read the single target pte, and avoids
>> having to iterate over the full contpte block to gather the access and dirty
>> bits, for the contpte case.
>>
>> It turns out that none of the call sites using ptep_get_lockless() require
>> accurate access and dirty bit information, so we can also convert those sites.
>> Although a couple of places need care (see patches 2 and 3).
>>
>> Arguably patch 3 is a bit fragile, given the wide accessibility of
>> vmf->orig_pte. So it might make sense to drop this patch and stick to using
>> ptep_get_lockless() in the page fault path. I'm keen to hear opinions.
> 
> Yes. Especially as we have these pte_same() checks that might just fail now
> because of wrong accessed/dirty bits?

Which pte_same() checks are you referring to? I've changed them all to
pte_same_norecency() which ignores the access/dirty bits when doing the comparison.

> 
> Likely, we just want to read "the real deal" on both sides of the pte_same()
> handling.

Sorry I'm not sure I understand? You mean read the full pte including
access/dirty? That's the same as dropping the patch, right? Of course if we do
that, we still have to keep pte_get_lockless() around for this case. In an ideal
world we would convert everything over to ptep_get_lockless_norecency() and
delete ptep_get_lockless() to remove the ugliness from arm64.

> 
>>
>> I've chosen the name "recency" because it's shortish and somewhat descriptive,
>> and is alredy used in a couple of places to mean similar things (see mglru and
>> damon). I'm open to other names if anyone has better ideas.
> 
> Not a native speaker; works for me.
> 
>>
>> If concensus is that this approach is generally acceptable, I intend to create a
>> series in future to do a similar thing with ptep_get() -> ptep_get_norecency().
> 
> Yes, sounds good to me.
>

Ryan Roberts March 26, 2024, 4:53 p.m. UTC | #3

On 26/03/2024 16:34, David Hildenbrand wrote:
> On 26.03.24 17:31, Ryan Roberts wrote:
>> On 26/03/2024 16:17, David Hildenbrand wrote:
>>> On 15.02.24 13:17, Ryan Roberts wrote:
>>>> This is an RFC for a series that aims to reduce the cost and complexity of
>>>> ptep_get_lockless() for arm64 when supporting transparent contpte mappings [1].
>>>> The approach came from discussion with Mark and David [2].
>>>>
>>>> It introduces a new helper, ptep_get_lockless_norecency(), which allows the
>>>> access and dirty bits in the returned pte to be incorrect. This relaxation
>>>> permits arm64's implementation to just read the single target pte, and avoids
>>>> having to iterate over the full contpte block to gather the access and dirty
>>>> bits, for the contpte case.
>>>>
>>>> It turns out that none of the call sites using ptep_get_lockless() require
>>>> accurate access and dirty bit information, so we can also convert those sites.
>>>> Although a couple of places need care (see patches 2 and 3).
>>>>
>>>> Arguably patch 3 is a bit fragile, given the wide accessibility of
>>>> vmf->orig_pte. So it might make sense to drop this patch and stick to using
>>>> ptep_get_lockless() in the page fault path. I'm keen to hear opinions.
>>>
>>> Yes. Especially as we have these pte_same() checks that might just fail now
>>> because of wrong accessed/dirty bits?
>>
>> Which pte_same() checks are you referring to? I've changed them all to
>> pte_same_norecency() which ignores the access/dirty bits when doing the
>> comparison.
> 
> I'm reading the patches just now. So I stumbled over that just after I wrote
> that, so I was missing that part from the description here.
> 
>>
>>>
>>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>>> handling.
>>
>> Sorry I'm not sure I understand? You mean read the full pte including
>> access/dirty? That's the same as dropping the patch, right? Of course if we do
>> that, we still have to keep pte_get_lockless() around for this case. In an ideal
>> world we would convert everything over to ptep_get_lockless_norecency() and
>> delete ptep_get_lockless() to remove the ugliness from arm64.
> 
> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any
> architecture.
> 
> I do wonder if pte_same_norecency() should be defined per architecture and the
> default would be pte_same(). So we could avoid the mkold etc on all other
> architectures.

Wouldn't that break it's semantics? The "norecency" of
ptep_get_lockless_norecency() means "recency information in the returned pte may
be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
access and dirty bits when you do the comparison".

I think you could only do the optimization you describe if you required that
pte_same_norecency() would only be given values returned by
ptep_get_lockless_norecency() (or ptep_get_norecency()). Even then, its not
quite the same; if a page is accessed between gets one will return true and the
other false.

David Hildenbrand March 26, 2024, 5:04 p.m. UTC | #4

>>>>
>>>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>>>> handling.
>>>
>>> Sorry I'm not sure I understand? You mean read the full pte including
>>> access/dirty? That's the same as dropping the patch, right? Of course if we do
>>> that, we still have to keep pte_get_lockless() around for this case. In an ideal
>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>
>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any
>> architecture.
>>
>> I do wonder if pte_same_norecency() should be defined per architecture and the
>> default would be pte_same(). So we could avoid the mkold etc on all other
>> architectures.
> 
> Wouldn't that break it's semantics? The "norecency" of
> ptep_get_lockless_norecency() means "recency information in the returned pte may
> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
> access and dirty bits when you do the comparison".

My idea was that ptep_get_lockless_norecency() would return the actual 
result on these architectures. So e.g., on x86, there would be no actual 
change in generated code.

But yes, the documentation of these functions would have to be improved.

Now I wonder if ptep_get_lockless_norecency() should actively clear 
dirty/accessed bits to more easily find any actual issues where the bits 
still matter ...

> 
> I think you could only do the optimization you describe if you required that
> pte_same_norecency() would only be given values returned by
> ptep_get_lockless_norecency() (or ptep_get_norecency()). Even then, its not
> quite the same; if a page is accessed between gets one will return true and the
> other false.

Right.

Ryan Roberts March 26, 2024, 5:32 p.m. UTC | #5

On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>
>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>>>>> handling.
>>>>
>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>> access/dirty? That's the same as dropping the patch, right? Of course if we do
>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>> ideal
>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>
>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any
>>> architecture.
>>>
>>> I do wonder if pte_same_norecency() should be defined per architecture and the
>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>> architectures.
>>
>> Wouldn't that break it's semantics? The "norecency" of
>> ptep_get_lockless_norecency() means "recency information in the returned pte may
>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>> access and dirty bits when you do the comparison".
> 
> My idea was that ptep_get_lockless_norecency() would return the actual result on
> these architectures. So e.g., on x86, there would be no actual change in
> generated code.

I think this is a bad plan... You'll end up with subtle differences between
architectures.

> 
> But yes, the documentation of these functions would have to be improved.
> 
> Now I wonder if ptep_get_lockless_norecency() should actively clear
> dirty/accessed bits to more easily find any actual issues where the bits still
> matter ...

I did a version that took that approach. Decided it was not as good as this way
though. Now for the life of me, I can't remember my reasoning.

> 
>>
>> I think you could only do the optimization you describe if you required that
>> pte_same_norecency() would only be given values returned by
>> ptep_get_lockless_norecency() (or ptep_get_norecency()). Even then, its not
>> quite the same; if a page is accessed between gets one will return true and the
>> other false.
> 
> Right.
>

David Hildenbrand March 26, 2024, 5:39 p.m. UTC | #6

On 26.03.24 18:32, Ryan Roberts wrote:
> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>
>>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>>>>>> handling.
>>>>>
>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>> access/dirty? That's the same as dropping the patch, right? Of course if we do
>>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>>> ideal
>>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>
>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect any
>>>> architecture.
>>>>
>>>> I do wonder if pte_same_norecency() should be defined per architecture and the
>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>> architectures.
>>>
>>> Wouldn't that break it's semantics? The "norecency" of
>>> ptep_get_lockless_norecency() means "recency information in the returned pte may
>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>>> access and dirty bits when you do the comparison".
>>
>> My idea was that ptep_get_lockless_norecency() would return the actual result on
>> these architectures. So e.g., on x86, there would be no actual change in
>> generated code.
> 
> I think this is a bad plan... You'll end up with subtle differences between
> architectures.
> 
>>
>> But yes, the documentation of these functions would have to be improved.
>>
>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>> dirty/accessed bits to more easily find any actual issues where the bits still
>> matter ...
> 
> I did a version that took that approach. Decided it was not as good as this way
> though. Now for the life of me, I can't remember my reasoning.

Maybe because there are some code paths that check accessed/dirty 
without "correctness" implications? For example, if the PTE is already 
dirty, no need to set it dirty etc?

Ryan Roberts March 26, 2024, 5:51 p.m. UTC | #7

On 26/03/2024 17:39, David Hildenbrand wrote:
> On 26.03.24 18:32, Ryan Roberts wrote:
>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>
>>>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>>>>>>> handling.
>>>>>>
>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>> we do
>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>>>> ideal
>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>
>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect
>>>>> any
>>>>> architecture.
>>>>>
>>>>> I do wonder if pte_same_norecency() should be defined per architecture and the
>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>> architectures.
>>>>
>>>> Wouldn't that break it's semantics? The "norecency" of
>>>> ptep_get_lockless_norecency() means "recency information in the returned pte
>>>> may
>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>>>> access and dirty bits when you do the comparison".
>>>
>>> My idea was that ptep_get_lockless_norecency() would return the actual result on
>>> these architectures. So e.g., on x86, there would be no actual change in
>>> generated code.
>>
>> I think this is a bad plan... You'll end up with subtle differences between
>> architectures.
>>
>>>
>>> But yes, the documentation of these functions would have to be improved.
>>>
>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>> dirty/accessed bits to more easily find any actual issues where the bits still
>>> matter ...
>>
>> I did a version that took that approach. Decided it was not as good as this way
>> though. Now for the life of me, I can't remember my reasoning.
> 
> Maybe because there are some code paths that check accessed/dirty without
> "correctness" implications? For example, if the PTE is already dirty, no need to
> set it dirty etc?

I think I decided I was penalizing the architectures that don't care because all
their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
that I could feed its result into pte_same().

David Hildenbrand March 27, 2024, 9:34 a.m. UTC | #8

On 26.03.24 18:51, Ryan Roberts wrote:
> On 26/03/2024 17:39, David Hildenbrand wrote:
>> On 26.03.24 18:32, Ryan Roberts wrote:
>>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> Likely, we just want to read "the real deal" on both sides of the pte_same()
>>>>>>>> handling.
>>>>>>>
>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>>> we do
>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>>>>> ideal
>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>>
>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect
>>>>>> any
>>>>>> architecture.
>>>>>>
>>>>>> I do wonder if pte_same_norecency() should be defined per architecture and the
>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>>> architectures.
>>>>>
>>>>> Wouldn't that break it's semantics? The "norecency" of
>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte
>>>>> may
>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>>>>> access and dirty bits when you do the comparison".
>>>>
>>>> My idea was that ptep_get_lockless_norecency() would return the actual result on
>>>> these architectures. So e.g., on x86, there would be no actual change in
>>>> generated code.
>>>
>>> I think this is a bad plan... You'll end up with subtle differences between
>>> architectures.
>>>
>>>>
>>>> But yes, the documentation of these functions would have to be improved.
>>>>
>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>>> dirty/accessed bits to more easily find any actual issues where the bits still
>>>> matter ...
>>>
>>> I did a version that took that approach. Decided it was not as good as this way
>>> though. Now for the life of me, I can't remember my reasoning.
>>
>> Maybe because there are some code paths that check accessed/dirty without
>> "correctness" implications? For example, if the PTE is already dirty, no need to
>> set it dirty etc?
> 
> I think I decided I was penalizing the architectures that don't care because all
> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
> that I could feed its result into pte_same().

True. With ptep_get_norecency() you're also penalizing other 
architectures. Therefore my original thought about making the behavior 
arch-specific, but the arch has to make sure to get the combination of 
ptep_get_lockless_norecency()+ptep_same_norecency() is right.

So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it 
must make sure to also ignore them in ptep_same_norecency(), and must be 
able to handle access/dirty bit changes differently.

Maybe one could have one variant for "hw-managed access/dirty" vs. "sw 
managed accessed or dirty". Only the former would end up ignoring stuff 
here, the latter would not.

But again, just some random thoughts how this affects other 
architectures and how we could avoid it. The issue I describe in patch 
#3 would be gone if ptep_same_norecency() would just do a ptep_same() 
check on other architectures -- and would make it easier to sell :)

Ryan Roberts March 27, 2024, 10:01 a.m. UTC | #9

On 27/03/2024 09:34, David Hildenbrand wrote:
> On 26.03.24 18:51, Ryan Roberts wrote:
>> On 26/03/2024 17:39, David Hildenbrand wrote:
>>> On 26.03.24 18:32, Ryan Roberts wrote:
>>>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the
>>>>>>>>> pte_same()
>>>>>>>>> handling.
>>>>>>>>
>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>>>> we do
>>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>>>>>> ideal
>>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>>>
>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect
>>>>>>> any
>>>>>>> architecture.
>>>>>>>
>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture
>>>>>>> and the
>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>>>> architectures.
>>>>>>
>>>>>> Wouldn't that break it's semantics? The "norecency" of
>>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte
>>>>>> may
>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>>>>>> access and dirty bits when you do the comparison".
>>>>>
>>>>> My idea was that ptep_get_lockless_norecency() would return the actual
>>>>> result on
>>>>> these architectures. So e.g., on x86, there would be no actual change in
>>>>> generated code.
>>>>
>>>> I think this is a bad plan... You'll end up with subtle differences between
>>>> architectures.
>>>>
>>>>>
>>>>> But yes, the documentation of these functions would have to be improved.
>>>>>
>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>>>> dirty/accessed bits to more easily find any actual issues where the bits still
>>>>> matter ...
>>>>
>>>> I did a version that took that approach. Decided it was not as good as this way
>>>> though. Now for the life of me, I can't remember my reasoning.
>>>
>>> Maybe because there are some code paths that check accessed/dirty without
>>> "correctness" implications? For example, if the PTE is already dirty, no need to
>>> set it dirty etc?
>>
>> I think I decided I was penalizing the architectures that don't care because all
>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
>> that I could feed its result into pte_same().
> 
> True. With ptep_get_norecency() you're also penalizing other architectures.
> Therefore my original thought about making the behavior arch-specific, but the
> arch has to make sure to get the combination of
> ptep_get_lockless_norecency()+ptep_same_norecency() is right.
> 
> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must
> make sure to also ignore them in ptep_same_norecency(), and must be able to
> handle access/dirty bit changes differently.
> 
> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed
> accessed or dirty". Only the former would end up ignoring stuff here, the latter
> would not.
> 
> But again, just some random thoughts how this affects other architectures and
> how we could avoid it. The issue I describe in patch #3 would be gone if
> ptep_same_norecency() would just do a ptep_same() check on other architectures
> -- and would make it easier to sell :)

Perhaps - let me chew on that for a bit. It doesn't feel as easy as you suggest
to me. But I can't put my finger on why...

Ryan Roberts April 3, 2024, 12:59 p.m. UTC | #10

On 27/03/2024 09:34, David Hildenbrand wrote:
> On 26.03.24 18:51, Ryan Roberts wrote:
>> On 26/03/2024 17:39, David Hildenbrand wrote:
>>> On 26.03.24 18:32, Ryan Roberts wrote:
>>>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the
>>>>>>>>> pte_same()
>>>>>>>>> handling.
>>>>>>>>
>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>>>> we do
>>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>>>>>> ideal
>>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>>>
>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect
>>>>>>> any
>>>>>>> architecture.
>>>>>>>
>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture
>>>>>>> and the
>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>>>> architectures.
>>>>>>
>>>>>> Wouldn't that break it's semantics? The "norecency" of
>>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte
>>>>>> may
>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>>>>>> access and dirty bits when you do the comparison".
>>>>>
>>>>> My idea was that ptep_get_lockless_norecency() would return the actual
>>>>> result on
>>>>> these architectures. So e.g., on x86, there would be no actual change in
>>>>> generated code.
>>>>
>>>> I think this is a bad plan... You'll end up with subtle differences between
>>>> architectures.
>>>>
>>>>>
>>>>> But yes, the documentation of these functions would have to be improved.
>>>>>
>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>>>> dirty/accessed bits to more easily find any actual issues where the bits still
>>>>> matter ...
>>>>
>>>> I did a version that took that approach. Decided it was not as good as this way
>>>> though. Now for the life of me, I can't remember my reasoning.
>>>
>>> Maybe because there are some code paths that check accessed/dirty without
>>> "correctness" implications? For example, if the PTE is already dirty, no need to
>>> set it dirty etc?
>>
>> I think I decided I was penalizing the architectures that don't care because all
>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
>> that I could feed its result into pte_same().
> 
> True. With ptep_get_norecency() you're also penalizing other architectures.
> Therefore my original thought about making the behavior arch-specific, but the
> arch has to make sure to get the combination of
> ptep_get_lockless_norecency()+ptep_same_norecency() is right.
> 
> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must
> make sure to also ignore them in ptep_same_norecency(), and must be able to
> handle access/dirty bit changes differently.
> 
> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed
> accessed or dirty". Only the former would end up ignoring stuff here, the latter
> would not.
> 
> But again, just some random thoughts how this affects other architectures and
> how we could avoid it. The issue I describe in patch #3 would be gone if
> ptep_same_norecency() would just do a ptep_same() check on other architectures
> -- and would make it easier to sell :)
> 

I've been thinking some more about this. I think your proposal is the following:


// ARM64
ptep_get_lockless_norecency()
{
	- returned access/dirty may be incorrect
	- returned access/dirty may be differently incorrect between 2 calls
}
pte_same_norecency()
{
	- ignore access/dirty when doing comparison
}
ptep_set_access_flags(ptep, pte)
{
	- must not assume access/dirty in pte are "more permissive" than
	  access/dirty in *ptep
	- must only set access/dirty in *ptep, never clear
}


// Other arches: no change to generated code
ptep_get_lockless_norecency()
{
	return ptep_get_lockless();
}
pte_same_norecency()
{
	return pte_same();
}
ptep_set_access_flags(ptep, pte)
{
	- may assume access/dirty in pte are "more permissive" than access/dirty
	  in *ptep
	- if no HW access/dirty updates, "*ptep = pte" always results in "more
	  permissive" change
}

An arch either specializes all 3 or none of them.

This would allow us to get rid of ptep_get_lockless().

And it addresses the bug you found with ptep_set_access_flags().


BUT, I still have a nagging feeling that there are likely to be other similar
problems caused by ignoring access/dirty during pte_same_norecency(). I can't
convince myself that its definitely all safe and robust.

So I'm leaning towards dropping patch 3 and therefore keeping
ptep_get_lockless() around.

Let me know if you have any insight that might help me change my mind :)

Thanks,
Ryan

David Hildenbrand April 8, 2024, 8:36 a.m. UTC | #11

On 03.04.24 14:59, Ryan Roberts wrote:
> On 27/03/2024 09:34, David Hildenbrand wrote:
>> On 26.03.24 18:51, Ryan Roberts wrote:
>>> On 26/03/2024 17:39, David Hildenbrand wrote:
>>>> On 26.03.24 18:32, Ryan Roberts wrote:
>>>>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>>>>
>>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the
>>>>>>>>>> pte_same()
>>>>>>>>>> handling.
>>>>>>>>>
>>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>>>>> we do
>>>>>>>>> that, we still have to keep pte_get_lockless() around for this case. In an
>>>>>>>>> ideal
>>>>>>>>> world we would convert everything over to ptep_get_lockless_norecency() and
>>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>>>>
>>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really affect
>>>>>>>> any
>>>>>>>> architecture.
>>>>>>>>
>>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture
>>>>>>>> and the
>>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>>>>> architectures.
>>>>>>>
>>>>>>> Wouldn't that break it's semantics? The "norecency" of
>>>>>>> ptep_get_lockless_norecency() means "recency information in the returned pte
>>>>>>> may
>>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore the
>>>>>>> access and dirty bits when you do the comparison".
>>>>>>
>>>>>> My idea was that ptep_get_lockless_norecency() would return the actual
>>>>>> result on
>>>>>> these architectures. So e.g., on x86, there would be no actual change in
>>>>>> generated code.
>>>>>
>>>>> I think this is a bad plan... You'll end up with subtle differences between
>>>>> architectures.
>>>>>
>>>>>>
>>>>>> But yes, the documentation of these functions would have to be improved.
>>>>>>
>>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>>>>> dirty/accessed bits to more easily find any actual issues where the bits still
>>>>>> matter ...
>>>>>
>>>>> I did a version that took that approach. Decided it was not as good as this way
>>>>> though. Now for the life of me, I can't remember my reasoning.
>>>>
>>>> Maybe because there are some code paths that check accessed/dirty without
>>>> "correctness" implications? For example, if the PTE is already dirty, no need to
>>>> set it dirty etc?
>>>
>>> I think I decided I was penalizing the architectures that don't care because all
>>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
>>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
>>> that I could feed its result into pte_same().
>>
>> True. With ptep_get_norecency() you're also penalizing other architectures.
>> Therefore my original thought about making the behavior arch-specific, but the
>> arch has to make sure to get the combination of
>> ptep_get_lockless_norecency()+ptep_same_norecency() is right.
>>
>> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must
>> make sure to also ignore them in ptep_same_norecency(), and must be able to
>> handle access/dirty bit changes differently.
>>
>> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed
>> accessed or dirty". Only the former would end up ignoring stuff here, the latter
>> would not.
>>
>> But again, just some random thoughts how this affects other architectures and
>> how we could avoid it. The issue I describe in patch #3 would be gone if
>> ptep_same_norecency() would just do a ptep_same() check on other architectures
>> -- and would make it easier to sell :)
>>
> 
> I've been thinking some more about this. I think your proposal is the following:
> 
> 
> // ARM64
> ptep_get_lockless_norecency()
> {
> 	- returned access/dirty may be incorrect
> 	- returned access/dirty may be differently incorrect between 2 calls
> }
> pte_same_norecency()
> {
> 	- ignore access/dirty when doing comparison
> }
> ptep_set_access_flags(ptep, pte)
> {
> 	- must not assume access/dirty in pte are "more permissive" than
> 	  access/dirty in *ptep
> 	- must only set access/dirty in *ptep, never clear
> }
> 
> 
> // Other arches: no change to generated code
> ptep_get_lockless_norecency()
> {
> 	return ptep_get_lockless();
> }
> pte_same_norecency()
> {
> 	return pte_same();
> }
> ptep_set_access_flags(ptep, pte)
> {
> 	- may assume access/dirty in pte are "more permissive" than access/dirty
> 	  in *ptep
> 	- if no HW access/dirty updates, "*ptep = pte" always results in "more
> 	  permissive" change
> }
> 
> An arch either specializes all 3 or none of them.
> 
> This would allow us to get rid of ptep_get_lockless().
> 
> And it addresses the bug you found with ptep_set_access_flags().
> 
> 
> BUT, I still have a nagging feeling that there are likely to be other similar
> problems caused by ignoring access/dirty during pte_same_norecency(). I can't
> convince myself that its definitely all safe and robust.

Right, we'd have to identify the other possible cases and document what 
an arch + common code must stick to to make it work.

Some rules would be: if an arch implements ptep_get_lockless_norecency():

(1) Passing the result from ptep_get_lockless_norecency() to pte_same()
     is wrong.
(2) Checking pte_young()/pte_old/pte_dirty()/pte_clean() after
     ptep_get_lockless_norecency() is very likely wrong.


> 
> So I'm leaning towards dropping patch 3 and therefore keeping
> ptep_get_lockless() around.
> 
> Let me know if you have any insight that might help me change my mind :)

I'm wondering if it would help if we could find a better name (or 
concept) for "norecency" here, that expresses that only on some archs 
we'd have that fuzzy handling.

Keeping ptep_get_lockless() around for now sounds like the best alternative.

Ryan Roberts April 9, 2024, 4:35 p.m. UTC | #12

On 08/04/2024 09:36, David Hildenbrand wrote:
> On 03.04.24 14:59, Ryan Roberts wrote:
>> On 27/03/2024 09:34, David Hildenbrand wrote:
>>> On 26.03.24 18:51, Ryan Roberts wrote:
>>>> On 26/03/2024 17:39, David Hildenbrand wrote:
>>>>> On 26.03.24 18:32, Ryan Roberts wrote:
>>>>>> On 26/03/2024 17:04, David Hildenbrand wrote:
>>>>>>>>>>>
>>>>>>>>>>> Likely, we just want to read "the real deal" on both sides of the
>>>>>>>>>>> pte_same()
>>>>>>>>>>> handling.
>>>>>>>>>>
>>>>>>>>>> Sorry I'm not sure I understand? You mean read the full pte including
>>>>>>>>>> access/dirty? That's the same as dropping the patch, right? Of course if
>>>>>>>>>> we do
>>>>>>>>>> that, we still have to keep pte_get_lockless() around for this case.
>>>>>>>>>> In an
>>>>>>>>>> ideal
>>>>>>>>>> world we would convert everything over to
>>>>>>>>>> ptep_get_lockless_norecency() and
>>>>>>>>>> delete ptep_get_lockless() to remove the ugliness from arm64.
>>>>>>>>>
>>>>>>>>> Yes, agreed. Patch #3 does not look too crazy and it wouldn't really
>>>>>>>>> affect
>>>>>>>>> any
>>>>>>>>> architecture.
>>>>>>>>>
>>>>>>>>> I do wonder if pte_same_norecency() should be defined per architecture
>>>>>>>>> and the
>>>>>>>>> default would be pte_same(). So we could avoid the mkold etc on all other
>>>>>>>>> architectures.
>>>>>>>>
>>>>>>>> Wouldn't that break it's semantics? The "norecency" of
>>>>>>>> ptep_get_lockless_norecency() means "recency information in the returned
>>>>>>>> pte
>>>>>>>> may
>>>>>>>> be incorrect". But the "norecency" of pte_same_norecency() means "ignore
>>>>>>>> the
>>>>>>>> access and dirty bits when you do the comparison".
>>>>>>>
>>>>>>> My idea was that ptep_get_lockless_norecency() would return the actual
>>>>>>> result on
>>>>>>> these architectures. So e.g., on x86, there would be no actual change in
>>>>>>> generated code.
>>>>>>
>>>>>> I think this is a bad plan... You'll end up with subtle differences between
>>>>>> architectures.
>>>>>>
>>>>>>>
>>>>>>> But yes, the documentation of these functions would have to be improved.
>>>>>>>
>>>>>>> Now I wonder if ptep_get_lockless_norecency() should actively clear
>>>>>>> dirty/accessed bits to more easily find any actual issues where the bits
>>>>>>> still
>>>>>>> matter ...
>>>>>>
>>>>>> I did a version that took that approach. Decided it was not as good as
>>>>>> this way
>>>>>> though. Now for the life of me, I can't remember my reasoning.
>>>>>
>>>>> Maybe because there are some code paths that check accessed/dirty without
>>>>> "correctness" implications? For example, if the PTE is already dirty, no
>>>>> need to
>>>>> set it dirty etc?
>>>>
>>>> I think I decided I was penalizing the architectures that don't care because
>>>> all
>>>> their ptep_get_norecency() and ptep_get_lockless_norecency() need to explicitly
>>>> clear access/dirty. And I would have needed ptep_get_norecency() from day 1 so
>>>> that I could feed its result into pte_same().
>>>
>>> True. With ptep_get_norecency() you're also penalizing other architectures.
>>> Therefore my original thought about making the behavior arch-specific, but the
>>> arch has to make sure to get the combination of
>>> ptep_get_lockless_norecency()+ptep_same_norecency() is right.
>>>
>>> So if an arch decide to ignore bits in ptep_get_lockless_norecency(), it must
>>> make sure to also ignore them in ptep_same_norecency(), and must be able to
>>> handle access/dirty bit changes differently.
>>>
>>> Maybe one could have one variant for "hw-managed access/dirty" vs. "sw managed
>>> accessed or dirty". Only the former would end up ignoring stuff here, the latter
>>> would not.
>>>
>>> But again, just some random thoughts how this affects other architectures and
>>> how we could avoid it. The issue I describe in patch #3 would be gone if
>>> ptep_same_norecency() would just do a ptep_same() check on other architectures
>>> -- and would make it easier to sell :)
>>>
>>
>> I've been thinking some more about this. I think your proposal is the following:
>>
>>
>> // ARM64
>> ptep_get_lockless_norecency()
>> {
>>     - returned access/dirty may be incorrect
>>     - returned access/dirty may be differently incorrect between 2 calls
>> }
>> pte_same_norecency()
>> {
>>     - ignore access/dirty when doing comparison
>> }
>> ptep_set_access_flags(ptep, pte)
>> {
>>     - must not assume access/dirty in pte are "more permissive" than
>>       access/dirty in *ptep
>>     - must only set access/dirty in *ptep, never clear
>> }
>>
>>
>> // Other arches: no change to generated code
>> ptep_get_lockless_norecency()
>> {
>>     return ptep_get_lockless();
>> }
>> pte_same_norecency()
>> {
>>     return pte_same();
>> }
>> ptep_set_access_flags(ptep, pte)
>> {
>>     - may assume access/dirty in pte are "more permissive" than access/dirty
>>       in *ptep
>>     - if no HW access/dirty updates, "*ptep = pte" always results in "more
>>       permissive" change
>> }
>>
>> An arch either specializes all 3 or none of them.
>>
>> This would allow us to get rid of ptep_get_lockless().
>>
>> And it addresses the bug you found with ptep_set_access_flags().
>>
>>
>> BUT, I still have a nagging feeling that there are likely to be other similar
>> problems caused by ignoring access/dirty during pte_same_norecency(). I can't
>> convince myself that its definitely all safe and robust.
> 
> Right, we'd have to identify the other possible cases and document what an arch
> + common code must stick to to make it work.
> 
> Some rules would be: if an arch implements ptep_get_lockless_norecency():
> 
> (1) Passing the result from ptep_get_lockless_norecency() to pte_same()
>     is wrong.
> (2) Checking pte_young()/pte_old/pte_dirty()/pte_clean() after
>     ptep_get_lockless_norecency() is very likely wrong.
> 
> 
>>
>> So I'm leaning towards dropping patch 3 and therefore keeping
>> ptep_get_lockless() around.
>>
>> Let me know if you have any insight that might help me change my mind :)
> 
> I'm wondering if it would help if we could find a better name (or concept) for
> "norecency" here, that expresses that only on some archs we'd have that fuzzy
> handling.
> 
> Keeping ptep_get_lockless() around for now sounds like the best alternative.

Separately to this I've been playing with an idea to add support for uffd-wp and
soft-dirty SW PTE bits for arm64; it boils down to keeping the SW bits in
separate storage, linked from the ptdesc. And we have some constant HW PTE bits
that we can remove and replace with those SW bits so we can keep the pte_t the
same size and abstract it all with ptep_get() and set_ptes().

It was all looking straightforward until I got to ptep_get_lockless(). Now that
there are 2 separate locations for PTE bits, I can't read it all atomically.

So I've been looking at all this again, and getting myself even more confused.

I believe there are 2 classes of ptep_get_lockless() caller:

1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
2) everyone else

--

(1) doesn't really care if orig_pte is consistent or not. It just wants to read
a value, do some speculative work based on that value, then lock the PTL, and
check the value hasn't changed. If it has changed, it backs out. So we don't
actually need any "lockless" guarrantees here; we could just use ptep_get().

In fact, prior to Hugh's commit 26e1a0c3277d ("mm: use pmdp_get_lockless()
without surplus barrier()"), we had this:

                vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
-               vmf->orig_pte = *vmf->pte;
+               vmf->orig_pte = ptep_get_lockless(vmf->pte);
                vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;

-               /*
-                * some architectures can have larger ptes than wordsize,
-                * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
-                * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
-                * accesses.  The code below just needs a consistent view
-                * for the ifs and we later double check anyway with the
-                * ptl lock held. So here a barrier will do.
-                */
-               barrier();
                if (pte_none(vmf->orig_pte)) {

--

(2) All the other users require that a subset of the pte fields are
self-consistent; specifically they don't care about access, dirty, uffd-wp or
soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
just by calling ptep_get().

--

So, I'm making the bold claim that it was never neccessary to specialize
pte_get_lockless() on arm64, and I *think* we could just delete it so that
ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
aim without needing to introduce "norecency" variants.

Additionally I propose documenting ptep_get_lockless() to describe the set of
fields that are guarranteed to be self-consistent and the remaining fields which
are self-consistent only with best-effort.

Could it be this easy? My head is hurting...

Thanks,
Ryan

David Hildenbrand April 10, 2024, 8:09 p.m. UTC | #13

[...]

Skipping the ptdesc stuff we discussed offline, to not get distracted. :)

This stuff is killing me, sorry for the lengthy reply ...

> 
> So I've been looking at all this again, and getting myself even more confused.
> 
> I believe there are 2 classes of ptep_get_lockless() caller:
> 
> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
> 2) everyone else

Likely only completely lockless page table walkers where we *cannot* 
recheck under PTL is special. Essentially where we disable interrupts 
and rely on TLB flushes to sync against concurrent changes.

Let's take a look where ptep_get_lockless() comes from:

commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3
Author: Peter Zijlstra <peterz@infradead.org>
Date:   Fri Nov 13 11:41:40 2020 +0100

     mm/gup: Provide gup_get_pte() more generic

     In order to write another lockless page-table walker, we need
     gup_get_pte() exposed. While doing that, rename it to
     ptep_get_lockless() to match the existing ptep_get() naming.


GUP-fast, when we were still relying on TLB flushes to sync against 
GUP-fast.

"With get_user_pages_fast(), we walk down the pagetables without taking 
any locks.  For this we would like to load the pointers atomically, but 
sometimes that is not possible (e.g. without expensive cmpxchg8b on 
x86_32 PAE).  What we do have is the guarantee that a PTE will only 
either go from not present to present, or present to not present or both 
-- it will not switch to a completely different present page without a 
TLB flush in between; something hat we are blocking by holding 
interrupts off."

Later, we added support for GUP-fast that introduced the !TLB variant:

commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13
Author: Steve Capper <steve.capper@linaro.org>
Date:   Thu Oct 9 15:29:14 2014 -0700

     mm: introduce a general RCU get_user_pages_fast()

With the pattern

/*
  * In the line below we are assuming that the pte can be read
  * atomically. If this is not the case for your architecture,
  * please wrap this in a helper function!
  *
  * for an example see gup_get_pte in arch/x86/mm/gup.c
  */
pte_t pte = ACCESS_ONCE(*ptep);
...
if (unlikely(pte_val(pte) != pte_val(*ptep))) {
...


Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte = 
gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read 
was required to be sane, this the lengthy comment above 
ptep_get_lockless() that talks about TLB flushes.

The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is 
still full of details about TLB flushes syncing against GUP-fast. But as 
you note, we use it even in contexts where we don't disable interrupts 
and the TLB flush can't help.

We don't disable interrupts during page faults ... so most of the things 
described in ptep_get_lockless() don't really apply.

That's also the reason why ...
> 
>                  vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
> -               vmf->orig_pte = *vmf->pte;
> +               vmf->orig_pte = ptep_get_lockless(vmf->pte);
>                  vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
> 
> -               /*
> -                * some architectures can have larger ptes than wordsize,
> -                * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
> -                * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
> -                * accesses.  The code below just needs a consistent view
> -                * for the ifs and we later double check anyway with the
> -                * ptl lock held. So here a barrier will do.
> -                */
> -               barrier();
>                  if (pte_none(vmf->orig_pte)) {

... that code was in place. We would possibly read garbage PTEs, but 
would recheck *under PTL* (where the PTE can no longer change) that what 
we read wasn't garbage and didn't change.

> 
> --
> 
> (2) All the other users require that a subset of the pte fields are
> self-consistent; specifically they don't care about access, dirty, uffd-wp or
> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
> just by calling ptep_get().

Let's focus on access+dirty for now ;)

> 
> --
> 
> So, I'm making the bold claim that it was never neccessary to specialize
> pte_get_lockless() on arm64, and I *think* we could just delete it so that
> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
> aim without needing to introduce "norecency" variants.
> 
> Additionally I propose documenting ptep_get_lockless() to describe the set of
> fields that are guarranteed to be self-consistent and the remaining fields which
> are self-consistent only with best-effort.
> 
> Could it be this easy? My head is hurting...

I think what has to happen is:

(1) pte_get_lockless() must return the same value as ptep_get() as long 
as there are no races. No removal/addition of access/dirty bits etc.

(2) Lockless page table walkers that later verify under the PTL can 
handle serious "garbage PTEs". This is our page fault handler.

(3) Lockless page table walkers that cannot verify under PTL cannot 
handle arbitrary garbage PTEs. This is GUP-fast. Two options:

(3a) pte_get_lockless() can atomically read the PTE: We re-check later 
if the atomically-read PTE is still unchanged (without PTL). No IPI for 
TLB flushes required. This is the common case. HW might concurrently set 
access/dirty bits, so we can race with that. But we don't read garbage.

(3b) pte_get_lockless() cannot atomically read the PTE: We need special 
magic to read somewhat-sane PTEs and need IPIs during TLB flushes to 
sync against serious PTE changes (e.g., present -> present). This is 
weird x86-PAE.


If ptep_get() on arm64 can do (1), (2) and (3a), we might be good.

My head is hurting ...

Ryan Roberts April 11, 2024, 9:45 a.m. UTC | #14

On 10/04/2024 21:09, David Hildenbrand wrote:
> [...]
> 
> Skipping the ptdesc stuff we discussed offline, to not get distracted. :)
> 
> This stuff is killing me, sorry for the lengthy reply ...

No problem - thanks for taking the time to think it through and reply with such
clarity. :)

> 
>>
>> So I've been looking at all this again, and getting myself even more confused.
>>
>> I believe there are 2 classes of ptep_get_lockless() caller:
>>
>> 1) vmf->orig_pte = ptep_get_lockless(vmf->pte); in handle_pte_fault()
>> 2) everyone else
> 
> Likely only completely lockless page table walkers where we *cannot* recheck
> under PTL is special. Essentially where we disable interrupts and rely on TLB
> flushes to sync against concurrent changes.

Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
"lockless walkers that never take the PTL".

Detail: the part about disabling interrupts and TLB flush syncing is
arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
you make that clear further down.

> 
> Let's take a look where ptep_get_lockless() comes from:
> 
> commit 2a4a06da8a4b93dd189171eed7a99fffd38f42f3
> Author: Peter Zijlstra <peterz@infradead.org>
> Date:   Fri Nov 13 11:41:40 2020 +0100
> 
>     mm/gup: Provide gup_get_pte() more generic
> 
>     In order to write another lockless page-table walker, we need
>     gup_get_pte() exposed. While doing that, rename it to
>     ptep_get_lockless() to match the existing ptep_get() naming.
> 
> 
> GUP-fast, when we were still relying on TLB flushes to sync against GUP-fast.
> 
> "With get_user_pages_fast(), we walk down the pagetables without taking any
> locks.  For this we would like to load the pointers atomically, but sometimes
> that is not possible (e.g. without expensive cmpxchg8b on x86_32 PAE).  What we
> do have is the guarantee that a PTE will only either go from not present to
> present, or present to not present or both -- it will not switch to a completely
> different present page without a TLB flush in between; something hat we are
> blocking by holding interrupts off."
> 
> Later, we added support for GUP-fast that introduced the !TLB variant:
> 
> commit 2667f50e8b81457fcb4a3dbe6aff3e81ea009e13
> Author: Steve Capper <steve.capper@linaro.org>
> Date:   Thu Oct 9 15:29:14 2014 -0700
> 
>     mm: introduce a general RCU get_user_pages_fast()
> 
> With the pattern
> 
> /*
>  * In the line below we are assuming that the pte can be read
>  * atomically. If this is not the case for your architecture,
>  * please wrap this in a helper function!
>  *
>  * for an example see gup_get_pte in arch/x86/mm/gup.c
>  */
> pte_t pte = ACCESS_ONCE(*ptep);
> ...
> if (unlikely(pte_val(pte) != pte_val(*ptep))) {
> ...
> 
> 
> Whereby the mentioned arch/x86/mm/gup.c code did a straight pte_t pte =
> gup_get_pte(ptep) without any re-reading of PTEs. The PTE that was read was
> required to be sane, this the lengthy comment above ptep_get_lockless() that
> talks about TLB flushes.
> 
> The comment above ptep_get_lockless() for CONFIG_GUP_GET_PXX_LOW_HIGH is still
> full of details about TLB flushes syncing against GUP-fast. But as you note, we
> use it even in contexts where we don't disable interrupts and the TLB flush
> can't help.
> 
> We don't disable interrupts during page faults ... so most of the things
> described in ptep_get_lockless() don't really apply.
> 
> That's also the reason why ...
>>
>>                  vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>> -               vmf->orig_pte = *vmf->pte;
>> +               vmf->orig_pte = ptep_get_lockless(vmf->pte);
>>                  vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
>>
>> -               /*
>> -                * some architectures can have larger ptes than wordsize,
>> -                * e.g.ppc44x-defconfig has CONFIG_PTE_64BIT=y and
>> -                * CONFIG_32BIT=y, so READ_ONCE cannot guarantee atomic
>> -                * accesses.  The code below just needs a consistent view
>> -                * for the ifs and we later double check anyway with the
>> -                * ptl lock held. So here a barrier will do.
>> -                */
>> -               barrier();
>>                  if (pte_none(vmf->orig_pte)) {
> 
> ... that code was in place. We would possibly read garbage PTEs, but would
> recheck *under PTL* (where the PTE can no longer change) that what we read
> wasn't garbage and didn't change.

Agreed.

> 
>>
>> -- 
>>
>> (2) All the other users require that a subset of the pte fields are
>> self-consistent; specifically they don't care about access, dirty, uffd-wp or
>> soft-dirty. arm64 can guarrantee that all the other bits are self-consistent
>> just by calling ptep_get().
> 
> Let's focus on access+dirty for now ;)
> 
>>
>> -- 
>>
>> So, I'm making the bold claim that it was never neccessary to specialize
>> pte_get_lockless() on arm64, and I *think* we could just delete it so that
>> ptep_get_lockless() resolves to ptep_get() on arm64. That solves the original
>> aim without needing to introduce "norecency" variants.
>>
>> Additionally I propose documenting ptep_get_lockless() to describe the set of
>> fields that are guarranteed to be self-consistent and the remaining fields which
>> are self-consistent only with best-effort.
>>
>> Could it be this easy? My head is hurting...
> 
> I think what has to happen is:
> 
> (1) pte_get_lockless() must return the same value as ptep_get() as long as there
> are no races. No removal/addition of access/dirty bits etc.

Today's arm64 ptep_get() guarantees this.

> 
> (2) Lockless page table walkers that later verify under the PTL can handle
> serious "garbage PTEs". This is our page fault handler.

This isn't really a property of a ptep_get_lockless(); its a statement about a
class of users. I agree with the statement.

> 
> (3) Lockless page table walkers that cannot verify under PTL cannot handle
> arbitrary garbage PTEs. This is GUP-fast. Two options:
> 
> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
> required. This is the common case. HW might concurrently set access/dirty bits,
> so we can race with that. But we don't read garbage.

Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
consistent for contpte ptes. That's the bit that complicates the current
ptep_get_lockless() implementation.

But the point I was trying to make is that GUP-fast does not actually care about
*all* the fields being consistent (e.g. access/dirty). So we could spec
pte_get_lockless() to say that "all fields in the returned pte are guarranteed
to be self-consistent except for access and dirty information, which may be
inconsistent if a racing modification occured".

This could mean that the access/dirty state *does* change for a given page while
GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
that failing to detect this is benign.

Aside: GUP-fast currently rechecks the pte originally obtained with
ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform
to (1), so either it returns the same pte or it returns a different pte or
garbage. But that garbage could just happen to be the same as the originally
obtained pte. So in that case, it would have a false match. I think this needs
to be changed to ptep_get_lockless()?

> 
> (3b) pte_get_lockless() cannot atomically read the PTE: We need special magic to
> read somewhat-sane PTEs and need IPIs during TLB flushes to sync against serious
> PTE changes (e.g., present -> present). This is weird x86-PAE.
> 
> 
> If ptep_get() on arm64 can do (1), (2) and (3a), we might be good.
> 
> My head is hurting ...
>

Ryan Roberts April 15, 2024, 9:28 a.m. UTC | #15

On 12/04/2024 21:16, David Hildenbrand wrote:
>>
>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>> "lockless walkers that never take the PTL".
>>
>> Detail: the part about disabling interrupts and TLB flush syncing is
>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
>> you make that clear further down.
> 
> Yes, but disabling interrupts is also required for RCU-freeing of page tables
> such that they can be walked safely. The TLB flush IPI is arch-specific and
> indeed to sync against PTE invalidation (before generic GUP-fast).
> [...]
> 
>>>>
>>>> Could it be this easy? My head is hurting...
>>>
>>> I think what has to happen is:
>>>
>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as there
>>> are no races. No removal/addition of access/dirty bits etc.
>>
>> Today's arm64 ptep_get() guarantees this.
>>
>>>
>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>> serious "garbage PTEs". This is our page fault handler.
>>
>> This isn't really a property of a ptep_get_lockless(); its a statement about a
>> class of users. I agree with the statement.
> 
> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
> fault handlers. Well, mostly "not GUP".
> 
>>
>>>
>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>
>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
>>> required. This is the common case. HW might concurrently set access/dirty bits,
>>> so we can race with that. But we don't read garbage.
>>
>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>> consistent for contpte ptes. That's the bit that complicates the current
>> ptep_get_lockless() implementation.
>>
>> But the point I was trying to make is that GUP-fast does not actually care about
>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>> pte_get_lockless() to say that "all fields in the returned pte are guarranteed
>> to be self-consistent except for access and dirty information, which may be
>> inconsistent if a racing modification occured".
> 
> We *might* have KVM in the future want to check that a PTE is dirty, such that
> we can only allow dirty PTEs to be writable in a secondary MMU. That's not there
> yet, but one thing I was discussing on the list recently. Burried in:
> 
> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com
> 
> We wouldn't care about racing modifications, as long as MMU notifiers will
> properly notify us when the PTE would lose its dirty bits.
> 
> But getting false-positive dirty bits would be problematic.
> 
>>
>> This could mean that the access/dirty state *does* change for a given page while
>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
>> that failing to detect this is benign.
> 
> I mean, HW could just set the dirty/access bit immediately after the check. So
> if HW concurrently sets the bit and we don't observe that change when we
> recheck, I think that would be perfectly fine.

Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
soft-dirty or uffd-wp).

But if you don't want to change the ptep_get_lockless() spec to explicitly allow
this (because you have the KVM use case where false-positive dirty is
problematic), then I think we are stuck with ptep_get_lockless() as implemented
for arm64 today.

> 
>>
>> Aside: GUP-fast currently rechecks the pte originally obtained with
>> ptep_get_lockless(), using ptep_get(). Is that correct? ptep_get() must conform
>> to (1), so either it returns the same pte or it returns a different pte or
>> garbage. But that garbage could just happen to be the same as the originally
>> obtained pte. So in that case, it would have a false match. I think this needs
>> to be changed to ptep_get_lockless()?
> 
> I *think* it's fine, because the case where it would make a difference (x86-PAE)
> still requires the TLB flush IPI to sync against PTE changes, and that check
> would likely be wrong in one way or the other. So for x86-pae, that check is
> just moot either way.
> 
> That my theory, at least.
> 
> (but this "let's fake-read atomically although we don't, but let's do like we
> could in some specific circumstances" is really hard to get)
> 
> I was wondering a while ago if we are missing a memory barrier before the checl,
> but I think the one from obtaining the page reference gets the job done (at
> least that's what I remember).
>

Ryan Roberts April 15, 2024, 1:30 p.m. UTC | #16

On 15/04/2024 11:57, David Hildenbrand wrote:
> On 15.04.24 11:28, Ryan Roberts wrote:
>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>
>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>> "lockless walkers that never take the PTL".
>>>>
>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
>>>> you make that clear further down.
>>>
>>> Yes, but disabling interrupts is also required for RCU-freeing of page tables
>>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>> [...]
>>>
>>>>>>
>>>>>> Could it be this easy? My head is hurting...
>>>>>
>>>>> I think what has to happen is:
>>>>>
>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as
>>>>> there
>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>
>>>> Today's arm64 ptep_get() guarantees this.
>>>>
>>>>>
>>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>
>>>> This isn't really a property of a ptep_get_lockless(); its a statement about a
>>>> class of users. I agree with the statement.
>>>
>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>>> fault handlers. Well, mostly "not GUP".
>>>
>>>>
>>>>>
>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>
>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if the
>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
>>>>> required. This is the common case. HW might concurrently set access/dirty
>>>>> bits,
>>>>> so we can race with that. But we don't read garbage.
>>>>
>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>> ptep_get_lockless() implementation.
>>>>
>>>> But the point I was trying to make is that GUP-fast does not actually care
>>>> about
>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>> pte_get_lockless() to say that "all fields in the returned pte are guarranteed
>>>> to be self-consistent except for access and dirty information, which may be
>>>> inconsistent if a racing modification occured".
>>>
>>> We *might* have KVM in the future want to check that a PTE is dirty, such that
>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not there
>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>
>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com
>>>
>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>> properly notify us when the PTE would lose its dirty bits.
>>>
>>> But getting false-positive dirty bits would be problematic.
>>>
>>>>
>>>> This could mean that the access/dirty state *does* change for a given page
>>>> while
>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
>>>> that failing to detect this is benign.
>>>
>>> I mean, HW could just set the dirty/access bit immediately after the check. So
>>> if HW concurrently sets the bit and we don't observe that change when we
>>> recheck, I think that would be perfectly fine.
>>
>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>> soft-dirty or uffd-wp).
>>
>> But if you don't want to change the ptep_get_lockless() spec to explicitly allow
>> this (because you have the KVM use case where false-positive dirty is
>> problematic), then I think we are stuck with ptep_get_lockless() as implemented
>> for arm64 today.
> 
> At least regarding the dirty bit, we'd have to guarantee that if
> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
> would be able to catch that.
> 
> Would that be possible?

Hmm maybe. My head hurts. Let me try to work through some examples...


Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is stored in
PTE0 for the contpte block, and it is dirty.

Now let's say there are 2 racing threads:

  - ThreadA is doing a GUP-fast for PTE3
  - ThreadB is remapping order-0 FolioB at PTE0

(ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
example - today's arm64 ptep_get_lockless() can handle the below correctly).

ThreadA					ThreadB
=======					=======

gup_pte_range()
  pte1 = ptep_get_lockless(PTE3)
    READ_ONCE(PTE3)
					mmap(PTE0)
					  clear_pte(PTE0)
					    unfold(PTE0 - PTE3)
					      WRITE_ONCE(PTE0, 0)
					      WRITE_ONCE(PTE1, 0)
					      WRITE_ONCE(PTE2, 0)
    READ_ONCE(PTE0) (for a/d) << CLEAN!!
    READ_ONCE(PTE1) (for a/d)
    READ_ONCE(PTE2) (for a/d)
    READ_ONCE(PTE3) (for a/d)
  <do speculative work with pte1 content>
  pte2 = ptep_get_lockless(PTE3)
    READ_ONCE(PTE3)
    READ_ONCE(PTE0) (for a/d)
    READ_ONCE(PTE1) (for a/d)
    READ_ONCE(PTE2) (for a/d)
    READ_ONCE(PTE3) (for a/d)
  true = pte_same(pte1, pte2)
					      WRITE_ONCE(PTE3, 0)
					      TLBI
					      WRITE_ONCE(PTE0, <orig & ~CONT>)
					      WRITE_ONCE(PTE1, <orig & ~CONT>)
					      WRITE_ONCE(PTE2, <orig & ~CONT>)
					      WRITE_ONCE(PTE3, <orig & ~CONT>)
					    WRITE_ONCE(PTE0, 0)
					  set_pte_at(PTE0, <new>)

This example shows how a *false-negative* can be returned for the dirty state,
which isn't detected by the check.

I've been unable to come up with an example where a *false-positive* can be
returned for dirty state without the second ptep_get_lockless() noticing. In
this second example, let's assume everything is the same execpt FolioA is
initially clean:

ThreadA					ThreadB
=======					=======

gup_pte_range()
  pte1 = ptep_get_lockless(PTE3)
    READ_ONCE(PTE3)
					mmap(PTE0)
					  clear_pte(PTE0)
					    unfold(PTE0 - PTE3)
					      WRITE_ONCE(PTE0, 0)
					      WRITE_ONCE(PTE1, 0)
					      WRITE_ONCE(PTE2, 0)
					      WRITE_ONCE(PTE3, 0)
					      TLBI
					      WRITE_ONCE(PTE0, <orig & ~CONT>)
					      WRITE_ONCE(PTE1, <orig & ~CONT>)
					      WRITE_ONCE(PTE2, <orig & ~CONT>)
					      WRITE_ONCE(PTE3, <orig & ~CONT>)
					    WRITE_ONCE(PTE0, 0)
					  set_pte_at(PTE0, <new>)
					write to FolioB - HW sets PTE0's dirty
    READ_ONCE(PTE0) (for a/d) << DIRTY!!
    READ_ONCE(PTE1) (for a/d)
    READ_ONCE(PTE2) (for a/d)
    READ_ONCE(PTE3) (for a/d)
  <do speculative work with pte1 content>
  pte2 = ptep_get_lockless(PTE3)
    READ_ONCE(PTE3)           << BUT THIS IS FOR FolioB
    READ_ONCE(PTE0) (for a/d)
    READ_ONCE(PTE1) (for a/d)
    READ_ONCE(PTE2) (for a/d)
    READ_ONCE(PTE3) (for a/d)
  false = pte_same(pte1, pte2) << So this fails

The only way I can see false-positive not being caught in the second example is
if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
But these lockless table walkers are already suseptible to that.

I think all the same arguments can be extended to the access bit.


For me this is all getting rather subtle and difficult to reason about and even
harder to spec in a comprehensible way. The best I could come up with is:

"All fields in the returned pte are guarranteed to be self-consistent except for
access and dirty information, which may be inconsistent if a racing modification
occured. Additionally it is guranteed that false-positive access and/or dirty
information is not possible if 2 calls are made and both ptes are the same. Only
false-negative access and/or dirty information is possible in this scenario."

which is starting to sound bonkers. Personally I think we are better off at this
point, just keeping today's arm64 ptep_get_lockless().

Thanks,
Ryan

Ryan Roberts April 15, 2024, 2:34 p.m. UTC | #17

On 15/04/2024 15:23, David Hildenbrand wrote:
> On 15.04.24 15:30, Ryan Roberts wrote:
>> On 15/04/2024 11:57, David Hildenbrand wrote:
>>> On 15.04.24 11:28, Ryan Roberts wrote:
>>>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>>>
>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>>>> "lockless walkers that never take the PTL".
>>>>>>
>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the TLBIs). But
>>>>>> you make that clear further down.
>>>>>
>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page tables
>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>>>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>>>> [...]
>>>>>
>>>>>>>>
>>>>>>>> Could it be this easy? My head is hurting...
>>>>>>>
>>>>>>> I think what has to happen is:
>>>>>>>
>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as
>>>>>>> there
>>>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>>>
>>>>>> Today's arm64 ptep_get() guarantees this.
>>>>>>
>>>>>>>
>>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>>>
>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement
>>>>>> about a
>>>>>> class of users. I agree with the statement.
>>>>>
>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>>>>> fault handlers. Well, mostly "not GUP".
>>>>>
>>>>>>
>>>>>>>
>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>>>
>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if
>>>>>>> the
>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB flushes
>>>>>>> required. This is the common case. HW might concurrently set access/dirty
>>>>>>> bits,
>>>>>>> so we can race with that. But we don't read garbage.
>>>>>>
>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>>>> ptep_get_lockless() implementation.
>>>>>>
>>>>>> But the point I was trying to make is that GUP-fast does not actually care
>>>>>> about
>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>>>> pte_get_lockless() to say that "all fields in the returned pte are
>>>>>> guarranteed
>>>>>> to be self-consistent except for access and dirty information, which may be
>>>>>> inconsistent if a racing modification occured".
>>>>>
>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such that
>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not
>>>>> there
>>>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>>>
>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com
>>>>>
>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>>>> properly notify us when the PTE would lose its dirty bits.
>>>>>
>>>>> But getting false-positive dirty bits would be problematic.
>>>>>
>>>>>>
>>>>>> This could mean that the access/dirty state *does* change for a given page
>>>>>> while
>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I *think*
>>>>>> that failing to detect this is benign.
>>>>>
>>>>> I mean, HW could just set the dirty/access bit immediately after the check. So
>>>>> if HW concurrently sets the bit and we don't observe that change when we
>>>>> recheck, I think that would be perfectly fine.
>>>>
>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>>>> soft-dirty or uffd-wp).
>>>>
>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly
>>>> allow
>>>> this (because you have the KVM use case where false-positive dirty is
>>>> problematic), then I think we are stuck with ptep_get_lockless() as implemented
>>>> for arm64 today.
>>>
>>> At least regarding the dirty bit, we'd have to guarantee that if
>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
>>> would be able to catch that.
>>>
>>> Would that be possible?
>>
>> Hmm maybe. My head hurts. Let me try to work through some examples...
>>
>>
>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is stored in
>> PTE0 for the contpte block, and it is dirty.
>>
>> Now let's say there are 2 racing threads:
>>
>>    - ThreadA is doing a GUP-fast for PTE3
>>    - ThreadB is remapping order-0 FolioB at PTE0
>>
>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
>> example - today's arm64 ptep_get_lockless() can handle the below correctly).
>>
>> ThreadA                    ThreadB
>> =======                    =======
>>
>> gup_pte_range()
>>    pte1 = ptep_get_lockless(PTE3)
>>      READ_ONCE(PTE3)
>>                     mmap(PTE0)
>>                       clear_pte(PTE0)
>>                         unfold(PTE0 - PTE3)
>>                           WRITE_ONCE(PTE0, 0)
>>                           WRITE_ONCE(PTE1, 0)
>>                           WRITE_ONCE(PTE2, 0)
>>      READ_ONCE(PTE0) (for a/d) << CLEAN!!
>>      READ_ONCE(PTE1) (for a/d)
>>      READ_ONCE(PTE2) (for a/d)
>>      READ_ONCE(PTE3) (for a/d)
>>    <do speculative work with pte1 content>
>>    pte2 = ptep_get_lockless(PTE3)
>>      READ_ONCE(PTE3)
>>      READ_ONCE(PTE0) (for a/d)
>>      READ_ONCE(PTE1) (for a/d)
>>      READ_ONCE(PTE2) (for a/d)
>>      READ_ONCE(PTE3) (for a/d)
>>    true = pte_same(pte1, pte2)
>>                           WRITE_ONCE(PTE3, 0)
>>                           TLBI
>>                           WRITE_ONCE(PTE0, <orig & ~CONT>)
>>                           WRITE_ONCE(PTE1, <orig & ~CONT>)
>>                           WRITE_ONCE(PTE2, <orig & ~CONT>)
>>                           WRITE_ONCE(PTE3, <orig & ~CONT>)
>>                         WRITE_ONCE(PTE0, 0)
>>                       set_pte_at(PTE0, <new>)
>>
>> This example shows how a *false-negative* can be returned for the dirty state,
>> which isn't detected by the check.
>>
>> I've been unable to come up with an example where a *false-positive* can be
>> returned for dirty state without the second ptep_get_lockless() noticing. In
>> this second example, let's assume everything is the same execpt FolioA is
>> initially clean:
>>
>> ThreadA                    ThreadB
>> =======                    =======
>>
>> gup_pte_range()
>>    pte1 = ptep_get_lockless(PTE3)
>>      READ_ONCE(PTE3)
>>                     mmap(PTE0)
>>                       clear_pte(PTE0)
>>                         unfold(PTE0 - PTE3)
>>                           WRITE_ONCE(PTE0, 0)
>>                           WRITE_ONCE(PTE1, 0)
>>                           WRITE_ONCE(PTE2, 0)
>>                           WRITE_ONCE(PTE3, 0)
>>                           TLBI
>>                           WRITE_ONCE(PTE0, <orig & ~CONT>)
>>                           WRITE_ONCE(PTE1, <orig & ~CONT>)
>>                           WRITE_ONCE(PTE2, <orig & ~CONT>)
>>                           WRITE_ONCE(PTE3, <orig & ~CONT>)
>>                         WRITE_ONCE(PTE0, 0)
>>                       set_pte_at(PTE0, <new>)
>>                     write to FolioB - HW sets PTE0's dirty
>>      READ_ONCE(PTE0) (for a/d) << DIRTY!!
>>      READ_ONCE(PTE1) (for a/d)
>>      READ_ONCE(PTE2) (for a/d)
>>      READ_ONCE(PTE3) (for a/d)
>>    <do speculative work with pte1 content>
>>    pte2 = ptep_get_lockless(PTE3)
>>      READ_ONCE(PTE3)           << BUT THIS IS FOR FolioB
>>      READ_ONCE(PTE0) (for a/d)
>>      READ_ONCE(PTE1) (for a/d)
>>      READ_ONCE(PTE2) (for a/d)
>>      READ_ONCE(PTE3) (for a/d)
>>    false = pte_same(pte1, pte2) << So this fails
>>
>> The only way I can see false-positive not being caught in the second example is
>> if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
>> But these lockless table walkers are already suseptible to that.
>>
>> I think all the same arguments can be extended to the access bit.
>>
>>
>> For me this is all getting rather subtle and difficult to reason about and even
>> harder to spec in a comprehensible way. The best I could come up with is:
>>
>> "All fields in the returned pte are guarranteed to be self-consistent except for
>> access and dirty information, which may be inconsistent if a racing modification
>> occured. Additionally it is guranteed that false-positive access and/or dirty
>> information is not possible if 2 calls are made and both ptes are the same. Only
>> false-negative access and/or dirty information is possible in this scenario."
>>
>> which is starting to sound bonkers. Personally I think we are better off at this
>> point, just keeping today's arm64 ptep_get_lockless().
> 
> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that
> would sync against GUP-fast?

No, the broadcast is in HW. There is no IPI.

Ryan Roberts April 15, 2024, 3:17 p.m. UTC | #18

On 15/04/2024 15:58, David Hildenbrand wrote:
> On 15.04.24 16:34, Ryan Roberts wrote:
>> On 15/04/2024 15:23, David Hildenbrand wrote:
>>> On 15.04.24 15:30, Ryan Roberts wrote:
>>>> On 15/04/2024 11:57, David Hildenbrand wrote:
>>>>> On 15.04.24 11:28, Ryan Roberts wrote:
>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>>>>>
>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>>>>>> "lockless walkers that never take the PTL".
>>>>>>>>
>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the
>>>>>>>> TLBIs). But
>>>>>>>> you make that clear further down.
>>>>>>>
>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page
>>>>>>> tables
>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Could it be this easy? My head is hurting...
>>>>>>>>>
>>>>>>>>> I think what has to happen is:
>>>>>>>>>
>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as
>>>>>>>>> there
>>>>>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>>>>>
>>>>>>>> Today's arm64 ptep_get() guarantees this.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>>>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>>>>>
>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement
>>>>>>>> about a
>>>>>>>> class of users. I agree with the statement.
>>>>>>>
>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>>>>>>> fault handlers. Well, mostly "not GUP".
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>>>>>
>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if
>>>>>>>>> the
>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB
>>>>>>>>> flushes
>>>>>>>>> required. This is the common case. HW might concurrently set access/dirty
>>>>>>>>> bits,
>>>>>>>>> so we can race with that. But we don't read garbage.
>>>>>>>>
>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>>>>>> ptep_get_lockless() implementation.
>>>>>>>>
>>>>>>>> But the point I was trying to make is that GUP-fast does not actually care
>>>>>>>> about
>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are
>>>>>>>> guarranteed
>>>>>>>> to be self-consistent except for access and dirty information, which may be
>>>>>>>> inconsistent if a racing modification occured".
>>>>>>>
>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such
>>>>>>> that
>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not
>>>>>>> there
>>>>>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>>>>>
>>>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com
>>>>>>>
>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>>>>>> properly notify us when the PTE would lose its dirty bits.
>>>>>>>
>>>>>>> But getting false-positive dirty bits would be problematic.
>>>>>>>
>>>>>>>>
>>>>>>>> This could mean that the access/dirty state *does* change for a given page
>>>>>>>> while
>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I
>>>>>>>> *think*
>>>>>>>> that failing to detect this is benign.
>>>>>>>
>>>>>>> I mean, HW could just set the dirty/access bit immediately after the
>>>>>>> check. So
>>>>>>> if HW concurrently sets the bit and we don't observe that change when we
>>>>>>> recheck, I think that would be perfectly fine.
>>>>>>
>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>>>>>> soft-dirty or uffd-wp).
>>>>>>
>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly
>>>>>> allow
>>>>>> this (because you have the KVM use case where false-positive dirty is
>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as
>>>>>> implemented
>>>>>> for arm64 today.
>>>>>
>>>>> At least regarding the dirty bit, we'd have to guarantee that if
>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
>>>>> would be able to catch that.
>>>>>
>>>>> Would that be possible?
>>>>
>>>> Hmm maybe. My head hurts. Let me try to work through some examples...
>>>>
>>>>
>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is
>>>> stored in
>>>> PTE0 for the contpte block, and it is dirty.
>>>>
>>>> Now let's say there are 2 racing threads:
>>>>
>>>>     - ThreadA is doing a GUP-fast for PTE3
>>>>     - ThreadB is remapping order-0 FolioB at PTE0
>>>>
>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly).
>>>>
>>>> ThreadA                    ThreadB
>>>> =======                    =======
>>>>
>>>> gup_pte_range()
>>>>     pte1 = ptep_get_lockless(PTE3)
>>>>       READ_ONCE(PTE3)
>>>>                      mmap(PTE0)
>>>>                        clear_pte(PTE0)
>>>>                          unfold(PTE0 - PTE3)
>>>>                            WRITE_ONCE(PTE0, 0)
>>>>                            WRITE_ONCE(PTE1, 0)
>>>>                            WRITE_ONCE(PTE2, 0)
>>>>       READ_ONCE(PTE0) (for a/d) << CLEAN!!
>>>>       READ_ONCE(PTE1) (for a/d)
>>>>       READ_ONCE(PTE2) (for a/d)
>>>>       READ_ONCE(PTE3) (for a/d)
>>>>     <do speculative work with pte1 content>
>>>>     pte2 = ptep_get_lockless(PTE3)
>>>>       READ_ONCE(PTE3)
>>>>       READ_ONCE(PTE0) (for a/d)
>>>>       READ_ONCE(PTE1) (for a/d)
>>>>       READ_ONCE(PTE2) (for a/d)
>>>>       READ_ONCE(PTE3) (for a/d)
>>>>     true = pte_same(pte1, pte2)
>>>>                            WRITE_ONCE(PTE3, 0)
>>>>                            TLBI
>>>>                            WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>                            WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>                            WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>                            WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>                          WRITE_ONCE(PTE0, 0)
>>>>                        set_pte_at(PTE0, <new>)
>>>>
>>>> This example shows how a *false-negative* can be returned for the dirty state,
>>>> which isn't detected by the check.
>>>>
>>>> I've been unable to come up with an example where a *false-positive* can be
>>>> returned for dirty state without the second ptep_get_lockless() noticing. In
>>>> this second example, let's assume everything is the same execpt FolioA is
>>>> initially clean:
>>>>
>>>> ThreadA                    ThreadB
>>>> =======                    =======
>>>>
>>>> gup_pte_range()
>>>>     pte1 = ptep_get_lockless(PTE3)
>>>>       READ_ONCE(PTE3)
>>>>                      mmap(PTE0)
>>>>                        clear_pte(PTE0)
>>>>                          unfold(PTE0 - PTE3)
>>>>                            WRITE_ONCE(PTE0, 0)
>>>>                            WRITE_ONCE(PTE1, 0)
>>>>                            WRITE_ONCE(PTE2, 0)
>>>>                            WRITE_ONCE(PTE3, 0)
>>>>                            TLBI
>>>>                            WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>                            WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>                            WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>                            WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>                          WRITE_ONCE(PTE0, 0)
>>>>                        set_pte_at(PTE0, <new>)
>>>>                      write to FolioB - HW sets PTE0's dirty
>>>>       READ_ONCE(PTE0) (for a/d) << DIRTY!!
>>>>       READ_ONCE(PTE1) (for a/d)
>>>>       READ_ONCE(PTE2) (for a/d)
>>>>       READ_ONCE(PTE3) (for a/d)
>>>>     <do speculative work with pte1 content>
>>>>     pte2 = ptep_get_lockless(PTE3)
>>>>       READ_ONCE(PTE3)           << BUT THIS IS FOR FolioB
>>>>       READ_ONCE(PTE0) (for a/d)
>>>>       READ_ONCE(PTE1) (for a/d)
>>>>       READ_ONCE(PTE2) (for a/d)
>>>>       READ_ONCE(PTE3) (for a/d)
>>>>     false = pte_same(pte1, pte2) << So this fails
>>>>
>>>> The only way I can see false-positive not being caught in the second example is
>>>> if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
>>>> But these lockless table walkers are already suseptible to that.
>>>>
>>>> I think all the same arguments can be extended to the access bit.
>>>>
>>>>
>>>> For me this is all getting rather subtle and difficult to reason about and even
>>>> harder to spec in a comprehensible way. The best I could come up with is:
>>>>
>>>> "All fields in the returned pte are guarranteed to be self-consistent except
>>>> for
>>>> access and dirty information, which may be inconsistent if a racing
>>>> modification
>>>> occured. Additionally it is guranteed that false-positive access and/or dirty
>>>> information is not possible if 2 calls are made and both ptes are the same.
>>>> Only
>>>> false-negative access and/or dirty information is possible in this scenario."
>>>>
>>>> which is starting to sound bonkers. Personally I think we are better off at
>>>> this
>>>> point, just keeping today's arm64 ptep_get_lockless().
>>>
>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that
>>> would sync against GUP-fast?
>>
>> No, the broadcast is in HW. There is no IPI.
> 
> Okay ...
> 
> I agree that the semantics are a bit weird, but if we could get rid of
> ptep_get_lockless() on arm64, that would also be nice.
> 
> 
> Something I've been thinking of ... just to share what I've had in mind. The two
> types of users we currently have are:
> 
> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
> 
> (2) ptep_get_lockless() followed by ptep_get() check without PTL.
> 
> What if we had the following instead:
> 
> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
> 
> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without
>     PTL.
> 
> And on arm64 let
> 
> (1) ptep_get_lockless() be ptep_get()
> 
> (2) ptep_get_gup_fast() be __ptep_get().
> 
> That would mean, that (2) would not care if another cont-pte is dirty, because
> we don't collect access+dirty bits. That way, we avoid any races with concurrent
> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes()
> would have to set all cont-PTEs dirty, even if any of these already is dirty.

I don't think the "problematic" thing is actually a problem; set_ptes() will
always set the dirty bit to the same value for all ptes it covers (and if you do
set_ptes() on a partial contpte block, it will be unfolded first). Although I
suspect I've misunderstood what you meant there...

The potential problem I see with this is that the Arm ARM doesn't specify which
PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
randomly and this could spuriously increase your check failure rate. In reality
I believe most implementations will update the PTE for the address that caused
the TLB to be populated. But in some cases, you could have eviction (due to
pressure or explicit invalidation) followed by re-population due to faulting on
a different page of the contpte block. In this case you would see this type of
problem too.

But ultimately, isn't this basically equivalent to ptep_get_lockless() returning
potentially false-negatives for access and dirty? Just with a much higher chance
of getting a false-negative. How is this helping?

David Hildenbrand April 15, 2024, 3:22 p.m. UTC | #19

On 15.04.24 17:17, Ryan Roberts wrote:
> On 15/04/2024 15:58, David Hildenbrand wrote:
>> On 15.04.24 16:34, Ryan Roberts wrote:
>>> On 15/04/2024 15:23, David Hildenbrand wrote:
>>>> On 15.04.24 15:30, Ryan Roberts wrote:
>>>>> On 15/04/2024 11:57, David Hildenbrand wrote:
>>>>>> On 15.04.24 11:28, Ryan Roberts wrote:
>>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>>>>>>
>>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>>>>>>> "lockless walkers that never take the PTL".
>>>>>>>>>
>>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the
>>>>>>>>> TLBIs). But
>>>>>>>>> you make that clear further down.
>>>>>>>>
>>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page
>>>>>>>> tables
>>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific and
>>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Could it be this easy? My head is hurting...
>>>>>>>>>>
>>>>>>>>>> I think what has to happen is:
>>>>>>>>>>
>>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as long as
>>>>>>>>>> there
>>>>>>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>>>>>>
>>>>>>>>> Today's arm64 ptep_get() guarantees this.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can handle
>>>>>>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>>>>>>
>>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement
>>>>>>>>> about a
>>>>>>>>> class of users. I agree with the statement.
>>>>>>>>
>>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as page
>>>>>>>> fault handlers. Well, mostly "not GUP".
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot handle
>>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>>>>>>
>>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check later if
>>>>>>>>>> the
>>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB
>>>>>>>>>> flushes
>>>>>>>>>> required. This is the common case. HW might concurrently set access/dirty
>>>>>>>>>> bits,
>>>>>>>>>> so we can race with that. But we don't read garbage.
>>>>>>>>>
>>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>>>>>>> ptep_get_lockless() implementation.
>>>>>>>>>
>>>>>>>>> But the point I was trying to make is that GUP-fast does not actually care
>>>>>>>>> about
>>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are
>>>>>>>>> guarranteed
>>>>>>>>> to be self-consistent except for access and dirty information, which may be
>>>>>>>>> inconsistent if a racing modification occured".
>>>>>>>>
>>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such
>>>>>>>> that
>>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not
>>>>>>>> there
>>>>>>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>>>>>>
>>>>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com
>>>>>>>>
>>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>>>>>>> properly notify us when the PTE would lose its dirty bits.
>>>>>>>>
>>>>>>>> But getting false-positive dirty bits would be problematic.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> This could mean that the access/dirty state *does* change for a given page
>>>>>>>>> while
>>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I
>>>>>>>>> *think*
>>>>>>>>> that failing to detect this is benign.
>>>>>>>>
>>>>>>>> I mean, HW could just set the dirty/access bit immediately after the
>>>>>>>> check. So
>>>>>>>> if HW concurrently sets the bit and we don't observe that change when we
>>>>>>>> recheck, I think that would be perfectly fine.
>>>>>>>
>>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>>>>>>> soft-dirty or uffd-wp).
>>>>>>>
>>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly
>>>>>>> allow
>>>>>>> this (because you have the KVM use case where false-positive dirty is
>>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as
>>>>>>> implemented
>>>>>>> for arm64 today.
>>>>>>
>>>>>> At least regarding the dirty bit, we'd have to guarantee that if
>>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
>>>>>> would be able to catch that.
>>>>>>
>>>>>> Would that be possible?
>>>>>
>>>>> Hmm maybe. My head hurts. Let me try to work through some examples...
>>>>>
>>>>>
>>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs 0, 1,
>>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is
>>>>> stored in
>>>>> PTE0 for the contpte block, and it is dirty.
>>>>>
>>>>> Now let's say there are 2 racing threads:
>>>>>
>>>>>      - ThreadA is doing a GUP-fast for PTE3
>>>>>      - ThreadB is remapping order-0 FolioB at PTE0
>>>>>
>>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
>>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly).
>>>>>
>>>>> ThreadA                    ThreadB
>>>>> =======                    =======
>>>>>
>>>>> gup_pte_range()
>>>>>      pte1 = ptep_get_lockless(PTE3)
>>>>>        READ_ONCE(PTE3)
>>>>>                       mmap(PTE0)
>>>>>                         clear_pte(PTE0)
>>>>>                           unfold(PTE0 - PTE3)
>>>>>                             WRITE_ONCE(PTE0, 0)
>>>>>                             WRITE_ONCE(PTE1, 0)
>>>>>                             WRITE_ONCE(PTE2, 0)
>>>>>        READ_ONCE(PTE0) (for a/d) << CLEAN!!
>>>>>        READ_ONCE(PTE1) (for a/d)
>>>>>        READ_ONCE(PTE2) (for a/d)
>>>>>        READ_ONCE(PTE3) (for a/d)
>>>>>      <do speculative work with pte1 content>
>>>>>      pte2 = ptep_get_lockless(PTE3)
>>>>>        READ_ONCE(PTE3)
>>>>>        READ_ONCE(PTE0) (for a/d)
>>>>>        READ_ONCE(PTE1) (for a/d)
>>>>>        READ_ONCE(PTE2) (for a/d)
>>>>>        READ_ONCE(PTE3) (for a/d)
>>>>>      true = pte_same(pte1, pte2)
>>>>>                             WRITE_ONCE(PTE3, 0)
>>>>>                             TLBI
>>>>>                             WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>>                             WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>>                             WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>>                             WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>>                           WRITE_ONCE(PTE0, 0)
>>>>>                         set_pte_at(PTE0, <new>)
>>>>>
>>>>> This example shows how a *false-negative* can be returned for the dirty state,
>>>>> which isn't detected by the check.
>>>>>
>>>>> I've been unable to come up with an example where a *false-positive* can be
>>>>> returned for dirty state without the second ptep_get_lockless() noticing. In
>>>>> this second example, let's assume everything is the same execpt FolioA is
>>>>> initially clean:
>>>>>
>>>>> ThreadA                    ThreadB
>>>>> =======                    =======
>>>>>
>>>>> gup_pte_range()
>>>>>      pte1 = ptep_get_lockless(PTE3)
>>>>>        READ_ONCE(PTE3)
>>>>>                       mmap(PTE0)
>>>>>                         clear_pte(PTE0)
>>>>>                           unfold(PTE0 - PTE3)
>>>>>                             WRITE_ONCE(PTE0, 0)
>>>>>                             WRITE_ONCE(PTE1, 0)
>>>>>                             WRITE_ONCE(PTE2, 0)
>>>>>                             WRITE_ONCE(PTE3, 0)
>>>>>                             TLBI
>>>>>                             WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>>                             WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>>                             WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>>                             WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>>                           WRITE_ONCE(PTE0, 0)
>>>>>                         set_pte_at(PTE0, <new>)
>>>>>                       write to FolioB - HW sets PTE0's dirty
>>>>>        READ_ONCE(PTE0) (for a/d) << DIRTY!!
>>>>>        READ_ONCE(PTE1) (for a/d)
>>>>>        READ_ONCE(PTE2) (for a/d)
>>>>>        READ_ONCE(PTE3) (for a/d)
>>>>>      <do speculative work with pte1 content>
>>>>>      pte2 = ptep_get_lockless(PTE3)
>>>>>        READ_ONCE(PTE3)           << BUT THIS IS FOR FolioB
>>>>>        READ_ONCE(PTE0) (for a/d)
>>>>>        READ_ONCE(PTE1) (for a/d)
>>>>>        READ_ONCE(PTE2) (for a/d)
>>>>>        READ_ONCE(PTE3) (for a/d)
>>>>>      false = pte_same(pte1, pte2) << So this fails
>>>>>
>>>>> The only way I can see false-positive not being caught in the second example is
>>>>> if ThreadB subseuently remaps the original folio, so you have an ABA scenario.
>>>>> But these lockless table walkers are already suseptible to that.
>>>>>
>>>>> I think all the same arguments can be extended to the access bit.
>>>>>
>>>>>
>>>>> For me this is all getting rather subtle and difficult to reason about and even
>>>>> harder to spec in a comprehensible way. The best I could come up with is:
>>>>>
>>>>> "All fields in the returned pte are guarranteed to be self-consistent except
>>>>> for
>>>>> access and dirty information, which may be inconsistent if a racing
>>>>> modification
>>>>> occured. Additionally it is guranteed that false-positive access and/or dirty
>>>>> information is not possible if 2 calls are made and both ptes are the same.
>>>>> Only
>>>>> false-negative access and/or dirty information is possible in this scenario."
>>>>>
>>>>> which is starting to sound bonkers. Personally I think we are better off at
>>>>> this
>>>>> point, just keeping today's arm64 ptep_get_lockless().
>>>>
>>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that
>>>> would sync against GUP-fast?
>>>
>>> No, the broadcast is in HW. There is no IPI.
>>
>> Okay ...
>>
>> I agree that the semantics are a bit weird, but if we could get rid of
>> ptep_get_lockless() on arm64, that would also be nice.
>>
>>
>> Something I've been thinking of ... just to share what I've had in mind. The two
>> types of users we currently have are:
>>
>> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>>
>> (2) ptep_get_lockless() followed by ptep_get() check without PTL.
>>
>> What if we had the following instead:
>>
>> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>>
>> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without
>>      PTL.
>>
>> And on arm64 let
>>
>> (1) ptep_get_lockless() be ptep_get()
>>
>> (2) ptep_get_gup_fast() be __ptep_get().
>>
>> That would mean, that (2) would not care if another cont-pte is dirty, because
>> we don't collect access+dirty bits. That way, we avoid any races with concurrent
>> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes()
>> would have to set all cont-PTEs dirty, even if any of these already is dirty.
> 
> I don't think the "problematic" thing is actually a problem; set_ptes() will
> always set the dirty bit to the same value for all ptes it covers (and if you do
> set_ptes() on a partial contpte block, it will be unfolded first). Although I
> suspect I've misunderstood what you meant there...

It's more code like that following that I am concerned about.

if (pte_dirty()) {
	/* Great, nothing to do */
} else
	mte_mkdirty();
	set_ptes();
	...
}

> 
> The potential problem I see with this is that the Arm ARM doesn't specify which
> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
> randomly and this could spuriously increase your check failure rate. In reality
> I believe most implementations will update the PTE for the address that caused
> the TLB to be populated. But in some cases, you could have eviction (due to
> pressure or explicit invalidation) followed by re-population due to faulting on
> a different page of the contpte block. In this case you would see this type of
> problem too.
> 
> But ultimately, isn't this basically equivalent to ptep_get_lockless() returning
> potentially false-negatives for access and dirty? Just with a much higher chance
> of getting a false-negative. How is this helping?

You are performing an atomic read like GUP-fast wants you to. So there 
are no races to worry about like on other architectures: HW might *set* 
the dirty bit concurrently, but that's just fine.

The whole races you describe with concurrent folding/unfolding/ ... are 
irrelevant.

To me that sounds ... much simpler ;) But again, just something I've 
been thinking about.

The reuse of pte_get_lockless() outside GUP code might not have been the 
wisest choice.

Ryan Roberts April 15, 2024, 3:53 p.m. UTC | #20

On 15/04/2024 16:22, David Hildenbrand wrote:
> On 15.04.24 17:17, Ryan Roberts wrote:
>> On 15/04/2024 15:58, David Hildenbrand wrote:
>>> On 15.04.24 16:34, Ryan Roberts wrote:
>>>> On 15/04/2024 15:23, David Hildenbrand wrote:
>>>>> On 15.04.24 15:30, Ryan Roberts wrote:
>>>>>> On 15/04/2024 11:57, David Hildenbrand wrote:
>>>>>>> On 15.04.24 11:28, Ryan Roberts wrote:
>>>>>>>> On 12/04/2024 21:16, David Hildenbrand wrote:
>>>>>>>>>>
>>>>>>>>>> Yes agreed - 2 types; "lockless walkers that later recheck under PTL" and
>>>>>>>>>> "lockless walkers that never take the PTL".
>>>>>>>>>>
>>>>>>>>>> Detail: the part about disabling interrupts and TLB flush syncing is
>>>>>>>>>> arch-specifc. That's not how arm64 does it (the hw broadcasts the
>>>>>>>>>> TLBIs). But
>>>>>>>>>> you make that clear further down.
>>>>>>>>>
>>>>>>>>> Yes, but disabling interrupts is also required for RCU-freeing of page
>>>>>>>>> tables
>>>>>>>>> such that they can be walked safely. The TLB flush IPI is arch-specific
>>>>>>>>> and
>>>>>>>>> indeed to sync against PTE invalidation (before generic GUP-fast).
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Could it be this easy? My head is hurting...
>>>>>>>>>>>
>>>>>>>>>>> I think what has to happen is:
>>>>>>>>>>>
>>>>>>>>>>> (1) pte_get_lockless() must return the same value as ptep_get() as
>>>>>>>>>>> long as
>>>>>>>>>>> there
>>>>>>>>>>> are no races. No removal/addition of access/dirty bits etc.
>>>>>>>>>>
>>>>>>>>>> Today's arm64 ptep_get() guarantees this.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (2) Lockless page table walkers that later verify under the PTL can
>>>>>>>>>>> handle
>>>>>>>>>>> serious "garbage PTEs". This is our page fault handler.
>>>>>>>>>>
>>>>>>>>>> This isn't really a property of a ptep_get_lockless(); its a statement
>>>>>>>>>> about a
>>>>>>>>>> class of users. I agree with the statement.
>>>>>>>>>
>>>>>>>>> Yes. That's a requirement for the user of ptep_get_lockless(), such as
>>>>>>>>> page
>>>>>>>>> fault handlers. Well, mostly "not GUP".
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> (3) Lockless page table walkers that cannot verify under PTL cannot
>>>>>>>>>>> handle
>>>>>>>>>>> arbitrary garbage PTEs. This is GUP-fast. Two options:
>>>>>>>>>>>
>>>>>>>>>>> (3a) pte_get_lockless() can atomically read the PTE: We re-check
>>>>>>>>>>> later if
>>>>>>>>>>> the
>>>>>>>>>>> atomically-read PTE is still unchanged (without PTL). No IPI for TLB
>>>>>>>>>>> flushes
>>>>>>>>>>> required. This is the common case. HW might concurrently set
>>>>>>>>>>> access/dirty
>>>>>>>>>>> bits,
>>>>>>>>>>> so we can race with that. But we don't read garbage.
>>>>>>>>>>
>>>>>>>>>> Today's arm64 ptep_get() cannot garantee that the access/dirty bits are
>>>>>>>>>> consistent for contpte ptes. That's the bit that complicates the current
>>>>>>>>>> ptep_get_lockless() implementation.
>>>>>>>>>>
>>>>>>>>>> But the point I was trying to make is that GUP-fast does not actually
>>>>>>>>>> care
>>>>>>>>>> about
>>>>>>>>>> *all* the fields being consistent (e.g. access/dirty). So we could spec
>>>>>>>>>> pte_get_lockless() to say that "all fields in the returned pte are
>>>>>>>>>> guarranteed
>>>>>>>>>> to be self-consistent except for access and dirty information, which
>>>>>>>>>> may be
>>>>>>>>>> inconsistent if a racing modification occured".
>>>>>>>>>
>>>>>>>>> We *might* have KVM in the future want to check that a PTE is dirty, such
>>>>>>>>> that
>>>>>>>>> we can only allow dirty PTEs to be writable in a secondary MMU. That's not
>>>>>>>>> there
>>>>>>>>> yet, but one thing I was discussing on the list recently. Burried in:
>>>>>>>>>
>>>>>>>>> https://lkml.kernel.org/r/20240320005024.3216282-1-seanjc@google.com
>>>>>>>>>
>>>>>>>>> We wouldn't care about racing modifications, as long as MMU notifiers will
>>>>>>>>> properly notify us when the PTE would lose its dirty bits.
>>>>>>>>>
>>>>>>>>> But getting false-positive dirty bits would be problematic.
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> This could mean that the access/dirty state *does* change for a given
>>>>>>>>>> page
>>>>>>>>>> while
>>>>>>>>>> GUP-fast is walking it, but GUP-fast *doesn't* detect that change. I
>>>>>>>>>> *think*
>>>>>>>>>> that failing to detect this is benign.
>>>>>>>>>
>>>>>>>>> I mean, HW could just set the dirty/access bit immediately after the
>>>>>>>>> check. So
>>>>>>>>> if HW concurrently sets the bit and we don't observe that change when we
>>>>>>>>> recheck, I think that would be perfectly fine.
>>>>>>>>
>>>>>>>> Yes indeed; that's my point - GUP-fast doesn't care about access/dirty (or
>>>>>>>> soft-dirty or uffd-wp).
>>>>>>>>
>>>>>>>> But if you don't want to change the ptep_get_lockless() spec to explicitly
>>>>>>>> allow
>>>>>>>> this (because you have the KVM use case where false-positive dirty is
>>>>>>>> problematic), then I think we are stuck with ptep_get_lockless() as
>>>>>>>> implemented
>>>>>>>> for arm64 today.
>>>>>>>
>>>>>>> At least regarding the dirty bit, we'd have to guarantee that if
>>>>>>> ptep_get_lockless() returns a false-positive dirty bit, that the PTE recheck
>>>>>>> would be able to catch that.
>>>>>>>
>>>>>>> Would that be possible?
>>>>>>
>>>>>> Hmm maybe. My head hurts. Let me try to work through some examples...
>>>>>>
>>>>>>
>>>>>> Let's imagine for this example, a contpte block is 4 PTEs. Lat's say PTEs
>>>>>> 0, 1,
>>>>>> 2 and 3 initially contpte-map order-2 mTHP, FolioA. The dirty state is
>>>>>> stored in
>>>>>> PTE0 for the contpte block, and it is dirty.
>>>>>>
>>>>>> Now let's say there are 2 racing threads:
>>>>>>
>>>>>>      - ThreadA is doing a GUP-fast for PTE3
>>>>>>      - ThreadB is remapping order-0 FolioB at PTE0
>>>>>>
>>>>>> (ptep_get_lockless() below is actaully arm64's ptep_get() for the sake of the
>>>>>> example - today's arm64 ptep_get_lockless() can handle the below correctly).
>>>>>>
>>>>>> ThreadA                    ThreadB
>>>>>> =======                    =======
>>>>>>
>>>>>> gup_pte_range()
>>>>>>      pte1 = ptep_get_lockless(PTE3)
>>>>>>        READ_ONCE(PTE3)
>>>>>>                       mmap(PTE0)
>>>>>>                         clear_pte(PTE0)
>>>>>>                           unfold(PTE0 - PTE3)
>>>>>>                             WRITE_ONCE(PTE0, 0)
>>>>>>                             WRITE_ONCE(PTE1, 0)
>>>>>>                             WRITE_ONCE(PTE2, 0)
>>>>>>        READ_ONCE(PTE0) (for a/d) << CLEAN!!
>>>>>>        READ_ONCE(PTE1) (for a/d)
>>>>>>        READ_ONCE(PTE2) (for a/d)
>>>>>>        READ_ONCE(PTE3) (for a/d)
>>>>>>      <do speculative work with pte1 content>
>>>>>>      pte2 = ptep_get_lockless(PTE3)
>>>>>>        READ_ONCE(PTE3)
>>>>>>        READ_ONCE(PTE0) (for a/d)
>>>>>>        READ_ONCE(PTE1) (for a/d)
>>>>>>        READ_ONCE(PTE2) (for a/d)
>>>>>>        READ_ONCE(PTE3) (for a/d)
>>>>>>      true = pte_same(pte1, pte2)
>>>>>>                             WRITE_ONCE(PTE3, 0)
>>>>>>                             TLBI
>>>>>>                             WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>>>                             WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>>>                             WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>>>                             WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>>>                           WRITE_ONCE(PTE0, 0)
>>>>>>                         set_pte_at(PTE0, <new>)
>>>>>>
>>>>>> This example shows how a *false-negative* can be returned for the dirty
>>>>>> state,
>>>>>> which isn't detected by the check.
>>>>>>
>>>>>> I've been unable to come up with an example where a *false-positive* can be
>>>>>> returned for dirty state without the second ptep_get_lockless() noticing. In
>>>>>> this second example, let's assume everything is the same execpt FolioA is
>>>>>> initially clean:
>>>>>>
>>>>>> ThreadA                    ThreadB
>>>>>> =======                    =======
>>>>>>
>>>>>> gup_pte_range()
>>>>>>      pte1 = ptep_get_lockless(PTE3)
>>>>>>        READ_ONCE(PTE3)
>>>>>>                       mmap(PTE0)
>>>>>>                         clear_pte(PTE0)
>>>>>>                           unfold(PTE0 - PTE3)
>>>>>>                             WRITE_ONCE(PTE0, 0)
>>>>>>                             WRITE_ONCE(PTE1, 0)
>>>>>>                             WRITE_ONCE(PTE2, 0)
>>>>>>                             WRITE_ONCE(PTE3, 0)
>>>>>>                             TLBI
>>>>>>                             WRITE_ONCE(PTE0, <orig & ~CONT>)
>>>>>>                             WRITE_ONCE(PTE1, <orig & ~CONT>)
>>>>>>                             WRITE_ONCE(PTE2, <orig & ~CONT>)
>>>>>>                             WRITE_ONCE(PTE3, <orig & ~CONT>)
>>>>>>                           WRITE_ONCE(PTE0, 0)
>>>>>>                         set_pte_at(PTE0, <new>)
>>>>>>                       write to FolioB - HW sets PTE0's dirty
>>>>>>        READ_ONCE(PTE0) (for a/d) << DIRTY!!
>>>>>>        READ_ONCE(PTE1) (for a/d)
>>>>>>        READ_ONCE(PTE2) (for a/d)
>>>>>>        READ_ONCE(PTE3) (for a/d)
>>>>>>      <do speculative work with pte1 content>
>>>>>>      pte2 = ptep_get_lockless(PTE3)
>>>>>>        READ_ONCE(PTE3)           << BUT THIS IS FOR FolioB
>>>>>>        READ_ONCE(PTE0) (for a/d)
>>>>>>        READ_ONCE(PTE1) (for a/d)
>>>>>>        READ_ONCE(PTE2) (for a/d)
>>>>>>        READ_ONCE(PTE3) (for a/d)
>>>>>>      false = pte_same(pte1, pte2) << So this fails
>>>>>>
>>>>>> The only way I can see false-positive not being caught in the second
>>>>>> example is
>>>>>> if ThreadB subseuently remaps the original folio, so you have an ABA
>>>>>> scenario.
>>>>>> But these lockless table walkers are already suseptible to that.
>>>>>>
>>>>>> I think all the same arguments can be extended to the access bit.
>>>>>>
>>>>>>
>>>>>> For me this is all getting rather subtle and difficult to reason about and
>>>>>> even
>>>>>> harder to spec in a comprehensible way. The best I could come up with is:
>>>>>>
>>>>>> "All fields in the returned pte are guarranteed to be self-consistent except
>>>>>> for
>>>>>> access and dirty information, which may be inconsistent if a racing
>>>>>> modification
>>>>>> occured. Additionally it is guranteed that false-positive access and/or dirty
>>>>>> information is not possible if 2 calls are made and both ptes are the same.
>>>>>> Only
>>>>>> false-negative access and/or dirty information is possible in this scenario."
>>>>>>
>>>>>> which is starting to sound bonkers. Personally I think we are better off at
>>>>>> this
>>>>>> point, just keeping today's arm64 ptep_get_lockless().
>>>>>
>>>>> Remind me again, does arm64 perform an IPI broadcast during a TLB flush that
>>>>> would sync against GUP-fast?
>>>>
>>>> No, the broadcast is in HW. There is no IPI.
>>>
>>> Okay ...
>>>
>>> I agree that the semantics are a bit weird, but if we could get rid of
>>> ptep_get_lockless() on arm64, that would also be nice.
>>>
>>>
>>> Something I've been thinking of ... just to share what I've had in mind. The two
>>> types of users we currently have are:
>>>
>>> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>>>
>>> (2) ptep_get_lockless() followed by ptep_get() check without PTL.
>>>
>>> What if we had the following instead:
>>>
>>> (1) ptep_get_lockless() followed by ptep_get() check under PTL.
>>>
>>> (2) ptep_get_gup_fast() followed by ptep_get_gup_fast() check without
>>>      PTL.
>>>
>>> And on arm64 let
>>>
>>> (1) ptep_get_lockless() be ptep_get()
>>>
>>> (2) ptep_get_gup_fast() be __ptep_get().
>>>
>>> That would mean, that (2) would not care if another cont-pte is dirty, because
>>> we don't collect access+dirty bits. That way, we avoid any races with concurrent
>>> unfolding etc. The only "problamtic" thing is that pte_mkdirty() -> set_ptes()
>>> would have to set all cont-PTEs dirty, even if any of these already is dirty.
>>
>> I don't think the "problematic" thing is actually a problem; set_ptes() will
>> always set the dirty bit to the same value for all ptes it covers (and if you do
>> set_ptes() on a partial contpte block, it will be unfolded first). Although I
>> suspect I've misunderstood what you meant there...
> 
> It's more code like that following that I am concerned about.
> 
> if (pte_dirty()) {
>     /* Great, nothing to do */
> } else
>     mte_mkdirty();
>     set_ptes();
>     ...
> }

OK I see, so you're worried about uneccessary unfolding that the false-negative
dirty reporting could cause? I think the best solution there would be for the
core to use the clear_young_dirty_ptes(CYDP_CLEAR_DIRTY) API that Lance adds in
his series at [1]. That would avoid any unfolding and just dirty all contpte
block(s) touched by the range.

[1] https://lore.kernel.org/linux-mm/20240413002219.71246-1-ioworker0@gmail.com/

> 
>>
>> The potential problem I see with this is that the Arm ARM doesn't specify which
>> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
>> randomly and this could spuriously increase your check failure rate. In reality
>> I believe most implementations will update the PTE for the address that caused
>> the TLB to be populated. But in some cases, you could have eviction (due to
>> pressure or explicit invalidation) followed by re-population due to faulting on
>> a different page of the contpte block. In this case you would see this type of
>> problem too.
>>
>> But ultimately, isn't this basically equivalent to ptep_get_lockless() returning
>> potentially false-negatives for access and dirty? Just with a much higher chance
>> of getting a false-negative. How is this helping?
> 
> You are performing an atomic read like GUP-fast wants you to. So there are no
> races to worry about like on other architectures: HW might *set* the dirty bit
> concurrently, but that's just fine.

But you can still see false-negatives for access and dirty...

> 
> The whole races you describe with concurrent folding/unfolding/ ... are irrelevant.

And I think I convinced myself that you will only see false-negatives with
today's arm64 ptep_get(). But an order or magnitude fewer than with your
proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of those).

> 
> To me that sounds ... much simpler ;) But again, just something I've been
> thinking about.

OK so this approach upgrades my "I'm fairly sure we never see false-positives"
to "we definitely never see false-positives". But it certainly increases the
quantity of false-negatives.

> 
> The reuse of pte_get_lockless() outside GUP code might not have been the wisest
> choice.
> 

If you want to go down the ptep_get_gup_fast() route, you've still got to be
able to spec it, and I think it will land pretty close to my most recent stab at
respec'ing ptep_get_lockless() a couple of replies up on this thread.

Where would your proposal leave the KVM use case? If you call it
ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would
be left with ptep_get()...

Sorry this thread is getting so long. Just to summarise, I think there are
currently 3 solutions on the table:

  - ptep_get_lockless() remains as is
  - ptep_get_lockless() wraps ptep_get()
  - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename)

Based on discussion so far, that's also the order of my preference.

Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()?
I think its just the potential for looping in the face of concurrent modifications?

David Hildenbrand April 15, 2024, 4:02 p.m. UTC | #21

>>> The potential problem I see with this is that the Arm ARM doesn't specify which
>>> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
>>> randomly and this could spuriously increase your check failure rate. In reality
>>> I believe most implementations will update the PTE for the address that caused
>>> the TLB to be populated. But in some cases, you could have eviction (due to
>>> pressure or explicit invalidation) followed by re-population due to faulting on
>>> a different page of the contpte block. In this case you would see this type of
>>> problem too.
>>>
>>> But ultimately, isn't this basically equivalent to ptep_get_lockless() returning
>>> potentially false-negatives for access and dirty? Just with a much higher chance
>>> of getting a false-negative. How is this helping?
>>
>> You are performing an atomic read like GUP-fast wants you to. So there are no
>> races to worry about like on other architectures: HW might *set* the dirty bit
>> concurrently, but that's just fine.
> 
> But you can still see false-negatives for access and dirty...

Yes.

> 
>>
>> The whole races you describe with concurrent folding/unfolding/ ... are irrelevant.
> 
> And I think I convinced myself that you will only see false-negatives with
> today's arm64 ptep_get(). But an order or magnitude fewer than with your
> proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of those).
> 
>>
>> To me that sounds ... much simpler ;) But again, just something I've been
>> thinking about.
> 
> OK so this approach upgrades my "I'm fairly sure we never see false-positives"
> to "we definitely never see false-positives". But it certainly increases the
> quantity of false-negatives.

Yes.

> 
>>
>> The reuse of pte_get_lockless() outside GUP code might not have been the wisest
>> choice.
>>
> 
> If you want to go down the ptep_get_gup_fast() route, you've still got to be
> able to spec it, and I think it will land pretty close to my most recent stab at
> respec'ing ptep_get_lockless() a couple of replies up on this thread.
> 
> Where would your proposal leave the KVM use case? If you call it
> ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would
> be left with ptep_get()...

It's using GUP-fast.

> 
> Sorry this thread is getting so long. Just to summarise, I think there are
> currently 3 solutions on the table:
> 
>    - ptep_get_lockless() remains as is
>    - ptep_get_lockless() wraps ptep_get()
>    - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename)
> 
> Based on discussion so far, that's also the order of my preference.

(1) seems like the easiest thing to do.

> 
> Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()?

Well, you sent that patch series with "that aims to reduce the cost and 
complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve 
that. :)

Ryan Roberts April 23, 2024, 10:15 a.m. UTC | #22

Hi David,

Sorry for the slow reply on this; its was due to a combination of thinking a bit
more about the options here and being out on holiday.


On 15/04/2024 17:02, David Hildenbrand wrote:
>>>> The potential problem I see with this is that the Arm ARM doesn't specify which
>>>> PTE of a contpte block the HW stores a/d in. So the HW _could_ update them
>>>> randomly and this could spuriously increase your check failure rate. In reality
>>>> I believe most implementations will update the PTE for the address that caused
>>>> the TLB to be populated. But in some cases, you could have eviction (due to
>>>> pressure or explicit invalidation) followed by re-population due to faulting on
>>>> a different page of the contpte block. In this case you would see this type of
>>>> problem too.
>>>>
>>>> But ultimately, isn't this basically equivalent to ptep_get_lockless()
>>>> returning
>>>> potentially false-negatives for access and dirty? Just with a much higher
>>>> chance
>>>> of getting a false-negative. How is this helping?
>>>
>>> You are performing an atomic read like GUP-fast wants you to. So there are no
>>> races to worry about like on other architectures: HW might *set* the dirty bit
>>> concurrently, but that's just fine.
>>
>> But you can still see false-negatives for access and dirty...
> 
> Yes.
> 
>>
>>>
>>> The whole races you describe with concurrent folding/unfolding/ ... are
>>> irrelevant.
>>
>> And I think I convinced myself that you will only see false-negatives with
>> today's arm64 ptep_get(). But an order or magnitude fewer than with your
>> proposal (assuming 16 ptes per contpte block, and the a/d bits are in one of
>> those).
>>
>>>
>>> To me that sounds ... much simpler ;) But again, just something I've been
>>> thinking about.
>>
>> OK so this approach upgrades my "I'm fairly sure we never see false-positives"
>> to "we definitely never see false-positives". But it certainly increases the
>> quantity of false-negatives.
> 
> Yes.
> 
>>
>>>
>>> The reuse of pte_get_lockless() outside GUP code might not have been the wisest
>>> choice.
>>>
>>
>> If you want to go down the ptep_get_gup_fast() route, you've still got to be
>> able to spec it, and I think it will land pretty close to my most recent stab at
>> respec'ing ptep_get_lockless() a couple of replies up on this thread.
>>
>> Where would your proposal leave the KVM use case? If you call it
>> ptep_get_gup_fast() presumably you wouldn't want to use it for KVM? So it would
>> be left with ptep_get()...
> 
> It's using GUP-fast.
> 
>>
>> Sorry this thread is getting so long. Just to summarise, I think there are
>> currently 3 solutions on the table:
>>
>>    - ptep_get_lockless() remains as is
>>    - ptep_get_lockless() wraps ptep_get()
>>    - ptep_get_lockless() wraps __ptep_get() (and gets a gup_fast rename)
>>
>> Based on discussion so far, that's also the order of my preference.
> 
> (1) seems like the easiest thing to do.

Yes, I'm very much in favour of easy.

> 
>>
>> Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()?
> 
> Well, you sent that patch series with "that aims to reduce the cost and
> complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve that. :)

Touche! I'd half forgotten that we were having this conversation in the context
of this series!

I guess your ptep_get_gup_fast() approach is very similar to
ptep_get_lockless_norecency()... So we are back to the beginning :)

But ultimately I've come to the conclusion that it is easy to reason about the
current arm64 ptep_get_lockless() implementation and see that its correct. The
other options both have their drawbacks.

Yes, there is a loop in the current implementation that would be nice to get rid
of, but I don't think it is really any worse than the cmpxchg loops we already
have in other helpers.

I'm not planning to persue this any further. Thanks for the useful discussion
(as always).

David Hildenbrand April 23, 2024, 10:18 a.m. UTC | #23

On 23.04.24 12:15, Ryan Roberts wrote:
> Hi David,
> 
> Sorry for the slow reply on this; its was due to a combination of thinking a bit
> more about the options here and being out on holiday.
> 

No worries, there are things more important in life than 
ptep_get_lockless() :D

>> (1) seems like the easiest thing to do.
> 
> Yes, I'm very much in favour of easy.
> 
>>
>>>
>>> Perhaps its useful to enumerate why we dislike the current ptep_get_lockless()?
>>
>> Well, you sent that patch series with "that aims to reduce the cost and
>> complexity of ptep_get_lockless() for arm64". (2) and (3) would achieve that. :)
> 
> Touche! I'd half forgotten that we were having this conversation in the context
> of this series!
> 
> I guess your ptep_get_gup_fast() approach is very similar to
> ptep_get_lockless_norecency()... So we are back to the beginning :)

Except that it would be limited to GUP-fast :)

> 
> But ultimately I've come to the conclusion that it is easy to reason about the
> current arm64 ptep_get_lockless() implementation and see that its correct. The
> other options both have their drawbacks.

Yes.

> 
> Yes, there is a loop in the current implementation that would be nice to get rid
> of, but I don't think it is really any worse than the cmpxchg loops we already
> have in other helpers.
> 
> I'm not planning to persue this any further. Thanks for the useful discussion
> (as always).

Make sense to me. let's leave it as is for the time being. (and also see 
if a GUP-fast user that needs precise dirty/accessed actually gets real)

[RFC,v1,0/4] Reduce cost of ptep_get_lockless on arm64

Message

Comments