diff mbox series

[v1,05/12] mm/memory: detect writability in restore_exclusive_pte() through can_change_pte_writable()

Message ID 20250129115411.2077152-6-david@redhat.com (mailing list archive)
State New
Headers show
Series mm: fixes for device-exclusive entries (hmm) | expand

Commit Message

David Hildenbrand Jan. 29, 2025, 11:54 a.m. UTC
Let's do it just like mprotect write-upgrade or during NUMA-hinting
faults on PROT_NONE PTEs: detect if the PTE can be writable by using
can_change_pte_writable().

Set the PTE only dirty if the folio is dirty: we might not
necessarily have a write access, and setting the PTE writable doesn't
require setting the PTE dirty.

With this change in place, there is no need to have separate
readable and writable device-exclusive entry types, and we'll merge
them next separately.

Note that, during fork(), we first convert the device-exclusive entries
back to ordinary PTEs, and we only ever allow conversion of writable
PTEs to device-exclusive -- only mprotect can currently change them to
readable-device-exclusive. Consequently, we always expect
PageAnonExclusive(page)==true and can_change_pte_writable()==true,
unless we are dealing with soft-dirty tracking or uffd-wp. But reusing
can_change_pte_writable() for now is cleaner.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 11 +++++++----
 1 file changed, 7 insertions(+), 4 deletions(-)

Comments

Simona Vetter Jan. 30, 2025, 9:51 a.m. UTC | #1
On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
> Let's do it just like mprotect write-upgrade or during NUMA-hinting
> faults on PROT_NONE PTEs: detect if the PTE can be writable by using
> can_change_pte_writable().
> 
> Set the PTE only dirty if the folio is dirty: we might not
> necessarily have a write access, and setting the PTE writable doesn't
> require setting the PTE dirty.

Not sure whether there's much difference in practice, since a device
exclusive access means a write, so the folio better be dirty (unless we
aborted halfway through). But then I couldn't find the code in nouveau to
do that, so now I'm confused.
-Sima

> With this change in place, there is no need to have separate
> readable and writable device-exclusive entry types, and we'll merge
> them next separately.
> 
> Note that, during fork(), we first convert the device-exclusive entries
> back to ordinary PTEs, and we only ever allow conversion of writable
> PTEs to device-exclusive -- only mprotect can currently change them to
> readable-device-exclusive. Consequently, we always expect
> PageAnonExclusive(page)==true and can_change_pte_writable()==true,
> unless we are dealing with soft-dirty tracking or uffd-wp. But reusing
> can_change_pte_writable() for now is cleaner.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 11 +++++++----
>  1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 03efeeef895a..db38d6ae4e74 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -725,18 +725,21 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
>  	struct folio *folio = page_folio(page);
>  	pte_t orig_pte;
>  	pte_t pte;
> -	swp_entry_t entry;
>  
>  	orig_pte = ptep_get(ptep);
>  	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
>  	if (pte_swp_soft_dirty(orig_pte))
>  		pte = pte_mksoft_dirty(pte);
>  
> -	entry = pte_to_swp_entry(orig_pte);
>  	if (pte_swp_uffd_wp(orig_pte))
>  		pte = pte_mkuffd_wp(pte);
> -	else if (is_writable_device_exclusive_entry(entry))
> -		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> +
> +	if ((vma->vm_flags & VM_WRITE) &&
> +	    can_change_pte_writable(vma, address, pte)) {
> +		if (folio_test_dirty(folio))
> +			pte = pte_mkdirty(pte);
> +		pte = pte_mkwrite(pte, vma);
> +	}
>  
>  	VM_BUG_ON_FOLIO(pte_write(pte) && (!folio_test_anon(folio) &&
>  					   PageAnonExclusive(page)), folio);
> -- 
> 2.48.1
>
David Hildenbrand Jan. 30, 2025, 9:58 a.m. UTC | #2
On 30.01.25 10:51, Simona Vetter wrote:
> On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
>> Let's do it just like mprotect write-upgrade or during NUMA-hinting
>> faults on PROT_NONE PTEs: detect if the PTE can be writable by using
>> can_change_pte_writable().
>>
>> Set the PTE only dirty if the folio is dirty: we might not
>> necessarily have a write access, and setting the PTE writable doesn't
>> require setting the PTE dirty.
> 
> Not sure whether there's much difference in practice, since a device
> exclusive access means a write, so the folio better be dirty (unless we
> aborted halfway through). But then I couldn't find the code in nouveau to
> do that, so now I'm confused.

That confused me as well. Requiring the PTE to be writable does not 
imply that it is dirty.

So something must either set the PTE or the folio dirty.

( In practice, most anonymous folios are dirty most of the time ... )

If we assume that "device-exclusive entries" are always dirty, then it 
doesn't make sense to set the folio dirty when creating device-exclusive 
entries. We'd always have to set the PTE dirty when restoring the 
exclusive pte.
Simona Vetter Jan. 30, 2025, 1:03 p.m. UTC | #3
On Thu, Jan 30, 2025 at 10:58:51AM +0100, David Hildenbrand wrote:
> On 30.01.25 10:51, Simona Vetter wrote:
> > On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
> > > Let's do it just like mprotect write-upgrade or during NUMA-hinting
> > > faults on PROT_NONE PTEs: detect if the PTE can be writable by using
> > > can_change_pte_writable().
> > > 
> > > Set the PTE only dirty if the folio is dirty: we might not
> > > necessarily have a write access, and setting the PTE writable doesn't
> > > require setting the PTE dirty.
> > 
> > Not sure whether there's much difference in practice, since a device
> > exclusive access means a write, so the folio better be dirty (unless we
> > aborted halfway through). But then I couldn't find the code in nouveau to
> > do that, so now I'm confused.
> 
> That confused me as well. Requiring the PTE to be writable does not imply
> that it is dirty.
> 
> So something must either set the PTE or the folio dirty.

Yeah I'm not finding that something.

> ( In practice, most anonymous folios are dirty most of the time ... )

And yup that's why I think it hasn't blown up yet.

> If we assume that "device-exclusive entries" are always dirty, then it
> doesn't make sense to set the folio dirty when creating device-exclusive
> entries. We'd always have to set the PTE dirty when restoring the exclusive
> pte.

I do agree with your change, I think it's correct to put this
responsibility onto drivers. It's just that nouveau seems to not be
entirely correct.

And thinking about this I have vague memories that I've discussed the case
of the missing folio_set_dirty in noveau hmm code before, maybe with
Alistair. But quick search in archives didn't turn up anything.
-Sima
Alistair Popple Jan. 30, 2025, 11:06 p.m. UTC | #4
On Thu, Jan 30, 2025 at 02:03:42PM +0100, Simona Vetter wrote:
> On Thu, Jan 30, 2025 at 10:58:51AM +0100, David Hildenbrand wrote:
> > On 30.01.25 10:51, Simona Vetter wrote:
> > > On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
> > > > Let's do it just like mprotect write-upgrade or during NUMA-hinting
> > > > faults on PROT_NONE PTEs: detect if the PTE can be writable by using
> > > > can_change_pte_writable().
> > > > 
> > > > Set the PTE only dirty if the folio is dirty: we might not
> > > > necessarily have a write access, and setting the PTE writable doesn't
> > > > require setting the PTE dirty.
> > > 
> > > Not sure whether there's much difference in practice, since a device
> > > exclusive access means a write, so the folio better be dirty (unless we
> > > aborted halfway through). But then I couldn't find the code in nouveau to
> > > do that, so now I'm confused.
> > 
> > That confused me as well. Requiring the PTE to be writable does not imply
> > that it is dirty.
> > 
> > So something must either set the PTE or the folio dirty.
> 
> Yeah I'm not finding that something.
> 
> > ( In practice, most anonymous folios are dirty most of the time ... )
> 
> And yup that's why I think it hasn't blown up yet.
> 
> > If we assume that "device-exclusive entries" are always dirty, then it
> > doesn't make sense to set the folio dirty when creating device-exclusive
> > entries. We'd always have to set the PTE dirty when restoring the exclusive
> > pte.
> 
> I do agree with your change, I think it's correct to put this
> responsibility onto drivers. It's just that nouveau seems to not be
> entirely correct.

Yeah, agree it should be a driver responsibility but also can't see how nouveau
is correct there either. I might see if I can get it to blow up...

> And thinking about this I have vague memories that I've discussed the case
> of the missing folio_set_dirty in noveau hmm code before, maybe with
> Alistair. But quick search in archives didn't turn up anything.

I have vague recollections of that, but I could be confusing it with some of the
migrate_vma_*() issues we had dropping dirty bits (see
https://lkml.kernel.org/r/dd48e4882ce859c295c1a77612f66d198b0403f9.1662078528.git-series.apopple@nvidia.com)

> -Sima
> -- 
> Simona Vetter
> Software Engineer, Intel Corporation
> http://blog.ffwll.ch
David Hildenbrand Jan. 31, 2025, 10:55 a.m. UTC | #5
On 31.01.25 00:06, Alistair Popple wrote:
> On Thu, Jan 30, 2025 at 02:03:42PM +0100, Simona Vetter wrote:
>> On Thu, Jan 30, 2025 at 10:58:51AM +0100, David Hildenbrand wrote:
>>> On 30.01.25 10:51, Simona Vetter wrote:
>>>> On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
>>>>> Let's do it just like mprotect write-upgrade or during NUMA-hinting
>>>>> faults on PROT_NONE PTEs: detect if the PTE can be writable by using
>>>>> can_change_pte_writable().
>>>>>
>>>>> Set the PTE only dirty if the folio is dirty: we might not
>>>>> necessarily have a write access, and setting the PTE writable doesn't
>>>>> require setting the PTE dirty.
>>>>
>>>> Not sure whether there's much difference in practice, since a device
>>>> exclusive access means a write, so the folio better be dirty (unless we
>>>> aborted halfway through). But then I couldn't find the code in nouveau to
>>>> do that, so now I'm confused.
>>>
>>> That confused me as well. Requiring the PTE to be writable does not imply
>>> that it is dirty.
>>>
>>> So something must either set the PTE or the folio dirty.
>>
>> Yeah I'm not finding that something.
>>
>>> ( In practice, most anonymous folios are dirty most of the time ... )
>>
>> And yup that's why I think it hasn't blown up yet.
>>
>>> If we assume that "device-exclusive entries" are always dirty, then it
>>> doesn't make sense to set the folio dirty when creating device-exclusive
>>> entries. We'd always have to set the PTE dirty when restoring the exclusive
>>> pte.
>>
>> I do agree with your change, I think it's correct to put this
>> responsibility onto drivers. It's just that nouveau seems to not be
>> entirely correct.
> 
> Yeah, agree it should be a driver responsibility but also can't see how nouveau
> is correct there either. I might see if I can get it to blow up...

(in context of the rmap walkers) The question is, how do we consider 
device-exclusive entries:

(1) dirty? Not from a CPU perspective.
(2) referenced? Not from a CPU perspective.

If the answer is always "no" to all questions, then memory notifiers 
must handle it, because we'd be answering the question from the CPU 
point of view.

If the answer is always "yes", there is a problem: we can only make it 
clean/young by converting it to an ordinary PTE first (requiring MMU 
notifiers etc.), which makes it quite nasty.

Mixed answers are not possible, because we don't know just from staring 
at the entry.
Simona Vetter Jan. 31, 2025, 5:05 p.m. UTC | #6
On Fri, Jan 31, 2025 at 11:55:55AM +0100, David Hildenbrand wrote:
> On 31.01.25 00:06, Alistair Popple wrote:
> > On Thu, Jan 30, 2025 at 02:03:42PM +0100, Simona Vetter wrote:
> > > On Thu, Jan 30, 2025 at 10:58:51AM +0100, David Hildenbrand wrote:
> > > > On 30.01.25 10:51, Simona Vetter wrote:
> > > > > On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
> > > > > > Let's do it just like mprotect write-upgrade or during NUMA-hinting
> > > > > > faults on PROT_NONE PTEs: detect if the PTE can be writable by using
> > > > > > can_change_pte_writable().
> > > > > > 
> > > > > > Set the PTE only dirty if the folio is dirty: we might not
> > > > > > necessarily have a write access, and setting the PTE writable doesn't
> > > > > > require setting the PTE dirty.
> > > > > 
> > > > > Not sure whether there's much difference in practice, since a device
> > > > > exclusive access means a write, so the folio better be dirty (unless we
> > > > > aborted halfway through). But then I couldn't find the code in nouveau to
> > > > > do that, so now I'm confused.
> > > > 
> > > > That confused me as well. Requiring the PTE to be writable does not imply
> > > > that it is dirty.
> > > > 
> > > > So something must either set the PTE or the folio dirty.
> > > 
> > > Yeah I'm not finding that something.
> > > 
> > > > ( In practice, most anonymous folios are dirty most of the time ... )
> > > 
> > > And yup that's why I think it hasn't blown up yet.
> > > 
> > > > If we assume that "device-exclusive entries" are always dirty, then it
> > > > doesn't make sense to set the folio dirty when creating device-exclusive
> > > > entries. We'd always have to set the PTE dirty when restoring the exclusive
> > > > pte.
> > > 
> > > I do agree with your change, I think it's correct to put this
> > > responsibility onto drivers. It's just that nouveau seems to not be
> > > entirely correct.
> > 
> > Yeah, agree it should be a driver responsibility but also can't see how nouveau
> > is correct there either. I might see if I can get it to blow up...
> 
> (in context of the rmap walkers) The question is, how do we consider
> device-exclusive entries:
> 
> (1) dirty? Not from a CPU perspective.
> (2) referenced? Not from a CPU perspective.
> 
> If the answer is always "no" to all questions, then memory notifiers must
> handle it, because we'd be answering the question from the CPU point of
> view.
> 
> If the answer is always "yes", there is a problem: we can only make it
> clean/young by converting it to an ordinary PTE first (requiring MMU
> notifiers etc.), which makes it quite nasty.
> 
> Mixed answers are not possible, because we don't know just from staring at
> the entry.

I think it's the gpu's (or whatever is using it) responsibility to update
folio state while it has ptes pointing at memory. Whether that's
device-exclusive system memory or device-private migrated memory. Anything
else doesn't make sense to me conceptually.

And I don't think we can just blindly assume even for device-exclusive
mappings that they will be dirty when we convert them back to a real pte,
because we might have raced trying to set up the gpu mapping and restarted
before we even put the pte into place. Or maybe someone was real quick at
writing it back after the gpu already dropped it's pte.

I guess maybe some clear documentation in all these functions
(make_device_exclusive, hmm_range_fault, migration helpers) that it's the
drivers job to dirty pages correctly would help?

But nouveau definitely does not look very correct here, pretty sure on
that.

Cheers, Sima
David Hildenbrand Feb. 4, 2025, 10:58 a.m. UTC | #7
On 31.01.25 18:05, Simona Vetter wrote:
> On Fri, Jan 31, 2025 at 11:55:55AM +0100, David Hildenbrand wrote:
>> On 31.01.25 00:06, Alistair Popple wrote:
>>> On Thu, Jan 30, 2025 at 02:03:42PM +0100, Simona Vetter wrote:
>>>> On Thu, Jan 30, 2025 at 10:58:51AM +0100, David Hildenbrand wrote:
>>>>> On 30.01.25 10:51, Simona Vetter wrote:
>>>>>> On Wed, Jan 29, 2025 at 12:54:03PM +0100, David Hildenbrand wrote:
>>>>>>> Let's do it just like mprotect write-upgrade or during NUMA-hinting
>>>>>>> faults on PROT_NONE PTEs: detect if the PTE can be writable by using
>>>>>>> can_change_pte_writable().
>>>>>>>
>>>>>>> Set the PTE only dirty if the folio is dirty: we might not
>>>>>>> necessarily have a write access, and setting the PTE writable doesn't
>>>>>>> require setting the PTE dirty.
>>>>>>
>>>>>> Not sure whether there's much difference in practice, since a device
>>>>>> exclusive access means a write, so the folio better be dirty (unless we
>>>>>> aborted halfway through). But then I couldn't find the code in nouveau to
>>>>>> do that, so now I'm confused.
>>>>>
>>>>> That confused me as well. Requiring the PTE to be writable does not imply
>>>>> that it is dirty.
>>>>>
>>>>> So something must either set the PTE or the folio dirty.
>>>>
>>>> Yeah I'm not finding that something.
>>>>
>>>>> ( In practice, most anonymous folios are dirty most of the time ... )
>>>>
>>>> And yup that's why I think it hasn't blown up yet.
>>>>
>>>>> If we assume that "device-exclusive entries" are always dirty, then it
>>>>> doesn't make sense to set the folio dirty when creating device-exclusive
>>>>> entries. We'd always have to set the PTE dirty when restoring the exclusive
>>>>> pte.
>>>>
>>>> I do agree with your change, I think it's correct to put this
>>>> responsibility onto drivers. It's just that nouveau seems to not be
>>>> entirely correct.
>>>
>>> Yeah, agree it should be a driver responsibility but also can't see how nouveau
>>> is correct there either. I might see if I can get it to blow up...
>>
>> (in context of the rmap walkers) The question is, how do we consider
>> device-exclusive entries:
>>
>> (1) dirty? Not from a CPU perspective.
>> (2) referenced? Not from a CPU perspective.
>>
>> If the answer is always "no" to all questions, then memory notifiers must
>> handle it, because we'd be answering the question from the CPU point of
>> view.
>>
>> If the answer is always "yes", there is a problem: we can only make it
>> clean/young by converting it to an ordinary PTE first (requiring MMU
>> notifiers etc.), which makes it quite nasty.
>>
>> Mixed answers are not possible, because we don't know just from staring at
>> the entry.
> 
> I think it's the gpu's (or whatever is using it) responsibility to update
> folio state while it has ptes pointing at memory. Whether that's
> device-exclusive system memory or device-private migrated memory. Anything
> else doesn't make sense to me conceptually.
> 
> And I don't think we can just blindly assume even for device-exclusive
> mappings that they will be dirty when we convert them back to a real pte,
> because we might have raced trying to set up the gpu mapping and restarted
> before we even put the pte into place. Or maybe someone was real quick at
> writing it back after the gpu already dropped it's pte.
> 
> I guess maybe some clear documentation in all these functions
> (make_device_exclusive, hmm_range_fault, migration helpers) that it's the
> drivers job to dirty pages correctly would help?


I'll add a comment to make_device_exclusive(), stating that these 
entries are considered clean+old from a MM perspective, and that the 
driver must update the folio when notified by MMU notifiers.

We should probably document that somewhere else as well as you suggested 
separately.
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index 03efeeef895a..db38d6ae4e74 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -725,18 +725,21 @@  static void restore_exclusive_pte(struct vm_area_struct *vma,
 	struct folio *folio = page_folio(page);
 	pte_t orig_pte;
 	pte_t pte;
-	swp_entry_t entry;
 
 	orig_pte = ptep_get(ptep);
 	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
 	if (pte_swp_soft_dirty(orig_pte))
 		pte = pte_mksoft_dirty(pte);
 
-	entry = pte_to_swp_entry(orig_pte);
 	if (pte_swp_uffd_wp(orig_pte))
 		pte = pte_mkuffd_wp(pte);
-	else if (is_writable_device_exclusive_entry(entry))
-		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+
+	if ((vma->vm_flags & VM_WRITE) &&
+	    can_change_pte_writable(vma, address, pte)) {
+		if (folio_test_dirty(folio))
+			pte = pte_mkdirty(pte);
+		pte = pte_mkwrite(pte, vma);
+	}
 
 	VM_BUG_ON_FOLIO(pte_write(pte) && (!folio_test_anon(folio) &&
 					   PageAnonExclusive(page)), folio);