diff mbox series

[v1,12/12] mm/rmap: keep mapcount untouched for device-exclusive entries

Message ID 20250129115411.2077152-13-david@redhat.com (mailing list archive)
State New
Headers show
Series mm: fixes for device-exclusive entries (hmm) | expand

Commit Message

David Hildenbrand Jan. 29, 2025, 11:54 a.m. UTC
Now that conversion to device-exclusive does no longer perform an
rmap walk and the main page_vma_mapped_walk() users were taught to
properly handle nonswap entries, let's treat device-exclusive entries just
as if they would be present, similar to how we handle device-private
entries already.

This fixes swapout/migration of folios with device-exclusive entries.

Likely there are still some page_vma_mapped_walk() callers that are not
fully prepared for these entries, and where we simply want to refuse
!pte_present() entries. They have to be fixed independently; the ones in
mm/rmap.c are prepared.

Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 17 +----------------
 mm/rmap.c   |  7 -------
 2 files changed, 1 insertion(+), 23 deletions(-)

Comments

Simona Vetter Jan. 30, 2025, 10:37 a.m. UTC | #1
On Wed, Jan 29, 2025 at 12:54:10PM +0100, David Hildenbrand wrote:
> Now that conversion to device-exclusive does no longer perform an
> rmap walk and the main page_vma_mapped_walk() users were taught to
> properly handle nonswap entries, let's treat device-exclusive entries just
> as if they would be present, similar to how we handle device-private
> entries already.

So the reason for handling device-private entries in rmap is so that
drivers can rely on try_to_migrate and related code to invalidate all the
various ptes even for device private memory. Otherwise no one should hit
this path, at least if my understanding is correct.

So I'm very much worried about opening a can of worms here because I think
this adds a genuine new case to all the various callers.

> This fixes swapout/migration of folios with device-exclusive entries.
> 
> Likely there are still some page_vma_mapped_walk() callers that are not
> fully prepared for these entries, and where we simply want to refuse
> !pte_present() entries. They have to be fixed independently; the ones in
> mm/rmap.c are prepared.

The other worry is that maybe breaking migration is a feature, at least in
parts. If thp constantly reassembles a pmd entry because hey all the
memory is contig and userspace allocated a chunk of memory to place
atomics that alternate between cpu and gpu nicely separated by 4k pages,
then we'll thrash around invalidating ptes to no end. So might be more
fallout here.
-Sima

> 
> Fixes: b756a3b5e7ea ("mm: device exclusive memory access")
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 17 +----------------
>  mm/rmap.c   |  7 -------
>  2 files changed, 1 insertion(+), 23 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index db38d6ae4e74..cd689cd8a7c8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -743,20 +743,6 @@ static void restore_exclusive_pte(struct vm_area_struct *vma,
>  
>  	VM_BUG_ON_FOLIO(pte_write(pte) && (!folio_test_anon(folio) &&
>  					   PageAnonExclusive(page)), folio);
> -
> -	/*
> -	 * No need to take a page reference as one was already
> -	 * created when the swap entry was made.
> -	 */
> -	if (folio_test_anon(folio))
> -		folio_add_anon_rmap_pte(folio, page, vma, address, RMAP_NONE);
> -	else
> -		/*
> -		 * Currently device exclusive access only supports anonymous
> -		 * memory so the entry shouldn't point to a filebacked page.
> -		 */
> -		WARN_ON_ONCE(1);
> -
>  	set_pte_at(vma->vm_mm, address, ptep, pte);
>  
>  	/*
> @@ -1628,8 +1614,7 @@ static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
>  		 */
>  		WARN_ON_ONCE(!vma_is_anonymous(vma));
>  		rss[mm_counter(folio)]--;
> -		if (is_device_private_entry(entry))
> -			folio_remove_rmap_pte(folio, page, vma);
> +		folio_remove_rmap_pte(folio, page, vma);
>  		folio_put(folio);
>  	} else if (!non_swap_entry(entry)) {
>  		/* Genuine swap entries, hence a private anon pages */
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 9e2002d97d6f..4acc9f6d743a 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2495,13 +2495,6 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  	/* The pte is writable, uffd-wp does not apply. */
>  	set_pte_at(mm, addr, fw.ptep, swp_pte);
>  
> -	/*
> -	 * TODO: The device-exclusive non-swap PTE holds a folio reference but
> -	 * does not count as a mapping (mapcount), which is wrong and must be
> -	 * fixed, otherwise RMAP walks don't behave as expected.
> -	 */
> -	folio_remove_rmap_pte(folio, page, vma);
> -
>  	folio_walk_end(&fw, vma);
>  	*foliop = folio;
>  	return page;
> -- 
> 2.48.1
>
David Hildenbrand Jan. 30, 2025, 11:42 a.m. UTC | #2
On 30.01.25 11:37, Simona Vetter wrote:
> On Wed, Jan 29, 2025 at 12:54:10PM +0100, David Hildenbrand wrote:
>> Now that conversion to device-exclusive does no longer perform an
>> rmap walk and the main page_vma_mapped_walk() users were taught to
>> properly handle nonswap entries, let's treat device-exclusive entries just
>> as if they would be present, similar to how we handle device-private
>> entries already.
> 
> So the reason for handling device-private entries in rmap is so that
> drivers can rely on try_to_migrate and related code to invalidate all the
> various ptes even for device private memory. Otherwise no one should hit
> this path, at least if my understanding is correct.

Right, device-private probably only happen to be seen on the migration 
path so far.

> 
> So I'm very much worried about opening a can of worms here because I think
> this adds a genuine new case to all the various callers.

To be clear: it can all already happen.

Assume you have a THP (or any mTHP today). You can easily trigger the 
scenario that folio_mapcount() != 0 with active device-exclusive 
entries, and you start doing rmap walks and stumble over these 
device-exclusive entries and *not* handle them properly. Note that more 
and more systems are configured to just give you THP unless you 
explicitly opted-out using MADV_NOHUGEPAGE early.

Note that b756a3b5e7ea added that hunk that still walks these 
device-exclusive entries in rmap code, but didn't actually update the 
rmap walkers:

@@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)

                 /* Handle un-addressable ZONE_DEVICE memory */
                 entry = pte_to_swp_entry(*pvmw->pte);
-               if (!is_device_private_entry(entry))
+               if (!is_device_private_entry(entry) &&
+                   !is_device_exclusive_entry(entry))
                         return false;

                 pfn = swp_offset(entry);

That was the right thing to do, because they resemble PROT_NONE entries 
and not migration entries or anything else that doesn't hold a folio 
reference).

Fortunately, it's only the page_vma_mapped_walk() callers that need care.

mm/rmap.c is handled with this series.

mm/page_vma_mapped.c should work already.

mm/migrate.c: does not apply

mm/page_idle.c: likely should just skip !pte_present().

mm/ksm.c might be fine, but likely we should just reject !pte_present().

kernel/events/uprobes.c likely should reject !pte_present().

mm/damon/paddr.c likely should reject !pte_present().


I briefly though about a flag to indicate if a page_vma_mapped_walk() 
supports these non-present entries, but likely just fixing them up is 
easier+cleaner.

Now that I looked at all, I might just write patches for them.

> 
>> This fixes swapout/migration of folios with device-exclusive entries.
>>
>> Likely there are still some page_vma_mapped_walk() callers that are not
>> fully prepared for these entries, and where we simply want to refuse
>> !pte_present() entries. They have to be fixed independently; the ones in
>> mm/rmap.c are prepared.
> 
> The other worry is that maybe breaking migration is a feature, at least in
> parts. 

Maybe breaking swap and migration is a feature in some reality, in this 
reality it's a BUG :)

If thp constantly reassembles a pmd entry because hey all the
> memory is contig and userspace allocated a chunk of memory to place
> atomics that alternate between cpu and gpu nicely separated by 4k pages,
> then we'll thrash around invalidating ptes to no end. So might be more
> fallout here.

khugepaged will back off once it sees an exclusive entry, so collapsing 
could only happen once everything is non-exclusive. See 
__collapse_huge_page_isolate() as an example.

It's really only page_vma_mapped_walk() callers that are affected by 
this change, not any other page table walkers.


It's unfortunate that we now have to fix it all up, that original series 
should have never been merged that way.
Simona Vetter Jan. 30, 2025, 1:19 p.m. UTC | #3
On Thu, Jan 30, 2025 at 12:42:26PM +0100, David Hildenbrand wrote:
> On 30.01.25 11:37, Simona Vetter wrote:
> > On Wed, Jan 29, 2025 at 12:54:10PM +0100, David Hildenbrand wrote:
> > > Now that conversion to device-exclusive does no longer perform an
> > > rmap walk and the main page_vma_mapped_walk() users were taught to
> > > properly handle nonswap entries, let's treat device-exclusive entries just
> > > as if they would be present, similar to how we handle device-private
> > > entries already.
> > 
> > So the reason for handling device-private entries in rmap is so that
> > drivers can rely on try_to_migrate and related code to invalidate all the
> > various ptes even for device private memory. Otherwise no one should hit
> > this path, at least if my understanding is correct.
> 
> Right, device-private probably only happen to be seen on the migration path
> so far.
> 
> > 
> > So I'm very much worried about opening a can of worms here because I think
> > this adds a genuine new case to all the various callers.
> 
> To be clear: it can all already happen.
> 
> Assume you have a THP (or any mTHP today). You can easily trigger the
> scenario that folio_mapcount() != 0 with active device-exclusive entries,
> and you start doing rmap walks and stumble over these device-exclusive
> entries and *not* handle them properly. Note that more and more systems are
> configured to just give you THP unless you explicitly opted-out using
> MADV_NOHUGEPAGE early.
> 
> Note that b756a3b5e7ea added that hunk that still walks these
> device-exclusive entries in rmap code, but didn't actually update the rmap
> walkers:
> 
> @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
> 
>                 /* Handle un-addressable ZONE_DEVICE memory */
>                 entry = pte_to_swp_entry(*pvmw->pte);
> -               if (!is_device_private_entry(entry))
> +               if (!is_device_private_entry(entry) &&
> +                   !is_device_exclusive_entry(entry))
>                         return false;
> 
>                 pfn = swp_offset(entry);
> 
> That was the right thing to do, because they resemble PROT_NONE entries and
> not migration entries or anything else that doesn't hold a folio reference).

Yeah I got that part. What I meant is that doubling down on this needs a
full audit and cannot rely on "we already have device private entries
going through these paths for much longer", which was the impression I
got. I guess it worked, thanks for doing that below :-)

And at least from my very rough understanding of mm, at least around all
this gpu stuff, tracking device exclusive mappings like real cpu mappings
makes sense, they do indeed act like PROT_NONE with some magic to restore
access on fault.

I do wonder a bit though what else is all not properly tracked because
they should be like prot_none except arent. I guess we'll find those as we
hit them :-/

> Fortunately, it's only the page_vma_mapped_walk() callers that need care.
> 
> mm/rmap.c is handled with this series.
> 
> mm/page_vma_mapped.c should work already.
> 
> mm/migrate.c: does not apply
> 
> mm/page_idle.c: likely should just skip !pte_present().
> 
> mm/ksm.c might be fine, but likely we should just reject !pte_present().
> 
> kernel/events/uprobes.c likely should reject !pte_present().
> 
> mm/damon/paddr.c likely should reject !pte_present().
> 
> 
> I briefly though about a flag to indicate if a page_vma_mapped_walk()
> supports these non-present entries, but likely just fixing them up is
> easier+cleaner.
> 
> Now that I looked at all, I might just write patches for them.
> 
> > 
> > > This fixes swapout/migration of folios with device-exclusive entries.
> > > 
> > > Likely there are still some page_vma_mapped_walk() callers that are not
> > > fully prepared for these entries, and where we simply want to refuse
> > > !pte_present() entries. They have to be fixed independently; the ones in
> > > mm/rmap.c are prepared.
> > 
> > The other worry is that maybe breaking migration is a feature, at least in
> > parts.
> 
> Maybe breaking swap and migration is a feature in some reality, in this
> reality it's a BUG :)

Oh yeah I agree on those :-)

> If thp constantly reassembles a pmd entry because hey all the
> > memory is contig and userspace allocated a chunk of memory to place
> > atomics that alternate between cpu and gpu nicely separated by 4k pages,
> > then we'll thrash around invalidating ptes to no end. So might be more
> > fallout here.
> 
> khugepaged will back off once it sees an exclusive entry, so collapsing
> could only happen once everything is non-exclusive. See
> __collapse_huge_page_isolate() as an example.

Ah ok. I think might be good to add that to the commit message, so that
people who don't understand mm deeply (like me) aren't worried when they
stumble over this change in the future again when digging around.

> It's really only page_vma_mapped_walk() callers that are affected by this
> change, not any other page table walkers.

I guess my mm understanding is just not up to that, but I couldn't figure
out why just looking at page_vma_mapped_walk() only is good enough?

> It's unfortunate that we now have to fix it all up, that original series
> should have never been merged that way.

Yeah looks like a rather big miss.
-Sima
David Hildenbrand Jan. 30, 2025, 3:43 p.m. UTC | #4
>> Assume you have a THP (or any mTHP today). You can easily trigger the
>> scenario that folio_mapcount() != 0 with active device-exclusive entries,
>> and you start doing rmap walks and stumble over these device-exclusive
>> entries and *not* handle them properly. Note that more and more systems are
>> configured to just give you THP unless you explicitly opted-out using
>> MADV_NOHUGEPAGE early.
>>
>> Note that b756a3b5e7ea added that hunk that still walks these
>> device-exclusive entries in rmap code, but didn't actually update the rmap
>> walkers:
>>
>> @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw)
>>
>>                  /* Handle un-addressable ZONE_DEVICE memory */
>>                  entry = pte_to_swp_entry(*pvmw->pte);
>> -               if (!is_device_private_entry(entry))
>> +               if (!is_device_private_entry(entry) &&
>> +                   !is_device_exclusive_entry(entry))
>>                          return false;
>>
>>                  pfn = swp_offset(entry);
>>
>> That was the right thing to do, because they resemble PROT_NONE entries and
>> not migration entries or anything else that doesn't hold a folio reference).
> 
> Yeah I got that part. What I meant is that doubling down on this needs a
> full audit and cannot rely on "we already have device private entries
> going through these paths for much longer", which was the impression I
> got. I guess it worked, thanks for doing that below :-)

I know I know, I shouldn't have touched it ... :)

So yeah, I'll spend some extra work on sorting out the other cases.

> 
> And at least from my very rough understanding of mm, at least around all
> this gpu stuff, tracking device exclusive mappings like real cpu mappings
> makes sense, they do indeed act like PROT_NONE with some magic to restore
> access on fault.
> 
> I do wonder a bit though what else is all not properly tracked because
> they should be like prot_none except arent. I guess we'll find those as we
> hit them :-/

Likely a lot of stuff. But more in a "entry gets ignored -- 
functionality not implemented, move along" way, because all page table 
walkers have to care about !pte_present() already; it's just RMAP code 
that so far never required it.

[...]

> 
>> If thp constantly reassembles a pmd entry because hey all the
>>> memory is contig and userspace allocated a chunk of memory to place
>>> atomics that alternate between cpu and gpu nicely separated by 4k pages,
>>> then we'll thrash around invalidating ptes to no end. So might be more
>>> fallout here.
>>
>> khugepaged will back off once it sees an exclusive entry, so collapsing
>> could only happen once everything is non-exclusive. See
>> __collapse_huge_page_isolate() as an example.
> 
> Ah ok. I think might be good to add that to the commit message, so that
> people who don't understand mm deeply (like me) aren't worried when they
> stumble over this change in the future again when digging around.

Will do, thanks for raising that concern!

> 
>> It's really only page_vma_mapped_walk() callers that are affected by this
>> change, not any other page table walkers.
> 
> I guess my mm understanding is just not up to that, but I couldn't figure
> out why just looking at page_vma_mapped_walk() only is good enough?

See above: these never had to handle !page_present() before -- in 
contrast to the other page table walkers.

So nothing bad happens when these page table walkers traverse these 
PTEs, it's just that the functionality will usually be implemented.

Take MADV_PAGEOUT as an example: madvise_cold_or_pageout_pte_range() 
will simply skip "!pte_present()", because it wouldn't know what to do 
in that case.

Of course, there could be page table walkers that check all cases and 
bail out if they find something unexpected: do_swap_page() cannot make 
forward progress and will inject a VM_FAULT_SIGBUS if it doesn't 
recognize the entry. But these are rather rare.

We could enlighten selected page table walkers to handle 
device-exclusive where it really makes sense later.
diff mbox series

Patch

diff --git a/mm/memory.c b/mm/memory.c
index db38d6ae4e74..cd689cd8a7c8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -743,20 +743,6 @@  static void restore_exclusive_pte(struct vm_area_struct *vma,
 
 	VM_BUG_ON_FOLIO(pte_write(pte) && (!folio_test_anon(folio) &&
 					   PageAnonExclusive(page)), folio);
-
-	/*
-	 * No need to take a page reference as one was already
-	 * created when the swap entry was made.
-	 */
-	if (folio_test_anon(folio))
-		folio_add_anon_rmap_pte(folio, page, vma, address, RMAP_NONE);
-	else
-		/*
-		 * Currently device exclusive access only supports anonymous
-		 * memory so the entry shouldn't point to a filebacked page.
-		 */
-		WARN_ON_ONCE(1);
-
 	set_pte_at(vma->vm_mm, address, ptep, pte);
 
 	/*
@@ -1628,8 +1614,7 @@  static inline int zap_nonpresent_ptes(struct mmu_gather *tlb,
 		 */
 		WARN_ON_ONCE(!vma_is_anonymous(vma));
 		rss[mm_counter(folio)]--;
-		if (is_device_private_entry(entry))
-			folio_remove_rmap_pte(folio, page, vma);
+		folio_remove_rmap_pte(folio, page, vma);
 		folio_put(folio);
 	} else if (!non_swap_entry(entry)) {
 		/* Genuine swap entries, hence a private anon pages */
diff --git a/mm/rmap.c b/mm/rmap.c
index 9e2002d97d6f..4acc9f6d743a 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2495,13 +2495,6 @@  struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 	/* The pte is writable, uffd-wp does not apply. */
 	set_pte_at(mm, addr, fw.ptep, swp_pte);
 
-	/*
-	 * TODO: The device-exclusive non-swap PTE holds a folio reference but
-	 * does not count as a mapping (mapcount), which is wrong and must be
-	 * fixed, otherwise RMAP walks don't behave as expected.
-	 */
-	folio_remove_rmap_pte(folio, page, vma);
-
 	folio_walk_end(&fw, vma);
 	*foliop = folio;
 	return page;