diff mbox series

[v1,04/12] mm/rmap: implement make_device_exclusive() using folio_walk instead of rmap walk

Message ID 20250129115411.2077152-5-david@redhat.com (mailing list archive)
State New
Headers show
Series mm: fixes for device-exclusive entries (hmm) | expand

Commit Message

David Hildenbrand Jan. 29, 2025, 11:54 a.m. UTC
We require a writable PTE and only support anonymous folio: we can only
have exactly one PTE pointing at that page, which we can just lookup
using a folio walk, avoiding the rmap walk and the anon VMA lock.

So let's stop doing an rmap walk and perform a folio walk instead, so we
can easily just modify a single PTE and avoid relying on rmap/mapcounts.

We now effectively work on a single PTE instead of multiple PTEs of
a large folio, allowing for conversion of individual PTEs from
non-exclusive to device-exclusive -- note that the other way always
worked on single PTEs.

We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
that is not required: GUP will already take care of the
MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
entry) when not finding a present PTE and having to trigger a fault and
ending up in remove_device_exclusive_entry(). Note that the PTE is
always writable, and we can always create a writable-device-exclusive
entry.

With this change, device-exclusive is fully compatible with THPs /
large folios. We still require PMD-sized THPs to get PTE-mapped, and
supporting PMD-mapped THP (without the PTE-remapping) is a different
endeavour that might not be worth it at this point.

This gets rid of the "folio_mapcount()" usage and let's us fix ordinary
rmap walks (migration/swapout) next. Spell out that messing with the
mapcount is wrong and must be fixed.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/rmap.c | 188 ++++++++++++++++--------------------------------------
 1 file changed, 55 insertions(+), 133 deletions(-)

Comments

Alistair Popple Jan. 30, 2025, 6:11 a.m. UTC | #1
On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
> We require a writable PTE and only support anonymous folio: we can only
> have exactly one PTE pointing at that page, which we can just lookup
> using a folio walk, avoiding the rmap walk and the anon VMA lock.
> 
> So let's stop doing an rmap walk and perform a folio walk instead, so we
> can easily just modify a single PTE and avoid relying on rmap/mapcounts.
> 
> We now effectively work on a single PTE instead of multiple PTEs of
> a large folio, allowing for conversion of individual PTEs from
> non-exclusive to device-exclusive -- note that the other way always
> worked on single PTEs.
> 
> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
> that is not required: GUP will already take care of the
> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
> entry) when not finding a present PTE and having to trigger a fault and
> ending up in remove_device_exclusive_entry().

I will have to look at this a bit more closely tomorrow but this doesn't seem
right to me. We may be transitioning from a present PTE (ie. a writable
anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
update their copies of the PTE. So I think the notifier call is needed.

> Note that the PTE is
> always writable, and we can always create a writable-device-exclusive
> entry.
> 
> With this change, device-exclusive is fully compatible with THPs /
> large folios. We still require PMD-sized THPs to get PTE-mapped, and
> supporting PMD-mapped THP (without the PTE-remapping) is a different
> endeavour that might not be worth it at this point.
> 
> This gets rid of the "folio_mapcount()" usage and let's us fix ordinary
> rmap walks (migration/swapout) next. Spell out that messing with the
> mapcount is wrong and must be fixed.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/rmap.c | 188 ++++++++++++++++--------------------------------------
>  1 file changed, 55 insertions(+), 133 deletions(-)
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 676df4fba5b0..49ffac6d27f8 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2375,131 +2375,6 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
>  }
>  
>  #ifdef CONFIG_DEVICE_PRIVATE
> -struct make_exclusive_args {
> -	struct mm_struct *mm;
> -	unsigned long address;
> -	void *owner;
> -	bool valid;
> -};
> -
> -static bool page_make_device_exclusive_one(struct folio *folio,
> -		struct vm_area_struct *vma, unsigned long address, void *priv)
> -{
> -	struct mm_struct *mm = vma->vm_mm;
> -	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> -	struct make_exclusive_args *args = priv;
> -	pte_t pteval;
> -	struct page *subpage;
> -	bool ret = true;
> -	struct mmu_notifier_range range;
> -	swp_entry_t entry;
> -	pte_t swp_pte;
> -	pte_t ptent;
> -
> -	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
> -				      vma->vm_mm, address, min(vma->vm_end,
> -				      address + folio_size(folio)),
> -				      args->owner);
> -	mmu_notifier_invalidate_range_start(&range);
> -
> -	while (page_vma_mapped_walk(&pvmw)) {
> -		/* Unexpected PMD-mapped THP? */
> -		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> -
> -		ptent = ptep_get(pvmw.pte);
> -		if (!pte_present(ptent)) {
> -			ret = false;
> -			page_vma_mapped_walk_done(&pvmw);
> -			break;
> -		}
> -
> -		subpage = folio_page(folio,
> -				pte_pfn(ptent) - folio_pfn(folio));
> -		address = pvmw.address;
> -
> -		/* Nuke the page table entry. */
> -		flush_cache_page(vma, address, pte_pfn(ptent));
> -		pteval = ptep_clear_flush(vma, address, pvmw.pte);
> -
> -		/* Set the dirty flag on the folio now the pte is gone. */
> -		if (pte_dirty(pteval))
> -			folio_mark_dirty(folio);
> -
> -		/*
> -		 * Check that our target page is still mapped at the expected
> -		 * address.
> -		 */
> -		if (args->mm == mm && args->address == address &&
> -		    pte_write(pteval))
> -			args->valid = true;
> -
> -		/*
> -		 * Store the pfn of the page in a special migration
> -		 * pte. do_swap_page() will wait until the migration
> -		 * pte is removed and then restart fault handling.
> -		 */
> -		if (pte_write(pteval))
> -			entry = make_writable_device_exclusive_entry(
> -							page_to_pfn(subpage));
> -		else
> -			entry = make_readable_device_exclusive_entry(
> -							page_to_pfn(subpage));
> -		swp_pte = swp_entry_to_pte(entry);
> -		if (pte_soft_dirty(pteval))
> -			swp_pte = pte_swp_mksoft_dirty(swp_pte);
> -		if (pte_uffd_wp(pteval))
> -			swp_pte = pte_swp_mkuffd_wp(swp_pte);
> -
> -		set_pte_at(mm, address, pvmw.pte, swp_pte);
> -
> -		/*
> -		 * There is a reference on the page for the swap entry which has
> -		 * been removed, so shouldn't take another.
> -		 */
> -		folio_remove_rmap_pte(folio, subpage, vma);
> -	}
> -
> -	mmu_notifier_invalidate_range_end(&range);
> -
> -	return ret;
> -}
> -
> -/**
> - * folio_make_device_exclusive - Mark the folio exclusively owned by a device.
> - * @folio: The folio to replace page table entries for.
> - * @mm: The mm_struct where the folio is expected to be mapped.
> - * @address: Address where the folio is expected to be mapped.
> - * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> - *
> - * Tries to remove all the page table entries which are mapping this
> - * folio and replace them with special device exclusive swap entries to
> - * grant a device exclusive access to the folio.
> - *
> - * Context: Caller must hold the folio lock.
> - * Return: false if the page is still mapped, or if it could not be unmapped
> - * from the expected address. Otherwise returns true (success).
> - */
> -static bool folio_make_device_exclusive(struct folio *folio,
> -		struct mm_struct *mm, unsigned long address, void *owner)
> -{
> -	struct make_exclusive_args args = {
> -		.mm = mm,
> -		.address = address,
> -		.owner = owner,
> -		.valid = false,
> -	};
> -	struct rmap_walk_control rwc = {
> -		.rmap_one = page_make_device_exclusive_one,
> -		.done = folio_not_mapped,
> -		.anon_lock = folio_lock_anon_vma_read,
> -		.arg = &args,
> -	};
> -
> -	rmap_walk(folio, &rwc);
> -
> -	return args.valid && !folio_mapcount(folio);
> -}
> -
>  /**
>   * make_device_exclusive() - Mark an address for exclusive use by a device
>   * @mm: mm_struct of associated target process
> @@ -2530,9 +2405,12 @@ static bool folio_make_device_exclusive(struct folio *folio,
>  struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  		void *owner, struct folio **foliop)
>  {
> -	struct folio *folio;
> +	struct folio *folio, *fw_folio;
> +	struct vm_area_struct *vma;
> +	struct folio_walk fw;
>  	struct page *page;
> -	long npages;
> +	swp_entry_t entry;
> +	pte_t swp_pte;
>  
>  	mmap_assert_locked(mm);
>  
> @@ -2540,12 +2418,16 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  	 * Fault in the page writable and try to lock it; note that if the
>  	 * address would already be marked for exclusive use by the device,
>  	 * the GUP call would undo that first by triggering a fault.
> +	 *
> +	 * If any other device would already map this page exclusively, the
> +	 * fault will trigger a conversion to an ordinary
> +	 * (non-device-exclusive) PTE and issue a MMU_NOTIFY_EXCLUSIVE.
>  	 */
> -	npages = get_user_pages_remote(mm, addr, 1,
> -				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> -				       &page, NULL);
> -	if (npages != 1)
> -		return ERR_PTR(npages);
> +	page = get_user_page_vma_remote(mm, addr,
> +					FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> +					&vma);
> +	if (IS_ERR(page))
> +		return page;
>  	folio = page_folio(page);
>  
>  	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
> @@ -2558,11 +2440,51 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
>  		return ERR_PTR(-EBUSY);
>  	}
>  
> -	if (!folio_make_device_exclusive(folio, mm, addr, owner)) {
> +	/*
> +	 * Let's do a second walk and make sure we still find the same page
> +	 * mapped writable. If we don't find what we expect, we will trigger
> +	 * GUP again to fix it up. Note that a page of an anonymous folio can
> +	 * only be mapped writable using exactly one page table mapping
> +	 * ("exclusive"), so there cannot be other mappings.
> +	 */
> +	fw_folio = folio_walk_start(&fw, vma, addr, 0);
> +	if (fw_folio != folio || fw.page != page ||
> +	    fw.level != FW_LEVEL_PTE || !pte_write(fw.pte)) {
> +		if (fw_folio)
> +			folio_walk_end(&fw, vma);
>  		folio_unlock(folio);
>  		folio_put(folio);
>  		return ERR_PTR(-EBUSY);
>  	}
> +
> +	/* Nuke the page table entry so we get the uptodate dirty bit. */
> +	flush_cache_page(vma, addr, page_to_pfn(page));
> +	fw.pte = ptep_clear_flush(vma, addr, fw.ptep);
> +
> +	/* Set the dirty flag on the folio now the pte is gone. */
> +	if (pte_dirty(fw.pte))
> +		folio_mark_dirty(folio);
> +
> +	/*
> +	 * Store the pfn of the page in a special device-exclusive non-swap pte.
> +	 * do_swap_page() will trigger the conversion back while holding the
> +	 * folio lock.
> +	 */
> +	entry = make_writable_device_exclusive_entry(page_to_pfn(page));
> +	swp_pte = swp_entry_to_pte(entry);
> +	if (pte_soft_dirty(fw.pte))
> +		swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +	/* The pte is writable, uffd-wp does not apply. */
> +	set_pte_at(mm, addr, fw.ptep, swp_pte);
> +
> +	/*
> +	 * TODO: The device-exclusive non-swap PTE holds a folio reference but
> +	 * does not count as a mapping (mapcount), which is wrong and must be
> +	 * fixed, otherwise RMAP walks don't behave as expected.
> +	 */
> +	folio_remove_rmap_pte(folio, page, vma);
> +
> +	folio_walk_end(&fw, vma);
>  	*foliop = folio;
>  	return page;
>  }
> -- 
> 2.48.1
>
David Hildenbrand Jan. 30, 2025, 9:01 a.m. UTC | #2
On 30.01.25 07:11, Alistair Popple wrote:
> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
>> We require a writable PTE and only support anonymous folio: we can only
>> have exactly one PTE pointing at that page, which we can just lookup
>> using a folio walk, avoiding the rmap walk and the anon VMA lock.
>>
>> So let's stop doing an rmap walk and perform a folio walk instead, so we
>> can easily just modify a single PTE and avoid relying on rmap/mapcounts.
>>
>> We now effectively work on a single PTE instead of multiple PTEs of
>> a large folio, allowing for conversion of individual PTEs from
>> non-exclusive to device-exclusive -- note that the other way always
>> worked on single PTEs.
>>
>> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
>> that is not required: GUP will already take care of the
>> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
>> entry) when not finding a present PTE and having to trigger a fault and
>> ending up in remove_device_exclusive_entry().
> 
> I will have to look at this a bit more closely tomorrow but this doesn't seem
> right to me. We may be transitioning from a present PTE (ie. a writable
> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
> update their copies of the PTE. So I think the notifier call is needed.

Then it is all very confusing:

"MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no 
longer have exclusive access to the page."

That's simply not true in the scenario you describe, because nobody had 
exclusive access.

But what you are saying is, that we need to inform others (e.g., KVM) 
that we are converting it to a device-exclusive entry, such that they 
stop accessing it.

That makes sense to me (and the cleanup patch in the cleanup series 
would have to go as well to prevent the livelock).

So we would have to update the documentation of MMU_NOTIFY_EXCLUSIVE 
that it is also trigger on conversion from non-exclusive to exclusive.

Does that make sense?

Thanks!
David Hildenbrand Jan. 30, 2025, 9:12 a.m. UTC | #3
On 30.01.25 10:01, David Hildenbrand wrote:
> On 30.01.25 07:11, Alistair Popple wrote:
>> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
>>> We require a writable PTE and only support anonymous folio: we can only
>>> have exactly one PTE pointing at that page, which we can just lookup
>>> using a folio walk, avoiding the rmap walk and the anon VMA lock.
>>>
>>> So let's stop doing an rmap walk and perform a folio walk instead, so we
>>> can easily just modify a single PTE and avoid relying on rmap/mapcounts.
>>>
>>> We now effectively work on a single PTE instead of multiple PTEs of
>>> a large folio, allowing for conversion of individual PTEs from
>>> non-exclusive to device-exclusive -- note that the other way always
>>> worked on single PTEs.
>>>
>>> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
>>> that is not required: GUP will already take care of the
>>> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
>>> entry) when not finding a present PTE and having to trigger a fault and
>>> ending up in remove_device_exclusive_entry().
>>
>> I will have to look at this a bit more closely tomorrow but this doesn't seem
>> right to me. We may be transitioning from a present PTE (ie. a writable
>> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
>> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
>> update their copies of the PTE. So I think the notifier call is needed.
> 
> Then it is all very confusing:
> 
> "MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
> longer have exclusive access to the page."
> 
> That's simply not true in the scenario you describe, because nobody had
> exclusive access.
> 
> But what you are saying is, that we need to inform others (e.g., KVM)
> that we are converting it to a device-exclusive entry, such that they
> stop accessing it.
> 
> That makes sense to me (and the cleanup patch in the cleanup series
> would have to go as well to prevent the livelock).
> 
> So we would have to update the documentation of MMU_NOTIFY_EXCLUSIVE
> that it is also trigger on conversion from non-exclusive to exclusive.
> 
> Does that make sense?

Something like that on top:

diff --git a/mm/rmap.c b/mm/rmap.c
index 49ffac6d27f8..fd6dfe67ce7b 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2405,6 +2405,7 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
  struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
                 void *owner, struct folio **foliop)
  {
+       struct mmu_notifier_range range;
         struct folio *folio, *fw_folio;
         struct vm_area_struct *vma;
         struct folio_walk fw;
@@ -2413,6 +2414,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
         pte_t swp_pte;
  
         mmap_assert_locked(mm);
+       addr = PAGE_ALIGN_DOWN(addr);
  
         /*
          * Fault in the page writable and try to lock it; note that if the
@@ -2440,6 +2442,14 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
                 return ERR_PTR(-EBUSY);
         }
  
+       /*
+        * Inform secondary MMUs that we are going to convert this PTE to
+        * device-exclusive, such that they unmap it now.
+        */
+       mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
+                                     mm, addr, addr + PAGE_SIZE, owner);
+       mmu_notifier_invalidate_range_start(&range);
+
         /*
          * Let's do a second walk and make sure we still find the same page
          * mapped writable. If we don't find what we expect, we will trigger
@@ -2452,6 +2462,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
             fw.level != FW_LEVEL_PTE || !pte_write(fw.pte)) {
                 if (fw_folio)
                         folio_walk_end(&fw, vma);
+               mmu_notifier_invalidate_range_end(&range);
                 folio_unlock(folio);
                 folio_put(folio);
                 return ERR_PTR(-EBUSY);
@@ -2485,6 +2496,7 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
         folio_remove_rmap_pte(folio, page, vma);
  
         folio_walk_end(&fw, vma);
+       mmu_notifier_invalidate_range_end(&range);
         *foliop = folio;
         return page;
  }
David Hildenbrand Jan. 30, 2025, 9:24 a.m. UTC | #4
On 30.01.25 10:01, David Hildenbrand wrote:
> On 30.01.25 07:11, Alistair Popple wrote:
>> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
>>> We require a writable PTE and only support anonymous folio: we can only
>>> have exactly one PTE pointing at that page, which we can just lookup
>>> using a folio walk, avoiding the rmap walk and the anon VMA lock.
>>>
>>> So let's stop doing an rmap walk and perform a folio walk instead, so we
>>> can easily just modify a single PTE and avoid relying on rmap/mapcounts.
>>>
>>> We now effectively work on a single PTE instead of multiple PTEs of
>>> a large folio, allowing for conversion of individual PTEs from
>>> non-exclusive to device-exclusive -- note that the other way always
>>> worked on single PTEs.
>>>
>>> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
>>> that is not required: GUP will already take care of the
>>> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
>>> entry) when not finding a present PTE and having to trigger a fault and
>>> ending up in remove_device_exclusive_entry().
>>
>> I will have to look at this a bit more closely tomorrow but this doesn't seem
>> right to me. We may be transitioning from a present PTE (ie. a writable
>> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
>> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
>> update their copies of the PTE. So I think the notifier call is needed.
> 
> Then it is all very confusing:
> 
> "MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
> longer have exclusive access to the page."

So the second sentence actually describes the other condition. Likely we
should make that clearer:

--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -43,10 +43,11 @@ struct mmu_interval_notifier;
   * a device driver to possibly ignore the invalidation if the
   * owner field matches the driver's device private pgmap owner.
   *
- * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
- * longer have exclusive access to the page. When sent during creation of an
- * exclusive range the owner will be initialised to the value provided by the
- * caller of make_device_exclusive(), otherwise the owner will be NULL.
+ * @MMU_NOTIFY_EXCLUSIVE: (1) to signal a device driver that the device will no
+ * longer have exclusive access to the page; and (2) to signal that a page will
+ * be made exclusive to a device. During (1), the owner will be NULL, during
+ * (2), the owner will be initialised to the value provided by the caller of
+ * make_device_exclusive().
   */
  enum mmu_notifier_event {
         MMU_NOTIFY_UNMAP = 0,
Simona Vetter Jan. 30, 2025, 9:40 a.m. UTC | #5
On Thu, Jan 30, 2025 at 05:11:49PM +1100, Alistair Popple wrote:
> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
> > We require a writable PTE and only support anonymous folio: we can only
> > have exactly one PTE pointing at that page, which we can just lookup
> > using a folio walk, avoiding the rmap walk and the anon VMA lock.
> > 
> > So let's stop doing an rmap walk and perform a folio walk instead, so we
> > can easily just modify a single PTE and avoid relying on rmap/mapcounts.
> > 
> > We now effectively work on a single PTE instead of multiple PTEs of
> > a large folio, allowing for conversion of individual PTEs from
> > non-exclusive to device-exclusive -- note that the other way always
> > worked on single PTEs.
> > 
> > We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
> > that is not required: GUP will already take care of the
> > MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
> > entry) when not finding a present PTE and having to trigger a fault and
> > ending up in remove_device_exclusive_entry().
> 
> I will have to look at this a bit more closely tomorrow but this doesn't seem
> right to me. We may be transitioning from a present PTE (ie. a writable
> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
> update their copies of the PTE. So I think the notifier call is needed.

I guess this is a question of semantics we want, for multiple gpus do we
require that device-exclusive also excludes other gpus or not. I'm leaning
towards agreeing with you here.

> > Note that the PTE is
> > always writable, and we can always create a writable-device-exclusive
> > entry.
> > 
> > With this change, device-exclusive is fully compatible with THPs /
> > large folios. We still require PMD-sized THPs to get PTE-mapped, and
> > supporting PMD-mapped THP (without the PTE-remapping) is a different
> > endeavour that might not be worth it at this point.

I'm not sure we actually want hugepages for device exclusive, since it has
an impact on what's allowed and what not. If we only ever do 4k entries
then userspace can assume that as long atomics are separated by a 4k page
there's no issue when both the gpu and cpu hammer on them. If we try to
keep thp entries then suddenly a workload that worked before will result
in endless ping-pong between gpu and cpu because the separate atomic
counters (or whatever) now all sit in the same 2m page.

So going with thp might result in userspace having to spread out atomics
even more, which is just wasting memory and not saving any tlb entries
since often you don't need that many.

tldr; I think not supporting thp entries for device exclusive is a
feature, not a bug.

Cheers, Sima

> > This gets rid of the "folio_mapcount()" usage and let's us fix ordinary
> > rmap walks (migration/swapout) next. Spell out that messing with the
> > mapcount is wrong and must be fixed.
> > 
> > Signed-off-by: David Hildenbrand <david@redhat.com>
> > ---
> >  mm/rmap.c | 188 ++++++++++++++++--------------------------------------
> >  1 file changed, 55 insertions(+), 133 deletions(-)
> > 
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 676df4fba5b0..49ffac6d27f8 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -2375,131 +2375,6 @@ void try_to_migrate(struct folio *folio, enum ttu_flags flags)
> >  }
> >  
> >  #ifdef CONFIG_DEVICE_PRIVATE
> > -struct make_exclusive_args {
> > -	struct mm_struct *mm;
> > -	unsigned long address;
> > -	void *owner;
> > -	bool valid;
> > -};
> > -
> > -static bool page_make_device_exclusive_one(struct folio *folio,
> > -		struct vm_area_struct *vma, unsigned long address, void *priv)
> > -{
> > -	struct mm_struct *mm = vma->vm_mm;
> > -	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
> > -	struct make_exclusive_args *args = priv;
> > -	pte_t pteval;
> > -	struct page *subpage;
> > -	bool ret = true;
> > -	struct mmu_notifier_range range;
> > -	swp_entry_t entry;
> > -	pte_t swp_pte;
> > -	pte_t ptent;
> > -
> > -	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
> > -				      vma->vm_mm, address, min(vma->vm_end,
> > -				      address + folio_size(folio)),
> > -				      args->owner);
> > -	mmu_notifier_invalidate_range_start(&range);
> > -
> > -	while (page_vma_mapped_walk(&pvmw)) {
> > -		/* Unexpected PMD-mapped THP? */
> > -		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
> > -
> > -		ptent = ptep_get(pvmw.pte);
> > -		if (!pte_present(ptent)) {
> > -			ret = false;
> > -			page_vma_mapped_walk_done(&pvmw);
> > -			break;
> > -		}
> > -
> > -		subpage = folio_page(folio,
> > -				pte_pfn(ptent) - folio_pfn(folio));
> > -		address = pvmw.address;
> > -
> > -		/* Nuke the page table entry. */
> > -		flush_cache_page(vma, address, pte_pfn(ptent));
> > -		pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > -
> > -		/* Set the dirty flag on the folio now the pte is gone. */
> > -		if (pte_dirty(pteval))
> > -			folio_mark_dirty(folio);
> > -
> > -		/*
> > -		 * Check that our target page is still mapped at the expected
> > -		 * address.
> > -		 */
> > -		if (args->mm == mm && args->address == address &&
> > -		    pte_write(pteval))
> > -			args->valid = true;
> > -
> > -		/*
> > -		 * Store the pfn of the page in a special migration
> > -		 * pte. do_swap_page() will wait until the migration
> > -		 * pte is removed and then restart fault handling.
> > -		 */
> > -		if (pte_write(pteval))
> > -			entry = make_writable_device_exclusive_entry(
> > -							page_to_pfn(subpage));
> > -		else
> > -			entry = make_readable_device_exclusive_entry(
> > -							page_to_pfn(subpage));
> > -		swp_pte = swp_entry_to_pte(entry);
> > -		if (pte_soft_dirty(pteval))
> > -			swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > -		if (pte_uffd_wp(pteval))
> > -			swp_pte = pte_swp_mkuffd_wp(swp_pte);
> > -
> > -		set_pte_at(mm, address, pvmw.pte, swp_pte);
> > -
> > -		/*
> > -		 * There is a reference on the page for the swap entry which has
> > -		 * been removed, so shouldn't take another.
> > -		 */
> > -		folio_remove_rmap_pte(folio, subpage, vma);
> > -	}
> > -
> > -	mmu_notifier_invalidate_range_end(&range);
> > -
> > -	return ret;
> > -}
> > -
> > -/**
> > - * folio_make_device_exclusive - Mark the folio exclusively owned by a device.
> > - * @folio: The folio to replace page table entries for.
> > - * @mm: The mm_struct where the folio is expected to be mapped.
> > - * @address: Address where the folio is expected to be mapped.
> > - * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> > - *
> > - * Tries to remove all the page table entries which are mapping this
> > - * folio and replace them with special device exclusive swap entries to
> > - * grant a device exclusive access to the folio.
> > - *
> > - * Context: Caller must hold the folio lock.
> > - * Return: false if the page is still mapped, or if it could not be unmapped
> > - * from the expected address. Otherwise returns true (success).
> > - */
> > -static bool folio_make_device_exclusive(struct folio *folio,
> > -		struct mm_struct *mm, unsigned long address, void *owner)
> > -{
> > -	struct make_exclusive_args args = {
> > -		.mm = mm,
> > -		.address = address,
> > -		.owner = owner,
> > -		.valid = false,
> > -	};
> > -	struct rmap_walk_control rwc = {
> > -		.rmap_one = page_make_device_exclusive_one,
> > -		.done = folio_not_mapped,
> > -		.anon_lock = folio_lock_anon_vma_read,
> > -		.arg = &args,
> > -	};
> > -
> > -	rmap_walk(folio, &rwc);
> > -
> > -	return args.valid && !folio_mapcount(folio);
> > -}
> > -
> >  /**
> >   * make_device_exclusive() - Mark an address for exclusive use by a device
> >   * @mm: mm_struct of associated target process
> > @@ -2530,9 +2405,12 @@ static bool folio_make_device_exclusive(struct folio *folio,
> >  struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> >  		void *owner, struct folio **foliop)
> >  {
> > -	struct folio *folio;
> > +	struct folio *folio, *fw_folio;
> > +	struct vm_area_struct *vma;
> > +	struct folio_walk fw;
> >  	struct page *page;
> > -	long npages;
> > +	swp_entry_t entry;
> > +	pte_t swp_pte;
> >  
> >  	mmap_assert_locked(mm);
> >  
> > @@ -2540,12 +2418,16 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> >  	 * Fault in the page writable and try to lock it; note that if the
> >  	 * address would already be marked for exclusive use by the device,
> >  	 * the GUP call would undo that first by triggering a fault.
> > +	 *
> > +	 * If any other device would already map this page exclusively, the
> > +	 * fault will trigger a conversion to an ordinary
> > +	 * (non-device-exclusive) PTE and issue a MMU_NOTIFY_EXCLUSIVE.
> >  	 */
> > -	npages = get_user_pages_remote(mm, addr, 1,
> > -				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> > -				       &page, NULL);
> > -	if (npages != 1)
> > -		return ERR_PTR(npages);
> > +	page = get_user_page_vma_remote(mm, addr,
> > +					FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> > +					&vma);
> > +	if (IS_ERR(page))
> > +		return page;
> >  	folio = page_folio(page);
> >  
> >  	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
> > @@ -2558,11 +2440,51 @@ struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
> >  		return ERR_PTR(-EBUSY);
> >  	}
> >  
> > -	if (!folio_make_device_exclusive(folio, mm, addr, owner)) {
> > +	/*
> > +	 * Let's do a second walk and make sure we still find the same page
> > +	 * mapped writable. If we don't find what we expect, we will trigger
> > +	 * GUP again to fix it up. Note that a page of an anonymous folio can
> > +	 * only be mapped writable using exactly one page table mapping
> > +	 * ("exclusive"), so there cannot be other mappings.
> > +	 */
> > +	fw_folio = folio_walk_start(&fw, vma, addr, 0);
> > +	if (fw_folio != folio || fw.page != page ||
> > +	    fw.level != FW_LEVEL_PTE || !pte_write(fw.pte)) {
> > +		if (fw_folio)
> > +			folio_walk_end(&fw, vma);
> >  		folio_unlock(folio);
> >  		folio_put(folio);
> >  		return ERR_PTR(-EBUSY);
> >  	}
> > +
> > +	/* Nuke the page table entry so we get the uptodate dirty bit. */
> > +	flush_cache_page(vma, addr, page_to_pfn(page));
> > +	fw.pte = ptep_clear_flush(vma, addr, fw.ptep);
> > +
> > +	/* Set the dirty flag on the folio now the pte is gone. */
> > +	if (pte_dirty(fw.pte))
> > +		folio_mark_dirty(folio);
> > +
> > +	/*
> > +	 * Store the pfn of the page in a special device-exclusive non-swap pte.
> > +	 * do_swap_page() will trigger the conversion back while holding the
> > +	 * folio lock.
> > +	 */
> > +	entry = make_writable_device_exclusive_entry(page_to_pfn(page));
> > +	swp_pte = swp_entry_to_pte(entry);
> > +	if (pte_soft_dirty(fw.pte))
> > +		swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > +	/* The pte is writable, uffd-wp does not apply. */
> > +	set_pte_at(mm, addr, fw.ptep, swp_pte);
> > +
> > +	/*
> > +	 * TODO: The device-exclusive non-swap PTE holds a folio reference but
> > +	 * does not count as a mapping (mapcount), which is wrong and must be
> > +	 * fixed, otherwise RMAP walks don't behave as expected.
> > +	 */
> > +	folio_remove_rmap_pte(folio, page, vma);
> > +
> > +	folio_walk_end(&fw, vma);
> >  	*foliop = folio;
> >  	return page;
> >  }
> > -- 
> > 2.48.1
> >
David Hildenbrand Jan. 30, 2025, 9:47 a.m. UTC | #6
On 30.01.25 10:40, Simona Vetter wrote:
> On Thu, Jan 30, 2025 at 05:11:49PM +1100, Alistair Popple wrote:
>> On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
>>> We require a writable PTE and only support anonymous folio: we can only
>>> have exactly one PTE pointing at that page, which we can just lookup
>>> using a folio walk, avoiding the rmap walk and the anon VMA lock.
>>>
>>> So let's stop doing an rmap walk and perform a folio walk instead, so we
>>> can easily just modify a single PTE and avoid relying on rmap/mapcounts.
>>>
>>> We now effectively work on a single PTE instead of multiple PTEs of
>>> a large folio, allowing for conversion of individual PTEs from
>>> non-exclusive to device-exclusive -- note that the other way always
>>> worked on single PTEs.
>>>
>>> We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
>>> that is not required: GUP will already take care of the
>>> MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
>>> entry) when not finding a present PTE and having to trigger a fault and
>>> ending up in remove_device_exclusive_entry().
>>
>> I will have to look at this a bit more closely tomorrow but this doesn't seem
>> right to me. We may be transitioning from a present PTE (ie. a writable
>> anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
>> therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
>> update their copies of the PTE. So I think the notifier call is needed.
> 
> I guess this is a question of semantics we want, for multiple gpus do we
> require that device-exclusive also excludes other gpus or not. I'm leaning
> towards agreeing with you here.

See my reply, it's also relevant for non-device, such as KVM. So it's 
the right thing to do.

> 
>>> Note that the PTE is
>>> always writable, and we can always create a writable-device-exclusive
>>> entry.
>>>
>>> With this change, device-exclusive is fully compatible with THPs /
>>> large folios. We still require PMD-sized THPs to get PTE-mapped, and
>>> supporting PMD-mapped THP (without the PTE-remapping) is a different
>>> endeavour that might not be worth it at this point.
> 
> I'm not sure we actually want hugepages for device exclusive, since it has
> an impact on what's allowed and what not. If we only ever do 4k entries
> then userspace can assume that as long atomics are separated by a 4k page
> there's no issue when both the gpu and cpu hammer on them. If we try to
> keep thp entries then suddenly a workload that worked before will result
> in endless ping-pong between gpu and cpu because the separate atomic
> counters (or whatever) now all sit in the same 2m page.

Agreed. And the conversion + mapping into the device gets trickier.

> 
> So going with thp might result in userspace having to spread out atomics
> even more, which is just wasting memory and not saving any tlb entries
> since often you don't need that many.
> 
> tldr; I think not supporting thp entries for device exclusive is a
> feature, not a bug.

So, you agree with my "different endeavour that might not be worth it" 
statement?
Simona Vetter Jan. 30, 2025, 1 p.m. UTC | #7
On Thu, Jan 30, 2025 at 10:47:29AM +0100, David Hildenbrand wrote:
> On 30.01.25 10:40, Simona Vetter wrote:
> > On Thu, Jan 30, 2025 at 05:11:49PM +1100, Alistair Popple wrote:
> > > On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
> > > > We require a writable PTE and only support anonymous folio: we can only
> > > > have exactly one PTE pointing at that page, which we can just lookup
> > > > using a folio walk, avoiding the rmap walk and the anon VMA lock.
> > > > 
> > > > So let's stop doing an rmap walk and perform a folio walk instead, so we
> > > > can easily just modify a single PTE and avoid relying on rmap/mapcounts.
> > > > 
> > > > We now effectively work on a single PTE instead of multiple PTEs of
> > > > a large folio, allowing for conversion of individual PTEs from
> > > > non-exclusive to device-exclusive -- note that the other way always
> > > > worked on single PTEs.
> > > > 
> > > > We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
> > > > that is not required: GUP will already take care of the
> > > > MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
> > > > entry) when not finding a present PTE and having to trigger a fault and
> > > > ending up in remove_device_exclusive_entry().
> > > 
> > > I will have to look at this a bit more closely tomorrow but this doesn't seem
> > > right to me. We may be transitioning from a present PTE (ie. a writable
> > > anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
> > > therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
> > > update their copies of the PTE. So I think the notifier call is needed.
> > 
> > I guess this is a question of semantics we want, for multiple gpus do we
> > require that device-exclusive also excludes other gpus or not. I'm leaning
> > towards agreeing with you here.
> 
> See my reply, it's also relevant for non-device, such as KVM. So it's the
> right thing to do.

Yeah sounds good.

> > > > Note that the PTE is
> > > > always writable, and we can always create a writable-device-exclusive
> > > > entry.
> > > > 
> > > > With this change, device-exclusive is fully compatible with THPs /
> > > > large folios. We still require PMD-sized THPs to get PTE-mapped, and
> > > > supporting PMD-mapped THP (without the PTE-remapping) is a different
> > > > endeavour that might not be worth it at this point.
> > 
> > I'm not sure we actually want hugepages for device exclusive, since it has
> > an impact on what's allowed and what not. If we only ever do 4k entries
> > then userspace can assume that as long atomics are separated by a 4k page
> > there's no issue when both the gpu and cpu hammer on them. If we try to
> > keep thp entries then suddenly a workload that worked before will result
> > in endless ping-pong between gpu and cpu because the separate atomic
> > counters (or whatever) now all sit in the same 2m page.
> 
> Agreed. And the conversion + mapping into the device gets trickier.
> 
> > 
> > So going with thp might result in userspace having to spread out atomics
> > even more, which is just wasting memory and not saving any tlb entries
> > since often you don't need that many.
> > 
> > tldr; I think not supporting thp entries for device exclusive is a
> > feature, not a bug.
> 
> So, you agree with my "different endeavour that might not be worth it"
> statement?

Yes.

Well I think we should go further and clearly document that we
intentionally return split pages. Because it's part of the uapi contract
with users of all this.

And if someone needs pmd entries for performance or whatever, we need two
things:

a) userspace must mmap that memory as hugepage memory, to clearly signal
the promise that atomics are split up on hugepage sizes and not just page
size

b) we need to extend make_device_exclusive and drivers to handle the
hugetlb folio case

I think thp is simply not going to work here, it's impossible (without
potentially causing fault storms) to figure out what userspace might want.

Cheers, Sima
David Hildenbrand Jan. 30, 2025, 3:59 p.m. UTC | #8
>>>>> Note that the PTE is
>>>>> always writable, and we can always create a writable-device-exclusive
>>>>> entry.
>>>>>
>>>>> With this change, device-exclusive is fully compatible with THPs /
>>>>> large folios. We still require PMD-sized THPs to get PTE-mapped, and
>>>>> supporting PMD-mapped THP (without the PTE-remapping) is a different
>>>>> endeavour that might not be worth it at this point.
>>>
>>> I'm not sure we actually want hugepages for device exclusive, since it has
>>> an impact on what's allowed and what not. If we only ever do 4k entries
>>> then userspace can assume that as long atomics are separated by a 4k page
>>> there's no issue when both the gpu and cpu hammer on them. If we try to
>>> keep thp entries then suddenly a workload that worked before will result
>>> in endless ping-pong between gpu and cpu because the separate atomic
>>> counters (or whatever) now all sit in the same 2m page.
>>
>> Agreed. And the conversion + mapping into the device gets trickier.
>>
>>>
>>> So going with thp might result in userspace having to spread out atomics
>>> even more, which is just wasting memory and not saving any tlb entries
>>> since often you don't need that many.
>>>
>>> tldr; I think not supporting thp entries for device exclusive is a
>>> feature, not a bug.
>>
>> So, you agree with my "different endeavour that might not be worth it"
>> statement?
> 
> Yes.
> 
> Well I think we should go further and clearly document that we
> intentionally return split pages. Because it's part of the uapi contract
> with users of all this.

Yes, see my reply to patch #3/

> 
> And if someone needs pmd entries for performance or whatever, we need two
> things:
> 
> a) userspace must mmap that memory as hugepage memory, to clearly signal
> the promise that atomics are split up on hugepage sizes and not just page
> size
> 
> b) we need to extend make_device_exclusive and drivers to handle the
> hugetlb folio case
> 
> I think thp is simply not going to work here, it's impossible (without
> potentially causing fault storms) to figure out what userspace might want.

Right, I added a link to this discussion in the patch.

Thanks!
Alistair Popple Jan. 30, 2025, 10:31 p.m. UTC | #9
On Thu, Jan 30, 2025 at 10:24:37AM +0100, David Hildenbrand wrote:
> On 30.01.25 10:01, David Hildenbrand wrote:
> > On 30.01.25 07:11, Alistair Popple wrote:
> > > On Wed, Jan 29, 2025 at 12:54:02PM +0100, David Hildenbrand wrote:
> > > > We require a writable PTE and only support anonymous folio: we can only
> > > > have exactly one PTE pointing at that page, which we can just lookup
> > > > using a folio walk, avoiding the rmap walk and the anon VMA lock.
> > > > 
> > > > So let's stop doing an rmap walk and perform a folio walk instead, so we
> > > > can easily just modify a single PTE and avoid relying on rmap/mapcounts.
> > > > 
> > > > We now effectively work on a single PTE instead of multiple PTEs of
> > > > a large folio, allowing for conversion of individual PTEs from
> > > > non-exclusive to device-exclusive -- note that the other way always
> > > > worked on single PTEs.
> > > > 
> > > > We can drop the MMU_NOTIFY_EXCLUSIVE MMU notifier call and document why
> > > > that is not required: GUP will already take care of the
> > > > MMU_NOTIFY_EXCLUSIVE call if required (there is already a device-exclusive
> > > > entry) when not finding a present PTE and having to trigger a fault and
> > > > ending up in remove_device_exclusive_entry().
> > > 
> > > I will have to look at this a bit more closely tomorrow but this doesn't seem
> > > right to me. We may be transitioning from a present PTE (ie. a writable
> > > anonymous mapping) to a non-present PTE (ie. a device-exclusive entry) and
> > > therefore any secondary processors (eg. other GPUs, iommus, etc.) will need to
> > > update their copies of the PTE. So I think the notifier call is needed.
> > 
> > Then it is all very confusing:

Can't argue with that in hindsight :-)

> > "MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
> > longer have exclusive access to the page."
> 
> So the second sentence actually describes the other condition. Likely we
> should make that clearer:
>
> --- a/include/linux/mmu_notifier.h
> +++ b/include/linux/mmu_notifier.h
> @@ -43,10 +43,11 @@ struct mmu_interval_notifier;
>   * a device driver to possibly ignore the invalidation if the
>   * owner field matches the driver's device private pgmap owner.
>   *
> - * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
> - * longer have exclusive access to the page. When sent during creation of an
> - * exclusive range the owner will be initialised to the value provided by the
> - * caller of make_device_exclusive(), otherwise the owner will be NULL.
> + * @MMU_NOTIFY_EXCLUSIVE: (1) to signal a device driver that the device will no
> + * longer have exclusive access to the page; and (2) to signal that a page will
> + * be made exclusive to a device. During (1), the owner will be NULL, during
> + * (2), the owner will be initialised to the value provided by the caller of
> + * make_device_exclusive().

Yes, I think that makes things clearer. Logically these are really two different
events though - I guess I didn't want to add another one at the time but I
wonder if we should just make them separate events rather than overloading them?

>   */
>  enum mmu_notifier_event {
>         MMU_NOTIFY_UNMAP = 0,
> 
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
diff mbox series

Patch

diff --git a/mm/rmap.c b/mm/rmap.c
index 676df4fba5b0..49ffac6d27f8 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2375,131 +2375,6 @@  void try_to_migrate(struct folio *folio, enum ttu_flags flags)
 }
 
 #ifdef CONFIG_DEVICE_PRIVATE
-struct make_exclusive_args {
-	struct mm_struct *mm;
-	unsigned long address;
-	void *owner;
-	bool valid;
-};
-
-static bool page_make_device_exclusive_one(struct folio *folio,
-		struct vm_area_struct *vma, unsigned long address, void *priv)
-{
-	struct mm_struct *mm = vma->vm_mm;
-	DEFINE_FOLIO_VMA_WALK(pvmw, folio, vma, address, 0);
-	struct make_exclusive_args *args = priv;
-	pte_t pteval;
-	struct page *subpage;
-	bool ret = true;
-	struct mmu_notifier_range range;
-	swp_entry_t entry;
-	pte_t swp_pte;
-	pte_t ptent;
-
-	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0,
-				      vma->vm_mm, address, min(vma->vm_end,
-				      address + folio_size(folio)),
-				      args->owner);
-	mmu_notifier_invalidate_range_start(&range);
-
-	while (page_vma_mapped_walk(&pvmw)) {
-		/* Unexpected PMD-mapped THP? */
-		VM_BUG_ON_FOLIO(!pvmw.pte, folio);
-
-		ptent = ptep_get(pvmw.pte);
-		if (!pte_present(ptent)) {
-			ret = false;
-			page_vma_mapped_walk_done(&pvmw);
-			break;
-		}
-
-		subpage = folio_page(folio,
-				pte_pfn(ptent) - folio_pfn(folio));
-		address = pvmw.address;
-
-		/* Nuke the page table entry. */
-		flush_cache_page(vma, address, pte_pfn(ptent));
-		pteval = ptep_clear_flush(vma, address, pvmw.pte);
-
-		/* Set the dirty flag on the folio now the pte is gone. */
-		if (pte_dirty(pteval))
-			folio_mark_dirty(folio);
-
-		/*
-		 * Check that our target page is still mapped at the expected
-		 * address.
-		 */
-		if (args->mm == mm && args->address == address &&
-		    pte_write(pteval))
-			args->valid = true;
-
-		/*
-		 * Store the pfn of the page in a special migration
-		 * pte. do_swap_page() will wait until the migration
-		 * pte is removed and then restart fault handling.
-		 */
-		if (pte_write(pteval))
-			entry = make_writable_device_exclusive_entry(
-							page_to_pfn(subpage));
-		else
-			entry = make_readable_device_exclusive_entry(
-							page_to_pfn(subpage));
-		swp_pte = swp_entry_to_pte(entry);
-		if (pte_soft_dirty(pteval))
-			swp_pte = pte_swp_mksoft_dirty(swp_pte);
-		if (pte_uffd_wp(pteval))
-			swp_pte = pte_swp_mkuffd_wp(swp_pte);
-
-		set_pte_at(mm, address, pvmw.pte, swp_pte);
-
-		/*
-		 * There is a reference on the page for the swap entry which has
-		 * been removed, so shouldn't take another.
-		 */
-		folio_remove_rmap_pte(folio, subpage, vma);
-	}
-
-	mmu_notifier_invalidate_range_end(&range);
-
-	return ret;
-}
-
-/**
- * folio_make_device_exclusive - Mark the folio exclusively owned by a device.
- * @folio: The folio to replace page table entries for.
- * @mm: The mm_struct where the folio is expected to be mapped.
- * @address: Address where the folio is expected to be mapped.
- * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
- *
- * Tries to remove all the page table entries which are mapping this
- * folio and replace them with special device exclusive swap entries to
- * grant a device exclusive access to the folio.
- *
- * Context: Caller must hold the folio lock.
- * Return: false if the page is still mapped, or if it could not be unmapped
- * from the expected address. Otherwise returns true (success).
- */
-static bool folio_make_device_exclusive(struct folio *folio,
-		struct mm_struct *mm, unsigned long address, void *owner)
-{
-	struct make_exclusive_args args = {
-		.mm = mm,
-		.address = address,
-		.owner = owner,
-		.valid = false,
-	};
-	struct rmap_walk_control rwc = {
-		.rmap_one = page_make_device_exclusive_one,
-		.done = folio_not_mapped,
-		.anon_lock = folio_lock_anon_vma_read,
-		.arg = &args,
-	};
-
-	rmap_walk(folio, &rwc);
-
-	return args.valid && !folio_mapcount(folio);
-}
-
 /**
  * make_device_exclusive() - Mark an address for exclusive use by a device
  * @mm: mm_struct of associated target process
@@ -2530,9 +2405,12 @@  static bool folio_make_device_exclusive(struct folio *folio,
 struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 		void *owner, struct folio **foliop)
 {
-	struct folio *folio;
+	struct folio *folio, *fw_folio;
+	struct vm_area_struct *vma;
+	struct folio_walk fw;
 	struct page *page;
-	long npages;
+	swp_entry_t entry;
+	pte_t swp_pte;
 
 	mmap_assert_locked(mm);
 
@@ -2540,12 +2418,16 @@  struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 	 * Fault in the page writable and try to lock it; note that if the
 	 * address would already be marked for exclusive use by the device,
 	 * the GUP call would undo that first by triggering a fault.
+	 *
+	 * If any other device would already map this page exclusively, the
+	 * fault will trigger a conversion to an ordinary
+	 * (non-device-exclusive) PTE and issue a MMU_NOTIFY_EXCLUSIVE.
 	 */
-	npages = get_user_pages_remote(mm, addr, 1,
-				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
-				       &page, NULL);
-	if (npages != 1)
-		return ERR_PTR(npages);
+	page = get_user_page_vma_remote(mm, addr,
+					FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
+					&vma);
+	if (IS_ERR(page))
+		return page;
 	folio = page_folio(page);
 
 	if (!folio_test_anon(folio) || folio_test_hugetlb(folio)) {
@@ -2558,11 +2440,51 @@  struct page *make_device_exclusive(struct mm_struct *mm, unsigned long addr,
 		return ERR_PTR(-EBUSY);
 	}
 
-	if (!folio_make_device_exclusive(folio, mm, addr, owner)) {
+	/*
+	 * Let's do a second walk and make sure we still find the same page
+	 * mapped writable. If we don't find what we expect, we will trigger
+	 * GUP again to fix it up. Note that a page of an anonymous folio can
+	 * only be mapped writable using exactly one page table mapping
+	 * ("exclusive"), so there cannot be other mappings.
+	 */
+	fw_folio = folio_walk_start(&fw, vma, addr, 0);
+	if (fw_folio != folio || fw.page != page ||
+	    fw.level != FW_LEVEL_PTE || !pte_write(fw.pte)) {
+		if (fw_folio)
+			folio_walk_end(&fw, vma);
 		folio_unlock(folio);
 		folio_put(folio);
 		return ERR_PTR(-EBUSY);
 	}
+
+	/* Nuke the page table entry so we get the uptodate dirty bit. */
+	flush_cache_page(vma, addr, page_to_pfn(page));
+	fw.pte = ptep_clear_flush(vma, addr, fw.ptep);
+
+	/* Set the dirty flag on the folio now the pte is gone. */
+	if (pte_dirty(fw.pte))
+		folio_mark_dirty(folio);
+
+	/*
+	 * Store the pfn of the page in a special device-exclusive non-swap pte.
+	 * do_swap_page() will trigger the conversion back while holding the
+	 * folio lock.
+	 */
+	entry = make_writable_device_exclusive_entry(page_to_pfn(page));
+	swp_pte = swp_entry_to_pte(entry);
+	if (pte_soft_dirty(fw.pte))
+		swp_pte = pte_swp_mksoft_dirty(swp_pte);
+	/* The pte is writable, uffd-wp does not apply. */
+	set_pte_at(mm, addr, fw.ptep, swp_pte);
+
+	/*
+	 * TODO: The device-exclusive non-swap PTE holds a folio reference but
+	 * does not count as a mapping (mapcount), which is wrong and must be
+	 * fixed, otherwise RMAP walks don't behave as expected.
+	 */
+	folio_remove_rmap_pte(folio, page, vma);
+
+	folio_walk_end(&fw, vma);
 	*foliop = folio;
 	return page;
 }