diff mbox series

[v7,5/8] mm: Device exclusive memory access

Message ID 20210326000805.2518-6-apopple@nvidia.com (mailing list archive)
State New, archived
Headers show
Series Add support for SVM atomics in Nouveau | expand

Commit Message

Alistair Popple March 26, 2021, 12:08 a.m. UTC
Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

---

v7:
* Added Christoph's Reviewed-by.
* Minor cosmetic cleanups suggested by Christoph.
* Replace mmu_notifier_range_init_migrate/exclusive with
  mmu_notifier_range_init_owner as suggested by Christoph.
* Replaced lock_page() with lock_page_retry() when handling faults.
* Restrict to anonymous pages for now.

v6:
* Fixed a bisectablity issue due to incorrectly applying the rename of
  migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.

v5:
* Renamed range->migrate_pgmap_owner to range->owner.
* Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
  allows notifiers called as a result of make_device_exclusive_range() to
  be ignored.
* Added a check to try_to_protect_one() to detect if the pages originally
  returned from get_user_pages() have been unmapped or not.
* Removed check_device_exclusive_range() as it is no longer required with
  the other changes.
* Documentation update.

v4:
* Add function to check that mappings are still valid and exclusive.
* s/long/unsigned long/ in make_device_exclusive_entry().
---
 Documentation/vm/hmm.rst              |  19 ++-
 drivers/gpu/drm/nouveau/nouveau_svm.c |   2 +-
 include/linux/mmu_notifier.h          |  26 ++--
 include/linux/rmap.h                  |   4 +
 include/linux/swap.h                  |   4 +-
 include/linux/swapops.h               |  44 +++++-
 lib/test_hmm.c                        |   2 +-
 mm/hmm.c                              |   5 +
 mm/memory.c                           | 108 ++++++++++++-
 mm/migrate.c                          |  10 +-
 mm/mprotect.c                         |   8 +
 mm/page_vma_mapped.c                  |   9 +-
 mm/rmap.c                             | 210 ++++++++++++++++++++++++++
 13 files changed, 426 insertions(+), 25 deletions(-)

Comments

Jason Gunthorpe March 30, 2021, 7:32 p.m. UTC | #1
On Fri, Mar 26, 2021 at 11:08:02AM +1100, Alistair Popple wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index 3a5705cfc891..33d11527ef77 100644
> +++ b/mm/memory.c
> @@ -781,6 +781,27 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
>  				pte = pte_swp_mkuffd_wp(pte);
>  			set_pte_at(src_mm, addr, src_pte, pte);
>  		}
> +	} else if (is_device_exclusive_entry(entry)) {
> +		page = pfn_swap_entry_to_page(entry);
> +
> +		get_page(page);
> +		rss[mm_counter(page)]++;
> +
> +		if (is_writable_device_exclusive_entry(entry) &&
> +		    is_cow_mapping(vm_flags)) {
> +			/*
> +			 * COW mappings require pages in both
> +			 * parent and child to be set to read.
> +			 */
> +			entry = make_readable_device_exclusive_entry(
> +							swp_offset(entry));
> +			pte = swp_entry_to_pte(entry);
> +			if (pte_swp_soft_dirty(*src_pte))
> +				pte = pte_swp_mksoft_dirty(pte);
> +			if (pte_swp_uffd_wp(*src_pte))
> +				pte = pte_swp_mkuffd_wp(pte);
> +			set_pte_at(src_mm, addr, src_pte, pte);
> +		}

This needs to have the same logic as we now have in
copy_present_page(). The page *is* present and we can't copy the PTE
value hidden in a swap entry if we can't copy the PTE normally.

The code should be shared because nobody is going to remember about
this corner case.

Jason
Alistair Popple March 31, 2021, 12:59 p.m. UTC | #2
On Wednesday, 31 March 2021 6:32:34 AM AEDT Jason Gunthorpe wrote:
> On Fri, Mar 26, 2021 at 11:08:02AM +1100, Alistair Popple wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 3a5705cfc891..33d11527ef77 100644
> > +++ b/mm/memory.c
> > @@ -781,6 +781,27 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct 
mm_struct *src_mm,
> >  				pte = pte_swp_mkuffd_wp(pte);
> >  			set_pte_at(src_mm, addr, src_pte, pte);
> >  		}
> > +	} else if (is_device_exclusive_entry(entry)) {
> > +		page = pfn_swap_entry_to_page(entry);
> > +
> > +		get_page(page);
> > +		rss[mm_counter(page)]++;
> > +
> > +		if (is_writable_device_exclusive_entry(entry) &&
> > +		    is_cow_mapping(vm_flags)) {
> > +			/*
> > +			 * COW mappings require pages in both
> > +			 * parent and child to be set to read.
> > +			 */
> > +			entry = make_readable_device_exclusive_entry(
> > +							swp_offset(entry));
> > +			pte = swp_entry_to_pte(entry);
> > +			if (pte_swp_soft_dirty(*src_pte))
> > +				pte = pte_swp_mksoft_dirty(pte);
> > +			if (pte_swp_uffd_wp(*src_pte))
> > +				pte = pte_swp_mkuffd_wp(pte);
> > +			set_pte_at(src_mm, addr, src_pte, pte);
> > +		}
> 
> This needs to have the same logic as we now have in
> copy_present_page(). The page *is* present and we can't copy the PTE
> value hidden in a swap entry if we can't copy the PTE normally.

You're saying we need to use copy_present_page() to make sure the split goes 
the right way for pinned pages? I guess that makes sense as the split could go 
either way at the moment but I should add a check to make sure this isn't used 
with pinned pages anyway.

 - Alistair

> The code should be shared because nobody is going to remember about
> this corner case.
> 
> Jason
>
Jason Gunthorpe March 31, 2021, 1:18 p.m. UTC | #3
On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:

> I guess that makes sense as the split could go either way at the
> moment but I should add a check to make sure this isn't used with
> pinned pages anyway.

Is it possible to have a pinned page under one of these things? If I
pin it before you migrate it then it remains pinned but hidden under
the swap entry?

So the special logic is needed and the pinned page has to be copied
and written as a normal pte, not dropped as a migration entry

Jason
Alistair Popple March 31, 2021, 1:27 p.m. UTC | #4
On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> 
> > I guess that makes sense as the split could go either way at the
> > moment but I should add a check to make sure this isn't used with
> > pinned pages anyway.
> 
> Is it possible to have a pinned page under one of these things? If I
> pin it before you migrate it then it remains pinned but hidden under
> the swap entry?

At the moment yes. But I had planned (and this reminded me) to add a check to 
prevent marking pinned pages for exclusive access. This check was in the 
original migration based implementation as I don't think it makes much sense 
to allow exclusive access to pinned pages given it indicates another device is 
possibly using it. 

> So the special logic is needed and the pinned page has to be copied
> and written as a normal pte, not dropped as a migration entry

Yep, if we end up allowing pinned pages to exist under these then that makes 
sense. Thanks for the clarification.

 - Alistair

> Jason
>
Jason Gunthorpe March 31, 2021, 1:46 p.m. UTC | #5
On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > 
> > > I guess that makes sense as the split could go either way at the
> > > moment but I should add a check to make sure this isn't used with
> > > pinned pages anyway.
> > 
> > Is it possible to have a pinned page under one of these things? If I
> > pin it before you migrate it then it remains pinned but hidden under
> > the swap entry?
> 
> At the moment yes. But I had planned (and this reminded me) to add a check to 
> prevent marking pinned pages for exclusive access. 

How do you even do that without races with GUP fast?

Jason
Alistair Popple April 1, 2021, 12:45 a.m. UTC | #6
On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > 
> > > > I guess that makes sense as the split could go either way at the
> > > > moment but I should add a check to make sure this isn't used with
> > > > pinned pages anyway.
> > > 
> > > Is it possible to have a pinned page under one of these things? If I
> > > pin it before you migrate it then it remains pinned but hidden under
> > > the swap entry?
> > 
> > At the moment yes. But I had planned (and this reminded me) to add a check 
to 
> > prevent marking pinned pages for exclusive access. 
> 
> How do you even do that without races with GUP fast?

Unless I've missed something I think I've convinced myself it should be safe 
to do the pin check after make_device_exclusive() has replaced all the PTEs 
with exclusive entries.

GUP fast sequence:
1. Read PTE
2. Pin page
3. Check PTE
4. if PTE changed -> unpin and fallback

If make_device_exclusive() runs after (1) it will either succeed or see the 
pin from (2) and fail (as desired). GUP should always see the PTE change and 
fallback which will revoke the exclusive access.

 - Alistair

> Jason
>
Jason Gunthorpe April 1, 2021, 12:48 a.m. UTC | #7
On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > > 
> > > > > I guess that makes sense as the split could go either way at the
> > > > > moment but I should add a check to make sure this isn't used with
> > > > > pinned pages anyway.
> > > > 
> > > > Is it possible to have a pinned page under one of these things? If I
> > > > pin it before you migrate it then it remains pinned but hidden under
> > > > the swap entry?
> > > 
> > > At the moment yes. But I had planned (and this reminded me) to add a check 
> to 
> > > prevent marking pinned pages for exclusive access. 
> > 
> > How do you even do that without races with GUP fast?
> 
> Unless I've missed something I think I've convinced myself it should be safe 
> to do the pin check after make_device_exclusive() has replaced all the PTEs 
> with exclusive entries.
> 
> GUP fast sequence:
> 1. Read PTE
> 2. Pin page
> 3. Check PTE
> 4. if PTE changed -> unpin and fallback
> 
> If make_device_exclusive() runs after (1) it will either succeed or see the 
> pin from (2) and fail (as desired). GUP should always see the PTE change and 
> fallback which will revoke the exclusive access.

AFAICT the user can trigger fork at that instant and fork will try to
copy the desposited migration entry before it has been checked

Jason
Alistair Popple April 1, 2021, 2:20 a.m. UTC | #8
On Thursday, 1 April 2021 11:48:13 AM AEDT Jason Gunthorpe wrote:
> On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> > On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > > > 
> > > > > > I guess that makes sense as the split could go either way at the
> > > > > > moment but I should add a check to make sure this isn't used with
> > > > > > pinned pages anyway.
> > > > > 
> > > > > Is it possible to have a pinned page under one of these things? If I
> > > > > pin it before you migrate it then it remains pinned but hidden under
> > > > > the swap entry?
> > > > 
> > > > At the moment yes. But I had planned (and this reminded me) to add a 
check 
> > to 
> > > > prevent marking pinned pages for exclusive access. 
> > > 
> > > How do you even do that without races with GUP fast?
> > 
> > Unless I've missed something I think I've convinced myself it should be 
safe 
> > to do the pin check after make_device_exclusive() has replaced all the 
PTEs 
> > with exclusive entries.
> > 
> > GUP fast sequence:
> > 1. Read PTE
> > 2. Pin page
> > 3. Check PTE
> > 4. if PTE changed -> unpin and fallback
> > 
> > If make_device_exclusive() runs after (1) it will either succeed or see 
the 
> > pin from (2) and fail (as desired). GUP should always see the PTE change 
and 
> > fallback which will revoke the exclusive access.
> 
> AFAICT the user can trigger fork at that instant and fork will try to
> copy the desposited migration entry before it has been checked

In that case the child will get a read-only exclusive entry and eventually a 
page copy via do_wp_page() and GUP will fallback (or fail in the case of fast 
only) so the parent's exclusive entry will get removed before the page can be 
pinned and therefore shouldn't split the wrong way.

But that is sounding rather complex, and I am not convinced I haven't missed a 
corner case. It also seems like it shouldn't be necessary to copy exclusive 
entries anyway. I could just remove them and restore the original entry, which 
would be far simpler.

> Jason
>
Jason Gunthorpe April 1, 2021, 11:55 a.m. UTC | #9
On Thu, Apr 01, 2021 at 01:20:05PM +1100, Alistair Popple wrote:
> On Thursday, 1 April 2021 11:48:13 AM AEDT Jason Gunthorpe wrote:
> > On Thu, Apr 01, 2021 at 11:45:57AM +1100, Alistair Popple wrote:
> > > On Thursday, 1 April 2021 12:46:04 AM AEDT Jason Gunthorpe wrote:
> > > > On Thu, Apr 01, 2021 at 12:27:52AM +1100, Alistair Popple wrote:
> > > > > On Thursday, 1 April 2021 12:18:54 AM AEDT Jason Gunthorpe wrote:
> > > > > > On Wed, Mar 31, 2021 at 11:59:28PM +1100, Alistair Popple wrote:
> > > > > > 
> > > > > > > I guess that makes sense as the split could go either way at the
> > > > > > > moment but I should add a check to make sure this isn't used with
> > > > > > > pinned pages anyway.
> > > > > > 
> > > > > > Is it possible to have a pinned page under one of these things? If I
> > > > > > pin it before you migrate it then it remains pinned but hidden under
> > > > > > the swap entry?
> > > > > 
> > > > > At the moment yes. But I had planned (and this reminded me) to add a 
> check 
> > > to 
> > > > > prevent marking pinned pages for exclusive access. 
> > > > 
> > > > How do you even do that without races with GUP fast?
> > > 
> > > Unless I've missed something I think I've convinced myself it should be 
> safe 
> > > to do the pin check after make_device_exclusive() has replaced all the 
> PTEs 
> > > with exclusive entries.
> > > 
> > > GUP fast sequence:
> > > 1. Read PTE
> > > 2. Pin page
> > > 3. Check PTE
> > > 4. if PTE changed -> unpin and fallback
> > > 
> > > If make_device_exclusive() runs after (1) it will either succeed or see 
> the 
> > > pin from (2) and fail (as desired). GUP should always see the PTE change 
> and 
> > > fallback which will revoke the exclusive access.
> > 
> > AFAICT the user can trigger fork at that instant and fork will try to
> > copy the desposited migration entry before it has been checked
> 
> In that case the child will get a read-only exclusive entry and eventually a 
> page copy via do_wp_page() 

Having do_wp_page() do a copy is a security bug. We closed it with the
at-fork checks.

Jason
diff mbox series

Patch

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 09e28507f5b2..a14c2938e7af 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -332,7 +332,7 @@  between device driver specific code and shared common code:
    walks to fill in the ``args->src`` array with PFNs to be migrated.
    The ``invalidate_range_start()`` callback is passed a
    ``struct mmu_notifier_range`` with the ``event`` field set to
-   ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to
+   ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to
    the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is
    allows the device driver to skip the invalidation callback and only
    invalidate device private MMU mappings that are actually migrating.
@@ -405,6 +405,23 @@  between device driver specific code and shared common code:
 
    The lock can now be released.
 
+Exclusive access memory
+=======================
+
+Some devices have features such as atomic PTE bits that can be used to implement
+atomic access to system memory. To support atomic operations to a shared virtual
+memory page such a device needs access to that page which is exclusive of any
+userspace access from the CPU. The ``make_device_exclusive_range()`` function
+can be used to make a memory range inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may proceed as described.
+
 Memory cgroup (memcg) and rss accounting
 ========================================
 
diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c
index f18bd53da052..94f841026c3b 100644
--- a/drivers/gpu/drm/nouveau/nouveau_svm.c
+++ b/drivers/gpu/drm/nouveau/nouveau_svm.c
@@ -265,7 +265,7 @@  nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn,
 	 * the invalidation is handled as part of the migration process.
 	 */
 	if (update->event == MMU_NOTIFY_MIGRATE &&
-	    update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev)
+	    update->owner == svmm->vmm->cli->drm->dev)
 		goto out;
 
 	if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) {
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index b8200782dede..2e6068d3fb9f 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -41,7 +41,12 @@  struct mmu_interval_notifier;
  *
  * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
  * a device driver to possibly ignore the invalidation if the
- * migrate_pgmap_owner field matches the driver's device private pgmap owner.
+ * owner field matches the driver's device private pgmap owner.
+ *
+ * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
+ * longer have exclusive access to the page. May ignore the invalidation that's
+ * part of make_device_exclusive_range() if the owner field
+ * matches the value passed to make_device_exclusive_range().
  */
 enum mmu_notifier_event {
 	MMU_NOTIFY_UNMAP = 0,
@@ -51,6 +56,7 @@  enum mmu_notifier_event {
 	MMU_NOTIFY_SOFT_DIRTY,
 	MMU_NOTIFY_RELEASE,
 	MMU_NOTIFY_MIGRATE,
+	MMU_NOTIFY_EXCLUSIVE,
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
@@ -269,7 +275,7 @@  struct mmu_notifier_range {
 	unsigned long end;
 	unsigned flags;
 	enum mmu_notifier_event event;
-	void *migrate_pgmap_owner;
+	void *owner;
 };
 
 static inline int mm_has_notifiers(struct mm_struct *mm)
@@ -521,14 +527,14 @@  static inline void mmu_notifier_range_init(struct mmu_notifier_range *range,
 	range->flags = flags;
 }
 
-static inline void mmu_notifier_range_init_migrate(
-			struct mmu_notifier_range *range, unsigned int flags,
+static inline void mmu_notifier_range_init_owner(
+			struct mmu_notifier_range *range,
+			enum mmu_notifier_event event, unsigned int flags,
 			struct vm_area_struct *vma, struct mm_struct *mm,
-			unsigned long start, unsigned long end, void *pgmap)
+			unsigned long start, unsigned long end, void *owner)
 {
-	mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm,
-				start, end);
-	range->migrate_pgmap_owner = pgmap;
+	mmu_notifier_range_init(range, event, flags, vma, mm, start, end);
+	range->owner = owner;
 }
 
 #define ptep_clear_flush_young_notify(__vma, __address, __ptep)		\
@@ -655,8 +661,8 @@  static inline void _mmu_notifier_range_init(struct mmu_notifier_range *range,
 
 #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end)  \
 	_mmu_notifier_range_init(range, start, end)
-#define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \
-					pgmap) \
+#define mmu_notifier_range_init_owner(range, event, flags, vma, mm, start, \
+					end, owner) \
 	_mmu_notifier_range_init(range, start, end)
 
 static inline bool
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 6062e0cfca2d..b207c138cbff 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -193,6 +193,10 @@  int page_referenced(struct page *, int is_locked,
 bool try_to_migrate(struct page *page, enum ttu_flags flags);
 bool try_to_unmap(struct page *, enum ttu_flags flags);
 
+int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
+				unsigned long end, struct page **pages,
+				void *arg);
+
 /* Avoid racy checks */
 #define PVMW_SYNC		(1 << 0)
 /* Look for migarion entries rather than present PTEs */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 516104b9334b..7a3c260146df 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -63,9 +63,11 @@  static inline int current_is_kswapd(void)
  * to a special SWP_DEVICE_* entry.
  */
 #ifdef CONFIG_DEVICE_PRIVATE
-#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_NUM 4
 #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
 #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
+#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
+#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
 #else
 #define SWP_DEVICE_NUM 0
 #endif
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4dfd807ae52a..4129bd2ff9d6 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -120,6 +120,27 @@  static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
 	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
+
+static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset);
+}
+
+static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset);
+}
+
+static inline bool is_device_exclusive_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
+		swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
+}
+
+static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE);
+}
 #else /* CONFIG_DEVICE_PRIVATE */
 static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
@@ -140,6 +161,26 @@  static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
 	return false;
 }
+
+static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline bool is_device_exclusive_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
+{
+	return false;
+}
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
@@ -219,7 +260,8 @@  static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
  */
 static inline bool is_pfn_swap_entry(swp_entry_t entry)
 {
-	return is_migration_entry(entry) || is_device_private_entry(entry);
+	return is_migration_entry(entry) || is_device_private_entry(entry) ||
+	       is_device_exclusive_entry(entry);
 }
 
 struct page_vma_mapped_walk;
diff --git a/lib/test_hmm.c b/lib/test_hmm.c
index 80a78877bd93..5c9f5a020c1d 100644
--- a/lib/test_hmm.c
+++ b/lib/test_hmm.c
@@ -218,7 +218,7 @@  static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni,
 	 * the invalidation is handled as part of the migration process.
 	 */
 	if (range->event == MMU_NOTIFY_MIGRATE &&
-	    range->migrate_pgmap_owner == dmirror->mdevice)
+	    range->owner == dmirror->mdevice)
 		return true;
 
 	if (mmu_notifier_range_blockable(range))
diff --git a/mm/hmm.c b/mm/hmm.c
index 11df3ca30b82..fad6be2bf072 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -26,6 +26,8 @@ 
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
 
+#include "internal.h"
+
 struct hmm_vma_walk {
 	struct hmm_range	*range;
 	unsigned long		last;
@@ -271,6 +273,9 @@  static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		if (!non_swap_entry(entry))
 			goto fault;
 
+		if (is_device_exclusive_entry(entry))
+			goto fault;
+
 		if (is_migration_entry(entry)) {
 			pte_unmap(ptep);
 			hmm_vma_walk->last = addr;
diff --git a/mm/memory.c b/mm/memory.c
index 3a5705cfc891..33d11527ef77 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -781,6 +781,27 @@  copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pte = pte_swp_mkuffd_wp(pte);
 			set_pte_at(src_mm, addr, src_pte, pte);
 		}
+	} else if (is_device_exclusive_entry(entry)) {
+		page = pfn_swap_entry_to_page(entry);
+
+		get_page(page);
+		rss[mm_counter(page)]++;
+
+		if (is_writable_device_exclusive_entry(entry) &&
+		    is_cow_mapping(vm_flags)) {
+			/*
+			 * COW mappings require pages in both
+			 * parent and child to be set to read.
+			 */
+			entry = make_readable_device_exclusive_entry(
+							swp_offset(entry));
+			pte = swp_entry_to_pte(entry);
+			if (pte_swp_soft_dirty(*src_pte))
+				pte = pte_swp_mksoft_dirty(pte);
+			if (pte_swp_uffd_wp(*src_pte))
+				pte = pte_swp_mkuffd_wp(pte);
+			set_pte_at(src_mm, addr, src_pte, pte);
+		}
 	}
 	set_pte_at(dst_mm, addr, dst_pte, pte);
 	return 0;
@@ -1287,7 +1308,8 @@  static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		}
 
 		entry = pte_to_swp_entry(ptent);
-		if (is_device_private_entry(entry)) {
+		if (is_device_private_entry(entry) ||
+		    is_device_exclusive_entry(entry)) {
 			struct page *page = pfn_swap_entry_to_page(entry);
 
 			if (unlikely(details && details->check_mapping)) {
@@ -1303,7 +1325,10 @@  static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+
+			if (is_device_private_entry(entry))
+				page_remove_rmap(page, false);
+
 			put_page(page);
 			continue;
 		}
@@ -3256,6 +3281,82 @@  void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+static void restore_exclusive_pte(struct vm_area_struct *vma,
+				  struct page *page, unsigned long address,
+				  pte_t *ptep)
+{
+	pte_t pte;
+	swp_entry_t entry;
+
+	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
+	if (pte_swp_soft_dirty(*ptep))
+		pte = pte_mksoft_dirty(pte);
+
+	entry = pte_to_swp_entry(*ptep);
+	if (pte_swp_uffd_wp(*ptep))
+		pte = pte_mkuffd_wp(pte);
+	else if (is_writable_device_exclusive_entry(entry))
+		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+
+	set_pte_at(vma->vm_mm, address, ptep, pte);
+
+	/*
+	 * No need to take a page reference as one was already
+	 * created when the swap entry was made.
+	 */
+	if (PageAnon(page))
+		page_add_anon_rmap(page, vma, address, false);
+	else
+		page_add_file_rmap(page, false);
+
+	if (vma->vm_flags & VM_LOCKED)
+		mlock_vma_page(page);
+
+	/*
+	 * No need to invalidate - it was non-present before. However
+	 * secondary CPUs may have mappings that need invalidating.
+	 */
+	update_mmu_cache(vma, address, ptep);
+}
+
+/*
+ * Restore a potential device exclusive pte to a working pte entry
+ */
+static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
+{
+	struct page *page = vmf->page;
+	struct vm_area_struct *vma = vmf->vma;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = vmf->address,
+		.flags = PVMW_SYNC,
+	};
+	vm_fault_t ret = 0;
+	struct mmu_notifier_range range;
+
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
+		return VM_FAULT_RETRY;
+	mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
+				vmf->address & PAGE_MASK,
+				(vmf->address & PAGE_MASK) + PAGE_SIZE);
+	mmu_notifier_invalidate_range_start(&range);
+
+	while (page_vma_mapped_walk(&pvmw)) {
+		if (unlikely(!pte_same(*pvmw.pte, vmf->orig_pte))) {
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
+		restore_exclusive_pte(vma, page, pvmw.address, pvmw.pte);
+	}
+
+	unlock_page(page);
+
+	mmu_notifier_invalidate_range_end(&range);
+	return ret;
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -3283,6 +3384,9 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
+		} else if (is_device_exclusive_entry(entry)) {
+			vmf->page = pfn_swap_entry_to_page(entry);
+			ret = remove_device_exclusive_entry(vmf);
 		} else if (is_device_private_entry(entry)) {
 			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
diff --git a/mm/migrate.c b/mm/migrate.c
index cc4612e2a246..9cc9251d4802 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2570,8 +2570,8 @@  static void migrate_vma_collect(struct migrate_vma *migrate)
 	 * that the registered device driver can skip invalidating device
 	 * private page mappings that won't be migrated.
 	 */
-	mmu_notifier_range_init_migrate(&range, 0, migrate->vma,
-		migrate->vma->vm_mm, migrate->start, migrate->end,
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0,
+		migrate->vma, migrate->vma->vm_mm, migrate->start, migrate->end,
 		migrate->pgmap_owner);
 	mmu_notifier_invalidate_range_start(&range);
 
@@ -3074,9 +3074,9 @@  void migrate_vma_pages(struct migrate_vma *migrate)
 			if (!notified) {
 				notified = true;
 
-				mmu_notifier_range_init_migrate(&range, 0,
-					migrate->vma, migrate->vma->vm_mm,
-					addr, migrate->end,
+				mmu_notifier_range_init_owner(&range,
+					MMU_NOTIFY_MIGRATE, 0, migrate->vma,
+					migrate->vma->vm_mm, addr, migrate->end,
 					migrate->pgmap_owner);
 				mmu_notifier_invalidate_range_start(&range);
 			}
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f21b760ec809..c6018541ea3d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -165,6 +165,14 @@  static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
+			} else if (is_writable_device_exclusive_entry(entry)) {
+				entry = make_readable_device_exclusive_entry(
+							swp_offset(entry));
+				newpte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(oldpte))
+					newpte = pte_swp_mksoft_dirty(newpte);
+				if (pte_swp_uffd_wp(oldpte))
+					newpte = pte_swp_mkuffd_wp(newpte);
 			} else {
 				newpte = oldpte;
 			}
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index eed988ab2e81..29842f169219 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -41,7 +41,8 @@  static bool map_pte(struct page_vma_mapped_walk *pvmw)
 
 				/* Handle un-addressable ZONE_DEVICE memory */
 				entry = pte_to_swp_entry(*pvmw->pte);
-				if (!is_device_private_entry(entry))
+				if (!is_device_private_entry(entry) &&
+				    !is_device_exclusive_entry(entry))
 					return false;
 			} else if (!pte_present(*pvmw->pte))
 				return false;
@@ -93,7 +94,8 @@  static bool check_pte(struct page_vma_mapped_walk *pvmw)
 			return false;
 		entry = pte_to_swp_entry(*pvmw->pte);
 
-		if (!is_migration_entry(entry))
+		if (!is_migration_entry(entry) &&
+		    !is_device_exclusive_entry(entry))
 			return false;
 
 		pfn = swp_offset(entry);
@@ -102,7 +104,8 @@  static bool check_pte(struct page_vma_mapped_walk *pvmw)
 
 		/* Handle un-addressable ZONE_DEVICE memory */
 		entry = pte_to_swp_entry(*pvmw->pte);
-		if (!is_device_private_entry(entry))
+		if (!is_device_private_entry(entry) &&
+		    !is_device_exclusive_entry(entry))
 			return false;
 
 		pfn = swp_offset(entry);
diff --git a/mm/rmap.c b/mm/rmap.c
index b540b44e299a..b0ec88a37dab 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2005,6 +2005,216 @@  void try_to_munlock(struct page *page)
 	rmap_walk(page, &rwc);
 }
 
+struct ttp_args {
+	struct mm_struct *mm;
+	unsigned long address;
+	void *arg;
+	bool valid;
+};
+
+static bool try_to_protect_one(struct page *page, struct vm_area_struct *vma,
+			unsigned long address, void *arg)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = address,
+	};
+	struct ttp_args *ttp = arg;
+	pte_t pteval;
+	struct page *subpage;
+	bool ret = true;
+	struct mmu_notifier_range range;
+	swp_entry_t entry;
+	pte_t swp_pte;
+
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
+				      vma->vm_mm, address,
+				      min(vma->vm_end,
+					  address + page_size(page)),
+				      ttp->arg);
+	if (PageHuge(page)) {
+		/*
+		 * If sharing is possible, start and end will be adjusted
+		 * accordingly.
+		 */
+		adjust_range_if_pmd_sharing_possible(vma, &range.start,
+						     &range.end);
+	}
+	mmu_notifier_invalidate_range_start(&range);
+
+	while (page_vma_mapped_walk(&pvmw)) {
+		/* Unexpected PMD-mapped THP? */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
+		if (!pte_present(*pvmw.pte)) {
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
+		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+		address = pvmw.address;
+
+		/* Nuke the page table entry. */
+		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		pteval = ptep_clear_flush(vma, address, pvmw.pte);
+
+		/* Move the dirty bit to the page. Now the pte is gone. */
+		if (pte_dirty(pteval))
+			set_page_dirty(page);
+
+		/* Update high watermark before we lower rss */
+		update_hiwater_rss(mm);
+
+		if (arch_unmap_one(mm, vma, address, pteval) < 0) {
+			set_pte_at(mm, address, pvmw.pte, pteval);
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
+		/*
+		 * Check that our target page is still mapped at the expected
+		 * address.
+		 */
+		if (ttp->mm == mm && ttp->address == address &&
+		    pte_write(pteval))
+			ttp->valid = true;
+
+		/*
+		 * Store the pfn of the page in a special migration
+		 * pte. do_swap_page() will wait until the migration
+		 * pte is removed and then restart fault handling.
+		 */
+		if (pte_write(pteval))
+			entry = make_writable_device_exclusive_entry(
+							page_to_pfn(subpage));
+		else
+			entry = make_readable_device_exclusive_entry(
+							page_to_pfn(subpage));
+		swp_pte = swp_entry_to_pte(entry);
+		if (pte_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		if (pte_uffd_wp(pteval))
+			swp_pte = pte_swp_mkuffd_wp(swp_pte);
+
+		/* Take a reference for the swap entry */
+		get_page(page);
+		set_pte_at(mm, address, pvmw.pte, swp_pte);
+
+		page_remove_rmap(subpage, PageHuge(page));
+		put_page(page);
+	}
+
+	mmu_notifier_invalidate_range_end(&range);
+
+	return ret;
+}
+
+/**
+ * try_to_protect - try to replace all page table mappings with swap entries
+ * @page: the page to replace page table entries for
+ * @flags: action and flags
+ * @mm: the mm_struct where the page is expected to be mapped
+ * @address: address where the page is expected to be mapped
+ * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
+ *
+ * Tries to remove all the page table entries which are mapping this page and
+ * replace them with special swap entries to grant a device exclusive access to
+ * the page. Caller must hold the page lock.
+ *
+ * Returns false if the page is still mapped, or if it could not be unmapped
+ * from the expected address. Otherwise returns true (success).
+ */
+static bool try_to_protect(struct page *page, struct mm_struct *mm,
+			   unsigned long address, void *arg)
+{
+	struct ttp_args ttp = {
+		.mm = mm,
+		.address = address,
+		.arg = arg,
+		.valid = false,
+	};
+	struct rmap_walk_control rwc = {
+		.rmap_one = try_to_protect_one,
+		.done = page_not_mapped,
+		.anon_lock = page_lock_anon_vma_read,
+		.arg = &ttp,
+	};
+
+	/*
+	 * Restrict to anonymous pages for now to avoid potential writeback
+	 * issues.
+	 */
+	if (!PageAnon(page))
+		return false;
+
+	/*
+	 * During exec, a temporary VMA is setup and later moved.
+	 * The VMA is moved under the anon_vma lock but not the
+	 * page tables leading to a race where migration cannot
+	 * find the migration ptes. Rather than increasing the
+	 * locking requirements of exec(), migration skips
+	 * temporary VMAs until after exec() completes.
+	 */
+	if (!PageKsm(page) && PageAnon(page))
+		rwc.invalid_vma = invalid_migration_vma;
+
+	rmap_walk(page, &rwc);
+
+	return ttp.valid && !page_mapcount(page);
+}
+
+/**
+ * make_device_exclusive_range() - Mark a range for exclusive use by a device
+ * @mm: mm_struct of assoicated target process
+ * @start: start of the region to mark for exclusive device access
+ * @end: end address of region
+ * @pages: returns the pages which were successfully marked for exclusive access
+ * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier too allow filtering
+ *
+ * Returns: number of pages successfully marked for exclusive access
+ *
+ * This function finds ptes mapping page(s) to the given address range, locks
+ * them and replaces mappings with special swap entries preventing userspace CPU
+ * access. On fault these entries are replaced with the original mapping after
+ * calling MMU notifiers.
+ *
+ * A driver using this to program access from a device must use a mmu notifier
+ * critical section to hold a device specific lock during programming. Once
+ * programming is complete it should drop the page lock and reference after
+ * which point CPU access to the page will revoke the exclusive access.
+ */
+int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
+				unsigned long end, struct page **pages,
+				void *arg)
+{
+	unsigned long npages = (end - start) >> PAGE_SHIFT;
+	unsigned long i;
+
+	npages = get_user_pages_remote(mm, start, npages,
+				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
+				       pages, NULL, NULL);
+	for (i = 0; i < npages; i++, start += PAGE_SIZE) {
+		if (!trylock_page(pages[i])) {
+			put_page(pages[i]);
+			pages[i] = NULL;
+			continue;
+		}
+
+		if (!try_to_protect(pages[i], mm, start, arg)) {
+			unlock_page(pages[i]);
+			put_page(pages[i]);
+			pages[i] = NULL;
+		}
+	}
+
+	return npages;
+}
+EXPORT_SYMBOL_GPL(make_device_exclusive_range);
+
 void __put_anon_vma(struct anon_vma *anon_vma)
 {
 	struct anon_vma *root = anon_vma->root;