diff mbox series

[v10,07/10] mm: Device exclusive memory access

Message ID 20210607075855.5084-8-apopple@nvidia.com (mailing list archive)
State New, archived
Headers show
Series Add support for SVM atomics in Nouveau | expand

Commit Message

Alistair Popple June 7, 2021, 7:58 a.m. UTC
Some devices require exclusive write access to shared virtual
memory (SVM) ranges to perform atomic operations on that memory. This
requires CPU page tables to be updated to deny access whilst atomic
operations are occurring.

In order to do this introduce a new swap entry
type (SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for
exclusive access by a device all page table mappings for the particular
range are replaced with device exclusive swap entries. This causes any
CPU access to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive
access to the region.

Walking of the page tables to find the target pages is handled by
get_user_pages() rather than a direct page table walk. A direct page
table walk similar to what migrate_vma_collect()/unmap() does could also
have been utilised. However this resulted in more code similar in
functionality to what get_user_pages() provides as page faulting is
required to make the PTEs present and to break COW.

Signed-off-by: Alistair Popple <apopple@nvidia.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>

---

v10:
* Make device exclusive code conditional on CONFIG_DEVICE_PRIVATE.
* Updates to code comments and more minor code cleanups.

v9:
* Split rename of migrate_pgmap_owner into a separate patch.
* Added comments explaining SWP_DEVICE_EXCLUSIVE_* entries.
* Renamed try_to_protect{_one} to page_make_device_exclusive{_one} based
  somewhat on a suggestion from Peter Xu. I was never particularly happy
  with try_to_protect() as a name so think this is better.
* Removed unneccesary code and reworded some comments based on feedback
  from Peter Xu.
* Removed the VMA walk when restoring PTEs for device-exclusive entries.
* Simplified implementation of copy_pte_range() to fail if the page
  cannot be locked. This might lead to occasional fork() failures but at
  this stage we don't think that will be an issue.

v8:
* Remove device exclusive entries on fork rather than copy them.

v7:
* Added Christoph's Reviewed-by.
* Minor cosmetic cleanups suggested by Christoph.
* Replace mmu_notifier_range_init_migrate/exclusive with
  mmu_notifier_range_init_owner as suggested by Christoph.
* Replaced lock_page() with lock_page_retry() when handling faults.
* Restrict to anonymous pages for now.

v6:
* Fixed a bisectablity issue due to incorrectly applying the rename of
  migrate_pgmap_owner to the wrong patches for Nouveau and hmm_test.

v5:
* Renamed range->migrate_pgmap_owner to range->owner.
* Added MMU_NOTIFY_EXCLUSIVE to allow passing of a driver cookie which
  allows notifiers called as a result of make_device_exclusive_range() to
  be ignored.
* Added a check to try_to_protect_one() to detect if the pages originally
  returned from get_user_pages() have been unmapped or not.
* Removed check_device_exclusive_range() as it is no longer required with
  the other changes.
* Documentation update.

v4:
* Add function to check that mappings are still valid and exclusive.
* s/long/unsigned long/ in make_device_exclusive_entry().
---
 Documentation/vm/hmm.rst     |  17 ++++
 include/linux/mmu_notifier.h |   6 ++
 include/linux/rmap.h         |   4 +
 include/linux/swap.h         |   9 +-
 include/linux/swapops.h      |  44 ++++++++-
 mm/hmm.c                     |   5 +
 mm/memory.c                  | 127 +++++++++++++++++++++++-
 mm/mprotect.c                |   8 ++
 mm/page_vma_mapped.c         |   9 +-
 mm/rmap.c                    | 182 +++++++++++++++++++++++++++++++++++
 10 files changed, 401 insertions(+), 10 deletions(-)

Comments

Peter Xu June 8, 2021, 6:33 p.m. UTC | #1
On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:

[...]

> +static bool page_make_device_exclusive_one(struct page *page,
> +		struct vm_area_struct *vma, unsigned long address, void *priv)
> +{
> +	struct mm_struct *mm = vma->vm_mm;
> +	struct page_vma_mapped_walk pvmw = {
> +		.page = page,
> +		.vma = vma,
> +		.address = address,
> +	};
> +	struct make_exclusive_args *args = priv;
> +	pte_t pteval;
> +	struct page *subpage;
> +	bool ret = true;
> +	struct mmu_notifier_range range;
> +	swp_entry_t entry;
> +	pte_t swp_pte;
> +
> +	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> +				      vma->vm_mm, address, min(vma->vm_end,
> +				      address + page_size(page)), args->owner);
> +	mmu_notifier_invalidate_range_start(&range);
> +
> +	while (page_vma_mapped_walk(&pvmw)) {
> +		/* Unexpected PMD-mapped THP? */
> +		VM_BUG_ON_PAGE(!pvmw.pte, page);

[1]

> +
> +		if (!pte_present(*pvmw.pte)) {
> +			ret = false;
> +			page_vma_mapped_walk_done(&pvmw);
> +			break;
> +		}
> +
> +		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> +		address = pvmw.address;

I raised a question here previously and didn't get an answer...

https://lore.kernel.org/linux-mm/YLDr%2FRyAdUR4q0kk@t490s/

I think I get your point now and it does look possible that the split page can
still be mapped somewhere else as thp, then having some subpage maintainance
looks necessary.  The confusing part is above [1] you've also got that
VM_BUG_ON_PAGE() assuming it must not be a mapped pmd at all..

Then I remembered these code majorly come from the try_to_unmap() so I looked
there.  I _think_ what's missing here is something like:

	if (flags & TTU_SPLIT_HUGE_PMD)
		split_huge_pmd_address(vma, address, false, page);

at the entry of page_make_device_exclusive_one()?

That !pte assertion in try_to_unmap() makes sense to me as long as it has split
the thp page first always.  However seems not the case for FOLL_SPLIT_PMD as
you previously mentioned.

Meanwhile, I also started to wonder whether it's even right to call rmap_walk()
with tail pages...  Please see below.

> +
> +		/* Nuke the page table entry. */
> +		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> +		pteval = ptep_clear_flush(vma, address, pvmw.pte);
> +
> +		/* Move the dirty bit to the page. Now the pte is gone. */
> +		if (pte_dirty(pteval))
> +			set_page_dirty(page);
> +
> +		/*
> +		 * Check that our target page is still mapped at the expected
> +		 * address.
> +		 */
> +		if (args->mm == mm && args->address == address &&
> +		    pte_write(pteval))
> +			args->valid = true;
> +
> +		/*
> +		 * Store the pfn of the page in a special migration
> +		 * pte. do_swap_page() will wait until the migration
> +		 * pte is removed and then restart fault handling.
> +		 */
> +		if (pte_write(pteval))
> +			entry = make_writable_device_exclusive_entry(
> +							page_to_pfn(subpage));
> +		else
> +			entry = make_readable_device_exclusive_entry(
> +							page_to_pfn(subpage));
> +		swp_pte = swp_entry_to_pte(entry);
> +		if (pte_soft_dirty(pteval))
> +			swp_pte = pte_swp_mksoft_dirty(swp_pte);
> +		if (pte_uffd_wp(pteval))
> +			swp_pte = pte_swp_mkuffd_wp(swp_pte);
> +
> +		set_pte_at(mm, address, pvmw.pte, swp_pte);
> +
> +		/*
> +		 * There is a reference on the page for the swap entry which has
> +		 * been removed, so shouldn't take another.
> +		 */
> +		page_remove_rmap(subpage, false);
> +	}
> +
> +	mmu_notifier_invalidate_range_end(&range);
> +
> +	return ret;
> +}
> +
> +/**
> + * page_make_device_exclusive - mark the page exclusively owned by a device
> + * @page: the page to replace page table entries for
> + * @mm: the mm_struct where the page is expected to be mapped
> + * @address: address where the page is expected to be mapped
> + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> + *
> + * Tries to remove all the page table entries which are mapping this page and
> + * replace them with special device exclusive swap entries to grant a device
> + * exclusive access to the page. Caller must hold the page lock.
> + *
> + * Returns false if the page is still mapped, or if it could not be unmapped
> + * from the expected address. Otherwise returns true (success).
> + */
> +static bool page_make_device_exclusive(struct page *page, struct mm_struct *mm,
> +				unsigned long address, void *owner)
> +{
> +	struct make_exclusive_args args = {
> +		.mm = mm,
> +		.address = address,
> +		.owner = owner,
> +		.valid = false,
> +	};
> +	struct rmap_walk_control rwc = {
> +		.rmap_one = page_make_device_exclusive_one,
> +		.done = page_not_mapped,
> +		.anon_lock = page_lock_anon_vma_read,
> +		.arg = &args,
> +	};
> +
> +	/*
> +	 * Restrict to anonymous pages for now to avoid potential writeback
> +	 * issues.
> +	 */
> +	if (!PageAnon(page))
> +		return false;
> +
> +	rmap_walk(page, &rwc);

Here we call rmap_walk() on each page we've got.  If it was thp then IIUC it'll
become the tail pages to walk as the outcome of FOLL_SPLIT_PMD gup (please
refer to the last reply of mine).  However now I'm uncertain whether we can do
rmap_walk on tail page at all...  As rmap_walk_anon() has thp_nr_pages() which
has:

	VM_BUG_ON_PGFLAGS(PageTail(page), page);

So... for thp mappings, wondering whether we should do normal GUP (without
SPLIT), pass in always normal or head pages into rmap_walk(), but then
unconditionally split_huge_pmd_address() in page_make_device_exclusive_one()?

Please correct me if I made silly mistakes on above, as I am looking at the
code when/during trying to review the patch, so it's possible I missed
something again.  Neither does this code a huge matter since it's not in
general mm path, but still raise this question up.

Thanks,

> +
> +	return args.valid && !page_mapcount(page);
> +}
> +
> +/**
> + * make_device_exclusive_range() - Mark a range for exclusive use by a device
> + * @mm: mm_struct of assoicated target process
> + * @start: start of the region to mark for exclusive device access
> + * @end: end address of region
> + * @pages: returns the pages which were successfully marked for exclusive access
> + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
> + *
> + * Returns: number of pages found in the range by GUP. A page is marked for
> + * exclusive access only if the page pointer is non-NULL.
> + *
> + * This function finds ptes mapping page(s) to the given address range, locks
> + * them and replaces mappings with special swap entries preventing userspace CPU
> + * access. On fault these entries are replaced with the original mapping after
> + * calling MMU notifiers.
> + *
> + * A driver using this to program access from a device must use a mmu notifier
> + * critical section to hold a device specific lock during programming. Once
> + * programming is complete it should drop the page lock and reference after
> + * which point CPU access to the page will revoke the exclusive access.
> + */
> +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> +				unsigned long end, struct page **pages,
> +				void *owner)
> +{
> +	long npages = (end - start) >> PAGE_SHIFT;
> +	unsigned long i;
> +
> +	npages = get_user_pages_remote(mm, start, npages,
> +				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> +				       pages, NULL, NULL);
> +	for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> +		if (!trylock_page(pages[i])) {
> +			put_page(pages[i]);
> +			pages[i] = NULL;
> +			continue;
> +		}
> +
> +		if (!page_make_device_exclusive(pages[i], mm, start, owner)) {
> +			unlock_page(pages[i]);
> +			put_page(pages[i]);
> +			pages[i] = NULL;
> +		}
> +	}
> +
> +	return npages;
> +}
> +EXPORT_SYMBOL_GPL(make_device_exclusive_range);
> +#endif
> +
>  void __put_anon_vma(struct anon_vma *anon_vma)
>  {
>  	struct anon_vma *root = anon_vma->root;
> -- 
> 2.20.1
>
Alistair Popple June 9, 2021, 9:38 a.m. UTC | #2
On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:
> 
> [...]
> 
> > +static bool page_make_device_exclusive_one(struct page *page,
> > +             struct vm_area_struct *vma, unsigned long address, void *priv)
> > +{
> > +     struct mm_struct *mm = vma->vm_mm;
> > +     struct page_vma_mapped_walk pvmw = {
> > +             .page = page,
> > +             .vma = vma,
> > +             .address = address,
> > +     };
> > +     struct make_exclusive_args *args = priv;
> > +     pte_t pteval;
> > +     struct page *subpage;
> > +     bool ret = true;
> > +     struct mmu_notifier_range range;
> > +     swp_entry_t entry;
> > +     pte_t swp_pte;
> > +
> > +     mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> > +                                   vma->vm_mm, address, min(vma->vm_end,
> > +                                   address + page_size(page)), args->owner);
> > +     mmu_notifier_invalidate_range_start(&range);
> > +
> > +     while (page_vma_mapped_walk(&pvmw)) {
> > +             /* Unexpected PMD-mapped THP? */
> > +             VM_BUG_ON_PAGE(!pvmw.pte, page);
> 
> [1]
> 
> > +
> > +             if (!pte_present(*pvmw.pte)) {
> > +                     ret = false;
> > +                     page_vma_mapped_walk_done(&pvmw);
> > +                     break;
> > +             }
> > +
> > +             subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > +             address = pvmw.address;
> 
> I raised a question here previously and didn't get an answer...
> 
> https://lore.kernel.org/linux-mm/YLDr%2FRyAdUR4q0kk@t490s/

Sorry, I had overlooked that. Will continue the discussion here.

> I think I get your point now and it does look possible that the split page can
> still be mapped somewhere else as thp, then having some subpage maintainance
> looks necessary.  The confusing part is above [1] you've also got that
> VM_BUG_ON_PAGE() assuming it must not be a mapped pmd at all..

Going back I thought your original question was whether subpage != page is
possible. My main point was it's possible if we get a thp head. In that case we
need to replace all pte's with exclusive entries because I haven't (yet)
defined a pmd version of device exclusive entries and also rmap_walk won't deal
with tail pages (see below).

> Then I remembered these code majorly come from the try_to_unmap() so I looked
> there.  I _think_ what's missing here is something like:
> 
>         if (flags & TTU_SPLIT_HUGE_PMD)
>                 split_huge_pmd_address(vma, address, false, page);
> 
> at the entry of page_make_device_exclusive_one()?
>
> That !pte assertion in try_to_unmap() makes sense to me as long as it has split
> the thp page first always.  However seems not the case for FOLL_SPLIT_PMD as
> you previously mentioned.

At present this is limited to PageAnon pages which have had CoW broken, which I
think means there shouldn't be other mappings so I expect the PMD will always
have been split into small PTEs mapping subpages by GUP which is what that
assertion [1] is checking. I could call split_huge_pmd_address() unconditionally
as suggested but see the discussion below.

> Meanwhile, I also started to wonder whether it's even right to call rmap_walk()
> with tail pages...  Please see below.
> 
> > +
> > +             /* Nuke the page table entry. */
> > +             flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > +             pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > +
> > +             /* Move the dirty bit to the page. Now the pte is gone. */
> > +             if (pte_dirty(pteval))
> > +                     set_page_dirty(page);
> > +
> > +             /*
> > +              * Check that our target page is still mapped at the expected
> > +              * address.
> > +              */
> > +             if (args->mm == mm && args->address == address &&
> > +                 pte_write(pteval))
> > +                     args->valid = true;
> > +
> > +             /*
> > +              * Store the pfn of the page in a special migration
> > +              * pte. do_swap_page() will wait until the migration
> > +              * pte is removed and then restart fault handling.
> > +              */
> > +             if (pte_write(pteval))
> > +                     entry = make_writable_device_exclusive_entry(
> > +                                                     page_to_pfn(subpage));
> > +             else
> > +                     entry = make_readable_device_exclusive_entry(
> > +                                                     page_to_pfn(subpage));
> > +             swp_pte = swp_entry_to_pte(entry);
> > +             if (pte_soft_dirty(pteval))
> > +                     swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > +             if (pte_uffd_wp(pteval))
> > +                     swp_pte = pte_swp_mkuffd_wp(swp_pte);
> > +
> > +             set_pte_at(mm, address, pvmw.pte, swp_pte);
> > +
> > +             /*
> > +              * There is a reference on the page for the swap entry which has
> > +              * been removed, so shouldn't take another.
> > +              */
> > +             page_remove_rmap(subpage, false);
> > +     }
> > +
> > +     mmu_notifier_invalidate_range_end(&range);
> > +
> > +     return ret;
> > +}
> > +
> > +/**
> > + * page_make_device_exclusive - mark the page exclusively owned by a device
> > + * @page: the page to replace page table entries for
> > + * @mm: the mm_struct where the page is expected to be mapped
> > + * @address: address where the page is expected to be mapped
> > + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> > + *
> > + * Tries to remove all the page table entries which are mapping this page and
> > + * replace them with special device exclusive swap entries to grant a device
> > + * exclusive access to the page. Caller must hold the page lock.
> > + *
> > + * Returns false if the page is still mapped, or if it could not be unmapped
> > + * from the expected address. Otherwise returns true (success).
> > + */
> > +static bool page_make_device_exclusive(struct page *page, struct mm_struct *mm,
> > +                             unsigned long address, void *owner)
> > +{
> > +     struct make_exclusive_args args = {
> > +             .mm = mm,
> > +             .address = address,
> > +             .owner = owner,
> > +             .valid = false,
> > +     };
> > +     struct rmap_walk_control rwc = {
> > +             .rmap_one = page_make_device_exclusive_one,
> > +             .done = page_not_mapped,
> > +             .anon_lock = page_lock_anon_vma_read,
> > +             .arg = &args,
> > +     };
> > +
> > +     /*
> > +      * Restrict to anonymous pages for now to avoid potential writeback
> > +      * issues.
> > +      */
> > +     if (!PageAnon(page))
> > +             return false;
> > +
> > +     rmap_walk(page, &rwc);
> 
> Here we call rmap_walk() on each page we've got.  If it was thp then IIUC it'll
> become the tail pages to walk as the outcome of FOLL_SPLIT_PMD gup (please
> refer to the last reply of mine).  However now I'm uncertain whether we can do
> rmap_walk on tail page at all...  As rmap_walk_anon() has thp_nr_pages() which
> has:
> 
>         VM_BUG_ON_PGFLAGS(PageTail(page), page);

In either case (FOLL_SPLIT_PMD or not) my understanding is GUP will return a
sub/tail page (perhaps I mixed up some terminology in the last thread but I
think we're in agreement here). For thp this means we could end up passing
tail pages to rmap_walk(), however it doesn't actually walk them.

Based on the results of previous testing I had done I assumed rmap_walk()
filtered out tail pages. It does, and I didn't hit the BUG_ON above, but the
filtering was not as deliberate as assumed.

I've gone back and looked at what was happening in my earlier tests and the
tail pages get filtered because the VMA is not getting locked in
page_lock_anon_vma_read() due to failing this check:

	anon_mapping = (unsigned long)READ_ONCE(page->mapping);
	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
		goto out;

And now I'm not sure it makes sense to read page->mapping of a tail page. So
it might be best if we explicitly ignore any tail pages returned from GUP, at
least for now (a future series will improve thp support such as adding a pmd
version for exclusive entries).

> So... for thp mappings, wondering whether we should do normal GUP (without
> SPLIT), pass in always normal or head pages into rmap_walk(), but then
> unconditionally split_huge_pmd_address() in page_make_device_exclusive_one()?

That could work (although I think GUP will still return tail pages - see
follow_trans_huge_pmd() which is called from follow_pmd_mask() in gup). The
main problem is split_huge_pmd_address() unconditionally calls a mmu notifier
so I would need to plumb in passing an owner everywhere which could get messy.

I suppose instead we could make that conditional on pmd_trans_huge(*pmd) but
that's just replicating what GUP already does for us. When I try adding support
for file mappings I will probably have to change this, but I am hoping to leave
that for a future series once the basic concept is there for anonymous mappings.

> Please correct me if I made silly mistakes on above, as I am looking at the
> code when/during trying to review the patch, so it's possible I missed
> something again.  Neither does this code a huge matter since it's not in
> general mm path, but still raise this question up.

You're correct that this bit isn't in the general mm path so perhaps doesn't
matter as much, but I still want to get it right so appreciate you taking the
time to comment! Thanks.

> Thanks,
> 
> > +
> > +     return args.valid && !page_mapcount(page);
> > +}
> > +
> > +/**
> > + * make_device_exclusive_range() - Mark a range for exclusive use by a device
> > + * @mm: mm_struct of assoicated target process
> > + * @start: start of the region to mark for exclusive device access
> > + * @end: end address of region
> > + * @pages: returns the pages which were successfully marked for exclusive access
> > + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
> > + *
> > + * Returns: number of pages found in the range by GUP. A page is marked for
> > + * exclusive access only if the page pointer is non-NULL.
> > + *
> > + * This function finds ptes mapping page(s) to the given address range, locks
> > + * them and replaces mappings with special swap entries preventing userspace CPU
> > + * access. On fault these entries are replaced with the original mapping after
> > + * calling MMU notifiers.
> > + *
> > + * A driver using this to program access from a device must use a mmu notifier
> > + * critical section to hold a device specific lock during programming. Once
> > + * programming is complete it should drop the page lock and reference after
> > + * which point CPU access to the page will revoke the exclusive access.
> > + */
> > +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> > +                             unsigned long end, struct page **pages,
> > +                             void *owner)
> > +{
> > +     long npages = (end - start) >> PAGE_SHIFT;
> > +     unsigned long i;
> > +
> > +     npages = get_user_pages_remote(mm, start, npages,
> > +                                    FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
> > +                                    pages, NULL, NULL);
> > +     for (i = 0; i < npages; i++, start += PAGE_SIZE) {
> > +             if (!trylock_page(pages[i])) {
> > +                     put_page(pages[i]);
> > +                     pages[i] = NULL;
> > +                     continue;
> > +             }
> > +
> > +             if (!page_make_device_exclusive(pages[i], mm, start, owner)) {
> > +                     unlock_page(pages[i]);
> > +                     put_page(pages[i]);
> > +                     pages[i] = NULL;
> > +             }
> > +     }
> > +
> > +     return npages;
> > +}
> > +EXPORT_SYMBOL_GPL(make_device_exclusive_range);
> > +#endif
> > +
> >  void __put_anon_vma(struct anon_vma *anon_vma)
> >  {
> >       struct anon_vma *root = anon_vma->root;
> > --
> > 2.20.1
> >
> 
> --
> Peter Xu
>
Peter Xu June 9, 2021, 4:05 p.m. UTC | #3
On Wed, Jun 09, 2021 at 07:38:04PM +1000, Alistair Popple wrote:
> On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> > On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:
> > 
> > [...]
> > 
> > > +static bool page_make_device_exclusive_one(struct page *page,
> > > +             struct vm_area_struct *vma, unsigned long address, void *priv)
> > > +{
> > > +     struct mm_struct *mm = vma->vm_mm;
> > > +     struct page_vma_mapped_walk pvmw = {
> > > +             .page = page,
> > > +             .vma = vma,
> > > +             .address = address,
> > > +     };
> > > +     struct make_exclusive_args *args = priv;
> > > +     pte_t pteval;
> > > +     struct page *subpage;
> > > +     bool ret = true;
> > > +     struct mmu_notifier_range range;
> > > +     swp_entry_t entry;
> > > +     pte_t swp_pte;
> > > +
> > > +     mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
> > > +                                   vma->vm_mm, address, min(vma->vm_end,
> > > +                                   address + page_size(page)), args->owner);
> > > +     mmu_notifier_invalidate_range_start(&range);
> > > +
> > > +     while (page_vma_mapped_walk(&pvmw)) {
> > > +             /* Unexpected PMD-mapped THP? */
> > > +             VM_BUG_ON_PAGE(!pvmw.pte, page);
> > 
> > [1]
> > 
> > > +
> > > +             if (!pte_present(*pvmw.pte)) {
> > > +                     ret = false;
> > > +                     page_vma_mapped_walk_done(&pvmw);
> > > +                     break;
> > > +             }
> > > +
> > > +             subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
> > > +             address = pvmw.address;
> > 
> > I raised a question here previously and didn't get an answer...
> > 
> > https://lore.kernel.org/linux-mm/YLDr%2FRyAdUR4q0kk@t490s/
> 
> Sorry, I had overlooked that. Will continue the discussion here.

No problem.  I also didn't really express clearly last time, I'm happy we can
discuss this more thoroughly, even if it may be a corner case only.

> 
> > I think I get your point now and it does look possible that the split page can
> > still be mapped somewhere else as thp, then having some subpage maintainance
> > looks necessary.  The confusing part is above [1] you've also got that
> > VM_BUG_ON_PAGE() assuming it must not be a mapped pmd at all..
> 
> Going back I thought your original question was whether subpage != page is
> possible. My main point was it's possible if we get a thp head. In that case we
> need to replace all pte's with exclusive entries because I haven't (yet)
> defined a pmd version of device exclusive entries and also rmap_walk won't deal
> with tail pages (see below).
> 
> > Then I remembered these code majorly come from the try_to_unmap() so I looked
> > there.  I _think_ what's missing here is something like:
> > 
> >         if (flags & TTU_SPLIT_HUGE_PMD)
> >                 split_huge_pmd_address(vma, address, false, page);
> > 
> > at the entry of page_make_device_exclusive_one()?
> >
> > That !pte assertion in try_to_unmap() makes sense to me as long as it has split
> > the thp page first always.  However seems not the case for FOLL_SPLIT_PMD as
> > you previously mentioned.
> 
> At present this is limited to PageAnon pages which have had CoW broken, which I
> think means there shouldn't be other mappings so I expect the PMD will always
> have been split into small PTEs mapping subpages by GUP which is what that
> assertion [1] is checking. I could call split_huge_pmd_address() unconditionally
> as suggested but see the discussion below.

Yes, I think calling that unconditionally should be enough.

> 
> > Meanwhile, I also started to wonder whether it's even right to call rmap_walk()
> > with tail pages...  Please see below.
> > 
> > > +
> > > +             /* Nuke the page table entry. */
> > > +             flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
> > > +             pteval = ptep_clear_flush(vma, address, pvmw.pte);
> > > +
> > > +             /* Move the dirty bit to the page. Now the pte is gone. */
> > > +             if (pte_dirty(pteval))
> > > +                     set_page_dirty(page);
> > > +
> > > +             /*
> > > +              * Check that our target page is still mapped at the expected
> > > +              * address.
> > > +              */
> > > +             if (args->mm == mm && args->address == address &&
> > > +                 pte_write(pteval))
> > > +                     args->valid = true;
> > > +
> > > +             /*
> > > +              * Store the pfn of the page in a special migration
> > > +              * pte. do_swap_page() will wait until the migration
> > > +              * pte is removed and then restart fault handling.
> > > +              */
> > > +             if (pte_write(pteval))
> > > +                     entry = make_writable_device_exclusive_entry(
> > > +                                                     page_to_pfn(subpage));
> > > +             else
> > > +                     entry = make_readable_device_exclusive_entry(
> > > +                                                     page_to_pfn(subpage));
> > > +             swp_pte = swp_entry_to_pte(entry);
> > > +             if (pte_soft_dirty(pteval))
> > > +                     swp_pte = pte_swp_mksoft_dirty(swp_pte);
> > > +             if (pte_uffd_wp(pteval))
> > > +                     swp_pte = pte_swp_mkuffd_wp(swp_pte);
> > > +
> > > +             set_pte_at(mm, address, pvmw.pte, swp_pte);
> > > +
> > > +             /*
> > > +              * There is a reference on the page for the swap entry which has
> > > +              * been removed, so shouldn't take another.
> > > +              */
> > > +             page_remove_rmap(subpage, false);
> > > +     }
> > > +
> > > +     mmu_notifier_invalidate_range_end(&range);
> > > +
> > > +     return ret;
> > > +}
> > > +
> > > +/**
> > > + * page_make_device_exclusive - mark the page exclusively owned by a device
> > > + * @page: the page to replace page table entries for
> > > + * @mm: the mm_struct where the page is expected to be mapped
> > > + * @address: address where the page is expected to be mapped
> > > + * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
> > > + *
> > > + * Tries to remove all the page table entries which are mapping this page and
> > > + * replace them with special device exclusive swap entries to grant a device
> > > + * exclusive access to the page. Caller must hold the page lock.
> > > + *
> > > + * Returns false if the page is still mapped, or if it could not be unmapped
> > > + * from the expected address. Otherwise returns true (success).
> > > + */
> > > +static bool page_make_device_exclusive(struct page *page, struct mm_struct *mm,
> > > +                             unsigned long address, void *owner)
> > > +{
> > > +     struct make_exclusive_args args = {
> > > +             .mm = mm,
> > > +             .address = address,
> > > +             .owner = owner,
> > > +             .valid = false,
> > > +     };
> > > +     struct rmap_walk_control rwc = {
> > > +             .rmap_one = page_make_device_exclusive_one,
> > > +             .done = page_not_mapped,
> > > +             .anon_lock = page_lock_anon_vma_read,
> > > +             .arg = &args,
> > > +     };
> > > +
> > > +     /*
> > > +      * Restrict to anonymous pages for now to avoid potential writeback
> > > +      * issues.
> > > +      */
> > > +     if (!PageAnon(page))
> > > +             return false;
> > > +
> > > +     rmap_walk(page, &rwc);
> > 
> > Here we call rmap_walk() on each page we've got.  If it was thp then IIUC it'll
> > become the tail pages to walk as the outcome of FOLL_SPLIT_PMD gup (please
> > refer to the last reply of mine).  However now I'm uncertain whether we can do
> > rmap_walk on tail page at all...  As rmap_walk_anon() has thp_nr_pages() which
> > has:
> > 
> >         VM_BUG_ON_PGFLAGS(PageTail(page), page);
> 
> In either case (FOLL_SPLIT_PMD or not) my understanding is GUP will return a
> sub/tail page (perhaps I mixed up some terminology in the last thread but I
> think we're in agreement here).

Aha, I totally missed this when I read last time (of follow_trans_huge_pmd)..

	page += (addr & ~HPAGE_PMD_MASK) >> PAGE_SHIFT;

Now I agree it'll always return subpage, even if thp mapped.  And do
FOLL_SPLIT_PMD makes sense too to do early break on cow pages as you said
before.

> For thp this means we could end up passing
> tail pages to rmap_walk(), however it doesn't actually walk them.
> 
> Based on the results of previous testing I had done I assumed rmap_walk()
> filtered out tail pages. It does, and I didn't hit the BUG_ON above, but the
> filtering was not as deliberate as assumed.
> 
> I've gone back and looked at what was happening in my earlier tests and the
> tail pages get filtered because the VMA is not getting locked in
> page_lock_anon_vma_read() due to failing this check:
> 
> 	anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> 	if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> 		goto out;
> 
> And now I'm not sure it makes sense to read page->mapping of a tail page. So
> it might be best if we explicitly ignore any tail pages returned from GUP, at
> least for now (a future series will improve thp support such as adding a pmd
> version for exclusive entries).

I feel like it's illegal to access page->mapping of tail pages; I looked at
what happens if we call page_anon_vma() on a tail page:

struct anon_vma *page_anon_vma(struct page *page)
{
	unsigned long mapping;

	page = compound_head(page);
	mapping = (unsigned long)page->mapping;
	if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
		return NULL;
	return __page_rmapping(page);
}

It'll just take the head's mapping instead.  It makes sense since the tail page
shouldn't have a different value against the head page, afaiu.

It would be great if thp experts could chim in.  Before that happens, I agree
with you that a safer approach is to explicitly not walk a tail page for its
rmap (and I think the rmap of a tail page will be the same of the head
anyways.. since they seem to share the anon_vma as quoted).

> 
> > So... for thp mappings, wondering whether we should do normal GUP (without
> > SPLIT), pass in always normal or head pages into rmap_walk(), but then
> > unconditionally split_huge_pmd_address() in page_make_device_exclusive_one()?
> 
> That could work (although I think GUP will still return tail pages - see
> follow_trans_huge_pmd() which is called from follow_pmd_mask() in gup).

Agreed.

> The main problem is split_huge_pmd_address() unconditionally calls a mmu
> notifier so I would need to plumb in passing an owner everywhere which could
> get messy.

Could I ask why?  split_huge_pmd_address() will notify with CLEAR, so I'm a bit
confused why we need to pass over the owner.

I thought plumb it right before your EXCLUSIVE notifier init would work?

---8<---
diff --git a/mm/rmap.c b/mm/rmap.c
index a94d9aed9d95..360ce86f3822 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2042,6 +2042,12 @@ static bool page_make_device_exclusive_one(struct page *page,
        swp_entry_t entry;
        pte_t swp_pte;
 
+       /*
+        * Make sure thps split as device exclusive entries only support pte
+        * level for now.
+        */
+       split_huge_pmd_address(vma, address, false, page);
+
        mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
                                      vma->vm_mm, address, min(vma->vm_end,
                                      address + page_size(page)), args->owner);
---8<---

Thanks,
Alistair Popple June 10, 2021, 12:18 a.m. UTC | #4
On Thursday, 10 June 2021 2:05:06 AM AEST Peter Xu wrote:
> On Wed, Jun 09, 2021 at 07:38:04PM +1000, Alistair Popple wrote:
> > On Wednesday, 9 June 2021 4:33:52 AM AEST Peter Xu wrote:
> > > On Mon, Jun 07, 2021 at 05:58:52PM +1000, Alistair Popple wrote:

[...]

> > For thp this means we could end up passing
> > tail pages to rmap_walk(), however it doesn't actually walk them.
> >
> > Based on the results of previous testing I had done I assumed rmap_walk()
> > filtered out tail pages. It does, and I didn't hit the BUG_ON above, but the
> > filtering was not as deliberate as assumed.
> >
> > I've gone back and looked at what was happening in my earlier tests and the
> > tail pages get filtered because the VMA is not getting locked in
> > page_lock_anon_vma_read() due to failing this check:
> >
> >       anon_mapping = (unsigned long)READ_ONCE(page->mapping);
> >       if ((anon_mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
> >               goto out;
> >
> > And now I'm not sure it makes sense to read page->mapping of a tail page. So
> > it might be best if we explicitly ignore any tail pages returned from GUP, at
> > least for now (a future series will improve thp support such as adding a pmd
> > version for exclusive entries).
> 
> I feel like it's illegal to access page->mapping of tail pages; I looked at
> what happens if we call page_anon_vma() on a tail page:
> 
> struct anon_vma *page_anon_vma(struct page *page)
> {
>         unsigned long mapping;
> 
>         page = compound_head(page);
>         mapping = (unsigned long)page->mapping;
>         if ((mapping & PAGE_MAPPING_FLAGS) != PAGE_MAPPING_ANON)
>                 return NULL;
>         return __page_rmapping(page);
> }
> 
> It'll just take the head's mapping instead.  It makes sense since the tail page
> shouldn't have a different value against the head page, afaiu.

Right, it makes no sense to look at ->mapping on a tail page because the field
is used for something else. On the 1st tail page it is ->compound_nr and on the
2nd tail page it is ->deferred_list. See the definitions of compound_nr() and
page_deferred_list() respectively. I suppose on the rest of the pages it could
be anything.

I think in practice it is probably ok - iuc bit 0 won't be set for compound_nr
and certainly not for deferred_list->next (a pointer). But none of that seems
intentional, so it would be better to be explicit and not walk the tail pages.

> It would be great if thp experts could chim in.  Before that happens, I agree
> with you that a safer approach is to explicitly not walk a tail page for its
> rmap (and I think the rmap of a tail page will be the same of the head
> anyways.. since they seem to share the anon_vma as quoted).
> >
> > > So... for thp mappings, wondering whether we should do normal GUP (without
> > > SPLIT), pass in always normal or head pages into rmap_walk(), but then
> > > unconditionally split_huge_pmd_address() in page_make_device_exclusive_one()?
> >
> > That could work (although I think GUP will still return tail pages - see
> > follow_trans_huge_pmd() which is called from follow_pmd_mask() in gup).
> 
> Agreed.
> 
> > The main problem is split_huge_pmd_address() unconditionally calls a mmu
> > notifier so I would need to plumb in passing an owner everywhere which could
> > get messy.
> 
> Could I ask why?  split_huge_pmd_address() will notify with CLEAR, so I'm a bit
> confused why we need to pass over the owner.

Sure, it is the same reason we need to pass it for the exclusive notifier.
Any invalidation during the make exclusive operation will break the mmu read
side critical section forcing a retry of the operation. The owner field is what
is used to filter out invalidations (such as the exclusive invalidation) that
don't need to be retried.
 
> I thought plumb it right before your EXCLUSIVE notifier init would work?

I did try this just to double check and it doesn't work due to the unconditional
notifier.

> ---8<---
> diff --git a/mm/rmap.c b/mm/rmap.c
> index a94d9aed9d95..360ce86f3822 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -2042,6 +2042,12 @@ static bool page_make_device_exclusive_one(struct page *page,
>         swp_entry_t entry;
>         pte_t swp_pte;
> 
> +       /*
> +        * Make sure thps split as device exclusive entries only support pte
> +        * level for now.
> +        */
> +       split_huge_pmd_address(vma, address, false, page);
> +
>         mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
>                                       vma->vm_mm, address, min(vma->vm_end,
>                                       address + page_size(page)), args->owner);
> ---8<---
> 
> Thanks,
> 
> --
> Peter Xu
>
Alistair Popple June 10, 2021, 2:21 p.m. UTC | #5
On Friday, 11 June 2021 4:04:35 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Thu, Jun 10, 2021 at 10:18:25AM +1000, Alistair Popple wrote:
> > > > The main problem is split_huge_pmd_address() unconditionally calls a mmu
> > > > notifier so I would need to plumb in passing an owner everywhere which could
> > > > get messy.
> > >
> > > Could I ask why?  split_huge_pmd_address() will notify with CLEAR, so I'm a bit
> > > confused why we need to pass over the owner.
> >
> > Sure, it is the same reason we need to pass it for the exclusive notifier.
> > Any invalidation during the make exclusive operation will break the mmu read
> > side critical section forcing a retry of the operation. The owner field is what
> > is used to filter out invalidations (such as the exclusive invalidation) that
> > don't need to be retried.
> 
> Do you mean the mmu_interval_read_begin|retry() calls?

Yep.

> Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to explicit
> call split_huge_pmd_address(), afaict.  Since both of them use __split_huge_pmd()
> internally which will generate that unwanted CLEAR notify.

Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
which will always CLEAR. However gup only calls split_huge_pmd_address() if it
finds a thp pmd. In follow_pmd_mask() we have:

	if (likely(!pmd_trans_huge(pmdval)))
		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);

So I don't think we have a problem here.

> If that's the case, I think it fails because split_huge_pmd_address() will
> trigger that CLEAR notify unconditionally (even if it's not a thp; not sure
> whether it should be optimized to not notify at all... definitely another
> story), while FOLL_SPLIT_PMD will skip the notify as it calls split_huge_pmd()
> instead, who checks the pmd before calling __split_huge_pmd().
> 
> Does it also mean that if there's a real THP it won't really work?  As then
> FOLL_SPLIT_PMD will start to trigger that CLEAR notify too, I think..
> 
> --
> Peter Xu
>
Peter Xu June 10, 2021, 6:04 p.m. UTC | #6
On Thu, Jun 10, 2021 at 10:18:25AM +1000, Alistair Popple wrote:
> > > The main problem is split_huge_pmd_address() unconditionally calls a mmu
> > > notifier so I would need to plumb in passing an owner everywhere which could
> > > get messy.
> > 
> > Could I ask why?  split_huge_pmd_address() will notify with CLEAR, so I'm a bit
> > confused why we need to pass over the owner.
> 
> Sure, it is the same reason we need to pass it for the exclusive notifier.
> Any invalidation during the make exclusive operation will break the mmu read
> side critical section forcing a retry of the operation. The owner field is what
> is used to filter out invalidations (such as the exclusive invalidation) that
> don't need to be retried.

Do you mean the mmu_interval_read_begin|retry() calls?

Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to explicit
call split_huge_pmd_address(), afaict.  Since both of them use __split_huge_pmd()
internally which will generate that unwanted CLEAR notify.

If that's the case, I think it fails because split_huge_pmd_address() will
trigger that CLEAR notify unconditionally (even if it's not a thp; not sure
whether it should be optimized to not notify at all... definitely another
story), while FOLL_SPLIT_PMD will skip the notify as it calls split_huge_pmd()
instead, who checks the pmd before calling __split_huge_pmd().

Does it also mean that if there's a real THP it won't really work?  As then
FOLL_SPLIT_PMD will start to trigger that CLEAR notify too, I think..
Peter Xu June 10, 2021, 11:04 p.m. UTC | #7
On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to explicit
> > call split_huge_pmd_address(), afaict.  Since both of them use __split_huge_pmd()
> > internally which will generate that unwanted CLEAR notify.
> 
> Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> which will always CLEAR. However gup only calls split_huge_pmd_address() if it
> finds a thp pmd. In follow_pmd_mask() we have:
> 
> 	if (likely(!pmd_trans_huge(pmdval)))
> 		return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> 
> So I don't think we have a problem here.

Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, right?  I
mean, if it's a thp for the current mm, afaict pmd_trans_huge() should return
true above, so we'll skip follow_page_pte(); then we'll check FOLL_SPLIT_PMD
and do the split, then the CLEAR notify.  Hmm.. Did I miss something?
Alistair Popple June 10, 2021, 11:17 p.m. UTC | #8
On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> External email: Use caution opening links or attachments
> 
> 
> On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to explicit
> > > call split_huge_pmd_address(), afaict.  Since both of them use __split_huge_pmd()
> > > internally which will generate that unwanted CLEAR notify.
> >
> > Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> > which will always CLEAR. However gup only calls split_huge_pmd_address() if it
> > finds a thp pmd. In follow_pmd_mask() we have:
> >
> >       if (likely(!pmd_trans_huge(pmdval)))
> >               return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> >
> > So I don't think we have a problem here.
> 
> Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, right?  I
> mean, if it's a thp for the current mm, afaict pmd_trans_huge() should return
> true above, so we'll skip follow_page_pte(); then we'll check FOLL_SPLIT_PMD
> and do the split, then the CLEAR notify.  Hmm.. Did I miss something?

That seems correct - if the thp is not mapped with a pmd we won't split and we
won't CLEAR. If there is a thp pmd we will split and CLEAR, but in that case it
is fine - we will retry, but the retry will won't CLEAR because the pmd has
already been split.

The issue arises with doing it unconditionally in make device exclusive is that
you *always* CLEAR even if there is no thp pmd to split. Or at least that's my
understanding, please let me know if it doesn't make sense.

 - Alistair

> --
> Peter Xu
>
Peter Xu June 11, 2021, 1 a.m. UTC | #9
On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > External email: Use caution opening links or attachments
> > 
> > 
> > On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to explicit
> > > > call split_huge_pmd_address(), afaict.  Since both of them use __split_huge_pmd()
> > > > internally which will generate that unwanted CLEAR notify.
> > >
> > > Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> > > which will always CLEAR. However gup only calls split_huge_pmd_address() if it
> > > finds a thp pmd. In follow_pmd_mask() we have:
> > >
> > >       if (likely(!pmd_trans_huge(pmdval)))
> > >               return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > >
> > > So I don't think we have a problem here.
> > 
> > Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, right?  I
> > mean, if it's a thp for the current mm, afaict pmd_trans_huge() should return
> > true above, so we'll skip follow_page_pte(); then we'll check FOLL_SPLIT_PMD
> > and do the split, then the CLEAR notify.  Hmm.. Did I miss something?
> 
> That seems correct - if the thp is not mapped with a pmd we won't split and we
> won't CLEAR. If there is a thp pmd we will split and CLEAR, but in that case it
> is fine - we will retry, but the retry will won't CLEAR because the pmd has
> already been split.

Aha!

> 
> The issue arises with doing it unconditionally in make device exclusive is that
> you *always* CLEAR even if there is no thp pmd to split. Or at least that's my
> understanding, please let me know if it doesn't make sense.

Exactly.  But if you see what I meant here, even if it can work like this, it
sounds still fragile, isn't it?  I just feel something is slightly off there..

IMHO split_huge_pmd() checked pmd before calling __split_huge_pmd() for
performance, afaict, because if it's not a thp even without locking, then it
won't be, so further __split_huge_pmd() is not necessary.

IOW, it's very legal if someday we'd like to let split_huge_pmd() call
__split_huge_pmd() directly, then AFAIU device exclusive API will be the 1st
one to be broken with that seems-to-be-irrelevant change I'm afraid..

This lets me goes back a step to think about why do we need this notifier at
all to cover this whole range of make_device_exclusive() procedure..

What I am thinking is, we're afraid some CPU accesses this page so the pte got
quickly restored when device atomic operation is carrying on.  Then with this
notifier we'll be able to cancel it.  Makes perfect sense.

However do we really need to register this notifier so early?  The thing is the
GPU driver still has all the page locks, so even if there's a race to restore
the ptes, they'll block at taking the page lock until the driver releases it.

IOW, I'm wondering whether the "non-fragile" way to do this is not do
mmu_interval_notifier_insert() that early: what if we register that notifier
after make_device_exclusive_range() returns but before page_unlock() somehow?
So before page_unlock(), race is protected fully by the lock itself; after
that, it's done by mmu notifier.  Then maybe we don't need to worry about all
these notifications during marking exclusive (while we shouldn't)?

Sorry in advance if I overlooked anything as I know little on device side (even
less than mm itself).  Also sorry to know that this series got marked
to-be-update in -mm; hopefully it'll still land soon even if it still needs
some rebase to other more important bugfixes - I definitely jumped in too late
even if to mess this all up. :-)
Alistair Popple June 11, 2021, 3:43 a.m. UTC | #10
On Friday, 11 June 2021 11:00:34 AM AEST Peter Xu wrote:
> On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> > On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > > On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to explicit
> > > > > call split_huge_pmd_address(), afaict.  Since both of them use __split_huge_pmd()
> > > > > internally which will generate that unwanted CLEAR notify.
> > > >
> > > > Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> > > > which will always CLEAR. However gup only calls split_huge_pmd_address() if it
> > > > finds a thp pmd. In follow_pmd_mask() we have:
> > > >
> > > >       if (likely(!pmd_trans_huge(pmdval)))
> > > >               return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > > >
> > > > So I don't think we have a problem here.
> > >
> > > Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, right?  I
> > > mean, if it's a thp for the current mm, afaict pmd_trans_huge() should return
> > > true above, so we'll skip follow_page_pte(); then we'll check FOLL_SPLIT_PMD
> > > and do the split, then the CLEAR notify.  Hmm.. Did I miss something?
> >
> > That seems correct - if the thp is not mapped with a pmd we won't split and we
> > won't CLEAR. If there is a thp pmd we will split and CLEAR, but in that case it
> > is fine - we will retry, but the retry will won't CLEAR because the pmd has
> > already been split.
> 
> Aha!
> 
> >
> > The issue arises with doing it unconditionally in make device exclusive is that
> > you *always* CLEAR even if there is no thp pmd to split. Or at least that's my
> > understanding, please let me know if it doesn't make sense.
> 
> Exactly.  But if you see what I meant here, even if it can work like this, it
> sounds still fragile, isn't it?  I just feel something is slightly off there..
> 
> IMHO split_huge_pmd() checked pmd before calling __split_huge_pmd() for
> performance, afaict, because if it's not a thp even without locking, then it
> won't be, so further __split_huge_pmd() is not necessary.
> 
> IOW, it's very legal if someday we'd like to let split_huge_pmd() call
> __split_huge_pmd() directly, then AFAIU device exclusive API will be the 1st
> one to be broken with that seems-to-be-irrelevant change I'm afraid..

Well I would argue the performance of memory notifiers is becoming increasingly
important, and a change that causes them to be called unnecessarily is
therefore not very legal. Likely the correct fix here is to optimise
__split_huge_pmd() to only call the notifier if it's actually going to split a
pmd. As you said though that's a completely different story which I think would
be best done as a separate series.

> This lets me goes back a step to think about why do we need this notifier at
> all to cover this whole range of make_device_exclusive() procedure..
> 
> What I am thinking is, we're afraid some CPU accesses this page so the pte got
> quickly restored when device atomic operation is carrying on.  Then with this
> notifier we'll be able to cancel it.  Makes perfect sense.
> 
> However do we really need to register this notifier so early?  The thing is the
> GPU driver still has all the page locks, so even if there's a race to restore
> the ptes, they'll block at taking the page lock until the driver releases it.
> 
> IOW, I'm wondering whether the "non-fragile" way to do this is not do
> mmu_interval_notifier_insert() that early: what if we register that notifier
> after make_device_exclusive_range() returns but before page_unlock() somehow?
> So before page_unlock(), race is protected fully by the lock itself; after
> that, it's done by mmu notifier.  Then maybe we don't need to worry about all
> these notifications during marking exclusive (while we shouldn't)?

The notifier is needed to protect against races with pte changes. Once a page
has been marked for exclusive access the driver will update it's page tables to
allow atomic access to the page. However in the meantime the page could become
unmapped entirely or write protected.

As I understand things the page lock won't protect against these kind of pte
changes, hence the need for mmu_interval_read_begin/retry which allows the
driver to hold a mutex protecting against invalidations via blocking the
notifier until the device page tables have been updated.

> Sorry in advance if I overlooked anything as I know little on device side (even
> less than mm itself).  Also sorry to know that this series got marked
> to-be-update in -mm; hopefully it'll still land soon even if it still needs
> some rebase to other more important bugfixes - I definitely jumped in too late
> even if to mess this all up. :-)

I was thinking that was probably coming anyway, but I'm still hoping it will be
just a rebase on Hugh's work which wasn't too bad last time I tried it :-)

> --
> Peter Xu
>
Peter Xu June 11, 2021, 3:01 p.m. UTC | #11
On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote:
> On Friday, 11 June 2021 11:00:34 AM AEST Peter Xu wrote:
> > On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> > > On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > > > On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > > > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to explicit
> > > > > > call split_huge_pmd_address(), afaict.  Since both of them use __split_huge_pmd()
> > > > > > internally which will generate that unwanted CLEAR notify.
> > > > >
> > > > > Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> > > > > which will always CLEAR. However gup only calls split_huge_pmd_address() if it
> > > > > finds a thp pmd. In follow_pmd_mask() we have:
> > > > >
> > > > >       if (likely(!pmd_trans_huge(pmdval)))
> > > > >               return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > > > >
> > > > > So I don't think we have a problem here.
> > > >
> > > > Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, right?  I
> > > > mean, if it's a thp for the current mm, afaict pmd_trans_huge() should return
> > > > true above, so we'll skip follow_page_pte(); then we'll check FOLL_SPLIT_PMD
> > > > and do the split, then the CLEAR notify.  Hmm.. Did I miss something?
> > >
> > > That seems correct - if the thp is not mapped with a pmd we won't split and we
> > > won't CLEAR. If there is a thp pmd we will split and CLEAR, but in that case it
> > > is fine - we will retry, but the retry will won't CLEAR because the pmd has
> > > already been split.
> > 
> > Aha!
> > 
> > >
> > > The issue arises with doing it unconditionally in make device exclusive is that
> > > you *always* CLEAR even if there is no thp pmd to split. Or at least that's my
> > > understanding, please let me know if it doesn't make sense.
> > 
> > Exactly.  But if you see what I meant here, even if it can work like this, it
> > sounds still fragile, isn't it?  I just feel something is slightly off there..
> > 
> > IMHO split_huge_pmd() checked pmd before calling __split_huge_pmd() for
> > performance, afaict, because if it's not a thp even without locking, then it
> > won't be, so further __split_huge_pmd() is not necessary.
> > 
> > IOW, it's very legal if someday we'd like to let split_huge_pmd() call
> > __split_huge_pmd() directly, then AFAIU device exclusive API will be the 1st
> > one to be broken with that seems-to-be-irrelevant change I'm afraid..
> 
> Well I would argue the performance of memory notifiers is becoming increasingly
> important, and a change that causes them to be called unnecessarily is
> therefore not very legal. Likely the correct fix here is to optimise
> __split_huge_pmd() to only call the notifier if it's actually going to split a
> pmd. As you said though that's a completely different story which I think would
> be best done as a separate series.

Right, maybe I can look a bit more into that later; but my whole point was to
express that one functionality shouldn't depend on such a trivial detail of
implementation of other modules (thp split in this case).

> 
> > This lets me goes back a step to think about why do we need this notifier at
> > all to cover this whole range of make_device_exclusive() procedure..
> > 
> > What I am thinking is, we're afraid some CPU accesses this page so the pte got
> > quickly restored when device atomic operation is carrying on.  Then with this
> > notifier we'll be able to cancel it.  Makes perfect sense.
> > 
> > However do we really need to register this notifier so early?  The thing is the
> > GPU driver still has all the page locks, so even if there's a race to restore
> > the ptes, they'll block at taking the page lock until the driver releases it.
> > 
> > IOW, I'm wondering whether the "non-fragile" way to do this is not do
> > mmu_interval_notifier_insert() that early: what if we register that notifier
> > after make_device_exclusive_range() returns but before page_unlock() somehow?
> > So before page_unlock(), race is protected fully by the lock itself; after
> > that, it's done by mmu notifier.  Then maybe we don't need to worry about all
> > these notifications during marking exclusive (while we shouldn't)?
> 
> The notifier is needed to protect against races with pte changes. Once a page
> has been marked for exclusive access the driver will update it's page tables to
> allow atomic access to the page. However in the meantime the page could become
> unmapped entirely or write protected.
> 
> As I understand things the page lock won't protect against these kind of pte
> changes, hence the need for mmu_interval_read_begin/retry which allows the
> driver to hold a mutex protecting against invalidations via blocking the
> notifier until the device page tables have been updated.

Indeed, I suppose you mean change_pte_range() and zap_pte_range()
correspondingly.

Do you think we can restore pte right before wr-protect or zap?  Then all
things serializes with page lock (btw: it's already an insane userspace to
either unmap a page or wr-protect a page if it knows the device is using it!).
If these are the only two cases, it still sounds a cleaner approach to me than
the current approach.

This also reminded me that right now the cpu pgtable recovery is lazy - it
happens either from fork() or a cpu page fault.  Even after device finished
using it, swap ptes keep there.

What if the device tries to do atomic op on the same page twice?  I am not sure
whether it means we may also want to teach both GUP (majorly follow_page_pte()
for now before pmd support) and process of page_make_device_exclusive() with
understanding the device exclusive entries too?  Another option seems to be
restoring pte after device finish using it, as long as the device knows when.
Alistair Popple June 15, 2021, 3:08 a.m. UTC | #12
On Saturday, 12 June 2021 1:01:42 AM AEST Peter Xu wrote:
> On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote:
> > On Friday, 11 June 2021 11:00:34 AM AEST Peter Xu wrote:
> > > On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> > > > On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > > > > On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > > > > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to explicit
> > > > > > > call split_huge_pmd_address(), afaict.  Since both of them use __split_huge_pmd()
> > > > > > > internally which will generate that unwanted CLEAR notify.
> > > > > >
> > > > > > Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> > > > > > which will always CLEAR. However gup only calls split_huge_pmd_address() if it
> > > > > > finds a thp pmd. In follow_pmd_mask() we have:
> > > > > >
> > > > > >       if (likely(!pmd_trans_huge(pmdval)))
> > > > > >               return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > > > > >
> > > > > > So I don't think we have a problem here.
> > > > >
> > > > > Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, right?  I
> > > > > mean, if it's a thp for the current mm, afaict pmd_trans_huge() should return
> > > > > true above, so we'll skip follow_page_pte(); then we'll check FOLL_SPLIT_PMD
> > > > > and do the split, then the CLEAR notify.  Hmm.. Did I miss something?
> > > >
> > > > That seems correct - if the thp is not mapped with a pmd we won't split and we
> > > > won't CLEAR. If there is a thp pmd we will split and CLEAR, but in that case it
> > > > is fine - we will retry, but the retry will won't CLEAR because the pmd has
> > > > already been split.
> > >
> > > Aha!
> > >
> > > >
> > > > The issue arises with doing it unconditionally in make device exclusive is that
> > > > you *always* CLEAR even if there is no thp pmd to split. Or at least that's my
> > > > understanding, please let me know if it doesn't make sense.
> > >
> > > Exactly.  But if you see what I meant here, even if it can work like this, it
> > > sounds still fragile, isn't it?  I just feel something is slightly off there..
> > >
> > > IMHO split_huge_pmd() checked pmd before calling __split_huge_pmd() for
> > > performance, afaict, because if it's not a thp even without locking, then it
> > > won't be, so further __split_huge_pmd() is not necessary.
> > >
> > > IOW, it's very legal if someday we'd like to let split_huge_pmd() call
> > > __split_huge_pmd() directly, then AFAIU device exclusive API will be the 1st
> > > one to be broken with that seems-to-be-irrelevant change I'm afraid..
> >
> > Well I would argue the performance of memory notifiers is becoming increasingly
> > important, and a change that causes them to be called unnecessarily is
> > therefore not very legal. Likely the correct fix here is to optimise
> > __split_huge_pmd() to only call the notifier if it's actually going to split a
> > pmd. As you said though that's a completely different story which I think would
> > be best done as a separate series.
> 
> Right, maybe I can look a bit more into that later; but my whole point was to
> express that one functionality shouldn't depend on such a trivial detail of
> implementation of other modules (thp split in this case).
> 
> >
> > > This lets me goes back a step to think about why do we need this notifier at
> > > all to cover this whole range of make_device_exclusive() procedure..
> > >
> > > What I am thinking is, we're afraid some CPU accesses this page so the pte got
> > > quickly restored when device atomic operation is carrying on.  Then with this
> > > notifier we'll be able to cancel it.  Makes perfect sense.
> > >
> > > However do we really need to register this notifier so early?  The thing is the
> > > GPU driver still has all the page locks, so even if there's a race to restore
> > > the ptes, they'll block at taking the page lock until the driver releases it.
> > >
> > > IOW, I'm wondering whether the "non-fragile" way to do this is not do
> > > mmu_interval_notifier_insert() that early: what if we register that notifier
> > > after make_device_exclusive_range() returns but before page_unlock() somehow?
> > > So before page_unlock(), race is protected fully by the lock itself; after
> > > that, it's done by mmu notifier.  Then maybe we don't need to worry about all
> > > these notifications during marking exclusive (while we shouldn't)?
> >
> > The notifier is needed to protect against races with pte changes. Once a page
> > has been marked for exclusive access the driver will update it's page tables to
> > allow atomic access to the page. However in the meantime the page could become
> > unmapped entirely or write protected.
> >
> > As I understand things the page lock won't protect against these kind of pte
> > changes, hence the need for mmu_interval_read_begin/retry which allows the
> > driver to hold a mutex protecting against invalidations via blocking the
> > notifier until the device page tables have been updated.
> 
> Indeed, I suppose you mean change_pte_range() and zap_pte_range()
> correspondingly.

Right.

> Do you think we can restore pte right before wr-protect or zap?  Then all
> things serializes with page lock (btw: it's already an insane userspace to
> either unmap a page or wr-protect a page if it knows the device is using it!).
> If these are the only two cases, it still sounds a cleaner approach to me than
> the current approach.

Perhaps we could but it would make {zap|change}_pte_range() much more complex as
we can't sleep taking the page lock whilst holding the ptl, so we'd have to
implement a retry scheme similar to copy_pte_range() in both those functions as
well. Given mmu_interval_read_begin/retry was IMHO added to solve this type of
problem (freezing pte's to safely program device pte's) it seems like the
better option rather than adding more complex code to generic mm paths.

It's also worth noting i915 seems to use mmu_interval_read_begin/retry() with
gup to sync mappings so this isn't an entirely new concept. I'm not an expert
in that driver but I imagine changing gup to generate unconditional mmu notifier
invalidates would also cause issues there. So I think overall this is the
cleanest solution as it reduces the amount of code (particularly in generic mm
paths).

> This also reminded me that right now the cpu pgtable recovery is lazy - it
> happens either from fork() or a cpu page fault.  Even after device finished
> using it, swap ptes keep there.
> 
> What if the device tries to do atomic op on the same page twice?  I am not sure
> whether it means we may also want to teach both GUP (majorly follow_page_pte()
> for now before pmd support) and process of page_make_device_exclusive() with
> understanding the device exclusive entries too?  Another option seems to be
> restoring pte after device finish using it, as long as the device knows when.

I don't think we need to complicate follow_page_pte() with knowledge of
exclusive entries. GUP will just restore the original pte via the normal
fault path - follow_page_pte() will return NULL for an exclusive entry,
resulting in handle_mm_path() getting called via faultin_page(). Therefore
a driver calling make_device_exclusive() twice on the same page won't cause an
issue. Also the device shouldn't fault on subsequent accesses if the exclusive
entry is still in place anyway.

We can't restore the pte when the device is finished with it because there is
no way of knowing when a device is done using an exclusive entry - device
pte's work much the same as cpu pte's in that regard.

 - Alistair

> --
> Peter Xu
>
Peter Xu June 15, 2021, 4:25 p.m. UTC | #13
On Tue, Jun 15, 2021 at 01:08:11PM +1000, Alistair Popple wrote:
> On Saturday, 12 June 2021 1:01:42 AM AEST Peter Xu wrote:
> > On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote:
> > > On Friday, 11 June 2021 11:00:34 AM AEST Peter Xu wrote:
> > > > On Fri, Jun 11, 2021 at 09:17:14AM +1000, Alistair Popple wrote:
> > > > > On Friday, 11 June 2021 9:04:19 AM AEST Peter Xu wrote:
> > > > > > On Fri, Jun 11, 2021 at 12:21:26AM +1000, Alistair Popple wrote:
> > > > > > > > Hmm, the thing is.. to me FOLL_SPLIT_PMD should have similar effect to explicit
> > > > > > > > call split_huge_pmd_address(), afaict.  Since both of them use __split_huge_pmd()
> > > > > > > > internally which will generate that unwanted CLEAR notify.
> > > > > > >
> > > > > > > Agree that gup calls __split_huge_pmd() via split_huge_pmd_address()
> > > > > > > which will always CLEAR. However gup only calls split_huge_pmd_address() if it
> > > > > > > finds a thp pmd. In follow_pmd_mask() we have:
> > > > > > >
> > > > > > >       if (likely(!pmd_trans_huge(pmdval)))
> > > > > > >               return follow_page_pte(vma, address, pmd, flags, &ctx->pgmap);
> > > > > > >
> > > > > > > So I don't think we have a problem here.
> > > > > >
> > > > > > Sorry I didn't follow here..  We do FOLL_SPLIT_PMD after this check, right?  I
> > > > > > mean, if it's a thp for the current mm, afaict pmd_trans_huge() should return
> > > > > > true above, so we'll skip follow_page_pte(); then we'll check FOLL_SPLIT_PMD
> > > > > > and do the split, then the CLEAR notify.  Hmm.. Did I miss something?
> > > > >
> > > > > That seems correct - if the thp is not mapped with a pmd we won't split and we
> > > > > won't CLEAR. If there is a thp pmd we will split and CLEAR, but in that case it
> > > > > is fine - we will retry, but the retry will won't CLEAR because the pmd has
> > > > > already been split.
> > > >
> > > > Aha!
> > > >
> > > > >
> > > > > The issue arises with doing it unconditionally in make device exclusive is that
> > > > > you *always* CLEAR even if there is no thp pmd to split. Or at least that's my
> > > > > understanding, please let me know if it doesn't make sense.
> > > >
> > > > Exactly.  But if you see what I meant here, even if it can work like this, it
> > > > sounds still fragile, isn't it?  I just feel something is slightly off there..
> > > >
> > > > IMHO split_huge_pmd() checked pmd before calling __split_huge_pmd() for
> > > > performance, afaict, because if it's not a thp even without locking, then it
> > > > won't be, so further __split_huge_pmd() is not necessary.
> > > >
> > > > IOW, it's very legal if someday we'd like to let split_huge_pmd() call
> > > > __split_huge_pmd() directly, then AFAIU device exclusive API will be the 1st
> > > > one to be broken with that seems-to-be-irrelevant change I'm afraid..
> > >
> > > Well I would argue the performance of memory notifiers is becoming increasingly
> > > important, and a change that causes them to be called unnecessarily is
> > > therefore not very legal. Likely the correct fix here is to optimise
> > > __split_huge_pmd() to only call the notifier if it's actually going to split a
> > > pmd. As you said though that's a completely different story which I think would
> > > be best done as a separate series.
> > 
> > Right, maybe I can look a bit more into that later; but my whole point was to
> > express that one functionality shouldn't depend on such a trivial detail of
> > implementation of other modules (thp split in this case).
> > 
> > >
> > > > This lets me goes back a step to think about why do we need this notifier at
> > > > all to cover this whole range of make_device_exclusive() procedure..
> > > >
> > > > What I am thinking is, we're afraid some CPU accesses this page so the pte got
> > > > quickly restored when device atomic operation is carrying on.  Then with this
> > > > notifier we'll be able to cancel it.  Makes perfect sense.
> > > >
> > > > However do we really need to register this notifier so early?  The thing is the
> > > > GPU driver still has all the page locks, so even if there's a race to restore
> > > > the ptes, they'll block at taking the page lock until the driver releases it.
> > > >
> > > > IOW, I'm wondering whether the "non-fragile" way to do this is not do
> > > > mmu_interval_notifier_insert() that early: what if we register that notifier
> > > > after make_device_exclusive_range() returns but before page_unlock() somehow?
> > > > So before page_unlock(), race is protected fully by the lock itself; after
> > > > that, it's done by mmu notifier.  Then maybe we don't need to worry about all
> > > > these notifications during marking exclusive (while we shouldn't)?
> > >
> > > The notifier is needed to protect against races with pte changes. Once a page
> > > has been marked for exclusive access the driver will update it's page tables to
> > > allow atomic access to the page. However in the meantime the page could become
> > > unmapped entirely or write protected.
> > >
> > > As I understand things the page lock won't protect against these kind of pte
> > > changes, hence the need for mmu_interval_read_begin/retry which allows the
> > > driver to hold a mutex protecting against invalidations via blocking the
> > > notifier until the device page tables have been updated.
> > 
> > Indeed, I suppose you mean change_pte_range() and zap_pte_range()
> > correspondingly.
> 
> Right.
> 
> > Do you think we can restore pte right before wr-protect or zap?  Then all
> > things serializes with page lock (btw: it's already an insane userspace to
> > either unmap a page or wr-protect a page if it knows the device is using it!).
> > If these are the only two cases, it still sounds a cleaner approach to me than
> > the current approach.
> 
> Perhaps we could but it would make {zap|change}_pte_range() much more complex as
> we can't sleep taking the page lock whilst holding the ptl, so we'd have to
> implement a retry scheme similar to copy_pte_range() in both those functions as
> well.

Yes, but shouldn't be hard to do so, imho. E.g., see when __tlb_remove_page()
returns true in zap_pte_range(), so we already did something like that.  IMHO
it's not uncommon to have such facilities as we do have requirements to sleep
during a spinlock critical section for a lot of places in mm, so we release
them when needed and retake.

> Given mmu_interval_read_begin/retry was IMHO added to solve this type of
> problem (freezing pte's to safely program device pte's) it seems like the
> better option rather than adding more complex code to generic mm paths.
> 
> It's also worth noting i915 seems to use mmu_interval_read_begin/retry() with
> gup to sync mappings so this isn't an entirely new concept. I'm not an expert
> in that driver but I imagine changing gup to generate unconditional mmu notifier
> invalidates would also cause issues there. So I think overall this is the
> cleanest solution as it reduces the amount of code (particularly in generic mm
> paths).

I could be wrong somewhere, but to me depending on mmu notifiers being
"accurate" in general is fragile..

Take an example of change_pte_range(), which will generate PROTECTION_VMA
notifies.  Let's imaging an userspace calls mprotect() e.g. twice or even more
times with the same PROT_* and upon the same region, we know very possibly the
2nd,3rd,... calls will generate those notifies with totally no change to the
pgtable at all as they're all done on the 1st shot.  However we'll generate mmu
notifies anyways for the 2nd,3rd,... calls.  It means mmu notifiers should
really be tolerant of false positives as it does happen, and such thing can be
triggered even from userspace system calls very easily like this.  That's why I
think any kernel facility that depends on mmu notifiers being accurate is
probably not the right approach..

But yeah as you said I think it's working as is with the series (I think the
follow_pmd_mask() checking pmd_trans_huge before calling split_huge_pmd is a
double safety-net for it, so even if the GUP split_huge_pmd got replaced with
__split_huge_pmd it should still work with the one-retry logic), not sure
whether it matters a lot, as it's not common mm path; I think I'll step back so
Andrew could still pick it up as wish, I'm just still not fully convinced it's
the best solution to have for a long term to depend on that..

> 
> > This also reminded me that right now the cpu pgtable recovery is lazy - it
> > happens either from fork() or a cpu page fault.  Even after device finished
> > using it, swap ptes keep there.
> > 
> > What if the device tries to do atomic op on the same page twice?  I am not sure
> > whether it means we may also want to teach both GUP (majorly follow_page_pte()
> > for now before pmd support) and process of page_make_device_exclusive() with
> > understanding the device exclusive entries too?  Another option seems to be
> > restoring pte after device finish using it, as long as the device knows when.
> 
> I don't think we need to complicate follow_page_pte() with knowledge of
> exclusive entries. GUP will just restore the original pte via the normal
> fault path - follow_page_pte() will return NULL for an exclusive entry,
> resulting in handle_mm_path() getting called via faultin_page(). Therefore
> a driver calling make_device_exclusive() twice on the same page won't cause an
> issue. Also the device shouldn't fault on subsequent accesses if the exclusive
> entry is still in place anyway.

Right, looks good then.

> 
> We can't restore the pte when the device is finished with it because there is
> no way of knowing when a device is done using an exclusive entry - device
> pte's work much the same as cpu pte's in that regard.

I see, I feel like I understand how it works slightly better now, thanks.

One last pure question: I see nouveau_atomic_range_fault() will call the other
nvif_object_ioctl() which seems to do the device pgtable mapping, am I right?
Then I see the notifier is quickly removed before nouveau_atomic_range_fault()
returns.  What happens if CPU access happens after mmu notifier removed?  Or is
it not possible to happen?
Alistair Popple June 16, 2021, 2:47 a.m. UTC | #14
On Wednesday, 16 June 2021 2:25:09 AM AEST Peter Xu wrote:
> On Tue, Jun 15, 2021 at 01:08:11PM +1000, Alistair Popple wrote:
> > On Saturday, 12 June 2021 1:01:42 AM AEST Peter Xu wrote:
> > > On Fri, Jun 11, 2021 at 01:43:20PM +1000, Alistair Popple wrote:

[...]

> > > Do you think we can restore pte right before wr-protect or zap?  Then all
> > > things serializes with page lock (btw: it's already an insane userspace to
> > > either unmap a page or wr-protect a page if it knows the device is using it!).
> > > If these are the only two cases, it still sounds a cleaner approach to me than
> > > the current approach.
> >
> > Perhaps we could but it would make {zap|change}_pte_range() much more complex as
> > we can't sleep taking the page lock whilst holding the ptl, so we'd have to
> > implement a retry scheme similar to copy_pte_range() in both those functions as
> > well.
> 
> Yes, but shouldn't be hard to do so, imho. E.g., see when __tlb_remove_page()
> returns true in zap_pte_range(), so we already did something like that.  IMHO
> it's not uncommon to have such facilities as we do have requirements to sleep
> during a spinlock critical section for a lot of places in mm, so we release
> them when needed and retake.

Agreed, it's not hard to do and it's a common enough pattern. However we decided
that for such a specific application this (trying to take the lock or drop locks
and retry) was too complex for copy_pte_range() so it seems like the same should
apply here.

Admittedly copy_pte_range() already had several other retry paths so perhaps
it was adding yet another that made it relatively more complex. Overall I have
been trying to minimise the impact on core mm code for this feature, and adding
this pattern to zap_pte_range(), etc. would make it more complex for any future
addition that requires locks to be dropped and retried so I guess in that sense
it is no different.

> > Given mmu_interval_read_begin/retry was IMHO added to solve this type of
> > problem (freezing pte's to safely program device pte's) it seems like the
> > better option rather than adding more complex code to generic mm paths.
> >
> > It's also worth noting i915 seems to use mmu_interval_read_begin/retry() with
> > gup to sync mappings so this isn't an entirely new concept. I'm not an expert
> > in that driver but I imagine changing gup to generate unconditional mmu notifier
> > invalidates would also cause issues there. So I think overall this is the
> > cleanest solution as it reduces the amount of code (particularly in generic mm
> > paths).
> 
> I could be wrong somewhere, but to me depending on mmu notifiers being
> "accurate" in general is fragile..
> 
> Take an example of change_pte_range(), which will generate PROTECTION_VMA
> notifies.  Let's imaging an userspace calls mprotect() e.g. twice or even more
> times with the same PROT_* and upon the same region, we know very possibly the
> 2nd,3rd,... calls will generate those notifies with totally no change to the
> pgtable at all as they're all done on the 1st shot.  However we'll generate mmu
> notifies anyways for the 2nd,3rd,... calls.  It means mmu notifiers should
> really be tolerant of false positives as it does happen, and such thing can be
> triggered even from userspace system calls very easily like this.  That's why I
> think any kernel facility that depends on mmu notifiers being accurate is
> probably not the right approach..

Argh, thanks. I was focused on the specifics of this series but I think I
understand your point better now - that as a more general principle we can't
assume notifiers are accurate.

> But yeah as you said I think it's working as is with the series (I think the
> follow_pmd_mask() checking pmd_trans_huge before calling split_huge_pmd is a
> double safety-net for it, so even if the GUP split_huge_pmd got replaced with
> __split_huge_pmd it should still work with the one-retry logic), not sure
> whether it matters a lot, as it's not common mm path; I think I'll step back so
> Andrew could still pick it up as wish, I'm just still not fully convinced it's
> the best solution to have for a long term to depend on that..

Ok, thanks. I guess you have somewhat convinced me - depending on it for the
long term might be a bit fragile. However as you say the current implementation
does work and I am starting to look at support for PMD and file backed pages
which require changes here anyway. So I am hoping Andrew can still take this
(once rebased) as it would be easier for me to do those changes if the basic
support and clean ups were already in place.

> > > This also reminded me that right now the cpu pgtable recovery is lazy - it
> > > happens either from fork() or a cpu page fault.  Even after device finished
> > > using it, swap ptes keep there.
> > >
> > > What if the device tries to do atomic op on the same page twice?  I am not sure
> > > whether it means we may also want to teach both GUP (majorly follow_page_pte()
> > > for now before pmd support) and process of page_make_device_exclusive() with
> > > understanding the device exclusive entries too?  Another option seems to be
> > > restoring pte after device finish using it, as long as the device knows when.
> >
> > I don't think we need to complicate follow_page_pte() with knowledge of
> > exclusive entries. GUP will just restore the original pte via the normal
> > fault path - follow_page_pte() will return NULL for an exclusive entry,
> > resulting in handle_mm_path() getting called via faultin_page(). Therefore
> > a driver calling make_device_exclusive() twice on the same page won't cause an
> > issue. Also the device shouldn't fault on subsequent accesses if the exclusive
> > entry is still in place anyway.
> 
> Right, looks good then.
> 
> >
> > We can't restore the pte when the device is finished with it because there is
> > no way of knowing when a device is done using an exclusive entry - device
> > pte's work much the same as cpu pte's in that regard.
> 
> I see, I feel like I understand how it works slightly better now, thanks.

Feel free to ask if there are any more details you want, but there's nothing too
magical going on here.

> One last pure question: I see nouveau_atomic_range_fault() will call the other
> nvif_object_ioctl() which seems to do the device pgtable mapping, am I right?

Correct - that installs the page table mapping on the GPU.

> Then I see the notifier is quickly removed before nouveau_atomic_range_fault()
> returns.  What happens if CPU access happens after mmu notifier removed?  Or is
> it not possible to happen?

So there are two notifiers registered - this one and a process wide notifier
(see nouveau_mn_ops). In this case the process wide notifier will get called
to invalidate the access when the CPU fault removes the device exclusive
entries.

 - Alistair

> --
> Peter Xu
>
diff mbox series

Patch

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst
index 3df79307a797..a14c2938e7af 100644
--- a/Documentation/vm/hmm.rst
+++ b/Documentation/vm/hmm.rst
@@ -405,6 +405,23 @@  between device driver specific code and shared common code:
 
    The lock can now be released.
 
+Exclusive access memory
+=======================
+
+Some devices have features such as atomic PTE bits that can be used to implement
+atomic access to system memory. To support atomic operations to a shared virtual
+memory page such a device needs access to that page which is exclusive of any
+userspace access from the CPU. The ``make_device_exclusive_range()`` function
+can be used to make a memory range inaccessible from userspace.
+
+This replaces all mappings for pages in the given range with special swap
+entries. Any attempt to access the swap entry results in a fault which is
+resovled by replacing the entry with the original mapping. A driver gets
+notified that the mapping has been changed by MMU notifiers, after which point
+it will no longer have exclusive access to the page. Exclusive access is
+guranteed to last until the driver drops the page lock and page reference, at
+which point any CPU faults on the page may proceed as described.
+
 Memory cgroup (memcg) and rss accounting
 ========================================
 
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 8e428eb813b8..6692da8d121d 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -42,6 +42,11 @@  struct mmu_interval_notifier;
  * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal
  * a device driver to possibly ignore the invalidation if the
  * owner field matches the driver's device private pgmap owner.
+ *
+ * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no
+ * longer have exclusive access to the page. When sent during creation of an
+ * exclusive range the owner will be initialised to the value provided by the
+ * caller of make_device_exclusive_range(), otherwise the owner will be NULL.
  */
 enum mmu_notifier_event {
 	MMU_NOTIFY_UNMAP = 0,
@@ -51,6 +56,7 @@  enum mmu_notifier_event {
 	MMU_NOTIFY_SOFT_DIRTY,
 	MMU_NOTIFY_RELEASE,
 	MMU_NOTIFY_MIGRATE,
+	MMU_NOTIFY_EXCLUSIVE,
 };
 
 #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0)
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index 0e25d829f742..3a1ce4ef9276 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -193,6 +193,10 @@  int page_referenced(struct page *, int is_locked,
 bool try_to_migrate(struct page *page, enum ttu_flags flags);
 bool try_to_unmap(struct page *, enum ttu_flags flags);
 
+int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
+				unsigned long end, struct page **pages,
+				void *arg);
+
 /* Avoid racy checks */
 #define PVMW_SYNC		(1 << 0)
 /* Look for migarion entries rather than present PTEs */
diff --git a/include/linux/swap.h b/include/linux/swap.h
index a6d4505ecf73..a002029130d0 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -62,12 +62,17 @@  static inline int current_is_kswapd(void)
  * migrate part of a process memory to device memory.
  *
  * When a page is migrated from CPU to device, we set the CPU page table entry
- * to a special SWP_DEVICE_* entry.
+ * to a special SWP_DEVICE_{READ|WRITE} entry.
+ *
+ * When a page is mapped by the device for exclusive access we set the CPU page
+ * table entries to special SWP_DEVICE_EXCLUSIVE_* entries.
  */
 #ifdef CONFIG_DEVICE_PRIVATE
-#define SWP_DEVICE_NUM 2
+#define SWP_DEVICE_NUM 4
 #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM)
 #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1)
+#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2)
+#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3)
 #else
 #define SWP_DEVICE_NUM 0
 #endif
diff --git a/include/linux/swapops.h b/include/linux/swapops.h
index 4dfd807ae52a..4129bd2ff9d6 100644
--- a/include/linux/swapops.h
+++ b/include/linux/swapops.h
@@ -120,6 +120,27 @@  static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
 	return unlikely(swp_type(entry) == SWP_DEVICE_WRITE);
 }
+
+static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset);
+}
+
+static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset);
+}
+
+static inline bool is_device_exclusive_entry(swp_entry_t entry)
+{
+	return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ ||
+		swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE;
+}
+
+static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
+{
+	return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE);
+}
 #else /* CONFIG_DEVICE_PRIVATE */
 static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset)
 {
@@ -140,6 +161,26 @@  static inline bool is_writable_device_private_entry(swp_entry_t entry)
 {
 	return false;
 }
+
+static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset)
+{
+	return swp_entry(0, 0);
+}
+
+static inline bool is_device_exclusive_entry(swp_entry_t entry)
+{
+	return false;
+}
+
+static inline bool is_writable_device_exclusive_entry(swp_entry_t entry)
+{
+	return false;
+}
 #endif /* CONFIG_DEVICE_PRIVATE */
 
 #ifdef CONFIG_MIGRATION
@@ -219,7 +260,8 @@  static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry)
  */
 static inline bool is_pfn_swap_entry(swp_entry_t entry)
 {
-	return is_migration_entry(entry) || is_device_private_entry(entry);
+	return is_migration_entry(entry) || is_device_private_entry(entry) ||
+	       is_device_exclusive_entry(entry);
 }
 
 struct page_vma_mapped_walk;
diff --git a/mm/hmm.c b/mm/hmm.c
index 11df3ca30b82..fad6be2bf072 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -26,6 +26,8 @@ 
 #include <linux/mmu_notifier.h>
 #include <linux/memory_hotplug.h>
 
+#include "internal.h"
+
 struct hmm_vma_walk {
 	struct hmm_range	*range;
 	unsigned long		last;
@@ -271,6 +273,9 @@  static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr,
 		if (!non_swap_entry(entry))
 			goto fault;
 
+		if (is_device_exclusive_entry(entry))
+			goto fault;
+
 		if (is_migration_entry(entry)) {
 			pte_unmap(ptep);
 			hmm_vma_walk->last = addr;
diff --git a/mm/memory.c b/mm/memory.c
index 0982cab37ecb..426e05ad4fc6 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -700,6 +700,68 @@  struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
 }
 #endif
 
+static void restore_exclusive_pte(struct vm_area_struct *vma,
+				  struct page *page, unsigned long address,
+				  pte_t *ptep)
+{
+	pte_t pte;
+	swp_entry_t entry;
+
+	pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot)));
+	if (pte_swp_soft_dirty(*ptep))
+		pte = pte_mksoft_dirty(pte);
+
+	entry = pte_to_swp_entry(*ptep);
+	if (pte_swp_uffd_wp(*ptep))
+		pte = pte_mkuffd_wp(pte);
+	else if (is_writable_device_exclusive_entry(entry))
+		pte = maybe_mkwrite(pte_mkdirty(pte), vma);
+
+	set_pte_at(vma->vm_mm, address, ptep, pte);
+
+	/*
+	 * No need to take a page reference as one was already
+	 * created when the swap entry was made.
+	 */
+	if (PageAnon(page))
+		page_add_anon_rmap(page, vma, address, false);
+	else
+		/*
+		 * Currently device exclusive access only supports anonymous
+		 * memory so the entry shouldn't point to a filebacked page.
+		 */
+		WARN_ON_ONCE(!PageAnon(page));
+
+	if (vma->vm_flags & VM_LOCKED)
+		mlock_vma_page(page);
+
+	/*
+	 * No need to invalidate - it was non-present before. However
+	 * secondary CPUs may have mappings that need invalidating.
+	 */
+	update_mmu_cache(vma, address, ptep);
+}
+
+/*
+ * Tries to restore an exclusive pte if the page lock can be acquired without
+ * sleeping.
+ */
+static int
+try_restore_exclusive_pte(pte_t *src_pte, struct vm_area_struct *vma,
+			unsigned long addr)
+{
+	swp_entry_t entry = pte_to_swp_entry(*src_pte);
+	struct page *page = pfn_swap_entry_to_page(entry);
+
+	if (trylock_page(page)) {
+		restore_exclusive_pte(vma, page, addr, src_pte);
+		unlock_page(page);
+		return 0;
+	}
+
+	return -EBUSY;
+}
+
 /*
  * copy one vm_area from one task to the other. Assumes the page tables
  * already present in the new task to be cleared in the whole range
@@ -781,6 +843,17 @@  copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm,
 				pte = pte_swp_mkuffd_wp(pte);
 			set_pte_at(src_mm, addr, src_pte, pte);
 		}
+	} else if (is_device_exclusive_entry(entry)) {
+		/*
+		 * Make device exclusive entries present by restoring the
+		 * original entry then copying as for a present pte. Device
+		 * exclusive entries currently only support private writable
+		 * (ie. COW) mappings.
+		 */
+		VM_BUG_ON(!is_cow_mapping(vma->vm_flags));
+		if (try_restore_exclusive_pte(src_pte, vma, addr))
+			return -EBUSY;
+		return -ENOENT;
 	}
 	set_pte_at(dst_mm, addr, dst_pte, pte);
 	return 0;
@@ -980,9 +1053,18 @@  copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			if (ret == -EIO) {
 				entry = pte_to_swp_entry(*src_pte);
 				break;
+			} else if (ret == -EBUSY) {
+				break;
+			} else if (!ret) {
+				progress += 8;
+				continue;
 			}
-			progress += 8;
-			continue;
+
+			/*
+			 * Device exclusive entry restored, continue by copying
+			 * the now present pte.
+			 */
+			WARN_ON_ONCE(ret != -ENOENT);
 		}
 		/* copy_present_pte() will clear `*prealloc' if consumed */
 		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
@@ -1020,6 +1102,8 @@  copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			goto out;
 		}
 		entry.val = 0;
+	} else if (ret == -EBUSY) {
+		goto out;
 	} else if (ret ==  -EAGAIN) {
 		prealloc = page_copy_prealloc(src_mm, src_vma, addr);
 		if (!prealloc)
@@ -1287,7 +1371,8 @@  static unsigned long zap_pte_range(struct mmu_gather *tlb,
 		}
 
 		entry = pte_to_swp_entry(ptent);
-		if (is_device_private_entry(entry)) {
+		if (is_device_private_entry(entry) ||
+		    is_device_exclusive_entry(entry)) {
 			struct page *page = pfn_swap_entry_to_page(entry);
 
 			if (unlikely(details && details->check_mapping)) {
@@ -1303,7 +1388,10 @@  static unsigned long zap_pte_range(struct mmu_gather *tlb,
 
 			pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
 			rss[mm_counter(page)]--;
-			page_remove_rmap(page, false);
+
+			if (is_device_private_entry(entry))
+				page_remove_rmap(page, false);
+
 			put_page(page);
 			continue;
 		}
@@ -3307,6 +3395,34 @@  void unmap_mapping_range(struct address_space *mapping,
 }
 EXPORT_SYMBOL(unmap_mapping_range);
 
+/*
+ * Restore a potential device exclusive pte to a working pte entry
+ */
+static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf)
+{
+	struct page *page = vmf->page;
+	struct vm_area_struct *vma = vmf->vma;
+	struct mmu_notifier_range range;
+
+	if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags))
+		return VM_FAULT_RETRY;
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
+				vma->vm_mm, vmf->address & PAGE_MASK,
+				(vmf->address & PAGE_MASK) + PAGE_SIZE, NULL);
+	mmu_notifier_invalidate_range_start(&range);
+
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address,
+				&vmf->ptl);
+	if (likely(pte_same(*vmf->pte, vmf->orig_pte)))
+		restore_exclusive_pte(vma, page, vmf->address, vmf->pte);
+
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+	unlock_page(page);
+
+	mmu_notifier_invalidate_range_end(&range);
+	return 0;
+}
+
 /*
  * We enter with non-exclusive mmap_lock (to exclude vma changes,
  * but allow concurrent faults), and pte mapped but not yet locked.
@@ -3334,6 +3450,9 @@  vm_fault_t do_swap_page(struct vm_fault *vmf)
 		if (is_migration_entry(entry)) {
 			migration_entry_wait(vma->vm_mm, vmf->pmd,
 					     vmf->address);
+		} else if (is_device_exclusive_entry(entry)) {
+			vmf->page = pfn_swap_entry_to_page(entry);
+			ret = remove_device_exclusive_entry(vmf);
 		} else if (is_device_private_entry(entry)) {
 			vmf->page = pfn_swap_entry_to_page(entry);
 			ret = vmf->page->pgmap->ops->migrate_to_ram(vmf);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ee5961888e70..883e2cc85cad 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -165,6 +165,14 @@  static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 				newpte = swp_entry_to_pte(entry);
 				if (pte_swp_uffd_wp(oldpte))
 					newpte = pte_swp_mkuffd_wp(newpte);
+			} else if (is_writable_device_exclusive_entry(entry)) {
+				entry = make_readable_device_exclusive_entry(
+							swp_offset(entry));
+				newpte = swp_entry_to_pte(entry);
+				if (pte_swp_soft_dirty(oldpte))
+					newpte = pte_swp_mksoft_dirty(newpte);
+				if (pte_swp_uffd_wp(oldpte))
+					newpte = pte_swp_mkuffd_wp(newpte);
 			} else {
 				newpte = oldpte;
 			}
diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c
index a6a7febb4d93..f535bcb4950c 100644
--- a/mm/page_vma_mapped.c
+++ b/mm/page_vma_mapped.c
@@ -41,7 +41,8 @@  static bool map_pte(struct page_vma_mapped_walk *pvmw)
 
 				/* Handle un-addressable ZONE_DEVICE memory */
 				entry = pte_to_swp_entry(*pvmw->pte);
-				if (!is_device_private_entry(entry))
+				if (!is_device_private_entry(entry) &&
+				    !is_device_exclusive_entry(entry))
 					return false;
 			} else if (!pte_present(*pvmw->pte))
 				return false;
@@ -93,7 +94,8 @@  static bool check_pte(struct page_vma_mapped_walk *pvmw)
 			return false;
 		entry = pte_to_swp_entry(*pvmw->pte);
 
-		if (!is_migration_entry(entry))
+		if (!is_migration_entry(entry) &&
+		    !is_device_exclusive_entry(entry))
 			return false;
 
 		pfn = swp_offset(entry);
@@ -102,7 +104,8 @@  static bool check_pte(struct page_vma_mapped_walk *pvmw)
 
 		/* Handle un-addressable ZONE_DEVICE memory */
 		entry = pte_to_swp_entry(*pvmw->pte);
-		if (!is_device_private_entry(entry))
+		if (!is_device_private_entry(entry) &&
+		    !is_device_exclusive_entry(entry))
 			return false;
 
 		pfn = swp_offset(entry);
diff --git a/mm/rmap.c b/mm/rmap.c
index be0450d905cd..0fb8c7389143 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -2013,6 +2013,188 @@  void page_mlock(struct page *page)
 	rmap_walk(page, &rwc);
 }
 
+#ifdef CONFIG_DEVICE_PRIVATE
+struct make_exclusive_args {
+	struct mm_struct *mm;
+	unsigned long address;
+	void *owner;
+	bool valid;
+};
+
+static bool page_make_device_exclusive_one(struct page *page,
+		struct vm_area_struct *vma, unsigned long address, void *priv)
+{
+	struct mm_struct *mm = vma->vm_mm;
+	struct page_vma_mapped_walk pvmw = {
+		.page = page,
+		.vma = vma,
+		.address = address,
+	};
+	struct make_exclusive_args *args = priv;
+	pte_t pteval;
+	struct page *subpage;
+	bool ret = true;
+	struct mmu_notifier_range range;
+	swp_entry_t entry;
+	pte_t swp_pte;
+
+	mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma,
+				      vma->vm_mm, address, min(vma->vm_end,
+				      address + page_size(page)), args->owner);
+	mmu_notifier_invalidate_range_start(&range);
+
+	while (page_vma_mapped_walk(&pvmw)) {
+		/* Unexpected PMD-mapped THP? */
+		VM_BUG_ON_PAGE(!pvmw.pte, page);
+
+		if (!pte_present(*pvmw.pte)) {
+			ret = false;
+			page_vma_mapped_walk_done(&pvmw);
+			break;
+		}
+
+		subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte);
+		address = pvmw.address;
+
+		/* Nuke the page table entry. */
+		flush_cache_page(vma, address, pte_pfn(*pvmw.pte));
+		pteval = ptep_clear_flush(vma, address, pvmw.pte);
+
+		/* Move the dirty bit to the page. Now the pte is gone. */
+		if (pte_dirty(pteval))
+			set_page_dirty(page);
+
+		/*
+		 * Check that our target page is still mapped at the expected
+		 * address.
+		 */
+		if (args->mm == mm && args->address == address &&
+		    pte_write(pteval))
+			args->valid = true;
+
+		/*
+		 * Store the pfn of the page in a special migration
+		 * pte. do_swap_page() will wait until the migration
+		 * pte is removed and then restart fault handling.
+		 */
+		if (pte_write(pteval))
+			entry = make_writable_device_exclusive_entry(
+							page_to_pfn(subpage));
+		else
+			entry = make_readable_device_exclusive_entry(
+							page_to_pfn(subpage));
+		swp_pte = swp_entry_to_pte(entry);
+		if (pte_soft_dirty(pteval))
+			swp_pte = pte_swp_mksoft_dirty(swp_pte);
+		if (pte_uffd_wp(pteval))
+			swp_pte = pte_swp_mkuffd_wp(swp_pte);
+
+		set_pte_at(mm, address, pvmw.pte, swp_pte);
+
+		/*
+		 * There is a reference on the page for the swap entry which has
+		 * been removed, so shouldn't take another.
+		 */
+		page_remove_rmap(subpage, false);
+	}
+
+	mmu_notifier_invalidate_range_end(&range);
+
+	return ret;
+}
+
+/**
+ * page_make_device_exclusive - mark the page exclusively owned by a device
+ * @page: the page to replace page table entries for
+ * @mm: the mm_struct where the page is expected to be mapped
+ * @address: address where the page is expected to be mapped
+ * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks
+ *
+ * Tries to remove all the page table entries which are mapping this page and
+ * replace them with special device exclusive swap entries to grant a device
+ * exclusive access to the page. Caller must hold the page lock.
+ *
+ * Returns false if the page is still mapped, or if it could not be unmapped
+ * from the expected address. Otherwise returns true (success).
+ */
+static bool page_make_device_exclusive(struct page *page, struct mm_struct *mm,
+				unsigned long address, void *owner)
+{
+	struct make_exclusive_args args = {
+		.mm = mm,
+		.address = address,
+		.owner = owner,
+		.valid = false,
+	};
+	struct rmap_walk_control rwc = {
+		.rmap_one = page_make_device_exclusive_one,
+		.done = page_not_mapped,
+		.anon_lock = page_lock_anon_vma_read,
+		.arg = &args,
+	};
+
+	/*
+	 * Restrict to anonymous pages for now to avoid potential writeback
+	 * issues.
+	 */
+	if (!PageAnon(page))
+		return false;
+
+	rmap_walk(page, &rwc);
+
+	return args.valid && !page_mapcount(page);
+}
+
+/**
+ * make_device_exclusive_range() - Mark a range for exclusive use by a device
+ * @mm: mm_struct of assoicated target process
+ * @start: start of the region to mark for exclusive device access
+ * @end: end address of region
+ * @pages: returns the pages which were successfully marked for exclusive access
+ * @owner: passed to MMU_NOTIFY_EXCLUSIVE range notifier to allow filtering
+ *
+ * Returns: number of pages found in the range by GUP. A page is marked for
+ * exclusive access only if the page pointer is non-NULL.
+ *
+ * This function finds ptes mapping page(s) to the given address range, locks
+ * them and replaces mappings with special swap entries preventing userspace CPU
+ * access. On fault these entries are replaced with the original mapping after
+ * calling MMU notifiers.
+ *
+ * A driver using this to program access from a device must use a mmu notifier
+ * critical section to hold a device specific lock during programming. Once
+ * programming is complete it should drop the page lock and reference after
+ * which point CPU access to the page will revoke the exclusive access.
+ */
+int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
+				unsigned long end, struct page **pages,
+				void *owner)
+{
+	long npages = (end - start) >> PAGE_SHIFT;
+	unsigned long i;
+
+	npages = get_user_pages_remote(mm, start, npages,
+				       FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD,
+				       pages, NULL, NULL);
+	for (i = 0; i < npages; i++, start += PAGE_SIZE) {
+		if (!trylock_page(pages[i])) {
+			put_page(pages[i]);
+			pages[i] = NULL;
+			continue;
+		}
+
+		if (!page_make_device_exclusive(pages[i], mm, start, owner)) {
+			unlock_page(pages[i]);
+			put_page(pages[i]);
+			pages[i] = NULL;
+		}
+	}
+
+	return npages;
+}
+EXPORT_SYMBOL_GPL(make_device_exclusive_range);
+#endif
+
 void __put_anon_vma(struct anon_vma *anon_vma)
 {
 	struct anon_vma *root = anon_vma->root;