[v7,5/8] mm: Device exclusive memory access

Message ID	20210326000805.2518-6-apopple@nvidia.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=IDF3=IY=lists.freedesktop.org=dri-devel-bounces@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9E7F56198C Received-SPF: Pass (protection.outlook.com: domain of nvidia.com designates 216.228.112.34 as permitted sender) receiver=protection.outlook.com; client-ip=216.228.112.34; helo=mail.nvidia.com; From: Alistair Popple <apopple@nvidia.com> To: <linux-mm@kvack.org>, <nouveau@lists.freedesktop.org>, <bskeggs@redhat.com>, <akpm@linux-foundation.org> Subject: [PATCH v7 5/8] mm: Device exclusive memory access Date: Fri, 26 Mar 2021 11:08:02 +1100 Message-ID: <20210326000805.2518-6-apopple@nvidia.com> In-Reply-To: <20210326000805.2518-1-apopple@nvidia.com> References: <20210326000805.2518-1-apopple@nvidia.com> MIME-Version: 1.0 Precedence: list Cc: rcampbell@nvidia.com, willy@infradead.org, linux-doc@vger.kernel.org, jhubbard@nvidia.com, Alistair Popple <apopple@nvidia.com>, linux-kernel@vger.kernel.org, dri-devel@lists.freedesktop.org, hch@infradead.org, jglisse@redhat.com, kvm-ppc@vger.kernel.org, jgg@nvidia.com, Christoph Hellwig <hch@lst.de> Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" <dri-devel-bounces@lists.freedesktop.org>
Series	Add support for SVM atomics in Nouveau \| expand [v7,0/8] Add support for SVM atomics in Nouveau [v7,1/8] mm: Remove special swap entry functions [v7,2/8] mm/swapops: Rework swap entry manipulation code [v7,3/8] mm/rmap: Split try_to_munlock from try_to_unmap [v7,4/8] mm/rmap: Split migration into its own function [v7,5/8] mm: Device exclusive memory access [v7,6/8] mm: Selftests for exclusive device memory [v7,7/8] nouveau/svm: Refactor nouveau_range_fault [v7,8/8] nouveau/svm: Implement atomic SVM access

diff --git a/Documentation/vm/hmm.rst b/Documentation/vm/hmm.rst index 09e28507f5b2..a14c2938e7af 100644 --- a/Documentation/vm/hmm.rst +++ b/Documentation/vm/hmm.rst @@ -332,7 +332,7 @@ between device driver specific code and shared common code: walks to fill in the ``args->src`` array with PFNs to be migrated. The ``invalidate_range_start()`` callback is passed a ``struct mmu_notifier_range`` with the ``event`` field set to - ``MMU_NOTIFY_MIGRATE`` and the ``migrate_pgmap_owner`` field set to + ``MMU_NOTIFY_MIGRATE`` and the ``owner`` field set to the ``args->pgmap_owner`` field passed to migrate_vma_setup(). This is allows the device driver to skip the invalidation callback and only invalidate device private MMU mappings that are actually migrating. @@ -405,6 +405,23 @@ between device driver specific code and shared common code: The lock can now be released. +Exclusive access memory +======================= + +Some devices have features such as atomic PTE bits that can be used to implement +atomic access to system memory. To support atomic operations to a shared virtual +memory page such a device needs access to that page which is exclusive of any +userspace access from the CPU. The ``make_device_exclusive_range()`` function +can be used to make a memory range inaccessible from userspace. + +This replaces all mappings for pages in the given range with special swap +entries. Any attempt to access the swap entry results in a fault which is +resovled by replacing the entry with the original mapping. A driver gets +notified that the mapping has been changed by MMU notifiers, after which point +it will no longer have exclusive access to the page. Exclusive access is +guranteed to last until the driver drops the page lock and page reference, at +which point any CPU faults on the page may proceed as described. + Memory cgroup (memcg) and rss accounting ======================================== diff --git a/drivers/gpu/drm/nouveau/nouveau_svm.c b/drivers/gpu/drm/nouveau/nouveau_svm.c index f18bd53da052..94f841026c3b 100644 --- a/drivers/gpu/drm/nouveau/nouveau_svm.c +++ b/drivers/gpu/drm/nouveau/nouveau_svm.c @@ -265,7 +265,7 @@ nouveau_svmm_invalidate_range_start(struct mmu_notifier *mn, * the invalidation is handled as part of the migration process. */ if (update->event == MMU_NOTIFY_MIGRATE && - update->migrate_pgmap_owner == svmm->vmm->cli->drm->dev) + update->owner == svmm->vmm->cli->drm->dev) goto out; if (limit > svmm->unmanaged.start && start < svmm->unmanaged.limit) { diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index b8200782dede..2e6068d3fb9f 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -41,7 +41,12 @@ struct mmu_interval_notifier; * * @MMU_NOTIFY_MIGRATE: used during migrate_vma_collect() invalidate to signal * a device driver to possibly ignore the invalidation if the - * migrate_pgmap_owner field matches the driver's device private pgmap owner. + * owner field matches the driver's device private pgmap owner. + * + * @MMU_NOTIFY_EXCLUSIVE: to signal a device driver that the device will no + * longer have exclusive access to the page. May ignore the invalidation that's + * part of make_device_exclusive_range() if the owner field + * matches the value passed to make_device_exclusive_range(). */ enum mmu_notifier_event { MMU_NOTIFY_UNMAP = 0, @@ -51,6 +56,7 @@ enum mmu_notifier_event { MMU_NOTIFY_SOFT_DIRTY, MMU_NOTIFY_RELEASE, MMU_NOTIFY_MIGRATE, + MMU_NOTIFY_EXCLUSIVE, }; #define MMU_NOTIFIER_RANGE_BLOCKABLE (1 << 0) @@ -269,7 +275,7 @@ struct mmu_notifier_range { unsigned long end; unsigned flags; enum mmu_notifier_event event; - void *migrate_pgmap_owner; + void *owner; }; static inline int mm_has_notifiers(struct mm_struct *mm) @@ -521,14 +527,14 @@ static inline void mmu_notifier_range_init(struct mmu_notifier_range *range, range->flags = flags; } -static inline void mmu_notifier_range_init_migrate( - struct mmu_notifier_range *range, unsigned int flags, +static inline void mmu_notifier_range_init_owner( + struct mmu_notifier_range *range, + enum mmu_notifier_event event, unsigned int flags, struct vm_area_struct *vma, struct mm_struct *mm, - unsigned long start, unsigned long end, void *pgmap) + unsigned long start, unsigned long end, void *owner) { - mmu_notifier_range_init(range, MMU_NOTIFY_MIGRATE, flags, vma, mm, - start, end); - range->migrate_pgmap_owner = pgmap; + mmu_notifier_range_init(range, event, flags, vma, mm, start, end); + range->owner = owner; } #define ptep_clear_flush_young_notify(__vma, __address, __ptep) \ @@ -655,8 +661,8 @@ static inline void _mmu_notifier_range_init(struct mmu_notifier_range *range, #define mmu_notifier_range_init(range,event,flags,vma,mm,start,end) \ _mmu_notifier_range_init(range, start, end) -#define mmu_notifier_range_init_migrate(range, flags, vma, mm, start, end, \ - pgmap) \ +#define mmu_notifier_range_init_owner(range, event, flags, vma, mm, start, \ + end, owner) \ _mmu_notifier_range_init(range, start, end) static inline bool diff --git a/include/linux/rmap.h b/include/linux/rmap.h index 6062e0cfca2d..b207c138cbff 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -193,6 +193,10 @@ int page_referenced(struct page *, int is_locked, bool try_to_migrate(struct page *page, enum ttu_flags flags); bool try_to_unmap(struct page *, enum ttu_flags flags); +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, + unsigned long end, struct page **pages, + void *arg); + /* Avoid racy checks */ #define PVMW_SYNC (1 << 0) /* Look for migarion entries rather than present PTEs */ diff --git a/include/linux/swap.h b/include/linux/swap.h index 516104b9334b..7a3c260146df 100644 --- a/include/linux/swap.h +++ b/include/linux/swap.h @@ -63,9 +63,11 @@ static inline int current_is_kswapd(void) * to a special SWP_DEVICE_* entry. */ #ifdef CONFIG_DEVICE_PRIVATE -#define SWP_DEVICE_NUM 2 +#define SWP_DEVICE_NUM 4 #define SWP_DEVICE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM) #define SWP_DEVICE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+1) +#define SWP_DEVICE_EXCLUSIVE_WRITE (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+2) +#define SWP_DEVICE_EXCLUSIVE_READ (MAX_SWAPFILES+SWP_HWPOISON_NUM+SWP_MIGRATION_NUM+3) #else #define SWP_DEVICE_NUM 0 #endif diff --git a/include/linux/swapops.h b/include/linux/swapops.h index 4dfd807ae52a..4129bd2ff9d6 100644 --- a/include/linux/swapops.h +++ b/include/linux/swapops.h @@ -120,6 +120,27 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry) { return unlikely(swp_type(entry) == SWP_DEVICE_WRITE); } + +static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset) +{ + return swp_entry(SWP_DEVICE_EXCLUSIVE_READ, offset); +} + +static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset) +{ + return swp_entry(SWP_DEVICE_EXCLUSIVE_WRITE, offset); +} + +static inline bool is_device_exclusive_entry(swp_entry_t entry) +{ + return swp_type(entry) == SWP_DEVICE_EXCLUSIVE_READ || + swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE; +} + +static inline bool is_writable_device_exclusive_entry(swp_entry_t entry) +{ + return unlikely(swp_type(entry) == SWP_DEVICE_EXCLUSIVE_WRITE); +} #else /* CONFIG_DEVICE_PRIVATE */ static inline swp_entry_t make_readable_device_private_entry(pgoff_t offset) { @@ -140,6 +161,26 @@ static inline bool is_writable_device_private_entry(swp_entry_t entry) { return false; } + +static inline swp_entry_t make_readable_device_exclusive_entry(pgoff_t offset) +{ + return swp_entry(0, 0); +} + +static inline swp_entry_t make_writable_device_exclusive_entry(pgoff_t offset) +{ + return swp_entry(0, 0); +} + +static inline bool is_device_exclusive_entry(swp_entry_t entry) +{ + return false; +} + +static inline bool is_writable_device_exclusive_entry(swp_entry_t entry) +{ + return false; +} #endif /* CONFIG_DEVICE_PRIVATE */ #ifdef CONFIG_MIGRATION @@ -219,7 +260,8 @@ static inline struct page *pfn_swap_entry_to_page(swp_entry_t entry) */ static inline bool is_pfn_swap_entry(swp_entry_t entry) { - return is_migration_entry(entry) || is_device_private_entry(entry); + return is_migration_entry(entry) || is_device_private_entry(entry) || + is_device_exclusive_entry(entry); } struct page_vma_mapped_walk; diff --git a/lib/test_hmm.c b/lib/test_hmm.c index 80a78877bd93..5c9f5a020c1d 100644 --- a/lib/test_hmm.c +++ b/lib/test_hmm.c @@ -218,7 +218,7 @@ static bool dmirror_interval_invalidate(struct mmu_interval_notifier *mni, * the invalidation is handled as part of the migration process. */ if (range->event == MMU_NOTIFY_MIGRATE && - range->migrate_pgmap_owner == dmirror->mdevice) + range->owner == dmirror->mdevice) return true; if (mmu_notifier_range_blockable(range)) diff --git a/mm/hmm.c b/mm/hmm.c index 11df3ca30b82..fad6be2bf072 100644 --- a/mm/hmm.c +++ b/mm/hmm.c @@ -26,6 +26,8 @@ #include <linux/mmu_notifier.h> #include <linux/memory_hotplug.h> +#include "internal.h" + struct hmm_vma_walk { struct hmm_range *range; unsigned long last; @@ -271,6 +273,9 @@ static int hmm_vma_handle_pte(struct mm_walk *walk, unsigned long addr, if (!non_swap_entry(entry)) goto fault; + if (is_device_exclusive_entry(entry)) + goto fault; + if (is_migration_entry(entry)) { pte_unmap(ptep); hmm_vma_walk->last = addr; diff --git a/mm/memory.c b/mm/memory.c index 3a5705cfc891..33d11527ef77 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -781,6 +781,27 @@ copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte = pte_swp_mkuffd_wp(pte); set_pte_at(src_mm, addr, src_pte, pte); } + } else if (is_device_exclusive_entry(entry)) { + page = pfn_swap_entry_to_page(entry); + + get_page(page); + rss[mm_counter(page)]++; + + if (is_writable_device_exclusive_entry(entry) && + is_cow_mapping(vm_flags)) { + /* + * COW mappings require pages in both + * parent and child to be set to read. + */ + entry = make_readable_device_exclusive_entry( + swp_offset(entry)); + pte = swp_entry_to_pte(entry); + if (pte_swp_soft_dirty(*src_pte)) + pte = pte_swp_mksoft_dirty(pte); + if (pte_swp_uffd_wp(*src_pte)) + pte = pte_swp_mkuffd_wp(pte); + set_pte_at(src_mm, addr, src_pte, pte); + } } set_pte_at(dst_mm, addr, dst_pte, pte); return 0; @@ -1287,7 +1308,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, } entry = pte_to_swp_entry(ptent); - if (is_device_private_entry(entry)) { + if (is_device_private_entry(entry) || + is_device_exclusive_entry(entry)) { struct page *page = pfn_swap_entry_to_page(entry); if (unlikely(details && details->check_mapping)) { @@ -1303,7 +1325,10 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb, pte_clear_not_present_full(mm, addr, pte, tlb->fullmm); rss[mm_counter(page)]--; - page_remove_rmap(page, false); + + if (is_device_private_entry(entry)) + page_remove_rmap(page, false); + put_page(page); continue; } @@ -3256,6 +3281,82 @@ void unmap_mapping_range(struct address_space *mapping, } EXPORT_SYMBOL(unmap_mapping_range); +static void restore_exclusive_pte(struct vm_area_struct *vma, + struct page *page, unsigned long address, + pte_t *ptep) +{ + pte_t pte; + swp_entry_t entry; + + pte = pte_mkold(mk_pte(page, READ_ONCE(vma->vm_page_prot))); + if (pte_swp_soft_dirty(*ptep)) + pte = pte_mksoft_dirty(pte); + + entry = pte_to_swp_entry(*ptep); + if (pte_swp_uffd_wp(*ptep)) + pte = pte_mkuffd_wp(pte); + else if (is_writable_device_exclusive_entry(entry)) + pte = maybe_mkwrite(pte_mkdirty(pte), vma); + + set_pte_at(vma->vm_mm, address, ptep, pte); + + /* + * No need to take a page reference as one was already + * created when the swap entry was made. + */ + if (PageAnon(page)) + page_add_anon_rmap(page, vma, address, false); + else + page_add_file_rmap(page, false); + + if (vma->vm_flags & VM_LOCKED) + mlock_vma_page(page); + + /* + * No need to invalidate - it was non-present before. However + * secondary CPUs may have mappings that need invalidating. + */ + update_mmu_cache(vma, address, ptep); +} + +/* + * Restore a potential device exclusive pte to a working pte entry + */ +static vm_fault_t remove_device_exclusive_entry(struct vm_fault *vmf) +{ + struct page *page = vmf->page; + struct vm_area_struct *vma = vmf->vma; + struct page_vma_mapped_walk pvmw = { + .page = page, + .vma = vma, + .address = vmf->address, + .flags = PVMW_SYNC, + }; + vm_fault_t ret = 0; + struct mmu_notifier_range range; + + if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) + return VM_FAULT_RETRY; + mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm, + vmf->address & PAGE_MASK, + (vmf->address & PAGE_MASK) + PAGE_SIZE); + mmu_notifier_invalidate_range_start(&range); + + while (page_vma_mapped_walk(&pvmw)) { + if (unlikely(!pte_same(*pvmw.pte, vmf->orig_pte))) { + page_vma_mapped_walk_done(&pvmw); + break; + } + + restore_exclusive_pte(vma, page, pvmw.address, pvmw.pte); + } + + unlock_page(page); + + mmu_notifier_invalidate_range_end(&range); + return ret; +} + /* * We enter with non-exclusive mmap_lock (to exclude vma changes, * but allow concurrent faults), and pte mapped but not yet locked. @@ -3283,6 +3384,9 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (is_migration_entry(entry)) { migration_entry_wait(vma->vm_mm, vmf->pmd, vmf->address); + } else if (is_device_exclusive_entry(entry)) { + vmf->page = pfn_swap_entry_to_page(entry); + ret = remove_device_exclusive_entry(vmf); } else if (is_device_private_entry(entry)) { vmf->page = pfn_swap_entry_to_page(entry); ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); diff --git a/mm/migrate.c b/mm/migrate.c index cc4612e2a246..9cc9251d4802 100644 --- a/mm/migrate.c +++ b/mm/migrate.c @@ -2570,8 +2570,8 @@ static void migrate_vma_collect(struct migrate_vma *migrate) * that the registered device driver can skip invalidating device * private page mappings that won't be migrated. */ - mmu_notifier_range_init_migrate(&range, 0, migrate->vma, - migrate->vma->vm_mm, migrate->start, migrate->end, + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_MIGRATE, 0, + migrate->vma, migrate->vma->vm_mm, migrate->start, migrate->end, migrate->pgmap_owner); mmu_notifier_invalidate_range_start(&range); @@ -3074,9 +3074,9 @@ void migrate_vma_pages(struct migrate_vma *migrate) if (!notified) { notified = true; - mmu_notifier_range_init_migrate(&range, 0, - migrate->vma, migrate->vma->vm_mm, - addr, migrate->end, + mmu_notifier_range_init_owner(&range, + MMU_NOTIFY_MIGRATE, 0, migrate->vma, + migrate->vma->vm_mm, addr, migrate->end, migrate->pgmap_owner); mmu_notifier_invalidate_range_start(&range); } diff --git a/mm/mprotect.c b/mm/mprotect.c index f21b760ec809..c6018541ea3d 100644 --- a/mm/mprotect.c +++ b/mm/mprotect.c @@ -165,6 +165,14 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd, newpte = swp_entry_to_pte(entry); if (pte_swp_uffd_wp(oldpte)) newpte = pte_swp_mkuffd_wp(newpte); + } else if (is_writable_device_exclusive_entry(entry)) { + entry = make_readable_device_exclusive_entry( + swp_offset(entry)); + newpte = swp_entry_to_pte(entry); + if (pte_swp_soft_dirty(oldpte)) + newpte = pte_swp_mksoft_dirty(newpte); + if (pte_swp_uffd_wp(oldpte)) + newpte = pte_swp_mkuffd_wp(newpte); } else { newpte = oldpte; } diff --git a/mm/page_vma_mapped.c b/mm/page_vma_mapped.c index eed988ab2e81..29842f169219 100644 --- a/mm/page_vma_mapped.c +++ b/mm/page_vma_mapped.c @@ -41,7 +41,8 @@ static bool map_pte(struct page_vma_mapped_walk *pvmw) /* Handle un-addressable ZONE_DEVICE memory */ entry = pte_to_swp_entry(*pvmw->pte); - if (!is_device_private_entry(entry)) + if (!is_device_private_entry(entry) && + !is_device_exclusive_entry(entry)) return false; } else if (!pte_present(*pvmw->pte)) return false; @@ -93,7 +94,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) return false; entry = pte_to_swp_entry(*pvmw->pte); - if (!is_migration_entry(entry)) + if (!is_migration_entry(entry) && + !is_device_exclusive_entry(entry)) return false; pfn = swp_offset(entry); @@ -102,7 +104,8 @@ static bool check_pte(struct page_vma_mapped_walk *pvmw) /* Handle un-addressable ZONE_DEVICE memory */ entry = pte_to_swp_entry(*pvmw->pte); - if (!is_device_private_entry(entry)) + if (!is_device_private_entry(entry) && + !is_device_exclusive_entry(entry)) return false; pfn = swp_offset(entry); diff --git a/mm/rmap.c b/mm/rmap.c index b540b44e299a..b0ec88a37dab 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -2005,6 +2005,216 @@ void try_to_munlock(struct page *page) rmap_walk(page, &rwc); } +struct ttp_args { + struct mm_struct *mm; + unsigned long address; + void *arg; + bool valid; +}; + +static bool try_to_protect_one(struct page *page, struct vm_area_struct *vma, + unsigned long address, void *arg) +{ + struct mm_struct *mm = vma->vm_mm; + struct page_vma_mapped_walk pvmw = { + .page = page, + .vma = vma, + .address = address, + }; + struct ttp_args *ttp = arg; + pte_t pteval; + struct page *subpage; + bool ret = true; + struct mmu_notifier_range range; + swp_entry_t entry; + pte_t swp_pte; + + mmu_notifier_range_init_owner(&range, MMU_NOTIFY_EXCLUSIVE, 0, vma, + vma->vm_mm, address, + min(vma->vm_end, + address + page_size(page)), + ttp->arg); + if (PageHuge(page)) { + /* + * If sharing is possible, start and end will be adjusted + * accordingly. + */ + adjust_range_if_pmd_sharing_possible(vma, &range.start, + &range.end); + } + mmu_notifier_invalidate_range_start(&range); + + while (page_vma_mapped_walk(&pvmw)) { + /* Unexpected PMD-mapped THP? */ + VM_BUG_ON_PAGE(!pvmw.pte, page); + + if (!pte_present(*pvmw.pte)) { + ret = false; + page_vma_mapped_walk_done(&pvmw); + break; + } + + subpage = page - page_to_pfn(page) + pte_pfn(*pvmw.pte); + address = pvmw.address; + + /* Nuke the page table entry. */ + flush_cache_page(vma, address, pte_pfn(*pvmw.pte)); + pteval = ptep_clear_flush(vma, address, pvmw.pte); + + /* Move the dirty bit to the page. Now the pte is gone. */ + if (pte_dirty(pteval)) + set_page_dirty(page); + + /* Update high watermark before we lower rss */ + update_hiwater_rss(mm); + + if (arch_unmap_one(mm, vma, address, pteval) < 0) { + set_pte_at(mm, address, pvmw.pte, pteval); + ret = false; + page_vma_mapped_walk_done(&pvmw); + break; + } + + /* + * Check that our target page is still mapped at the expected + * address. + */ + if (ttp->mm == mm && ttp->address == address && + pte_write(pteval)) + ttp->valid = true; + + /* + * Store the pfn of the page in a special migration + * pte. do_swap_page() will wait until the migration + * pte is removed and then restart fault handling. + */ + if (pte_write(pteval)) + entry = make_writable_device_exclusive_entry( + page_to_pfn(subpage)); + else + entry = make_readable_device_exclusive_entry( + page_to_pfn(subpage)); + swp_pte = swp_entry_to_pte(entry); + if (pte_soft_dirty(pteval)) + swp_pte = pte_swp_mksoft_dirty(swp_pte); + if (pte_uffd_wp(pteval)) + swp_pte = pte_swp_mkuffd_wp(swp_pte); + + /* Take a reference for the swap entry */ + get_page(page); + set_pte_at(mm, address, pvmw.pte, swp_pte); + + page_remove_rmap(subpage, PageHuge(page)); + put_page(page); + } + + mmu_notifier_invalidate_range_end(&range); + + return ret; +} + +/** + * try_to_protect - try to replace all page table mappings with swap entries + * @page: the page to replace page table entries for + * @flags: action and flags + * @mm: the mm_struct where the page is expected to be mapped + * @address: address where the page is expected to be mapped + * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier callbacks + * + * Tries to remove all the page table entries which are mapping this page and + * replace them with special swap entries to grant a device exclusive access to + * the page. Caller must hold the page lock. + * + * Returns false if the page is still mapped, or if it could not be unmapped + * from the expected address. Otherwise returns true (success). + */ +static bool try_to_protect(struct page *page, struct mm_struct *mm, + unsigned long address, void *arg) +{ + struct ttp_args ttp = { + .mm = mm, + .address = address, + .arg = arg, + .valid = false, + }; + struct rmap_walk_control rwc = { + .rmap_one = try_to_protect_one, + .done = page_not_mapped, + .anon_lock = page_lock_anon_vma_read, + .arg = &ttp, + }; + + /* + * Restrict to anonymous pages for now to avoid potential writeback + * issues. + */ + if (!PageAnon(page)) + return false; + + /* + * During exec, a temporary VMA is setup and later moved. + * The VMA is moved under the anon_vma lock but not the + * page tables leading to a race where migration cannot + * find the migration ptes. Rather than increasing the + * locking requirements of exec(), migration skips + * temporary VMAs until after exec() completes. + */ + if (!PageKsm(page) && PageAnon(page)) + rwc.invalid_vma = invalid_migration_vma; + + rmap_walk(page, &rwc); + + return ttp.valid && !page_mapcount(page); +} + +/** + * make_device_exclusive_range() - Mark a range for exclusive use by a device + * @mm: mm_struct of assoicated target process + * @start: start of the region to mark for exclusive device access + * @end: end address of region + * @pages: returns the pages which were successfully marked for exclusive access + * @arg: passed to MMU_NOTIFY_EXCLUSIVE range notifier too allow filtering + * + * Returns: number of pages successfully marked for exclusive access + * + * This function finds ptes mapping page(s) to the given address range, locks + * them and replaces mappings with special swap entries preventing userspace CPU + * access. On fault these entries are replaced with the original mapping after + * calling MMU notifiers. + * + * A driver using this to program access from a device must use a mmu notifier + * critical section to hold a device specific lock during programming. Once + * programming is complete it should drop the page lock and reference after + * which point CPU access to the page will revoke the exclusive access. + */ +int make_device_exclusive_range(struct mm_struct *mm, unsigned long start, + unsigned long end, struct page **pages, + void *arg) +{ + unsigned long npages = (end - start) >> PAGE_SHIFT; + unsigned long i; + + npages = get_user_pages_remote(mm, start, npages, + FOLL_GET | FOLL_WRITE | FOLL_SPLIT_PMD, + pages, NULL, NULL); + for (i = 0; i < npages; i++, start += PAGE_SIZE) { + if (!trylock_page(pages[i])) { + put_page(pages[i]); + pages[i] = NULL; + continue; + } + + if (!try_to_protect(pages[i], mm, start, arg)) { + unlock_page(pages[i]); + put_page(pages[i]); + pages[i] = NULL; + } + } + + return npages; +} +EXPORT_SYMBOL_GPL(make_device_exclusive_range); + void __put_anon_vma(struct anon_vma *anon_vma) { struct anon_vma *root = anon_vma->root;

[v7,5/8] mm: Device exclusive memory access

Commit Message

Comments

Patch