diff mbox series

[RFC,v2,19/27] mm: mprotect: Introduce PAGE_FAULT_ON_ACCESS for mprotect(PROT_MTE)

Message ID 20231119165721.9849-20-alexandru.elisei@arm.com (mailing list archive)
State Handled Elsewhere
Headers show
Series [RFC,v2,01/27] arm64: mte: Rework naming for tag manipulation functions | expand

Commit Message

Alexandru Elisei Nov. 19, 2023, 4:57 p.m. UTC
To enable tagging on a memory range, userspace can use mprotect() with the
PROT_MTE access flag. Pages already mapped in the VMA don't have the
associated tag storage block reserved, so mark the PTEs as
PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
reserve the tag storage on the fault path.

This has several benefits over reserving the tag storage as part of the
mprotect() call handling:

- Tag storage is reserved only for those pages in the VMA that are
  accessed, instead of for all the pages already mapped in the VMA.
- Reduces the latency of the mprotect() call.
- Eliminates races with page migration.

But all of this is at the expense of an extra page fault per page until the
pages being accessed all have their corresponding tag storage reserved.

For arm64, the PAGE_FAULT_ON_ACCESS protection is created by defining a new
page table entry software bit, PTE_TAG_STORAGE_NONE. Linux doesn't set any
of the PBHA bits in entries from the last level of the translation table
and it doesn't use the TCR_ELx.HWUxx bits; also, the first PBHA bit, bit
59, is already being used as a software bit for PMD_PRESENT_INVALID.

This is only implemented for PTE mappings; PMD mappings will follow.

Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
---
 arch/arm64/Kconfig                       |   1 +
 arch/arm64/include/asm/mte.h             |   4 +-
 arch/arm64/include/asm/mte_tag_storage.h |   2 +
 arch/arm64/include/asm/pgtable-prot.h    |   2 +
 arch/arm64/include/asm/pgtable.h         |  40 ++++++---
 arch/arm64/kernel/mte.c                  |  12 ++-
 arch/arm64/mm/fault.c                    | 101 +++++++++++++++++++++++
 include/linux/pgtable.h                  |  17 ++++
 mm/Kconfig                               |   3 +
 mm/memory.c                              |   3 +
 10 files changed, 170 insertions(+), 15 deletions(-)

Comments

David Hildenbrand Nov. 28, 2023, 5:55 p.m. UTC | #1
On 19.11.23 17:57, Alexandru Elisei wrote:
> To enable tagging on a memory range, userspace can use mprotect() with the
> PROT_MTE access flag. Pages already mapped in the VMA don't have the
> associated tag storage block reserved, so mark the PTEs as
> PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
> reserve the tag storage on the fault path.

That sounds alot like fake PROT_NONE. Would there be a way to unify hat 
handling and simply reuse pte_protnone()? For example, could we special 
case on VMA flags?

Like, don't do NUMA hinting in these special VMAs. Then, have something 
like:

if (pte_protnone(vmf->orig_pte))
	return handle_pte_protnone(vmf);

In there, special case on the VMA flags.

I *suspect* that handle_page_missing_tag_storage() stole (sorry :P) some 
code from the prot_none handling path. At least the recovery path and 
writability handling looks like it better be located shared in 
handle_pte_protnone() as well.

That might take some magic out of this patch.
David Hildenbrand Nov. 28, 2023, 6 p.m. UTC | #2
On 28.11.23 18:55, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
>> To enable tagging on a memory range, userspace can use mprotect() with the
>> PROT_MTE access flag. Pages already mapped in the VMA don't have the
>> associated tag storage block reserved, so mark the PTEs as
>> PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
>> reserve the tag storage on the fault path.
> 
> That sounds alot like fake PROT_NONE. Would there be a way to unify hat
> handling and simply reuse pte_protnone()? For example, could we special
> case on VMA flags?
> 
> Like, don't do NUMA hinting in these special VMAs. Then, have something
> like:
> 
> if (pte_protnone(vmf->orig_pte))
> 	return handle_pte_protnone(vmf);
> 

Think out loud: maybe there isn't even the need to special-case on the 
VMA. Arch code should know it there is something to do. If not, it 
surely was triggered bu NUMA hinting. So maybe that could be handled in 
handle_pte_protnone() quite nicely.
Hyesoo Yu Nov. 29, 2023, 9:27 a.m. UTC | #3
On Sun, Nov 19, 2023 at 04:57:13PM +0000, Alexandru Elisei wrote:
> To enable tagging on a memory range, userspace can use mprotect() with the
> PROT_MTE access flag. Pages already mapped in the VMA don't have the
> associated tag storage block reserved, so mark the PTEs as
> PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
> reserve the tag storage on the fault path.
> 
> This has several benefits over reserving the tag storage as part of the
> mprotect() call handling:
> 
> - Tag storage is reserved only for those pages in the VMA that are
>   accessed, instead of for all the pages already mapped in the VMA.
> - Reduces the latency of the mprotect() call.
> - Eliminates races with page migration.
> 
> But all of this is at the expense of an extra page fault per page until the
> pages being accessed all have their corresponding tag storage reserved.
> 
> For arm64, the PAGE_FAULT_ON_ACCESS protection is created by defining a new
> page table entry software bit, PTE_TAG_STORAGE_NONE. Linux doesn't set any
> of the PBHA bits in entries from the last level of the translation table
> and it doesn't use the TCR_ELx.HWUxx bits; also, the first PBHA bit, bit
> 59, is already being used as a software bit for PMD_PRESENT_INVALID.
> 
> This is only implemented for PTE mappings; PMD mappings will follow.
> 
> Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> ---
>  arch/arm64/Kconfig                       |   1 +
>  arch/arm64/include/asm/mte.h             |   4 +-
>  arch/arm64/include/asm/mte_tag_storage.h |   2 +
>  arch/arm64/include/asm/pgtable-prot.h    |   2 +
>  arch/arm64/include/asm/pgtable.h         |  40 ++++++---
>  arch/arm64/kernel/mte.c                  |  12 ++-
>  arch/arm64/mm/fault.c                    | 101 +++++++++++++++++++++++
>  include/linux/pgtable.h                  |  17 ++++
>  mm/Kconfig                               |   3 +
>  mm/memory.c                              |   3 +
>  10 files changed, 170 insertions(+), 15 deletions(-)
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index efa5b7958169..3b9c435eaafb 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2066,6 +2066,7 @@ if ARM64_MTE
>  config ARM64_MTE_TAG_STORAGE
>  	bool "Dynamic MTE tag storage management"
>  	depends on ARCH_KEEP_MEMBLOCK
> +	select ARCH_HAS_FAULT_ON_ACCESS
>  	select CONFIG_CMA
>  	help
>  	  Adds support for dynamic management of the memory used by the hardware
> diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> index 6457b7899207..70dc2e409070 100644
> --- a/arch/arm64/include/asm/mte.h
> +++ b/arch/arm64/include/asm/mte.h
> @@ -107,7 +107,7 @@ static inline bool try_page_mte_tagging(struct page *page)
>  }
>  
>  void mte_zero_clear_page_tags(void *addr);
> -void mte_sync_tags(pte_t pte, unsigned int nr_pages);
> +void mte_sync_tags(pte_t *pteval, unsigned int nr_pages);
>  void mte_copy_page_tags(void *kto, const void *kfrom);
>  void mte_thread_init_user(void);
>  void mte_thread_switch(struct task_struct *next);
> @@ -139,7 +139,7 @@ static inline bool try_page_mte_tagging(struct page *page)
>  static inline void mte_zero_clear_page_tags(void *addr)
>  {
>  }
> -static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages)
> +static inline void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
>  {
>  }
>  static inline void mte_copy_page_tags(void *kto, const void *kfrom)
> diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> index 6e5d28e607bb..c70ced60a0cd 100644
> --- a/arch/arm64/include/asm/mte_tag_storage.h
> +++ b/arch/arm64/include/asm/mte_tag_storage.h
> @@ -33,6 +33,8 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
>  void free_tag_storage(struct page *page, int order);
>  
>  bool page_tag_storage_reserved(struct page *page);
> +
> +vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
>  #else
>  static inline bool tag_storage_enabled(void)
>  {
> diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> index e9624f6326dd..85ebb3e352ad 100644
> --- a/arch/arm64/include/asm/pgtable-prot.h
> +++ b/arch/arm64/include/asm/pgtable-prot.h
> @@ -19,6 +19,7 @@
>  #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
>  #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
>  #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
> +#define PTE_TAG_STORAGE_NONE	(_AT(pteval_t, 1) << 60) /* only when PTE_PROT_NONE */
>  
>  /*
>   * This bit indicates that the entry is present i.e. pmd_page()
> @@ -94,6 +95,7 @@ extern bool arm64_use_ng_mappings;
>  	 })
>  
>  #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
> +#define PAGE_FAULT_ON_ACCESS	__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
>  /* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
>  #define PAGE_SHARED		__pgprot(_PAGE_SHARED)
>  #define PAGE_SHARED_EXEC	__pgprot(_PAGE_SHARED_EXEC)
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 20e8de853f5d..8cc135f1c112 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -326,10 +326,10 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
>  		     __func__, pte_val(old_pte), pte_val(pte));
>  }
>  
> -static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
> +static inline void __sync_cache_and_tags(pte_t *pteval, unsigned int nr_pages)
>  {
> -	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
> -		__sync_icache_dcache(pte);
> +	if (pte_present(*pteval) && pte_user_exec(*pteval) && !pte_special(*pteval))
> +		__sync_icache_dcache(*pteval);
>  
>  	/*
>  	 * If the PTE would provide user space access to the tags associated
> @@ -337,9 +337,9 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
>  	 * pte_access_permitted() returns false for exec only mappings, they
>  	 * don't expose tags (instruction fetches don't check tags).
>  	 */
> -	if (system_supports_mte() && pte_access_permitted(pte, false) &&
> -	    !pte_special(pte) && pte_tagged(pte))
> -		mte_sync_tags(pte, nr_pages);
> +	if (system_supports_mte() && pte_access_permitted(*pteval, false) &&
> +	    !pte_special(*pteval) && pte_tagged(*pteval))
> +		mte_sync_tags(pteval, nr_pages);
>  }
>  
>  static inline void set_ptes(struct mm_struct *mm,
> @@ -347,7 +347,7 @@ static inline void set_ptes(struct mm_struct *mm,
>  			    pte_t *ptep, pte_t pte, unsigned int nr)
>  {
>  	page_table_check_ptes_set(mm, ptep, pte, nr);
> -	__sync_cache_and_tags(pte, nr);
> +	__sync_cache_and_tags(&pte, nr);
>  
>  	for (;;) {
>  		__check_safe_pte_update(mm, ptep, pte);
> @@ -459,6 +459,26 @@ static inline int pmd_protnone(pmd_t pmd)
>  }
>  #endif
>  
> +#ifdef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> +static inline bool fault_on_access_pte(pte_t pte)
> +{
> +	return (pte_val(pte) & (PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID)) ==
> +		(PTE_PROT_NONE | PTE_TAG_STORAGE_NONE);
> +}
> +
> +static inline bool fault_on_access_pmd(pmd_t pmd)
> +{
> +	return fault_on_access_pte(pmd_pte(pmd));
> +}
> +
> +static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> +{
> +	if (tag_storage_enabled())
> +		return handle_page_missing_tag_storage(vmf);
> +	return VM_FAULT_SIGBUS;
> +}
> +#endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
> +
>  #define pmd_present_invalid(pmd)     (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
>  
>  static inline int pmd_present(pmd_t pmd)
> @@ -533,7 +553,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
>  				unsigned long __always_unused addr,
>  				pte_t *ptep, pte_t pte, unsigned int nr)
>  {
> -	__sync_cache_and_tags(pte, nr);
> +	__sync_cache_and_tags(&pte, nr);
>  	__check_safe_pte_update(mm, ptep, pte);
>  	set_pte(ptep, pte);
>  }
> @@ -828,8 +848,8 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
>  	 * in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
>  	 */
>  	const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
> -			      PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_GP |
> -			      PTE_ATTRINDX_MASK;
> +			      PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID |
> +			      PTE_WRITE | PTE_GP | PTE_ATTRINDX_MASK;
>  	/* preserve the hardware dirty information */
>  	if (pte_hw_dirty(pte))
>  		pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
> diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> index a41ef3213e1e..5962bab1d549 100644
> --- a/arch/arm64/kernel/mte.c
> +++ b/arch/arm64/kernel/mte.c
> @@ -21,6 +21,7 @@
>  #include <asm/barrier.h>
>  #include <asm/cpufeature.h>
>  #include <asm/mte.h>
> +#include <asm/mte_tag_storage.h>
>  #include <asm/ptrace.h>
>  #include <asm/sysreg.h>
>  
> @@ -35,13 +36,18 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
>  EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
>  #endif
>  
> -void mte_sync_tags(pte_t pte, unsigned int nr_pages)
> +void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
>  {
> -	struct page *page = pte_page(pte);
> +	struct page *page = pte_page(*pteval);
>  	unsigned int i;
>  
> -	/* if PG_mte_tagged is set, tags have already been initialised */
>  	for (i = 0; i < nr_pages; i++, page++) {
> +		if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page))) {
> +			*pteval = pte_modify(*pteval, PAGE_FAULT_ON_ACCESS);
> +			continue;
> +		}
> +
> +		/* if PG_mte_tagged is set, tags have already been initialised */
>  		if (try_page_mte_tagging(page)) {
>  			mte_clear_page_tags(page_address(page));
>  			set_page_mte_tagged(page);
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index acbc7530d2b2..f5fa583acf18 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -19,6 +19,7 @@
>  #include <linux/kprobes.h>
>  #include <linux/uaccess.h>
>  #include <linux/page-flags.h>
> +#include <linux/page-isolation.h>
>  #include <linux/sched/signal.h>
>  #include <linux/sched/debug.h>
>  #include <linux/highmem.h>
> @@ -953,3 +954,103 @@ void tag_clear_highpage(struct page *page)
>  	mte_zero_clear_page_tags(page_address(page));
>  	set_page_mte_tagged(page);
>  }
> +
> +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> +vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct page *page = NULL;
> +	pte_t new_pte, old_pte;
> +	bool writable = false;
> +	vm_fault_t err;
> +	int ret;
> +
> +	spin_lock(vmf->ptl);
> +	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> +		return 0;
> +	}
> +
> +	/* Get the normal PTE  */
> +	old_pte = ptep_get(vmf->pte);
> +	new_pte = pte_modify(old_pte, vma->vm_page_prot);
> +
> +	/*
> +	 * Detect now whether the PTE could be writable; this information
> +	 * is only valid while holding the PT lock.
> +	 */
> +	writable = pte_write(new_pte);
> +	if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
> +	    can_change_pte_writable(vma, vmf->address, new_pte))
> +		writable = true;
> +
> +	page = vm_normal_page(vma, vmf->address, new_pte);
> +	if (!page || is_zone_device_page(page))
> +		goto out_map;
> +
> +	/*
> +	 * This should never happen, once a VMA has been marked as tagged, that
> +	 * cannot be changed.
> +	 */
> +	if (!(vma->vm_flags & VM_MTE))
> +		goto out_map;
> +
> +	/* Prevent the page from being unmapped from under us. */
> +	get_page(page);
> +	vma_set_access_pid_bit(vma);
> +
> +	/*
> +	 * Pairs with pte_offset_map_nolock(), which takes the RCU read lock,
> +	 * and spin_lock() above which takes the ptl lock. Both locks should be
> +	 * balanced after this point.
> +	 */
> +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> +
> +	/*
> +	 * Probably the page is being isolated for migration, replay the fault
> +	 * to give time for the entry to be replaced by a migration pte.
> +	 */
> +	if (unlikely(is_migrate_isolate_page(page)))
> +		goto out_retry;
> +
> +	ret = reserve_tag_storage(page, 0, GFP_HIGHUSER_MOVABLE);
> +	if (ret)
> +		goto out_retry;
> +
> +	put_page(page);
> +
> +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
> +	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> +		return 0;
> +	}
> +
> +out_map:
> +	/*
> +	 * Make it present again, depending on how arch implements
> +	 * non-accessible ptes, some can allow access by kernel mode.
> +	 */
> +	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
> +	new_pte = pte_modify(old_pte, vma->vm_page_prot);
> +	new_pte = pte_mkyoung(new_pte);
> +	if (writable)
> +		new_pte = pte_mkwrite(new_pte, vma);
> +	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, new_pte);
> +	update_mmu_cache(vma, vmf->address, vmf->pte);
> +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> +
> +	return 0;
> +
> +out_retry:
> +	put_page(page);
> +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> +		vma_end_read(vma);
> +	if (fault_flag_allow_retry_first(vmf->flags)) {
> +		err = VM_FAULT_RETRY;
> +	} else {
> +		/* Replay the fault. */
> +		err = 0;

Hello!

Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)

How about migrating the page when migration problem repeats.

Thanks,
Regards.

> +	}
> +	return err;
> +}
> +#endif
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index ffdb9b6bed6c..e2c761dd6c41 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -1458,6 +1458,23 @@ static inline int pmd_protnone(pmd_t pmd)
>  }
>  #endif /* CONFIG_NUMA_BALANCING */
>  
> +#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> +static inline bool fault_on_access_pte(pte_t pte)
> +{
> +	return false;
> +}
> +
> +static inline bool fault_on_access_pmd(pmd_t pmd)
> +{
> +	return false;
> +}
> +
> +static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> +{
> +	return VM_FAULT_SIGBUS;
> +}
> +#endif
> +
>  #endif /* CONFIG_MMU */
>  
>  #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 89971a894b60..a90eefc3ee80 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -1019,6 +1019,9 @@ config IDLE_PAGE_TRACKING
>  config ARCH_HAS_CACHE_LINE_SIZE
>  	bool
>  
> +config ARCH_HAS_FAULT_ON_ACCESS
> +	bool
> +
>  config ARCH_HAS_CURRENT_STACK_POINTER
>  	bool
>  	help
> diff --git a/mm/memory.c b/mm/memory.c
> index e137f7673749..a04a971200b9 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5044,6 +5044,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
>  	if (!pte_present(vmf->orig_pte))
>  		return do_swap_page(vmf);
>  
> +	if (fault_on_access_pte(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> +		return arch_do_page_fault_on_access(vmf);
> +
>  	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
>  		return do_numa_page(vmf);
>  
> -- 
> 2.42.1
> 
>
Alexandru Elisei Nov. 29, 2023, 11:55 a.m. UTC | #4
Hi,

On Tue, Nov 28, 2023 at 06:55:18PM +0100, David Hildenbrand wrote:
> On 19.11.23 17:57, Alexandru Elisei wrote:
> > To enable tagging on a memory range, userspace can use mprotect() with the
> > PROT_MTE access flag. Pages already mapped in the VMA don't have the
> > associated tag storage block reserved, so mark the PTEs as
> > PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
> > reserve the tag storage on the fault path.
> 
> That sounds alot like fake PROT_NONE. Would there be a way to unify hat

Yes, arm64 basically defines PAGE_FAULT_ON_ACCESS as PAGE_NONE |
PTE_TAG_STORAGE_NONE.

> handling and simply reuse pte_protnone()? For example, could we special case
> on VMA flags?
> 
> Like, don't do NUMA hinting in these special VMAs. Then, have something
> like:
> 
> if (pte_protnone(vmf->orig_pte))
> 	return handle_pte_protnone(vmf);
> 
> In there, special case on the VMA flags.

Your suggestion from the follow-up reply that an arch should know if it needs to
do something was spot on, arm64 can use the software bit in the translation
table entry for that.

So what you are proposing is this:

* Rename do_numa_page->handle_pte_protnone
* At some point in the do_numa_page (now renamed to handle_pte_protnone) flow,
  decide if pte_protnone() has been set for an arch specific reason or because
  of automatic NUMA balancing.
* if pte_protnone() has been set by an architecture, then let the architecture
  handle the fault.

If I understood you correctly, that's a good idea, and should be easy to
implement.

> 
> I *suspect* that handle_page_missing_tag_storage() stole (sorry :P) some

Indeed, most of the code is taken as-is from do_numa_page().

> code from the prot_none handling path. At least the recovery path and
> writability handling looks like it better be located shared in
> handle_pte_protnone() as well.

Yes, I agree.

Thanks,
Alex

> 
> That might take some magic out of this patch.
> 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand Nov. 29, 2023, 12:48 p.m. UTC | #5
On 29.11.23 12:55, Alexandru Elisei wrote:
> Hi,
> 
> On Tue, Nov 28, 2023 at 06:55:18PM +0100, David Hildenbrand wrote:
>> On 19.11.23 17:57, Alexandru Elisei wrote:
>>> To enable tagging on a memory range, userspace can use mprotect() with the
>>> PROT_MTE access flag. Pages already mapped in the VMA don't have the
>>> associated tag storage block reserved, so mark the PTEs as
>>> PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
>>> reserve the tag storage on the fault path.
>>
>> That sounds alot like fake PROT_NONE. Would there be a way to unify hat
> 
> Yes, arm64 basically defines PAGE_FAULT_ON_ACCESS as PAGE_NONE |
> PTE_TAG_STORAGE_NONE.
> 
>> handling and simply reuse pte_protnone()? For example, could we special case
>> on VMA flags?
>>
>> Like, don't do NUMA hinting in these special VMAs. Then, have something
>> like:
>>
>> if (pte_protnone(vmf->orig_pte))
>> 	return handle_pte_protnone(vmf);
>>
>> In there, special case on the VMA flags.
> 
> Your suggestion from the follow-up reply that an arch should know if it needs to
> do something was spot on, arm64 can use the software bit in the translation
> table entry for that.
> 
> So what you are proposing is this:
> 
> * Rename do_numa_page->handle_pte_protnone
> * At some point in the do_numa_page (now renamed to handle_pte_protnone) flow,
>    decide if pte_protnone() has been set for an arch specific reason or because
>    of automatic NUMA balancing.
> * if pte_protnone() has been set by an architecture, then let the architecture
>    handle the fault.
> 
> If I understood you correctly, that's a good idea, and should be easy to
> implement.

yes! :)
Alexandru Elisei Nov. 30, 2023, 12:06 p.m. UTC | #6
Hi,

On Wed, Nov 29, 2023 at 06:27:25PM +0900, Hyesoo Yu wrote:
> On Sun, Nov 19, 2023 at 04:57:13PM +0000, Alexandru Elisei wrote:
> > To enable tagging on a memory range, userspace can use mprotect() with the
> > PROT_MTE access flag. Pages already mapped in the VMA don't have the
> > associated tag storage block reserved, so mark the PTEs as
> > PAGE_FAULT_ON_ACCESS to trigger a fault next time they are accessed, and
> > reserve the tag storage on the fault path.
> > 
> > This has several benefits over reserving the tag storage as part of the
> > mprotect() call handling:
> > 
> > - Tag storage is reserved only for those pages in the VMA that are
> >   accessed, instead of for all the pages already mapped in the VMA.
> > - Reduces the latency of the mprotect() call.
> > - Eliminates races with page migration.
> > 
> > But all of this is at the expense of an extra page fault per page until the
> > pages being accessed all have their corresponding tag storage reserved.
> > 
> > For arm64, the PAGE_FAULT_ON_ACCESS protection is created by defining a new
> > page table entry software bit, PTE_TAG_STORAGE_NONE. Linux doesn't set any
> > of the PBHA bits in entries from the last level of the translation table
> > and it doesn't use the TCR_ELx.HWUxx bits; also, the first PBHA bit, bit
> > 59, is already being used as a software bit for PMD_PRESENT_INVALID.
> > 
> > This is only implemented for PTE mappings; PMD mappings will follow.
> > 
> > Signed-off-by: Alexandru Elisei <alexandru.elisei@arm.com>
> > ---
> >  arch/arm64/Kconfig                       |   1 +
> >  arch/arm64/include/asm/mte.h             |   4 +-
> >  arch/arm64/include/asm/mte_tag_storage.h |   2 +
> >  arch/arm64/include/asm/pgtable-prot.h    |   2 +
> >  arch/arm64/include/asm/pgtable.h         |  40 ++++++---
> >  arch/arm64/kernel/mte.c                  |  12 ++-
> >  arch/arm64/mm/fault.c                    | 101 +++++++++++++++++++++++
> >  include/linux/pgtable.h                  |  17 ++++
> >  mm/Kconfig                               |   3 +
> >  mm/memory.c                              |   3 +
> >  10 files changed, 170 insertions(+), 15 deletions(-)
> > 
> > diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> > index efa5b7958169..3b9c435eaafb 100644
> > --- a/arch/arm64/Kconfig
> > +++ b/arch/arm64/Kconfig
> > @@ -2066,6 +2066,7 @@ if ARM64_MTE
> >  config ARM64_MTE_TAG_STORAGE
> >  	bool "Dynamic MTE tag storage management"
> >  	depends on ARCH_KEEP_MEMBLOCK
> > +	select ARCH_HAS_FAULT_ON_ACCESS
> >  	select CONFIG_CMA
> >  	help
> >  	  Adds support for dynamic management of the memory used by the hardware
> > diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
> > index 6457b7899207..70dc2e409070 100644
> > --- a/arch/arm64/include/asm/mte.h
> > +++ b/arch/arm64/include/asm/mte.h
> > @@ -107,7 +107,7 @@ static inline bool try_page_mte_tagging(struct page *page)
> >  }
> >  
> >  void mte_zero_clear_page_tags(void *addr);
> > -void mte_sync_tags(pte_t pte, unsigned int nr_pages);
> > +void mte_sync_tags(pte_t *pteval, unsigned int nr_pages);
> >  void mte_copy_page_tags(void *kto, const void *kfrom);
> >  void mte_thread_init_user(void);
> >  void mte_thread_switch(struct task_struct *next);
> > @@ -139,7 +139,7 @@ static inline bool try_page_mte_tagging(struct page *page)
> >  static inline void mte_zero_clear_page_tags(void *addr)
> >  {
> >  }
> > -static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages)
> > +static inline void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
> >  {
> >  }
> >  static inline void mte_copy_page_tags(void *kto, const void *kfrom)
> > diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
> > index 6e5d28e607bb..c70ced60a0cd 100644
> > --- a/arch/arm64/include/asm/mte_tag_storage.h
> > +++ b/arch/arm64/include/asm/mte_tag_storage.h
> > @@ -33,6 +33,8 @@ int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
> >  void free_tag_storage(struct page *page, int order);
> >  
> >  bool page_tag_storage_reserved(struct page *page);
> > +
> > +vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
> >  #else
> >  static inline bool tag_storage_enabled(void)
> >  {
> > diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
> > index e9624f6326dd..85ebb3e352ad 100644
> > --- a/arch/arm64/include/asm/pgtable-prot.h
> > +++ b/arch/arm64/include/asm/pgtable-prot.h
> > @@ -19,6 +19,7 @@
> >  #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
> >  #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
> >  #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
> > +#define PTE_TAG_STORAGE_NONE	(_AT(pteval_t, 1) << 60) /* only when PTE_PROT_NONE */
> >  
> >  /*
> >   * This bit indicates that the entry is present i.e. pmd_page()
> > @@ -94,6 +95,7 @@ extern bool arm64_use_ng_mappings;
> >  	 })
> >  
> >  #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
> > +#define PAGE_FAULT_ON_ACCESS	__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
> >  /* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
> >  #define PAGE_SHARED		__pgprot(_PAGE_SHARED)
> >  #define PAGE_SHARED_EXEC	__pgprot(_PAGE_SHARED_EXEC)
> > diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> > index 20e8de853f5d..8cc135f1c112 100644
> > --- a/arch/arm64/include/asm/pgtable.h
> > +++ b/arch/arm64/include/asm/pgtable.h
> > @@ -326,10 +326,10 @@ static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
> >  		     __func__, pte_val(old_pte), pte_val(pte));
> >  }
> >  
> > -static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
> > +static inline void __sync_cache_and_tags(pte_t *pteval, unsigned int nr_pages)
> >  {
> > -	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
> > -		__sync_icache_dcache(pte);
> > +	if (pte_present(*pteval) && pte_user_exec(*pteval) && !pte_special(*pteval))
> > +		__sync_icache_dcache(*pteval);
> >  
> >  	/*
> >  	 * If the PTE would provide user space access to the tags associated
> > @@ -337,9 +337,9 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
> >  	 * pte_access_permitted() returns false for exec only mappings, they
> >  	 * don't expose tags (instruction fetches don't check tags).
> >  	 */
> > -	if (system_supports_mte() && pte_access_permitted(pte, false) &&
> > -	    !pte_special(pte) && pte_tagged(pte))
> > -		mte_sync_tags(pte, nr_pages);
> > +	if (system_supports_mte() && pte_access_permitted(*pteval, false) &&
> > +	    !pte_special(*pteval) && pte_tagged(*pteval))
> > +		mte_sync_tags(pteval, nr_pages);
> >  }
> >  
> >  static inline void set_ptes(struct mm_struct *mm,
> > @@ -347,7 +347,7 @@ static inline void set_ptes(struct mm_struct *mm,
> >  			    pte_t *ptep, pte_t pte, unsigned int nr)
> >  {
> >  	page_table_check_ptes_set(mm, ptep, pte, nr);
> > -	__sync_cache_and_tags(pte, nr);
> > +	__sync_cache_and_tags(&pte, nr);
> >  
> >  	for (;;) {
> >  		__check_safe_pte_update(mm, ptep, pte);
> > @@ -459,6 +459,26 @@ static inline int pmd_protnone(pmd_t pmd)
> >  }
> >  #endif
> >  
> > +#ifdef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> > +static inline bool fault_on_access_pte(pte_t pte)
> > +{
> > +	return (pte_val(pte) & (PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID)) ==
> > +		(PTE_PROT_NONE | PTE_TAG_STORAGE_NONE);
> > +}
> > +
> > +static inline bool fault_on_access_pmd(pmd_t pmd)
> > +{
> > +	return fault_on_access_pte(pmd_pte(pmd));
> > +}
> > +
> > +static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> > +{
> > +	if (tag_storage_enabled())
> > +		return handle_page_missing_tag_storage(vmf);
> > +	return VM_FAULT_SIGBUS;
> > +}
> > +#endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
> > +
> >  #define pmd_present_invalid(pmd)     (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
> >  
> >  static inline int pmd_present(pmd_t pmd)
> > @@ -533,7 +553,7 @@ static inline void __set_pte_at(struct mm_struct *mm,
> >  				unsigned long __always_unused addr,
> >  				pte_t *ptep, pte_t pte, unsigned int nr)
> >  {
> > -	__sync_cache_and_tags(pte, nr);
> > +	__sync_cache_and_tags(&pte, nr);
> >  	__check_safe_pte_update(mm, ptep, pte);
> >  	set_pte(ptep, pte);
> >  }
> > @@ -828,8 +848,8 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> >  	 * in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
> >  	 */
> >  	const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
> > -			      PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_GP |
> > -			      PTE_ATTRINDX_MASK;
> > +			      PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID |
> > +			      PTE_WRITE | PTE_GP | PTE_ATTRINDX_MASK;
> >  	/* preserve the hardware dirty information */
> >  	if (pte_hw_dirty(pte))
> >  		pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
> > diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
> > index a41ef3213e1e..5962bab1d549 100644
> > --- a/arch/arm64/kernel/mte.c
> > +++ b/arch/arm64/kernel/mte.c
> > @@ -21,6 +21,7 @@
> >  #include <asm/barrier.h>
> >  #include <asm/cpufeature.h>
> >  #include <asm/mte.h>
> > +#include <asm/mte_tag_storage.h>
> >  #include <asm/ptrace.h>
> >  #include <asm/sysreg.h>
> >  
> > @@ -35,13 +36,18 @@ DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
> >  EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
> >  #endif
> >  
> > -void mte_sync_tags(pte_t pte, unsigned int nr_pages)
> > +void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
> >  {
> > -	struct page *page = pte_page(pte);
> > +	struct page *page = pte_page(*pteval);
> >  	unsigned int i;
> >  
> > -	/* if PG_mte_tagged is set, tags have already been initialised */
> >  	for (i = 0; i < nr_pages; i++, page++) {
> > +		if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page))) {
> > +			*pteval = pte_modify(*pteval, PAGE_FAULT_ON_ACCESS);
> > +			continue;
> > +		}
> > +
> > +		/* if PG_mte_tagged is set, tags have already been initialised */
> >  		if (try_page_mte_tagging(page)) {
> >  			mte_clear_page_tags(page_address(page));
> >  			set_page_mte_tagged(page);
> > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> > index acbc7530d2b2..f5fa583acf18 100644
> > --- a/arch/arm64/mm/fault.c
> > +++ b/arch/arm64/mm/fault.c
> > @@ -19,6 +19,7 @@
> >  #include <linux/kprobes.h>
> >  #include <linux/uaccess.h>
> >  #include <linux/page-flags.h>
> > +#include <linux/page-isolation.h>
> >  #include <linux/sched/signal.h>
> >  #include <linux/sched/debug.h>
> >  #include <linux/highmem.h>
> > @@ -953,3 +954,103 @@ void tag_clear_highpage(struct page *page)
> >  	mte_zero_clear_page_tags(page_address(page));
> >  	set_page_mte_tagged(page);
> >  }
> > +
> > +#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
> > +vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
> > +{
> > +	struct vm_area_struct *vma = vmf->vma;
> > +	struct page *page = NULL;
> > +	pte_t new_pte, old_pte;
> > +	bool writable = false;
> > +	vm_fault_t err;
> > +	int ret;
> > +
> > +	spin_lock(vmf->ptl);
> > +	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> > +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +		return 0;
> > +	}
> > +
> > +	/* Get the normal PTE  */
> > +	old_pte = ptep_get(vmf->pte);
> > +	new_pte = pte_modify(old_pte, vma->vm_page_prot);
> > +
> > +	/*
> > +	 * Detect now whether the PTE could be writable; this information
> > +	 * is only valid while holding the PT lock.
> > +	 */
> > +	writable = pte_write(new_pte);
> > +	if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
> > +	    can_change_pte_writable(vma, vmf->address, new_pte))
> > +		writable = true;
> > +
> > +	page = vm_normal_page(vma, vmf->address, new_pte);
> > +	if (!page || is_zone_device_page(page))
> > +		goto out_map;
> > +
> > +	/*
> > +	 * This should never happen, once a VMA has been marked as tagged, that
> > +	 * cannot be changed.
> > +	 */
> > +	if (!(vma->vm_flags & VM_MTE))
> > +		goto out_map;
> > +
> > +	/* Prevent the page from being unmapped from under us. */
> > +	get_page(page);
> > +	vma_set_access_pid_bit(vma);
> > +
> > +	/*
> > +	 * Pairs with pte_offset_map_nolock(), which takes the RCU read lock,
> > +	 * and spin_lock() above which takes the ptl lock. Both locks should be
> > +	 * balanced after this point.
> > +	 */
> > +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +
> > +	/*
> > +	 * Probably the page is being isolated for migration, replay the fault
> > +	 * to give time for the entry to be replaced by a migration pte.
> > +	 */
> > +	if (unlikely(is_migrate_isolate_page(page)))
> > +		goto out_retry;
> > +
> > +	ret = reserve_tag_storage(page, 0, GFP_HIGHUSER_MOVABLE);
> > +	if (ret)
> > +		goto out_retry;
> > +
> > +	put_page(page);
> > +
> > +	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
> > +	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
> > +		pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +		return 0;
> > +	}
> > +
> > +out_map:
> > +	/*
> > +	 * Make it present again, depending on how arch implements
> > +	 * non-accessible ptes, some can allow access by kernel mode.
> > +	 */
> > +	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
> > +	new_pte = pte_modify(old_pte, vma->vm_page_prot);
> > +	new_pte = pte_mkyoung(new_pte);
> > +	if (writable)
> > +		new_pte = pte_mkwrite(new_pte, vma);
> > +	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, new_pte);
> > +	update_mmu_cache(vma, vmf->address, vmf->pte);
> > +	pte_unmap_unlock(vmf->pte, vmf->ptl);
> > +
> > +	return 0;
> > +
> > +out_retry:
> > +	put_page(page);
> > +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > +		vma_end_read(vma);
> > +	if (fault_flag_allow_retry_first(vmf->flags)) {
> > +		err = VM_FAULT_RETRY;
> > +	} else {
> > +		/* Replay the fault. */
> > +		err = 0;
> 
> Hello!
> 
> Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
> I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
> 
> How about migrating the page when migration problem repeats.

Yes, I had the same though in the previous iteration of the series, the
page was migrated out of the VMA if tag storage couldn't be reserved.

Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
pin will be released before the fault is replayed. Because of this, and
because it makes the code simpler, I chose not to migrate the page if tag
storage couldn't be reserved.

I'd be happy to revisit this if it turns out that in the real world
replaying the fault happens often enough that migrating the page is faster.

In fact, statistics about how often the fault is replayed and how long that
takes would be very helpful.

Thanks,
Alex

> 
> Thanks,
> Regards.
> 
> > +	}
> > +	return err;
> > +}
> > +#endif
> > diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> > index ffdb9b6bed6c..e2c761dd6c41 100644
> > --- a/include/linux/pgtable.h
> > +++ b/include/linux/pgtable.h
> > @@ -1458,6 +1458,23 @@ static inline int pmd_protnone(pmd_t pmd)
> >  }
> >  #endif /* CONFIG_NUMA_BALANCING */
> >  
> > +#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
> > +static inline bool fault_on_access_pte(pte_t pte)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline bool fault_on_access_pmd(pmd_t pmd)
> > +{
> > +	return false;
> > +}
> > +
> > +static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
> > +{
> > +	return VM_FAULT_SIGBUS;
> > +}
> > +#endif
> > +
> >  #endif /* CONFIG_MMU */
> >  
> >  #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 89971a894b60..a90eefc3ee80 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -1019,6 +1019,9 @@ config IDLE_PAGE_TRACKING
> >  config ARCH_HAS_CACHE_LINE_SIZE
> >  	bool
> >  
> > +config ARCH_HAS_FAULT_ON_ACCESS
> > +	bool
> > +
> >  config ARCH_HAS_CURRENT_STACK_POINTER
> >  	bool
> >  	help
> > diff --git a/mm/memory.c b/mm/memory.c
> > index e137f7673749..a04a971200b9 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5044,6 +5044,9 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> >  	if (!pte_present(vmf->orig_pte))
> >  		return do_swap_page(vmf);
> >  
> > +	if (fault_on_access_pte(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> > +		return arch_do_page_fault_on_access(vmf);
> > +
> >  	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> >  		return do_numa_page(vmf);
> >  
> > -- 
> > 2.42.1
> > 
> >
David Hildenbrand Nov. 30, 2023, 12:49 p.m. UTC | #7
>>> +
>>> +out_retry:
>>> +	put_page(page);
>>> +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
>>> +		vma_end_read(vma);
>>> +	if (fault_flag_allow_retry_first(vmf->flags)) {
>>> +		err = VM_FAULT_RETRY;
>>> +	} else {
>>> +		/* Replay the fault. */
>>> +		err = 0;
>>
>> Hello!
>>
>> Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
>> I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
>>
>> How about migrating the page when migration problem repeats.
> 
> Yes, I had the same though in the previous iteration of the series, the
> page was migrated out of the VMA if tag storage couldn't be reserved.
> 
> Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
> pin will be released before the fault is replayed. Because of this, and
> because it makes the code simpler, I chose not to migrate the page if tag
> storage couldn't be reserved.

There are still some cases that are theoretically problematic: 
vmsplice() can pin pages forever and doesn't use FOLL_LONGTERM yet.

All these things also affect other users that rely on movability (e.g., 
CMA, memory hotunplug).
Alexandru Elisei Nov. 30, 2023, 1:32 p.m. UTC | #8
Hi,

On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:
> > > > +
> > > > +out_retry:
> > > > +	put_page(page);
> > > > +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > > +		vma_end_read(vma);
> > > > +	if (fault_flag_allow_retry_first(vmf->flags)) {
> > > > +		err = VM_FAULT_RETRY;
> > > > +	} else {
> > > > +		/* Replay the fault. */
> > > > +		err = 0;
> > > 
> > > Hello!
> > > 
> > > Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
> > > I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
> > > 
> > > How about migrating the page when migration problem repeats.
> > 
> > Yes, I had the same though in the previous iteration of the series, the
> > page was migrated out of the VMA if tag storage couldn't be reserved.
> > 
> > Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
> > pin will be released before the fault is replayed. Because of this, and
> > because it makes the code simpler, I chose not to migrate the page if tag
> > storage couldn't be reserved.
> 
> There are still some cases that are theoretically problematic: vmsplice()
> can pin pages forever and doesn't use FOLL_LONGTERM yet.
> 
> All these things also affect other users that rely on movability (e.g., CMA,
> memory hotunplug).

I wasn't aware of that, thank you for the information. Then to ensure that the
process doesn't hang by replying the loop indefinitely, I'll migrate the page if
tag storage cannot be reserved. Looking over the code again, I think I can reuse
the same function that migrates tag storage pages out of the MTE VMA (added in
patch #21), so no major changes needed.

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
>
David Hildenbrand Nov. 30, 2023, 1:43 p.m. UTC | #9
On 30.11.23 14:32, Alexandru Elisei wrote:
> Hi,
> 
> On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:
>>>>> +
>>>>> +out_retry:
>>>>> +	put_page(page);
>>>>> +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
>>>>> +		vma_end_read(vma);
>>>>> +	if (fault_flag_allow_retry_first(vmf->flags)) {
>>>>> +		err = VM_FAULT_RETRY;
>>>>> +	} else {
>>>>> +		/* Replay the fault. */
>>>>> +		err = 0;
>>>>
>>>> Hello!
>>>>
>>>> Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
>>>> I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
>>>>
>>>> How about migrating the page when migration problem repeats.
>>>
>>> Yes, I had the same though in the previous iteration of the series, the
>>> page was migrated out of the VMA if tag storage couldn't be reserved.
>>>
>>> Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
>>> pin will be released before the fault is replayed. Because of this, and
>>> because it makes the code simpler, I chose not to migrate the page if tag
>>> storage couldn't be reserved.
>>
>> There are still some cases that are theoretically problematic: vmsplice()
>> can pin pages forever and doesn't use FOLL_LONGTERM yet.
>>
>> All these things also affect other users that rely on movability (e.g., CMA,
>> memory hotunplug).
> 
> I wasn't aware of that, thank you for the information. Then to ensure that the
> process doesn't hang by replying the loop indefinitely, I'll migrate the page if
> tag storage cannot be reserved. Looking over the code again, I think I can reuse
> the same function that migrates tag storage pages out of the MTE VMA (added in
> patch #21), so no major changes needed.

It's going to be interesting if migrating that page fails because it is 
pinned :/
Alexandru Elisei Nov. 30, 2023, 2:33 p.m. UTC | #10
On Thu, Nov 30, 2023 at 02:43:48PM +0100, David Hildenbrand wrote:
> On 30.11.23 14:32, Alexandru Elisei wrote:
> > Hi,
> > 
> > On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:
> > > > > > +
> > > > > > +out_retry:
> > > > > > +	put_page(page);
> > > > > > +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
> > > > > > +		vma_end_read(vma);
> > > > > > +	if (fault_flag_allow_retry_first(vmf->flags)) {
> > > > > > +		err = VM_FAULT_RETRY;
> > > > > > +	} else {
> > > > > > +		/* Replay the fault. */
> > > > > > +		err = 0;
> > > > > 
> > > > > Hello!
> > > > > 
> > > > > Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
> > > > > I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
> > > > > 
> > > > > How about migrating the page when migration problem repeats.
> > > > 
> > > > Yes, I had the same though in the previous iteration of the series, the
> > > > page was migrated out of the VMA if tag storage couldn't be reserved.
> > > > 
> > > > Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
> > > > pin will be released before the fault is replayed. Because of this, and
> > > > because it makes the code simpler, I chose not to migrate the page if tag
> > > > storage couldn't be reserved.
> > > 
> > > There are still some cases that are theoretically problematic: vmsplice()
> > > can pin pages forever and doesn't use FOLL_LONGTERM yet.
> > > 
> > > All these things also affect other users that rely on movability (e.g., CMA,
> > > memory hotunplug).
> > 
> > I wasn't aware of that, thank you for the information. Then to ensure that the
> > process doesn't hang by replying the loop indefinitely, I'll migrate the page if
> > tag storage cannot be reserved. Looking over the code again, I think I can reuse
> > the same function that migrates tag storage pages out of the MTE VMA (added in
> > patch #21), so no major changes needed.
> 
> It's going to be interesting if migrating that page fails because it is
> pinned :/

I imagine that having both the page **and** its tag storage pinned longterm
without FOLL_LONGTERM is going to be exceedingly rare.

Am I mistaken in believing that the problematic vmsplice() behaviour is
recognized as something that needs to be fixed?

Thanks,
Alex

> 
> -- 
> Cheers,
> 
> David / dhildenb
>
David Hildenbrand Nov. 30, 2023, 2:39 p.m. UTC | #11
On 30.11.23 15:33, Alexandru Elisei wrote:
> On Thu, Nov 30, 2023 at 02:43:48PM +0100, David Hildenbrand wrote:
>> On 30.11.23 14:32, Alexandru Elisei wrote:
>>> Hi,
>>>
>>> On Thu, Nov 30, 2023 at 01:49:34PM +0100, David Hildenbrand wrote:
>>>>>>> +
>>>>>>> +out_retry:
>>>>>>> +	put_page(page);
>>>>>>> +	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
>>>>>>> +		vma_end_read(vma);
>>>>>>> +	if (fault_flag_allow_retry_first(vmf->flags)) {
>>>>>>> +		err = VM_FAULT_RETRY;
>>>>>>> +	} else {
>>>>>>> +		/* Replay the fault. */
>>>>>>> +		err = 0;
>>>>>>
>>>>>> Hello!
>>>>>>
>>>>>> Unfortunately, if the page continues to be pinned, it seems like fault will continue to occur.
>>>>>> I guess it makes system stability issue. (but I'm not familiar with that, so please let me know if I'm mistaken!)
>>>>>>
>>>>>> How about migrating the page when migration problem repeats.
>>>>>
>>>>> Yes, I had the same though in the previous iteration of the series, the
>>>>> page was migrated out of the VMA if tag storage couldn't be reserved.
>>>>>
>>>>> Only short term pins are allowed on MIGRATE_CMA pages, so I expect that the
>>>>> pin will be released before the fault is replayed. Because of this, and
>>>>> because it makes the code simpler, I chose not to migrate the page if tag
>>>>> storage couldn't be reserved.
>>>>
>>>> There are still some cases that are theoretically problematic: vmsplice()
>>>> can pin pages forever and doesn't use FOLL_LONGTERM yet.
>>>>
>>>> All these things also affect other users that rely on movability (e.g., CMA,
>>>> memory hotunplug).
>>>
>>> I wasn't aware of that, thank you for the information. Then to ensure that the
>>> process doesn't hang by replying the loop indefinitely, I'll migrate the page if
>>> tag storage cannot be reserved. Looking over the code again, I think I can reuse
>>> the same function that migrates tag storage pages out of the MTE VMA (added in
>>> patch #21), so no major changes needed.
>>
>> It's going to be interesting if migrating that page fails because it is
>> pinned :/
> 
> I imagine that having both the page **and** its tag storage pinned longterm
> without FOLL_LONGTERM is going to be exceedingly rare.

Yes. I recall that the rule of thumb is that some O_DIRECT I/O can take 
up to 10 seconds, although extremely rare (and maybe not applicable on 
arm64).

> 
> Am I mistaken in believing that the problematic vmsplice() behaviour is
> recognized as something that needs to be fixed?

Yes, for a couple of years  I'm hoping this will actually get fixed now 
that O_DIRECT mostly uses FOLL_PIN instead of FOLL_GET.
diff mbox series

Patch

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index efa5b7958169..3b9c435eaafb 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -2066,6 +2066,7 @@  if ARM64_MTE
 config ARM64_MTE_TAG_STORAGE
 	bool "Dynamic MTE tag storage management"
 	depends on ARCH_KEEP_MEMBLOCK
+	select ARCH_HAS_FAULT_ON_ACCESS
 	select CONFIG_CMA
 	help
 	  Adds support for dynamic management of the memory used by the hardware
diff --git a/arch/arm64/include/asm/mte.h b/arch/arm64/include/asm/mte.h
index 6457b7899207..70dc2e409070 100644
--- a/arch/arm64/include/asm/mte.h
+++ b/arch/arm64/include/asm/mte.h
@@ -107,7 +107,7 @@  static inline bool try_page_mte_tagging(struct page *page)
 }
 
 void mte_zero_clear_page_tags(void *addr);
-void mte_sync_tags(pte_t pte, unsigned int nr_pages);
+void mte_sync_tags(pte_t *pteval, unsigned int nr_pages);
 void mte_copy_page_tags(void *kto, const void *kfrom);
 void mte_thread_init_user(void);
 void mte_thread_switch(struct task_struct *next);
@@ -139,7 +139,7 @@  static inline bool try_page_mte_tagging(struct page *page)
 static inline void mte_zero_clear_page_tags(void *addr)
 {
 }
-static inline void mte_sync_tags(pte_t pte, unsigned int nr_pages)
+static inline void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
 {
 }
 static inline void mte_copy_page_tags(void *kto, const void *kfrom)
diff --git a/arch/arm64/include/asm/mte_tag_storage.h b/arch/arm64/include/asm/mte_tag_storage.h
index 6e5d28e607bb..c70ced60a0cd 100644
--- a/arch/arm64/include/asm/mte_tag_storage.h
+++ b/arch/arm64/include/asm/mte_tag_storage.h
@@ -33,6 +33,8 @@  int reserve_tag_storage(struct page *page, int order, gfp_t gfp);
 void free_tag_storage(struct page *page, int order);
 
 bool page_tag_storage_reserved(struct page *page);
+
+vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf);
 #else
 static inline bool tag_storage_enabled(void)
 {
diff --git a/arch/arm64/include/asm/pgtable-prot.h b/arch/arm64/include/asm/pgtable-prot.h
index e9624f6326dd..85ebb3e352ad 100644
--- a/arch/arm64/include/asm/pgtable-prot.h
+++ b/arch/arm64/include/asm/pgtable-prot.h
@@ -19,6 +19,7 @@ 
 #define PTE_SPECIAL		(_AT(pteval_t, 1) << 56)
 #define PTE_DEVMAP		(_AT(pteval_t, 1) << 57)
 #define PTE_PROT_NONE		(_AT(pteval_t, 1) << 58) /* only when !PTE_VALID */
+#define PTE_TAG_STORAGE_NONE	(_AT(pteval_t, 1) << 60) /* only when PTE_PROT_NONE */
 
 /*
  * This bit indicates that the entry is present i.e. pmd_page()
@@ -94,6 +95,7 @@  extern bool arm64_use_ng_mappings;
 	 })
 
 #define PAGE_NONE		__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
+#define PAGE_FAULT_ON_ACCESS	__pgprot(((_PAGE_DEFAULT) & ~PTE_VALID) | PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_RDONLY | PTE_NG | PTE_PXN | PTE_UXN)
 /* shared+writable pages are clean by default, hence PTE_RDONLY|PTE_WRITE */
 #define PAGE_SHARED		__pgprot(_PAGE_SHARED)
 #define PAGE_SHARED_EXEC	__pgprot(_PAGE_SHARED_EXEC)
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 20e8de853f5d..8cc135f1c112 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -326,10 +326,10 @@  static inline void __check_safe_pte_update(struct mm_struct *mm, pte_t *ptep,
 		     __func__, pte_val(old_pte), pte_val(pte));
 }
 
-static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
+static inline void __sync_cache_and_tags(pte_t *pteval, unsigned int nr_pages)
 {
-	if (pte_present(pte) && pte_user_exec(pte) && !pte_special(pte))
-		__sync_icache_dcache(pte);
+	if (pte_present(*pteval) && pte_user_exec(*pteval) && !pte_special(*pteval))
+		__sync_icache_dcache(*pteval);
 
 	/*
 	 * If the PTE would provide user space access to the tags associated
@@ -337,9 +337,9 @@  static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
 	 * pte_access_permitted() returns false for exec only mappings, they
 	 * don't expose tags (instruction fetches don't check tags).
 	 */
-	if (system_supports_mte() && pte_access_permitted(pte, false) &&
-	    !pte_special(pte) && pte_tagged(pte))
-		mte_sync_tags(pte, nr_pages);
+	if (system_supports_mte() && pte_access_permitted(*pteval, false) &&
+	    !pte_special(*pteval) && pte_tagged(*pteval))
+		mte_sync_tags(pteval, nr_pages);
 }
 
 static inline void set_ptes(struct mm_struct *mm,
@@ -347,7 +347,7 @@  static inline void set_ptes(struct mm_struct *mm,
 			    pte_t *ptep, pte_t pte, unsigned int nr)
 {
 	page_table_check_ptes_set(mm, ptep, pte, nr);
-	__sync_cache_and_tags(pte, nr);
+	__sync_cache_and_tags(&pte, nr);
 
 	for (;;) {
 		__check_safe_pte_update(mm, ptep, pte);
@@ -459,6 +459,26 @@  static inline int pmd_protnone(pmd_t pmd)
 }
 #endif
 
+#ifdef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
+static inline bool fault_on_access_pte(pte_t pte)
+{
+	return (pte_val(pte) & (PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID)) ==
+		(PTE_PROT_NONE | PTE_TAG_STORAGE_NONE);
+}
+
+static inline bool fault_on_access_pmd(pmd_t pmd)
+{
+	return fault_on_access_pte(pmd_pte(pmd));
+}
+
+static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
+{
+	if (tag_storage_enabled())
+		return handle_page_missing_tag_storage(vmf);
+	return VM_FAULT_SIGBUS;
+}
+#endif /* CONFIG_ARCH_HAS_FAULT_ON_ACCESS */
+
 #define pmd_present_invalid(pmd)     (!!(pmd_val(pmd) & PMD_PRESENT_INVALID))
 
 static inline int pmd_present(pmd_t pmd)
@@ -533,7 +553,7 @@  static inline void __set_pte_at(struct mm_struct *mm,
 				unsigned long __always_unused addr,
 				pte_t *ptep, pte_t pte, unsigned int nr)
 {
-	__sync_cache_and_tags(pte, nr);
+	__sync_cache_and_tags(&pte, nr);
 	__check_safe_pte_update(mm, ptep, pte);
 	set_pte(ptep, pte);
 }
@@ -828,8 +848,8 @@  static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
 	 * in MAIR_EL1. The mask below has to include PTE_ATTRINDX_MASK.
 	 */
 	const pteval_t mask = PTE_USER | PTE_PXN | PTE_UXN | PTE_RDONLY |
-			      PTE_PROT_NONE | PTE_VALID | PTE_WRITE | PTE_GP |
-			      PTE_ATTRINDX_MASK;
+			      PTE_PROT_NONE | PTE_TAG_STORAGE_NONE | PTE_VALID |
+			      PTE_WRITE | PTE_GP | PTE_ATTRINDX_MASK;
 	/* preserve the hardware dirty information */
 	if (pte_hw_dirty(pte))
 		pte = set_pte_bit(pte, __pgprot(PTE_DIRTY));
diff --git a/arch/arm64/kernel/mte.c b/arch/arm64/kernel/mte.c
index a41ef3213e1e..5962bab1d549 100644
--- a/arch/arm64/kernel/mte.c
+++ b/arch/arm64/kernel/mte.c
@@ -21,6 +21,7 @@ 
 #include <asm/barrier.h>
 #include <asm/cpufeature.h>
 #include <asm/mte.h>
+#include <asm/mte_tag_storage.h>
 #include <asm/ptrace.h>
 #include <asm/sysreg.h>
 
@@ -35,13 +36,18 @@  DEFINE_STATIC_KEY_FALSE(mte_async_or_asymm_mode);
 EXPORT_SYMBOL_GPL(mte_async_or_asymm_mode);
 #endif
 
-void mte_sync_tags(pte_t pte, unsigned int nr_pages)
+void mte_sync_tags(pte_t *pteval, unsigned int nr_pages)
 {
-	struct page *page = pte_page(pte);
+	struct page *page = pte_page(*pteval);
 	unsigned int i;
 
-	/* if PG_mte_tagged is set, tags have already been initialised */
 	for (i = 0; i < nr_pages; i++, page++) {
+		if (tag_storage_enabled() && unlikely(!page_tag_storage_reserved(page))) {
+			*pteval = pte_modify(*pteval, PAGE_FAULT_ON_ACCESS);
+			continue;
+		}
+
+		/* if PG_mte_tagged is set, tags have already been initialised */
 		if (try_page_mte_tagging(page)) {
 			mte_clear_page_tags(page_address(page));
 			set_page_mte_tagged(page);
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index acbc7530d2b2..f5fa583acf18 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -19,6 +19,7 @@ 
 #include <linux/kprobes.h>
 #include <linux/uaccess.h>
 #include <linux/page-flags.h>
+#include <linux/page-isolation.h>
 #include <linux/sched/signal.h>
 #include <linux/sched/debug.h>
 #include <linux/highmem.h>
@@ -953,3 +954,103 @@  void tag_clear_highpage(struct page *page)
 	mte_zero_clear_page_tags(page_address(page));
 	set_page_mte_tagged(page);
 }
+
+#ifdef CONFIG_ARM64_MTE_TAG_STORAGE
+vm_fault_t handle_page_missing_tag_storage(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct page *page = NULL;
+	pte_t new_pte, old_pte;
+	bool writable = false;
+	vm_fault_t err;
+	int ret;
+
+	spin_lock(vmf->ptl);
+	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return 0;
+	}
+
+	/* Get the normal PTE  */
+	old_pte = ptep_get(vmf->pte);
+	new_pte = pte_modify(old_pte, vma->vm_page_prot);
+
+	/*
+	 * Detect now whether the PTE could be writable; this information
+	 * is only valid while holding the PT lock.
+	 */
+	writable = pte_write(new_pte);
+	if (!writable && vma_wants_manual_pte_write_upgrade(vma) &&
+	    can_change_pte_writable(vma, vmf->address, new_pte))
+		writable = true;
+
+	page = vm_normal_page(vma, vmf->address, new_pte);
+	if (!page || is_zone_device_page(page))
+		goto out_map;
+
+	/*
+	 * This should never happen, once a VMA has been marked as tagged, that
+	 * cannot be changed.
+	 */
+	if (!(vma->vm_flags & VM_MTE))
+		goto out_map;
+
+	/* Prevent the page from being unmapped from under us. */
+	get_page(page);
+	vma_set_access_pid_bit(vma);
+
+	/*
+	 * Pairs with pte_offset_map_nolock(), which takes the RCU read lock,
+	 * and spin_lock() above which takes the ptl lock. Both locks should be
+	 * balanced after this point.
+	 */
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	/*
+	 * Probably the page is being isolated for migration, replay the fault
+	 * to give time for the entry to be replaced by a migration pte.
+	 */
+	if (unlikely(is_migrate_isolate_page(page)))
+		goto out_retry;
+
+	ret = reserve_tag_storage(page, 0, GFP_HIGHUSER_MOVABLE);
+	if (ret)
+		goto out_retry;
+
+	put_page(page);
+
+	vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl);
+	if (unlikely(!pte_same(*vmf->pte, vmf->orig_pte))) {
+		pte_unmap_unlock(vmf->pte, vmf->ptl);
+		return 0;
+	}
+
+out_map:
+	/*
+	 * Make it present again, depending on how arch implements
+	 * non-accessible ptes, some can allow access by kernel mode.
+	 */
+	old_pte = ptep_modify_prot_start(vma, vmf->address, vmf->pte);
+	new_pte = pte_modify(old_pte, vma->vm_page_prot);
+	new_pte = pte_mkyoung(new_pte);
+	if (writable)
+		new_pte = pte_mkwrite(new_pte, vma);
+	ptep_modify_prot_commit(vma, vmf->address, vmf->pte, old_pte, new_pte);
+	update_mmu_cache(vma, vmf->address, vmf->pte);
+	pte_unmap_unlock(vmf->pte, vmf->ptl);
+
+	return 0;
+
+out_retry:
+	put_page(page);
+	if (vmf->flags & FAULT_FLAG_VMA_LOCK)
+		vma_end_read(vma);
+	if (fault_flag_allow_retry_first(vmf->flags)) {
+		err = VM_FAULT_RETRY;
+	} else {
+		/* Replay the fault. */
+		err = 0;
+	}
+	return err;
+}
+#endif
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index ffdb9b6bed6c..e2c761dd6c41 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1458,6 +1458,23 @@  static inline int pmd_protnone(pmd_t pmd)
 }
 #endif /* CONFIG_NUMA_BALANCING */
 
+#ifndef CONFIG_ARCH_HAS_FAULT_ON_ACCESS
+static inline bool fault_on_access_pte(pte_t pte)
+{
+	return false;
+}
+
+static inline bool fault_on_access_pmd(pmd_t pmd)
+{
+	return false;
+}
+
+static inline vm_fault_t arch_do_page_fault_on_access(struct vm_fault *vmf)
+{
+	return VM_FAULT_SIGBUS;
+}
+#endif
+
 #endif /* CONFIG_MMU */
 
 #ifdef CONFIG_HAVE_ARCH_HUGE_VMAP
diff --git a/mm/Kconfig b/mm/Kconfig
index 89971a894b60..a90eefc3ee80 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -1019,6 +1019,9 @@  config IDLE_PAGE_TRACKING
 config ARCH_HAS_CACHE_LINE_SIZE
 	bool
 
+config ARCH_HAS_FAULT_ON_ACCESS
+	bool
+
 config ARCH_HAS_CURRENT_STACK_POINTER
 	bool
 	help
diff --git a/mm/memory.c b/mm/memory.c
index e137f7673749..a04a971200b9 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5044,6 +5044,9 @@  static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
 	if (!pte_present(vmf->orig_pte))
 		return do_swap_page(vmf);
 
+	if (fault_on_access_pte(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+		return arch_do_page_fault_on_access(vmf);
+
 	if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
 		return do_numa_page(vmf);