[v5,19/25] arm64/mm: Wire up PTE_CONT for user mappings

Message ID	20240202080756.1453939-20-ryan.roberts@arm.com (mailing list archive)
State	New
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Ryan Roberts <ryan.roberts@arm.com> To: Catalin Marinas <catalin.marinas@arm.com>, Will Deacon <will@kernel.org>, Ard Biesheuvel <ardb@kernel.org>, Marc Zyngier <maz@kernel.org>, James Morse <james.morse@arm.com>, Andrey Ryabinin <ryabinin.a.a@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, Matthew Wilcox <willy@infradead.org>, Mark Rutland <mark.rutland@arm.com>, David Hildenbrand <david@redhat.com>, Kefeng Wang <wangkefeng.wang@huawei.com>, John Hubbard <jhubbard@nvidia.com>, Zi Yan <ziy@nvidia.com>, Barry Song <21cnbao@gmail.com>, Alistair Popple <apopple@nvidia.com>, Yang Shi <shy828301@gmail.com>, Nicholas Piggin <npiggin@gmail.com>, Christophe Leroy <christophe.leroy@csgroup.eu>, "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>, "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>, Thomas Gleixner <tglx@linutronix.de>, Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>, Dave Hansen <dave.hansen@linux.intel.com>, "H. Peter Anvin" <hpa@zytor.com> Cc: Ryan Roberts <ryan.roberts@arm.com>, linux-arm-kernel@lists.infradead.org, x86@kernel.org, linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings Date: Fri, 2 Feb 2024 08:07:50 +0000 Message-Id: <20240202080756.1453939-20-ryan.roberts@arm.com> In-Reply-To: <20240202080756.1453939-1-ryan.roberts@arm.com> References: <20240202080756.1453939-1-ryan.roberts@arm.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Transparent Contiguous PTEs for User Mappings \| expand [v5,00/25] Transparent Contiguous PTEs for User Mappings [v5,01/25] mm: Clarify the spec for set_ptes() [v5,02/25] mm: thp: Batch-collapse PMD with set_ptes() [v5,03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn() [v5,04/25] arm/mm: Convert pte_next_pfn() to pte_advance_pfn() [v5,05/25] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn() [v5,06/25] powerpc/mm: Convert pte_next_pfn() to pte_advance_pfn() [v5,07/25] x86/mm: Convert pte_next_pfn() to pte_advance_pfn() [v5,08/25] mm: Remove pte_next_pfn() and replace with pte_advance_pfn() [v5,09/25] arm64/mm: set_pte(): New layer to manage contig bit [v5,10/25] arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit [v5,11/25] arm64/mm: pte_clear(): New layer to manage contig bit [v5,12/25] arm64/mm: ptep_get_and_clear(): New layer to manage contig bit [v5,13/25] arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit [v5,14/25] arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit [v5,15/25] arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit [v5,16/25] arm64/mm: ptep_set_access_flags(): New layer to manage contig bit [v5,17/25] arm64/mm: ptep_get(): New layer to manage contig bit [v5,18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB [v5,19/25] arm64/mm: Wire up PTE_CONT for user mappings [v5,20/25] arm64/mm: Implement new wrprotect_ptes() batch API [v5,21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs [v5,22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch() [v5,23/25] arm64/mm: Implement pte_batch_hint() [v5,24/25] arm64/mm: __always_inline to improve fork() perf [v5,25/25] arm64/mm: Automatically fold contpte mappings

Ryan Roberts Feb. 2, 2024, 8:07 a.m. UTC

With the ptep API sufficiently refactored, we can now introduce a new
"contpte" API layer, which transparently manages the PTE_CONT bit for
user mappings.

In this initial implementation, only suitable batches of PTEs, set via
set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
modification of individual PTEs will cause an "unfold" operation to
repaint the contpte block as individual PTEs before performing the
requested operation. While, a modification of a single PTE could cause
the block of PTEs to which it belongs to become eligible for "folding"
into a contpte entry, "folding" is not performed in this initial
implementation due to the costs of checking the requirements are met.
Due to this, contpte mappings will degrade back to normal pte mappings
over time if/when protections are changed. This will be solved in a
future patch.

Since a contpte block only has a single access and dirty bit, the
semantic here changes slightly; when getting a pte (e.g. ptep_get())
that is part of a contpte mapping, the access and dirty information are
pulled from the block (so all ptes in the block return the same
access/dirty info). When changing the access/dirty info on a pte (e.g.
ptep_set_access_flags()) that is part of a contpte mapping, this change
will affect the whole contpte block. This is works fine in practice
since we guarantee that only a single folio is mapped by a contpte
block, and the core-mm tracks access/dirty information per folio.

In order for the public functions, which used to be pure inline, to
continue to be callable by modules, export all the contpte_* symbols
that are now called by those public inline functions.

The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
at build time. It defaults to enabled as long as its dependency,
TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
enabled, then there is no chance of meeting the physical contiguity
requirement for contpte mappings.

Tested-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/Kconfig               |   9 +
 arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
 arch/arm64/mm/Makefile           |   1 +
 arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
 4 files changed, 454 insertions(+)
 create mode 100644 arch/arm64/mm/contpte.c

Mark Rutland Feb. 12, 2024, noon UTC | #1

Hi Ryan,

Overall this looks pretty good; I have a bunch of minor comments below, and a
bigger question on the way ptep_get_lockless() works.

On Fri, Feb 02, 2024 at 08:07:50AM +0000, Ryan Roberts wrote:
> With the ptep API sufficiently refactored, we can now introduce a new
> "contpte" API layer, which transparently manages the PTE_CONT bit for
> user mappings.
> 
> In this initial implementation, only suitable batches of PTEs, set via
> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
> modification of individual PTEs will cause an "unfold" operation to
> repaint the contpte block as individual PTEs before performing the
> requested operation. While, a modification of a single PTE could cause
> the block of PTEs to which it belongs to become eligible for "folding"
> into a contpte entry, "folding" is not performed in this initial
> implementation due to the costs of checking the requirements are met.
> Due to this, contpte mappings will degrade back to normal pte mappings
> over time if/when protections are changed. This will be solved in a
> future patch.
> 
> Since a contpte block only has a single access and dirty bit, the
> semantic here changes slightly; when getting a pte (e.g. ptep_get())
> that is part of a contpte mapping, the access and dirty information are
> pulled from the block (so all ptes in the block return the same
> access/dirty info). When changing the access/dirty info on a pte (e.g.
> ptep_set_access_flags()) that is part of a contpte mapping, this change
> will affect the whole contpte block. This is works fine in practice
> since we guarantee that only a single folio is mapped by a contpte
> block, and the core-mm tracks access/dirty information per folio.
> 
> In order for the public functions, which used to be pure inline, to
> continue to be callable by modules, export all the contpte_* symbols
> that are now called by those public inline functions.
> 
> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
> at build time. It defaults to enabled as long as its dependency,
> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
> enabled, then there is no chance of meeting the physical contiguity
> requirement for contpte mappings.
> 
> Tested-by: John Hubbard <jhubbard@nvidia.com>
> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
> ---
>  arch/arm64/Kconfig               |   9 +
>  arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
>  arch/arm64/mm/Makefile           |   1 +
>  arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
>  4 files changed, 454 insertions(+)
>  create mode 100644 arch/arm64/mm/contpte.c
> 
> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
> index d86d7f4758b5..1442e8ed95b6 100644
> --- a/arch/arm64/Kconfig
> +++ b/arch/arm64/Kconfig
> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>  	select UNWIND_TABLES
>  	select DYNAMIC_SCS
>  
> +config ARM64_CONTPTE
> +	bool "Contiguous PTE mappings for user memory" if EXPERT
> +	depends on TRANSPARENT_HUGEPAGE
> +	default y
> +	help
> +	  When enabled, user mappings are configured using the PTE contiguous
> +	  bit, for any mappings that meet the size and alignment requirements.
> +	  This reduces TLB pressure and improves performance.
> +
>  endmenu # "Kernel Features"
>  
>  menu "Boot options"
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 7dc6b68ee516..34892a95403d 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>   */
>  #define pte_valid_not_user(pte) \
>  	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
> +/*
> + * Returns true if the pte is valid and has the contiguous bit set.
> + */
> +#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
>  /*
>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>   * so that we don't erroneously return false for pages that have been
> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>  #define vmemmap_update_pte vmemmap_update_pte
>  #endif
>  
> +#ifdef CONFIG_ARM64_CONTPTE
> +
> +/*
> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
> + * a private implementation detail of the public ptep API (see below).
> + */
> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte);
> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte, unsigned int nr);
> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep);
> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				pte_t entry, int dirty);
> +
> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte)
> +{
> +	if (unlikely(pte_valid_cont(pte)))
> +		__contpte_try_unfold(mm, addr, ptep, pte);
> +}
> +
> +/*
> + * The below functions constitute the public API that arm64 presents to the
> + * core-mm to manipulate PTE entries within their page tables (or at least this
> + * is the subset of the API that arm64 needs to implement). These public
> + * versions will automatically and transparently apply the contiguous bit where
> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
> + * private versions, which are prefixed with double underscore. All of these
> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
> + * held.
> + */
> +
> +#define ptep_get ptep_get
> +static inline pte_t ptep_get(pte_t *ptep)
> +{
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(pte)))
> +		return pte;
> +
> +	return contpte_ptep_get(ptep, pte);
> +}
> +
> +#define ptep_get_lockless ptep_get_lockless
> +static inline pte_t ptep_get_lockless(pte_t *ptep)
> +{
> +	pte_t pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(pte)))
> +		return pte;
> +
> +	return contpte_ptep_get_lockless(ptep);
> +}
> +
> +static inline void set_pte(pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
> +	 * this should never even be called. Regardless do our best to service
> +	 * any call and emit a warning if there is any attempt to set a pte on
> +	 * top of an existing contig range.
> +	 */
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
> +	__set_pte(ptep, pte_mknoncont(pte));
> +}
> +
> +#define set_ptes set_ptes
> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> +				pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> +	pte = pte_mknoncont(pte);

Why do we have to clear the contiguous bit here? Is that for the same reason as
set_pte(), or do we expect callers to legitimately call this with the
contiguous bit set in 'pte'?

I think you explained this to me in-person, and IIRC we don't expect callers to
go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
have to clear it here to defer the decision of whether to set/clear it when
modifying entries. It would be nice if we could have a description of why/when
we need to clear this, e.g. in the 'public API' comment block above.

> +
> +	if (likely(nr == 1)) {
> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +		__set_ptes(mm, addr, ptep, pte, 1);
> +	} else {
> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
> +	}
> +}
> +
> +static inline void pte_clear(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +	__pte_clear(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +	return __ptep_get_and_clear(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(orig_pte)))
> +		return __ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	if (likely(!pte_valid_cont(orig_pte)))
> +		return __ptep_clear_flush_young(vma, addr, ptep);
> +
> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
> +				unsigned long addr, pte_t *ptep)
> +{
> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
> +	__ptep_set_wrprotect(mm, addr, ptep);
> +}
> +
> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
> +				unsigned long addr, pte_t *ptep,
> +				pte_t entry, int dirty)
> +{
> +	pte_t orig_pte = __ptep_get(ptep);
> +
> +	entry = pte_mknoncont(entry);
> +
> +	if (likely(!pte_valid_cont(orig_pte)))
> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +
> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +}
> +
> +#else /* CONFIG_ARM64_CONTPTE */
> +
>  #define ptep_get				__ptep_get
>  #define set_pte					__set_pte
>  #define set_ptes				__set_ptes
> @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>  #define ptep_set_access_flags			__ptep_set_access_flags
>  
> +#endif /* CONFIG_ARM64_CONTPTE */
> +
>  #endif /* !__ASSEMBLY__ */
>  
>  #endif /* __ASM_PGTABLE_H */
> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
> index dbd1bc95967d..60454256945b 100644
> --- a/arch/arm64/mm/Makefile
> +++ b/arch/arm64/mm/Makefile
> @@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>  				   cache.o copypage.o flush.o \
>  				   ioremap.o mmap.o pgd.o mmu.o \
>  				   context.o proc.o pageattr.o fixmap.o
> +obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
> new file mode 100644
> index 000000000000..bfb50e6b44c7
> --- /dev/null
> +++ b/arch/arm64/mm/contpte.c
> @@ -0,0 +1,283 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * Copyright (C) 2023 ARM Ltd.
> + */
> +
> +#include <linux/mm.h>
> +#include <linux/export.h>
> +#include <asm/tlbflush.h>
> +
> +static inline bool mm_is_user(struct mm_struct *mm)
> +{
> +	/*
> +	 * Don't attempt to apply the contig bit to kernel mappings, because
> +	 * dynamically adding/removing the contig bit can cause page faults.
> +	 * These racing faults are ok for user space, since they get serialized
> +	 * on the PTL. But kernel mappings can't tolerate faults.
> +	 */
> +	return mm != &init_mm;
> +}

We also have the efi_mm as a non-user mm, though I don't think we manipulate
that while it is live, and I'm not sure if that needs any special handling.

> +static inline pte_t *contpte_align_down(pte_t *ptep)
> +{
> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);

I think this can be:

static inline pte_t *contpte_align_down(pte_t *ptep)
{
	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
}

> +
> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> +			    pte_t *ptep, pte_t pte)
> +{
> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> +	unsigned long start_addr;
> +	pte_t *start_ptep;
> +	int i;
> +
> +	start_ptep = ptep = contpte_align_down(ptep);
> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> +
> +		if (pte_dirty(ptent))
> +			pte = pte_mkdirty(pte);
> +
> +		if (pte_young(ptent))
> +			pte = pte_mkyoung(pte);
> +	}

Not a big deal either way, but I wonder if it makes more sense to accumulate
the 'ptent' dirty/young values, then modify 'pte' once, i.e.

	bool dirty = false, young = false;

	for (...) {
		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
		dirty |= pte_dirty(ptent);
		young |= pte_young(ptent);
	}

	if (dirty)
		pte_mkdirty(pte);
	if (young)
		pte_mkyoung(pte);

I suspect that might generate slightly better code, but I'm also happy with the
current form if people thnk that's more legible (I have no strong feelings
either way).

> +
> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
> +
> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
> +}
> +
> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
> +			pte_t *ptep, pte_t pte)
> +{
> +	/*
> +	 * We have already checked that the ptes are contiguous in
> +	 * contpte_try_unfold(), so just check that the mm is user space.
> +	 */
> +
> +	if (!mm_is_user(mm))
> +		return;

Nit: normally we don't put a line gap between a comment block and the
associated block of code.

> +
> +	pte = pte_mknoncont(pte);
> +	contpte_convert(mm, addr, ptep, pte);
> +}
> +EXPORT_SYMBOL(__contpte_try_unfold);
> +
> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
> +	 * contiguous range cannot be unfolded or otherwise modified under our
> +	 * feet.
> +	 */

Nit: s/guarranteed/guaranteed/

> +
> +	pte_t pte;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
> +		pte = __ptep_get(ptep);
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get);
> +
> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> +{
> +	/*
> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> +	 * of the contig range. We may not be holding the PTL, so any contiguous
> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
> +	 * in the range are valid and have CONT_PTE set, that all pfns are
> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
> +	 * If we find a pte that is not consistent, then we must be racing with
> +	 * an update so start again. If the target pte does not have CONT_PTE
> +	 * set then that is considered consistent on its own because it is not
> +	 * part of a contpte range.
> +	 */
> +
> +	pgprot_t orig_prot;
> +	unsigned long pfn;
> +	pte_t orig_pte;
> +	pgprot_t prot;
> +	pte_t *ptep;
> +	pte_t pte;
> +	int i;
> +
> +retry:
> +	orig_pte = __ptep_get(orig_ptep);
> +
> +	if (!pte_valid_cont(orig_pte))
> +		return orig_pte;
> +
> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> +	ptep = contpte_align_down(orig_ptep);
> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> +		pte = __ptep_get(ptep);
> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> +
> +		if (!pte_valid_cont(pte) ||
> +		   pte_pfn(pte) != pfn ||
> +		   pgprot_val(prot) != pgprot_val(orig_prot))
> +			goto retry;
> +
> +		if (pte_dirty(pte))
> +			orig_pte = pte_mkdirty(orig_pte);
> +
> +		if (pte_young(pte))
> +			orig_pte = pte_mkyoung(orig_pte);
> +	}
> +
> +	return orig_pte;
> +}
> +EXPORT_SYMBOL(contpte_ptep_get_lockless);

I'm struggling to convince myself that this is safe in general, as it really
depends on how the caller will use this value. Which caller(s) actually care
about the access/dirty bits, given those could change at any time anyway?

I took a quick scan, and AFAICT:

* For perf_get_pgtable_size(), we only care about whether the entry is valid
  and has the contig bit set. We could clean that up with a new interface, e.g.
  something like a new ptep_get_size_lockless().

* For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
  we look at the pte to start with, since we only care where we can logically
  write to the page at that point.

  I see that we later follow up with:

    with pte_val(pte) != pte_val(ptep_get(ptep)))

  ... is that why we need ptep_get_lockless() to accumulate the access/dirty
  bits? So that shape of lockless-try...locked-compare sequence works?

* For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
  so this doesn' seem to matter.

* For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
  which means the pte isn't valid, and we'll return the orig_pte as-is anyway.

* For pte_range_none() the access/dirty bits don't matter.

* For handle_pte_fault() I think we have the same shape of
  lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
  care about the acess/dirty bits before we reach the locked compare step.

* For ptdump_pte_entry() I think it's arguable that we should continue to
  report the access/dirty bits separately for each PTE, as we have done until
  now, to give an accurate representation of the contents of the translation
  tables.

* For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
  swap entry, the access/dirty bits don't matter.

So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
and IIUC that's only so that the locklessly-loaded pte value can be compared
with a subsequently locked-loaded entry (for which the access/dirty bits will
be accumulated). Have I understood that correctly?

If so, I wonder if we could instead do that comparison modulo the access/dirty
bits, and leave ptep_get_lockless() only reading a single entry?

Thanks,
Mark.

> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
> +					pte_t *ptep, pte_t pte, unsigned int nr)
> +{
> +	unsigned long next;
> +	unsigned long end;
> +	unsigned long pfn;
> +	pgprot_t prot;
> +
> +	/*
> +	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
> +	 * all ptes is not-present. Therefore we never need to unfold or
> +	 * otherwise invalidate a range before we set the new ptes.
> +	 * contpte_set_ptes() should never be called for nr < 2.
> +	 */
> +	VM_WARN_ON(nr == 1);
> +
> +	if (!mm_is_user(mm))
> +		return __set_ptes(mm, addr, ptep, pte, nr);
> +
> +	end = addr + (nr << PAGE_SHIFT);
> +	pfn = pte_pfn(pte);
> +	prot = pte_pgprot(pte);
> +
> +	do {
> +		next = pte_cont_addr_end(addr, end);
> +		nr = (next - addr) >> PAGE_SHIFT;
> +		pte = pfn_pte(pfn, prot);
> +
> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
> +			pte = pte_mkcont(pte);
> +		else
> +			pte = pte_mknoncont(pte);
> +
> +		__set_ptes(mm, addr, ptep, pte, nr);
> +
> +		addr = next;
> +		ptep += nr;
> +		pfn += nr;
> +
> +	} while (addr != end);
> +}
> +EXPORT_SYMBOL(contpte_set_ptes);
> +
> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	/*
> +	 * ptep_clear_flush_young() technically requires us to clear the access
> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
> +	 * access/dirty per folio, not per page. And since we only create a
> +	 * contig range when the range is covered by a single folio, we can get
> +	 * away with clearing young for the whole contig range here, so we avoid
> +	 * having to unfold.
> +	 */
> +
> +	int young = 0;
> +	int i;
> +
> +	ptep = contpte_align_down(ptep);
> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	return young;
> +}
> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
> +
> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep)
> +{
> +	int young;
> +
> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
> +
> +	if (young) {
> +		/*
> +		 * See comment in __ptep_clear_flush_young(); same rationale for
> +		 * eliding the trailing DSB applies here.
> +		 */
> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
> +					 PAGE_SIZE, true, 3);
> +	}
> +
> +	return young;
> +}
> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
> +
> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
> +					unsigned long addr, pte_t *ptep,
> +					pte_t entry, int dirty)
> +{
> +	unsigned long start_addr;
> +	pte_t orig_pte;
> +	int i;
> +
> +	/*
> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
> +	 * changed, its a noop.
> +	 */
> +	orig_pte = pte_mknoncont(ptep_get(ptep));
> +	if (pte_val(orig_pte) == pte_val(entry))
> +		return 0;
> +
> +	/*
> +	 * We can fix up access/dirty bits without having to unfold the contig
> +	 * range. But if the write bit is changing, we must unfold.
> +	 */
> +	if (pte_write(orig_pte) == pte_write(entry)) {
> +		/*
> +		 * For HW access management, we technically only need to update
> +		 * the flag on a single pte in the range. But for SW access
> +		 * management, we need to update all the ptes to prevent extra
> +		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
> +		 * and instead flush the whole range at the end.
> +		 */
> +		ptep = contpte_align_down(ptep);
> +		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> +
> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
> +
> +		if (dirty)
> +			__flush_tlb_range(vma, start_addr, addr,
> +							PAGE_SIZE, true, 3);
> +	} else {
> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
> +		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
> +	}
> +
> +	return 1;
> +}
> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);
> -- 
> 2.25.1
>

Ryan Roberts Feb. 12, 2024, 12:59 p.m. UTC | #2

On 12/02/2024 12:00, Mark Rutland wrote:
> Hi Ryan,
> 
> Overall this looks pretty good; I have a bunch of minor comments below, and a
> bigger question on the way ptep_get_lockless() works.

OK great - thanks for the review. Let's see if I can answer them all...

> 
> On Fri, Feb 02, 2024 at 08:07:50AM +0000, Ryan Roberts wrote:
>> With the ptep API sufficiently refactored, we can now introduce a new
>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>> user mappings.
>>
>> In this initial implementation, only suitable batches of PTEs, set via
>> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
>> modification of individual PTEs will cause an "unfold" operation to
>> repaint the contpte block as individual PTEs before performing the
>> requested operation. While, a modification of a single PTE could cause
>> the block of PTEs to which it belongs to become eligible for "folding"
>> into a contpte entry, "folding" is not performed in this initial
>> implementation due to the costs of checking the requirements are met.
>> Due to this, contpte mappings will degrade back to normal pte mappings
>> over time if/when protections are changed. This will be solved in a
>> future patch.
>>
>> Since a contpte block only has a single access and dirty bit, the
>> semantic here changes slightly; when getting a pte (e.g. ptep_get())
>> that is part of a contpte mapping, the access and dirty information are
>> pulled from the block (so all ptes in the block return the same
>> access/dirty info). When changing the access/dirty info on a pte (e.g.
>> ptep_set_access_flags()) that is part of a contpte mapping, this change
>> will affect the whole contpte block. This is works fine in practice
>> since we guarantee that only a single folio is mapped by a contpte
>> block, and the core-mm tracks access/dirty information per folio.
>>
>> In order for the public functions, which used to be pure inline, to
>> continue to be callable by modules, export all the contpte_* symbols
>> that are now called by those public inline functions.
>>
>> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
>> at build time. It defaults to enabled as long as its dependency,
>> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
>> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
>> enabled, then there is no chance of meeting the physical contiguity
>> requirement for contpte mappings.
>>
>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>> ---
>>  arch/arm64/Kconfig               |   9 +
>>  arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
>>  arch/arm64/mm/Makefile           |   1 +
>>  arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
>>  4 files changed, 454 insertions(+)
>>  create mode 100644 arch/arm64/mm/contpte.c
>>
>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>> index d86d7f4758b5..1442e8ed95b6 100644
>> --- a/arch/arm64/Kconfig
>> +++ b/arch/arm64/Kconfig
>> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>>  	select UNWIND_TABLES
>>  	select DYNAMIC_SCS
>>  
>> +config ARM64_CONTPTE
>> +	bool "Contiguous PTE mappings for user memory" if EXPERT
>> +	depends on TRANSPARENT_HUGEPAGE
>> +	default y
>> +	help
>> +	  When enabled, user mappings are configured using the PTE contiguous
>> +	  bit, for any mappings that meet the size and alignment requirements.
>> +	  This reduces TLB pressure and improves performance.
>> +
>>  endmenu # "Kernel Features"
>>  
>>  menu "Boot options"
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 7dc6b68ee516..34892a95403d 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>>   */
>>  #define pte_valid_not_user(pte) \
>>  	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
>> +/*
>> + * Returns true if the pte is valid and has the contiguous bit set.
>> + */
>> +#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
>>  /*
>>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>>   * so that we don't erroneously return false for pages that have been
>> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>  #define vmemmap_update_pte vmemmap_update_pte
>>  #endif
>>  
>> +#ifdef CONFIG_ARM64_CONTPTE
>> +
>> +/*
>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>> + * a private implementation detail of the public ptep API (see below).
>> + */
>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte);
>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte, unsigned int nr);
>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep);
>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				pte_t entry, int dirty);
>> +
>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte)
>> +{
>> +	if (unlikely(pte_valid_cont(pte)))
>> +		__contpte_try_unfold(mm, addr, ptep, pte);
>> +}
>> +
>> +/*
>> + * The below functions constitute the public API that arm64 presents to the
>> + * core-mm to manipulate PTE entries within their page tables (or at least this
>> + * is the subset of the API that arm64 needs to implement). These public
>> + * versions will automatically and transparently apply the contiguous bit where
>> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>> + * private versions, which are prefixed with double underscore. All of these
>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>> + * held.
>> + */
>> +
>> +#define ptep_get ptep_get
>> +static inline pte_t ptep_get(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(pte)))
>> +		return pte;
>> +
>> +	return contpte_ptep_get(ptep, pte);
>> +}
>> +
>> +#define ptep_get_lockless ptep_get_lockless
>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>> +{
>> +	pte_t pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(pte)))
>> +		return pte;
>> +
>> +	return contpte_ptep_get_lockless(ptep);
>> +}
>> +
>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>> +	 * this should never even be called. Regardless do our best to service
>> +	 * any call and emit a warning if there is any attempt to set a pte on
>> +	 * top of an existing contig range.
>> +	 */
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>> +	__set_pte(ptep, pte_mknoncont(pte));
>> +}
>> +
>> +#define set_ptes set_ptes
>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +	pte = pte_mknoncont(pte);
> 
> Why do we have to clear the contiguous bit here? Is that for the same reason as
> set_pte(), or do we expect callers to legitimately call this with the
> contiguous bit set in 'pte'?
> 
> I think you explained this to me in-person, and IIRC we don't expect callers to
> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
> have to clear it here to defer the decision of whether to set/clear it when
> modifying entries. It would be nice if we could have a description of why/when
> we need to clear this, e.g. in the 'public API' comment block above.

Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
private to the architecture code and is never set directly by core code. If the
public API ever receives a pte that happens to have the PTE_CONT bit set, it
would be bad news if we then accidentally set that in the pgtable.

Ideally, we would just uncondidtionally clear the bit before a getter returns
the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
and can therefore never accidentally pass such a pte into a setter function.
However, there is existing functionality that relies on being able to get a pte,
then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
to determine how big the leaf is. This is used in perf_get_pgtable_size().

So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
decided to allow PTE_CONT to leak through the getters and instead
unconditionally clear the bit when a pte is passed to any of the setters.

I'll add a (slightly less verbose) comment as you suggest.

> 
>> +
>> +	if (likely(nr == 1)) {
>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +		__set_ptes(mm, addr, ptep, pte, 1);
>> +	} else {
>> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>> +	}
>> +}
>> +
>> +static inline void pte_clear(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +	__pte_clear(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +	return __ptep_get_and_clear(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(orig_pte)))
>> +		return __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	if (likely(!pte_valid_cont(orig_pte)))
>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>> +
>> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
>> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
>> +				unsigned long addr, pte_t *ptep)
>> +{
>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>> +	__ptep_set_wrprotect(mm, addr, ptep);
>> +}
>> +
>> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>> +				unsigned long addr, pte_t *ptep,
>> +				pte_t entry, int dirty)
>> +{
>> +	pte_t orig_pte = __ptep_get(ptep);
>> +
>> +	entry = pte_mknoncont(entry);
>> +
>> +	if (likely(!pte_valid_cont(orig_pte)))
>> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +
>> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +}
>> +
>> +#else /* CONFIG_ARM64_CONTPTE */
>> +
>>  #define ptep_get				__ptep_get
>>  #define set_pte					__set_pte
>>  #define set_ptes				__set_ptes
>> @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>  #define ptep_set_access_flags			__ptep_set_access_flags
>>  
>> +#endif /* CONFIG_ARM64_CONTPTE */
>> +
>>  #endif /* !__ASSEMBLY__ */
>>  
>>  #endif /* __ASM_PGTABLE_H */
>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>> index dbd1bc95967d..60454256945b 100644
>> --- a/arch/arm64/mm/Makefile
>> +++ b/arch/arm64/mm/Makefile
>> @@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>>  				   cache.o copypage.o flush.o \
>>  				   ioremap.o mmap.o pgd.o mmu.o \
>>  				   context.o proc.o pageattr.o fixmap.o
>> +obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
>>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>> new file mode 100644
>> index 000000000000..bfb50e6b44c7
>> --- /dev/null
>> +++ b/arch/arm64/mm/contpte.c
>> @@ -0,0 +1,283 @@
>> +// SPDX-License-Identifier: GPL-2.0-only
>> +/*
>> + * Copyright (C) 2023 ARM Ltd.
>> + */
>> +
>> +#include <linux/mm.h>
>> +#include <linux/export.h>
>> +#include <asm/tlbflush.h>
>> +
>> +static inline bool mm_is_user(struct mm_struct *mm)
>> +{
>> +	/*
>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>> +	 * dynamically adding/removing the contig bit can cause page faults.
>> +	 * These racing faults are ok for user space, since they get serialized
>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>> +	 */
>> +	return mm != &init_mm;
>> +}
> 
> We also have the efi_mm as a non-user mm, though I don't think we manipulate
> that while it is live, and I'm not sure if that needs any special handling.

Well we never need this function in the hot (order-0 folio) path, so I think I
could add a check for efi_mm here with performance implication. It's probably
safest to explicitly exclude it? What do you think?

> 
>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>> +{
>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> 
> I think this can be:
> 
> static inline pte_t *contpte_align_down(pte_t *ptep)
> {
> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
> }

Yep - that's much less ugly - thanks!

> 
>> +
>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>> +			    pte_t *ptep, pte_t pte)
>> +{
>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>> +	unsigned long start_addr;
>> +	pte_t *start_ptep;
>> +	int i;
>> +
>> +	start_ptep = ptep = contpte_align_down(ptep);
>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> +
>> +		if (pte_dirty(ptent))
>> +			pte = pte_mkdirty(pte);
>> +
>> +		if (pte_young(ptent))
>> +			pte = pte_mkyoung(pte);
>> +	}
> 
> Not a big deal either way, but I wonder if it makes more sense to accumulate
> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
> 
> 	bool dirty = false, young = false;
> 
> 	for (...) {
> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> 		dirty |= pte_dirty(ptent);
> 		young |= pte_young(ptent);
> 	}
> 
> 	if (dirty)
> 		pte_mkdirty(pte);
> 	if (young)
> 		pte_mkyoung(pte);
> 
> I suspect that might generate slightly better code, but I'm also happy with the
> current form if people thnk that's more legible (I have no strong feelings
> either way).

I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
We also had the same comment against David's batching patches recently, and he
opted to stick with the former version:

https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/

So I'm inclined to leave it as is, since you're not insisting :)

> 
>> +
>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>> +
>> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>> +}
>> +
>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>> +			pte_t *ptep, pte_t pte)
>> +{
>> +	/*
>> +	 * We have already checked that the ptes are contiguous in
>> +	 * contpte_try_unfold(), so just check that the mm is user space.
>> +	 */
>> +
>> +	if (!mm_is_user(mm))
>> +		return;
> 
> Nit: normally we don't put a line gap between a comment block and the
> associated block of code.

ACK, I'll fix in next version.

> 
>> +
>> +	pte = pte_mknoncont(pte);
>> +	contpte_convert(mm, addr, ptep, pte);
>> +}
>> +EXPORT_SYMBOL(__contpte_try_unfold);
>> +
>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
>> +	 * contiguous range cannot be unfolded or otherwise modified under our
>> +	 * feet.
>> +	 */
> 
> Nit: s/guarranteed/guaranteed/

ACK, I'll fix in next version.

> 
>> +
>> +	pte_t pte;
>> +	int i;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
>> +		pte = __ptep_get(ptep);
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get);
>> +
>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>> +{
>> +	/*
>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>> +	 * If we find a pte that is not consistent, then we must be racing with
>> +	 * an update so start again. If the target pte does not have CONT_PTE
>> +	 * set then that is considered consistent on its own because it is not
>> +	 * part of a contpte range.
>> +	 */
>> +
>> +	pgprot_t orig_prot;
>> +	unsigned long pfn;
>> +	pte_t orig_pte;
>> +	pgprot_t prot;
>> +	pte_t *ptep;
>> +	pte_t pte;
>> +	int i;
>> +
>> +retry:
>> +	orig_pte = __ptep_get(orig_ptep);
>> +
>> +	if (!pte_valid_cont(orig_pte))
>> +		return orig_pte;
>> +
>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>> +	ptep = contpte_align_down(orig_ptep);
>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>> +		pte = __ptep_get(ptep);
>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>> +
>> +		if (!pte_valid_cont(pte) ||
>> +		   pte_pfn(pte) != pfn ||
>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>> +			goto retry;
>> +
>> +		if (pte_dirty(pte))
>> +			orig_pte = pte_mkdirty(orig_pte);
>> +
>> +		if (pte_young(pte))
>> +			orig_pte = pte_mkyoung(orig_pte);
>> +	}
>> +
>> +	return orig_pte;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
> 
> I'm struggling to convince myself that this is safe in general, as it really
> depends on how the caller will use this value. Which caller(s) actually care
> about the access/dirty bits, given those could change at any time anyway?

I think your points below are valid, and agree we should try to make this work
without needing access/dirty if possible. But can you elaborate on why you don't
think it's safe?

> 
> I took a quick scan, and AFAICT:

Thanks for enumerating these; Saves me from having to refresh my memory :)

> 
> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>   and has the contig bit set. We could clean that up with a new interface, e.g.
>   something like a new ptep_get_size_lockless().
> 
> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>   we look at the pte to start with, since we only care where we can logically
>   write to the page at that point.
> 
>   I see that we later follow up with:
> 
>     with pte_val(pte) != pte_val(ptep_get(ptep)))
> 
>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>   bits? So that shape of lockless-try...locked-compare sequence works?
> 
> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>   so this doesn' seem to matter.
> 
> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
> 
> * For pte_range_none() the access/dirty bits don't matter.
> 
> * For handle_pte_fault() I think we have the same shape of
>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>   care about the acess/dirty bits before we reach the locked compare step.
> 
> * For ptdump_pte_entry() I think it's arguable that we should continue to
>   report the access/dirty bits separately for each PTE, as we have done until
>   now, to give an accurate representation of the contents of the translation
>   tables.
> 
> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>   swap entry, the access/dirty bits don't matter.
> 
> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
> and IIUC that's only so that the locklessly-loaded pte value can be compared
> with a subsequently locked-loaded entry (for which the access/dirty bits will
> be accumulated). Have I understood that correctly?

Yes, I agree with what you are saying. My approach was to try to implement the
existing APIs accurately though, the argument being that it reduces the chances
of getting it wrong. But if you think the implementation is unsafe, then I guess
it blows that out of the water...

> 
> If so, I wonder if we could instead do that comparison modulo the access/dirty
> bits, 

I think that would work - but will need to think a bit more on it.

> and leave ptep_get_lockless() only reading a single entry?

I think we will need to do something a bit less fragile. ptep_get() does collect
the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
we will likely want to rename the function and make its documentation explicit
that it does not return those bits.

ptep_get_lockless_noyoungdirty()? yuk... Any ideas?

Of course if I could convince you the current implementation is safe, I might be
able to sidestep this optimization until a later date?

Thanks,
Ryan


> 
> Thanks,
> Mark.
> 
>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>> +					pte_t *ptep, pte_t pte, unsigned int nr)
>> +{
>> +	unsigned long next;
>> +	unsigned long end;
>> +	unsigned long pfn;
>> +	pgprot_t prot;
>> +
>> +	/*
>> +	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
>> +	 * all ptes is not-present. Therefore we never need to unfold or
>> +	 * otherwise invalidate a range before we set the new ptes.
>> +	 * contpte_set_ptes() should never be called for nr < 2.
>> +	 */
>> +	VM_WARN_ON(nr == 1);
>> +
>> +	if (!mm_is_user(mm))
>> +		return __set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +	end = addr + (nr << PAGE_SHIFT);
>> +	pfn = pte_pfn(pte);
>> +	prot = pte_pgprot(pte);
>> +
>> +	do {
>> +		next = pte_cont_addr_end(addr, end);
>> +		nr = (next - addr) >> PAGE_SHIFT;
>> +		pte = pfn_pte(pfn, prot);
>> +
>> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>> +			pte = pte_mkcont(pte);
>> +		else
>> +			pte = pte_mknoncont(pte);
>> +
>> +		__set_ptes(mm, addr, ptep, pte, nr);
>> +
>> +		addr = next;
>> +		ptep += nr;
>> +		pfn += nr;
>> +
>> +	} while (addr != end);
>> +}
>> +EXPORT_SYMBOL(contpte_set_ptes);
>> +
>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	/*
>> +	 * ptep_clear_flush_young() technically requires us to clear the access
>> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
>> +	 * access/dirty per folio, not per page. And since we only create a
>> +	 * contig range when the range is covered by a single folio, we can get
>> +	 * away with clearing young for the whole contig range here, so we avoid
>> +	 * having to unfold.
>> +	 */
>> +
>> +	int young = 0;
>> +	int i;
>> +
>> +	ptep = contpte_align_down(ptep);
>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	return young;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
>> +
>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep)
>> +{
>> +	int young;
>> +
>> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>> +
>> +	if (young) {
>> +		/*
>> +		 * See comment in __ptep_clear_flush_young(); same rationale for
>> +		 * eliding the trailing DSB applies here.
>> +		 */
>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>> +					 PAGE_SIZE, true, 3);
>> +	}
>> +
>> +	return young;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>> +
>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>> +					unsigned long addr, pte_t *ptep,
>> +					pte_t entry, int dirty)
>> +{
>> +	unsigned long start_addr;
>> +	pte_t orig_pte;
>> +	int i;
>> +
>> +	/*
>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>> +	 * changed, its a noop.
>> +	 */
>> +	orig_pte = pte_mknoncont(ptep_get(ptep));
>> +	if (pte_val(orig_pte) == pte_val(entry))
>> +		return 0;
>> +
>> +	/*
>> +	 * We can fix up access/dirty bits without having to unfold the contig
>> +	 * range. But if the write bit is changing, we must unfold.
>> +	 */
>> +	if (pte_write(orig_pte) == pte_write(entry)) {
>> +		/*
>> +		 * For HW access management, we technically only need to update
>> +		 * the flag on a single pte in the range. But for SW access
>> +		 * management, we need to update all the ptes to prevent extra
>> +		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
>> +		 * and instead flush the whole range at the end.
>> +		 */
>> +		ptep = contpte_align_down(ptep);
>> +		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>> +
>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>> +
>> +		if (dirty)
>> +			__flush_tlb_range(vma, start_addr, addr,
>> +							PAGE_SIZE, true, 3);
>> +	} else {
>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>> +		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>> +	}
>> +
>> +	return 1;
>> +}
>> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);
>> -- 
>> 2.25.1
>>

David Hildenbrand Feb. 12, 2024, 1:54 p.m. UTC | #3

>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>> bits,
> 
> I think that would work - but will need to think a bit more on it.
> 
>> and leave ptep_get_lockless() only reading a single entry?
> 
> I think we will need to do something a bit less fragile. ptep_get() does collect
> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
> we will likely want to rename the function and make its documentation explicit
> that it does not return those bits.
> 
> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
> 
> Of course if I could convince you the current implementation is safe, I might be
> able to sidestep this optimization until a later date?

As discussed (and pointed out abive), there might be quite some 
callsites where we don't really care about uptodate accessed/dirty bits 
-- where ptep_get() is used nowadays.

One way to approach that I had in mind was having an explicit interface:

ptep_get()
ptep_get_uptodate()
ptep_get_lockless()
ptep_get_lockless_uptodate()

Especially the last one might not be needed.

Futher, "uptodate" might not be the best choice because of 
PageUptodate() and friends. But it's better than 
"youngdirty"/"noyoungdirty" IMHO.

Of course, any such changes require care and are better done one step at 
at time separately.

Ryan Roberts Feb. 12, 2024, 2:45 p.m. UTC | #4

On 12/02/2024 13:54, David Hildenbrand wrote:
>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>> bits,
>>
>> I think that would work - but will need to think a bit more on it.
>>
>>> and leave ptep_get_lockless() only reading a single entry?
>>
>> I think we will need to do something a bit less fragile. ptep_get() does collect
>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>> we will likely want to rename the function and make its documentation explicit
>> that it does not return those bits.
>>
>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>
>> Of course if I could convince you the current implementation is safe, I might be
>> able to sidestep this optimization until a later date?
> 
> As discussed (and pointed out abive), there might be quite some callsites where
> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
> used nowadays.
> 
> One way to approach that I had in mind was having an explicit interface:
> 
> ptep_get()
> ptep_get_uptodate()
> ptep_get_lockless()
> ptep_get_lockless_uptodate()

Yes, I like the direction of this. I guess we anticipate that call sites
requiring the "_uptodate" variant will be the minority so it makes sense to use
the current names for the "_not_uptodate" variants? But to do a slow migration,
it might be better/safer to have the weaker variant use the new name - that
would allow us to downgrade one at a time?

> 
> Especially the last one might not be needed.
I've done a scan through the code and agree with Mark's original conclusions.
Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
access/dirty info. So I think I could migrate everything to the weaker variant
fairly easily.

> 
> Futher, "uptodate" might not be the best choice because of PageUptodate() and
> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.

Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
"_nosync"?

> 
> Of course, any such changes require care and are better done one step at at time
> separately.
> 

So I propose to introduce ptep_get_lockless_nosync() (name up for discussion)
and migrate all users to it, as part of this series. This will side-step Mark's
correctness concerns. We can add ptep_get_nosync() later and migrate slowly.

Shout if you think this is a bad plan.

Thanks,
Ryan

David Hildenbrand Feb. 12, 2024, 3:26 p.m. UTC | #5

On 12.02.24 15:45, Ryan Roberts wrote:
> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>>> bits,
>>>
>>> I think that would work - but will need to think a bit more on it.
>>>
>>>> and leave ptep_get_lockless() only reading a single entry?
>>>
>>> I think we will need to do something a bit less fragile. ptep_get() does collect
>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>>> we will likely want to rename the function and make its documentation explicit
>>> that it does not return those bits.
>>>
>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>
>>> Of course if I could convince you the current implementation is safe, I might be
>>> able to sidestep this optimization until a later date?
>>
>> As discussed (and pointed out abive), there might be quite some callsites where
>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>> used nowadays.
>>
>> One way to approach that I had in mind was having an explicit interface:
>>
>> ptep_get()
>> ptep_get_uptodate()
>> ptep_get_lockless()
>> ptep_get_lockless_uptodate()
> 
> Yes, I like the direction of this. I guess we anticipate that call sites
> requiring the "_uptodate" variant will be the minority so it makes sense to use
> the current names for the "_not_uptodate" variants? But to do a slow migration,
> it might be better/safer to have the weaker variant use the new name - that
> would allow us to downgrade one at a time?

Yes, I was primarily struggling with names. Likely it makes sense to 
either have two completely new function names, or use the new name only 
for the "faster but less precise" variant.

> 
>>
>> Especially the last one might not be needed.
> I've done a scan through the code and agree with Mark's original conclusions.
> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
> access/dirty info. So I think I could migrate everything to the weaker variant
> fairly easily.
> 
>>
>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
> 
> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
> "_nosync"?

I could live with

ptep_get_sync()
ptep_get_nosync()

with proper documentation :)

I don't think we use "_sync" / "_nosync" in the context of pte 
operations yet.

Well, there seems to be "__arm_v7s_pte_sync" in iommu code, bit at least 
in core code nothing jumped at me.

Ryan Roberts Feb. 12, 2024, 3:30 p.m. UTC | #6

On 12/02/2024 12:59, Ryan Roberts wrote:
> On 12/02/2024 12:00, Mark Rutland wrote:
>> Hi Ryan,
>>
>> Overall this looks pretty good; I have a bunch of minor comments below, and a
>> bigger question on the way ptep_get_lockless() works.
> 
> OK great - thanks for the review. Let's see if I can answer them all...
> 
>>
>> On Fri, Feb 02, 2024 at 08:07:50AM +0000, Ryan Roberts wrote:
>>> With the ptep API sufficiently refactored, we can now introduce a new
>>> "contpte" API layer, which transparently manages the PTE_CONT bit for
>>> user mappings.
>>>
>>> In this initial implementation, only suitable batches of PTEs, set via
>>> set_ptes(), are mapped with the PTE_CONT bit. Any subsequent
>>> modification of individual PTEs will cause an "unfold" operation to
>>> repaint the contpte block as individual PTEs before performing the
>>> requested operation. While, a modification of a single PTE could cause
>>> the block of PTEs to which it belongs to become eligible for "folding"
>>> into a contpte entry, "folding" is not performed in this initial
>>> implementation due to the costs of checking the requirements are met.
>>> Due to this, contpte mappings will degrade back to normal pte mappings
>>> over time if/when protections are changed. This will be solved in a
>>> future patch.
>>>
>>> Since a contpte block only has a single access and dirty bit, the
>>> semantic here changes slightly; when getting a pte (e.g. ptep_get())
>>> that is part of a contpte mapping, the access and dirty information are
>>> pulled from the block (so all ptes in the block return the same
>>> access/dirty info). When changing the access/dirty info on a pte (e.g.
>>> ptep_set_access_flags()) that is part of a contpte mapping, this change
>>> will affect the whole contpte block. This is works fine in practice
>>> since we guarantee that only a single folio is mapped by a contpte
>>> block, and the core-mm tracks access/dirty information per folio.
>>>
>>> In order for the public functions, which used to be pure inline, to
>>> continue to be callable by modules, export all the contpte_* symbols
>>> that are now called by those public inline functions.
>>>
>>> The feature is enabled/disabled with the ARM64_CONTPTE Kconfig parameter
>>> at build time. It defaults to enabled as long as its dependency,
>>> TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon
>>> TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not
>>> enabled, then there is no chance of meeting the physical contiguity
>>> requirement for contpte mappings.
>>>
>>> Tested-by: John Hubbard <jhubbard@nvidia.com>
>>> Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
>>> ---
>>>  arch/arm64/Kconfig               |   9 +
>>>  arch/arm64/include/asm/pgtable.h | 161 ++++++++++++++++++
>>>  arch/arm64/mm/Makefile           |   1 +
>>>  arch/arm64/mm/contpte.c          | 283 +++++++++++++++++++++++++++++++
>>>  4 files changed, 454 insertions(+)
>>>  create mode 100644 arch/arm64/mm/contpte.c
>>>
>>> diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
>>> index d86d7f4758b5..1442e8ed95b6 100644
>>> --- a/arch/arm64/Kconfig
>>> +++ b/arch/arm64/Kconfig
>>> @@ -2230,6 +2230,15 @@ config UNWIND_PATCH_PAC_INTO_SCS
>>>  	select UNWIND_TABLES
>>>  	select DYNAMIC_SCS
>>>  
>>> +config ARM64_CONTPTE
>>> +	bool "Contiguous PTE mappings for user memory" if EXPERT
>>> +	depends on TRANSPARENT_HUGEPAGE
>>> +	default y
>>> +	help
>>> +	  When enabled, user mappings are configured using the PTE contiguous
>>> +	  bit, for any mappings that meet the size and alignment requirements.
>>> +	  This reduces TLB pressure and improves performance.
>>> +
>>>  endmenu # "Kernel Features"
>>>  
>>>  menu "Boot options"
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 7dc6b68ee516..34892a95403d 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -133,6 +133,10 @@ static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>>>   */
>>>  #define pte_valid_not_user(pte) \
>>>  	((pte_val(pte) & (PTE_VALID | PTE_USER | PTE_UXN)) == (PTE_VALID | PTE_UXN))
>>> +/*
>>> + * Returns true if the pte is valid and has the contiguous bit set.
>>> + */
>>> +#define pte_valid_cont(pte)	(pte_valid(pte) && pte_cont(pte))
>>>  /*
>>>   * Could the pte be present in the TLB? We must check mm_tlb_flush_pending
>>>   * so that we don't erroneously return false for pages that have been
>>> @@ -1135,6 +1139,161 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>>  #define vmemmap_update_pte vmemmap_update_pte
>>>  #endif
>>>  
>>> +#ifdef CONFIG_ARM64_CONTPTE
>>> +
>>> +/*
>>> + * The contpte APIs are used to transparently manage the contiguous bit in ptes
>>> + * where it is possible and makes sense to do so. The PTE_CONT bit is considered
>>> + * a private implementation detail of the public ptep API (see below).
>>> + */
>>> +extern void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +				pte_t *ptep, pte_t pte);
>>> +extern pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte);
>>> +extern pte_t contpte_ptep_get_lockless(pte_t *orig_ptep);
>>> +extern void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +				pte_t *ptep, pte_t pte, unsigned int nr);
>>> +extern int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep);
>>> +extern int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep);
>>> +extern int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep,
>>> +				pte_t entry, int dirty);
>>> +
>>> +static inline void contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +					pte_t *ptep, pte_t pte)
>>> +{
>>> +	if (unlikely(pte_valid_cont(pte)))
>>> +		__contpte_try_unfold(mm, addr, ptep, pte);
>>> +}
>>> +
>>> +/*
>>> + * The below functions constitute the public API that arm64 presents to the
>>> + * core-mm to manipulate PTE entries within their page tables (or at least this
>>> + * is the subset of the API that arm64 needs to implement). These public
>>> + * versions will automatically and transparently apply the contiguous bit where
>>> + * it makes sense to do so. Therefore any users that are contig-aware (e.g.
>>> + * hugetlb, kernel mapper) should NOT use these APIs, but instead use the
>>> + * private versions, which are prefixed with double underscore. All of these
>>> + * APIs except for ptep_get_lockless() are expected to be called with the PTL
>>> + * held.
>>> + */
>>> +
>>> +#define ptep_get ptep_get
>>> +static inline pte_t ptep_get(pte_t *ptep)
>>> +{
>>> +	pte_t pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(pte)))
>>> +		return pte;
>>> +
>>> +	return contpte_ptep_get(ptep, pte);
>>> +}
>>> +
>>> +#define ptep_get_lockless ptep_get_lockless
>>> +static inline pte_t ptep_get_lockless(pte_t *ptep)
>>> +{
>>> +	pte_t pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(pte)))
>>> +		return pte;
>>> +
>>> +	return contpte_ptep_get_lockless(ptep);
>>> +}
>>> +
>>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>>> +{
>>> +	/*
>>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>>> +	 * this should never even be called. Regardless do our best to service
>>> +	 * any call and emit a warning if there is any attempt to set a pte on
>>> +	 * top of an existing contig range.
>>> +	 */
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>>> +	__set_pte(ptep, pte_mknoncont(pte));
>>> +}
>>> +
>>> +#define set_ptes set_ptes
>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>>> +{
>>> +	pte = pte_mknoncont(pte);
>>
>> Why do we have to clear the contiguous bit here? Is that for the same reason as
>> set_pte(), or do we expect callers to legitimately call this with the
>> contiguous bit set in 'pte'?
>>
>> I think you explained this to me in-person, and IIRC we don't expect callers to
>> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
>> have to clear it here to defer the decision of whether to set/clear it when
>> modifying entries. It would be nice if we could have a description of why/when
>> we need to clear this, e.g. in the 'public API' comment block above.
> 
> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
> private to the architecture code and is never set directly by core code. If the
> public API ever receives a pte that happens to have the PTE_CONT bit set, it
> would be bad news if we then accidentally set that in the pgtable.
> 
> Ideally, we would just uncondidtionally clear the bit before a getter returns
> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
> and can therefore never accidentally pass such a pte into a setter function.
> However, there is existing functionality that relies on being able to get a pte,
> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
> to determine how big the leaf is. This is used in perf_get_pgtable_size().
> 
> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
> decided to allow PTE_CONT to leak through the getters and instead
> unconditionally clear the bit when a pte is passed to any of the setters.
> 
> I'll add a (slightly less verbose) comment as you suggest.
> 
>>
>>> +
>>> +	if (likely(nr == 1)) {
>>> +		contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +		__set_ptes(mm, addr, ptep, pte, 1);
>>> +	} else {
>>> +		contpte_set_ptes(mm, addr, ptep, pte, nr);
>>> +	}
>>> +}
>>> +
>>> +static inline void pte_clear(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +	__pte_clear(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_GET_AND_CLEAR
>>> +static inline pte_t ptep_get_and_clear(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +	return __ptep_get_and_clear(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_TEST_AND_CLEAR_YOUNG
>>> +static inline int ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(orig_pte)))
>>> +		return __ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +	return contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_CLEAR_YOUNG_FLUSH
>>> +static inline int ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	if (likely(!pte_valid_cont(orig_pte)))
>>> +		return __ptep_clear_flush_young(vma, addr, ptep);
>>> +
>>> +	return contpte_ptep_clear_flush_young(vma, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_SET_WRPROTECT
>>> +static inline void ptep_set_wrprotect(struct mm_struct *mm,
>>> +				unsigned long addr, pte_t *ptep)
>>> +{
>>> +	contpte_try_unfold(mm, addr, ptep, __ptep_get(ptep));
>>> +	__ptep_set_wrprotect(mm, addr, ptep);
>>> +}
>>> +
>>> +#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>> +static inline int ptep_set_access_flags(struct vm_area_struct *vma,
>>> +				unsigned long addr, pte_t *ptep,
>>> +				pte_t entry, int dirty)
>>> +{
>>> +	pte_t orig_pte = __ptep_get(ptep);
>>> +
>>> +	entry = pte_mknoncont(entry);
>>> +
>>> +	if (likely(!pte_valid_cont(orig_pte)))
>>> +		return __ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +
>>> +	return contpte_ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +}
>>> +
>>> +#else /* CONFIG_ARM64_CONTPTE */
>>> +
>>>  #define ptep_get				__ptep_get
>>>  #define set_pte					__set_pte
>>>  #define set_ptes				__set_ptes
>>> @@ -1150,6 +1309,8 @@ void vmemmap_update_pte(unsigned long addr, pte_t *ptep, pte_t pte);
>>>  #define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
>>>  #define ptep_set_access_flags			__ptep_set_access_flags
>>>  
>>> +#endif /* CONFIG_ARM64_CONTPTE */
>>> +
>>>  #endif /* !__ASSEMBLY__ */
>>>  
>>>  #endif /* __ASM_PGTABLE_H */
>>> diff --git a/arch/arm64/mm/Makefile b/arch/arm64/mm/Makefile
>>> index dbd1bc95967d..60454256945b 100644
>>> --- a/arch/arm64/mm/Makefile
>>> +++ b/arch/arm64/mm/Makefile
>>> @@ -3,6 +3,7 @@ obj-y				:= dma-mapping.o extable.o fault.o init.o \
>>>  				   cache.o copypage.o flush.o \
>>>  				   ioremap.o mmap.o pgd.o mmu.o \
>>>  				   context.o proc.o pageattr.o fixmap.o
>>> +obj-$(CONFIG_ARM64_CONTPTE)	+= contpte.o
>>>  obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
>>>  obj-$(CONFIG_PTDUMP_CORE)	+= ptdump.o
>>>  obj-$(CONFIG_PTDUMP_DEBUGFS)	+= ptdump_debugfs.o
>>> diff --git a/arch/arm64/mm/contpte.c b/arch/arm64/mm/contpte.c
>>> new file mode 100644
>>> index 000000000000..bfb50e6b44c7
>>> --- /dev/null
>>> +++ b/arch/arm64/mm/contpte.c
>>> @@ -0,0 +1,283 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * Copyright (C) 2023 ARM Ltd.
>>> + */
>>> +
>>> +#include <linux/mm.h>
>>> +#include <linux/export.h>
>>> +#include <asm/tlbflush.h>
>>> +
>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>> +{
>>> +	/*
>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>> +	 * These racing faults are ok for user space, since they get serialized
>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>> +	 */
>>> +	return mm != &init_mm;
>>> +}
>>
>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>> that while it is live, and I'm not sure if that needs any special handling.
> 
> Well we never need this function in the hot (order-0 folio) path, so I think I
> could add a check for efi_mm here with performance implication. It's probably
> safest to explicitly exclude it? What do you think?

Oops: This should have read "I think I could add a check for efi_mm here
*without* performance implication"

> 
>>
>>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>>> +{
>>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>
>> I think this can be:
>>
>> static inline pte_t *contpte_align_down(pte_t *ptep)
>> {
>> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
>> }
> 
> Yep - that's much less ugly - thanks!
> 
>>
>>> +
>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>> +			    pte_t *ptep, pte_t pte)
>>> +{
>>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>> +	unsigned long start_addr;
>>> +	pte_t *start_ptep;
>>> +	int i;
>>> +
>>> +	start_ptep = ptep = contpte_align_down(ptep);
>>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>> +
>>> +		if (pte_dirty(ptent))
>>> +			pte = pte_mkdirty(pte);
>>> +
>>> +		if (pte_young(ptent))
>>> +			pte = pte_mkyoung(pte);
>>> +	}
>>
>> Not a big deal either way, but I wonder if it makes more sense to accumulate
>> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
>>
>> 	bool dirty = false, young = false;
>>
>> 	for (...) {
>> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>> 		dirty |= pte_dirty(ptent);
>> 		young |= pte_young(ptent);
>> 	}
>>
>> 	if (dirty)
>> 		pte_mkdirty(pte);
>> 	if (young)
>> 		pte_mkyoung(pte);
>>
>> I suspect that might generate slightly better code, but I'm also happy with the
>> current form if people thnk that's more legible (I have no strong feelings
>> either way).
> 
> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
> We also had the same comment against David's batching patches recently, and he
> opted to stick with the former version:
> 
> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
> 
> So I'm inclined to leave it as is, since you're not insisting :)
> 
>>
>>> +
>>> +	__flush_tlb_range(&vma, start_addr, addr, PAGE_SIZE, true, 3);
>>> +
>>> +	__set_ptes(mm, start_addr, start_ptep, pte, CONT_PTES);
>>> +}
>>> +
>>> +void __contpte_try_unfold(struct mm_struct *mm, unsigned long addr,
>>> +			pte_t *ptep, pte_t pte)
>>> +{
>>> +	/*
>>> +	 * We have already checked that the ptes are contiguous in
>>> +	 * contpte_try_unfold(), so just check that the mm is user space.
>>> +	 */
>>> +
>>> +	if (!mm_is_user(mm))
>>> +		return;
>>
>> Nit: normally we don't put a line gap between a comment block and the
>> associated block of code.
> 
> ACK, I'll fix in next version.

Just to clarify this, I've got a few instances in this file where I have a
comment that applies to the function as a whole, and in those cases the comment
is the first thing in the body of the function, and there's a blank line between
the end of the comment and the first statement. This is intended to be one of
those comments, although since the function is pretty small, I can see how it
also could look like it applies to the immediately proceeding statements too.

What is the normal policy for such comments? I'd rather leave this alone since
it aligns with how all the others are done in the file. Or should I just remove
the blank line for all instances?


> 
>>
>>> +
>>> +	pte = pte_mknoncont(pte);
>>> +	contpte_convert(mm, addr, ptep, pte);
>>> +}
>>> +EXPORT_SYMBOL(__contpte_try_unfold);
>>> +
>>> +pte_t contpte_ptep_get(pte_t *ptep, pte_t orig_pte)
>>> +{
>>> +	/*
>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>> +	 * of the contig range. We are guarranteed to be holding the PTL, so any
>>> +	 * contiguous range cannot be unfolded or otherwise modified under our
>>> +	 * feet.
>>> +	 */
>>
>> Nit: s/guarranteed/guaranteed/
> 
> ACK, I'll fix in next version.
> 
>>
>>> +
>>> +	pte_t pte;
>>> +	int i;
>>> +
>>> +	ptep = contpte_align_down(ptep);
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++) {
>>> +		pte = __ptep_get(ptep);
>>> +
>>> +		if (pte_dirty(pte))
>>> +			orig_pte = pte_mkdirty(orig_pte);
>>> +
>>> +		if (pte_young(pte))
>>> +			orig_pte = pte_mkyoung(orig_pte);
>>> +	}
>>> +
>>> +	return orig_pte;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_get);
>>> +
>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>> +{
>>> +	/*
>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>>> +	 * If we find a pte that is not consistent, then we must be racing with
>>> +	 * an update so start again. If the target pte does not have CONT_PTE
>>> +	 * set then that is considered consistent on its own because it is not
>>> +	 * part of a contpte range.
>>> +	 */
>>> +
>>> +	pgprot_t orig_prot;
>>> +	unsigned long pfn;
>>> +	pte_t orig_pte;
>>> +	pgprot_t prot;
>>> +	pte_t *ptep;
>>> +	pte_t pte;
>>> +	int i;
>>> +
>>> +retry:
>>> +	orig_pte = __ptep_get(orig_ptep);
>>> +
>>> +	if (!pte_valid_cont(orig_pte))
>>> +		return orig_pte;
>>> +
>>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>>> +	ptep = contpte_align_down(orig_ptep);
>>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>> +		pte = __ptep_get(ptep);
>>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>> +
>>> +		if (!pte_valid_cont(pte) ||
>>> +		   pte_pfn(pte) != pfn ||
>>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>>> +			goto retry;
>>> +
>>> +		if (pte_dirty(pte))
>>> +			orig_pte = pte_mkdirty(orig_pte);
>>> +
>>> +		if (pte_young(pte))
>>> +			orig_pte = pte_mkyoung(orig_pte);
>>> +	}
>>> +
>>> +	return orig_pte;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>>
>> I'm struggling to convince myself that this is safe in general, as it really
>> depends on how the caller will use this value. Which caller(s) actually care
>> about the access/dirty bits, given those could change at any time anyway?
> 
> I think your points below are valid, and agree we should try to make this work
> without needing access/dirty if possible. But can you elaborate on why you don't
> think it's safe?
> 
>>
>> I took a quick scan, and AFAICT:
> 
> Thanks for enumerating these; Saves me from having to refresh my memory :)
> 
>>
>> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>>   and has the contig bit set. We could clean that up with a new interface, e.g.
>>   something like a new ptep_get_size_lockless().
>>
>> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>>   we look at the pte to start with, since we only care where we can logically
>>   write to the page at that point.
>>
>>   I see that we later follow up with:
>>
>>     with pte_val(pte) != pte_val(ptep_get(ptep)))
>>
>>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>>   bits? So that shape of lockless-try...locked-compare sequence works?
>>
>> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>>   so this doesn' seem to matter.
>>
>> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
>>
>> * For pte_range_none() the access/dirty bits don't matter.
>>
>> * For handle_pte_fault() I think we have the same shape of
>>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>>   care about the acess/dirty bits before we reach the locked compare step.
>>
>> * For ptdump_pte_entry() I think it's arguable that we should continue to
>>   report the access/dirty bits separately for each PTE, as we have done until
>>   now, to give an accurate representation of the contents of the translation
>>   tables.
>>
>> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>>   swap entry, the access/dirty bits don't matter.
>>
>> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
>> and IIUC that's only so that the locklessly-loaded pte value can be compared
>> with a subsequently locked-loaded entry (for which the access/dirty bits will
>> be accumulated). Have I understood that correctly?
> 
> Yes, I agree with what you are saying. My approach was to try to implement the
> existing APIs accurately though, the argument being that it reduces the chances
> of getting it wrong. But if you think the implementation is unsafe, then I guess
> it blows that out of the water...
> 
>>
>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>> bits, 
> 
> I think that would work - but will need to think a bit more on it.
> 
>> and leave ptep_get_lockless() only reading a single entry?
> 
> I think we will need to do something a bit less fragile. ptep_get() does collect
> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
> we will likely want to rename the function and make its documentation explicit
> that it does not return those bits.
> 
> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
> 
> Of course if I could convince you the current implementation is safe, I might be
> able to sidestep this optimization until a later date?
> 
> Thanks,
> Ryan
> 
> 
>>
>> Thanks,
>> Mark.
>>
>>> +void contpte_set_ptes(struct mm_struct *mm, unsigned long addr,
>>> +					pte_t *ptep, pte_t pte, unsigned int nr)
>>> +{
>>> +	unsigned long next;
>>> +	unsigned long end;
>>> +	unsigned long pfn;
>>> +	pgprot_t prot;
>>> +
>>> +	/*
>>> +	 * The set_ptes() spec guarantees that when nr > 1, the initial state of
>>> +	 * all ptes is not-present. Therefore we never need to unfold or
>>> +	 * otherwise invalidate a range before we set the new ptes.
>>> +	 * contpte_set_ptes() should never be called for nr < 2.
>>> +	 */
>>> +	VM_WARN_ON(nr == 1);
>>> +
>>> +	if (!mm_is_user(mm))
>>> +		return __set_ptes(mm, addr, ptep, pte, nr);
>>> +
>>> +	end = addr + (nr << PAGE_SHIFT);
>>> +	pfn = pte_pfn(pte);
>>> +	prot = pte_pgprot(pte);
>>> +
>>> +	do {
>>> +		next = pte_cont_addr_end(addr, end);
>>> +		nr = (next - addr) >> PAGE_SHIFT;
>>> +		pte = pfn_pte(pfn, prot);
>>> +
>>> +		if (((addr | next | (pfn << PAGE_SHIFT)) & ~CONT_PTE_MASK) == 0)
>>> +			pte = pte_mkcont(pte);
>>> +		else
>>> +			pte = pte_mknoncont(pte);
>>> +
>>> +		__set_ptes(mm, addr, ptep, pte, nr);
>>> +
>>> +		addr = next;
>>> +		ptep += nr;
>>> +		pfn += nr;
>>> +
>>> +	} while (addr != end);
>>> +}
>>> +EXPORT_SYMBOL(contpte_set_ptes);
>>> +
>>> +int contpte_ptep_test_and_clear_young(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep)
>>> +{
>>> +	/*
>>> +	 * ptep_clear_flush_young() technically requires us to clear the access
>>> +	 * flag for a _single_ pte. However, the core-mm code actually tracks
>>> +	 * access/dirty per folio, not per page. And since we only create a
>>> +	 * contig range when the range is covered by a single folio, we can get
>>> +	 * away with clearing young for the whole contig range here, so we avoid
>>> +	 * having to unfold.
>>> +	 */
>>> +
>>> +	int young = 0;
>>> +	int i;
>>> +
>>> +	ptep = contpte_align_down(ptep);
>>> +	addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +		young |= __ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +	return young;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_test_and_clear_young);
>>> +
>>> +int contpte_ptep_clear_flush_young(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep)
>>> +{
>>> +	int young;
>>> +
>>> +	young = contpte_ptep_test_and_clear_young(vma, addr, ptep);
>>> +
>>> +	if (young) {
>>> +		/*
>>> +		 * See comment in __ptep_clear_flush_young(); same rationale for
>>> +		 * eliding the trailing DSB applies here.
>>> +		 */
>>> +		addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +		__flush_tlb_range_nosync(vma, addr, addr + CONT_PTE_SIZE,
>>> +					 PAGE_SIZE, true, 3);
>>> +	}
>>> +
>>> +	return young;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_clear_flush_young);
>>> +
>>> +int contpte_ptep_set_access_flags(struct vm_area_struct *vma,
>>> +					unsigned long addr, pte_t *ptep,
>>> +					pte_t entry, int dirty)
>>> +{
>>> +	unsigned long start_addr;
>>> +	pte_t orig_pte;
>>> +	int i;
>>> +
>>> +	/*
>>> +	 * Gather the access/dirty bits for the contiguous range. If nothing has
>>> +	 * changed, its a noop.
>>> +	 */
>>> +	orig_pte = pte_mknoncont(ptep_get(ptep));
>>> +	if (pte_val(orig_pte) == pte_val(entry))
>>> +		return 0;
>>> +
>>> +	/*
>>> +	 * We can fix up access/dirty bits without having to unfold the contig
>>> +	 * range. But if the write bit is changing, we must unfold.
>>> +	 */
>>> +	if (pte_write(orig_pte) == pte_write(entry)) {
>>> +		/*
>>> +		 * For HW access management, we technically only need to update
>>> +		 * the flag on a single pte in the range. But for SW access
>>> +		 * management, we need to update all the ptes to prevent extra
>>> +		 * faults. Avoid per-page tlb flush in __ptep_set_access_flags()
>>> +		 * and instead flush the whole range at the end.
>>> +		 */
>>> +		ptep = contpte_align_down(ptep);
>>> +		start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>> +
>>> +		for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE)
>>> +			__ptep_set_access_flags(vma, addr, ptep, entry, 0);
>>> +
>>> +		if (dirty)
>>> +			__flush_tlb_range(vma, start_addr, addr,
>>> +							PAGE_SIZE, true, 3);
>>> +	} else {
>>> +		__contpte_try_unfold(vma->vm_mm, addr, ptep, orig_pte);
>>> +		__ptep_set_access_flags(vma, addr, ptep, entry, dirty);
>>> +	}
>>> +
>>> +	return 1;
>>> +}
>>> +EXPORT_SYMBOL(contpte_ptep_set_access_flags);
>>> -- 
>>> 2.25.1
>>>
>

Ryan Roberts Feb. 12, 2024, 3:34 p.m. UTC | #7

On 12/02/2024 15:26, David Hildenbrand wrote:
> On 12.02.24 15:45, Ryan Roberts wrote:
>> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>>>> bits,
>>>>
>>>> I think that would work - but will need to think a bit more on it.
>>>>
>>>>> and leave ptep_get_lockless() only reading a single entry?
>>>>
>>>> I think we will need to do something a bit less fragile. ptep_get() does
>>>> collect
>>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>>>> we will likely want to rename the function and make its documentation explicit
>>>> that it does not return those bits.
>>>>
>>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>>
>>>> Of course if I could convince you the current implementation is safe, I
>>>> might be
>>>> able to sidestep this optimization until a later date?
>>>
>>> As discussed (and pointed out abive), there might be quite some callsites where
>>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>>> used nowadays.
>>>
>>> One way to approach that I had in mind was having an explicit interface:
>>>
>>> ptep_get()
>>> ptep_get_uptodate()
>>> ptep_get_lockless()
>>> ptep_get_lockless_uptodate()
>>
>> Yes, I like the direction of this. I guess we anticipate that call sites
>> requiring the "_uptodate" variant will be the minority so it makes sense to use
>> the current names for the "_not_uptodate" variants? But to do a slow migration,
>> it might be better/safer to have the weaker variant use the new name - that
>> would allow us to downgrade one at a time?
> 
> Yes, I was primarily struggling with names. Likely it makes sense to either have
> two completely new function names, or use the new name only for the "faster but
> less precise" variant.
> 
>>
>>>
>>> Especially the last one might not be needed.
>> I've done a scan through the code and agree with Mark's original conclusions.
>> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
>> access/dirty info. So I think I could migrate everything to the weaker variant
>> fairly easily.
>>
>>>
>>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
>>
>> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
>> "_nosync"?
> 
> I could live with
> 
> ptep_get_sync()
> ptep_get_nosync()
> 
> with proper documentation :)

but could you live with:

ptep_get()
ptep_get_nosync()
ptep_get_lockless_nosync()

?

So leave the "slower, more precise" version with the existing name.

> 
> I don't think we use "_sync" / "_nosync" in the context of pte operations yet.
> 
> Well, there seems to be "__arm_v7s_pte_sync" in iommu code, bit at least in core
> code nothing jumped at me.
>

David Hildenbrand Feb. 12, 2024, 4:24 p.m. UTC | #8

On 12.02.24 16:34, Ryan Roberts wrote:
> On 12/02/2024 15:26, David Hildenbrand wrote:
>> On 12.02.24 15:45, Ryan Roberts wrote:
>>> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>>>>> bits,
>>>>>
>>>>> I think that would work - but will need to think a bit more on it.
>>>>>
>>>>>> and leave ptep_get_lockless() only reading a single entry?
>>>>>
>>>>> I think we will need to do something a bit less fragile. ptep_get() does
>>>>> collect
>>>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>>>>> we will likely want to rename the function and make its documentation explicit
>>>>> that it does not return those bits.
>>>>>
>>>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>>>
>>>>> Of course if I could convince you the current implementation is safe, I
>>>>> might be
>>>>> able to sidestep this optimization until a later date?
>>>>
>>>> As discussed (and pointed out abive), there might be quite some callsites where
>>>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>>>> used nowadays.
>>>>
>>>> One way to approach that I had in mind was having an explicit interface:
>>>>
>>>> ptep_get()
>>>> ptep_get_uptodate()
>>>> ptep_get_lockless()
>>>> ptep_get_lockless_uptodate()
>>>
>>> Yes, I like the direction of this. I guess we anticipate that call sites
>>> requiring the "_uptodate" variant will be the minority so it makes sense to use
>>> the current names for the "_not_uptodate" variants? But to do a slow migration,
>>> it might be better/safer to have the weaker variant use the new name - that
>>> would allow us to downgrade one at a time?
>>
>> Yes, I was primarily struggling with names. Likely it makes sense to either have
>> two completely new function names, or use the new name only for the "faster but
>> less precise" variant.
>>
>>>
>>>>
>>>> Especially the last one might not be needed.
>>> I've done a scan through the code and agree with Mark's original conclusions.
>>> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
>>> access/dirty info. So I think I could migrate everything to the weaker variant
>>> fairly easily.
>>>
>>>>
>>>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>>>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
>>>
>>> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
>>> "_nosync"?
>>
>> I could live with
>>
>> ptep_get_sync()
>> ptep_get_nosync()
>>
>> with proper documentation :)
> 
> but could you live with:
> 
> ptep_get()
> ptep_get_nosync()
> ptep_get_lockless_nosync()
> 
> ?
> 
> So leave the "slower, more precise" version with the existing name.

Sure.

Ryan Roberts Feb. 12, 2024, 8:38 p.m. UTC | #9

[...]

>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>> +	 */
>>>> +	return mm != &init_mm;
>>>> +}
>>>
>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>> that while it is live, and I'm not sure if that needs any special handling.
>>
>> Well we never need this function in the hot (order-0 folio) path, so I think I
>> could add a check for efi_mm here with performance implication. It's probably
>> safest to explicitly exclude it? What do you think?
> 
> Oops: This should have read "I think I could add a check for efi_mm here
> *without* performance implication"

It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:

return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);

Is that acceptable? This is my preference, but nothing else outside of efi
references this symbol currently.

Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
There are a couple of things that need to be garanteed for it to be safe:

  - The PFNs of present ptes either need to have an associated struct page or
    need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
    pte_mkdevmap())

  - Live mappings must either be static (no changes that could cause fold/unfold
    while live) or the system must be able to tolerate a temporary fault

Mark suggests efi_mm is not manipulated while live, so that meets the latter
requirement, but I'm not sure about the former?

Thanks,
Ryan

David Hildenbrand Feb. 13, 2024, 10:01 a.m. UTC | #10

On 12.02.24 21:38, Ryan Roberts wrote:
> [...]
> 
>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>> +{
>>>>> +	/*
>>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>>> +	 */
>>>>> +	return mm != &init_mm;
>>>>> +}
>>>>
>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>
>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>> could add a check for efi_mm here with performance implication. It's probably
>>> safest to explicitly exclude it? What do you think?
>>
>> Oops: This should have read "I think I could add a check for efi_mm here
>> *without* performance implication"
> 
> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:
> 
> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);

Please use all the lines you need ;)

if (IS_ENABLED(CONFIG_EFI) && unlikely(mm == &efi_mm))
	return false;
return mm != &init_mm;

> 
> Is that acceptable? This is my preference, but nothing else outside of efi
> references this symbol currently.

We could also mark MMs in some way to be special.

return mm->is_user;

Then it's easy to extend.

Mark Rutland Feb. 13, 2024, 12:02 p.m. UTC | #11

On Mon, Feb 12, 2024 at 12:59:57PM +0000, Ryan Roberts wrote:
> On 12/02/2024 12:00, Mark Rutland wrote:
> > Hi Ryan,

[...]

> >> +static inline void set_pte(pte_t *ptep, pte_t pte)
> >> +{
> >> +	/*
> >> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
> >> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
> >> +	 * this should never even be called. Regardless do our best to service
> >> +	 * any call and emit a warning if there is any attempt to set a pte on
> >> +	 * top of an existing contig range.
> >> +	 */
> >> +	pte_t orig_pte = __ptep_get(ptep);
> >> +
> >> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
> >> +	__set_pte(ptep, pte_mknoncont(pte));
> >> +}
> >> +
> >> +#define set_ptes set_ptes
> >> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
> >> +				pte_t *ptep, pte_t pte, unsigned int nr)
> >> +{
> >> +	pte = pte_mknoncont(pte);
> > 
> > Why do we have to clear the contiguous bit here? Is that for the same reason as
> > set_pte(), or do we expect callers to legitimately call this with the
> > contiguous bit set in 'pte'?
> > 
> > I think you explained this to me in-person, and IIRC we don't expect callers to
> > go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
> > have to clear it here to defer the decision of whether to set/clear it when
> > modifying entries. It would be nice if we could have a description of why/when
> > we need to clear this, e.g. in the 'public API' comment block above.
> 
> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
> private to the architecture code and is never set directly by core code. If the
> public API ever receives a pte that happens to have the PTE_CONT bit set, it
> would be bad news if we then accidentally set that in the pgtable.
> 
> Ideally, we would just uncondidtionally clear the bit before a getter returns
> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
> and can therefore never accidentally pass such a pte into a setter function.
> However, there is existing functionality that relies on being able to get a pte,
> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
> to determine how big the leaf is. This is used in perf_get_pgtable_size().
> 
> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
> decided to allow PTE_CONT to leak through the getters and instead
> unconditionally clear the bit when a pte is passed to any of the setters.
> 
> I'll add a (slightly less verbose) comment as you suggest.

Great, thanks!

[...]

> >> +static inline bool mm_is_user(struct mm_struct *mm)
> >> +{
> >> +	/*
> >> +	 * Don't attempt to apply the contig bit to kernel mappings, because
> >> +	 * dynamically adding/removing the contig bit can cause page faults.
> >> +	 * These racing faults are ok for user space, since they get serialized
> >> +	 * on the PTL. But kernel mappings can't tolerate faults.
> >> +	 */
> >> +	return mm != &init_mm;
> >> +}
> > 
> > We also have the efi_mm as a non-user mm, though I don't think we manipulate
> > that while it is live, and I'm not sure if that needs any special handling.
> 
> Well we never need this function in the hot (order-0 folio) path, so I think I
> could add a check for efi_mm here with performance implication. It's probably
> safest to explicitly exclude it? What do you think?

That sounds ok to me.

Otherwise, if we (somehow) know that we avoid calling this at all with an EFI
mm (e.g. because of the way we construct that), I'd be happy with a comment.

Probably best to Cc Ard for whatever we do here.

> >> +static inline pte_t *contpte_align_down(pte_t *ptep)
> >> +{
> >> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
> > 
> > I think this can be:
> > 
> > static inline pte_t *contpte_align_down(pte_t *ptep)
> > {
> > 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
> > }
> 
> Yep - that's much less ugly - thanks!
> 
> > 
> >> +
> >> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
> >> +			    pte_t *ptep, pte_t pte)
> >> +{
> >> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
> >> +	unsigned long start_addr;
> >> +	pte_t *start_ptep;
> >> +	int i;
> >> +
> >> +	start_ptep = ptep = contpte_align_down(ptep);
> >> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
> >> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
> >> +
> >> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
> >> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> >> +
> >> +		if (pte_dirty(ptent))
> >> +			pte = pte_mkdirty(pte);
> >> +
> >> +		if (pte_young(ptent))
> >> +			pte = pte_mkyoung(pte);
> >> +	}
> > 
> > Not a big deal either way, but I wonder if it makes more sense to accumulate
> > the 'ptent' dirty/young values, then modify 'pte' once, i.e.
> > 
> > 	bool dirty = false, young = false;
> > 
> > 	for (...) {
> > 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
> > 		dirty |= pte_dirty(ptent);
> > 		young |= pte_young(ptent);
> > 	}
> > 
> > 	if (dirty)
> > 		pte_mkdirty(pte);
> > 	if (young)
> > 		pte_mkyoung(pte);
> > 
> > I suspect that might generate slightly better code, but I'm also happy with the
> > current form if people thnk that's more legible (I have no strong feelings
> > either way).
> 
> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
> We also had the same comment against David's batching patches recently, and he
> opted to stick with the former version:
> 
> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
> 
> So I'm inclined to leave it as is, since you're not insisting :)

That rationale is reasonable, and I'm fine with this as-is.

[...]

> >> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
> >> +{
> >> +	/*
> >> +	 * Gather access/dirty bits, which may be populated in any of the ptes
> >> +	 * of the contig range. We may not be holding the PTL, so any contiguous
> >> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
> >> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
> >> +	 * in the range are valid and have CONT_PTE set, that all pfns are
> >> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
> >> +	 * If we find a pte that is not consistent, then we must be racing with
> >> +	 * an update so start again. If the target pte does not have CONT_PTE
> >> +	 * set then that is considered consistent on its own because it is not
> >> +	 * part of a contpte range.
> >> +	 */
> >> +
> >> +	pgprot_t orig_prot;
> >> +	unsigned long pfn;
> >> +	pte_t orig_pte;
> >> +	pgprot_t prot;
> >> +	pte_t *ptep;
> >> +	pte_t pte;
> >> +	int i;
> >> +
> >> +retry:
> >> +	orig_pte = __ptep_get(orig_ptep);
> >> +
> >> +	if (!pte_valid_cont(orig_pte))
> >> +		return orig_pte;
> >> +
> >> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
> >> +	ptep = contpte_align_down(orig_ptep);
> >> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
> >> +
> >> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
> >> +		pte = __ptep_get(ptep);
> >> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
> >> +
> >> +		if (!pte_valid_cont(pte) ||
> >> +		   pte_pfn(pte) != pfn ||
> >> +		   pgprot_val(prot) != pgprot_val(orig_prot))
> >> +			goto retry;
> >> +
> >> +		if (pte_dirty(pte))
> >> +			orig_pte = pte_mkdirty(orig_pte);
> >> +
> >> +		if (pte_young(pte))
> >> +			orig_pte = pte_mkyoung(orig_pte);
> >> +	}
> >> +
> >> +	return orig_pte;
> >> +}
> >> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
> > 
> > I'm struggling to convince myself that this is safe in general, as it really
> > depends on how the caller will use this value. Which caller(s) actually care
> > about the access/dirty bits, given those could change at any time anyway?
> 
> I think your points below are valid, and agree we should try to make this work
> without needing access/dirty if possible. But can you elaborate on why you don't
> think it's safe?

Having mulled this over, I think it is safe as-is, and I was being overly
cautious.

I had a general fear of potential problems stemming from the fact that (a) the
accumulation of access/dirty bits isn't atomic and (b) the loop is unbounded.
From looking at how this is used today, I think (a) is essentially the same as
reading a stale non-contiguous entry, and I'm being overly cautious there. For
(b), I think that's largely a performance concern and the would only retry
indefinitely in the presence of mis-programmed entries or consistent racing
with a writer under heavy contention.

I think it's still desirable to avoid the accumulation in most cases (to avoid
redundant work and to minimize the potential for unbounded retries), but I'm
happy with that being a follow-up improvement.

> > I took a quick scan, and AFAICT:
> 
> Thanks for enumerating these; Saves me from having to refresh my memory :)
> > 
> > * For perf_get_pgtable_size(), we only care about whether the entry is valid
> >   and has the contig bit set. We could clean that up with a new interface, e.g.
> >   something like a new ptep_get_size_lockless().
> > 
> > * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
> >   we look at the pte to start with, since we only care where we can logically
> >   write to the page at that point.
> > 
> >   I see that we later follow up with:
> > 
> >     with pte_val(pte) != pte_val(ptep_get(ptep)))
> > 
> >   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
> >   bits? So that shape of lockless-try...locked-compare sequence works?
> > 
> > * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
> >   so this doesn' seem to matter.
> > 
> > * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
> >   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
> > 
> > * For pte_range_none() the access/dirty bits don't matter.
> > 
> > * For handle_pte_fault() I think we have the same shape of
> >   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
> >   care about the acess/dirty bits before we reach the locked compare step.
> > 
> > * For ptdump_pte_entry() I think it's arguable that we should continue to
> >   report the access/dirty bits separately for each PTE, as we have done until
> >   now, to give an accurate representation of the contents of the translation
> >   tables.
> > 
> > * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
> >   swap entry, the access/dirty bits don't matter.
> > 
> > So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
> > and IIUC that's only so that the locklessly-loaded pte value can be compared
> > with a subsequently locked-loaded entry (for which the access/dirty bits will
> > be accumulated). Have I understood that correctly?
> 
> Yes, I agree with what you are saying. My approach was to try to implement the
> existing APIs accurately though, the argument being that it reduces the chances
> of getting it wrong. But if you think the implementation is unsafe, then I guess
> it blows that out of the water...

I think your approach makes sense, and as above I'm happy to defer the API
changes/additions to avoid the accumulation of access/dirty bits.

> > If so, I wonder if we could instead do that comparison modulo the access/dirty
> > bits, 
> 
> I think that would work - but will need to think a bit more on it.
> 
> > and leave ptep_get_lockless() only reading a single entry?
> 
> I think we will need to do something a bit less fragile. ptep_get() does collect
> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
> we will likely want to rename the function and make its documentation explicit
> that it does not return those bits.
> 
> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
> 
> Of course if I could convince you the current implementation is safe, I might be
> able to sidestep this optimization until a later date?

Yep. :)

Mark.

Ryan Roberts Feb. 13, 2024, 12:06 p.m. UTC | #12

On 12/02/2024 20:38, Ryan Roberts wrote:
> [...]
> 
>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>> +{
>>>>> +	/*
>>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>>> +	 */
>>>>> +	return mm != &init_mm;
>>>>> +}
>>>>
>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>
>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>> could add a check for efi_mm here with performance implication. It's probably
>>> safest to explicitly exclude it? What do you think?
>>
>> Oops: This should have read "I think I could add a check for efi_mm here
>> *without* performance implication"
> 
> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:
> 
> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
> 
> Is that acceptable? This is my preference, but nothing else outside of efi
> references this symbol currently.
> 
> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
> There are a couple of things that need to be garanteed for it to be safe:
> 
>   - The PFNs of present ptes either need to have an associated struct page or
>     need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>     pte_mkdevmap())
> 
>   - Live mappings must either be static (no changes that could cause fold/unfold
>     while live) or the system must be able to tolerate a temporary fault
> 
> Mark suggests efi_mm is not manipulated while live, so that meets the latter
> requirement, but I'm not sure about the former?

I've gone through all the efi code, and conclude that, as Mark suggests, the
mappings are indeed static. And additionally, the ptes are populated using only
the _private_ ptep API, so there is no issue here. As just discussed with Mark,
my prefereence is to not make any changes to code, and just add a comment
describing why efi_mm is safe.

Details:

* Registered with ptdump
    * ptep_get_lockless()
* efi_create_mapping -> create_pgd_mapping … -> init_pte:
    * __ptep_get()
    * __set_pte()
* efi_memattr_apply_permissions -> efi_set_mapping_permissions … -> set_permissions
    * __ptep_get()
    * __set_pte()

Thanks,
Ryan

> 
> Thanks,
> Ryan
>

David Hildenbrand Feb. 13, 2024, 12:19 p.m. UTC | #13

On 13.02.24 13:06, Ryan Roberts wrote:
> On 12/02/2024 20:38, Ryan Roberts wrote:
>> [...]
>>
>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>> +{
>>>>>> +	/*
>>>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>>>> +	 */
>>>>>> +	return mm != &init_mm;
>>>>>> +}
>>>>>
>>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>
>>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>>> could add a check for efi_mm here with performance implication. It's probably
>>>> safest to explicitly exclude it? What do you think?
>>>
>>> Oops: This should have read "I think I could add a check for efi_mm here
>>> *without* performance implication"
>>
>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do this:
>>
>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>
>> Is that acceptable? This is my preference, but nothing else outside of efi
>> references this symbol currently.
>>
>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>> There are a couple of things that need to be garanteed for it to be safe:
>>
>>    - The PFNs of present ptes either need to have an associated struct page or
>>      need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>      pte_mkdevmap())
>>
>>    - Live mappings must either be static (no changes that could cause fold/unfold
>>      while live) or the system must be able to tolerate a temporary fault
>>
>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>> requirement, but I'm not sure about the former?
> 
> I've gone through all the efi code, and conclude that, as Mark suggests, the
> mappings are indeed static. And additionally, the ptes are populated using only
> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
> my prefereence is to not make any changes to code, and just add a comment
> describing why efi_mm is safe.
> 
> Details:
> 
> * Registered with ptdump
>      * ptep_get_lockless()
> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>      * __ptep_get()
>      * __set_pte()
> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … -> set_permissions
>      * __ptep_get()
>      * __set_pte()

Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via 
the "official" APIs.

Ryan Roberts Feb. 13, 2024, 1:03 p.m. UTC | #14

On 13/02/2024 12:02, Mark Rutland wrote:
> On Mon, Feb 12, 2024 at 12:59:57PM +0000, Ryan Roberts wrote:
>> On 12/02/2024 12:00, Mark Rutland wrote:
>>> Hi Ryan,
> 
> [...]
> 
>>>> +static inline void set_pte(pte_t *ptep, pte_t pte)
>>>> +{
>>>> +	/*
>>>> +	 * We don't have the mm or vaddr so cannot unfold contig entries (since
>>>> +	 * it requires tlb maintenance). set_pte() is not used in core code, so
>>>> +	 * this should never even be called. Regardless do our best to service
>>>> +	 * any call and emit a warning if there is any attempt to set a pte on
>>>> +	 * top of an existing contig range.
>>>> +	 */
>>>> +	pte_t orig_pte = __ptep_get(ptep);
>>>> +
>>>> +	WARN_ON_ONCE(pte_valid_cont(orig_pte));
>>>> +	__set_pte(ptep, pte_mknoncont(pte));
>>>> +}
>>>> +
>>>> +#define set_ptes set_ptes
>>>> +static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>> +				pte_t *ptep, pte_t pte, unsigned int nr)
>>>> +{
>>>> +	pte = pte_mknoncont(pte);
>>>
>>> Why do we have to clear the contiguous bit here? Is that for the same reason as
>>> set_pte(), or do we expect callers to legitimately call this with the
>>> contiguous bit set in 'pte'?
>>>
>>> I think you explained this to me in-person, and IIRC we don't expect callers to
>>> go set the bit themselves, but since it 'leaks' out to them via __ptep_get() we
>>> have to clear it here to defer the decision of whether to set/clear it when
>>> modifying entries. It would be nice if we could have a description of why/when
>>> we need to clear this, e.g. in the 'public API' comment block above.
>>
>> Yes, I think you've got it, but just to ram home the point: The PTE_CONT bit is
>> private to the architecture code and is never set directly by core code. If the
>> public API ever receives a pte that happens to have the PTE_CONT bit set, it
>> would be bad news if we then accidentally set that in the pgtable.
>>
>> Ideally, we would just uncondidtionally clear the bit before a getter returns
>> the pte (e.g. ptep_get(), ptep_get_lockless(), ptep_get_and_clear(), ...). That
>> way, the code code is guarranteed never to see a pte with the PTE_CONT bit set
>> and can therefore never accidentally pass such a pte into a setter function.
>> However, there is existing functionality that relies on being able to get a pte,
>> then pass it to pte_leaf_size(), and arch function that checks the PTE_CONT bit
>> to determine how big the leaf is. This is used in perf_get_pgtable_size().
>>
>> So to allow perf_get_pgtable_size() to continue to see the "real" page size, I
>> decided to allow PTE_CONT to leak through the getters and instead
>> unconditionally clear the bit when a pte is passed to any of the setters.
>>
>> I'll add a (slightly less verbose) comment as you suggest.
> 
> Great, thanks!
> 
> [...]
> 
>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>> +{
>>>> +	/*
>>>> +	 * Don't attempt to apply the contig bit to kernel mappings, because
>>>> +	 * dynamically adding/removing the contig bit can cause page faults.
>>>> +	 * These racing faults are ok for user space, since they get serialized
>>>> +	 * on the PTL. But kernel mappings can't tolerate faults.
>>>> +	 */
>>>> +	return mm != &init_mm;
>>>> +}
>>>
>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>> that while it is live, and I'm not sure if that needs any special handling.
>>
>> Well we never need this function in the hot (order-0 folio) path, so I think I
>> could add a check for efi_mm here with performance implication. It's probably
>> safest to explicitly exclude it? What do you think?
> 
> That sounds ok to me.
> 
> Otherwise, if we (somehow) know that we avoid calling this at all with an EFI
> mm (e.g. because of the way we construct that), I'd be happy with a comment.

We crossed streams - as per my other email, I'm confident that this is safe so
will just add a comment.

> 
> Probably best to Cc Ard for whatever we do here.

Ard is already on CC.

> 
>>>> +static inline pte_t *contpte_align_down(pte_t *ptep)
>>>> +{
>>>> +	return (pte_t *)(ALIGN_DOWN((unsigned long)ptep >> 3, CONT_PTES) << 3);
>>>
>>> I think this can be:
>>>
>>> static inline pte_t *contpte_align_down(pte_t *ptep)
>>> {
>>> 	return PTR_ALIGN_DOWN(ptep, sizeof(*ptep) * CONT_PTES);
>>> }
>>
>> Yep - that's much less ugly - thanks!
>>
>>>
>>>> +
>>>> +static void contpte_convert(struct mm_struct *mm, unsigned long addr,
>>>> +			    pte_t *ptep, pte_t pte)
>>>> +{
>>>> +	struct vm_area_struct vma = TLB_FLUSH_VMA(mm, 0);
>>>> +	unsigned long start_addr;
>>>> +	pte_t *start_ptep;
>>>> +	int i;
>>>> +
>>>> +	start_ptep = ptep = contpte_align_down(ptep);
>>>> +	start_addr = addr = ALIGN_DOWN(addr, CONT_PTE_SIZE);
>>>> +	pte = pfn_pte(ALIGN_DOWN(pte_pfn(pte), CONT_PTES), pte_pgprot(pte));
>>>> +
>>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, addr += PAGE_SIZE) {
>>>> +		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>>> +
>>>> +		if (pte_dirty(ptent))
>>>> +			pte = pte_mkdirty(pte);
>>>> +
>>>> +		if (pte_young(ptent))
>>>> +			pte = pte_mkyoung(pte);
>>>> +	}
>>>
>>> Not a big deal either way, but I wonder if it makes more sense to accumulate
>>> the 'ptent' dirty/young values, then modify 'pte' once, i.e.
>>>
>>> 	bool dirty = false, young = false;
>>>
>>> 	for (...) {
>>> 		pte_t ptent = __ptep_get_and_clear(mm, addr, ptep);
>>> 		dirty |= pte_dirty(ptent);
>>> 		young |= pte_young(ptent);
>>> 	}
>>>
>>> 	if (dirty)
>>> 		pte_mkdirty(pte);
>>> 	if (young)
>>> 		pte_mkyoung(pte);
>>>
>>> I suspect that might generate slightly better code, but I'm also happy with the
>>> current form if people thnk that's more legible (I have no strong feelings
>>> either way).
>>
>> I kept it this way, because its the same pattern used in arm64's hugetlbpage.c.
>> We also had the same comment against David's batching patches recently, and he
>> opted to stick with the former version:
>>
>> https://lore.kernel.org/linux-mm/d83309fa-4daa-430f-ae52-4e72162bca9a@redhat.com/
>>
>> So I'm inclined to leave it as is, since you're not insisting :)
> 
> That rationale is reasonable, and I'm fine with this as-is.
> 
> [...]
> 
>>>> +pte_t contpte_ptep_get_lockless(pte_t *orig_ptep)
>>>> +{
>>>> +	/*
>>>> +	 * Gather access/dirty bits, which may be populated in any of the ptes
>>>> +	 * of the contig range. We may not be holding the PTL, so any contiguous
>>>> +	 * range may be unfolded/modified/refolded under our feet. Therefore we
>>>> +	 * ensure we read a _consistent_ contpte range by checking that all ptes
>>>> +	 * in the range are valid and have CONT_PTE set, that all pfns are
>>>> +	 * contiguous and that all pgprots are the same (ignoring access/dirty).
>>>> +	 * If we find a pte that is not consistent, then we must be racing with
>>>> +	 * an update so start again. If the target pte does not have CONT_PTE
>>>> +	 * set then that is considered consistent on its own because it is not
>>>> +	 * part of a contpte range.
>>>> +	 */
>>>> +
>>>> +	pgprot_t orig_prot;
>>>> +	unsigned long pfn;
>>>> +	pte_t orig_pte;
>>>> +	pgprot_t prot;
>>>> +	pte_t *ptep;
>>>> +	pte_t pte;
>>>> +	int i;
>>>> +
>>>> +retry:
>>>> +	orig_pte = __ptep_get(orig_ptep);
>>>> +
>>>> +	if (!pte_valid_cont(orig_pte))
>>>> +		return orig_pte;
>>>> +
>>>> +	orig_prot = pte_pgprot(pte_mkold(pte_mkclean(orig_pte)));
>>>> +	ptep = contpte_align_down(orig_ptep);
>>>> +	pfn = pte_pfn(orig_pte) - (orig_ptep - ptep);
>>>> +
>>>> +	for (i = 0; i < CONT_PTES; i++, ptep++, pfn++) {
>>>> +		pte = __ptep_get(ptep);
>>>> +		prot = pte_pgprot(pte_mkold(pte_mkclean(pte)));
>>>> +
>>>> +		if (!pte_valid_cont(pte) ||
>>>> +		   pte_pfn(pte) != pfn ||
>>>> +		   pgprot_val(prot) != pgprot_val(orig_prot))
>>>> +			goto retry;
>>>> +
>>>> +		if (pte_dirty(pte))
>>>> +			orig_pte = pte_mkdirty(orig_pte);
>>>> +
>>>> +		if (pte_young(pte))
>>>> +			orig_pte = pte_mkyoung(orig_pte);
>>>> +	}
>>>> +
>>>> +	return orig_pte;
>>>> +}
>>>> +EXPORT_SYMBOL(contpte_ptep_get_lockless);
>>>
>>> I'm struggling to convince myself that this is safe in general, as it really
>>> depends on how the caller will use this value. Which caller(s) actually care
>>> about the access/dirty bits, given those could change at any time anyway?
>>
>> I think your points below are valid, and agree we should try to make this work
>> without needing access/dirty if possible. But can you elaborate on why you don't
>> think it's safe?
> 
> Having mulled this over, I think it is safe as-is, and I was being overly
> cautious.
> 
> I had a general fear of potential problems stemming from the fact that (a) the
> accumulation of access/dirty bits isn't atomic and (b) the loop is unbounded.
> From looking at how this is used today, I think (a) is essentially the same as
> reading a stale non-contiguous entry, and I'm being overly cautious there. For
> (b), I think that's largely a performance concern and the would only retry
> indefinitely in the presence of mis-programmed entries or consistent racing
> with a writer under heavy contention.
> 
> I think it's still desirable to avoid the accumulation in most cases (to avoid
> redundant work and to minimize the potential for unbounded retries), but I'm
> happy with that being a follow-up improvement.

Great! I'll do the conversion to ptep_get_lockless_nosync() as a follow up series.

> 
>>> I took a quick scan, and AFAICT:
>>
>> Thanks for enumerating these; Saves me from having to refresh my memory :)
>>>
>>> * For perf_get_pgtable_size(), we only care about whether the entry is valid
>>>   and has the contig bit set. We could clean that up with a new interface, e.g.
>>>   something like a new ptep_get_size_lockless().
>>>
>>> * For gup_pte_range(), I'm not sure we actually need the access/dirty bits when
>>>   we look at the pte to start with, since we only care where we can logically
>>>   write to the page at that point.
>>>
>>>   I see that we later follow up with:
>>>
>>>     with pte_val(pte) != pte_val(ptep_get(ptep)))
>>>
>>>   ... is that why we need ptep_get_lockless() to accumulate the access/dirty
>>>   bits? So that shape of lockless-try...locked-compare sequence works?
>>>
>>> * For huge_pte_alloc(), arm64 doesn't select CONFIG_ARCH_WANT_GENERAL_HUGETLB,
>>>   so this doesn' seem to matter.
>>>
>>> * For __collapse_huge_page_swapin(), we only care if the pte is a swap pte,
>>>   which means the pte isn't valid, and we'll return the orig_pte as-is anyway.
>>>
>>> * For pte_range_none() the access/dirty bits don't matter.
>>>
>>> * For handle_pte_fault() I think we have the same shape of
>>>   lockless-try...locked-compare sequence as for gup_pte_range(), where we don't
>>>   care about the acess/dirty bits before we reach the locked compare step.
>>>
>>> * For ptdump_pte_entry() I think it's arguable that we should continue to
>>>   report the access/dirty bits separately for each PTE, as we have done until
>>>   now, to give an accurate representation of the contents of the translation
>>>   tables.
>>>
>>> * For swap_vma_readahead() and unuse_pte_range() we only care if the PTE is a
>>>   swap entry, the access/dirty bits don't matter.
>>>
>>> So AFAICT this only really matters for gup_pte_range() and handle_pte_fault(),
>>> and IIUC that's only so that the locklessly-loaded pte value can be compared
>>> with a subsequently locked-loaded entry (for which the access/dirty bits will
>>> be accumulated). Have I understood that correctly?
>>
>> Yes, I agree with what you are saying. My approach was to try to implement the
>> existing APIs accurately though, the argument being that it reduces the chances
>> of getting it wrong. But if you think the implementation is unsafe, then I guess
>> it blows that out of the water...
> 
> I think your approach makes sense, and as above I'm happy to defer the API
> changes/additions to avoid the accumulation of access/dirty bits.
> 
>>> If so, I wonder if we could instead do that comparison modulo the access/dirty
>>> bits, 
>>
>> I think that would work - but will need to think a bit more on it.
>>
>>> and leave ptep_get_lockless() only reading a single entry?
>>
>> I think we will need to do something a bit less fragile. ptep_get() does collect
>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't IMHO. So
>> we will likely want to rename the function and make its documentation explicit
>> that it does not return those bits.
>>
>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>
>> Of course if I could convince you the current implementation is safe, I might be
>> able to sidestep this optimization until a later date?
> 
> Yep. :)
> 
> Mark.

Ryan Roberts Feb. 13, 2024, 1:06 p.m. UTC | #15

On 13/02/2024 12:19, David Hildenbrand wrote:
> On 13.02.24 13:06, Ryan Roberts wrote:
>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>> [...]
>>>
>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> +    /*
>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>> +     * These racing faults are ok for user space, since they get serialized
>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>> +     */
>>>>>>> +    return mm != &init_mm;
>>>>>>> +}
>>>>>>
>>>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>
>>>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>>>> could add a check for efi_mm here with performance implication. It's probably
>>>>> safest to explicitly exclude it? What do you think?
>>>>
>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>> *without* performance implication"
>>>
>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>> this:
>>>
>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>
>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>> references this symbol currently.
>>>
>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>> There are a couple of things that need to be garanteed for it to be safe:
>>>
>>>    - The PFNs of present ptes either need to have an associated struct page or
>>>      need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>      pte_mkdevmap())
>>>
>>>    - Live mappings must either be static (no changes that could cause
>>> fold/unfold
>>>      while live) or the system must be able to tolerate a temporary fault
>>>
>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>> requirement, but I'm not sure about the former?
>>
>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>> mappings are indeed static. And additionally, the ptes are populated using only
>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>> my prefereence is to not make any changes to code, and just add a comment
>> describing why efi_mm is safe.
>>
>> Details:
>>
>> * Registered with ptdump
>>      * ptep_get_lockless()
>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>      * __ptep_get()
>>      * __set_pte()
>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>> set_permissions
>>      * __ptep_get()
>>      * __set_pte()
> 
> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
> "official" APIs.

We could, but that would lead to the same linkage issue, which I'm trying to
avoid in the first place:

VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);

This creates new source code dependencies, which I would rather avoid if possible.

David Hildenbrand Feb. 13, 2024, 1:13 p.m. UTC | #16

On 13.02.24 14:06, Ryan Roberts wrote:
> On 13/02/2024 12:19, David Hildenbrand wrote:
>> On 13.02.24 13:06, Ryan Roberts wrote:
>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>> [...]
>>>>
>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>> +{
>>>>>>>> +    /*
>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>> +     * These racing faults are ok for user space, since they get serialized
>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>> +     */
>>>>>>>> +    return mm != &init_mm;
>>>>>>>> +}
>>>>>>>
>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we manipulate
>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>
>>>>>> Well we never need this function in the hot (order-0 folio) path, so I think I
>>>>>> could add a check for efi_mm here with performance implication. It's probably
>>>>>> safest to explicitly exclude it? What do you think?
>>>>>
>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>> *without* performance implication"
>>>>
>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>>> this:
>>>>
>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>
>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>> references this symbol currently.
>>>>
>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>
>>>>     - The PFNs of present ptes either need to have an associated struct page or
>>>>       need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>       pte_mkdevmap())
>>>>
>>>>     - Live mappings must either be static (no changes that could cause
>>>> fold/unfold
>>>>       while live) or the system must be able to tolerate a temporary fault
>>>>
>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>> requirement, but I'm not sure about the former?
>>>
>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>> mappings are indeed static. And additionally, the ptes are populated using only
>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>> my prefereence is to not make any changes to code, and just add a comment
>>> describing why efi_mm is safe.
>>>
>>> Details:
>>>
>>> * Registered with ptdump
>>>       * ptep_get_lockless()
>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>       * __ptep_get()
>>>       * __set_pte()
>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>> set_permissions
>>>       * __ptep_get()
>>>       * __set_pte()
>>
>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>> "official" APIs.
> 
> We could, but that would lead to the same linkage issue, which I'm trying to
> avoid in the first place:
> 
> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
> 
> This creates new source code dependencies, which I would rather avoid if possible.

Just a thought, you could have a is_efi_mm() function that abstracts all that.

diff --git a/include/linux/efi.h b/include/linux/efi.h
index c74f47711f0b..152f5fa66a2a 100644
--- a/include/linux/efi.h
+++ b/include/linux/efi.h
@@ -692,6 +692,15 @@ extern struct efi {
  
  extern struct mm_struct efi_mm;
  
+static inline void is_efi_mm(struct mm_struct *mm)
+{
+#ifdef CONFIG_EFI
+       return mm == &efi_mm;
+#else
+       return false;
+#endif
+}
+
  static inline int
  efi_guidcmp (efi_guid_t left, efi_guid_t right)
  {

Ryan Roberts Feb. 13, 2024, 1:20 p.m. UTC | #17

On 13/02/2024 13:13, David Hildenbrand wrote:
> On 13.02.24 14:06, Ryan Roberts wrote:
>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>> [...]
>>>>>
>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>> +{
>>>>>>>>> +    /*
>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>> serialized
>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>> +     */
>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>> +}
>>>>>>>>
>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>> manipulate
>>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>>
>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>> think I
>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>> probably
>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>
>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>> *without* performance implication"
>>>>>
>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>>>> this:
>>>>>
>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>
>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>> references this symbol currently.
>>>>>
>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>
>>>>>     - The PFNs of present ptes either need to have an associated struct
>>>>> page or
>>>>>       need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>       pte_mkdevmap())
>>>>>
>>>>>     - Live mappings must either be static (no changes that could cause
>>>>> fold/unfold
>>>>>       while live) or the system must be able to tolerate a temporary fault
>>>>>
>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>> requirement, but I'm not sure about the former?
>>>>
>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>> mappings are indeed static. And additionally, the ptes are populated using only
>>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>>> my prefereence is to not make any changes to code, and just add a comment
>>>> describing why efi_mm is safe.
>>>>
>>>> Details:
>>>>
>>>> * Registered with ptdump
>>>>       * ptep_get_lockless()
>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>       * __ptep_get()
>>>>       * __set_pte()
>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>> set_permissions
>>>>       * __ptep_get()
>>>>       * __set_pte()
>>>
>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>> "official" APIs.
>>
>> We could, but that would lead to the same linkage issue, which I'm trying to
>> avoid in the first place:
>>
>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>
>> This creates new source code dependencies, which I would rather avoid if
>> possible.
> 
> Just a thought, you could have a is_efi_mm() function that abstracts all that.
> 
> diff --git a/include/linux/efi.h b/include/linux/efi.h
> index c74f47711f0b..152f5fa66a2a 100644
> --- a/include/linux/efi.h
> +++ b/include/linux/efi.h
> @@ -692,6 +692,15 @@ extern struct efi {
>  
>  extern struct mm_struct efi_mm;
>  
> +static inline void is_efi_mm(struct mm_struct *mm)
> +{
> +#ifdef CONFIG_EFI
> +       return mm == &efi_mm;
> +#else
> +       return false;
> +#endif
> +}
> +
>  static inline int
>  efi_guidcmp (efi_guid_t left, efi_guid_t right)
>  {
> 
> 

That would definitely work, but in that case, I might as well just check for it
in mm_is_user() (and personally I would change the name to mm_is_efi()):


static inline bool mm_is_user(struct mm_struct *mm)
{
	return mm != &init_mm && !mm_is_efi(mm);
}

Any objections?

David Hildenbrand Feb. 13, 2024, 1:22 p.m. UTC | #18

On 13.02.24 14:20, Ryan Roberts wrote:
> On 13/02/2024 13:13, David Hildenbrand wrote:
>> On 13.02.24 14:06, Ryan Roberts wrote:
>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>> [...]
>>>>>>
>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>> +{
>>>>>>>>>> +    /*
>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>> serialized
>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>> +     */
>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>> +}
>>>>>>>>>
>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>> manipulate
>>>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>>>
>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>> think I
>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>> probably
>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>
>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>> *without* performance implication"
>>>>>>
>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
>>>>>> this:
>>>>>>
>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>
>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>> references this symbol currently.
>>>>>>
>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>
>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>> page or
>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>        pte_mkdevmap())
>>>>>>
>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>> fold/unfold
>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>
>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>>> requirement, but I'm not sure about the former?
>>>>>
>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>> mappings are indeed static. And additionally, the ptes are populated using only
>>>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>> describing why efi_mm is safe.
>>>>>
>>>>> Details:
>>>>>
>>>>> * Registered with ptdump
>>>>>        * ptep_get_lockless()
>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>        * __ptep_get()
>>>>>        * __set_pte()
>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>> set_permissions
>>>>>        * __ptep_get()
>>>>>        * __set_pte()
>>>>
>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>> "official" APIs.
>>>
>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>> avoid in the first place:
>>>
>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>
>>> This creates new source code dependencies, which I would rather avoid if
>>> possible.
>>
>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>
>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>> index c74f47711f0b..152f5fa66a2a 100644
>> --- a/include/linux/efi.h
>> +++ b/include/linux/efi.h
>> @@ -692,6 +692,15 @@ extern struct efi {
>>   
>>   extern struct mm_struct efi_mm;
>>   
>> +static inline void is_efi_mm(struct mm_struct *mm)
>> +{
>> +#ifdef CONFIG_EFI
>> +       return mm == &efi_mm;
>> +#else
>> +       return false;
>> +#endif
>> +}
>> +
>>   static inline int
>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>   {
>>
>>
> 
> That would definitely work, but in that case, I might as well just check for it
> in mm_is_user() (and personally I would change the name to mm_is_efi()):
> 
> 
> static inline bool mm_is_user(struct mm_struct *mm)
> {
> 	return mm != &init_mm && !mm_is_efi(mm);
> }
> 
> Any objections?
> 

Nope :) Maybe slap in an "unlikely()", because efi_mm *is* unlikely to 
show up.

Ryan Roberts Feb. 13, 2024, 1:24 p.m. UTC | #19

On 13/02/2024 13:22, David Hildenbrand wrote:
> On 13.02.24 14:20, Ryan Roberts wrote:
>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>> +{
>>>>>>>>>>> +    /*
>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings,
>>>>>>>>>>> because
>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page
>>>>>>>>>>> faults.
>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>> serialized
>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>> manipulate
>>>>>>>>>> that while it is live, and I'm not sure if that needs any special
>>>>>>>>>> handling.
>>>>>>>>>
>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>> think I
>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>> probably
>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>
>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>> *without* performance implication"
>>>>>>>
>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I
>>>>>>> can do
>>>>>>> this:
>>>>>>>
>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>
>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>> references this symbol currently.
>>>>>>>
>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like
>>>>>>> userspace.
>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>
>>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>>> page or
>>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>        pte_mkdevmap())
>>>>>>>
>>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>>> fold/unfold
>>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>>
>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>>>> requirement, but I'm not sure about the former?
>>>>>>
>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>> mappings are indeed static. And additionally, the ptes are populated using
>>>>>> only
>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with
>>>>>> Mark,
>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>> describing why efi_mm is safe.
>>>>>>
>>>>>> Details:
>>>>>>
>>>>>> * Registered with ptdump
>>>>>>        * ptep_get_lockless()
>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>> set_permissions
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>
>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>> "official" APIs.
>>>>
>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>> avoid in the first place:
>>>>
>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>
>>>> This creates new source code dependencies, which I would rather avoid if
>>>> possible.
>>>
>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>
>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>> index c74f47711f0b..152f5fa66a2a 100644
>>> --- a/include/linux/efi.h
>>> +++ b/include/linux/efi.h
>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>     extern struct mm_struct efi_mm;
>>>   +static inline void is_efi_mm(struct mm_struct *mm)
>>> +{
>>> +#ifdef CONFIG_EFI
>>> +       return mm == &efi_mm;
>>> +#else
>>> +       return false;
>>> +#endif
>>> +}
>>> +
>>>   static inline int
>>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>   {
>>>
>>>
>>
>> That would definitely work, but in that case, I might as well just check for it
>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>
>>
>> static inline bool mm_is_user(struct mm_struct *mm)
>> {
>>     return mm != &init_mm && !mm_is_efi(mm);
>> }
>>
>> Any objections?
>>
> 
> Nope :) Maybe slap in an "unlikely()", because efi_mm *is* unlikely to show up.

Deal

>

Ard Biesheuvel Feb. 13, 2024, 1:33 p.m. UTC | #20

On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 13/02/2024 13:13, David Hildenbrand wrote:
> > On 13.02.24 14:06, Ryan Roberts wrote:
> >> On 13/02/2024 12:19, David Hildenbrand wrote:
> >>> On 13.02.24 13:06, Ryan Roberts wrote:
> >>>> On 12/02/2024 20:38, Ryan Roberts wrote:
> >>>>> [...]
> >>>>>
> >>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
> >>>>>>>>> +{
> >>>>>>>>> +    /*
> >>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
> >>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
> >>>>>>>>> +     * These racing faults are ok for user space, since they get
> >>>>>>>>> serialized
> >>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
> >>>>>>>>> +     */
> >>>>>>>>> +    return mm != &init_mm;
> >>>>>>>>> +}
> >>>>>>>>
> >>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
> >>>>>>>> manipulate
> >>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
> >>>>>>>
> >>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
> >>>>>>> think I
> >>>>>>> could add a check for efi_mm here with performance implication. It's
> >>>>>>> probably
> >>>>>>> safest to explicitly exclude it? What do you think?
> >>>>>>
> >>>>>> Oops: This should have read "I think I could add a check for efi_mm here
> >>>>>> *without* performance implication"
> >>>>>
> >>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled. I can do
> >>>>> this:
> >>>>>
> >>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
> >>>>>
> >>>>> Is that acceptable? This is my preference, but nothing else outside of efi
> >>>>> references this symbol currently.
> >>>>>
> >>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
> >>>>> There are a couple of things that need to be garanteed for it to be safe:
> >>>>>
> >>>>>     - The PFNs of present ptes either need to have an associated struct
> >>>>> page or
> >>>>>       need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
> >>>>>       pte_mkdevmap())
> >>>>>
> >>>>>     - Live mappings must either be static (no changes that could cause
> >>>>> fold/unfold
> >>>>>       while live) or the system must be able to tolerate a temporary fault
> >>>>>
> >>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
> >>>>> requirement, but I'm not sure about the former?
> >>>>
> >>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
> >>>> mappings are indeed static. And additionally, the ptes are populated using only
> >>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
> >>>> my prefereence is to not make any changes to code, and just add a comment
> >>>> describing why efi_mm is safe.
> >>>>
> >>>> Details:
> >>>>
> >>>> * Registered with ptdump
> >>>>       * ptep_get_lockless()
> >>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
> >>>>       * __ptep_get()
> >>>>       * __set_pte()
> >>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
> >>>> set_permissions
> >>>>       * __ptep_get()
> >>>>       * __set_pte()
> >>>
> >>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
> >>> "official" APIs.
> >>
> >> We could, but that would lead to the same linkage issue, which I'm trying to
> >> avoid in the first place:
> >>
> >> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
> >>
> >> This creates new source code dependencies, which I would rather avoid if
> >> possible.
> >
> > Just a thought, you could have a is_efi_mm() function that abstracts all that.
> >
> > diff --git a/include/linux/efi.h b/include/linux/efi.h
> > index c74f47711f0b..152f5fa66a2a 100644
> > --- a/include/linux/efi.h
> > +++ b/include/linux/efi.h
> > @@ -692,6 +692,15 @@ extern struct efi {
> >
> >  extern struct mm_struct efi_mm;
> >
> > +static inline void is_efi_mm(struct mm_struct *mm)
> > +{
> > +#ifdef CONFIG_EFI
> > +       return mm == &efi_mm;
> > +#else
> > +       return false;
> > +#endif
> > +}
> > +
> >  static inline int
> >  efi_guidcmp (efi_guid_t left, efi_guid_t right)
> >  {
> >
> >
>
> That would definitely work, but in that case, I might as well just check for it
> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>
>
> static inline bool mm_is_user(struct mm_struct *mm)
> {
>         return mm != &init_mm && !mm_is_efi(mm);
> }
>
> Any objections?
>

Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
declaration is visible to the compiler, and any references should
disappear before the linker could notice that efi_mm does not exist.

In any case, feel free to add

Acked-by: Ard Biesheuvel <ardb@kernel.org>

when you roll a patch based on the above, with or without IS_ENABLED().

And as you concluded, efi_mm is indeed set in stone once the runtime
regions described by the firmware have been mapped, although this may
happen in two passes depending on how the runtime regions are
described. But by the time user MMs might exist, efi_mm should
effectively be immutable.

David Hildenbrand Feb. 13, 2024, 1:45 p.m. UTC | #21

On 13.02.24 14:33, Ard Biesheuvel wrote:
> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>
>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>> [...]
>>>>>>>
>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>> +{
>>>>>>>>>>> +    /*
>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings, because
>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page faults.
>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>> serialized
>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>> +     */
>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>> +}
>>>>>>>>>>
>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>> manipulate
>>>>>>>>>> that while it is live, and I'm not sure if that needs any special handling.
>>>>>>>>>
>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>> think I
>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>> probably
>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>
>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>> *without* performance implication"
>>>>>>>
>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled I can do
>>>>>>> this:
>>>>>>>
>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>
>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>> references this symbol currently.
>>>>>>>
>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like userspace.
>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>
>>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>>> page or
>>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>        pte_mkdevmap())
>>>>>>>
>>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>>> fold/unfold
>>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>>
>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the latter
>>>>>>> requirement, but I'm not sure about the former?
>>>>>>
>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>> mappings are indeed static. And additionally, the ptes are populated using only
>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with Mark,
>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>> describing why efi_mm is safe.
>>>>>>
>>>>>> Details:
>>>>>>
>>>>>> * Registered with ptdump
>>>>>>        * ptep_get_lockless()
>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>> set_permissions
>>>>>>        * __ptep_get()
>>>>>>        * __set_pte()
>>>>>
>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>> "official" APIs.
>>>>
>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>> avoid in the first place:
>>>>
>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>
>>>> This creates new source code dependencies, which I would rather avoid if
>>>> possible.
>>>
>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>
>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>> index c74f47711f0b..152f5fa66a2a 100644
>>> --- a/include/linux/efi.h
>>> +++ b/include/linux/efi.h
>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>
>>>   extern struct mm_struct efi_mm;
>>>
>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>> +{
>>> +#ifdef CONFIG_EFI
>>> +       return mm == &efi_mm;
>>> +#else
>>> +       return false;
>>> +#endif
>>> +}
>>> +
>>>   static inline int
>>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>   {
>>>
>>>
>>
>> That would definitely work, but in that case, I might as well just check for it
>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>
>>
>> static inline bool mm_is_user(struct mm_struct *mm)
>> {
>>          return mm != &init_mm && !mm_is_efi(mm);
>> }
>>
>> Any objections?
>>
> 
> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
> declaration is visible to the compiler, and any references should
> disappear before the linker could notice that efi_mm does not exist.
> 

Sure, as long as the linker is happy why not. I'll let Ryan mess with 
that :)

> In any case, feel free to add
> 
> Acked-by: Ard Biesheuvel <ardb@kernel.org>

Thanks for the review.

Ryan Roberts Feb. 13, 2024, 2:02 p.m. UTC | #22

On 13/02/2024 13:45, David Hildenbrand wrote:
> On 13.02.24 14:33, Ard Biesheuvel wrote:
>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>
>>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>>> [...]
>>>>>>>>
>>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>>> +{
>>>>>>>>>>>> +    /*
>>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings,
>>>>>>>>>>>> because
>>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page
>>>>>>>>>>>> faults.
>>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>>> serialized
>>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>>> +     */
>>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>>> +}
>>>>>>>>>>>
>>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>>> manipulate
>>>>>>>>>>> that while it is live, and I'm not sure if that needs any special
>>>>>>>>>>> handling.
>>>>>>>>>>
>>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>>> think I
>>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>>> probably
>>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>>
>>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>>> *without* performance implication"
>>>>>>>>
>>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled I
>>>>>>>> can do
>>>>>>>> this:
>>>>>>>>
>>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>>
>>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>>> references this symbol currently.
>>>>>>>>
>>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like
>>>>>>>> userspace.
>>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>>
>>>>>>>>      - The PFNs of present ptes either need to have an associated struct
>>>>>>>> page or
>>>>>>>>        need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>>        pte_mkdevmap())
>>>>>>>>
>>>>>>>>      - Live mappings must either be static (no changes that could cause
>>>>>>>> fold/unfold
>>>>>>>>        while live) or the system must be able to tolerate a temporary fault
>>>>>>>>
>>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the
>>>>>>>> latter
>>>>>>>> requirement, but I'm not sure about the former?
>>>>>>>
>>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>>> mappings are indeed static. And additionally, the ptes are populated
>>>>>>> using only
>>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with
>>>>>>> Mark,
>>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>>> describing why efi_mm is safe.
>>>>>>>
>>>>>>> Details:
>>>>>>>
>>>>>>> * Registered with ptdump
>>>>>>>        * ptep_get_lockless()
>>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>>        * __ptep_get()
>>>>>>>        * __set_pte()
>>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>>> set_permissions
>>>>>>>        * __ptep_get()
>>>>>>>        * __set_pte()
>>>>>>
>>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>>> "official" APIs.
>>>>>
>>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>>> avoid in the first place:
>>>>>
>>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>>
>>>>> This creates new source code dependencies, which I would rather avoid if
>>>>> possible.
>>>>
>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>>
>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>>> index c74f47711f0b..152f5fa66a2a 100644
>>>> --- a/include/linux/efi.h
>>>> +++ b/include/linux/efi.h
>>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>>
>>>>   extern struct mm_struct efi_mm;
>>>>
>>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>>> +{
>>>> +#ifdef CONFIG_EFI
>>>> +       return mm == &efi_mm;
>>>> +#else
>>>> +       return false;
>>>> +#endif
>>>> +}
>>>> +
>>>>   static inline int
>>>>   efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>>   {
>>>>
>>>>
>>>
>>> That would definitely work, but in that case, I might as well just check for it
>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>>
>>>
>>> static inline bool mm_is_user(struct mm_struct *mm)
>>> {
>>>          return mm != &init_mm && !mm_is_efi(mm);
>>> }
>>>
>>> Any objections?
>>>
>>
>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
>> declaration is visible to the compiler, and any references should
>> disappear before the linker could notice that efi_mm does not exist.
>>
> 
> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)

I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?

The former was what I did initially; It works great, but I didn't like that I
was introducing a new code dependecy between efi and arm64 (nothing else outside
of efi references efi_mm).

So then concluded that it is safe to not worry about efi_mm (thanks for your
confirmation). But then David wanted a VM_WARN check, which reintroduces the
code dependency. So he suggested the mm_is_efi() helper to hide that... This is
all starting to feel circular...

Since I've jsut updated the code to do it David's way, I propose leaving it as
is since nobody has strong feelings.

> 
>> In any case, feel free to add
>>
>> Acked-by: Ard Biesheuvel <ardb@kernel.org>

Great thanks!

> 
> Thanks for the review.
>

David Hildenbrand Feb. 13, 2024, 2:05 p.m. UTC | #23

On 13.02.24 15:02, Ryan Roberts wrote:
> On 13/02/2024 13:45, David Hildenbrand wrote:
>> On 13.02.24 14:33, Ard Biesheuvel wrote:
>>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>
>>>> On 13/02/2024 13:13, David Hildenbrand wrote:
>>>>> On 13.02.24 14:06, Ryan Roberts wrote:
>>>>>> On 13/02/2024 12:19, David Hildenbrand wrote:
>>>>>>> On 13.02.24 13:06, Ryan Roberts wrote:
>>>>>>>> On 12/02/2024 20:38, Ryan Roberts wrote:
>>>>>>>>> [...]
>>>>>>>>>
>>>>>>>>>>>>> +static inline bool mm_is_user(struct mm_struct *mm)
>>>>>>>>>>>>> +{
>>>>>>>>>>>>> +    /*
>>>>>>>>>>>>> +     * Don't attempt to apply the contig bit to kernel mappings,
>>>>>>>>>>>>> because
>>>>>>>>>>>>> +     * dynamically adding/removing the contig bit can cause page
>>>>>>>>>>>>> faults.
>>>>>>>>>>>>> +     * These racing faults are ok for user space, since they get
>>>>>>>>>>>>> serialized
>>>>>>>>>>>>> +     * on the PTL. But kernel mappings can't tolerate faults.
>>>>>>>>>>>>> +     */
>>>>>>>>>>>>> +    return mm != &init_mm;
>>>>>>>>>>>>> +}
>>>>>>>>>>>>
>>>>>>>>>>>> We also have the efi_mm as a non-user mm, though I don't think we
>>>>>>>>>>>> manipulate
>>>>>>>>>>>> that while it is live, and I'm not sure if that needs any special
>>>>>>>>>>>> handling.
>>>>>>>>>>>
>>>>>>>>>>> Well we never need this function in the hot (order-0 folio) path, so I
>>>>>>>>>>> think I
>>>>>>>>>>> could add a check for efi_mm here with performance implication. It's
>>>>>>>>>>> probably
>>>>>>>>>>> safest to explicitly exclude it? What do you think?
>>>>>>>>>>
>>>>>>>>>> Oops: This should have read "I think I could add a check for efi_mm here
>>>>>>>>>> *without* performance implication"
>>>>>>>>>
>>>>>>>>> It turns out that efi_mm is only defined when CONFIG_EFI is enabled I
>>>>>>>>> can do
>>>>>>>>> this:
>>>>>>>>>
>>>>>>>>> return mm != &init_mm && (!IS_ENABLED(CONFIG_EFI) || mm != &efi_mm);
>>>>>>>>>
>>>>>>>>> Is that acceptable? This is my preference, but nothing else outside of efi
>>>>>>>>> references this symbol currently.
>>>>>>>>>
>>>>>>>>> Or perhaps I can convince myself that its safe to treat efi_mm like
>>>>>>>>> userspace.
>>>>>>>>> There are a couple of things that need to be garanteed for it to be safe:
>>>>>>>>>
>>>>>>>>>       - The PFNs of present ptes either need to have an associated struct
>>>>>>>>> page or
>>>>>>>>>         need to have the PTE_SPECIAL bit set (either pte_mkspecial() or
>>>>>>>>>         pte_mkdevmap())
>>>>>>>>>
>>>>>>>>>       - Live mappings must either be static (no changes that could cause
>>>>>>>>> fold/unfold
>>>>>>>>>         while live) or the system must be able to tolerate a temporary fault
>>>>>>>>>
>>>>>>>>> Mark suggests efi_mm is not manipulated while live, so that meets the
>>>>>>>>> latter
>>>>>>>>> requirement, but I'm not sure about the former?
>>>>>>>>
>>>>>>>> I've gone through all the efi code, and conclude that, as Mark suggests, the
>>>>>>>> mappings are indeed static. And additionally, the ptes are populated
>>>>>>>> using only
>>>>>>>> the _private_ ptep API, so there is no issue here. As just discussed with
>>>>>>>> Mark,
>>>>>>>> my prefereence is to not make any changes to code, and just add a comment
>>>>>>>> describing why efi_mm is safe.
>>>>>>>>
>>>>>>>> Details:
>>>>>>>>
>>>>>>>> * Registered with ptdump
>>>>>>>>         * ptep_get_lockless()
>>>>>>>> * efi_create_mapping -> create_pgd_mapping … -> init_pte:
>>>>>>>>         * __ptep_get()
>>>>>>>>         * __set_pte()
>>>>>>>> * efi_memattr_apply_permissions -> efi_set_mapping_permissions … ->
>>>>>>>> set_permissions
>>>>>>>>         * __ptep_get()
>>>>>>>>         * __set_pte()
>>>>>>>
>>>>>>> Sound good. We could add some VM_WARN_ON if we ever get the efi_mm via the
>>>>>>> "official" APIs.
>>>>>>
>>>>>> We could, but that would lead to the same linkage issue, which I'm trying to
>>>>>> avoid in the first place:
>>>>>>
>>>>>> VM_WARN_ON(IS_ENABLED(CONFIG_EFI) && mm == efi_mm);
>>>>>>
>>>>>> This creates new source code dependencies, which I would rather avoid if
>>>>>> possible.
>>>>>
>>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>>>
>>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>>>> index c74f47711f0b..152f5fa66a2a 100644
>>>>> --- a/include/linux/efi.h
>>>>> +++ b/include/linux/efi.h
>>>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>>>
>>>>>    extern struct mm_struct efi_mm;
>>>>>
>>>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>>>> +{
>>>>> +#ifdef CONFIG_EFI
>>>>> +       return mm == &efi_mm;
>>>>> +#else
>>>>> +       return false;
>>>>> +#endif
>>>>> +}
>>>>> +
>>>>>    static inline int
>>>>>    efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>>>    {
>>>>>
>>>>>
>>>>
>>>> That would definitely work, but in that case, I might as well just check for it
>>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>>>
>>>>
>>>> static inline bool mm_is_user(struct mm_struct *mm)
>>>> {
>>>>           return mm != &init_mm && !mm_is_efi(mm);
>>>> }
>>>>
>>>> Any objections?
>>>>
>>>
>>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
>>> declaration is visible to the compiler, and any references should
>>> disappear before the linker could notice that efi_mm does not exist.
>>>
>>
>> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)
> 
> I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
> IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
> using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?
> 
> The former was what I did initially; It works great, but I didn't like that I
> was introducing a new code dependecy between efi and arm64 (nothing else outside
> of efi references efi_mm).
> 
> So then concluded that it is safe to not worry about efi_mm (thanks for your
> confirmation). But then David wanted a VM_WARN check, which reintroduces the
> code dependency. So he suggested the mm_is_efi() helper to hide that... This is
> all starting to feel circular...

I think Ard meant that inside mm_is_efi(), we could avoid the #ifdef and 
simply use IS_ENABLED().

Ard Biesheuvel Feb. 13, 2024, 2:08 p.m. UTC | #24

On Tue, 13 Feb 2024 at 15:05, David Hildenbrand <david@redhat.com> wrote:
>
> On 13.02.24 15:02, Ryan Roberts wrote:
> > On 13/02/2024 13:45, David Hildenbrand wrote:
> >> On 13.02.24 14:33, Ard Biesheuvel wrote:
> >>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>>>
> >>>> On 13/02/2024 13:13, David Hildenbrand wrote:
...
> >>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
> >>>>>
> >>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
> >>>>> index c74f47711f0b..152f5fa66a2a 100644
> >>>>> --- a/include/linux/efi.h
> >>>>> +++ b/include/linux/efi.h
> >>>>> @@ -692,6 +692,15 @@ extern struct efi {
> >>>>>
> >>>>>    extern struct mm_struct efi_mm;
> >>>>>
> >>>>> +static inline void is_efi_mm(struct mm_struct *mm)
> >>>>> +{
> >>>>> +#ifdef CONFIG_EFI
> >>>>> +       return mm == &efi_mm;
> >>>>> +#else
> >>>>> +       return false;
> >>>>> +#endif
> >>>>> +}
> >>>>> +
> >>>>>    static inline int
> >>>>>    efi_guidcmp (efi_guid_t left, efi_guid_t right)
> >>>>>    {
> >>>>>
> >>>>>
> >>>>
> >>>> That would definitely work, but in that case, I might as well just check for it
> >>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
> >>>>
> >>>>
> >>>> static inline bool mm_is_user(struct mm_struct *mm)
> >>>> {
> >>>>           return mm != &init_mm && !mm_is_efi(mm);
> >>>> }
> >>>>
> >>>> Any objections?
> >>>>
> >>>
> >>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
> >>> declaration is visible to the compiler, and any references should
> >>> disappear before the linker could notice that efi_mm does not exist.
> >>>
> >>
> >> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)
> >
> > I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
> > IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
> > using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?
> >
> > The former was what I did initially; It works great, but I didn't like that I
> > was introducing a new code dependecy between efi and arm64 (nothing else outside
> > of efi references efi_mm).
> >
> > So then concluded that it is safe to not worry about efi_mm (thanks for your
> > confirmation). But then David wanted a VM_WARN check, which reintroduces the
> > code dependency. So he suggested the mm_is_efi() helper to hide that... This is
> > all starting to feel circular...
>
> I think Ard meant that inside mm_is_efi(), we could avoid the #ifdef and
> simply use IS_ENABLED().
>

Yes.

static inline void mm_is_efi(struct mm_struct *mm)
{
    return IS_ENABLED(CONFIG_EFI) && mm == &efi_mm;
}

Ryan Roberts Feb. 13, 2024, 2:21 p.m. UTC | #25

On 13/02/2024 14:08, Ard Biesheuvel wrote:
> On Tue, 13 Feb 2024 at 15:05, David Hildenbrand <david@redhat.com> wrote:
>>
>> On 13.02.24 15:02, Ryan Roberts wrote:
>>> On 13/02/2024 13:45, David Hildenbrand wrote:
>>>> On 13.02.24 14:33, Ard Biesheuvel wrote:
>>>>> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts <ryan.roberts@arm.com> wrote:
>>>>>>
>>>>>> On 13/02/2024 13:13, David Hildenbrand wrote:
> ...
>>>>>>> Just a thought, you could have a is_efi_mm() function that abstracts all that.
>>>>>>>
>>>>>>> diff --git a/include/linux/efi.h b/include/linux/efi.h
>>>>>>> index c74f47711f0b..152f5fa66a2a 100644
>>>>>>> --- a/include/linux/efi.h
>>>>>>> +++ b/include/linux/efi.h
>>>>>>> @@ -692,6 +692,15 @@ extern struct efi {
>>>>>>>
>>>>>>>    extern struct mm_struct efi_mm;
>>>>>>>
>>>>>>> +static inline void is_efi_mm(struct mm_struct *mm)
>>>>>>> +{
>>>>>>> +#ifdef CONFIG_EFI
>>>>>>> +       return mm == &efi_mm;
>>>>>>> +#else
>>>>>>> +       return false;
>>>>>>> +#endif
>>>>>>> +}
>>>>>>> +
>>>>>>>    static inline int
>>>>>>>    efi_guidcmp (efi_guid_t left, efi_guid_t right)
>>>>>>>    {
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> That would definitely work, but in that case, I might as well just check for it
>>>>>> in mm_is_user() (and personally I would change the name to mm_is_efi()):
>>>>>>
>>>>>>
>>>>>> static inline bool mm_is_user(struct mm_struct *mm)
>>>>>> {
>>>>>>           return mm != &init_mm && !mm_is_efi(mm);
>>>>>> }
>>>>>>
>>>>>> Any objections?
>>>>>>
>>>>>
>>>>> Any reason not to use IS_ENABLED(CONFIG_EFI) in the above? The extern
>>>>> declaration is visible to the compiler, and any references should
>>>>> disappear before the linker could notice that efi_mm does not exist.
>>>>>
>>>>
>>>> Sure, as long as the linker is happy why not. I'll let Ryan mess with that :)
>>>
>>> I'm not sure if you are suggesting dropping the mm_is_efi() helper and just use
>>> IS_ENABLED(CONFIG_EFI) in mm_is_user() to guard efi_mm, or if you are suggesting
>>> using IS_ENABLED(CONFIG_EFI) in mm_is_efi() instead of the ifdefery?
>>>
>>> The former was what I did initially; It works great, but I didn't like that I
>>> was introducing a new code dependecy between efi and arm64 (nothing else outside
>>> of efi references efi_mm).
>>>
>>> So then concluded that it is safe to not worry about efi_mm (thanks for your
>>> confirmation). But then David wanted a VM_WARN check, which reintroduces the
>>> code dependency. So he suggested the mm_is_efi() helper to hide that... This is
>>> all starting to feel circular...
>>
>> I think Ard meant that inside mm_is_efi(), we could avoid the #ifdef and
>> simply use IS_ENABLED().
>>
> 
> Yes.
> 
> static inline void mm_is_efi(struct mm_struct *mm)
> {
>     return IS_ENABLED(CONFIG_EFI) && mm == &efi_mm;
> }

Ahh, got it. Yes, I'll do it like this. Thanks!

Ryan Roberts Feb. 13, 2024, 3:29 p.m. UTC | #26

On 12/02/2024 16:24, David Hildenbrand wrote:
> On 12.02.24 16:34, Ryan Roberts wrote:
>> On 12/02/2024 15:26, David Hildenbrand wrote:
>>> On 12.02.24 15:45, Ryan Roberts wrote:
>>>> On 12/02/2024 13:54, David Hildenbrand wrote:
>>>>>>> If so, I wonder if we could instead do that comparison modulo the
>>>>>>> access/dirty
>>>>>>> bits,
>>>>>>
>>>>>> I think that would work - but will need to think a bit more on it.
>>>>>>
>>>>>>> and leave ptep_get_lockless() only reading a single entry?
>>>>>>
>>>>>> I think we will need to do something a bit less fragile. ptep_get() does
>>>>>> collect
>>>>>> the access/dirty bits so its confusing if ptep_get_lockless() doesn't
>>>>>> IMHO. So
>>>>>> we will likely want to rename the function and make its documentation
>>>>>> explicit
>>>>>> that it does not return those bits.
>>>>>>
>>>>>> ptep_get_lockless_noyoungdirty()? yuk... Any ideas?
>>>>>>
>>>>>> Of course if I could convince you the current implementation is safe, I
>>>>>> might be
>>>>>> able to sidestep this optimization until a later date?
>>>>>
>>>>> As discussed (and pointed out abive), there might be quite some callsites
>>>>> where
>>>>> we don't really care about uptodate accessed/dirty bits -- where ptep_get() is
>>>>> used nowadays.
>>>>>
>>>>> One way to approach that I had in mind was having an explicit interface:
>>>>>
>>>>> ptep_get()
>>>>> ptep_get_uptodate()
>>>>> ptep_get_lockless()
>>>>> ptep_get_lockless_uptodate()
>>>>
>>>> Yes, I like the direction of this. I guess we anticipate that call sites
>>>> requiring the "_uptodate" variant will be the minority so it makes sense to use
>>>> the current names for the "_not_uptodate" variants? But to do a slow migration,
>>>> it might be better/safer to have the weaker variant use the new name - that
>>>> would allow us to downgrade one at a time?
>>>
>>> Yes, I was primarily struggling with names. Likely it makes sense to either have
>>> two completely new function names, or use the new name only for the "faster but
>>> less precise" variant.
>>>
>>>>
>>>>>
>>>>> Especially the last one might not be needed.
>>>> I've done a scan through the code and agree with Mark's original conclusions.
>>>> Additionally, huge_pte_alloc() (which isn't used for arm64) doesn't rely on
>>>> access/dirty info. So I think I could migrate everything to the weaker variant
>>>> fairly easily.
>>>>
>>>>>
>>>>> Futher, "uptodate" might not be the best choice because of PageUptodate() and
>>>>> friends. But it's better than "youngdirty"/"noyoungdirty" IMHO.
>>>>
>>>> Certainly agree with "noyoungdirty" being a horrible name. How about "_sync" /
>>>> "_nosync"?
>>>
>>> I could live with
>>>
>>> ptep_get_sync()
>>> ptep_get_nosync()
>>>
>>> with proper documentation :)
>>
>> but could you live with:
>>
>> ptep_get()
>> ptep_get_nosync()
>> ptep_get_lockless_nosync()
>>
>> ?
>>
>> So leave the "slower, more precise" version with the existing name.
> 
> Sure.
> 

I'm just implementing this (as a separate RFC), and had an alternative idea for
naming/semantics:

ptep_get()
ptep_get_norecency()
ptep_get_lockless()
ptep_get_lockless_norecency()

The "_norecency" versions explicitly clear the access/dirty bits. This is useful
for the "compare to original pte to check we are not racing" pattern:

pte = ptep_get_lockless_norecency(ptep)
...
<lock>
if (!pte_same(pte, ptep_get_norecency(ptep)))
	// RACE!
...
<unlock>

With the "_nosync" semantic, the access/dirty bits may or may not be set, so the
user has to explicitly clear them to do the comparison. (although I considered a
pte_same_nosync() that would clear the bits for you - but that name is pretty naff).

Although the _norecency semantic requires always explicitly clearing the bits,
so may be infinitesimally slower, it gives a very clear expectation that the
access/dirty bits are always clear and I think that's conveyed well in the name too.

Thoughts?

[v5,19/25] arm64/mm: Wire up PTE_CONT for user mappings

Commit Message

Comments

Patch