diff mbox series

[v2,07/14] arm64/mm: Avoid barriers for invalid or userspace mappings

Message ID 20250217140809.1702789-8-ryan.roberts@arm.com (mailing list archive)
State New
Headers show
Series Perf improvements for hugetlb and vmalloc on arm64 | expand

Commit Message

Ryan Roberts Feb. 17, 2025, 2:07 p.m. UTC
__set_pte_complete(), set_pmd(), set_pud(), set_p4d() and set_pgd() are
used to write entries into pgtables. And they issue barriers (currently
dsb and isb) to ensure that the written values are observed by the table
walker prior to any program-order-future memory access to the mapped
location.

Over the years some of these functions have received optimizations: In
particular, commit 7f0b1bf04511 ("arm64: Fix barriers used for page
table modifications") made it so that the barriers were only emitted for
valid-kernel mappings for set_pte() (now __set_pte_complete()). And
commit 0795edaf3f1f ("arm64: pgtable: Implement p[mu]d_valid() and check
in set_p[mu]d()") made it so that set_pmd()/set_pud() only emitted the
barriers for valid mappings. set_p4d()/set_pgd() continue to emit the
barriers unconditionally.

This is all very confusing to the casual observer; surely the rules
should be invariant to the level? Let's change this so that every level
consistently emits the barriers only when setting valid, non-user
entries (both table and leaf).

It seems obvious that if it is ok to elide barriers all but valid kernel
mappings at pte level, it must also be ok to do this for leaf entries at
other levels: If setting an entry to invalid, a tlb maintenance
operation must surely follow to synchronise the TLB and this contains
the required barriers. If setting a valid user mapping, the previous
mapping must have been invalid and there must have been a TLB
maintenance operation (complete with barriers) to honour
break-before-make. So the worst that can happen is we take an extra
fault (which will imply the DSB + ISB) and conclude that there is
nothing to do. These are the arguments for doing this optimization at
pte level and they also apply to leaf mappings at other levels.

For table entries, the same arguments hold: If unsetting a table entry,
a TLB is required and this will emit the required barriers. If setting a
table entry, the previous value must have been invalid and the table
walker must already be able to observe that. Additionally the contents
of the pgtable being pointed to in the newly set entry must be visible
before the entry is written and this is enforced via smp_wmb() (dmb) in
the pgtable allocation functions and in __split_huge_pmd_locked(). But
this last part could never have been enforced by the barriers in
set_pXd() because they occur after updating the entry. So ultimately,
the wost that can happen by eliding these barriers for user table
entries is an extra fault.

I observe roughly the same number of page faults (107M) with and without
this change when compiling the kernel on Apple M2.

Signed-off-by: Ryan Roberts <ryan.roberts@arm.com>
---
 arch/arm64/include/asm/pgtable.h | 34 ++++++++++++++++++++++++++------
 1 file changed, 28 insertions(+), 6 deletions(-)

Comments

Kevin Brodsky Feb. 20, 2025, 4:54 p.m. UTC | #1
On 17/02/2025 15:07, Ryan Roberts wrote:
> __set_pte_complete(), set_pmd(), set_pud(), set_p4d() and set_pgd() are

Nit: it would be more accurate to say __set_pte() instead of
__set_pte_complete(), as it is the former that actually writes the PTE
(and then issues barriers).

> used to write entries into pgtables. And they issue barriers (currently
> dsb and isb) to ensure that the written values are observed by the table
> walker prior to any program-order-future memory access to the mapped
> location.
>
> Over the years some of these functions have received optimizations: In
> particular, commit 7f0b1bf04511 ("arm64: Fix barriers used for page
> table modifications") made it so that the barriers were only emitted for
> valid-kernel mappings for set_pte() (now __set_pte_complete()). And
> commit 0795edaf3f1f ("arm64: pgtable: Implement p[mu]d_valid() and check
> in set_p[mu]d()") made it so that set_pmd()/set_pud() only emitted the
> barriers for valid mappings. set_p4d()/set_pgd() continue to emit the
> barriers unconditionally.
>
> This is all very confusing to the casual observer; surely the rules
> should be invariant to the level? Let's change this so that every level
> consistently emits the barriers only when setting valid, non-user
> entries (both table and leaf).
>
> It seems obvious that if it is ok to elide barriers all but valid kernel
> mappings at pte level, it must also be ok to do this for leaf entries at
> other levels: If setting an entry to invalid, a tlb maintenance
> operation must surely follow to synchronise the TLB and this contains
> the required barriers. If setting a valid user mapping, the previous
> mapping must have been invalid and there must have been a TLB
> maintenance operation (complete with barriers) to honour
> break-before-make. So the worst that can happen is we take an extra
> fault (which will imply the DSB + ISB) and conclude that there is
> nothing to do. These are the arguments for doing this optimization at
> pte level and they also apply to leaf mappings at other levels.
>
> For table entries, the same arguments hold: If unsetting a table entry,
> a TLB is required and this will emit the required barriers. If setting a

s/TLB/TLB maintenance/

> table entry, the previous value must have been invalid and the table
> walker must already be able to observe that. Additionally the contents
> of the pgtable being pointed to in the newly set entry must be visible
> before the entry is written and this is enforced via smp_wmb() (dmb) in
> the pgtable allocation functions and in __split_huge_pmd_locked(). But
> this last part could never have been enforced by the barriers in
> set_pXd() because they occur after updating the entry. So ultimately,
> the wost that can happen by eliding these barriers for user table

s/wost/worst/

- Kevin

> entries is an extra fault.
>
> [...]
Catalin Marinas Feb. 22, 2025, 1:17 p.m. UTC | #2
On Mon, Feb 17, 2025 at 02:07:59PM +0000, Ryan Roberts wrote:
> __set_pte_complete(), set_pmd(), set_pud(), set_p4d() and set_pgd() are
> used to write entries into pgtables. And they issue barriers (currently
> dsb and isb) to ensure that the written values are observed by the table
> walker prior to any program-order-future memory access to the mapped
> location.
> 
> Over the years some of these functions have received optimizations: In
> particular, commit 7f0b1bf04511 ("arm64: Fix barriers used for page
> table modifications") made it so that the barriers were only emitted for
> valid-kernel mappings for set_pte() (now __set_pte_complete()). And
> commit 0795edaf3f1f ("arm64: pgtable: Implement p[mu]d_valid() and check
> in set_p[mu]d()") made it so that set_pmd()/set_pud() only emitted the
> barriers for valid mappings.

The assumption probably was that set_pmd/pud() are called a lot less
often than set_pte() as they cover larger address ranges (whether table
or leaf).

> set_p4d()/set_pgd() continue to emit the barriers unconditionally.

We probably missed them, should have been the same as set_pmd().

> This is all very confusing to the casual observer; surely the rules
> should be invariant to the level? Let's change this so that every level
> consistently emits the barriers only when setting valid, non-user
> entries (both table and leaf).

Also see commit d0b7a302d58a ("Revert "arm64: Remove unnecessary ISBs
from set_{pte,pmd,pud}"") why we added back the ISBs to the pmd/pud
accessors and the last paragraph on why we are ok with the spurious
faults for PTEs.

For user mappings, the translation fault is routed through the usual
path that can handle mapping new entries, so I think we are fine. But
it's worth double-checking Will's comment (unless he only referred to
kernel table entries).

> It seems obvious that if it is ok to elide barriers all but valid kernel
> mappings at pte level, it must also be ok to do this for leaf entries at
> other levels: If setting an entry to invalid, a tlb maintenance
> operation must surely follow to synchronise the TLB and this contains
> the required barriers.

Setting to invalid is fine indeed, handled by the TLB flushing code,
hence the pmd_valid() checks.

> If setting a valid user mapping, the previous
> mapping must have been invalid and there must have been a TLB
> maintenance operation (complete with barriers) to honour
> break-before-make.

That's not entirely true for change_protection() for example or the
fork() path when we make the entries read-only from writeable without
BBM. We could improve these cases as well, I haven't looked in detail.
ptep_modify_prot_commit() via change_pte_range() can defer the barriers
to tlb_end_vma(). Something similar on the copy_present_ptes() path.

> So the worst that can happen is we take an extra
> fault (which will imply the DSB + ISB) and conclude that there is
> nothing to do. These are the arguments for doing this optimization at
> pte level and they also apply to leaf mappings at other levels.

It's worth clarifying Will's comment in the commit I mentioned above.

> For table entries, the same arguments hold: If unsetting a table entry,
> a TLB is required and this will emit the required barriers. If setting a
> table entry, the previous value must have been invalid and the table
> walker must already be able to observe that. Additionally the contents
> of the pgtable being pointed to in the newly set entry must be visible
> before the entry is written and this is enforced via smp_wmb() (dmb) in
> the pgtable allocation functions and in __split_huge_pmd_locked(). But
> this last part could never have been enforced by the barriers in
> set_pXd() because they occur after updating the entry. So ultimately,
> the wost that can happen by eliding these barriers for user table
> entries is an extra fault.
> 
> I observe roughly the same number of page faults (107M) with and without
> this change when compiling the kernel on Apple M2.

That's microarch specific, highly dependent on timing, so you may never
see a difference.

> +static inline bool pmd_valid_not_user(pmd_t pmd)
> +{
> +	/*
> +	 * User-space table entries always have (PXN && !UXN). All other
> +	 * combinations indicate it's a table entry for kernel space.
> +	 * Valid-not-user leaf entries follow the same rules as
> +	 * pte_valid_not_user().
> +	 */
> +	if (pmd_table(pmd))
> +		return !((pmd_val(pmd) & (PMD_TABLE_PXN | PMD_TABLE_UXN)) == PMD_TABLE_PXN);
> +	return pte_valid_not_user(pmd_pte(pmd));
> +}

With the 128-bit format I think we lost the PXN/UXNTable bits, though we
have software bits if we need to. I just wonder whether it's worth the
hassle of skipping some barriers for user non-leaf entries. Did you see
any improvement in practice?
Ryan Roberts Feb. 24, 2025, 12:26 p.m. UTC | #3
On 20/02/2025 16:54, Kevin Brodsky wrote:
> On 17/02/2025 15:07, Ryan Roberts wrote:
>> __set_pte_complete(), set_pmd(), set_pud(), set_p4d() and set_pgd() are
> 
> Nit: it would be more accurate to say __set_pte() instead of
> __set_pte_complete(), as it is the former that actually writes the PTE
> (and then issues barriers).

Yeah, fair enough. Will fix in the next version.

> 
>> used to write entries into pgtables. And they issue barriers (currently
>> dsb and isb) to ensure that the written values are observed by the table
>> walker prior to any program-order-future memory access to the mapped
>> location.
>>
>> Over the years some of these functions have received optimizations: In
>> particular, commit 7f0b1bf04511 ("arm64: Fix barriers used for page
>> table modifications") made it so that the barriers were only emitted for
>> valid-kernel mappings for set_pte() (now __set_pte_complete()). And
>> commit 0795edaf3f1f ("arm64: pgtable: Implement p[mu]d_valid() and check
>> in set_p[mu]d()") made it so that set_pmd()/set_pud() only emitted the
>> barriers for valid mappings. set_p4d()/set_pgd() continue to emit the
>> barriers unconditionally.
>>
>> This is all very confusing to the casual observer; surely the rules
>> should be invariant to the level? Let's change this so that every level
>> consistently emits the barriers only when setting valid, non-user
>> entries (both table and leaf).
>>
>> It seems obvious that if it is ok to elide barriers all but valid kernel
>> mappings at pte level, it must also be ok to do this for leaf entries at
>> other levels: If setting an entry to invalid, a tlb maintenance
>> operation must surely follow to synchronise the TLB and this contains
>> the required barriers. If setting a valid user mapping, the previous
>> mapping must have been invalid and there must have been a TLB
>> maintenance operation (complete with barriers) to honour
>> break-before-make. So the worst that can happen is we take an extra
>> fault (which will imply the DSB + ISB) and conclude that there is
>> nothing to do. These are the arguments for doing this optimization at
>> pte level and they also apply to leaf mappings at other levels.
>>
>> For table entries, the same arguments hold: If unsetting a table entry,
>> a TLB is required and this will emit the required barriers. If setting a
> 
> s/TLB/TLB maintenance/
> 
>> table entry, the previous value must have been invalid and the table
>> walker must already be able to observe that. Additionally the contents
>> of the pgtable being pointed to in the newly set entry must be visible
>> before the entry is written and this is enforced via smp_wmb() (dmb) in
>> the pgtable allocation functions and in __split_huge_pmd_locked(). But
>> this last part could never have been enforced by the barriers in
>> set_pXd() because they occur after updating the entry. So ultimately,
>> the wost that can happen by eliding these barriers for user table
> 
> s/wost/worst/
> 
> - Kevin
> 
>> entries is an extra fault.
>>
>> [...]
>
Ryan Roberts Feb. 25, 2025, 4:41 p.m. UTC | #4
On 22/02/2025 13:17, Catalin Marinas wrote:
> On Mon, Feb 17, 2025 at 02:07:59PM +0000, Ryan Roberts wrote:
>> __set_pte_complete(), set_pmd(), set_pud(), set_p4d() and set_pgd() are
>> used to write entries into pgtables. And they issue barriers (currently
>> dsb and isb) to ensure that the written values are observed by the table
>> walker prior to any program-order-future memory access to the mapped
>> location.
>>
>> Over the years some of these functions have received optimizations: In
>> particular, commit 7f0b1bf04511 ("arm64: Fix barriers used for page
>> table modifications") made it so that the barriers were only emitted for
>> valid-kernel mappings for set_pte() (now __set_pte_complete()). And
>> commit 0795edaf3f1f ("arm64: pgtable: Implement p[mu]d_valid() and check
>> in set_p[mu]d()") made it so that set_pmd()/set_pud() only emitted the
>> barriers for valid mappings.
> 
> The assumption probably was that set_pmd/pud() are called a lot less
> often than set_pte() as they cover larger address ranges (whether table
> or leaf).

Yes you're probably right. From an optimization perspective I doubt we are
seeing much improvement from the reduction in barriers for pmd/pud/p4d/pgd
levels. I made the change more for the benefit of uniformity.

> 
>> set_p4d()/set_pgd() continue to emit the barriers unconditionally.
> 
> We probably missed them, should have been the same as set_pmd().

So perhaps the middle ground is for pmd/pud/p4d/pgd to all have the same
implementation (emit barriers for valid mappings only). And then let pte level
continue with the additional optimization of only emitting barriers for valid
*kernel* mappings.

The only problem there is that it gets confusing when you bring set_pmd_at() and
set_pud_at() into the picture, which today end up calling __set_pte(). So
pmds/puds set by that route *already* has the same optimizations applied as ptes
today.

Personally, I'd prefer to have a single pattern for all levels.

> 
>> This is all very confusing to the casual observer; surely the rules
>> should be invariant to the level? Let's change this so that every level
>> consistently emits the barriers only when setting valid, non-user
>> entries (both table and leaf).
> 
> Also see commit d0b7a302d58a ("Revert "arm64: Remove unnecessary ISBs
> from set_{pte,pmd,pud}"") why we added back the ISBs to the pmd/pud
> accessors and the last paragraph on why we are ok with the spurious
> faults for PTEs.

Yes, I was already aware of that behaviour (thanks to MarkR for patiently
explaining it to me!). But my changes should still be compatible with that
requirement; we always issue an ISB after updating the page table with a valid,
kernel mapping.

As an aside, my understanding is that if the system supports FEAT_ETS2, then the
CPU promises not to store that "faulting" state in the pipeline, and the ISB
should not be needed; we could patch that out in future with an alternative.

> 
> For user mappings, the translation fault is routed through the usual
> path that can handle mapping new entries, so I think we are fine. But
> it's worth double-checking Will's comment (unless he only referred to
> kernel table entries).
> 
>> It seems obvious that if it is ok to elide barriers all but valid kernel
>> mappings at pte level, it must also be ok to do this for leaf entries at
>> other levels: If setting an entry to invalid, a tlb maintenance
>> operation must surely follow to synchronise the TLB and this contains
>> the required barriers.
> 
> Setting to invalid is fine indeed, handled by the TLB flushing code,
> hence the pmd_valid() checks.
> 
>> If setting a valid user mapping, the previous
>> mapping must have been invalid and there must have been a TLB
>> maintenance operation (complete with barriers) to honour
>> break-before-make.
> 
> That's not entirely true for change_protection() for example or the
> fork() path when we make the entries read-only from writeable without

For fork() we use ptep_set_wrprotect() (or these days, wrprotect_ptes(), right?)

> BBM. We could improve these cases as well, I haven't looked in detail.
> ptep_modify_prot_commit() via change_pte_range() can defer the barriers
> to tlb_end_vma(). Something similar on the copy_present_ptes() path.

But yes, I agree my comments are not fully correct for all corner cases. But the
point is that for all of these corner cases, the higher level causes appropriate
TLBI operations to be issued, which include the appropriate barriers.

> 
>> So the worst that can happen is we take an extra
>> fault (which will imply the DSB + ISB) and conclude that there is
>> nothing to do. These are the arguments for doing this optimization at
>> pte level and they also apply to leaf mappings at other levels.
> 
> It's worth clarifying Will's comment in the commit I mentioned above.

By my interpretation of Will's comment, it agrees with what I'm trying to say
here. Perhaps I've missed something?

> 
>> For table entries, the same arguments hold: If unsetting a table entry,
>> a TLB is required and this will emit the required barriers. If setting a
>> table entry, the previous value must have been invalid and the table
>> walker must already be able to observe that. Additionally the contents
>> of the pgtable being pointed to in the newly set entry must be visible
>> before the entry is written and this is enforced via smp_wmb() (dmb) in
>> the pgtable allocation functions and in __split_huge_pmd_locked(). But
>> this last part could never have been enforced by the barriers in
>> set_pXd() because they occur after updating the entry. So ultimately,
>> the wost that can happen by eliding these barriers for user table
>> entries is an extra fault.
>>
>> I observe roughly the same number of page faults (107M) with and without
>> this change when compiling the kernel on Apple M2.
> 
> That's microarch specific, highly dependent on timing, so you may never
> see a difference.

Fair. Are you advocating for keeping the higher levels not conditional on user
vs kernel? If so, then my preference would be to at least simplfy to 2 patterns
as I describe above.

> 
>> +static inline bool pmd_valid_not_user(pmd_t pmd)
>> +{
>> +	/*
>> +	 * User-space table entries always have (PXN && !UXN). All other
>> +	 * combinations indicate it's a table entry for kernel space.
>> +	 * Valid-not-user leaf entries follow the same rules as
>> +	 * pte_valid_not_user().
>> +	 */
>> +	if (pmd_table(pmd))
>> +		return !((pmd_val(pmd) & (PMD_TABLE_PXN | PMD_TABLE_UXN)) == PMD_TABLE_PXN);
>> +	return pte_valid_not_user(pmd_pte(pmd));
>> +}
> 
> With the 128-bit format I think we lost the PXN/UXNTable bits, though we
> have software bits if we need to. 

Indeed. I've already fed back to Anshuman that in my opinion, those bits need to
be maintained as SW bits. Otherwise it all gets a bit fragile if a SW agent
tries to check the value of those bits in D128 mode vs D64 mode.

> I just wonder whether it's worth the
> hassle of skipping some barriers for user non-leaf entries. Did you see
> any improvement in practice?

As I said, I doubt there is much of a performance improvement in practice, my
main motivation was unifying around a single pattern to simplify.

Thanks,
Ryan
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index e4b1946b261f..51128c2956f8 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -767,6 +767,19 @@  static inline bool in_swapper_pgdir(void *addr)
 	        ((unsigned long)swapper_pg_dir & PAGE_MASK);
 }
 
+static inline bool pmd_valid_not_user(pmd_t pmd)
+{
+	/*
+	 * User-space table entries always have (PXN && !UXN). All other
+	 * combinations indicate it's a table entry for kernel space.
+	 * Valid-not-user leaf entries follow the same rules as
+	 * pte_valid_not_user().
+	 */
+	if (pmd_table(pmd))
+		return !((pmd_val(pmd) & (PMD_TABLE_PXN | PMD_TABLE_UXN)) == PMD_TABLE_PXN);
+	return pte_valid_not_user(pmd_pte(pmd));
+}
+
 static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 {
 #ifdef __PAGETABLE_PMD_FOLDED
@@ -778,7 +791,7 @@  static inline void set_pmd(pmd_t *pmdp, pmd_t pmd)
 
 	WRITE_ONCE(*pmdp, pmd);
 
-	if (pmd_valid(pmd)) {
+	if (pmd_valid_not_user(pmd)) {
 		dsb(ishst);
 		isb();
 	}
@@ -833,6 +846,7 @@  static inline unsigned long pmd_page_vaddr(pmd_t pmd)
 #define pud_valid(pud)		pte_valid(pud_pte(pud))
 #define pud_user(pud)		pte_user(pud_pte(pud))
 #define pud_user_exec(pud)	pte_user_exec(pud_pte(pud))
+#define pud_valid_not_user(pud)	pmd_valid_not_user(pte_pmd(pud_pte(pud)))
 
 static inline bool pgtable_l4_enabled(void);
 
@@ -845,7 +859,7 @@  static inline void set_pud(pud_t *pudp, pud_t pud)
 
 	WRITE_ONCE(*pudp, pud);
 
-	if (pud_valid(pud)) {
+	if (pud_valid_not_user(pud)) {
 		dsb(ishst);
 		isb();
 	}
@@ -916,6 +930,7 @@  static inline bool mm_pud_folded(const struct mm_struct *mm)
 #define p4d_none(p4d)		(pgtable_l4_enabled() && !p4d_val(p4d))
 #define p4d_bad(p4d)		(pgtable_l4_enabled() && !(p4d_val(p4d) & P4D_TABLE_BIT))
 #define p4d_present(p4d)	(!p4d_none(p4d))
+#define p4d_valid_not_user(p4d)	pmd_valid_not_user(pte_pmd(p4d_pte(p4d)))
 
 static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
@@ -925,8 +940,11 @@  static inline void set_p4d(p4d_t *p4dp, p4d_t p4d)
 	}
 
 	WRITE_ONCE(*p4dp, p4d);
-	dsb(ishst);
-	isb();
+
+	if (p4d_valid_not_user(p4d)) {
+		dsb(ishst);
+		isb();
+	}
 }
 
 static inline void p4d_clear(p4d_t *p4dp)
@@ -1043,6 +1061,7 @@  static inline bool mm_p4d_folded(const struct mm_struct *mm)
 #define pgd_none(pgd)		(pgtable_l5_enabled() && !pgd_val(pgd))
 #define pgd_bad(pgd)		(pgtable_l5_enabled() && !(pgd_val(pgd) & PGD_TABLE_BIT))
 #define pgd_present(pgd)	(!pgd_none(pgd))
+#define pgd_valid_not_user(pgd)	pmd_valid_not_user(pte_pmd(pgd_pte(pgd)))
 
 static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
@@ -1052,8 +1071,11 @@  static inline void set_pgd(pgd_t *pgdp, pgd_t pgd)
 	}
 
 	WRITE_ONCE(*pgdp, pgd);
-	dsb(ishst);
-	isb();
+
+	if (pgd_valid_not_user(pgd)) {
+		dsb(ishst);
+		isb();
+	}
 }
 
 static inline void pgd_clear(pgd_t *pgdp)