Message ID | 20240207090458.463021-1-gankulkarni@os.amperecomputing.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | arm64: errata: Minimize tlb flush due to vttbr writes on AmpereOne | expand |
On Wed, Feb 07, 2024 at 01:04:58AM -0800, Ganapatrao Kulkarni wrote: > AmpereOne implementation is doing tlb flush when ever there is > a write to vttbr_el2. As per KVM implementation, vttbr_el2 is updated > with VM's S2-MMU while return to VM. This is not necessary when there > is no VM context switch and a just return to same Guest. > > Adding a check to avoid the vttbr_el2 write if the same value > already exist to prevent needless tlb flush. Sorry, zero interest in taking what is really a uarch optimization. The errata framework exists to allow the kernel achieve *correctness* on a variety of hardware and is not a collection of party tricks for optimizing any given implementation. Think of the precedent this would establish. What would stop implementers from, say, changing out our memcpy implementation into a a hundred different uarch-specific routines. That isn't maintainable, nor is it even testable as most folks don't have access to your hardware. Ignoring all of that -- I question the necessity of these patches altogether. KVM writes to VTTBR at the time of vcpu load as of commit 934bf871f011 ("KVM: arm64: Load the stage-2 MMU context in kvm_vcpu_load_vhe()"), which should drastically reduce the overhead of this hardware fix. -- Thanks, Oliver
(catching up on emails) On Wed, Feb 07, 2024 at 09:45:59AM +0000, Oliver Upton wrote: > On Wed, Feb 07, 2024 at 01:04:58AM -0800, Ganapatrao Kulkarni wrote: > > AmpereOne implementation is doing tlb flush when ever there is > > a write to vttbr_el2. As per KVM implementation, vttbr_el2 is updated > > with VM's S2-MMU while return to VM. This is not necessary when there > > is no VM context switch and a just return to same Guest. > > > > Adding a check to avoid the vttbr_el2 write if the same value > > already exist to prevent needless tlb flush. > > Sorry, zero interest in taking what is really a uarch optimization. > The errata framework exists to allow the kernel achieve *correctness* > on a variety of hardware and is not a collection of party tricks for > optimizing any given implementation. Definitely, we should not abuse the errata framework for uarch optimisations. > Think of the precedent this would establish. What would stop > implementers from, say, changing out our memcpy implementation into a > a hundred different uarch-specific routines. That isn't maintainable, > nor is it even testable as most folks don't have access to your > hardware. I agree. FTR, I'm fine with uarch optimisations if (a) they don't run-time patch the kernel binary, (b) don't affect the existing hardware and (c) show significant gains on the targeted uarch in some meaningful benchmarks (definitely not microbenchmark hammering a certain kernel path). We did have uarch optimisations in the past that broke rule (a). We tried to make them somewhat more justifiable by creating optimisation classes (well, I think it was only ARM64_HAS_NO_HW_PREFETCH). But such changes don't scale well for maintainers, so I'd rather not go back there. So, if one wants an optimisation, it better benefits the other implementations or at least it doesn't make them worse. Now, we do have hardware from mobiles to large enterprise systems, so at some point we may have to make a call on different kernel behaviours, possibly even at run-time. We already do this at build-time, e.g. CONFIG_NUMA where it doesn't make much sense in a mobile (yet). But they should not be seen as uarch specific tweaks, more like higher-level classes of optimisations.
On Tue, Feb 27, 2024 at 08:11:22PM +0000, Catalin Marinas wrote: > On Wed, Feb 07, 2024 at 09:45:59AM +0000, Oliver Upton wrote: [...] > > Think of the precedent this would establish. What would stop > > implementers from, say, changing out our memcpy implementation into a > > a hundred different uarch-specific routines. That isn't maintainable, > > nor is it even testable as most folks don't have access to your > > hardware. > > I agree. FTR, I'm fine with uarch optimisations if (a) they don't > run-time patch the kernel binary, (b) don't affect the existing hardware > and (c) show significant gains on the targeted uarch in some meaningful > benchmarks (definitely not microbenchmark hammering a certain kernel > path). and (d) they have a minimal, maintainable code footprint :) > So, if one wants an optimisation, it better benefits the other > implementations or at least it doesn't make them worse. Now, we do have > hardware from mobiles to large enterprise systems, so at some point we > may have to make a call on different kernel behaviours, possibly even at > run-time. We already do this at build-time, e.g. CONFIG_NUMA where it > doesn't make much sense in a mobile (yet). But they should not be seen > as uarch specific tweaks, more like higher-level classes of > optimisations. Agreed. I think the way we handled this case is a great example of how these sort of things should go -- a general improvement to how the stage-2 MMU gets loaded on VHE systems, which ought to benefit other implementations too. Only if we can't extract a generalization should we even think about something implementation-specific, IMO.
diff --git a/Documentation/arch/arm64/silicon-errata.rst b/Documentation/arch/arm64/silicon-errata.rst index e8c2ce1f9df6..8924e84358c9 100644 --- a/Documentation/arch/arm64/silicon-errata.rst +++ b/Documentation/arch/arm64/silicon-errata.rst @@ -54,6 +54,8 @@ stable kernels. +----------------+-----------------+-----------------+-----------------------------+ | Ampere | AmpereOne | AC03_CPU_38 | AMPERE_ERRATUM_AC03_CPU_38 | +----------------+-----------------+-----------------+-----------------------------+ +| Ampere | AmpereOne | N/A | AMPERE_AC03_REDUCE_TLB_FLUSH| ++----------------+-----------------+-----------------+-----------------------------+ +----------------+-----------------+-----------------+-----------------------------+ | ARM | Cortex-A510 | #2457168 | ARM64_ERRATUM_2457168 | +----------------+-----------------+-----------------+-----------------------------+ diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index aa7c1d435139..77485d0322e4 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -436,6 +436,19 @@ config AMPERE_ERRATUM_AC03_CPU_38 If unsure, say Y. +config AMPERE_AC03_REDUCE_TLB_FLUSH + bool "AmpereOne: Minimize the writes to vttbr_el2 register" + default y + help + On AmpereOne, any writes to vttbr_el2 results in TLB flush. + It can be avoided to improve the performance when there is no VM + context switches and a just return to same VM from the hypervisor. + + This option adds a check to avoid rewrite of the same value + to vttbr_el2. + + If unsure, say Y. + config ARM64_WORKAROUND_CLEAN_CACHE bool diff --git a/arch/arm64/include/asm/kvm_mmu.h b/arch/arm64/include/asm/kvm_mmu.h index e3e793d0ec30..da39e4749434 100644 --- a/arch/arm64/include/asm/kvm_mmu.h +++ b/arch/arm64/include/asm/kvm_mmu.h @@ -317,8 +317,14 @@ static __always_inline u64 kvm_get_vttbr(struct kvm_s2_mmu *mmu) static __always_inline void __load_stage2(struct kvm_s2_mmu *mmu, struct kvm_arch *arch) { + u64 vttbr; + + vttbr = kvm_get_vttbr(mmu); write_sysreg(mmu->vtcr, vtcr_el2); - write_sysreg(kvm_get_vttbr(mmu), vttbr_el2); + + if (!cpus_have_final_cap(ARM64_WORKAROUND_AMPERE_AC03_TLB_FLUSH) || + read_sysreg(vttbr_el2) != vttbr) + write_sysreg(vttbr, vttbr_el2); /* * ARM errata 1165522 and 1530923 require the actual execution of the diff --git a/arch/arm64/kernel/cpu_errata.c b/arch/arm64/kernel/cpu_errata.c index 967c7c7a4e7d..f612975e0cb5 100644 --- a/arch/arm64/kernel/cpu_errata.c +++ b/arch/arm64/kernel/cpu_errata.c @@ -740,6 +740,13 @@ const struct arm64_cpu_capabilities arm64_errata[] = { .capability = ARM64_WORKAROUND_AMPERE_AC03_CPU_38, ERRATA_MIDR_ALL_VERSIONS(MIDR_AMPERE1), }, +#endif +#ifdef CONFIG_AMPERE_AC03_REDUCE_TLB_FLUSH + { + .desc = "AmpereOne, minimize tlb flush due to vttbr write", + .capability = ARM64_WORKAROUND_AMPERE_AC03_TLB_FLUSH, + ERRATA_MIDR_ALL_VERSIONS(MIDR_AMPERE1), + }, #endif { } diff --git a/arch/arm64/tools/cpucaps b/arch/arm64/tools/cpucaps index b912b1409fc0..b4bee37d0527 100644 --- a/arch/arm64/tools/cpucaps +++ b/arch/arm64/tools/cpucaps @@ -85,6 +85,7 @@ WORKAROUND_2457168 WORKAROUND_2645198 WORKAROUND_2658417 WORKAROUND_AMPERE_AC03_CPU_38 +WORKAROUND_AMPERE_AC03_TLB_FLUSH WORKAROUND_TRBE_OVERWRITE_FILL_MODE WORKAROUND_TSB_FLUSH_FAILURE WORKAROUND_TRBE_WRITE_OUT_OF_RANGE
AmpereOne implementation is doing tlb flush when ever there is a write to vttbr_el2. As per KVM implementation, vttbr_el2 is updated with VM's S2-MMU while return to VM. This is not necessary when there is no VM context switch and a just return to same Guest. Adding a check to avoid the vttbr_el2 write if the same value already exist to prevent needless tlb flush. Signed-off-by: Ganapatrao Kulkarni <gankulkarni@os.amperecomputing.com> --- Documentation/arch/arm64/silicon-errata.rst | 2 ++ arch/arm64/Kconfig | 13 +++++++++++++ arch/arm64/include/asm/kvm_mmu.h | 8 +++++++- arch/arm64/kernel/cpu_errata.c | 7 +++++++ arch/arm64/tools/cpucaps | 1 + 5 files changed, 30 insertions(+), 1 deletion(-)