diff mbox series

[v1,2/2] KVM: arm64: PMU: Ensure to trap PMU access from EL0 to EL2

Message ID 20230329002136.2463442-3-reijiw@google.com (mailing list archive)
State New, archived
Headers show
Series KVM: arm64: PMU: Correct the handling of PMUSERENR_EL0 | expand

Commit Message

Reiji Watanabe March 29, 2023, 12:21 a.m. UTC
Currently, with VHE, KVM sets ER, CR, SW and EN bits of
PMUSERENR_EL0 to 1 on vcpu_load().  So, if the value of those bits
are cleared after vcpu_load() (the perf subsystem would do when PMU
counters are programmed for the guest), PMU access from the guest EL0
might be trapped to the guest EL1 directly regardless of the current
PMUSERENR_EL0 value of the vCPU.

With VHE, fix this by setting those bits of the register on every
guest entry (as with nVHE).  Also, opportunistically make the similar
change for PMSELR_EL0, which is cleared by vcpu_load(), to ensure it
is always set to zero on guest entry (PMXEVCNTR_EL0 access might cause
UNDEF at EL1 instead of being trapped to EL2, depending on the value
of PMSELR_EL0).  I think that would be more robust, although I don't
find any kernel code that writes PMSELR_EL0.

Fixes: 83a7a4d643d3 ("arm64: perf: Enable PMU counter userspace access for perf event")
Signed-off-by: Reiji Watanabe <reijiw@google.com>
---
 arch/arm64/kvm/hyp/include/hyp/switch.h | 29 +++++++++++++------------
 1 file changed, 15 insertions(+), 14 deletions(-)

Comments

Marc Zyngier March 29, 2023, 12:03 p.m. UTC | #1
On Wed, 29 Mar 2023 01:21:36 +0100,
Reiji Watanabe <reijiw@google.com> wrote:
> 
> Currently, with VHE, KVM sets ER, CR, SW and EN bits of
> PMUSERENR_EL0 to 1 on vcpu_load().  So, if the value of those bits
> are cleared after vcpu_load() (the perf subsystem would do when PMU
> counters are programmed for the guest), PMU access from the guest EL0
> might be trapped to the guest EL1 directly regardless of the current
> PMUSERENR_EL0 value of the vCPU.

+ RobH.

Is that what is done when the event is created and armv8pmu_start()
called? This is... crap. The EL0 access thing breaks everything, and
nobody tested it with KVM, obviously.

I would be tempted to start mitigating it with the following:

diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index dde06c0f97f3..8063525bf3dd 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -806,17 +806,19 @@ static void armv8pmu_disable_event(struct perf_event *event)
 
 static void armv8pmu_start(struct arm_pmu *cpu_pmu)
 {
-	struct perf_event_context *ctx;
-	int nr_user = 0;
+	if (sysctl_perf_user_access) {
+		struct perf_event_context *ctx;
+		int nr_user = 0;
 
-	ctx = perf_cpu_task_ctx();
-	if (ctx)
-		nr_user = ctx->nr_user;
+		ctx = perf_cpu_task_ctx();
+		if (ctx)
+			nr_user = ctx->nr_user;
 
-	if (sysctl_perf_user_access && nr_user)
-		armv8pmu_enable_user_access(cpu_pmu);
-	else
-		armv8pmu_disable_user_access();
+		if (nr_user)
+			armv8pmu_enable_user_access(cpu_pmu);
+		else
+			armv8pmu_disable_user_access();
+	}
 
 	/* Enable all counters */
 	armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);

but that's obviously not enough as we want it to work with EL0 access
enabled on the host as well.

What we miss is something that tells the PMU code "we're in a context
where host userspace isn't present", and this would be completely
skipped, relying on KVM to restore the appropriate state on
vcpu_put(). But then the IPI stuff that controls EL0 can always come
in and wreck things. Gahhh...

I'm a bit reluctant to use the "save/restore all the time" hammer,
because it only hides that the EL0 counter infrastructure is a bit
broken.

> With VHE, fix this by setting those bits of the register on every
> guest entry (as with nVHE).  Also, opportunistically make the similar
> change for PMSELR_EL0, which is cleared by vcpu_load(), to ensure it
> is always set to zero on guest entry (PMXEVCNTR_EL0 access might cause
> UNDEF at EL1 instead of being trapped to EL2, depending on the value
> of PMSELR_EL0).  I think that would be more robust, although I don't
> find any kernel code that writes PMSELR_EL0.

This was changed a while ago to avoid using the selection register,
see 0fdf1bb75953 ("arm64: perf: Avoid PMXEV* indirection"), and the
rationale behind the reset of PMSELR_EL0 in 21cbe3cc8a48 ("arm64: KVM:
pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest").

We *could* simply drop this zeroing of PMSELR_EL0 now that there is
nothing else host-side that writes to it. But we need to agree on how
to fix the above first.

Thanks,

	M.
Reiji Watanabe March 30, 2023, 3:55 a.m. UTC | #2
Hi Marc,

On Wed, Mar 29, 2023 at 01:03:18PM +0100, Marc Zyngier wrote:
> On Wed, 29 Mar 2023 01:21:36 +0100,
> Reiji Watanabe <reijiw@google.com> wrote:
> > 
> > Currently, with VHE, KVM sets ER, CR, SW and EN bits of
> > PMUSERENR_EL0 to 1 on vcpu_load().  So, if the value of those bits
> > are cleared after vcpu_load() (the perf subsystem would do when PMU
> > counters are programmed for the guest), PMU access from the guest EL0
> > might be trapped to the guest EL1 directly regardless of the current
> > PMUSERENR_EL0 value of the vCPU.
> 
> + RobH.
> 
> Is that what is done when the event is created and armv8pmu_start()
> called? 

Yes, that is it.

> This is... crap. The EL0 access thing breaks everything, and
> nobody tested it with KVM, obviously.

It was a bit shocking, as we detected those EL0 related
issues just with the first EL0 PMU test we ran...

> 
> I would be tempted to start mitigating it with the following:
> 
> diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> index dde06c0f97f3..8063525bf3dd 100644
> --- a/arch/arm64/kernel/perf_event.c
> +++ b/arch/arm64/kernel/perf_event.c
> @@ -806,17 +806,19 @@ static void armv8pmu_disable_event(struct perf_event *event)
>  
>  static void armv8pmu_start(struct arm_pmu *cpu_pmu)
>  {
> -	struct perf_event_context *ctx;
> -	int nr_user = 0;
> +	if (sysctl_perf_user_access) {
> +		struct perf_event_context *ctx;
> +		int nr_user = 0;
>  
> -	ctx = perf_cpu_task_ctx();
> -	if (ctx)
> -		nr_user = ctx->nr_user;
> +		ctx = perf_cpu_task_ctx();
> +		if (ctx)
> +			nr_user = ctx->nr_user;
>  
> -	if (sysctl_perf_user_access && nr_user)
> -		armv8pmu_enable_user_access(cpu_pmu);
> -	else
> -		armv8pmu_disable_user_access();
> +		if (nr_user)
> +			armv8pmu_enable_user_access(cpu_pmu);
> +		else
> +			armv8pmu_disable_user_access();
> +	}
>  
>  	/* Enable all counters */
>  	armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);
> 
> but that's obviously not enough as we want it to work with EL0 access
> enabled on the host as well.

Right, also with the change above, since PMUSERENR_EL0 isn't explicitly
cleared, a perf client (EL0) might have an access to counters.
(with the current code, a non-perf client might have an access to
counters though)


> What we miss is something that tells the PMU code "we're in a context
> where host userspace isn't present", and this would be completely

Could you please elaborate ?
I'm not sure if I understand the above correctly.
Since the task actually has the host userspace, which could be using
the PMU, and both the host EL0 and guest EL0 events are associated with
the task context of the perf_cpu_context, I think the "something" we
want to say would be subtle (I would assume it is similar to what we
meant with exclude_guest == 0 && exclude_host == 1 in the event attr
for the guest, in terms of events?).


> skipped, relying on KVM to restore the appropriate state on
> vcpu_put(). But then the IPI stuff that controls EL0 can always come
> in and wreck things. Gahhh...
> 
> I'm a bit reluctant to use the "save/restore all the time" hammer,
> because it only hides that the EL0 counter infrastructure is a bit
> broken.

Looking at the current code only, since KVM directly silently
modifies the PMU register (PMUSERENR_EL0) even though KVM is
a client of the perf in general, my original thought was
it made sense to have KVM restore the register value.


> > With VHE, fix this by setting those bits of the register on every
> > guest entry (as with nVHE).  Also, opportunistically make the similar
> > change for PMSELR_EL0, which is cleared by vcpu_load(), to ensure it
> > is always set to zero on guest entry (PMXEVCNTR_EL0 access might cause
> > UNDEF at EL1 instead of being trapped to EL2, depending on the value
> > of PMSELR_EL0).  I think that would be more robust, although I don't
> > find any kernel code that writes PMSELR_EL0.
> 
> This was changed a while ago to avoid using the selection register,
> see 0fdf1bb75953 ("arm64: perf: Avoid PMXEV* indirection"), and the
> rationale behind the reset of PMSELR_EL0 in 21cbe3cc8a48 ("arm64: KVM:
> pmu: Reset PMSELR_EL0.SEL to a sane value before entering the guest").
> 
> We *could* simply drop this zeroing of PMSELR_EL0 now that there is
> nothing else host-side that writes to it. But we need to agree on how
> to fix the above first.

We don't have to clear PMSELR_EL0 on every guest entry,
but I would think we still should do that at least in vcpu_load(),
since now the host EL0 could have a direct access to PMSELR_EL0.
(should be fine with the sane EL0 though)

Thank you,
Reiji
Mark Rutland April 4, 2023, 2:25 p.m. UTC | #3
On Wed, Mar 29, 2023 at 01:03:18PM +0100, Marc Zyngier wrote:
> On Wed, 29 Mar 2023 01:21:36 +0100,
> Reiji Watanabe <reijiw@google.com> wrote:
> > 
> > Currently, with VHE, KVM sets ER, CR, SW and EN bits of
> > PMUSERENR_EL0 to 1 on vcpu_load().  So, if the value of those bits
> > are cleared after vcpu_load() (the perf subsystem would do when PMU
> > counters are programmed for the guest), PMU access from the guest EL0
> > might be trapped to the guest EL1 directly regardless of the current
> > PMUSERENR_EL0 value of the vCPU.
> 
> + RobH.
> 
> Is that what is done when the event is created and armv8pmu_start()
> called? This is... crap. The EL0 access thing breaks everything, and
> nobody tested it with KVM, obviously.
> 
> I would be tempted to start mitigating it with the following:
> 
> diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> index dde06c0f97f3..8063525bf3dd 100644
> --- a/arch/arm64/kernel/perf_event.c
> +++ b/arch/arm64/kernel/perf_event.c
> @@ -806,17 +806,19 @@ static void armv8pmu_disable_event(struct perf_event *event)
>  
>  static void armv8pmu_start(struct arm_pmu *cpu_pmu)
>  {
> -	struct perf_event_context *ctx;
> -	int nr_user = 0;
> +	if (sysctl_perf_user_access) {
> +		struct perf_event_context *ctx;
> +		int nr_user = 0;
>  
> -	ctx = perf_cpu_task_ctx();
> -	if (ctx)
> -		nr_user = ctx->nr_user;
> +		ctx = perf_cpu_task_ctx();
> +		if (ctx)
> +			nr_user = ctx->nr_user;
>  
> -	if (sysctl_perf_user_access && nr_user)
> -		armv8pmu_enable_user_access(cpu_pmu);
> -	else
> -		armv8pmu_disable_user_access();
> +		if (nr_user)
> +			armv8pmu_enable_user_access(cpu_pmu);
> +		else
> +			armv8pmu_disable_user_access();
> +	}
>  
>  	/* Enable all counters */
>  	armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);
> 
> but that's obviously not enough as we want it to work with EL0 access
> enabled on the host as well.
> 
> What we miss is something that tells the PMU code "we're in a context
> where host userspace isn't present", and this would be completely
> skipped, relying on KVM to restore the appropriate state on
> vcpu_put(). But then the IPI stuff that controls EL0 can always come
> in and wreck things. Gahhh...

AFAICT the perf code only writes to PMUSERENR_EL0 in contexts where IRQs (and
hence preemption) are disabled, so as long as we have a shadow of the host
PMUSERENR value somewhere, I think we can update the perf code with something
like the below?

... unless the KVM code is interruptible before saving the host value, or after
restoring it?

Thanks,
Mark.

---->8----
diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
index dde06c0f97f3e..bdab3d5cbb5e3 100644
--- a/arch/arm64/kernel/perf_event.c
+++ b/arch/arm64/kernel/perf_event.c
@@ -741,11 +741,26 @@ static inline u32 armv8pmu_getreset_flags(void)
        return value;
 }
 
-static void armv8pmu_disable_user_access(void)
+static void update_pmuserenr(u64 val)
 {
+       lockdep_assert_irqs_disabled();
+
+       if (IS_ENABLED(CONFIG_KVM)) {
+               struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
+               if (vcpu) {
+                       vcpu->arch.pmuserenr_host = val;
+                       return;
+               }
+       }
+
        write_sysreg(0, pmuserenr_el0);
 }
 
+static void armv8pmu_disable_user_access(void)
+{
+       update_pmuserenr(0);
+}
+
 static void armv8pmu_enable_user_access(struct arm_pmu *cpu_pmu)
 {
        int i;
@@ -759,8 +774,7 @@ static void armv8pmu_enable_user_access(struct arm_pmu *cpu_pmu)
                        armv8pmu_write_evcntr(i, 0);
        }
 
-       write_sysreg(0, pmuserenr_el0);
-       write_sysreg(ARMV8_PMU_USERENR_ER | ARMV8_PMU_USERENR_CR, pmuserenr_el0);
+       update_pmuserenr(ARMV8_PMU_USERENR_ER | ARMV8_PMU_USERENR_CR);
 }
 
 static void armv8pmu_enable_event(struct perf_event *event)
Reiji Watanabe April 6, 2023, 2:28 a.m. UTC | #4
On Tue, Apr 04, 2023 at 03:25:27PM +0100, Mark Rutland wrote:
> On Wed, Mar 29, 2023 at 01:03:18PM +0100, Marc Zyngier wrote:
> > On Wed, 29 Mar 2023 01:21:36 +0100,
> > Reiji Watanabe <reijiw@google.com> wrote:
> > > 
> > > Currently, with VHE, KVM sets ER, CR, SW and EN bits of
> > > PMUSERENR_EL0 to 1 on vcpu_load().  So, if the value of those bits
> > > are cleared after vcpu_load() (the perf subsystem would do when PMU
> > > counters are programmed for the guest), PMU access from the guest EL0
> > > might be trapped to the guest EL1 directly regardless of the current
> > > PMUSERENR_EL0 value of the vCPU.
> > 
> > + RobH.
> > 
> > Is that what is done when the event is created and armv8pmu_start()
> > called? This is... crap. The EL0 access thing breaks everything, and
> > nobody tested it with KVM, obviously.
> > 
> > I would be tempted to start mitigating it with the following:
> > 
> > diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> > index dde06c0f97f3..8063525bf3dd 100644
> > --- a/arch/arm64/kernel/perf_event.c
> > +++ b/arch/arm64/kernel/perf_event.c
> > @@ -806,17 +806,19 @@ static void armv8pmu_disable_event(struct perf_event *event)
> >  
> >  static void armv8pmu_start(struct arm_pmu *cpu_pmu)
> >  {
> > -	struct perf_event_context *ctx;
> > -	int nr_user = 0;
> > +	if (sysctl_perf_user_access) {
> > +		struct perf_event_context *ctx;
> > +		int nr_user = 0;
> >  
> > -	ctx = perf_cpu_task_ctx();
> > -	if (ctx)
> > -		nr_user = ctx->nr_user;
> > +		ctx = perf_cpu_task_ctx();
> > +		if (ctx)
> > +			nr_user = ctx->nr_user;
> >  
> > -	if (sysctl_perf_user_access && nr_user)
> > -		armv8pmu_enable_user_access(cpu_pmu);
> > -	else
> > -		armv8pmu_disable_user_access();
> > +		if (nr_user)
> > +			armv8pmu_enable_user_access(cpu_pmu);
> > +		else
> > +			armv8pmu_disable_user_access();
> > +	}
> >  
> >  	/* Enable all counters */
> >  	armv8pmu_pmcr_write(armv8pmu_pmcr_read() | ARMV8_PMU_PMCR_E);
> > 
> > but that's obviously not enough as we want it to work with EL0 access
> > enabled on the host as well.
> > 
> > What we miss is something that tells the PMU code "we're in a context
> > where host userspace isn't present", and this would be completely
> > skipped, relying on KVM to restore the appropriate state on
> > vcpu_put(). But then the IPI stuff that controls EL0 can always come
> > in and wreck things. Gahhh...
> 
> AFAICT the perf code only writes to PMUSERENR_EL0 in contexts where IRQs (and
> hence preemption) are disabled, so as long as we have a shadow of the host
> PMUSERENR value somewhere, I think we can update the perf code with something
> like the below?
> 
> ... unless the KVM code is interruptible before saving the host value, or after
> restoring it?

Thank you for the suggestion.
I will update my patch based on the suggestion.

Thank you,
Reiji


> 
> Thanks,
> Mark.
> 
> ---->8----
> diff --git a/arch/arm64/kernel/perf_event.c b/arch/arm64/kernel/perf_event.c
> index dde06c0f97f3e..bdab3d5cbb5e3 100644
> --- a/arch/arm64/kernel/perf_event.c
> +++ b/arch/arm64/kernel/perf_event.c
> @@ -741,11 +741,26 @@ static inline u32 armv8pmu_getreset_flags(void)
>         return value;
>  }
>  
> -static void armv8pmu_disable_user_access(void)
> +static void update_pmuserenr(u64 val)
>  {
> +       lockdep_assert_irqs_disabled();
> +
> +       if (IS_ENABLED(CONFIG_KVM)) {
> +               struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> +               if (vcpu) {
> +                       vcpu->arch.pmuserenr_host = val;
> +                       return;
> +               }
> +       }
> +
>         write_sysreg(0, pmuserenr_el0);
>  }
>  
> +static void armv8pmu_disable_user_access(void)
> +{
> +       update_pmuserenr(0);
> +}
> +
>  static void armv8pmu_enable_user_access(struct arm_pmu *cpu_pmu)
>  {
>         int i;
> @@ -759,8 +774,7 @@ static void armv8pmu_enable_user_access(struct arm_pmu *cpu_pmu)
>                         armv8pmu_write_evcntr(i, 0);
>         }
>  
> -       write_sysreg(0, pmuserenr_el0);
> -       write_sysreg(ARMV8_PMU_USERENR_ER | ARMV8_PMU_USERENR_CR, pmuserenr_el0);
> +       update_pmuserenr(ARMV8_PMU_USERENR_ER | ARMV8_PMU_USERENR_CR);
>  }
>  
>  static void armv8pmu_enable_event(struct perf_event *event)
>
diff mbox series

Patch

diff --git a/arch/arm64/kvm/hyp/include/hyp/switch.h b/arch/arm64/kvm/hyp/include/hyp/switch.h
index 44b84fbdde0d..7d39882d8a73 100644
--- a/arch/arm64/kvm/hyp/include/hyp/switch.h
+++ b/arch/arm64/kvm/hyp/include/hyp/switch.h
@@ -74,18 +74,6 @@  static inline void __activate_traps_common(struct kvm_vcpu *vcpu)
 	/* Trap on AArch32 cp15 c15 (impdef sysregs) accesses (EL1 or EL0) */
 	write_sysreg(1 << 15, hstr_el2);
 
-	/*
-	 * Make sure we trap PMU access from EL0 to EL2. Also sanitize
-	 * PMSELR_EL0 to make sure it never contains the cycle
-	 * counter, which could make a PMXEVCNTR_EL0 access UNDEF at
-	 * EL1 instead of being trapped to EL2.
-	 */
-	if (kvm_arm_support_pmu_v3()) {
-		write_sysreg(0, pmselr_el0);
-		vcpu->arch.host_pmuserenr_el0 = read_sysreg(pmuserenr_el0);
-		write_sysreg(ARMV8_PMU_USERENR_MASK, pmuserenr_el0);
-	}
-
 	vcpu->arch.mdcr_el2_host = read_sysreg(mdcr_el2);
 	write_sysreg(vcpu->arch.mdcr_el2, mdcr_el2);
 
@@ -106,8 +94,6 @@  static inline void __deactivate_traps_common(struct kvm_vcpu *vcpu)
 	write_sysreg(vcpu->arch.mdcr_el2_host, mdcr_el2);
 
 	write_sysreg(0, hstr_el2);
-	if (kvm_arm_support_pmu_v3())
-		write_sysreg(vcpu->arch.host_pmuserenr_el0, pmuserenr_el0);
 
 	if (cpus_have_final_cap(ARM64_SME)) {
 		sysreg_clear_set_s(SYS_HFGRTR_EL2, 0,
@@ -130,6 +116,18 @@  static inline void ___activate_traps(struct kvm_vcpu *vcpu)
 
 	if (cpus_have_final_cap(ARM64_HAS_RAS_EXTN) && (hcr & HCR_VSE))
 		write_sysreg_s(vcpu->arch.vsesr_el2, SYS_VSESR_EL2);
+
+	/*
+	 * Make sure we trap PMU access from EL0 to EL2. Also sanitize
+	 * PMSELR_EL0 to make sure it never contains the cycle
+	 * counter, which could make a PMXEVCNTR_EL0 access UNDEF at
+	 * EL1 instead of being trapped to EL2.
+	 */
+	if (kvm_arm_support_pmu_v3()) {
+		write_sysreg(0, pmselr_el0);
+		vcpu->arch.host_pmuserenr_el0 = read_sysreg(pmuserenr_el0);
+		write_sysreg(ARMV8_PMU_USERENR_MASK, pmuserenr_el0);
+	}
 }
 
 static inline void ___deactivate_traps(struct kvm_vcpu *vcpu)
@@ -144,6 +142,9 @@  static inline void ___deactivate_traps(struct kvm_vcpu *vcpu)
 		vcpu->arch.hcr_el2 &= ~HCR_VSE;
 		vcpu->arch.hcr_el2 |= read_sysreg(hcr_el2) & HCR_VSE;
 	}
+
+	if (kvm_arm_support_pmu_v3())
+		write_sysreg(vcpu->arch.host_pmuserenr_el0, pmuserenr_el0);
 }
 
 static inline bool __populate_fault_info(struct kvm_vcpu *vcpu)