diff mbox series

[2/5] kvm/arm64: rework guest entry logic

Message ID 20220111153539.2532246-3-mark.rutland@arm.com (mailing list archive)
State New, archived
Headers show
Series kvm: fix latent guest entry/exit bugs | expand

Commit Message

Mark Rutland Jan. 11, 2022, 3:35 p.m. UTC
In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state
(EQS) by calling guest_enter_irqoff(), and unmasked IRQs prior to
exiting the EQS by calling guest_exit(). As the IRQ entry code will not
wake RCU in this case, we may run the core IRQ code and IRQ handler
without RCU watching, leading to various potential problems.

Additionally, we do not inform lockdep or tracing that interrupts will
be enabled during guest execution, which caan lead to misleading traces
and warnings that interrupts have been enabled for overly-long periods.

This patch fixes these issues by using the new timing and context
entry/exit helpers to ensure that interrupts are handled during guest
vtime but with RCU watching, with a sequence:

	guest_timing_enter_irqoff();

	exit_to_guest_mode();
	< run the vcpu >
	enter_from_guest_mode();

	< take any pending IRQs >

	guest_timing_exit_irqoff();

Since instrumentation may make use of RCU, we must also ensure that no
instrumented code is run during the EQS. I've split out the critical
section into a new kvm_arm_enter_exit_vcpu() helper which is marked
noinstr.

Fixes: 1b3d546daf85ed2b ("arm/arm64: KVM: Properly account for guest CPU time")
Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Alexandru Elisei <alexandru.elisei@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Frederic Weisbecker <frederic@kernel.org>
Cc: James Morse <james.morse@arm.com>
Cc: Marc Zyngier <maz@kernel.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/kvm/arm.c | 51 ++++++++++++++++++++++++++++----------------
 1 file changed, 33 insertions(+), 18 deletions(-)

Comments

Marc Zyngier Jan. 11, 2022, 5:55 p.m. UTC | #1
On Tue, 11 Jan 2022 15:35:36 +0000,
Mark Rutland <mark.rutland@arm.com> wrote:
> 
> In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state
> (EQS) by calling guest_enter_irqoff(), and unmasked IRQs prior to
> exiting the EQS by calling guest_exit(). As the IRQ entry code will not
> wake RCU in this case, we may run the core IRQ code and IRQ handler
> without RCU watching, leading to various potential problems.
> 
> Additionally, we do not inform lockdep or tracing that interrupts will
> be enabled during guest execution, which caan lead to misleading traces
> and warnings that interrupts have been enabled for overly-long periods.
> 
> This patch fixes these issues by using the new timing and context
> entry/exit helpers to ensure that interrupts are handled during guest
> vtime but with RCU watching, with a sequence:
> 
> 	guest_timing_enter_irqoff();
> 
> 	exit_to_guest_mode();
> 	< run the vcpu >
> 	enter_from_guest_mode();
> 
> 	< take any pending IRQs >
> 
> 	guest_timing_exit_irqoff();
> 
> Since instrumentation may make use of RCU, we must also ensure that no
> instrumented code is run during the EQS. I've split out the critical
> section into a new kvm_arm_enter_exit_vcpu() helper which is marked
> noinstr.
> 
> Fixes: 1b3d546daf85ed2b ("arm/arm64: KVM: Properly account for guest CPU time")
> Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> Cc: Alexandru Elisei <alexandru.elisei@arm.com>
> Cc: Catalin Marinas <catalin.marinas@arm.com>
> Cc: Frederic Weisbecker <frederic@kernel.org>
> Cc: James Morse <james.morse@arm.com>
> Cc: Marc Zyngier <maz@kernel.org>
> Cc: Paolo Bonzini <pbonzini@redhat.com>
> Cc: Paul E. McKenney <paulmck@kernel.org>
> Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> Cc: Will Deacon <will@kernel.org>
> ---
>  arch/arm64/kvm/arm.c | 51 ++++++++++++++++++++++++++++----------------
>  1 file changed, 33 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index e4727dc771bf..1721df2522c8 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -764,6 +764,24 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret)
>  			xfer_to_guest_mode_work_pending();
>  }
>  
> +/*
> + * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while
> + * the vCPU is running.
> + *
> + * This must be noinstr as instrumentation may make use of RCU, and this is not
> + * safe during the EQS.
> + */
> +static int noinstr kvm_arm_vcpu_enter_exit(struct kvm_vcpu *vcpu)
> +{
> +	int ret;
> +
> +	exit_to_guest_mode();
> +	ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
> +	enter_from_guest_mode();
> +
> +	return ret;
> +}
> +
>  /**
>   * kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest code
>   * @vcpu:	The VCPU pointer
> @@ -854,9 +872,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  		 * Enter the guest
>  		 */
>  		trace_kvm_entry(*vcpu_pc(vcpu));
> -		guest_enter_irqoff();
> +		guest_timing_enter_irqoff();
>  
> -		ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
> +		ret = kvm_arm_vcpu_enter_exit(vcpu);
>  
>  		vcpu->mode = OUTSIDE_GUEST_MODE;
>  		vcpu->stat.exits++;
> @@ -891,26 +909,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  		kvm_arch_vcpu_ctxsync_fp(vcpu);
>  
>  		/*
> -		 * We may have taken a host interrupt in HYP mode (ie
> -		 * while executing the guest). This interrupt is still
> -		 * pending, as we haven't serviced it yet!
> +		 * We must ensure that any pending interrupts are taken before
> +		 * we exit guest timing so that timer ticks are accounted as
> +		 * guest time. Transiently unmask interrupts so that any
> +		 * pending interrupts are taken.
>  		 *
> -		 * We're now back in SVC mode, with interrupts
> -		 * disabled.  Enabling the interrupts now will have
> -		 * the effect of taking the interrupt again, in SVC
> -		 * mode this time.
> +		 * Per ARM DDI 0487G.b section D1.13.4, an ISB (or other
> +		 * context synchronization event) is necessary to ensure that
> +		 * pending interrupts are taken.
>  		 */
>  		local_irq_enable();
> +		isb();
> +		local_irq_disable();

Small nit: we may be able to elide this enable/isb/disable dance if a
read of ISR_EL1 returns 0.

> +
> +		guest_timing_exit_irqoff();
> +
> +		local_irq_enable();
>  
> -		/*
> -		 * We do local_irq_enable() before calling guest_exit() so
> -		 * that if a timer interrupt hits while running the guest we
> -		 * account that tick as being spent in the guest.  We enable
> -		 * preemption after calling guest_exit() so that if we get
> -		 * preempted we make sure ticks after that is not counted as
> -		 * guest time.
> -		 */
> -		guest_exit();
>  		trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
>  
>  		/* Exit types that need handling before we can be preempted */

Reviewed-by: Marc Zyngier <maz@kernel.org>

	M.
Mark Rutland Jan. 13, 2022, 11:17 a.m. UTC | #2
On Tue, Jan 11, 2022 at 05:55:20PM +0000, Marc Zyngier wrote:
> On Tue, 11 Jan 2022 15:35:36 +0000,
> Mark Rutland <mark.rutland@arm.com> wrote:
> > 
> > In kvm_arch_vcpu_ioctl_run() we enter an RCU extended quiescent state
> > (EQS) by calling guest_enter_irqoff(), and unmasked IRQs prior to
> > exiting the EQS by calling guest_exit(). As the IRQ entry code will not
> > wake RCU in this case, we may run the core IRQ code and IRQ handler
> > without RCU watching, leading to various potential problems.
> > 
> > Additionally, we do not inform lockdep or tracing that interrupts will
> > be enabled during guest execution, which caan lead to misleading traces
> > and warnings that interrupts have been enabled for overly-long periods.
> > 
> > This patch fixes these issues by using the new timing and context
> > entry/exit helpers to ensure that interrupts are handled during guest
> > vtime but with RCU watching, with a sequence:
> > 
> > 	guest_timing_enter_irqoff();
> > 
> > 	exit_to_guest_mode();
> > 	< run the vcpu >
> > 	enter_from_guest_mode();
> > 
> > 	< take any pending IRQs >
> > 
> > 	guest_timing_exit_irqoff();
> > 
> > Since instrumentation may make use of RCU, we must also ensure that no
> > instrumented code is run during the EQS. I've split out the critical
> > section into a new kvm_arm_enter_exit_vcpu() helper which is marked
> > noinstr.
> > 
> > Fixes: 1b3d546daf85ed2b ("arm/arm64: KVM: Properly account for guest CPU time")
> > Reported-by: Nicolas Saenz Julienne <nsaenzju@redhat.com>
> > Signed-off-by: Mark Rutland <mark.rutland@arm.com>
> > Cc: Alexandru Elisei <alexandru.elisei@arm.com>
> > Cc: Catalin Marinas <catalin.marinas@arm.com>
> > Cc: Frederic Weisbecker <frederic@kernel.org>
> > Cc: James Morse <james.morse@arm.com>
> > Cc: Marc Zyngier <maz@kernel.org>
> > Cc: Paolo Bonzini <pbonzini@redhat.com>
> > Cc: Paul E. McKenney <paulmck@kernel.org>
> > Cc: Suzuki K Poulose <suzuki.poulose@arm.com>
> > Cc: Will Deacon <will@kernel.org>
> > ---
> >  arch/arm64/kvm/arm.c | 51 ++++++++++++++++++++++++++++----------------
> >  1 file changed, 33 insertions(+), 18 deletions(-)
> > 
> > diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> > index e4727dc771bf..1721df2522c8 100644
> > --- a/arch/arm64/kvm/arm.c
> > +++ b/arch/arm64/kvm/arm.c
> > @@ -764,6 +764,24 @@ static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret)
> >  			xfer_to_guest_mode_work_pending();
> >  }
> >  
> > +/*
> > + * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while
> > + * the vCPU is running.
> > + *
> > + * This must be noinstr as instrumentation may make use of RCU, and this is not
> > + * safe during the EQS.
> > + */
> > +static int noinstr kvm_arm_vcpu_enter_exit(struct kvm_vcpu *vcpu)
> > +{
> > +	int ret;
> > +
> > +	exit_to_guest_mode();
> > +	ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
> > +	enter_from_guest_mode();
> > +
> > +	return ret;
> > +}
> > +
> >  /**
> >   * kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest code
> >   * @vcpu:	The VCPU pointer
> > @@ -854,9 +872,9 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> >  		 * Enter the guest
> >  		 */
> >  		trace_kvm_entry(*vcpu_pc(vcpu));
> > -		guest_enter_irqoff();
> > +		guest_timing_enter_irqoff();
> >  
> > -		ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
> > +		ret = kvm_arm_vcpu_enter_exit(vcpu);
> >  
> >  		vcpu->mode = OUTSIDE_GUEST_MODE;
> >  		vcpu->stat.exits++;
> > @@ -891,26 +909,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> >  		kvm_arch_vcpu_ctxsync_fp(vcpu);
> >  
> >  		/*
> > -		 * We may have taken a host interrupt in HYP mode (ie
> > -		 * while executing the guest). This interrupt is still
> > -		 * pending, as we haven't serviced it yet!
> > +		 * We must ensure that any pending interrupts are taken before
> > +		 * we exit guest timing so that timer ticks are accounted as
> > +		 * guest time. Transiently unmask interrupts so that any
> > +		 * pending interrupts are taken.
> >  		 *
> > -		 * We're now back in SVC mode, with interrupts
> > -		 * disabled.  Enabling the interrupts now will have
> > -		 * the effect of taking the interrupt again, in SVC
> > -		 * mode this time.
> > +		 * Per ARM DDI 0487G.b section D1.13.4, an ISB (or other
> > +		 * context synchronization event) is necessary to ensure that
> > +		 * pending interrupts are taken.
> >  		 */
> >  		local_irq_enable();
> > +		isb();
> > +		local_irq_disable();
> 
> Small nit: we may be able to elide this enable/isb/disable dance if a
> read of ISR_EL1 returns 0.

Wouldn't that be broken when using GIC priority masking, since that can prevent
IRQS being signalled ot the PE?

I'm happy to rework this, but I'll need to think a bit harder about it. Would
you be happy if we did that as a follow-up?

I suspect we'll want to split that out into a helper, e.g.

static __always_inline handle_pending_host_irqs(void)
{
	/*
	 * TODO: explain PMR masking / signalling here
	 */
	if (!system_uses_irq_prio_masking() &&
	    !read_sysreg(isr_el1))
		return;
	
	local_irq_enable();
	isb();
	local_irq_disable();
}

> 
> > +
> > +		guest_timing_exit_irqoff();
> > +
> > +		local_irq_enable();
> >  
> > -		/*
> > -		 * We do local_irq_enable() before calling guest_exit() so
> > -		 * that if a timer interrupt hits while running the guest we
> > -		 * account that tick as being spent in the guest.  We enable
> > -		 * preemption after calling guest_exit() so that if we get
> > -		 * preempted we make sure ticks after that is not counted as
> > -		 * guest time.
> > -		 */
> > -		guest_exit();
> >  		trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
> >  
> >  		/* Exit types that need handling before we can be preempted */
> 
> Reviewed-by: Marc Zyngier <maz@kernel.org>

Thanks!

Mark.
Marc Zyngier Jan. 13, 2022, 11:43 a.m. UTC | #3
On Thu, 13 Jan 2022 11:17:53 +0000,
Mark Rutland <mark.rutland@arm.com> wrote:
> 
> On Tue, Jan 11, 2022 at 05:55:20PM +0000, Marc Zyngier wrote:
> > On Tue, 11 Jan 2022 15:35:36 +0000,
> > Mark Rutland <mark.rutland@arm.com> wrote:

[...]

> > > @@ -891,26 +909,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > >  		kvm_arch_vcpu_ctxsync_fp(vcpu);
> > >  
> > >  		/*
> > > -		 * We may have taken a host interrupt in HYP mode (ie
> > > -		 * while executing the guest). This interrupt is still
> > > -		 * pending, as we haven't serviced it yet!
> > > +		 * We must ensure that any pending interrupts are taken before
> > > +		 * we exit guest timing so that timer ticks are accounted as
> > > +		 * guest time. Transiently unmask interrupts so that any
> > > +		 * pending interrupts are taken.
> > >  		 *
> > > -		 * We're now back in SVC mode, with interrupts
> > > -		 * disabled.  Enabling the interrupts now will have
> > > -		 * the effect of taking the interrupt again, in SVC
> > > -		 * mode this time.
> > > +		 * Per ARM DDI 0487G.b section D1.13.4, an ISB (or other
> > > +		 * context synchronization event) is necessary to ensure that
> > > +		 * pending interrupts are taken.
> > >  		 */
> > >  		local_irq_enable();
> > > +		isb();
> > > +		local_irq_disable();
> > 
> > Small nit: we may be able to elide this enable/isb/disable dance if a
> > read of ISR_EL1 returns 0.
> 
> Wouldn't that be broken when using GIC priority masking, since that
> can prevent IRQS being signalled ot the PE?

You're right. But this can be made even simpler. We already know if
we've exited the guest because of an IRQ (ret tells us that), and
that's true whether we're using priority masking or not. It could be
as simple as:

	if (ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ) {
		// We exited because of an interrupt. Let's take
		// it now to account timer ticks to the guest.
	 	local_irq_enable();
 		isb();
 		local_irq_disable();
	}

and that would avoid accounting the interrupt to the guest if it fired
after the exit took place.

> I'm happy to rework this, but I'll need to think a bit harder about
> it. Would you be happy if we did that as a follow-up?

Oh, absolutely. I want the flow to be correct before we make it
fast(-ish).

Thanks,

	M.
Mark Rutland Jan. 13, 2022, 12:58 p.m. UTC | #4
On Thu, Jan 13, 2022 at 11:43:30AM +0000, Marc Zyngier wrote:
> On Thu, 13 Jan 2022 11:17:53 +0000,
> Mark Rutland <mark.rutland@arm.com> wrote:
> > 
> > On Tue, Jan 11, 2022 at 05:55:20PM +0000, Marc Zyngier wrote:
> > > On Tue, 11 Jan 2022 15:35:36 +0000,
> > > Mark Rutland <mark.rutland@arm.com> wrote:
> 
> [...]
> 
> > > > @@ -891,26 +909,23 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
> > > >  		kvm_arch_vcpu_ctxsync_fp(vcpu);
> > > >  
> > > >  		/*
> > > > -		 * We may have taken a host interrupt in HYP mode (ie
> > > > -		 * while executing the guest). This interrupt is still
> > > > -		 * pending, as we haven't serviced it yet!
> > > > +		 * We must ensure that any pending interrupts are taken before
> > > > +		 * we exit guest timing so that timer ticks are accounted as
> > > > +		 * guest time. Transiently unmask interrupts so that any
> > > > +		 * pending interrupts are taken.
> > > >  		 *
> > > > -		 * We're now back in SVC mode, with interrupts
> > > > -		 * disabled.  Enabling the interrupts now will have
> > > > -		 * the effect of taking the interrupt again, in SVC
> > > > -		 * mode this time.
> > > > +		 * Per ARM DDI 0487G.b section D1.13.4, an ISB (or other
> > > > +		 * context synchronization event) is necessary to ensure that
> > > > +		 * pending interrupts are taken.
> > > >  		 */
> > > >  		local_irq_enable();
> > > > +		isb();
> > > > +		local_irq_disable();
> > > 
> > > Small nit: we may be able to elide this enable/isb/disable dance if a
> > > read of ISR_EL1 returns 0.
> > 
> > Wouldn't that be broken when using GIC priority masking, since that
> > can prevent IRQS being signalled ot the PE?
> 
> You're right. But this can be made even simpler. We already know if
> we've exited the guest because of an IRQ (ret tells us that), and
> that's true whether we're using priority masking or not. It could be
> as simple as:
> 
> 	if (ARM_EXCEPTION_CODE(ret) == ARM_EXCEPTION_IRQ) {
> 		// We exited because of an interrupt. Let's take
> 		// it now to account timer ticks to the guest.
> 	 	local_irq_enable();
>  		isb();
>  		local_irq_disable();
> 	}
> 
> and that would avoid accounting the interrupt to the guest if it fired
> after the exit took place.
> 
> > I'm happy to rework this, but I'll need to think a bit harder about
> > it. Would you be happy if we did that as a follow-up?
> 
> Oh, absolutely. I want the flow to be correct before we make it
> fast(-ish).

Cool; I'll leave that for now on the assumption we'll address that with a
follow-up patch, though your suggestion above looks "obviously" correct to me.

Thanks,
Mark.
diff mbox series

Patch

diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index e4727dc771bf..1721df2522c8 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -764,6 +764,24 @@  static bool kvm_vcpu_exit_request(struct kvm_vcpu *vcpu, int *ret)
 			xfer_to_guest_mode_work_pending();
 }
 
+/*
+ * Actually run the vCPU, entering an RCU extended quiescent state (EQS) while
+ * the vCPU is running.
+ *
+ * This must be noinstr as instrumentation may make use of RCU, and this is not
+ * safe during the EQS.
+ */
+static int noinstr kvm_arm_vcpu_enter_exit(struct kvm_vcpu *vcpu)
+{
+	int ret;
+
+	exit_to_guest_mode();
+	ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
+	enter_from_guest_mode();
+
+	return ret;
+}
+
 /**
  * kvm_arch_vcpu_ioctl_run - the main VCPU run function to execute guest code
  * @vcpu:	The VCPU pointer
@@ -854,9 +872,9 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		 * Enter the guest
 		 */
 		trace_kvm_entry(*vcpu_pc(vcpu));
-		guest_enter_irqoff();
+		guest_timing_enter_irqoff();
 
-		ret = kvm_call_hyp_ret(__kvm_vcpu_run, vcpu);
+		ret = kvm_arm_vcpu_enter_exit(vcpu);
 
 		vcpu->mode = OUTSIDE_GUEST_MODE;
 		vcpu->stat.exits++;
@@ -891,26 +909,23 @@  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		kvm_arch_vcpu_ctxsync_fp(vcpu);
 
 		/*
-		 * We may have taken a host interrupt in HYP mode (ie
-		 * while executing the guest). This interrupt is still
-		 * pending, as we haven't serviced it yet!
+		 * We must ensure that any pending interrupts are taken before
+		 * we exit guest timing so that timer ticks are accounted as
+		 * guest time. Transiently unmask interrupts so that any
+		 * pending interrupts are taken.
 		 *
-		 * We're now back in SVC mode, with interrupts
-		 * disabled.  Enabling the interrupts now will have
-		 * the effect of taking the interrupt again, in SVC
-		 * mode this time.
+		 * Per ARM DDI 0487G.b section D1.13.4, an ISB (or other
+		 * context synchronization event) is necessary to ensure that
+		 * pending interrupts are taken.
 		 */
 		local_irq_enable();
+		isb();
+		local_irq_disable();
+
+		guest_timing_exit_irqoff();
+
+		local_irq_enable();
 
-		/*
-		 * We do local_irq_enable() before calling guest_exit() so
-		 * that if a timer interrupt hits while running the guest we
-		 * account that tick as being spent in the guest.  We enable
-		 * preemption after calling guest_exit() so that if we get
-		 * preempted we make sure ticks after that is not counted as
-		 * guest time.
-		 */
-		guest_exit();
 		trace_kvm_exit(ret, kvm_vcpu_trap_get_class(vcpu), *vcpu_pc(vcpu));
 
 		/* Exit types that need handling before we can be preempted */