diff mbox series

[4/4] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device

Message ID 20210510172818.025080848@redhat.com (mailing list archive)
State New, archived
Headers show
Series VMX: configure posted interrupt descriptor when assigning device (v3) | expand

Commit Message

Marcelo Tosatti May 10, 2021, 5:26 p.m. UTC
For VMX, when a vcpu enters HLT emulation, pi_post_block will:

1) Add vcpu to per-cpu list of blocked vcpus.

2) Program the posted-interrupt descriptor "notification vector" 
to POSTED_INTR_WAKEUP_VECTOR

With interrupt remapping, an interrupt will set the PIR bit for the 
vector programmed for the device on the CPU, test-and-set the 
ON bit on the posted interrupt descriptor, and if the ON bit is clear
generate an interrupt for the notification vector.

This way, the target CPU wakes upon a device interrupt and wakes up
the target vcpu.

Problem is that pi_post_block only programs the notification vector
if kvm_arch_has_assigned_device() is true. Its possible for the
following to happen:

1) vcpu V HLTs on pcpu P, kvm_arch_has_assigned_device is false,
notification vector is not programmed
2) device is assigned to VM
3) device interrupts vcpu V, sets ON bit
(notification vector not programmed, so pcpu P remains in idle)
4) vcpu 0 IPIs vcpu V (in guest), but since pi descriptor ON bit is set,
kvm_vcpu_kick is skipped
5) vcpu 0 busy spins on vcpu V's response for several seconds, until
RCU watchdog NMIs all vCPUs.

To fix this, use the start_assignment kvm_x86_ops callback to kick
vcpus out of the halt loop, so the notification vector is 
properly reprogrammed to the wakeup vector.

Reported-by: Pei Zhang <pezhang@redhat.com>
Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>

Comments

Paolo Bonzini May 24, 2021, 3:55 p.m. UTC | #1
On 10/05/21 19:26, Marcelo Tosatti wrote:
> +void vmx_pi_start_assignment(struct kvm *kvm)
> +{
> +	struct kvm_vcpu *vcpu;
> +	int i;
> +
> +	if (!irq_remapping_cap(IRQ_POSTING_CAP))
> +		return;
> +
> +	/*
> +	 * Wakeup will cause the vCPU to bail out of kvm_vcpu_block() and
> +	 * go back through vcpu_block().
> +	 */
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		if (!kvm_vcpu_apicv_active(vcpu))
> +			continue;
> +
> +		kvm_vcpu_wake_up(vcpu);

Would you still need the check_block callback, if you also added a 
kvm_make_request(KVM_REQ_EVENT)?

In fact, since this is entirely not a hot path, can you just do 
kvm_make_all_cpus_request(kvm, KVM_REQ_EVENT) instead of this loop?

Thanks,

Paolo

> +	}
> +}
>   
>   /*
>    * pi_update_irte - set IRTE for Posted-Interrupts
> Index: kvm/arch/x86/kvm/vmx/posted_intr.h
> ===================================================================
> --- kvm.orig/arch/x86/kvm/vmx/posted_intr.h
> +++ kvm/arch/x86/kvm/vmx/posted_intr.h
> @@ -95,5 +95,7 @@ void __init pi_init_cpu(int cpu);
>   bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
>   int pi_update_irte(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq,
>   		   bool set);
> +void vmx_pi_start_assignment(struct kvm *kvm);
> +int vmx_vcpu_check_block(struct kvm_vcpu *vcpu);
>   
>   #endif /* __KVM_X86_VMX_POSTED_INTR_H */
> Index: kvm/arch/x86/kvm/vmx/vmx.c
> ===================================================================
> --- kvm.orig/arch/x86/kvm/vmx/vmx.c
> +++ kvm/arch/x86/kvm/vmx/vmx.c
> @@ -7727,13 +7727,13 @@ static struct kvm_x86_ops vmx_x86_ops __
>   
>   	.pre_block = vmx_pre_block,
>   	.post_block = vmx_post_block,
> -	.vcpu_check_block = NULL,
> +	.vcpu_check_block = vmx_vcpu_check_block,
>   
>   	.pmu_ops = &intel_pmu_ops,
>   	.nested_ops = &vmx_nested_ops,
>   
>   	.update_pi_irte = pi_update_irte,
> -	.start_assignment = NULL,
> +	.start_assignment = vmx_pi_start_assignment,
>   
>   #ifdef CONFIG_X86_64
>   	.set_hv_timer = vmx_set_hv_timer,
> 
>
Marcelo Tosatti May 24, 2021, 5:53 p.m. UTC | #2
On Mon, May 24, 2021 at 05:55:18PM +0200, Paolo Bonzini wrote:
> On 10/05/21 19:26, Marcelo Tosatti wrote:
> > +void vmx_pi_start_assignment(struct kvm *kvm)
> > +{
> > +	struct kvm_vcpu *vcpu;
> > +	int i;
> > +
> > +	if (!irq_remapping_cap(IRQ_POSTING_CAP))
> > +		return;
> > +
> > +	/*
> > +	 * Wakeup will cause the vCPU to bail out of kvm_vcpu_block() and
> > +	 * go back through vcpu_block().
> > +	 */
> > +	kvm_for_each_vcpu(i, vcpu, kvm) {
> > +		if (!kvm_vcpu_apicv_active(vcpu))
> > +			continue;
> > +
> > +		kvm_vcpu_wake_up(vcpu);
> 
> Would you still need the check_block callback, if you also added a
> kvm_make_request(KVM_REQ_EVENT)?
> 
> In fact, since this is entirely not a hot path, can you just do
> kvm_make_all_cpus_request(kvm, KVM_REQ_EVENT) instead of this loop?
> 
> Thanks,
> 
> Paolo

Hi Paolo,

Don't think so:

int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
{
        return kvm_vcpu_running(vcpu) || kvm_vcpu_has_events(vcpu);
}

static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
{
        int ret = -EINTR;
        int idx = srcu_read_lock(&vcpu->kvm->srcu);

        if (kvm_arch_vcpu_runnable(vcpu)) {
                kvm_make_request(KVM_REQ_UNHALT, vcpu);  <---- don't want KVM_REQ_UNHALT
                goto out;
        }
        if (kvm_cpu_has_pending_timer(vcpu))
                goto out;
        if (signal_pending(current))
                goto out;

        ret = 0;
out:
        srcu_read_unlock(&vcpu->kvm->srcu, idx);
        return ret;
}

See previous discussion:


Date: Wed, 12 May 2021 14:41:56 +0000                                                                                   
From: Sean Christopherson <seanjc@google.com>                                                                           
To: Marcelo Tosatti <mtosatti@redhat.com>                                                                               
Cc: Peter Xu <peterx@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org, Alex Williamson             
        <alex.williamson@redhat.com>, Pei Zhang <pezhang@redhat.com>                                                    
Subject: Re: [patch 4/4] KVM: VMX: update vcpu posted-interrupt descriptor when assigning device                        

On Tue, May 11, 2021, Marcelo Tosatti wrote:
> > The KVM_REQ_UNBLOCK patch will resume execution even any such event                                                 
>                                                                                                                       
>                                                 even without any such event                                           
>                                                                                                                       
> > occuring. So the behaviour would be different from baremetal.                                                       

I agree with Marcelo, we don't want to spuriously unhalt the vCPU.  It's legal,
albeit risky, to do something like

       	hlt
       	/* #UD to triple fault if this CPU is awakened. */
       	ud2

when offlining a CPU, in which case the spurious wake event will crash the guest.
Paolo Bonzini May 25, 2021, 11:58 a.m. UTC | #3
On 24/05/21 19:53, Marcelo Tosatti wrote:
> On Mon, May 24, 2021 at 05:55:18PM +0200, Paolo Bonzini wrote:
>> On 10/05/21 19:26, Marcelo Tosatti wrote:
>>> +void vmx_pi_start_assignment(struct kvm *kvm)
>>> +{
>>> +	struct kvm_vcpu *vcpu;
>>> +	int i;
>>> +
>>> +	if (!irq_remapping_cap(IRQ_POSTING_CAP))
>>> +		return;
>>> +
>>> +	/*
>>> +	 * Wakeup will cause the vCPU to bail out of kvm_vcpu_block() and
>>> +	 * go back through vcpu_block().
>>> +	 */
>>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>>> +		if (!kvm_vcpu_apicv_active(vcpu))
>>> +			continue;
>>> +
>>> +		kvm_vcpu_wake_up(vcpu);
>>
>> Would you still need the check_block callback, if you also added a
>> kvm_make_request(KVM_REQ_EVENT)?
>>
>> In fact, since this is entirely not a hot path, can you just do
>> kvm_make_all_cpus_request(kvm, KVM_REQ_EVENT) instead of this loop?
>>
>> Thanks,
>>
>> Paolo
> 
> Hi Paolo,
> 
> Don't think so:
> 
> static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> {
>          int ret = -EINTR;
>          int idx = srcu_read_lock(&vcpu->kvm->srcu);
> 
>          if (kvm_arch_vcpu_runnable(vcpu)) {
>                  kvm_make_request(KVM_REQ_UNHALT, vcpu);  <---- don't want KVM_REQ_UNHALT

UNHALT is incorrect indeed, but requests don't have to unhalt the vCPU.

This case is somewhat similar to signal_pending(), where the next 
KVM_RUN ioctl resumes the halt.  It's also similar to 
KVM_REQ_PENDING_TIMER.  So you can:

- rename KVM_REQ_PENDING_TIMER to KVM_REQ_UNBLOCK except in 
arch/powerpc, where instead you add KVM_REQ_PENDING_TIMER to 
arch/powerpc/include/asm/kvm_host.h

- here, you add

	if (kvm_check_request(KVM_REQ_UNBLOCK, vcpu))
		goto out;

- then vmx_pi_start_assignment only needs to

	if (!irq_remapping_cap(IRQ_POSTING_CAP))
		return;
	kvm_make_all_cpus_request(kvm, KVM_REQ_UNBLOCK);

kvm_arch_vcpu_runnable() would still return false, so the mp_state would 
not change.

Paolo
diff mbox series

Patch

Index: kvm/arch/x86/kvm/vmx/posted_intr.c
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/posted_intr.c
+++ kvm/arch/x86/kvm/vmx/posted_intr.c
@@ -204,6 +204,32 @@  void pi_post_block(struct kvm_vcpu *vcpu
 }
 
 /*
+ * Bail out of the block loop if the VM has an assigned
+ * device, but the blocking vCPU didn't reconfigure the
+ * PI.NV to the wakeup vector, i.e. the assigned device
+ * came along after the initial check in vcpu_block().
+ */
+
+int vmx_vcpu_check_block(struct kvm_vcpu *vcpu)
+{
+	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+
+	if (!irq_remapping_cap(IRQ_POSTING_CAP))
+		return 0;
+
+	if (!kvm_vcpu_apicv_active(vcpu))
+		return 0;
+
+	if (!kvm_arch_has_assigned_device(vcpu->kvm))
+		return 0;
+
+	if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR)
+		return 0;
+
+	return 1;
+}
+
+/*
  * Handler for POSTED_INTERRUPT_WAKEUP_VECTOR.
  */
 void pi_wakeup_handler(void)
@@ -236,6 +262,25 @@  bool pi_has_pending_interrupt(struct kvm
 		(pi_test_sn(pi_desc) && !pi_is_pir_empty(pi_desc));
 }
 
+void vmx_pi_start_assignment(struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	int i;
+
+	if (!irq_remapping_cap(IRQ_POSTING_CAP))
+		return;
+
+	/*
+	 * Wakeup will cause the vCPU to bail out of kvm_vcpu_block() and
+	 * go back through vcpu_block().
+	 */
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		if (!kvm_vcpu_apicv_active(vcpu))
+			continue;
+
+		kvm_vcpu_wake_up(vcpu);
+	}
+}
 
 /*
  * pi_update_irte - set IRTE for Posted-Interrupts
Index: kvm/arch/x86/kvm/vmx/posted_intr.h
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/posted_intr.h
+++ kvm/arch/x86/kvm/vmx/posted_intr.h
@@ -95,5 +95,7 @@  void __init pi_init_cpu(int cpu);
 bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int pi_update_irte(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq,
 		   bool set);
+void vmx_pi_start_assignment(struct kvm *kvm);
+int vmx_vcpu_check_block(struct kvm_vcpu *vcpu);
 
 #endif /* __KVM_X86_VMX_POSTED_INTR_H */
Index: kvm/arch/x86/kvm/vmx/vmx.c
===================================================================
--- kvm.orig/arch/x86/kvm/vmx/vmx.c
+++ kvm/arch/x86/kvm/vmx/vmx.c
@@ -7727,13 +7727,13 @@  static struct kvm_x86_ops vmx_x86_ops __
 
 	.pre_block = vmx_pre_block,
 	.post_block = vmx_post_block,
-	.vcpu_check_block = NULL,
+	.vcpu_check_block = vmx_vcpu_check_block,
 
 	.pmu_ops = &intel_pmu_ops,
 	.nested_ops = &vmx_nested_ops,
 
 	.update_pi_irte = pi_update_irte,
-	.start_assignment = NULL,
+	.start_assignment = vmx_pi_start_assignment,
 
 #ifdef CONFIG_X86_64
 	.set_hv_timer = vmx_set_hv_timer,