diff mbox

[resend,v5,6/6] KVM: nVMX: Enable nested posted interrupt processing

Message ID 1422979097-2203-1-git-send-email-fanwenyi0529@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Wincy Van Feb. 3, 2015, 3:58 p.m. UTC
If vcpu has a interrupt in vmx non-root mode, we will
kick that vcpu to inject interrupt timely. With posted
interrupt processing, the kick intr is not needed, and
interrupts are fully taken care of by hardware.

In nested vmx, this feature avoids much more vmexits
than non-nested vmx.

This patch use L0's POSTED_INTR_NV to avoid unexpected
interrupt if L1's vector is different with L0's. If vcpu
is in hardware's non-root mode, we use a physical ipi to
deliver posted interrupts, otherwise we will accomplish
that posted interrupt in nested vm-entry manually.

Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
---
 arch/x86/kvm/lapic.c |   13 +++-
 arch/x86/kvm/lapic.h |    1 +
 arch/x86/kvm/vmx.c   |  151 ++++++++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 158 insertions(+), 7 deletions(-)

Comments

Yong Wang Feb. 15, 2015, 6:27 a.m. UTC | #1
On Tue, Feb 03, 2015 at 11:58:17PM +0800, Wincy Van wrote:
> If vcpu has a interrupt in vmx non-root mode, we will
> kick that vcpu to inject interrupt timely. With posted
> interrupt processing, the kick intr is not needed, and
> interrupts are fully taken care of by hardware.
> 
> In nested vmx, this feature avoids much more vmexits
> than non-nested vmx.
> 
> This patch use L0's POSTED_INTR_NV to avoid unexpected
> interrupt if L1's vector is different with L0's. If vcpu
> is in hardware's non-root mode, we use a physical ipi to
> deliver posted interrupts, otherwise we will accomplish
> that posted interrupt in nested vm-entry manually.
> 
> Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
> ---
>  arch/x86/kvm/lapic.c |   13 +++-
>  arch/x86/kvm/lapic.h |    1 +
>  arch/x86/kvm/vmx.c   |  151 ++++++++++++++++++++++++++++++++++++++++++++++++-
>  3 files changed, 158 insertions(+), 7 deletions(-)
> 

Wincy, our QA found regressions with this patch that 64bit L2 linux guest
fails to boot up when running nested kvm on kvm.

Environment:
------------
Host OS (ia32/ia32e/IA64):ia32e
Guest OS (ia32/ia32e/IA64):ia32e
Guest OS Type (Linux/Windows):Linux
kvm.git Commit:6557bada461afeaa920a189fae2cff7c8fdce39f
qemu.kvm Commit:5c697ae74170d43928cb185f5ac1a9058adcae0b
Host Kernel Version:3.19.0-rc3
Hardware:Ivytown_EP, Haswell_EP


Bug detailed description:
--------------------------
create 64bit linux guest as L2 guest, the guest boot up fail

note:
1. create a 32bit linux guest as L2 guest, the guest boots up fine.
2. create a 64bit windows guest as L2 guest, the guest boots up fine.
3. this should be a kernel bug:
kvm       + qemu     = result
6557bada  + 5c697ae7 = bad
8fff5e37  + 5c697ae7 = good

Reproduce steps:
----------------
1 create L1 guest:
qemu-system-x86_64 -enable-kvm -m 8G -smp 4 -net nic,macaddr=00:12:31:34:51:31 -net tap,script=/etc/kvm/qemu-ifup nested-kvm.qcow -cpu host

2. create L2 guest
qemu-system-x86_64 -enable-kvm -m 2G -smp 2 -net none rhel6u5.qcow

Current result:
----------------
create 64bit linux guest as L2 guest, the guest boots up fail

Expected result:
----------------
create 64bit linux guest as L2 guest, the guest boots up fine

Please take a look.

Thanks
-Yong

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Wincy Van Feb. 15, 2015, 1:01 p.m. UTC | #2
On Sun, Feb 15, 2015 at 2:27 PM, Yong Wang <yong.y.wang@linux.intel.com> wrote:
> On Tue, Feb 03, 2015 at 11:58:17PM +0800, Wincy Van wrote:
>> If vcpu has a interrupt in vmx non-root mode, we will
>> kick that vcpu to inject interrupt timely. With posted
>> interrupt processing, the kick intr is not needed, and
>> interrupts are fully taken care of by hardware.
>>
>> In nested vmx, this feature avoids much more vmexits
>> than non-nested vmx.
>>
>> This patch use L0's POSTED_INTR_NV to avoid unexpected
>> interrupt if L1's vector is different with L0's. If vcpu
>> is in hardware's non-root mode, we use a physical ipi to
>> deliver posted interrupts, otherwise we will accomplish
>> that posted interrupt in nested vm-entry manually.
>>
>> Signed-off-by: Wincy Van <fanwenyi0529@gmail.com>
>> ---
>>  arch/x86/kvm/lapic.c |   13 +++-
>>  arch/x86/kvm/lapic.h |    1 +
>>  arch/x86/kvm/vmx.c   |  151 ++++++++++++++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 158 insertions(+), 7 deletions(-)
>>
>
> Wincy, our QA found regressions with this patch that 64bit L2 linux guest
> fails to boot up when running nested kvm on kvm.
>
> Environment:
> ------------
> Host OS (ia32/ia32e/IA64):ia32e
> Guest OS (ia32/ia32e/IA64):ia32e
> Guest OS Type (Linux/Windows):Linux
> kvm.git Commit:6557bada461afeaa920a189fae2cff7c8fdce39f
> qemu.kvm Commit:5c697ae74170d43928cb185f5ac1a9058adcae0b
> Host Kernel Version:3.19.0-rc3
> Hardware:Ivytown_EP, Haswell_EP
>
>
> Bug detailed description:
> --------------------------
> create 64bit linux guest as L2 guest, the guest boot up fail
>
> note:
> 1. create a 32bit linux guest as L2 guest, the guest boots up fine.
> 2. create a 64bit windows guest as L2 guest, the guest boots up fine.
> 3. this should be a kernel bug:
> kvm       + qemu     = result
> 6557bada  + 5c697ae7 = bad
> 8fff5e37  + 5c697ae7 = good
>
> Reproduce steps:
> ----------------
> 1 create L1 guest:
> qemu-system-x86_64 -enable-kvm -m 8G -smp 4 -net nic,macaddr=00:12:31:34:51:31 -net tap,script=/etc/kvm/qemu-ifup nested-kvm.qcow -cpu host
>
> 2. create L2 guest
> qemu-system-x86_64 -enable-kvm -m 2G -smp 2 -net none rhel6u5.qcow
>
> Current result:
> ----------------
> create 64bit linux guest as L2 guest, the guest boots up fail
>
> Expected result:
> ----------------
> create 64bit linux guest as L2 guest, the guest boots up fine
>
> Please take a look.
>

Certainly, will do.

Thanks,
Wincy
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 555956c..2f83384 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -325,17 +325,24 @@  static u8 count_vectors(void *bitmap)
 	return count;
 }
 
-void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
+void __kvm_apic_update_irr(u32 *pir, void *regs)
 {
 	u32 i, pir_val;
-	struct kvm_lapic *apic = vcpu->arch.apic;
 
 	for (i = 0; i <= 7; i++) {
 		pir_val = xchg(&pir[i], 0);
 		if (pir_val)
-			*((u32 *)(apic->regs + APIC_IRR + i * 0x10)) |= pir_val;
+			*((u32 *)(regs + APIC_IRR + i * 0x10)) |= pir_val;
 	}
 }
+EXPORT_SYMBOL_GPL(__kvm_apic_update_irr);
+
+void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir)
+{
+	struct kvm_lapic *apic = vcpu->arch.apic;
+
+	__kvm_apic_update_irr(pir, apic->regs);
+}
 EXPORT_SYMBOL_GPL(kvm_apic_update_irr);
 
 static inline void apic_set_irr(int vec, struct kvm_lapic *apic)
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index c1ef25c..0bc6c65 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -57,6 +57,7 @@  u64 kvm_lapic_get_base(struct kvm_vcpu *vcpu);
 void kvm_apic_set_version(struct kvm_vcpu *vcpu);
 
 void kvm_apic_update_tmr(struct kvm_vcpu *vcpu, u32 *tmr);
+void __kvm_apic_update_irr(u32 *pir, void *regs);
 void kvm_apic_update_irr(struct kvm_vcpu *vcpu, u32 *pir);
 int kvm_apic_set_irq(struct kvm_vcpu *vcpu, struct kvm_lapic_irq *irq,
 		unsigned long *dest_map);
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e22e159..45c3437 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -218,6 +218,7 @@  struct __packed vmcs12 {
 	u64 tsc_offset;
 	u64 virtual_apic_page_addr;
 	u64 apic_access_addr;
+	u64 posted_intr_desc_addr;
 	u64 ept_pointer;
 	u64 eoi_exit_bitmap0;
 	u64 eoi_exit_bitmap1;
@@ -337,6 +338,7 @@  struct __packed vmcs12 {
 	u32 vmx_preemption_timer_value;
 	u32 padding32[7]; /* room for future expansion */
 	u16 virtual_processor_id;
+	u16 posted_intr_nv;
 	u16 guest_es_selector;
 	u16 guest_cs_selector;
 	u16 guest_ss_selector;
@@ -409,6 +411,10 @@  struct nested_vmx {
 	 */
 	struct page *apic_access_page;
 	struct page *virtual_apic_page;
+	struct page *pi_desc_page;
+	struct pi_desc *pi_desc;
+	bool pi_pending;
+	u16 posted_intr_nv;
 	u64 msr_ia32_feature_control;
 
 	struct hrtimer preemption_timer;
@@ -628,6 +634,7 @@  static int max_shadow_read_write_fields =
 
 static const unsigned short vmcs_field_to_offset_table[] = {
 	FIELD(VIRTUAL_PROCESSOR_ID, virtual_processor_id),
+	FIELD(POSTED_INTR_NV, posted_intr_nv),
 	FIELD(GUEST_ES_SELECTOR, guest_es_selector),
 	FIELD(GUEST_CS_SELECTOR, guest_cs_selector),
 	FIELD(GUEST_SS_SELECTOR, guest_ss_selector),
@@ -653,6 +660,7 @@  static const unsigned short vmcs_field_to_offset_table[] = {
 	FIELD64(TSC_OFFSET, tsc_offset),
 	FIELD64(VIRTUAL_APIC_PAGE_ADDR, virtual_apic_page_addr),
 	FIELD64(APIC_ACCESS_ADDR, apic_access_addr),
+	FIELD64(POSTED_INTR_DESC_ADDR, posted_intr_desc_addr),
 	FIELD64(EPT_POINTER, ept_pointer),
 	FIELD64(EOI_EXIT_BITMAP0, eoi_exit_bitmap0),
 	FIELD64(EOI_EXIT_BITMAP1, eoi_exit_bitmap1),
@@ -805,6 +813,7 @@  static void kvm_cpu_vmxon(u64 addr);
 static void kvm_cpu_vmxoff(void);
 static bool vmx_mpx_supported(void);
 static bool vmx_xsaves_supported(void);
+static int vmx_vm_has_apicv(struct kvm *kvm);
 static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
 static void vmx_set_segment(struct kvm_vcpu *vcpu,
 			    struct kvm_segment *var, int seg);
@@ -1162,6 +1171,11 @@  static inline bool nested_cpu_has_vid(struct vmcs12 *vmcs12)
 	return nested_cpu_has2(vmcs12, SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
 }
 
+static inline bool nested_cpu_has_posted_intr(struct vmcs12 *vmcs12)
+{
+	return vmcs12->pin_based_vm_exec_control & PIN_BASED_POSTED_INTR;
+}
+
 static inline bool is_exception(u32 intr_info)
 {
 	return (intr_info & (INTR_INFO_INTR_TYPE_MASK | INTR_INFO_VALID_MASK))
@@ -2365,6 +2379,9 @@  static void nested_vmx_setup_ctls_msrs(struct vcpu_vmx *vmx)
 	vmx->nested.nested_vmx_pinbased_ctls_high |=
 		PIN_BASED_ALWAYSON_WITHOUT_TRUE_MSR |
 		PIN_BASED_VMX_PREEMPTION_TIMER;
+	if (vmx_vm_has_apicv(vmx->vcpu.kvm))
+		vmx->nested.nested_vmx_pinbased_ctls_high |=
+			PIN_BASED_POSTED_INTR;
 
 	/* exit controls */
 	rdmsr(MSR_IA32_VMX_EXIT_CTLS,
@@ -4317,6 +4334,64 @@  static int vmx_vm_has_apicv(struct kvm *kvm)
 	return enable_apicv && irqchip_in_kernel(kvm);
 }
 
+static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	int max_irr;
+	void *vapic_page;
+	u16 status;
+
+	if (vmx->nested.pi_desc &&
+	    vmx->nested.pi_pending) {
+		vmx->nested.pi_pending = false;
+		if (!pi_test_and_clear_on(vmx->nested.pi_desc))
+			return 0;
+
+		max_irr = find_last_bit(
+			(unsigned long *)vmx->nested.pi_desc->pir, 256);
+
+		if (max_irr == 256)
+			return 0;
+
+		vapic_page = kmap(vmx->nested.virtual_apic_page);
+		if (!vapic_page) {
+			WARN_ON(1);
+			return -ENOMEM;
+		}
+		__kvm_apic_update_irr(vmx->nested.pi_desc->pir, vapic_page);
+		kunmap(vmx->nested.virtual_apic_page);
+
+		status = vmcs_read16(GUEST_INTR_STATUS);
+		if ((u8)max_irr > ((u8)status & 0xff)) {
+			status &= ~0xff;
+			status |= (u8)max_irr;
+			vmcs_write16(GUEST_INTR_STATUS, status);
+		}
+	}
+	return 0;
+}
+
+static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
+						int vector)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (is_guest_mode(vcpu) &&
+	    vector == vmx->nested.posted_intr_nv) {
+		/* the PIR and ON have been set by L1. */
+		if (vcpu->mode == IN_GUEST_MODE)
+			apic->send_IPI_mask(get_cpu_mask(vcpu->cpu),
+				POSTED_INTR_VECTOR);
+		/*
+		 * If a posted intr is not recognized by hardware,
+		 * we will accomplish it in the next vmentry.
+		 */
+		vmx->nested.pi_pending = true;
+		kvm_make_request(KVM_REQ_EVENT, vcpu);
+		return 0;
+	}
+	return -1;
+}
 /*
  * Send interrupt to vcpu via posted interrupt way.
  * 1. If target vcpu is running(non-root mode), send posted interrupt
@@ -4329,6 +4404,10 @@  static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	int r;
 
+	r = vmx_deliver_nested_posted_interrupt(vcpu, vector);
+	if (!r)
+		return;
+
 	if (pi_test_and_set_pir(vector, &vmx->pi_desc))
 		return;
 
@@ -6591,6 +6670,7 @@  static inline void nested_release_vmcs12(struct vcpu_vmx *vmx)
 		vmcs_write32(SECONDARY_VM_EXEC_CONTROL, exec_control);
 		vmcs_write64(VMCS_LINK_POINTER, -1ull);
 	}
+	vmx->nested.posted_intr_nv = -1;
 	kunmap(vmx->nested.current_vmcs12_page);
 	nested_release_page(vmx->nested.current_vmcs12_page);
 	vmx->nested.current_vmptr = -1ull;
@@ -6619,6 +6699,12 @@  static void free_nested(struct vcpu_vmx *vmx)
 		nested_release_page(vmx->nested.virtual_apic_page);
 		vmx->nested.virtual_apic_page = NULL;
 	}
+	if (vmx->nested.pi_desc_page) {
+		kunmap(vmx->nested.pi_desc_page);
+		nested_release_page(vmx->nested.pi_desc_page);
+		vmx->nested.pi_desc_page = NULL;
+		vmx->nested.pi_desc = NULL;
+	}
 
 	nested_free_all_saved_vmcss(vmx);
 }
@@ -8333,6 +8419,7 @@  static struct kvm_vcpu *vmx_create_vcpu(struct kvm *kvm, unsigned int id)
 	if (nested)
 		nested_vmx_setup_ctls_msrs(vmx);
 
+	vmx->nested.posted_intr_nv = -1;
 	vmx->nested.current_vmptr = -1ull;
 	vmx->nested.current_vmcs12 = NULL;
 
@@ -8578,6 +8665,31 @@  static bool nested_get_vmcs12_pages(struct kvm_vcpu *vcpu,
 			return false;
 	}
 
+	if (nested_cpu_has_posted_intr(vmcs12)) {
+		if (!IS_ALIGNED(vmcs12->posted_intr_desc_addr, 64))
+			return false;
+
+		if (vmx->nested.pi_desc_page) { /* shouldn't happen */
+			kunmap(vmx->nested.pi_desc_page);
+			nested_release_page(vmx->nested.pi_desc_page);
+		}
+		vmx->nested.pi_desc_page =
+			nested_get_page(vcpu, vmcs12->posted_intr_desc_addr);
+		if (!vmx->nested.pi_desc_page)
+			return false;
+
+		vmx->nested.pi_desc =
+			(struct pi_desc *)kmap(vmx->nested.pi_desc_page);
+		if (!vmx->nested.pi_desc) {
+			nested_release_page_clean(vmx->nested.pi_desc_page);
+			return false;
+		}
+		vmx->nested.pi_desc =
+			(struct pi_desc *)((void *)vmx->nested.pi_desc +
+			(unsigned long)(vmcs12->posted_intr_desc_addr &
+			(PAGE_SIZE - 1)));
+	}
+
 	return true;
 }
 
@@ -8713,7 +8825,8 @@  static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu,
 {
 	if (!nested_cpu_has_virt_x2apic_mode(vmcs12) &&
 	    !nested_cpu_has_apic_reg_virt(vmcs12) &&
-	    !nested_cpu_has_vid(vmcs12))
+	    !nested_cpu_has_vid(vmcs12) &&
+	    !nested_cpu_has_posted_intr(vmcs12))
 		return 0;
 
 	/*
@@ -8732,6 +8845,17 @@  static int nested_vmx_check_apicv_controls(struct kvm_vcpu *vcpu,
 	   !nested_exit_on_intr(vcpu))
 		return -EINVAL;
 
+	/*
+	 * bits 15:8 should be zero in posted_intr_nv,
+	 * the descriptor address has been already checked
+	 * in nested_get_vmcs12_pages.
+	 */
+	if (nested_cpu_has_posted_intr(vmcs12) &&
+	   (!nested_cpu_has_vid(vmcs12) ||
+	    !nested_exit_intr_ack_set(vcpu) ||
+	    vmcs12->posted_intr_nv & 0xff00))
+		return -EINVAL;
+
 	/* tpr shadow is needed by all apicv features. */
 	if (!nested_cpu_has(vmcs12, CPU_BASED_TPR_SHADOW))
 		return -EINVAL;
@@ -8974,8 +9098,20 @@  static void prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 
 	exec_control = vmcs12->pin_based_vm_exec_control;
 	exec_control |= vmcs_config.pin_based_exec_ctrl;
-	exec_control &= ~(PIN_BASED_VMX_PREEMPTION_TIMER |
-                          PIN_BASED_POSTED_INTR);
+	exec_control &= ~PIN_BASED_VMX_PREEMPTION_TIMER;
+
+	if (nested_cpu_has_posted_intr(vmcs12)) {
+		/* Note that we use L0's vector to avoid unexpected intr. */
+		vmx->nested.posted_intr_nv = vmcs12->posted_intr_nv;
+		vmx->nested.pi_pending = false;
+		vmcs_write64(POSTED_INTR_NV, POSTED_INTR_VECTOR);
+		vmcs_write64(POSTED_INTR_DESC_ADDR,
+			page_to_phys(vmx->nested.pi_desc_page) +
+			(unsigned long)(vmcs12->posted_intr_desc_addr &
+			(PAGE_SIZE - 1)));
+	} else
+		exec_control &= ~PIN_BASED_POSTED_INTR;
+
 	vmcs_write32(PIN_BASED_VM_EXEC_CONTROL, exec_control);
 
 	vmx->nested.preemption_timer_expired = false;
@@ -9511,9 +9647,10 @@  static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
 		if (vmx->nested.nested_run_pending)
 			return -EBUSY;
 		nested_vmx_vmexit(vcpu, EXIT_REASON_EXTERNAL_INTERRUPT, 0, 0);
+		return 0;
 	}
 
-	return 0;
+	return vmx_complete_nested_posted_interrupt(vcpu);
 }
 
 static u32 vmx_get_preemption_timer_value(struct kvm_vcpu *vcpu)
@@ -9891,6 +10028,12 @@  static void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 exit_reason,
 		nested_release_page(vmx->nested.virtual_apic_page);
 		vmx->nested.virtual_apic_page = NULL;
 	}
+	if (vmx->nested.pi_desc_page) {
+		kunmap(vmx->nested.pi_desc_page);
+		nested_release_page(vmx->nested.pi_desc_page);
+		vmx->nested.pi_desc_page = NULL;
+		vmx->nested.pi_desc = NULL;
+	}
 
 	/*
 	 * We are now running in L2, mmu_notifier will force to reload the