Message ID | 1512461786-6465-5-git-send-email-liran.alon@oracle.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Good idea ! Using self-ipi is much more simple and efficient than emulation. On Tue, Dec 5, 2017 at 4:16 PM, Liran Alon <liran.alon@oracle.com> wrote: > When L1 wants to send a posted-interrupt to another L1 CPU running L2, > it sets the relevant bit in vmx->nested.pi_desc->pir and ON bit in > vmx->nested.pi_desc->control. Then it attempts to send a > notification-vector IPI to dest L1 CPU. > This attempt to send IPI will exit to L0 which will reach > vmx_deliver_nested_posted_interrupt() which does the following: > 1. If dest L0 CPU is currently running guest (vcpu->mode == > IN_GUEST_MODE), it sends a physical IPI of PI nested-notification-vector. > 2. It sets KVM_REQ_EVENT in dest vCPU. This is done such that if dest > L0 CPU exits from guest to host and therefore doesn't recognize physical > IPI (or it wasn't sent), then KVM_REQ_EVENT will be consumed on next > vmentry which will call vmx_check_nested_events() which should call > (in theory) vmx_complete_nested_posted_interrupt(). That function > should see vmx->nested.pi_desc->control ON bit is set and therefore > "emulate" posted-interrupt delivery for L1 (sync PIR to IRR in L1 > virtual-apic-page & update vmcs02 RVI). > > The above logic regarding nested-posted-interrupts contains multiple > issues: > > A) Race-condition: > On step (1) it is possible sender will see vcpu->mode == IN_GUEST_MODE > but before sending physical IPI, the dest CPU will exit to host. > Therefore, physical IPI could be received at host which it's hanlder > does nothing. In addition, assume that dest CPU passes the checks > for pending kvm requests before sender sets KVM_REQ_EVENT. Therefore, > dest CPU will resume L2 without evaluating nested-posted-interrupts. > > B) vmx_complete_nested_posted_interrupt() is not always called > when needed. Even if dest CPU consumed KVM_REQ_EVENT, there is a bug > that vmx_check_nested_events() could exit from L2 to L1 before calling > vmx_complete_nested_posted_interrupt(). Therefore, on next resume of > L1 into L2, nested-posted-interrupts won't be evaluated even though L0 > resume L2 (We may resume L2 without having KVM_REQ_EVENT set). > > This commit removes nested-posted-interrupts processing from > check_nested_events() and instead makes sure to process > nested-posted-interrupts on vmentry after interrupts > disabled. Processing of nested-posted-interrupts is delegated > to hardware by issueing a self-IPI of relevant notification-vector > which will be delivered to CPU when CPU is in guest. > > * Bug (A) is solved by the fact that processing of > nested-posted-interrupts is not dependent on KVM_REQ_EVENT and > happens before every vmentry to L2. > * Bug (B) is now trivially solved by processing > nested-posted-interrupts before each vmentry to L2 guest. > > An alternative could have been to just call > vmx_complete_nested_posted_interrupt() at this call-site. However, we > have decided to go with this approach because: > 1. It would require modifying vmx_complete_nested_posted_interrupt() > to be able to work with interrupts disabled (not-trivial). > 2. We preffer to avoid software-emulation of hardware behavior if it > is possible. > > Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing") > > Signed-off-by: Liran Alon <liran.alon@oracle.com> > Co-authored-by: Nikita Leshenko <nikita.leshchenko@oracle.com> > Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> > Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > --- > arch/x86/include/asm/kvm_host.h | 1 + > arch/x86/kvm/vmx.c | 57 ++++++++++------------------------------- > arch/x86/kvm/x86.c | 14 +++++++--- > 3 files changed, 25 insertions(+), 47 deletions(-) > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h > index 1bfb99770c34..dc0affe69903 100644 > --- a/arch/x86/include/asm/kvm_host.h > +++ b/arch/x86/include/asm/kvm_host.h > @@ -980,6 +980,7 @@ struct kvm_x86_ops { > void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); > void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa); > void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); > + void (*complete_nested_posted_interrupt)(struct kvm_vcpu *vcpu); > int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); > int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); > int (*get_tdp_level)(struct kvm_vcpu *vcpu); > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > index bbc023fb9ef1..517822f94158 100644 > --- a/arch/x86/kvm/vmx.c > +++ b/arch/x86/kvm/vmx.c > @@ -534,12 +534,6 @@ static bool pi_test_and_set_on(struct pi_desc *pi_desc) > (unsigned long *)&pi_desc->control); > } > > -static bool pi_test_and_clear_on(struct pi_desc *pi_desc) > -{ > - return test_and_clear_bit(POSTED_INTR_ON, > - (unsigned long *)&pi_desc->control); > -} > - > static int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc) > { > return test_and_set_bit(vector, (unsigned long *)pi_desc->pir); > @@ -5041,37 +5035,6 @@ static void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu) > } > } > > - > -static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) > -{ > - struct vcpu_vmx *vmx = to_vmx(vcpu); > - int max_irr; > - void *vapic_page; > - u16 status; > - > - if (!vmx->nested.pi_desc) > - return; > - > - if (!pi_test_and_clear_on(vmx->nested.pi_desc)) > - return; > - > - max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256); > - if (max_irr != 256) { > - vapic_page = kmap(vmx->nested.virtual_apic_page); > - __kvm_apic_update_irr(vmx->nested.pi_desc->pir, vapic_page); > - kunmap(vmx->nested.virtual_apic_page); > - > - status = vmcs_read16(GUEST_INTR_STATUS); > - if ((u8)max_irr > ((u8)status & 0xff)) { > - status &= ~0xff; > - status |= (u8)max_irr; > - vmcs_write16(GUEST_INTR_STATUS, status); > - } > - } > - > - nested_mark_vmcs12_pages_dirty(vcpu); > -} > - > static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu, > bool nested) > { > @@ -5120,11 +5083,6 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu, > vector == vmx->nested.posted_intr_nv) { > /* the PIR and ON have been set by L1. */ > kvm_vcpu_trigger_posted_interrupt(vcpu, true); > - /* > - * If a posted intr is not recognized by hardware, > - * we will accomplish it in the next vmentry. > - */ > - kvm_make_request(KVM_REQ_EVENT, vcpu); > return 0; > } > return -1; > @@ -5156,6 +5114,17 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) > kvm_vcpu_kick(vcpu); > } > > +static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) > +{ > + struct vcpu_vmx *vmx = to_vmx(vcpu); > + > + WARN_ON(!vcpu->arch.apicv_active); > + > + if (is_guest_mode(vcpu) && vmx->nested.pi_desc && > + pi_test_on(vmx->nested.pi_desc)) > + kvm_vcpu_trigger_posted_interrupt(vcpu, true); > +} > + > /* > * Set up the vmcs's constant host-state fields, i.e., host-state fields that > * will not change in the lifetime of the guest. > @@ -6823,6 +6792,7 @@ static __init int hardware_setup(void) > if (!cpu_has_vmx_apicv()) { > enable_apicv = 0; > kvm_x86_ops->sync_pir_to_irr = NULL; > + kvm_x86_ops->complete_nested_posted_interrupt = NULL; > } > > if (cpu_has_vmx_tsc_scaling()) { > @@ -11144,7 +11114,6 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) > return 0; > } > > - vmx_complete_nested_posted_interrupt(vcpu); > return 0; > } > > @@ -12136,6 +12105,8 @@ static int enable_smi_window(struct kvm_vcpu *vcpu) > .hwapic_isr_update = vmx_hwapic_isr_update, > .sync_pir_to_irr = vmx_sync_pir_to_irr, > .deliver_posted_interrupt = vmx_deliver_posted_interrupt, > + .complete_nested_posted_interrupt = > + vmx_complete_nested_posted_interrupt, > > .set_tss_addr = vmx_set_tss_addr, > .get_tdp_level = get_ept_level, > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > index 34c85aa2e2d1..6fb58ab9a85c 100644 > --- a/arch/x86/kvm/x86.c > +++ b/arch/x86/kvm/x86.c > @@ -6962,12 +6962,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > smp_mb__after_srcu_read_unlock(); > > /* > - * This handles the case where a posted interrupt was > - * notified with kvm_vcpu_kick. > + * In case guest got the posted-interrupt notification > + * vector while running in host, we need to make sure > + * it arrives to guest. > + * For L1 posted-interrupts, we manually sync PIR to IRR. > + * For L2 posted-interrupts, we send notification-vector > + * again by self IPI such that it will now be received in guest. > */ > - if (kvm_lapic_enabled(vcpu)) { > - if (kvm_x86_ops->sync_pir_to_irr && vcpu->arch.apicv_active) > + if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) { > + if (kvm_x86_ops->sync_pir_to_irr) > kvm_x86_ops->sync_pir_to_irr(vcpu); > + if (kvm_x86_ops->complete_nested_posted_interrupt) > + kvm_x86_ops->complete_nested_posted_interrupt(vcpu); > } > > if (vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu) > -- > 1.9.1 >
2017-12-05 10:16+0200, Liran Alon: > When L1 wants to send a posted-interrupt to another L1 CPU running L2, > it sets the relevant bit in vmx->nested.pi_desc->pir and ON bit in > vmx->nested.pi_desc->control. Then it attempts to send a > notification-vector IPI to dest L1 CPU. > This attempt to send IPI will exit to L0 which will reach > vmx_deliver_nested_posted_interrupt() which does the following: > 1. If dest L0 CPU is currently running guest (vcpu->mode == > IN_GUEST_MODE), it sends a physical IPI of PI nested-notification-vector. > 2. It sets KVM_REQ_EVENT in dest vCPU. This is done such that if dest > L0 CPU exits from guest to host and therefore doesn't recognize physical > IPI (or it wasn't sent), then KVM_REQ_EVENT will be consumed on next > vmentry which will call vmx_check_nested_events() which should call > (in theory) vmx_complete_nested_posted_interrupt(). That function > should see vmx->nested.pi_desc->control ON bit is set and therefore > "emulate" posted-interrupt delivery for L1 (sync PIR to IRR in L1 > virtual-apic-page & update vmcs02 RVI). > > The above logic regarding nested-posted-interrupts contains multiple > issues: > > A) Race-condition: > On step (1) it is possible sender will see vcpu->mode == IN_GUEST_MODE > but before sending physical IPI, the dest CPU will exit to host. > Therefore, physical IPI could be received at host which it's hanlder > does nothing. In addition, assume that dest CPU passes the checks > for pending kvm requests before sender sets KVM_REQ_EVENT. Therefore, > dest CPU will resume L2 without evaluating nested-posted-interrupts. > > B) vmx_complete_nested_posted_interrupt() is not always called > when needed. Even if dest CPU consumed KVM_REQ_EVENT, there is a bug > that vmx_check_nested_events() could exit from L2 to L1 before calling > vmx_complete_nested_posted_interrupt(). Therefore, on next resume of > L1 into L2, nested-posted-interrupts won't be evaluated even though L0 > resume L2 (We may resume L2 without having KVM_REQ_EVENT set). > > This commit removes nested-posted-interrupts processing from > check_nested_events() and instead makes sure to process > nested-posted-interrupts on vmentry after interrupts > disabled. Processing of nested-posted-interrupts is delegated > to hardware by issueing a self-IPI of relevant notification-vector > which will be delivered to CPU when CPU is in guest. > > * Bug (A) is solved by the fact that processing of > nested-posted-interrupts is not dependent on KVM_REQ_EVENT and > happens before every vmentry to L2. > * Bug (B) is now trivially solved by processing > nested-posted-interrupts before each vmentry to L2 guest. > > An alternative could have been to just call > vmx_complete_nested_posted_interrupt() at this call-site. However, we > have decided to go with this approach because: > 1. It would require modifying vmx_complete_nested_posted_interrupt() > to be able to work with interrupts disabled (not-trivial). > 2. We preffer to avoid software-emulation of hardware behavior if it > is possible. Nice solution! Self-IPI was slower on non-nested, but even if it is slower here, the code saving is definitely worth it. > Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing") > > Signed-off-by: Liran Alon <liran.alon@oracle.com> > Co-authored-by: Nikita Leshenko <nikita.leshchenko@oracle.com> > Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> > Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > --- > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > @@ -5156,6 +5114,17 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) > kvm_vcpu_kick(vcpu); > } > > +static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) > +{ > + struct vcpu_vmx *vmx = to_vmx(vcpu); > + > + WARN_ON(!vcpu->arch.apicv_active); > + > + if (is_guest_mode(vcpu) && vmx->nested.pi_desc && > + pi_test_on(vmx->nested.pi_desc)) > + kvm_vcpu_trigger_posted_interrupt(vcpu, true); We're always sending to self, so the function would be better with apic->send_IPI_self() as it might be accelerated by hardware. > +} > + > /* > * Set up the vmcs's constant host-state fields, i.e., host-state fields that > * will not change in the lifetime of the guest. > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > @@ -6962,12 +6962,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > smp_mb__after_srcu_read_unlock(); > > /* > - * This handles the case where a posted interrupt was > - * notified with kvm_vcpu_kick. > + * In case guest got the posted-interrupt notification > + * vector while running in host, we need to make sure > + * it arrives to guest. > + * For L1 posted-interrupts, we manually sync PIR to IRR. > + * For L2 posted-interrupts, we send notification-vector > + * again by self IPI such that it will now be received in guest. > */ > - if (kvm_lapic_enabled(vcpu)) { > - if (kvm_x86_ops->sync_pir_to_irr && vcpu->arch.apicv_active) > + if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) { > + if (kvm_x86_ops->sync_pir_to_irr) > kvm_x86_ops->sync_pir_to_irr(vcpu); > + if (kvm_x86_ops->complete_nested_posted_interrupt) > + kvm_x86_ops->complete_nested_posted_interrupt(vcpu); I think that vcpu->arch.apicv_active is enough to guarantee presence of both sync_pir_to_irr and complete_nested_posted_interrupt, and considering this is a hot path, I'd check is_guest_mode(vcpu) before calling the second function. (Saving the function call with a new x86_op for those two seems viable too, but would need deeper analysis and benchmarking as it is putting more code into hot cache ...) Thanks.
On 06/12/17 22:45, Radim Krčmář wrote: > 2017-12-05 10:16+0200, Liran Alon: >> When L1 wants to send a posted-interrupt to another L1 CPU running L2, >> it sets the relevant bit in vmx->nested.pi_desc->pir and ON bit in >> vmx->nested.pi_desc->control. Then it attempts to send a >> notification-vector IPI to dest L1 CPU. >> This attempt to send IPI will exit to L0 which will reach >> vmx_deliver_nested_posted_interrupt() which does the following: >> 1. If dest L0 CPU is currently running guest (vcpu->mode == >> IN_GUEST_MODE), it sends a physical IPI of PI nested-notification-vector. >> 2. It sets KVM_REQ_EVENT in dest vCPU. This is done such that if dest >> L0 CPU exits from guest to host and therefore doesn't recognize physical >> IPI (or it wasn't sent), then KVM_REQ_EVENT will be consumed on next >> vmentry which will call vmx_check_nested_events() which should call >> (in theory) vmx_complete_nested_posted_interrupt(). That function >> should see vmx->nested.pi_desc->control ON bit is set and therefore >> "emulate" posted-interrupt delivery for L1 (sync PIR to IRR in L1 >> virtual-apic-page & update vmcs02 RVI). >> >> The above logic regarding nested-posted-interrupts contains multiple >> issues: >> >> A) Race-condition: >> On step (1) it is possible sender will see vcpu->mode == IN_GUEST_MODE >> but before sending physical IPI, the dest CPU will exit to host. >> Therefore, physical IPI could be received at host which it's hanlder >> does nothing. In addition, assume that dest CPU passes the checks >> for pending kvm requests before sender sets KVM_REQ_EVENT. Therefore, >> dest CPU will resume L2 without evaluating nested-posted-interrupts. >> >> B) vmx_complete_nested_posted_interrupt() is not always called >> when needed. Even if dest CPU consumed KVM_REQ_EVENT, there is a bug >> that vmx_check_nested_events() could exit from L2 to L1 before calling >> vmx_complete_nested_posted_interrupt(). Therefore, on next resume of >> L1 into L2, nested-posted-interrupts won't be evaluated even though L0 >> resume L2 (We may resume L2 without having KVM_REQ_EVENT set). >> >> This commit removes nested-posted-interrupts processing from >> check_nested_events() and instead makes sure to process >> nested-posted-interrupts on vmentry after interrupts >> disabled. Processing of nested-posted-interrupts is delegated >> to hardware by issueing a self-IPI of relevant notification-vector >> which will be delivered to CPU when CPU is in guest. >> >> * Bug (A) is solved by the fact that processing of >> nested-posted-interrupts is not dependent on KVM_REQ_EVENT and >> happens before every vmentry to L2. >> * Bug (B) is now trivially solved by processing >> nested-posted-interrupts before each vmentry to L2 guest. >> >> An alternative could have been to just call >> vmx_complete_nested_posted_interrupt() at this call-site. However, we >> have decided to go with this approach because: >> 1. It would require modifying vmx_complete_nested_posted_interrupt() >> to be able to work with interrupts disabled (not-trivial). >> 2. We preffer to avoid software-emulation of hardware behavior if it >> is possible. > > Nice solution! Self-IPI was slower on non-nested, but even if it is > slower here, the code saving is definitely worth it. > >> Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing") >> >> Signed-off-by: Liran Alon <liran.alon@oracle.com> >> Co-authored-by: Nikita Leshenko <nikita.leshchenko@oracle.com> >> Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> >> Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> >> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> >> --- >> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c >> @@ -5156,6 +5114,17 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) >> kvm_vcpu_kick(vcpu); >> } >> >> +static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) >> +{ >> + struct vcpu_vmx *vmx = to_vmx(vcpu); >> + >> + WARN_ON(!vcpu->arch.apicv_active); >> + >> + if (is_guest_mode(vcpu) && vmx->nested.pi_desc && >> + pi_test_on(vmx->nested.pi_desc)) >> + kvm_vcpu_trigger_posted_interrupt(vcpu, true); > > We're always sending to self, so the function would be better with > apic->send_IPI_self() as it might be accelerated by hardware. > Agreed. Will change. >> +} >> + >> /* >> * Set up the vmcs's constant host-state fields, i.e., host-state fields that >> * will not change in the lifetime of the guest. >> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c >> @@ -6962,12 +6962,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) >> smp_mb__after_srcu_read_unlock(); >> >> /* >> - * This handles the case where a posted interrupt was >> - * notified with kvm_vcpu_kick. >> + * In case guest got the posted-interrupt notification >> + * vector while running in host, we need to make sure >> + * it arrives to guest. >> + * For L1 posted-interrupts, we manually sync PIR to IRR. >> + * For L2 posted-interrupts, we send notification-vector >> + * again by self IPI such that it will now be received in guest. >> */ >> - if (kvm_lapic_enabled(vcpu)) { >> - if (kvm_x86_ops->sync_pir_to_irr && vcpu->arch.apicv_active) >> + if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) { >> + if (kvm_x86_ops->sync_pir_to_irr) >> kvm_x86_ops->sync_pir_to_irr(vcpu); >> + if (kvm_x86_ops->complete_nested_posted_interrupt) >> + kvm_x86_ops->complete_nested_posted_interrupt(vcpu); > > I think that vcpu->arch.apicv_active is enough to guarantee presence of > both sync_pir_to_irr and complete_nested_posted_interrupt, I am not sure you can remove the check for kvm_lapic_enabled(). apicv_active can be true even when lapic_in_kernel()==false. apicv_active is only dependent on if physical CPU supports APICv related features & enable_apicv module parameter is 1. However, sync_pir_to_irr() assumes that lapic_in_kernel()==true because it reads & updates data in in-kernel LAPIC IRR register. So I think at least it doesn't makes sense to invoke sync_pir_to_irr() in case kvm_lapic_enabled()==false. But I do agree that complete_nested_posted_interrupt() should be invoked if just apicv_active==true. Because the LAPIC that L1 emulate for L2 can be enabled and implemented in-kernel even if the LAPIC that L0 emulate for L1 is not currently enabled. Therefore, L1 can still send posted-interrupts for it's L2 guest in this case. > > and considering this is a hot path, I'd check is_guest_mode(vcpu) before > calling the second function. > > (Saving the function call with a new x86_op for those two seems viable > too, but would need deeper analysis and benchmarking as it is putting > more code into hot cache ...) > OK. Will change. Regards, -Liran > Thanks. >
2017-12-07 13:33+0200, Liran Alon: > > > On 06/12/17 22:45, Radim Krčmář wrote: > > 2017-12-05 10:16+0200, Liran Alon: > > > When L1 wants to send a posted-interrupt to another L1 CPU running L2, > > > it sets the relevant bit in vmx->nested.pi_desc->pir and ON bit in > > > vmx->nested.pi_desc->control. Then it attempts to send a > > > notification-vector IPI to dest L1 CPU. > > > This attempt to send IPI will exit to L0 which will reach > > > vmx_deliver_nested_posted_interrupt() which does the following: > > > 1. If dest L0 CPU is currently running guest (vcpu->mode == > > > IN_GUEST_MODE), it sends a physical IPI of PI nested-notification-vector. > > > 2. It sets KVM_REQ_EVENT in dest vCPU. This is done such that if dest > > > L0 CPU exits from guest to host and therefore doesn't recognize physical > > > IPI (or it wasn't sent), then KVM_REQ_EVENT will be consumed on next > > > vmentry which will call vmx_check_nested_events() which should call > > > (in theory) vmx_complete_nested_posted_interrupt(). That function > > > should see vmx->nested.pi_desc->control ON bit is set and therefore > > > "emulate" posted-interrupt delivery for L1 (sync PIR to IRR in L1 > > > virtual-apic-page & update vmcs02 RVI). > > > > > > The above logic regarding nested-posted-interrupts contains multiple > > > issues: > > > > > > A) Race-condition: > > > On step (1) it is possible sender will see vcpu->mode == IN_GUEST_MODE > > > but before sending physical IPI, the dest CPU will exit to host. > > > Therefore, physical IPI could be received at host which it's hanlder > > > does nothing. In addition, assume that dest CPU passes the checks > > > for pending kvm requests before sender sets KVM_REQ_EVENT. Therefore, > > > dest CPU will resume L2 without evaluating nested-posted-interrupts. > > > > > > B) vmx_complete_nested_posted_interrupt() is not always called > > > when needed. Even if dest CPU consumed KVM_REQ_EVENT, there is a bug > > > that vmx_check_nested_events() could exit from L2 to L1 before calling > > > vmx_complete_nested_posted_interrupt(). Therefore, on next resume of > > > L1 into L2, nested-posted-interrupts won't be evaluated even though L0 > > > resume L2 (We may resume L2 without having KVM_REQ_EVENT set). > > > > > > This commit removes nested-posted-interrupts processing from > > > check_nested_events() and instead makes sure to process > > > nested-posted-interrupts on vmentry after interrupts > > > disabled. Processing of nested-posted-interrupts is delegated > > > to hardware by issueing a self-IPI of relevant notification-vector > > > which will be delivered to CPU when CPU is in guest. > > > > > > * Bug (A) is solved by the fact that processing of > > > nested-posted-interrupts is not dependent on KVM_REQ_EVENT and > > > happens before every vmentry to L2. > > > * Bug (B) is now trivially solved by processing > > > nested-posted-interrupts before each vmentry to L2 guest. > > > > > > An alternative could have been to just call > > > vmx_complete_nested_posted_interrupt() at this call-site. However, we > > > have decided to go with this approach because: > > > 1. It would require modifying vmx_complete_nested_posted_interrupt() > > > to be able to work with interrupts disabled (not-trivial). > > > 2. We preffer to avoid software-emulation of hardware behavior if it > > > is possible. > > > > Nice solution! Self-IPI was slower on non-nested, but even if it is > > slower here, the code saving is definitely worth it. > > > > > Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing") > > > > > > Signed-off-by: Liran Alon <liran.alon@oracle.com> > > > Co-authored-by: Nikita Leshenko <nikita.leshchenko@oracle.com> > > > Reviewed-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> > > > Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> > > > Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> > > > --- > > > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c > > > @@ -5156,6 +5114,17 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) > > > kvm_vcpu_kick(vcpu); > > > } > > > > > > +static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) > > > +{ > > > + struct vcpu_vmx *vmx = to_vmx(vcpu); > > > + > > > + WARN_ON(!vcpu->arch.apicv_active); > > > + > > > + if (is_guest_mode(vcpu) && vmx->nested.pi_desc && > > > + pi_test_on(vmx->nested.pi_desc)) > > > + kvm_vcpu_trigger_posted_interrupt(vcpu, true); > > > > We're always sending to self, so the function would be better with > > apic->send_IPI_self() as it might be accelerated by hardware. > > > > Agreed. Will change. > > > > +} > > > + > > > /* > > > * Set up the vmcs's constant host-state fields, i.e., host-state fields that > > > * will not change in the lifetime of the guest. > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c > > > @@ -6962,12 +6962,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) > > > smp_mb__after_srcu_read_unlock(); > > > > > > /* > > > - * This handles the case where a posted interrupt was > > > - * notified with kvm_vcpu_kick. > > > + * In case guest got the posted-interrupt notification > > > + * vector while running in host, we need to make sure > > > + * it arrives to guest. > > > + * For L1 posted-interrupts, we manually sync PIR to IRR. > > > + * For L2 posted-interrupts, we send notification-vector > > > + * again by self IPI such that it will now be received in guest. > > > */ > > > - if (kvm_lapic_enabled(vcpu)) { > > > - if (kvm_x86_ops->sync_pir_to_irr && vcpu->arch.apicv_active) > > > + if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) { > > > + if (kvm_x86_ops->sync_pir_to_irr) > > > kvm_x86_ops->sync_pir_to_irr(vcpu); > > > + if (kvm_x86_ops->complete_nested_posted_interrupt) > > > + kvm_x86_ops->complete_nested_posted_interrupt(vcpu); > > > > I think that vcpu->arch.apicv_active is enough to guarantee presence of > > both sync_pir_to_irr and complete_nested_posted_interrupt, > > I am not sure you can remove the check for kvm_lapic_enabled(). > > apicv_active can be true even when lapic_in_kernel()==false. > apicv_active is only dependent on if physical CPU supports APICv related > features & enable_apicv module parameter is 1. > > However, sync_pir_to_irr() assumes that lapic_in_kernel()==true because > it reads & updates data in in-kernel LAPIC IRR register. So I think at least > it doesn't makes sense to invoke sync_pir_to_irr() in case > kvm_lapic_enabled()==false. I agree. > But I do agree that complete_nested_posted_interrupt() should be invoked if > just apicv_active==true. Because the LAPIC that L1 emulate for L2 can be > enabled and implemented in-kernel even if the LAPIC that L0 emulate for L1 > is not currently enabled. Therefore, L1 can still send posted-interrupts for > it's L2 guest in this case. I think that posted interrupts depend on enabled harware LAPIC (with nested the lapic L0 provides), so if L1 can't send or receive interrupts, the posted interrupts wouldn't work. We emulate posted interrupts "incorrectly" by looking at PIR and injecting even if there was no notification vector, but we don't really have to do anything if there can be no notification vector coming to L1. Keeping the 'kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active' for both is fine by me, I just didn't see a reason to check for presence of sync_pir_to_irr and complete_nested_posted_interrupt, because if they are NULL, then vcpu->arch.apicv_active must be false. > > > > and considering this is a hot path, I'd check is_guest_mode(vcpu) before > > calling the second function. > > > > (Saving the function call with a new x86_op for those two seems viable > > too, but would need deeper analysis and benchmarking as it is putting > > more code into hot cache ...) > > > > OK. Will change. Thanks.
On 07/12/2017 17:26, Radim Krčmář wrote: > > Keeping the 'kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active' for > both is fine by me, I just didn't see a reason to check for presence of > sync_pir_to_irr and complete_nested_posted_interrupt, because if they > are NULL, then vcpu->arch.apicv_active must be false. Actually AMD doesn't have them, does it? But adding dummies would be fine for me. Thanks, Paolo
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 1bfb99770c34..dc0affe69903 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -980,6 +980,7 @@ struct kvm_x86_ops { void (*set_virtual_x2apic_mode)(struct kvm_vcpu *vcpu, bool set); void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu, hpa_t hpa); void (*deliver_posted_interrupt)(struct kvm_vcpu *vcpu, int vector); + void (*complete_nested_posted_interrupt)(struct kvm_vcpu *vcpu); int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu); int (*set_tss_addr)(struct kvm *kvm, unsigned int addr); int (*get_tdp_level)(struct kvm_vcpu *vcpu); diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index bbc023fb9ef1..517822f94158 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -534,12 +534,6 @@ static bool pi_test_and_set_on(struct pi_desc *pi_desc) (unsigned long *)&pi_desc->control); } -static bool pi_test_and_clear_on(struct pi_desc *pi_desc) -{ - return test_and_clear_bit(POSTED_INTR_ON, - (unsigned long *)&pi_desc->control); -} - static int pi_test_and_set_pir(int vector, struct pi_desc *pi_desc) { return test_and_set_bit(vector, (unsigned long *)pi_desc->pir); @@ -5041,37 +5035,6 @@ static void nested_mark_vmcs12_pages_dirty(struct kvm_vcpu *vcpu) } } - -static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) -{ - struct vcpu_vmx *vmx = to_vmx(vcpu); - int max_irr; - void *vapic_page; - u16 status; - - if (!vmx->nested.pi_desc) - return; - - if (!pi_test_and_clear_on(vmx->nested.pi_desc)) - return; - - max_irr = find_last_bit((unsigned long *)vmx->nested.pi_desc->pir, 256); - if (max_irr != 256) { - vapic_page = kmap(vmx->nested.virtual_apic_page); - __kvm_apic_update_irr(vmx->nested.pi_desc->pir, vapic_page); - kunmap(vmx->nested.virtual_apic_page); - - status = vmcs_read16(GUEST_INTR_STATUS); - if ((u8)max_irr > ((u8)status & 0xff)) { - status &= ~0xff; - status |= (u8)max_irr; - vmcs_write16(GUEST_INTR_STATUS, status); - } - } - - nested_mark_vmcs12_pages_dirty(vcpu); -} - static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu, bool nested) { @@ -5120,11 +5083,6 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu, vector == vmx->nested.posted_intr_nv) { /* the PIR and ON have been set by L1. */ kvm_vcpu_trigger_posted_interrupt(vcpu, true); - /* - * If a posted intr is not recognized by hardware, - * we will accomplish it in the next vmentry. - */ - kvm_make_request(KVM_REQ_EVENT, vcpu); return 0; } return -1; @@ -5156,6 +5114,17 @@ static void vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector) kvm_vcpu_kick(vcpu); } +static void vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu) +{ + struct vcpu_vmx *vmx = to_vmx(vcpu); + + WARN_ON(!vcpu->arch.apicv_active); + + if (is_guest_mode(vcpu) && vmx->nested.pi_desc && + pi_test_on(vmx->nested.pi_desc)) + kvm_vcpu_trigger_posted_interrupt(vcpu, true); +} + /* * Set up the vmcs's constant host-state fields, i.e., host-state fields that * will not change in the lifetime of the guest. @@ -6823,6 +6792,7 @@ static __init int hardware_setup(void) if (!cpu_has_vmx_apicv()) { enable_apicv = 0; kvm_x86_ops->sync_pir_to_irr = NULL; + kvm_x86_ops->complete_nested_posted_interrupt = NULL; } if (cpu_has_vmx_tsc_scaling()) { @@ -11144,7 +11114,6 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr) return 0; } - vmx_complete_nested_posted_interrupt(vcpu); return 0; } @@ -12136,6 +12105,8 @@ static int enable_smi_window(struct kvm_vcpu *vcpu) .hwapic_isr_update = vmx_hwapic_isr_update, .sync_pir_to_irr = vmx_sync_pir_to_irr, .deliver_posted_interrupt = vmx_deliver_posted_interrupt, + .complete_nested_posted_interrupt = + vmx_complete_nested_posted_interrupt, .set_tss_addr = vmx_set_tss_addr, .get_tdp_level = get_ept_level, diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 34c85aa2e2d1..6fb58ab9a85c 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6962,12 +6962,18 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu) smp_mb__after_srcu_read_unlock(); /* - * This handles the case where a posted interrupt was - * notified with kvm_vcpu_kick. + * In case guest got the posted-interrupt notification + * vector while running in host, we need to make sure + * it arrives to guest. + * For L1 posted-interrupts, we manually sync PIR to IRR. + * For L2 posted-interrupts, we send notification-vector + * again by self IPI such that it will now be received in guest. */ - if (kvm_lapic_enabled(vcpu)) { - if (kvm_x86_ops->sync_pir_to_irr && vcpu->arch.apicv_active) + if (kvm_lapic_enabled(vcpu) && vcpu->arch.apicv_active) { + if (kvm_x86_ops->sync_pir_to_irr) kvm_x86_ops->sync_pir_to_irr(vcpu); + if (kvm_x86_ops->complete_nested_posted_interrupt) + kvm_x86_ops->complete_nested_posted_interrupt(vcpu); } if (vcpu->mode == EXITING_GUEST_MODE || kvm_request_pending(vcpu)