[3/3] KVM: VMX: use preemption timer to force immediate VMExit
diff mbox series

Message ID 20180827222112.6640-4-sean.j.christopherson@intel.com
State New
Headers show
Series
  • KVM: VMX: handle preemption timer value of zero
Related show

Commit Message

Sean Christopherson Aug. 27, 2018, 10:21 p.m. UTC
A VMX preemption timer value of '0' is guaranteed to cause a VMExit
prior to the CPU executing any instructions in the guest.  Use the
preemption timer (if it's supported) to trigger immediate VMExit
in place of the current method of sending a self-IPI.  This ensures
that pending VMExit injection to L1 occurs prior to executing any
instructions in the guest (regardless of nesting level).

When deferring VMExit injection, KVM generates an immediate VMExit
from the (possibly nested) guest by sending itself an IPI.  Because
hardware interrupts are blocked prior to VMEnter and are unblocked
(in hardware) after VMEnter, this results in taking a VMExit(INTR)
before any guest instruction is executed.  But, as this approach
relies on the IPI being received before VMEnter executes, it only
works as intended when KVM is running as L0.  Because there are no
architectural guarantees regarding when IPIs are delivered, when
running nested the INTR may "arrive" long after L2 is running e.g.
L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.

For the most part, this unintended delay is not an issue since the
events being injected to L1 also do not have architectural guarantees
regarding their timing.  The notable exception is the VMX preemption
timer[1], which is architecturally guaranteed to cause a VMExit prior
to executing any instructions in the guest if the timer value is '0'
at VMEnter.  Specifically, the delay in injecting the VMExit causes
the preemption timer KVM unit test to fail when run in a nested guest.

Note: this approach is viable even on CPUs with a broken preemption
timer, as broken in this context only means the timer counts at the
wrong rate.  There are no known errata affecting timer value of '0'.

[1] I/O SMIs also have guarantees on when they arrive, but I have
    no idea if/how those are emulated in KVM.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/vmx.c              | 22 ++++++++++++++++++----
 arch/x86/kvm/x86.c              |  5 ++++-
 3 files changed, 23 insertions(+), 5 deletions(-)

Comments

Liran Alon Aug. 27, 2018, 11:38 p.m. UTC | #1
> On 28 Aug 2018, at 1:21, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> A VMX preemption timer value of '0' is guaranteed to cause a VMExit
> prior to the CPU executing any instructions in the guest.  Use the
> preemption timer (if it's supported) to trigger immediate VMExit
> in place of the current method of sending a self-IPI.  This ensures
> that pending VMExit injection to L1 occurs prior to executing any
> instructions in the guest (regardless of nesting level).
> 
> When deferring VMExit injection, KVM generates an immediate VMExit
> from the (possibly nested) guest by sending itself an IPI.  Because
> hardware interrupts are blocked prior to VMEnter and are unblocked
> (in hardware) after VMEnter, this results in taking a VMExit(INTR)
> before any guest instruction is executed.  But, as this approach
> relies on the IPI being received before VMEnter executes, it only
> works as intended when KVM is running as L0.  Because there are no
> architectural guarantees regarding when IPIs are delivered, when
> running nested the INTR may "arrive" long after L2 is running e.g.
> L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
> 
> For the most part, this unintended delay is not an issue since the
> events being injected to L1 also do not have architectural guarantees
> regarding their timing.  The notable exception is the VMX preemption
> timer[1], which is architecturally guaranteed to cause a VMExit prior
> to executing any instructions in the guest if the timer value is '0'
> at VMEnter.  Specifically, the delay in injecting the VMExit causes
> the preemption timer KVM unit test to fail when run in a nested guest.
> 
> Note: this approach is viable even on CPUs with a broken preemption
> timer, as broken in this context only means the timer counts at the
> wrong rate.  There are no known errata affecting timer value of '0'.
> 
> [1] I/O SMIs also have guarantees on when they arrive, but I have
>    no idea if/how those are emulated in KVM.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> 

This commit made me wonder why exactly does L0 delivers the IPI long after L2 is running.

The IPI is sent by L1 writing to LAPIC ICR which will exit to L0 on APIC_WRITE
which will eventually reach __apic_accept_irq() that will either, based on if APICv is active,
set vector in L1 LAPIC IRR and set KVM_REQ_EVENT
or set vector in vmx->pi_desc->pir and set vmx->pi_desc->control with ON.

On next vcpu_enter_guest() one of the following will happen:
1. If APICv is disabled, KVM_REQ_EVENT will be consumed but inject_pending_event() will notice vmx_interrupt_allowed() returns false.
Thus, vcpu_enter_guest() will enable_irq_window() which will set VIRTUAL_INTR_PENDING in vmcs01.
2. vcpu_enter_guest() will call vmx_sync_pir_to_irr() which will consume vector from PIR to L1 LAPIC IRR and update RVI accordingly.
In this case, CPU will still keep vector pending in RVI as interrupt is disabled on L1.

Later, when L1 will VMRESUME into L2, L0 will reach nested_vmx_run() which will switch active vmcs to vmcs02.
Therefore:
a. If APICv is disabled, KVM_REQ_EVENT is currently not set and as vmcs01 is not active,
CPU is not aware of the VIRTUAL_INTR_PENDING that was previously set. Thus, KVM will indeed enter
into L2 without exiting to L1 on the pending IPI. *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit.
b. If APICv is enabled, vmx_sync_pir_to_irr() will see that vmx->pi_desc->control is not set with ON and
is_guest_mode()==true and therefore vmx_hwapic_irr_update() does nothing as-well.
Again, we reach a scenario that even though L1 has a pending interrupt, it isn’t evaluated.
*THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit.

Therefore, even though from architectural perspective, this commit is correct, it workaround
some real nVMX event-injection bugs that should be solved.
What do you think? Did I miss something in my analysis?

-Liran
Liran Alon Aug. 27, 2018, 11:52 p.m. UTC | #2
> On 28 Aug 2018, at 2:38, Liran Alon <liran.alon@oracle.com> wrote:
> 
> 
> 
>> On 28 Aug 2018, at 1:21, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>> 
>> A VMX preemption timer value of '0' is guaranteed to cause a VMExit
>> prior to the CPU executing any instructions in the guest.  Use the
>> preemption timer (if it's supported) to trigger immediate VMExit
>> in place of the current method of sending a self-IPI.  This ensures
>> that pending VMExit injection to L1 occurs prior to executing any
>> instructions in the guest (regardless of nesting level).
>> 
>> When deferring VMExit injection, KVM generates an immediate VMExit
>> from the (possibly nested) guest by sending itself an IPI.  Because
>> hardware interrupts are blocked prior to VMEnter and are unblocked
>> (in hardware) after VMEnter, this results in taking a VMExit(INTR)
>> before any guest instruction is executed.  But, as this approach
>> relies on the IPI being received before VMEnter executes, it only
>> works as intended when KVM is running as L0.  Because there are no
>> architectural guarantees regarding when IPIs are delivered, when
>> running nested the INTR may "arrive" long after L2 is running e.g.
>> L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
>> 
>> For the most part, this unintended delay is not an issue since the
>> events being injected to L1 also do not have architectural guarantees
>> regarding their timing.  The notable exception is the VMX preemption
>> timer[1], which is architecturally guaranteed to cause a VMExit prior
>> to executing any instructions in the guest if the timer value is '0'
>> at VMEnter.  Specifically, the delay in injecting the VMExit causes
>> the preemption timer KVM unit test to fail when run in a nested guest.
>> 
>> Note: this approach is viable even on CPUs with a broken preemption
>> timer, as broken in this context only means the timer counts at the
>> wrong rate.  There are no known errata affecting timer value of '0'.
>> 
>> [1] I/O SMIs also have guarantees on when they arrive, but I have
>>   no idea if/how those are emulated in KVM.
>> 
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> 
> 
> This commit made me wonder why exactly does L0 delivers the IPI long after L2 is running.
> 
> The IPI is sent by L1 writing to LAPIC ICR which will exit to L0 on APIC_WRITE
> which will eventually reach __apic_accept_irq() that will either, based on if APICv is active,
> set vector in L1 LAPIC IRR and set KVM_REQ_EVENT
> or set vector in vmx->pi_desc->pir and set vmx->pi_desc->control with ON.
> 
> On next vcpu_enter_guest() one of the following will happen:
> 1. If APICv is disabled, KVM_REQ_EVENT will be consumed but inject_pending_event() will notice vmx_interrupt_allowed() returns false.
> Thus, vcpu_enter_guest() will enable_irq_window() which will set VIRTUAL_INTR_PENDING in vmcs01.
> 2. vcpu_enter_guest() will call vmx_sync_pir_to_irr() which will consume vector from PIR to L1 LAPIC IRR and update RVI accordingly.
> In this case, CPU will still keep vector pending in RVI as interrupt is disabled on L1.
> 
> Later, when L1 will VMRESUME into L2, L0 will reach nested_vmx_run() which will switch active vmcs to vmcs02.
> Therefore:
> a. If APICv is disabled, KVM_REQ_EVENT is currently not set and as vmcs01 is not active,
> CPU is not aware of the VIRTUAL_INTR_PENDING that was previously set. Thus, KVM will indeed enter
> into L2 without exiting to L1 on the pending IPI. *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit.
> b. If APICv is enabled, vmx_sync_pir_to_irr() will see that vmx->pi_desc->control is not set with ON and
> is_guest_mode()==true and therefore vmx_hwapic_irr_update() does nothing as-well.
> Again, we reach a scenario that even though L1 has a pending interrupt, it isn’t evaluated.
> *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit.
> 
> Therefore, even though from architectural perspective, this commit is correct, it workaround
> some real nVMX event-injection bugs that should be solved.
> What do you think? Did I miss something in my analysis?
> 
> -Liran

I will also add that I think the above bugs can be fixed with a simple patch:
We could just add a kvm_make_request(KVM_REQ_EVENT, vcpu); on successful run of nested_vmx_run().

This will cause vcpu_enter_guest() -> inject_pending_event() -> vmx_check_nested_events() to see that
there is a pending interrupt for L1 and return -EBUSY (because nested_run_pending==true) which will force an
immediate-exit which will set KVM_REQ_EVENT again and on next round of vcpu_enter_guest(),
vmx_check_nested_events() will now VMExit from L2 to L1 on EXTERNAL_INTERRUPT as required.

If you agree, I will submit such a patch.

-Liran
Sean Christopherson Aug. 28, 2018, 2:46 p.m. UTC | #3
On Tue, 2018-08-28 at 02:38 +0300, Liran Alon wrote:
> 
> > On 28 Aug 2018, at 1:21, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> > 
> > A VMX preemption timer value of '0' is guaranteed to cause a VMExit
> > prior to the CPU executing any instructions in the guest.  Use the
> > preemption timer (if it's supported) to trigger immediate VMExit
> > in place of the current method of sending a self-IPI.  This ensures
> > that pending VMExit injection to L1 occurs prior to executing any
> > instructions in the guest (regardless of nesting level).
> > 
> > When deferring VMExit injection, KVM generates an immediate VMExit
> > from the (possibly nested) guest by sending itself an IPI.  Because
> > hardware interrupts are blocked prior to VMEnter and are unblocked
> > (in hardware) after VMEnter, this results in taking a VMExit(INTR)
> > before any guest instruction is executed.  But, as this approach
> > relies on the IPI being received before VMEnter executes, it only
> > works as intended when KVM is running as L0.  Because there are no
> > architectural guarantees regarding when IPIs are delivered, when
> > running nested the INTR may "arrive" long after L2 is running e.g.
> > L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
> > 
> > For the most part, this unintended delay is not an issue since the
> > events being injected to L1 also do not have architectural guarantees
> > regarding their timing.  The notable exception is the VMX preemption
> > timer[1], which is architecturally guaranteed to cause a VMExit prior
> > to executing any instructions in the guest if the timer value is '0'
> > at VMEnter.  Specifically, the delay in injecting the VMExit causes
> > the preemption timer KVM unit test to fail when run in a nested guest.
> > 
> > Note: this approach is viable even on CPUs with a broken preemption
> > timer, as broken in this context only means the timer counts at the
> > wrong rate.  There are no known errata affecting timer value of '0'.
> > 
> > [1] I/O SMIs also have guarantees on when they arrive, but I have
> >    no idea if/how those are emulated in KVM.
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> This commit made me wonder why exactly does L0 delivers the IPI long after L2 is running.
> 
> The IPI is sent by L1 writing to LAPIC ICR which will exit to L0 on APIC_WRITE
> which will eventually reach __apic_accept_irq() that will either, based on if APICv is active,
> set vector in L1 LAPIC IRR and set KVM_REQ_EVENT
> or set vector in vmx->pi_desc->pir and set vmx->pi_desc->control with ON.
> 
> On next vcpu_enter_guest() one of the following will happen:
> 1. If APICv is disabled, KVM_REQ_EVENT will be consumed but inject_pending_event() will notice vmx_interrupt_allowed() returns false.
> Thus, vcpu_enter_guest() will enable_irq_window() which will set VIRTUAL_INTR_PENDING in vmcs01.
> 2. vcpu_enter_guest() will call vmx_sync_pir_to_irr() which will consume vector from PIR to L1 LAPIC IRR and update RVI accordingly.
> In this case, CPU will still keep vector pending in RVI as interrupt is disabled on L1.
> 
> Later, when L1 will VMRESUME into L2, L0 will reach nested_vmx_run() which will switch active vmcs to vmcs02.
> Therefore:
> a. If APICv is disabled, KVM_REQ_EVENT is currently not set and as vmcs01 is not active,
> CPU is not aware of the VIRTUAL_INTR_PENDING that was previously set. Thus, KVM will indeed enter
> into L2 without exiting to L1 on the pending IPI. *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit.
> b. If APICv is enabled, vmx_sync_pir_to_irr() will see that vmx->pi_desc->control is not set with ON and
> is_guest_mode()==true and therefore vmx_hwapic_irr_update() does nothing as-well.
> Again, we reach a scenario that even though L1 has a pending interrupt, it isn’t evaluated.
> *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit.
> 
> Therefore, even though from architectural perspective, this commit is correct, it workaround
> some real nVMX event-injection bugs that should be solved.
> What do you think? Did I miss something in my analysis?

Hmm, IMO it's the other way around.  Relying on an IPI to trigger an
immediate VMExit is a bug, and setting KVM_REQ_EVENT is a workaround.
There are simply no guarantees regarding the delivery time of IPIs.

For example, setting KVM_REQ_EVENT at the end of nested_vmx_run()
allows the preemption timer unit test to pass in L1, but it continues
to fail in L2 (even if L0, L1 and L2 are all patched).  At the time of
VMEnter to L4 (from L0), L2 has done smp_send_reschedule() to self-IPI
so that it can inject a VMExit to L3.  But, L1 doesn't inject an event
to L2 because its nested_run_pending is set and so there is no pending
event visible to L0.  Setting KVM_REQ_EVENT in L1 doesn't resolve the
issue because L1 can't process that request until it gets control back,
which doesn't happen until L4 takes a VMExit.

On the other handing, using the preemption timer to trigger VMExit
ensures L0 will see the pending event (for lack of a better term)
at the time of VMEnter to L4 because the event is visible in both
vmcs24 and vmcs14.  Using this approach, the unit test passes in
L1 and L2 if both are patched, even if L0 is unpatched (since the
current self-IPI mechanism works when running on bare metal).

So, setting KVM_REQ_EVENT won't cause problems, but it also doesn't
truly fix the issue.  I can tack on a patch to set KVM_REQ_EVENT in
vmx_nested_run() if the VMX preemption timer isn't supported.  That
would partially workaround the issue on older hardware.
Liran Alon Aug. 29, 2018, 9:57 a.m. UTC | #4
> On 28 Aug 2018, at 17:46, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
> 
> On Tue, 2018-08-28 at 02:38 +0300, Liran Alon wrote:
>> 
>>> On 28 Aug 2018, at 1:21, Sean Christopherson <sean.j.christopherson@intel.com> wrote:
>>> 
>>> A VMX preemption timer value of '0' is guaranteed to cause a VMExit
>>> prior to the CPU executing any instructions in the guest.  Use the
>>> preemption timer (if it's supported) to trigger immediate VMExit
>>> in place of the current method of sending a self-IPI.  This ensures
>>> that pending VMExit injection to L1 occurs prior to executing any
>>> instructions in the guest (regardless of nesting level).
>>> 
>>> When deferring VMExit injection, KVM generates an immediate VMExit
>>> from the (possibly nested) guest by sending itself an IPI.  Because
>>> hardware interrupts are blocked prior to VMEnter and are unblocked
>>> (in hardware) after VMEnter, this results in taking a VMExit(INTR)
>>> before any guest instruction is executed.  But, as this approach
>>> relies on the IPI being received before VMEnter executes, it only
>>> works as intended when KVM is running as L0.  Because there are no
>>> architectural guarantees regarding when IPIs are delivered, when
>>> running nested the INTR may "arrive" long after L2 is running e.g.
>>> L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.
>>> 
>>> For the most part, this unintended delay is not an issue since the
>>> events being injected to L1 also do not have architectural guarantees
>>> regarding their timing.  The notable exception is the VMX preemption
>>> timer[1], which is architecturally guaranteed to cause a VMExit prior
>>> to executing any instructions in the guest if the timer value is '0'
>>> at VMEnter.  Specifically, the delay in injecting the VMExit causes
>>> the preemption timer KVM unit test to fail when run in a nested guest.
>>> 
>>> Note: this approach is viable even on CPUs with a broken preemption
>>> timer, as broken in this context only means the timer counts at the
>>> wrong rate.  There are no known errata affecting timer value of '0'.
>>> 
>>> [1] I/O SMIs also have guarantees on when they arrive, but I have
>>>    no idea if/how those are emulated in KVM.
>>> 
>>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>>> 
>> This commit made me wonder why exactly does L0 delivers the IPI long after L2 is running.
>> 
>> The IPI is sent by L1 writing to LAPIC ICR which will exit to L0 on APIC_WRITE
>> which will eventually reach __apic_accept_irq() that will either, based on if APICv is active,
>> set vector in L1 LAPIC IRR and set KVM_REQ_EVENT
>> or set vector in vmx->pi_desc->pir and set vmx->pi_desc->control with ON.
>> 
>> On next vcpu_enter_guest() one of the following will happen:
>> 1. If APICv is disabled, KVM_REQ_EVENT will be consumed but inject_pending_event() will notice vmx_interrupt_allowed() returns false.
>> Thus, vcpu_enter_guest() will enable_irq_window() which will set VIRTUAL_INTR_PENDING in vmcs01.
>> 2. vcpu_enter_guest() will call vmx_sync_pir_to_irr() which will consume vector from PIR to L1 LAPIC IRR and update RVI accordingly.
>> In this case, CPU will still keep vector pending in RVI as interrupt is disabled on L1.
>> 
>> Later, when L1 will VMRESUME into L2, L0 will reach nested_vmx_run() which will switch active vmcs to vmcs02.
>> Therefore:
>> a. If APICv is disabled, KVM_REQ_EVENT is currently not set and as vmcs01 is not active,
>> CPU is not aware of the VIRTUAL_INTR_PENDING that was previously set. Thus, KVM will indeed enter
>> into L2 without exiting to L1 on the pending IPI. *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit.
>> b. If APICv is enabled, vmx_sync_pir_to_irr() will see that vmx->pi_desc->control is not set with ON and
>> is_guest_mode()==true and therefore vmx_hwapic_irr_update() does nothing as-well.
>> Again, we reach a scenario that even though L1 has a pending interrupt, it isn’t evaluated.
>> *THIS SEEMS LIKE A BUG TO ME*. Regardless of this commit.
>> 
>> Therefore, even though from architectural perspective, this commit is correct, it workaround
>> some real nVMX event-injection bugs that should be solved.
>> What do you think? Did I miss something in my analysis?
> 
> Hmm, IMO it's the other way around.  Relying on an IPI to trigger an
> immediate VMExit is a bug, and setting KVM_REQ_EVENT is a workaround.
> There are simply no guarantees regarding the delivery time of IPIs.
> 
> For example, setting KVM_REQ_EVENT at the end of nested_vmx_run()
> allows the preemption timer unit test to pass in L1, but it continues
> to fail in L2 (even if L0, L1 and L2 are all patched).  At the time of
> VMEnter to L4 (from L0), L2 has done smp_send_reschedule() to self-IPI
> so that it can inject a VMExit to L3.  But, L1 doesn't inject an event
> to L2 because its nested_run_pending is set and so there is no pending
> event visible to L0.  Setting KVM_REQ_EVENT in L1 doesn't resolve the
> issue because L1 can't process that request until it gets control back,
> which doesn't happen until L4 takes a VMExit.

I’m not sure I was able to follow your scenario.
If vCPU is currently running L2, L1 by definition have nested_run_pending=false as it is set to true
only when L2 enters into L3 or L4 (via vmresume/vmlaunch).
When L2 will trigger this self-IPI, it will exit to L0 which will reflect exit to L1. At that point, even if
L1 nested_run_pending was set to true, it will now be cleared, and L1 will queue a pending interrupt for L2
but because L2 is now running with interrupts-disabled, L1 won’t inject an interrupt but instead
will request an irq window on vmcs12.
When L2 will later vmresume back into L4 (after requesting immediate-exit via self-IPI), the vmresume will
exit to L0 which will reflect exit to L1 which will set KVM_REQ_EVENT (as I suggested in previous email).
Therefore, before L1 will vmresume with vmcs14, it will see L2 has pending interrupt but nested_run_pending=true
and therefore will send a self-IPI. This will cause L0 to queue a pending interrupt for L1 and request an irq window in vmcs01.
Then, when L1 will vmresume with vmcs14, L0 will set KVM_REQ_EVENT (as my patch suggests) which will make L0
to request an immediate-exit after vmresume with vmcs04.
Therefore, before CPU executes any instruction of L4, it will immediately exit to L0 which will exit
on EXTERNAL_INTERRUPT to L1 which will now in turn exit on EXTERNAL_INTERRUPT to L2 as required.

> 
> On the other handing, using the preemption timer to trigger VMExit
> ensures L0 will see the pending event (for lack of a better term)
> at the time of VMEnter to L4 because the event is visible in both
> vmcs24 and vmcs14.  Using this approach, the unit test passes in
> L1 and L2 if both are patched, even if L0 is unpatched (since the
> current self-IPI mechanism works when running on bare metal).
> 
> So, setting KVM_REQ_EVENT won't cause problems, but it also doesn't
> truly fix the issue.  I can tack on a patch to set KVM_REQ_EVENT in
> vmx_nested_run() if the VMX preemption timer isn't supported.  That
> would partially workaround the issue on older hardware.

I believe the patch I suggested of setting KVM_REQ_EVENT at vmx_nested_run() fixes an issue which is
orthogonal to your patch. It fixes an issue that if there is a pending interrupt to L1 while L1 is running with
Interrupt-disabled and L1 executes vmlaunch/vmresume, L0 should re-evaluate if it should exit from L2 to L1
because of a pending L1 interrupt. This is just a general issue and the L1 interrupt shouldn’t be delayed until
the next point that KVM will happen to set KVM_REQ_EVENT.

If you agree with above analysis, I will submit such a patch, unrelated to your patch series.

-Liran
Sean Christopherson Aug. 29, 2018, 6:45 p.m. UTC | #5
On Mon, Aug 27, 2018 at 03:21:12PM -0700, Sean Christopherson wrote:
> A VMX preemption timer value of '0' is guaranteed to cause a VMExit
> prior to the CPU executing any instructions in the guest.  Use the
> preemption timer (if it's supported) to trigger immediate VMExit
> in place of the current method of sending a self-IPI.  This ensures
> that pending VMExit injection to L1 occurs prior to executing any
> instructions in the guest (regardless of nesting level).
> 
> When deferring VMExit injection, KVM generates an immediate VMExit
> from the (possibly nested) guest by sending itself an IPI.  Because
> hardware interrupts are blocked prior to VMEnter and are unblocked
> (in hardware) after VMEnter, this results in taking a VMExit(INTR)
> before any guest instruction is executed.  But, as this approach
> relies on the IPI being received before VMEnter executes, it only
> works as intended when KVM is running as L0.  Because there are no
> architectural guarantees regarding when IPIs are delivered, when
> running nested the INTR may "arrive" long after L2 is running e.g.
> L0 KVM doesn't force an immediate switch to L1 to deliver an INTR.

Circling back to this patch now that we have the nested interrupt
handling sorted out, knock on wood...  I think the basic premise of
this patch is valid, but the above line should be stricken from the
commit message since KVM's behavior is a bug.

> For the most part, this unintended delay is not an issue since the
> events being injected to L1 also do not have architectural guarantees
> regarding their timing.  The notable exception is the VMX preemption
> timer[1], which is architecturally guaranteed to cause a VMExit prior
> to executing any instructions in the guest if the timer value is '0'
> at VMEnter.  Specifically, the delay in injecting the VMExit causes
> the preemption timer KVM unit test to fail when run in a nested guest.
> 
> Note: this approach is viable even on CPUs with a broken preemption
> timer, as broken in this context only means the timer counts at the
> wrong rate.  There are no known errata affecting timer value of '0'.
> 
> [1] I/O SMIs also have guarantees on when they arrive, but I have
>     no idea if/how those are emulated in KVM.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

Patch
diff mbox series

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 00ddb0c9e612..6ae0d3f18ee3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1055,6 +1055,7 @@  struct kvm_x86_ops {
 	bool (*umip_emulated)(void);
 
 	int (*check_nested_events)(struct kvm_vcpu *vcpu, bool external_intr);
+	void (*request_immediate_exit)(struct kvm_vcpu *vcpu);
 
 	void (*sched_in)(struct kvm_vcpu *kvm, int cpu);
 
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 5ae46af2077d..2f497fbc62dc 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1026,6 +1026,7 @@  struct vcpu_vmx {
 	/* apic deadline value in host tsc */
 	u64 hv_deadline_tsc;
 	bool hv_timer_armed;
+	bool req_immediate_exit;
 
 	u64 current_tsc_ratio;
 
@@ -2865,6 +2866,8 @@  static void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 	u16 fs_sel, gs_sel;
 	int i;
 
+	vmx->req_immediate_exit = false;
+
 	if (vmx->loaded_cpu_state)
 		return;
 
@@ -7967,6 +7970,9 @@  static __init int hardware_setup(void)
 		kvm_x86_ops->enable_log_dirty_pt_masked = NULL;
 	}
 
+	if (!cpu_has_vmx_preemption_timer())
+		kvm_x86_ops->request_immediate_exit = NULL;
+
 	if (cpu_has_vmx_preemption_timer() && enable_preemption_timer) {
 		u64 vmx_msr;
 
@@ -9209,7 +9215,8 @@  static int handle_pml_full(struct kvm_vcpu *vcpu)
 
 static int handle_preemption_timer(struct kvm_vcpu *vcpu)
 {
-	kvm_lapic_expired_hv_timer(vcpu);
+	if (!to_vmx(vcpu)->req_immediate_exit)
+		kvm_lapic_expired_hv_timer(vcpu);
 	return 1;
 }
 
@@ -10606,7 +10613,9 @@  static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
 	u64 tscl;
 	u32 delta_tsc;
 
-	if (vmx->hv_deadline_tsc != -1) {
+	if (vmx->req_immediate_exit) {
+		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, 0);
+	} else if (vmx->hv_deadline_tsc != -1) {
 		tscl = rdtsc();
 		if (vmx->hv_deadline_tsc > tscl)
 			/* set_hv_timer ensures the delta fits in 32-bits */
@@ -10614,11 +10623,10 @@  static void vmx_arm_hv_timer(struct kvm_vcpu *vcpu)
 				cpu_preemption_timer_multi);
 		else
 			delta_tsc = 0;
-
 		vmcs_write32(VMX_PREEMPTION_TIMER_VALUE, delta_tsc);
 	}
 
-	arm_timer = (vmx->hv_deadline_tsc != -1);
+	arm_timer = (vmx->hv_deadline_tsc != -1) || vmx->req_immediate_exit;
 	if (arm_timer && !vmx->hv_timer_armed)
 		vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
 			PIN_BASED_VMX_PREEMPTION_TIMER);
@@ -12856,6 +12864,11 @@  static int vmx_check_nested_events(struct kvm_vcpu *vcpu, bool external_intr)
 	return 0;
 }
 
+static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
+{
+	to_vmx(vcpu)->req_immediate_exit = true;
+}
+
 static u32 vmx_get_preemption_timer_value(struct kvm_vcpu *vcpu)
 {
 	ktime_t remaining =
@@ -14115,6 +14128,7 @@  static struct kvm_x86_ops vmx_x86_ops __ro_after_init = {
 	.umip_emulated = vmx_umip_emulated,
 
 	.check_nested_events = vmx_check_nested_events,
+	.request_immediate_exit = vmx_request_immediate_exit,
 
 	.sched_in = vmx_sched_in,
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 14ee9a814888..e2906ec45cb6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7544,7 +7544,10 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 	if (req_immediate_exit) {
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
-		smp_send_reschedule(vcpu->cpu);
+		if (kvm_x86_ops->request_immediate_exit)
+			kvm_x86_ops->request_immediate_exit(vcpu);
+		else
+			smp_send_reschedule(vcpu->cpu);
 	}
 
 	trace_kvm_entry(vcpu->vcpu_id);