KVM: nVMX: clear nested_run_pending when emulating invalid guest state
diff mbox

Message ID 20180305173947.3025-1-sean.j.christopherson@intel.com
State New
Headers show

Commit Message

Sean Christopherson March 5, 2018, 5:39 p.m. UTC
Clear nested_run_pending in handle_invalid_guest_state() after calling
emulate_instruction(), i.e. after attempting to emulate at least one
instruction.  This fixes an issue where L0 enters an infinite loop if
L2 hits an exception that is intercepted by L1 while L0 is emulating
L2's invalid guest state, effectively causing DoS on L1, e.g. the only
way to break the loop is to kill Qemu in L0.

    1. call handle_invalid_guest_state() for L2
    2. emulate_instruction() pends an exception, e.g. #UD
    3. L1 intercepts the exception, i.e. nested_vmx_check_exception
       returns 1
    4. vmx_check_nested_events() returns -EBUSY because L1 wants to
       intercept the exception and nested_run_pending is true
    5. handle_invalid_guest_state() never makes forward progress for
       L2 due to the pending exception
    6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the
       L1 vCPU trying VMLAUNCH effectively hangs

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
---
 arch/x86/kvm/vmx.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Radim Krčmář March 9, 2018, 7:59 p.m. UTC | #1
2018-03-05 09:39-0800, Sean Christopherson:
> Clear nested_run_pending in handle_invalid_guest_state() after calling
> emulate_instruction(), i.e. after attempting to emulate at least one
> instruction.  This fixes an issue where L0 enters an infinite loop if
> L2 hits an exception that is intercepted by L1 while L0 is emulating
> L2's invalid guest state, effectively causing DoS on L1, e.g. the only
> way to break the loop is to kill Qemu in L0.
> 
>     1. call handle_invalid_guest_state() for L2
>     2. emulate_instruction() pends an exception, e.g. #UD
>     3. L1 intercepts the exception, i.e. nested_vmx_check_exception
>        returns 1
>     4. vmx_check_nested_events() returns -EBUSY because L1 wants to
>        intercept the exception and nested_run_pending is true
>     5. handle_invalid_guest_state() never makes forward progress for
>        L2 due to the pending exception
>     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the
>        L1 vCPU trying VMLAUNCH effectively hangs
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> ---

nested_run_pending signals that we have to execute VMRESUME in order to
do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).

If we don't let the hardware do it, we need to transfer the state from
L2's VMCS while doing a nested VM exit for the exception (= behave as if
we entered the guest and exited).

And I think the actual fix here is to evaluate the interrupt before the
first emulate_instruction() in handle_invalid_guest_state().

Do you want to look deeper into this?

Thanks.

> diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> index 591214843046..3073160e6bae 100644
> --- a/arch/x86/kvm/vmx.c
> +++ b/arch/x86/kvm/vmx.c
> @@ -6835,6 +6835,8 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
>  
>  		err = emulate_instruction(vcpu, 0);
>  
> +		vmx->nested.nested_run_pending = 0;
> +
>  		if (err == EMULATE_USER_EXIT) {
>  			++vcpu->stat.mmio_exits;
>  			ret = 0;
> -- 
> 2.16.2
>
Sean Christopherson March 9, 2018, 11:06 p.m. UTC | #2
On Friday, 2018-03-09, Radim Krčmář wrote:
> 2018-03-05 09:39-0800, Sean Christopherson:
> > Clear nested_run_pending in handle_invalid_guest_state() after calling
> > emulate_instruction(), i.e. after attempting to emulate at least one
> > instruction.  This fixes an issue where L0 enters an infinite loop if
> > L2 hits an exception that is intercepted by L1 while L0 is emulating
> > L2's invalid guest state, effectively causing DoS on L1, e.g. the only
> > way to break the loop is to kill Qemu in L0.
> > 
> >     1. call handle_invalid_guest_state() for L2
> >     2. emulate_instruction() pends an exception, e.g. #UD
> >     3. L1 intercepts the exception, i.e. nested_vmx_check_exception
> >        returns 1
> >     4. vmx_check_nested_events() returns -EBUSY because L1 wants to
> >        intercept the exception and nested_run_pending is true
> >     5. handle_invalid_guest_state() never makes forward progress for
> >        L2 due to the pending exception
> >     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the
> >        L1 vCPU trying VMLAUNCH effectively hangs
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > ---
> 
> nested_run_pending signals that we have to execute VMRESUME in order to
> do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).
> 
> If we don't let the hardware do it, we need to transfer the state from
> L2's VMCS while doing a nested VM exit for the exception (= behave as if
> we entered the guest and exited).

Right, but by virtue of being in handle_invalid_guest_state() we're
already effectively post-VMRESUME.  handle_invalid_guest_state() is
technically called in the VMEXIT flow, but the initial check against
emulation_required is done in vmx_vcpu_run().  Conceptually, emulating
due to invalid state is distinctly different from emulating due to a
specific VMEXIT.  In the latter case, we're emulating an instruction
that has already been executed/attempted in the guest, whereas in the
invalid guest case we're emulating instructions that haven't been seen
before.  So in that sense, clearing nested_run_pending when emulating
invalid guest state is analogous to clearing nested_run_pending after
VMENTER in vmx_vcpu_run().

> And I think the actual fix here is to evaluate the interrupt before the
> first emulate_instruction() in handle_invalid_guest_state().

The issue isn't that we don't evaluate an interrupt/exception, it's that
nested_run_pending is never cleared and so inject_pending_event() always
returns -EBUSY without injecting the VMEXIT(#UD) to L1.

And in the failing case, there is no interrupt to evaluate before the
first emulate_instruction(), the #UD is pended by emulate_instruction().
Theoretically, any number of exceptions that L1 wants to intercept could
be detected while emulating L2.

All that being said, I only encountered this bug due to the vcms02 sync
bug (https://patchwork.kernel.org/patch/10259389/), e.g. L0 incorrectly
thought L1 was attempting to run L2 with invalid guest state in vmcs12.
Thinking about it more, L0 can only reach handle_invalid_guest_state()
in the context of L2 if L0 has a bug, or if L1 attempts to run L2 with
invalid state, e.g. L1 knowingly ignores the fact that unrestricted guest
is disabled or has a bug of its own.

So, I think the correct fix would be to return 1 from prepare_vmcs02
if emulation_required is true, i.e. signal VMEntry failed due to
EXIT_REASON_INVALID_STATE, and add a BUG either in vmx_vcpu_run() or
handle_invalid_guest_state() that fires if we're emulating L2, i.e.
BUG_ON(vmx->emulation_required && vmx->nested.nested_run_pending);

> Do you want to look deeper into this?
> 
> Thanks.
> 
> > diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
> > index 591214843046..3073160e6bae 100644
> > --- a/arch/x86/kvm/vmx.c
> > +++ b/arch/x86/kvm/vmx.c
> > @@ -6835,6 +6835,8 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
> >  
> >  		err = emulate_instruction(vcpu, 0);
> >  
> > +		vmx->nested.nested_run_pending = 0;
> > +
> >  		if (err == EMULATE_USER_EXIT) {
> >  			++vcpu->stat.mmio_exits;
> >  			ret = 0;
> > -- 
> > 2.16.2
> > 
>
Paolo Bonzini March 12, 2018, 8:42 a.m. UTC | #3
On 10/03/2018 00:06, Christopherson, Sean J wrote:
> All that being said, I only encountered this bug due to the vcms02 sync
> bug (https://patchwork.kernel.org/patch/10259389/), e.g. L0 incorrectly
> thought L1 was attempting to run L2 with invalid guest state in vmcs12.
> Thinking about it more, L0 can only reach handle_invalid_guest_state()
> in the context of L2 if L0 has a bug, or if L1 attempts to run L2 with
> invalid state, e.g. L1 knowingly ignores the fact that unrestricted guest
> is disabled or has a bug of its own.
> 
> So, I think the correct fix would be to return 1 from prepare_vmcs02
> if emulation_required is true, i.e. signal VMEntry failed due to
> EXIT_REASON_INVALID_STATE, and add a BUG either in vmx_vcpu_run() or
> handle_invalid_guest_state() that fires if we're emulating L2, i.e.
> BUG_ON(vmx->emulation_required && vmx->nested.nested_run_pending);

I agree (though it should be a WARN_ON_ONCE, not a BUG_ON).

Paolo
Radim Krčmář March 13, 2018, 6:42 p.m. UTC | #4
2018-03-09 23:06+0000, Christopherson, Sean J:
> On Friday, 2018-03-09, Radim Krčmář wrote:
> > 2018-03-05 09:39-0800, Sean Christopherson:
> > > Clear nested_run_pending in handle_invalid_guest_state() after calling
> > > emulate_instruction(), i.e. after attempting to emulate at least one
> > > instruction.  This fixes an issue where L0 enters an infinite loop if
> > > L2 hits an exception that is intercepted by L1 while L0 is emulating
> > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only
> > > way to break the loop is to kill Qemu in L0.
> > > 
> > >     1. call handle_invalid_guest_state() for L2
> > >     2. emulate_instruction() pends an exception, e.g. #UD
> > >     3. L1 intercepts the exception, i.e. nested_vmx_check_exception
> > >        returns 1
> > >     4. vmx_check_nested_events() returns -EBUSY because L1 wants to
> > >        intercept the exception and nested_run_pending is true
> > >     5. handle_invalid_guest_state() never makes forward progress for
> > >        L2 due to the pending exception
> > >     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the
> > >        L1 vCPU trying VMLAUNCH effectively hangs
> > > 
> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > ---
> > 
> > nested_run_pending signals that we have to execute VMRESUME in order to
> > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).
> > 
> > If we don't let the hardware do it, we need to transfer the state from
> > L2's VMCS while doing a nested VM exit for the exception (= behave as if
> > we entered the guest and exited).
> 
> Right, but by virtue of being in handle_invalid_guest_state() we're
> already effectively post-VMRESUME.  handle_invalid_guest_state() is
> technically called in the VMEXIT flow, but the initial check against
> emulation_required is done in vmx_vcpu_run().  Conceptually, emulating
> due to invalid state is distinctly different from emulating due to a
> specific VMEXIT.  In the latter case, we're emulating an instruction
> that has already been executed/attempted in the guest, whereas in the
> invalid guest case we're emulating instructions that haven't been seen
> before.

On a higher level of abstraction, KVM is emulating CPU just the same.
Here, we're starting by emulating L1's VMLAUNCH.  KVM cannot accelerate
the emulation of the VMLAUNCH instruction nor instructions that come
after VMLAUNCH (L2) due to missing hardware features.

The purpose of nested_run_pending is to finish the interrupt injection
of VMLAUNCH, which should be done before any L2 instruction is executed.
Normal flow uses (L0's) VMRESUME acceleration for that, but that is not
available in our case.
(Maybe it's possible to accelerate by manipulating VMCS into a valid
 state and forcing an immediate VM exit.)

>          So in that sense, clearing nested_run_pending when emulating
> invalid guest state is analogous to clearing nested_run_pending after
> VMENTER in vmx_vcpu_run().

Only if we actually do what VMLAUNCH does.  We should finish VMLAUNCH by
emulating the injection before emulating invalid L2 state.

> > And I think the actual fix here is to evaluate the interrupt before the
> > first emulate_instruction() in handle_invalid_guest_state().
> 
> The issue isn't that we don't evaluate an interrupt/exception, it's that
> nested_run_pending is never cleared and so inject_pending_event() always
> returns -EBUSY without injecting the VMEXIT(#UD) to L1.

nested_run_pending is never cleared because we accelerate injection of
the interrupt in hardware and therefore need a (L0) VMRESUME.

I was proposing to emulate the whole VMLAUNCH before emulating L2 guest
state, which would prevent the -EBUSY as a side effect.

> And in the failing case, there is no interrupt to evaluate before the
> first emulate_instruction(), the #UD is pended by emulate_instruction().
> Theoretically, any number of exceptions that L1 wants to intercept could
> be detected while emulating L2.

A simple fix for this particular case would be to skip setting
nested_run_pending, because we don't defer injection to (L0) VMRESUME.

> All that being said, I only encountered this bug due to the vcms02 sync
> bug (https://patchwork.kernel.org/patch/10259389/), e.g. L0 incorrectly
> thought L1 was attempting to run L2 with invalid guest state in vmcs12.
> Thinking about it more, L0 can only reach handle_invalid_guest_state()
> in the context of L2 if L0 has a bug, or if L1 attempts to run L2 with
> invalid state, e.g. L1 knowingly ignores the fact that unrestricted guest
> is disabled or has a bug of its own.

Right, it shouldn't be happening on recent hardware.

> So, I think the correct fix would be to return 1 from prepare_vmcs02
> if emulation_required is true, i.e. signal VMEntry failed due to
> EXIT_REASON_INVALID_STATE, and add a BUG either in vmx_vcpu_run() or
> handle_invalid_guest_state() that fires if we're emulating L2, i.e.
> BUG_ON(vmx->emulation_required && vmx->nested.nested_run_pending);

Sounds good (with a WARN_ONCE like Paolo said, because a guest can
configure it and KVM should support it),

thanks.
Sean Christopherson March 13, 2018, 8:22 p.m. UTC | #5
On Tue, 2018-03-13 at 19:42 +0100, Radim Krcmár wrote:
> 2018-03-09 23:06+0000, Christopherson, Sean J:
> > 
> > On Friday, 2018-03-09, Radim Krčmář wrote:
> > > 
> > > 2018-03-05 09:39-0800, Sean Christopherson:
> > > > 
> > > > Clear nested_run_pending in handle_invalid_guest_state() after calling
> > > > emulate_instruction(), i.e. after attempting to emulate at least one
> > > > instruction.  This fixes an issue where L0 enters an infinite loop if
> > > > L2 hits an exception that is intercepted by L1 while L0 is emulating
> > > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only
> > > > way to break the loop is to kill Qemu in L0.
> > > > 
> > > >     1. call handle_invalid_guest_state() for L2
> > > >     2. emulate_instruction() pends an exception, e.g. #UD
> > > >     3. L1 intercepts the exception, i.e. nested_vmx_check_exception
> > > >        returns 1
> > > >     4. vmx_check_nested_events() returns -EBUSY because L1 wants to
> > > >        intercept the exception and nested_run_pending is true
> > > >     5. handle_invalid_guest_state() never makes forward progress for
> > > >        L2 due to the pending exception
> > > >     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the
> > > >        L1 vCPU trying VMLAUNCH effectively hangs
> > > > 
> > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > ---
> > > nested_run_pending signals that we have to execute VMRESUME in order to
> > > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).
> > > 
> > > If we don't let the hardware do it, we need to transfer the state from
> > > L2's VMCS while doing a nested VM exit for the exception (= behave as if
> > > we entered the guest and exited).
> > Right, but by virtue of being in handle_invalid_guest_state() we're
> > already effectively post-VMRESUME.  handle_invalid_guest_state() is
> > technically called in the VMEXIT flow, but the initial check against
> > emulation_required is done in vmx_vcpu_run().  Conceptually, emulating
> > due to invalid state is distinctly different from emulating due to a
> > specific VMEXIT.  In the latter case, we're emulating an instruction
> > that has already been executed/attempted in the guest, whereas in the
> > invalid guest case we're emulating instructions that haven't been seen
> > before.
> On a higher level of abstraction, KVM is emulating CPU just the same.
> Here, we're starting by emulating L1's VMLAUNCH.  KVM cannot accelerate
> the emulation of the VMLAUNCH instruction nor instructions that come
> after VMLAUNCH (L2) due to missing hardware features.
> 
> The purpose of nested_run_pending is to finish the interrupt injection
> of VMLAUNCH, which should be done before any L2 instruction is executed.
> Normal flow uses (L0's) VMRESUME acceleration for that, but that is not
> available in our case.
> (Maybe it's possible to accelerate by manipulating VMCS into a valid
>  state and forcing an immediate VM exit.)

There is nothing to inject at the time of VMLAUNCH.  L1's VMLAUNCH has
completed and we're running/emulating L2 with nothing pending when a
#UD is encountered in L2.  In the unrestricted guest case, where we
accelerated L1's VMLAUNCH, the vcpu_run postamble in the accelerated
VMLAUNCH->VMEXIT path clears nested_run_pending after the VMEXIT(#UD).
In the restricted guest case, we bailed out of vcpu_run very early
without clearing nested_run_pending.

Here's a comparison of how a #UD in L2 w/ith invalid guest state would
be handled with unrestricted guest enabled vs. disabled, and why I
thought it made sense to clear nested_run_pending in
handle_invalid_guest_state().


Unrestricted Guest

H/W  Emu    Action              Result
--------------------------------------
L0   *      vcpu_run vmcs01     switch to L1
L1   *      vcpu_run vmcs12     exit to L0
L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1
L0   *      vcpu_run vmcs02     switch to L2
L2   *      L2 does stuff...    
L2   *      #UD                 exit to L0
L0   *      vcpu_run postamble  vmx->nested.nested_run_pending = 0
L0   *      inject VMEXIT(#UD)  succeeds
L0   *      vcpu_run vmcs01     switch to L1


Restricted Guest

H/W  Emu    Action              Result
--------------------------------------
L0   *      vcpu_run vmcs01     switch to L1
L1   *      vcpu_run vmcs12     exit to L0
L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1
L0   *      vcpu_run vmcs02     bails due to emulation_required
L0   L2     emulate_instruction emulate L2, i.e. post-VMLAUNCH vmcs12
L0   L2     L2 does stuff...    
L0   L2     #UD                 exit to L0 (emulated)

<this is where I was suggesting we clear nested_run_pending>

L0   *      inject VMEXIT(#UD)  fails, nested_run_pending is set
L0   L2     infinite loop...

> > 
> >          So in that sense, clearing nested_run_pending when emulating
> > invalid guest state is analogous to clearing nested_run_pending after
> > VMENTER in vmx_vcpu_run().
> Only if we actually do what VMLAUNCH does.  We should finish VMLAUNCH by
> emulating the injection before emulating invalid L2 state.
> 
> > 
> > > 
> > > And I think the actual fix here is to evaluate the interrupt before the
> > > first emulate_instruction() in handle_invalid_guest_state().
> > The issue isn't that we don't evaluate an interrupt/exception, it's that
> > nested_run_pending is never cleared and so inject_pending_event() always
> > returns -EBUSY without injecting the VMEXIT(#UD) to L1.
> nested_run_pending is never cleared because we accelerate injection of
> the interrupt in hardware and therefore need a (L0) VMRESUME.
> 
> I was proposing to emulate the whole VMLAUNCH before emulating L2 guest
> state, which would prevent the -EBUSY as a side effect.
> 
> > 
> > And in the failing case, there is no interrupt to evaluate before the
> > first emulate_instruction(), the #UD is pended by emulate_instruction().
> > Theoretically, any number of exceptions that L1 wants to intercept could
> > be detected while emulating L2.
> A simple fix for this particular case would be to skip setting
> nested_run_pending, because we don't defer injection to (L0) VMRESUME.
> 
> > 
> > All that being said, I only encountered this bug due to the vcms02 sync
> > bug (https://patchwork.kernel.org/patch/10259389/), e.g. L0 incorrectly
> > thought L1 was attempting to run L2 with invalid guest state in vmcs12.
> > Thinking about it more, L0 can only reach handle_invalid_guest_state()
> > in the context of L2 if L0 has a bug, or if L1 attempts to run L2 with
> > invalid state, e.g. L1 knowingly ignores the fact that unrestricted guest
> > is disabled or has a bug of its own.
> Right, it shouldn't be happening on recent hardware.
> 
> > 
> > So, I think the correct fix would be to return 1 from prepare_vmcs02
> > if emulation_required is true, i.e. signal VMEntry failed due to
> > EXIT_REASON_INVALID_STATE, and add a BUG either in vmx_vcpu_run() or
> > handle_invalid_guest_state() that fires if we're emulating L2, i.e.
> > BUG_ON(vmx->emulation_required && vmx->nested.nested_run_pending);
> Sounds good (with a WARN_ONCE like Paolo said, because a guest can
> configure it and KVM should support it),
> 
> thanks.
Radim Krčmář March 13, 2018, 9:59 p.m. UTC | #6
2018-03-13 13:22-0700, Sean Christopherson:
> On Tue, 2018-03-13 at 19:42 +0100, Radim Krcmár wrote:
> > 2018-03-09 23:06+0000, Christopherson, Sean J:
> > > 
> > > On Friday, 2018-03-09, Radim Krčmář wrote:
> > > > 
> > > > 2018-03-05 09:39-0800, Sean Christopherson:
> > > > > 
> > > > > Clear nested_run_pending in handle_invalid_guest_state() after calling
> > > > > emulate_instruction(), i.e. after attempting to emulate at least one
> > > > > instruction.  This fixes an issue where L0 enters an infinite loop if
> > > > > L2 hits an exception that is intercepted by L1 while L0 is emulating
> > > > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only
> > > > > way to break the loop is to kill Qemu in L0.
> > > > > 
> > > > >     1. call handle_invalid_guest_state() for L2
> > > > >     2. emulate_instruction() pends an exception, e.g. #UD
> > > > >     3. L1 intercepts the exception, i.e. nested_vmx_check_exception
> > > > >        returns 1
> > > > >     4. vmx_check_nested_events() returns -EBUSY because L1 wants to
> > > > >        intercept the exception and nested_run_pending is true
> > > > >     5. handle_invalid_guest_state() never makes forward progress for
> > > > >        L2 due to the pending exception
> > > > >     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the
> > > > >        L1 vCPU trying VMLAUNCH effectively hangs
> > > > > 
> > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > > ---
> > > > nested_run_pending signals that we have to execute VMRESUME in order to
> > > > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).
> > > > 
> > > > If we don't let the hardware do it, we need to transfer the state from
> > > > L2's VMCS while doing a nested VM exit for the exception (= behave as if
> > > > we entered the guest and exited).
> > > Right, but by virtue of being in handle_invalid_guest_state() we're
> > > already effectively post-VMRESUME.  handle_invalid_guest_state() is
> > > technically called in the VMEXIT flow, but the initial check against
> > > emulation_required is done in vmx_vcpu_run().  Conceptually, emulating
> > > due to invalid state is distinctly different from emulating due to a
> > > specific VMEXIT.  In the latter case, we're emulating an instruction
> > > that has already been executed/attempted in the guest, whereas in the
> > > invalid guest case we're emulating instructions that haven't been seen
> > > before.
> > On a higher level of abstraction, KVM is emulating CPU just the same.
> > Here, we're starting by emulating L1's VMLAUNCH.  KVM cannot accelerate
> > the emulation of the VMLAUNCH instruction nor instructions that come
> > after VMLAUNCH (L2) due to missing hardware features.
> > 
> > The purpose of nested_run_pending is to finish the interrupt injection
> > of VMLAUNCH, which should be done before any L2 instruction is executed.
> > Normal flow uses (L0's) VMRESUME acceleration for that, but that is not
> > available in our case.
> > (Maybe it's possible to accelerate by manipulating VMCS into a valid
> >  state and forcing an immediate VM exit.)
> 
> There is nothing to inject at the time of VMLAUNCH.  L1's VMLAUNCH has
> completed

This seems to be the core of disagreement.  The emulated VMLAUNCH
depends on VMX to inject whatever is in VM_ENTRY_INTR_INFO_FIELD and if
we never do a VMRESUME in L0, then the VMLAUNCH is incomplete.

Maybe I'm missing the point of nested_run_pending -- why do we have it
at all if not to signal that there is a pending injection from vmcs12?

>               we're running/emulating L2 with nothing pending when a
> #UD is encountered in L2.  In the unrestricted guest case, where we
> accelerated L1's VMLAUNCH, the vcpu_run postamble in the accelerated
> VMLAUNCH->VMEXIT path clears nested_run_pending after the VMEXIT(#UD).
> In the restricted guest case, we bailed out of vcpu_run very early
> without clearing nested_run_pending.

Yes, we do not need to inject in this case (which we could detect and
just never set nested_run_pending in the first place), but we need to in
the case where L1 sets pending interrupt in VMCS.

> Here's a comparison of how a #UD in L2 w/ith invalid guest state would
> be handled with unrestricted guest enabled vs. disabled, and why I
> thought it made sense to clear nested_run_pending in
> handle_invalid_guest_state().
> 
> 
> Unrestricted Guest
> 
> H/W  Emu    Action              Result
> --------------------------------------
> L0   *      vcpu_run vmcs01     switch to L1
> L1   *      vcpu_run vmcs12     exit to L0
> L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1
> L0   *      vcpu_run vmcs02     switch to L2

The hardware injected an interrupt from vmcs12.

> L2   *      L2 does stuff...    
> L2   *      #UD                 exit to L0
> L0   *      vcpu_run postamble  vmx->nested.nested_run_pending = 0
> L0   *      inject VMEXIT(#UD)  succeeds
> L0   *      vcpu_run vmcs01     switch to L1
> 
> 
> Restricted Guest
> 
> H/W  Emu    Action              Result
> --------------------------------------
> L0   *      vcpu_run vmcs01     switch to L1
> L1   *      vcpu_run vmcs12     exit to L0
> L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1
> L0   *      vcpu_run vmcs02     bails due to emulation_required

<<Here>> is no code to inject an interrupt from vmcs12.

> L0   L2     emulate_instruction emulate L2, i.e. post-VMLAUNCH vmcs12
> L0   L2     L2 does stuff...    
> L0   L2     #UD                 exit to L0 (emulated)
> 
> <this is where I was suggesting we clear nested_run_pending>

But why not earlier, e.g. before emulating instructions in the invalid
state?  I don't understand what nested_run_pending is supposed to do in
this case.  I think the right place to clear nested_run_pending is at
<<Here>>, where we also do what hardware with unrestricted guest does.

Thanks.

> L0   *      inject VMEXIT(#UD)  fails, nested_run_pending is set
> L0   L2     infinite loop...
Sean Christopherson March 14, 2018, 3:22 p.m. UTC | #7
On Tue, 2018-03-13 at 22:59 +0100, Radim Krcmár wrote:
> 2018-03-13 13:22-0700, Sean Christopherson:

> > 

> > On Tue, 2018-03-13 at 19:42 +0100, Radim Krcmár wrote:

> > > 

> > > 2018-03-09 23:06+0000, Christopherson, Sean J:

> > > > 

> > > > 

> > > > On Friday, 2018-03-09, Radim Krčmář wrote:

> > > > > 

> > > > > 

> > > > > 2018-03-05 09:39-0800, Sean Christopherson:

> > > > > > 

> > > > > > 

> > > > > > Clear nested_run_pending in handle_invalid_guest_state() after calling

> > > > > > emulate_instruction(), i.e. after attempting to emulate at least one

> > > > > > instruction.  This fixes an issue where L0 enters an infinite loop if

> > > > > > L2 hits an exception that is intercepted by L1 while L0 is emulating

> > > > > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only

> > > > > > way to break the loop is to kill Qemu in L0.

> > > > > > 

> > > > > >     1. call handle_invalid_guest_state() for L2

> > > > > >     2. emulate_instruction() pends an exception, e.g. #UD

> > > > > >     3. L1 intercepts the exception, i.e. nested_vmx_check_exception

> > > > > >        returns 1

> > > > > >     4. vmx_check_nested_events() returns -EBUSY because L1 wants to

> > > > > >        intercept the exception and nested_run_pending is true

> > > > > >     5. handle_invalid_guest_state() never makes forward progress for

> > > > > >        L2 due to the pending exception

> > > > > >     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the

> > > > > >        L1 vCPU trying VMLAUNCH effectively hangs

> > > > > > 

> > > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

> > > > > > ---

> > > > > nested_run_pending signals that we have to execute VMRESUME in order to

> > > > > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).

> > > > > 

> > > > > If we don't let the hardware do it, we need to transfer the state from

> > > > > L2's VMCS while doing a nested VM exit for the exception (= behave as if

> > > > > we entered the guest and exited).

> > > > Right, but by virtue of being in handle_invalid_guest_state() we're

> > > > already effectively post-VMRESUME.  handle_invalid_guest_state() is

> > > > technically called in the VMEXIT flow, but the initial check against

> > > > emulation_required is done in vmx_vcpu_run().  Conceptually, emulating

> > > > due to invalid state is distinctly different from emulating due to a

> > > > specific VMEXIT.  In the latter case, we're emulating an instruction

> > > > that has already been executed/attempted in the guest, whereas in the

> > > > invalid guest case we're emulating instructions that haven't been seen

> > > > before.

> > > On a higher level of abstraction, KVM is emulating CPU just the same.

> > > Here, we're starting by emulating L1's VMLAUNCH.  KVM cannot accelerate

> > > the emulation of the VMLAUNCH instruction nor instructions that come

> > > after VMLAUNCH (L2) due to missing hardware features.

> > > 

> > > The purpose of nested_run_pending is to finish the interrupt injection

> > > of VMLAUNCH, which should be done before any L2 instruction is executed.

> > > Normal flow uses (L0's) VMRESUME acceleration for that, but that is not

> > > available in our case.

> > > (Maybe it's possible to accelerate by manipulating VMCS into a valid

> > >  state and forcing an immediate VM exit.)

> > There is nothing to inject at the time of VMLAUNCH.  L1's VMLAUNCH has

> > completed

> This seems to be the core of disagreement.  The emulated VMLAUNCH

> depends on VMX to inject whatever is in VM_ENTRY_INTR_INFO_FIELD and if

> we never do a VMRESUME in L0, then the VMLAUNCH is incomplete.


Ah, I see where you're coming from.  Digging a little deeper, I think
the underlying issue is that handle_invalid_guest_state() doesn't
properly handle exceptions that are pended by emulate_instruction().
Regardless of whether we're emulating L1 or L2, we'll enter an infinite
loop if an emulation_required is true and an exception is pending.  As
you said, we rely on VMRESUME to inject the exception, but we'll never
execute VMRESUME if emulation_required is true (and it would fail due
to invalid guest state if we tried to force the injection).  I verified
this by generating a #GP while emulating Big RM for L1; KVM soft hangs
and never returns to userspace.

I think the fix would be to mirror what handle_emulation_failure()
does if it gets a #UD at CPL0 and return KVM_INTERNAL_ERROR_EMULATION
to userspace if an exception is pending and emulation_required is true.
Note that a #UD in L1 will already do this because of the aforementioned
handle_emulation_failure().

> Maybe I'm missing the point of nested_run_pending -- why do we have it

> at all if not to signal that there is a pending injection from vmcs12?


My understanding is that nested_run_pending is quite literal: there is
a nested VM that needs to be run, do that next.  I don't think it's
directly related to a pending injection, but rather is queried to know
what it is or isn't legal, e.g. in this case injecting a VMEXIT into L1
isn't legal because we need to run L2 next.

> > 

> >               we're running/emulating L2 with nothing pending when a

> > #UD is encountered in L2.  In the unrestricted guest case, where we

> > accelerated L1's VMLAUNCH, the vcpu_run postamble in the accelerated

> > VMLAUNCH->VMEXIT path clears nested_run_pending after the VMEXIT(#UD).

> > In the restricted guest case, we bailed out of vcpu_run very early

> > without clearing nested_run_pending.

> Yes, we do not need to inject in this case (which we could detect and

> just never set nested_run_pending in the first place), but we need to in

> the case where L1 sets pending interrupt in VMCS.

> 

> > 

> > Here's a comparison of how a #UD in L2 w/ith invalid guest state would

> > be handled with unrestricted guest enabled vs. disabled, and why I

> > thought it made sense to clear nested_run_pending in

> > handle_invalid_guest_state().

> > 

> > 

> > Unrestricted Guest

> > 

> > H/W  Emu    Action              Result

> > --------------------------------------

> > L0   *      vcpu_run vmcs01     switch to L1

> > L1   *      vcpu_run vmcs12     exit to L0

> > L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1

> > L0   *      vcpu_run vmcs02     switch to L2

> The hardware injected an interrupt from vmcs12.

> 

> > 

> > L2   *      L2 does stuff...    

> > L2   *      #UD                 exit to L0

> > L0   *      vcpu_run postamble  vmx->nested.nested_run_pending = 0

> > L0   *      inject VMEXIT(#UD)  succeeds

> > L0   *      vcpu_run vmcs01     switch to L1

> > 

> > 

> > Restricted Guest

> > 

> > H/W  Emu    Action              Result

> > --------------------------------------

> > L0   *      vcpu_run vmcs01     switch to L1

> > L1   *      vcpu_run vmcs12     exit to L0

> > L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1

> > L0   *      vcpu_run vmcs02     bails due to emulation_required

> <<Here>> is no code to inject an interrupt from vmcs12.

> 

> > 

> > L0   L2     emulate_instruction emulate L2, i.e. post-VMLAUNCH vmcs12

> > L0   L2     L2 does stuff...    

> > L0   L2     #UD                 exit to L0 (emulated)

> > 

> > <this is where I was suggesting we clear nested_run_pending>

> But why not earlier, e.g. before emulating instructions in the invalid

> state?  I don't understand what nested_run_pending is supposed to do in

> this case.  I think the right place to clear nested_run_pending is at

> <<Here>>, where we also do what hardware with unrestricted guest does.


In an alternate universe where we weren't fixing the underlying bugs,
I would have no qualms about clearing nested_run_pending earlier.  I put
it after emulate_instruction because I was trying to be conservative,
i.e. I wanted to put it in a location where we could say without a doubt
that we "ran" L2.

> Thanks.

> 

> > 

> > L0   *      inject VMEXIT(#UD)  fails, nested_run_pending is set

> > L0   L2     infinite loop...
Sean Christopherson March 15, 2018, 9:38 p.m. UTC | #8
On Wed, 2018-03-14 at 08:22 -0700, Christopherson, Sean J wrote:
> On Tue, 2018-03-13 at 22:59 +0100, Radim Krcmár wrote:

> > 

> > 2018-03-13 13:22-0700, Sean Christopherson:

> > > 

> > > 

> > > On Tue, 2018-03-13 at 19:42 +0100, Radim Krcmár wrote:

> > > > 

> > > > 

> > > > 2018-03-09 23:06+0000, Christopherson, Sean J:

> > > > > 

> > > > > 

> > > > > 

> > > > > On Friday, 2018-03-09, Radim Krčmář wrote:

> > > > > > 

> > > > > > 

> > > > > > 

> > > > > > 2018-03-05 09:39-0800, Sean Christopherson:

> > > > > > > 

> > > > > > > 

> > > > > > > 

> > > > > > > Clear nested_run_pending in handle_invalid_guest_state() after calling

> > > > > > > emulate_instruction(), i.e. after attempting to emulate at least one

> > > > > > > instruction.  This fixes an issue where L0 enters an infinite loop if

> > > > > > > L2 hits an exception that is intercepted by L1 while L0 is emulating

> > > > > > > L2's invalid guest state, effectively causing DoS on L1, e.g. the only

> > > > > > > way to break the loop is to kill Qemu in L0.

> > > > > > > 

> > > > > > >     1. call handle_invalid_guest_state() for L2

> > > > > > >     2. emulate_instruction() pends an exception, e.g. #UD

> > > > > > >     3. L1 intercepts the exception, i.e. nested_vmx_check_exception

> > > > > > >        returns 1

> > > > > > >     4. vmx_check_nested_events() returns -EBUSY because L1 wants to

> > > > > > >        intercept the exception and nested_run_pending is true

> > > > > > >     5. handle_invalid_guest_state() never makes forward progress for

> > > > > > >        L2 due to the pending exception

> > > > > > >     6. L1 retries VMLAUNCH and VMExits to L0 indefinitely, i.e. the

> > > > > > >        L1 vCPU trying VMLAUNCH effectively hangs

> > > > > > > 

> > > > > > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>

> > > > > > > ---

> > > > > > nested_run_pending signals that we have to execute VMRESUME in order to

> > > > > > do injection from L2's VMCS (at least VM_ENTRY_INTR_INFO_FIELD).

> > > > > > 

> > > > > > If we don't let the hardware do it, we need to transfer the state from

> > > > > > L2's VMCS while doing a nested VM exit for the exception (= behave as if

> > > > > > we entered the guest and exited).

> > > > > Right, but by virtue of being in handle_invalid_guest_state() we're

> > > > > already effectively post-VMRESUME.  handle_invalid_guest_state() is

> > > > > technically called in the VMEXIT flow, but the initial check against

> > > > > emulation_required is done in vmx_vcpu_run().  Conceptually, emulating

> > > > > due to invalid state is distinctly different from emulating due to a

> > > > > specific VMEXIT.  In the latter case, we're emulating an instruction

> > > > > that has already been executed/attempted in the guest, whereas in the

> > > > > invalid guest case we're emulating instructions that haven't been seen

> > > > > before.

> > > > On a higher level of abstraction, KVM is emulating CPU just the same.

> > > > Here, we're starting by emulating L1's VMLAUNCH.  KVM cannot accelerate

> > > > the emulation of the VMLAUNCH instruction nor instructions that come

> > > > after VMLAUNCH (L2) due to missing hardware features.

> > > > 

> > > > The purpose of nested_run_pending is to finish the interrupt injection

> > > > of VMLAUNCH, which should be done before any L2 instruction is executed.

> > > > Normal flow uses (L0's) VMRESUME acceleration for that, but that is not

> > > > available in our case.

> > > > (Maybe it's possible to accelerate by manipulating VMCS into a valid

> > > >  state and forcing an immediate VM exit.)

> > > There is nothing to inject at the time of VMLAUNCH.  L1's VMLAUNCH has

> > > completed

> > This seems to be the core of disagreement.  The emulated VMLAUNCH

> > depends on VMX to inject whatever is in VM_ENTRY_INTR_INFO_FIELD and if

> > we never do a VMRESUME in L0, then the VMLAUNCH is incomplete.

> Ah, I see where you're coming from.  Digging a little deeper, I think

> the underlying issue is that handle_invalid_guest_state() doesn't

> properly handle exceptions that are pended by emulate_instruction().

> Regardless of whether we're emulating L1 or L2, we'll enter an infinite

> loop if an emulation_required is true and an exception is pending.  As

> you said, we rely on VMRESUME to inject the exception, but we'll never

> execute VMRESUME if emulation_required is true (and it would fail due

> to invalid guest state if we tried to force the injection).  I verified

> this by generating a #GP while emulating Big RM for L1; KVM soft hangs

> and never returns to userspace.


Argh, my analysis was wrong for Big RM, though it's correct for invalid
state in protected mode.  KVM fully emulates interrupts for Big RM, i.e.
doesn't rely on VMRESUME for injection.  The soft hang I saw was because
pc-bios handled the #GP but didn't skip the faulting instruction.

Invalid PM on the other hand doesn't support emulating interrupts, e.g.
a #GP while emulating invalid guest state in PM will be written into
the VMCS even though emulation_required is true.  I'll take a stab at a
patch to handle the most obvious case of hitting an exception during
handle_invalid_guest_state.

> I think the fix would be to mirror what handle_emulation_failure()

> does if it gets a #UD at CPL0 and return KVM_INTERNAL_ERROR_EMULATION

> to userspace if an exception is pending and emulation_required is true.

> Note that a #UD in L1 will already do this because of the aforementioned

> handle_emulation_failure().

> 

> > 

> > Maybe I'm missing the point of nested_run_pending -- why do we have it

> > at all if not to signal that there is a pending injection from vmcs12?

> My understanding is that nested_run_pending is quite literal: there is

> a nested VM that needs to be run, do that next.  I don't think it's

> directly related to a pending injection, but rather is queried to know

> what it is or isn't legal, e.g. in this case injecting a VMEXIT into L1

> isn't legal because we need to run L2 next.

> 

> > 

> > > 

> > > 

> > >               we're running/emulating L2 with nothing pending when a

> > > #UD is encountered in L2.  In the unrestricted guest case, where we

> > > accelerated L1's VMLAUNCH, the vcpu_run postamble in the accelerated

> > > VMLAUNCH->VMEXIT path clears nested_run_pending after the VMEXIT(#UD).

> > > In the restricted guest case, we bailed out of vcpu_run very early

> > > without clearing nested_run_pending.

> > Yes, we do not need to inject in this case (which we could detect and

> > just never set nested_run_pending in the first place), but we need to in

> > the case where L1 sets pending interrupt in VMCS.

> > 

> > > 

> > > 

> > > Here's a comparison of how a #UD in L2 w/ith invalid guest state would

> > > be handled with unrestricted guest enabled vs. disabled, and why I

> > > thought it made sense to clear nested_run_pending in

> > > handle_invalid_guest_state().

> > > 

> > > 

> > > Unrestricted Guest

> > > 

> > > H/W  Emu    Action              Result

> > > --------------------------------------

> > > L0   *      vcpu_run vmcs01     switch to L1

> > > L1   *      vcpu_run vmcs12     exit to L0

> > > L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1

> > > L0   *      vcpu_run vmcs02     switch to L2

> > The hardware injected an interrupt from vmcs12.

> > 

> > > 

> > > 

> > > L2   *      L2 does stuff...    

> > > L2   *      #UD                 exit to L0

> > > L0   *      vcpu_run postamble  vmx->nested.nested_run_pending = 0

> > > L0   *      inject VMEXIT(#UD)  succeeds

> > > L0   *      vcpu_run vmcs01     switch to L1

> > > 

> > > 

> > > Restricted Guest

> > > 

> > > H/W  Emu    Action              Result

> > > --------------------------------------

> > > L0   *      vcpu_run vmcs01     switch to L1

> > > L1   *      vcpu_run vmcs12     exit to L0

> > > L0   *      nested_vmx_run      vmx->nested.nested_run_pending = 1

> > > L0   *      vcpu_run vmcs02     bails due to emulation_required

> > <<Here>> is no code to inject an interrupt from vmcs12.

> > 

> > > 

> > > 

> > > L0   L2     emulate_instruction emulate L2, i.e. post-VMLAUNCH vmcs12

> > > L0   L2     L2 does stuff...    

> > > L0   L2     #UD                 exit to L0 (emulated)

> > > 

> > > <this is where I was suggesting we clear nested_run_pending>

> > But why not earlier, e.g. before emulating instructions in the invalid

> > state?  I don't understand what nested_run_pending is supposed to do in

> > this case.  I think the right place to clear nested_run_pending is at

> > <<Here>>, where we also do what hardware with unrestricted guest does.

> In an alternate universe where we weren't fixing the underlying bugs,

> I would have no qualms about clearing nested_run_pending earlier.  I put

> it after emulate_instruction because I was trying to be conservative,

> i.e. I wanted to put it in a location where we could say without a doubt

> that we "ran" L2.

> 

> > 

> > Thanks.

> > 

> > > 

> > > 

> > > L0   *      inject VMEXIT(#UD)  fails, nested_run_pending is set

> > > L0   L2     infinite loop...

Patch
diff mbox

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 591214843046..3073160e6bae 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -6835,6 +6835,8 @@  static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
 
 		err = emulate_instruction(vcpu, 0);
 
+		vmx->nested.nested_run_pending = 0;
+
 		if (err == EMULATE_USER_EXIT) {
 			++vcpu->stat.mmio_exits;
 			ret = 0;