diff mbox

[2/4] KVM: nSVM: do not forward NMI window singlestep VM exits to L1

Message ID 20170619161740.GA13549@potion (mailing list archive)
State New, archived
Headers show

Commit Message

Radim Krčmář June 19, 2017, 4:17 p.m. UTC
2017-06-19 15:05+0200, Ladi Prosek:
> On Mon, Jun 19, 2017 at 2:50 PM, Ladi Prosek <lprosek@redhat.com> wrote:
> > On Fri, Jun 16, 2017 at 3:26 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:
> >> 2017-06-15 13:20+0200, Ladi Prosek:
> >>> Nested hypervisor should not see singlestep VM exits if singlestepping
> >>> was enabled internally by KVM. Windows is particularly sensitive to this
> >>> and known to bluescreen on unexpected VM exits.
> >>>
> >>> Signed-off-by: Ladi Prosek <lprosek@redhat.com>
> >>> ---
> >>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> >>> @@ -966,9 +967,13 @@ static void svm_disable_lbrv(struct vcpu_svm *svm)
> >>>  static void disable_nmi_singlestep(struct vcpu_svm *svm)
> >>>  {
> >>>       svm->nmi_singlestep = false;
> >>> -     if (!(svm->vcpu.guest_debug & KVM_GUESTDBG_SINGLESTEP))
> >>> -             svm->vmcb->save.rflags &=
> >>> -                     ~(X86_EFLAGS_TF | X86_EFLAGS_RF);
> >>> +     if (!(svm->vcpu.guest_debug & KVM_GUESTDBG_SINGLESTEP)) {
> >>> +             /* Clear our flags if they were not set by the guest */
> >>> +             if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF))
> >>> +                     svm->vmcb->save.rflags &= ~X86_EFLAGS_TF;
> >>> +             if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_RF))
> >>> +                     svm->vmcb->save.rflags &= ~X86_EFLAGS_RF;
> >>
> >> IIUC, we intercept/fault on IRET, disable the interception, set TF+RF
> >> and enter again, the CPU executes IRET and then we get a #DB exit.
> >>
> >> IRET pops EFLAGS from before the NMI -- doesn't the CPU properly restore
> >> EFLAGS, so we do not need this part here?
> >
> > My test VM doesn't work without this part, even with the change that
> > Paolo suggested in 0/4:
> >
> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> > index d1efe2c62b3f..15a2f7f8e539 100644
> > --- a/arch/x86/kvm/svm.c
> > +++ b/arch/x86/kvm/svm.c
> > @@ -4622,6 +4622,9 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
> >         if ((svm->vcpu.arch.hflags & (HF_NMI_MASK | HF_IRET_MASK))
> >             == HF_NMI_MASK)
> >                 return; /* IRET will cause a vm exit */
> > +       if ((svm->vcpu.arch.hflags & (HF_NMI_MASK | HF_GIF_MASK))
> > +           == HF_NMI_MASK)
> > +               return; /* STGI will cause a vm exit */
> >
> >
> > Let's see what we're singlestepping over (Windows Server 2016 with Hyper-V).
> >
> > 1. The very first time we enable NMI singlestep is when running
> > nested. The nested guest's rip is just after 'pause' in a 'pause; cmp;
> > jne' loop. svm_nmi_allowed returns false because nested_svm_nmi
> > returns false - we don't want to inject NMI now because
> > svm->nested.intercept has the INTERCEPT_NMI bit set.
> >
> > 2. Then we find ourselves in L1 with the rip just after 'vmrun'.
> > svm_nmi_allowed returns false because gif_set returns false (hflags ==
> > HF_HIF_MASK | HF_VINTR_MASK). So we, again, enable NMI singlestep
> > (without having disabled it yet).
> >
> > 3. We singlestep over the instruction immediately following 'vmrun'
> > (it's a 'mov rax, [rsp+0x20]' in this trace) and finally disable NMI
> > singlestep on this vcpu. We inject the NMI when the guest executes
> > stgi.
> >
> >
> > I'll see if I can short this out. Setting the trap flag to step over
> > an instruction which has no other significance than following a
> > 'vmrun' is indeed unnecessary.
> 
> Ok, I think I understand it.
> 
> First, Paolo's GIF change should be checking only HF_GIF_MASK. Whether
> we're currently in an NMI or not does not make a difference, NMI is
> not allowed and we have interception in place for when GIF is set so
> we don't need to singlestep.
> 
> Second, enable_nmi_window can also do nothing if
> svm->nested.exit_required. We're not really going to run the guest in
> this case so no need to singlestep.

Makes sense.

> With these two tweaks my test VM does not generate any DB exits. We
> still occasionally set TF if we have an NMI and interrupt pending at
> the same time (see inject_pending_event where we check
> vcpu->arch.interrupt.pending, then later vcpu->arch.nmi_pending) but
> we clear it immediately in the new code added in patch 4.

Right, we only need the single step over IRET and interrupt shadow.

Btw. instead of single-stepping over IRET/interrupt shadow, could we set
INTERRUPT_SHADOW in VMCB, inject the NMI, and let it execute?
This mechanism would explain why AMD didn't provide a trap for IRET ...

APM 15.20 says "Injected events are treated in every way as though they
had occurred normally in the guest", which makes me think that
INTERRUPT_SHADOW blocks them, if it blocks NMIs at all on AMD.

e.g.


(Of course we'd want to refactor it for the final patch. :])

> What a complex state machine! I'll prepare v2.

Yes. :/  Looking forward to v2,

thanks.

Comments

Paolo Bonzini June 19, 2017, 5:17 p.m. UTC | #1
On 19/06/2017 18:17, Radim Krčmář wrote:
> Right, we only need the single step over IRET and interrupt shadow.
> 
> Btw. instead of single-stepping over IRET/interrupt shadow, could we set
> INTERRUPT_SHADOW in VMCB, inject the NMI, and let it execute?
> This mechanism would explain why AMD didn't provide a trap for IRET ...

You mean they didn't provide a trap-like VMEXIT for IRET, only fault-like?

Thanks,

Paolo

> APM 15.20 says "Injected events are treated in every way as though they
> had occurred normally in the guest", which makes me think that
> INTERRUPT_SHADOW blocks them, if it blocks NMIs at all on AMD.
Radim Krčmář June 19, 2017, 5:46 p.m. UTC | #2
2017-06-19 19:17+0200, Paolo Bonzini:
> On 19/06/2017 18:17, Radim Krčmář wrote:
> > Right, we only need the single step over IRET and interrupt shadow.
> > 
> > Btw. instead of single-stepping over IRET/interrupt shadow, could we set
> > INTERRUPT_SHADOW in VMCB, inject the NMI, and let it execute?
> > This mechanism would explain why AMD didn't provide a trap for IRET ...
> 
> You mean they didn't provide a trap-like VMEXIT for IRET, only fault-like?

Yes.  SVM has trap-like VM exit, so I didn't understand why it was not
used for IRET.  Forcing the hypervisor to have two VM exits and a clumsy
single-step felt out of place when the rest was designed nicely ...
Ladi Prosek June 20, 2017, 7:41 a.m. UTC | #3
On Mon, Jun 19, 2017 at 6:17 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:
> 2017-06-19 15:05+0200, Ladi Prosek:
>> On Mon, Jun 19, 2017 at 2:50 PM, Ladi Prosek <lprosek@redhat.com> wrote:
>> > On Fri, Jun 16, 2017 at 3:26 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:
>> >> 2017-06-15 13:20+0200, Ladi Prosek:
>> >>> Nested hypervisor should not see singlestep VM exits if singlestepping
>> >>> was enabled internally by KVM. Windows is particularly sensitive to this
>> >>> and known to bluescreen on unexpected VM exits.
>> >>>
>> >>> Signed-off-by: Ladi Prosek <lprosek@redhat.com>
>> >>> ---
>> >>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> >>> @@ -966,9 +967,13 @@ static void svm_disable_lbrv(struct vcpu_svm *svm)
>> >>>  static void disable_nmi_singlestep(struct vcpu_svm *svm)
>> >>>  {
>> >>>       svm->nmi_singlestep = false;
>> >>> -     if (!(svm->vcpu.guest_debug & KVM_GUESTDBG_SINGLESTEP))
>> >>> -             svm->vmcb->save.rflags &=
>> >>> -                     ~(X86_EFLAGS_TF | X86_EFLAGS_RF);
>> >>> +     if (!(svm->vcpu.guest_debug & KVM_GUESTDBG_SINGLESTEP)) {
>> >>> +             /* Clear our flags if they were not set by the guest */
>> >>> +             if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF))
>> >>> +                     svm->vmcb->save.rflags &= ~X86_EFLAGS_TF;
>> >>> +             if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_RF))
>> >>> +                     svm->vmcb->save.rflags &= ~X86_EFLAGS_RF;
>> >>
>> >> IIUC, we intercept/fault on IRET, disable the interception, set TF+RF
>> >> and enter again, the CPU executes IRET and then we get a #DB exit.
>> >>
>> >> IRET pops EFLAGS from before the NMI -- doesn't the CPU properly restore
>> >> EFLAGS, so we do not need this part here?
>> >
>> > My test VM doesn't work without this part, even with the change that
>> > Paolo suggested in 0/4:
>> >
>> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
>> > index d1efe2c62b3f..15a2f7f8e539 100644
>> > --- a/arch/x86/kvm/svm.c
>> > +++ b/arch/x86/kvm/svm.c
>> > @@ -4622,6 +4622,9 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
>> >         if ((svm->vcpu.arch.hflags & (HF_NMI_MASK | HF_IRET_MASK))
>> >             == HF_NMI_MASK)
>> >                 return; /* IRET will cause a vm exit */
>> > +       if ((svm->vcpu.arch.hflags & (HF_NMI_MASK | HF_GIF_MASK))
>> > +           == HF_NMI_MASK)
>> > +               return; /* STGI will cause a vm exit */
>> >
>> >
>> > Let's see what we're singlestepping over (Windows Server 2016 with Hyper-V).
>> >
>> > 1. The very first time we enable NMI singlestep is when running
>> > nested. The nested guest's rip is just after 'pause' in a 'pause; cmp;
>> > jne' loop. svm_nmi_allowed returns false because nested_svm_nmi
>> > returns false - we don't want to inject NMI now because
>> > svm->nested.intercept has the INTERCEPT_NMI bit set.
>> >
>> > 2. Then we find ourselves in L1 with the rip just after 'vmrun'.
>> > svm_nmi_allowed returns false because gif_set returns false (hflags ==
>> > HF_HIF_MASK | HF_VINTR_MASK). So we, again, enable NMI singlestep
>> > (without having disabled it yet).
>> >
>> > 3. We singlestep over the instruction immediately following 'vmrun'
>> > (it's a 'mov rax, [rsp+0x20]' in this trace) and finally disable NMI
>> > singlestep on this vcpu. We inject the NMI when the guest executes
>> > stgi.
>> >
>> >
>> > I'll see if I can short this out. Setting the trap flag to step over
>> > an instruction which has no other significance than following a
>> > 'vmrun' is indeed unnecessary.
>>
>> Ok, I think I understand it.
>>
>> First, Paolo's GIF change should be checking only HF_GIF_MASK. Whether
>> we're currently in an NMI or not does not make a difference, NMI is
>> not allowed and we have interception in place for when GIF is set so
>> we don't need to singlestep.
>>
>> Second, enable_nmi_window can also do nothing if
>> svm->nested.exit_required. We're not really going to run the guest in
>> this case so no need to singlestep.
>
> Makes sense.
>
>> With these two tweaks my test VM does not generate any DB exits. We
>> still occasionally set TF if we have an NMI and interrupt pending at
>> the same time (see inject_pending_event where we check
>> vcpu->arch.interrupt.pending, then later vcpu->arch.nmi_pending) but
>> we clear it immediately in the new code added in patch 4.
>
> Right, we only need the single step over IRET and interrupt shadow.
>
> Btw. instead of single-stepping over IRET/interrupt shadow, could we set
> INTERRUPT_SHADOW in VMCB, inject the NMI, and let it execute?
> This mechanism would explain why AMD didn't provide a trap for IRET ...
>
> APM 15.20 says "Injected events are treated in every way as though they
> had occurred normally in the guest", which makes me think that
> INTERRUPT_SHADOW blocks them, if it blocks NMIs at all on AMD.

Unfortunately this does not seem to work, eventinj.flat fails with
your change. There may be more vmexits before rip finally moves
forward so I have added a bool flag "nmi_int_shadow" and keep setting
SVM_INTERRUPT_SHADOW_MASK on entry as long as the flag is set. The
flag is reset in svm_complete_interrupts which conveniently has the
logic for this already. As far as I can tell, NMI is still injected
before the IRET.

eventinj.flat output:

Before NMI IRET test
Sending NMI to self
NMI isr running stack 0x461000
Sending nested NMI to self
After nested NMI to self
Nested NMI isr running rip=40038e (expecting 400390)
...

and my primitive debug output ("Running guest" is printed before
vmrun, "Guest exited" after):

[436685.009323] Running guest at rip 402146 shadow 0
[436685.009325] Guest exited with 7b
[436685.009438] Running guest at rip 40038e shadow 0
[436685.009440] Guest exited with 400
[436685.009452] Running guest at rip 40038e shadow 0
[436685.009454] Guest exited with 400
[436685.009467] Running guest at rip 40038e shadow 0
[436685.009468] Guest exited with 400
[436685.009470] Running guest at rip 40038e shadow 0
[436685.009472] Guest exited with 400
[436685.009474] Running guest at rip 40038e shadow 0
[436685.009476] Guest exited with 74
[436685.009478] Running guest at rip 40038e shadow 1
[436685.009481] Guest exited with 400
[436685.009484] Running guest at rip 40038e shadow 1
[436685.009486] Guest exited with 400
[436685.009494] Running guest at rip 40038e shadow 1
[436685.009496] Guest exited with 400
[436685.009499] Running guest at rip 402e20 shadow 0
[436685.009501] Guest exited with 400


> e.g.
>
> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> index 6e3095d1bad4..b564613b4e65 100644
> --- a/arch/x86/kvm/svm.c
> +++ b/arch/x86/kvm/svm.c
> @@ -4624,14 +4624,16 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
>
>         if ((svm->vcpu.arch.hflags & (HF_NMI_MASK | HF_IRET_MASK))
>             == HF_NMI_MASK)
> -               return; /* IRET will cause a vm exit */
> +               return 0; /* IRET will cause a vm exit */
>
> -       /*
> -        * Something prevents NMI from been injected. Single step over possible
> -        * problem (IRET or exception injection or interrupt shadow)
> -        */
> -       svm->nmi_singlestep = true;
> -       svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
> +       if (svm->vmcb->control.event_inj)
> +               return 1; /* will request VM exit immediately after injection */
> +
> +       --vcpu->arch.nmi_pending;
> +       svm_inject_nmi(vcpu);
> +       svm->vmcb->control.int_state |= SVM_INTERRUPT_SHADOW_MASK;
> +
> +       return 0;
>  }
>
>  static int svm_set_tss_addr(struct kvm *kvm, unsigned int addr)
>
> (Of course we'd want to refactor it for the final patch. :])
>
>> What a complex state machine! I'll prepare v2.
>
> Yes. :/  Looking forward to v2,
>
> thanks.
Radim Krčmář June 20, 2017, 1:01 p.m. UTC | #4
2017-06-20 09:41+0200, Ladi Prosek:
> On Mon, Jun 19, 2017 at 6:17 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:
> > 2017-06-19 15:05+0200, Ladi Prosek:
> >> On Mon, Jun 19, 2017 at 2:50 PM, Ladi Prosek <lprosek@redhat.com> wrote:
> >> > On Fri, Jun 16, 2017 at 3:26 PM, Radim Krčmář <rkrcmar@redhat.com> wrote:
> >> >> 2017-06-15 13:20+0200, Ladi Prosek:
> >> >>> Nested hypervisor should not see singlestep VM exits if singlestepping
> >> >>> was enabled internally by KVM. Windows is particularly sensitive to this
> >> >>> and known to bluescreen on unexpected VM exits.
> >> >>>
> >> >>> Signed-off-by: Ladi Prosek <lprosek@redhat.com>
> >> >>> ---
> >> >>> diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> >> >>> @@ -966,9 +967,13 @@ static void svm_disable_lbrv(struct vcpu_svm *svm)
> >> >>>  static void disable_nmi_singlestep(struct vcpu_svm *svm)
> >> >>>  {
> >> >>>       svm->nmi_singlestep = false;
> >> >>> -     if (!(svm->vcpu.guest_debug & KVM_GUESTDBG_SINGLESTEP))
> >> >>> -             svm->vmcb->save.rflags &=
> >> >>> -                     ~(X86_EFLAGS_TF | X86_EFLAGS_RF);
> >> >>> +     if (!(svm->vcpu.guest_debug & KVM_GUESTDBG_SINGLESTEP)) {
> >> >>> +             /* Clear our flags if they were not set by the guest */
> >> >>> +             if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_TF))
> >> >>> +                     svm->vmcb->save.rflags &= ~X86_EFLAGS_TF;
> >> >>> +             if (!(svm->nmi_singlestep_guest_rflags & X86_EFLAGS_RF))
> >> >>> +                     svm->vmcb->save.rflags &= ~X86_EFLAGS_RF;
> >> >>
> >> >> IIUC, we intercept/fault on IRET, disable the interception, set TF+RF
> >> >> and enter again, the CPU executes IRET and then we get a #DB exit.
> >> >>
> >> >> IRET pops EFLAGS from before the NMI -- doesn't the CPU properly restore
> >> >> EFLAGS, so we do not need this part here?
> >> >
> >> > My test VM doesn't work without this part, even with the change that
> >> > Paolo suggested in 0/4:
> >> >
> >> > diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
> >> > index d1efe2c62b3f..15a2f7f8e539 100644
> >> > --- a/arch/x86/kvm/svm.c
> >> > +++ b/arch/x86/kvm/svm.c
> >> > @@ -4622,6 +4622,9 @@ static void enable_nmi_window(struct kvm_vcpu *vcpu)
> >> >         if ((svm->vcpu.arch.hflags & (HF_NMI_MASK | HF_IRET_MASK))
> >> >             == HF_NMI_MASK)
> >> >                 return; /* IRET will cause a vm exit */
> >> > +       if ((svm->vcpu.arch.hflags & (HF_NMI_MASK | HF_GIF_MASK))
> >> > +           == HF_NMI_MASK)
> >> > +               return; /* STGI will cause a vm exit */
> >> >
> >> >
> >> > Let's see what we're singlestepping over (Windows Server 2016 with Hyper-V).
> >> >
> >> > 1. The very first time we enable NMI singlestep is when running
> >> > nested. The nested guest's rip is just after 'pause' in a 'pause; cmp;
> >> > jne' loop. svm_nmi_allowed returns false because nested_svm_nmi
> >> > returns false - we don't want to inject NMI now because
> >> > svm->nested.intercept has the INTERCEPT_NMI bit set.
> >> >
> >> > 2. Then we find ourselves in L1 with the rip just after 'vmrun'.
> >> > svm_nmi_allowed returns false because gif_set returns false (hflags ==
> >> > HF_HIF_MASK | HF_VINTR_MASK). So we, again, enable NMI singlestep
> >> > (without having disabled it yet).
> >> >
> >> > 3. We singlestep over the instruction immediately following 'vmrun'
> >> > (it's a 'mov rax, [rsp+0x20]' in this trace) and finally disable NMI
> >> > singlestep on this vcpu. We inject the NMI when the guest executes
> >> > stgi.
> >> >
> >> >
> >> > I'll see if I can short this out. Setting the trap flag to step over
> >> > an instruction which has no other significance than following a
> >> > 'vmrun' is indeed unnecessary.
> >>
> >> Ok, I think I understand it.
> >>
> >> First, Paolo's GIF change should be checking only HF_GIF_MASK. Whether
> >> we're currently in an NMI or not does not make a difference, NMI is
> >> not allowed and we have interception in place for when GIF is set so
> >> we don't need to singlestep.
> >>
> >> Second, enable_nmi_window can also do nothing if
> >> svm->nested.exit_required. We're not really going to run the guest in
> >> this case so no need to singlestep.
> >
> > Makes sense.
> >
> >> With these two tweaks my test VM does not generate any DB exits. We
> >> still occasionally set TF if we have an NMI and interrupt pending at
> >> the same time (see inject_pending_event where we check
> >> vcpu->arch.interrupt.pending, then later vcpu->arch.nmi_pending) but
> >> we clear it immediately in the new code added in patch 4.
> >
> > Right, we only need the single step over IRET and interrupt shadow.
> >
> > Btw. instead of single-stepping over IRET/interrupt shadow, could we set
> > INTERRUPT_SHADOW in VMCB, inject the NMI, and let it execute?
> > This mechanism would explain why AMD didn't provide a trap for IRET ...
> >
> > APM 15.20 says "Injected events are treated in every way as though they
> > had occurred normally in the guest", which makes me think that
> > INTERRUPT_SHADOW blocks them, if it blocks NMIs at all on AMD.
> 
> Unfortunately this does not seem to work, eventinj.flat fails with
> your change. There may be more vmexits before rip finally moves
> forward so I have added a bool flag "nmi_int_shadow" and keep setting
> SVM_INTERRUPT_SHADOW_MASK on entry as long as the flag is set. The
> flag is reset in svm_complete_interrupts which conveniently has the
> logic for this already. As far as I can tell, NMI is still injected
> before the IRET.
> 
> eventinj.flat output:
> 
> Before NMI IRET test
> Sending NMI to self
> NMI isr running stack 0x461000
> Sending nested NMI to self
> After nested NMI to self
> Nested NMI isr running rip=40038e (expecting 400390)
> ...

At least we know the current code isn't bad for no reason. :)
Thanks for trying it out!
diff mbox

Patch

diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c
index 6e3095d1bad4..b564613b4e65 100644
--- a/arch/x86/kvm/svm.c
+++ b/arch/x86/kvm/svm.c
@@ -4624,14 +4624,16 @@  static void enable_nmi_window(struct kvm_vcpu *vcpu)
 
 	if ((svm->vcpu.arch.hflags & (HF_NMI_MASK | HF_IRET_MASK))
 	    == HF_NMI_MASK)
-		return; /* IRET will cause a vm exit */
+		return 0; /* IRET will cause a vm exit */
 
-	/*
-	 * Something prevents NMI from been injected. Single step over possible
-	 * problem (IRET or exception injection or interrupt shadow)
-	 */
-	svm->nmi_singlestep = true;
-	svm->vmcb->save.rflags |= (X86_EFLAGS_TF | X86_EFLAGS_RF);
+	if (svm->vmcb->control.event_inj)
+		return 1; /* will request VM exit immediately after injection */
+
+	--vcpu->arch.nmi_pending;
+	svm_inject_nmi(vcpu);
+	svm->vmcb->control.int_state |= SVM_INTERRUPT_SHADOW_MASK;
+
+	return 0;
 }
 
 static int svm_set_tss_addr(struct kvm *kvm, unsigned int addr)