diff mbox series

[v2,11/11] KVM: x86: emulator/smm: preserve interrupt shadow in SMRAM

Message ID 20220621150902.46126-12-mlevitsk@redhat.com (mailing list archive)
State New, archived
Headers show
Series SMM emulation and interrupt shadow fixes | expand

Commit Message

Maxim Levitsky June 21, 2022, 3:09 p.m. UTC
When #SMI is asserted, the CPU can be in interrupt shadow
due to sti or mov ss.

It is not mandatory in  Intel/AMD prm to have the #SMI
blocked during the shadow, and on top of
that, since neither SVM nor VMX has true support for SMI
window, waiting for one instruction would mean single stepping
the guest.

Instead, allow #SMI in this case, but both reset the interrupt
window and stash its value in SMRAM to restore it on exit
from SMM.

This fixes rare failures seen mostly on windows guests on VMX,
when #SMI falls on the sti instruction which mainfest in
VM entry failure due to EFLAGS.IF not being set, but STI interrupt
window still being set in the VMCS.


Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/emulate.c     | 17 ++++++++++++++---
 arch/x86/kvm/kvm_emulate.h | 13 ++++++++++---
 arch/x86/kvm/x86.c         | 12 ++++++++++++
 3 files changed, 36 insertions(+), 6 deletions(-)

Comments

Jim Mattson June 29, 2022, 4:31 p.m. UTC | #1
On Tue, Jun 21, 2022 at 8:09 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> When #SMI is asserted, the CPU can be in interrupt shadow
> due to sti or mov ss.
>
> It is not mandatory in  Intel/AMD prm to have the #SMI
> blocked during the shadow, and on top of
> that, since neither SVM nor VMX has true support for SMI
> window, waiting for one instruction would mean single stepping
> the guest.
>
> Instead, allow #SMI in this case, but both reset the interrupt
> window and stash its value in SMRAM to restore it on exit
> from SMM.
>
> This fixes rare failures seen mostly on windows guests on VMX,
> when #SMI falls on the sti instruction which mainfest in
> VM entry failure due to EFLAGS.IF not being set, but STI interrupt
> window still being set in the VMCS.

I think you're just making stuff up! See Note #5 at
https://sandpile.org/x86/inter.htm.

Can you reference the vendors' documentation that supports this change?
Maxim Levitsky June 30, 2022, 6 a.m. UTC | #2
On Wed, 2022-06-29 at 09:31 -0700, Jim Mattson wrote:
> On Tue, Jun 21, 2022 at 8:09 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > When #SMI is asserted, the CPU can be in interrupt shadow
> > due to sti or mov ss.
> > 
> > It is not mandatory in  Intel/AMD prm to have the #SMI
> > blocked during the shadow, and on top of
> > that, since neither SVM nor VMX has true support for SMI
> > window, waiting for one instruction would mean single stepping
> > the guest.
> > 
> > Instead, allow #SMI in this case, but both reset the interrupt
> > window and stash its value in SMRAM to restore it on exit
> > from SMM.
> > 
> > This fixes rare failures seen mostly on windows guests on VMX,
> > when #SMI falls on the sti instruction which mainfest in
> > VM entry failure due to EFLAGS.IF not being set, but STI interrupt
> > window still being set in the VMCS.
> 
> I think you're just making stuff up! See Note #5 at
> https://sandpile.org/x86/inter.htm.
> 
> Can you reference the vendors' documentation that supports this change?
> 

First of all, just to note that the actual issue here was that 
we don't clear the shadow bits in the guest interruptability field 
in the vmcb on SMM entry, that triggered a consistency check because
we do clear EFLAGS.IF.
Preserving the interrupt shadow is just nice to have.


That what Intel's spec says for the 'STI':

"The IF flag and the STI and CLI instructions do not prohibit the generation of exceptions and nonmaskable inter-
rupts (NMIs). However, NMIs (and system-management interrupts) may be inhibited on the instruction boundary
following an execution of STI that begins with IF = 0."

Thus it is likely that #SMI are just blocked when in shadow, but it is easier to implement
it this way (avoids single stepping the guest) and without any user visable difference,
which I noted in the patch description, I noted that there are two ways to solve this,
and preserving the int shadow in SMRAM is just more simple way.


As for CPUS that neither block SMI nor preserve the int shadaw, in theory they can, but that would
break things, as noted in this mail

https://lore.kernel.org/lkml/1284913699-14986-1-git-send-email-avi@redhat.com/

It is possible though that real cpu supports HLT restart flag, which makes this a non issue,
still. I can't rule out that a real cpu doesn't preserve the interrupt shadow on SMI, but
I don't see why we can't do this to make things more robust.

Best regards,
	Maxim Levitsky
Jim Mattson June 30, 2022, 4 p.m. UTC | #3
On Wed, Jun 29, 2022 at 11:00 PM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Wed, 2022-06-29 at 09:31 -0700, Jim Mattson wrote:
> > On Tue, Jun 21, 2022 at 8:09 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > When #SMI is asserted, the CPU can be in interrupt shadow
> > > due to sti or mov ss.
> > >
> > > It is not mandatory in  Intel/AMD prm to have the #SMI
> > > blocked during the shadow, and on top of
> > > that, since neither SVM nor VMX has true support for SMI
> > > window, waiting for one instruction would mean single stepping
> > > the guest.
> > >
> > > Instead, allow #SMI in this case, but both reset the interrupt
> > > window and stash its value in SMRAM to restore it on exit
> > > from SMM.
> > >
> > > This fixes rare failures seen mostly on windows guests on VMX,
> > > when #SMI falls on the sti instruction which mainfest in
> > > VM entry failure due to EFLAGS.IF not being set, but STI interrupt
> > > window still being set in the VMCS.
> >
> > I think you're just making stuff up! See Note #5 at
> > https://sandpile.org/x86/inter.htm.
> >
> > Can you reference the vendors' documentation that supports this change?
> >
>
> First of all, just to note that the actual issue here was that
> we don't clear the shadow bits in the guest interruptability field
> in the vmcb on SMM entry, that triggered a consistency check because
> we do clear EFLAGS.IF.
> Preserving the interrupt shadow is just nice to have.
>
>
> That what Intel's spec says for the 'STI':
>
> "The IF flag and the STI and CLI instructions do not prohibit the generation of exceptions and nonmaskable inter-
> rupts (NMIs). However, NMIs (and system-management interrupts) may be inhibited on the instruction boundary
> following an execution of STI that begins with IF = 0."
>
> Thus it is likely that #SMI are just blocked when in shadow, but it is easier to implement
> it this way (avoids single stepping the guest) and without any user visable difference,
> which I noted in the patch description, I noted that there are two ways to solve this,
> and preserving the int shadow in SMRAM is just more simple way.

It's not true that there is no user-visible difference. In your
implementation, the SMI handler can see that the interrupt was
delivered in the interrupt shadow.

The right fix for this problem is to block SMI in an interrupt shadow,
as is likely the case for all modern CPUs.

>
> As for CPUS that neither block SMI nor preserve the int shadaw, in theory they can, but that would
> break things, as noted in this mail
>
> https://lore.kernel.org/lkml/1284913699-14986-1-git-send-email-avi@redhat.com/
>
> It is possible though that real cpu supports HLT restart flag, which makes this a non issue,
> still. I can't rule out that a real cpu doesn't preserve the interrupt shadow on SMI, but
> I don't see why we can't do this to make things more robust.

Because, as I said, I think you're just making stuff up...unless, of
course, you have documentation to back this up.
Maxim Levitsky July 5, 2022, 1:38 p.m. UTC | #4
On Thu, 2022-06-30 at 09:00 -0700, Jim Mattson wrote:
> On Wed, Jun 29, 2022 at 11:00 PM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > 
> > On Wed, 2022-06-29 at 09:31 -0700, Jim Mattson wrote:
> > > On Tue, Jun 21, 2022 at 8:09 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > > When #SMI is asserted, the CPU can be in interrupt shadow
> > > > due to sti or mov ss.
> > > > 
> > > > It is not mandatory in  Intel/AMD prm to have the #SMI
> > > > blocked during the shadow, and on top of
> > > > that, since neither SVM nor VMX has true support for SMI
> > > > window, waiting for one instruction would mean single stepping
> > > > the guest.
> > > > 
> > > > Instead, allow #SMI in this case, but both reset the interrupt
> > > > window and stash its value in SMRAM to restore it on exit
> > > > from SMM.
> > > > 
> > > > This fixes rare failures seen mostly on windows guests on VMX,
> > > > when #SMI falls on the sti instruction which mainfest in
> > > > VM entry failure due to EFLAGS.IF not being set, but STI interrupt
> > > > window still being set in the VMCS.
> > > 
> > > I think you're just making stuff up! See Note #5 at
> > > https://sandpile.org/x86/inter.htm.
> > > 
> > > Can you reference the vendors' documentation that supports this change?
> > > 
> > 
> > First of all, just to note that the actual issue here was that
> > we don't clear the shadow bits in the guest interruptability field
> > in the vmcb on SMM entry, that triggered a consistency check because
> > we do clear EFLAGS.IF.
> > Preserving the interrupt shadow is just nice to have.
> > 
> > 
> > That what Intel's spec says for the 'STI':
> > 
> > "The IF flag and the STI and CLI instructions do not prohibit the generation of exceptions and nonmaskable inter-
> > rupts (NMIs). However, NMIs (and system-management interrupts) may be inhibited on the instruction boundary
> > following an execution of STI that begins with IF = 0."
> > 
> > Thus it is likely that #SMI are just blocked when in shadow, but it is easier to implement
> > it this way (avoids single stepping the guest) and without any user visable difference,
> > which I noted in the patch description, I noted that there are two ways to solve this,
> > and preserving the int shadow in SMRAM is just more simple way.
> 
> It's not true that there is no user-visible difference. In your
> implementation, the SMI handler can see that the interrupt was
> delivered in the interrupt shadow.

Most of the SMI save state area is reserved, and the handler has no way of knowing
what CPU stored there, it can only access the fields that are reserved in the spec.

Yes, if the SMI handler really insists it can see that the saved RIP points to an
instruction that follows the STI, but does that really matter? It is allowed by the
spec explicitly anyway.

Plus our SMI layout (at least for 32 bit) doesn't confirm to the X86 spec anyway,
we as I found out flat out write over the fields that have other meaning in the X86 spec.

Also I proposed to preserve the int shadow in internal kvm state and migrate
it in upper 4 bits of the 'shadow' field of struct kvm_vcpu_events.
Both Paolo and Sean proposed to store the int shadow in the SMRAM instead,
and you didn't object to this, and now after I refactored and implemented
the whole thing you suddently do.

BTW, just FYI, I found out that qemu doesn't migrate the 'shadow' field,
this needs to be fixed (not related to the issue, just FYI).

> 
> The right fix for this problem is to block SMI in an interrupt shadow,
> as is likely the case for all modern CPUs.

Yes, I agree that this is the most correct fix. 

However AMD just recently posted a VNMI patch series to avoid
single stepping the CPU when NMI is blocked due to the same reason, because
it is fragile.

Do you really want KVM to single step the guest in this case, to deliver the #SMI?
I can do it, but it is bound to cause lot of trouble.

Note that I will have to do it on both Intel and AMD, as neither has support for SMI
window, unless I were to use MTF, which is broken on nested virt as you know,
so a nested hypervisor running a guest with SMI will now have to cope with broken MTF.

Note that I can't use the VIRQ hack we use for interrupt window, because there
is no guarantee that the guest's EFLAGS.IF is on.

Best regards,	
	Maxim Levitsky

> 
> > 
> > As for CPUS that neither block SMI nor preserve the int shadaw, in theory they can, but that would
> > break things, as noted in this mail
> > 
> > https://lore.kernel.org/lkml/1284913699-14986-1-git-send-email-avi@redhat.com/
> > 
> > It is possible though that real cpu supports HLT restart flag, which makes this a non issue,
> > still. I can't rule out that a real cpu doesn't preserve the interrupt shadow on SMI, but
> > I don't see why we can't do this to make things more robust.
> 
> Because, as I said, I think you're just making stuff up...unless, of
> course, you have documentation to back this up.
>
Maxim Levitsky July 5, 2022, 1:40 p.m. UTC | #5
On Tue, 2022-07-05 at 16:38 +0300, Maxim Levitsky wrote:
> On Thu, 2022-06-30 at 09:00 -0700, Jim Mattson wrote:
> > On Wed, Jun 29, 2022 at 11:00 PM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > 
> > > On Wed, 2022-06-29 at 09:31 -0700, Jim Mattson wrote:
> > > > On Tue, Jun 21, 2022 at 8:09 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > > > When #SMI is asserted, the CPU can be in interrupt shadow
> > > > > due to sti or mov ss.
> > > > > 
> > > > > It is not mandatory in  Intel/AMD prm to have the #SMI
> > > > > blocked during the shadow, and on top of
> > > > > that, since neither SVM nor VMX has true support for SMI
> > > > > window, waiting for one instruction would mean single stepping
> > > > > the guest.
> > > > > 
> > > > > Instead, allow #SMI in this case, but both reset the interrupt
> > > > > window and stash its value in SMRAM to restore it on exit
> > > > > from SMM.
> > > > > 
> > > > > This fixes rare failures seen mostly on windows guests on VMX,
> > > > > when #SMI falls on the sti instruction which mainfest in
> > > > > VM entry failure due to EFLAGS.IF not being set, but STI interrupt
> > > > > window still being set in the VMCS.
> > > > 
> > > > I think you're just making stuff up! See Note #5 at
> > > > https://sandpile.org/x86/inter.htm.
> > > > 
> > > > Can you reference the vendors' documentation that supports this change?
> > > > 
> > > 
> > > First of all, just to note that the actual issue here was that
> > > we don't clear the shadow bits in the guest interruptability field
> > > in the vmcb on SMM entry, that triggered a consistency check because
> > > we do clear EFLAGS.IF.
> > > Preserving the interrupt shadow is just nice to have.
> > > 
> > > 
> > > That what Intel's spec says for the 'STI':
> > > 
> > > "The IF flag and the STI and CLI instructions do not prohibit the generation of exceptions and nonmaskable inter-
> > > rupts (NMIs). However, NMIs (and system-management interrupts) may be inhibited on the instruction boundary
> > > following an execution of STI that begins with IF = 0."
> > > 
> > > Thus it is likely that #SMI are just blocked when in shadow, but it is easier to implement
> > > it this way (avoids single stepping the guest) and without any user visable difference,
> > > which I noted in the patch description, I noted that there are two ways to solve this,
> > > and preserving the int shadow in SMRAM is just more simple way.
> > 
> > It's not true that there is no user-visible difference. In your
> > implementation, the SMI handler can see that the interrupt was
> > delivered in the interrupt shadow.
> 
> Most of the SMI save state area is reserved, and the handler has no way of knowing
> what CPU stored there, it can only access the fields that are reserved in the spec.
I mean fields that are not reserved in the spec.

Best regards,
	Maxim Levitsky
> 
> Yes, if the SMI handler really insists it can see that the saved RIP points to an
> instruction that follows the STI, but does that really matter? It is allowed by the
> spec explicitly anyway.
> 
> Plus our SMI layout (at least for 32 bit) doesn't confirm to the X86 spec anyway,
> we as I found out flat out write over the fields that have other meaning in the X86 spec.
> 
> Also I proposed to preserve the int shadow in internal kvm state and migrate
> it in upper 4 bits of the 'shadow' field of struct kvm_vcpu_events.
> Both Paolo and Sean proposed to store the int shadow in the SMRAM instead,
> and you didn't object to this, and now after I refactored and implemented
> the whole thing you suddently do.
> 
> BTW, just FYI, I found out that qemu doesn't migrate the 'shadow' field,
> this needs to be fixed (not related to the issue, just FYI).
> 
> > 
> > The right fix for this problem is to block SMI in an interrupt shadow,
> > as is likely the case for all modern CPUs.
> 
> Yes, I agree that this is the most correct fix. 
> 
> However AMD just recently posted a VNMI patch series to avoid
> single stepping the CPU when NMI is blocked due to the same reason, because
> it is fragile.
> 
> Do you really want KVM to single step the guest in this case, to deliver the #SMI?
> I can do it, but it is bound to cause lot of trouble.
> 
> Note that I will have to do it on both Intel and AMD, as neither has support for SMI
> window, unless I were to use MTF, which is broken on nested virt as you know,
> so a nested hypervisor running a guest with SMI will now have to cope with broken MTF.
> 
> Note that I can't use the VIRQ hack we use for interrupt window, because there
> is no guarantee that the guest's EFLAGS.IF is on.
> 
> Best regards,   
>         Maxim Levitsky
> 
> > 
> > > 
> > > As for CPUS that neither block SMI nor preserve the int shadaw, in theory they can, but that would
> > > break things, as noted in this mail
> > > 
> > > https://lore.kernel.org/lkml/1284913699-14986-1-git-send-email-avi@redhat.com/
> > > 
> > > It is possible though that real cpu supports HLT restart flag, which makes this a non issue,
> > > still. I can't rule out that a real cpu doesn't preserve the interrupt shadow on SMI, but
> > > I don't see why we can't do this to make things more robust.
> > 
> > Because, as I said, I think you're just making stuff up...unless, of
> > course, you have documentation to back this up.
> > 
>
Maxim Levitsky July 5, 2022, 1:51 p.m. UTC | #6
On Tue, 2022-07-05 at 16:40 +0300, Maxim Levitsky wrote:
> On Tue, 2022-07-05 at 16:38 +0300, Maxim Levitsky wrote:
> > On Thu, 2022-06-30 at 09:00 -0700, Jim Mattson wrote:
> > > On Wed, Jun 29, 2022 at 11:00 PM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > > 
> > > > On Wed, 2022-06-29 at 09:31 -0700, Jim Mattson wrote:
> > > > > On Tue, Jun 21, 2022 at 8:09 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > > > > When #SMI is asserted, the CPU can be in interrupt shadow
> > > > > > due to sti or mov ss.
> > > > > > 
> > > > > > It is not mandatory in  Intel/AMD prm to have the #SMI
> > > > > > blocked during the shadow, and on top of
> > > > > > that, since neither SVM nor VMX has true support for SMI
> > > > > > window, waiting for one instruction would mean single stepping
> > > > > > the guest.
> > > > > > 
> > > > > > Instead, allow #SMI in this case, but both reset the interrupt
> > > > > > window and stash its value in SMRAM to restore it on exit
> > > > > > from SMM.
> > > > > > 
> > > > > > This fixes rare failures seen mostly on windows guests on VMX,
> > > > > > when #SMI falls on the sti instruction which mainfest in
> > > > > > VM entry failure due to EFLAGS.IF not being set, but STI interrupt
> > > > > > window still being set in the VMCS.
> > > > > 
> > > > > I think you're just making stuff up! See Note #5 at
> > > > > https://sandpile.org/x86/inter.htm.
> > > > > 
> > > > > Can you reference the vendors' documentation that supports this change?
> > > > > 
> > > > 
> > > > First of all, just to note that the actual issue here was that
> > > > we don't clear the shadow bits in the guest interruptability field
> > > > in the vmcb on SMM entry, that triggered a consistency check because
> > > > we do clear EFLAGS.IF.
> > > > Preserving the interrupt shadow is just nice to have.
> > > > 
> > > > 
> > > > That what Intel's spec says for the 'STI':
> > > > 
> > > > "The IF flag and the STI and CLI instructions do not prohibit the generation of exceptions and nonmaskable inter-
> > > > rupts (NMIs). However, NMIs (and system-management interrupts) may be inhibited on the instruction boundary
> > > > following an execution of STI that begins with IF = 0."
> > > > 
> > > > Thus it is likely that #SMI are just blocked when in shadow, but it is easier to implement
> > > > it this way (avoids single stepping the guest) and without any user visable difference,
> > > > which I noted in the patch description, I noted that there are two ways to solve this,
> > > > and preserving the int shadow in SMRAM is just more simple way.
> > > 
> > > It's not true that there is no user-visible difference. In your
> > > implementation, the SMI handler can see that the interrupt was
> > > delivered in the interrupt shadow.
> > 
> > Most of the SMI save state area is reserved, and the handler has no way of knowing
> > what CPU stored there, it can only access the fields that are reserved in the spec.
> I mean fields that are not reserved in the spec.
> 
> Best regards,
>         Maxim Levitsky
> > 
> > Yes, if the SMI handler really insists it can see that the saved RIP points to an
> > instruction that follows the STI, but does that really matter? It is allowed by the
> > spec explicitly anyway.
> > 
> > Plus our SMI layout (at least for 32 bit) doesn't confirm to the X86 spec anyway,
> > we as I found out flat out write over the fields that have other meaning in the X86 spec.
> > 
> > Also I proposed to preserve the int shadow in internal kvm state and migrate
> > it in upper 4 bits of the 'shadow' field of struct kvm_vcpu_events.
> > Both Paolo and Sean proposed to store the int shadow in the SMRAM instead,
> > and you didn't object to this, and now after I refactored and implemented
> > the whole thing you suddently do.
> > 
> > BTW, just FYI, I found out that qemu doesn't migrate the 'shadow' field,
> > this needs to be fixed (not related to the issue, just FYI).
> > 
> > > 
> > > The right fix for this problem is to block SMI in an interrupt shadow,
> > > as is likely the case for all modern CPUs.
> > 
> > Yes, I agree that this is the most correct fix. 
> > 
> > However AMD just recently posted a VNMI patch series to avoid
> > single stepping the CPU when NMI is blocked due to the same reason, because
> > it is fragile.
> > 
> > Do you really want KVM to single step the guest in this case, to deliver the #SMI?
> > I can do it, but it is bound to cause lot of trouble.
> > 
> > Note that I will have to do it on both Intel and AMD, as neither has support for SMI
> > window, unless I were to use MTF, which is broken on nested virt as you know,
> > so a nested hypervisor running a guest with SMI will now have to cope with broken MTF.
> > 
> > Note that I can't use the VIRQ hack we use for interrupt window, because there
> > is no guarantee that the guest's EFLAGS.IF is on.
> > 
> > Best regards,   
> >         Maxim Levitsky
> > 
> > > 
> > > > 
> > > > As for CPUS that neither block SMI nor preserve the int shadaw, in theory they can, but that would
> > > > break things, as noted in this mail
> > > > 
> > > > https://lore.kernel.org/lkml/1284913699-14986-1-git-send-email-avi@redhat.com/
> > > > 
> > > > It is possible though that real cpu supports HLT restart flag, which makes this a non issue,
> > > > still. I can't rule out that a real cpu doesn't preserve the interrupt shadow on SMI, but
> > > > I don't see why we can't do this to make things more robust.
> > > 
> > > Because, as I said, I think you're just making stuff up...unless, of
> > > course, you have documentation to back this up.

Again, I clearly explained that I choose to implement it this way because it
is more robust, _and_ it was approved by both Sean and Paolo. 

It is not called making stuff up - I never claimed that a real
CPU does it this way.

Best regards,
	Maxim Levitsky

> > > 
> > 
> 
>
Jim Mattson July 6, 2022, 6:13 p.m. UTC | #7
On Tue, Jul 5, 2022 at 6:38 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:

> Most of the SMI save state area is reserved, and the handler has no way of knowing
> what CPU stored there, it can only access the fields that are reserved in the spec.
>
> Yes, if the SMI handler really insists it can see that the saved RIP points to an
> instruction that follows the STI, but does that really matter? It is allowed by the
> spec explicitly anyway.

I was just pointing out that the difference between blocking SMI and
not blocking SMI is, in fact, observable.

> Plus our SMI layout (at least for 32 bit) doesn't confirm to the X86 spec anyway,
> we as I found out flat out write over the fields that have other meaning in the X86 spec.

Shouldn't we fix that?

> Also I proposed to preserve the int shadow in internal kvm state and migrate
> it in upper 4 bits of the 'shadow' field of struct kvm_vcpu_events.
> Both Paolo and Sean proposed to store the int shadow in the SMRAM instead,
> and you didn't object to this, and now after I refactored and implemented
> the whole thing you suddently do.

I did not see the prior conversations. I rarely get an opportunity to
read the list.

> However AMD just recently posted a VNMI patch series to avoid
> single stepping the CPU when NMI is blocked due to the same reason, because
> it is fragile.

The vNMI feature isn't available in any shipping processor yet, is it?

> Do you really want KVM to single step the guest in this case, to deliver the #SMI?
> I can do it, but it is bound to cause lot of trouble.

Perhaps you could document this as a KVM erratum...one of many
involving virtual SMI delivery.
Maxim Levitsky July 6, 2022, 8 p.m. UTC | #8
On Wed, 2022-07-06 at 11:13 -0700, Jim Mattson wrote:
> On Tue, Jul 5, 2022 at 6:38 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> 
> > Most of the SMI save state area is reserved, and the handler has no way of knowing
> > what CPU stored there, it can only access the fields that are reserved in the spec.
> > 
> > Yes, if the SMI handler really insists it can see that the saved RIP points to an
> > instruction that follows the STI, but does that really matter? It is allowed by the
> > spec explicitly anyway.
> 
> I was just pointing out that the difference between blocking SMI and
> not blocking SMI is, in fact, observable.

Yes, and I agree, I should have said that while observable,
it should cause no problem.


> 
> > Plus our SMI layout (at least for 32 bit) doesn't confirm to the X86 spec anyway,
> > we as I found out flat out write over the fields that have other meaning in the X86 spec.
> 
> Shouldn't we fix that?
I am afraid we can't because that will break (in theory) the backward compatibility
(e.g if someone migrates a VM while in SMM).

Plus this is only for 32 bit layout which is only used when the guest has no long
mode in CPUID, which is only used these days by 32 bit qemu 

(I found it the hard way when I found that SMM with a nested guest doesn't work
for me on 32 bit, and it was because the KVM doesn't bother to save/restore the
running nested guest vmcb address, when we use 32 bit SMM layout, which makes
sense because truly 32 bit only AMD cpus likely didn't had SVM).

But then after looking at SDM I also found out that Intel and AMD have completely
different SMM layout for 64 bit. We follow the AMD's layout, but we don't
implement many fields, including some that are barely/not documented.
(e.g what is svm_guest_virtual_int?)

In theory we could use Intel's layout when we run with Intel's vendor ID,
and AMD's vise versa, but we probably won't bother + once again there
is an issue of backward compatibility.

Feel free to look at the patch series, I documented fully the SMRAM layout
that KVM uses, including all the places when it differs from the real
thing.


> 
> > Also I proposed to preserve the int shadow in internal kvm state and migrate
> > it in upper 4 bits of the 'shadow' field of struct kvm_vcpu_events.
> > Both Paolo and Sean proposed to store the int shadow in the SMRAM instead,
> > and you didn't object to this, and now after I refactored and implemented
> > the whole thing you suddently do.
> 
> I did not see the prior conversations. I rarely get an opportunity to
> read the list.
I understand.

> 
> > However AMD just recently posted a VNMI patch series to avoid
> > single stepping the CPU when NMI is blocked due to the same reason, because
> > it is fragile.
> 
> The vNMI feature isn't available in any shipping processor yet, is it?
Yes, but one of its purposes is to avoid single stepping the guest,
which is especially painful on AMD, because there is no MTF, so
you have to 'borrow' the TF flag in the EFLAGS, and that can leak into
the guest state (e.g pushed onto the stack).


> 
> > Do you really want KVM to single step the guest in this case, to deliver the #SMI?
> > I can do it, but it is bound to cause lot of trouble.
> 
> Perhaps you could document this as a KVM erratum...one of many
> involving virtual SMI delivery.

Absolutely, I can document that we choose to save/restore the int shadow in
SMRAM, something that CPUs usually don't really do, but happens to be the best way
to deal with this corner case.

(Actually looking at clause of default treatment of SMIs in Intel's PRM,
they do mention that they preserve the int shadow somewhere at least
on some Intel's CPUs).


BTW, according to my observations, it is really hard to hit this problem,
because it looks like when the CPU is in interrupt shadow, it doesn't process
_real_ interrupts as well (despite the fact that in VM, real interrupts
should never be blocked(*), but yet, that is what I observed on both AMD and Intel.

(*) You can allow the guest to control the real EFLAGS.IF on both VMX and SVM,
(in which case int shadow should indeed work as on bare metal)
but KVM of course doesn't do it.

I observed that when KVM sends #SMI from other vCPU, it sends a vCPU kick,
and the kick never arrives inside the interrupt shadow.
I have seen it on both VMX and SVM.

What still triggers this problem, is that the instruction which is in the interrupt
shadow can still get a VM exit, (e.g EPT/NPT violation) and then it can notice
the pending SMI.

I think it has to be EPT/NPT violation btw, because, IMHO most if not all other VM exits I 
think are instruction intercepts, which will cause KVM to emulate the instruction 
and clear the interrupt shadow, and only after that it will enter SMM.

Even MMIO/IOPORT access is emulated by the KVM.

Its not the case with EPT/NPT violation, because the KVM will in this case re-execute
the instruction after it 'fixes' the fault.

Best regards,
	Maxim Levitsky



>
Jim Mattson July 6, 2022, 8:38 p.m. UTC | #9
On Wed, Jul 6, 2022 at 1:00 PM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Wed, 2022-07-06 at 11:13 -0700, Jim Mattson wrote:
> > On Tue, Jul 5, 2022 at 6:38 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
...
> > > Plus our SMI layout (at least for 32 bit) doesn't confirm to the X86 spec anyway,
> > > we as I found out flat out write over the fields that have other meaning in the X86 spec.
> >
> > Shouldn't we fix that?
> I am afraid we can't because that will break (in theory) the backward compatibility
> (e.g if someone migrates a VM while in SMM).

Every time someone says, "We can't fix this, because it breaks
backward compatibility," I think, "Another potential use of
KVM_CAP_DISABLE_QUIRKS2?"

...
> But then after looking at SDM I also found out that Intel and AMD have completely
> different SMM layout for 64 bit. We follow the AMD's layout, but we don't
> implement many fields, including some that are barely/not documented.
> (e.g what is svm_guest_virtual_int?)
>
> In theory we could use Intel's layout when we run with Intel's vendor ID,
> and AMD's vise versa, but we probably won't bother + once again there
> is an issue of backward compatibility.

This seems pretty egregious, since the SDM specifically states, "Some
of the registers in the SMRAM state save area (marked YES in column 3)
may be read and changed by the
SMI handler, with the changed values restored to the processor
registers by the RSM instruction." How can that possibly work with
AMD's layout?
(See my comment above regarding backwards compatibility.)

<soapbox>I wish KVM would stop offering virtual CPU features that are
completely broken.</soapbox>

> > The vNMI feature isn't available in any shipping processor yet, is it?
> Yes, but one of its purposes is to avoid single stepping the guest,
> which is especially painful on AMD, because there is no MTF, so
> you have to 'borrow' the TF flag in the EFLAGS, and that can leak into
> the guest state (e.g pushed onto the stack).

So, what's the solution for all of today's SVM-capable processors? KVM
will probably be supporting AMD CPUs without vNMI for the next decade
or two.


> (Actually looking at clause of default treatment of SMIs in Intel's PRM,
> they do mention that they preserve the int shadow somewhere at least
> on some Intel's CPUs).

Yes, this is a required part of VMX-critical state for processors that
support SMI recognition while there is blocking by STI or by MOV SS.
However, I don't believe that KVM actually saves VMX-critical state on
delivery of a virtual SMI.

>
> BTW, according to my observations, it is really hard to hit this problem,
> because it looks like when the CPU is in interrupt shadow, it doesn't process
> _real_ interrupts as well (despite the fact that in VM, real interrupts
> should never be blocked(*), but yet, that is what I observed on both AMD and Intel.
>
> (*) You can allow the guest to control the real EFLAGS.IF on both VMX and SVM,
> (in which case int shadow should indeed work as on bare metal)
> but KVM of course doesn't do it.

It doesn't surprise me that hardware treats a virtual interrupt shadow
as a physical interrupt shadow. IIRC, each vendor has a way of
breaking an endless chain of interrupt shadows, so a malicious guest
can't defer interrupts indefinitely.

> I observed that when KVM sends #SMI from other vCPU, it sends a vCPU kick,
> and the kick never arrives inside the interrupt shadow.
> I have seen it on both VMX and SVM.
>
> What still triggers this problem, is that the instruction which is in the interrupt
> shadow can still get a VM exit, (e.g EPT/NPT violation) and then it can notice
> the pending SMI.
>
> I think it has to be EPT/NPT violation btw, because, IMHO most if not all other VM exits I
> think are instruction intercepts, which will cause KVM to emulate the instruction
> and clear the interrupt shadow, and only after that it will enter SMM.
>
> Even MMIO/IOPORT access is emulated by the KVM.
>
> Its not the case with EPT/NPT violation, because the KVM will in this case re-execute
> the instruction after it 'fixes' the fault.

Probably #PF as well, then, if TDP is disabled.
Maxim Levitsky July 10, 2022, 4:05 p.m. UTC | #10
On Wed, 2022-07-06 at 13:38 -0700, Jim Mattson wrote:
> On Wed, Jul 6, 2022 at 1:00 PM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Wed, 2022-07-06 at 11:13 -0700, Jim Mattson wrote:
> > > On Tue, Jul 5, 2022 at 6:38 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> ...
> > > > Plus our SMI layout (at least for 32 bit) doesn't confirm to the X86 spec anyway,
> > > > we as I found out flat out write over the fields that have other meaning in the X86 spec.
> > > 
> > > Shouldn't we fix that?
> > I am afraid we can't because that will break (in theory) the backward compatibility
> > (e.g if someone migrates a VM while in SMM).
> 
> Every time someone says, "We can't fix this, because it breaks
> backward compatibility," I think, "Another potential use of
> KVM_CAP_DISABLE_QUIRKS2?"
> 
> ...
> > But then after looking at SDM I also found out that Intel and AMD have completely
> > different SMM layout for 64 bit. We follow the AMD's layout, but we don't
> > implement many fields, including some that are barely/not documented.
> > (e.g what is svm_guest_virtual_int?)
> > 
> > In theory we could use Intel's layout when we run with Intel's vendor ID,
> > and AMD's vise versa, but we probably won't bother + once again there
> > is an issue of backward compatibility.
> 
> This seems pretty egregious, since the SDM specifically states, "Some
> of the registers in the SMRAM state save area (marked YES in column 3)
> may be read and changed by the
> SMI handler, with the changed values restored to the processor
> registers by the RSM instruction." How can that possibly work with
> AMD's layout?
> (See my comment above regarding backwards compatibility.)
> 
> <soapbox>I wish KVM would stop offering virtual CPU features that are
> completely broken.</soapbox>
> 
> > > The vNMI feature isn't available in any shipping processor yet, is it?
> > Yes, but one of its purposes is to avoid single stepping the guest,
> > which is especially painful on AMD, because there is no MTF, so
> > you have to 'borrow' the TF flag in the EFLAGS, and that can leak into
> > the guest state (e.g pushed onto the stack).
> 
> So, what's the solution for all of today's SVM-capable processors? KVM
> will probably be supporting AMD CPUs without vNMI for the next decade
> or two.

I did some homework on this a few months ago so here it goes:

First of all lets assume that GIF is set, because when clear, we just
intercept STGI to deliver #NMI there. Same for #SMI.
GIF is easy in other words in regard to interrupt window.

So it works like that:

When we inject #NMI, we enable IRET intercept (in svm_inject_nmi)
As long as we didn't hit IRET, that is our NMI window, so
enable_nmi_window does nothing.

We also mark this situation with

vcpu->arch.hflags |= HF_NMI_MASK;

This means that we are in NMI, but haven't yet
seen IRET.

When we hit IRET interception which is fault like interception,
we are still in NMI, until IRET completes.

We mark this situaion with 

vcpu->arch.hflags |= HF_IRET_MASK;

Now both HF_NMI_MASK and HF_IRET_MASK are set.


If at that point someone enables NMI window,
the NMI window code (enable_nmi_window) detects the 
(HF_NMI_MASK | HF_IRET_MASK), enables single stepping,
and remembers current RIP.


Finally svm_complete_interrupts (which is called on each vm exit)
notices the HF_IRET_MASK flag, and if set, and RIP is not the same as
it was when we enabled single stepping, then it clears the HF_NMI_MASK
and raises KVM_REQ_EVENT to possibly inject now an another NMI.

Of course if for example the IRET gets an exception (or even interrupt
since EFLAGS.IF can be set), then TF flag we force enabled will be pushed
onto the exception stack and leaked to the guest which is not nice.


Note that the same problem doesn't happen with STGI interception,
because unlike IRET, we fully emulate STGI, so upon completion of emulation
of it, the NMI window is open.

IF we could fully emualate IRET, we could have done the same with it as well,
but it is hard, and of course in the case of skipping over the interrupt shadow,
we would have to emulate *any* instruction which happens to be there,
which is not feasable at all for the KVM's emulator.


That also doesn't work with SEV-ES, due to encrypted nature of the guest
(but then emulated SMM won't work either), I guess this is another reason
for vNMI feature.

TL;DR - on #NMI injection we intercept IRET, and rely on its interception
to signal the almost start of the NMI window, but this still leaves a short
window of executing the IRET itself during which NMIs are still blocked,
so we have to single step over it.

Note that there is no issue with interrupt shadow here because NMI doesn't
respect it.



> 
> 
> > (Actually looking at clause of default treatment of SMIs in Intel's PRM,
> > they do mention that they preserve the int shadow somewhere at least
> > on some Intel's CPUs).
> 
> Yes, this is a required part of VMX-critical state for processors that
> support SMI recognition while there is blocking by STI or by MOV SS.
> However, I don't believe that KVM actually saves VMX-critical state on
> delivery of a virtual SMI.

Yes, but that does suggest that older cpus which allowed SMI in interrupt
shadow did preserve it *somewhere* Its also not a spec violation to preserve it
in this way.

> 
> > BTW, according to my observations, it is really hard to hit this problem,
> > because it looks like when the CPU is in interrupt shadow, it doesn't process
> > _real_ interrupts as well (despite the fact that in VM, real interrupts
> > should never be blocked(*), but yet, that is what I observed on both AMD and Intel.
> > 
> > (*) You can allow the guest to control the real EFLAGS.IF on both VMX and SVM,
> > (in which case int shadow should indeed work as on bare metal)
> > but KVM of course doesn't do it.
> 
> It doesn't surprise me that hardware treats a virtual interrupt shadow
> as a physical interrupt shadow. IIRC, each vendor has a way of
> breaking an endless chain of interrupt shadows, so a malicious guest
> can't defer interrupts indefinitely.

Thankfully a malicious guest can't abuse the STI interrupt shadow this way I think
because STI interrupt shadow is only valid if the STI actually enables the EFLAGS.IF.
If it was already set, there is no shadow.

I don't know how they deal with repeated MOV SS instruction. Maybe this one
doesn't enable real interrupt shadow, or also doesn't enable shadow if
the shadow is already enabled, I don't know.

> 
> > I observed that when KVM sends #SMI from other vCPU, it sends a vCPU kick,
> > and the kick never arrives inside the interrupt shadow.
> > I have seen it on both VMX and SVM.
> > 
> > What still triggers this problem, is that the instruction which is in the interrupt
> > shadow can still get a VM exit, (e.g EPT/NPT violation) and then it can notice
> > the pending SMI.
> > 
> > I think it has to be EPT/NPT violation btw, because, IMHO most if not all other VM exits I
> > think are instruction intercepts, which will cause KVM to emulate the instruction
> > and clear the interrupt shadow, and only after that it will enter SMM.
> > 
> > Even MMIO/IOPORT access is emulated by the KVM.
> > 
> > Its not the case with EPT/NPT violation, because the KVM will in this case re-execute
> > the instruction after it 'fixes' the fault.
> 
> Probably #PF as well, then, if TDP is disabled.

Yep no doubt about it.

Also come to think about it, we also intercept #AC and just forward it to the guest,
and since we let the instruction to be re-executed that won't clear the interrupt
shadow either.
#UD is also intercepted, and if the emulator can't emulate it, it should also be forwarded
to the guest. That gives me an idea to improve my test by sticking an UD2 there.
I'll take a look.


Best regards,
	Maxim Levitsky

>
diff mbox series

Patch

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 7a3a042d6b862a..d4ede5216491ad 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2443,7 +2443,7 @@  static int rsm_load_state_32(struct x86_emulate_ctxt *ctxt,
 			     struct kvm_smram_state_32 *smstate)
 {
 	struct desc_ptr dt;
-	int i;
+	int i, r;
 
 	ctxt->eflags =  smstate->eflags | X86_EFLAGS_FIXED;
 	ctxt->_eip =  smstate->eip;
@@ -2478,8 +2478,16 @@  static int rsm_load_state_32(struct x86_emulate_ctxt *ctxt,
 
 	ctxt->ops->set_smbase(ctxt, smstate->smbase);
 
-	return rsm_enter_protected_mode(ctxt, smstate->cr0,
-					smstate->cr3, smstate->cr4);
+	r = rsm_enter_protected_mode(ctxt, smstate->cr0,
+				     smstate->cr3, smstate->cr4);
+
+	if (r != X86EMUL_CONTINUE)
+		return r;
+
+	ctxt->ops->set_int_shadow(ctxt, 0);
+	ctxt->interruptibility = (u8)smstate->int_shadow;
+
+	return X86EMUL_CONTINUE;
 }
 
 #ifdef CONFIG_X86_64
@@ -2528,6 +2536,9 @@  static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
 	rsm_load_seg_64(ctxt, &smstate->fs, VCPU_SREG_FS);
 	rsm_load_seg_64(ctxt, &smstate->gs, VCPU_SREG_GS);
 
+	ctxt->ops->set_int_shadow(ctxt, 0);
+	ctxt->interruptibility = (u8)smstate->int_shadow;
+
 	return X86EMUL_CONTINUE;
 }
 #endif
diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
index 7015728da36d5f..11928306439c77 100644
--- a/arch/x86/kvm/kvm_emulate.h
+++ b/arch/x86/kvm/kvm_emulate.h
@@ -232,6 +232,7 @@  struct x86_emulate_ops {
 	bool (*guest_has_rdpid)(struct x86_emulate_ctxt *ctxt);
 
 	void (*set_nmi_mask)(struct x86_emulate_ctxt *ctxt, bool masked);
+	void (*set_int_shadow)(struct x86_emulate_ctxt *ctxt, u8 shadow);
 
 	unsigned (*get_hflags)(struct x86_emulate_ctxt *ctxt);
 	void (*exiting_smm)(struct x86_emulate_ctxt *ctxt);
@@ -520,7 +521,9 @@  struct kvm_smram_state_32 {
 	u32 reserved1[62];			/* FE00 - FEF7 */
 	u32 smbase;				/* FEF8 */
 	u32 smm_revision;			/* FEFC */
-	u32 reserved2[5];			/* FF00-FF13 */
+	u32 reserved2[4];			/* FF00-FF0F*/
+	/* int_shadow is KVM extension*/
+	u32 int_shadow;				/* FF10 */
 	/* CR4 is not present in Intel/AMD SMRAM image*/
 	u32 cr4;				/* FF14 */
 	u32 reserved3[5];			/* FF18 */
@@ -592,13 +595,17 @@  struct kvm_smram_state_64 {
 	struct kvm_smm_seg_state_64 idtr;	/* FE80 (R/O) */
 	struct kvm_smm_seg_state_64 tr;		/* FE90 (R/O) */
 
-	/* I/O restart and auto halt restart are not implemented by KVM */
+	/*
+	 * I/O restart and auto halt restart are not implemented by KVM
+	 * int_shadow is KVM's extension
+	 */
+
 	u64 io_restart_rip;			/* FEA0 (R/O) */
 	u64 io_restart_rcx;			/* FEA8 (R/O) */
 	u64 io_restart_rsi;			/* FEB0 (R/O) */
 	u64 io_restart_rdi;			/* FEB8 (R/O) */
 	u32 io_restart_dword;			/* FEC0 (R/O) */
-	u32 reserved1;				/* FEC4 */
+	u32 int_shadow;				/* FEC4 (R/O) */
 	u8 io_instruction_restart;		/* FEC8 (R/W) */
 	u8 auto_halt_restart;			/* FEC9 (R/W) */
 	u8 reserved2[6];			/* FECA-FECF */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a1b138f0815d30..665134b1096b25 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7887,6 +7887,11 @@  static void emulator_set_nmi_mask(struct x86_emulate_ctxt *ctxt, bool masked)
 	static_call(kvm_x86_set_nmi_mask)(emul_to_vcpu(ctxt), masked);
 }
 
+static void emulator_set_int_shadow(struct x86_emulate_ctxt *ctxt, u8 shadow)
+{
+	 static_call(kvm_x86_set_interrupt_shadow)(emul_to_vcpu(ctxt), shadow);
+}
+
 static unsigned emulator_get_hflags(struct x86_emulate_ctxt *ctxt)
 {
 	return emul_to_vcpu(ctxt)->arch.hflags;
@@ -7967,6 +7972,7 @@  static const struct x86_emulate_ops emulate_ops = {
 	.guest_has_fxsr      = emulator_guest_has_fxsr,
 	.guest_has_rdpid     = emulator_guest_has_rdpid,
 	.set_nmi_mask        = emulator_set_nmi_mask,
+	.set_int_shadow      = emulator_set_int_shadow,
 	.get_hflags          = emulator_get_hflags,
 	.exiting_smm         = emulator_exiting_smm,
 	.leave_smm           = emulator_leave_smm,
@@ -9744,6 +9750,8 @@  static void enter_smm_save_state_32(struct kvm_vcpu *vcpu, struct kvm_smram_stat
 	smram->cr4 = kvm_read_cr4(vcpu);
 	smram->smm_revision = 0x00020000;
 	smram->smbase = vcpu->arch.smbase;
+
+	smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
 }
 
 #ifdef CONFIG_X86_64
@@ -9792,6 +9800,8 @@  static void enter_smm_save_state_64(struct kvm_vcpu *vcpu, struct kvm_smram_stat
 	enter_smm_save_seg_64(vcpu, &smram->ds, VCPU_SREG_DS);
 	enter_smm_save_seg_64(vcpu, &smram->fs, VCPU_SREG_FS);
 	enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
+
+	smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
 }
 #endif
 
@@ -9828,6 +9838,8 @@  static void enter_smm(struct kvm_vcpu *vcpu)
 	kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
 	kvm_rip_write(vcpu, 0x8000);
 
+	static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
+
 	cr0 = vcpu->arch.cr0 & ~(X86_CR0_PE | X86_CR0_EM | X86_CR0_TS | X86_CR0_PG);
 	static_call(kvm_x86_set_cr0)(vcpu, cr0);
 	vcpu->arch.cr0 = cr0;