diff mbox

[08/12] KVM: x86: save/load state on SMM switch

Message ID 1431084034-8425-9-git-send-email-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Paolo Bonzini May 8, 2015, 11:20 a.m. UTC
The big ugly one.  This patch adds support for switching in and out of
system management mode, respectively upon receiving KVM_REQ_SMI and upon
executing a RSM instruction.  Both 32- and 64-bit formats are supported
for the SMM state save area.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
	RFC->v1: shift access rights left by 8 for 32-bit format
		 move tracepoint to kvm_set_hflags
		 fix NMI handling
---
 arch/x86/kvm/cpuid.h   |   8 ++
 arch/x86/kvm/emulate.c | 248 ++++++++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/trace.h   |  22 +++++
 arch/x86/kvm/x86.c     | 225 +++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 501 insertions(+), 2 deletions(-)

Comments

Radim Krčmář May 21, 2015, 4:20 p.m. UTC | #1
2015-05-08 13:20+0200, Paolo Bonzini:
> The big ugly one.  This patch adds support for switching in and out of
> system management mode, respectively upon receiving KVM_REQ_SMI and upon
> executing a RSM instruction.  Both 32- and 64-bit formats are supported
> for the SMM state save area.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> 	RFC->v1: shift access rights left by 8 for 32-bit format
> 		 move tracepoint to kvm_set_hflags
> 		 fix NMI handling
> ---
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> @@ -2262,12 +2262,258 @@ static int em_lseg(struct x86_emulate_ctxt *ctxt)
> +static int rsm_load_seg_32(struct x86_emulate_ctxt *ctxt, u64 smbase, int n)
> +{
> +	struct desc_struct desc;
> +	int offset;
> +	u16 selector;
> +
> +	selector = get_smstate(u32, smbase, 0x7fa8 + n * 4);

(u16, SDM says that most significant 2 bytes are reserved anyway.)

> +	if (n < 3)
> +		offset = 0x7f84 + n * 12;
> +	else
> +		offset = 0x7f2c + (n - 3) * 12;

These numbers made me look where the hell is that defined and the
easiest reference seemed to be http://www.sandpile.org/x86/smm.htm,
which has several layouts ... I hopefully checked the intersection of
various Intels and AMDs.

> +	set_desc_base(&desc,      get_smstate(u32, smbase, offset + 8));
> +	set_desc_limit(&desc,     get_smstate(u32, smbase, offset + 4));
> +	rsm_set_desc_flags(&desc, get_smstate(u32, smbase, offset));

(There wan't a layout where this would be right, so we could save the
 shifting of those flags in 64 bit mode.  Intel P6 was close, and they
 had only 2 bytes for access right, which means they weren't shifted.)

> +static int rsm_load_state_32(struct x86_emulate_ctxt *ctxt, u64 smbase)
> +{
> +	cr0 =                      get_smstate(u32, smbase, 0x7ffc);

(I wonder why they made 'smbase + 0x8000' the default offset in SDM,
 when 'smbase + 0xfe00' or 'smbase' would work as well.)

> +static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt, u64 smbase)
> +{
> +	struct desc_struct desc;
> +	u16 selector;
> +	selector =                  get_smstate(u32, smbase, 0x7e90);
> +	rsm_set_desc_flags(&desc,   get_smstate(u32, smbase, 0x7e92) << 8);

(Both reads should be u16.  Luckily, extra data gets ignored.)

>  static int em_rsm(struct x86_emulate_ctxt *ctxt)
>  {
> +	if ((ctxt->emul_flags & X86EMUL_SMM_INSIDE_NMI_MASK) == 0)
> +		ctxt->ops->set_nmi_mask(ctxt, false);

NMI is always fun ... let's see two cases:
1. NMI -> SMI -> RSM -> NMI
NMI is not injected;  ok.

2. NMI -> SMI -> IRET -> RSM -> NMI
NMI is injected;  I think it shouldn't be ... have you based this
behavior on the 3rd paragraph of SDM 34.8 NMI HANDLING WHILE IN SMM
("A special case [...]")?

Why I think we should restore NMI mask on RSM:
- It's consistent with SMI -> IRET -> NMI -> RSM -> NMI (where we,
  I think correctly, unmask NMIs) and the idea that SMM tries to be to
  transparent (but maybe they didn't care about retarded SMI handlers).
- APM 2:15.30.3 SMM_CTL MSR (C001_0116h)
  • ENTER—Bit 1. Enter SMM: map the SMRAM memory areas, record whether
    NMI was currently blocked and block further NMI and SMI interrupts.
  • EXIT—Bit 3. Exit SMM: unmap the SMRAM memory areas, restore the
    previous masking status of NMI and unconditionally reenable SMI.
  
  The MSR should mimic real SMM signals and does restore the NMI mask.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 21, 2015, 4:21 p.m. UTC | #2
On 21/05/2015 18:20, Radim Kr?má? wrote:
> 
>> > +	set_desc_base(&desc,      get_smstate(u32, smbase, offset + 8));
>> > +	set_desc_limit(&desc,     get_smstate(u32, smbase, offset + 4));
>> > +	rsm_set_desc_flags(&desc, get_smstate(u32, smbase, offset));
> (There wan't a layout where this would be right, so we could save the
>  shifting of those flags in 64 bit mode.  Intel P6 was close, and they
>  had only 2 bytes for access right, which means they weren't shifted.)

Check the AMD architecture manual.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 21, 2015, 4:23 p.m. UTC | #3
On 21/05/2015 18:20, Radim Kr?má? wrote:
> 2. NMI -> SMI -> IRET -> RSM -> NMI
> NMI is injected;  I think it shouldn't be ... have you based this
> behavior on the 3rd paragraph of SDM 34.8 NMI HANDLING WHILE IN SMM
> ("A special case [...]")?

Yes.

> Why I think we should restore NMI mask on RSM:
> - It's consistent with SMI -> IRET -> NMI -> RSM -> NMI (where we,
>   I think correctly, unmask NMIs)

Yes, we do.

> and the idea that SMM tries to be to
>   transparent (but maybe they didn't care about retarded SMI handlers).

That's my reading of that paragraph of the manual. :)

> - APM 2:15.30.3 SMM_CTL MSR (C001_0116h)
>   • ENTER—Bit 1. Enter SMM: map the SMRAM memory areas, record whether
>     NMI was currently blocked and block further NMI and SMI interrupts.
>   • EXIT—Bit 3. Exit SMM: unmap the SMRAM memory areas, restore the
>     previous masking status of NMI and unconditionally reenable SMI.
>   
>   The MSR should mimic real SMM signals and does restore the NMI mask.

No idea...  My implementation does restore the previous masking status,
but only if it was "unmasked".

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář May 21, 2015, 4:33 p.m. UTC | #4
2015-05-21 18:21+0200, Paolo Bonzini:
> On 21/05/2015 18:20, Radim Kr?má? wrote:
> > 
> >> > +	set_desc_base(&desc,      get_smstate(u32, smbase, offset + 8));
> >> > +	set_desc_limit(&desc,     get_smstate(u32, smbase, offset + 4));
> >> > +	rsm_set_desc_flags(&desc, get_smstate(u32, smbase, offset));
> > (There wan't a layout where this would be right, so we could save the
> >  shifting of those flags in 64 bit mode.  Intel P6 was close, and they
> >  had only 2 bytes for access right, which means they weren't shifted.)
> 
> Check the AMD architecture manual.

I must be blind, is there more than Table 10-2?

(And according to ADM manual, we are overwriting GDT and IDT base at
 offset 0xff88 and 0xff94 with ES and CS data, so it's not the best
 reference for this case ...)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář May 21, 2015, 5 p.m. UTC | #5
2015-05-21 18:23+0200, Paolo Bonzini:
> On 21/05/2015 18:20, Radim Kr?má? wrote:
>> 2. NMI -> SMI -> IRET -> RSM -> NMI
>> NMI is injected;  I think it shouldn't be ... have you based this
>> behavior on the 3rd paragraph of SDM 34.8 NMI HANDLING WHILE IN SMM
>> ("A special case [...]")?
> 
> Yes.

Well, if I were to go lawyer

 [...] saves the SMRAM state save map but does not save the attribute to
 keep NMI interrupts disabled.

NMI masking is a bit, so it'd be really wasteful not to have an
attribute to keep NMI enabled in the same place ...

  Potentially, an NMI could be latched (while in SMM or upon exit) and
  serviced upon exit [...]

This "Potentially" could be in the sense that the whole 3rd paragraph is
only applicable to some ancient SMM design :)

The 1st paragraph has quite clear sentence:

  If NMIs were blocked before the SMI occurred, they are blocked after
  execution of RSM.

so I'd just ignore the 3rd paragraph ...

And the APM 2:10.3.3 Exceptions and Interrupts
  NMI—If an NMI occurs while the processor is in SMM, it is latched by
  the processor, but the NMI handler is not invoked until the processor
  leaves SMM with the execution of an RSM instruction.  A pending NMI
  causes the handler to be invoked immediately after the RSM completes
  and before the first instruction in the interrupted program is
  executed.

  An SMM handler can unmask NMI interrupts by simply executing an IRET.
  Upon completion of the IRET instruction, the processor recognizes the
  pending NMI, and transfers control to the NMI handler. Once an NMI is
  recognized within SMM using this technique, subsequent NMIs are
  recognized until SMM is exited. Later SMIs cause NMIs to be masked,
  until the SMM handler unmasks them.

makes me think that we should unmask them unconditionally or that SMM
doesn't do anything with NMI masking.

If we can choose, less NMI nesting seems like a good idea.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 21, 2015, 8:24 p.m. UTC | #6
On 21/05/2015 18:33, Radim Kr?má? wrote:
>> > Check the AMD architecture manual.
> I must be blind, is there more than Table 10-2?

There's Table 10-1! :DDD

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 21, 2015, 9:21 p.m. UTC | #7
On 21/05/2015 19:00, Radim Kr?má? wrote:
>   Potentially, an NMI could be latched (while in SMM or upon exit) and
>   serviced upon exit [...]
> 
> This "Potentially" could be in the sense that the whole 3rd paragraph is
> only applicable to some ancient SMM design :)

It could also be in the sense that you cannot exclude an NMI coming at
exactly the wrong time.

If you want to go full language lawyer, it does mention it whenever
behavior is specific to a processor family.

> The 1st paragraph has quite clear sentence:
> 
>   If NMIs were blocked before the SMI occurred, they are blocked after
>   execution of RSM.
> 
> so I'd just ignore the 3rd paragraph ...
> 
> And the APM 2:10.3.3 Exceptions and Interrupts
>   NMI—If an NMI occurs while the processor is in SMM, it is latched by
>   the processor, but the NMI handler is not invoked until the processor
>   leaves SMM with the execution of an RSM instruction.  A pending NMI
>   causes the handler to be invoked immediately after the RSM completes
>   and before the first instruction in the interrupted program is
>   executed.
> 
>   An SMM handler can unmask NMI interrupts by simply executing an IRET.
>   Upon completion of the IRET instruction, the processor recognizes the
>   pending NMI, and transfers control to the NMI handler. Once an NMI is
>   recognized within SMM using this technique, subsequent NMIs are
>   recognized until SMM is exited. Later SMIs cause NMIs to be masked,
>   until the SMM handler unmasks them.
> 
> makes me think that we should unmask them unconditionally or that SMM
> doesn't do anything with NMI masking.

Actually I hadn't noticed this paragraph.  But I read it the same as the
Intel manual (i.e. what I implemented): it doesn't say anywhere that RSM
may cause the processor to *set* the "NMIs masked" flag.

It makes no sense; as you said it's 1 bit of state!  But it seems that
it's the architectural behavior. :(

> If we can choose, less NMI nesting seems like a good idea.

It would---I'm just preempting future patches from Nadav. :)  That said,
even if OVMF does do IRETs in SMM (in 64-bit mode it fills in page
tables lazily for memory above 4GB), we do not care about asynchronous
SMIs such as those for power management.  So we should never enter SMM
with NMIs masked, to begin with.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář May 22, 2015, 1:13 p.m. UTC | #8
2015-05-21 22:24+0200, Paolo Bonzini:
> On 21/05/2015 18:33, Radim Kr?má? wrote:
> >> > Check the AMD architecture manual.
> > I must be blind, is there more than Table 10-2?
> 
> There's Table 10-1! :DDD

:D  I think I understand ...

10-1 says that amd64 doesn't shift the segment's attributes (they
wouldn't fit into a word otherwise), but table 10-2 says nothing about
the same for ia32 segment registers;  that behavior is model-specific.
Some people on http://www.sandpile.org/x86/smm.htm found out that P6
stores SMM state like this

  7F84h: ES selector
  7F86h: ES access rights
  7F88h: ES limit
  7F8Ch: ES base

which has an extra selector there (makes little sense), but access
rights cannot be shifted for they have only a word of space.

I guess it stems in conflicting online resources, but it's not an
architectural behavior, so we'll be wrong anyway :)
(Not shifting them would make the code a bit nicer ...)
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář May 22, 2015, 2:17 p.m. UTC | #9
2015-05-21 23:21+0200, Paolo Bonzini:
> On 21/05/2015 19:00, Radim Kr?má? wrote:
>>   Potentially, an NMI could be latched (while in SMM or upon exit) and
>>   serviced upon exit [...]
>> 
>> This "Potentially" could be in the sense that the whole 3rd paragraph is
>> only applicable to some ancient SMM design :)
> 
> It could also be in the sense that you cannot exclude an NMI coming at
> exactly the wrong time.

Yes, but it is hard to figure out how big the wrong time window is ...

Taken to the extreme, the paragraph says that we must inject NMI that
arrived while in SMM after RSM;  regardless of NMI blocking before.
(Which is not how real hardware works.)

> If you want to go full language lawyer, it does mention it whenever
> behavior is specific to a processor family.

True, I don't know of an exception, but that is not a proof for the
contrary here :/

>> The 1st paragraph has quite clear sentence:
>> 
>>   If NMIs were blocked before the SMI occurred, they are blocked after
>>   execution of RSM.
>> 
>> so I'd just ignore the 3rd paragraph ...

It's suspicious in other ways ... I'll focus on other part of the
sentence now

  Potentially, an NMI could be latched (while in SMM or upon exit)
                               ^^^^^^^^^^^^^^^^^^^^^

A NMI can't be latched in SMM mode and delivered after RSM when we
started with masked NMI.
It was latched in SMM, so we either didn't unmask NMIs or we were
executing a NMI in SMM mode.  The first case is covered by

  If NMIs were blocked before the SMI occurred, they are blocked after
  execution of RSM.

The second case, when we specialize the above, would need to unmask NMIs
with IRET, accept an NMI, and then do RSM before IRET (because IRET
would immediately inject the latched NMI);
if CPU unmasks NMIs in that case, I'd slap someone.

Btw. I had a good laugh on Intel's response to a similar question:
https://software.intel.com/en-us/forums/topic/305672

>> And the APM 2:10.3.3 Exceptions and Interrupts
| [...]
>> makes me think that we should unmask them unconditionally or that SMM
>> doesn't do anything with NMI masking.
> 
> Actually I hadn't noticed this paragraph.  But I read it the same as the
> Intel manual (i.e. what I implemented): it doesn't say anywhere that RSM
> may cause the processor to *set* the "NMIs masked" flag.
> 
> It makes no sense; as you said it's 1 bit of state!  But it seems that
> it's the architectural behavior. :(

Ok, it's sad and I'm too lazy to actually try it ...

>> If we can choose, less NMI nesting seems like a good idea.
> 
> It would---I'm just preempting future patches from Nadav. :)

Me too :D

>                                                               That said,
> even if OVMF does do IRETs in SMM (in 64-bit mode it fills in page
> tables lazily for memory above 4GB), we do not care about asynchronous
> SMIs such as those for power management.  So we should never enter SMM
> with NMIs masked, to begin with.

Yeah, it's a stupid corner case, the place where most of time and sanity
is lost.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 25, 2015, 12:46 p.m. UTC | #10
On 22/05/2015 16:17, Radim Kr?má? wrote:
> Btw. I had a good laugh on Intel's response to a similar question:
> https://software.intel.com/en-us/forums/topic/305672

Duh... the question is dumb (because he's not doing IRET in SMM), and
the answer is dumber...

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index c3b1ad9fca81..02b67616435d 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -70,6 +70,14 @@  static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu)
 	return best && (best->ebx & bit(X86_FEATURE_FSGSBASE));
 }
 
+static inline bool guest_cpuid_has_longmode(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	best = kvm_find_cpuid_entry(vcpu, 0x80000001, 0);
+	return best && (best->edx & bit(X86_FEATURE_LM));
+}
+
 static inline bool guest_cpuid_has_osvw(struct kvm_vcpu *vcpu)
 {
 	struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c5a6a407afba..b278ea09ed80 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2262,12 +2262,258 @@  static int em_lseg(struct x86_emulate_ctxt *ctxt)
 	return rc;
 }
 
+static int emulator_has_longmode(struct x86_emulate_ctxt *ctxt)
+{
+	u32 eax, ebx, ecx, edx;
+
+	eax = 0x80000001;
+	ecx = 0;
+	ctxt->ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx);
+	return edx & bit(X86_FEATURE_LM);
+}
+
+#define get_smstate(type, smbase, offset)				  \
+	({								  \
+	 type __val;							  \
+	 int r = ctxt->ops->read_std(ctxt, smbase + offset, &__val,       \
+				     sizeof(__val), NULL);		  \
+	 if (r != X86EMUL_CONTINUE)					  \
+		 return X86EMUL_UNHANDLEABLE;				  \
+	 __val;								  \
+	})
+
+static void rsm_set_desc_flags(struct desc_struct *desc, u32 flags)
+{
+	desc->g    = (flags >> 23) & 1;
+	desc->d    = (flags >> 22) & 1;
+	desc->l    = (flags >> 21) & 1;
+	desc->avl  = (flags >> 20) & 1;
+	desc->p    = (flags >> 15) & 1;
+	desc->dpl  = (flags >> 13) & 3;
+	desc->s    = (flags >> 12) & 1;
+	desc->type = (flags >>  8) & 15;
+}
+
+static int rsm_load_seg_32(struct x86_emulate_ctxt *ctxt, u64 smbase, int n)
+{
+	struct desc_struct desc;
+	int offset;
+	u16 selector;
+
+	selector = get_smstate(u32, smbase, 0x7fa8 + n * 4);
+
+	if (n < 3)
+		offset = 0x7f84 + n * 12;
+	else
+		offset = 0x7f2c + (n - 3) * 12;
+
+	set_desc_base(&desc,      get_smstate(u32, smbase, offset + 8));
+	set_desc_limit(&desc,     get_smstate(u32, smbase, offset + 4));
+	rsm_set_desc_flags(&desc, get_smstate(u32, smbase, offset));
+	ctxt->ops->set_segment(ctxt, selector, &desc, 0, n);
+	return X86EMUL_CONTINUE;
+}
+
+static int rsm_load_seg_64(struct x86_emulate_ctxt *ctxt, u64 smbase, int n)
+{
+	struct desc_struct desc;
+	int offset;
+	u16 selector;
+	u32 base3;
+
+	offset = 0x7e00 + n * 16;
+
+	selector =                get_smstate(u16, smbase, offset);
+	rsm_set_desc_flags(&desc, get_smstate(u16, smbase, offset + 2) << 8);
+	set_desc_limit(&desc,     get_smstate(u32, smbase, offset + 4));
+	set_desc_base(&desc,      get_smstate(u32, smbase, offset + 8));
+	base3 =                   get_smstate(u32, smbase, offset + 12);
+
+	ctxt->ops->set_segment(ctxt, selector, &desc, base3, n);
+	return X86EMUL_CONTINUE;
+}
+
+static int rsm_enter_protected_mode(struct x86_emulate_ctxt *ctxt,
+				     u64 cr0, u64 cr4)
+{
+	int bad;
+
+	/*
+	 * First enable PAE, long mode needs it before CR0.PG = 1 is set.
+	 * Then enable protected mode.	However, PCID cannot be enabled
+	 * if EFER.LMA=0, so set it separately.
+	 */
+	bad = ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PCIDE);
+	if (bad)
+		return X86EMUL_UNHANDLEABLE;
+
+	bad = ctxt->ops->set_cr(ctxt, 0, cr0);
+	if (bad)
+		return X86EMUL_UNHANDLEABLE;
+
+	if (cr4 & X86_CR4_PCIDE) {
+		bad = ctxt->ops->set_cr(ctxt, 4, cr4);
+		if (bad)
+			return X86EMUL_UNHANDLEABLE;
+	}
+
+	return X86EMUL_CONTINUE;
+}
+
+static int rsm_load_state_32(struct x86_emulate_ctxt *ctxt, u64 smbase)
+{
+	struct desc_struct desc;
+	struct desc_ptr dt;
+	u16 selector;
+	u32 val, cr0, cr4;
+	int i;
+
+	cr0 =                      get_smstate(u32, smbase, 0x7ffc);
+	ctxt->ops->set_cr(ctxt, 3, get_smstate(u32, smbase, 0x7ff8));
+	ctxt->eflags =             get_smstate(u32, smbase, 0x7ff4) | X86_EFLAGS_FIXED;
+	ctxt->_eip =               get_smstate(u32, smbase, 0x7ff0);
+
+	for (i = 0; i < 8; i++)
+		*reg_write(ctxt, i) = get_smstate(u32, smbase, 0x7fd0 + i * 4);
+
+	val = get_smstate(u32, smbase, 0x7fcc);
+	ctxt->ops->set_dr(ctxt, 6, (val & DR6_VOLATILE) | DR6_FIXED_1);
+	val = get_smstate(u32, smbase, 0x7fc8);
+	ctxt->ops->set_dr(ctxt, 7, (val & DR7_VOLATILE) | DR7_FIXED_1);
+
+	selector =                 get_smstate(u32, smbase, 0x7fc4);
+	set_desc_base(&desc,       get_smstate(u32, smbase, 0x7f64));
+	set_desc_limit(&desc,      get_smstate(u32, smbase, 0x7f60));
+	rsm_set_desc_flags(&desc,  get_smstate(u32, smbase, 0x7f5c));
+	ctxt->ops->set_segment(ctxt, selector, &desc, 0, VCPU_SREG_TR);
+
+	selector =                 get_smstate(u32, smbase, 0x7fc0);
+	set_desc_base(&desc,       get_smstate(u32, smbase, 0x7f80));
+	set_desc_limit(&desc,      get_smstate(u32, smbase, 0x7f7c));
+	rsm_set_desc_flags(&desc,  get_smstate(u32, smbase, 0x7f78));
+	ctxt->ops->set_segment(ctxt, selector, &desc, 0, VCPU_SREG_LDTR);
+
+	dt.address =               get_smstate(u32, smbase, 0x7f74);
+	dt.size =                  get_smstate(u32, smbase, 0x7f70);
+	ctxt->ops->set_gdt(ctxt, &dt);
+
+	dt.address =               get_smstate(u32, smbase, 0x7f58);
+	dt.size =                  get_smstate(u32, smbase, 0x7f54);
+	ctxt->ops->set_idt(ctxt, &dt);
+
+	for (i = 0; i < 6; i++) {
+		int r = rsm_load_seg_32(ctxt, smbase, i);
+		if (r != X86EMUL_CONTINUE)
+			return r;
+	}
+
+	cr4 = get_smstate(u32, smbase, 0x7f14);
+
+	ctxt->ops->set_smbase(ctxt, get_smstate(u32, smbase, 0x7ef8));
+
+	return rsm_enter_protected_mode(ctxt, cr0, cr4);
+}
+
+static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt, u64 smbase)
+{
+	struct desc_struct desc;
+	struct desc_ptr dt;
+	u64 val, cr0, cr4;
+	u32 base3;
+	u16 selector;
+	int i;
+
+	for (i = 0; i < 16; i++)
+		*reg_write(ctxt, i) = get_smstate(u64, smbase, 0x7ff8 - i * 8);
+
+	ctxt->_eip   = get_smstate(u64, smbase, 0x7f78);
+	ctxt->eflags = get_smstate(u32, smbase, 0x7f70) | X86_EFLAGS_FIXED;
+
+	val = get_smstate(u32, smbase, 0x7f68);
+	ctxt->ops->set_dr(ctxt, 6, (val & DR6_VOLATILE) | DR6_FIXED_1);
+	val = get_smstate(u32, smbase, 0x7f60);
+	ctxt->ops->set_dr(ctxt, 7, (val & DR7_VOLATILE) | DR7_FIXED_1);
+
+	cr0 =                       get_smstate(u64, smbase, 0x7f58);
+	ctxt->ops->set_cr(ctxt, 3,  get_smstate(u64, smbase, 0x7f50));
+	cr4 =                       get_smstate(u64, smbase, 0x7f48);
+	ctxt->ops->set_smbase(ctxt, get_smstate(u32, smbase, 0x7f00));
+	val =                       get_smstate(u64, smbase, 0x7ed0);
+	ctxt->ops->set_msr(ctxt, MSR_EFER, val & ~EFER_LMA);
+
+	selector =                  get_smstate(u32, smbase, 0x7e90);
+	rsm_set_desc_flags(&desc,   get_smstate(u32, smbase, 0x7e92) << 8);
+	set_desc_limit(&desc,       get_smstate(u32, smbase, 0x7e94));
+	set_desc_base(&desc,        get_smstate(u32, smbase, 0x7e98));
+	base3 =                     get_smstate(u32, smbase, 0x7e9c);
+	ctxt->ops->set_segment(ctxt, selector, &desc, base3, VCPU_SREG_TR);
+
+	dt.size =                   get_smstate(u32, smbase, 0x7e84);
+	dt.address =                get_smstate(u64, smbase, 0x7e88);
+	ctxt->ops->set_idt(ctxt, &dt);
+
+	selector =                  get_smstate(u32, smbase, 0x7e70);
+	rsm_set_desc_flags(&desc,   get_smstate(u32, smbase, 0x7e72) << 8);
+	set_desc_limit(&desc,       get_smstate(u32, smbase, 0x7e74));
+	set_desc_base(&desc,        get_smstate(u32, smbase, 0x7e78));
+	base3 =                     get_smstate(u32, smbase, 0x7e7c);
+	ctxt->ops->set_segment(ctxt, selector, &desc, base3, VCPU_SREG_LDTR);
+
+	dt.size =                   get_smstate(u32, smbase, 0x7e64);
+	dt.address =                get_smstate(u64, smbase, 0x7e68);
+	ctxt->ops->set_gdt(ctxt, &dt);
+
+	for (i = 0; i < 6; i++) {
+		int r = rsm_load_seg_64(ctxt, smbase, i);
+		if (r != X86EMUL_CONTINUE)
+			return r;
+	}
+
+	return rsm_enter_protected_mode(ctxt, cr0, cr4);
+}
+
 static int em_rsm(struct x86_emulate_ctxt *ctxt)
 {
+	unsigned long cr0, cr4, efer;
+	u64 smbase;
+	int ret;
+
 	if ((ctxt->emul_flags & X86EMUL_SMM_MASK) == 0)
 		return emulate_ud(ctxt);
 
-	return X86EMUL_UNHANDLEABLE;
+	/*
+	 * Get back to real mode, to prepare a safe state in which to load
+	 * CR0/CR3/CR4/EFER.  Also this will ensure that addresses passed
+	 * to read_std/write_std are not virtual.
+	 *
+	 * CR4.PCIDE must be zero, because it is a 64-bit mode only feature.
+	 */
+	cr0 = ctxt->ops->get_cr(ctxt, 0);
+	if (cr0 & X86_CR0_PE)
+		ctxt->ops->set_cr(ctxt, 0, cr0 & ~(X86_CR0_PG | X86_CR0_PE));
+	cr4 = ctxt->ops->get_cr(ctxt, 4);
+	if (cr4 & X86_CR4_PAE)
+		ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PAE);
+	efer = 0;
+	ctxt->ops->set_msr(ctxt, MSR_EFER, efer);
+
+	smbase = ctxt->ops->get_smbase(ctxt);
+	if (emulator_has_longmode(ctxt))
+		ret = rsm_load_state_64(ctxt, smbase + 0x8000);
+	else
+		ret = rsm_load_state_32(ctxt, smbase + 0x8000);
+
+	if (ret != X86EMUL_CONTINUE) {
+		/* FIXME: should triple fault */
+		return X86EMUL_UNHANDLEABLE;
+	}
+
+	if ((ctxt->emul_flags & X86EMUL_SMM_INSIDE_NMI_MASK) == 0)
+		ctxt->ops->set_nmi_mask(ctxt, false);
+
+	ctxt->emul_flags &= ~X86EMUL_SMM_INSIDE_NMI_MASK;
+	ctxt->emul_flags &= ~X86EMUL_SMM_MASK;
+	return X86EMUL_CONTINUE;
 }
 
 static void
diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 7c7bc8bef21f..17e505125b2c 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -952,6 +952,28 @@  TRACE_EVENT(kvm_wait_lapic_expire,
 		  __entry->delta < 0 ? "early" : "late")
 );
 
+TRACE_EVENT(kvm_enter_smm,
+	TP_PROTO(unsigned int vcpu_id, u64 smbase, bool entering),
+	TP_ARGS(vcpu_id, smbase, entering),
+
+	TP_STRUCT__entry(
+		__field(	unsigned int,	vcpu_id		)
+		__field(	u64,		smbase		)
+		__field(	bool,		entering	)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id	   = vcpu_id;
+		__entry->smbase             = smbase;
+		__entry->entering           = entering;
+	),
+
+	TP_printk("vcpu %u: %s SMM, smbase 0x%llx",
+		  __entry->vcpu_id,
+		  __entry->entering ? "entering" : "leaving",
+		  __entry->smbase)
+);
+
 #endif /* _TRACE_KVM_H */
 
 #undef TRACE_INCLUDE_PATH
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2bf5bc4ed00f..9eba0d850d17 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5409,6 +5409,9 @@  static int complete_emulated_pio(struct kvm_vcpu *vcpu);
 void kvm_set_hflags(struct kvm_vcpu *vcpu, unsigned emul_flags)
 {
 	if (is_smm(vcpu) && (emul_flags & HF_SMM_MASK) == 0) {
+		/* This is a good place to trace that we are exiting SMM.  */
+		trace_kvm_enter_smm(vcpu->vcpu_id, vcpu->arch.smbase, false);
+
 		if (unlikely(vcpu->arch.smi_pending)) {
 			kvm_make_request(KVM_REQ_SMI, vcpu);
 			vcpu->arch.smi_pending = 0;
@@ -6309,14 +6312,234 @@  static void process_nmi(struct kvm_vcpu *vcpu)
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 
+#define put_smstate(type, buf, offset, val)			  \
+	*(type *)((buf) + (offset) - 0x7e00) = val
+
+static u32 process_smi_get_segment_flags(struct kvm_segment *seg)
+{
+	u32 flags = 0;
+	flags |= seg->g       << 23;
+	flags |= seg->db      << 22;
+	flags |= seg->l       << 21;
+	flags |= seg->avl     << 20;
+	flags |= seg->present << 15;
+	flags |= seg->dpl     << 13;
+	flags |= seg->s       << 12;
+	flags |= seg->type    << 8;
+	return flags;
+}
+
+static void process_smi_save_seg_32(struct kvm_vcpu *vcpu, char *buf, int n)
+{
+	struct kvm_segment seg;
+	int offset;
+
+	kvm_get_segment(vcpu, &seg, n);
+	put_smstate(u32, buf, 0x7fa8 + n * 4, seg.selector);
+
+	if (n < 3)
+		offset = 0x7f84 + n * 12;
+	else
+		offset = 0x7f2c + (n - 3) * 12;
+
+	put_smstate(u32, buf, offset + 8, seg.base);
+	put_smstate(u32, buf, offset + 4, seg.limit);
+	put_smstate(u32, buf, offset, process_smi_get_segment_flags(&seg));
+}
+
+static void process_smi_save_seg_64(struct kvm_vcpu *vcpu, char *buf, int n)
+{
+	struct kvm_segment seg;
+	int offset;
+	u16 flags;
+
+	kvm_get_segment(vcpu, &seg, n);
+	offset = 0x7e00 + n * 16;
+
+	flags = process_smi_get_segment_flags(&seg) >> 8;
+	put_smstate(u16, buf, offset, seg.selector);
+	put_smstate(u16, buf, offset + 2, flags);
+	put_smstate(u32, buf, offset + 4, seg.limit);
+	put_smstate(u64, buf, offset + 8, seg.base);
+}
+
+static void process_smi_save_state_32(struct kvm_vcpu *vcpu, char *buf)
+{
+	struct desc_ptr dt;
+	struct kvm_segment seg;
+	unsigned long val;
+	int i;
+
+	put_smstate(u32, buf, 0x7ffc, kvm_read_cr0(vcpu));
+	put_smstate(u32, buf, 0x7ff8, kvm_read_cr3(vcpu));
+	put_smstate(u32, buf, 0x7ff4, kvm_get_rflags(vcpu));
+	put_smstate(u32, buf, 0x7ff0, kvm_rip_read(vcpu));
+
+	for (i = 0; i < 8; i++)
+		put_smstate(u32, buf, 0x7fd0 + i * 4, kvm_register_read(vcpu, i));
+
+	kvm_get_dr(vcpu, 6, &val);
+	put_smstate(u32, buf, 0x7fcc, (u32)val);
+	kvm_get_dr(vcpu, 7, &val);
+	put_smstate(u32, buf, 0x7fc8, (u32)val);
+
+	kvm_get_segment(vcpu, &seg, VCPU_SREG_TR);
+	put_smstate(u32, buf, 0x7fc4, seg.selector);
+	put_smstate(u32, buf, 0x7f64, seg.base);
+	put_smstate(u32, buf, 0x7f60, seg.limit);
+	put_smstate(u32, buf, 0x7f5c, process_smi_get_segment_flags(&seg));
+
+	kvm_get_segment(vcpu, &seg, VCPU_SREG_LDTR);
+	put_smstate(u32, buf, 0x7fc0, seg.selector);
+	put_smstate(u32, buf, 0x7f80, seg.base);
+	put_smstate(u32, buf, 0x7f7c, seg.limit);
+	put_smstate(u32, buf, 0x7f78, process_smi_get_segment_flags(&seg));
+
+	kvm_x86_ops->get_gdt(vcpu, &dt);
+	put_smstate(u32, buf, 0x7f74, dt.address);
+	put_smstate(u32, buf, 0x7f70, dt.size);
+
+	kvm_x86_ops->get_idt(vcpu, &dt);
+	put_smstate(u32, buf, 0x7f58, dt.address);
+	put_smstate(u32, buf, 0x7f54, dt.size);
+
+	for (i = 0; i < 6; i++)
+		process_smi_save_seg_32(vcpu, buf, i);
+
+	put_smstate(u32, buf, 0x7f14, kvm_read_cr4(vcpu));
+
+	/* revision id */
+	put_smstate(u32, buf, 0x7efc, 0x00020000);
+	put_smstate(u32, buf, 0x7ef8, vcpu->arch.smbase);
+}
+
+static void process_smi_save_state_64(struct kvm_vcpu *vcpu, char *buf)
+{
+#ifdef CONFIG_X86_64
+	struct desc_ptr dt;
+	struct kvm_segment seg;
+	unsigned long val;
+	int i;
+
+	for (i = 0; i < 16; i++)
+		put_smstate(u64, buf, 0x7ff8 - i * 8, kvm_register_read(vcpu, i));
+
+	put_smstate(u64, buf, 0x7f78, kvm_rip_read(vcpu));
+	put_smstate(u32, buf, 0x7f70, kvm_get_rflags(vcpu));
+
+	kvm_get_dr(vcpu, 6, &val);
+	put_smstate(u64, buf, 0x7f68, val);
+	kvm_get_dr(vcpu, 7, &val);
+	put_smstate(u64, buf, 0x7f60, val);
+
+	put_smstate(u64, buf, 0x7f58, kvm_read_cr0(vcpu));
+	put_smstate(u64, buf, 0x7f50, kvm_read_cr3(vcpu));
+	put_smstate(u64, buf, 0x7f48, kvm_read_cr4(vcpu));
+
+	put_smstate(u32, buf, 0x7f00, vcpu->arch.smbase);
+
+	/* revision id */
+	put_smstate(u32, buf, 0x7efc, 0x00020064);
+
+	put_smstate(u64, buf, 0x7ed0, vcpu->arch.efer);
+
+	kvm_get_segment(vcpu, &seg, VCPU_SREG_TR);
+	put_smstate(u16, buf, 0x7e90, seg.selector);
+	put_smstate(u16, buf, 0x7e92, process_smi_get_segment_flags(&seg) >> 8);
+	put_smstate(u32, buf, 0x7e94, seg.limit);
+	put_smstate(u64, buf, 0x7e98, seg.base);
+
+	kvm_x86_ops->get_idt(vcpu, &dt);
+	put_smstate(u32, buf, 0x7e84, dt.size);
+	put_smstate(u64, buf, 0x7e88, dt.address);
+
+	kvm_get_segment(vcpu, &seg, VCPU_SREG_LDTR);
+	put_smstate(u16, buf, 0x7e70, seg.selector);
+	put_smstate(u16, buf, 0x7e72, process_smi_get_segment_flags(&seg) >> 8);
+	put_smstate(u32, buf, 0x7e74, seg.limit);
+	put_smstate(u64, buf, 0x7e78, seg.base);
+
+	kvm_x86_ops->get_gdt(vcpu, &dt);
+	put_smstate(u32, buf, 0x7e64, dt.size);
+	put_smstate(u64, buf, 0x7e68, dt.address);
+
+	for (i = 0; i < 6; i++)
+		process_smi_save_seg_64(vcpu, buf, i);
+#else
+	WARN_ON_ONCE(1);
+#endif
+}
+
 static void process_smi(struct kvm_vcpu *vcpu)
 {
+	struct kvm_segment cs, ds;
+	char buf[512];
+	u32 cr0;
+	int r;
+
 	if (is_smm(vcpu)) {
 		vcpu->arch.smi_pending = true;
 		return;
 	}
 
-	printk_once(KERN_DEBUG "Ignoring guest SMI\n");
+	trace_kvm_enter_smm(vcpu->vcpu_id, vcpu->arch.smbase, true);
+	vcpu->arch.hflags |= HF_SMM_MASK;
+	memset(buf, 0, 512);
+	if (guest_cpuid_has_longmode(vcpu))
+		process_smi_save_state_64(vcpu, buf);
+	else
+		process_smi_save_state_32(vcpu, buf);
+
+	r = kvm_write_guest(vcpu->kvm, vcpu->arch.smbase + 0xfe00, buf, sizeof(buf));
+	if (r < 0)
+		return;
+
+	if (kvm_x86_ops->get_nmi_mask(vcpu))
+		vcpu->arch.hflags |= HF_SMM_INSIDE_NMI_MASK;
+	else
+		kvm_x86_ops->set_nmi_mask(vcpu, true);
+
+	kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
+	kvm_rip_write(vcpu, 0x8000);
+
+	cr0 = vcpu->arch.cr0 & ~(X86_CR0_PE | X86_CR0_EM | X86_CR0_TS | X86_CR0_PG);
+	kvm_x86_ops->set_cr0(vcpu, cr0);
+	vcpu->arch.cr0 = cr0;
+
+	kvm_x86_ops->set_cr4(vcpu, 0);
+
+	__kvm_set_dr(vcpu, 7, DR7_FIXED_1);
+
+	cs.selector = (vcpu->arch.smbase >> 4) & 0xffff;
+	cs.base = vcpu->arch.smbase;
+
+	ds.selector = 0;
+	ds.base = 0;
+
+	cs.limit    = ds.limit = 0xffffffff;
+	cs.type     = ds.type = 0x3;
+	cs.dpl      = ds.dpl = 0;
+	cs.db       = ds.db = 0;
+	cs.s        = ds.s = 1;
+	cs.l        = ds.l = 0;
+	cs.g        = ds.g = 1;
+	cs.avl      = ds.avl = 0;
+	cs.present  = ds.present = 1;
+	cs.unusable = ds.unusable = 0;
+	cs.padding  = ds.padding = 0;
+
+	kvm_set_segment(vcpu, &cs, VCPU_SREG_CS);
+	kvm_set_segment(vcpu, &ds, VCPU_SREG_DS);
+	kvm_set_segment(vcpu, &ds, VCPU_SREG_ES);
+	kvm_set_segment(vcpu, &ds, VCPU_SREG_FS);
+	kvm_set_segment(vcpu, &ds, VCPU_SREG_GS);
+	kvm_set_segment(vcpu, &ds, VCPU_SREG_SS);
+
+	if (guest_cpuid_has_longmode(vcpu))
+		kvm_x86_ops->set_efer(vcpu, 0);
+
+	kvm_update_cpuid(vcpu);
+	kvm_mmu_reset_context(vcpu);
 }
 
 static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)