diff mbox

[09/13] KVM: x86: save/load state on SMM switch

Message ID 1430393772-27208-10-git-send-email-pbonzini@redhat.com (mailing list archive)
State New, archived
Headers show

Commit Message

Paolo Bonzini April 30, 2015, 11:36 a.m. UTC
The big ugly one.  This patch adds support for switching in and out of
system management mode, respectively upon receiving KVM_REQ_SMI and upon
executing a RSM instruction.  Both 32- and 64-bit formats are supported
for the SMM state save area.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/cpuid.h   |   8 ++
 arch/x86/kvm/emulate.c | 243 ++++++++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c     | 216 ++++++++++++++++++++++++++++++++++++++++++-
 3 files changed, 465 insertions(+), 2 deletions(-)

Comments

Radim Krčmář May 4, 2015, 7:59 p.m. UTC | #1
2015-04-30 13:36+0200, Paolo Bonzini:
> The big ugly one.  This patch adds support for switching in and out of
> system management mode, respectively upon receiving KVM_REQ_SMI and upon
> executing a RSM instruction.  Both 32- and 64-bit formats are supported
> for the SMM state save area.
> 
> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> +static void rsm_set_desc_flags(struct desc_struct *desc, u16 flags)
> +{
> +	desc->g    = (flags >> 15) & 1;
> +	desc->d    = (flags >> 14) & 1;
> +	desc->l    = (flags >> 13) & 1;
> +	desc->avl  = (flags >> 12) & 1;
> +	desc->p    = (flags >> 7) & 1;
> +	desc->dpl  = (flags >> 5) & 3;
> +	desc->s    = (flags >> 4) & 1;
> +	desc->type = flags & 15;

I can't find a description of this ... can you point me to a place where
the gap between 'p' and 'avl' is documented?
(Not that it matters unless the guest reads it, but it's a bit weird.)

Thanks.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 5, 2015, 9:37 a.m. UTC | #2
On 04/05/2015 21:59, Radim Kr?má? wrote:
> > The big ugly one.  This patch adds support for switching in and out of
> > system management mode, respectively upon receiving KVM_REQ_SMI and upon
> > executing a RSM instruction.  Both 32- and 64-bit formats are supported
> > for the SMM state save area.
> > 
> > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > ---
> > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> > +static void rsm_set_desc_flags(struct desc_struct *desc, u16 flags)
> > +{
> > +	desc->g    = (flags >> 15) & 1;
> > +	desc->d    = (flags >> 14) & 1;
> > +	desc->l    = (flags >> 13) & 1;
> > +	desc->avl  = (flags >> 12) & 1;
> > +	desc->p    = (flags >> 7) & 1;
> > +	desc->dpl  = (flags >> 5) & 3;
> > +	desc->s    = (flags >> 4) & 1;
> > +	desc->type = flags & 15;
>
> I can't find a description of this ... can you point me to a place where
> the gap between 'p' and 'avl' is documented?
> (Not that it matters unless the guest reads it, but it's a bit weird.)

It turns out that access rights are stored in the same format as the VMX
access rights.  However, they are shifted by 8, which my code above
doesn't do (bug).

The documentation is, of course, QEMU and Bochs :) but you can also find
it in http://www.rcollins.org/ftp/source/include/struc.inc.  It is not
exactly for SMM, but it is more or less the same.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Radim Krčmář May 5, 2015, 12:48 p.m. UTC | #3
2015-05-05 11:37+0200, Paolo Bonzini:
> On 04/05/2015 21:59, Radim Kr?má? wrote:
> > > The big ugly one.  This patch adds support for switching in and out of
> > > system management mode, respectively upon receiving KVM_REQ_SMI and upon
> > > executing a RSM instruction.  Both 32- and 64-bit formats are supported
> > > for the SMM state save area.
> > > 
> > > Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
> > > ---
> > > diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> > > +static void rsm_set_desc_flags(struct desc_struct *desc, u16 flags)
> > > +{
> > > +	desc->g    = (flags >> 15) & 1;
> > > +	desc->d    = (flags >> 14) & 1;
> > > +	desc->l    = (flags >> 13) & 1;
> > > +	desc->avl  = (flags >> 12) & 1;
> > > +	desc->p    = (flags >> 7) & 1;
> > > +	desc->dpl  = (flags >> 5) & 3;
> > > +	desc->s    = (flags >> 4) & 1;
> > > +	desc->type = flags & 15;
> >
> > I can't find a description of this ... can you point me to a place where
> > the gap between 'p' and 'avl' is documented?
> > (Not that it matters unless the guest reads it, but it's a bit weird.)
> 
> It turns out that access rights are stored in the same format as the VMX
> access rights.

Thanks, so it really has a "reserved" space in the middle, to save some
processing because the format is horrible (backward compatible).

> access rights.  However, they are shifted by 8, which my code above
> doesn't do (bug).

I think you are shifting it right, though ... they wouldn't fit into a
word if shifted left.

(I'd just shorten it after finding the right name for mask
   u32 mask = 0x00f0ff00
   desc->b = desc->b & ~mask | flags << 8 & mask

 and maybe define 'attributes' field in desc that is already shifted.)

> The documentation is, of course, QEMU and Bochs :) but you can also find
> it in http://www.rcollins.org/ftp/source/include/struc.inc.  It is not
> exactly for SMM, but it is more or less the same.

(It's a register hidden from software, so I have some trust issues :])
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 5, 2015, 1:18 p.m. UTC | #4
On 05/05/2015 14:48, Radim Kr?má? wrote:
>>>> > > > +{
>>>> > > > +	desc->g    = (flags >> 15) & 1;
>>>> > > > +	desc->d    = (flags >> 14) & 1;
>>>> > > > +	desc->l    = (flags >> 13) & 1;
>>>> > > > +	desc->avl  = (flags >> 12) & 1;
>>>> > > > +	desc->p    = (flags >> 7) & 1;
>>>> > > > +	desc->dpl  = (flags >> 5) & 3;
>>>> > > > +	desc->s    = (flags >> 4) & 1;
>>>> > > > +	desc->type = flags & 15;
>>> > >
>>> > > I can't find a description of this ... can you point me to a place where
>>> > > the gap between 'p' and 'avl' is documented?
>>> > > (Not that it matters unless the guest reads it, but it's a bit weird.)
>> > 
>> > It turns out that access rights are stored in the same format as the VMX
>> > access rights.
> Thanks, so it really has a "reserved" space in the middle, to save some
> processing because the format is horrible (backward compatible).
> 
>> > access rights.  However, they are shifted by 8, which my code above
>> > doesn't do (bug).
> I think you are shifting it right, though ... they wouldn't fit into a
> word if shifted left.

Right, they have to be shifted right in the 64-bit case but not the
32-bit case.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bandan Das May 5, 2015, 8:44 p.m. UTC | #5
Paolo Bonzini <pbonzini@redhat.com> writes:

> +static inline bool guest_cpuid_has_longmode(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_cpuid_entry2 *best;
> +
> +	best = kvm_find_cpuid_entry(vcpu, 0x80000001, 0);
> +	return best && (best->edx & bit(X86_FEATURE_LM));
> +}
> +

We could combine all guest_cpuid_has* functions into a single
function guest_cpuid_has_feature(vcpu, feature) and avoid code duplication.
Not relevant to this change - just a note (to self).

>  static inline bool guest_cpuid_has_osvw(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_cpuid_entry2 *best;
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index f6b641207416..a49606871277 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -2262,12 +2262,253 @@ static int em_lseg(struct x86_emulate_ctxt *ctxt)
>  	return rc;
>  }
>  
> +static int emulator_has_longmode(struct x86_emulate_ctxt *ctxt)
> +{
> +	u32 eax, ebx, ecx, edx;
> +
> +	eax = 0x80000001;
> +	ecx = 0;
> +	ctxt->ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx);
> +	return edx & bit(X86_FEATURE_LM);
> +}
> +
> +#define get_smstate(type, smbase, offset)				  \
> +	({								  \
> +	 type __val;							  \
> +	 int r = ctxt->ops->read_std(ctxt, smbase + offset, &__val,       \
> +				     sizeof(__val), NULL);		  \
> +	 if (r != X86EMUL_CONTINUE)					  \
> +		 return X86EMUL_UNHANDLEABLE;				  \
> +	 __val;								  \
> +	})
> +
> +static void rsm_set_desc_flags(struct desc_struct *desc, u16 flags)
> +{
> +	desc->g    = (flags >> 15) & 1;
> +	desc->d    = (flags >> 14) & 1;
> +	desc->l    = (flags >> 13) & 1;
> +	desc->avl  = (flags >> 12) & 1;
> +	desc->p    = (flags >> 7) & 1;
> +	desc->dpl  = (flags >> 5) & 3;
> +	desc->s    = (flags >> 4) & 1;
> +	desc->type = flags & 15;
> +}
> +
> +static int rsm_load_seg_32(struct x86_emulate_ctxt *ctxt, u64 smbase, int n)
> +{
> +	struct desc_struct desc;
> +	int offset;
> +	u16 selector;
> +
> +	selector = get_smstate(u32, smbase, 0x7fa8 + n * 4);
Probably a good idea to use #defines for all the offsets here
and elsewhere. 

> +
> +	if (n < 3)
> +		offset = 0x7f84 + n * 12;
> +	else
> +		offset = 0x7f2c + (n - 3) * 12;
> +
> +	set_desc_base(&desc,      get_smstate(u32, smbase, offset + 8));
> +	set_desc_limit(&desc,     get_smstate(u32, smbase, offset + 4));
> +	rsm_set_desc_flags(&desc, get_smstate(u32, smbase, offset));
> +	ctxt->ops->set_segment(ctxt, selector, &desc, 0, n);
> +	return X86EMUL_CONTINUE;
> +}
> +
> +static int rsm_load_seg_64(struct x86_emulate_ctxt *ctxt, u64 smbase, int n)
> +{
> +	struct desc_struct desc;
> +	int offset;
> +	u16 selector;
> +	u32 base3;
> +
> +	offset = 0x7e00 + n * 16;
> +
> +	selector =                get_smstate(u16, smbase, offset);
> +	rsm_set_desc_flags(&desc, get_smstate(u16, smbase, offset + 2));
> +	set_desc_limit(&desc,     get_smstate(u32, smbase, offset + 4));
> +	set_desc_base(&desc,      get_smstate(u32, smbase, offset + 8));
> +	base3 =                   get_smstate(u32, smbase, offset + 12);
> +
> +	ctxt->ops->set_segment(ctxt, selector, &desc, base3, n);
> +	return X86EMUL_CONTINUE;
> +}
> +
> +static int rsm_enter_protected_mode(struct x86_emulate_ctxt *ctxt,
> +				     u64 cr0, u64 cr4)
> +{
> +	int bad;
> +
> +	/*
> +	 * First enable PAE, long mode needs it before CR0.PG = 1 is set.
> +	 * Then enable protected mode.	However, PCID cannot be enabled
> +	 * if EFER.LMA=0, so set it separately.
> +	 */
> +	bad = ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PCIDE);
> +	if (bad)
> +		return X86EMUL_UNHANDLEABLE;
> +
> +	bad = ctxt->ops->set_cr(ctxt, 0, cr0);
> +	if (bad)
> +		return X86EMUL_UNHANDLEABLE;
> +
> +	if (cr4 & X86_CR4_PCIDE) {
> +		bad = ctxt->ops->set_cr(ctxt, 4, cr4);
> +		if (bad)
> +			return X86EMUL_UNHANDLEABLE;
> +	}
> +
> +	return X86EMUL_CONTINUE;
> +}
> +
> +static int rsm_load_state_32(struct x86_emulate_ctxt *ctxt, u64 smbase)
> +{
> +	struct desc_struct desc;
> +	struct desc_ptr dt;
> +	u16 selector;
> +	u32 val, cr0, cr4;
> +	int i;
> +
> +	cr0 =                      get_smstate(u32, smbase, 0x7ffc);
> +	ctxt->ops->set_cr(ctxt, 3, get_smstate(u32, smbase, 0x7ff8));
> +	ctxt->eflags =             get_smstate(u32, smbase, 0x7ff4) | X86_EFLAGS_FIXED;
> +	ctxt->_eip =               get_smstate(u32, smbase, 0x7ff0);
> +
> +	for (i = 0; i < 8; i++)
> +		*reg_write(ctxt, i) = get_smstate(u32, smbase, 0x7fd0 + i * 4);
> +
> +	val = get_smstate(u32, smbase, 0x7fcc);
> +	ctxt->ops->set_dr(ctxt, 6, (val & DR6_VOLATILE) | DR6_FIXED_1);
> +	val = get_smstate(u32, smbase, 0x7fc8);
> +	ctxt->ops->set_dr(ctxt, 7, (val & DR7_VOLATILE) | DR7_FIXED_1);
> +
> +	selector =                 get_smstate(u32, smbase, 0x7fc4);
> +	set_desc_base(&desc,       get_smstate(u32, smbase, 0x7f64));
> +	set_desc_limit(&desc,      get_smstate(u32, smbase, 0x7f60));
> +	rsm_set_desc_flags(&desc,  get_smstate(u32, smbase, 0x7f5c));
> +	ctxt->ops->set_segment(ctxt, selector, &desc, 0, VCPU_SREG_TR);
> +
> +	selector =                 get_smstate(u32, smbase, 0x7fc0);
> +	set_desc_base(&desc,       get_smstate(u32, smbase, 0x7f80));
> +	set_desc_limit(&desc,      get_smstate(u32, smbase, 0x7f7c));
> +	rsm_set_desc_flags(&desc,  get_smstate(u32, smbase, 0x7f78));
> +	ctxt->ops->set_segment(ctxt, selector, &desc, 0, VCPU_SREG_LDTR);
> +
> +	dt.address =               get_smstate(u32, smbase, 0x7f74);
> +	dt.size =                  get_smstate(u32, smbase, 0x7f70);
> +	ctxt->ops->set_gdt(ctxt, &dt);
> +
> +	dt.address =               get_smstate(u32, smbase, 0x7f58);
> +	dt.size =                  get_smstate(u32, smbase, 0x7f54);
> +	ctxt->ops->set_idt(ctxt, &dt);
> +
> +	for (i = 0; i < 6; i++) {
> +		int r = rsm_load_seg_32(ctxt, smbase, i);
> +		if (r != X86EMUL_CONTINUE)
> +			return r;
This return is redundant since rsm_load_seg_32 always returns
success. Same for rsm_load_seg_64()

> +	}
> +
> +	cr4 = get_smstate(u32, smbase, 0x7f14);
> +
> +	ctxt->ops->set_smbase(ctxt, get_smstate(u32, smbase, 0x7ef8));
> +
> +	return rsm_enter_protected_mode(ctxt, cr0, cr4);
> +}
> +
> +static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt, u64 smbase)
> +{
> +	struct desc_struct desc;
> +	struct desc_ptr dt;
> +	u64 val, cr0, cr4;
> +	u32 base3;
> +	u16 selector;
> +	int i;
> +
> +	for (i = 0; i < 16; i++)
> +		*reg_write(ctxt, i) = get_smstate(u64, smbase, 0x7ff8 - i * 8);
> +
> +	ctxt->_eip   = get_smstate(u64, smbase, 0x7f78);
> +	ctxt->eflags = get_smstate(u32, smbase, 0x7f70) | X86_EFLAGS_FIXED;
> +
> +	val = get_smstate(u32, smbase, 0x7f68);
> +	ctxt->ops->set_dr(ctxt, 6, (val & DR6_VOLATILE) | DR6_FIXED_1);
> +	val = get_smstate(u32, smbase, 0x7f60);
> +	ctxt->ops->set_dr(ctxt, 7, (val & DR7_VOLATILE) | DR7_FIXED_1);
> +
> +	cr0 =                       get_smstate(u64, smbase, 0x7f58);
> +	ctxt->ops->set_cr(ctxt, 3,  get_smstate(u64, smbase, 0x7f50));
> +	cr4 =                       get_smstate(u64, smbase, 0x7f48);
> +	ctxt->ops->set_smbase(ctxt, get_smstate(u32, smbase, 0x7f00));
> +	val =                       get_smstate(u64, smbase, 0x7ed0);
> +	ctxt->ops->set_msr(ctxt, MSR_EFER, val & ~EFER_LMA);
> +
> +	selector =                  get_smstate(u32, smbase, 0x7e90);
> +	rsm_set_desc_flags(&desc,   get_smstate(u32, smbase, 0x7e92));
> +	set_desc_limit(&desc,       get_smstate(u32, smbase, 0x7e94));
> +	set_desc_base(&desc,        get_smstate(u32, smbase, 0x7e98));
> +	base3 =                     get_smstate(u32, smbase, 0x7e9c);
> +	ctxt->ops->set_segment(ctxt, selector, &desc, base3, VCPU_SREG_TR);
> +
> +	dt.size =                   get_smstate(u32, smbase, 0x7e84);
> +	dt.address =                get_smstate(u64, smbase, 0x7e88);
> +	ctxt->ops->set_idt(ctxt, &dt);
> +
> +	selector =                  get_smstate(u32, smbase, 0x7e70);
> +	rsm_set_desc_flags(&desc,   get_smstate(u32, smbase, 0x7e72));
> +	set_desc_limit(&desc,       get_smstate(u32, smbase, 0x7e74));
> +	set_desc_base(&desc,        get_smstate(u32, smbase, 0x7e78));
> +	base3 =                     get_smstate(u32, smbase, 0x7e7c);
> +	ctxt->ops->set_segment(ctxt, selector, &desc, base3, VCPU_SREG_LDTR);
> +
> +	dt.size =                   get_smstate(u32, smbase, 0x7e64);
> +	dt.address =                get_smstate(u64, smbase, 0x7e68);
> +	ctxt->ops->set_gdt(ctxt, &dt);
> +
> +	for (i = 0; i < 6; i++) {
> +		int r = rsm_load_seg_64(ctxt, smbase, i);
> +		if (r != X86EMUL_CONTINUE)
> +			return r;
> +	}
> +
> +	return rsm_enter_protected_mode(ctxt, cr0, cr4);
> +}
> +
>  static int em_rsm(struct x86_emulate_ctxt *ctxt)
>  {
> +	unsigned long cr0, cr4, efer;
> +	u64 smbase;
> +	int ret;
> +
> +	printk("rsm\n");
>  	if ((ctxt->emul_flags & X86EMUL_SMM_MASK) == 0)
>  		return emulate_ud(ctxt);
>  
> -	return X86EMUL_UNHANDLEABLE;
> +	/*
> +	 * Get back to real mode, to prepare a safe state in which
> +	 * to load CR0/CR3/CR4/EFER.  Also this will ensure that
> +	 * addresses passed to read_std/write_std are not virtual.
> +	 */
I am trying to understand this. Aren't we are already in real mode
here since we are in smm or am I missing something..

> +	cr0 = ctxt->ops->get_cr(ctxt, 0);
> +	if (cr0 & X86_CR0_PE)
> +		ctxt->ops->set_cr(ctxt, 0, cr0 & ~(X86_CR0_PG | X86_CR0_PE));
> +	cr4 = ctxt->ops->get_cr(ctxt, 4);
> +	if (cr0 & X86_CR4_PAE)
> +		ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PAE);
> +	efer = 0;
> +	ctxt->ops->set_msr(ctxt, MSR_EFER, efer);
> +
> +	ctxt->ops->get_msr(ctxt, MSR_IA32_SMBASE, &smbase);
> +	if (emulator_has_longmode(ctxt))
> +		ret = rsm_load_state_64(ctxt, smbase + 0x8000);
> +	else
> +		ret = rsm_load_state_32(ctxt, smbase + 0x8000);
> +
> +	if (ret != X86EMUL_CONTINUE) {
> +		/* FIXME: should triple fault */
> +		return X86EMUL_UNHANDLEABLE;
> +	}
> +
> +	ctxt->emul_flags &= ~X86EMUL_SMM_MASK;
> +	return X86EMUL_CONTINUE;
>  }
>  
>  static void
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4cd7a2a18e93..ab6a38617813 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6232,12 +6232,226 @@ static void process_nmi(struct kvm_vcpu *vcpu)
>  	kvm_make_request(KVM_REQ_EVENT, vcpu);
>  }
>  
> +#define put_smstate(type, buf, offset, val)			  \
> +	*(type *)((buf) + (offset) - 0x7e00) = val
> +
> +static u16 process_smi_get_segment_flags(struct kvm_segment *seg)
> +{
> +	u16 flags = 0;
> +	flags |= seg->g       << 15;
> +	flags |= seg->db      << 14;
> +	flags |= seg->l       << 13;
> +	flags |= seg->avl     << 12;
> +	flags |= seg->present << 7;
> +	flags |= seg->dpl     << 5;
> +	flags |= seg->s       << 4;
> +	flags |= seg->type;
> +	return flags;
> +}
> +
> +static void process_smi_save_seg_32(struct kvm_vcpu *vcpu, char *buf, int n)
> +{
> +	struct kvm_segment seg;
> +	int offset;
> +
> +	kvm_get_segment(vcpu, &seg, n);
> +	put_smstate(u32, buf, 0x7fa8 + n * 4, seg.selector);
> +
> +	if (n < 3)
> +		offset = 0x7f84 + n * 12;
> +	else
> +		offset = 0x7f2c + (n - 3) * 12;
> +
> +	put_smstate(u32, buf, offset + 8, seg.base);
> +	put_smstate(u32, buf, offset + 4, seg.limit);
> +	put_smstate(u32, buf, offset, process_smi_get_segment_flags(&seg));
> +}
> +
> +static void process_smi_save_seg_64(struct kvm_vcpu *vcpu, char *buf, int n)
> +{
> +	struct kvm_segment seg;
> +	int offset;
> +
> +	kvm_get_segment(vcpu, &seg, n);
> +	offset = 0x7e00 + n * 16;
> +
> +	put_smstate(u16, buf, offset, seg.selector);
> +	put_smstate(u16, buf, offset + 2, process_smi_get_segment_flags(&seg));
> +	put_smstate(u32, buf, offset + 4, seg.limit);
> +	put_smstate(u64, buf, offset + 8, seg.base);
> +}
> +
> +static void process_smi_save_state_32(struct kvm_vcpu *vcpu, char *buf)
> +{
> +	struct desc_ptr dt;
> +	struct kvm_segment seg;
> +	unsigned long val;
> +	int i;
> +
> +	put_smstate(u32, buf, 0x7ffc, kvm_read_cr0(vcpu));
> +	put_smstate(u32, buf, 0x7ff8, kvm_read_cr3(vcpu));
> +	put_smstate(u32, buf, 0x7ff4, kvm_get_rflags(vcpu));
> +	put_smstate(u32, buf, 0x7ff0, kvm_rip_read(vcpu));
> +
> +	for (i = 0; i < 8; i++)
> +		put_smstate(u32, buf, 0x7fd0 + i * 4, kvm_register_read(vcpu, i));
> +
> +	kvm_get_dr(vcpu, 6, &val);
> +	put_smstate(u32, buf, 0x7fcc, (u32)val);
> +	kvm_get_dr(vcpu, 7, &val);
> +	put_smstate(u32, buf, 0x7fc8, (u32)val);
> +
> +	kvm_get_segment(vcpu, &seg, VCPU_SREG_TR);
> +	put_smstate(u32, buf, 0x7fc4, seg.selector);
> +	put_smstate(u32, buf, 0x7f64, seg.base);
> +	put_smstate(u32, buf, 0x7f60, seg.limit);
> +	put_smstate(u32, buf, 0x7f5c, process_smi_get_segment_flags(&seg));
> +
> +	kvm_get_segment(vcpu, &seg, VCPU_SREG_LDTR);
> +	put_smstate(u32, buf, 0x7fc0, seg.selector);
> +	put_smstate(u32, buf, 0x7f80, seg.base);
> +	put_smstate(u32, buf, 0x7f7c, seg.limit);
> +	put_smstate(u32, buf, 0x7f78, process_smi_get_segment_flags(&seg));
> +
> +	kvm_x86_ops->get_gdt(vcpu, &dt);
> +	put_smstate(u32, buf, 0x7f74, dt.address);
> +	put_smstate(u32, buf, 0x7f70, dt.size);
> +
> +	kvm_x86_ops->get_idt(vcpu, &dt);
> +	put_smstate(u32, buf, 0x7f58, dt.address);
> +	put_smstate(u32, buf, 0x7f54, dt.size);
> +
> +	for (i = 0; i < 6; i++)
> +		process_smi_save_seg_32(vcpu, buf, i);
> +
> +	put_smstate(u32, buf, 0x7f14, kvm_read_cr4(vcpu));
> +
> +	/* revision id */
> +	put_smstate(u32, buf, 0x7efc, 0x00020000);
> +	put_smstate(u32, buf, 0x7ef8, vcpu->arch.smbase);
> +}
> +
> +static void process_smi_save_state_64(struct kvm_vcpu *vcpu, char *buf)
> +{
> +#ifdef CONFIG_X86_64
> +	struct desc_ptr dt;
> +	struct kvm_segment seg;
> +	unsigned long val;
> +	int i;
> +
> +	for (i = 0; i < 16; i++)
> +		put_smstate(u64, buf, 0x7ff8 - i * 8, kvm_register_read(vcpu, i));
> +
> +	put_smstate(u64, buf, 0x7f78, kvm_rip_read(vcpu));
> +	put_smstate(u32, buf, 0x7f70, kvm_get_rflags(vcpu));
> +
> +	kvm_get_dr(vcpu, 6, &val);
> +	put_smstate(u64, buf, 0x7f68, val);
> +	kvm_get_dr(vcpu, 7, &val);
> +	put_smstate(u64, buf, 0x7f60, val);
> +
> +	put_smstate(u64, buf, 0x7f58, kvm_read_cr0(vcpu));
> +	put_smstate(u64, buf, 0x7f50, kvm_read_cr3(vcpu));
> +	put_smstate(u64, buf, 0x7f48, kvm_read_cr4(vcpu));
> +
> +	put_smstate(u32, buf, 0x7f00, vcpu->arch.smbase);
> +
> +	/* revision id */
> +	put_smstate(u32, buf, 0x7efc, 0x00020064);
Is the revision id (and  0x00020000 for process_smi*_32()) from the
spec ? I can't seem to find them.

Bandan
...
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 6, 2015, 10:39 a.m. UTC | #6
On 05/05/2015 22:44, Bandan Das wrote:
> Paolo Bonzini <pbonzini@redhat.com> writes:
> 
>> +static inline bool guest_cpuid_has_longmode(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_cpuid_entry2 *best;
>> +
>> +	best = kvm_find_cpuid_entry(vcpu, 0x80000001, 0);
>> +	return best && (best->edx & bit(X86_FEATURE_LM));
>> +}
>> +
> 
> We could combine all guest_cpuid_has* functions into a single
> function guest_cpuid_has_feature(vcpu, feature) and avoid code duplication.
> Not relevant to this change - just a note (to self).
> 
>>  static inline bool guest_cpuid_has_osvw(struct kvm_vcpu *vcpu)
>>  {
>>  	struct kvm_cpuid_entry2 *best;
>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>> index f6b641207416..a49606871277 100644
>> --- a/arch/x86/kvm/emulate.c
>> +++ b/arch/x86/kvm/emulate.c
>> @@ -2262,12 +2262,253 @@ static int em_lseg(struct x86_emulate_ctxt *ctxt)
>>  	return rc;
>>  }
>>  
>> +static int emulator_has_longmode(struct x86_emulate_ctxt *ctxt)
>> +{
>> +	u32 eax, ebx, ecx, edx;
>> +
>> +	eax = 0x80000001;
>> +	ecx = 0;
>> +	ctxt->ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx);
>> +	return edx & bit(X86_FEATURE_LM);
>> +}
>> +
>> +#define get_smstate(type, smbase, offset)				  \
>> +	({								  \
>> +	 type __val;							  \
>> +	 int r = ctxt->ops->read_std(ctxt, smbase + offset, &__val,       \
>> +				     sizeof(__val), NULL);		  \
>> +	 if (r != X86EMUL_CONTINUE)					  \
>> +		 return X86EMUL_UNHANDLEABLE;				  \
>> +	 __val;								  \
>> +	})
>> +
>> +static void rsm_set_desc_flags(struct desc_struct *desc, u16 flags)
>> +{
>> +	desc->g    = (flags >> 15) & 1;
>> +	desc->d    = (flags >> 14) & 1;
>> +	desc->l    = (flags >> 13) & 1;
>> +	desc->avl  = (flags >> 12) & 1;
>> +	desc->p    = (flags >> 7) & 1;
>> +	desc->dpl  = (flags >> 5) & 3;
>> +	desc->s    = (flags >> 4) & 1;
>> +	desc->type = flags & 15;
>> +}
>> +
>> +static int rsm_load_seg_32(struct x86_emulate_ctxt *ctxt, u64 smbase, int n)
>> +{
>> +	struct desc_struct desc;
>> +	int offset;
>> +	u16 selector;
>> +
>> +	selector = get_smstate(u32, smbase, 0x7fa8 + n * 4);
> Probably a good idea to use #defines for all the offsets here
> and elsewhere. 

Uff, those would be a lot of offsets.  It's just as easy to get the
#defines right, as it is to get them right in get_smstate...  >80
character lines would also mess everything up, I think.

> This return is redundant since rsm_load_seg_32 always returns
> success. Same for rsm_load_seg_64()

Note that the get_smstate macro can do an early return.

>>  static int em_rsm(struct x86_emulate_ctxt *ctxt)
>>  {
>> +	unsigned long cr0, cr4, efer;
>> +	u64 smbase;
>> +	int ret;
>> +
>> +	printk("rsm\n");
>>  	if ((ctxt->emul_flags & X86EMUL_SMM_MASK) == 0)
>>  		return emulate_ud(ctxt);
>>  
>> -	return X86EMUL_UNHANDLEABLE;
>> +	/*
>> +	 * Get back to real mode, to prepare a safe state in which
>> +	 * to load CR0/CR3/CR4/EFER.  Also this will ensure that
>> +	 * addresses passed to read_std/write_std are not virtual.
>> +	 */
> 
> I am trying to understand this. Aren't we are already in real mode
> here since we are in smm or am I missing something..

SMM starts in big real mode.  The SMI handler can go to protected mode
and even enable paging; as long as it's not 64-bit, it can execute an RSM.

I should check and disable CR4.PCIDE before CR0.PG.

>> +	cr0 = ctxt->ops->get_cr(ctxt, 0);
>> +	if (cr0 & X86_CR0_PE)
>> +		ctxt->ops->set_cr(ctxt, 0, cr0 & ~(X86_CR0_PG | X86_CR0_PE));
>> +	cr4 = ctxt->ops->get_cr(ctxt, 4);
>> +	if (cr0 & X86_CR4_PAE)

Oops, cr4 here.

>> +	/* revision id */
>> +	put_smstate(u32, buf, 0x7efc, 0x00020064);
> Is the revision id (and  0x00020000 for process_smi*_32()) from the
> spec ? I can't seem to find them.

This revision id is in the AMD spec.  You can see that SeaBIOS checks
for it and the 32-bit one as well (which I cannot find anywhere).

SeaBIOS should also accept 0x30000 and 0x30064.  Sending a patch.

Paolo

> Bandan
> ...
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bandan Das May 6, 2015, 5:55 p.m. UTC | #7
Paolo Bonzini <pbonzini@redhat.com> writes:
..
>>> +
>>> +	selector = get_smstate(u32, smbase, 0x7fa8 + n * 4);
>> Probably a good idea to use #defines for all the offsets here
>> and elsewhere. 
>
> Uff, those would be a lot of offsets.  It's just as easy to get the
> #defines right, as it is to get them right in get_smstate...  >80
> character lines would also mess everything up, I think.

Valid point. But I would say cryptic names will always be better then
hex offsets. It's a one time pain in the neck to get the defines
right.

I found this header -
ftp://ftp.xskernel.org/soft/linux-src/bochs-20080511/cpu/smm.h

Maybe we can pick it up ? I can sanity check the list and
make sure the offsets are correct if you prefer.

>> This return is redundant since rsm_load_seg_32 always returns
>> success. Same for rsm_load_seg_64()
>
> Note that the get_smstate macro can do an early return.
>
Oops! Sorry, right. I suggest changing get_smstate to get_smstate_iam_a_macro_not_a_function ;)

>>>  static int em_rsm(struct x86_emulate_ctxt *ctxt)
>>>  {
>>> +	unsigned long cr0, cr4, efer;
>>> +	u64 smbase;
>>> +	int ret;
>>> +
>>> +	printk("rsm\n");
>>>  	if ((ctxt->emul_flags & X86EMUL_SMM_MASK) == 0)
>>>  		return emulate_ud(ctxt);
>>>  
>>> -	return X86EMUL_UNHANDLEABLE;
>>> +	/*
>>> +	 * Get back to real mode, to prepare a safe state in which
>>> +	 * to load CR0/CR3/CR4/EFER.  Also this will ensure that
>>> +	 * addresses passed to read_std/write_std are not virtual.
>>> +	 */
>> 
>> I am trying to understand this. Aren't we are already in real mode
>> here since we are in smm or am I missing something..
>
> SMM starts in big real mode.  The SMI handler can go to protected mode
> and even enable paging; as long as it's not 64-bit, it can execute an RSM.

Ok. This part. rsm can be called from protected mode. Thanks for the
explanation.

> I should check and disable CR4.PCIDE before CR0.PG.
>
Actually, 4.10.1 says that "The processor ensures that CR4.PCIDE can
only be 1 in IA32e mode".  At this point it should already be 0.

Bandan

>>> +	cr0 = ctxt->ops->get_cr(ctxt, 0);
>>> +	if (cr0 & X86_CR0_PE)
>>> +		ctxt->ops->set_cr(ctxt, 0, cr0 & ~(X86_CR0_PG | X86_CR0_PE));
>>> +	cr4 = ctxt->ops->get_cr(ctxt, 4);
>>> +	if (cr0 & X86_CR4_PAE)
>
> Oops, cr4 here.
>
>>> +	/* revision id */
>>> +	put_smstate(u32, buf, 0x7efc, 0x00020064);
>> Is the revision id (and  0x00020000 for process_smi*_32()) from the
>> spec ? I can't seem to find them.
>
> This revision id is in the AMD spec.  You can see that SeaBIOS checks
> for it and the 32-bit one as well (which I cannot find anywhere).
>
> SeaBIOS should also accept 0x30000 and 0x30064.  Sending a patch.
>
> Paolo
>
>> Bandan
>> ...
>> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 6, 2015, 7:38 p.m. UTC | #8
On 06/05/2015 19:55, Bandan Das wrote:
> Valid point. But I would say cryptic names will always be better then
> hex offsets. It's a one time pain in the neck to get the defines
> right.
> 
> I found this header -
> ftp://ftp.xskernel.org/soft/linux-src/bochs-20080511/cpu/smm.h
> 
> Maybe we can pick it up ? I can sanity check the list and
> make sure the offsets are correct if you prefer.

Thanks, I like that header. :)  No need to sanity check yourself.

I'll have to adjust to define both 32-bit and 64-bit offsets.  Ok to use
just SMRAM32_OFS_xyz and SMRAM64_OFS_xyz?  No need for LO32 and HI32, in
particular.

The one problem is that I am doing math on the offsets in some places.
It would suck to save/restore 16 registers manually.  Any suggestions?
I will probably send v1 for review without the offsets.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Bandan Das May 12, 2015, 11:56 p.m. UTC | #9
Paolo Bonzini <pbonzini@redhat.com> writes:

> On 06/05/2015 19:55, Bandan Das wrote:
>> Valid point. But I would say cryptic names will always be better then
>> hex offsets. It's a one time pain in the neck to get the defines
>> right.
>> 
>> I found this header -
>> ftp://ftp.xskernel.org/soft/linux-src/bochs-20080511/cpu/smm.h
>> 
>> Maybe we can pick it up ? I can sanity check the list and
>> make sure the offsets are correct if you prefer.
>
> Thanks, I like that header. :)  No need to sanity check yourself.
>
> I'll have to adjust to define both 32-bit and 64-bit offsets.  Ok to use
> just SMRAM32_OFS_xyz and SMRAM64_OFS_xyz?  No need for LO32 and HI32, in
> particular.
Agreed, that sounds like the right thing to do.

> The one problem is that I am doing math on the offsets in some places.
> It would suck to save/restore 16 registers manually.  Any suggestions?
> I will probably send v1 for review without the offsets.

I think it's ok just to replace the offset number with a name.
If you are concerned about readability, we could probably wrap the
"for loops" in the math calculations with a more serious sounding macro
eg. "#define foreach_smm_offset(val, start_offset, jump, count)" :)

> Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Paolo Bonzini May 13, 2015, 6:58 a.m. UTC | #10
On 13/05/2015 01:56, Bandan Das wrote:
> I think it's ok just to replace the offset number with a name.
> If you are concerned about readability, we could probably wrap the
> "for loops" in the math calculations with a more serious sounding macro
> eg. "#define foreach_smm_offset(val, start_offset, jump, count)" :)

That doesn't help.  The problem is that the for loops assume that you
know that the offsets are at a constant distance, plus the distance.
This is already not trivial for the registers (is RAX at the highest or
lowest offset); but it could even be worse than hardcoded offsets for
selectors and descriptor caches.

Paolo
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
diff mbox

Patch

diff --git a/arch/x86/kvm/cpuid.h b/arch/x86/kvm/cpuid.h
index c3b1ad9fca81..02b67616435d 100644
--- a/arch/x86/kvm/cpuid.h
+++ b/arch/x86/kvm/cpuid.h
@@ -70,6 +70,14 @@  static inline bool guest_cpuid_has_fsgsbase(struct kvm_vcpu *vcpu)
 	return best && (best->ebx & bit(X86_FEATURE_FSGSBASE));
 }
 
+static inline bool guest_cpuid_has_longmode(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	best = kvm_find_cpuid_entry(vcpu, 0x80000001, 0);
+	return best && (best->edx & bit(X86_FEATURE_LM));
+}
+
 static inline bool guest_cpuid_has_osvw(struct kvm_vcpu *vcpu)
 {
 	struct kvm_cpuid_entry2 *best;
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index f6b641207416..a49606871277 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -2262,12 +2262,253 @@  static int em_lseg(struct x86_emulate_ctxt *ctxt)
 	return rc;
 }
 
+static int emulator_has_longmode(struct x86_emulate_ctxt *ctxt)
+{
+	u32 eax, ebx, ecx, edx;
+
+	eax = 0x80000001;
+	ecx = 0;
+	ctxt->ops->get_cpuid(ctxt, &eax, &ebx, &ecx, &edx);
+	return edx & bit(X86_FEATURE_LM);
+}
+
+#define get_smstate(type, smbase, offset)				  \
+	({								  \
+	 type __val;							  \
+	 int r = ctxt->ops->read_std(ctxt, smbase + offset, &__val,       \
+				     sizeof(__val), NULL);		  \
+	 if (r != X86EMUL_CONTINUE)					  \
+		 return X86EMUL_UNHANDLEABLE;				  \
+	 __val;								  \
+	})
+
+static void rsm_set_desc_flags(struct desc_struct *desc, u16 flags)
+{
+	desc->g    = (flags >> 15) & 1;
+	desc->d    = (flags >> 14) & 1;
+	desc->l    = (flags >> 13) & 1;
+	desc->avl  = (flags >> 12) & 1;
+	desc->p    = (flags >> 7) & 1;
+	desc->dpl  = (flags >> 5) & 3;
+	desc->s    = (flags >> 4) & 1;
+	desc->type = flags & 15;
+}
+
+static int rsm_load_seg_32(struct x86_emulate_ctxt *ctxt, u64 smbase, int n)
+{
+	struct desc_struct desc;
+	int offset;
+	u16 selector;
+
+	selector = get_smstate(u32, smbase, 0x7fa8 + n * 4);
+
+	if (n < 3)
+		offset = 0x7f84 + n * 12;
+	else
+		offset = 0x7f2c + (n - 3) * 12;
+
+	set_desc_base(&desc,      get_smstate(u32, smbase, offset + 8));
+	set_desc_limit(&desc,     get_smstate(u32, smbase, offset + 4));
+	rsm_set_desc_flags(&desc, get_smstate(u32, smbase, offset));
+	ctxt->ops->set_segment(ctxt, selector, &desc, 0, n);
+	return X86EMUL_CONTINUE;
+}
+
+static int rsm_load_seg_64(struct x86_emulate_ctxt *ctxt, u64 smbase, int n)
+{
+	struct desc_struct desc;
+	int offset;
+	u16 selector;
+	u32 base3;
+
+	offset = 0x7e00 + n * 16;
+
+	selector =                get_smstate(u16, smbase, offset);
+	rsm_set_desc_flags(&desc, get_smstate(u16, smbase, offset + 2));
+	set_desc_limit(&desc,     get_smstate(u32, smbase, offset + 4));
+	set_desc_base(&desc,      get_smstate(u32, smbase, offset + 8));
+	base3 =                   get_smstate(u32, smbase, offset + 12);
+
+	ctxt->ops->set_segment(ctxt, selector, &desc, base3, n);
+	return X86EMUL_CONTINUE;
+}
+
+static int rsm_enter_protected_mode(struct x86_emulate_ctxt *ctxt,
+				     u64 cr0, u64 cr4)
+{
+	int bad;
+
+	/*
+	 * First enable PAE, long mode needs it before CR0.PG = 1 is set.
+	 * Then enable protected mode.	However, PCID cannot be enabled
+	 * if EFER.LMA=0, so set it separately.
+	 */
+	bad = ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PCIDE);
+	if (bad)
+		return X86EMUL_UNHANDLEABLE;
+
+	bad = ctxt->ops->set_cr(ctxt, 0, cr0);
+	if (bad)
+		return X86EMUL_UNHANDLEABLE;
+
+	if (cr4 & X86_CR4_PCIDE) {
+		bad = ctxt->ops->set_cr(ctxt, 4, cr4);
+		if (bad)
+			return X86EMUL_UNHANDLEABLE;
+	}
+
+	return X86EMUL_CONTINUE;
+}
+
+static int rsm_load_state_32(struct x86_emulate_ctxt *ctxt, u64 smbase)
+{
+	struct desc_struct desc;
+	struct desc_ptr dt;
+	u16 selector;
+	u32 val, cr0, cr4;
+	int i;
+
+	cr0 =                      get_smstate(u32, smbase, 0x7ffc);
+	ctxt->ops->set_cr(ctxt, 3, get_smstate(u32, smbase, 0x7ff8));
+	ctxt->eflags =             get_smstate(u32, smbase, 0x7ff4) | X86_EFLAGS_FIXED;
+	ctxt->_eip =               get_smstate(u32, smbase, 0x7ff0);
+
+	for (i = 0; i < 8; i++)
+		*reg_write(ctxt, i) = get_smstate(u32, smbase, 0x7fd0 + i * 4);
+
+	val = get_smstate(u32, smbase, 0x7fcc);
+	ctxt->ops->set_dr(ctxt, 6, (val & DR6_VOLATILE) | DR6_FIXED_1);
+	val = get_smstate(u32, smbase, 0x7fc8);
+	ctxt->ops->set_dr(ctxt, 7, (val & DR7_VOLATILE) | DR7_FIXED_1);
+
+	selector =                 get_smstate(u32, smbase, 0x7fc4);
+	set_desc_base(&desc,       get_smstate(u32, smbase, 0x7f64));
+	set_desc_limit(&desc,      get_smstate(u32, smbase, 0x7f60));
+	rsm_set_desc_flags(&desc,  get_smstate(u32, smbase, 0x7f5c));
+	ctxt->ops->set_segment(ctxt, selector, &desc, 0, VCPU_SREG_TR);
+
+	selector =                 get_smstate(u32, smbase, 0x7fc0);
+	set_desc_base(&desc,       get_smstate(u32, smbase, 0x7f80));
+	set_desc_limit(&desc,      get_smstate(u32, smbase, 0x7f7c));
+	rsm_set_desc_flags(&desc,  get_smstate(u32, smbase, 0x7f78));
+	ctxt->ops->set_segment(ctxt, selector, &desc, 0, VCPU_SREG_LDTR);
+
+	dt.address =               get_smstate(u32, smbase, 0x7f74);
+	dt.size =                  get_smstate(u32, smbase, 0x7f70);
+	ctxt->ops->set_gdt(ctxt, &dt);
+
+	dt.address =               get_smstate(u32, smbase, 0x7f58);
+	dt.size =                  get_smstate(u32, smbase, 0x7f54);
+	ctxt->ops->set_idt(ctxt, &dt);
+
+	for (i = 0; i < 6; i++) {
+		int r = rsm_load_seg_32(ctxt, smbase, i);
+		if (r != X86EMUL_CONTINUE)
+			return r;
+	}
+
+	cr4 = get_smstate(u32, smbase, 0x7f14);
+
+	ctxt->ops->set_smbase(ctxt, get_smstate(u32, smbase, 0x7ef8));
+
+	return rsm_enter_protected_mode(ctxt, cr0, cr4);
+}
+
+static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt, u64 smbase)
+{
+	struct desc_struct desc;
+	struct desc_ptr dt;
+	u64 val, cr0, cr4;
+	u32 base3;
+	u16 selector;
+	int i;
+
+	for (i = 0; i < 16; i++)
+		*reg_write(ctxt, i) = get_smstate(u64, smbase, 0x7ff8 - i * 8);
+
+	ctxt->_eip   = get_smstate(u64, smbase, 0x7f78);
+	ctxt->eflags = get_smstate(u32, smbase, 0x7f70) | X86_EFLAGS_FIXED;
+
+	val = get_smstate(u32, smbase, 0x7f68);
+	ctxt->ops->set_dr(ctxt, 6, (val & DR6_VOLATILE) | DR6_FIXED_1);
+	val = get_smstate(u32, smbase, 0x7f60);
+	ctxt->ops->set_dr(ctxt, 7, (val & DR7_VOLATILE) | DR7_FIXED_1);
+
+	cr0 =                       get_smstate(u64, smbase, 0x7f58);
+	ctxt->ops->set_cr(ctxt, 3,  get_smstate(u64, smbase, 0x7f50));
+	cr4 =                       get_smstate(u64, smbase, 0x7f48);
+	ctxt->ops->set_smbase(ctxt, get_smstate(u32, smbase, 0x7f00));
+	val =                       get_smstate(u64, smbase, 0x7ed0);
+	ctxt->ops->set_msr(ctxt, MSR_EFER, val & ~EFER_LMA);
+
+	selector =                  get_smstate(u32, smbase, 0x7e90);
+	rsm_set_desc_flags(&desc,   get_smstate(u32, smbase, 0x7e92));
+	set_desc_limit(&desc,       get_smstate(u32, smbase, 0x7e94));
+	set_desc_base(&desc,        get_smstate(u32, smbase, 0x7e98));
+	base3 =                     get_smstate(u32, smbase, 0x7e9c);
+	ctxt->ops->set_segment(ctxt, selector, &desc, base3, VCPU_SREG_TR);
+
+	dt.size =                   get_smstate(u32, smbase, 0x7e84);
+	dt.address =                get_smstate(u64, smbase, 0x7e88);
+	ctxt->ops->set_idt(ctxt, &dt);
+
+	selector =                  get_smstate(u32, smbase, 0x7e70);
+	rsm_set_desc_flags(&desc,   get_smstate(u32, smbase, 0x7e72));
+	set_desc_limit(&desc,       get_smstate(u32, smbase, 0x7e74));
+	set_desc_base(&desc,        get_smstate(u32, smbase, 0x7e78));
+	base3 =                     get_smstate(u32, smbase, 0x7e7c);
+	ctxt->ops->set_segment(ctxt, selector, &desc, base3, VCPU_SREG_LDTR);
+
+	dt.size =                   get_smstate(u32, smbase, 0x7e64);
+	dt.address =                get_smstate(u64, smbase, 0x7e68);
+	ctxt->ops->set_gdt(ctxt, &dt);
+
+	for (i = 0; i < 6; i++) {
+		int r = rsm_load_seg_64(ctxt, smbase, i);
+		if (r != X86EMUL_CONTINUE)
+			return r;
+	}
+
+	return rsm_enter_protected_mode(ctxt, cr0, cr4);
+}
+
 static int em_rsm(struct x86_emulate_ctxt *ctxt)
 {
+	unsigned long cr0, cr4, efer;
+	u64 smbase;
+	int ret;
+
+	printk("rsm\n");
 	if ((ctxt->emul_flags & X86EMUL_SMM_MASK) == 0)
 		return emulate_ud(ctxt);
 
-	return X86EMUL_UNHANDLEABLE;
+	/*
+	 * Get back to real mode, to prepare a safe state in which
+	 * to load CR0/CR3/CR4/EFER.  Also this will ensure that
+	 * addresses passed to read_std/write_std are not virtual.
+	 */
+	cr0 = ctxt->ops->get_cr(ctxt, 0);
+	if (cr0 & X86_CR0_PE)
+		ctxt->ops->set_cr(ctxt, 0, cr0 & ~(X86_CR0_PG | X86_CR0_PE));
+	cr4 = ctxt->ops->get_cr(ctxt, 4);
+	if (cr0 & X86_CR4_PAE)
+		ctxt->ops->set_cr(ctxt, 4, cr4 & ~X86_CR4_PAE);
+	efer = 0;
+	ctxt->ops->set_msr(ctxt, MSR_EFER, efer);
+
+	ctxt->ops->get_msr(ctxt, MSR_IA32_SMBASE, &smbase);
+	if (emulator_has_longmode(ctxt))
+		ret = rsm_load_state_64(ctxt, smbase + 0x8000);
+	else
+		ret = rsm_load_state_32(ctxt, smbase + 0x8000);
+
+	if (ret != X86EMUL_CONTINUE) {
+		/* FIXME: should triple fault */
+		return X86EMUL_UNHANDLEABLE;
+	}
+
+	ctxt->emul_flags &= ~X86EMUL_SMM_MASK;
+	return X86EMUL_CONTINUE;
 }
 
 static void
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4cd7a2a18e93..ab6a38617813 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6232,12 +6232,226 @@  static void process_nmi(struct kvm_vcpu *vcpu)
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
 
+#define put_smstate(type, buf, offset, val)			  \
+	*(type *)((buf) + (offset) - 0x7e00) = val
+
+static u16 process_smi_get_segment_flags(struct kvm_segment *seg)
+{
+	u16 flags = 0;
+	flags |= seg->g       << 15;
+	flags |= seg->db      << 14;
+	flags |= seg->l       << 13;
+	flags |= seg->avl     << 12;
+	flags |= seg->present << 7;
+	flags |= seg->dpl     << 5;
+	flags |= seg->s       << 4;
+	flags |= seg->type;
+	return flags;
+}
+
+static void process_smi_save_seg_32(struct kvm_vcpu *vcpu, char *buf, int n)
+{
+	struct kvm_segment seg;
+	int offset;
+
+	kvm_get_segment(vcpu, &seg, n);
+	put_smstate(u32, buf, 0x7fa8 + n * 4, seg.selector);
+
+	if (n < 3)
+		offset = 0x7f84 + n * 12;
+	else
+		offset = 0x7f2c + (n - 3) * 12;
+
+	put_smstate(u32, buf, offset + 8, seg.base);
+	put_smstate(u32, buf, offset + 4, seg.limit);
+	put_smstate(u32, buf, offset, process_smi_get_segment_flags(&seg));
+}
+
+static void process_smi_save_seg_64(struct kvm_vcpu *vcpu, char *buf, int n)
+{
+	struct kvm_segment seg;
+	int offset;
+
+	kvm_get_segment(vcpu, &seg, n);
+	offset = 0x7e00 + n * 16;
+
+	put_smstate(u16, buf, offset, seg.selector);
+	put_smstate(u16, buf, offset + 2, process_smi_get_segment_flags(&seg));
+	put_smstate(u32, buf, offset + 4, seg.limit);
+	put_smstate(u64, buf, offset + 8, seg.base);
+}
+
+static void process_smi_save_state_32(struct kvm_vcpu *vcpu, char *buf)
+{
+	struct desc_ptr dt;
+	struct kvm_segment seg;
+	unsigned long val;
+	int i;
+
+	put_smstate(u32, buf, 0x7ffc, kvm_read_cr0(vcpu));
+	put_smstate(u32, buf, 0x7ff8, kvm_read_cr3(vcpu));
+	put_smstate(u32, buf, 0x7ff4, kvm_get_rflags(vcpu));
+	put_smstate(u32, buf, 0x7ff0, kvm_rip_read(vcpu));
+
+	for (i = 0; i < 8; i++)
+		put_smstate(u32, buf, 0x7fd0 + i * 4, kvm_register_read(vcpu, i));
+
+	kvm_get_dr(vcpu, 6, &val);
+	put_smstate(u32, buf, 0x7fcc, (u32)val);
+	kvm_get_dr(vcpu, 7, &val);
+	put_smstate(u32, buf, 0x7fc8, (u32)val);
+
+	kvm_get_segment(vcpu, &seg, VCPU_SREG_TR);
+	put_smstate(u32, buf, 0x7fc4, seg.selector);
+	put_smstate(u32, buf, 0x7f64, seg.base);
+	put_smstate(u32, buf, 0x7f60, seg.limit);
+	put_smstate(u32, buf, 0x7f5c, process_smi_get_segment_flags(&seg));
+
+	kvm_get_segment(vcpu, &seg, VCPU_SREG_LDTR);
+	put_smstate(u32, buf, 0x7fc0, seg.selector);
+	put_smstate(u32, buf, 0x7f80, seg.base);
+	put_smstate(u32, buf, 0x7f7c, seg.limit);
+	put_smstate(u32, buf, 0x7f78, process_smi_get_segment_flags(&seg));
+
+	kvm_x86_ops->get_gdt(vcpu, &dt);
+	put_smstate(u32, buf, 0x7f74, dt.address);
+	put_smstate(u32, buf, 0x7f70, dt.size);
+
+	kvm_x86_ops->get_idt(vcpu, &dt);
+	put_smstate(u32, buf, 0x7f58, dt.address);
+	put_smstate(u32, buf, 0x7f54, dt.size);
+
+	for (i = 0; i < 6; i++)
+		process_smi_save_seg_32(vcpu, buf, i);
+
+	put_smstate(u32, buf, 0x7f14, kvm_read_cr4(vcpu));
+
+	/* revision id */
+	put_smstate(u32, buf, 0x7efc, 0x00020000);
+	put_smstate(u32, buf, 0x7ef8, vcpu->arch.smbase);
+}
+
+static void process_smi_save_state_64(struct kvm_vcpu *vcpu, char *buf)
+{
+#ifdef CONFIG_X86_64
+	struct desc_ptr dt;
+	struct kvm_segment seg;
+	unsigned long val;
+	int i;
+
+	for (i = 0; i < 16; i++)
+		put_smstate(u64, buf, 0x7ff8 - i * 8, kvm_register_read(vcpu, i));
+
+	put_smstate(u64, buf, 0x7f78, kvm_rip_read(vcpu));
+	put_smstate(u32, buf, 0x7f70, kvm_get_rflags(vcpu));
+
+	kvm_get_dr(vcpu, 6, &val);
+	put_smstate(u64, buf, 0x7f68, val);
+	kvm_get_dr(vcpu, 7, &val);
+	put_smstate(u64, buf, 0x7f60, val);
+
+	put_smstate(u64, buf, 0x7f58, kvm_read_cr0(vcpu));
+	put_smstate(u64, buf, 0x7f50, kvm_read_cr3(vcpu));
+	put_smstate(u64, buf, 0x7f48, kvm_read_cr4(vcpu));
+
+	put_smstate(u32, buf, 0x7f00, vcpu->arch.smbase);
+
+	/* revision id */
+	put_smstate(u32, buf, 0x7efc, 0x00020064);
+
+	put_smstate(u64, buf, 0x7ed0, vcpu->arch.efer);
+
+	kvm_get_segment(vcpu, &seg, VCPU_SREG_TR);
+	put_smstate(u16, buf, 0x7e90, seg.selector);
+	put_smstate(u16, buf, 0x7e92, process_smi_get_segment_flags(&seg));
+	put_smstate(u32, buf, 0x7e94, seg.limit);
+	put_smstate(u64, buf, 0x7e98, seg.base);
+
+	kvm_x86_ops->get_idt(vcpu, &dt);
+	put_smstate(u32, buf, 0x7e84, dt.size);
+	put_smstate(u64, buf, 0x7e88, dt.address);
+
+	kvm_get_segment(vcpu, &seg, VCPU_SREG_LDTR);
+	put_smstate(u16, buf, 0x7e70, seg.selector);
+	put_smstate(u16, buf, 0x7e72, process_smi_get_segment_flags(&seg));
+	put_smstate(u32, buf, 0x7e74, seg.limit);
+	put_smstate(u64, buf, 0x7e78, seg.base);
+
+	kvm_x86_ops->get_gdt(vcpu, &dt);
+	put_smstate(u32, buf, 0x7e64, dt.size);
+	put_smstate(u64, buf, 0x7e68, dt.address);
+
+	for (i = 0; i < 6; i++)
+		process_smi_save_seg_64(vcpu, buf, i);
+#else
+	WARN_ON_ONCE(1);
+#endif
+}
+
 static void process_smi(struct kvm_vcpu *vcpu)
 {
+	struct kvm_segment cs, ds;
+	char buf[512];
+	u32 cr0;
+	int r;
+
 	if (is_smm(vcpu))
 		return;
 
-	printk_once(KERN_DEBUG "Ignoring guest SMI\n");
+	printk("smm %llx\n", vcpu->arch.smbase);
+	vcpu->arch.hflags |= HF_SMM_MASK;
+	memset(buf, 0, 512);
+	if (guest_cpuid_has_longmode(vcpu))
+		process_smi_save_state_64(vcpu, buf);
+	else
+		process_smi_save_state_32(vcpu, buf);
+
+	r = kvm_write_guest(vcpu->kvm, vcpu->arch.smbase + 0xfe00, buf, sizeof(buf));
+	if (r < 0)
+		return;
+
+	kvm_x86_ops->set_nmi_mask(vcpu, true);
+	kvm_set_rflags(vcpu, X86_EFLAGS_FIXED);
+	kvm_rip_write(vcpu, 0x8000);
+
+	cr0 = vcpu->arch.cr0 & ~(X86_CR0_PE | X86_CR0_EM | X86_CR0_TS | X86_CR0_PG);
+	kvm_x86_ops->set_cr0(vcpu, cr0);
+	vcpu->arch.cr0 = cr0;
+
+	kvm_x86_ops->set_cr4(vcpu, 0);
+
+	__kvm_set_dr(vcpu, 7, DR7_FIXED_1);
+
+	cs.selector = (vcpu->arch.smbase >> 4) & 0xffff;
+	cs.base = vcpu->arch.smbase;
+
+	ds.selector = 0;
+	ds.base = 0;
+
+	cs.limit    = ds.limit = 0xffffffff;
+	cs.type     = ds.type = 0x3;
+	cs.dpl      = ds.dpl = 0;
+	cs.db       = ds.db = 0;
+	cs.s        = ds.s = 1;
+	cs.l        = ds.l = 0;
+	cs.g        = ds.g = 1;
+	cs.avl      = ds.avl = 0;
+	cs.present  = ds.present = 1;
+	cs.unusable = ds.unusable = 0;
+	cs.padding  = ds.padding = 0;
+
+	kvm_set_segment(vcpu, &cs, VCPU_SREG_CS);
+	kvm_set_segment(vcpu, &ds, VCPU_SREG_DS);
+	kvm_set_segment(vcpu, &ds, VCPU_SREG_ES);
+	kvm_set_segment(vcpu, &ds, VCPU_SREG_FS);
+	kvm_set_segment(vcpu, &ds, VCPU_SREG_GS);
+	kvm_set_segment(vcpu, &ds, VCPU_SREG_SS);
+
+	if (guest_cpuid_has_longmode(vcpu))
+		kvm_x86_ops->set_efer(vcpu, 0);
+
+	kvm_update_cpuid(vcpu);
+	kvm_mmu_reset_context(vcpu);
 }
 
 static void vcpu_scan_ioapic(struct kvm_vcpu *vcpu)