diff mbox series

[v3,22/22] kvm: x86: Disable interception for IA32_XFD on demand

Message ID 20211222124052.644626-23-jing2.liu@intel.com (mailing list archive)
State New
Headers show
Series AMX Support in KVM | expand

Commit Message

Jing Liu Dec. 22, 2021, 12:40 p.m. UTC
Always intercepting IA32_XFD causes non-negligible overhead when this
register is updated frequently in the guest.

Disable r/w emulation after intercepting the first WRMSR(IA32_XFD)
with a non-zero value.

Disable WRMSR emulation implies that IA32_XFD becomes out-of-sync
with the software states in fpstate and the per-cpu xfd cache. Call
fpu_sync_guest_vmexit_xfd_state() to bring them back in-sync, before
preemption is enabled.

p.s. We have confirmed that SDM is being revised to say that
when setting IA32_XFD[18] the AMX register state is not guaranteed
to be preserved. This clarification avoids adding mess for a creative
guest which sets IA32_XFD[18]=1 before saving active AMX state to
its own storage.

Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/vmx/vmx.c          | 4 ++++
 arch/x86/kvm/vmx/vmx.h          | 2 +-
 arch/x86/kvm/x86.c              | 3 +++
 4 files changed, 9 insertions(+), 1 deletion(-)

Comments

Sean Christopherson Dec. 29, 2021, 1:04 a.m. UTC | #1
On Wed, Dec 22, 2021, Jing Liu wrote:
> @@ -1968,6 +1969,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	case MSR_IA32_XFD:
>  		ret = kvm_set_msr_common(vcpu, msr_info);
>  		if (!ret && data) {
> +			vmx_disable_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_RW);
> +			vcpu->arch.xfd_out_of_sync = true;

xfd_out_of_sync is a poor name, as XFD _may_ be out of sync, or it may not.  It's
also confusing that it's kept set after XFD is explicitly synchronized in
vcpu_enter_guest().

> +
>  			vcpu->arch.trap_nm = true;
>  			vmx_update_exception_bitmap(vcpu);

Ah, this is why #NM interception was made sticky many patches ago.  More at the end.

>  		}
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index bf9d3051cd6c..0a00242a91e7 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -340,7 +340,7 @@ struct vcpu_vmx {
>  	struct lbr_desc lbr_desc;
>  
>  	/* Save desired MSR intercept (read: pass-through) state */
> -#define MAX_POSSIBLE_PASSTHROUGH_MSRS	14
> +#define MAX_POSSIBLE_PASSTHROUGH_MSRS	15
>  	struct {
>  		DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
>  		DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 3b756ff13103..10a08aa2aa45 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10024,6 +10024,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  	if (vcpu->arch.guest_fpu.xfd_err)
>  		wrmsrl(MSR_IA32_XFD_ERR, 0);
>  
> +	if (vcpu->arch.xfd_out_of_sync)

Rather than adding a flag that tracks whether or not the MSR can be written by
the guest, can't this be:

	if (!vmx_test_msr_bitmap_write(vcpu->loaded_vmcs->msr_bitmap))
		fpu_sync_guest_vmexit_xfd_state();

That might be marginally slower than checking a dedicated flag?  But is has the
advantage of doing the correct thing for nested guests instead of penalizing them
with an unnecessary sync on every exit.  If performance of the check is an issue,
we could add a static key to skip the code unless at least one vCPU has triggered
the XFD crud, a la kvm_has_noapic_vcpu (which may or may not provide any real
performance benefits).

Speaking of nested, interception of #NM in vmx_update_exception_bitmap() is wrong
with respect to nested guests.  Until XFD is supported for L2, which I didn't see
in this series, #NM should not be intercepted while L2 is running.

For the earlier patch that introduced arch.trap_nm, if it's not too gross and not
racy, the code could be:

	if (is_guest_mode(vcpu))
		eb |= get_vmcs12(vcpu)->exception_bitmap;
        else {
		...

		if (vcpu->arch.guest_fpu.fpstate.xfd)
			eb |= (1u << NM_VECTOR);
	}

Though I'm ok with a semi-temporary flag if that's gross/racy.

Then this patch can change it to:

	if (is_guest_mode(vcpu))
		eb |= get_vmcs12(vcpu)->exception_bitmap;
        else {
		...

		if (!vmx_test_msr_bitmap_write(vcpu->vmcs01.msr_bitmap))
			eb |= (1u << NM_VECTOR);
	}

> +		fpu_sync_guest_vmexit_xfd_state();
> +
>  	/*
>  	 * Consume any pending interrupts, including the possible source of
>  	 * VM-Exit on SVM and any ticks that occur between VM-Exit and now.
> -- 
> 2.27.0
>
Tian, Kevin Dec. 29, 2021, 3:35 a.m. UTC | #2
> From: Sean Christopherson <seanjc@google.com>
> Sent: Wednesday, December 29, 2021 9:05 AM
> 
> On Wed, Dec 22, 2021, Jing Liu wrote:
> > @@ -1968,6 +1969,9 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu,
> struct msr_data *msr_info)
> >  	case MSR_IA32_XFD:
> >  		ret = kvm_set_msr_common(vcpu, msr_info);
> >  		if (!ret && data) {
> > +			vmx_disable_intercept_for_msr(vcpu,
> MSR_IA32_XFD, MSR_TYPE_RW);
> > +			vcpu->arch.xfd_out_of_sync = true;
> 
> xfd_out_of_sync is a poor name, as XFD _may_ be out of sync, or it may not.
> It's
> also confusing that it's kept set after XFD is explicitly synchronized in
> vcpu_enter_guest().

yes, sync_xfd_after_exit might be more accurate.

> 
> > +
> >  			vcpu->arch.trap_nm = true;
> >  			vmx_update_exception_bitmap(vcpu);
> 
> Ah, this is why #NM interception was made sticky many patches ago.  More
> at the end.
> 
> >  		}
> > diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> > index bf9d3051cd6c..0a00242a91e7 100644
> > --- a/arch/x86/kvm/vmx/vmx.h
> > +++ b/arch/x86/kvm/vmx/vmx.h
> > @@ -340,7 +340,7 @@ struct vcpu_vmx {
> >  	struct lbr_desc lbr_desc;
> >
> >  	/* Save desired MSR intercept (read: pass-through) state */
> > -#define MAX_POSSIBLE_PASSTHROUGH_MSRS	14
> > +#define MAX_POSSIBLE_PASSTHROUGH_MSRS	15
> >  	struct {
> >  		DECLARE_BITMAP(read,
> MAX_POSSIBLE_PASSTHROUGH_MSRS);
> >  		DECLARE_BITMAP(write,
> MAX_POSSIBLE_PASSTHROUGH_MSRS);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 3b756ff13103..10a08aa2aa45 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10024,6 +10024,9 @@ static int vcpu_enter_guest(struct kvm_vcpu
> *vcpu)
> >  	if (vcpu->arch.guest_fpu.xfd_err)
> >  		wrmsrl(MSR_IA32_XFD_ERR, 0);
> >
> > +	if (vcpu->arch.xfd_out_of_sync)
> 
> Rather than adding a flag that tracks whether or not the MSR can be written
> by
> the guest, can't this be:
> 
> 	if (!vmx_test_msr_bitmap_write(vcpu->loaded_vmcs->msr_bitmap))
> 		fpu_sync_guest_vmexit_xfd_state();
> 

We can use this

> That might be marginally slower than checking a dedicated flag?  But is has
> the
> advantage of doing the correct thing for nested guests instead of penalizing
> them
> with an unnecessary sync on every exit.  If performance of the check is an
> issue,
> we could add a static key to skip the code unless at least one vCPU has
> triggered
> the XFD crud, a la kvm_has_noapic_vcpu (which may or may not provide any
> real
> performance benefits).
> 
> Speaking of nested, interception of #NM in vmx_update_exception_bitmap()
> is wrong
> with respect to nested guests.  Until XFD is supported for L2, which I didn't
> see
> in this series, #NM should not be intercepted while L2 is running.

Can you remind what additional thing is required to support XFD for L2?
If only about performance I prefer to the current conservative approach
as the first step. As explained earlier, #NM should be rare if the guest 
doesn't run AMX applications at all. Adding nested into this picture doesn't 
make things a lot worser.

> 
> For the earlier patch that introduced arch.trap_nm, if it's not too gross and
> not
> racy, the code could be:
> 
> 	if (is_guest_mode(vcpu))
> 		eb |= get_vmcs12(vcpu)->exception_bitmap;
>         else {
> 		...
> 
> 		if (vcpu->arch.guest_fpu.fpstate.xfd)
> 			eb |= (1u << NM_VECTOR);
> 	}
> 
> Though I'm ok with a semi-temporary flag if that's gross/racy.
> 
> Then this patch can change it to:
> 
> 	if (is_guest_mode(vcpu))
> 		eb |= get_vmcs12(vcpu)->exception_bitmap;
>         else {
> 		...
> 
> 		if (!vmx_test_msr_bitmap_write(vcpu->vmcs01.msr_bitmap))
> 			eb |= (1u << NM_VECTOR);
> 	}
> 
> > +		fpu_sync_guest_vmexit_xfd_state();
> > +
> >  	/*
> >  	 * Consume any pending interrupts, including the possible source of
> >  	 * VM-Exit on SVM and any ticks that occur between VM-Exit and
> now.
> > --
> > 2.27.0
> >

Thanks
Kevin
Tian, Kevin Dec. 29, 2021, 7:16 a.m. UTC | #3
> From: Tian, Kevin
> Sent: Wednesday, December 29, 2021 11:35 AM
> >
> > Speaking of nested, interception of #NM in
> vmx_update_exception_bitmap()
> > is wrong
> > with respect to nested guests.  Until XFD is supported for L2, which I didn't
> > see
> > in this series, #NM should not be intercepted while L2 is running.
> 
> Can you remind what additional thing is required to support XFD for L2?

ok, at least we require additional work on when to disable write interception.
It can be done only when both L0 and L1 make the same decision, just like 
what we did for many other VMCS settings...

> If only about performance I prefer to the current conservative approach
> as the first step. As explained earlier, #NM should be rare if the guest
> doesn't run AMX applications at all. Adding nested into this picture doesn't
> make things a lot worser.

All your comments incorporated except this one. As said always trapping
#NM for L2 is not a big problem, as long as it's rare and don't break function.
Usually a relatively static scheme is safer than a dynamic one in case of
anything overlooked for the initial implementation. 
Tian, Kevin Dec. 29, 2021, 7:37 a.m. UTC | #4
> From: Sean Christopherson <seanjc@google.com>
> Sent: Wednesday, December 29, 2021 9:05 AM
> > +	if (vcpu->arch.xfd_out_of_sync)
> 
> Rather than adding a flag that tracks whether or not the MSR can be written
> by
> the guest, can't this be:
> 
> 	if (!vmx_test_msr_bitmap_write(vcpu->loaded_vmcs->msr_bitmap))
> 		fpu_sync_guest_vmexit_xfd_state();
> 

and forgot to mention a change different from above. It's in x86
common exit path but above is vmx specific check. I'm not sure
whether it's worthy of introducing another kvm_x86_ops callback
just for this minor usage.

Introducing an extra flag (e.g. vcpu->arch.xfd_disable_interception)
sounds simpler here.

Thanks
Kevin
Sean Christopherson Dec. 29, 2021, 5:26 p.m. UTC | #5
On Wed, Dec 29, 2021, Tian, Kevin wrote:
> > From: Tian, Kevin
> > Sent: Wednesday, December 29, 2021 11:35 AM
> > >
> > > Speaking of nested, interception of #NM in
> > vmx_update_exception_bitmap()
> > > is wrong
> > > with respect to nested guests.  Until XFD is supported for L2, which I didn't
> > > see
> > > in this series, #NM should not be intercepted while L2 is running.
> > 
> > Can you remind what additional thing is required to support XFD for L2?
> 
> ok, at least we require additional work on when to disable write interception.
> It can be done only when both L0 and L1 make the same decision, just like 
> what we did for many other VMCS settings...

I'm not terribly concerned with exposing XFD to L2 right now, I'm much more
concerned with not breaking nVMX when XFD is exposed to L1.

> > If only about performance I prefer to the current conservative approach
> > as the first step. As explained earlier, #NM should be rare if the guest
> > doesn't run AMX applications at all. Adding nested into this picture doesn't
> > make things a lot worser.
> 
> All your comments incorporated except this one. As said always trapping
> #NM for L2 is not a big problem, as long as it's rare and don't break function.
> Usually a relatively static scheme is safer than a dynamic one in case of
> anything overlooked for the initial implementation. 
Tian, Kevin Dec. 30, 2021, 1:28 a.m. UTC | #6
> From: Sean Christopherson <seanjc@google.com>
> Sent: Thursday, December 30, 2021 1:26 AM
> 
> On Wed, Dec 29, 2021, Tian, Kevin wrote:
> > > From: Tian, Kevin
> > > Sent: Wednesday, December 29, 2021 11:35 AM
> > > >
> > > > Speaking of nested, interception of #NM in
> > > vmx_update_exception_bitmap()
> > > > is wrong
> > > > with respect to nested guests.  Until XFD is supported for L2, which I
> didn't
> > > > see
> > > > in this series, #NM should not be intercepted while L2 is running.
> > >
> > > Can you remind what additional thing is required to support XFD for L2?
> >
> > ok, at least we require additional work on when to disable write
> interception.
> > It can be done only when both L0 and L1 make the same decision, just like
> > what we did for many other VMCS settings...
> 
> I'm not terribly concerned with exposing XFD to L2 right now, I'm much more
> concerned with not breaking nVMX when XFD is exposed to L1.
> 
> > > If only about performance I prefer to the current conservative approach
> > > as the first step. As explained earlier, #NM should be rare if the guest
> > > doesn't run AMX applications at all. Adding nested into this picture
> doesn't
> > > make things a lot worser.
> >
> > All your comments incorporated except this one. As said always trapping
> > #NM for L2 is not a big problem, as long as it's rare and don't break
> function.
> > Usually a relatively static scheme is safer than a dynamic one in case of
> > anything overlooked for the initial implementation. 
Tian, Kevin Dec. 30, 2021, 7:04 a.m. UTC | #7
> From: Tian, Kevin
> Sent: Thursday, December 30, 2021 9:28 AM
> 
> > As posted, there is zero chance that the patches correctly handling #NM in
> L2
> > because nested_vmx_l0_wants_exit() doesn't prevent an #NM from being
> > forwarded
> > to L1.  E.g. if L0 (KVM) and L1 both intercept #NM, handle_exception_nm()
> > will
> > queue a #NM for injection and then syntehsizea nested VM-Exit, i.e. the
> #NM
> > from
> > L2 will get injected into L1 after the nested exit.
> 
> Yes, it's incorrect behavior.
> 
> >
> > That also means handle_exception_nmi_irqoff() => handle_exception_nm()
> > is fatally
> > broken on non-XFD hardware, as it will attempt
> RDMSR(MSR_IA32_XFD_ERR)
> > if L1
> > intercepts #NM since handle_exception_nmi_irqoff() will run before
> > __vmx_handle_exit() => nested_vmx_reflect_vmexit() checks whether L0 or
> > L1 should
> > handle the exit.
> 
> Ditto. Thanks for pointing out those facts that we obviously overlooked.
> 
> So handle_exception_nm() should neither blindly read xfd_err (#NM might
> be
> caused by L1 interception on a non-xfd platform) nor blindly queue a #NM
> exception (triggered in L2) to L1 which intercepts #NM (then expects vm-exit)
> 
> The former suggests that reading xfd_err should be made conditionally
> similar to what we did in vmx_update_exception_bitmap(). The latter
> suggests the actual exception is better postponed until __vmx_handle_exit().
> 
> We are looking at this change now.
> 
> And once #NM handler works correctly to handle interception by either L0
> or L1, I'm not sure whether we still want to special case L1 vs. L2 in
> vmx_update_exception_bitmap(), since in the end L0 wants to intercept
> #NM to save xfd_err for both L1 and L2 (if exposed with xfd capability by L1).
> 

the new change is like below.

static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
 {
	/*
	 * Save xfd_err to guest_fpu before interrupt is enabled, so the
	 * guest value is not clobbered by the host activity before the guest
	 * has chance to consume it.
	 *
	 * Since trapping #NM is started when xfd write interception is
	 * disabled, using this flag to guard the saving operation. This
	 * implies no-op for a non-xfd #NM due to L1 interception.
	 *
	 * Queuing exception is done in vmx_handle_exit.
	 */
	if (vcpu->arch.xfd_no_write_intercept)
		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
}

in the final series it will first check vcpu->arch.guest_fpu.fpstate->xfd
before the disable interception patch is applied and then becomes
the above form, similar to your suggestion on vmx_update_exception_bitmap().

whether to check msr_bitmap vs. an extra flag is an orthogonal open.

Then:

handle_exception_nmi(struct kvm_vcpu *vcpu)
{
	...
	if (is_machine_check(intr_info) || is_nmi(intr_info))
		return 1; /* handled by handle_exception_nmi_irqoff() */

	/*
	 * Queue the exception here instead of in handle_nm_fault_irqoff().
	 * This ensures the nested_vmx check is not skipped so vmexit can
	 * be reflected to L1 (when it intercepts #NM) before reaching this
	 * point.
	 */
	if (is_nm_fault(intr_info)) {
		kvm_queue_exception(vcpu, NM_VECTOR);
		return 1;
	}

	...
}

Then regarding to test non-AMX nested #NM usage, it might be difficult
to trigger it from modern OS. As commented by Linux #NM handler, it's
expected only for XFD or math emulation when fpu is missing. So we plan
to run a selftest in L1 which sets CR0.TS and then touch fpu registers. and
for L1 kernel we will run two binaries with one trapping #NM and the other
not.

Please let us know if you have a better suggestion for the testing here.

Thanks
Kevin
Tian, Kevin Dec. 31, 2021, 9:42 a.m. UTC | #8
> From: Tian, Kevin
> Sent: Thursday, December 30, 2021 3:05 PM
> 
> the new change is like below.
> 
> static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
>  {
> 	/*
> 	 * Save xfd_err to guest_fpu before interrupt is enabled, so the
> 	 * guest value is not clobbered by the host activity before the guest
> 	 * has chance to consume it.
> 	 *
> 	 * Since trapping #NM is started when xfd write interception is
> 	 * disabled, using this flag to guard the saving operation. This
> 	 * implies no-op for a non-xfd #NM due to L1 interception.
> 	 *
> 	 * Queuing exception is done in vmx_handle_exit.
> 	 */
> 	if (vcpu->arch.xfd_no_write_intercept)
> 		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> }
> 
> in the final series it will first check vcpu->arch.guest_fpu.fpstate->xfd
> before the disable interception patch is applied and then becomes
> the above form, similar to your suggestion on
> vmx_update_exception_bitmap().
> 
> whether to check msr_bitmap vs. an extra flag is an orthogonal open.
> 
> Then:
> 
> handle_exception_nmi(struct kvm_vcpu *vcpu)
> {
> 	...
> 	if (is_machine_check(intr_info) || is_nmi(intr_info))
> 		return 1; /* handled by handle_exception_nmi_irqoff() */
> 
> 	/*
> 	 * Queue the exception here instead of in handle_nm_fault_irqoff().
> 	 * This ensures the nested_vmx check is not skipped so vmexit can
> 	 * be reflected to L1 (when it intercepts #NM) before reaching this
> 	 * point.
> 	 */
> 	if (is_nm_fault(intr_info)) {
> 		kvm_queue_exception(vcpu, NM_VECTOR);
> 		return 1;
> 	}
> 
> 	...
> }
> 
> Then regarding to test non-AMX nested #NM usage, it might be difficult
> to trigger it from modern OS. As commented by Linux #NM handler, it's
> expected only for XFD or math emulation when fpu is missing. So we plan
> to run a selftest in L1 which sets CR0.TS and then touch fpu registers. and
> for L1 kernel we will run two binaries with one trapping #NM and the other
> not.
> 

We have verified this scenario and didn't find problem.

Basically the selftest is like below:

	guest_code()
	{
		cr0 = read_cr0();
		cr0 |= X86_CR0_TS;
		write_cr0(cr0);

		asm volatile("fnop");
	}

	guest_nm_handler()
	{
		cr0 = read_cr0();
		cr0 &= ~X86_CR0_TS;
		write_cr0(cr0);
	}

We run the selftest in L1 to create a nested scenario.

When L1 intercepts #NM:

	(L2) fnop
	(L0) #NM vmexit
	(L0) reflect a virtual vmexit (reason #NM) to L1
	(L1) #NM vmexit
	(L1) queue #NM exception to L2
	(L2) guest_nm_handler()
	(L2) fnop (succeed)

When L1 doesn't intercept #NM:
	(L2) fnop
	(L0) #NM vmexit
	(L0) queue #NM exception to L2
	(L2) guest_nm_handler()
	(L2) fnop (succeed)

Please suggest if any more test is necessary here.

Thanks
Kevin
Paolo Bonzini Jan. 4, 2022, 6:32 p.m. UTC | #9
On 12/29/21 02:04, Sean Christopherson wrote:
> 
> Speaking of nested, interception of #NM in vmx_update_exception_bitmap() is wrong
> with respect to nested guests.  Until XFD is supported for L2, which I didn't see
> in this series, #NM should not be intercepted while L2 is running.

Why wouldn't L2 support XFD, since there are no new VMCS bits?  As long 
as L0 knows what to do with XFD and XFD_ERR, it will do the right thing 
no matter if L1 or L2 is running.

Paolo
Sean Christopherson Jan. 4, 2022, 6:58 p.m. UTC | #10
On Tue, Jan 04, 2022, Paolo Bonzini wrote:
> On 12/29/21 02:04, Sean Christopherson wrote:
> > 
> > Speaking of nested, interception of #NM in vmx_update_exception_bitmap() is wrong
> > with respect to nested guests.  Until XFD is supported for L2, which I didn't see
> > in this series, #NM should not be intercepted while L2 is running.
> 
> Why wouldn't L2 support XFD, since there are no new VMCS bits?  As long as
> L0 knows what to do with XFD and XFD_ERR, it will do the right thing no
> matter if L1 or L2 is running.

I incorrectly assumed there was something L0 needed to do in order to support
nested XFD.
diff mbox series

Patch

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f7a661f35d1a..7b81d324b32a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -641,6 +641,7 @@  struct kvm_vcpu_arch {
 	bool tpr_access_reporting;
 	bool xsaves_enabled;
 	bool trap_nm;
+	bool xfd_out_of_sync;
 	u64 ia32_xss;
 	u64 microcode_version;
 	u64 arch_capabilities;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8595a3e8fbd8..7aaf905e13b4 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -162,6 +162,7 @@  static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
 	MSR_FS_BASE,
 	MSR_GS_BASE,
 	MSR_KERNEL_GS_BASE,
+	MSR_IA32_XFD,
 	MSR_IA32_XFD_ERR,
 #endif
 	MSR_IA32_SYSENTER_CS,
@@ -1968,6 +1969,9 @@  static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_IA32_XFD:
 		ret = kvm_set_msr_common(vcpu, msr_info);
 		if (!ret && data) {
+			vmx_disable_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_RW);
+			vcpu->arch.xfd_out_of_sync = true;
+
 			vcpu->arch.trap_nm = true;
 			vmx_update_exception_bitmap(vcpu);
 		}
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index bf9d3051cd6c..0a00242a91e7 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -340,7 +340,7 @@  struct vcpu_vmx {
 	struct lbr_desc lbr_desc;
 
 	/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS	14
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS	15
 	struct {
 		DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 		DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3b756ff13103..10a08aa2aa45 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10024,6 +10024,9 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.guest_fpu.xfd_err)
 		wrmsrl(MSR_IA32_XFD_ERR, 0);
 
+	if (vcpu->arch.xfd_out_of_sync)
+		fpu_sync_guest_vmexit_xfd_state();
+
 	/*
 	 * Consume any pending interrupts, including the possible source of
 	 * VM-Exit on SVM and any ticks that occur between VM-Exit and now.