diff mbox series

[15/19] kvm: x86: Save and restore guest XFD_ERR properly

Message ID 20211208000359.2853257-16-yang.zhong@intel.com (mailing list archive)
State New, archived
Headers show
Series AMX Support in KVM | expand

Commit Message

Yang Zhong Dec. 8, 2021, 12:03 a.m. UTC
KVM needs to save the guest XFD_ERR value before this register
might be accessed by the host and restore it before entering the
guest.

This implementation saves guest XFD_ERR in two transition points:

  - When the vCPU thread exits to the userspace VMM;
  - When the vCPU thread is preempted;

XFD_ERR is cleared to ZERO right after saving the previous guest
value. Otherwise a stale guest value may confuse the host #NM
handler to misinterpret a non-XFD-related #NM as XFD related.

There is no need to save the host XFD_ERR value because the only
place where XFD_ERR is consumed outside of KVM is in #NM handler
(which can not be preempted by a vCPU thread). XFD_ERR should
always be observed as ZER0 outside of #NM hanlder, thus clearing
XFD_ERR meets the host expectation here.

The saved guest value is restored to XFD_ERR right before entering
the guest (with preemption disabled).

Current implementation still has two opens which we would like
to hear suggestions:

  1) Will #NM be triggered in host kernel?

  Now the code is written assuming above is true, and it's the only
  reason for saving guest XFD_ERR at preemption time. Otherwise the
  save is only required when the CPU enters ring-3 (either from the
  vCPU itself or other threads), by leveraging the "user-return
  notifier" machinery as suggested by Paolo.

  2) When to enable XFD_ERR save/restore?

  There are four options on the table:

    a) As long as guest cpuid has xfd enabled

       XFD_ERR save/restore is enabled in every VM-exit (if preemption
       or ret-to-userspace happens)

    b) When the guest sets IA32_XFD to 1 for the first time

       Indicate that guest OS supports XFD features. Because guest OS
       usually initializes IA32_XFD at boot time, XFD_ERR save/restore
       is enabled for almost every VM-exit (if preemption or ret-to-
       userspace happens).

       No save/restore for legacy guest OS which doesn't support XFD
       features at all (thus won't touch IA32_XFD).

    c) When the guest sets IA32_XFD to 0 for the first time

       Lazily enabling XFD_ERR save/restore until XFD features are
       used inside guest. However, this option doesn't work because
       XFD_ERR is set when #NM is raised. An VM-exit could happen
       between CPU raising #NM and guest #NM handler reading XFD_ERR
       (before setting XFD to 0). The very first XFD_ERR might be
       already clobbered by the host due to no save/restore in that
       small window.

    d) When the 1st guest #NM with non-zero XFD_ERR occurs

       Lazily enabling XFD_ERR save/restore until XFD features are
       used inside guest. This requires intercepting guest #NM until
       non-zero XFD_ERR occurs. If a guest with XFD in cpuid never
       launches an AMX application, it implies that #NM is always
       trapped thus adding a constant overhead which may be even
       higher than doing RDMSR in preemption path in a) and b):

         #preempts < #VMEXITS (no #NM trap) < #VMEXITS (#NM trap)

       The number of preemptions and ret-to-userspaces should be a
       small portion of total #VMEXITs in a healthy virtualization
       environment. Our gut-feeling is that adding at most one MSR
       read and one MSR write to the preempt/user-ret paths is possibly
       more efficient than increasing #VMEXITs due to trapping #NM.

For above analysis we plan to go option b), although this version
currently implements a). But we would like to hear other suggestions
before making this change.

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/kernel/fpu/core.c | 2 ++
 arch/x86/kvm/cpuid.c       | 5 +++++
 arch/x86/kvm/vmx/vmx.c     | 2 ++
 arch/x86/kvm/vmx/vmx.h     | 2 +-
 arch/x86/kvm/x86.c         | 5 +++++
 5 files changed, 15 insertions(+), 1 deletion(-)

Comments

Paolo Bonzini Dec. 10, 2021, 4:23 p.m. UTC | #1
On 12/8/21 01:03, Yang Zhong wrote:
>   		kvm_steal_time_set_preempted(vcpu);
>   	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>   
> +	if (vcpu->preempted)
> +		fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu);
> +

Instead of checking vcpu->preempted, can you instead check if the active 
FPU is the guest FPU?  That is, save if 
current->thread.fpu->fpstate->is_guest?

Paolo
Paolo Bonzini Dec. 10, 2021, 10:01 p.m. UTC | #2
On 12/8/21 01:03, Yang Zhong wrote:
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -219,6 +219,11 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>   		kvm_apic_set_version(vcpu);
>   	}
>   
> +	/* Enable saving guest XFD_ERR */
> +	best = kvm_find_cpuid_entry(vcpu, 7, 0);
> +	if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE))
> +		vcpu->arch.guest_fpu.xfd_err = 0;
> +

This is incorrect.  Instead it should check whether leaf 0xD includes 
any dynamic features.

Paolo
Thomas Gleixner Dec. 11, 2021, 12:10 a.m. UTC | #3
On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 5089f2e7dc22..9811dc98d550 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -238,6 +238,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
>  	fpstate->is_guest	= true;
>  
>  	gfpu->fpstate		= fpstate;
> +	gfpu->xfd_err           = XFD_ERR_GUEST_DISABLED;

This wants to be part of the previous patch, which introduces the field.

>  	gfpu->user_xfeatures	= fpu_user_cfg.default_features;
>  	gfpu->user_perm		= fpu_user_cfg.default_features;
>  	fpu_init_guest_permissions(gfpu);
> @@ -297,6 +298,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
>  		fpu->fpstate = guest_fps;
>  		guest_fps->in_use = true;
>  	} else {
> +		fpu_save_guest_xfd_err(guest_fpu);

Hmm. See below.

>  		guest_fps->in_use = false;
>  		fpu->fpstate = fpu->__task_fpstate;
>  		fpu->__task_fpstate = NULL;
> @@ -4550,6 +4550,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  		kvm_steal_time_set_preempted(vcpu);
>  	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>  
> +	if (vcpu->preempted)
> +		fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu);

I'm not really exited about the thought of an exception cause register
in guest clobbered state.

Aside of that I really have to ask the question why all this is needed?

#NM in the guest is slow path, right? So why are you trying to optimize
for it?

The straight forward solution to this is:

    1) Trap #NM and MSR_XFD_ERR write

    2) When the guest triggers #NM is takes an VMEXIT and the host
       does:

                rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);

       injects the #NM and goes on.

    3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
       the host does:

           vcpu->arch.guest_fpu.xfd_err = msrval;
           wrmsrl(MSR_XFD_ERR, msrval);

      and goes back.

    4) Before entering the preemption disabled section of the VCPU loop
       do:
       
           if (vcpu->arch.guest_fpu.xfd_err)
                      wrmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);

    5) Before leaving the preemption disabled section of the VCPU loop
       do:
       
           if (vcpu->arch.guest_fpu.xfd_err)
                      wrmsrl(MSR_XFD_ERR, 0);

It's really that simple and pretty much 0 overhead for the regular case.

If the guest triggers #NM with a high frequency then taking the VMEXITs
is the least of the problems. That's not a realistic use case, really.

Hmm?

Thanks,

        tglx
Paolo Bonzini Dec. 11, 2021, 1:31 a.m. UTC | #4
On 12/11/21 01:10, Thomas Gleixner wrote:
>      2) When the guest triggers #NM is takes an VMEXIT and the host
>         does:
> 
>                  rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> 
>         injects the #NM and goes on.
> 
>      3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
>         the host does:
> 
>             vcpu->arch.guest_fpu.xfd_err = msrval;
>             wrmsrl(MSR_XFD_ERR, msrval);

No wrmsrl here I think, the host value is 0 and should stay so.  Instead 
the wrmsrl will happen the next time the VCPU loop is entred.

Paolo
Tian, Kevin Dec. 11, 2021, 3:07 a.m. UTC | #5
> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Saturday, December 11, 2021 8:11 AM
> 
> On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > index 5089f2e7dc22..9811dc98d550 100644
> > --- a/arch/x86/kernel/fpu/core.c
> > +++ b/arch/x86/kernel/fpu/core.c
> > @@ -238,6 +238,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest
> *gfpu)
> >  	fpstate->is_guest	= true;
> >
> >  	gfpu->fpstate		= fpstate;
> > +	gfpu->xfd_err           = XFD_ERR_GUEST_DISABLED;
> 
> This wants to be part of the previous patch, which introduces the field.
> 
> >  	gfpu->user_xfeatures	= fpu_user_cfg.default_features;
> >  	gfpu->user_perm		= fpu_user_cfg.default_features;
> >  	fpu_init_guest_permissions(gfpu);
> > @@ -297,6 +298,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest
> *guest_fpu, bool enter_guest)
> >  		fpu->fpstate = guest_fps;
> >  		guest_fps->in_use = true;
> >  	} else {
> > +		fpu_save_guest_xfd_err(guest_fpu);
> 
> Hmm. See below.
> 
> >  		guest_fps->in_use = false;
> >  		fpu->fpstate = fpu->__task_fpstate;
> >  		fpu->__task_fpstate = NULL;
> > @@ -4550,6 +4550,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >  		kvm_steal_time_set_preempted(vcpu);
> >  	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >
> > +	if (vcpu->preempted)
> > +		fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu);
> 
> I'm not really exited about the thought of an exception cause register
> in guest clobbered state.
> 
> Aside of that I really have to ask the question why all this is needed?
> 
> #NM in the guest is slow path, right? So why are you trying to optimize
> for it?

This is really good information. The current logic is obviously
based on the assumption that #NM is frequently triggered.

> 
> The straight forward solution to this is:
> 
>     1) Trap #NM and MSR_XFD_ERR write

and #NM vmexit handler should be called in kvm_x86_handle_exit_irqoff()
before preemption is enabled, otherwise there is still a small window
where MSR_XFD_ERR might be clobbered after preemption enable and
before #NM handler is actually called.

> 
>     2) When the guest triggers #NM is takes an VMEXIT and the host
>        does:
> 
>                 rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> 
>        injects the #NM and goes on.
> 
>     3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
>        the host does:
> 
>            vcpu->arch.guest_fpu.xfd_err = msrval;
>            wrmsrl(MSR_XFD_ERR, msrval);
> 
>       and goes back.
> 
>     4) Before entering the preemption disabled section of the VCPU loop
>        do:
> 
>            if (vcpu->arch.guest_fpu.xfd_err)
>                       wrmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> 
>     5) Before leaving the preemption disabled section of the VCPU loop
>        do:
> 
>            if (vcpu->arch.guest_fpu.xfd_err)
>                       wrmsrl(MSR_XFD_ERR, 0);
> 
> It's really that simple and pretty much 0 overhead for the regular case.

Much cleaner.

> 
> If the guest triggers #NM with a high frequency then taking the VMEXITs
> is the least of the problems. That's not a realistic use case, really.
> 
> Hmm?
> 
> Thanks,
> 
>         tglx

Thanks
Kevin
Tian, Kevin Dec. 11, 2021, 3:23 a.m. UTC | #6
> From: Paolo Bonzini
> Sent: Saturday, December 11, 2021 9:32 AM
> 
> On 12/11/21 01:10, Thomas Gleixner wrote:
> >      2) When the guest triggers #NM is takes an VMEXIT and the host
> >         does:
> >
> >                  rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> >
> >         injects the #NM and goes on.
> >
> >      3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
> >         the host does:
> >
> >             vcpu->arch.guest_fpu.xfd_err = msrval;
> >             wrmsrl(MSR_XFD_ERR, msrval);
> 
> No wrmsrl here I think, the host value is 0 and should stay so.  Instead
> the wrmsrl will happen the next time the VCPU loop is entred.
> 

To elaborate I guess the reason is because MSR_XFD_ERR should 
always contain host value 0 after preemption is enabled, while 
WRMSR emulation is called with preemption enabled. Then we 
just need wait for the next time the vcpu loop is entered to 
restore the guest value after preemption is disabled. 
Thomas Gleixner Dec. 11, 2021, 1:10 p.m. UTC | #7
On Sat, Dec 11 2021 at 02:31, Paolo Bonzini wrote:
> On 12/11/21 01:10, Thomas Gleixner wrote:
>>      2) When the guest triggers #NM is takes an VMEXIT and the host
>>         does:
>> 
>>                  rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
>> 
>>         injects the #NM and goes on.
>> 
>>      3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
>>         the host does:
>> 
>>             vcpu->arch.guest_fpu.xfd_err = msrval;
>>             wrmsrl(MSR_XFD_ERR, msrval);
>
> No wrmsrl here I think, the host value is 0 and should stay so.  Instead 
> the wrmsrl will happen the next time the VCPU loop is entred.

I assumed this can be handled in the fast path, but either way.
Thomas Gleixner Dec. 11, 2021, 1:29 p.m. UTC | #8
Kevin,

On Sat, Dec 11 2021 at 03:07, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> #NM in the guest is slow path, right? So why are you trying to optimize
>> for it?
>
> This is really good information. The current logic is obviously
> based on the assumption that #NM is frequently triggered.

More context.

When an application want's to use AMX, it invokes the prctl() which
grants permission. If permission is granted then still the kernel FPU
state buffers are default size and XFD is armed.

When a thread of that process issues the first AMX (tile) instruction,
then #NM is raised.

The #NM handler does:

    1) Read MSR_XFD_ERR. If 0, goto regular #NM

    2) Write MSR_XFD_ERR to 0

    3) Check whether the process has permission granted. If not,
       raise SIGILL and return.

    4) Allocate and install a larger FPU state buffer for the task.
       If allocation fails, raise SIGSEGV and return.

    5) Disarm XFD for that task

That means one thread takes at max. one AMX/XFD related #NM during its
lifetime, which means two VMEXITs.

If there are other XFD controlled facilities in the future, then it will
be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which uses
them. Not the end of the world either.

Looking at the targeted application space it's pretty unlikely that
tasks which utilize AMX are going to be so short lived that the overhead
of these VMEXITs really matters.

This of course can be revisited when there is a sane use case, but
optimizing for it prematurely does not buy us anything else than
pointless complexity.

>> The straight forward solution to this is:
>> 
>>     1) Trap #NM and MSR_XFD_ERR write
>
> and #NM vmexit handler should be called in kvm_x86_handle_exit_irqoff()
> before preemption is enabled, otherwise there is still a small window
> where MSR_XFD_ERR might be clobbered after preemption enable and
> before #NM handler is actually called.

Yes.

Thanks,

        tglx
Tian, Kevin Dec. 12, 2021, 1:50 a.m. UTC | #9
> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Saturday, December 11, 2021 9:29 PM
> 
> Kevin,
> 
> On Sat, Dec 11 2021 at 03:07, Kevin Tian wrote:
> >> From: Thomas Gleixner <tglx@linutronix.de>
> >> #NM in the guest is slow path, right? So why are you trying to optimize
> >> for it?
> >
> > This is really good information. The current logic is obviously
> > based on the assumption that #NM is frequently triggered.
> 
> More context.
> 
> When an application want's to use AMX, it invokes the prctl() which
> grants permission. If permission is granted then still the kernel FPU
> state buffers are default size and XFD is armed.
> 
> When a thread of that process issues the first AMX (tile) instruction,
> then #NM is raised.
> 
> The #NM handler does:
> 
>     1) Read MSR_XFD_ERR. If 0, goto regular #NM
> 
>     2) Write MSR_XFD_ERR to 0
> 
>     3) Check whether the process has permission granted. If not,
>        raise SIGILL and return.
> 
>     4) Allocate and install a larger FPU state buffer for the task.
>        If allocation fails, raise SIGSEGV and return.
> 
>     5) Disarm XFD for that task
> 
> That means one thread takes at max. one AMX/XFD related #NM during its
> lifetime, which means two VMEXITs.
> 
> If there are other XFD controlled facilities in the future, then it will
> be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which
> uses
> them. Not the end of the world either.
> 
> Looking at the targeted application space it's pretty unlikely that
> tasks which utilize AMX are going to be so short lived that the overhead
> of these VMEXITs really matters.
> 
> This of course can be revisited when there is a sane use case, but
> optimizing for it prematurely does not buy us anything else than
> pointless complexity.

I get all above.

I guess the original open is also about the frequency of #NM not due 
to XFD. For Linux guest looks it's not a problem since CR0.TS is not set 
now when math emulation is not required:

DEFINE_IDTENTRY(exc_device_not_available)
{
	...
	/* This should not happen. */
	if (WARN(cr0 & X86_CR0_TS, "CR0.TS was set")) {
		/* Try to fix it up and carry on. */
		write_cr0(cr0 & ~X86_CR0_TS);
	} else {
		/*
		 * Something terrible happened, and we're better off trying
		 * to kill the task than getting stuck in a never-ending
		 * loop of #NM faults.
		 */
		die("unexpected #NM exception", regs, 0);
	}
}

It may affect guest which still uses CR0.TS to do lazy save. But likely
modern OSes all move to eager save approach so always trapping #NM
should be fine.

Is this understanding correct?

Thanks
Kevin
Paolo Bonzini Dec. 12, 2021, 9:10 a.m. UTC | #10
On 12/12/21 02:50, Tian, Kevin wrote:
>>
>> If there are other XFD controlled facilities in the future, then it will
>> be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which
>> uses
>> them. Not the end of the world either.
>>
>> Looking at the targeted application space it's pretty unlikely that
>> tasks which utilize AMX are going to be so short lived that the overhead
>> of these VMEXITs really matters.
>>
>> This of course can be revisited when there is a sane use case, but
>> optimizing for it prematurely does not buy us anything else than
>> pointless complexity.
> It may affect guest which still uses CR0.TS to do lazy save. But likely
> modern OSes all move to eager save approach so always trapping #NM
> should be fine.

You also don't need to trap #NM if CPUID includes no dynamic bits, 
because then XFD will never be nonzero.

Paolo
Yang Zhong Dec. 12, 2021, 1:10 p.m. UTC | #11
On Fri, Dec 10, 2021 at 11:01:15PM +0100, Paolo Bonzini wrote:
> On 12/8/21 01:03, Yang Zhong wrote:
> >--- a/arch/x86/kvm/cpuid.c
> >+++ b/arch/x86/kvm/cpuid.c
> >@@ -219,6 +219,11 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> >  		kvm_apic_set_version(vcpu);
> >  	}
> >+	/* Enable saving guest XFD_ERR */
> >+	best = kvm_find_cpuid_entry(vcpu, 7, 0);
> >+	if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE))
> >+		vcpu->arch.guest_fpu.xfd_err = 0;
> >+
> 
> This is incorrect.  Instead it should check whether leaf 0xD
> includes any dynamic features.
> 

  Thanks Paolo, So ditto for "[PATCH 04/19] kvm: x86: Check guest xstate permissions when KVM_SET_CPUID2".

  Yang

> Paolo
diff mbox series

Patch

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 5089f2e7dc22..9811dc98d550 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -238,6 +238,7 @@  bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
 	fpstate->is_guest	= true;
 
 	gfpu->fpstate		= fpstate;
+	gfpu->xfd_err           = XFD_ERR_GUEST_DISABLED;
 	gfpu->user_xfeatures	= fpu_user_cfg.default_features;
 	gfpu->user_perm		= fpu_user_cfg.default_features;
 	fpu_init_guest_permissions(gfpu);
@@ -297,6 +298,7 @@  int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 		fpu->fpstate = guest_fps;
 		guest_fps->in_use = true;
 	} else {
+		fpu_save_guest_xfd_err(guest_fpu);
 		guest_fps->in_use = false;
 		fpu->fpstate = fpu->__task_fpstate;
 		fpu->__task_fpstate = NULL;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f3c61205bbf4..ea51b986ee67 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -219,6 +219,11 @@  static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		kvm_apic_set_version(vcpu);
 	}
 
+	/* Enable saving guest XFD_ERR */
+	best = kvm_find_cpuid_entry(vcpu, 7, 0);
+	if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE))
+		vcpu->arch.guest_fpu.xfd_err = 0;
+
 	best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
 	if (!best)
 		vcpu->arch.guest_supported_xcr0 = 0;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6198b13c4846..0db8bdf273e2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -161,6 +161,7 @@  static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
 	MSR_GS_BASE,
 	MSR_KERNEL_GS_BASE,
 	MSR_IA32_XFD,
+	MSR_IA32_XFD_ERR,
 #endif
 	MSR_IA32_SYSENTER_CS,
 	MSR_IA32_SYSENTER_ESP,
@@ -7153,6 +7154,7 @@  static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
 static void vmx_update_intercept_xfd(struct kvm_vcpu *vcpu)
 {
 	vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_R, false);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD_ERR, MSR_TYPE_RW, false);
 }
 
 static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index bf9d3051cd6c..0a00242a91e7 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -340,7 +340,7 @@  struct vcpu_vmx {
 	struct lbr_desc lbr_desc;
 
 	/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS	14
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS	15
 	struct {
 		DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 		DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d127b229dd29..8b033c9241d6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4550,6 +4550,9 @@  void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 		kvm_steal_time_set_preempted(vcpu);
 	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 
+	if (vcpu->preempted)
+		fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu);
+
 	static_call(kvm_x86_vcpu_put)(vcpu);
 	vcpu->arch.last_host_tsc = rdtsc();
 }
@@ -9951,6 +9954,8 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
 		switch_fpu_return();
 
+	fpu_restore_guest_xfd_err(&vcpu->arch.guest_fpu);
+
 	if (unlikely(vcpu->arch.switch_db_regs)) {
 		set_debugreg(0, 7);
 		set_debugreg(vcpu->arch.eff_db[0], 0);