diff mbox series

[10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD

Message ID 20211208000359.2853257-11-yang.zhong@intel.com (mailing list archive)
State New, archived
Headers show
Series AMX Support in KVM | expand

Commit Message

Yang Zhong Dec. 8, 2021, 12:03 a.m. UTC
From: Jing Liu <jing2.liu@intel.com>

Intel's eXtended Feature Disable (XFD) feature allows the software
to dynamically adjust fpstate buffer size for XSAVE features which
have large state.

WRMSR to IA32_XFD is intercepted so if the written value enables
a dynamic XSAVE feature the emulation code can exit to userspace
to trigger fpstate reallocation for the state.

Introduce a new KVM exit reason (KVM_EXIT_FPU_REALLOC) for this
purpose. If reallocation succeeds in fpu_swap_kvm_fpstate(), this
exit just bounces to userspace and then back. Otherwise the
userspace VMM should handle the error properly.

Use a new exit reason (instead of KVM_EXIT_X86_WRMSR) is clearer
and can be shared between WRMSR(IA32_XFD) and XSETBV. This also
avoids mixing with the userspace MSR machinery which is tied to
KVM_EXIT_X86_WRMSR today.

Also introduce a new MSR return type (KVM_MSR_RET_USERSPACE).
Currently MSR emulation returns to userspace only upon error or
per certain filtering rules via the userspace MSR mechinary.
This new return type indicates that emulation of certain MSR has
its own specific reason to bounce to userspace.

IA32_XFD is updated in two ways:

  - If reallocation is not required, the emulation code directly
    updates guest_fpu::xfd and then calls xfd_update_state() to
    update IA32_XFD and per-cpu cache;

  - If reallocation is triggered, above updates are completed as
    part of the fpstate reallocation process if succeeds;

RDMSR to IA32_XFD is not intercepted. fpu_swap_kvm_fpstate() ensures
the guest XFD value loaded into MSR before re-entering the guest.
Just save an unnecessary VM-exit here

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/kvm/vmx/vmx.c   |  8 +++++++
 arch/x86/kvm/x86.c       | 48 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.h       |  1 +
 include/uapi/linux/kvm.h |  1 +
 4 files changed, 58 insertions(+)

Comments

Paolo Bonzini Dec. 10, 2021, 4:02 p.m. UTC | #1
First, the MSR should be added to msrs_to_save_all and 
kvm_cpu_cap_has(X86_FEATURE_XFD) should be checked in kvm_init_msr_list.

It seems that RDMSR support is missing, too.

More important, please include:

- documentation for the new KVM_EXIT_* value

- a selftest that explains how userspace should react to it.

This is a strong requirement for any new API (the first has been for 
years; but the latter is also almost always respected these days).  This 
series should not have been submitted without documentation.

Also:

On 12/8/21 01:03, Yang Zhong wrote:
> 
> +		if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
> +			return 1;

This should allow msr->host_initiated always (even if XFD is not part of 
CPUID).  However, if XFD is nonzero and kvm_check_guest_realloc_fpstate 
returns true, then it should return 1.

The selftest should also cover using KVM_GET_MSR/KVM_SET_MSR.

> +		/* Setting unsupported bits causes #GP */
> +		if (~XFEATURE_MASK_USER_DYNAMIC & data) {
> +			kvm_inject_gp(vcpu, 0);
> +			break;
> +		}

This should check

	if (data & ~(XFEATURE_MASK_USER_DYNAMIC &
		    vcpu->arch.guest_supported_xcr0))

instead.

Paolo
Thomas Gleixner Dec. 10, 2021, 11:09 p.m. UTC | #2
On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> +
> +		/*
> +		 * Update IA32_XFD to the guest value so #NM can be
> +		 * raised properly in the guest. Instead of directly
> +		 * writing the MSR, call a helper to avoid breaking
> +		 * per-cpu cached value in fpu core.
> +		 */
> +		fpregs_lock();
> +		current->thread.fpu.fpstate->xfd = data;
> +		xfd_update_state(current->thread.fpu.fpstate);
> +		fpregs_unlock();
> +		break;

Now looking at the actual callsite the previous patch really should be
something like the below. Why?

It preserves the inline which allows the compiler to generate better
code in the other hotpathes and it keeps the FPU internals to the core
code. Hmm?

Thanks,

        tglx

--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -125,8 +125,10 @@ DECLARE_PER_CPU(struct fpu *, fpu_fpregs
 /* Process cleanup */
 #ifdef CONFIG_X86_64
 extern void fpstate_free(struct fpu *fpu);
+extern void fpu_update_xfd_state(u64 xfd);
 #else
 static inline void fpstate_free(struct fpu *fpu) { }
+static inline void fpu_update_xfd_state(u64 xfd) { }
 #endif
 
 /* fpstate-related functions which are exported to KVM */
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -322,6 +322,19 @@ int fpu_swap_kvm_fpstate(struct fpu_gues
 }
 EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);
 
+#ifdef CONFIG_X86_64
+void fpu_update_xfd_state(u64 xfd)
+{
+	struct fpstate *fps = current->thread.fpu.fpstate;
+
+	fpregs_lock();
+	fps->xfd = xfd;
+	xfd_update_state(fps);
+	fpregs_unlock();
+}
+EXPORT_SYMBOL_GPL(fpu_update_xfd_state);
+#endif
+
 void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf,
 				    unsigned int size, u32 pkru)
 {
Jing Liu Dec. 13, 2021, 7:51 a.m. UTC | #3
On 12/11/2021 12:02 AM, Paolo Bonzini wrote:
> 
> Also:
> 
> On 12/8/21 01:03, Yang Zhong wrote:
> >
> > +		if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
> > +			return 1;
> 
> This should allow msr->host_initiated always (even if XFD is not part of
> CPUID). 
Thanks Paolo.

msr->host_initiated handling would be added in next version.

I'd like to ask why always allow msr->host_initiated even if XFD is not part of
CPUID, although guest doesn't care that MSR?  We found some MSRs
 (e.g. MSR_AMD64_OSVW_STATUS and MSR_AMD64_OSVW_ID_LENGTH ) 
are specially handled so would like to know the consideration of allowing
msr->host_initiated.

if (!msr_info->host_initiated && !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
        return 1;


 However, if XFD is nonzero and kvm_check_guest_realloc_fpstate
> returns true, then it should return 1.
> 
If XFD is nonzero, kvm_check_guest_realloc_fpstate() won't return true. So
may not need this check here?

Thanks,
Jing

> 
> Paolo
Paolo Bonzini Dec. 13, 2021, 9:01 a.m. UTC | #4
On 12/13/21 08:51, Liu, Jing2 wrote:
> On 12/11/2021 12:02 AM, Paolo Bonzini wrote:
>>
>> Also:
>>
>> On 12/8/21 01:03, Yang Zhong wrote:
>>>
>>> +		if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
>>> +			return 1;
>>
>> This should allow msr->host_initiated always (even if XFD is not part of
>> CPUID).
> Thanks Paolo.
> 
> msr->host_initiated handling would be added in next version.
> 
> I'd like to ask why always allow msr->host_initiated even if XFD is not part of
> CPUID, although guest doesn't care that MSR?  We found some MSRs
>   (e.g. MSR_AMD64_OSVW_STATUS and MSR_AMD64_OSVW_ID_LENGTH )
> are specially handled so would like to know the consideration of allowing
> msr->host_initiated.
> 
> if (!msr_info->host_initiated && !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
>          return 1;

Because it's simpler if userspace can just take the entire list from 
KVM_GET_MSR_INDEX_LIST and pass it to KVM_GET/SET_MSR.  See for example 
vcpu_save_state and vcpu_load_state in 
tools/testing/selftests/kvm/lib/x86_64/processor.c.

>>  However, if XFD is nonzero and kvm_check_guest_realloc_fpstate
>> returns true, then it should return 1.
>
> If XFD is nonzero, kvm_check_guest_realloc_fpstate() won't return true. So
> may not need this check here?

It can't for now, because there's a single dynamic feature, but here:

+	if ((xfd & xcr0) != xcr0) {
+		u64 request = (xcr0 ^ xfd) & xcr0;
+		struct fpu_guest *guest_fpu = &vcpu->arch.guest_fpu;
+
+		/*
+		 * If requested features haven't been enabled, update
+		 * the request bitmap and tell the caller to request
+		 * dynamic buffer reallocation.
+		 */
+		if ((guest_fpu->user_xfeatures & request) != request) {
+			vcpu->arch.guest_fpu.realloc_request = request;
+			return true;
+		}
+	}

it is certainly possible to return true with nonzero XFD.

Paolo
Paolo Bonzini Dec. 13, 2021, 3:06 p.m. UTC | #5
On 12/8/21 01:03, Yang Zhong wrote:
> +		/*
> +		 * Update IA32_XFD to the guest value so #NM can be
> +		 * raised properly in the guest. Instead of directly
> +		 * writing the MSR, call a helper to avoid breaking
> +		 * per-cpu cached value in fpu core.
> +		 */
> +		fpregs_lock();
> +		current->thread.fpu.fpstate->xfd = data;

This is wrong, it should be written in vcpu->arch.guest_fpu.

> +		xfd_update_state(current->thread.fpu.fpstate);

This is okay though, so that KVM_SET_MSR will not write XFD and WRMSR will.

That said, I think xfd_update_state should not have an argument. 
current->thread.fpu.fpstate->xfd is the only fpstate that should be 
synced with the xfd_state per-CPU variable.

Paolo

> +		fpregs_unlock();
Thomas Gleixner Dec. 13, 2021, 7:45 p.m. UTC | #6
On Mon, Dec 13 2021 at 16:06, Paolo Bonzini wrote:
> On 12/8/21 01:03, Yang Zhong wrote:
>> +		/*
>> +		 * Update IA32_XFD to the guest value so #NM can be
>> +		 * raised properly in the guest. Instead of directly
>> +		 * writing the MSR, call a helper to avoid breaking
>> +		 * per-cpu cached value in fpu core.
>> +		 */
>> +		fpregs_lock();
>> +		current->thread.fpu.fpstate->xfd = data;
>
> This is wrong, it should be written in vcpu->arch.guest_fpu.
>
>> +		xfd_update_state(current->thread.fpu.fpstate);
>
> This is okay though, so that KVM_SET_MSR will not write XFD and WRMSR
> will.
>
> That said, I think xfd_update_state should not have an argument. 
> current->thread.fpu.fpstate->xfd is the only fpstate that should be 
> synced with the xfd_state per-CPU variable.

I'm looking into this right now. The whole restore versus runtime thing
needs to be handled differently.

Thanks,

        tglx
Thomas Gleixner Dec. 13, 2021, 9:23 p.m. UTC | #7
Paolo,

On Mon, Dec 13 2021 at 20:45, Thomas Gleixner wrote:
> On Mon, Dec 13 2021 at 16:06, Paolo Bonzini wrote:
>> That said, I think xfd_update_state should not have an argument. 
>> current->thread.fpu.fpstate->xfd is the only fpstate that should be 
>> synced with the xfd_state per-CPU variable.
>
> I'm looking into this right now. The whole restore versus runtime thing
> needs to be handled differently.

We need to look at different things here:

   1) XFD MSR write emulation

   2) XFD MSR synchronization when write emulation is disabled

   3) Guest restore

#1 and #2 are in the context of vcpu_run() and

   vcpu->arch.guest_fpu.fpstate == current->thread.fpu.fpstate

while #3 has:

   vcpu->arch.guest_fpu.fpstate != current->thread.fpu.fpstate


#2 is only updating fpstate->xfd and the per CPU shadow.

So the state synchronization wants to be something like this:

void fpu_sync_guest_xfd_state(void)
{
	struct fpstate *fps = current->thread.fpu.fpstate;

	lockdep_assert_irqs_disabled();
	if (fpu_state_size_dynamic()) {
		rdmsrl(MSR_IA32_XFD, fps->xfd);
		__this_cpu_write(xfd_state, fps->xfd);
	}
}
EXPORT_SYMBOL_GPL(fpu_sync_guest_xfd_state);

No wrmsrl() because the MSR is already up do date. The important part is
that fpstate->xfd and the shadow state are updated so that after
reenabling preemption the context switch FPU logic works correctly.


#1 and #3 can trigger a reallocation of guest_fpu.fpstate and
can fail. But this is also true for XSETBV emulation and XCR0 restore.

For #1 modifying fps->xfd in the KVM code before calling into the FPU
code is just _wrong_ because if the guest removes the XFD restriction
then it must be ensured that the buffer is sized correctly _before_ this
is updated.

For #3 it's not really important, but I still try to wrap my head around
the whole picture vs. XCR0.

There are two options:

  1) Require strict ordering of XFD and XCR0 update to avoid pointless
     buffer expansion, i.e. XFD before XCR0.

     Because if XCR0 is updated while guest_fpu->fpstate.xfd is still in
     init state (0) and XCR0 contains extended features, then the buffer
     would be expanded because XFD does not mask the extended features
     out. When XFD is restored with a non-zero value, it's too late
     already.

  2) Ignore buffer expansion up to the point where XSTATE restore happens
     and evaluate guest XCR0 and guest_fpu->fpstate.xfd there.

I'm leaning towards #1 because that means we have exactly _ONE_ place
where we need to deal with buffer expansion. If Qemu gets the ordering
wrong it wastes memory per vCPU, *shrug*.

Thanks,

        tglx
Tian, Kevin Dec. 14, 2021, 7:16 a.m. UTC | #8
Hi, Thomas,

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Tuesday, December 14, 2021 5:23 AM
> 
> Paolo,
> 
> On Mon, Dec 13 2021 at 20:45, Thomas Gleixner wrote:
> > On Mon, Dec 13 2021 at 16:06, Paolo Bonzini wrote:
> >> That said, I think xfd_update_state should not have an argument.
> >> current->thread.fpu.fpstate->xfd is the only fpstate that should be
> >> synced with the xfd_state per-CPU variable.
> >
> > I'm looking into this right now. The whole restore versus runtime thing
> > needs to be handled differently.
> 

After looking at your series, I think it missed Paolo's comment
about changing xfd_update_state() to accept no argument.

Thanks
Kevin
Yang Zhong Dec. 14, 2021, 10:26 a.m. UTC | #9
On Fri, Dec 10, 2021 at 05:02:49PM +0100, Paolo Bonzini wrote:
> First, the MSR should be added to msrs_to_save_all and
> kvm_cpu_cap_has(X86_FEATURE_XFD) should be checked in
> kvm_init_msr_list.
> 
> It seems that RDMSR support is missing, too.
> 
> More important, please include:
> 
> - documentation for the new KVM_EXIT_* value
> 
> - a selftest that explains how userspace should react to it.
> 
> This is a strong requirement for any new API (the first has been for
> years; but the latter is also almost always respected these days).
> This series should not have been submitted without documentation.
> 
> Also:
> 
> On 12/8/21 01:03, Yang Zhong wrote:
> >
> >+		if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
> >+			return 1;
> 
> This should allow msr->host_initiated always (even if XFD is not
> part of CPUID).  However, if XFD is nonzero and
> kvm_check_guest_realloc_fpstate returns true, then it should return
> 1.
> 
> The selftest should also cover using KVM_GET_MSR/KVM_SET_MSR.
> 

  Paolo, Seems we do not need new KVM_EXIT_* again from below thomas' new patchset:
  git://git.kernel.org/pub/scm/linux/kernel/git/people/tglx/devel.git x86/fpu-kvm

  So the selftest stll need support KVM_GET_MSR/KVM_SET_MSR for MSR_IA32_XFD
  and MSR_IA32_XFD_ERR? If yes, we only do some read/write test with vcpu_set_msr()/
  vcpu_get_msr() from new selftest tool? or do wrmsr from guest side and check this value
  from selftest side? 

  I checked some msr selftest reference code, tsc_msrs_test.c, which maybe better for this
  reference. If you have better suggestion, please share it to me. thanks!

  Yang
Paolo Bonzini Dec. 14, 2021, 11:24 a.m. UTC | #10
On 12/14/21 11:26, Yang Zhong wrote:
>    Paolo, Seems we do not need new KVM_EXIT_* again from below thomas' new patchset:
>    git://git.kernel.org/pub/scm/linux/kernel/git/people/tglx/devel.git x86/fpu-kvm
> 
>    So the selftest stll need support KVM_GET_MSR/KVM_SET_MSR for MSR_IA32_XFD
>    and MSR_IA32_XFD_ERR? If yes, we only do some read/write test with vcpu_set_msr()/
>    vcpu_get_msr() from new selftest tool? or do wrmsr from guest side and check this value
>    from selftest side?

You can write a test similar to state_test.c to cover XCR0, XFD and the
new XSAVE extensions.  The test can:

- initialize AMX and write a nonzero value to XFD

- load a matrix into TMM0

- check that #NM is delivered (search for vm_install_exception_handler) and
that XFD_ERR is correct

- write 0 to XFD

- load again the matrix, and check that #NM is not delivered

- store it back into memory

- compare it with the original data

All of this can be done with a full save&restore after every step
(though I suggest that you first get it working without save&restore,
the relevant code in state_test.c is easy to identify and comment out).

You will have to modify vcpu_load_state, so that it does
first KVM_SET_MSRS, then KVM_SET_XCRS, then KVM_SET_XSAVE.
See patch below.

Paolo

>    I checked some msr selftest reference code, tsc_msrs_test.c, which maybe better for this
>    reference. If you have better suggestion, please share it to me. thanks!


------------------ 8< -----------------
From: Paolo Bonzini <pbonzini@redhat.com>
Subject: [PATCH] selftest: kvm: Reorder vcpu_load_state steps for AMX

For AMX support it is recommended to load XCR0 after XFD, so that
KVM does not see XFD=0, XCR=1 for a save state that will eventually
be disabled (which would lead to premature allocation of the space
required for that save state).

It is also required to load XSAVE data after XCR0 and XFD, so that
KVM can trigger allocation of the extra space required to store AMX
state.

Adjust vcpu_load_state to obey these new requirements.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index 82c39db91369..d805f63f7203 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -1157,16 +1157,6 @@ void vcpu_load_state(struct kvm_vm *vm, uint32_t vcpuid, struct kvm_x86_state *s
  	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
  	int r;
  
-	r = ioctl(vcpu->fd, KVM_SET_XSAVE, &state->xsave);
-        TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XSAVE, r: %i",
-                r);
-
-	if (kvm_check_cap(KVM_CAP_XCRS)) {
-		r = ioctl(vcpu->fd, KVM_SET_XCRS, &state->xcrs);
-		TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XCRS, r: %i",
-			    r);
-	}
-
  	r = ioctl(vcpu->fd, KVM_SET_SREGS, &state->sregs);
          TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_SREGS, r: %i",
                  r);
@@ -1175,6 +1165,16 @@ void vcpu_load_state(struct kvm_vm *vm, uint32_t vcpuid, struct kvm_x86_state *s
          TEST_ASSERT(r == state->msrs.nmsrs, "Unexpected result from KVM_SET_MSRS, r: %i (failed at %x)",
                  r, r == state->msrs.nmsrs ? -1 : state->msrs.entries[r].index);
  
+	if (kvm_check_cap(KVM_CAP_XCRS)) {
+		r = ioctl(vcpu->fd, KVM_SET_XCRS, &state->xcrs);
+		TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XCRS, r: %i",
+			    r);
+	}
+
+	r = ioctl(vcpu->fd, KVM_SET_XSAVE, &state->xsave);
+        TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XSAVE, r: %i",
+                r);
+
  	r = ioctl(vcpu->fd, KVM_SET_VCPU_EVENTS, &state->events);
          TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_VCPU_EVENTS, r: %i",
                  r);
diff mbox series

Patch

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 70d86ffbccf7..971d60980d5b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7141,6 +7141,11 @@  static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
 		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
 }
 
+static void vmx_update_intercept_xfd(struct kvm_vcpu *vcpu)
+{
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_R, false);
+}
+
 static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7181,6 +7186,9 @@  static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		}
 	}
 
+	if (cpu_feature_enabled(X86_FEATURE_XFD) && guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+		vmx_update_intercept_xfd(vcpu);
+
 	set_cr4_guest_host_mask(vmx);
 
 	vmx_write_encls_bitmap(vcpu, NULL);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 91cc6f69a7ca..c83887cb55ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1873,6 +1873,16 @@  static int kvm_msr_user_space(struct kvm_vcpu *vcpu, u32 index,
 {
 	u64 msr_reason = kvm_msr_reason(r);
 
+	/*
+	 * MSR emulation may need certain effect triggered in the
+	 * path transitioning to userspace (e.g. fpstate realloction).
+	 * In this case the actual exit reason and completion
+	 * func should have been set by the emulation code before
+	 * this point.
+	 */
+	if (r == KVM_MSR_RET_USERSPACE)
+		return 1;
+
 	/* Check if the user wanted to know about this MSR fault */
 	if (!(vcpu->kvm->arch.user_space_msr_mask & msr_reason))
 		return 0;
@@ -3692,6 +3702,44 @@  int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 1;
 		vcpu->arch.msr_misc_features_enables = data;
 		break;
+#ifdef CONFIG_X86_64
+	case MSR_IA32_XFD:
+		if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+			return 1;
+
+		/* Setting unsupported bits causes #GP */
+		if (~XFEATURE_MASK_USER_DYNAMIC & data) {
+			kvm_inject_gp(vcpu, 0);
+			break;
+		}
+
+		WARN_ON_ONCE(current->thread.fpu.fpstate !=
+			     vcpu->arch.guest_fpu.fpstate);
+
+		/*
+		 * Check if fpstate reallocate is required. If yes, then
+		 * let the fpu core do reallocation and update xfd;
+		 * otherwise, update xfd here.
+		 */
+		if (kvm_check_guest_realloc_fpstate(vcpu, data)) {
+			vcpu->run->exit_reason = KVM_EXIT_FPU_REALLOC;
+			vcpu->arch.complete_userspace_io =
+				kvm_skip_emulated_instruction;
+			return KVM_MSR_RET_USERSPACE;
+		}
+
+		/*
+		 * Update IA32_XFD to the guest value so #NM can be
+		 * raised properly in the guest. Instead of directly
+		 * writing the MSR, call a helper to avoid breaking
+		 * per-cpu cached value in fpu core.
+		 */
+		fpregs_lock();
+		current->thread.fpu.fpstate->xfd = data;
+		xfd_update_state(current->thread.fpu.fpstate);
+		fpregs_unlock();
+		break;
+#endif
 	default:
 		if (kvm_pmu_is_valid_msr(vcpu, msr))
 			return kvm_pmu_set_msr(vcpu, msr_info);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 24a323980146..446ffa8c7804 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -460,6 +460,7 @@  bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
  */
 #define  KVM_MSR_RET_INVALID	2	/* in-kernel MSR emulation #GP condition */
 #define  KVM_MSR_RET_FILTERED	3	/* #GP due to userspace MSR filter */
+#define  KVM_MSR_RET_USERSPACE	4	/* Userspace handling */
 
 #define __cr4_reserved_bits(__cpu_has, __c)             \
 ({                                                      \
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1daa45268de2..0c7b301c7254 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -270,6 +270,7 @@  struct kvm_xen_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_FPU_REALLOC      36
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */