diff mbox series

[RFC,10/10] x86/fpu: defer FPU state load until return to userspace

Message ID 20180912133353.20595-11-bigeasy@linutronix.de (mailing list archive)
State New, archived
Headers show
Series [RFC,01/10] x86/entry: remove _TIF_ALLWORK_MASK | expand

Commit Message

Sebastian Andrzej Siewior Sept. 12, 2018, 1:33 p.m. UTC
From: Rik van Riel <riel@surriel.com>

Defer loading of FPU state until return to userspace. This gives
the kernel the potential to skip loading FPU state for tasks that
stay in kernel mode, or for tasks that end up with repeated
invocations of kernel_fpu_begin.

It also increases the chances that a task's FPU state will remain
valid in the FPU registers until it is scheduled back in, allowing
us to skip restoring that task's FPU state altogether.

Signed-off-by: Rik van Riel <riel@surriel.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 arch/x86/entry/common.c             |  9 +++++++++
 arch/x86/include/asm/fpu/api.h      |  5 +++++
 arch/x86/include/asm/fpu/internal.h | 16 +++++++++------
 arch/x86/kernel/fpu/core.c          | 31 ++++++++++++++++++++++-------
 arch/x86/kernel/process_32.c        |  7 ++++---
 arch/x86/kernel/process_64.c        |  7 ++++---
 arch/x86/kvm/x86.c                  | 10 +++++++---
 arch/x86/mm/pkeys.c                 |  2 +-
 8 files changed, 64 insertions(+), 23 deletions(-)

Comments

Andy Lutomirski Sept. 12, 2018, 3:47 p.m. UTC | #1
> On Sep 12, 2018, at 6:33 AM, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> From: Rik van Riel <riel@surriel.com>
> 
> Defer loading of FPU state until return to userspace. This gives
> the kernel the potential to skip loading FPU state for tasks that
> stay in kernel mode, or for tasks that end up with repeated
> invocations of kernel_fpu_begin.
> 
> It also increases the chances that a task's FPU state will remain
> valid in the FPU registers until it is scheduled back in, allowing
> us to skip restoring that task's FPU state altogether.
> 
> 

> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -101,14 +101,14 @@ void __kernel_fpu_begin(void)
> 
>    kernel_fpu_disable();
> 
> -    if (fpu->initialized) {
> +    __cpu_invalidate_fpregs_state();
> +
> +    if (!test_and_set_thread_flag(TIF_LOAD_FPU)) {

Since the already-TIF_LOAD_FPU path is supposed to be fast here, use test_thread_flag() instead. test_and_set operations do unconditional RMW operations and are always full barriers, so they’re slow.

Also, on top of this patch, there should be lots of cleanups available. In particular, all the fpu state accessors could probably be reworked to take TIF_LOAD_FPU into account, which would simplify the callers and maybe even the mess of variables tracking whether the state is in regs.
Sebastian Andrzej Siewior Sept. 19, 2018, 5:05 p.m. UTC | #2
On 2018-09-12 08:47:19 [-0700], Andy Lutomirski wrote:
> > --- a/arch/x86/kernel/fpu/core.c
> > +++ b/arch/x86/kernel/fpu/core.c
> > @@ -101,14 +101,14 @@ void __kernel_fpu_begin(void)
> > 
> >    kernel_fpu_disable();
> > 
> > -    if (fpu->initialized) {
> > +    __cpu_invalidate_fpregs_state();
> > +
> > +    if (!test_and_set_thread_flag(TIF_LOAD_FPU)) {
> 
> Since the already-TIF_LOAD_FPU path is supposed to be fast here, use test_thread_flag() instead. test_and_set operations do unconditional RMW operations and are always full barriers, so they’re slow.
okay.

> Also, on top of this patch, there should be lots of cleanups available. In particular, all the fpu state accessors could probably be reworked to take TIF_LOAD_FPU into account, which would simplify the callers and maybe even the mess of variables tracking whether the state is in regs.

Do you refer to the fpu.initilized check or something else?

Sebastian
Andy Lutomirski Sept. 21, 2018, 3:45 a.m. UTC | #3
> On Sep 19, 2018, at 10:05 AM, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> On 2018-09-12 08:47:19 [-0700], Andy Lutomirski wrote:
>>> --- a/arch/x86/kernel/fpu/core.c
>>> +++ b/arch/x86/kernel/fpu/core.c
>>> @@ -101,14 +101,14 @@ void __kernel_fpu_begin(void)
>>> 
>>>   kernel_fpu_disable();
>>> 
>>> -    if (fpu->initialized) {
>>> +    __cpu_invalidate_fpregs_state();
>>> +
>>> +    if (!test_and_set_thread_flag(TIF_LOAD_FPU)) {
>> 
>> Since the already-TIF_LOAD_FPU path is supposed to be fast here, use test_thread_flag() instead. test_and_set operations do unconditional RMW operations and are always full barriers, so they’re slow.
> okay.
> 
>> Also, on top of this patch, there should be lots of cleanups available. In particular, all the fpu state accessors could probably be reworked to take TIF_LOAD_FPU into account, which would simplify the callers and maybe even the mess of variables tracking whether the state is in regs.
> 
> Do you refer to the fpu.initilized check or something else?
> 

I mean the fpu.initialized variable entirely. AFAIK, its only use is for kernel threads — setting it to false lets us switch to a kernel thread and back without saving and restoring. But TIF_LOAD_FPU should be able to replace it: when we have FPU regs loaded and we switch to *any* thread, kernel or otherwise, we can set TIF_LOAD_FPU and leave the old regs loaded.  So we don’t need the special case for kernel threads.

Which reminds me: if you haven’t already done so, can you add a helper to sanity check the current context?  It should check that the combination of owner_ctx, last_cpu, and TIF_LOAD_FPU is sane. For example, if owner_ctx or last_cpu is says the cpu regs are invalid for current but TIF_LOAD_FPU is clear, it should warn.  I think that at least switch_fpu_finish should call it.  Arguably switch_fpu_prepare should too, at the beginning.
Andy Lutomirski Sept. 21, 2018, 4:15 a.m. UTC | #4
> On Sep 20, 2018, at 8:45 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> 
> 
> 
>> On Sep 19, 2018, at 10:05 AM, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
>> 
>> On 2018-09-12 08:47:19 [-0700], Andy Lutomirski wrote:
>>>> --- a/arch/x86/kernel/fpu/core.c
>>>> +++ b/arch/x86/kernel/fpu/core.c
>>>> @@ -101,14 +101,14 @@ void __kernel_fpu_begin(void)
>>>> 
>>>>  kernel_fpu_disable();
>>>> 
>>>> -    if (fpu->initialized) {
>>>> +    __cpu_invalidate_fpregs_state();
>>>> +
>>>> +    if (!test_and_set_thread_flag(TIF_LOAD_FPU)) {
>>> 
>>> Since the already-TIF_LOAD_FPU path is supposed to be fast here, use test_thread_flag() instead. test_and_set operations do unconditional RMW operations and are always full barriers, so they’re slow.
>> okay.
>> 
>>> Also, on top of this patch, there should be lots of cleanups available. In particular, all the fpu state accessors could probably be reworked to take TIF_LOAD_FPU into account, which would simplify the callers and maybe even the mess of variables tracking whether the state is in regs.
>> 
>> Do you refer to the fpu.initilized check or something else?
>> 
> 
> I mean the fpu.initialized variable entirely. AFAIK, its only use is for kernel threads — setting it to false lets us switch to a kernel thread and back without saving and restoring. But TIF_LOAD_FPU should be able to replace it: when we have FPU regs loaded and we switch to *any* thread, kernel or otherwise, we can set TIF_LOAD_FPU and leave the old regs loaded.  So we don’t need the special case for kernel threads.
> 
> Which reminds me: if you haven’t already done so, can you add a helper to sanity check the current context?  It should check that the combination of owner_ctx, last_cpu, and TIF_LOAD_FPU is sane. For example, if owner_ctx or last_cpu is says the cpu regs are invalid for current but TIF_LOAD_FPU is clear, it should warn.  I think that at least switch_fpu_finish should call it.  Arguably switch_fpu_prepare should too, at the beginning.

Looking some more, the “preload” variable needs to go away or be renamed. It hasn’t had anything to do with preloading for some time.

Also, the interaction between TIF_LOAD_FPU and FPU emulation needs to be documented somewhere.  Probably FPU-less systems should never have TIF_LOAD_FPU set.

Or we could decide that no one uses FPU emulation any more.
Sebastian Andrzej Siewior Sept. 26, 2018, 11:12 a.m. UTC | #5
On 2018-09-20 21:15:35 [-0700], Andy Lutomirski wrote:
> > I mean the fpu.initialized variable entirely. AFAIK, its only use is for kernel threads — setting it to false lets us switch to a kernel thread and back without saving and restoring. But TIF_LOAD_FPU should be able to replace it: when we have FPU regs loaded and we switch to *any* thread, kernel or otherwise, we can set TIF_LOAD_FPU and leave the old regs loaded.  So we don’t need the special case for kernel threads.
> > 
> > Which reminds me: if you haven’t already done so, can you add a helper to sanity check the current context?  It should check that the combination of owner_ctx, last_cpu, and TIF_LOAD_FPU is sane. For example, if owner_ctx or last_cpu is says the cpu regs are invalid for current but TIF_LOAD_FPU is clear, it should warn.  I think that at least switch_fpu_finish should call it.  Arguably switch_fpu_prepare should too, at the beginning.
> 
> Looking some more, the “preload” variable needs to go away or be renamed. It hasn’t had anything to do with preloading for some time.
okay.

> Also, the interaction between TIF_LOAD_FPU and FPU emulation needs to be documented somewhere.  Probably FPU-less systems should never have TIF_LOAD_FPU set.
Yes, they should not.

> Or we could decide that no one uses FPU emulation any more.

Oh. Removing unused code? I'm all yours.
There is this Intel Quark thingy which comes to mind can still be
bought. Its data sheet[0] has this:
| 13.1 Features:
| Note: The processor does not provide an x87 Floating Point Unit (FPU) and does
| not support x87 FPU instructions

so not only it does not support the lock prefix, no, it also relies on
soft-FPU.
The latest bsp release notes quotes a package named
	quark_linux_v3.14+v1.2.1.1.tar.gz

so they still use v3.14 (which is not a supported kernel anymore).
And then I took a look into their Yocto-BSP and found this:
| $ fgrep -R MATH_EMULATION .
| ./meta-intel-quark/recipes-kernel/linux/files/quark_3.14.cfg:# CONFIG_MATH_EMULATION is not set

so they don't set this option. This is small SoC and does not run on any
Distro due to the missing lock prefix. So if they use yocto to recompile
everything, they can rebuild their toolchain with soft-fpu support which
is more efficient than calling into the kernel for every opcode.

So I *think* nobody relies on FPU-emulation anymore. I would suggest to
get this patch set into shape and then getting rid of
CONFIG_MATH_EMULATION?

[0] https://www.intel.de/content/dam/www/public/us/en/documents/datasheets/quark-c1000-datasheet.pdf
[1] https://downloadmirror.intel.com/23197/eng/Quark_SW_RelNotes_330232_007.pdf

Sebastian
Andy Lutomirski Sept. 26, 2018, 2:34 p.m. UTC | #6
> On Sep 26, 2018, at 4:12 AM, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> On 2018-09-20 21:15:35 [-0700], Andy Lutomirski wrote:
>>> I mean the fpu.initialized variable entirely. AFAIK, its only use is for kernel threads — setting it to false lets us switch to a kernel thread and back without saving and restoring. But TIF_LOAD_FPU should be able to replace it: when we have FPU regs loaded and we switch to *any* thread, kernel or otherwise, we can set TIF_LOAD_FPU and leave the old regs loaded.  So we don’t need the special case for kernel threads.
>>> 
>>> Which reminds me: if you haven’t already done so, can you add a helper to sanity check the current context?  It should check that the combination of owner_ctx, last_cpu, and TIF_LOAD_FPU is sane. For example, if owner_ctx or last_cpu is says the cpu regs are invalid for current but TIF_LOAD_FPU is clear, it should warn.  I think that at least switch_fpu_finish should call it.  Arguably switch_fpu_prepare should too, at the beginning.
>> 
>> Looking some more, the “preload” variable needs to go away or be renamed. It hasn’t had anything to do with preloading for some time.
> okay.
> 
>> Also, the interaction between TIF_LOAD_FPU and FPU emulation needs to be documented somewhere.  Probably FPU-less systems should never have TIF_LOAD_FPU set.
> Yes, they should not.
> 
>> Or we could decide that no one uses FPU emulation any more.
> 
> Oh. Removing unused code? I'm all yours.
> There is this Intel Quark thingy which comes to mind can still be
> bought. Its data sheet[0] has this:
> | 13.1 Features:
> | Note: The processor does not provide an x87 Floating Point Unit (FPU) and does
> | not support x87 FPU instructions
> 
> so not only it does not support the lock prefix, no, it also relies on
> soft-FPU.
> The latest bsp release notes quotes a package named
>    quark_linux_v3.14+v1.2.1.1.tar.gz
> 
> so they still use v3.14 (which is not a supported kernel anymore).
> And then I took a look into their Yocto-BSP and found this:
> | $ fgrep -R MATH_EMULATION .
> | ./meta-intel-quark/recipes-kernel/linux/files/quark_3.14.cfg:# CONFIG_MATH_EMULATION is not set
> 
> so they don't set this option. This is small SoC and does not run on any
> Distro due to the missing lock prefix. So if they use yocto to recompile
> everything, they can rebuild their toolchain with soft-fpu support which
> is more efficient than calling into the kernel for every opcode.
> 
> So I *think* nobody relies on FPU-emulation anymore. I would suggest to
> get this patch set into shape and then getting rid of
> CONFIG_MATH_EMULATION?

I don’t think you can boot a kernel without math emulation on a no-FPU CPU. So maybe we should table this particular idea.  I didn’t realize there were Quark CPUs without FPU.
Sebastian Andrzej Siewior Sept. 26, 2018, 3:32 p.m. UTC | #7
On 2018-09-26 07:34:09 [-0700], Andy Lutomirski wrote:
> > So I *think* nobody relies on FPU-emulation anymore. I would suggest to
> > get this patch set into shape and then getting rid of
> > CONFIG_MATH_EMULATION?

Bryan, Denys, does anyone of you rely on CONFIG_MATH_EMULATION?
The manual for Quark claims that the CPU has no FPU and yet the current
kernel (or the yocto v3.14) does not select this option.
Denys added support for some opcodes in 2015 but I didn't figure out
*why* or which CPU in particular requires this.

> I don’t think you can boot a kernel without math emulation on a no-FPU CPU. So maybe we should table this particular idea.  I didn’t realize there were Quark CPUs without FPU.

Okay.

Sebastian
Andy Lutomirski Sept. 26, 2018, 4:24 p.m. UTC | #8
> On Sep 26, 2018, at 8:32 AM, Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> 
> On 2018-09-26 07:34:09 [-0700], Andy Lutomirski wrote:
>>> So I *think* nobody relies on FPU-emulation anymore. I would suggest to
>>> get this patch set into shape and then getting rid of
>>> CONFIG_MATH_EMULATION?
> 
> Bryan, Denys, does anyone of you rely on CONFIG_MATH_EMULATION?
> The manual for Quark claims that the CPU has no FPU and yet the current
> kernel (or the yocto v3.14) does not select this option.
> Denys added support for some opcodes in 2015 but I didn't figure out
> *why* or which CPU in particular requires this.

Certainly the early Quark CPUs were, um, crappy. They had so many terrifying errata that I’m surprised anyone is willing to use them.

I would *hope* that all supported Quark CPUs support FPU instructions even if they’re microcoded and run extremely slowly.
diff mbox series

Patch

diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c
index 3b2490b819181..180e16a8177d2 100644
--- a/arch/x86/entry/common.c
+++ b/arch/x86/entry/common.c
@@ -31,6 +31,7 @@ 
 #include <asm/vdso.h>
 #include <linux/uaccess.h>
 #include <asm/cpufeature.h>
+#include <asm/fpu/api.h>
 
 #define CREATE_TRACE_POINTS
 #include <trace/events/syscalls.h>
@@ -196,6 +197,14 @@  __visible inline void prepare_exit_to_usermode(struct pt_regs *regs)
 	if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
 		exit_to_usermode_loop(regs, cached_flags);
 
+	/* Reload ti->flags; we may have rescheduled above. */
+	cached_flags = READ_ONCE(ti->flags);
+
+	if (unlikely(cached_flags & _TIF_LOAD_FPU)) {
+		clear_thread_flag(TIF_LOAD_FPU);
+		switch_fpu_return();
+	}
+
 #ifdef CONFIG_COMPAT
 	/*
 	 * Compat syscalls set TS_COMPAT.  Make sure we clear it before
diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index a9caac9d4a729..308e66a7fcd62 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -27,6 +27,11 @@  extern void kernel_fpu_begin(void);
 extern void kernel_fpu_end(void);
 extern bool irq_fpu_usable(void);
 
+/*
+ * Load the task FPU state before returning to userspace.
+ */
+extern void switch_fpu_return(void);
+
 /*
  * Query the presence of one or more xfeatures. Works on any legacy CPU as well.
  *
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 184d76c6470b1..370686972592e 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -472,9 +472,11 @@  static inline void fpregs_activate(struct fpu *fpu)
 /*
  * Load the FPU state for the current task. Call with preemption disabled.
  */
-static inline void __fpregs_load_activate(struct fpu *fpu, int cpu)
+static inline void __fpregs_load_activate(void)
 {
-	if (!fpregs_state_valid(fpu, cpu))
+	struct fpu *fpu = &current->thread.fpu;
+
+	if (!fpregs_state_valid(fpu, smp_processor_id()))
 		copy_kernel_to_fpregs(&fpu->state);
 	fpregs_activate(fpu);
 }
@@ -482,11 +484,13 @@  static inline void __fpregs_load_activate(struct fpu *fpu, int cpu)
 static inline void __fpregs_changes_begin(void)
 {
 	preempt_disable();
+	local_bh_disable();
 }
 
 static inline void __fpregs_changes_end(void)
 {
 	preempt_enable();
+	local_bh_enable();
 }
 
 /*
@@ -497,8 +501,8 @@  static inline void __fpregs_changes_end(void)
  *  - switch_fpu_prepare() saves the old state.
  *    This is done within the context of the old process.
  *
- *  - switch_fpu_finish() restores the new state as
- *    necessary.
+ *  - switch_fpu_finish() sets TIF_LOAD_FPU; the floating point state
+ *    will get loaded on return to userspace, or when the kernel needs it.
  */
 static inline void
 switch_fpu_prepare(struct fpu *old_fpu, int cpu)
@@ -523,7 +527,7 @@  switch_fpu_prepare(struct fpu *old_fpu, int cpu)
  * Set up the userspace FPU context for the new task, if the task
  * has used the FPU.
  */
-static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
+static inline void switch_fpu_finish(struct fpu *new_fpu)
 {
 	bool preload = static_cpu_has(X86_FEATURE_FPU) &&
 		       new_fpu->initialized;
@@ -531,7 +535,7 @@  static inline void switch_fpu_finish(struct fpu *new_fpu, int cpu)
 	if (!preload)
 		return;
 
-	__fpregs_load_activate(new_fpu, cpu);
+	set_thread_flag(TIF_LOAD_FPU);
 	/* Protection keys need to be in place right at context switch time. */
 	if (boot_cpu_has(X86_FEATURE_OSPKE)) {
 		if (new_fpu->pkru != __read_pkru())
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 8564086c217fc..cbbd20515cf46 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -101,14 +101,14 @@  void __kernel_fpu_begin(void)
 
 	kernel_fpu_disable();
 
-	if (fpu->initialized) {
+	__cpu_invalidate_fpregs_state();
+
+	if (!test_and_set_thread_flag(TIF_LOAD_FPU)) {
 		/*
 		 * Ignore return value -- we don't care if reg state
 		 * is clobbered.
 		 */
 		copy_fpregs_to_fpstate(fpu);
-	} else {
-		__cpu_invalidate_fpregs_state();
 	}
 }
 EXPORT_SYMBOL(__kernel_fpu_begin);
@@ -117,8 +117,7 @@  void __kernel_fpu_end(void)
 {
 	struct fpu *fpu = &current->thread.fpu;
 
-	if (fpu->initialized)
-		copy_kernel_to_fpregs(&fpu->state);
+	switch_fpu_finish(fpu);
 
 	kernel_fpu_enable();
 }
@@ -147,7 +146,7 @@  void fpu__save(struct fpu *fpu)
 {
 	WARN_ON_FPU(fpu != &current->thread.fpu);
 
-	preempt_disable();
+	__fpregs_changes_begin();
 	trace_x86_fpu_before_save(fpu);
 	if (fpu->initialized) {
 		if (!copy_fpregs_to_fpstate(fpu)) {
@@ -155,7 +154,7 @@  void fpu__save(struct fpu *fpu)
 		}
 	}
 	trace_x86_fpu_after_save(fpu);
-	preempt_enable();
+	__fpregs_changes_end();
 }
 EXPORT_SYMBOL_GPL(fpu__save);
 
@@ -403,6 +402,24 @@  void fpu__clear(struct fpu *fpu)
 	}
 }
 
+/*
+ * Load FPU context before returning to userspace.
+ */
+void switch_fpu_return(void)
+{
+	if (!static_cpu_has(X86_FEATURE_FPU))
+		return;
+
+	/*
+	 * We should never return to user space without the task's
+	 * own FPU contents loaded into the registers. That makes it
+	 * a bug to not have the task's FPU state set up.
+	 */
+	WARN_ON_FPU(!current->thread.fpu.initialized);
+
+	__fpregs_load_activate();
+}
+
 /*
  * x87 math exception handling:
  */
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 5046a3c9dec2f..a65f8ce36379b 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -236,7 +236,8 @@  __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */
 
-	switch_fpu_prepare(prev_fpu, cpu);
+	if (prev_fpu->initialized && !test_thread_flag(TIF_LOAD_FPU))
+		switch_fpu_prepare(prev_fpu, cpu);
 
 	/*
 	 * Save away %gs. No need to save %fs, as it was saved on the
@@ -297,10 +298,10 @@  __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	if (prev->gs | next->gs)
 		lazy_load_gs(next->gs);
 
-	switch_fpu_finish(next_fpu, cpu);
-
 	this_cpu_write(current_task, next_p);
 
+	switch_fpu_finish(next_fpu);
+
 	/* Load the Intel cache allocation PQR MSR. */
 	intel_rdt_sched_in();
 
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ea5ea850348da..66b763f3da6a0 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -427,7 +427,8 @@  __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
 		     this_cpu_read(irq_count) != -1);
 
-	switch_fpu_prepare(prev_fpu, cpu);
+	if (prev_fpu->initialized && !test_thread_flag(TIF_LOAD_FPU))
+		switch_fpu_prepare(prev_fpu, cpu);
 
 	/* We must save %fs and %gs before load_TLS() because
 	 * %fs and %gs may be cleared by load_TLS().
@@ -478,8 +479,6 @@  __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	load_seg_legacy(prev->gsindex, prev->gsbase,
 			next->gsindex, next->gsbase, GS);
 
-	switch_fpu_finish(next_fpu, cpu);
-
 	/*
 	 * Switch the PDA and FPU contexts.
 	 */
@@ -489,6 +488,8 @@  __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	/* Reload sp0. */
 	update_task_stack(next_p);
 
+	switch_fpu_finish(next_fpu);
+
 	/*
 	 * Now maybe reload the debug registers and handle I/O bitmaps
 	 */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2d8b3c257e769..d9a8c5ec8d6fe 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7573,6 +7573,9 @@  static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		wait_lapic_expire(vcpu);
 	guest_enter_irqoff();
 
+	if (test_and_clear_thread_flag(TIF_LOAD_FPU))
+		switch_fpu_return();
+
 	if (unlikely(vcpu->arch.switch_db_regs)) {
 		set_debugreg(0, 7);
 		set_debugreg(vcpu->arch.eff_db[0], 0);
@@ -7832,22 +7835,23 @@  static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
 /* Swap (qemu) user FPU context for the guest FPU context. */
 static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 {
-	preempt_disable();
+	__fpregs_changes_begin();
 	copy_fpregs_to_fpstate(&vcpu->arch.user_fpu);
 	/* PKRU is separately restored in kvm_x86_ops->run.  */
 	__copy_kernel_to_fpregs(&vcpu->arch.guest_fpu.state,
 				~XFEATURE_MASK_PKRU);
-	preempt_enable();
+	__fpregs_changes_end();
 	trace_kvm_fpu(1);
 }
 
 /* When vcpu_run ends, restore user space FPU context. */
 static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 {
-	preempt_disable();
+	__fpregs_changes_begin();
 	copy_fpregs_to_fpstate(&vcpu->arch.guest_fpu);
 	copy_kernel_to_fpregs(&vcpu->arch.user_fpu.state);
 	preempt_enable();
+	__fpregs_changes_end();
 	++vcpu->stat.fpu_reload;
 	trace_kvm_fpu(0);
 }
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 6e91529fe2a30..83835abd526d7 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -28,7 +28,7 @@  void write_pkru(u32 pkru)
 	current->thread.fpu.pkru = pkru;
 
 	__fpregs_changes_begin();
-	__fpregs_load_activate(&current->thread.fpu, smp_processor_id());
+	__fpregs_load_activate();
 	__write_pkru(pkru);
 	__fpregs_changes_end();
 }