mbox series

[v6] x86: load FPU registers on return to userland

Message ID 20190109114744.10936-1-bigeasy@linutronix.de (mailing list archive)
State New, archived
Headers show
Series [v6] x86: load FPU registers on return to userland | expand

Pull-request

git://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git x86_fpu_rtu_v6

Message

Sebastian Andrzej Siewior Jan. 9, 2019, 11:47 a.m. UTC
This is a refurbished series originally started by by Rik van Riel. The
goal is load the FPU registers on return to userland and not on every
context switch. By this optimisation we can:
- avoid loading the registers if the task stays in kernel and does
  not return to userland
- make kernel_fpu_begin() cheaper: it only saves the registers on the
  first invocation. The second invocation does not need save them again.

To access the FPU registers in kernel we need:
- disable preemption to avoid that the scheduler switches tasks. By
  doing so it would set TIF_NEED_FPU_LOAD and the FPU registers would be
  not valid.
- disable BH because the softirq might use kernel_fpu_begin() and then
  set TIF_NEED_FPU_LOAD instead loading the FPU registers on completion.

v5…v6:
Rebased on top of v5.0-rc1. Integrated a few fixes which I noticed while
looking over the patches, dropped the first few patches which were
already applied.

v4…v5:
Rebased on top of a fix, noticed a problem with XSAVES and then redid
the restore on sig return (patch #26 to #28).

I don't like very much the sig save+restore thing that we are doing. It
has been always like that. I *think* that this is just because we have
nowhere to stash the FPU state while we are handling the signal. We
could add another fpu->state for the signal handler and avoid the thing.
Debian code-search revealed that `criu' is using it (and I didn't
figure out why). Nothing else (that is packaged in Debian). Maybe we
could get rid of this and if `criu' would then use a dedicated interface
for its needs rather the signal interface that happen to do what it
wants :)

v3…v4:
It has been suggested to remove the `initialized' member of the struct
fpu because it should not required be needed with lazy-FPU-restore and
would make the review easier. This is the first part of the series, the
second is basically the rebase of the v3 queue. As a result, the
diffstat became negative (which wasn't the case in previous version) :)
I tried to incorporate all the review comments that came up, some of
them were "outdated" after the removal of the `initialized' member. I'm
sorry should I missed any.

v1…v3:
v2 was never posted. I followed the idea to completely decouple PKRU
from xstate. This didn't quite work and made a few things complicated. 
One obvious required fixup is copy_fpstate_to_sigframe() where the PKRU
state needs to be fiddled into xstate. This required another
xfeatures_mask so that the sanity checks were performed and
xstate_offsets would be computed. Additionally ptrace also reads/sets
xstate in order to get/set the register and PKRU is one of them. So this
would need some fiddle, too.
In v3 I dropped that decouple idea. I also learned that the wrpkru
instruction is not privileged and so caching it in kernel does not work.
Instead I keep PKRU in xstate area and load it at context switch time
while the remaining registers are deferred (until return to userland).
The offset of PKRU within xstate is enumerated at boot time so why not
use it.

The following changes since commit bfeffd155283772bbe78c6a05dec7c0128ee500c:

  Linux 5.0-rc1 (2019-01-06 17:08:20 -0800)

are available in the Git repository at:

  git://git.kernel.org/pub/scm/linux/kernel/git/bigeasy/staging.git x86_fpu_rtu_v6

for you to fetch changes up to b7b783d84166b6b74aed2bfd4a07128ff303fed6:

  x86/fpu: Defer FPU state load until return to userspace (2019-01-07 11:32:57 +0100)

----------------------------------------------------------------
Rik van Riel (5):
      x86/fpu: Add (__)make_fpregs_active helpers
      x86/fpu: Eager switch PKRU state
      x86/fpu: Always store the registers in copy_fpstate_to_sigframe()
      x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD
      x86/fpu: Defer FPU state load until return to userspace

Sebastian Andrzej Siewior (17):
      x86/fpu: Remove fpu->initialized usage in __fpu__restore_sig()
      x86/fpu: Remove fpu__restore()
      x86/fpu: Remove preempt_disable() in fpu__clear()
      x86/fpu: Always init the `state' in fpu__clear()
      x86/fpu: Remove fpu->initialized usage in copy_fpstate_to_sigframe()
      x86/fpu: Don't save fxregs for ia32 frames in copy_fpstate_to_sigframe()
      x86/fpu: Remove fpu->initialized
      x86/fpu: Remove user_fpu_begin()
      x86/fpu: Make __raw_xsave_addr() use feature number instead of mask
      x86/fpu: Make get_xsave_field_ptr() and get_xsave_addr() use feature number instead of mask
      x86/fpu: Only write PKRU if it is different from current
      x86/pkeys: Don't check if PKRU is zero before writting it
      x86/entry: Add TIF_NEED_FPU_LOAD
      x86/fpu: Update xstate's PKRU value on write_pkru()
      x86/fpu: Inline copy_user_to_fpregs_zeroing()
      x86/fpu: Let __fpu__restore_sig() restore the !32bit+fxsr frame from kernel memory
      x86/fpu: Merge the two code paths in __fpu__restore_sig()

 Documentation/preempt-locking.txt    |   1 -
 arch/x86/entry/common.c              |   8 ++
 arch/x86/ia32/ia32_signal.c          |  17 +--
 arch/x86/include/asm/fpu/api.h       |  31 ++++++
 arch/x86/include/asm/fpu/internal.h  | 151 +++++++++++---------------
 arch/x86/include/asm/fpu/signal.h    |   2 +-
 arch/x86/include/asm/fpu/types.h     |   9 --
 arch/x86/include/asm/fpu/xstate.h    |   5 +-
 arch/x86/include/asm/pgtable.h       |  20 +++-
 arch/x86/include/asm/special_insns.h |  13 ++-
 arch/x86/include/asm/thread_info.h   |   2 +
 arch/x86/include/asm/trace/fpu.h     |   8 +-
 arch/x86/kernel/fpu/core.c           | 193 ++++++++++++++++-----------------
 arch/x86/kernel/fpu/init.c           |   2 -
 arch/x86/kernel/fpu/regset.c         |  24 +---
 arch/x86/kernel/fpu/signal.c         | 205 ++++++++++++++++-------------------
 arch/x86/kernel/fpu/xstate.c         |  43 ++++----
 arch/x86/kernel/process.c            |   2 +-
 arch/x86/kernel/process_32.c         |  11 +-
 arch/x86/kernel/process_64.c         |  11 +-
 arch/x86/kernel/signal.c             |  17 ++-
 arch/x86/kernel/traps.c              |   2 +-
 arch/x86/kvm/x86.c                   |  48 +++++---
 arch/x86/math-emu/fpu_entry.c        |   3 -
 arch/x86/mm/mpx.c                    |   6 +-
 arch/x86/mm/pkeys.c                  |  14 +--
 26 files changed, 411 insertions(+), 437 deletions(-)

Comments

David Laight Jan. 15, 2019, 12:44 p.m. UTC | #1
From:  Sebastian Andrzej Siewior
> Sent: 09 January 2019 11:47
>
> This is a refurbished series originally started by by Rik van Riel. The
> goal is load the FPU registers on return to userland and not on every
> context switch. By this optimisation we can:
> - avoid loading the registers if the task stays in kernel and does
>   not return to userland
> - make kernel_fpu_begin() cheaper: it only saves the registers on the
>   first invocation. The second invocation does not need save them again.
> 
> To access the FPU registers in kernel we need:
> - disable preemption to avoid that the scheduler switches tasks. By
>   doing so it would set TIF_NEED_FPU_LOAD and the FPU registers would be
>   not valid.
> - disable BH because the softirq might use kernel_fpu_begin() and then
>   set TIF_NEED_FPU_LOAD instead loading the FPU registers on completion.

Once this is done it might be worth while adding a parameter to
kernel_fpu_begin() to request the registers only when they don't
need saving.
This would benefit code paths where the gains are reasonable but not massive.

The return value from kernel_fpu_begin() ought to indicate which
registers are available - none, SSE, SSE2, AVX, AVX512 etc.
So code can use an appropriate implementation.
(I've not looked to see if this is already the case!)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Sebastian Andrzej Siewior Jan. 15, 2019, 1:15 p.m. UTC | #2
On 2019-01-15 12:44:53 [+0000], David Laight wrote:
> Once this is done it might be worth while adding a parameter to
> kernel_fpu_begin() to request the registers only when they don't
> need saving.
> This would benefit code paths where the gains are reasonable but not massive.

So if saving + FPU code is a small win why not do this always?

> The return value from kernel_fpu_begin() ought to indicate which
> registers are available - none, SSE, SSE2, AVX, AVX512 etc.
> So code can use an appropriate implementation.
> (I've not looked to see if this is already the case!)

Either everything is saved or nothing. So if SSE registers are saved
then AVX512 are, too.
I would like to see some benefit of this first before adding/adjusting
the API in a way which makes it possible do something to do wrong. That
said, one thing I would like to do is to get rid of irq_fpu_usable() so
code can use FPU registers and needs not to implement a fallback.

> 	David

Sebastian
David Laight Jan. 15, 2019, 2:33 p.m. UTC | #3
From: 'Sebastian Andrzej Siewior'
> Sent: 15 January 2019 13:15
> On 2019-01-15 12:44:53 [+0000], David Laight wrote:
> > Once this is done it might be worth while adding a parameter to
> > kernel_fpu_begin() to request the registers only when they don't
> > need saving.
> > This would benefit code paths where the gains are reasonable but not massive.
> 
> So if saving + FPU code is a small win why not do this always?

I was thinking of the case when the cost of the fpu save is greater
than the saving.
This might be true for (say) a crc on a short buffer.

> > The return value from kernel_fpu_begin() ought to indicate which
> > registers are available - none, SSE, SSE2, AVX, AVX512 etc.
> > So code can use an appropriate implementation.
> > (I've not looked to see if this is already the case!)
> 
> Either everything is saved or nothing. So if SSE registers are saved
> then AVX512 are, too.

(I know that - I've written fpu save code for AVX).
I was thinking that the return value would depend on what the cpu supports.

In fact, given some talk about big-little cpus it might be worth being
able to ask for a specific register set.
Potentially that could cause a processor switch.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Dave Hansen Jan. 15, 2019, 7:46 p.m. UTC | #4
On 1/15/19 4:44 AM, David Laight wrote:
> Once this is done it might be worth while adding a parameter to
> kernel_fpu_begin() to request the registers only when they don't
> need saving.
> This would benefit code paths where the gains are reasonable but not massive.
> 
> The return value from kernel_fpu_begin() ought to indicate which
> registers are available - none, SSE, SSE2, AVX, AVX512 etc.
> So code can use an appropriate implementation.
> (I've not looked to see if this is already the case!)

Yeah, it would be sane to have both a mask passed, and returned, say:

	got = kernel_fpu_begin(XFEATURE_MASK_AVX512, NO_XSAVE_ALLOWED);

	if (got == XFEATURE_MASK_AVX512)
		do_avx_512_goo();
	else
		do_integer_goo();

	kernel_fpu_end(got)

Then, kernel_fpu_begin() can actually work without even *doing* an XSAVE:

	/* Do we have to save state for anything in 'ask_mask'? */
	if (all_states_are_init(ask_mask))
		return ask_mask;

Then kernel_fpu_end() just needs to zero out (re-init) the state, which
it can do with XRSTORS and a careful combination of XSTATE_BV and the
requested feature bitmap (RFBM).

This is all just optimization, though.
Andy Lutomirski Jan. 15, 2019, 8:26 p.m. UTC | #5
On Tue, Jan 15, 2019 at 11:46 AM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/15/19 4:44 AM, David Laight wrote:
> > Once this is done it might be worth while adding a parameter to
> > kernel_fpu_begin() to request the registers only when they don't
> > need saving.
> > This would benefit code paths where the gains are reasonable but not massive.
> >
> > The return value from kernel_fpu_begin() ought to indicate which
> > registers are available - none, SSE, SSE2, AVX, AVX512 etc.
> > So code can use an appropriate implementation.
> > (I've not looked to see if this is already the case!)
>
> Yeah, it would be sane to have both a mask passed, and returned, say:
>
>         got = kernel_fpu_begin(XFEATURE_MASK_AVX512, NO_XSAVE_ALLOWED);
>
>         if (got == XFEATURE_MASK_AVX512)
>                 do_avx_512_goo();
>         else
>                 do_integer_goo();
>
>         kernel_fpu_end(got)
>
> Then, kernel_fpu_begin() can actually work without even *doing* an XSAVE:
>
>         /* Do we have to save state for anything in 'ask_mask'? */
>         if (all_states_are_init(ask_mask))
>                 return ask_mask;
>
> Then kernel_fpu_end() just needs to zero out (re-init) the state, which
> it can do with XRSTORS and a careful combination of XSTATE_BV and the
> requested feature bitmap (RFBM).
>
> This is all just optimization, though.

I don't think we'd ever want kernel_fpu_end() to restore anything,
right?  I'm a bit confused as to when this optimization would actually
be useful.

Jason Donenfeld has a rather nice API for this in his Zinc series.
Jason, how is that coming?
Dave Hansen Jan. 15, 2019, 8:54 p.m. UTC | #6
On 1/15/19 12:26 PM, Andy Lutomirski wrote:
> I don't think we'd ever want kernel_fpu_end() to restore anything,
> right?  I'm a bit confused as to when this optimization would actually
> be useful.

Using AVX-512 as an example...

Let's say there was AVX-512 state, and a kernel_fpu_begin() user only
used AVX2.  We could totally avoid doing *any* AVX-512 state save/restore.

The init optimization doesn't help us if there _is_ AVX-512 state, and
the modified optimization only helps if we recently did a XRSTOR at
context switch and have not written to AVX-512 state since XRSTOR.

This probably only matters for AVX-512-using apps that have run on a
kernel with lots of kernel_fpu_begin()s that don't use AVX-512.  So, not
a big deal right now.
Andy Lutomirski Jan. 15, 2019, 9:11 p.m. UTC | #7
On Tue, Jan 15, 2019 at 12:54 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 1/15/19 12:26 PM, Andy Lutomirski wrote:
> > I don't think we'd ever want kernel_fpu_end() to restore anything,
> > right?  I'm a bit confused as to when this optimization would actually
> > be useful.
>
> Using AVX-512 as an example...
>
> Let's say there was AVX-512 state, and a kernel_fpu_begin() user only
> used AVX2.  We could totally avoid doing *any* AVX-512 state save/restore.
>
> The init optimization doesn't help us if there _is_ AVX-512 state, and
> the modified optimization only helps if we recently did a XRSTOR at
> context switch and have not written to AVX-512 state since XRSTOR.
>
> This probably only matters for AVX-512-using apps that have run on a
> kernel with lots of kernel_fpu_begin()s that don't use AVX-512.  So, not
> a big deal right now.

On top of this series, this gets rather awkward, I think -- now we
need to be able to keep track of a state in which some of the user
registers live in the CPU and some live in memory, and we need to be
able to do the partial restore if we go back to user mode like this.
We also need to be able to do a partial save if we end up context
switching.  This seems rather complicated.

Last time I measured it (on Skylake IIRC), a full save was only about
twice as slow as a save that saved nothing at all, so I think we'd
need numbers.
David Laight Jan. 16, 2019, 10:18 a.m. UTC | #8
From: Andy Lutomirski
> Sent: 15 January 2019 20:27
> On Tue, Jan 15, 2019 at 11:46 AM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 1/15/19 4:44 AM, David Laight wrote:
> > > Once this is done it might be worth while adding a parameter to
> > > kernel_fpu_begin() to request the registers only when they don't
> > > need saving.
> > > This would benefit code paths where the gains are reasonable but not massive.
> > >
> > > The return value from kernel_fpu_begin() ought to indicate which
> > > registers are available - none, SSE, SSE2, AVX, AVX512 etc.
> > > So code can use an appropriate implementation.
> > > (I've not looked to see if this is already the case!)
> >
> > Yeah, it would be sane to have both a mask passed, and returned, say:
> >
> >         got = kernel_fpu_begin(XFEATURE_MASK_AVX512, NO_XSAVE_ALLOWED);

You could merge the two arguments.

> >         if (got == XFEATURE_MASK_AVX512)

	got & XFEATURE_MASK_AVX512

> >                 do_avx_512_goo();
> >         else
> >                 do_integer_goo();
> >
> >         kernel_fpu_end(got)
> >
> > Then, kernel_fpu_begin() can actually work without even *doing* an XSAVE:
> >
> >         /* Do we have to save state for anything in 'ask_mask'? */
> >         if (all_states_are_init(ask_mask))
> >                 return ask_mask;

It almost certainly needs to disable pre-emption - there isn't another
fpu save area.

> >
> > Then kernel_fpu_end() just needs to zero out (re-init) the state, which
> > it can do with XRSTORS and a careful combination of XSTATE_BV and the
> > requested feature bitmap (RFBM).
> >
> > This is all just optimization, though.
> 
> I don't think we'd ever want kernel_fpu_end() to restore anything,
> right?  I'm a bit confused as to when this optimization would actually
> be useful.

The user register restore is deferred to 'return to user'.

What you need to ensure is that the kernel values never leak out
to userspace.

ISTR there is a flag that says that all the AVX registers are zero
(XSAVE writes one, I can't remember if it is readable).
If the registers are all zero I think the kernel code can use them
even if they are 'live' - provided they get zeroed again before
return to user.
I also can't remember whether the fpu flags register is set by AVX
instructions - I know that is a pita to recover.

Also are all system calls entered via asm stubs that look like real functions?
(I think I've seen inline system calls in a linux binary - but that was a
long time ago.)
If that assumption can be made then because the AVX registers are all
caller-saved they are not 'live' on system call entry so can be zeroed
and need not be saved on a context switch.
(They still need saving if the kernel is entered by trap or interrupt.)

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
David Laight Jan. 16, 2019, 10:31 a.m. UTC | #9
From: Andy Lutomirski [mailto:luto@kernel.org]
> On Tue, Jan 15, 2019 at 12:54 PM Dave Hansen <dave.hansen@intel.com> wrote:
> >
> > On 1/15/19 12:26 PM, Andy Lutomirski wrote:
> > > I don't think we'd ever want kernel_fpu_end() to restore anything,
> > > right?  I'm a bit confused as to when this optimization would actually
> > > be useful.
> >
> > Using AVX-512 as an example...
> >
> > Let's say there was AVX-512 state, and a kernel_fpu_begin() user only
> > used AVX2.  We could totally avoid doing *any* AVX-512 state save/restore.
> >
> > The init optimization doesn't help us if there _is_ AVX-512 state, and
> > the modified optimization only helps if we recently did a XRSTOR at
> > context switch and have not written to AVX-512 state since XRSTOR.
> >
> > This probably only matters for AVX-512-using apps that have run on a
> > kernel with lots of kernel_fpu_begin()s that don't use AVX-512.  So, not
> > a big deal right now.
> 
> On top of this series, this gets rather awkward, I think -- now we
> need to be able to keep track of a state in which some of the user
> registers live in the CPU and some live in memory, and we need to be
> able to do the partial restore if we go back to user mode like this.
> We also need to be able to do a partial save if we end up context
> switching.  This seems rather complicated.

If kernel_fpu_begin() requests registers that are 'live' for userspace,
or if the user registers have been saved then you (more or less) have
to disable pre-emption.
OTOH if the kernel wants the AVX2 registers and the user ones are all 0
then the kernel can just use the registers provided kernel_fpu_end()
zeroes them. In this can you can allow pre-emption because it will save
everything and it will all get restored correctly (will need to be
restored when the process is scheduled, not return to user).
The register save area might need zapping (if used) because it might
be readable from user space (by a debugger).

The other case is kernel code that guarantees to save and restore
any registers is uses (it might only want 2 registers for a CRC).
Such code can nest with other kernel users (eg in an ISR).
I'm not sure whether is needs a small 'save area' for fpu flags?
It might be worth adding such a structure to the interface - even
if it is currently a dummy structure.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Borislav Petkov Jan. 30, 2019, 11:35 a.m. UTC | #10
On Wed, Jan 09, 2019 at 12:47:22PM +0100, Sebastian Andrzej Siewior wrote:
> This is a refurbished series originally started by by Rik van Riel. The
> goal is load the FPU registers on return to userland and not on every
> context switch. By this optimisation we can:
> - avoid loading the registers if the task stays in kernel and does
>   not return to userland
> - make kernel_fpu_begin() cheaper: it only saves the registers on the
>   first invocation. The second invocation does not need save them again.

Btw, do we have any benchmark data showing the improvement this brings?
Sebastian Andrzej Siewior Jan. 30, 2019, 12:06 p.m. UTC | #11
On 2019-01-30 12:35:55 [+0100], Borislav Petkov wrote:
> On Wed, Jan 09, 2019 at 12:47:22PM +0100, Sebastian Andrzej Siewior wrote:
> > This is a refurbished series originally started by by Rik van Riel. The
> > goal is load the FPU registers on return to userland and not on every
> > context switch. By this optimisation we can:
> > - avoid loading the registers if the task stays in kernel and does
> >   not return to userland
> > - make kernel_fpu_begin() cheaper: it only saves the registers on the
> >   first invocation. The second invocation does not need save them again.
> 
> Btw, do we have any benchmark data showing the improvement this brings?

nope. There is sig_lat or something like that which would measure how
many signals you can handle a second. That could show how bad the
changes are in the signal path.
I don't think that I need any numbers to show that all but first
invocation of kernel_fpu_begin() is "free". That would be the claim in
the second bullet.
And for the first bullet. Hmmm. I could add a trace point to see how
often we entered schedule() without saving FPU registers for user tasks.
That would mean the benefit is that we didn't restore them while we left
schedule() and didn't save them while entered schedule() (again).
I don't know if hackbench would show anything besides noise.

Sebastian
Borislav Petkov Jan. 30, 2019, 12:27 p.m. UTC | #12
On Wed, Jan 30, 2019 at 01:06:47PM +0100, Sebastian Andrzej Siewior wrote:
> I don't know if hackbench would show anything besides noise.

Yeah, if a sensible benchmark (dunno if hackbench is among them :))
shows no difference, is also saying something.
Sebastian Andrzej Siewior Feb. 8, 2019, 1:12 p.m. UTC | #13
On 2019-01-30 13:27:13 [+0100], Borislav Petkov wrote:
> On Wed, Jan 30, 2019 at 01:06:47PM +0100, Sebastian Andrzej Siewior wrote:
> > I don't know if hackbench would show anything besides noise.
> 
> Yeah, if a sensible benchmark (dunno if hackbench is among them :))
> shows no difference, is also saying something.

"hackbench -g80 -l 1000 -s 255" shows just noise. I don't see any
reasonable difference with or without the series.

Tracing. The following patch

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index c5a6edd92de4f..aa1914e5bf5c0 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -292,6 +292,7 @@ struct fpu {
 	 * FPU state should be reloaded next time the task is run.
 	 */
 	unsigned int			last_cpu;
+	unsigned int avoided_loads;
 
 	/*
 	 * @state:
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index c98c54e796186..7560942a550ed 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -358,9 +358,11 @@ void fpu__clear(struct fpu *fpu)
  */
 void switch_fpu_return(void)
 {
+	struct fpu *fpu = &current->thread.fpu;
+
 	if (!static_cpu_has(X86_FEATURE_FPU))
 		return;
-
+	fpu->avoided_loads = 0;
 	__fpregs_load_activate();
 }
 EXPORT_SYMBOL_GPL(switch_fpu_return);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 37b2ecef041e6..875f74b1e8779 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -522,6 +522,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	if (!test_thread_flag(TIF_NEED_FPU_LOAD))
 		switch_fpu_prepare(prev_fpu, cpu);
+	else if (current->mm) {
+		prev_fpu->avoided_loads++;
+		trace_printk("skipped save %d\n", prev_fpu->avoided_loads);
+	}
 
 	/* We must save %fs and %gs before load_TLS() because
 	 * %fs and %gs may be cleared by load_TLS().

should help to spot the optimization. So if TIF_NEED_FPU_LOAD is set at
this point then between this and the last invocation of schedule() we
haven't been in userland and so we avoided loading + storing of FPU
registers. I saw things like:

|  http-1935  [001] d..2   223.460434: sched_switch: prev_comm=http prev_pid=1935 prev_prio=120 prev_state=R+ ==> next_comm=apt next_pid=1931 next_prio=120
|   apt-1931  [001] d..2   223.460680: sched_switch: prev_comm=apt prev_pid=1931 prev_prio=120 prev_state=D ==> next_comm=http next_pid=1935 next_prio=120
|  http-1935  [001] d..2   223.460729: sched_switch: prev_comm=http prev_pid=1935 prev_prio=120 prev_state=R+ ==> next_comm=apt next_pid=1931 next_prio=120
|  http-1935  [001] d..2   223.460732: __switch_to: skipped save 1
|   apt-1931  [001] d..2   223.461076: sched_switch: prev_comm=apt prev_pid=1931 prev_prio=120 prev_state=D ==> next_comm=http next_pid=1935 next_prio=120
|  http-1935  [001] d..2   223.461111: sched_switch: prev_comm=http prev_pid=1935 prev_prio=120 prev_state=R+ ==> next_comm=apt next_pid=1931 next_prio=120
|  http-1935  [001] d..2   223.461112: __switch_to: skipped save 2

which means we avoided loading FPU registers for `http' because for some
reason it was not required. Here we switched between two user tasks so
without the patches we would have to save and restore them.

I captured a few instances of something like:

|  rcu_preempt-10    [000] d..2  1032.867293: sched_switch: prev_comm=rcu_preempt prev_pid=10 prev_prio=98 prev_state=I ==> next_comm=kswapd0 next_pid=536 next_prio=120
|          apt-1954  [001] d..2  1032.867435: sched_switch: prev_comm=apt prev_pid=1954 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:0 next_pid=1943 next_prio=120
|          apt-1954  [001] d..2  1032.867436: __switch_to: skipped save 30
|  kworker/1:0-1943  [001] d..2  1032.867455: sched_switch: prev_comm=kworker/1:0 prev_pid=1943 prev_prio=120 prev_state=I ==> next_comm=apt next_pid=1954 next_prio=120
|          apt-1954  [001] d..2  1032.867459: sched_switch: prev_comm=apt prev_pid=1954 prev_prio=120 prev_state=D ==> next_comm=swapper/1 next_pid=0 next_prio=120
|          apt-1954  [001] d..2  1032.867460: __switch_to: skipped save 31

It has been avoided to restore and save the FPU register of `apt' 31
times (in a row). This isn't 100% true. We switched to and from a kernel
thread to `apt' so switch_fpu_finish() wouldn't load the registers
because the switch to the kernel thread (switch_fpu_prepare()) would not
destroy them. *However* the switch away from `apt' would save the FPU
registers so we avoid this (current code always saves FPU registers on
context switch, see switch_fpu_prepare()).
My understanding is that if the CPU supports `xsaves' then it wouldn't
save anything in this scenario because the CPU would notice that its FPU
state didn't change since last time so no need to save anything.

Then we have lat_sig [0]. Without the series 64bit:
|Signal handler overhead: 2.6839 microseconds
|Signal handler overhead: 2.6996 microseconds
|Signal handler overhead: 2.6821 microseconds

with the series:
|Signal handler overhead: 3.2976 microseconds
|Signal handler overhead: 3.3033 microseconds
|Signal handler overhead: 3.2980 microseconds

that is approximately 22% worse. Without the series 64bit kernel with
32bit binary:
| Signal handler overhead: 3.8139 microseconds
| Signal handler overhead: 3.8035 microseconds
| Signal handler overhead: 3.8127 microseconds

with the series:
| Signal handler overhead: 4.0434 microseconds
| Signal handler overhead: 4.0438 microseconds
| Signal handler overhead: 4.0408 microseconds

approximately 6% worse. I'm a little surprised in the 32bit case because
it did save+copy earlier (while the 64bit saved it directly to signal
stack).

If we restore directly from signal stack (instead the copy_from_user())
we get to (64bit only):
| Signal handler overhead: 3.0376 microseconds
| Signal handler overhead: 3.0687 microseconds
| Signal handler overhead: 3.0510 microseconds

and if additionally save the registers to the signal stack:
| Signal handler overhead: 2.7835 microseconds
| Signal handler overhead: 2.7850 microseconds
| Signal handler overhead: 2.7766 microseconds

then we get almost to where we started. I will fire a commit per commit
bench to see if I notice something.
Ach and this was PREEMPT on a 
|x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format.

machine. So those with AVX-512 might be worse but I don't have any of
those.

[0] Part of lmbench, test
    taskset 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_sig -P 1 -W 64 -N 5000 catch

Sebastian
Sebastian Andrzej Siewior Feb. 13, 2019, 3:54 p.m. UTC | #14
On 2019-02-08 14:12:33 [+0100], To Borislav Petkov wrote:
> Then we have lat_sig [0]. Without the series 64bit:
> |Signal handler overhead: 2.6839 microseconds
> |Signal handler overhead: 2.6996 microseconds
> |Signal handler overhead: 2.6821 microseconds
> 
> with the series:
> |Signal handler overhead: 3.2976 microseconds
> |Signal handler overhead: 3.3033 microseconds
> |Signal handler overhead: 3.2980 microseconds

Did a patch-by-patch run (64bit only, server preemption model, output in
us ("commit")):

2.368 ("Linux 5.0-rc5")

2.603 ("x86/fpu: Always store the registers in copy_fpstate_to_sigframe()")
  copy_fpstate_to_sigframe() stores to thread's FPU area and then copies
  user stack area.

2.668 ("x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD")
  this should be noise since preempt_disable/enable is a nop -
  test_thread_flag() isn't.

2.701 ("x86/fpu: Inline copy_user_to_fpregs_zeroing()")
  This pops up somehow but is simply code movement.

3.474 ("x86/fpu: Let __fpu__restore_sig() restore the !32bit+fxsr frame from kernel memory")
  This stands out. There a kmalloc() + saving to kernel memory + copy
  instead a direct save to kernel stack.

2.928 ("x86/fpu: Defer FPU state load until return to userspace")
  The kmalloc() has been removed. Just "copy-to-kernel-memory" and
  copy_to_user() remained.

So this looks like 0.3us for the save-copy + 0.3us for copy-restore. The
numbers for the preempt/low-lat-desktop have the same two spikes and
drop at the end.

Sebastian