Message ID | 20190109114744.10936-1-bigeasy@linutronix.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | [v6] x86: load FPU registers on return to userland | expand |
From: Sebastian Andrzej Siewior > Sent: 09 January 2019 11:47 > > This is a refurbished series originally started by by Rik van Riel. The > goal is load the FPU registers on return to userland and not on every > context switch. By this optimisation we can: > - avoid loading the registers if the task stays in kernel and does > not return to userland > - make kernel_fpu_begin() cheaper: it only saves the registers on the > first invocation. The second invocation does not need save them again. > > To access the FPU registers in kernel we need: > - disable preemption to avoid that the scheduler switches tasks. By > doing so it would set TIF_NEED_FPU_LOAD and the FPU registers would be > not valid. > - disable BH because the softirq might use kernel_fpu_begin() and then > set TIF_NEED_FPU_LOAD instead loading the FPU registers on completion. Once this is done it might be worth while adding a parameter to kernel_fpu_begin() to request the registers only when they don't need saving. This would benefit code paths where the gains are reasonable but not massive. The return value from kernel_fpu_begin() ought to indicate which registers are available - none, SSE, SSE2, AVX, AVX512 etc. So code can use an appropriate implementation. (I've not looked to see if this is already the case!) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 2019-01-15 12:44:53 [+0000], David Laight wrote: > Once this is done it might be worth while adding a parameter to > kernel_fpu_begin() to request the registers only when they don't > need saving. > This would benefit code paths where the gains are reasonable but not massive. So if saving + FPU code is a small win why not do this always? > The return value from kernel_fpu_begin() ought to indicate which > registers are available - none, SSE, SSE2, AVX, AVX512 etc. > So code can use an appropriate implementation. > (I've not looked to see if this is already the case!) Either everything is saved or nothing. So if SSE registers are saved then AVX512 are, too. I would like to see some benefit of this first before adding/adjusting the API in a way which makes it possible do something to do wrong. That said, one thing I would like to do is to get rid of irq_fpu_usable() so code can use FPU registers and needs not to implement a fallback. > David Sebastian
From: 'Sebastian Andrzej Siewior' > Sent: 15 January 2019 13:15 > On 2019-01-15 12:44:53 [+0000], David Laight wrote: > > Once this is done it might be worth while adding a parameter to > > kernel_fpu_begin() to request the registers only when they don't > > need saving. > > This would benefit code paths where the gains are reasonable but not massive. > > So if saving + FPU code is a small win why not do this always? I was thinking of the case when the cost of the fpu save is greater than the saving. This might be true for (say) a crc on a short buffer. > > The return value from kernel_fpu_begin() ought to indicate which > > registers are available - none, SSE, SSE2, AVX, AVX512 etc. > > So code can use an appropriate implementation. > > (I've not looked to see if this is already the case!) > > Either everything is saved or nothing. So if SSE registers are saved > then AVX512 are, too. (I know that - I've written fpu save code for AVX). I was thinking that the return value would depend on what the cpu supports. In fact, given some talk about big-little cpus it might be worth being able to ask for a specific register set. Potentially that could cause a processor switch. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On 1/15/19 4:44 AM, David Laight wrote: > Once this is done it might be worth while adding a parameter to > kernel_fpu_begin() to request the registers only when they don't > need saving. > This would benefit code paths where the gains are reasonable but not massive. > > The return value from kernel_fpu_begin() ought to indicate which > registers are available - none, SSE, SSE2, AVX, AVX512 etc. > So code can use an appropriate implementation. > (I've not looked to see if this is already the case!) Yeah, it would be sane to have both a mask passed, and returned, say: got = kernel_fpu_begin(XFEATURE_MASK_AVX512, NO_XSAVE_ALLOWED); if (got == XFEATURE_MASK_AVX512) do_avx_512_goo(); else do_integer_goo(); kernel_fpu_end(got) Then, kernel_fpu_begin() can actually work without even *doing* an XSAVE: /* Do we have to save state for anything in 'ask_mask'? */ if (all_states_are_init(ask_mask)) return ask_mask; Then kernel_fpu_end() just needs to zero out (re-init) the state, which it can do with XRSTORS and a careful combination of XSTATE_BV and the requested feature bitmap (RFBM). This is all just optimization, though.
On Tue, Jan 15, 2019 at 11:46 AM Dave Hansen <dave.hansen@intel.com> wrote: > > On 1/15/19 4:44 AM, David Laight wrote: > > Once this is done it might be worth while adding a parameter to > > kernel_fpu_begin() to request the registers only when they don't > > need saving. > > This would benefit code paths where the gains are reasonable but not massive. > > > > The return value from kernel_fpu_begin() ought to indicate which > > registers are available - none, SSE, SSE2, AVX, AVX512 etc. > > So code can use an appropriate implementation. > > (I've not looked to see if this is already the case!) > > Yeah, it would be sane to have both a mask passed, and returned, say: > > got = kernel_fpu_begin(XFEATURE_MASK_AVX512, NO_XSAVE_ALLOWED); > > if (got == XFEATURE_MASK_AVX512) > do_avx_512_goo(); > else > do_integer_goo(); > > kernel_fpu_end(got) > > Then, kernel_fpu_begin() can actually work without even *doing* an XSAVE: > > /* Do we have to save state for anything in 'ask_mask'? */ > if (all_states_are_init(ask_mask)) > return ask_mask; > > Then kernel_fpu_end() just needs to zero out (re-init) the state, which > it can do with XRSTORS and a careful combination of XSTATE_BV and the > requested feature bitmap (RFBM). > > This is all just optimization, though. I don't think we'd ever want kernel_fpu_end() to restore anything, right? I'm a bit confused as to when this optimization would actually be useful. Jason Donenfeld has a rather nice API for this in his Zinc series. Jason, how is that coming?
On 1/15/19 12:26 PM, Andy Lutomirski wrote: > I don't think we'd ever want kernel_fpu_end() to restore anything, > right? I'm a bit confused as to when this optimization would actually > be useful. Using AVX-512 as an example... Let's say there was AVX-512 state, and a kernel_fpu_begin() user only used AVX2. We could totally avoid doing *any* AVX-512 state save/restore. The init optimization doesn't help us if there _is_ AVX-512 state, and the modified optimization only helps if we recently did a XRSTOR at context switch and have not written to AVX-512 state since XRSTOR. This probably only matters for AVX-512-using apps that have run on a kernel with lots of kernel_fpu_begin()s that don't use AVX-512. So, not a big deal right now.
On Tue, Jan 15, 2019 at 12:54 PM Dave Hansen <dave.hansen@intel.com> wrote: > > On 1/15/19 12:26 PM, Andy Lutomirski wrote: > > I don't think we'd ever want kernel_fpu_end() to restore anything, > > right? I'm a bit confused as to when this optimization would actually > > be useful. > > Using AVX-512 as an example... > > Let's say there was AVX-512 state, and a kernel_fpu_begin() user only > used AVX2. We could totally avoid doing *any* AVX-512 state save/restore. > > The init optimization doesn't help us if there _is_ AVX-512 state, and > the modified optimization only helps if we recently did a XRSTOR at > context switch and have not written to AVX-512 state since XRSTOR. > > This probably only matters for AVX-512-using apps that have run on a > kernel with lots of kernel_fpu_begin()s that don't use AVX-512. So, not > a big deal right now. On top of this series, this gets rather awkward, I think -- now we need to be able to keep track of a state in which some of the user registers live in the CPU and some live in memory, and we need to be able to do the partial restore if we go back to user mode like this. We also need to be able to do a partial save if we end up context switching. This seems rather complicated. Last time I measured it (on Skylake IIRC), a full save was only about twice as slow as a save that saved nothing at all, so I think we'd need numbers.
From: Andy Lutomirski > Sent: 15 January 2019 20:27 > On Tue, Jan 15, 2019 at 11:46 AM Dave Hansen <dave.hansen@intel.com> wrote: > > > > On 1/15/19 4:44 AM, David Laight wrote: > > > Once this is done it might be worth while adding a parameter to > > > kernel_fpu_begin() to request the registers only when they don't > > > need saving. > > > This would benefit code paths where the gains are reasonable but not massive. > > > > > > The return value from kernel_fpu_begin() ought to indicate which > > > registers are available - none, SSE, SSE2, AVX, AVX512 etc. > > > So code can use an appropriate implementation. > > > (I've not looked to see if this is already the case!) > > > > Yeah, it would be sane to have both a mask passed, and returned, say: > > > > got = kernel_fpu_begin(XFEATURE_MASK_AVX512, NO_XSAVE_ALLOWED); You could merge the two arguments. > > if (got == XFEATURE_MASK_AVX512) got & XFEATURE_MASK_AVX512 > > do_avx_512_goo(); > > else > > do_integer_goo(); > > > > kernel_fpu_end(got) > > > > Then, kernel_fpu_begin() can actually work without even *doing* an XSAVE: > > > > /* Do we have to save state for anything in 'ask_mask'? */ > > if (all_states_are_init(ask_mask)) > > return ask_mask; It almost certainly needs to disable pre-emption - there isn't another fpu save area. > > > > Then kernel_fpu_end() just needs to zero out (re-init) the state, which > > it can do with XRSTORS and a careful combination of XSTATE_BV and the > > requested feature bitmap (RFBM). > > > > This is all just optimization, though. > > I don't think we'd ever want kernel_fpu_end() to restore anything, > right? I'm a bit confused as to when this optimization would actually > be useful. The user register restore is deferred to 'return to user'. What you need to ensure is that the kernel values never leak out to userspace. ISTR there is a flag that says that all the AVX registers are zero (XSAVE writes one, I can't remember if it is readable). If the registers are all zero I think the kernel code can use them even if they are 'live' - provided they get zeroed again before return to user. I also can't remember whether the fpu flags register is set by AVX instructions - I know that is a pita to recover. Also are all system calls entered via asm stubs that look like real functions? (I think I've seen inline system calls in a linux binary - but that was a long time ago.) If that assumption can be made then because the AVX registers are all caller-saved they are not 'live' on system call entry so can be zeroed and need not be saved on a context switch. (They still need saving if the kernel is entered by trap or interrupt.) David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
From: Andy Lutomirski [mailto:luto@kernel.org] > On Tue, Jan 15, 2019 at 12:54 PM Dave Hansen <dave.hansen@intel.com> wrote: > > > > On 1/15/19 12:26 PM, Andy Lutomirski wrote: > > > I don't think we'd ever want kernel_fpu_end() to restore anything, > > > right? I'm a bit confused as to when this optimization would actually > > > be useful. > > > > Using AVX-512 as an example... > > > > Let's say there was AVX-512 state, and a kernel_fpu_begin() user only > > used AVX2. We could totally avoid doing *any* AVX-512 state save/restore. > > > > The init optimization doesn't help us if there _is_ AVX-512 state, and > > the modified optimization only helps if we recently did a XRSTOR at > > context switch and have not written to AVX-512 state since XRSTOR. > > > > This probably only matters for AVX-512-using apps that have run on a > > kernel with lots of kernel_fpu_begin()s that don't use AVX-512. So, not > > a big deal right now. > > On top of this series, this gets rather awkward, I think -- now we > need to be able to keep track of a state in which some of the user > registers live in the CPU and some live in memory, and we need to be > able to do the partial restore if we go back to user mode like this. > We also need to be able to do a partial save if we end up context > switching. This seems rather complicated. If kernel_fpu_begin() requests registers that are 'live' for userspace, or if the user registers have been saved then you (more or less) have to disable pre-emption. OTOH if the kernel wants the AVX2 registers and the user ones are all 0 then the kernel can just use the registers provided kernel_fpu_end() zeroes them. In this can you can allow pre-emption because it will save everything and it will all get restored correctly (will need to be restored when the process is scheduled, not return to user). The register save area might need zapping (if used) because it might be readable from user space (by a debugger). The other case is kernel code that guarantees to save and restore any registers is uses (it might only want 2 registers for a CRC). Such code can nest with other kernel users (eg in an ISR). I'm not sure whether is needs a small 'save area' for fpu flags? It might be worth adding such a structure to the interface - even if it is currently a dummy structure. David - Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK Registration No: 1397386 (Wales)
On Wed, Jan 09, 2019 at 12:47:22PM +0100, Sebastian Andrzej Siewior wrote: > This is a refurbished series originally started by by Rik van Riel. The > goal is load the FPU registers on return to userland and not on every > context switch. By this optimisation we can: > - avoid loading the registers if the task stays in kernel and does > not return to userland > - make kernel_fpu_begin() cheaper: it only saves the registers on the > first invocation. The second invocation does not need save them again. Btw, do we have any benchmark data showing the improvement this brings?
On 2019-01-30 12:35:55 [+0100], Borislav Petkov wrote: > On Wed, Jan 09, 2019 at 12:47:22PM +0100, Sebastian Andrzej Siewior wrote: > > This is a refurbished series originally started by by Rik van Riel. The > > goal is load the FPU registers on return to userland and not on every > > context switch. By this optimisation we can: > > - avoid loading the registers if the task stays in kernel and does > > not return to userland > > - make kernel_fpu_begin() cheaper: it only saves the registers on the > > first invocation. The second invocation does not need save them again. > > Btw, do we have any benchmark data showing the improvement this brings? nope. There is sig_lat or something like that which would measure how many signals you can handle a second. That could show how bad the changes are in the signal path. I don't think that I need any numbers to show that all but first invocation of kernel_fpu_begin() is "free". That would be the claim in the second bullet. And for the first bullet. Hmmm. I could add a trace point to see how often we entered schedule() without saving FPU registers for user tasks. That would mean the benefit is that we didn't restore them while we left schedule() and didn't save them while entered schedule() (again). I don't know if hackbench would show anything besides noise. Sebastian
On Wed, Jan 30, 2019 at 01:06:47PM +0100, Sebastian Andrzej Siewior wrote:
> I don't know if hackbench would show anything besides noise.
Yeah, if a sensible benchmark (dunno if hackbench is among them :))
shows no difference, is also saying something.
On 2019-01-30 13:27:13 [+0100], Borislav Petkov wrote: > On Wed, Jan 30, 2019 at 01:06:47PM +0100, Sebastian Andrzej Siewior wrote: > > I don't know if hackbench would show anything besides noise. > > Yeah, if a sensible benchmark (dunno if hackbench is among them :)) > shows no difference, is also saying something. "hackbench -g80 -l 1000 -s 255" shows just noise. I don't see any reasonable difference with or without the series. Tracing. The following patch diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h index c5a6edd92de4f..aa1914e5bf5c0 100644 --- a/arch/x86/include/asm/fpu/types.h +++ b/arch/x86/include/asm/fpu/types.h @@ -292,6 +292,7 @@ struct fpu { * FPU state should be reloaded next time the task is run. */ unsigned int last_cpu; + unsigned int avoided_loads; /* * @state: diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c index c98c54e796186..7560942a550ed 100644 --- a/arch/x86/kernel/fpu/core.c +++ b/arch/x86/kernel/fpu/core.c @@ -358,9 +358,11 @@ void fpu__clear(struct fpu *fpu) */ void switch_fpu_return(void) { + struct fpu *fpu = ¤t->thread.fpu; + if (!static_cpu_has(X86_FEATURE_FPU)) return; - + fpu->avoided_loads = 0; __fpregs_load_activate(); } EXPORT_SYMBOL_GPL(switch_fpu_return); diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c index 37b2ecef041e6..875f74b1e8779 100644 --- a/arch/x86/kernel/process_64.c +++ b/arch/x86/kernel/process_64.c @@ -522,6 +522,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p) if (!test_thread_flag(TIF_NEED_FPU_LOAD)) switch_fpu_prepare(prev_fpu, cpu); + else if (current->mm) { + prev_fpu->avoided_loads++; + trace_printk("skipped save %d\n", prev_fpu->avoided_loads); + } /* We must save %fs and %gs before load_TLS() because * %fs and %gs may be cleared by load_TLS(). should help to spot the optimization. So if TIF_NEED_FPU_LOAD is set at this point then between this and the last invocation of schedule() we haven't been in userland and so we avoided loading + storing of FPU registers. I saw things like: | http-1935 [001] d..2 223.460434: sched_switch: prev_comm=http prev_pid=1935 prev_prio=120 prev_state=R+ ==> next_comm=apt next_pid=1931 next_prio=120 | apt-1931 [001] d..2 223.460680: sched_switch: prev_comm=apt prev_pid=1931 prev_prio=120 prev_state=D ==> next_comm=http next_pid=1935 next_prio=120 | http-1935 [001] d..2 223.460729: sched_switch: prev_comm=http prev_pid=1935 prev_prio=120 prev_state=R+ ==> next_comm=apt next_pid=1931 next_prio=120 | http-1935 [001] d..2 223.460732: __switch_to: skipped save 1 | apt-1931 [001] d..2 223.461076: sched_switch: prev_comm=apt prev_pid=1931 prev_prio=120 prev_state=D ==> next_comm=http next_pid=1935 next_prio=120 | http-1935 [001] d..2 223.461111: sched_switch: prev_comm=http prev_pid=1935 prev_prio=120 prev_state=R+ ==> next_comm=apt next_pid=1931 next_prio=120 | http-1935 [001] d..2 223.461112: __switch_to: skipped save 2 which means we avoided loading FPU registers for `http' because for some reason it was not required. Here we switched between two user tasks so without the patches we would have to save and restore them. I captured a few instances of something like: | rcu_preempt-10 [000] d..2 1032.867293: sched_switch: prev_comm=rcu_preempt prev_pid=10 prev_prio=98 prev_state=I ==> next_comm=kswapd0 next_pid=536 next_prio=120 | apt-1954 [001] d..2 1032.867435: sched_switch: prev_comm=apt prev_pid=1954 prev_prio=120 prev_state=R+ ==> next_comm=kworker/1:0 next_pid=1943 next_prio=120 | apt-1954 [001] d..2 1032.867436: __switch_to: skipped save 30 | kworker/1:0-1943 [001] d..2 1032.867455: sched_switch: prev_comm=kworker/1:0 prev_pid=1943 prev_prio=120 prev_state=I ==> next_comm=apt next_pid=1954 next_prio=120 | apt-1954 [001] d..2 1032.867459: sched_switch: prev_comm=apt prev_pid=1954 prev_prio=120 prev_state=D ==> next_comm=swapper/1 next_pid=0 next_prio=120 | apt-1954 [001] d..2 1032.867460: __switch_to: skipped save 31 It has been avoided to restore and save the FPU register of `apt' 31 times (in a row). This isn't 100% true. We switched to and from a kernel thread to `apt' so switch_fpu_finish() wouldn't load the registers because the switch to the kernel thread (switch_fpu_prepare()) would not destroy them. *However* the switch away from `apt' would save the FPU registers so we avoid this (current code always saves FPU registers on context switch, see switch_fpu_prepare()). My understanding is that if the CPU supports `xsaves' then it wouldn't save anything in this scenario because the CPU would notice that its FPU state didn't change since last time so no need to save anything. Then we have lat_sig [0]. Without the series 64bit: |Signal handler overhead: 2.6839 microseconds |Signal handler overhead: 2.6996 microseconds |Signal handler overhead: 2.6821 microseconds with the series: |Signal handler overhead: 3.2976 microseconds |Signal handler overhead: 3.3033 microseconds |Signal handler overhead: 3.2980 microseconds that is approximately 22% worse. Without the series 64bit kernel with 32bit binary: | Signal handler overhead: 3.8139 microseconds | Signal handler overhead: 3.8035 microseconds | Signal handler overhead: 3.8127 microseconds with the series: | Signal handler overhead: 4.0434 microseconds | Signal handler overhead: 4.0438 microseconds | Signal handler overhead: 4.0408 microseconds approximately 6% worse. I'm a little surprised in the 32bit case because it did save+copy earlier (while the 64bit saved it directly to signal stack). If we restore directly from signal stack (instead the copy_from_user()) we get to (64bit only): | Signal handler overhead: 3.0376 microseconds | Signal handler overhead: 3.0687 microseconds | Signal handler overhead: 3.0510 microseconds and if additionally save the registers to the signal stack: | Signal handler overhead: 2.7835 microseconds | Signal handler overhead: 2.7850 microseconds | Signal handler overhead: 2.7766 microseconds then we get almost to where we started. I will fire a commit per commit bench to see if I notice something. Ach and this was PREEMPT on a |x86/fpu: Enabled xstate features 0x7, context size is 832 bytes, using 'compacted' format. machine. So those with AVX-512 might be worse but I don't have any of those. [0] Part of lmbench, test taskset 2 /usr/lib/lmbench/bin/x86_64-linux-gnu/lat_sig -P 1 -W 64 -N 5000 catch Sebastian
On 2019-02-08 14:12:33 [+0100], To Borislav Petkov wrote: > Then we have lat_sig [0]. Without the series 64bit: > |Signal handler overhead: 2.6839 microseconds > |Signal handler overhead: 2.6996 microseconds > |Signal handler overhead: 2.6821 microseconds > > with the series: > |Signal handler overhead: 3.2976 microseconds > |Signal handler overhead: 3.3033 microseconds > |Signal handler overhead: 3.2980 microseconds Did a patch-by-patch run (64bit only, server preemption model, output in us ("commit")): 2.368 ("Linux 5.0-rc5") 2.603 ("x86/fpu: Always store the registers in copy_fpstate_to_sigframe()") copy_fpstate_to_sigframe() stores to thread's FPU area and then copies user stack area. 2.668 ("x86/fpu: Prepare copy_fpstate_to_sigframe() for TIF_NEED_FPU_LOAD") this should be noise since preempt_disable/enable is a nop - test_thread_flag() isn't. 2.701 ("x86/fpu: Inline copy_user_to_fpregs_zeroing()") This pops up somehow but is simply code movement. 3.474 ("x86/fpu: Let __fpu__restore_sig() restore the !32bit+fxsr frame from kernel memory") This stands out. There a kmalloc() + saving to kernel memory + copy instead a direct save to kernel stack. 2.928 ("x86/fpu: Defer FPU state load until return to userspace") The kmalloc() has been removed. Just "copy-to-kernel-memory" and copy_to_user() remained. So this looks like 0.3us for the save-copy + 0.3us for copy-restore. The numbers for the preempt/low-lat-desktop have the same two spikes and drop at the end. Sebastian