diff mbox

[2/3] x86_64,entry: Use sysret to return to userspace when possible

Message ID 49394403b8b12486a6b9c9c70b72bd9f5dce7364.1415403984.git.luto@amacapital.net (mailing list archive)
State New, archived
Headers show

Commit Message

Andy Lutomirski Nov. 7, 2014, 11:58 p.m. UTC
The x86_64 entry code currently jumps through complex and
inconsisnent hoops to try to minimize the impact of syscall exit
work.  For a true fast-path syscall, almost nothing needs to be
done, so returning is just a check for exit work and sysret.  For a
full slow-path return from a syscall, the C exit hook is invoked if
needed and we join the iret path.

Using iret to return to userspace is very slow, so the entry code
has accumulated various special cases to try to do certain forms of
exit work without invoking iret.  This is error-prone, since it
duplicates assembly code paths, and it's dangerous, since sysret
can malfunction in interesting ways if used carelessly.  It's
also inefficient, since a lot of useful cases aren't optimized
and therefore force an iret out of a combination of paranoia and
the fact that no one has bothered to write even more asm code
to avoid it.

I would argue that this approach is backwards.  Rather than
trying to avoid the iret path, we should instead try to make
the iret path fast.  Under a specific set of conditions, iret
is unnecessary.  In particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not set, and both
SS and CS are as expected, then movq 32(%rsp),%rsp;sysret does the
same thing as iret.  This set of conditions is nearly always satisfied
on return from syscalls, and it can even occasionally be satisfied on
return from an irq.

Even with the careful checks for sysret applicability, this cuts
nearly 80ns off of the overhead from syscalls with unoptimized exit
work.  This includes tracing and context tracking, and any return
that invokes KVM's user return notifier.  For example, the cost of
getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to
~280ns on my computer.

This may allow the removal and even eventual conversion to C
of a respectable amount of exit asm.

This may require further tweaking to give the full benefit on Xen.

It may be worthwhile to adjust signal delivery and exec to try hit
the sysret path.

This does not optimize returns to 32-bit userspace.  Making the same
optimization for CS == __USER32_CS is conceptually straightforward,
but it will require some tedious code to handle the differences
between sysretl and sysexitl.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 48 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

Comments

Borislav Petkov Jan. 8, 2015, 12:29 p.m. UTC | #1
On Fri, Nov 07, 2014 at 03:58:18PM -0800, Andy Lutomirski wrote:
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 3710b8241945..a5afdf0f7fa4 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -804,6 +804,54 @@ retint_swapgs:		/* return to user-space */

Ok, so retint_swapgs is also on the error_exit path.

What you're basically proposing is to use SYSRET on exceptions exit
too AFAICT. And while I don't see anything wrong with the patch, you
probably need to run this by more people like tip guys + Linus just in
case. We can't allow ourselves to leak stuff here.

>  	 */
>  	DISABLE_INTERRUPTS(CLBR_ANY)
>  	TRACE_IRQS_IRETQ
> +
> +	/*
> +	 * Try to use SYSRET instead of IRET if we're returning to
> +	 * a completely clean 64-bit userspace context.
> +	 */
> +	movq (RCX-R11)(%rsp), %rcx
> +	cmpq %rcx,(RIP-R11)(%rsp)		/* RCX == RIP */
> +	jne opportunistic_sysret_failed
> +
> +	/*
> +	 * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
> +	 * in kernel space.  This essentially lets the user take over
> +	 * the kernel, since userspace controls RSP.  It's not worth
> +	 * testing for canonicalness exactly -- this check detects any
> +	 * of the 17 high bits set, which is true for non-canonical
> +	 * or kernel addresses.  (This will pessimize vsyscall=native.
> +	 * Big deal.)
> +	 */
> +	shr $47, %rcx

	shr $__VIRTUAL_MASK_SHIFT, %rcx

I guess, in case someone decides to play with the address space again
and forgets this naked bit here.

> +	jnz opportunistic_sysret_failed
> +
> +	cmpq $__USER_CS,(CS-R11)(%rsp)		/* CS must match SYSRET */
> +	jne opportunistic_sysret_failed
> +
> +	movq (R11-R11)(%rsp), %r11
> +	cmpq %r11,(EFLAGS-R11)(%rsp)		/* R11 == RFLAGS */
> +	jne opportunistic_sysret_failed
> +
> +	testq $X86_EFLAGS_RF,%r11		/* sysret can't restore RF */
> +	jnz opportunistic_sysret_failed
> +
> +	/* nothing to check for RSP */
> +
> +	cmpq $__USER_DS,(SS-R11)(%rsp)		/* SS must match SYSRET */
> +	jne opportunistic_sysret_failed
> +
> +	/*
> +	 * We win!  This label is here just for ease of understanding
> +	 * perf profiles.  Nothing jumps here.
> +	 */
> +irq_return_via_sysret:
> +	CFI_REMEMBER_STATE
> +	RESTORE_ARGS 1,8,1
> +	movq (RSP-RIP)(%rsp),%rsp
> +	USERGS_SYSRET64
> +	CFI_RESTORE_STATE
> +
> +opportunistic_sysret_failed:
>  	SWAPGS
>  	jmp restore_args

Ok, dammit, it happened again:

...
[   13.480778] BTRFS info (device sda9): disk space caching is enabled
[   13.487270] BTRFS: has skinny extents
[   14.368392] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   15.928679] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
[   15.936406] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
[   15.942879] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  115.065408] ata1.00: exception Emask 0x0 SAct 0x7fd80000 SErr 0x0 action 0x6 frozen
[  115.073159] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.078459] ata1.00: cmd 61/80:98:c0:e7:35/4a:00:1f:00:00/40 tag 19 ncq 9764864 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.093623] ata1.00: status: { DRDY }
[  115.097314] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.102569] ata1.00: cmd 61/30:a0:40:32:36/20:00:1f:00:00/40 tag 20 ncq 4218880 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.117668] ata1.00: status: { DRDY }
[  115.121351] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.126602] ata1.00: cmd 61/80:b0:80:f7:37/20:00:1f:00:00/40 tag 22 ncq 4259840 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.141701] ata1.00: status: { DRDY }
[  115.145389] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.150638] ata1.00: cmd 61/90:b8:70:52:36/03:00:1f:00:00/40 tag 23 ncq 466944 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.165682] ata1.00: status: { DRDY }
[  115.169357] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.174617] ata1.00: cmd 61/c0:c0:00:58:36/39:00:1f:00:00/40 tag 24 ncq 7569408 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.189713] ata1.00: status: { DRDY }
[  115.193400] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.198650] ata1.00: cmd 61/80:c8:c0:91:36/4b:00:1f:00:00/40 tag 25 ncq 9895936 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.213755] ata1.00: status: { DRDY }
[  115.217431] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.222692] ata1.00: cmd 61/80:d0:40:dd:36/4a:00:1f:00:00/40 tag 26 ncq 9764864 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.237788] ata1.00: status: { DRDY }
[  115.241479] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.246723] ata1.00: cmd 61/40:d8:c0:27:37/30:00:1f:00:00/40 tag 27 ncq 6324224 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.261825] ata1.00: status: { DRDY }
[  115.265519] ata1.00: failed command: READ FPDMA QUEUED
[  115.270683] ata1.00: cmd 60/08:e0:40:98:18/00:00:1f:00:00/40 tag 28 ncq 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.285432] ata1.00: status: { DRDY }
[  115.289113] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.294367] ata1.00: cmd 61/00:e8:00:58:37/15:00:1f:00:00/40 tag 29 ncq 2752512 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.309463] ata1.00: status: { DRDY }
[  115.313149] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.318399] ata1.00: cmd 61/00:f0:00:6d:37/2b:00:1f:00:00/40 tag 30 ncq 5636096 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.333503] ata1.00: status: { DRDY }
[  115.337201] ata1: hard resetting link
[  115.645895] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  115.743776] ata1.00: configured for UDMA/133
[  115.748074] ata1.00: device reported invalid CHS sector 0
[  115.753516] ata1.00: device reported invalid CHS sector 0
[  115.758947] ata1.00: device reported invalid CHS sector 0
[  115.764383] ata1.00: device reported invalid CHS sector 0
[  115.769825] ata1.00: device reported invalid CHS sector 0
[  115.775260] ata1.00: device reported invalid CHS sector 0
[  115.780689] ata1.00: device reported invalid CHS sector 0
[  115.786123] ata1.00: device reported invalid CHS sector 0
[  115.791563] ata1.00: device reported invalid CHS sector 0
[  115.796998] ata1.00: device reported invalid CHS sector 0
[  115.802431] ata1.00: device reported invalid CHS sector 0
[  115.807914] ata1: EH complete
[  146.085052] ata1.00: exception Emask 0x0 SAct 0x77c SErr 0x0 action 0x6 frozen
[  146.092320] ata1.00: failed command: READ FPDMA QUEUED
[  146.097489] ata1.00: cmd 60/08:10:40:98:18/00:00:1f:00:00/40 tag 2 ncq 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.112367] ata1.00: status: { DRDY }
[  146.116244] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.121696] ata1.00: cmd 61/40:18:c0:27:37/30:00:1f:00:00/40 tag 3 ncq 6324224 out
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.137389] ata1.00: status: { DRDY }
[  146.141267] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.146710] ata1.00: cmd 61/80:20:40:dd:36/4a:00:1f:00:00/40 tag 4 ncq 9764864 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.162395] ata1.00: status: { DRDY }
[  146.166269] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.171723] ata1.00: cmd 61/80:28:c0:91:36/4b:00:1f:00:00/40 tag 5 ncq 9895936 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.187402] ata1.00: status: { DRDY }
[  146.191278] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.196718] ata1.00: cmd 61/c0:30:00:58:36/39:00:1f:00:00/40 tag 6 ncq 7569408 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.212399] ata1.00: status: { DRDY }
[  146.216275] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.221723] ata1.00: cmd 61/80:40:80:f7:37/20:00:1f:00:00/40 tag 8 ncq 4259840 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.237407] ata1.00: status: { DRDY }
[  146.241280] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.246725] ata1.00: cmd 61/30:48:40:32:36/20:00:1f:00:00/40 tag 9 ncq 4218880 out
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.262407] ata1.00: status: { DRDY }
[  146.266282] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.271731] ata1.00: cmd 61/80:50:c0:e7:35/4a:00:1f:00:00/40 tag 10 ncq 9764864 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.287498] ata1.00: status: { DRDY }
[  146.291371] ata1: hard resetting link
[  146.599768] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  146.608680] ata1.00: configured for UDMA/133
[  146.613180] ata1.00: device reported invalid CHS sector 0
[  146.618807] ata1.00: device reported invalid CHS sector 0
[  146.624430] ata1.00: device reported invalid CHS sector 0
[  146.630048] ata1.00: device reported invalid CHS sector 0
[  146.635658] ata1.00: device reported invalid CHS sector 0
[  146.641270] ata1.00: device reported invalid CHS sector 0
[  146.646881] ata1.00: device reported invalid CHS sector 0
[  146.652484] ata1.00: device reported invalid CHS sector 0
[  146.658122] ata1: EH complete
[  177.110908] ata1.00: exception Emask 0x0 SAct 0x7f800 SErr 0x0 action 0x6 frozen
[  177.118525] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.123960] ata1.00: cmd 61/80:58:c0:e7:35/4a:00:1f:00:00/40 tag 11 ncq 9764864 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.139559] ata1.00: status: { DRDY }
[  177.143419] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.148849] ata1.00: cmd 61/30:60:40:32:36/20:00:1f:00:00/40 tag 12 ncq 4218880 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.164454] ata1.00: status: { DRDY }
[  177.168311] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.173747] ata1.00: cmd 61/80:68:80:f7:37/20:00:1f:00:00/40 tag 13 ncq 4259840 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.189387] ata1.00: status: { DRDY }
[  177.193254] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.198691] ata1.00: cmd 61/c0:70:00:58:36/39:00:1f:00:00/40 tag 14 ncq 7569408 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.214430] ata1.00: status: { DRDY }
[  177.218304] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.223755] ata1.00: cmd 61/80:78:c0:91:36/4b:00:1f:00:00/40 tag 15 ncq 9895936 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.239575] ata1.00: status: { DRDY }
[  177.243460] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.248908] ata1.00: cmd 61/80:80:40:dd:36/4a:00:1f:00:00/40 tag 16 ncq 9764864 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.264743] ata1.00: status: { DRDY }
[  177.268622] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.274075] ata1.00: cmd 61/40:88:c0:27:37/30:00:1f:00:00/40 tag 17 ncq 6324224 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.289913] ata1.00: status: { DRDY }
[  177.293795] ata1.00: failed command: READ FPDMA QUEUED
[  177.299153] ata1.00: cmd 60/08:90:40:98:18/00:00:1f:00:00/40 tag 18 ncq 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.314633] ata1.00: status: { DRDY }
[  177.318509] ata1: hard resetting link
[  177.626616] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  177.968639] ata1.00: configured for UDMA/133
[  177.973609] ata1.00: device reported invalid CHS sector 0
[  177.979669] ata1.00: device reported invalid CHS sector 0
[  177.985723] ata1.00: device reported invalid CHS sector 0
[  177.991371] ata1.00: device reported invalid CHS sector 0
[  177.997008] ata1.00: device reported invalid CHS sector 0
[  178.002641] ata1.00: device reported invalid CHS sector 0
[  178.008260] ata1.00: device reported invalid CHS sector 0
[  178.013886] ata1.00: device reported invalid CHS sector 0
[  178.019558] ata1: EH complete
Borislav Petkov Jan. 8, 2015, 1:57 p.m. UTC | #2
On Thu, Jan 08, 2015 at 01:29:28PM +0100, Borislav Petkov wrote:
> Ok, dammit, it happened again:

Running -rc+ 2without your first two patches doesn't trigger it. Well,
I don't know what workload even triggered it, it used to happen during
system upgrade. I left the box without your patches to build the kernel
in a loop and went to lunch.

Now I'm back and it all still looks good.

I'll try running only with this second patch, i.e. the ret-to-user
SYSRET speedup thing. See what happens.
Borislav Petkov Jan. 9, 2015, 10:40 a.m. UTC | #3
On Fri, Nov 07, 2014 at 03:58:18PM -0800, Andy Lutomirski wrote:
> +	/*
> +	 * Try to use SYSRET instead of IRET if we're returning to
> +	 * a completely clean 64-bit userspace context.
> +	 */
> +	movq (RCX-R11)(%rsp), %rcx
> +	cmpq %rcx,(RIP-R11)(%rsp)		/* RCX == RIP */
> +	jne opportunistic_sysret_failed
> +
> +	/*
> +	 * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
> +	 * in kernel space.  This essentially lets the user take over
> +	 * the kernel, since userspace controls RSP.  It's not worth
> +	 * testing for canonicalness exactly -- this check detects any
> +	 * of the 17 high bits set, which is true for non-canonical
> +	 * or kernel addresses.  (This will pessimize vsyscall=native.
> +	 * Big deal.)
> +	 */
> +	shr $47, %rcx
> +	jnz opportunistic_sysret_failed
> +
> +	cmpq $__USER_CS,(CS-R11)(%rsp)		/* CS must match SYSRET */
> +	jne opportunistic_sysret_failed
> +
> +	movq (R11-R11)(%rsp), %r11
> +	cmpq %r11,(EFLAGS-R11)(%rsp)		/* R11 == RFLAGS */
> +	jne opportunistic_sysret_failed
> +
> +	testq $X86_EFLAGS_RF,%r11		/* sysret can't restore RF */
> +	jnz opportunistic_sysret_failed
> +
> +	/* nothing to check for RSP */
> +
> +	cmpq $__USER_DS,(SS-R11)(%rsp)		/* SS must match SYSRET */
> +	jne opportunistic_sysret_failed

Btw, Denys' R11->ARGOFFSET fix makes sense here too - using ARGOFFSET
instead of R11 would make this here clearer.
Andy Lutomirski Jan. 10, 2015, 9:05 p.m. UTC | #4
On Thu, Jan 8, 2015 at 4:29 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Nov 07, 2014 at 03:58:18PM -0800, Andy Lutomirski wrote:
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index 3710b8241945..a5afdf0f7fa4 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -804,6 +804,54 @@ retint_swapgs:           /* return to user-space */
>
> Ok, so retint_swapgs is also on the error_exit path.
>
> What you're basically proposing is to use SYSRET on exceptions exit
> too AFAICT. And while I don't see anything wrong with the patch, you
> probably need to run this by more people like tip guys + Linus just in
> case. We can't allow ourselves to leak stuff here.

I'll cc Linus et all on v2.

>
>>        */
>>       DISABLE_INTERRUPTS(CLBR_ANY)
>>       TRACE_IRQS_IRETQ
>> +
>> +     /*
>> +      * Try to use SYSRET instead of IRET if we're returning to
>> +      * a completely clean 64-bit userspace context.
>> +      */
>> +     movq (RCX-R11)(%rsp), %rcx
>> +     cmpq %rcx,(RIP-R11)(%rsp)               /* RCX == RIP */
>> +     jne opportunistic_sysret_failed
>> +
>> +     /*
>> +      * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
>> +      * in kernel space.  This essentially lets the user take over
>> +      * the kernel, since userspace controls RSP.  It's not worth
>> +      * testing for canonicalness exactly -- this check detects any
>> +      * of the 17 high bits set, which is true for non-canonical
>> +      * or kernel addresses.  (This will pessimize vsyscall=native.
>> +      * Big deal.)
>> +      */
>> +     shr $47, %rcx
>
>         shr $__VIRTUAL_MASK_SHIFT, %rcx
>
> I guess, in case someone decides to play with the address space again
> and forgets this naked bit here.
>

I'll probably add a build-time assertion that __VIRTUAL_MASK_SHIFT ==
47 instead.  If we ever support CPUs with an extra level of page
tables, we'll probably need to patch the instruction, since we have a
security hole if that shift ever exceeds 47 on existing CPUs.

--Andy

>> +     jnz opportunistic_sysret_failed
>> +
>> +     cmpq $__USER_CS,(CS-R11)(%rsp)          /* CS must match SYSRET */
>> +     jne opportunistic_sysret_failed
>> +
>> +     movq (R11-R11)(%rsp), %r11
>> +     cmpq %r11,(EFLAGS-R11)(%rsp)            /* R11 == RFLAGS */
>> +     jne opportunistic_sysret_failed
>> +
>> +     testq $X86_EFLAGS_RF,%r11               /* sysret can't restore RF */
>> +     jnz opportunistic_sysret_failed
>> +
>> +     /* nothing to check for RSP */
>> +
>> +     cmpq $__USER_DS,(SS-R11)(%rsp)          /* SS must match SYSRET */
>> +     jne opportunistic_sysret_failed
>> +
>> +     /*
>> +      * We win!  This label is here just for ease of understanding
>> +      * perf profiles.  Nothing jumps here.
>> +      */
>> +irq_return_via_sysret:
>> +     CFI_REMEMBER_STATE
>> +     RESTORE_ARGS 1,8,1
>> +     movq (RSP-RIP)(%rsp),%rsp
>> +     USERGS_SYSRET64
>> +     CFI_RESTORE_STATE
>> +
>> +opportunistic_sysret_failed:
>>       SWAPGS
>>       jmp restore_args
>
> Ok, dammit, it happened again:
>
> ...
> [   13.480778] BTRFS info (device sda9): disk space caching is enabled
> [   13.487270] BTRFS: has skinny extents
> [   14.368392] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [   15.928679] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> [   15.936406] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> [   15.942879] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [  115.065408] ata1.00: exception Emask 0x0 SAct 0x7fd80000 SErr 0x0 action 0x6 frozen
> [  115.073159] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.078459] ata1.00: cmd 61/80:98:c0:e7:35/4a:00:1f:00:00/40 tag 19 ncq 9764864 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.093623] ata1.00: status: { DRDY }
> [  115.097314] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.102569] ata1.00: cmd 61/30:a0:40:32:36/20:00:1f:00:00/40 tag 20 ncq 4218880 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.117668] ata1.00: status: { DRDY }
> [  115.121351] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.126602] ata1.00: cmd 61/80:b0:80:f7:37/20:00:1f:00:00/40 tag 22 ncq 4259840 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.141701] ata1.00: status: { DRDY }
> [  115.145389] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.150638] ata1.00: cmd 61/90:b8:70:52:36/03:00:1f:00:00/40 tag 23 ncq 466944 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.165682] ata1.00: status: { DRDY }
> [  115.169357] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.174617] ata1.00: cmd 61/c0:c0:00:58:36/39:00:1f:00:00/40 tag 24 ncq 7569408 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.189713] ata1.00: status: { DRDY }
> [  115.193400] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.198650] ata1.00: cmd 61/80:c8:c0:91:36/4b:00:1f:00:00/40 tag 25 ncq 9895936 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.213755] ata1.00: status: { DRDY }
> [  115.217431] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.222692] ata1.00: cmd 61/80:d0:40:dd:36/4a:00:1f:00:00/40 tag 26 ncq 9764864 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.237788] ata1.00: status: { DRDY }
> [  115.241479] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.246723] ata1.00: cmd 61/40:d8:c0:27:37/30:00:1f:00:00/40 tag 27 ncq 6324224 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.261825] ata1.00: status: { DRDY }
> [  115.265519] ata1.00: failed command: READ FPDMA QUEUED
> [  115.270683] ata1.00: cmd 60/08:e0:40:98:18/00:00:1f:00:00/40 tag 28 ncq 4096 in
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.285432] ata1.00: status: { DRDY }
> [  115.289113] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.294367] ata1.00: cmd 61/00:e8:00:58:37/15:00:1f:00:00/40 tag 29 ncq 2752512 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.309463] ata1.00: status: { DRDY }
> [  115.313149] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.318399] ata1.00: cmd 61/00:f0:00:6d:37/2b:00:1f:00:00/40 tag 30 ncq 5636096 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.333503] ata1.00: status: { DRDY }
> [  115.337201] ata1: hard resetting link
> [  115.645895] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  115.743776] ata1.00: configured for UDMA/133
> [  115.748074] ata1.00: device reported invalid CHS sector 0
> [  115.753516] ata1.00: device reported invalid CHS sector 0
> [  115.758947] ata1.00: device reported invalid CHS sector 0
> [  115.764383] ata1.00: device reported invalid CHS sector 0
> [  115.769825] ata1.00: device reported invalid CHS sector 0
> [  115.775260] ata1.00: device reported invalid CHS sector 0
> [  115.780689] ata1.00: device reported invalid CHS sector 0
> [  115.786123] ata1.00: device reported invalid CHS sector 0
> [  115.791563] ata1.00: device reported invalid CHS sector 0
> [  115.796998] ata1.00: device reported invalid CHS sector 0
> [  115.802431] ata1.00: device reported invalid CHS sector 0
> [  115.807914] ata1: EH complete
> [  146.085052] ata1.00: exception Emask 0x0 SAct 0x77c SErr 0x0 action 0x6 frozen
> [  146.092320] ata1.00: failed command: READ FPDMA QUEUED
> [  146.097489] ata1.00: cmd 60/08:10:40:98:18/00:00:1f:00:00/40 tag 2 ncq 4096 in
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.112367] ata1.00: status: { DRDY }
> [  146.116244] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.121696] ata1.00: cmd 61/40:18:c0:27:37/30:00:1f:00:00/40 tag 3 ncq 6324224 out
>          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.137389] ata1.00: status: { DRDY }
> [  146.141267] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.146710] ata1.00: cmd 61/80:20:40:dd:36/4a:00:1f:00:00/40 tag 4 ncq 9764864 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.162395] ata1.00: status: { DRDY }
> [  146.166269] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.171723] ata1.00: cmd 61/80:28:c0:91:36/4b:00:1f:00:00/40 tag 5 ncq 9895936 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.187402] ata1.00: status: { DRDY }
> [  146.191278] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.196718] ata1.00: cmd 61/c0:30:00:58:36/39:00:1f:00:00/40 tag 6 ncq 7569408 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.212399] ata1.00: status: { DRDY }
> [  146.216275] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.221723] ata1.00: cmd 61/80:40:80:f7:37/20:00:1f:00:00/40 tag 8 ncq 4259840 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.237407] ata1.00: status: { DRDY }
> [  146.241280] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.246725] ata1.00: cmd 61/30:48:40:32:36/20:00:1f:00:00/40 tag 9 ncq 4218880 out
>          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.262407] ata1.00: status: { DRDY }
> [  146.266282] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.271731] ata1.00: cmd 61/80:50:c0:e7:35/4a:00:1f:00:00/40 tag 10 ncq 9764864 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.287498] ata1.00: status: { DRDY }
> [  146.291371] ata1: hard resetting link
> [  146.599768] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  146.608680] ata1.00: configured for UDMA/133
> [  146.613180] ata1.00: device reported invalid CHS sector 0
> [  146.618807] ata1.00: device reported invalid CHS sector 0
> [  146.624430] ata1.00: device reported invalid CHS sector 0
> [  146.630048] ata1.00: device reported invalid CHS sector 0
> [  146.635658] ata1.00: device reported invalid CHS sector 0
> [  146.641270] ata1.00: device reported invalid CHS sector 0
> [  146.646881] ata1.00: device reported invalid CHS sector 0
> [  146.652484] ata1.00: device reported invalid CHS sector 0
> [  146.658122] ata1: EH complete
> [  177.110908] ata1.00: exception Emask 0x0 SAct 0x7f800 SErr 0x0 action 0x6 frozen
> [  177.118525] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.123960] ata1.00: cmd 61/80:58:c0:e7:35/4a:00:1f:00:00/40 tag 11 ncq 9764864 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.139559] ata1.00: status: { DRDY }
> [  177.143419] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.148849] ata1.00: cmd 61/30:60:40:32:36/20:00:1f:00:00/40 tag 12 ncq 4218880 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.164454] ata1.00: status: { DRDY }
> [  177.168311] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.173747] ata1.00: cmd 61/80:68:80:f7:37/20:00:1f:00:00/40 tag 13 ncq 4259840 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.189387] ata1.00: status: { DRDY }
> [  177.193254] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.198691] ata1.00: cmd 61/c0:70:00:58:36/39:00:1f:00:00/40 tag 14 ncq 7569408 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.214430] ata1.00: status: { DRDY }
> [  177.218304] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.223755] ata1.00: cmd 61/80:78:c0:91:36/4b:00:1f:00:00/40 tag 15 ncq 9895936 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.239575] ata1.00: status: { DRDY }
> [  177.243460] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.248908] ata1.00: cmd 61/80:80:40:dd:36/4a:00:1f:00:00/40 tag 16 ncq 9764864 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.264743] ata1.00: status: { DRDY }
> [  177.268622] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.274075] ata1.00: cmd 61/40:88:c0:27:37/30:00:1f:00:00/40 tag 17 ncq 6324224 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.289913] ata1.00: status: { DRDY }
> [  177.293795] ata1.00: failed command: READ FPDMA QUEUED
> [  177.299153] ata1.00: cmd 60/08:90:40:98:18/00:00:1f:00:00/40 tag 18 ncq 4096 in
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.314633] ata1.00: status: { DRDY }
> [  177.318509] ata1: hard resetting link
> [  177.626616] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  177.968639] ata1.00: configured for UDMA/133
> [  177.973609] ata1.00: device reported invalid CHS sector 0
> [  177.979669] ata1.00: device reported invalid CHS sector 0
> [  177.985723] ata1.00: device reported invalid CHS sector 0
> [  177.991371] ata1.00: device reported invalid CHS sector 0
> [  177.997008] ata1.00: device reported invalid CHS sector 0
> [  178.002641] ata1.00: device reported invalid CHS sector 0
> [  178.008260] ata1.00: device reported invalid CHS sector 0
> [  178.013886] ata1.00: device reported invalid CHS sector 0
> [  178.019558] ata1: EH complete
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --
diff mbox

Patch

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 3710b8241945..a5afdf0f7fa4 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -804,6 +804,54 @@  retint_swapgs:		/* return to user-space */
 	 */
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_IRETQ
+
+	/*
+	 * Try to use SYSRET instead of IRET if we're returning to
+	 * a completely clean 64-bit userspace context.
+	 */
+	movq (RCX-R11)(%rsp), %rcx
+	cmpq %rcx,(RIP-R11)(%rsp)		/* RCX == RIP */
+	jne opportunistic_sysret_failed
+
+	/*
+	 * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
+	 * in kernel space.  This essentially lets the user take over
+	 * the kernel, since userspace controls RSP.  It's not worth
+	 * testing for canonicalness exactly -- this check detects any
+	 * of the 17 high bits set, which is true for non-canonical
+	 * or kernel addresses.  (This will pessimize vsyscall=native.
+	 * Big deal.)
+	 */
+	shr $47, %rcx
+	jnz opportunistic_sysret_failed
+
+	cmpq $__USER_CS,(CS-R11)(%rsp)		/* CS must match SYSRET */
+	jne opportunistic_sysret_failed
+
+	movq (R11-R11)(%rsp), %r11
+	cmpq %r11,(EFLAGS-R11)(%rsp)		/* R11 == RFLAGS */
+	jne opportunistic_sysret_failed
+
+	testq $X86_EFLAGS_RF,%r11		/* sysret can't restore RF */
+	jnz opportunistic_sysret_failed
+
+	/* nothing to check for RSP */
+
+	cmpq $__USER_DS,(SS-R11)(%rsp)		/* SS must match SYSRET */
+	jne opportunistic_sysret_failed
+
+	/*
+	 * We win!  This label is here just for ease of understanding
+	 * perf profiles.  Nothing jumps here.
+	 */
+irq_return_via_sysret:
+	CFI_REMEMBER_STATE
+	RESTORE_ARGS 1,8,1
+	movq (RSP-RIP)(%rsp),%rsp
+	USERGS_SYSRET64
+	CFI_RESTORE_STATE
+
+opportunistic_sysret_failed:
 	SWAPGS
 	jmp restore_args