[RFC,1/2] x86: Allow breakpoints to emulate call functions
diff mbox series

Message ID 20190501203152.397154664@goodmis.org
State New
Headers show
Series
  • [RFC,1/2] x86: Allow breakpoints to emulate call functions
Related show

Commit Message

Steven Rostedt May 1, 2019, 8:28 p.m. UTC
From: Peter Zijlstra <peterz@infradead.org>

In order to allow breakpoints to emulate call functions, they need to push
the return address onto the stack. But because the breakpoint exception
frame is added to the stack when the breakpoint is hit, there's no room to
add the address onto the stack and return to the address of the emulated
called funtion.

To handle this, copy the exception frame on entry of the breakpoint handler
and have leave a gap that can be used to add a return address to the stack
frame and return from the breakpoint to the emulated called function,
allowing for that called function to return back to the location after the
breakpoint was placed.

The helper functions were also added:

  int3_emulate_push(): to push the address onto the gap in the stack
  int3_emulate_jmp(): changes the location of the regs->ip to return there.
  int3_emulate_call(): push the return address and change regs->ip

Cc: Andy Lutomirski <luto@kernel.org>
Cc: Nicolai Stange <nstange@suse.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: the arch/x86 maintainers <x86@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jiri Kosina <jikos@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Petr Mladek <pmladek@suse.com>
Cc: Joe Lawrence <joe.lawrence@redhat.com>
Cc: Shuah Khan <shuah@kernel.org>
Cc: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Tim Chen <tim.c.chen@linux.intel.com>
Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Cc: Mimi Zohar <zohar@linux.ibm.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Nick Desaulniers <ndesaulniers@google.com>
Cc: Nayna Jain <nayna@linux.ibm.com>
Cc: Masahiro Yamada <yamada.masahiro@socionext.com>
Cc: Joerg Roedel <jroedel@suse.de>
Cc: "open list:KERNEL SELFTEST FRAMEWORK" <linux-kselftest@vger.kernel.org>
Cc: stable@vger.kernel.org
Fixes: b700e7f03df5 ("livepatch: kernel: add support for live patching")
Signed-off-by: *** Need Peter Zijlstra's SoB here! ***
Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>
---
 arch/x86/entry/entry_32.S            | 11 +++++++++++
 arch/x86/entry/entry_64.S            | 14 ++++++++++++--
 arch/x86/include/asm/text-patching.h | 20 ++++++++++++++++++++
 3 files changed, 43 insertions(+), 2 deletions(-)

Comments

Steven Rostedt May 2, 2019, 3:24 a.m. UTC | #1
On Wed, 01 May 2019 16:28:31 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index d309f30cf7af..50bbf4035baf 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -1478,6 +1478,17 @@ ENTRY(int3)
>  	ASM_CLAC
>  	pushl	$-1				# mark this as an int
>  
> +#ifdef CONFIG_VM86
> +	testl	$X86_EFLAGS_VM, PT_EFLAGS(%esp)
> +	jnz	.Lfrom_usermode_no_gap
> +#endif
> +	testl	$SEGMENT_RPL_MASK, PT_CS(%esp)
> +	jnz	.Lfrom_usermode_no_gap
> +	.rept 6
> +	pushl	5*4(%esp)
> +	.endr
> +.Lfrom_usermode_no_gap:
> +
>  	SAVE_ALL switch_stacks=1
>  	ENCODE_FRAME_POINTER
>  	TRACE_IRQS_OFF

This failed to work on 32 bit at all (crashed and burned badly - triple
fault!). Looking at it I found one issue. This code is done before the
regs are saved, and PT_EFLAGS(%esp) and PT_CS(%esp) expect %esp to
contain them. I applied this patch against this but it didn't totally
fix the problems. It still constantly crashes (although, with this
update I can put in some printks to get some ideas). I haven't spent
too much time on it, but it looks like there's an issue with the
entry-stack that int3 switches to. I'm not sure its handling the copy
well.

-- Steve

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 50bbf4035baf..4f427285e421 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1479,10 +1479,10 @@ ENTRY(int3)
 	pushl	$-1				# mark this as an int
 
 #ifdef CONFIG_VM86
-	testl	$X86_EFLAGS_VM, PT_EFLAGS(%esp)
+	testl	$X86_EFLAGS_VM, PT_EFLAGS-PT_ORIG_EAX(%esp)
 	jnz	.Lfrom_usermode_no_gap
 #endif
-	testl	$SEGMENT_RPL_MASK, PT_CS(%esp)
+	testl	$SEGMENT_RPL_MASK, PT_CS-PT_ORIG_EAX(%esp)
 	jnz	.Lfrom_usermode_no_gap
 	.rept 6
 	pushl	5*4(%esp)
Peter Zijlstra May 2, 2019, 4:21 p.m. UTC | #2
On Wed, May 01, 2019 at 11:24:12PM -0400, Steven Rostedt wrote:
> On Wed, 01 May 2019 16:28:31 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> > index d309f30cf7af..50bbf4035baf 100644
> > --- a/arch/x86/entry/entry_32.S
> > +++ b/arch/x86/entry/entry_32.S
> > @@ -1478,6 +1478,17 @@ ENTRY(int3)
> >  	ASM_CLAC
> >  	pushl	$-1				# mark this as an int
> >  
> > +#ifdef CONFIG_VM86
> > +	testl	$X86_EFLAGS_VM, PT_EFLAGS(%esp)
> > +	jnz	.Lfrom_usermode_no_gap
> > +#endif
> > +	testl	$SEGMENT_RPL_MASK, PT_CS(%esp)
> > +	jnz	.Lfrom_usermode_no_gap
> > +	.rept 6
> > +	pushl	5*4(%esp)
> > +	.endr
> > +.Lfrom_usermode_no_gap:
> > +
> >  	SAVE_ALL switch_stacks=1
> >  	ENCODE_FRAME_POINTER
> >  	TRACE_IRQS_OFF
> 
> This failed to work on 32 bit at all (crashed and burned badly - triple
> fault!). 

Indeed so; find a working version below (albeit with a lot of debug
garbage still in).

It also includes the self-test code that Andy wanted -- it's what I used
to debug this mess.

Much thanks to Joerg Roedel for talking entry_32.S with me.

TL;DR, on x86_32 kernel->kernel IRET frames are only 3 entries and do
not include ESP/SS, so not only wasn't regs->sp setup, if you changed it
it wouldn't be effective and corrupt random stack state.

---
 arch/x86/entry/entry_32.S            |  87 +++++++++++++++++++++++---
 arch/x86/entry/entry_64.S            |  14 ++++-
 arch/x86/include/asm/text-patching.h |  20 ++++++
 arch/x86/kernel/alternative.c        | 116 +++++++++++++++++++++++++++++++++--
 arch/x86/kernel/traps.c              |   1 +
 5 files changed, 225 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 7b23431be5cb..01c5bdbe5f39 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -203,7 +203,7 @@
 .Lend_\@:
 .endm
 
-.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
+.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 clear_cs=1
 	cld
 	PUSH_GS
 	pushl	%fs
@@ -225,7 +225,7 @@
 
 	/* Switch to kernel stack if necessary */
 .if \switch_stacks > 0
-	SWITCH_TO_KERNEL_STACK
+	SWITCH_TO_KERNEL_STACK \clear_cs
 .endif
 
 .endm
@@ -377,8 +377,9 @@
 
 #define CS_FROM_ENTRY_STACK	(1 << 31)
 #define CS_FROM_USER_CR3	(1 << 30)
+#define CS_FROM_INT3		(1 << 29)
 
-.macro SWITCH_TO_KERNEL_STACK
+.macro SWITCH_TO_KERNEL_STACK clear_cs=1
 
 	ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
 
@@ -391,12 +392,13 @@
 	 * that register for the time this macro runs
 	 */
 
+	.if \clear_cs
 	/*
-	 * The high bits of the CS dword (__csh) are used for
-	 * CS_FROM_ENTRY_STACK and CS_FROM_USER_CR3. Clear them in case
-	 * hardware didn't do this for us.
+	 * The high bits of the CS dword (__csh) are used for CS_FROM_*. Clear
+	 * them in case hardware didn't do this for us.
 	 */
 	andl	$(0x0000ffff), PT_CS(%esp)
+	.endif
 
 	/* Are we on the entry stack? Bail out if not! */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
@@ -1019,6 +1021,29 @@ ENTRY(entry_INT80_32)
 	/* Restore user state */
 	RESTORE_REGS pop=4			# skip orig_eax/error_code
 .Lirq_return:
+	testl $CS_FROM_INT3, 4(%esp)
+	jz .Lno_iret_fixup
+
+	/*
+	 * Undo the magic from ENTRY(int3), in particular consider the case
+	 * where regs->sp has been modified.
+	 */
+
+	pushl	%eax
+	movl	%esp, %eax
+
+	movl	4*4(%eax), %esp		# restore (modified) regs->sp
+
+	/* rebuild IRET frame */
+	pushl	3*4(%eax)		# flags
+	pushl	2*4(%eax)		# cs
+	pushl	1*4(%eax)		# ip
+
+	andl	$0x0000ffff, 4(%esp)	# clear high CS bits
+
+	movl	(%eax), %eax		# restore eax
+
+.Lno_iret_fixup:
 	/*
 	 * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
 	 * when returning from IPI handler and when returning from
@@ -1477,9 +1502,57 @@ END(nmi)
 
 ENTRY(int3)
 	ASM_CLAC
+
+	/*
+	 * The high bits of the CS dword (__csh) are used for CS_FROM_*. Clear
+	 * them in case hardware didn't do this for us.
+	 */
+	andl	$0x0000ffff, 4(%esp)
+
+#ifdef CONFIG_VM86
+	testl	$X86_EFLAGS_VM, 8(%esp)
+	jnz	.Lfrom_usermode_no_gap
+#endif
+	testl	$SEGMENT_RPL_MASK, 4(%esp)
+	jnz	.Lfrom_usermode_no_gap
+
+	/*
+	 * Here from kernel mode; so the (exception) stack looks like:
+	 *
+	 * 12(esp) - <previous context>
+	 *  8(esp) - flags
+	 *  4(esp) - cs
+	 *  0(esp) - ip
+	 *
+	 * Lets build a 5 entry IRET frame after that, such that struct pt_regs
+	 * is complete and in particular regs->sp is correct. This gives us
+	 * the original 3 enties as gap:
+	 *
+	 * 32(esp) - <previous context>
+	 * 28(esp) - orig_flags / gap
+	 * 24(esp) - orig_cs	/ gap
+	 * 20(esp) - orig_ip	/ gap
+	 * 16(esp) - ss
+	 * 12(esp) - sp
+	 *  8(esp) - flags
+	 *  4(esp) - cs
+	 *  0(esp) - ip
+	 */
+	pushl	%ss	  # ss
+	pushl	%esp      # sp (points at ss)
+	pushl	4*4(%esp) # flags
+	pushl	4*4(%esp) # cs
+	pushl	4*4(%esp) # ip
+
+	add	$16, 12(%esp) # point sp back at the previous context
+
+	orl	$CS_FROM_INT3, 4(%esp) # mark magic IRET
+
+.Lfrom_usermode_no_gap:
+
 	pushl	$-1				# mark this as an int
 
-	SAVE_ALL switch_stacks=1
+	SAVE_ALL switch_stacks=1 clear_cs=0
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
 	xorl	%edx, %edx			# zero error code
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 20e45d9b4e15..268cd9affe04 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -878,7 +878,7 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -898,6 +898,16 @@ ENTRY(\sym)
 	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
 
+	.if \create_gap == 1
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jnz	.Lfrom_usermode_no_gap_\@
+	.rept 6
+	pushq	5*8(%rsp)
+	.endr
+	UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+	.endif
+
 	.if \paranoid
 	call	paranoid_entry
 	.else
@@ -1129,7 +1139,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
 #endif /* CONFIG_HYPERV */
 
 idtentry debug			do_debug		has_error_code=0	paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET
-idtentry int3			do_int3			has_error_code=0
+idtentry int3			do_int3			has_error_code=0	create_gap=1
 idtentry stack_segment		do_stack_segment	has_error_code=1
 
 #ifdef CONFIG_XEN_PV
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index c90678fd391a..6aac6abf931e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -42,4 +42,24 @@ extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
 
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+	regs->sp -= sizeof(unsigned long);
+	*(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+	regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+	int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+	int3_emulate_jmp(regs, func);
+}
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 4db9c0d29bc1..1e11076c3a2b 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -613,11 +613,118 @@ extern struct paravirt_patch_site __start_parainstructions[],
 	__stop_parainstructions[];
 #endif	/* CONFIG_PARAVIRT */
 
+static __always_inline void print_stack(struct pt_regs *regs)
+{
+#if 1
+	unsigned long *end = (unsigned long *)current_stack_pointer;
+	unsigned long *frame = (unsigned long *)__builtin_frame_address(0);
+	unsigned long *stack = (unsigned long *)(current_stack_pointer & ~(THREAD_SIZE - 1));
+	int i, j;
+
+	stack += THREAD_SIZE / sizeof(unsigned long);
+
+	printk("stack dump from: %lx\n", stack);
+
+	for (i=0; ; i++) {
+		pr_info("stack[%03d]: ", 16*i);
+		for (j=0; j<16; j++) {
+			if (i==0 && j==0) {
+				pr_cont(" %08lx  ", 0UL);
+				stack--;
+				continue;
+			}
+			if (stack == end)
+				pr_cont(">%08lx< ", *(stack--));
+			else if (stack == frame)
+				pr_cont("*%08lx* ", *(stack--));
+			else if (stack == regs)
+				pr_cont("r%08lxr ", *(stack--));
+			else if (regs && stack == regs->sp)
+				pr_cont("s%08lxs ", *(stack--));
+			else
+				pr_cont(" %08lx  ", *(stack--));
+		}
+		pr_cont("\n");
+
+		if (stack < end)
+			break;
+	}
+#endif
+}
+
+static void __init int3_magic(unsigned int *ptr)
+{
+	printk("*************** %lx\n", (unsigned long)ptr);
+	print_stack(NULL);
+	*ptr = 1;
+}
+
+static __initdata unsigned long int3_ip;
+
+static int __init int3_exception_notify(struct notifier_block *self, unsigned long val, void *data)
+{
+	struct die_args *args = data;
+	struct pt_regs *regs = args->regs;
+
+	if (!regs || user_mode(regs))
+		return NOTIFY_DONE;
+
+	if (val != DIE_INT3)
+		return NOTIFY_DONE;
+
+	printk("XXXXXXXXXXXXXXXXXXXXXXXXXX %lx %lx\n", regs->ip, int3_ip);
+	if (regs->ip - INT3_INSN_SIZE != int3_ip)
+		return NOTIFY_DONE;
+
+	print_stack(regs);
+	int3_emulate_call(regs, (unsigned long)&int3_magic);
+	print_stack(regs);
+
+	return NOTIFY_STOP;
+}
+
+static void __init int3_selftest(void)
+{
+	static __initdata struct notifier_block int3_exception_nb = {
+		.notifier_call	= int3_exception_notify,
+		.priority	= INT_MAX-1, /* last */
+	};
+	unsigned int val = 0;
+
+	BUG_ON(register_die_notifier(&int3_exception_nb));
+
+	printk("+++++++++++++++++++ %lx %lx\n", (unsigned long)&val, (unsigned long)&int3_ip);
+
+	print_stack(NULL);
+
+	/*
+	 * Basically: int3_magic(&val); but really complicated :-)
+	 *
+	 * Stick the address of the INT3 instruction into int3_ip, then trigger
+	 * the INT3, padded with NOPs to match a CALL instruction length.
+	 */
+#ifdef CONFIG_X86_32
+	asm volatile ("call 1f; 1: pop (%%edx); add $5, (%%edx);"
+		      "int3; nop; nop; nop; nop" : : "d" (&int3_ip), "a" (&val) : "memory");
+#else /* CONFIG_X86_64 */
+	asm volatile ("call 1f; 1: pop (%%rdx); add $5, (%%rdx);"
+		      "int3; nop; nop; nop; nop" : : "d" (&int3_ip), "D" (&val) : "memory");
+#endif
+
+	BUG_ON(val != 1);
+
+	unregister_die_notifier(&int3_exception_nb);
+}
+
 void __init alternative_instructions(void)
 {
-	/* The patching is not fully atomic, so try to avoid local interruptions
-	   that might execute the to be patched code.
-	   Other CPUs are not running. */
+	int3_selftest();
+
+	/*
+	 * The patching is not fully atomic, so try to avoid local
+	 * interruptions that might execute the to be patched code.
+	 * Other CPUs are not running.
+	 */
 	stop_nmi();
 
 	/*
@@ -642,10 +749,11 @@ void __init alternative_instructions(void)
 					    _text, _etext);
 	}
 
-	if (!uniproc_patched || num_possible_cpus() == 1)
+	if (!uniproc_patched || num_possible_cpus() == 1) {
 		free_init_pages("SMP alternatives",
 				(unsigned long)__smp_locks,
 				(unsigned long)__smp_locks_end);
+	}
 #endif
 
 	apply_paravirt(__parainstructions, __parainstructions_end);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 8b6d03e55d2f..e072cdd07284 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -572,6 +572,7 @@ NOKPROBE_SYMBOL(do_general_protection);
 
 dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 {
+	printk("int3 frame: %lx\n", __builtin_frame_address(0));
 #ifdef CONFIG_DYNAMIC_FTRACE
 	/*
 	 * ftrace must be first, everything else may cause a recursive crash.
Peter Zijlstra May 2, 2019, 4:29 p.m. UTC | #3
On Thu, May 02, 2019 at 06:21:33PM +0200, Peter Zijlstra wrote:
> Much thanks to Joerg Roedel for talking entry_32.S with me.
> 
> TL;DR, on x86_32 kernel->kernel IRET frames are only 3 entries and do
> not include ESP/SS, so not only wasn't regs->sp setup, if you changed it
> it wouldn't be effective and corrupt random stack state.

Also, i386 is bloody insane...
Linus Torvalds May 2, 2019, 6:02 p.m. UTC | #4
On Thu, May 2, 2019 at 9:21 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> TL;DR, on x86_32 kernel->kernel IRET frames are only 3 entries and do
> not include ESP/SS, so not only wasn't regs->sp setup, if you changed it
> it wouldn't be effective and corrupt random stack state.

Indeed, the 32-bit case for same-RPL exceptions/iret is entirely
different, and I'd forgotten about that.

And honestly, this makes the 32-bit case much worse. Now the entry
stack modifications of int3 suddenly affect not just the entry, but
every exit too.

This is _exactly_ the kind of subtle kernel entry/exit code I wanted
us to avoid.

And while your code looks kind of ok, it's subtly buggy. This sequence:

+       pushl   %eax
+       movl    %esp, %eax
+
+       movl    4*4(%eax), %esp         # restore (modified) regs->sp
+
+       /* rebuild IRET frame */
+       pushl   3*4(%eax)               # flags
+       pushl   2*4(%eax)               # cs
+       pushl   1*4(%eax)               # ip
+
+       andl    $0x0000ffff, 4(%esp)    # clear high CS bits
+
+       movl    (%eax), %eax            # restore eax

looks very wrong to me. When you do that "restore (modified)
regs->sp", isn't that now resetting %esp to the point where %eax now
points below the stack? So if we get an NMI in this sequence, that
will overwrite the parts you are trying to copy from?

Am I missing something? doesn't it need to be done something like

  pushl %eax
  pushl %ecx
  movl 20(%esp),%eax   # possibly modified regs->sp
  movl 16(%esp),%ecx   # flags
  movl %ecx,-4(%eax)
  movl 12(%esp),%ecx   # cs
  movl %ecx,-8(%eax)
  movl 8(%esp),%ecx   # ip
  movl %ecx, -12(%eax)
  movl 4(%esp),%ecx   # eax
  movl %ecx, -16(%eax)
  popl %ecx
  lea -16(%eax),%esp
  popl %eax

(NOTE NOTE NOTE I might have gotten the offsets and the direction of
the moves *completely* wrong, this is not a serious patch, it's meant
as a "something like this" thing!!)

But now I confused myself, and maybe I'm wrong.

                   Linus
Peter Zijlstra May 2, 2019, 6:18 p.m. UTC | #5
On Thu, May 02, 2019 at 11:02:40AM -0700, Linus Torvalds wrote:
> On Thu, May 2, 2019 at 9:21 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > TL;DR, on x86_32 kernel->kernel IRET frames are only 3 entries and do
> > not include ESP/SS, so not only wasn't regs->sp setup, if you changed it
> > it wouldn't be effective and corrupt random stack state.
> 
> Indeed, the 32-bit case for same-RPL exceptions/iret is entirely
> different, and I'd forgotten about that.
> 
> And honestly, this makes the 32-bit case much worse. Now the entry
> stack modifications of int3 suddenly affect not just the entry, but
> every exit too.

We could fix this by not using the common exit path on int3; not sure we
want to go there, but that is an option.

> This is _exactly_ the kind of subtle kernel entry/exit code I wanted
> us to avoid.
> 
> And while your code looks kind of ok, it's subtly buggy. This sequence:
> 
> +       pushl   %eax
> +       movl    %esp, %eax
> +
> +       movl    4*4(%eax), %esp         # restore (modified) regs->sp
> +
> +       /* rebuild IRET frame */
> +       pushl   3*4(%eax)               # flags
> +       pushl   2*4(%eax)               # cs
> +       pushl   1*4(%eax)               # ip
> +
> +       andl    $0x0000ffff, 4(%esp)    # clear high CS bits
> +
> +       movl    (%eax), %eax            # restore eax
> 
> looks very wrong to me. When you do that "restore (modified)
> regs->sp", isn't that now resetting %esp to the point where %eax now
> points below the stack? So if we get an NMI in this sequence, that
> will overwrite the parts you are trying to copy from?

ARGH; I knew it was too pretty :/ Yes, something like what you suggest
will be needed, I'll go look at that once my brain recovers a bit from
staring at entry code all day.
Peter Zijlstra May 2, 2019, 6:30 p.m. UTC | #6
On Thu, May 02, 2019 at 08:18:11PM +0200, Peter Zijlstra wrote:

> ARGH; I knew it was too pretty :/ Yes, something like what you suggest
> will be needed, I'll go look at that once my brain recovers a bit from
> staring at entry code all day.

I forgot I can just run the thing, and it works!


---
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 7b23431be5cb..73b7bca8712f 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -203,7 +203,7 @@
 .Lend_\@:
 .endm
 
-.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
+.macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0 clear_csh=1
 	cld
 	PUSH_GS
 	pushl	%fs
@@ -225,7 +225,7 @@
 
 	/* Switch to kernel stack if necessary */
 .if \switch_stacks > 0
-	SWITCH_TO_KERNEL_STACK
+	SWITCH_TO_KERNEL_STACK \clear_csh
 .endif
 
 .endm
@@ -377,8 +377,9 @@
 
 #define CS_FROM_ENTRY_STACK	(1 << 31)
 #define CS_FROM_USER_CR3	(1 << 30)
+#define CS_FROM_INT3		(1 << 29)
 
-.macro SWITCH_TO_KERNEL_STACK
+.macro SWITCH_TO_KERNEL_STACK clear_csh=1
 
 	ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
 
@@ -391,12 +392,13 @@
 	 * that register for the time this macro runs
 	 */
 
+	.if \clear_csh
 	/*
-	 * The high bits of the CS dword (__csh) are used for
-	 * CS_FROM_ENTRY_STACK and CS_FROM_USER_CR3. Clear them in case
-	 * hardware didn't do this for us.
+	 * The high bits of the CS dword (__csh) are used for CS_FROM_*. Clear
+	 * them in case hardware didn't do this for us.
 	 */
 	andl	$(0x0000ffff), PT_CS(%esp)
+	.endif
 
 	/* Are we on the entry stack? Bail out if not! */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
@@ -1019,6 +1021,40 @@ ENTRY(entry_INT80_32)
 	/* Restore user state */
 	RESTORE_REGS pop=4			# skip orig_eax/error_code
 .Lirq_return:
+	testl $CS_FROM_INT3, 4(%esp)
+	jz .Lno_iret_fixup
+
+	/*
+	 * Undo the magic from ENTRY(int3), in particular consider the case
+	 * where regs->sp has been modified.
+	 *
+	 * Reconstruct the 3 entry IRET frame right after the (modified)
+	 * regs->sp without lowering %esp in between, such that an NMI in the
+	 * middle doesn't scribble our stack.
+	 */
+
+	pushl	%eax
+	pushl	%ecx
+	movl	5*4(%esp), %eax		# (modified) regs->sp
+
+	movl	4*4(%esp), %ecx		# flags
+	movl	%ecx, -4(%eax)
+
+	movl	3*4(%esp), %ecx		# cs
+	andl	$0x0000ffff, %ecx
+	movl	%ecx, -8(%eax)
+
+	movl	2*4(%esp), %ecx		# ip
+	movl	%ecx, -12(%eax)
+
+	movl	1*4(%esp), %ecx		# eax
+	movl	%ecx, -16(%eax)
+
+	popl	%ecx
+	lea	-16(%eax), %esp
+	popl	%eax
+
+.Lno_iret_fixup:
 	/*
 	 * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
 	 * when returning from IPI handler and when returning from
@@ -1477,9 +1513,57 @@ END(nmi)
 
 ENTRY(int3)
 	ASM_CLAC
+
+	/*
+	 * The high bits of the CS dword (__csh) are used for CS_FROM_*. Clear
+	 * them in case hardware didn't do this for us.
+	 */
+	andl	$0x0000ffff, 4(%esp)
+
+#ifdef CONFIG_VM86
+	testl	$X86_EFLAGS_VM, 8(%esp)
+	jnz	.Lfrom_usermode_no_gap
+#endif
+	testl	$SEGMENT_RPL_MASK, 4(%esp)
+	jnz	.Lfrom_usermode_no_gap
+
+	/*
+	 * Here from kernel mode; so the (exception) stack looks like:
+	 *
+	 * 12(esp) - <previous context>
+	 *  8(esp) - flags
+	 *  4(esp) - cs
+	 *  0(esp) - ip
+	 *
+	 * Lets build a 5 entry IRET frame after that, such that struct pt_regs
+	 * is complete and in particular regs->sp is correct. This gives us
+	 * the original 3 enties as gap:
+	 *
+	 * 32(esp) - <previous context>
+	 * 28(esp) - orig_flags / gap
+	 * 24(esp) - orig_cs	/ gap
+	 * 20(esp) - orig_ip	/ gap
+	 * 16(esp) - ss
+	 * 12(esp) - sp
+	 *  8(esp) - flags
+	 *  4(esp) - cs
+	 *  0(esp) - ip
+	 */
+	pushl	%ss	  # ss
+	pushl	%esp      # sp (points at ss)
+	pushl	4*4(%esp) # flags
+	pushl	4*4(%esp) # cs
+	pushl	4*4(%esp) # ip
+
+	add	$16, 12(%esp) # point sp back at the previous context
+
+	orl	$CS_FROM_INT3, 4(%esp) # mark magic IRET
+
+.Lfrom_usermode_no_gap:
+
 	pushl	$-1				# mark this as an int
 
-	SAVE_ALL switch_stacks=1
+	SAVE_ALL switch_stacks=1 clear_csh=0
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
 	xorl	%edx, %edx			# zero error code
Linus Torvalds May 2, 2019, 6:43 p.m. UTC | #7
On Thu, May 2, 2019 at 11:18 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> We could fix this by not using the common exit path on int3; not sure we
> want to go there, but that is an option.

I don't think it's an option in general, because *some* int3
invocations will need all the usual error return.

But I guess we could make "int3 from kernel space" special.

I'm not sure how much that would help, but it might be worth looking into.

> ARGH; I knew it was too pretty :/ Yes, something like what you suggest
> will be needed, I'll go look at that once my brain recovers a bit from
> staring at entry code all day.

Looks like it works based on your other email.

What would it look like with the "int3-from-kernel is special" modification?

Because *if* we can make the "kernel int3" entirely special, that
would make the "Eww factor" much less of this whole thing.

I forget: is #BP _only_ for the "int3" instruction? I know we have
really nasty cases with #DB (int1) because of "pending exceptions
happen on the first instruction in kernel space", and that makes it
really really nasty to handle with all the stack switch and %cr3
handling etc.

But if "int3 from kernel space" _only_ happens on actual "int3"
instructions, then we really could just special-case that case. We'd
know that %cr3 has been switched, we'd know that we don't need to do
fsgs switching, we'd know we already have a good stack and percpu data
etc set up.

So then special casing #BP would actually allow us to have a simple
and straightforward kernel-int3-only sequence?

And then having that odd stack setup special case would be *much* more
palatable to me.

               Linus
Jiri Kosina May 2, 2019, 7:28 p.m. UTC | #8
On Thu, 2 May 2019, Linus Torvalds wrote:

> I forget: is #BP _only_ for the "int3" instruction? 

Hmm, according to 17.3.2 in vol 3 of SDM (and table 6-1 there), that 
indeed seems to be the case, so we should be fine.

> But if "int3 from kernel space" _only_ happens on actual "int3"
> instructions, then we really could just special-case that case. We'd
> know that %cr3 has been switched, we'd know that we don't need to do
> fsgs switching, we'd know we already have a good stack and percpu data
> etc set up.

That should indeed be doable, under the asumption that noone is doing any 
int3 games before we've switched away from entry trampoline.

I've briefly looked, and seems like we have proper notrace anotations for 
stackleak_erase(), which seems to be the only C (ftrace-able) code that's 
running on a trampoline (off-topic: where does PTI code make sure that we 
actually map this symbol into user pagetables in case we're not doing 
global mapping?).
Peter Zijlstra May 2, 2019, 8:21 p.m. UTC | #9
On Thu, May 02, 2019 at 11:43:37AM -0700, Linus Torvalds wrote:
> What would it look like with the "int3-from-kernel is special" modification?

Something like so; it boots; but I could've made some horrible mistake
(again).

---
diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 7b23431be5cb..4de51cff5b8a 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -67,9 +67,20 @@
 # define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
 # define preempt_stop(clobbers)
-# define resume_kernel		restore_all_kernel
 #endif
 
+.macro RETINT_PREEMPT
+#ifdef CONFIG_PREEMPT
+	DISABLE_INTERRUPTS(CLBR_ANY)
+	cmpl	$0, PER_CPU_VAR(__preempt_count)
+	jnz	.Lend_\@
+	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
+	jz	.Lend_\@
+	call	preempt_schedule_irq
+.Lend_\@:
+#endif
+.endm
+
 .macro TRACE_IRQS_IRET
 #ifdef CONFIG_TRACE_IRQFLAGS
 	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)     # interrupts off?
@@ -755,7 +766,7 @@ END(ret_from_fork)
 	andl	$SEGMENT_RPL_MASK, %eax
 #endif
 	cmpl	$USER_RPL, %eax
-	jb	resume_kernel			# not returning to v8086 or userspace
+	jb	restore_all_kernel		# not returning to v8086 or userspace
 
 ENTRY(resume_userspace)
 	DISABLE_INTERRUPTS(CLBR_ANY)
@@ -765,18 +776,6 @@ ENTRY(resume_userspace)
 	jmp	restore_all
 END(ret_from_exception)
 
-#ifdef CONFIG_PREEMPT
-ENTRY(resume_kernel)
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	cmpl	$0, PER_CPU_VAR(__preempt_count)
-	jnz	restore_all_kernel
-	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
-	jz	restore_all_kernel
-	call	preempt_schedule_irq
-	jmp	restore_all_kernel
-END(resume_kernel)
-#endif
-
 GLOBAL(__begin_SYSENTER_singlestep_region)
 /*
  * All code from here through __end_SYSENTER_singlestep_region is subject
@@ -1027,6 +1026,7 @@ ENTRY(entry_INT80_32)
 	INTERRUPT_RETURN
 
 restore_all_kernel:
+	RETINT_PREEMPT
 	TRACE_IRQS_IRET
 	PARANOID_EXIT_TO_KERNEL_MODE
 	BUG_IF_WRONG_CR3
@@ -1477,6 +1477,94 @@ END(nmi)
 
 ENTRY(int3)
 	ASM_CLAC
+
+#ifdef CONFIG_VM86
+	testl	$X86_EFLAGS_VM, 8(%esp)
+	jnz	.Lfrom_usermode_no_gap
+#endif
+	testl	$SEGMENT_RPL_MASK, 4(%esp)
+	jnz	.Lfrom_usermode_no_gap
+
+	/*
+	 * Here from kernel mode; so the (exception) stack looks like:
+	 *
+	 * 12(esp) - <previous context>
+	 *  8(esp) - flags
+	 *  4(esp) - cs
+	 *  0(esp) - ip
+	 *
+	 * Lets build a 5 entry IRET frame after that, such that struct pt_regs
+	 * is complete and in particular regs->sp is correct. This gives us
+	 * the original 3 enties as gap:
+	 *
+	 * 32(esp) - <previous context>
+	 * 28(esp) - orig_flags / gap
+	 * 24(esp) - orig_cs	/ gap
+	 * 20(esp) - orig_ip	/ gap
+	 * 16(esp) - ss
+	 * 12(esp) - sp
+	 *  8(esp) - flags
+	 *  4(esp) - cs
+	 *  0(esp) - ip
+	 */
+	pushl	%ss	  # ss
+	pushl	%esp      # sp (points at ss)
+	pushl	4*4(%esp) # flags
+	pushl	4*4(%esp) # cs
+	pushl	4*4(%esp) # ip
+
+	add	$16, 12(%esp) # point sp back at the previous context
+
+	pushl	$-1				# orig_eax; mark as interrupt
+
+	SAVE_ALL
+	ENCODE_FRAME_POINTER
+	TRACE_IRQS_OFF
+	xorl	%edx, %edx			# zero error code
+	movl	%esp, %eax			# pt_regs pointer
+	call	do_int3
+
+	RETINT_PREEMPT
+	TRACE_IRQS_IRET
+	/*
+	 * If we really never INT3 from entry code, it looks like
+	 * we can skip this one.
+	PARANOID_EXIT_TO_KERNEL_MODE
+	 */
+	BUG_IF_WRONG_CR3
+	RESTORE_REGS 4				# consume orig_eax
+
+	/*
+	 * Reconstruct the 3 entry IRET frame right after the (modified)
+	 * regs->sp without lowering %esp in between, such that an NMI in the
+	 * middle doesn't scribble our stack.
+	 */
+
+	pushl	%eax
+	pushl	%ecx
+	movl	5*4(%esp), %eax		# (modified) regs->sp
+
+	movl	4*4(%esp), %ecx		# flags
+	movl	%ecx, -4(%eax)
+
+	movl	3*4(%esp), %ecx		# cs
+	andl	$0x0000ffff, %ecx
+	movl	%ecx, -8(%eax)
+
+	movl	2*4(%esp), %ecx		# ip
+	movl	%ecx, -12(%eax)
+
+	movl	1*4(%esp), %ecx		# eax
+	movl	%ecx, -16(%eax)
+
+	popl	%ecx
+	lea	-16(%eax), %esp
+	popl	%eax
+
+	jmp	.Lirq_return
+
+.Lfrom_usermode_no_gap:
+
 	pushl	$-1				# mark this as an int
 
 	SAVE_ALL switch_stacks=1
Andy Lutomirski May 2, 2019, 8:25 p.m. UTC | #10
> On May 2, 2019, at 12:28 PM, Jiri Kosina <jikos@kernel.org> wrote:
> 
>> On Thu, 2 May 2019, Linus Torvalds wrote:
>> 
>> I forget: is #BP _only_ for the "int3" instruction? 
> 
> Hmm, according to 17.3.2 in vol 3 of SDM (and table 6-1 there), that 
> indeed seems to be the case, so we should be fine.

I’m reasonably confident that the absurd MOV SS; INT3 sequence results in #BP from user mode and then #DB inside that (as opposed to vice versa), so this should be okay.
Steven Rostedt May 2, 2019, 8:48 p.m. UTC | #11
On Thu, 2 May 2019 11:02:40 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Indeed, the 32-bit case for same-RPL exceptions/iret is entirely
> different, and I'd forgotten about that.
> 
> And honestly, this makes the 32-bit case much worse. Now the entry
> stack modifications of int3 suddenly affect not just the entry, but
> every exit too.
> 
> This is _exactly_ the kind of subtle kernel entry/exit code I wanted
> us to avoid.

I just want to point out that I never got the trampoline version
working on i386. I didn't have the time to debug why it would crash
when stressed.

-- Steve
Linus Torvalds May 2, 2019, 8:49 p.m. UTC | #12
On Thu, May 2, 2019 at 1:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Something like so; it boots; but I could've made some horrible mistake
> (again).

This actually looks much better to me.

Maybe it's more lines (I didn't check), but it's a lot simpler in that
now the magic of the int3 stack doesn't get exposed to anything else.

We *could* also make this kernel-mode-only do_int3() be a special
function, and do something like

        # args: pt_regs pointer (no error code for int3)
        movl %esp,%eax
        # allocate a bit of extra room on the stack, so that
'kernel_int3' can move the pt_regs
        subl $8,%esp
        call kernel_int3
        movl %eax,%esp

and not do any stack switching magic in the asm code AT ALL. We'd do

    struct pt_regs *kernel_int3(struct pt_regs *regs)
    {
        ..
        return regs;
    }

and now you the rule for call emulation ends up being that you need to
"memmove()" the ptregs up and down properly, and return the new
pt_regs pointer.

Hmm? That would simplify the asm code further, but some people might
find it objectionable?

                  Linus
Peter Zijlstra May 2, 2019, 9:32 p.m. UTC | #13
On Thu, May 02, 2019 at 01:49:29PM -0700, Linus Torvalds wrote:

> We *could* also make this kernel-mode-only do_int3() be a special
> function, and do something like

I think I prefer the variant we have now. The int3_emulate_*() things
work uniformly and as expected on 32 and 64 bit (it would even work for
userspace if it weren't for SMAP).

So while the 32bit kernel entry is 'special' all the INT3 handlers can
uniformly prod at pt_regs in a natural way and have it work.

Making it special -- just for 32bit, seems like the wrong thing to me.
Steven Rostedt May 2, 2019, 10:52 p.m. UTC | #14
On Thu, 2 May 2019 22:21:46 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Thu, May 02, 2019 at 11:43:37AM -0700, Linus Torvalds wrote:
> > What would it look like with the "int3-from-kernel is special" modification?  
> 
> Something like so; it boots; but I could've made some horrible mistake
> (again).
> 
> ---
> diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> index 7b23431be5cb..4de51cff5b8a 100644
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S

Oh, and so close!

I was running this on my i386 tests and for test 8 of 9 (passed 1-7) I
hit this:

Testing all events: OK
Testing ftrace filter: OK
trace_kprobe: Testing kprobe tracing: 
BUG: unable to handle kernel paging request at 10ec839c
#PF error: [INSTR]
*pdpt = 0000000000000000 *pde = 0000000000000000 
Oops: 0010 [#1] SMP PTI
CPU: 1 PID: 1 Comm: swapper/0 Not tainted 5.1.0-rc3-test+ #1
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
EIP: 0x10ec839c
Code: Bad RIP value.
EAX: 54e24bbc EBX: 00000000 ECX: 00000003 EDX: 00000002
ESI: ebcae400 EDI: c1513a88 EBP: ee641f30 ESP: c0439a07
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210202
CR0: 80050033 CR2: 10ec8392 CR3: 0167e000 CR4: 001406f0
Call Trace:
Modules linked in:
CR2: 0000000010ec839c
---[ end trace 8aa996061578b437 ]---
EIP: 0x10ec839c
Code: Bad RIP value.
EAX: 54e24bbc EBX: 00000000 ECX: 00000003 EDX: 00000002
ESI: ebcae400 EDI: c1513a88 EBP: ee641f30 ESP: c16834bc
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210202
CR0: 80050033 CR2: 10ec8392 CR3: 0167e000 CR4: 001406f0
Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009 ]---
------------[ cut here ]------------
sched: Unexpected reschedule of offline CPU#3!
WARNING: CPU: 1 PID: 1 at arch/x86/kernel/smp.c:128 native_smp_send_reschedule+0x1c/0x37
Modules linked in:
CPU: 1 PID: 1 Comm: swapper/0 Tainted: G      D           5.1.0-rc3-test+ #1
Hardware name: MSI MS-7823/CSM-H87M-G43 (MS-7823), BIOS V1.6 02/22/2014
EIP: native_smp_send_reschedule+0x1c/0x37
Code: 4a 18 ba fb 00 00 00 e8 71 d6 9c 00 5d c3 55 89 e5 e8 37 6f 00 00 0f a3 05 e0 20 44 c1 72 11 50 68 d2 75 15 c1 e8 c7 a4 01 00 <0f> 0b 58 5a eb 13 8b 15 c0 36 28 c1 8b 4a 18 ba fd 00 00 00 e8 3a
EAX: 0000002e EBX: 00000003 ECX: c049977e EDX: ee638000
ESI: 00000003 EDI: 00000003 EBP: ee641e04 ESP: ee641dfc
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210092
CR0: 80050033 CR2: 10ec8392 CR3: 0167e000 CR4: 001406f0
Call Trace:
 kick_ilb+0x77/0x7c
 trigger_load_balance+0x279/0x280
 ? __pick_first_entity+0x18/0x18
 scheduler_tick+0x90/0xa9
 ? __pick_first_entity+0x18/0x18
 update_process_times+0x3f/0x4a
 tick_sched_handle+0x4c/0x5a
 tick_sched_timer+0x3b/0x79
 ? tick_sched_do_timer+0x44/0x44
 __hrtimer_run_queues+0x180/0x26a
 ? tick_sched_do_timer+0x44/0x44
 hrtimer_interrupt+0xa6/0x18e
 ? rcu_irq_enter+0x60/0x7e
 smp_apic_timer_interrupt+0xdf/0x179
 apic_timer_interrupt+0xda/0xe0
EIP: panic+0x208/0x24a
Code: 83 c3 64 eb ad 83 3d a8 d1 68 c1 00 74 05 e8 01 bf 01 00 68 e0 d1 68 c1 68 cc e6 15 c1 e8 bd e7 04 00 e8 02 ff 0a 00 fb 31 db <58> 5a e8 c5 a9 09 00 39 fb 7c 18 83 f6 01 8b 15 a0 d1 68 c1 89 f0
EAX: c044c764 EBX: 00000000 ECX: c049977e EDX: 318e4000
ESI: 00000000 EDI: 00000000 EBP: ee641f6c ESP: ee641f54
DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00200246
 ? console_unlock+0x466/0x4b8
 ? panic+0x205/0x24a
 ? do_exit+0x4bd/0x8d1
 do_exit+0x4e3/0x8d1
 ? copy_oldmem_page+0x63/0x9b
 rewind_stack_do_exit+0x11/0x13
irq event stamp: 1400780
hardirqs last  enabled at (1400779): [<c0401568>] trace_hardirqs_on_thunk+0xc/0x10
hardirqs last disabled at (1400780): [<c0401578>] trace_hardirqs_off_thunk+0xc/0x10
softirqs last  enabled at (1400760): [<c0dfefb2>] __do_softirq+0x2a2/0x2d2
softirqs last disabled at (1400753): [<c0416fb7>] call_on_stack+0x45/0x4b
---[ end trace 8aa996061578b438 ]---
------------[ cut here ]------------

That's one of the startup tests that happens on boot up.

Config attached.

Good news is, it appears very reproducible.

-- Steve

Here's the patch I applied (both folded together):

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..2885acd691ac 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -67,9 +67,20 @@
 # define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
 # define preempt_stop(clobbers)
-# define resume_kernel		restore_all_kernel
 #endif
 
+.macro RETINT_PREEMPT
+#ifdef CONFIG_PREEMPT
+	DISABLE_INTERRUPTS(CLBR_ANY)
+	cmpl	$0, PER_CPU_VAR(__preempt_count)
+	jnz	.Lend_\@
+	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
+	jz	.Lend_\@
+	call	preempt_schedule_irq
+.Lend_\@:
+#endif
+.endm
+
 .macro TRACE_IRQS_IRET
 #ifdef CONFIG_TRACE_IRQFLAGS
 	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)     # interrupts off?
@@ -753,7 +764,7 @@ ret_from_intr:
 	andl	$SEGMENT_RPL_MASK, %eax
 #endif
 	cmpl	$USER_RPL, %eax
-	jb	resume_kernel			# not returning to v8086 or userspace
+	jb	restore_all_kernel		# not returning to v8086 or userspace
 
 ENTRY(resume_userspace)
 	DISABLE_INTERRUPTS(CLBR_ANY)
@@ -763,19 +774,6 @@ ENTRY(resume_userspace)
 	jmp	restore_all
 END(ret_from_exception)
 
-#ifdef CONFIG_PREEMPT
-ENTRY(resume_kernel)
-	DISABLE_INTERRUPTS(CLBR_ANY)
-.Lneed_resched:
-	cmpl	$0, PER_CPU_VAR(__preempt_count)
-	jnz	restore_all_kernel
-	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
-	jz	restore_all_kernel
-	call	preempt_schedule_irq
-	jmp	.Lneed_resched
-END(resume_kernel)
-#endif
-
 GLOBAL(__begin_SYSENTER_singlestep_region)
 /*
  * All code from here through __end_SYSENTER_singlestep_region is subject
@@ -1026,6 +1024,7 @@ restore_all:
 	INTERRUPT_RETURN
 
 restore_all_kernel:
+	RETINT_PREEMPT
 	TRACE_IRQS_IRET
 	PARANOID_EXIT_TO_KERNEL_MODE
 	BUG_IF_WRONG_CR3
@@ -1476,6 +1475,94 @@ END(nmi)
 
 ENTRY(int3)
 	ASM_CLAC
+
+#ifdef CONFIG_VM86
+	testl	$X86_EFLAGS_VM, 8(%esp)
+	jnz	.Lfrom_usermode_no_gap
+#endif
+	testl	$SEGMENT_RPL_MASK, 4(%esp)
+	jnz	.Lfrom_usermode_no_gap
+
+	/*
+	 * Here from kernel mode; so the (exception) stack looks like:
+	 *
+	 * 12(esp) - <previous context>
+	 *  8(esp) - flags
+	 *  4(esp) - cs
+	 *  0(esp) - ip
+	 *
+	 * Lets build a 5 entry IRET frame after that, such that struct pt_regs
+	 * is complete and in particular regs->sp is correct. This gives us
+	 * the original 3 enties as gap:
+	 *
+	 * 32(esp) - <previous context>
+	 * 28(esp) - orig_flags / gap
+	 * 24(esp) - orig_cs	/ gap
+	 * 20(esp) - orig_ip	/ gap
+	 * 16(esp) - ss
+	 * 12(esp) - sp
+	 *  8(esp) - flags
+	 *  4(esp) - cs
+	 *  0(esp) - ip
+	 */
+	pushl	%ss	  # ss
+	pushl	%esp      # sp (points at ss)
+	pushl	4*4(%esp) # flags
+	pushl	4*4(%esp) # cs
+	pushl	4*4(%esp) # ip
+
+	add	$16, 12(%esp) # point sp back at the previous context
+
+	pushl	$-1				# orig_eax; mark as interrupt
+
+	SAVE_ALL
+	ENCODE_FRAME_POINTER
+	TRACE_IRQS_OFF
+	xorl	%edx, %edx			# zero error code
+	movl	%esp, %eax			# pt_regs pointer
+	call	do_int3
+
+	RETINT_PREEMPT
+	TRACE_IRQS_IRET
+	/*
+	 * If we really never INT3 from entry code, it looks like
+	 * we can skip this one.
+	PARANOID_EXIT_TO_KERNEL_MODE
+	 */
+	BUG_IF_WRONG_CR3
+	RESTORE_REGS 4				# consume orig_eax
+
+	/*
+	 * Reconstruct the 3 entry IRET frame right after the (modified)
+	 * regs->sp without lowering %esp in between, such that an NMI in the
+	 * middle doesn't scribble our stack.
+	 */
+
+	pushl	%eax
+	pushl	%ecx
+	movl	5*4(%esp), %eax		# (modified) regs->sp
+
+	movl	4*4(%esp), %ecx		# flags
+	movl	%ecx, -4(%eax)
+
+	movl	3*4(%esp), %ecx		# cs
+	andl	$0x0000ffff, %ecx
+	movl	%ecx, -8(%eax)
+
+	movl	2*4(%esp), %ecx		# ip
+	movl	%ecx, -12(%eax)
+
+	movl	1*4(%esp), %ecx		# eax
+	movl	%ecx, -16(%eax)
+
+	popl	%ecx
+	lea	-16(%eax), %esp
+	popl	%eax
+
+	jmp	.Lirq_return
+
+.Lfrom_usermode_no_gap:
+
 	pushl	$-1				# mark this as an int
 
 	SAVE_ALL switch_stacks=1
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..834ec1397dab 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -879,7 +879,7 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -899,6 +899,16 @@ ENTRY(\sym)
 	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
 
+	.if \create_gap == 1
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jnz	.Lfrom_usermode_no_gap_\@
+	.rept 6
+	pushq	5*8(%rsp)
+	.endr
+	UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+	.endif
+
 	.if \paranoid
 	call	paranoid_entry
 	.else
@@ -1130,7 +1140,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
 #endif /* CONFIG_HYPERV */
 
 idtentry debug			do_debug		has_error_code=0	paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3			do_int3			has_error_code=0
+idtentry int3			do_int3			has_error_code=0	create_gap=1
 idtentry stack_segment		do_stack_segment	has_error_code=1
 
 #ifdef CONFIG_XEN_PV
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..ba275b6292db 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,24 @@ extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
 extern int after_bootmem;
 
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+	regs->sp -= sizeof(unsigned long);
+	*(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+	regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+	int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+	int3_emulate_jmp(regs, func);
+}
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..fd152f5a937b 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
 #include <asm/kprobes.h>
 #include <asm/ftrace.h>
 #include <asm/nops.h>
+#include <asm/text-patching.h>
 
 #ifdef CONFIG_DYNAMIC_FTRACE
 
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
 }
 
 static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
 
 static int update_ftrace_func(unsigned long ip, void *new)
 {
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 	unsigned char *new;
 	int ret;
 
+	ftrace_update_func_call = (unsigned long)func;
+
 	new = ftrace_call_replace(ip, (unsigned long)func);
 	ret = update_ftrace_func(ip, new);
 
@@ -294,13 +298,21 @@ int ftrace_int3_handler(struct pt_regs *regs)
 	if (WARN_ON_ONCE(!regs))
 		return 0;
 
-	ip = regs->ip - 1;
-	if (!ftrace_location(ip) && !is_ftrace_caller(ip))
-		return 0;
+	ip = regs->ip - INT3_INSN_SIZE;
 
-	regs->ip += MCOUNT_INSN_SIZE - 1;
+	if (ftrace_location(ip)) {
+		int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+		return 1;
+	} else if (is_ftrace_caller(ip)) {
+		if (!ftrace_update_func_call) {
+			int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+			return 1;
+		}
+		int3_emulate_call(regs, ftrace_update_func_call);
+		return 1;
+	}
 
-	return 1;
+	return 0;
 }
 NOKPROBE_SYMBOL(ftrace_int3_handler);
 
@@ -859,6 +871,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
 
 	func = ftrace_ops_get_func(ops);
 
+	ftrace_update_func_call = (unsigned long)func;
+
 	/* Do a safe modify in case the trampoline is executing */
 	new = ftrace_call_replace(ip, (unsigned long)func);
 	ret = update_ftrace_func(ip, new);
@@ -960,6 +974,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
 {
 	unsigned char *new;
 
+	ftrace_update_func_call = 0UL;
 	new = ftrace_jmp_replace(ip, (unsigned long)func);
 
 	return update_ftrace_func(ip, new);
Steven Rostedt May 2, 2019, 11:31 p.m. UTC | #15
On Thu, 2 May 2019 18:52:25 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> On Thu, 2 May 2019 22:21:46 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Thu, May 02, 2019 at 11:43:37AM -0700, Linus Torvalds wrote:  
> > > What would it look like with the "int3-from-kernel is special" modification?    
> > 
> > Something like so; it boots; but I could've made some horrible mistake
> > (again).
> > 
> > ---
> > diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
> > index 7b23431be5cb..4de51cff5b8a 100644
> > --- a/arch/x86/entry/entry_32.S
> > +++ b/arch/x86/entry/entry_32.S  
> 
> Oh, and so close!
> 
> I was running this on my i386 tests and for test 8 of 9 (passed 1-7) I
> hit this:

Digging a little further, I pinpointed it out to being kretprobes. The
problem I believe is the use of kernel_stack_pointer() which does some
magic on x86_32. kretprobes uses this to hijack the return address of
the function (much like the function graph tracer does). I do have code
that would allow kretprobes to use the function graph tracer instead,
but that's still in progress (almost done!). But still, we should not
have this break the use of kernel_stack_pointer() either.

Adding some printks in that code, it looks to be returning "&regs->sp"
which I think we changed.

-- Steve
Steven Rostedt May 2, 2019, 11:50 p.m. UTC | #16
On Thu, 2 May 2019 19:31:29 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Digging a little further, I pinpointed it out to being kretprobes. The
> problem I believe is the use of kernel_stack_pointer() which does some
> magic on x86_32. kretprobes uses this to hijack the return address of
> the function (much like the function graph tracer does). I do have code
> that would allow kretprobes to use the function graph tracer instead,
> but that's still in progress (almost done!). But still, we should not
> have this break the use of kernel_stack_pointer() either.
> 
> Adding some printks in that code, it looks to be returning "&regs->sp"
> which I think we changed.
>

This appears to fix it!

-- Steve

diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 4b8ee05dd6ad..600ead178bf4 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -171,8 +171,12 @@ unsigned long kernel_stack_pointer(struct pt_regs *regs)
 	unsigned long sp = (unsigned long)&regs->sp;
 	u32 *prev_esp;
 
-	if (context == (sp & ~(THREAD_SIZE - 1)))
+	if (context == (sp & ~(THREAD_SIZE - 1))) {
+		/* int3 code adds a gap */
+		if (sp == regs->sp - 5*4)
+			return regs->sp;
 		return sp;
+	}
 
 	prev_esp = (u32 *)(context);
 	if (*prev_esp)
Peter Zijlstra May 3, 2019, 9:29 a.m. UTC | #17
On Thu, May 02, 2019 at 07:50:52PM -0400, Steven Rostedt wrote:
> On Thu, 2 May 2019 19:31:29 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > Digging a little further, I pinpointed it out to being kretprobes. The
> > problem I believe is the use of kernel_stack_pointer() which does some
> > magic on x86_32. kretprobes uses this to hijack the return address of
> > the function (much like the function graph tracer does). I do have code
> > that would allow kretprobes to use the function graph tracer instead,
> > but that's still in progress (almost done!). But still, we should not
> > have this break the use of kernel_stack_pointer() either.
> > 
> > Adding some printks in that code, it looks to be returning "&regs->sp"
> > which I think we changed.
> >
> 
> This appears to fix it!
> 
> -- Steve
> 
> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> index 4b8ee05dd6ad..600ead178bf4 100644
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -171,8 +171,12 @@ unsigned long kernel_stack_pointer(struct pt_regs *regs)
>  	unsigned long sp = (unsigned long)&regs->sp;
>  	u32 *prev_esp;
>  
> -	if (context == (sp & ~(THREAD_SIZE - 1)))
> +	if (context == (sp & ~(THREAD_SIZE - 1))) {
> +		/* int3 code adds a gap */
> +		if (sp == regs->sp - 5*4)
> +			return regs->sp;
>  		return sp;
> +	}
>  
>  	prev_esp = (u32 *)(context);
>  	if (*prev_esp)

OMG, WTF, ARGH... That code is fsck'ing horrible. I'd almost argue to
always do the INT3 thing, just to avoid games like that.

That said; for normal traps &regs->sp is indeed the previous context --
if it doesn't fall off the stack. Your hack detects the regular INT3
frame. Howver if regs->sp has been modified (int3_emulate_push, for
example) your detectoring comes unstuck.

Now, it is rather unlikely these two code paths interact, but just to be
safe, something like so might be more reliable:


diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 4b8ee05dd6ad..aceaad0cc9a9 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -163,6 +163,9 @@ static inline bool invalid_selector(u16 value)
  * stack pointer we fall back to regs as stack if no previous stack
  * exists.
  *
+ * There is a special case for INT3, there we construct a full pt_regs
+ * environment. We can detect this case by a high bit in regs->cs
+ *
  * This is valid only for kernel mode traps.
  */
 unsigned long kernel_stack_pointer(struct pt_regs *regs)
@@ -171,6 +174,9 @@ unsigned long kernel_stack_pointer(struct pt_regs *regs)
 	unsigned long sp = (unsigned long)&regs->sp;
 	u32 *prev_esp;
 
+	if (regs->__csh & (1 << 13)) /* test CS_FROM_INT3 */
+		return regs->sp;
+
 	if (context == (sp & ~(THREAD_SIZE - 1)))
 		return sp;
 
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -388,6 +388,7 @@
 
 #define CS_FROM_ENTRY_STACK	(1 << 31)
 #define CS_FROM_USER_CR3	(1 << 30)
+#define CS_FROM_INT3		(1 << 29)
 
 .macro SWITCH_TO_KERNEL_STACK
 
@@ -1515,6 +1516,9 @@ ENTRY(int3)
 
 	add	$16, 12(%esp) # point sp back at the previous context
 
+	andl	$0x0000ffff, 4(%esp)
+	orl	$CS_FROM_INT3, 4(%esp)
+
 	pushl	$-1				# orig_eax; mark as interrupt
 
 	SAVE_ALL
Steven Rostedt May 3, 2019, 1:22 p.m. UTC | #18
On Fri, 3 May 2019 11:29:59 +0200
Peter Zijlstra <peterz@infradead.org> wrote:


> OMG, WTF, ARGH... That code is fsck'ing horrible. I'd almost argue to
> always do the INT3 thing, just to avoid games like that.

Hehe, that's almost the exact same thoughts I had when seeing this
code ;-)

> 
> That said; for normal traps &regs->sp is indeed the previous context --
> if it doesn't fall off the stack. Your hack detects the regular INT3
> frame. Howver if regs->sp has been modified (int3_emulate_push, for
> example) your detectoring comes unstuck.

Yep. I realized the issue as well. But wanted to make sure this did
work when sp wasn't changed.

> 
> Now, it is rather unlikely these two code paths interact, but just to be
> safe, something like so might be more reliable:
> 
> 
> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
> index 4b8ee05dd6ad..aceaad0cc9a9 100644
> --- a/arch/x86/kernel/ptrace.c
> +++ b/arch/x86/kernel/ptrace.c
> @@ -163,6 +163,9 @@ static inline bool invalid_selector(u16 value)
>   * stack pointer we fall back to regs as stack if no previous stack
>   * exists.
>   *
> + * There is a special case for INT3, there we construct a full pt_regs
> + * environment. We can detect this case by a high bit in regs->cs
> + *
>   * This is valid only for kernel mode traps.
>   */
>  unsigned long kernel_stack_pointer(struct pt_regs *regs)
> @@ -171,6 +174,9 @@ unsigned long kernel_stack_pointer(struct pt_regs *regs)
>  	unsigned long sp = (unsigned long)&regs->sp;
>  	u32 *prev_esp;
>  
> +	if (regs->__csh & (1 << 13)) /* test CS_FROM_INT3 */
> +		return regs->sp;
> +

Thanks, I was looking into doing something like this (setting a flag in
the int3 code), but didn't have the time to see the best way to do this.

I'll add this version of the code and run it through my tests.

-- Steve

>  	if (context == (sp & ~(THREAD_SIZE - 1)))
>  		return sp;
>  
> --- a/arch/x86/entry/entry_32.S
> +++ b/arch/x86/entry/entry_32.S
> @@ -388,6 +388,7 @@
>  
>  #define CS_FROM_ENTRY_STACK	(1 << 31)
>  #define CS_FROM_USER_CR3	(1 << 30)
> +#define CS_FROM_INT3		(1 << 29)
>  
>  .macro SWITCH_TO_KERNEL_STACK
>  
> @@ -1515,6 +1516,9 @@ ENTRY(int3)
>  
>  	add	$16, 12(%esp) # point sp back at the previous context
>  
> +	andl	$0x0000ffff, 4(%esp)
> +	orl	$CS_FROM_INT3, 4(%esp)
> +
>  	pushl	$-1				# orig_eax; mark as interrupt
>  
>  	SAVE_ALL
Andy Lutomirski May 3, 2019, 4:20 p.m. UTC | #19
> On May 3, 2019, at 6:22 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> On Fri, 3 May 2019 11:29:59 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> 
>> OMG, WTF, ARGH... That code is fsck'ing horrible. I'd almost argue to
>> always do the INT3 thing, just to avoid games like that.
> 
> Hehe, that's almost the exact same thoughts I had when seeing this
> code ;-)
> 
>> 
>> That said; for normal traps &regs->sp is indeed the previous context --
>> if it doesn't fall off the stack. Your hack detects the regular INT3
>> frame. Howver if regs->sp has been modified (int3_emulate_push, for
>> example) your detectoring comes unstuck.
> 
> Yep. I realized the issue as well. But wanted to make sure this did
> work when sp wasn't changed.
> 
>> 
>> Now, it is rather unlikely these two code paths interact, but just to be
>> safe, something like so might be more reliable:
>> 
>> 
>> diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
>> index 4b8ee05dd6ad..aceaad0cc9a9 100644
>> --- a/arch/x86/kernel/ptrace.c
>> +++ b/arch/x86/kernel/ptrace.c
>> @@ -163,6 +163,9 @@ static inline bool invalid_selector(u16 value)
>>  * stack pointer we fall back to regs as stack if no previous stack
>>  * exists.
>>  *
>> + * There is a special case for INT3, there we construct a full pt_regs
>> + * environment. We can detect this case by a high bit in regs->cs
>> + *
>>  * This is valid only for kernel mode traps.
>>  */
>> unsigned long kernel_stack_pointer(struct pt_regs *regs)
>> @@ -171,6 +174,9 @@ unsigned long kernel_stack_pointer(struct pt_regs *regs)
>>    unsigned long sp = (unsigned long)&regs->sp;
>>    u32 *prev_esp;
>> 
>> +    if (regs->__csh & (1 << 13)) /* test CS_FROM_INT3 */
>> +        return regs->sp;
>> +
> 
> Thanks, I was looking into doing something like this (setting a flag in
> the int3 code), but didn't have the time to see the best way to do this.
> 
> I'll add this version of the code and run it through my tests.
> 
> -- Steve
> 
>>    if (context == (sp & ~(THREAD_SIZE - 1)))
>>        return sp;
>> 
>> --- a/arch/x86/entry/entry_32.S
>> +++ b/arch/x86/entry/entry_32.S
>> @@ -388,6 +388,7 @@
>> 
>> #define CS_FROM_ENTRY_STACK    (1 << 31)
>> #define CS_FROM_USER_CR3    (1 << 30)
>> +#define CS_FROM_INT3        (1 << 29)
>> 
>> .macro SWITCH_TO_KERNEL_STACK
>> 
>> @@ -1515,6 +1516,9 @@ ENTRY(int3)
>> 
>>    add    $16, 12(%esp) # point sp back at the previous context
>> 
>> +    andl    $0x0000ffff, 4(%esp)
>> +    orl    $CS_FROM_INT3, 4(%esp)
>> +
>>    pushl    $-1                # orig_eax; mark as interrupt
>> 
>>    SAVE_ALL
> 

So here’s a somewhat nutty suggestion: how about we tweak the 32-bit entry code to emulate the sane 64-bit frame, not just for int3 but always?  Basically, the entry asm for entries from kernel mode would do, roughly:

push $0 ;dummy for call emulation
push %ss
push $0 ;a dummy for ESP
push 3*4(%esp) ;EFLAGS
push 3*4(%esp) ;CS
push 3*4(%esp) ;EIP
push %rax
lea 7*4(%esp), %rax
mov %rax, 4*4(%esp) ;ESP

And the exit asm would do a little dance to write EFLAGS, CS, and EIP to the right spot, then load ESP-3*4 into %esp and do IRET.

Now the annoying kernel_stack_pointer() hack can just go away, since regs->sp is always correct!

I probably screwed up some arithmetic there, but it’s the idea that counts :)
Steven Rostedt May 3, 2019, 4:31 p.m. UTC | #20
On Fri, 3 May 2019 09:20:55 -0700
Andy Lutomirski <luto@amacapital.net> wrote:

> So here’s a somewhat nutty suggestion: how about we tweak the 32-bit entry code to emulate the sane 64-bit frame, not just for int3 but always?  Basically, the entry asm for entries from kernel mode would do, roughly:
> 
> push $0 ;dummy for call emulation
> push %ss
> push $0 ;a dummy for ESP
> push 3*4(%esp) ;EFLAGS
> push 3*4(%esp) ;CS
> push 3*4(%esp) ;EIP
> push %rax
> lea 7*4(%esp), %rax
> mov %rax, 4*4(%esp) ;ESP
> 
> And the exit asm would do a little dance to write EFLAGS, CS, and EIP to the right spot, then load ESP-3*4 into %esp and do IRET.
> 
> Now the annoying kernel_stack_pointer() hack can just go away, since regs->sp is always correct!
> 
> I probably screwed up some arithmetic there, but it’s the idea that counts :)

Yeah, as it will end up with:

 $0 ; dummy for call emulation
 %ss
 $0 ; dummy for ESP
 EIP
 $0
 %ss
 $0

As 3 only gets you over what you already pushed. I think 5*4 is what
you want.

I guess the real question is, what's the performance impact of doing
that? Although, this is only needed for kernel -> kernel exceptions,
which are hopefully a rarity.

-- Steve
Peter Zijlstra May 3, 2019, 4:32 p.m. UTC | #21
On Fri, May 03, 2019 at 09:20:55AM -0700, Andy Lutomirski wrote:
> > On May 3, 2019, at 6:22 AM, Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > On Fri, 3 May 2019 11:29:59 +0200
> > Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > 
> >> OMG, WTF, ARGH... That code is fsck'ing horrible. I'd almost argue to
> >> always do the INT3 thing, just to avoid games like that.

> So here’s a somewhat nutty suggestion: how about we tweak the 32-bit
> entry code to emulate the sane 64-bit frame, not just for int3 but
> always? 

Basically what I wrote above, right? I'd be fine with that... Linus?
Peter Zijlstra May 3, 2019, 4:35 p.m. UTC | #22
On Fri, May 03, 2019 at 12:31:26PM -0400, Steven Rostedt wrote:
> I guess the real question is, what's the performance impact of doing
> that?

Is there anyone that considers i386 a performance platform?
Andy Lutomirski May 3, 2019, 4:44 p.m. UTC | #23
On Fri, May 3, 2019 at 9:35 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Fri, May 03, 2019 at 12:31:26PM -0400, Steven Rostedt wrote:
> > I guess the real question is, what's the performance impact of doing
> > that?
>
> Is there anyone that considers i386 a performance platform?

Not me.  As far as I'm concerned, I will basically always gladly trade
several cycles for simplicity on 32-bit.

--Andy
Steven Rostedt May 3, 2019, 4:49 p.m. UTC | #24
On Fri, 3 May 2019 09:44:35 -0700
Andy Lutomirski <luto@kernel.org> wrote:

> On Fri, May 3, 2019 at 9:35 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > On Fri, May 03, 2019 at 12:31:26PM -0400, Steven Rostedt wrote:  
> > > I guess the real question is, what's the performance impact of doing
> > > that?  
> >
> > Is there anyone that considers i386 a performance platform?  
> 
> Not me.  As far as I'm concerned, I will basically always gladly trade
> several cycles for simplicity on 32-bit.
>

So should I wait for a new PATCH 1 from Peter to implement it this way?
I'd like to get this into the next merge window (although, I am marking
it for stable since it fixes a bug for live kernel patching).

Linus, what's your thoughts?

-- Steve
Linus Torvalds May 3, 2019, 6:57 p.m. UTC | #25
On Fri, May 3, 2019 at 9:21 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> So here’s a somewhat nutty suggestion: how about we tweak the 32-bit entry code to emulate the sane 64-bit frame, not just for int3 but always?

What would the code actually end up looking like? I don't necessarily
object, since that kernel_stack_pointer() thing certainly looks
horrible, but honestly, my suggestion to just pass in the 'struct
pt_regs' and let the call emulation fix it up would have also worked,
and avoided that bug (and who knows what else might be hiding).

I really think that you're now hitting all the special case magic
low-level crap that I wanted to avoid.

                   Linus
Steven Rostedt May 3, 2019, 7:24 p.m. UTC | #26
On Thu, 2 May 2019 13:49:29 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Thu, May 2, 2019 at 1:22 PM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Something like so; it boots; but I could've made some horrible mistake
> > (again).  
> 
> This actually looks much better to me.
> 
> Maybe it's more lines (I didn't check), but it's a lot simpler in that
> now the magic of the int3 stack doesn't get exposed to anything else.
> 
> We *could* also make this kernel-mode-only do_int3() be a special
> function, and do something like
> 
>         # args: pt_regs pointer (no error code for int3)
>         movl %esp,%eax
>         # allocate a bit of extra room on the stack, so that
> 'kernel_int3' can move the pt_regs
>         subl $8,%esp
>         call kernel_int3
>         movl %eax,%esp
> 
> and not do any stack switching magic in the asm code AT ALL. We'd do
> 
>     struct pt_regs *kernel_int3(struct pt_regs *regs)
>     {
>         ..
>         return regs;
>     }
> 
> and now you the rule for call emulation ends up being that you need to
> "memmove()" the ptregs up and down properly, and return the new
> pt_regs pointer.
> 
> Hmm? That would simplify the asm code further, but some people might
> find it objectionable?
> 

The problem with this approach is that it would require doing the same
for x86_64, as the int3 C code is the same for both. And that may be a
bit more difficult on the x86_64 side because it's all done with a
simple flag in the idtentry macro to add the gap.

-- Steve
Linus Torvalds May 3, 2019, 9:46 p.m. UTC | #27
On Fri, May 3, 2019 at 12:24 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> The problem with this approach is that it would require doing the same
> for x86_64, as the int3 C code is the same for both. And that may be a
> bit more difficult on the x86_64 side because it's all done with a
> simple flag in the idtentry macro to add the gap.

That argument is weakened by the fact that we have to do _other_
things differently on 32-bit and 64-bit anyway.

So we might as well have a "on 32-bit, the call emulation needs to
move the pt_regs to make space" special case in the call emulation
code. It's very easy to explain why.

And then we'd limit the special case to where it matters (with a big
comment about what's going on), rather than adding random special case
handling to random _other_ places.

Having to add s magic special case to "kernel_stack_pointer() is
certainly not obvious. Neither is adding magic special cases to system
call exit paths etc.

This has been why I've been arguing against the entry code changes.
Exactly because they tend to have these kind of odd cascading effects.
The entry code is fragile not just because it's a complex hardware
interface, but also because we know about those complex hardware
interfaces in random other places.

I'd much rather have the code that does special things be in one
place, and be the place that *needs* to do the special thing. If we
copy the pt_regs around when we do the "call" emulation, it's *really*
easy to explain *exactly* what we're doing and why in *exactly* that
one context where we are doing it. And it won't affect anything else,
and our existing code that looks at pt_regs will work both before and
after.

Would it need a #ifdef CONFIG_X86_32 around it because it's not needed
on x86-64? Sure. But that #ifdef would be right there, and the comment
that explains why the pt_regs need to be moved would also make it very
obvious why it is only needed for x86-32.

There's a lot of advantages to keeping your problems localized,
instead of letting your random hacks escape and become problems for
other, entirely unrelated, code.

                Linus
Steven Rostedt May 3, 2019, 10:49 p.m. UTC | #28
On Fri, 3 May 2019 14:46:11 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Fri, May 3, 2019 at 12:24 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > The problem with this approach is that it would require doing the same
> > for x86_64, as the int3 C code is the same for both. And that may be a
> > bit more difficult on the x86_64 side because it's all done with a
> > simple flag in the idtentry macro to add the gap.  
> 
> That argument is weakened by the fact that we have to do _other_
> things differently on 32-bit and 64-bit anyway.
> 
> So we might as well have a "on 32-bit, the call emulation needs to
> move the pt_regs to make space" special case in the call emulation
> code. It's very easy to explain why.

So if I understand correctly what you are implying, is to have the int3
code be different for 32 bit and 64 bit? This would require handling
text_poke, ftrace and kprobes differently for the two. Or perhaps we
can build hacks on top.

> 
> And then we'd limit the special case to where it matters (with a big
> comment about what's going on), rather than adding random special case
> handling to random _other_ places.
> 
> Having to add s magic special case to "kernel_stack_pointer() is
> certainly not obvious. Neither is adding magic special cases to system
> call exit paths etc.

Honestly, this sounds more of an argument for doing the buffered space
for all exceptions on 32 bit, because it removes one of the special
cases. If we were to move the frame and give it a full frame like
x86_64, then we can remove kernel_stack_pointer() altogether. I've hated
that function for some time, as I tripped over it in the past too, and
it keeps coming up. It's sad that regs->sp is unreliable as is.

> 
> This has been why I've been arguing against the entry code changes.
> Exactly because they tend to have these kind of odd cascading effects.
> The entry code is fragile not just because it's a complex hardware
> interface, but also because we know about those complex hardware
> interfaces in random other places.

IMO, getting rid of the kernel_stack_pointer() function is a positive
side effect of these changes.

> 
> I'd much rather have the code that does special things be in one
> place, and be the place that *needs* to do the special thing. If we
> copy the pt_regs around when we do the "call" emulation, it's *really*
> easy to explain *exactly* what we're doing and why in *exactly* that
> one context where we are doing it. And it won't affect anything else,
> and our existing code that looks at pt_regs will work both before and
> after.
> 
> Would it need a #ifdef CONFIG_X86_32 around it because it's not needed
> on x86-64? Sure. But that #ifdef would be right there, and the comment
> that explains why the pt_regs need to be moved would also make it very
> obvious why it is only needed for x86-32.
> 
> There's a lot of advantages to keeping your problems localized,
> instead of letting your random hacks escape and become problems for
> other, entirely unrelated, code.

But is it localize? It would definitely affect do_int3().

You are saying that we have a do_int3() for user space int3, and
do_kernel_int3() for kernel space. That would need to be done in asm
for both, because having x86_64 call do_int3() for kernel and
user would be interesting. Looking at the do_int3() code, I'm not sure
how to safely separate kernel and user int3 handlers if 64bit doesn't
call do_kernel_int3() directly. It may end up looking something like
this:

dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
{
#ifdef CONFIG_X86_64
	do_kernel_int3(&regs);
#endif
	/*
	 * Use ist_enter despite the fact that we don't use an IST stack.
	 * We can be called from a kprobe in non-CONTEXT_KERNEL kernel
	 * mode or even during context tracking state changes.
	 *
	 * This means that we can't schedule.  That's okay.
	 */
	ist_enter(regs);
	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");

#ifdef CONFIG_X86_64
#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
	if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
				SIGTRAP) == NOTIFY_STOP)
		goto exit;
#endif /* CONFIG_KGDB_LOW_LEVEL_TRAP */

#ifdef CONFIG_KPROBES
	if (kprobe_int3_handler(&regs))
		goto exit;
#endif
#endif

	if (notify_die(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
			SIGTRAP) == NOTIFY_STOP)
		goto exit;

	cond_local_irq_enable(regs);
	do_trap(X86_TRAP_BP, SIGTRAP, "int3", regs, error_code, 0, NULL);
	cond_local_irq_disable(regs);

exit:
	ist_exit(regs);
}

dotraplinkage void notrace do_kernel_int3(struct pt_regs **regs)
{
#ifdef CONFIG_DYNAMIC_FTRACE
	/*
	 * ftrace must be first, everything else may cause a recursive crash.
	 * See note by declaration of modifying_ftrace_code in ftrace.c
	 */
	if (unlikely(atomic_read(&modifying_ftrace_code)) &&
	    ftrace_int3_handler(regs))
		return;
#endif
	if (poke_int3_handler(regs))
		return;

#ifdef CONFIG_X86_64
	return;
#endif

	/*
	 * Use ist_enter despite the fact that we don't use an IST stack.
	 * We can be called from a kprobe in non-CONTEXT_KERNEL kernel
	 * mode or even during context tracking state changes.
	 *
	 * This means that we can't schedule.  That's okay.
	 */
	ist_enter(*regs);
	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");

#ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
	if (kgdb_ll_trap(DIE_INT3, "int3", *regs, error_code, X86_TRAP_BP,
				SIGTRAP) == NOTIFY_STOP)
		goto exit;
#endif /* CONFIG_KGDB_LOW_LEVEL_TRAP */

#ifdef CONFIG_KPROBES
	if (kprobe_int3_handler(regs))
		goto exit;
#endif

	notify_die(DIE_INT3, "int3", *regs, error_code, X86_TRAP_BP,
			SIGTRAP);
	ist_exit(*regs);
}

Or maybe I misunderstood what you envision :-/

-- Steve
Andy Lutomirski May 3, 2019, 10:55 p.m. UTC | #29
> On May 3, 2019, at 2:46 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Fri, May 3, 2019 at 12:24 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>> 
>> The problem with this approach is that it would require doing the same
>> for x86_64, as the int3 C code is the same for both. And that may be a
>> bit more difficult on the x86_64 side because it's all done with a
>> simple flag in the idtentry macro to add the gap.
> 
> That argument is weakened by the fact that we have to do _other_
> things differently on 32-bit and 64-bit anyway.
> 
> So we might as well have a "on 32-bit, the call emulation needs to
> move the pt_regs to make space" special case in the call emulation
> code. It's very easy to explain why.
> 
> And then we'd limit the special case to where it matters (with a big
> comment about what's going on), rather than adding random special case
> handling to random _other_ places.

If we do this, it should IMO look like this:

struct pt_regs *change_kernel_stack_pointer(struct pt_regs *, unsigned long new_sp);

And that helper should be used on both variants.

But I think this will end up worse than the version where the entry code fixes it up.  This is because, if the C code moves pt_regs, then we need some way to pass the new pointer back to the asm.  We will also have a much harder time with runtime sanity checks.  In the model where the C code merely updates regs->sp, it’s very easy to compare sp and &regs to check for overlap, but it’s much harder to tell whether memmoveing it is going to overwrite something important.  And we have to worry about whether there’s some other code that assumes that pt_regs stays put.

So my intuition is that the pure asm fixup will result is more maintainable code.
Linus Torvalds May 3, 2019, 11:07 p.m. UTC | #30
On Fri, May 3, 2019 at 3:49 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> You are saying that we have a do_int3() for user space int3, and
> do_kernel_int3() for kernel space. That would need to be done in asm
> for both, because having x86_64 call do_int3() for kernel and
> user would be interesting.

The clean/simple way is to just do this

 - x86-32 does the special asm for the kernel_do_int3(), case and
calls user_do_int3 otherwise.

 - x86-64 doesn't care, and just calls "do_int3()".

We have a trivial helper function like

    dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
    {
        if (user_mode(regs))
                user_int3(regs);
        else
                WARN_ON_ONCE(kernel_int3(regs) != regs);
    }

which adds that warning just for debug purposes.

Then we make the rule be that user_int3() does the normal stuff, and
kernel_int3() returns the pt_regs it was passed in.

Easy-peasy, there is absolutely no difference between x86-64 and
x86-32 here except for the trivial case that x86-32 does its thing at
the asm layer, which is what allows "kernel_int3()" to move pt_regs
around by a small amount.

Now, the _real_ difference is when you do the "call_emulate()" case,
which will have to do something like this

    static struct pt_regs *emulate_call(struct pt_regs *regs, unsigned
long return, unsigned long target)
    {
    #ifdef CONFIG_X86_32
            /* BIG comment about how we need to move pt_regs to make
room and to update the return 'sp' */
            struct pt_regs *new = (void *)regs - 4;
            unsigned long *sp = (unsigned long *)(new + 1);
            memmove(new, regs, sizeof(*regs));
            regs = new;
    #else
            unsigned long *sp = regs->sp;
            regs->sp -= 4;
    #endif
            *sp = value;
            regs->ip = target;
            return regs;
    }

but look, the above isn't that complicated, is it? And notice how the
subtle pt_regs movement is exactly where it needs to be and nowhere
else.

And what's the cost of all of this? NOTHING. The x86-32 entry code has
to do the test for kernel space anyway, and *all* it does now is to
call "kernel_int3" for the kernel case after having made a bit of
extra room on the stack so that you *can* move pt_regs around (maybe
people want to pop things too? It would work as well).

See what I mean by "localized to the cases the need it"?

              Linus
Linus Torvalds May 3, 2019, 11:16 p.m. UTC | #31
On Fri, May 3, 2019 at 3:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> But I think this will end up worse than the version where the entry code fixes it up.  This is because, if the C code moves pt_regs, then we need some way to pass the new pointer back to the asm.

What? I already posted that code. Let me quote it again:

Message-ID: <CAHk-=wh8bi5c_GkyjPtDAiaXaZRqtmhWs30usUvs4qK_F+c9tg@mail.gmail.com>

        # args: pt_regs pointer (no error code for int3)
        movl %esp,%eax
        # allocate a bit of extra room on the stack, so that
        # 'kernel_int3' can move the pt_regs
        subl $8,%esp
        call kernel_int3
        movl %eax,%esp

It's that easy (this is with the assumption that we've already applied
the "standalone simple int3" case, but I think the above might work
even with the current code model, just the "call do_int3" needs to
have the kernel/not-kernel distinction and do the above for the kernel
case)

That's *MUCH* easier than your code to move entries around on the
stack just as you return, and has the advantage of not changing any
C-visible layout.

The C interface looks like this

    /* Note: on x86-32, we can move 'regs' around for push/pop emulation */
    struct pt_regs *kernel_int3(struct pt_regs *regs)
    {
        ..
        .. need to pass regs to emulation functions
        .. and call emulation needs to return it
        ..
        return regs;
    }

and I just posted as a response to Stephen the *trivial* do_int3()
wrapper (so that x86-64 doesn't need to care), and the *trivial* code
to actually emulate a call instruction.

And when I say "trivial", I obviously mean "totally untested and
probably buggy", but it sure seems *simple*.,

Notice? Simple and minimal changes to entry code that only affect
int3, and nothing else.

                 Linus
Andy Lutomirski May 3, 2019, 11:32 p.m. UTC | #32
> On May 3, 2019, at 4:16 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
>> On Fri, May 3, 2019 at 3:55 PM Andy Lutomirski <luto@amacapital.net> wrote:
>> 
>> But I think this will end up worse than the version where the entry code fixes it up.  This is because, if the C code moves pt_regs, then we need some way to pass the new pointer back to the asm.
> 
> What? I already posted that code. Let me quote it again:
> 
> Message-ID: <CAHk-=wh8bi5c_GkyjPtDAiaXaZRqtmhWs30usUvs4qK_F+c9tg@mail.gmail.com>
> 
>        # args: pt_regs pointer (no error code for int3)
>        movl %esp,%eax
>        # allocate a bit of extra room on the stack, so that
>        # 'kernel_int3' can move the pt_regs
>        subl $8,%esp
>        call kernel_int3
>        movl %eax,%esp
> 
> It's that easy (this is with the assumption that we've already applied
> the "standalone simple int3" case, but I think the above might work
> even with the current code model, just the "call do_int3" needs to
> have the kernel/not-kernel distinction and do the above for the kernel
> case)
> 
> That's *MUCH* easier than your code to move entries around on the
> stack just as you return, and has the advantage of not changing any
> C-visible layout.
> 
> The C interface looks like this
> 
>    /* Note: on x86-32, we can move 'regs' around for push/pop emulation */
>    struct pt_regs *kernel_int3(struct pt_regs *regs)
>    {
>        ..
>        .. need to pass regs to emulation functions
>        .. and call emulation needs to return it
>        ..
>        return regs;
>    }
> 
> and I just posted as a response to Stephen the *trivial* do_int3()
> wrapper (so that x86-64 doesn't need to care), and the *trivial* code
> to actually emulate a call instruction.
> 
> And when I say "trivial", I obviously mean "totally untested and
> probably buggy", but it sure seems *simple*.,
> 
> Notice? Simple and minimal changes to entry code that only affect
> int3, and nothing else.
> 
>  

I can get on board with this.
Steven Rostedt May 4, 2019, 4:17 a.m. UTC | #33
On Fri, 3 May 2019 16:07:59 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

>     static struct pt_regs *emulate_call(struct pt_regs *regs, unsigned
> long return, unsigned long target)
>     {
>     #ifdef CONFIG_X86_32
>             /* BIG comment about how we need to move pt_regs to make
> room and to update the return 'sp' */
>             struct pt_regs *new = (void *)regs - 4;
>             unsigned long *sp = (unsigned long *)(new + 1);

I tried this, and it crashed. After a bit of debugging, I found that
the issue is that we can't use sizeof(*regs), because we don't have a
full pt_regs. We are missing the first entry (which is why we have
kernel_stack_pointer() in the first place!)


>             memmove(new, regs, sizeof(*regs));

This would need to be something like:

#define SAVED_REGS_SIZE (sizeof(struct pt_regs) - 4)

		struct pt_regs *new = (void *)regs - 4;
		unsigned long *sp = (unsigned long *)
			((void *)new + SAVED_REGS_SIZE);

		memmove(new, regs, SAVED_REGS_SIZE);

>             regs = new;
>     #else
>             unsigned long *sp = regs->sp;
>             regs->sp -= 4;

should be:

		unsigned long *sp;

		regs->sp -= sizeof(long);
		sp = (unsigned long *)regs->sp;

>     #endif
>             *sp = value;
>             regs->ip = target;
>             return regs;
>     }

I got it sorta working. And it appears that printk() no longer flushes
partial lines, and I couldn't figure out where my self tests broke as
it does:

	pr_info("Testing ....: ");
	ret = do_test();
	if (ret)
		pr_info("PASSED\n");
	else
		pr_info("FAILED\n");

but because it doesn't flush anymore, I don't see what test it is :-p
But that's another issue.

With some "early_printk" in the code, I do see it doing the calls, but
for some reason, the start up test never moves forward.

Below is the patch that I tried. I started with Peter's changes (this
works on x86_64) and tried to work your ideas in for x86_32. This still
gets stuck (but doesn't crash). 

Also note that this method prevents the die notifiers from doing this,
which means I had to disable Peter's selftest because that no longer
will work the way it is written.

-- Steve

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..f0fd3e17bedd 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -67,9 +67,20 @@
 # define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
 # define preempt_stop(clobbers)
-# define resume_kernel		restore_all_kernel
 #endif
 
+.macro RETINT_PREEMPT
+#ifdef CONFIG_PREEMPT
+	DISABLE_INTERRUPTS(CLBR_ANY)
+	cmpl	$0, PER_CPU_VAR(__preempt_count)
+	jnz	.Lend_\@
+	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
+	jz	.Lend_\@
+	call	preempt_schedule_irq
+.Lend_\@:
+#endif
+.endm
+
 .macro TRACE_IRQS_IRET
 #ifdef CONFIG_TRACE_IRQFLAGS
 	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)     # interrupts off?
@@ -377,6 +388,7 @@
 
 #define CS_FROM_ENTRY_STACK	(1 << 31)
 #define CS_FROM_USER_CR3	(1 << 30)
+#define CS_FROM_INT3		(1 << 29)
 
 .macro SWITCH_TO_KERNEL_STACK
 
@@ -753,7 +765,7 @@ ret_from_intr:
 	andl	$SEGMENT_RPL_MASK, %eax
 #endif
 	cmpl	$USER_RPL, %eax
-	jb	resume_kernel			# not returning to v8086 or userspace
+	jb	restore_all_kernel		# not returning to v8086 or userspace
 
 ENTRY(resume_userspace)
 	DISABLE_INTERRUPTS(CLBR_ANY)
@@ -763,19 +775,6 @@ ENTRY(resume_userspace)
 	jmp	restore_all
 END(ret_from_exception)
 
-#ifdef CONFIG_PREEMPT
-ENTRY(resume_kernel)
-	DISABLE_INTERRUPTS(CLBR_ANY)
-.Lneed_resched:
-	cmpl	$0, PER_CPU_VAR(__preempt_count)
-	jnz	restore_all_kernel
-	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
-	jz	restore_all_kernel
-	call	preempt_schedule_irq
-	jmp	.Lneed_resched
-END(resume_kernel)
-#endif
-
 GLOBAL(__begin_SYSENTER_singlestep_region)
 /*
  * All code from here through __end_SYSENTER_singlestep_region is subject
@@ -1026,6 +1025,7 @@ restore_all:
 	INTERRUPT_RETURN
 
 restore_all_kernel:
+	RETINT_PREEMPT
 	TRACE_IRQS_IRET
 	PARANOID_EXIT_TO_KERNEL_MODE
 	BUG_IF_WRONG_CR3
@@ -1476,6 +1476,27 @@ END(nmi)
 
 ENTRY(int3)
 	ASM_CLAC
+
+#ifdef CONFIG_VM86
+	testl	$X86_EFLAGS_VM, 8(%esp)
+	jnz	.Lfrom_usermode_no_gap
+#endif
+	testl	$SEGMENT_RPL_MASK, 4(%esp)
+	jnz	.Lfrom_usermode_no_gap
+
+	pushl	$-1				# mark this as an int
+
+	SAVE_ALL switch_stacks=1
+	ENCODE_FRAME_POINTER
+	TRACE_IRQS_OFF
+	movl	%esp, %eax			# pt_regs pointer
+	subl	$8, %esp
+	call	do_kernel_int3
+	movl	%eax, %esp
+	jmp	ret_from_exception
+
+.Lfrom_usermode_no_gap:
+
 	pushl	$-1				# mark this as an int
 
 	SAVE_ALL switch_stacks=1
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..834ec1397dab 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -879,7 +879,7 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -899,6 +899,16 @@ ENTRY(\sym)
 	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
 
+	.if \create_gap == 1
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jnz	.Lfrom_usermode_no_gap_\@
+	.rept 6
+	pushq	5*8(%rsp)
+	.endr
+	UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+	.endif
+
 	.if \paranoid
 	call	paranoid_entry
 	.else
@@ -1130,7 +1140,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
 #endif /* CONFIG_HYPERV */
 
 idtentry debug			do_debug		has_error_code=0	paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3			do_int3			has_error_code=0
+idtentry int3			do_int3			has_error_code=0	create_gap=1
 idtentry stack_segment		do_stack_segment	has_error_code=1
 
 #ifdef CONFIG_XEN_PV
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index cf350639e76d..4b335ac5afcc 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -37,7 +37,7 @@ struct dyn_arch_ftrace {
 	/* No extra data needed for x86 */
 };
 
-int ftrace_int3_handler(struct pt_regs *regs);
+int ftrace_int3_handler(struct pt_regs **regs);
 
 #define FTRACE_GRAPH_TRAMP_ADDR FTRACE_GRAPH_ADDR
 
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..8e638231cf52 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,33 @@ extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
 extern int after_bootmem;
 
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+	regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+static inline struct pt_regs * int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+#ifdef CONFIG_X86_32
+#define SAVED_REGS_SIZE (sizeof(struct pt_regs) - 8)
+	/* BIG comment about how we need to move pt_regs to make
+	   room and to update the return 'sp' */
+	struct pt_regs *new = (void *)regs - sizeof(long);
+	unsigned long *sp = (unsigned long *)
+		((void *)new + SAVED_REGS_SIZE);
+	memmove(new, regs, SAVED_REGS_SIZE);
+	regs = new;
+#else
+	unsigned long *sp;
+	regs->sp -= sizeof(long);
+	sp = (unsigned long *)regs->sp;
+#endif
+	*sp = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
+	regs->ip = func;
+	return regs;
+}
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 9a79c7808f9c..c0e9002e2708 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -613,11 +613,84 @@ extern struct paravirt_patch_site __start_parainstructions[],
 	__stop_parainstructions[];
 #endif	/* CONFIG_PARAVIRT */
 
+/*
+ * Self-test for the INT3 based CALL emulation code.
+ *
+ * This exercises int3_emulate_call() to make sure INT3 pt_regs are set up
+ * properly and that there is a stack gap between the INT3 frame and the
+ * previous context. Without this gap doing a virtual PUSH on the interrupted
+ * stack would corrupt the INT3 IRET frame.
+ *
+ * See entry_{32,64}.S for more details.
+ */
+static void __init int3_magic(unsigned int *ptr)
+{
+	*ptr = 1;
+}
+
+extern __initdata unsigned long int3_selftest_ip; /* defined in asm below */
+
+static int __init
+int3_exception_notify(struct notifier_block *self, unsigned long val, void *data)
+{
+	struct die_args *args = data;
+	struct pt_regs *regs = args->regs;
+
+	if (!regs || user_mode(regs))
+		return NOTIFY_DONE;
+
+	if (val != DIE_INT3)
+		return NOTIFY_DONE;
+
+	if (regs->ip - INT3_INSN_SIZE != int3_selftest_ip)
+		return NOTIFY_DONE;
+
+	int3_emulate_call(regs, (unsigned long)&int3_magic);
+	return NOTIFY_STOP;
+}
+
+static void __init int3_selftest(void)
+{
+	static __initdata struct notifier_block int3_exception_nb = {
+		.notifier_call	= int3_exception_notify,
+		.priority	= INT_MAX-1, /* last */
+	};
+	unsigned int val = 0;
+
+	return;
+	BUG_ON(register_die_notifier(&int3_exception_nb));
+
+	/*
+	 * Basically: int3_magic(&val); but really complicated :-)
+	 *
+	 * Stick the address of the INT3 instruction into int3_selftest_ip,
+	 * then trigger the INT3, padded with NOPs to match a CALL instruction
+	 * length.
+	 */
+	asm volatile ("1: int3; nop; nop; nop; nop\n\t"
+		      ".pushsection .init.data,\"aw\"\n\t"
+		      ".align " __ASM_SEL(4, 8) "\n\t"
+		      ".type int3_selftest_ip, @object\n\t"
+		      ".size int3_selftest_ip, " __ASM_SEL(4, 8) "\n\t"
+		      "int3_selftest_ip:\n\t"
+		      __ASM_SEL(.long, .quad) " 1b\n\t"
+		      ".popsection\n\t"
+		      : : __ASM_SEL_RAW(a, D) (&val) : "memory");
+
+	BUG_ON(val != 1);
+
+	unregister_die_notifier(&int3_exception_nb);
+}
+
 void __init alternative_instructions(void)
 {
-	/* The patching is not fully atomic, so try to avoid local interruptions
-	   that might execute the to be patched code.
-	   Other CPUs are not running. */
+	int3_selftest();
+
+	/*
+	 * The patching is not fully atomic, so try to avoid local
+	 * interruptions that might execute the to be patched code.
+	 * Other CPUs are not running.
+	 */
 	stop_nmi();
 
 	/*
@@ -642,10 +715,11 @@ void __init alternative_instructions(void)
 					    _text, _etext);
 	}
 
-	if (!uniproc_patched || num_possible_cpus() == 1)
+	if (!uniproc_patched || num_possible_cpus() == 1) {
 		free_init_pages("SMP alternatives",
 				(unsigned long)__smp_locks,
 				(unsigned long)__smp_locks_end);
+	}
 #endif
 
 	apply_paravirt(__parainstructions, __parainstructions_end);
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..c5e2ae1efbd8 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
 #include <asm/kprobes.h>
 #include <asm/ftrace.h>
 #include <asm/nops.h>
+#include <asm/text-patching.h>
 
 #ifdef CONFIG_DYNAMIC_FTRACE
 
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
 }
 
 static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
 
 static int update_ftrace_func(unsigned long ip, void *new)
 {
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 	unsigned char *new;
 	int ret;
 
+	ftrace_update_func_call = (unsigned long)func;
+
 	new = ftrace_call_replace(ip, (unsigned long)func);
 	ret = update_ftrace_func(ip, new);
 
@@ -287,20 +291,30 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip)
  * call to a nop. While the change is taking place, we treat
  * it just like it was a nop.
  */
-int ftrace_int3_handler(struct pt_regs *regs)
+int ftrace_int3_handler(struct pt_regs **pregs)
 {
+	struct pt_regs *regs = *pregs;
 	unsigned long ip;
 
 	if (WARN_ON_ONCE(!regs))
 		return 0;
 
-	ip = regs->ip - 1;
-	if (!ftrace_location(ip) && !is_ftrace_caller(ip))
-		return 0;
+	ip = regs->ip - INT3_INSN_SIZE;
 
-	regs->ip += MCOUNT_INSN_SIZE - 1;
+	if (ftrace_location(ip)) {
+		*pregs = int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+//		early_printk("ip=%pS was=%px is=%px\n", (void *)ip, regs, *pregs);
+		return 1;
+	} else if (is_ftrace_caller(ip)) {
+		if (!ftrace_update_func_call) {
+			int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+			return 1;
+		}
+		*pregs = int3_emulate_call(regs, ftrace_update_func_call);
+		return 1;
+	}
 
-	return 1;
+	return 0;
 }
 NOKPROBE_SYMBOL(ftrace_int3_handler);
 
@@ -859,6 +873,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
 
 	func = ftrace_ops_get_func(ops);
 
+	ftrace_update_func_call = (unsigned long)func;
+
 	/* Do a safe modify in case the trampoline is executing */
 	new = ftrace_call_replace(ip, (unsigned long)func);
 	ret = update_ftrace_func(ip, new);
@@ -960,6 +976,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
 {
 	unsigned char *new;
 
+	ftrace_update_func_call = 0UL;
 	new = ftrace_jmp_replace(ip, (unsigned long)func);
 
 	return update_ftrace_func(ip, new);
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 4b8ee05dd6ad..03ba69a6e594 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -163,6 +163,9 @@ static inline bool invalid_selector(u16 value)
  * stack pointer we fall back to regs as stack if no previous stack
  * exists.
  *
+ * There is a special case for INT3, there we construct a full pt_regs
+ * environment. We can detect this case by a high bit in regs->cs
+ *
  * This is valid only for kernel mode traps.
  */
 unsigned long kernel_stack_pointer(struct pt_regs *regs)
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d26f9e9c3d83..17e0c56326c9 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -570,7 +570,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
 }
 NOKPROBE_SYMBOL(do_general_protection);
 
-dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
+dotraplinkage struct pt_regs * notrace do_kernel_int3(struct pt_regs *regs)
 {
 #ifdef CONFIG_DYNAMIC_FTRACE
 	/*
@@ -578,11 +578,11 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 	 * See note by declaration of modifying_ftrace_code in ftrace.c
 	 */
 	if (unlikely(atomic_read(&modifying_ftrace_code)) &&
-	    ftrace_int3_handler(regs))
-		return;
+	    ftrace_int3_handler(&regs))
+		return regs;
 #endif
 	if (poke_int3_handler(regs))
-		return;
+		return regs;
 
 	/*
 	 * Use ist_enter despite the fact that we don't use an IST stack.
@@ -594,7 +594,7 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 	ist_enter(regs);
 	RCU_LOCKDEP_WARN(!rcu_is_watching(), "entry code didn't wake RCU");
 #ifdef CONFIG_KGDB_LOW_LEVEL_TRAP
-	if (kgdb_ll_trap(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
+	if (kgdb_ll_trap(DIE_INT3, "int3", regs, 0, X86_TRAP_BP,
 				SIGTRAP) == NOTIFY_STOP)
 		goto exit;
 #endif /* CONFIG_KGDB_LOW_LEVEL_TRAP */
@@ -604,16 +604,26 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 		goto exit;
 #endif
 
-	if (notify_die(DIE_INT3, "int3", regs, error_code, X86_TRAP_BP,
+	if (notify_die(DIE_INT3, "int3", regs, 0, X86_TRAP_BP,
 			SIGTRAP) == NOTIFY_STOP)
 		goto exit;
 
 	cond_local_irq_enable(regs);
-	do_trap(X86_TRAP_BP, SIGTRAP, "int3", regs, error_code, 0, NULL);
+	do_trap(X86_TRAP_BP, SIGTRAP, "int3", regs, 0, 0, NULL);
 	cond_local_irq_disable(regs);
 
 exit:
 	ist_exit(regs);
+	return regs;
+}
+NOKPROBE_SYMBOL(do_kernel_int3);
+
+dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
+{
+	struct pt_regs *ret_regs;
+
+	ret_regs = do_kernel_int3(regs);
+	WARN_ON_ONCE(ret_regs != regs);
 }
 NOKPROBE_SYMBOL(do_int3);
Linus Torvalds May 4, 2019, 6:59 p.m. UTC | #34
On Fri, May 3, 2019 at 10:08 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> I'll look at it tomorrow, but I think this actually makes unnecessary changes.
>
> In particular, I think we could keep the existing entry code almost unchanged with this whole approach.

So here's what I *think* should work. Note that I also removed your
test-case code, because it really didn't have a chance in hell of
working. Doing that

        int3_emulate_call(regs, (unsigned long)&int3_magic);

inside of int3_exception_notify() could not possibly be valid, since
int3_emulate_call() returns the new pt_regs that need to be used, and
throwing it away is clearly wrong.

So you can't use a register_die_notifier() to try to intercept the
'int3' error and then do it manually, it needs to be done by the
ftrace_int3_handler() code that actually returns the new regs, and
where do_kernel_int3() will then return it to the low-level handler.

End result: I haven't actually tested this code, but I've looked
through the patch something like ten times without finding any new
errors.

I've also tried *very* hard to make the patch minimal, with the
exception of the comments, which I tried to make extensive for any of
the subtle cases.

But without testing, it's probably still buggy.

I have to say, I finally like the end result here. Maybe it's because
I got to make my mark and pee in the snow, but I will say that

 (a) the actual entry code modifications really are minimal now

 (b) the instruction emulation really is very simple and straightforward

 (c) yes, we play some stack tricks (and yes, we play them differently
on x86-64 and x86-32), but the tricks are again at least
straightforward, and we never really change the layout of any stack.

So on the whole, I think this is about as good as it gets. Did I get
all the details actually right, and it _works_? I guess we'll see.

                 Linus
arch/x86/entry/entry_32.S            |  7 +++-
 arch/x86/entry/entry_64.S            | 14 ++++++--
 arch/x86/include/asm/ftrace.h        |  2 +-
 arch/x86/include/asm/text-patching.h | 62 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/ftrace.c             | 29 ++++++++++++++---
 arch/x86/kernel/traps.c              | 15 ++++++---
 6 files changed, 116 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..138ac432221b 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1483,7 +1483,12 @@ ENTRY(int3)
 	TRACE_IRQS_OFF
 	xorl	%edx, %edx			# zero error code
 	movl	%esp, %eax			# pt_regs pointer
-	call	do_int3
+
+	# make room on kernel stack for push emulation
+	# do_kernel_int3 returns possibly updated pt_regs
+	subl	$8, %esp
+	call	do_kernel_int3
+	movl	%eax, %esp
 	jmp	ret_from_exception
 END(int3)
 
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..834ec1397dab 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -879,7 +879,7 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -899,6 +899,16 @@ ENTRY(\sym)
 	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
 
+	.if \create_gap == 1
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jnz	.Lfrom_usermode_no_gap_\@
+	.rept 6
+	pushq	5*8(%rsp)
+	.endr
+	UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+	.endif
+
 	.if \paranoid
 	call	paranoid_entry
 	.else
@@ -1130,7 +1140,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
 #endif /* CONFIG_HYPERV */
 
 idtentry debug			do_debug		has_error_code=0	paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3			do_int3			has_error_code=0
+idtentry int3			do_int3			has_error_code=0	create_gap=1
 idtentry stack_segment		do_stack_segment	has_error_code=1
 
 #ifdef CONFIG_XEN_PV
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index cf350639e76d..4b335ac5afcc 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -37,7 +37,7 @@ struct dyn_arch_ftrace {
 	/* No extra data needed for x86 */
 };
 
-int ftrace_int3_handler(struct pt_regs *regs);
+int ftrace_int3_handler(struct pt_regs **regs);
 
 #define FTRACE_GRAPH_TRAMP_ADDR FTRACE_GRAPH_ADDR
 
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..87fad9f5c31b 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,66 @@ extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
 extern int after_bootmem;
 
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+	regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+/*
+ * The 32-bit kernel entry exception stack does not contain the whole
+ * 'struct pt_regs', because the hardware will not push sp/ss when the
+ * protection level doesn't change.
+ *
+ * So pushing a value onto the stack requires us to make room for it by
+ * moving the whole truncated 'pt_regs' down by four bytes, and the
+ * stack we return to will be directly after it. The lowlevel x86-32
+ * entry code will have made room on the stack for this, so we're not
+ * overwriting anything else there.
+ *
+ * On x86-64, the exception stack is much simpler, and always contains
+ * sp/ss, so we can just update the values directly. Again, the entry
+ * code has made sure there is a gap (now above pt_regs!) that this is ok.
+ */
+static inline struct pt_regs *int3_emulate_push(struct pt_regs *regs, unsigned long value)
+{
+#ifdef CONFIG_X86_32
+#define SAVED_KERNEL_REGS_SIZE (offsetof(struct pt_regs, sp))
+	struct pt_regs *new = (void *)regs - sizeof(long);
+	memmove(new, regs, SAVED_KERNEL_REGS_SIZE);
+
+	/*
+	 * NOTE! '&new->sp' is actually the top of stack that that 'int3'
+	 * will return to! So despite what this looks like, this doesn't
+	 * update the stack pointer, but writing the value to 'new->sp'
+	 * is actually writing the value *to* the top of the stack in
+	 * the returning context!
+	 *
+	 * The thing that updates the stack pointer in the context we
+	 * return to is the fact that we moved 'struct pt_regs' itself,
+	 * and will be returning using that moved stack frame.
+	 */
+	new->sp = value;
+	return new;
+#else
+	unsigned long *sp;
+
+	regs->sp -= sizeof(long);
+	sp = (unsigned long *)regs->sp;
+	*sp = value;
+	return regs;
+#endif
+}
+
+static inline struct pt_regs *int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+	unsigned long next_ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
+
+	regs = int3_emulate_push(regs, next_ip);
+	regs->ip = func;
+	return regs;
+}
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..6c239fca99d0 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
 #include <asm/kprobes.h>
 #include <asm/ftrace.h>
 #include <asm/nops.h>
+#include <asm/text-patching.h>
 
 #ifdef CONFIG_DYNAMIC_FTRACE
 
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
 }
 
 static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
 
 static int update_ftrace_func(unsigned long ip, void *new)
 {
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 	unsigned char *new;
 	int ret;
 
+	ftrace_update_func_call = (unsigned long)func;
+
 	new = ftrace_call_replace(ip, (unsigned long)func);
 	ret = update_ftrace_func(ip, new);
 
@@ -287,20 +291,32 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip)
  * call to a nop. While the change is taking place, we treat
  * it just like it was a nop.
  */
-int ftrace_int3_handler(struct pt_regs *regs)
+int ftrace_int3_handler(struct pt_regs **pregs)
 {
+	struct pt_regs *regs = *pregs;
 	unsigned long ip;
 
 	if (WARN_ON_ONCE(!regs))
 		return 0;
 
-	ip = regs->ip - 1;
-	if (!ftrace_location(ip) && !is_ftrace_caller(ip))
+	if (user_mode(regs))
 		return 0;
 
-	regs->ip += MCOUNT_INSN_SIZE - 1;
+	ip = regs->ip - INT3_INSN_SIZE;
 
-	return 1;
+	if (ftrace_location(ip)) {
+		*pregs = int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+		return 1;
+	} else if (is_ftrace_caller(ip)) {
+		if (!ftrace_update_func_call) {
+			int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+			return 1;
+		}
+		*pregs = int3_emulate_call(regs, ftrace_update_func_call);
+		return 1;
+	}
+
+	return 0;
 }
 NOKPROBE_SYMBOL(ftrace_int3_handler);
 
@@ -859,6 +875,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
 
 	func = ftrace_ops_get_func(ops);
 
+	ftrace_update_func_call = (unsigned long)func;
+
 	/* Do a safe modify in case the trampoline is executing */
 	new = ftrace_call_replace(ip, (unsigned long)func);
 	ret = update_ftrace_func(ip, new);
@@ -960,6 +978,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
 {
 	unsigned char *new;
 
+	ftrace_update_func_call = 0UL;
 	new = ftrace_jmp_replace(ip, (unsigned long)func);
 
 	return update_ftrace_func(ip, new);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d26f9e9c3d83..e811e55ab030 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -570,7 +570,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
 }
 NOKPROBE_SYMBOL(do_general_protection);
 
-dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
+dotraplinkage struct pt_regs * notrace do_kernel_int3(struct pt_regs *regs, long error_code)
 {
 #ifdef CONFIG_DYNAMIC_FTRACE
 	/*
@@ -578,11 +578,11 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 	 * See note by declaration of modifying_ftrace_code in ftrace.c
 	 */
 	if (unlikely(atomic_read(&modifying_ftrace_code)) &&
-	    ftrace_int3_handler(regs))
-		return;
+	    ftrace_int3_handler(&regs))
+		return regs;
 #endif
 	if (poke_int3_handler(regs))
-		return;
+		return regs;
 
 	/*
 	 * Use ist_enter despite the fact that we don't use an IST stack.
@@ -614,6 +614,13 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 
 exit:
 	ist_exit(regs);
+	return regs;
+}
+NOKPROBE_SYMBOL(do_kernel_int3);
+
+dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
+{
+	WARN_ON_ONCE(do_kernel_int3(regs, error_code) != regs);
 }
 NOKPROBE_SYMBOL(do_int3);
Andy Lutomirski May 4, 2019, 8:12 p.m. UTC | #35
> On May 4, 2019, at 11:59 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Fri, May 3, 2019 at 10:08 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> 
>> I'll look at it tomorrow, but I think this actually makes unnecessary changes.
>> 
>> In particular, I think we could keep the existing entry code almost unchanged with this whole approach.
> 
> So here's what I *think* should work. Note that I also removed your
> test-case code, because it really didn't have a chance in hell of
> working. Doing that
> 
>        int3_emulate_call(regs, (unsigned long)&int3_magic);
> 
> inside of int3_exception_notify() could not possibly be valid, since
> int3_emulate_call() returns the new pt_regs that need to be used, and
> throwing it away is clearly wrong.
> 
> So you can't use a register_die_notifier() to try to intercept the
> 'int3' error and then do it manually, it needs to be done by the
> ftrace_int3_handler() code that actually returns the new regs, and
> where do_kernel_int3() will then return it to the low-level handler.

I hate register_die_notifier(), so I consider this a plus. I’ve occasionally considered removing the ability for the notifiers to skip normal processing, because, as it stands, figuring out what actually happens in the trap handlers is almost impossible.

It generally looks sane to me.

As an aside, is it even *possible* to get #BP from v8086 mode?  On a quick SDM read, the INT3 instruction causes #GP if VM=1 and IOPL<3.  And, if we allow vm86() to have IOPL=3, we should just remove that ability. It’s nuts.

(We should maybe consider a config option for iopl() that defaults off. We’ve supported ioperm() for a long, long time.)
Linus Torvalds May 4, 2019, 8:28 p.m. UTC | #36
On Sat, May 4, 2019 at 1:12 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> As an aside, is it even *possible* to get #BP from v8086 mode?  On a quick SDM read, the INT3 instruction causes #GP if VM=1 and IOPL<3.  And, if we allow vm86() to have IOPL=3, we should just remove that ability. It’s nuts.

We've definitely historically allowed IOPL=3 with the whole "iopl()"
system call. And yes, afaik it works together with the vm86 system
call too. I think we copy the unsafe bits from the original eflags, so
if you do iopl(3) followed by vm86(), you will be running in vm86 mode
with iopl 3.

> (We should maybe consider a config option for iopl() that defaults off. We’ve supported ioperm() for a long, long time.)

It's entirely possible that nobody uses iopl() and we should make it a
config option that defaults to off.

But we've already done that with the VM86 support entirely, and I'm
not sure modern distros even enable it.

And obviously vm86 mode isn't available at all with a 64-bit kernel,
so this is all slowly becoming more or less moot.

                  Linus
Linus Torvalds May 4, 2019, 8:36 p.m. UTC | #37
On Sat, May 4, 2019 at 1:12 PM Andy Lutomirski <luto@amacapital.net> wrote:
>
> As an aside, is it even *possible* to get #BP from v8086 mode?  On a quick SDM read, the INT3 instruction causes #GP if VM=1 and IOPL<3.  And, if we allow vm86() to have IOPL=3, we should just remove that ability. It’s nuts.

Oh, and I think you mis-read the SDM.

Yes, iopl matters for "int X" (cd xx) instruction in vm86 mode.

But no, iopl does *not* matter for the special "int3/into/int1"
(cc/ce/f1) instructions, I think.

              Linus
Peter Zijlstra May 6, 2019, 8:19 a.m. UTC | #38
On Fri, May 03, 2019 at 11:57:22AM -0700, Linus Torvalds wrote:
> On Fri, May 3, 2019 at 9:21 AM Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > So here’s a somewhat nutty suggestion: how about we tweak the 32-bit
> > entry code to emulate the sane 64-bit frame, not just for int3 but
> > always?
> 
> What would the code actually end up looking like? I don't necessarily
> object, since that kernel_stack_pointer() thing certainly looks
> horrible, but honestly, my suggestion to just pass in the 'struct
> pt_regs' and let the call emulation fix it up would have also worked,
> and avoided that bug (and who knows what else might be hiding).
> 
> I really think that you're now hitting all the special case magic
> low-level crap that I wanted to avoid.

This did actually boot on first try; so there must be something horribly
wrong...

Now, I know you like that other approach; but I figured I should at
least show you what this one looks like. Maybe I've been staring at
entry_32.S too much, but I really don't dislike this.

---
 arch/x86/entry/entry_32.S            | 150 +++++++++++++++++++++++++++++------
 arch/x86/entry/entry_64.S            |  14 +++-
 arch/x86/include/asm/ptrace.h        |   4 -
 arch/x86/include/asm/text-patching.h |  20 +++++
 arch/x86/kernel/alternative.c        |  81 ++++++++++++++++++-
 arch/x86/kernel/ptrace.c             |  29 -------
 6 files changed, 235 insertions(+), 63 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 7b23431be5cb..d29cf03219c5 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -67,9 +67,20 @@
 # define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
 # define preempt_stop(clobbers)
-# define resume_kernel		restore_all_kernel
 #endif
 
+.macro RETINT_PREEMPT
+#ifdef CONFIG_PREEMPT
+	DISABLE_INTERRUPTS(CLBR_ANY)
+	cmpl	$0, PER_CPU_VAR(__preempt_count)
+	jnz	.Lend_\@
+	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
+	jz	.Lend_\@
+	call	preempt_schedule_irq
+.Lend_\@:
+#endif
+.endm
+
 .macro TRACE_IRQS_IRET
 #ifdef CONFIG_TRACE_IRQFLAGS
 	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)     # interrupts off?
@@ -203,8 +214,119 @@
 .Lend_\@:
 .endm
 
+#define CS_FROM_ENTRY_STACK	(1 << 31)
+#define CS_FROM_USER_CR3	(1 << 30)
+#define CS_FROM_KERNEL		(1 << 29)
+
+.macro FIXUP_FRAME
+
+#ifdef CONFIG_VM86
+	testl	$X86_EFLAGS_VM, 3*4(%esp)
+	jnz	.Lfrom_usermode_no_fixup_\@
+#endif
+	testl	$SEGMENT_RPL_MASK, 2*4(%esp)
+	jnz	.Lfrom_usermode_no_fixup_\@
+
+	orl	$CS_FROM_KERNEL, 2*4(%esp)
+
+	/*
+	 * When we're here from kernel mode; the (exception) stack looks like:
+	 *
+	 * 4*4(%esp) - <previous context>
+	 * 3*4(%esp) - flags
+	 * 2*4(%esp) - cs
+	 * 1*4(%esp) - ip
+	 * 0*4(%esp) - orig_eax
+	 *
+	 * Lets build a 5 entry IRET frame after that, such that struct pt_regs
+	 * is complete and in particular regs->sp is correct. This gives us
+	 * the original 4 enties as gap:
+	 *
+	 * 10*4(%esp) - <previous context>
+	 *  9*4(%esp) - gap / flags
+	 *  8*4(%esp) - gap / cs
+	 *  7*4(%esp) - gap / ip
+	 *  6*4(%esp) - gap / orig_eax
+	 *  5*4(%esp) - ss
+	 *  4*4(%esp) - sp
+	 *  3*4(%esp) - flags
+	 *  2*4(%esp) - cs
+	 *  1*4(%esp) - ip
+	 *  0*4(%esp) - orig_eax
+	 */
+
+	pushl	%ss		# ss
+	pushl	%esp		# sp (points at ss)
+	add	$5*4, (%esp)	# point sp back at the previous context
+	pushl	5*4(%esp)	# flags
+	pushl	5*4(%esp)	# cs
+	pushl	5*4(%esp)	# ip
+	pushl	5*4(%esp)	# orig_eax
+
+.Lfrom_usermode_no_fixup_\@:
+.endm
+
+.macro IRET_FRAME
+
+	/* orig_eax is already POP'ed when we're here */
+
+	testl $CS_FROM_KERNEL, 1*4(%esp)
+	jz .Lfinished_frame_\@
+
+	pushl %eax
+
+	lea 10*4(%esp), %eax	# address of <previous context>
+	cmpl %eax, 4*4(%esp)	# if ->sp is unmodified
+	jnz .Lmodified_sp_do_fixup_\@
+
+	/*
+	 * Fast path; regs->sp wasn't modified, reuse the original IRET frame.
+	 */
+	pop %eax
+	add $6*4, %esp
+	jmp .Lfinished_frame_\@;
+
+.Lmodified_sp_do_fixup_\@:
+
+	/*
+	 * Reconstruct the 3 entry IRET frame right after the (modified)
+	 * regs->sp without lowering %esp in between, such that an NMI in the
+	 * middle doesn't scribble our stack.
+	 */
+	pushl	%ecx
+	movl	5*4(%esp), %eax		# (modified) regs->sp
+
+	movl	4*4(%esp), %ecx		# flags
+	movl	%ecx, -4(%eax)
+
+	movl	3*4(%esp), %ecx		# cs
+	andl	$0x0000ffff, %ecx
+	movl	%ecx, -8(%eax)
+
+	movl	2*4(%esp), %ecx		# ip
+	movl	%ecx, -12(%eax)
+
+	movl	1*4(%esp), %ecx		# eax
+	movl	%ecx, -16(%eax)
+
+	popl	%ecx
+	lea	-16(%eax), %esp
+	popl	%eax
+
+.Lfinished_frame_\@:
+.endm
+
 .macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
+
 	cld
+	/*
+	 * The high bits of the CS dword (__csh) are used for CS_FROM_*.
+	 * Clear them in case hardware didn't do this for us.
+	 */
+	andl	$(0x0000ffff), 2*4(%esp)
+
+	FIXUP_FRAME
+
 	PUSH_GS
 	pushl	%fs
 	pushl	%es
@@ -375,9 +497,6 @@
  * switch to it before we do any copying.
  */
 
-#define CS_FROM_ENTRY_STACK	(1 << 31)
-#define CS_FROM_USER_CR3	(1 << 30)
-
 .macro SWITCH_TO_KERNEL_STACK
 
 	ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
@@ -391,13 +510,6 @@
 	 * that register for the time this macro runs
 	 */
 
-	/*
-	 * The high bits of the CS dword (__csh) are used for
-	 * CS_FROM_ENTRY_STACK and CS_FROM_USER_CR3. Clear them in case
-	 * hardware didn't do this for us.
-	 */
-	andl	$(0x0000ffff), PT_CS(%esp)
-
 	/* Are we on the entry stack? Bail out if not! */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
 	addl	$CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx
@@ -755,7 +867,7 @@ END(ret_from_fork)
 	andl	$SEGMENT_RPL_MASK, %eax
 #endif
 	cmpl	$USER_RPL, %eax
-	jb	resume_kernel			# not returning to v8086 or userspace
+	jb	restore_all_kernel		# not returning to v8086 or userspace
 
 ENTRY(resume_userspace)
 	DISABLE_INTERRUPTS(CLBR_ANY)
@@ -765,18 +877,6 @@ ENTRY(resume_userspace)
 	jmp	restore_all
 END(ret_from_exception)
 
-#ifdef CONFIG_PREEMPT
-ENTRY(resume_kernel)
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	cmpl	$0, PER_CPU_VAR(__preempt_count)
-	jnz	restore_all_kernel
-	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
-	jz	restore_all_kernel
-	call	preempt_schedule_irq
-	jmp	restore_all_kernel
-END(resume_kernel)
-#endif
-
 GLOBAL(__begin_SYSENTER_singlestep_region)
 /*
  * All code from here through __end_SYSENTER_singlestep_region is subject
@@ -1019,6 +1119,7 @@ ENTRY(entry_INT80_32)
 	/* Restore user state */
 	RESTORE_REGS pop=4			# skip orig_eax/error_code
 .Lirq_return:
+	IRET_FRAME
 	/*
 	 * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
 	 * when returning from IPI handler and when returning from
@@ -1027,6 +1128,7 @@ ENTRY(entry_INT80_32)
 	INTERRUPT_RETURN
 
 restore_all_kernel:
+	RETINT_PREEMPT
 	TRACE_IRQS_IRET
 	PARANOID_EXIT_TO_KERNEL_MODE
 	BUG_IF_WRONG_CR3
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 20e45d9b4e15..268cd9affe04 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -878,7 +878,7 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -898,6 +898,16 @@ ENTRY(\sym)
 	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
 
+	.if \create_gap == 1
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jnz	.Lfrom_usermode_no_gap_\@
+	.rept 6
+	pushq	5*8(%rsp)
+	.endr
+	UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+	.endif
+
 	.if \paranoid
 	call	paranoid_entry
 	.else
@@ -1129,7 +1139,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
 #endif /* CONFIG_HYPERV */
 
 idtentry debug			do_debug		has_error_code=0	paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET
-idtentry int3			do_int3			has_error_code=0
+idtentry int3			do_int3			has_error_code=0	create_gap=1
 idtentry stack_segment		do_stack_segment	has_error_code=1
 
 #ifdef CONFIG_XEN_PV
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 8a7fc0cca2d1..5ff42dc8b396 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -166,14 +166,10 @@ static inline bool user_64bit_mode(struct pt_regs *regs)
 #define compat_user_stack_pointer()	current_pt_regs()->sp
 #endif
 
-#ifdef CONFIG_X86_32
-extern unsigned long kernel_stack_pointer(struct pt_regs *regs);
-#else
 static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
 {
 	return regs->sp;
 }
-#endif
 
 #define GET_IP(regs) ((regs)->ip)
 #define GET_FP(regs) ((regs)->bp)
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index c90678fd391a..6aac6abf931e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -42,4 +42,24 @@ extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
 
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+	regs->sp -= sizeof(unsigned long);
+	*(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+	regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+	int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+	int3_emulate_jmp(regs, func);
+}
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 4db9c0d29bc1..9519bc553d6d 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -613,11 +613,83 @@ extern struct paravirt_patch_site __start_parainstructions[],
 	__stop_parainstructions[];
 #endif	/* CONFIG_PARAVIRT */
 
+/*
+ * Self-test for the INT3 based CALL emulation code.
+ *
+ * This exercises int3_emulate_call() to make sure INT3 pt_regs are set up
+ * properly and that there is a stack gap between the INT3 frame and the
+ * previous context. Without this gap doing a virtual PUSH on the interrupted
+ * stack would corrupt the INT3 IRET frame.
+ *
+ * See entry_{32,64}.S for more details.
+ */
+static void __init int3_magic(unsigned int *ptr)
+{
+	*ptr = 1;
+}
+
+extern __initdata unsigned long int3_selftest_ip; /* defined in asm below */
+
+static int __init
+int3_exception_notify(struct notifier_block *self, unsigned long val, void *data)
+{
+	struct die_args *args = data;
+	struct pt_regs *regs = args->regs;
+
+	if (!regs || user_mode(regs))
+		return NOTIFY_DONE;
+
+	if (val != DIE_INT3)
+		return NOTIFY_DONE;
+
+	if (regs->ip - INT3_INSN_SIZE != int3_selftest_ip)
+		return NOTIFY_DONE;
+
+	int3_emulate_call(regs, (unsigned long)&int3_magic);
+	return NOTIFY_STOP;
+}
+
+static void __init int3_selftest(void)
+{
+	static __initdata struct notifier_block int3_exception_nb = {
+		.notifier_call	= int3_exception_notify,
+		.priority	= INT_MAX-1, /* last */
+	};
+	unsigned int val = 0;
+
+	BUG_ON(register_die_notifier(&int3_exception_nb));
+
+	/*
+	 * Basically: int3_magic(&val); but really complicated :-)
+	 *
+	 * Stick the address of the INT3 instruction into int3_selftest_ip,
+	 * then trigger the INT3, padded with NOPs to match a CALL instruction
+	 * length.
+	 */
+	asm volatile ("1: int3; nop; nop; nop; nop\n\t"
+		      ".pushsection .init.data,\"aw\"\n\t"
+		      ".align " __ASM_SEL(4, 8) "\n\t"
+		      ".type int3_selftest_ip, @object\n\t"
+		      ".size int3_selftest_ip, " __ASM_SEL(4, 8) "\n\t"
+		      "int3_selftest_ip:\n\t"
+		      __ASM_SEL(.long, .quad) " 1b\n\t"
+		      ".popsection\n\t"
+		      : : __ASM_SEL_RAW(a, D) (&val) : "memory");
+
+	BUG_ON(val != 1);
+
+	unregister_die_notifier(&int3_exception_nb);
+}
+
 void __init alternative_instructions(void)
 {
-	/* The patching is not fully atomic, so try to avoid local interruptions
-	   that might execute the to be patched code.
-	   Other CPUs are not running. */
+	int3_selftest();
+
+	/*
+	 * The patching is not fully atomic, so try to avoid local
+	 * interruptions that might execute the to be patched code.
+	 * Other CPUs are not running.
+	 */
 	stop_nmi();
 
 	/*
@@ -642,10 +714,11 @@ void __init alternative_instructions(void)
 					    _text, _etext);
 	}
 
-	if (!uniproc_patched || num_possible_cpus() == 1)
+	if (!uniproc_patched || num_possible_cpus() == 1) {
 		free_init_pages("SMP alternatives",
 				(unsigned long)__smp_locks,
 				(unsigned long)__smp_locks_end);
+	}
 #endif
 
 	apply_paravirt(__parainstructions, __parainstructions_end);
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 4b8ee05dd6ad..d13f892d2c47 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -153,35 +153,6 @@ static inline bool invalid_selector(u16 value)
 
 #define FLAG_MASK		FLAG_MASK_32
 
-/*
- * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode
- * when it traps.  The previous stack will be directly underneath the saved
- * registers, and 'sp/ss' won't even have been saved. Thus the '&regs->sp'.
- *
- * Now, if the stack is empty, '&regs->sp' is out of range. In this
- * case we try to take the previous stack. To always return a non-null
- * stack pointer we fall back to regs as stack if no previous stack
- * exists.
- *
- * This is valid only for kernel mode traps.
- */
-unsigned long kernel_stack_pointer(struct pt_regs *regs)
-{
-	unsigned long context = (unsigned long)regs & ~(THREAD_SIZE - 1);
-	unsigned long sp = (unsigned long)&regs->sp;
-	u32 *prev_esp;
-
-	if (context == (sp & ~(THREAD_SIZE - 1)))
-		return sp;
-
-	prev_esp = (u32 *)(context);
-	if (*prev_esp)
-		return (unsigned long)*prev_esp;
-
-	return (unsigned long)regs;
-}
-EXPORT_SYMBOL_GPL(kernel_stack_pointer);
-
 static unsigned long *pt_regs_access(struct pt_regs *regs, unsigned long regno)
 {
 	BUILD_BUG_ON(offsetof(struct pt_regs, bx) != 0);
Steven Rostedt May 6, 2019, 1:56 p.m. UTC | #39
On Mon, 6 May 2019 10:19:51 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, May 03, 2019 at 11:57:22AM -0700, Linus Torvalds wrote:
> > On Fri, May 3, 2019 at 9:21 AM Andy Lutomirski <luto@amacapital.net> wrote:  
> > >
> > > So here’s a somewhat nutty suggestion: how about we tweak the 32-bit
> > > entry code to emulate the sane 64-bit frame, not just for int3 but
> > > always?  
> > 
> > What would the code actually end up looking like? I don't necessarily
> > object, since that kernel_stack_pointer() thing certainly looks
> > horrible, but honestly, my suggestion to just pass in the 'struct
> > pt_regs' and let the call emulation fix it up would have also worked,
> > and avoided that bug (and who knows what else might be hiding).
> > 
> > I really think that you're now hitting all the special case magic
> > low-level crap that I wanted to avoid.  
> 
> This did actually boot on first try; so there must be something horribly
> wrong...
> 
> Now, I know you like that other approach; but I figured I should at
> least show you what this one looks like. Maybe I've been staring at
> entry_32.S too much, but I really don't dislike this.

I can test this too. I was hoping to get this in by this merge window.
I spent 3 hours yesterday trying to get Linus's version working on
i386 with no success. Not sure how much time Linus will have to look at
this, as he just opened the merge window.

Again, I think Peter's solution here is the more elegant one. But as
long as we get *a* solution, I'll be happy. And my time to work on it
has pretty much already been depleted.

-- Steve
Peter Zijlstra May 6, 2019, 2:22 p.m. UTC | #40
On Mon, May 06, 2019 at 10:19:51AM +0200, Peter Zijlstra wrote:
> +.Lfrom_usermode_no_fixup_\@:
> +.endm
> +
> +.macro IRET_FRAME
> +
> +	/* orig_eax is already POP'ed when we're here */
> +
> +	testl $CS_FROM_KERNEL, 1*4(%esp)
> +	jz .Lfinished_frame_\@
> +
> +	pushl %eax
> +

From there..

> +	lea 10*4(%esp), %eax	# address of <previous context>
> +	cmpl %eax, 4*4(%esp)	# if ->sp is unmodified
> +	jnz .Lmodified_sp_do_fixup_\@
> +
> +	/*
> +	 * Fast path; regs->sp wasn't modified, reuse the original IRET frame.
> +	 */
> +	pop %eax
> +	add $6*4, %esp
> +	jmp .Lfinished_frame_\@;
> +
> +.Lmodified_sp_do_fixup_\@:

... until here, needs to go, it is buggy. While a clever idea, it looses
updates to regs->ip and ->flags.

> +
> +	/*
> +	 * Reconstruct the 3 entry IRET frame right after the (modified)
> +	 * regs->sp without lowering %esp in between, such that an NMI in the
> +	 * middle doesn't scribble our stack.
> +	 */
> +	pushl	%ecx
> +	movl	5*4(%esp), %eax		# (modified) regs->sp
> +
> +	movl	4*4(%esp), %ecx		# flags
> +	movl	%ecx, -4(%eax)
> +
> +	movl	3*4(%esp), %ecx		# cs
> +	andl	$0x0000ffff, %ecx
> +	movl	%ecx, -8(%eax)
> +
> +	movl	2*4(%esp), %ecx		# ip
> +	movl	%ecx, -12(%eax)
> +
> +	movl	1*4(%esp), %ecx		# eax
> +	movl	%ecx, -16(%eax)
> +
> +	popl	%ecx
> +	lea	-16(%eax), %esp
> +	popl	%eax
> +
> +.Lfinished_frame_\@:
> +.endm
Josh Poimboeuf May 6, 2019, 3:14 p.m. UTC | #41
On Thu, May 02, 2019 at 11:02:40AM -0700, Linus Torvalds wrote:
> On Thu, May 2, 2019 at 9:21 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > TL;DR, on x86_32 kernel->kernel IRET frames are only 3 entries and do
> > not include ESP/SS, so not only wasn't regs->sp setup, if you changed it
> > it wouldn't be effective and corrupt random stack state.
> 
> Indeed, the 32-bit case for same-RPL exceptions/iret is entirely
> different, and I'd forgotten about that.
> 
> And honestly, this makes the 32-bit case much worse. Now the entry
> stack modifications of int3 suddenly affect not just the entry, but
> every exit too.
> 
> This is _exactly_ the kind of subtle kernel entry/exit code I wanted
> us to avoid.

I actually love this patch (absent the bugs).  This is already something
that has been sorely needed for years.

The "struct pt_regs is incomplete on x86-32" thing is a monstrosity
which has long been a source of confusion and bugs.  Sure, this patch
adds some complexity to the entry code, but on the other hand it
actually makes it possible to use pt_regs sanely: regs->sp is no longer
uninitialized.  So a class of (very non-obvious) bugs is eliminated.

I don't think it would make sense to make this change for int3 only,
because the benefits are global.
Linus Torvalds May 6, 2019, 4:17 p.m. UTC | #42
On Mon, May 6, 2019 at 6:56 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> I can test this too. I was hoping to get this in by this merge window.
> I spent 3 hours yesterday trying to get Linus's version working on
> i386 with no success. Not sure how much time Linus will have to look at
> this, as he just opened the merge window.

I acvtually just tested it in qemu, and it worked fine.

Ok, so my test was admittedly really simple and stupid, in that al I did was

        printk("Before int3\n");
        asm("int3; nop; nop; nop; nop" : : :"ax","dx","cx");
        printk("After int3\n");

and then I hacked up do_kernel_int3() to just unconditionally do

        return int3_emulate_call(regs, (unsigned long)test_int3);

with a stupid test-function:

    void test_int3(void)
    {
        printk("In int3 handler");
    }

instead fo anything fancy.

But it did exactly the expected thing, and resulted in

    Before int3
    In int3 handler
    After int3

on the screen.

So what is it that doesn't actually work? I've looked at the patch
even more, and I can't for the life of me see how it wouldn't work.

Of course, I didn't test any of the actual ftrace parts, since I
short-circuited them intentionally with the above test function hack.
I have no idea what the semantics for those
ftrace_location(ip)/is_ftrace_caller(ip) cases are supposed to be, I
only tested that yes, the infrastructure clearly emulates a call
instruction.

               Linus
Linus Torvalds May 6, 2019, 4:19 p.m. UTC | #43
On Mon, May 6, 2019 at 9:17 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> So what is it that doesn't actually work? I've looked at the patch
> even more, and I can't for the life of me see how it wouldn't work.

And I do still react to PeterZ's

 arch/x86/entry/entry_32.S            | 150 +++++++++++++++++++++++++++++------

vs

 arch/x86/entry/entry_32.S            |  7 ++++++-

for mine. And PeterZ's  apparently still had bugs.

Yes, mine can have bugs too, but I really *have* done some basic
testing, and with 6 lines of code, I think they are more localized.

                Linus
Steven Rostedt May 6, 2019, 5:06 p.m. UTC | #44
On Mon, 6 May 2019 09:17:19 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> So what is it that doesn't actually work? I've looked at the patch
> even more, and I can't for the life of me see how it wouldn't work.
> 
> Of course, I didn't test any of the actual ftrace parts, since I
> short-circuited them intentionally with the above test function hack.
> I have no idea what the semantics for those
> ftrace_location(ip)/is_ftrace_caller(ip) cases are supposed to be, I
> only tested that yes, the infrastructure clearly emulates a call
> instruction.

Can you try booting with:

CONFIG_KPROBE_EVENTS=y
CONFIG_UPROBE_EVENTS=y
CONFIG_DYNAMIC_EVENTS=y
CONFIG_PROBE_EVENTS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_FTRACE_MCOUNT_RECORD=y
CONFIG_FTRACE_SELFTEST=y
CONFIG_FTRACE_STARTUP_TEST=y
CONFIG_TRACEPOINT_BENCHMARK=y
CONFIG_RING_BUFFER_BENCHMARK=y
CONFIG_RING_BUFFER_STARTUP_TEST=y


And see if it boots?

-- Steve
Linus Torvalds May 6, 2019, 6:06 p.m. UTC | #45
On Mon, May 6, 2019 at 10:06 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Can you try booting with:
> [ snip snip ]
>
> And see if it boots?

No it doesn't.

Dang, I tried to figure out what's up, but now I really have to start
handling all the puill requests..

I thought it might be an int3 that happens on the entry stack, but I
don't think that should ever happen. Either it's a user-mode int3, or
we're in the kernel and have switched stacks. So I still don't see why
my patch doesn't work, but now I have no time to debug it.

                   Linus
Steven Rostedt May 6, 2019, 6:57 p.m. UTC | #46
On Mon, 6 May 2019 11:06:51 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, May 6, 2019 at 10:06 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Can you try booting with:
> > [ snip snip ]
> >
> > And see if it boots?  
> 
> No it doesn't.
> 
> Dang, I tried to figure out what's up, but now I really have to start
> handling all the puill requests..
> 
> I thought it might be an int3 that happens on the entry stack, but I
> don't think that should ever happen. Either it's a user-mode int3, or
> we're in the kernel and have switched stacks. So I still don't see why
> my patch doesn't work, but now I have no time to debug it.
> 

You should have waited another week to open that merge window ;-)

-- Steve
Linus Torvalds May 6, 2019, 7:46 p.m. UTC | #47
On Mon, May 6, 2019 at 11:57 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> You should have waited another week to open that merge window ;-)

Hmm. I'm looking at it while the test builds happen, and since I don't
see what's wrong in the low-level entry code, I'm looking at the
ftrace code instead.

What's going on here?

               *pregs = int3_emulate_call(regs, (unsigned
long)ftrace_regs_caller);

that line makes no sense. Why would we emulate a call to
ftrace_regs_caller()? That function sets up a pt_regs, and then calls
ftrace_stub().

But we *have* pt_regs here already with the right values. Why isn't
this just a direct call to ftrace_stub() from within the int3 handler?

And the thing is, calling ftrace_regs_caller() looks wrong, because
that's not what happens for *real* mcount handling, which uses that
"create_trampoline()" to create the thing we're supposed to really
use?

Anyway, I simply don't understand the code, so I'm confused. But why
is the int3 emulation creating a call that doesn't seem to match what
the instruction that we're actually rewriting is supposed to do?

IOW, it looks to me like ftrace_int3_handler() is actually emulating
something different than what ftrace_modify_code() is actually
modifying the code to do!

Since the only caller of ftrace_modify_code() is update_ftrace_func(),
why is that function not just saving the target and we'd emulate the
call to that? Using anything else looks crazy?

But as mentioned, I just don't understand the ftrace logic. It looks
insane to me, and much more likely to be buggy than the very simple
entry code.

             Linus
Steven Rostedt May 6, 2019, 8:29 p.m. UTC | #48
On Mon, 6 May 2019 12:46:11 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, May 6, 2019 at 11:57 AM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > You should have waited another week to open that merge window ;-)  
> 
> Hmm. I'm looking at it while the test builds happen, and since I don't
> see what's wrong in the low-level entry code, I'm looking at the
> ftrace code instead.
> 
> What's going on here?
> 
>                *pregs = int3_emulate_call(regs, (unsigned
> long)ftrace_regs_caller);
> 
> that line makes no sense. Why would we emulate a call to
> ftrace_regs_caller()? That function sets up a pt_regs, and then calls
> ftrace_stub().

Because that call to ftrace_stub is also dynamic.

In entry_32.S

.globl ftrace_call
ftrace_call:
	call	ftrace_stub


update_ftrace_func()
{
  [..]
	} else if (is_ftrace_caller(ip)) {
		if (!ftrace_update_func_call) {
			int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
			return 1;
		}
		*pregs = int3_emulate_call(regs, ftrace_update_func_call);
		return 1;
	}

Part of the code will change it to call the function needed directly.

struct ftrace_ops my_ops {
	.func = my_handler
};

	register_ftrace_function(&my_ops);

Will change "call ftrace_stub" into "call my_handler"

If you register another ftrace_ops, it will change that to

	call ftrace_ops_list_func

Which will iterate over all registered ftrace_ops, and depending on the
hashs in ftrace_ops, will call the registered handler for them.


> 
> But we *have* pt_regs here already with the right values. Why isn't
> this just a direct call to ftrace_stub() from within the int3 handler?
> 
> And the thing is, calling ftrace_regs_caller() looks wrong, because
> that's not what happens for *real* mcount handling, which uses that
> "create_trampoline()" to create the thing we're supposed to really
> use?

The ftrace_regs_caller() is what is called if there's two or more
callbacks registered to a single function. For example, you have a
function that is being lived patch (it uses the ftrace_regs_caller copy
of the trampoline). But if you enable function tracing (which doesn't
need a copy of regs), it will call the ftrace_regs_caller, which will
call a ftrace_ops_list_func() which will look at the ftrace_ops (which
is the descriptor representing registered callbacks to ftrace), and
based on the hash value in them, will call their handler if the
ftrace_ops hashes match the function being called.


> 
> Anyway, I simply don't understand the code, so I'm confused. But why
> is the int3 emulation creating a call that doesn't seem to match what
> the instruction that we're actually rewriting is supposed to do?
> 
> IOW, it looks to me like ftrace_int3_handler() is actually emulating
> something different than what ftrace_modify_code() is actually
> modifying the code to do!
> 
> Since the only caller of ftrace_modify_code() is update_ftrace_func(),
> why is that function not just saving the target and we'd emulate the
> call to that? Using anything else looks crazy?
> 
> But as mentioned, I just don't understand the ftrace logic. It looks
> insane to me, and much more likely to be buggy than the very simple
> entry code.


Let's go an example. Let's say we live patched do_IRQ() and
__migrate_task(). We would have this:

live_patch_trampoline:
   (which is a modified copy of the ftrace_regs_caller)
	pushl	$__KERNEL_CS
	pushl	4(%esp)
	pushl	$0
	pushl	%gs
	pushl	%fs
	pushl	%es
	pushl	%ds
	pushl	%eax
	pushf
	popl	%eax
	movl	%eax, 8*4(%esp)
	pushl	%ebp
	pushl	%edi
	pushl	%esi
	pushl	%edx
	pushl	%ecx
	pushl	%ebx
	movl	12*4(%esp), %eax
	subl	$MCOUNT_INSN_SIZE, %eax
	movl	15*4(%esp), %edx		/* Load parent ip (2nd parameter) */
	movl	function_trace_op, %ecx		/* Save ftrace_pos in 3rd parameter */
	pushl	%esp				/* Save pt_regs as 4th parameter */

	call	live_kernel_patch_func

	addl	$4, %esp			/* Skip pt_regs */
	push	14*4(%esp)
	popf
	movl	12*4(%esp), %eax
	movl	%eax, 14*4(%esp)
	popl	%ebx
	popl	%ecx
	popl	%edx
	popl	%esi
	popl	%edi
	popl	%ebp
	popl	%eax
	popl	%ds
	popl	%es
	popl	%fs
	popl	%gs
	lea	3*4(%esp), %esp			/* Skip orig_ax, ip and cs */
	jmp	.Lftrace_ret


<do_IRQ>:
	call live_patch_trampoline
	[..]

<__migrate_task>:
	call_live_patch_trampoline


Now we enable function tracing on all functions that can be traced, and
this includes do_IRQ() and __migrate_task(). Thus, we first modify that
call to ftrace_stub in the ftrace_regs_caller to point to the
ftrace_ops_list_func() as that will iterate over the ftrace_ops for
live kernel patching, and the ftrace_ops for the function tracer. That
iterator will check the hashes against the called functions, and for
live kernel patching, it will it will call its handler if the passed in
ip matches either do_IRQ() or __migrate_task(). It will see that the
ftrace_ops for function tracing is set to trace all functions and just
call its handler in that loop too.

Today, when we place an int3 on those functions, we basically turn them
into nops.

<do_IRQ>:
	<int3>(convert from call live_patch_trampoline
	                 to call ftrace_regs_caller)
	[..]

But that int3 handler, doesn't call either the live_patch_trampoline or
ftrace_regs_caller, which means, the live kernel patching doesn't get
to make that function call something different. We basically, just
disabled tracing completely for those functions during that transition.

Remember that ftrace_regs_caller gets updated to not call ftrace_stub,
but to the list iterator if there's more than one handler registered
with ftrace (and so does ftrace_caller). By making the int3 handler
call it, will do the iteration over all registered ftrace_ops and
nothing will be missed.

Does that help explain what's going on?

-- Steve
Linus Torvalds May 6, 2019, 8:42 p.m. UTC | #49
On Mon, May 6, 2019 at 1:29 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Because that call to ftrace_stub is also dynamic.

You're missing the point.

We are rewriting a single "cal" instruction to point to something.

The "int3" emulation should call THE SAME THING.

Right now it doesn't.

> Part of the code will change it to call the function needed directly.
>
> struct ftrace_ops my_ops {
>         .func = my_handler
> };
>
>         register_ftrace_function(&my_ops);
>
> Will change "call ftrace_stub" into "call my_handler"

But that's not what you're actually *doing*.

Instead, you're now _emulating_ calling ftrace_regs_caller, which will
call that ftrace_stub, which in turn will try to update the call site.

But that's insane. It's insane because

 - it's not even what your call rewriting is doing Why aren't you just
doing the emulation using the *SAME* target that you're rewriting the
actual call instruction with?

 - even if ftrace_regs_caller ends up being that same function, you'd
be better off just passing the "struct pt_regs" that you *ALREADY
HAVE* directly to ftrace_stub in the int3 handler, rather than create
*another* pt_regs stack

See? In that second case, why don't you just use "int3_emulate_call()"
to do the reguired 'struct pt_regs' updates, and then call
ftrace_stub() *with* that fixed-up pt_regs thing?

In other words, I think you should always do "int3_emulate_call()"
with the *exact* same address that the instruction you are rewriting
is using. There's no "three different cases". The only possible cases
are "am I rewriting a jump" or "am I rewriting a call".

There is no "am I rewriting a call to one address, and then emulating
it with a call to another address" case that makes sense.

What *can* make sense is "Oh, I'm emulating a call, but I know that
call will be rewritten, so let me emulate the call and then
short-circuit the emulation immediately".

But that is not what the ftrace code is doing. The ftrace code is
doing something odd and insane.

And no, your "explanation" makes no sense. Because it doesn't actually
touch on the fundamental insanity.

            Linus
Linus Torvalds May 6, 2019, 8:44 p.m. UTC | #50
On Mon, May 6, 2019 at 1:42 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> What *can* make sense is "Oh, I'm emulating a call, but I know that
> call will be rewritten, so let me emulate the call and then
> short-circuit the emulation immediately".

That made no sense. The end should have been "and then short-circuit
the _rewriting_ immediately" of course.

The "emulate a call" is just to make the "struct pt_regs" state look
like it would have after the call. The "short circuit the rewriting"
part is the thing that then rewrites the actual instruction based on
that pt_regs state.

                 Linus
Steven Rostedt May 6, 2019, 9:45 p.m. UTC | #51
On Mon, 6 May 2019 13:42:12 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, May 6, 2019 at 1:29 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > Because that call to ftrace_stub is also dynamic.  
> 
> You're missing the point.
> 
> We are rewriting a single "cal" instruction to point to something.
> 
> The "int3" emulation should call THE SAME THING.
> 
> Right now it doesn't.

To do that we would need to rewrite the logic to update each of those
40,000 calls one at a time, or group them together to what gets
changed. As each function can call a different trampoline. One of the
custom trampolines associated with a ftrace_ops, or the list one when a
function has more than one ftrace_ops attached to it.

The ftrace code does this in batches:

 1) Add int3 to all functions being affected (could be 40,000 of
    them) (sync the cores)

 (Now all those functions are going through the int3 handler)

 2) Modify the address of those call sites (sync the cores)

 (Still going through the int3 handlers)

 3) Remove the int3

Now each of those functions are calling something, and they may not all
be the same thing they are calling.

> 
> > Part of the code will change it to call the function needed directly.
> >
> > struct ftrace_ops my_ops {
> >         .func = my_handler
> > };
> >
> >         register_ftrace_function(&my_ops);
> >
> > Will change "call ftrace_stub" into "call my_handler"  
> 
> But that's not what you're actually *doing*.
> 
> Instead, you're now _emulating_ calling ftrace_regs_caller, which will
> call that ftrace_stub, which in turn will try to update the call site.

The call site gets updated before the loop gets called (at least it
should, I have to go and look at the code, but I'm 99% that it does).
There should not be a breakpoint on the call to ftrace_stub when we add
breakpoints to the functions that call the ftrace_regs_caller.

> 
> But that's insane. It's insane because
> 
>  - it's not even what your call rewriting is doing Why aren't you just
> doing the emulation using the *SAME* target that you're rewriting the
> actual call instruction with?
> 
>  - even if ftrace_regs_caller ends up being that same function, you'd
> be better off just passing the "struct pt_regs" that you *ALREADY
> HAVE* directly to ftrace_stub in the int3 handler, rather than create
> *another* pt_regs stack

Are you saying to call the ftrace_ops handlers from int3 directly?


> 
> See? In that second case, why don't you just use "int3_emulate_call()"
> to do the reguired 'struct pt_regs' updates, and then call
> ftrace_stub() *with* that fixed-up pt_regs thing?
> 
> In other words, I think you should always do "int3_emulate_call()"
> with the *exact* same address that the instruction you are rewriting
> is using. There's no "three different cases". The only possible cases
> are "am I rewriting a jump" or "am I rewriting a call".
> 
> There is no "am I rewriting a call to one address, and then emulating
> it with a call to another address" case that makes sense.
> 
> What *can* make sense is "Oh, I'm emulating a call, but I know that
> call will be rewritten, so let me emulate the call and then
> short-circuit the emulation immediately".
> 
> But that is not what the ftrace code is doing. The ftrace code is
> doing something odd and insane.
> 
> And no, your "explanation" makes no sense. Because it doesn't actually
> touch on the fundamental insanity.

Also, talking with Peter about the ftrace_32.S version of
ftrace_regs_caller, makes Peter's patch sound even better.

The "ftrace_regs_caller" was created to allow kprobes to use the
function tracer when a probe was added to the start of a function
(which is a common occurrence). That is, kprobes uses int3 to inject a
handler pretty much anywhere in the code. kprobes can even be used for
dynamic trace events.

Now we found that if we use function tracing, it's faster to do the
call then to take the int3 hit. So kprobes knows of ftrace, and will
register a callback if it happens to be placed on a fentry call site.

Thus, Masami asked me to create a way to have ftrace be able to
simulate an int3. As kprobe handlers can do pretty much anything (bpf
uses them), I had to make that call from ftrace look like a real int3
just happened.

As ftrace_caller, is optimized for fast function tracing, I needed to
make another trampoline for the slower "emulate int3" operation, and
that was the birth of ftrace_regs_caller. For x86_64, it was really
straight forward and I had that done rather quickly. For i386, it was
much more difficult, and that was because of not having regs->sp on the
stack. I had to play this game to be able to pass in a pt_regs that
would be the same regardless if it was called by ftrace or an int3.

The problem was the call to ftrace_regs_caller would place the return
code on the stack, but I had to move it, because the kprobes handlers,
expected &regs->sp to point to the location of the stack just before
the call was hit!  This means, regs->flags was were the return code
was.

When we enter ftrace_regs_caller from the function being traced, the
top of the stack has the return code. But then we needed to do this:

	pushl	$__KERNEL_CS
	pushl	4(%esp)				/* Save the return ip */
	pushl	$0				/* Load 0 into orig_ax */
	pushl	%gs
	pushl	%fs
	pushl	%es
	pushl	%ds
	pushl	%eax

The above push regs->cs, regs->ip (the return code), then regs->gs...
to regs->ax, where now I finally have saved a scratch register to use.


	/* Get flags and place them into the return ip slot */
	pushf
	popl	%eax
	movl	%eax, 8*4(%esp)

I would then save flags into %eax and move it to where the return
address was placed by the call to this trampoline.

At the end, I had to undo this song and dance as well:

	/* restore flags */
	push	14*4(%esp)
	popf

	/* Move return ip back to its original location */
	movl	12*4(%esp), %eax
	movl	%eax, 14*4(%esp)


If we go with Peter's patch, I can make this code much more sane, and
not have to worry about having &regs->sp be at the top of the stack. I
could simply, just push everything in the order of pt_regs and call the
handler.

-- Steve
Linus Torvalds May 6, 2019, 10:06 p.m. UTC | #52
On Mon, May 6, 2019 at 2:45 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> To do that we would need to rewrite the logic to update each of those
> 40,000 calls one at a time, or group them together to what gets
> changed.

Stephen, YOU ARE NOT LISTENING.

You are already fixing the value of the call in the instruction as
part of the instruction rewriting.

When you do things like this:

        unsigned long ip = (unsigned long)(&ftrace_call);
        unsigned char *new;
        int ret;

        new = ftrace_call_replace(ip, (unsigned long)func);
        ret = update_ftrace_func(ip, new);

you have already decided to rewrite the instruction with one single
fixed call target: "func".

I'm just saying that you should ALWAYS use the same call target in the
int3 emulation.

Instead, you hardcode something else than what you are AT THE SAME
TIME rewriting the instruction with.

See what I'm saying?

You already save off the "ip" of the instruction you modify in
update_ftrace_func(). I'm just saying that you should *also* save off
the actual target of the call, and use *THAT*.

So that the int3 emulation and the instruction rewriting *match*.

What you do now makes no sense. You're modifing the code with one
thing (the "func" argument in update_ftrace_func), so if your
modification completed, that's what you'll actually *run*. But you're
then _emulating_ doing somethiing completely different, not using
"func" at all there.

So let me say one more time: how can it *possibly* make sense to
emulate something else than you are changing the instruction to read?

Are you finally understanding what craziness I'm talking about?

Stop with the "there could be thousands of targets" arguyment. The
"call" instruction THAT YOU ARE REWRITING has exactly one target.
There aren't 40,000 of them. x86 does not have that kind of "call"
instruction that randomly calls 40k different functions. You are
replacing FIVE BYTES of memory, and the emulation you do should
emulate those FIVE BYTES.

See?

Why are you emulating something different than what you are rewriting?

                    Linus
Linus Torvalds May 6, 2019, 10:31 p.m. UTC | #53
On Mon, May 6, 2019 at 3:06 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Why are you emulating something different than what you are rewriting?

Side note: I'm also finding another bug on the ftrace side, which is a
simple race condition.

In particular, the logic with 'modifying_ftrace_code' is fundamentally racy.

What can happen is that on one CPU we rewrite one instruction:

        ftrace_update_func = ip;
        /* Make sure the breakpoints see the ftrace_update_func update */
        smp_wmb();

        /* See comment above by declaration of modifying_ftrace_code */
        atomic_inc(&modifying_ftrace_code);

        ret = ftrace_modify_code(ip, old, new);

        atomic_dec(&modifying_ftrace_code);

   but then another CPU hits the 'int3' while the modification is
going on, and takes the fault.

The fault handler does that

        if (unlikely(atomic_read(&modifying_ftrace_code))..

and sees that "yes, it's in the middle of modifying the ftrace code",
and calls ftrace_int3_handler().  All good and "obviously correct" so
far, no?

HOWEVER. It's actually buggy. Because in the meantime, the CPU that
was rewriting instructions has finished, and decrements the
modifying_ftrace_code, which doesn't hurt us (because we already saw
that the int3 was due to the modification.

BUT! There are two different races here:

 (a) maybe the fault handling was slow, and we saw the 'int3' and took
the fault, but the modifying CPU had already finished, so that
atomic_read(&modifying_ftrace_code) didn't actually trigger at all.

 (b) maybe the int3-faulting CPU *did* see the proper value of
modifying_ftrace_code, but the modifying CPU went on and started
*another* modification, and has changed ftrace_update_func in the
meantime, so now the int3 handling is looking at the wrong values!

In the case of (a), we'll die with an oops due to the inexplicable
'int3' we hit. And in the case of (b) we'll be fixing up using the
wrong address.

Things like this is why I'm wondering how much of the problems are due
to the entry code, and how much of it is due to simply races and
timing differences?

Again, I don't actually know the ftrace code, and maybe I'm missing
something, but this really looks like _another_ fundamental bug.

The way to handle that modifying_ftrace_code thing is most likely by
using a sequence counter. For example, one way to actually do some
thing like this might be

        ftrace_update_func = ip;
        ftrace_update_target = func;
        smp_wmb();
        atomic_inc(&modifying_ftrace_head);

        ret = ftrace_modify_code(ip, old, new);

        atomic_inc(&modifying_ftrace_tail);
        smp_wmb();

and now the int3 code could do something like

        int head, tail;

        head = atomic_read(&modifying_ftrace_head);
        smp_rmb();
        tail = atomic_read(&modifying_ftrace_tail);

        /* Are we still in the process of modification? */
        if (unlikely(head != tail+1))
                return 0;

        ip = ftrace_update_func;
        func = ftrace_update_target;
        smp_rmb();
        /* Need to re-check that the above two values are consistent
and we didn't exit */
        if (atomic_read(&modifying_ftrace_tail) != tail)
                return 0;

        *pregs int3_emulate_call(regs, ip, func);
        return 1;

although it probably really would be better to use a seqcount instead
of writing it out like the above.

NOTE! The above only fixes the (b) race. The (a) race is probably best
handled by actually checking if the 'int3' instruction is still there
before dying.

Again, maybe there's something I'm missing, but having looked at that
patch now what feels like a million times, I'm finding more worrisome
things in the ftrace code than in the kernel entry code..

                 Linus
Steven Rostedt May 7, 2019, 12:10 a.m. UTC | #54
On Mon, 6 May 2019 15:31:57 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, May 6, 2019 at 3:06 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > Why are you emulating something different than what you are rewriting?  
> 
> Side note: I'm also finding another bug on the ftrace side, which is a
> simple race condition.
> 
> In particular, the logic with 'modifying_ftrace_code' is fundamentally racy.
> 
> What can happen is that on one CPU we rewrite one instruction:
> 
>         ftrace_update_func = ip;
>         /* Make sure the breakpoints see the ftrace_update_func update */
>         smp_wmb();
> 
>         /* See comment above by declaration of modifying_ftrace_code */
>         atomic_inc(&modifying_ftrace_code);
> 
>         ret = ftrace_modify_code(ip, old, new);
> 
>         atomic_dec(&modifying_ftrace_code);
> 
>    but then another CPU hits the 'int3' while the modification is
> going on, and takes the fault.
> 
> The fault handler does that
> 
>         if (unlikely(atomic_read(&modifying_ftrace_code))..
> 
> and sees that "yes, it's in the middle of modifying the ftrace code",
> and calls ftrace_int3_handler().  All good and "obviously correct" so
> far, no?
> 
> HOWEVER. It's actually buggy. Because in the meantime, the CPU that
> was rewriting instructions has finished, and decrements the
> modifying_ftrace_code, which doesn't hurt us (because we already saw
> that the int3 was due to the modification.

But the CPU that was rewriting instructions does a run_sync() after
removing the int3:

static void run_sync(void)
{
	int enable_irqs;

	/* No need to sync if there's only one CPU */
	if (num_online_cpus() == 1)
		return;

	enable_irqs = irqs_disabled();

	/* We may be called with interrupts disabled (on bootup). */
	if (enable_irqs)
		local_irq_enable();
	on_each_cpu(do_sync_core, NULL, 1);
	if (enable_irqs)
		local_irq_disable();
}

Which sends an IPI to all CPUs to make sure they no longer see the int3.

> 
> BUT! There are two different races here:
> 
>  (a) maybe the fault handling was slow, and we saw the 'int3' and took
> the fault, but the modifying CPU had already finished, so that
> atomic_read(&modifying_ftrace_code) didn't actually trigger at all.
> 
>  (b) maybe the int3-faulting CPU *did* see the proper value of
> modifying_ftrace_code, but the modifying CPU went on and started
> *another* modification, and has changed ftrace_update_func in the
> meantime, so now the int3 handling is looking at the wrong values!
> 
> In the case of (a), we'll die with an oops due to the inexplicable
> 'int3' we hit. And in the case of (b) we'll be fixing up using the
> wrong address.
> 
> Things like this is why I'm wondering how much of the problems are due
> to the entry code, and how much of it is due to simply races and
> timing differences?
> 
> Again, I don't actually know the ftrace code, and maybe I'm missing
> something, but this really looks like _another_ fundamental bug.
> 
> The way to handle that modifying_ftrace_code thing is most likely by
> using a sequence counter. For example, one way to actually do some
> thing like this might be
> 
>         ftrace_update_func = ip;
>         ftrace_update_target = func;
>         smp_wmb();
>         atomic_inc(&modifying_ftrace_head);
> 
>         ret = ftrace_modify_code(ip, old, new);
> 
>         atomic_inc(&modifying_ftrace_tail);
>         smp_wmb();
> 
> and now the int3 code could do something like
> 
>         int head, tail;
> 
>         head = atomic_read(&modifying_ftrace_head);
>         smp_rmb();
>         tail = atomic_read(&modifying_ftrace_tail);
> 
>         /* Are we still in the process of modification? */
>         if (unlikely(head != tail+1))
>                 return 0;
> 
>         ip = ftrace_update_func;
>         func = ftrace_update_target;
>         smp_rmb();
>         /* Need to re-check that the above two values are consistent
> and we didn't exit */
>         if (atomic_read(&modifying_ftrace_tail) != tail)
>                 return 0;
> 
>         *pregs int3_emulate_call(regs, ip, func);
>         return 1;
> 
> although it probably really would be better to use a seqcount instead
> of writing it out like the above.
> 
> NOTE! The above only fixes the (b) race. The (a) race is probably best
> handled by actually checking if the 'int3' instruction is still there
> before dying.
> 
> Again, maybe there's something I'm missing, but having looked at that
> patch now what feels like a million times, I'm finding more worrisome
> things in the ftrace code than in the kernel entry code..

I think you are missing the run_sync() which is the heavy hammer to
make sure all CPUs are in sync. And this is done at each stage:

	add int3
	run_sync();
	update call cite outside of int3
	run_sync()
	remove int3
	run_sync()

HPA said that the last run_sync() isn't needed, but I kept it because I
wanted to make sure. Looks like your analysis shows that it is needed.

-- Steve
Steven Rostedt May 7, 2019, 1:04 a.m. UTC | #55
On Mon, 6 May 2019 15:06:57 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, May 6, 2019 at 2:45 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > To do that we would need to rewrite the logic to update each of those
> > 40,000 calls one at a time, or group them together to what gets
> > changed.  
> 
> Stephen, YOU ARE NOT LISTENING.

 (note, it's Steven ;-)

I'm listening, I'm just trying to understand.

> 
> You are already fixing the value of the call in the instruction as
> part of the instruction rewriting.
> 
> When you do things like this:
> 
>         unsigned long ip = (unsigned long)(&ftrace_call);
>         unsigned char *new;
>         int ret;
> 
>         new = ftrace_call_replace(ip, (unsigned long)func);
>         ret = update_ftrace_func(ip, new);
> 
> you have already decided to rewrite the instruction with one single
> fixed call target: "func".
> 
> I'm just saying that you should ALWAYS use the same call target in the
> int3 emulation.
> 
> Instead, you hardcode something else than what you are AT THE SAME
> TIME rewriting the instruction with.
> 
> See what I'm saying?

Yes, but that's not the code I'm worried about.

ftrace_replace_code() is, which does:

	for_ftrace_rec_iter(iter) {
		rec = ftrace_rec_iter_record(iter);

		ret = add_breakpoints(rec, enable);
		if (ret)
			goto remove_breakpoints;
		count++;
	}

	run_sync();


And there's two more iterator loops that will do the modification of
the call site, and then the third loop will remove the breakpoint.

That iterator does something special for each individual record. All
40,000 of them.

That add_breakpoint() does:

static int add_breakpoints(struct dyn_ftrace *rec, int enable)
{
	unsigned long ftrace_addr;
	int ret;

	ftrace_addr = ftrace_get_addr_curr(rec);

	ret = ftrace_test_record(rec, enable);

	switch (ret) {
	case FTRACE_UPDATE_IGNORE:
		return 0;

	case FTRACE_UPDATE_MAKE_CALL:
		/* converting nop to call */
		return add_brk_on_nop(rec);

	case FTRACE_UPDATE_MODIFY_CALL:
	case FTRACE_UPDATE_MAKE_NOP:
		/* converting a call to a nop */
		return add_brk_on_call(rec, ftrace_addr);
	}
	return 0;
}

And to get what the target is, we call ftrace_get_addr_curr(), which
will return a function based on the flags in the record. Which can be
anything from a call to a customized trampoline, to either
ftrace_caller, or to ftrace_regs_caller, or it can turn the record into
a nop.

This is what I'm talking about. We are adding thousands of int3s
through out the kernel, and we have a single handler to handle each one
of them.

The reason I picked ftrace_regs_caller() is because that one does
anything that any of the callers can do (albeit slower). If it does
not, then ftrace will be broken, because it handles the case that all
types of trampolines are attached to a single function, and that code
had better do the exact same thing for each of those trampolines as if
the trampolines were called directly, because the handlers that those
trampolines call, shouldn't care who else is using that function.

Note, the only exception to that rule, is that we only allow one
function attached to the function to modify the return address (and the
record has a flag for that). If a record already modifies the ip
address on return, the registering of another ftrace_ops that modifies
the ip address will fail to register.


> 
> Stop with the "there could be thousands of targets" arguyment. The
> "call" instruction THAT YOU ARE REWRITING has exactly one target.
> There aren't 40,000 of them. x86 does not have that kind of "call"
> instruction that randomly calls 40k different functions. You are
> replacing FIVE BYTES of memory, and the emulation you do should
> emulate those FIVE BYTES.
> 
> See?
> 
> Why are you emulating something different than what you are rewriting?

I'm not having one call site call 40,000 different functions. You are
right about that. But I have 40,000 different call sites that could be
calling different functions and all of them are being processed by a
single int3 handler.

That's my point.

-- Steve
Linus Torvalds May 7, 2019, 1:06 a.m. UTC | #56
On Mon, May 6, 2019 at 5:10 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> But the CPU that was rewriting instructions does a run_sync() after
> removing the int3:
>
> static void run_sync(void)
> {
>         int enable_irqs;
>
>         /* No need to sync if there's only one CPU */
>         if (num_online_cpus() == 1)
>                 return;
>
>         enable_irqs = irqs_disabled();
>
>         /* We may be called with interrupts disabled (on bootup). */
>         if (enable_irqs)
>                 local_irq_enable();
>         on_each_cpu(do_sync_core, NULL, 1);
>         if (enable_irqs)
>                 local_irq_disable();
> }
>
> Which sends an IPI to all CPUs to make sure they no longer see the int3.

Duh. I have been looking back and forth in that file, and I was mixing
ftrace_modify_code_direct() (which only does a local sync) with
ftrace_modify_code() (which does run_sync()). The dangers of moving
around by searching for function names.

That file is a maze of several functions that are very similarly named
and do slightly different things.

But yes, I was looking at the "direct" sequence.

> I think you are missing the run_sync() which is the heavy hammer to
> make sure all CPUs are in sync. And this is done at each stage:
>
>         add int3
>         run_sync();
>         update call cite outside of int3
>         run_sync()
>         remove int3
>         run_sync()
>
> HPA said that the last run_sync() isn't needed, but I kept it because I
> wanted to make sure. Looks like your analysis shows that it is needed.

Absolutely. I think we could get rid of it, but yes, to then avoid the
race we'd need to be a lot more clever.

Yeah, with the three run_sunc() things, the races I thought it had can't happen.

             Linus
Steven Rostedt May 7, 2019, 1:34 a.m. UTC | #57
On Mon, 6 May 2019 21:04:16 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> static int add_breakpoints(struct dyn_ftrace *rec, int enable)
> {
> 	unsigned long ftrace_addr;
> 	int ret;
> 
> 	ftrace_addr = ftrace_get_addr_curr(rec);
> 
> 	ret = ftrace_test_record(rec, enable);
> 
> 	switch (ret) {
> 	case FTRACE_UPDATE_IGNORE:
> 		return 0;
> 
> 	case FTRACE_UPDATE_MAKE_CALL:
> 		/* converting nop to call */
> 		return add_brk_on_nop(rec);
> 
> 	case FTRACE_UPDATE_MODIFY_CALL:

The FTRACE_UPDATE_MODIFY_CALL is the case where we switch from calling
one trampoline to calling another. As you can plainly see from this
switch statement, we treat it the same as converting to a nop. That's
the bug we are trying to fix. It shouldn't be treated as a nop. It
should be treated as a call to the old or new function. Either is fine,
but it must be one or the other, not a nop.

-- Steve


> 	case FTRACE_UPDATE_MAKE_NOP:
> 		/* converting a call to a nop */
> 		return add_brk_on_call(rec, ftrace_addr);
> 	}
> 	return 0;
> }
Linus Torvalds May 7, 2019, 1:34 a.m. UTC | #58
On Mon, May 6, 2019 at 6:04 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> That iterator does something special for each individual record. All
> 40,000 of them.

.. yes, but the 'int3' only happens for *one* of them at a time.

Why would it bother with the other 39,999 calls?

You could easily just look up the record at the int3 time, and just
use the record. Exactly the same way you use the one-at-a-time ones.

Instead, you emulate a fake call to a function that *wouldn't* get
called, which now does the lookup there. That's the part I don't get.
Why are you emulating something else than what you'd be rewriting?

             Linus
Steven Rostedt May 7, 2019, 1:53 a.m. UTC | #59
On Mon, 6 May 2019 18:34:59 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Mon, May 6, 2019 at 6:04 PM Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > That iterator does something special for each individual record. All
> > 40,000 of them.  
> 
> .. yes, but the 'int3' only happens for *one* of them at a time.
> 
> Why would it bother with the other 39,999 calls?
> 
> You could easily just look up the record at the int3 time, and just
> use the record. Exactly the same way you use the one-at-a-time ones.
> 
> Instead, you emulate a fake call to a function that *wouldn't* get
> called, which now does the lookup there. That's the part I don't get.
> Why are you emulating something else than what you'd be rewriting?
>

Ah, now I see what you are saying. Yes, I could pass in what it is
suppose to call. But I was trying to use the same code for all the
alternative solutions we were passing around, and this became the
"default" case that would work with any int3_emulate_call
implementation we came up with.

That is, if we call ftrace_regs_caller() for any scenario it should
work. Even if the call was suppose to be a nop, because in that case,
all the ftrace_ops registered in the iterator would refuse to have
their handler be called for that function.

I sent you a single patch, but that was really just a diff of several
applied patches against your unmodified tree. The last patch implements
the ftrace code. And I had it this way because it should work for any
of the implementations.

I could modify it so that it picks what function to call when the int3
is triggered. I think all the solutions we are down to allow that now.
Some of the early ideas had me call one function for all int3s due to
trampolines and such.

Also, I figured just calling ftrace_regs_caller() was simpler then
having that int3 handler do the hash look ups to determine what handler
it needs to call.

-- Steve
Linus Torvalds May 7, 2019, 2:22 a.m. UTC | #60
On Mon, May 6, 2019 at 6:53 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> Also, I figured just calling ftrace_regs_caller() was simpler then
> having that int3 handler do the hash look ups to determine what handler
> it needs to call.

So what got me looking at this - and the races (that didn't turn out
to be races) - and why I really hate it, is because of the whole
notion of "atomic state".

Running an "int3" handler (when the int3 is in the kernel) is in some
sense "atomic". There is the state in the caller, of course, and
there's the state that the int3 handler has, but you can *mostly*
think of the two as independent.

In particular, if the int3 handler doesn't ever enable interrupts, and
if it doesn't need to do any stack switches, the int3 handler part
looks "obvious". It can be interrupted by NMI, but it can't be
interrupted (for example) by the cross-cpu IPI.

That was at least the mental model I was going for.

Similarly, the "caller" side mostly looks obvious too. If we don't
take an int3, none of this matter, and if we *do* take an int3, if we
can at least make it "atomic" wrt the rewriter (before or after
state), it should be easy to think about.

One of the things I was thinking of doing, for example, was to simply
make the call emulation do a "load four bytes from the instruction
stream, and just use that as the emulation target offset".

Why would that matter?

We do *not* have very strict guarantees for D$-vs-I$ coherency on x86,
but we *do* have very strict guarantees for D$-vs-D$ coherency. And so
we could use the D$ coherency to give us atomicity guarantees for
loading and storing the instruction offset for instruction emulation,
in ways we can *not* use the D$-to-I$ guarantees and just executing it
directly.

So while we still need those nasty IPI's to guarantee the D$-vs-I$
coherency in the "big picture" model and to get the serialization with
the actual 'int3' exception right, we *could* just do all the other
parts of the instruction emulation using the D$ coherency.

So we could do the actual "call offset" write with a single atomic
4-byte locked cycle (just use "xchg" to write - it's always locked).
And similarly we could do the call offset *read* with a single locked
cycle (cmpxchg with a 0 value, for example). It would be atomic even
if it crosses a cacheline boundary.

And now we'd be guaranteed that on a D$ side, we'd get either the old
value _or_ the new value for the emulation, and never a mix of them.
Which is very much unlike the I$ side guarantees that aren't
documented and people have been worrying about.

Notice? We'd not even have to look up any values.  We'd literally just
do something like

        int offset = locked_atomic_read(ip+1);
        return int3_emulate_call(ip, ip+5+offset);

and it would be *atomic* with respect to whatever other user that
updates the instruction, as long as they update the offset with a
"xchg" instruction.

Wouldn't that be lovely? We'd literally use the emulation as a way to
get the kinds of coherency guarantees that we don't get natively with
the I$.

Except that was _so_ not the case with the whole ftrace_regs_caller()
thing. That whole hack made me go "this doesn't emulate either the old
_or_ the new call". And see above why that would matter. Because a
"let's just make the four bytes after the 'int3' instruction be
atomic" would very much be all about "emulate either the old or the
new call".

So this was why I was then looking at what the heck that code was
doing, and it was making zero sense to me. Emulating something that is
different from what you're actually rewriting means that tricks like
the above don't work.

                  Linus
Steven Rostedt May 7, 2019, 2:58 a.m. UTC | #61
On Mon, 6 May 2019 19:22:06 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> Notice? We'd not even have to look up any values.  We'd literally just
> do something like
> 
>         int offset = locked_atomic_read(ip+1);
>         return int3_emulate_call(ip, ip+5+offset);
> 
> and it would be *atomic* with respect to whatever other user that
> updates the instruction, as long as they update the offset with a
> "xchg" instruction.

Honestly, I'm not really sure what you are trying to do here.

Are you talking about making the update to the code in the int3
handler? And then just returning back to the regs->ip and executing the
new call (not really emulating).

-- Steve
Linus Torvalds May 7, 2019, 3:05 a.m. UTC | #62
On Mon, May 6, 2019 at 7:58 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> > Notice? We'd not even have to look up any values.  We'd literally just
> > do something like
> >
> >         int offset = locked_atomic_read(ip+1);
> >         return int3_emulate_call(ip, ip+5+offset);
> >
> > and it would be *atomic* with respect to whatever other user that
> > updates the instruction, as long as they update the offset with a
> > "xchg" instruction.
>
> Honestly, I'm not really sure what you are trying to do here.
>
> Are you talking about making the update to the code in the int3
> handler?

No. The above would be pretty much the entirely of the the ftrace_int3_handler.

It would emulate the call that has had its first byte overwritten by
'int3'. Without doing any lookups of what it was supposed to change
the call to, because it simply depends on what the rewriting code is
doing on another CPU (or on the same CPU - it wouldn't care).

So no need to look up anything, not at int3 time, and not at return
time. It would just emulate the instruction atomically, with no state,
and no need to look up what the 'ip' instruction is at the time.

It could literally just use a single flag: "is ftrace updating call
instructions". Add another flag for the "I'm nop'ing out call
instructions" so that it knows to emulate a jump-over instead. That's
it.

Because all the actual *values* would be entirely be determined by the
actual rewriting that is going on independently of the 'int3'
exception.

                   Linus
Steven Rostedt May 7, 2019, 3:21 a.m. UTC | #63
On Mon, 6 May 2019 20:05:24 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:


> It would emulate the call that has had its first byte overwritten by
> 'int3'. Without doing any lookups of what it was supposed to change
> the call to, because it simply depends on what the rewriting code is
> doing on another CPU (or on the same CPU - it wouldn't care).

OK, so this is just about what to have it call.

> 
> So no need to look up anything, not at int3 time, and not at return
> time. It would just emulate the instruction atomically, with no state,
> and no need to look up what the 'ip' instruction is at the time.
> 
> It could literally just use a single flag: "is ftrace updating call
> instructions". Add another flag for the "I'm nop'ing out call
> instructions" so that it knows to emulate a jump-over instead. That's
> it.

Well we have that, and we have to look up the record regardless to know
if this was a ftrace int3 or not (the ftrace_location(ip) does that).
And the record has a counter to # of attached callers. Zero being to
turn it into a nop.

Note, if we are going from nop to call or call to nop, it would need to
read the offset to see if it is a nop (don't want to call with the nop
offset)

> 
> Because all the actual *values* would be entirely be determined by the
> actual rewriting that is going on independently of the 'int3'
> exception.

But still, we need to emulate the call, which requires pushing the
return code back onto the stack. I believe that part is the part we are
struggling with.

-- Steve
Linus Torvalds May 7, 2019, 3:28 a.m. UTC | #64
On Mon, May 6, 2019 at 8:22 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> But still, we need to emulate the call, which requires pushing the
> return code back onto the stack. I believe that part is the part we are
> struggling with.

Yes. But I was looking at the ftrace parts because I didn't see the
bug in the low-level x86 side, so...

The x86 int3 faulting code itself looks so *trivially* simple, and it
does work for the trivial test-case too. Which was what made me go
"Hmm, maybe there's timing or something".

But it could obviously also be that the trivial test-case is just too
trivial, and doesn't involve nmi etc etc.

               Linus
Peter Zijlstra May 7, 2019, 8:57 a.m. UTC | #65
On Mon, May 06, 2019 at 10:19:51AM +0200, Peter Zijlstra wrote:
> On Fri, May 03, 2019 at 11:57:22AM -0700, Linus Torvalds wrote:
> > On Fri, May 3, 2019 at 9:21 AM Andy Lutomirski <luto@amacapital.net> wrote:
> > >
> > > So here’s a somewhat nutty suggestion: how about we tweak the 32-bit
> > > entry code to emulate the sane 64-bit frame, not just for int3 but
> > > always?
> > 
> > What would the code actually end up looking like? I don't necessarily
> > object, since that kernel_stack_pointer() thing certainly looks
> > horrible, but honestly, my suggestion to just pass in the 'struct
> > pt_regs' and let the call emulation fix it up would have also worked,
> > and avoided that bug (and who knows what else might be hiding).
> > 
> > I really think that you're now hitting all the special case magic
> > low-level crap that I wanted to avoid.
> 
> This did actually boot on first try; so there must be something horribly
> wrong...
> 
> Now, I know you like that other approach; but I figured I should at
> least show you what this one looks like. Maybe I've been staring at
> entry_32.S too much, but I really don't dislike this.

So this one boots with all of Steve's self-test code enabled.

Yes, its fairly huge, and it really should be multiple patches. But it
does make a lot of the 32bit code a lot more like the 64bit code.

It also fixes a bunch of bugs in the various trampolines (notably the
EBP frame pointer crud, which was wrong or missing). And all the weird
32bit exceptions in the unwinder can go. However, I still cannot get an
unwind from trace_selftest_test_regs_func() for some reason. Josh?

---
 arch/x86/entry/entry_32.S            | 136 ++++++++++++++++++++++++++++-------
 arch/x86/entry/entry_64.S            |  14 +++-
 arch/x86/include/asm/ptrace.h        |   4 --
 arch/x86/include/asm/text-patching.h |  20 ++++++
 arch/x86/kernel/alternative.c        |  81 +++++++++++++++++++--
 arch/x86/kernel/ftrace_32.S          |  85 +++++++++++++---------
 arch/x86/kernel/kprobes/common.h     |  36 +++++++---
 arch/x86/kernel/kprobes/core.c       |  27 +++----
 arch/x86/kernel/kprobes/opt.c        |  20 +++---
 arch/x86/kernel/ptrace.c             |  29 --------
 arch/x86/kernel/unwind_frame.c       |   8 ---
 kernel/trace/trace_selftest.c        |   3 +
 12 files changed, 323 insertions(+), 140 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 7b23431be5cb..183d0cf5c167 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -67,9 +67,20 @@
 # define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
 # define preempt_stop(clobbers)
-# define resume_kernel		restore_all_kernel
 #endif
 
+.macro RETINT_PREEMPT
+#ifdef CONFIG_PREEMPT
+	DISABLE_INTERRUPTS(CLBR_ANY)
+	cmpl	$0, PER_CPU_VAR(__preempt_count)
+	jnz	.Lend_\@
+	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
+	jz	.Lend_\@
+	call	preempt_schedule_irq
+.Lend_\@:
+#endif
+.endm
+
 .macro TRACE_IRQS_IRET
 #ifdef CONFIG_TRACE_IRQFLAGS
 	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)     # interrupts off?
@@ -203,8 +214,105 @@
 .Lend_\@:
 .endm
 
+#define CS_FROM_ENTRY_STACK	(1 << 31)
+#define CS_FROM_USER_CR3	(1 << 30)
+#define CS_FROM_KERNEL		(1 << 29)
+
+.macro FIXUP_FRAME
+
+#ifdef CONFIG_VM86
+	testl	$X86_EFLAGS_VM, 3*4(%esp)
+	jnz	.Lfrom_usermode_no_fixup_\@
+#endif
+	testl	$SEGMENT_RPL_MASK, 2*4(%esp)
+	jnz	.Lfrom_usermode_no_fixup_\@
+
+	orl	$CS_FROM_KERNEL, 2*4(%esp)
+
+	/*
+	 * When we're here from kernel mode; the (exception) stack looks like:
+	 *
+	 * 4*4(%esp) - <previous context>
+	 * 3*4(%esp) - flags
+	 * 2*4(%esp) - cs
+	 * 1*4(%esp) - ip
+	 * 0*4(%esp) - orig_eax
+	 *
+	 * Lets build a 5 entry IRET frame after that, such that struct pt_regs
+	 * is complete and in particular regs->sp is correct. This gives us
+	 * the original 4 enties as gap:
+	 *
+	 * 10*4(%esp) - <previous context>
+	 *  9*4(%esp) - gap / flags
+	 *  8*4(%esp) - gap / cs
+	 *  7*4(%esp) - gap / ip
+	 *  6*4(%esp) - gap / orig_eax
+	 *  5*4(%esp) - ss
+	 *  4*4(%esp) - sp
+	 *  3*4(%esp) - flags
+	 *  2*4(%esp) - cs
+	 *  1*4(%esp) - ip
+	 *  0*4(%esp) - orig_eax
+	 */
+
+	pushl	%ss		# ss
+	pushl	%esp		# sp (points at ss)
+	addl	$5*4, (%esp)	# point sp back at the previous context
+	pushl	5*4(%esp)	# flags
+	pushl	5*4(%esp)	# cs
+	pushl	5*4(%esp)	# ip
+	pushl	5*4(%esp)	# orig_eax
+
+.Lfrom_usermode_no_fixup_\@:
+.endm
+
+.macro IRET_FRAME
+
+	/* orig_eax is already POP'ed when we're here */
+
+	testl $CS_FROM_KERNEL, 1*4(%esp)
+	jz .Lfinished_frame_\@
+
+	/*
+	 * Reconstruct the 3 entry IRET frame right after the (modified)
+	 * regs->sp without lowering %esp in between, such that an NMI in the
+	 * middle doesn't scribble our stack.
+	 */
+	pushl	%eax
+	pushl	%ecx
+	movl	5*4(%esp), %eax		# (modified) regs->sp
+
+	movl	4*4(%esp), %ecx		# flags
+	movl	%ecx, -4(%eax)
+
+	movl	3*4(%esp), %ecx		# cs
+	andl	$0x0000ffff, %ecx
+	movl	%ecx, -8(%eax)
+
+	movl	2*4(%esp), %ecx		# ip
+	movl	%ecx, -12(%eax)
+
+	movl	1*4(%esp), %ecx		# eax
+	movl	%ecx, -16(%eax)
+
+	popl	%ecx
+	lea	-16(%eax), %esp
+	popl	%eax
+
+.Lfinished_frame_\@:
+.endm
+
 .macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
+
 	cld
+	/*
+	 * The high bits of the CS dword (__csh) are used for CS_FROM_*.
+	 * Clear them in case hardware didn't do this for us.
+	 */
+	andl	$(0x0000ffff), 2*4(%esp)
+
+	FIXUP_FRAME
+
 	PUSH_GS
 	pushl	%fs
 	pushl	%es
@@ -375,9 +483,6 @@
  * switch to it before we do any copying.
  */
 
-#define CS_FROM_ENTRY_STACK	(1 << 31)
-#define CS_FROM_USER_CR3	(1 << 30)
-
 .macro SWITCH_TO_KERNEL_STACK
 
 	ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
@@ -391,13 +496,6 @@
 	 * that register for the time this macro runs
 	 */
 
-	/*
-	 * The high bits of the CS dword (__csh) are used for
-	 * CS_FROM_ENTRY_STACK and CS_FROM_USER_CR3. Clear them in case
-	 * hardware didn't do this for us.
-	 */
-	andl	$(0x0000ffff), PT_CS(%esp)
-
 	/* Are we on the entry stack? Bail out if not! */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
 	addl	$CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx
@@ -755,7 +853,7 @@ END(ret_from_fork)
 	andl	$SEGMENT_RPL_MASK, %eax
 #endif
 	cmpl	$USER_RPL, %eax
-	jb	resume_kernel			# not returning to v8086 or userspace
+	jb	restore_all_kernel		# not returning to v8086 or userspace
 
 ENTRY(resume_userspace)
 	DISABLE_INTERRUPTS(CLBR_ANY)
@@ -765,18 +863,6 @@ ENTRY(resume_userspace)
 	jmp	restore_all
 END(ret_from_exception)
 
-#ifdef CONFIG_PREEMPT
-ENTRY(resume_kernel)
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	cmpl	$0, PER_CPU_VAR(__preempt_count)
-	jnz	restore_all_kernel
-	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
-	jz	restore_all_kernel
-	call	preempt_schedule_irq
-	jmp	restore_all_kernel
-END(resume_kernel)
-#endif
-
 GLOBAL(__begin_SYSENTER_singlestep_region)
 /*
  * All code from here through __end_SYSENTER_singlestep_region is subject
@@ -1019,6 +1105,7 @@ ENTRY(entry_INT80_32)
 	/* Restore user state */
 	RESTORE_REGS pop=4			# skip orig_eax/error_code
 .Lirq_return:
+	IRET_FRAME
 	/*
 	 * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
 	 * when returning from IPI handler and when returning from
@@ -1027,6 +1114,7 @@ ENTRY(entry_INT80_32)
 	INTERRUPT_RETURN
 
 restore_all_kernel:
+	RETINT_PREEMPT
 	TRACE_IRQS_IRET
 	PARANOID_EXIT_TO_KERNEL_MODE
 	BUG_IF_WRONG_CR3
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 20e45d9b4e15..268cd9affe04 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -878,7 +878,7 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -898,6 +898,16 @@ ENTRY(\sym)
 	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
 
+	.if \create_gap == 1
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jnz	.Lfrom_usermode_no_gap_\@
+	.rept 6
+	pushq	5*8(%rsp)
+	.endr
+	UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+	.endif
+
 	.if \paranoid
 	call	paranoid_entry
 	.else
@@ -1129,7 +1139,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
 #endif /* CONFIG_HYPERV */
 
 idtentry debug			do_debug		has_error_code=0	paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET
-idtentry int3			do_int3			has_error_code=0
+idtentry int3			do_int3			has_error_code=0	create_gap=1
 idtentry stack_segment		do_stack_segment	has_error_code=1
 
 #ifdef CONFIG_XEN_PV
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 8a7fc0cca2d1..5ff42dc8b396 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -166,14 +166,10 @@ static inline bool user_64bit_mode(struct pt_regs *regs)
 #define compat_user_stack_pointer()	current_pt_regs()->sp
 #endif
 
-#ifdef CONFIG_X86_32
-extern unsigned long kernel_stack_pointer(struct pt_regs *regs);
-#else
 static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
 {
 	return regs->sp;
 }
-#endif
 
 #define GET_IP(regs) ((regs)->ip)
 #define GET_FP(regs) ((regs)->bp)
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index c90678fd391a..6aac6abf931e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -42,4 +42,24 @@ extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
 
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+	regs->sp -= sizeof(unsigned long);
+	*(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+	regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+	int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+	int3_emulate_jmp(regs, func);
+}
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 7b9b49dfc05a..8e1fafffb926 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -614,11 +614,83 @@ extern struct paravirt_patch_site __start_parainstructions[],
 	__stop_parainstructions[];
 #endif	/* CONFIG_PARAVIRT */
 
+/*
+ * Self-test for the INT3 based CALL emulation code.
+ *
+ * This exercises int3_emulate_call() to make sure INT3 pt_regs are set up
+ * properly and that there is a stack gap between the INT3 frame and the
+ * previous context. Without this gap doing a virtual PUSH on the interrupted
+ * stack would corrupt the INT3 IRET frame.
+ *
+ * See entry_{32,64}.S for more details.
+ */
+static void __init int3_magic(unsigned int *ptr)
+{
+	*ptr = 1;
+}
+
+extern __initdata unsigned long int3_selftest_ip; /* defined in asm below */
+
+static int __init
+int3_exception_notify(struct notifier_block *self, unsigned long val, void *data)
+{
+	struct die_args *args = data;
+	struct pt_regs *regs = args->regs;
+
+	if (!regs || user_mode(regs))
+		return NOTIFY_DONE;
+
+	if (val != DIE_INT3)
+		return NOTIFY_DONE;
+
+	if (regs->ip - INT3_INSN_SIZE != int3_selftest_ip)
+		return NOTIFY_DONE;
+
+	int3_emulate_call(regs, (unsigned long)&int3_magic);
+	return NOTIFY_STOP;
+}
+
+static void __init int3_selftest(void)
+{
+	static __initdata struct notifier_block int3_exception_nb = {
+		.notifier_call	= int3_exception_notify,
+		.priority	= INT_MAX-1, /* last */
+	};
+	unsigned int val = 0;
+
+	BUG_ON(register_die_notifier(&int3_exception_nb));
+
+	/*
+	 * Basically: int3_magic(&val); but really complicated :-)
+	 *
+	 * Stick the address of the INT3 instruction into int3_selftest_ip,
+	 * then trigger the INT3, padded with NOPs to match a CALL instruction
+	 * length.
+	 */
+	asm volatile ("1: int3; nop; nop; nop; nop\n\t"
+		      ".pushsection .init.data,\"aw\"\n\t"
+		      ".align " __ASM_SEL(4, 8) "\n\t"
+		      ".type int3_selftest_ip, @object\n\t"
+		      ".size int3_selftest_ip, " __ASM_SEL(4, 8) "\n\t"
+		      "int3_selftest_ip:\n\t"
+		      __ASM_SEL(.long, .quad) " 1b\n\t"
+		      ".popsection\n\t"
+		      : : __ASM_SEL_RAW(a, D) (&val) : "memory");
+
+	BUG_ON(val != 1);
+
+	unregister_die_notifier(&int3_exception_nb);
+}
+
 void __init alternative_instructions(void)
 {
-	/* The patching is not fully atomic, so try to avoid local interruptions
-	   that might execute the to be patched code.
-	   Other CPUs are not running. */
+	int3_selftest();
+
+	/*
+	 * The patching is not fully atomic, so try to avoid local
+	 * interruptions that might execute the to be patched code.
+	 * Other CPUs are not running.
+	 */
 	stop_nmi();
 
 	/*
@@ -643,10 +715,11 @@ void __init alternative_instructions(void)
 					    _text, _etext);
 	}
 
-	if (!uniproc_patched || num_possible_cpus() == 1)
+	if (!uniproc_patched || num_possible_cpus() == 1) {
 		free_init_pages("SMP alternatives",
 				(unsigned long)__smp_locks,
 				(unsigned long)__smp_locks_end);
+	}
 #endif
 
 	apply_paravirt(__parainstructions, __parainstructions_end);
diff --git a/arch/x86/kernel/ftrace_32.S b/arch/x86/kernel/ftrace_32.S
index 4c8440de3355..dbf929489b08 100644
--- a/arch/x86/kernel/ftrace_32.S
+++ b/arch/x86/kernel/ftrace_32.S
@@ -9,6 +9,7 @@
 #include <asm/export.h>
 #include <asm/ftrace.h>
 #include <asm/nospec-branch.h>
+#include <asm/asm-offsets.h>
 
 #ifdef CC_USING_FENTRY
 # define function_hook	__fentry__
@@ -104,26 +105,38 @@ END(ftrace_caller)
 
 ENTRY(ftrace_regs_caller)
 	/*
-	 * i386 does not save SS and ESP when coming from kernel.
-	 * Instead, to get sp, &regs->sp is used (see ptrace.h).
-	 * Unfortunately, that means eflags must be at the same location
-	 * as the current return ip is. We move the return ip into the
-	 * regs->ip location, and move flags into the return ip location.
+	 * We're here from an mcount/fentry CALL, and the stack frame looks like:
+	 *
+	 *  <previous context>
+	 *  RET-IP
+	 *
+	 * The purpose of this function is to call out in an emulated INT3
+	 * environment with a stack frame like:
+	 *
+	 *  <previous context>
+	 *  gap
+	 *  gap
+	 *  gap
+	 *  gap
+	 *  pt_regs
+	 *
+	 * We do _NOT_ restore: ss, flags, cs, gs, fs, es, ds
 	 */
-	pushl	$__KERNEL_CS
-	pushl	4(%esp)				/* Save the return ip */
-	pushl	$0				/* Load 0 into orig_ax */
+	subl	$3*4, %esp	# RET-IP + 3 gaps
+	pushl	%ss		# ss
+	pushl	%esp		# points at ss
+	addl	$5*4, (%esp)	#   make it point at <previous context>
+	pushfl			# flags
+	pushl	$__KERNEL_CS	# cs
+	pushl	8*4(%esp)	# ip <- RET-IP
+	pushl	$0		# orig_eax
+
 	pushl	%gs
 	pushl	%fs
 	pushl	%es
 	pushl	%ds
-	pushl	%eax
-
-	/* Get flags and place them into the return ip slot */
-	pushf
-	popl	%eax
-	movl	%eax, 8*4(%esp)
 
+	pushl	%eax
 	pushl	%ebp
 	pushl	%edi
 	pushl	%esi
@@ -131,28 +144,36 @@ ENTRY(ftrace_regs_caller)
 	pushl	%ecx
 	pushl	%ebx
 
-	movl	12*4(%esp), %eax		/* Load ip (1st parameter) */
-	subl	$MCOUNT_INSN_SIZE, %eax		/* Adjust ip */
+	movl	PT_EIP(%esp), %eax	# 1st argument: IP
+	subl	$MCOUNT_INSN_SIZE, %eax
+
 #ifdef CC_USING_FENTRY
-	movl	15*4(%esp), %edx		/* Load parent ip (2nd parameter) */
+	movl	21*4(%esp), %edx	# 2nd argument: parent ip
 #else
-	movl	0x4(%ebp), %edx			/* Load parent ip (2nd parameter) */
+	movl	1*4(%ebp), %edx		# 2nd argument: parent ip
 #endif
-	movl	function_trace_op, %ecx		/* Save ftrace_pos in 3rd parameter */
-	pushl	%esp				/* Save pt_regs as 4th parameter */
+
+#ifdef CONFIG_FRAME_POINTER
+	movl	%esp, %ebp
+	andl	$0x7fffffff, %ebp
+#endif
+
+	movl	function_trace_op, %ecx	# 3rd argument: ftrace_pos
+	pushl	%esp			# 4th argument: pt_regs
 
 GLOBAL(ftrace_regs_call)
 	call	ftrace_stub
 
-	addl	$4, %esp			/* Skip pt_regs */
+	addl	$4, %esp		# skip 4th argument
 
-	/* restore flags */
-	push	14*4(%esp)
-	popf
+	/* place IP below the new SP */
+	movl	PT_OLDESP(%esp), %eax
+	movl	PT_EIP(%esp), %ecx
+	movl	%ecx, -4(%eax)
 
-	/* Move return ip back to its original location */
-	movl	12*4(%esp), %eax
-	movl	%eax, 14*4(%esp)
+	/* place EAX below that */
+	movl	PT_EAX(%esp), %ecx
+	movl	%ecx, -8(%eax)
 
 	popl	%ebx
 	popl	%ecx
@@ -160,16 +181,12 @@ GLOBAL(ftrace_regs_call)
 	popl	%esi
 	popl	%edi
 	popl	%ebp
-	popl	%eax
-	popl	%ds
-	popl	%es
-	popl	%fs
-	popl	%gs
 
-	/* use lea to not affect flags */
-	lea	3*4(%esp), %esp			/* Skip orig_ax, ip and cs */
+	lea	-8(%eax), %esp
+	popl	%eax
 
 	jmp	.Lftrace_ret
+
 #else /* ! CONFIG_DYNAMIC_FTRACE */
 
 ENTRY(function_hook)
diff --git a/arch/x86/kernel/kprobes/common.h b/arch/x86/kernel/kprobes/common.h
index 2b949f4fd4d8..f54b5d666169 100644
--- a/arch/x86/kernel/kprobes/common.h
+++ b/arch/x86/kernel/kprobes/common.h
@@ -6,14 +6,15 @@
 
 #include <asm/asm.h>
 
+#ifdef CONFIG_X86_64
+
 #ifdef CONFIG_FRAME_POINTER
-# define SAVE_RBP_STRING "	push %" _ASM_BP "\n" \
-			 "	mov  %" _ASM_SP ", %" _ASM_BP "\n"
+#define ENCODE_FRAME_POINTER			\
+	"	leaq 1(%rsp), %rbp\n"
 #else
-# define SAVE_RBP_STRING "	push %" _ASM_BP "\n"
+#define ENCODE_FRAME_POINTER
 #endif
 
-#ifdef CONFIG_X86_64
 #define SAVE_REGS_STRING			\
 	/* Skip cs, ip, orig_ax. */		\
 	"	subq $24, %rsp\n"		\
@@ -27,11 +28,13 @@
 	"	pushq %r10\n"			\
 	"	pushq %r11\n"			\
 	"	pushq %rbx\n"			\
-	SAVE_RBP_STRING				\
+	"	pushq %rbp\n"			\
 	"	pushq %r12\n"			\
 	"	pushq %r13\n"			\
 	"	pushq %r14\n"			\
-	"	pushq %r15\n"
+	"	pushq %r15\n"			\
+	ENCODE_FRAME_POINTER
+
 #define RESTORE_REGS_STRING			\
 	"	popq %r15\n"			\
 	"	popq %r14\n"			\
@@ -51,19 +54,30 @@
 	/* Skip orig_ax, ip, cs */		\
 	"	addq $24, %rsp\n"
 #else
+
+#ifdef CONFIG_FRAME_POINTER
+#define ENCODE_FRAME_POINTER			\
+	"	movl %esp, %ebp\n"		\
+	"	andl $0x7fffffff, %ebp\n"
+#else
+#define ENCODE_FRAME_POINTER
+#endif
+
 #define SAVE_REGS_STRING			\
 	/* Skip cs, ip, orig_ax and gs. */	\
-	"	subl $16, %esp\n"		\
+	"	subl $4*4, %esp\n"		\
 	"	pushl %fs\n"			\
 	"	pushl %es\n"			\
 	"	pushl %ds\n"			\
 	"	pushl %eax\n"			\
-	SAVE_RBP_STRING				\
+	"	pushl %ebp\n"			\
 	"	pushl %edi\n"			\
 	"	pushl %esi\n"			\
 	"	pushl %edx\n"			\
 	"	pushl %ecx\n"			\
-	"	pushl %ebx\n"
+	"	pushl %ebx\n"			\
+	ENCODE_FRAME_POINTER
+
 #define RESTORE_REGS_STRING			\
 	"	popl %ebx\n"			\
 	"	popl %ecx\n"			\
@@ -72,8 +86,8 @@
 	"	popl %edi\n"			\
 	"	popl %ebp\n"			\
 	"	popl %eax\n"			\
-	/* Skip ds, es, fs, gs, orig_ax, and ip. Note: don't pop cs here*/\
-	"	addl $24, %esp\n"
+	/* Skip ds, es, fs, gs, orig_ax, ip, and cs. */\
+	"	addl $7*4, %esp\n"
 #endif
 
 /* Ensure if the instruction can be boostable */
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 9e4fa2484d10..28d8ba3b9add 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -731,29 +731,27 @@ asm(
 	".global kretprobe_trampoline\n"
 	".type kretprobe_trampoline, @function\n"
 	"kretprobe_trampoline:\n"
-#ifdef CONFIG_X86_64
 	/* We don't bother saving the ss register */
+#ifdef CONFIG_X86_64
 	"	pushq %rsp\n"
 	"	pushfq\n"
 	SAVE_REGS_STRING
 	"	movq %rsp, %rdi\n"
 	"	call trampoline_handler\n"
 	/* Replace saved sp with true return address. */
-	"	movq %rax, 152(%rsp)\n"
+	"	movq %rax, 19*8(%rsp)\n"
 	RESTORE_REGS_STRING
 	"	popfq\n"
 #else
-	"	pushf\n"
+	"	pushl %esp\n"
+	"	pushfl\n"
 	SAVE_REGS_STRING
 	"	movl %esp, %eax\n"
 	"	call trampoline_handler\n"
-	/* Move flags to cs */
-	"	movl 56(%esp), %edx\n"
-	"	movl %edx, 52(%esp)\n"
-	/* Replace saved flags with true return address. */
-	"	movl %eax, 56(%esp)\n"
+	/* Replace saved sp with true return address. */
+	"	movl %eax, 15*4(%esp)\n"
 	RESTORE_REGS_STRING
-	"	popf\n"
+	"	popfl\n"
 #endif
 	"	ret\n"
 	".size kretprobe_trampoline, .-kretprobe_trampoline\n"
@@ -794,16 +792,13 @@ __used __visible void *trampoline_handler(struct pt_regs *regs)
 	INIT_HLIST_HEAD(&empty_rp);
 	kretprobe_hash_lock(current, &head, &flags);
 	/* fixup registers */
-#ifdef CONFIG_X86_64
 	regs->cs = __KERNEL_CS;
-	/* On x86-64, we use pt_regs->sp for return address holder. */
-	frame_pointer = &regs->sp;
-#else
-	regs->cs = __KERNEL_CS | get_kernel_rpl();
+#ifdef CONFIG_X86_32
+	regs->cs |= get_kernel_rpl();
 	regs->gs = 0;
-	/* On x86-32, we use pt_regs->flags for return address holder. */
-	frame_pointer = &regs->flags;
 #endif
+	/* We use pt_regs->sp for return address holder. */
+	frame_pointer = &regs->sp;
 	regs->ip = trampoline_address;
 	regs->orig_ax = ~0UL;
 
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index f14262952015..c1010207d036 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -115,14 +115,15 @@ asm (
 			"optprobe_template_call:\n"
 			ASM_NOP5
 			/* Move flags to rsp */
-			"	movq 144(%rsp), %rdx\n"
-			"	movq %rdx, 152(%rsp)\n"
+			"	movq 18*8(%rsp), %rdx\n"
+			"	movq %rdx, 19*8(%rsp)\n"
 			RESTORE_REGS_STRING
 			/* Skip flags entry */
 			"	addq $8, %rsp\n"
 			"	popfq\n"
 #else /* CONFIG_X86_32 */
-			"	pushf\n"
+			"	pushl %esp\n"
+			"	pushfl\n"
 			SAVE_REGS_STRING
 			"	movl %esp, %edx\n"
 			".global optprobe_template_val\n"
@@ -131,9 +132,13 @@ asm (
 			".global optprobe_template_call\n"
 			"optprobe_template_call:\n"
 			ASM_NOP5
+			/* Move flags into esp */
+			"	movl 14*4(%esp), %edx\n"
+			"	movl %edx, 15*4(%esp)\n"
 			RESTORE_REGS_STRING
-			"	addl $4, %esp\n"	/* skip cs */
-			"	popf\n"
+			/* Skip flags entry */
+			"	addl $4, %esp\n"
+			"	popfl\n"
 #endif
 			".global optprobe_template_end\n"
 			"optprobe_template_end:\n"
@@ -165,10 +170,9 @@ optimized_callback(struct optimized_kprobe *op, struct pt_regs *regs)
 	} else {
 		struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
 		/* Save skipped registers */
-#ifdef CONFIG_X86_64
 		regs->cs = __KERNEL_CS;
-#else
-		regs->cs = __KERNEL_CS | get_kernel_rpl();
+#ifdef CONFIG_X86_32
+		regs->cs |= get_kernel_rpl();
 		regs->gs = 0;
 #endif
 		regs->ip = (unsigned long)op->kp.addr + INT3_SIZE;
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 4b8ee05dd6ad..d13f892d2c47 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -153,35 +153,6 @@ static inline bool invalid_selector(u16 value)
 
 #define FLAG_MASK		FLAG_MASK_32
 
-/*
- * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode
- * when it traps.  The previous stack will be directly underneath the saved
- * registers, and 'sp/ss' won't even have been saved. Thus the '&regs->sp'.
- *
- * Now, if the stack is empty, '&regs->sp' is out of range. In this
- * case we try to take the previous stack. To always return a non-null
- * stack pointer we fall back to regs as stack if no previous stack
- * exists.
- *
- * This is valid only for kernel mode traps.
- */
-unsigned long kernel_stack_pointer(struct pt_regs *regs)
-{
-	unsigned long context = (unsigned long)regs & ~(THREAD_SIZE - 1);
-	unsigned long sp = (unsigned long)&regs->sp;
-	u32 *prev_esp;
-
-	if (context == (sp & ~(THREAD_SIZE - 1)))
-		return sp;
-
-	prev_esp = (u32 *)(context);
-	if (*prev_esp)
-		return (unsigned long)*prev_esp;
-
-	return (unsigned long)regs;
-}
-EXPORT_SYMBOL_GPL(kernel_stack_pointer);
-
 static unsigned long *pt_regs_access(struct pt_regs *regs, unsigned long regno)
 {
 	BUILD_BUG_ON(offsetof(struct pt_regs, bx) != 0);
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index 9b9fd4826e7a..df51040d1689 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -71,10 +71,6 @@ static void unwind_dump(struct unwind_state *state)
 
 static size_t regs_size(struct pt_regs *regs)
 {
-	/* x86_32 regs from kernel mode are two words shorter: */
-	if (IS_ENABLED(CONFIG_X86_32) && !user_mode(regs))
-		return sizeof(*regs) - 2*sizeof(long);
-
 	return sizeof(*regs);
 }
 
@@ -197,11 +193,7 @@ static struct pt_regs *decode_frame_pointer(unsigned long *bp)
 }
 #endif
 
-#ifdef CONFIG_X86_32
-#define KERNEL_REGS_SIZE (sizeof(struct pt_regs) - 2*sizeof(long))
-#else
 #define KERNEL_REGS_SIZE (sizeof(struct pt_regs))
-#endif
 
 static bool update_stack_state(struct unwind_state *state,
 			       unsigned long *next_bp)
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 9d402e7fc949..acfd7a174337 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -548,6 +548,9 @@ static void trace_selftest_test_regs_func(unsigned long ip,
 					  struct ftrace_ops *op,
 					  struct pt_regs *pt_regs)
 {
+
+	WARN(1, "whoomp!\n");
+
 	if (pt_regs)
 		trace_selftest_regs_stat = TRACE_SELFTEST_REGS_FOUND;
 	else
David Laight May 7, 2019, 9:18 a.m. UTC | #66
From: Peter Zijlstra
> Sent: 07 May 2019 09:58
...
> +	/*
> +	 * When we're here from kernel mode; the (exception) stack looks like:
> +	 *
> +	 * 4*4(%esp) - <previous context>
> +	 * 3*4(%esp) - flags
> +	 * 2*4(%esp) - cs
> +	 * 1*4(%esp) - ip
> +	 * 0*4(%esp) - orig_eax

Am I right in thinking that this is the only 'INT3' stack frame that
needs to be 'fiddled' with?
And that the 'emulate a call instruction' has verified that is the case??
So the %cs is always the kernel %cs.

If the 'call target' address is saved in a per-cpu location it ought
to be possible to get the code that returns from the INT3 with the
call target address (or zero) in %ax.
If non-zero, instead of 'pop %ax; iret' execute:
	xchg %eax, 4(%esp)   - swap function address and callers ip
	push 12(%esp)        - old flags
	mov  14(%esp),%eax   - callers address over flags
	popf                 - enables interrupts (etc)
	pop  %eax
	retf                 - Jump to called function and remove %cs

Nothing else needs to be moved.

	David

	

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Peter Zijlstra May 7, 2019, 9:27 a.m. UTC | #67
On Tue, May 07, 2019 at 10:57:53AM +0200, Peter Zijlstra wrote:
> diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
> index 9e4fa2484d10..28d8ba3b9add 100644
> --- a/arch/x86/kernel/kprobes/core.c
> +++ b/arch/x86/kernel/kprobes/core.c
> @@ -731,29 +731,27 @@ asm(
>  	".global kretprobe_trampoline\n"
>  	".type kretprobe_trampoline, @function\n"
>  	"kretprobe_trampoline:\n"
> -#ifdef CONFIG_X86_64
>  	/* We don't bother saving the ss register */
> +#ifdef CONFIG_X86_64
>  	"	pushq %rsp\n"
>  	"	pushfq\n"
>  	SAVE_REGS_STRING
>  	"	movq %rsp, %rdi\n"
>  	"	call trampoline_handler\n"
>  	/* Replace saved sp with true return address. */
> -	"	movq %rax, 152(%rsp)\n"
> +	"	movq %rax, 19*8(%rsp)\n"
>  	RESTORE_REGS_STRING
>  	"	popfq\n"
>  #else
> -	"	pushf\n"
> +	"	pushl %esp\n"
> +	"	pushfl\n"
>  	SAVE_REGS_STRING
>  	"	movl %esp, %eax\n"
>  	"	call trampoline_handler\n"
> -	/* Move flags to cs */
> -	"	movl 56(%esp), %edx\n"
> -	"	movl %edx, 52(%esp)\n"
> -	/* Replace saved flags with true return address. */
> -	"	movl %eax, 56(%esp)\n"
> +	/* Replace saved sp with true return address. */
> +	"	movl %eax, 15*4(%esp)\n"
>  	RESTORE_REGS_STRING
> -	"	popf\n"
> +	"	popfl\n"
>  #endif
>  	"	ret\n"
>  	".size kretprobe_trampoline, .-kretprobe_trampoline\n"

> diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
> index f14262952015..c1010207d036 100644
> --- a/arch/x86/kernel/kprobes/opt.c
> +++ b/arch/x86/kernel/kprobes/opt.c
> @@ -115,14 +115,15 @@ asm (
>  			"optprobe_template_call:\n"
>  			ASM_NOP5
>  			/* Move flags to rsp */
> -			"	movq 144(%rsp), %rdx\n"
> -			"	movq %rdx, 152(%rsp)\n"
> +			"	movq 18*8(%rsp), %rdx\n"
> +			"	movq %rdx, 19*8(%rsp)\n"
>  			RESTORE_REGS_STRING
>  			/* Skip flags entry */
>  			"	addq $8, %rsp\n"
>  			"	popfq\n"
>  #else /* CONFIG_X86_32 */
> -			"	pushf\n"
> +			"	pushl %esp\n"
> +			"	pushfl\n"
>  			SAVE_REGS_STRING
>  			"	movl %esp, %edx\n"
>  			".global optprobe_template_val\n"
> @@ -131,9 +132,13 @@ asm (
>  			".global optprobe_template_call\n"
>  			"optprobe_template_call:\n"
>  			ASM_NOP5
> +			/* Move flags into esp */
> +			"	movl 14*4(%esp), %edx\n"
> +			"	movl %edx, 15*4(%esp)\n"
>  			RESTORE_REGS_STRING
> -			"	addl $4, %esp\n"	/* skip cs */
> -			"	popf\n"
> +			/* Skip flags entry */
> +			"	addl $4, %esp\n"
> +			"	popfl\n"
>  #endif
>  			".global optprobe_template_end\n"
>  			"optprobe_template_end:\n"

FWIW, both these trampolines assume a kprobe will not
int3_emulate_{push/call}(), for both bitnesses.

But then; I'm thinking kprobes should be inspection only and not modify
things. So that might just be good enough.
Peter Zijlstra May 7, 2019, 9:51 a.m. UTC | #68
On Mon, May 06, 2019 at 07:22:06PM -0700, Linus Torvalds wrote:
> We do *not* have very strict guarantees for D$-vs-I$ coherency on x86,
> but we *do* have very strict guarantees for D$-vs-D$ coherency. And so
> we could use the D$ coherency to give us atomicity guarantees for
> loading and storing the instruction offset for instruction emulation,
> in ways we can *not* use the D$-to-I$ guarantees and just executing it
> directly.
> 
> So while we still need those nasty IPI's to guarantee the D$-vs-I$
> coherency in the "big picture" model and to get the serialization with
> the actual 'int3' exception right, we *could* just do all the other
> parts of the instruction emulation using the D$ coherency.
> 
> So we could do the actual "call offset" write with a single atomic
> 4-byte locked cycle (just use "xchg" to write - it's always locked).
> And similarly we could do the call offset *read* with a single locked
> cycle (cmpxchg with a 0 value, for example). It would be atomic even
> if it crosses a cacheline boundary.

Very 'soon', x86 will start to #AC if you do unaligned LOCK prefixed
instructions. The problem is that while aligned LOCK instructions can do
the atomicity with the coherency protocol, unaligned (esp, line or page
boundary crossing ones) needs that bus-lock thing the SDM talks about.

For giggles, write yourself a while(1) loop that XCHGs across a
page-boundary and see what it does to the rest of the system.

So _please_, do not rely on unaligned atomic ops. We really want them to
do the way of the Dodo.
Peter Zijlstra May 7, 2019, 11:30 a.m. UTC | #69
On Tue, May 07, 2019 at 09:18:51AM +0000, David Laight wrote:
> From: Peter Zijlstra
> > Sent: 07 May 2019 09:58
> ...
> > +	/*
> > +	 * When we're here from kernel mode; the (exception) stack looks like:
> > +	 *
> > +	 * 4*4(%esp) - <previous context>
> > +	 * 3*4(%esp) - flags
> > +	 * 2*4(%esp) - cs
> > +	 * 1*4(%esp) - ip
> > +	 * 0*4(%esp) - orig_eax
> 
> Am I right in thinking that this is the only 'INT3' stack frame that
> needs to be 'fiddled' with?
> And that the 'emulate a call instruction' has verified that is the case??
> So the %cs is always the kernel %cs.

Only the INT3 thing needs 'the gap', but the far bigger change here is
that kernel frames now have a complete pt_regs set and all sorts of
horrible crap can go away.

For 32bit 'the gap' happens naturally when building a 5 entry frame. Yes
it is possible to build a 5 entry frame on top of the old 3 entry one,
but why bother...
Steven Rostedt May 7, 2019, 12:27 p.m. UTC | #70
On Tue, 7 May 2019 11:27:31 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> FWIW, both these trampolines assume a kprobe will not
> int3_emulate_{push/call}(), for both bitnesses.
> 
> But then; I'm thinking kprobes should be inspection only and not modify
> things. So that might just be good enough.

I believe there are kprobe calls that do modify things. Note, they can
modify regs->ip. Kprobes sets the FTRACE_OPS_FL_IPMODIFY flag, thus
they can never be put at the same location that is being live patched.

-- Steve
Peter Zijlstra May 7, 2019, 12:41 p.m. UTC | #71
On Tue, May 07, 2019 at 08:27:16AM -0400, Steven Rostedt wrote:
> On Tue, 7 May 2019 11:27:31 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > FWIW, both these trampolines assume a kprobe will not
> > int3_emulate_{push/call}(), for both bitnesses.
> > 
> > But then; I'm thinking kprobes should be inspection only and not modify
> > things. So that might just be good enough.
> 
> I believe there are kprobe calls that do modify things. Note, they can
> modify regs->ip. 

The kprobe pre_handler as used by kretprobes does, and that is indeed
handled by the trampolines.

> Kprobes sets the FTRACE_OPS_FL_IPMODIFY flag, thus
> they can never be put at the same location that is being live patched.

OK, so do we want to allow kprobes that also modify regs->sp ? Because
then we need to change these trampolines a bit.

I'd prefer not to allow kprobes this.
Steven Rostedt May 7, 2019, 12:54 p.m. UTC | #72
On Tue, 7 May 2019 14:41:31 +0200
Peter Zijlstra <peterz@infradead.org> wrote:

> > Kprobes sets the FTRACE_OPS_FL_IPMODIFY flag, thus
> > they can never be put at the same location that is being live patched.  
> 
> OK, so do we want to allow kprobes that also modify regs->sp ? Because
> then we need to change these trampolines a bit.
> 
> I'd prefer not to allow kprobes this.

I believe no kprobe changes sp, because it would have had the same
issues we are trying to solve now. And even though we are changing
things to allow it, it's not a regression to keep kprobes from doing it.

-- Steve
David Laight May 7, 2019, 12:57 p.m. UTC | #73
From: Peter Zijlstra
> Sent: 07 May 2019 12:31
> To: David Laight
> On Tue, May 07, 2019 at 09:18:51AM +0000, David Laight wrote:
> > From: Peter Zijlstra
> > > Sent: 07 May 2019 09:58
> > ...
> > > +	/*
> > > +	 * When we're here from kernel mode; the (exception) stack looks like:
> > > +	 *
> > > +	 * 4*4(%esp) - <previous context>
> > > +	 * 3*4(%esp) - flags
> > > +	 * 2*4(%esp) - cs
> > > +	 * 1*4(%esp) - ip
> > > +	 * 0*4(%esp) - orig_eax
> >
> > Am I right in thinking that this is the only 'INT3' stack frame that
> > needs to be 'fiddled' with?
> > And that the 'emulate a call instruction' has verified that is the case??
> > So the %cs is always the kernel %cs.
> 
> Only the INT3 thing needs 'the gap', but the far bigger change here is
> that kernel frames now have a complete pt_regs set and all sorts of
> horrible crap can go away.

I'm not doubting that generating the 'five register' interrupt stack frame
for faults in kernel space makes life simpler just suggesting that the
'emulated call' can be done by emulating the 'iret' rather than generating
a gap in the stack.

> For 32bit 'the gap' happens naturally when building a 5 entry frame. Yes
> it is possible to build a 5 entry frame on top of the old 3 entry one,
> but why bother...

Presumably there is 'horrid' code to generate the gap in 64bit mode?
(less horrid than 32bit, but still horrid?)
Or does it copy the entire pt_regs into a local stack frame and use
that for the iret?

I've just tried to parse the pseudo code for IRET in the intel docs.
Does anyone find that readable?
I wonder if you can force 32bit mode to do a stack switch 'iret'
by doing something like a far jump to a different %cs ?

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Steven Rostedt May 7, 2019, 1:14 p.m. UTC | #74
On Tue, 7 May 2019 12:57:15 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> > Only the INT3 thing needs 'the gap', but the far bigger change here is
> > that kernel frames now have a complete pt_regs set and all sorts of
> > horrible crap can go away.  
> 
> I'm not doubting that generating the 'five register' interrupt stack frame
> for faults in kernel space makes life simpler just suggesting that the
> 'emulated call' can be done by emulating the 'iret' rather than generating
> a gap in the stack.

But how would the user put something on the stack? I don't see how
emulating an iret helps here. Can you write some pseudo code to explain
what you mean. I also believe the gap is only added for kernel->kernel
entries.

> 
> > For 32bit 'the gap' happens naturally when building a 5 entry frame. Yes
> > it is possible to build a 5 entry frame on top of the old 3 entry one,
> > but why bother...  
> 
> Presumably there is 'horrid' code to generate the gap in 64bit mode?
> (less horrid than 32bit, but still horrid?)
> Or does it copy the entire pt_regs into a local stack frame and use
> that for the iret?

On x86_64, the gap is only done for int3 and nothing else, thus it is
much less horrid. That's because x86_64 has a sane pt_regs storage for
all exceptions.

-- Steve
Peter Zijlstra May 7, 2019, 1:32 p.m. UTC | #75
On Tue, May 07, 2019 at 12:57:15PM +0000, David Laight wrote:

> > Only the INT3 thing needs 'the gap', but the far bigger change here is
> > that kernel frames now have a complete pt_regs set and all sorts of
> > horrible crap can go away.
> 
> I'm not doubting that generating the 'five register' interrupt stack frame
> for faults in kernel space makes life simpler just suggesting that the
> 'emulated call' can be done by emulating the 'iret' rather than generating
> a gap in the stack.

The thing you suggested doesn't actually work, INT3 can nest.

> > For 32bit 'the gap' happens naturally when building a 5 entry frame. Yes
> > it is possible to build a 5 entry frame on top of the old 3 entry one,
> > but why bother...
> 
> Presumably there is 'horrid' code to generate the gap in 64bit mode?
> (less horrid than 32bit, but still horrid?)
> Or does it copy the entire pt_regs into a local stack frame and use
> that for the iret?

It's in the patch you replied to; it is so small you might have
overlooked it. It simply pushes another copy on top of what was already
there.

> I've just tried to parse the pseudo code for IRET in the intel docs.
> Does anyone find that readable?

No; it's abysmal.

> I wonder if you can force 32bit mode to do a stack switch 'iret'
> by doing something like a far jump to a different %cs ?

I don't think that'll work.
Masami Hiramatsu May 7, 2019, 2:13 p.m. UTC | #76
On Mon, 6 May 2019 17:45:11 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> If we go with Peter's patch, I can make this code much more sane, and
> not have to worry about having &regs->sp be at the top of the stack. I
> could simply, just push everything in the order of pt_regs and call the
> handler.

Hi Steve, I need to catch up with the origin of this series, but it seems
also good to optprobe which is doing similar trick on pt_regs. If we can
assume that int3 pt_regs can have a gap, optprobe can also make a gap, and
it can be also used for storing destination address.

Thank you,
Peter Zijlstra May 7, 2019, 2:28 p.m. UTC | #77
On Tue, May 07, 2019 at 10:57:53AM +0200, Peter Zijlstra wrote:
> So this one boots with all of Steve's self-test code enabled.

The selftests need improvement; I passed the 'ftrace regs test' but that
trampline was buggered.

> Yes, its fairly huge, and it really should be multiple patches. But it
> does make a lot of the 32bit code a lot more like the 64bit code.
> 
> It also fixes a bunch of bugs in the various trampolines (notably the
> EBP frame pointer crud, which was wrong or missing). And all the weird
> 32bit exceptions in the unwinder can go. However, I still cannot get an
> unwind from trace_selftest_test_regs_func() for some reason. Josh?
> 
> ---

 arch/x86/entry/entry_32.S            | 136 ++++++++++++++++++++++++++++-------
 arch/x86/entry/entry_64.S            |  14 +++-
 arch/x86/include/asm/ptrace.h        |   4 --
 arch/x86/include/asm/text-patching.h |  20 ++++++
 arch/x86/kernel/alternative.c        |  81 +++++++++++++++++++--
 arch/x86/kernel/ftrace.c             |  25 +++++--
 arch/x86/kernel/ftrace_32.S          |  85 +++++++++++++---------
 arch/x86/kernel/kprobes/common.h     |  36 +++++++---
 arch/x86/kernel/kprobes/core.c       |  27 +++----
 arch/x86/kernel/kprobes/opt.c        |  20 +++---
 arch/x86/kernel/ptrace.c             |  29 --------
 arch/x86/kernel/unwind_frame.c       |   8 ---
 kernel/trace/trace_selftest.c        |   3 +
 13 files changed, 343 insertions(+), 145 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 7b23431be5cb..183d0cf5c167 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -67,9 +67,20 @@
 # define preempt_stop(clobbers)	DISABLE_INTERRUPTS(clobbers); TRACE_IRQS_OFF
 #else
 # define preempt_stop(clobbers)
-# define resume_kernel		restore_all_kernel
 #endif
 
+.macro RETINT_PREEMPT
+#ifdef CONFIG_PREEMPT
+	DISABLE_INTERRUPTS(CLBR_ANY)
+	cmpl	$0, PER_CPU_VAR(__preempt_count)
+	jnz	.Lend_\@
+	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
+	jz	.Lend_\@
+	call	preempt_schedule_irq
+.Lend_\@:
+#endif
+.endm
+
 .macro TRACE_IRQS_IRET
 #ifdef CONFIG_TRACE_IRQFLAGS
 	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)     # interrupts off?
@@ -203,8 +214,105 @@
 .Lend_\@:
 .endm
 
+#define CS_FROM_ENTRY_STACK	(1 << 31)
+#define CS_FROM_USER_CR3	(1 << 30)
+#define CS_FROM_KERNEL		(1 << 29)
+
+.macro FIXUP_FRAME
+
+#ifdef CONFIG_VM86
+	testl	$X86_EFLAGS_VM, 3*4(%esp)
+	jnz	.Lfrom_usermode_no_fixup_\@
+#endif
+	testl	$SEGMENT_RPL_MASK, 2*4(%esp)
+	jnz	.Lfrom_usermode_no_fixup_\@
+
+	orl	$CS_FROM_KERNEL, 2*4(%esp)
+
+	/*
+	 * When we're here from kernel mode; the (exception) stack looks like:
+	 *
+	 * 4*4(%esp) - <previous context>
+	 * 3*4(%esp) - flags
+	 * 2*4(%esp) - cs
+	 * 1*4(%esp) - ip
+	 * 0*4(%esp) - orig_eax
+	 *
+	 * Lets build a 5 entry IRET frame after that, such that struct pt_regs
+	 * is complete and in particular regs->sp is correct. This gives us
+	 * the original 4 enties as gap:
+	 *
+	 * 10*4(%esp) - <previous context>
+	 *  9*4(%esp) - gap / flags
+	 *  8*4(%esp) - gap / cs
+	 *  7*4(%esp) - gap / ip
+	 *  6*4(%esp) - gap / orig_eax
+	 *  5*4(%esp) - ss
+	 *  4*4(%esp) - sp
+	 *  3*4(%esp) - flags
+	 *  2*4(%esp) - cs
+	 *  1*4(%esp) - ip
+	 *  0*4(%esp) - orig_eax
+	 */
+
+	pushl	%ss		# ss
+	pushl	%esp		# sp (points at ss)
+	addl	$5*4, (%esp)	# point sp back at the previous context
+	pushl	5*4(%esp)	# flags
+	pushl	5*4(%esp)	# cs
+	pushl	5*4(%esp)	# ip
+	pushl	5*4(%esp)	# orig_eax
+
+.Lfrom_usermode_no_fixup_\@:
+.endm
+
+.macro IRET_FRAME
+
+	/* orig_eax is already POP'ed when we're here */
+
+	testl $CS_FROM_KERNEL, 1*4(%esp)
+	jz .Lfinished_frame_\@
+
+	/*
+	 * Reconstruct the 3 entry IRET frame right after the (modified)
+	 * regs->sp without lowering %esp in between, such that an NMI in the
+	 * middle doesn't scribble our stack.
+	 */
+	pushl	%eax
+	pushl	%ecx
+	movl	5*4(%esp), %eax		# (modified) regs->sp
+
+	movl	4*4(%esp), %ecx		# flags
+	movl	%ecx, -4(%eax)
+
+	movl	3*4(%esp), %ecx		# cs
+	andl	$0x0000ffff, %ecx
+	movl	%ecx, -8(%eax)
+
+	movl	2*4(%esp), %ecx		# ip
+	movl	%ecx, -12(%eax)
+
+	movl	1*4(%esp), %ecx		# eax
+	movl	%ecx, -16(%eax)
+
+	popl	%ecx
+	lea	-16(%eax), %esp
+	popl	%eax
+
+.Lfinished_frame_\@:
+.endm
+
 .macro SAVE_ALL pt_regs_ax=%eax switch_stacks=0
+
 	cld
+	/*
+	 * The high bits of the CS dword (__csh) are used for CS_FROM_*.
+	 * Clear them in case hardware didn't do this for us.
+	 */
+	andl	$(0x0000ffff), 2*4(%esp)
+
+	FIXUP_FRAME
+
 	PUSH_GS
 	pushl	%fs
 	pushl	%es
@@ -375,9 +483,6 @@
  * switch to it before we do any copying.
  */
 
-#define CS_FROM_ENTRY_STACK	(1 << 31)
-#define CS_FROM_USER_CR3	(1 << 30)
-
 .macro SWITCH_TO_KERNEL_STACK
 
 	ALTERNATIVE     "", "jmp .Lend_\@", X86_FEATURE_XENPV
@@ -391,13 +496,6 @@
 	 * that register for the time this macro runs
 	 */
 
-	/*
-	 * The high bits of the CS dword (__csh) are used for
-	 * CS_FROM_ENTRY_STACK and CS_FROM_USER_CR3. Clear them in case
-	 * hardware didn't do this for us.
-	 */
-	andl	$(0x0000ffff), PT_CS(%esp)
-
 	/* Are we on the entry stack? Bail out if not! */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
 	addl	$CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx
@@ -755,7 +853,7 @@ END(ret_from_fork)
 	andl	$SEGMENT_RPL_MASK, %eax
 #endif
 	cmpl	$USER_RPL, %eax
-	jb	resume_kernel			# not returning to v8086 or userspace
+	jb	restore_all_kernel		# not returning to v8086 or userspace
 
 ENTRY(resume_userspace)
 	DISABLE_INTERRUPTS(CLBR_ANY)
@@ -765,18 +863,6 @@ ENTRY(resume_userspace)
 	jmp	restore_all
 END(ret_from_exception)
 
-#ifdef CONFIG_PREEMPT
-ENTRY(resume_kernel)
-	DISABLE_INTERRUPTS(CLBR_ANY)
-	cmpl	$0, PER_CPU_VAR(__preempt_count)
-	jnz	restore_all_kernel
-	testl	$X86_EFLAGS_IF, PT_EFLAGS(%esp)	# interrupts off (exception path) ?
-	jz	restore_all_kernel
-	call	preempt_schedule_irq
-	jmp	restore_all_kernel
-END(resume_kernel)
-#endif
-
 GLOBAL(__begin_SYSENTER_singlestep_region)
 /*
  * All code from here through __end_SYSENTER_singlestep_region is subject
@@ -1019,6 +1105,7 @@ ENTRY(entry_INT80_32)
 	/* Restore user state */
 	RESTORE_REGS pop=4			# skip orig_eax/error_code
 .Lirq_return:
+	IRET_FRAME
 	/*
 	 * ARCH_HAS_MEMBARRIER_SYNC_CORE rely on IRET core serialization
 	 * when returning from IPI handler and when returning from
@@ -1027,6 +1114,7 @@ ENTRY(entry_INT80_32)
 	INTERRUPT_RETURN
 
 restore_all_kernel:
+	RETINT_PREEMPT
 	TRACE_IRQS_IRET
 	PARANOID_EXIT_TO_KERNEL_MODE
 	BUG_IF_WRONG_CR3
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 20e45d9b4e15..268cd9affe04 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -878,7 +878,7 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 ist_offset=0 create_gap=0
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -898,6 +898,16 @@ ENTRY(\sym)
 	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
 
+	.if \create_gap == 1
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jnz	.Lfrom_usermode_no_gap_\@
+	.rept 6
+	pushq	5*8(%rsp)
+	.endr
+	UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+	.endif
+
 	.if \paranoid
 	call	paranoid_entry
 	.else
@@ -1129,7 +1139,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
 #endif /* CONFIG_HYPERV */
 
 idtentry debug			do_debug		has_error_code=0	paranoid=1 shift_ist=IST_INDEX_DB ist_offset=DB_STACK_OFFSET
-idtentry int3			do_int3			has_error_code=0
+idtentry int3			do_int3			has_error_code=0	create_gap=1
 idtentry stack_segment		do_stack_segment	has_error_code=1
 
 #ifdef CONFIG_XEN_PV
diff --git a/arch/x86/include/asm/ptrace.h b/arch/x86/include/asm/ptrace.h
index 8a7fc0cca2d1..5ff42dc8b396 100644
--- a/arch/x86/include/asm/ptrace.h
+++ b/arch/x86/include/asm/ptrace.h
@@ -166,14 +166,10 @@ static inline bool user_64bit_mode(struct pt_regs *regs)
 #define compat_user_stack_pointer()	current_pt_regs()->sp
 #endif
 
-#ifdef CONFIG_X86_32
-extern unsigned long kernel_stack_pointer(struct pt_regs *regs);
-#else
 static inline unsigned long kernel_stack_pointer(struct pt_regs *regs)
 {
 	return regs->sp;
 }
-#endif
 
 #define GET_IP(regs) ((regs)->ip)
 #define GET_FP(regs) ((regs)->bp)
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index c90678fd391a..6aac6abf931e 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -42,4 +42,24 @@ extern int after_bootmem;
 extern __ro_after_init struct mm_struct *poking_mm;
 extern __ro_after_init unsigned long poking_addr;
 
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+	regs->sp -= sizeof(unsigned long);
+	*(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+	regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+	int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+	int3_emulate_jmp(regs, func);
+}
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/alternative.c b/arch/x86/kernel/alternative.c
index 7b9b49dfc05a..8e1fafffb926 100644
--- a/arch/x86/kernel/alternative.c
+++ b/arch/x86/kernel/alternative.c
@@ -614,11 +614,83 @@ extern struct paravirt_patch_site __start_parainstructions[],
 	__stop_parainstructions[];
 #endif	/* CONFIG_PARAVIRT */
 
+/*
+ * Self-test for the INT3 based CALL emulation code.
+ *
+ * This exercises int3_emulate_call() to make sure INT3 pt_regs are set up
+ * properly and that there is a stack gap between the INT3 frame and the
+ * previous context. Without this gap doing a virtual PUSH on the interrupted
+ * stack would corrupt the INT3 IRET frame.
+ *
+ * See entry_{32,64}.S for more details.
+ */
+static void __init int3_magic(unsigned int *ptr)
+{
+	*ptr = 1;
+}
+
+extern __initdata unsigned long int3_selftest_ip; /* defined in asm below */
+
+static int __init
+int3_exception_notify(struct notifier_block *self, unsigned long val, void *data)
+{
+	struct die_args *args = data;
+	struct pt_regs *regs = args->regs;
+
+	if (!regs || user_mode(regs))
+		return NOTIFY_DONE;
+
+	if (val != DIE_INT3)
+		return NOTIFY_DONE;
+
+	if (regs->ip - INT3_INSN_SIZE != int3_selftest_ip)
+		return NOTIFY_DONE;
+
+	int3_emulate_call(regs, (unsigned long)&int3_magic);
+	return NOTIFY_STOP;
+}
+
+static void __init int3_selftest(void)
+{
+	static __initdata struct notifier_block int3_exception_nb = {
+		.notifier_call	= int3_exception_notify,
+		.priority	= INT_MAX-1, /* last */
+	};
+	unsigned int val = 0;
+
+	BUG_ON(register_die_notifier(&int3_exception_nb));
+
+	/*
+	 * Basically: int3_magic(&val); but really complicated :-)
+	 *
+	 * Stick the address of the INT3 instruction into int3_selftest_ip,
+	 * then trigger the INT3, padded with NOPs to match a CALL instruction
+	 * length.
+	 */
+	asm volatile ("1: int3; nop; nop; nop; nop\n\t"
+		      ".pushsection .init.data,\"aw\"\n\t"
+		      ".align " __ASM_SEL(4, 8) "\n\t"
+		      ".type int3_selftest_ip, @object\n\t"
+		      ".size int3_selftest_ip, " __ASM_SEL(4, 8) "\n\t"
+		      "int3_selftest_ip:\n\t"
+		      __ASM_SEL(.long, .quad) " 1b\n\t"
+		      ".popsection\n\t"
+		      : : __ASM_SEL_RAW(a, D) (&val) : "memory");
+
+	BUG_ON(val != 1);
+
+	unregister_die_notifier(&int3_exception_nb);
+}
+
 void __init alternative_instructions(void)
 {
-	/* The patching is not fully atomic, so try to avoid local interruptions
-	   that might execute the to be patched code.
-	   Other CPUs are not running. */
+	int3_selftest();
+
+	/*
+	 * The patching is not fully atomic, so try to avoid local
+	 * interruptions that might execute the to be patched code.
+	 * Other CPUs are not running.
+	 */
 	stop_nmi();
 
 	/*
@@ -643,10 +715,11 @@ void __init alternative_instructions(void)
 					    _text, _etext);
 	}
 
-	if (!uniproc_patched || num_possible_cpus() == 1)
+	if (!uniproc_patched || num_possible_cpus() == 1) {
 		free_init_pages("SMP alternatives",
 				(unsigned long)__smp_locks,
 				(unsigned long)__smp_locks_end);
+	}
 #endif
 
 	apply_paravirt(__parainstructions, __parainstructions_end);
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index 0caf8122d680..a4eea7bad4a1 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
 #include <asm/kprobes.h>
 #include <asm/ftrace.h>
 #include <asm/nops.h>
+#include <asm/text-patching.h>
 
 #ifdef CONFIG_DYNAMIC_FTRACE
 
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
 }
 
 static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
 
 static int update_ftrace_func(unsigned long ip, void *new)
 {
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 	unsigned char *new;
 	int ret;
 
+	ftrace_update_func_call = (unsigned long)func;
+
 	new = ftrace_call_replace(ip, (unsigned long)func);
 	ret = update_ftrace_func(ip, new);
 
@@ -294,13 +298,21 @@ int ftrace_int3_handler(struct pt_regs *regs)
 	if (WARN_ON_ONCE(!regs))
 		return 0;
 
-	ip = regs->ip - 1;
-	if (!ftrace_location(ip) && !is_ftrace_caller(ip))
-		return 0;
+	ip = regs->ip - INT3_INSN_SIZE;
 
-	regs->ip += MCOUNT_INSN_SIZE - 1;
+	if (ftrace_location(ip)) {
+		int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+		return 1;
+	} else if (is_ftrace_caller(ip)) {
+		if (!ftrace_update_func_call) {
+			int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+			return 1;
+		}
+		int3_emulate_call(regs, ftrace_update_func_call);
+		return 1;
+	}
 
-	return 1;
+	return 0;
 }
 NOKPROBE_SYMBOL(ftrace_int3_handler);
 
@@ -865,6 +877,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
 
 	func = ftrace_ops_get_func(ops);
 
+	ftrace_update_func_call = (unsigned long)func;
+
 	/* Do a safe modify in case the trampoline is executing */
 	new = ftrace_call_replace(ip, (unsigned long)func);
 	ret = update_ftrace_func(ip, new);
@@ -966,6 +980,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
 {
 	unsigned char *new;
 
+	ftrace_update_func_call = 0UL;
 	new = ftrace_jmp_replace(ip, (unsigned long)func);
 
 	return update_ftrace_func(ip, new);
diff --git a/arch/x86/kernel/ftrace_32.S b/arch/x86/kernel/ftrace_32.S
index 4c8440de3355..6ff9911adbb7 100644
--- a/arch/x86/kernel/ftrace_32.S
+++ b/arch/x86/kernel/ftrace_32.S
@@ -9,6 +9,7 @@
 #include <asm/export.h>
 #include <asm/ftrace.h>
 #include <asm/nospec-branch.h>
+#include <asm/asm-offsets.h>
 
 #ifdef CC_USING_FENTRY
 # define function_hook	__fentry__
@@ -104,26 +105,38 @@ END(ftrace_caller)
 
 ENTRY(ftrace_regs_caller)
 	/*
-	 * i386 does not save SS and ESP when coming from kernel.
-	 * Instead, to get sp, &regs->sp is used (see ptrace.h).
-	 * Unfortunately, that means eflags must be at the same location
-	 * as the current return ip is. We move the return ip into the
-	 * regs->ip location, and move flags into the return ip location.
+	 * We're here from an mcount/fentry CALL, and the stack frame looks like:
+	 *
+	 *  <previous context>
+	 *  RET-IP
+	 *
+	 * The purpose of this function is to call out in an emulated INT3
+	 * environment with a stack frame like:
+	 *
+	 *  <previous context>
+	 *  gap / RET-IP
+	 *  gap
+	 *  gap
+	 *  gap
+	 *  pt_regs
+	 *
+	 * We do _NOT_ restore: ss, flags, cs, gs, fs, es, ds
 	 */
-	pushl	$__KERNEL_CS
-	pushl	4(%esp)				/* Save the return ip */
-	pushl	$0				/* Load 0 into orig_ax */
+	subl	$3*4, %esp	# RET-IP + 3 gaps
+	pushl	%ss		# ss
+	pushl	%esp		# points at ss
+	addl	$5*4, (%esp)	#   make it point at <previous context>
+	pushfl			# flags
+	pushl	$__KERNEL_CS	# cs
+	pushl	7*4(%esp)	# ip <- RET-IP
+	pushl	$0		# orig_eax
+
 	pushl	%gs
 	pushl	%fs
 	pushl	%es
 	pushl	%ds
-	pushl	%eax
-
-	/* Get flags and place them into the return ip slot */
-	pushf
-	popl	%eax
-	movl	%eax, 8*4(%esp)
 
+	pushl	%eax
 	pushl	%ebp
 	pushl	%edi
 	pushl	%esi
@@ -131,28 +144,36 @@ ENTRY(ftrace_regs_caller)
 	pushl	%ecx
 	pushl	%ebx
 
-	movl	12*4(%esp), %eax		/* Load ip (1st parameter) */
-	subl	$MCOUNT_INSN_SIZE, %eax		/* Adjust ip */
+	movl	PT_EIP(%esp), %eax	# 1st argument: IP
+	subl	$MCOUNT_INSN_SIZE, %eax
+
 #ifdef CC_USING_FENTRY
-	movl	15*4(%esp), %edx		/* Load parent ip (2nd parameter) */
+	movl	21*4(%esp), %edx	# 2nd argument: parent ip
 #else
-	movl	0x4(%ebp), %edx			/* Load parent ip (2nd parameter) */
+	movl	1*4(%ebp), %edx		# 2nd argument: parent ip
 #endif
-	movl	function_trace_op, %ecx		/* Save ftrace_pos in 3rd parameter */
-	pushl	%esp				/* Save pt_regs as 4th parameter */
+
+#ifdef CONFIG_FRAME_POINTER
+	movl	%esp, %ebp
+	andl	$0x7fffffff, %ebp
+#endif
+
+	movl	function_trace_op, %ecx	# 3rd argument: ftrace_pos
+	pushl	%esp			# 4th argument: pt_regs
 
 GLOBAL(ftrace_regs_call)
 	call	ftrace_stub
 
-	addl	$4, %esp			/* Skip pt_regs */
+	addl	$4, %esp		# skip 4th argument
 
-	/* restore flags */
-	push	14*4(%esp)
-	popf
+	/* place IP below the new SP */
+	movl	PT_OLDESP(%esp), %eax
+	movl	PT_EIP(%esp), %ecx
+	movl	%ecx, -4(%eax)
 
-	/* Move return ip back to its original location */
-	movl	12*4(%esp), %eax
-	movl	%eax, 14*4(%esp)
+	/* place EAX below that */
+	movl	PT_EAX(%esp), %ecx
+	movl	%ecx, -8(%eax)
 
 	popl	%ebx
 	popl	%ecx
@@ -160,16 +181,12 @@ GLOBAL(ftrace_regs_call)
 	popl	%esi
 	popl	%edi
 	popl	%ebp
-	popl	%eax
-	popl	%ds
-	popl	%es
-	popl	%fs
-	popl	%gs
 
-	/* use lea to not affect flags */
-	lea	3*4(%esp), %esp			/* Skip orig_ax, ip and cs */
+	lea	-8(%eax), %esp
+	popl	%eax
 
 	jmp	.Lftrace_ret
+
 #else /* ! CONFIG_DYNAMIC_FTRACE */
 
 ENTRY(function_hook)
diff --git a/arch/x86/kernel/kprobes/common.h b/arch/x86/kernel/kprobes/common.h
index 2b949f4fd4d8..f54b5d666169 100644
--- a/arch/x86/kernel/kprobes/common.h
+++ b/arch/x86/kernel/kprobes/common.h
@@ -6,14 +6,15 @@
 
 #include <asm/asm.h>
 
+#ifdef CONFIG_X86_64
+
 #ifdef CONFIG_FRAME_POINTER
-# define SAVE_RBP_STRING "	push %" _ASM_BP "\n" \
-			 "	mov  %" _ASM_SP ", %" _ASM_BP "\n"
+#define ENCODE_FRAME_POINTER			\
+	"	leaq 1(%rsp), %rbp\n"
 #else
-# define SAVE_RBP_STRING "	push %" _ASM_BP "\n"
+#define ENCODE_FRAME_POINTER
 #endif
 
-#ifdef CONFIG_X86_64
 #define SAVE_REGS_STRING			\
 	/* Skip cs, ip, orig_ax. */		\
 	"	subq $24, %rsp\n"		\
@@ -27,11 +28,13 @@
 	"	pushq %r10\n"			\
 	"	pushq %r11\n"			\
 	"	pushq %rbx\n"			\
-	SAVE_RBP_STRING				\
+	"	pushq %rbp\n"			\
 	"	pushq %r12\n"			\
 	"	pushq %r13\n"			\
 	"	pushq %r14\n"			\
-	"	pushq %r15\n"
+	"	pushq %r15\n"			\
+	ENCODE_FRAME_POINTER
+
 #define RESTORE_REGS_STRING			\
 	"	popq %r15\n"			\
 	"	popq %r14\n"			\
@@ -51,19 +54,30 @@
 	/* Skip orig_ax, ip, cs */		\
 	"	addq $24, %rsp\n"
 #else
+
+#ifdef CONFIG_FRAME_POINTER
+#define ENCODE_FRAME_POINTER 			\
+	"	movl %esp, %ebp\n"		\
+	"	andl $0x7fffffff, %ebp\n"
+#else
+#define ENCODE_FRAME_POINTER
+#endif
+
 #define SAVE_REGS_STRING			\
 	/* Skip cs, ip, orig_ax and gs. */	\
-	"	subl $16, %esp\n"		\
+	"	subl $4*4, %esp\n"		\
 	"	pushl %fs\n"			\
 	"	pushl %es\n"			\
 	"	pushl %ds\n"			\
 	"	pushl %eax\n"			\
-	SAVE_RBP_STRING				\
+	"	pushl %ebp\n"			\
 	"	pushl %edi\n"			\
 	"	pushl %esi\n"			\
 	"	pushl %edx\n"			\
 	"	pushl %ecx\n"			\
-	"	pushl %ebx\n"
+	"	pushl %ebx\n"			\
+	ENCODE_FRAME_POINTER
+
 #define RESTORE_REGS_STRING			\
 	"	popl %ebx\n"			\
 	"	popl %ecx\n"			\
@@ -72,8 +86,8 @@
 	"	popl %edi\n"			\
 	"	popl %ebp\n"			\
 	"	popl %eax\n"			\
-	/* Skip ds, es, fs, gs, orig_ax, and ip. Note: don't pop cs here*/\
-	"	addl $24, %esp\n"
+	/* Skip ds, es, fs, gs, orig_ax, ip, and cs. */\
+	"	addl $7*4, %esp\n"
 #endif
 
 /* Ensure if the instruction can be boostable */
diff --git a/arch/x86/kernel/kprobes/core.c b/arch/x86/kernel/kprobes/core.c
index 9e4fa2484d10..28d8ba3b9add 100644
--- a/arch/x86/kernel/kprobes/core.c
+++ b/arch/x86/kernel/kprobes/core.c
@@ -731,29 +731,27 @@ asm(
 	".global kretprobe_trampoline\n"
 	".type kretprobe_trampoline, @function\n"
 	"kretprobe_trampoline:\n"
-#ifdef CONFIG_X86_64
 	/* We don't bother saving the ss register */
+#ifdef CONFIG_X86_64
 	"	pushq %rsp\n"
 	"	pushfq\n"
 	SAVE_REGS_STRING
 	"	movq %rsp, %rdi\n"
 	"	call trampoline_handler\n"
 	/* Replace saved sp with true return address. */
-	"	movq %rax, 152(%rsp)\n"
+	"	movq %rax, 19*8(%rsp)\n"
 	RESTORE_REGS_STRING
 	"	popfq\n"
 #else
-	"	pushf\n"
+	"	pushl %esp\n"
+	"	pushfl\n"
 	SAVE_REGS_STRING
 	"	movl %esp, %eax\n"
 	"	call trampoline_handler\n"
-	/* Move flags to cs */
-	"	movl 56(%esp), %edx\n"
-	"	movl %edx, 52(%esp)\n"
-	/* Replace saved flags with true return address. */
-	"	movl %eax, 56(%esp)\n"
+	/* Replace saved sp with true return address. */
+	"	movl %eax, 15*4(%esp)\n"
 	RESTORE_REGS_STRING
-	"	popf\n"
+	"	popfl\n"
 #endif
 	"	ret\n"
 	".size kretprobe_trampoline, .-kretprobe_trampoline\n"
@@ -794,16 +792,13 @@ __used __visible void *trampoline_handler(struct pt_regs *regs)
 	INIT_HLIST_HEAD(&empty_rp);
 	kretprobe_hash_lock(current, &head, &flags);
 	/* fixup registers */
-#ifdef CONFIG_X86_64
 	regs->cs = __KERNEL_CS;
-	/* On x86-64, we use pt_regs->sp for return address holder. */
-	frame_pointer = &regs->sp;
-#else
-	regs->cs = __KERNEL_CS | get_kernel_rpl();
+#ifdef CONFIG_X86_32
+	regs->cs |= get_kernel_rpl();
 	regs->gs = 0;
-	/* On x86-32, we use pt_regs->flags for return address holder. */
-	frame_pointer = &regs->flags;
 #endif
+	/* We use pt_regs->sp for return address holder. */
+	frame_pointer = &regs->sp;
 	regs->ip = trampoline_address;
 	regs->orig_ax = ~0UL;
 
diff --git a/arch/x86/kernel/kprobes/opt.c b/arch/x86/kernel/kprobes/opt.c
index f14262952015..c1010207d036 100644
--- a/arch/x86/kernel/kprobes/opt.c
+++ b/arch/x86/kernel/kprobes/opt.c
@@ -115,14 +115,15 @@ asm (
 			"optprobe_template_call:\n"
 			ASM_NOP5
 			/* Move flags to rsp */
-			"	movq 144(%rsp), %rdx\n"
-			"	movq %rdx, 152(%rsp)\n"
+			"	movq 18*8(%rsp), %rdx\n"
+			"	movq %rdx, 19*8(%rsp)\n"
 			RESTORE_REGS_STRING
 			/* Skip flags entry */
 			"	addq $8, %rsp\n"
 			"	popfq\n"
 #else /* CONFIG_X86_32 */
-			"	pushf\n"
+			"	pushl %esp\n"
+			"	pushfl\n"
 			SAVE_REGS_STRING
 			"	movl %esp, %edx\n"
 			".global optprobe_template_val\n"
@@ -131,9 +132,13 @@ asm (
 			".global optprobe_template_call\n"
 			"optprobe_template_call:\n"
 			ASM_NOP5
+			/* Move flags into esp */
+			"	movl 14*4(%esp), %edx\n"
+			"	movl %edx, 15*4(%esp)\n"
 			RESTORE_REGS_STRING
-			"	addl $4, %esp\n"	/* skip cs */
-			"	popf\n"
+			/* Skip flags entry */
+			"	addl $4, %esp\n"
+			"	popfl\n"
 #endif
 			".global optprobe_template_end\n"
 			"optprobe_template_end:\n"
@@ -165,10 +170,9 @@ optimized_callback(struct optimized_kprobe *op, struct pt_regs *regs)
 	} else {
 		struct kprobe_ctlblk *kcb = get_kprobe_ctlblk();
 		/* Save skipped registers */
-#ifdef CONFIG_X86_64
 		regs->cs = __KERNEL_CS;
-#else
-		regs->cs = __KERNEL_CS | get_kernel_rpl();
+#ifdef CONFIG_X86_32
+		regs->cs |= get_kernel_rpl();
 		regs->gs = 0;
 #endif
 		regs->ip = (unsigned long)op->kp.addr + INT3_SIZE;
diff --git a/arch/x86/kernel/ptrace.c b/arch/x86/kernel/ptrace.c
index 4b8ee05dd6ad..d13f892d2c47 100644
--- a/arch/x86/kernel/ptrace.c
+++ b/arch/x86/kernel/ptrace.c
@@ -153,35 +153,6 @@ static inline bool invalid_selector(u16 value)
 
 #define FLAG_MASK		FLAG_MASK_32
 
-/*
- * X86_32 CPUs don't save ss and esp if the CPU is already in kernel mode
- * when it traps.  The previous stack will be directly underneath the saved
- * registers, and 'sp/ss' won't even have been saved. Thus the '&regs->sp'.
- *
- * Now, if the stack is empty, '&regs->sp' is out of range. In this
- * case we try to take the previous stack. To always return a non-null
- * stack pointer we fall back to regs as stack if no previous stack
- * exists.
- *
- * This is valid only for kernel mode traps.
- */
-unsigned long kernel_stack_pointer(struct pt_regs *regs)
-{
-	unsigned long context = (unsigned long)regs & ~(THREAD_SIZE - 1);
-	unsigned long sp = (unsigned long)&regs->sp;
-	u32 *prev_esp;
-
-	if (context == (sp & ~(THREAD_SIZE - 1)))
-		return sp;
-
-	prev_esp = (u32 *)(context);
-	if (*prev_esp)
-		return (unsigned long)*prev_esp;
-
-	return (unsigned long)regs;
-}
-EXPORT_SYMBOL_GPL(kernel_stack_pointer);
-
 static unsigned long *pt_regs_access(struct pt_regs *regs, unsigned long regno)
 {
 	BUILD_BUG_ON(offsetof(struct pt_regs, bx) != 0);
diff --git a/arch/x86/kernel/unwind_frame.c b/arch/x86/kernel/unwind_frame.c
index 9b9fd4826e7a..df51040d1689 100644
--- a/arch/x86/kernel/unwind_frame.c
+++ b/arch/x86/kernel/unwind_frame.c
@@ -71,10 +71,6 @@ static void unwind_dump(struct unwind_state *state)
 
 static size_t regs_size(struct pt_regs *regs)
 {
-	/* x86_32 regs from kernel mode are two words shorter: */
-	if (IS_ENABLED(CONFIG_X86_32) && !user_mode(regs))
-		return sizeof(*regs) - 2*sizeof(long);
-
 	return sizeof(*regs);
 }
 
@@ -197,11 +193,7 @@ static struct pt_regs *decode_frame_pointer(unsigned long *bp)
 }
 #endif
 
-#ifdef CONFIG_X86_32
-#define KERNEL_REGS_SIZE (sizeof(struct pt_regs) - 2*sizeof(long))
-#else
 #define KERNEL_REGS_SIZE (sizeof(struct pt_regs))
-#endif
 
 static bool update_stack_state(struct unwind_state *state,
 			       unsigned long *next_bp)
diff --git a/kernel/trace/trace_selftest.c b/kernel/trace/trace_selftest.c
index 9d402e7fc949..acfd7a174337 100644
--- a/kernel/trace/trace_selftest.c
+++ b/kernel/trace/trace_selftest.c
@@ -548,6 +548,9 @@ static void trace_selftest_test_regs_func(unsigned long ip,
 					  struct ftrace_ops *op,
 					  struct pt_regs *pt_regs)
 {
+
+	WARN(1, "whoomp!\n");
+
 	if (pt_regs)
 		trace_selftest_regs_stat = TRACE_SELFTEST_REGS_FOUND;
 	else
Andy Lutomirski May 7, 2019, 2:48 p.m. UTC | #78
>> On May 6, 2019, at 7:22 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> 
>> On Mon, May 6, 2019 at 6:53 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>> 
>> Also, I figured just calling ftrace_regs_caller() was simpler then
>> having that int3 handler do the hash look ups to determine what handler
>> it needs to call.
> 
> So what got me looking at this - and the races (that didn't turn out
> to be races) - and why I really hate it, is because of the whole
> notion of "atomic state".
> 
> Running an "int3" handler (when the int3 is in the kernel) is in some
> sense "atomic". There is the state in the caller, of course, and
> there's the state that the int3 handler has, but you can *mostly*
> think of the two as independent.
> 
> In particular, if the int3 handler doesn't ever enable interrupts, and
> if it doesn't need to do any stack switches, the int3 handler part
> looks "obvious". It can be interrupted by NMI, but it can't be
> interrupted (for example) by the cross-cpu IPI.
> 
> That was at least the mental model I was going for.
> 
> Similarly, the "caller" side mostly looks obvious too. If we don't
> take an int3, none of this matter, and if we *do* take an int3, if we
> can at least make it "atomic" wrt the rewriter (before or after
> state), it should be easy to think about.
> 
> One of the things I was thinking of doing, for example, was to simply
> make the call emulation do a "load four bytes from the instruction
> stream, and just use that as the emulation target offset".
> 
> Why would that matter?
> 
> We do *not* have very strict guarantees for D$-vs-I$ coherency on x86,
> but we *do* have very strict guarantees for D$-vs-D$ coherency. And so
> we could use the D$ coherency to give us atomicity guarantees for
> loading and storing the instruction offset for instruction emulation,
> in ways we can *not* use the D$-to-I$ guarantees and just executing it
> directly.
> 
> So while we still need those nasty IPI's to guarantee the D$-vs-I$
> coherency in the "big picture" model and to get the serialization with
> the actual 'int3' exception right, we *could* just do all the other
> parts of the instruction emulation using the D$ coherency.
> 
> So we could do the actual "call offset" write with a single atomic
> 4-byte locked cycle (just use "xchg" to write - it's always locked).
> And similarly we could do the call offset *read* with a single locked
> cycle (cmpxchg with a 0 value, for example). It would be atomic even
> if it crosses a cacheline boundary.

I don’t quite get how this could work.  Suppose we start with a five-byte NOP (0F 1F ...).  Then we change the first byte to INT3 (CC).  Now we can atomically change the other four bytes, but the INT3 could happen first. I suppose that we could treat 1F 00 00 00 or similar as a known-bogus call target, but that seems dangerous.

IOW I think your trick only works if the old and new states are CALL, but we don’t know that until we’ve looked up the record, at which point we can just use the result of the lookup.

An I missing something clever?  IMO it’s a bummer that there isn’t a way to turn NOP into CALL by changing only one byte.
David Laight May 7, 2019, 2:50 p.m. UTC | #79
From: Steven Rostedt
> Sent: 07 May 2019 14:14
> On Tue, 7 May 2019 12:57:15 +0000
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > > Only the INT3 thing needs 'the gap', but the far bigger change here is
> > > that kernel frames now have a complete pt_regs set and all sorts of
> > > horrible crap can go away.
> >
> > I'm not doubting that generating the 'five register' interrupt stack frame
> > for faults in kernel space makes life simpler just suggesting that the
> > 'emulated call' can be done by emulating the 'iret' rather than generating
> > a gap in the stack.
> 
> But how would the user put something on the stack? I don't see how
> emulating an iret helps here. Can you write some pseudo code to explain
> what you mean. I also believe the gap is only added for kernel->kernel
> entries.

The 'user' (ie the kernel code that needs to emulate the call) doesn't
write the data to the stack, just to some per-cpu location.
(Actually it could be on the stack at the other end of pt-regs.)
So you get to the 'register restore and iret' code with the stack unaltered.
It is then a SMOP to replace the %flags saved by the int3 with the %ip
saved by the int3, the %ip with the address of the function to call,
restore the flags (push and popf) and issue a ret.f to remove the %ip and %cs.

(Actually you need to add 4 to the callers %ip address to allow for the
difference between the size of int3 (hopefully 0xcc, not 0xcd 0x3).)

> > > For 32bit 'the gap' happens naturally when building a 5 entry frame. Yes
> > > it is possible to build a 5 entry frame on top of the old 3 entry one,
> > > but why bother...
> >
> > Presumably there is 'horrid' code to generate the gap in 64bit mode?
> > (less horrid than 32bit, but still horrid?)
> > Or does it copy the entire pt_regs into a local stack frame and use
> > that for the iret?
> 
> On x86_64, the gap is only done for int3 and nothing else, thus it is
> much less horrid. That's because x86_64 has a sane pt_regs storage for
> all exceptions.

Well, in particular, it always loads %sp as part of the iret.
So you can create a gap and the cpu will remove it for you.

In 64bit mode you could overwrite the %ss with the return address
to the caller restore %eax and %flags, push the function address
and use ret.n to jump to the function subtracting the right amount
from %esp.

Actually that means you can do the following in both modes:
	if not emulated_call_address then pop %ax; iret else
	# assume kernel<->kernel return
	push emulated_call_address;
	push flags_saved_by_int3
	load %ax, return_address_from_iret
	add %ax,#4
	store %ax, first_stack_location_written_by_int3
	load %ax, value_saved_by_int3_entry
	popf
	ret.n

The ret.n discards everything from the %ax to the required return address.
So 'n' is the size of the int3 frame, so 12 for i386 and 40 for amd64.

If the register restore (done just before this code) finished with
'add %sp, sizeof *pt_regs' then the emulated_call_address can be
loaded in %ax from the other end of pt_regs.

This all reminds me of fixing up the in-kernel faults that happen
when loading the user segment registers during 'return to user'
fault in kernel space.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Linus Torvalds May 7, 2019, 2:54 p.m. UTC | #80
Duh.

I woke up this morning, realizing what was wrong with my patch.



On Mon, May 6, 2019 at 8:28 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> Yes. But I was looking at the ftrace parts because I didn't see the
> bug in the low-level x86 side, so...

There was nothing wrong in the *low-level* parts. There was one thing
wrong with the "in3_emulate_push()" code:

        memmove(new, regs, SAVED_KERNEL_REGS_SIZE);

which ends up calling an out-of-line function. One that is traced. One
that will recursively result in 'int3'. Which will fill up the stack
until you get a triple fault and reboot.

Stupid stupid.

Anyway, changing that to just copying things one word at a time makes
everything work. The attached patch boots with the whole ftrace test
thing.The only change is literally changing that memmove()  into

        /* Inlined "memmove()" of the pt_regs */
        unsigned long *p = (unsigned long *) new;
        int i = SAVED_KERNEL_REGS_SIZE / sizeof(unsigned long);
        do { *p = p[1]; p++; } while (--i);

which I'm not entirely proud of, but it sure is still simple.

And honestly, I absolutely despise PeterZ's patch. The notion that we
should suddenly say that "oh, the i386 kernel stack is odd" after 28
years of having that standard i386 stack is just crazy. And this:

 arch/x86/entry/entry_32.S            | 136 ++++++++++++++++++++++++++++-------
...
 12 files changed, 323 insertions(+), 140 deletions(-)


vs this:

 arch/x86/entry/entry_32.S            |  7 +++-
...
 6 files changed, 120 insertions(+), 13 deletions(-)

is still pretty damn conclusive. Not to mention that the simple
approach had a truly mindbogglingly simple solution with no actual
subtle changes anywhere else.

So I still claim that we should do my patch. Because it is SIMPLE.
It's straightforward, and I can explain every single line in it. Even
if I spent *way* too long until I realized that the "trivial"
memmove() wasn't so trivial.

                     Linus
arch/x86/entry/entry_32.S            |  7 +++-
 arch/x86/entry/entry_64.S            | 14 ++++++--
 arch/x86/include/asm/ftrace.h        |  2 +-
 arch/x86/include/asm/text-patching.h | 66 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/ftrace.c             | 29 +++++++++++++---
 arch/x86/kernel/traps.c              | 15 +++++---
 6 files changed, 120 insertions(+), 13 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..138ac432221b 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1483,7 +1483,12 @@ ENTRY(int3)
 	TRACE_IRQS_OFF
 	xorl	%edx, %edx			# zero error code
 	movl	%esp, %eax			# pt_regs pointer
-	call	do_int3
+
+	# make room on kernel stack for push emulation
+	# do_kernel_int3 returns possibly updated pt_regs
+	subl	$8, %esp
+	call	do_kernel_int3
+	movl	%eax, %esp
 	jmp	ret_from_exception
 END(int3)
 
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..834ec1397dab 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -879,7 +879,7 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -899,6 +899,16 @@ ENTRY(\sym)
 	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
 
+	.if \create_gap == 1
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jnz	.Lfrom_usermode_no_gap_\@
+	.rept 6
+	pushq	5*8(%rsp)
+	.endr
+	UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+	.endif
+
 	.if \paranoid
 	call	paranoid_entry
 	.else
@@ -1130,7 +1140,7 @@ apicinterrupt3 HYPERV_STIMER0_VECTOR \
 #endif /* CONFIG_HYPERV */
 
 idtentry debug			do_debug		has_error_code=0	paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3			do_int3			has_error_code=0
+idtentry int3			do_int3			has_error_code=0	create_gap=1
 idtentry stack_segment		do_stack_segment	has_error_code=1
 
 #ifdef CONFIG_XEN_PV
diff --git a/arch/x86/include/asm/ftrace.h b/arch/x86/include/asm/ftrace.h
index cf350639e76d..4b335ac5afcc 100644
--- a/arch/x86/include/asm/ftrace.h
+++ b/arch/x86/include/asm/ftrace.h
@@ -37,7 +37,7 @@ struct dyn_arch_ftrace {
 	/* No extra data needed for x86 */
 };
 
-int ftrace_int3_handler(struct pt_regs *regs);
+int ftrace_int3_handler(struct pt_regs **regs);
 
 #define FTRACE_GRAPH_TRAMP_ADDR FTRACE_GRAPH_ADDR
 
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..c91fdecab9c2 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,70 @@ extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
 extern int after_bootmem;
 
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+	regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+/*
+ * The 32-bit kernel entry exception stack does not contain the whole
+ * 'struct pt_regs', because the hardware will not push sp/ss when the
+ * protection level doesn't change.
+ *
+ * So pushing a value onto the stack requires us to make room for it by
+ * moving the whole truncated 'pt_regs' down by four bytes, and the
+ * stack we return to will be directly after it. The lowlevel x86-32
+ * entry code will have made room on the stack for this, so we're not
+ * overwriting anything else there.
+ *
+ * On x86-64, the exception stack is much simpler, and always contains
+ * sp/ss, so we can just update the values directly. Again, the entry
+ * code has made sure there is a gap (now above pt_regs!) that this is ok.
+ */
+static inline struct pt_regs *int3_emulate_push(struct pt_regs *regs, unsigned long value)
+{
+#ifdef CONFIG_X86_32
+#define SAVED_KERNEL_REGS_SIZE (offsetof(struct pt_regs, sp))
+	struct pt_regs *new = (void *)regs - sizeof(long);
+
+	/* Inlined "memmove()" of the pt_regs */
+	unsigned long *p = (unsigned long *) new;
+	int i = SAVED_KERNEL_REGS_SIZE / sizeof(unsigned long);
+	do { *p = p[1]; p++; } while (--i);
+
+	/*
+	 * NOTE! '&new->sp' is actually the top of stack that that 'int3'
+	 * will return to! So despite what this looks like, this doesn't
+	 * update the stack pointer, but writing the value to 'new->sp'
+	 * is actually writing the value *to* the top of the stack in
+	 * the returning context!
+	 *
+	 * The thing that updates the stack pointer in the context we
+	 * return to is the fact that we moved 'struct pt_regs' itself,
+	 * and will be returning using that moved stack frame.
+	 */
+	new->sp = value;
+	return new;
+#else
+	unsigned long *sp;
+
+	regs->sp -= sizeof(long);
+	sp = (unsigned long *)regs->sp;
+	*sp = value;
+	return regs;
+#endif
+}
+
+static inline struct pt_regs *int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+	unsigned long next_ip = regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE;
+
+	regs = int3_emulate_push(regs, next_ip);
+	regs->ip = func;
+	return regs;
+}
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */
diff --git a/arch/x86/kernel/ftrace.c b/arch/x86/kernel/ftrace.c
index ef49517f6bb2..6c239fca99d0 100644
--- a/arch/x86/kernel/ftrace.c
+++ b/arch/x86/kernel/ftrace.c
@@ -29,6 +29,7 @@
 #include <asm/kprobes.h>
 #include <asm/ftrace.h>
 #include <asm/nops.h>
+#include <asm/text-patching.h>
 
 #ifdef CONFIG_DYNAMIC_FTRACE
 
@@ -231,6 +232,7 @@ int ftrace_modify_call(struct dyn_ftrace *rec, unsigned long old_addr,
 }
 
 static unsigned long ftrace_update_func;
+static unsigned long ftrace_update_func_call;
 
 static int update_ftrace_func(unsigned long ip, void *new)
 {
@@ -259,6 +261,8 @@ int ftrace_update_ftrace_func(ftrace_func_t func)
 	unsigned char *new;
 	int ret;
 
+	ftrace_update_func_call = (unsigned long)func;
+
 	new = ftrace_call_replace(ip, (unsigned long)func);
 	ret = update_ftrace_func(ip, new);
 
@@ -287,20 +291,32 @@ static nokprobe_inline int is_ftrace_caller(unsigned long ip)
  * call to a nop. While the change is taking place, we treat
  * it just like it was a nop.
  */
-int ftrace_int3_handler(struct pt_regs *regs)
+int ftrace_int3_handler(struct pt_regs **pregs)
 {
+	struct pt_regs *regs = *pregs;
 	unsigned long ip;
 
 	if (WARN_ON_ONCE(!regs))
 		return 0;
 
-	ip = regs->ip - 1;
-	if (!ftrace_location(ip) && !is_ftrace_caller(ip))
+	if (user_mode(regs))
 		return 0;
 
-	regs->ip += MCOUNT_INSN_SIZE - 1;
+	ip = regs->ip - INT3_INSN_SIZE;
 
-	return 1;
+	if (ftrace_location(ip)) {
+		*pregs = int3_emulate_call(regs, (unsigned long)ftrace_regs_caller);
+		return 1;
+	} else if (is_ftrace_caller(ip)) {
+		if (!ftrace_update_func_call) {
+			int3_emulate_jmp(regs, ip + CALL_INSN_SIZE);
+			return 1;
+		}
+		*pregs = int3_emulate_call(regs, ftrace_update_func_call);
+		return 1;
+	}
+
+	return 0;
 }
 NOKPROBE_SYMBOL(ftrace_int3_handler);
 
@@ -859,6 +875,8 @@ void arch_ftrace_update_trampoline(struct ftrace_ops *ops)
 
 	func = ftrace_ops_get_func(ops);
 
+	ftrace_update_func_call = (unsigned long)func;
+
 	/* Do a safe modify in case the trampoline is executing */
 	new = ftrace_call_replace(ip, (unsigned long)func);
 	ret = update_ftrace_func(ip, new);
@@ -960,6 +978,7 @@ static int ftrace_mod_jmp(unsigned long ip, void *func)
 {
 	unsigned char *new;
 
+	ftrace_update_func_call = 0UL;
 	new = ftrace_jmp_replace(ip, (unsigned long)func);
 
 	return update_ftrace_func(ip, new);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index d26f9e9c3d83..e811e55ab030 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -570,7 +570,7 @@ do_general_protection(struct pt_regs *regs, long error_code)
 }
 NOKPROBE_SYMBOL(do_general_protection);
 
-dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
+dotraplinkage struct pt_regs * notrace do_kernel_int3(struct pt_regs *regs, long error_code)
 {
 #ifdef CONFIG_DYNAMIC_FTRACE
 	/*
@@ -578,11 +578,11 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 	 * See note by declaration of modifying_ftrace_code in ftrace.c
 	 */
 	if (unlikely(atomic_read(&modifying_ftrace_code)) &&
-	    ftrace_int3_handler(regs))
-		return;
+	    ftrace_int3_handler(&regs))
+		return regs;
 #endif
 	if (poke_int3_handler(regs))
-		return;
+		return regs;
 
 	/*
 	 * Use ist_enter despite the fact that we don't use an IST stack.
@@ -614,6 +614,13 @@ dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
 
 exit:
 	ist_exit(regs);
+	return regs;
+}
+NOKPROBE_SYMBOL(do_kernel_int3);
+
+dotraplinkage void notrace do_int3(struct pt_regs *regs, long error_code)
+{
+	WARN_ON_ONCE(do_kernel_int3(regs, error_code) != regs);
 }
 NOKPROBE_SYMBOL(do_int3);
Linus Torvalds May 7, 2019, 2:57 p.m. UTC | #81
On Tue, May 7, 2019 at 7:48 AM Andy Lutomirski <luto@amacapital.net> wrote:
>
> IOW I think your trick only works if the old and new states are CALL, but we don’t know that until we’ve looked up the record, at which point we can just use the result of the lookup.

It would indeed only work for call instructions. I was thinking we'd
know that because we only ever batch up call instructions, though.

But it doesn't matter. I was looking at the ftrace code because I
thought there was some subtle timing bug or race or similar. But it
turned out my "memmove()" was the problem. See the patch I just sent
out.

               Linus
Steven Rostedt May 7, 2019, 2:57 p.m. UTC | #82
On Tue, 7 May 2019 14:50:26 +0000
David Laight <David.Laight@ACULAB.COM> wrote:

> From: Steven Rostedt
> > Sent: 07 May 2019 14:14
> > On Tue, 7 May 2019 12:57:15 +0000
> > David Laight <David.Laight@ACULAB.COM> wrote:

> The 'user' (ie the kernel code that needs to emulate the call) doesn't
> write the data to the stack, just to some per-cpu location.
> (Actually it could be on the stack at the other end of pt-regs.)
> So you get to the 'register restore and iret' code with the stack unaltered.
> It is then a SMOP to replace the %flags saved by the int3 with the %ip
> saved by the int3, the %ip with the address of the function to call,
> restore the flags (push and popf) and issue a ret.f to remove the %ip and %cs.

How would you handle NMIs doing the same thing? Yes, the NMI handlers
have breakpoints that will need to emulated calls as well.

> 
> (Actually you need to add 4 to the callers %ip address to allow for the
> difference between the size of int3 (hopefully 0xcc, not 0xcd 0x3).)
> 
> > > > For 32bit 'the gap' happens naturally when building a 5 entry frame. Yes
> > > > it is possible to build a 5 entry frame on top of the old 3 entry one,
> > > > but why bother...  
> > >
> > > Presumably there is 'horrid' code to generate the gap in 64bit mode?
> > > (less horrid than 32bit, but still horrid?)
> > > Or does it copy the entire pt_regs into a local stack frame and use
> > > that for the iret?  
> > 
> > On x86_64, the gap is only done for int3 and nothing else, thus it is
> > much less horrid. That's because x86_64 has a sane pt_regs storage for
> > all exceptions.  
> 
> Well, in particular, it always loads %sp as part of the iret.
> So you can create a gap and the cpu will remove it for you.
> 
> In 64bit mode you could overwrite the %ss with the return address
> to the caller restore %eax and %flags, push the function address
> and use ret.n to jump to the function subtracting the right amount
> from %esp.
> 
> Actually that means you can do the following in both modes:
> 	if not emulated_call_address then pop %ax; iret else
> 	# assume kernel<->kernel return
> 	push emulated_call_address;
> 	push flags_saved_by_int3
> 	load %ax, return_address_from_iret
> 	add %ax,#4
> 	store %ax, first_stack_location_written_by_int3
> 	load %ax, value_saved_by_int3_entry
> 	popf
> 	ret.n
> 
> The ret.n discards everything from the %ax to the required return address.
> So 'n' is the size of the int3 frame, so 12 for i386 and 40 for amd64.
> 
> If the register restore (done just before this code) finished with
> 'add %sp, sizeof *pt_regs' then the emulated_call_address can be
> loaded in %ax from the other end of pt_regs.
> 
> This all reminds me of fixing up the in-kernel faults that happen
> when loading the user segment registers during 'return to user'
> fault in kernel space.

This all sounds much more complex and fragile than the proposed
solution. Why would we do this over what is being proposed?

-- Steve
Steven Rostedt May 7, 2019, 3:12 p.m. UTC | #83
On Tue, 7 May 2019 07:54:53 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> And honestly, I absolutely despise PeterZ's patch. The notion that we
> should suddenly say that "oh, the i386 kernel stack is odd" after 28
> years of having that standard i386 stack is just crazy. And this:
> 
>  arch/x86/entry/entry_32.S            | 136 ++++++++++++++++++++++++++++-------
> ...
>  12 files changed, 323 insertions(+), 140 deletions(-)
> 
> 
> vs this:
> 
>  arch/x86/entry/entry_32.S            |  7 +++-
> ...
>  6 files changed, 120 insertions(+), 13 deletions(-)
> 
> is still pretty damn conclusive. Not to mention that the simple
> approach had a truly mindbogglingly simple solution with no actual
> subtle changes anywhere else.
> 
> So I still claim that we should do my patch. Because it is SIMPLE.
> It's straightforward, and I can explain every single line in it. Even
> if I spent *way* too long until I realized that the "trivial"
> memmove() wasn't so trivial.

Yes, band-aids are usually simpler than a proper fix. We have 28 years
of hacks built on hacks. There's a lot of hacks in the C code to handle
the differences between the crappy way x86_32 does pt_regs and the
proper way x86_64 does them.

If the goal was just to add another band-aid to this, we now have one
more subtle work around caused by two different methods being handled
by a single code base.

I don't look at Peter's patch and think "this is the solution for int3
emulate calls". I see Peter's patch as "Thanks God, we are finally
getting rid of the cause of all theses work around hacks all over the
place! and oh by the way, we can easily implement int3 call emulation
because of it".

To implement your way, we need to change how the int3 handler works.
It will be the only exception handler having to return regs, otherwise
it will crash.

Sure, it's an easily solution for the one off change of emulating
calls, but it's just another complex work around that nobody is going
to understand in 5 years.

-- Steve
Steven Rostedt May 7, 2019, 3:25 p.m. UTC | #84
On Tue, 7 May 2019 11:12:27 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> I don't look at Peter's patch and think "this is the solution for int3
> emulate calls". I see Peter's patch as "Thanks God, we are finally
> getting rid of the cause of all theses work around hacks all over the
> place! and oh by the way, we can easily implement int3 call emulation
> because of it".

Note, if you really are adamant on your solution, I can write them up,
test them, and get them out for this merge window. I really want a
solution for the int3 emulate calls, as there is a real bug here that
they fix.

But I will say, I still prefer Peter's solution. Maybe in the future we
can revisit it again.

-- Steve
Linus Torvalds May 7, 2019, 3:31 p.m. UTC | #85
On Tue, May 7, 2019 at 8:12 AM Steven Rostedt <rostedt@goodmis.org> wrote:
>>
> Yes, band-aids are usually simpler than a proper fix.

What? No/.

My fix is the *proper* fix.

PeterZ's is the bandaid.

>             We have 28 years
> of hacks built on hacks. There's a lot of hacks in the C code to handle
> the differences between the crappy way x86_32 does pt_regs and the
> proper way x86_64 does them.

You're confusing "reality": with "your dream world". They are different.

The reality is that the i386 kernel stack is just how things work. End of story.

The reality is that changing something fundamental like the kernel
stack at this point for an architecture that will not change in the
future is silly.

The reality is that Peter's patch is much bigger than mine, because it
needed a lot of other changes *because* it did that change.

> To implement your way, we need to change how the int3 handler works.
> It will be the only exception handler having to return regs, otherwise
> it will crash.

What? That's what the patch *does*. It's trivial.  It is done.

                  Linus
Steven Rostedt May 7, 2019, 3:45 p.m. UTC | #86
On Tue, 7 May 2019 08:31:14 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> The reality is that changing something fundamental like the kernel
> stack at this point for an architecture that will not change in the
> future is silly.

x86_32 will no longer have updates, but will x86_64? And we will
constantly be adding more work around hacks to handle the difference of
the pt_regs in the future. I see Peter's patch easing that future pain.

> 
> The reality is that Peter's patch is much bigger than mine, because it
> needed a lot of other changes *because* it did that change.

It was much bigger because it removed a lot of the work around hacks.

But you are the "benevolent dictator", and I don't see me changing your
mind.

I'll go and implement it the way you like.

-- Steve
David Laight May 7, 2019, 3:46 p.m. UTC | #87
From: Steven Rostedt
> Sent: 07 May 2019 15:57
> On Tue, 7 May 2019 14:50:26 +0000
> David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > From: Steven Rostedt
> > > Sent: 07 May 2019 14:14
> > > On Tue, 7 May 2019 12:57:15 +0000
> > > David Laight <David.Laight@ACULAB.COM> wrote:
> 
> > The 'user' (ie the kernel code that needs to emulate the call) doesn't
> > write the data to the stack, just to some per-cpu location.
> > (Actually it could be on the stack at the other end of pt-regs.)
> > So you get to the 'register restore and iret' code with the stack unaltered.
> > It is then a SMOP to replace the %flags saved by the int3 with the %ip
> > saved by the int3, the %ip with the address of the function to call,
> > restore the flags (push and popf) and issue a ret.f to remove the %ip and %cs.
> 
> How would you handle NMIs doing the same thing? Yes, the NMI handlers
> have breakpoints that will need to emulated calls as well.

That means you'd have to use a field in the on-stack pt_regs for the
'address to call' rather than a per-cpu variable.
Then it would all nest.

...
> > Actually that means you can do the following in both modes:
> > 	if not emulated_call_address then pop %ax; iret else
> > 	# assume kernel<->kernel return
> > 	push emulated_call_address;
> > 	push flags_saved_by_int3
> > 	load %ax, return_address_from_iret
> > 	add %ax,#4
> > 	store %ax, first_stack_location_written_by_int3
> > 	load %ax, value_saved_by_int3_entry
> > 	popf
> > 	ret.n
> >
> > The ret.n discards everything from the %ax to the required return address.
> > So 'n' is the size of the int3 frame, so 12 for i386 and 40 for amd64.
> >
> > If the register restore (done just before this code) finished with
> > 'add %sp, sizeof *pt_regs' then the emulated_call_address can be
> > loaded in %ax from the other end of pt_regs.
...
> 
> This all sounds much more complex and fragile than the proposed
> solution. Why would we do this over what is being proposed?

It is all complex and fragile however you do it.

I see a problem with converting the 3 register trap frame to a 5 register
one is that the entry code and exit code both have to know whether
it is necessary or was done.

AFAICT it is actually quite hard to tell from the stack which form it is.
Although the %sp value might tell you because %ss:%sp might only be
pushed when a stack switch happens so the kernel %sp will be a known
value (for the switched to stack).

The advantage of converting the frame is that, as pointed out earlier it
does let you have a pt_regs that always contains %ss:%sp.

	David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)
Steven Rostedt May 7, 2019, 4:25 p.m. UTC | #88
On Tue, 7 May 2019 11:25:13 -0400
Steven Rostedt <rostedt@goodmis.org> wrote:

> Note, if you really are adamant on your solution, I can write them up,
> test them, and get them out for this merge window. I really want a
> solution for the int3 emulate calls, as there is a real bug here that
> they fix.

Thinking about this more, as my real motivation for getting this in
(for this merge window), is to fix the live kernel patching bug. We
only need to implement int3 call emulation for x86_64. We don't need to
implement it for 32bit. The ftrace code can continue to just make it a
nop. Live kernel patching does not support x86_32, and the only issue
that happens on 32bit when we do a nop when converting a call to call,
is that we might lose a trace. But that's been the case since this
started, and not a critical issue. But with live kernel patching,
losing a trace could crash the machine.

As I need to mark all this for stable, I'm going to look into just
implementing this for x86_64. Then we can continue the debate about
how to do this for x86_32 if we care about loss traces. But we don't
need to commit to anything yet.

-- Steve
Peter Zijlstra May 7, 2019, 4:34 p.m. UTC | #89
On Tue, May 07, 2019 at 08:31:14AM -0700, Linus Torvalds wrote:
> The reality is that changing something fundamental like the kernel
> stack at this point for an architecture that will not change in the
> future is silly.

In my eyes it makes sense because i386 is a minority architecture at
this point, and 'nobody' wants to care about how its different if they
don't have to.

> The reality is that Peter's patch is much bigger than mine, because it
> needed a lot of other changes *because* it did that change.

Yes, I change the way stack layout works on i386, and yes that affects a
lot of code. _However_ all of that code is now more like x86_64 than it
was.

Earlier you said that kernel_stack_pointer() was a horrible thing; and
most/all the code that I ended up touching was similarly horrible.

Would you consider my approach later on, under the guise of unification?
We can work on it for a while, and make sure all the iffy bits are
sorted, no need to rush?
Linus Torvalds May 7, 2019, 5:08 p.m. UTC | #90
On Tue, May 7, 2019 at 9:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
>
> Would you consider my approach later on, under the guise of unification?

WHY?

The *only* advantage of your patch is that trivial "look up kernel stack" macro.

Seriously. There's absolutely nothing else.

Is that macro ugly? Yes. But it's directly explainable by just
pointing to the architecture documentation.

It's a one-liner hack.

And for that, you want to complicate the x86-32 entry and exit code?

Do we have different emulation for "push" on 32-bit and 64-bit? Yes.
But again, that's just how the hardware works. This is not some
"generic hw-independent code". This is literally emulating
instructions that care about instruction encoding and bit size
details, where there are certainly _similarities_ (and in the case of
'call', they look bit-identical), but it's also not like "same code"
is a big argument. That's why we have a helper function, to hide the
details.

I point to my diffstat once again. It's smaller, and I argue that it
is actually conceptually *simpler* to simply say "this is how the
architecture works".

And yes, I realize that I may be biased by the fact that I simply know
i386 so well, so to me it simply makes more sense to just work with
what the hardware gives us. The i386 exception model with the kernel
stack nesting is a *hell* of a lot simpler than the x86-64 one. The
fact is, x86-64 messed things up, and swapgs and friends are an
abomination against God.

So the whole "let's clean up x86-32 to look like x86-64, which got
things right" is to me a completely bogus argument. x86-64 got the
"yes, push ss/sp unconditionally" part right, but got a lot of other
things horribly wrong. So this is all just one small detail that
differs, across two architectures that are similar but have very
different warts.

But that diffstat is still hard, cold, unbiased data.

                    Linus
Masami Hiramatsu May 7, 2019, 5:15 p.m. UTC | #91
On Tue, 7 May 2019 23:13:40 +0900
Masami Hiramatsu <mhiramat@kernel.org> wrote:

> On Mon, 6 May 2019 17:45:11 -0400
> Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > If we go with Peter's patch, I can make this code much more sane, and
> > not have to worry about having &regs->sp be at the top of the stack. I
> > could simply, just push everything in the order of pt_regs and call the
> > handler.
> 
> Hi Steve, I need to catch up with the origin of this series, but it seems
> also good to optprobe which is doing similar trick on pt_regs. If we can
> assume that int3 pt_regs can have a gap, optprobe can also make a gap, and
> it can be also used for storing destination address.

Sorry, I misunderstood. I see the issue ( https://lkml.org/lkml/2019/5/1/497 )
and solutions on the thread. If we really need to fix this trace-livepatch
combination issue, it may be good to backport to stable trees.
From this viewpoint, Linus's suggestion (no pt_reg changes on x86-32) seems
to have a point.

BTW, even though I think Peter's patch (unifying pt_regs behavior) will also
be good for us for more general reason (not only for fixing actual issue).

Thank you,
Josh Poimboeuf May 7, 2019, 5:21 p.m. UTC | #92
On Tue, May 07, 2019 at 10:08:50AM -0700, Linus Torvalds wrote:
> On Tue, May 7, 2019 at 9:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Would you consider my approach later on, under the guise of unification?
> 
> WHY?
> 
> The *only* advantage of your patch is that trivial "look up kernel stack" macro.
> 
> Seriously. There's absolutely nothing else.
> 
> Is that macro ugly? Yes. But it's directly explainable by just
> pointing to the architecture documentation.
> 
> It's a one-liner hack.
> 
> And for that, you want to complicate the x86-32 entry and exit code?
> 
> Do we have different emulation for "push" on 32-bit and 64-bit? Yes.
> But again, that's just how the hardware works. This is not some
> "generic hw-independent code". This is literally emulating
> instructions that care about instruction encoding and bit size
> details, where there are certainly _similarities_ (and in the case of
> 'call', they look bit-identical), but it's also not like "same code"
> is a big argument. That's why we have a helper function, to hide the
> details.
> 
> I point to my diffstat once again. It's smaller, and I argue that it
> is actually conceptually *simpler* to simply say "this is how the
> architecture works".
> 
> And yes, I realize that I may be biased by the fact that I simply know
> i386 so well, so to me it simply makes more sense to just work with
> what the hardware gives us. The i386 exception model with the kernel
> stack nesting is a *hell* of a lot simpler than the x86-64 one. The
> fact is, x86-64 messed things up, and swapgs and friends are an
> abomination against God.
> 
> So the whole "let's clean up x86-32 to look like x86-64, which got
> things right" is to me a completely bogus argument. x86-64 got the
> "yes, push ss/sp unconditionally" part right, but got a lot of other
> things horribly wrong. So this is all just one small detail that
> differs, across two architectures that are similar but have very
> different warts.
> 
> But that diffstat is still hard, cold, unbiased data.

regs->sp is *undefined* on x86-32.  We're damning our future selves to
have to always remember to use that darn kernel_stack_pointer() helper
for eternity just because of x86-32.

There have already been several bugs related to that.  Because regs->sp
is there, so why wouldn't you use it?

If we truly want the code to reflect the HW, then we should have a
pt_regs_kernel and a pt_regs_user on 32-bit.  I'm pretty sure we don't
want to go there...

IMO, we either need to make the pt_regs struct(s) match the HW behavior,
or make entry code match pt_regs.  But this in-between thing just
creates a bunch of headaches.
Masami Hiramatsu May 7, 2019, 5:22 p.m. UTC | #93
2019年5月7日(火) 21:54 Steven Rostedt <rostedt@goodmis.org>:
>
> On Tue, 7 May 2019 14:41:31 +0200
> Peter Zijlstra <peterz@infradead.org> wrote:
>
> > > Kprobes sets the FTRACE_OPS_FL_IPMODIFY flag, thus
> > > they can never be put at the same location that is being live patched.
> >
> > OK, so do we want to allow kprobes that also modify regs->sp ? Because
> > then we need to change these trampolines a bit.
> >
> > I'd prefer not to allow kprobes this.
>
> I believe no kprobe changes sp, because it would have had the same
> issues we are trying to solve now. And even though we are changing
> things to allow it, it's not a regression to keep kprobes from doing it.

No, kprobes doesn't allow to change sp. At this moment we can't change
"&regs->sp" since it is just a value :)
kprobes user (e.g. function fault-injection) will change regs->ip,
that is why kprobes sets IPMODIFY flag.

Thank you,
Peter Zijlstra May 7, 2019, 5:38 p.m. UTC | #94
On Tue, May 07, 2019 at 10:08:50AM -0700, Linus Torvalds wrote:
> On Tue, May 7, 2019 at 9:34 AM Peter Zijlstra <peterz@infradead.org> wrote:
> >
> > Would you consider my approach later on, under the guise of unification?
> 
> WHY?
> 
> The *only* advantage of your patch is that trivial "look up kernel stack" macro.
> 
> Seriously. There's absolutely nothing else.

The ftrace_regs_caller, the kprobe tramplines, the unwinder, they all
have 'funny' bits because pt_regs isn't 'right'.

> So the whole "let's clean up x86-32 to look like x86-64, which got
> things right" is to me a completely bogus argument. x86-64 got the
> "yes, push ss/sp unconditionally" part right, but got a lot of other
> things horribly wrong. So this is all just one small detail that
> differs, across two architectures that are similar but have very
> different warts.

It's a detail that leaks into the C code. Yes SWAPGS is horrible crap,
but C code doesn't much care. The partial pt_regs thing otoh comes up a
fair number of times.

Anyway; I think we're at the point where we'll have to agree to
disagree (or maybe slightly past it).
Steven Rostedt May 7, 2019, 9:24 p.m. UTC | #95
On Tue, 7 May 2019 12:21:59 -0500
Josh Poimboeuf <jpoimboe@redhat.com> wrote:

> regs->sp is *undefined* on x86-32.  We're damning our future selves to
> have to always remember to use that darn kernel_stack_pointer() helper
> for eternity just because of x86-32.

And there's been several times I forget that regs->sp can not be read
directly. Especially most of my bug reports are for x86_64 these days.
But when I had that seldom x86_32 one, and go debugging, I would print
out "regs->sp" and then the system would crash. And I spend some time
wondering why?

It's been a bane of mine for some time.


-- Steve
Linus Torvalds May 8, 2019, 4:50 a.m. UTC | #96
On Tue, May 7, 2019 at 2:24 PM Steven Rostedt <rostedt@goodmis.org> wrote:
>
> And there's been several times I forget that regs->sp can not be read
> directly. Especially most of my bug reports are for x86_64 these days.
> But when I had that seldom x86_32 one, and go debugging, I would print
> out "regs->sp" and then the system would crash. And I spend some time
> wondering why?
>
> It's been a bane of mine for some time.

Guys, I have basically a one-liner patch for your hangups.

It's called "rename 'sp' to 'user_sp' on x86-32".

Then we make the 'sp' field on x86-64 be a union, so that you can call
it user_sp or sp as you wish.

Yeah, it's really more than one line, because obviously the users will
need chaning, but honestly, that would be a _real_ cleanup. Make the
register match what it actually is.

And it doesn't mess up the stack frame, and it doesn't change the
entry code. It just reminds people that the entry is the USER stack
pointer.

Problem solved.

               Linus
Steven Rostedt May 8, 2019, 4:37 p.m. UTC | #97
On Tue, 7 May 2019 21:50:52 -0700
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> > It's been a bane of mine for some time.  
> 
> Guys, I have basically a one-liner patch for your hangups.
> 
> It's called "rename 'sp' to 'user_sp' on x86-32".
> 
> Then we make the 'sp' field on x86-64 be a union, so that you can call
> it user_sp or sp as you wish.
> 
> Yeah, it's really more than one line, because obviously the users will
> need chaning, but honestly, that would be a _real_ cleanup. Make the
> register match what it actually is.

But is it? Sure, it will be a reminder that it's different for x86-32,
but that still doesn't take away the fact that pt_regs on x86_32 is an
anomaly! Where else do we have part of a data structure that can't be
read because it can possibly fault? If regs is a valid pointer, one
would think that simply reading regs->sp (or regs->user_sp) would be no
more dangerous than reading regs->ip.

The difference between entry_32.S from entry_64.S causes it to spill
into C code, making the x86_64 code more difficult to deal with. Sure,
10 to 15 years ago, all your arguments would make sense. But today, who
uses x86_32?  Perhaps you may use it in a VM, but I asked a few
developers when was the last time they used one, they told me 5 to 7
years ago. I only boot x86_32 to test to make sure I didn't break it.

Yes, your diffstat is really nice to the changes to entry_32.S, but at
what cost? To make the x86_64 code more complex? That whole returning
the regs in the int3 handler makes no sense on x86_64, but yet we would
need to do it to handle x86_32. Why burden the architecture of today
and tomorrow with the architecture of yesterday? x86_32 is becoming
more obsolete by the day. It baffles me why we wouldn't want to contain
its complexity in a single file then to spread it out like wildfire
across the generic x86 code.

The &regs->sp hack is just one more rung in the complex learning curve
ladder of becoming a Linux kernel developer.

-- Steve

Patch
diff mbox series

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index d309f30cf7af..50bbf4035baf 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -1478,6 +1478,17 @@  ENTRY(int3)
 	ASM_CLAC
 	pushl	$-1				# mark this as an int
 
+#ifdef CONFIG_VM86
+	testl	$X86_EFLAGS_VM, PT_EFLAGS(%esp)
+	jnz	.Lfrom_usermode_no_gap
+#endif
+	testl	$SEGMENT_RPL_MASK, PT_CS(%esp)
+	jnz	.Lfrom_usermode_no_gap
+	.rept 6
+	pushl	5*4(%esp)
+	.endr
+.Lfrom_usermode_no_gap:
+
 	SAVE_ALL switch_stacks=1
 	ENCODE_FRAME_POINTER
 	TRACE_IRQS_OFF
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 1f0efdb7b629..834ec1397dab 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -879,7 +879,7 @@  apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  * @paranoid == 2 is special: the stub will never switch stacks.  This is for
  * #DF: if the thread stack is somehow unusable, we'll still get a useful OOPS.
  */
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 create_gap=0
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
 
@@ -899,6 +899,16 @@  ENTRY(\sym)
 	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
 
+	.if \create_gap == 1
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jnz	.Lfrom_usermode_no_gap_\@
+	.rept 6
+	pushq	5*8(%rsp)
+	.endr
+	UNWIND_HINT_IRET_REGS offset=8
+.Lfrom_usermode_no_gap_\@:
+	.endif
+
 	.if \paranoid
 	call	paranoid_entry
 	.else
@@ -1130,7 +1140,7 @@  apicinterrupt3 HYPERV_STIMER0_VECTOR \
 #endif /* CONFIG_HYPERV */
 
 idtentry debug			do_debug		has_error_code=0	paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3			do_int3			has_error_code=0
+idtentry int3			do_int3			has_error_code=0	create_gap=1
 idtentry stack_segment		do_stack_segment	has_error_code=1
 
 #ifdef CONFIG_XEN_PV
diff --git a/arch/x86/include/asm/text-patching.h b/arch/x86/include/asm/text-patching.h
index e85ff65c43c3..ba275b6292db 100644
--- a/arch/x86/include/asm/text-patching.h
+++ b/arch/x86/include/asm/text-patching.h
@@ -39,4 +39,24 @@  extern int poke_int3_handler(struct pt_regs *regs);
 extern void *text_poke_bp(void *addr, const void *opcode, size_t len, void *handler);
 extern int after_bootmem;
 
+static inline void int3_emulate_push(struct pt_regs *regs, unsigned long val)
+{
+	regs->sp -= sizeof(unsigned long);
+	*(unsigned long *)regs->sp = val;
+}
+
+static inline void int3_emulate_jmp(struct pt_regs *regs, unsigned long ip)
+{
+	regs->ip = ip;
+}
+
+#define INT3_INSN_SIZE 1
+#define CALL_INSN_SIZE 5
+
+static inline void int3_emulate_call(struct pt_regs *regs, unsigned long func)
+{
+	int3_emulate_push(regs, regs->ip - INT3_INSN_SIZE + CALL_INSN_SIZE);
+	int3_emulate_jmp(regs, func);
+}
+
 #endif /* _ASM_X86_TEXT_PATCHING_H */