diff mbox series

arm64: preserve pt_regs::stackframe during exec*()

Message ID 20241021164456.2275285-1-mark.rutland@arm.com (mailing list archive)
State New
Headers show
Series arm64: preserve pt_regs::stackframe during exec*() | expand

Commit Message

Mark Rutland Oct. 21, 2024, 4:44 p.m. UTC
When performing an exec*(), there's a transient period before the return
to userspace where any stacktrace will result in a warning triggered by
kunwind_next_frame_record_meta() encountering a struct frame_record_meta
with an unknown type. This can be seen fairly reliably by enabling KASAN
or KFENCE, e.g.

| WARNING: CPU: 3 PID: 143 at arch/arm64/kernel/stacktrace.c:223 arch_stack_walk+0x264/0x3b0
| Modules linked in:
| CPU: 3 UID: 0 PID: 143 Comm: login Not tainted 6.12.0-rc2-00010-g0f0b9a3f6a50 #1
| Hardware name: linux,dummy-virt (DT)
| pstate: 814000c5 (Nzcv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
| pc : arch_stack_walk+0x264/0x3b0
| lr : arch_stack_walk+0x1ec/0x3b0
| sp : ffff80008060b970
| x29: ffff80008060ba10 x28: fff00000051133c0 x27: 0000000000000000
| x26: 0000000000000000 x25: 0000000000000000 x24: fff000007fe84000
| x23: ffff9d1b3c940af0 x22: 0000000000000000 x21: fff00000051133c0
| x20: ffff80008060ba50 x19: ffff9d1b3c9408e0 x18: 0000000000000014
| x17: 000000006d50da47 x16: 000000008e3f265e x15: fff0000004e8bf40
| x14: 0000ffffc5e50e48 x13: 000000000000000f x12: 0000ffffc5e50fed
| x11: 000000000000001f x10: 000018007f8bffff x9 : 0000000000000000
| x8 : ffff80008060b9c0 x7 : ffff80008060bfd8 x6 : ffff80008060ba80
| x5 : ffff80008060ba00 x4 : ffff80008060c000 x3 : ffff80008060bff0
| x2 : 0000000000000018 x1 : ffff80008060bfd8 x0 : 0000000000000000
| Call trace:
|  arch_stack_walk+0x264/0x3b0 (P)
|  arch_stack_walk+0x1ec/0x3b0 (L)
|  stack_trace_save+0x50/0x80
|  metadata_update_state+0x98/0xa0
|  kfence_guarded_free+0xec/0x2c4
|  __kfence_free+0x50/0x100
|  kmem_cache_free+0x1a4/0x37c
|  putname+0x9c/0xc0
|  do_execveat_common.isra.0+0xf0/0x1e4
|  __arm64_sys_execve+0x40/0x60
|  invoke_syscall+0x48/0x104
|  el0_svc_common.constprop.0+0x40/0xe0
|  do_el0_svc+0x1c/0x28
|  el0_svc+0x34/0xe0
|  el0t_64_sync_handler+0x120/0x12c
|  el0t_64_sync+0x198/0x19c

This happens because start_thread_common() zeroes the entirety of
current_pt_regs(), including pt_regs::stackframe::type, changing this
from FRAME_META_TYPE_FINAL to 0 and making the final record invalid.
The stacktrace code will reject this until the next return to userspace,
where a subsequent exception entry will reinitialize the type to
FRAME_META_TYPE_FINAL.

This zeroing wasn't a problem prior to commit:

  0f0b9a3f6a50fa36 ("arm64: stacktrace: unwind exception boundaries")

... as before that commit the stacktrace code only expected the final
pt_regs::stackframe to contain zeroes, which was unchanged by
start_thread_common().

A stacktrace could occur at any time, either due to instrumentation or
an exception, and so start_thread_common() must ensure that
pt_regs::stackframe is always valid.

Fix this by changing the way start_thread_common() zeroes and
reinitializes the pt_regs fields:

* The '{regs,pc,pstate}' fields are initialized in one go via a struct
  assignment to the user_regs, with start_thread() and
  compat_start_thread() modified to pass 'pstate' into
  start_thread_common().

* The 'sp' and 'compat_sp' fields are zeroed by the struct assignment in
  start_thread_common(), and subsequently overwritten in start_thread()
  and compat_start_thread respectively, matching existing behaviour.

* The 'syscallno' field is implicitly preserved while the 'orig_x0'
  field is explicitly zeroed, maintaining existing ABI.

* The 'pmr' field is explicitly initialized, as necessary for an exec*()
  from a kernel thread, matching existing behaviour.

* The 'stackframe' field is implicitly preserved, with a new comment and
  some assertions to ensure we don't accidentally break this in future.

* All other fields are implicitly preserved, and should have no
  functional impact:

  - 'sdei_ttbr1' is only used for SDEI exception entry/exit, and we
    never exec*() inside an SDEI handler.

  - 'lockdep_hardirqs' and 'exit_rcu' are only used for EL1 exception
    entry/exit, and we never exec*() inside an EL1 exception handler.

While updating compat_start_thread() to pass 'pstate' into
start_thread_common(), I've also updated the logic to remove the
ifdeffery, replacing:

| #ifdef __AARCH64EB__
|        regs->pstate |= PSR_AA32_E_BIT;
| #endif

... with:

| if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
|         pstate |= PSR_AA32_E_BIT;

... which should be functionally equivalent, and matches our preferred
code style.

Fixes: 0f0b9a3f6a50fa36 ("arm64: stacktrace: unwind exception boundaries")
Signed-off-by: Mark Rutland <mark.rutland@arm.com>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Mark Brown <broonie@kernel.org>
Cc: Miroslav Benes <mbenes@suse.cz>
Cc: Puranjay Mohan <puranjay12@gmail.com>
Cc: Will Deacon <will@kernel.org>
---
 arch/arm64/include/asm/processor.h | 48 +++++++++++++++++++++---------
 1 file changed, 34 insertions(+), 14 deletions(-)

Comments

Mark Rutland Oct. 21, 2024, 5:13 p.m. UTC | #1
On Mon, Oct 21, 2024 at 05:44:56PM +0100, Mark Rutland wrote:
> When performing an exec*(), there's a transient period before the return
> to userspace where any stacktrace will result in a warning triggered by
> kunwind_next_frame_record_meta() encountering a struct frame_record_meta
> with an unknown type. This can be seen fairly reliably by enabling KASAN
> or KFENCE, e.g.

[...]

> This zeroing wasn't a problem prior to commit:
> 
>   0f0b9a3f6a50fa36 ("arm64: stacktrace: unwind exception boundaries")

Sorry, that's the commit ID from my arm64/stacktrace/metadata branch.
The version queued in the arm64 for-next/stacktrace branch has a
different ID, and this should say:

  c2c6b27b5aa14fa2 ("arm64: stacktrace: unwind exception boundaries")

[...]

> Fixes: 0f0b9a3f6a50fa36 ("arm64: stacktrace: unwind exception boundaries")

... and likewise that should be:
  
  Fixes: c2c6b27b5aa14fa2 ("arm64: stacktrace: unwind exception boundaries")

Mark.
Puranjay Mohan Oct. 21, 2024, 7:40 p.m. UTC | #2
Mark Rutland <mark.rutland@arm.com> writes:

> When performing an exec*(), there's a transient period before the return
> to userspace where any stacktrace will result in a warning triggered by
> kunwind_next_frame_record_meta() encountering a struct frame_record_meta
> with an unknown type. This can be seen fairly reliably by enabling KASAN
> or KFENCE, e.g.
>
> | WARNING: CPU: 3 PID: 143 at arch/arm64/kernel/stacktrace.c:223 arch_stack_walk+0x264/0x3b0
> | Modules linked in:
> | CPU: 3 UID: 0 PID: 143 Comm: login Not tainted 6.12.0-rc2-00010-g0f0b9a3f6a50 #1
> | Hardware name: linux,dummy-virt (DT)
> | pstate: 814000c5 (Nzcv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> | pc : arch_stack_walk+0x264/0x3b0
> | lr : arch_stack_walk+0x1ec/0x3b0
> | sp : ffff80008060b970
> | x29: ffff80008060ba10 x28: fff00000051133c0 x27: 0000000000000000
> | x26: 0000000000000000 x25: 0000000000000000 x24: fff000007fe84000
> | x23: ffff9d1b3c940af0 x22: 0000000000000000 x21: fff00000051133c0
> | x20: ffff80008060ba50 x19: ffff9d1b3c9408e0 x18: 0000000000000014
> | x17: 000000006d50da47 x16: 000000008e3f265e x15: fff0000004e8bf40
> | x14: 0000ffffc5e50e48 x13: 000000000000000f x12: 0000ffffc5e50fed
> | x11: 000000000000001f x10: 000018007f8bffff x9 : 0000000000000000
> | x8 : ffff80008060b9c0 x7 : ffff80008060bfd8 x6 : ffff80008060ba80
> | x5 : ffff80008060ba00 x4 : ffff80008060c000 x3 : ffff80008060bff0
> | x2 : 0000000000000018 x1 : ffff80008060bfd8 x0 : 0000000000000000
> | Call trace:
> |  arch_stack_walk+0x264/0x3b0 (P)
> |  arch_stack_walk+0x1ec/0x3b0 (L)
> |  stack_trace_save+0x50/0x80
> |  metadata_update_state+0x98/0xa0
> |  kfence_guarded_free+0xec/0x2c4
> |  __kfence_free+0x50/0x100
> |  kmem_cache_free+0x1a4/0x37c
> |  putname+0x9c/0xc0
> |  do_execveat_common.isra.0+0xf0/0x1e4
> |  __arm64_sys_execve+0x40/0x60
> |  invoke_syscall+0x48/0x104
> |  el0_svc_common.constprop.0+0x40/0xe0
> |  do_el0_svc+0x1c/0x28
> |  el0_svc+0x34/0xe0
> |  el0t_64_sync_handler+0x120/0x12c
> |  el0t_64_sync+0x198/0x19c
>
> This happens because start_thread_common() zeroes the entirety of
> current_pt_regs(), including pt_regs::stackframe::type, changing this
> from FRAME_META_TYPE_FINAL to 0 and making the final record invalid.
> The stacktrace code will reject this until the next return to userspace,
> where a subsequent exception entry will reinitialize the type to
> FRAME_META_TYPE_FINAL.
>
> This zeroing wasn't a problem prior to commit:
>
>   0f0b9a3f6a50fa36 ("arm64: stacktrace: unwind exception boundaries")
>
> ... as before that commit the stacktrace code only expected the final
> pt_regs::stackframe to contain zeroes, which was unchanged by
> start_thread_common().
>
> A stacktrace could occur at any time, either due to instrumentation or
> an exception, and so start_thread_common() must ensure that
> pt_regs::stackframe is always valid.
>
> Fix this by changing the way start_thread_common() zeroes and
> reinitializes the pt_regs fields:
>
> * The '{regs,pc,pstate}' fields are initialized in one go via a struct
>   assignment to the user_regs, with start_thread() and
>   compat_start_thread() modified to pass 'pstate' into
>   start_thread_common().
>
> * The 'sp' and 'compat_sp' fields are zeroed by the struct assignment in
>   start_thread_common(), and subsequently overwritten in start_thread()
>   and compat_start_thread respectively, matching existing behaviour.
>
> * The 'syscallno' field is implicitly preserved while the 'orig_x0'
>   field is explicitly zeroed, maintaining existing ABI.
>
> * The 'pmr' field is explicitly initialized, as necessary for an exec*()
>   from a kernel thread, matching existing behaviour.
>
> * The 'stackframe' field is implicitly preserved, with a new comment and
>   some assertions to ensure we don't accidentally break this in future.
>
> * All other fields are implicitly preserved, and should have no
>   functional impact:
>
>   - 'sdei_ttbr1' is only used for SDEI exception entry/exit, and we
>     never exec*() inside an SDEI handler.
>
>   - 'lockdep_hardirqs' and 'exit_rcu' are only used for EL1 exception
>     entry/exit, and we never exec*() inside an EL1 exception handler.
>
> While updating compat_start_thread() to pass 'pstate' into
> start_thread_common(), I've also updated the logic to remove the
> ifdeffery, replacing:
>
> | #ifdef __AARCH64EB__
> |        regs->pstate |= PSR_AA32_E_BIT;
> | #endif
>
> ... with:
>
> | if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
> |         pstate |= PSR_AA32_E_BIT;
>
> ... which should be functionally equivalent, and matches our preferred
> code style.
>
> Fixes: 0f0b9a3f6a50fa36 ("arm64: stacktrace: unwind exception boundaries")
> Signed-off-by: Mark Rutland <mark.rutland@arm.com>

I tested this with KVM on an AWS graviton metal host. The warning is
gone now.

Tested-by: Puranjay Mohan <puranjay12@gmail.com>
Reviewed-by: Puranjay Mohan <puranjay12@gmail.com>

Thanks,
Puranjay
Catalin Marinas Oct. 22, 2024, 11:37 a.m. UTC | #3
On Mon, 21 Oct 2024 17:44:56 +0100, Mark Rutland wrote:
> When performing an exec*(), there's a transient period before the return
> to userspace where any stacktrace will result in a warning triggered by
> kunwind_next_frame_record_meta() encountering a struct frame_record_meta
> with an unknown type. This can be seen fairly reliably by enabling KASAN
> or KFENCE, e.g.
> 
> | WARNING: CPU: 3 PID: 143 at arch/arm64/kernel/stacktrace.c:223 arch_stack_walk+0x264/0x3b0
> | Modules linked in:
> | CPU: 3 UID: 0 PID: 143 Comm: login Not tainted 6.12.0-rc2-00010-g0f0b9a3f6a50 #1
> | Hardware name: linux,dummy-virt (DT)
> | pstate: 814000c5 (Nzcv daIF +PAN -UAO -TCO +DIT -SSBS BTYPE=--)
> | pc : arch_stack_walk+0x264/0x3b0
> | lr : arch_stack_walk+0x1ec/0x3b0
> | sp : ffff80008060b970
> | x29: ffff80008060ba10 x28: fff00000051133c0 x27: 0000000000000000
> | x26: 0000000000000000 x25: 0000000000000000 x24: fff000007fe84000
> | x23: ffff9d1b3c940af0 x22: 0000000000000000 x21: fff00000051133c0
> | x20: ffff80008060ba50 x19: ffff9d1b3c9408e0 x18: 0000000000000014
> | x17: 000000006d50da47 x16: 000000008e3f265e x15: fff0000004e8bf40
> | x14: 0000ffffc5e50e48 x13: 000000000000000f x12: 0000ffffc5e50fed
> | x11: 000000000000001f x10: 000018007f8bffff x9 : 0000000000000000
> | x8 : ffff80008060b9c0 x7 : ffff80008060bfd8 x6 : ffff80008060ba80
> | x5 : ffff80008060ba00 x4 : ffff80008060c000 x3 : ffff80008060bff0
> | x2 : 0000000000000018 x1 : ffff80008060bfd8 x0 : 0000000000000000
> | Call trace:
> |  arch_stack_walk+0x264/0x3b0 (P)
> |  arch_stack_walk+0x1ec/0x3b0 (L)
> |  stack_trace_save+0x50/0x80
> |  metadata_update_state+0x98/0xa0
> |  kfence_guarded_free+0xec/0x2c4
> |  __kfence_free+0x50/0x100
> |  kmem_cache_free+0x1a4/0x37c
> |  putname+0x9c/0xc0
> |  do_execveat_common.isra.0+0xf0/0x1e4
> |  __arm64_sys_execve+0x40/0x60
> |  invoke_syscall+0x48/0x104
> |  el0_svc_common.constprop.0+0x40/0xe0
> |  do_el0_svc+0x1c/0x28
> |  el0_svc+0x34/0xe0
> |  el0t_64_sync_handler+0x120/0x12c
> |  el0t_64_sync+0x198/0x19c
> 
> [...]

Applied to arm64 (for-next/stacktrace), thanks!

[1/1] arm64: preserve pt_regs::stackframe during exec*()
      https://git.kernel.org/arm64/c/f260c4426763
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/processor.h b/arch/arm64/include/asm/processor.h
index 03b99164f7f46..e277105fb87a2 100644
--- a/arch/arm64/include/asm/processor.h
+++ b/arch/arm64/include/asm/processor.h
@@ -285,22 +285,44 @@  void tls_preserve_current_state(void);
 	.fpsimd_cpu = NR_CPUS,			\
 }
 
-static inline void start_thread_common(struct pt_regs *regs, unsigned long pc)
+static inline void start_thread_common(struct pt_regs *regs, unsigned long pc,
+				       unsigned long pstate)
 {
-	s32 previous_syscall = regs->syscallno;
-	memset(regs, 0, sizeof(*regs));
-	regs->syscallno = previous_syscall;
-	regs->pc = pc;
+	/*
+	 * Ensure all GPRs are zeroed, and initialize PC + PSTATE.
+	 * The SP (or compat SP) will be initialized later.
+	 */
+	regs->user_regs = (struct user_pt_regs) {
+		.pc = pc,
+		.pstate = pstate,
+	};
+
+	/*
+	 * To allow the syscalls:sys_exit_execve tracepoint we need to preserve
+	 * syscallno, but do not need orig_x0 or the original GPRs.
+	 */
+	regs->orig_x0 = 0;
 
+	/*
+	 * An exec from a kernel thread won't have an existing PMR value.
+	 */
 	if (system_uses_irq_prio_masking())
 		regs->pmr = GIC_PRIO_IRQON;
+
+	/*
+	 * The pt_regs::stackframe field must remain valid throughout this
+	 * function as a stacktrace can be taken at any time. Any user or
+	 * kernel task should have a valid final frame.
+	 */
+	WARN_ON_ONCE(regs->stackframe.record.fp != 0);
+	WARN_ON_ONCE(regs->stackframe.record.lr != 0);
+	WARN_ON_ONCE(regs->stackframe.type != FRAME_META_TYPE_FINAL);
 }
 
 static inline void start_thread(struct pt_regs *regs, unsigned long pc,
 				unsigned long sp)
 {
-	start_thread_common(regs, pc);
-	regs->pstate = PSR_MODE_EL0t;
+	start_thread_common(regs, pc, PSR_MODE_EL0t);
 	spectre_v4_enable_task_mitigation(current);
 	regs->sp = sp;
 }
@@ -309,15 +331,13 @@  static inline void start_thread(struct pt_regs *regs, unsigned long pc,
 static inline void compat_start_thread(struct pt_regs *regs, unsigned long pc,
 				       unsigned long sp)
 {
-	start_thread_common(regs, pc);
-	regs->pstate = PSR_AA32_MODE_USR;
+	unsigned long pstate = PSR_AA32_MODE_USR;
 	if (pc & 1)
-		regs->pstate |= PSR_AA32_T_BIT;
-
-#ifdef __AARCH64EB__
-	regs->pstate |= PSR_AA32_E_BIT;
-#endif
+		pstate |= PSR_AA32_T_BIT;
+	if (IS_ENABLED(CONFIG_CPU_BIG_ENDIAN))
+		pstate |= PSR_AA32_E_BIT;
 
+	start_thread_common(regs, pc, pstate);
 	spectre_v4_enable_task_mitigation(current);
 	regs->compat_sp = sp;
 }