diff mbox series

[RFC,v2,5/8] arm64: Detect an FTRACE frame and mark a stack trace unreliable

Message ID 20210315165800.5948-6-madvenka@linux.microsoft.com (mailing list archive)
State New, archived
Headers show
Series arm64: Implement reliable stack trace | expand

Commit Message

Madhavan T. Venkataraman March 15, 2021, 4:57 p.m. UTC
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

When CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled and tracing is activated
for a function, the ftrace infrastructure is called for the function at
the very beginning. Ftrace creates two frames:

	- One for the traced function

	- One for the caller of the traced function

That gives a reliable stack trace while executing in the ftrace
infrastructure code. When ftrace returns to the traced function, the frames
are popped and everything is back to normal.

However, in cases like live patch, execution is redirected to a different
function when ftrace returns. A stack trace taken while still in the ftrace
infrastructure code will not show the target function. The target function
is the real function that we want to track.

So, if an FTRACE frame is detected on the stack, just mark the stack trace
as unreliable.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 arch/arm64/kernel/entry-ftrace.S |  2 ++
 arch/arm64/kernel/stacktrace.c   | 33 ++++++++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

Comments

Mark Rutland March 23, 2021, 10:51 a.m. UTC | #1
On Mon, Mar 15, 2021 at 11:57:57AM -0500, madvenka@linux.microsoft.com wrote:
> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> 
> When CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled and tracing is activated
> for a function, the ftrace infrastructure is called for the function at
> the very beginning. Ftrace creates two frames:
> 
> 	- One for the traced function
> 
> 	- One for the caller of the traced function
> 
> That gives a reliable stack trace while executing in the ftrace
> infrastructure code. When ftrace returns to the traced function, the frames
> are popped and everything is back to normal.
> 
> However, in cases like live patch, execution is redirected to a different
> function when ftrace returns. A stack trace taken while still in the ftrace
> infrastructure code will not show the target function. The target function
> is the real function that we want to track.
> 
> So, if an FTRACE frame is detected on the stack, just mark the stack trace
> as unreliable.

To identify this case, please identify the ftrace trampolines instead,
e.g. ftrace_regs_caller, return_to_handler.

It'd be good to check *exactly* when we need to reject, since IIUC when
we have a graph stack entry the unwind will be correct from livepatch's
PoV.

Thanks,
Mark.

> 
> Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
> ---
>  arch/arm64/kernel/entry-ftrace.S |  2 ++
>  arch/arm64/kernel/stacktrace.c   | 33 ++++++++++++++++++++++++++++++++
>  2 files changed, 35 insertions(+)
> 
> diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S
> index b3e4f9a088b1..1ec8c5180fc0 100644
> --- a/arch/arm64/kernel/entry-ftrace.S
> +++ b/arch/arm64/kernel/entry-ftrace.S
> @@ -74,6 +74,8 @@
>  	/* Create our frame record within pt_regs. */
>  	stp	x29, x30, [sp, #S_STACKFRAME]
>  	add	x29, sp, #S_STACKFRAME
> +	ldr	w17, =FTRACE_FRAME
> +	str	w17, [sp, #S_FRAME_TYPE]
>  	.endm
>  
>  SYM_CODE_START(ftrace_regs_caller)
> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 6ae103326f7b..594806a0c225 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -23,6 +23,7 @@ static void check_if_reliable(unsigned long fp, struct stackframe *frame,
>  {
>  	struct pt_regs *regs;
>  	unsigned long regs_start, regs_end;
> +	unsigned long caller_fp;
>  
>  	/*
>  	 * If the stack trace has already been marked unreliable, just
> @@ -68,6 +69,38 @@ static void check_if_reliable(unsigned long fp, struct stackframe *frame,
>  		frame->reliable = false;
>  		return;
>  	}
> +
> +#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
> +	/*
> +	 * When tracing is active for a function, the ftrace code is called
> +	 * from the function even before the frame pointer prolog and
> +	 * epilog. ftrace creates a pt_regs structure on the stack to save
> +	 * register state.
> +	 *
> +	 * In addition, ftrace sets up two stack frames and chains them
> +	 * with other frames on the stack. One frame is pt_regs->stackframe
> +	 * that is for the traced function. The other frame is set up right
> +	 * after the pt_regs structure and it is for the caller of the
> +	 * traced function. This is done to ensure a proper stack trace.
> +	 *
> +	 * If the ftrace code returns to the traced function, then all is
> +	 * fine. But if it transfers control to a different function (like
> +	 * in livepatch), then a stack walk performed while still in the
> +	 * ftrace code will not find the target function.
> +	 *
> +	 * So, mark the stack trace as unreliable if an ftrace frame is
> +	 * detected.
> +	 */
> +	if (regs->frame_type == FTRACE_FRAME && frame->fp == regs_end &&
> +	    frame->fp < info->high) {
> +		/* Check the traced function's caller's frame. */
> +		caller_fp = READ_ONCE_NOCHECK(*(unsigned long *)(frame->fp));
> +		if (caller_fp == regs->regs[29]) {
> +			frame->reliable = false;
> +			return;
> +		}
> +	}
> +#endif
>  }
>  
>  /*
> -- 
> 2.25.1
>
Madhavan T. Venkataraman March 23, 2021, 12:56 p.m. UTC | #2
On 3/23/21 5:51 AM, Mark Rutland wrote:
> On Mon, Mar 15, 2021 at 11:57:57AM -0500, madvenka@linux.microsoft.com wrote:
>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>
>> When CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled and tracing is activated
>> for a function, the ftrace infrastructure is called for the function at
>> the very beginning. Ftrace creates two frames:
>>
>> 	- One for the traced function
>>
>> 	- One for the caller of the traced function
>>
>> That gives a reliable stack trace while executing in the ftrace
>> infrastructure code. When ftrace returns to the traced function, the frames
>> are popped and everything is back to normal.
>>
>> However, in cases like live patch, execution is redirected to a different
>> function when ftrace returns. A stack trace taken while still in the ftrace
>> infrastructure code will not show the target function. The target function
>> is the real function that we want to track.
>>
>> So, if an FTRACE frame is detected on the stack, just mark the stack trace
>> as unreliable.
> 
> To identify this case, please identify the ftrace trampolines instead,
> e.g. ftrace_regs_caller, return_to_handler.
> 

Yes. As part of the return address checking, I will check this. IIUC, I think that
I need to check for the inner labels that are defined at the point where the
instructions are patched for ftrace. E.g., ftrace_call and ftrace_graph_call.

SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
        bl      ftrace_stub	<====================================

#ifdef CONFIG_FUNCTION_GRAPH_TRACER
SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) // ftrace_graph_caller();
        nop	<=======                // If enabled, this will be replaced
                                        // "b ftrace_graph_caller"
#endif

For instance, the stack trace I got while tracing do_mmap() with the stack trace
tracer looks like this:

		 ...
[  338.911793]   trace_function+0xc4/0x160
[  338.911801]   function_stack_trace_call+0xac/0x130
[  338.911807]   ftrace_graph_call+0x0/0x4
[  338.911813]   do_mmap+0x8/0x598
[  338.911820]   vm_mmap_pgoff+0xf4/0x188
[  338.911826]   ksys_mmap_pgoff+0x1d8/0x220
[  338.911832]   __arm64_sys_mmap+0x38/0x50
[  338.911839]   el0_svc_common.constprop.0+0x70/0x1a8
[  338.911846]   do_el0_svc+0x2c/0x98
[  338.911851]   el0_svc+0x2c/0x70
[  338.911859]   el0_sync_handler+0xb0/0xb8
[  338.911864]   el0_sync+0x180/0x1c0

> It'd be good to check *exactly* when we need to reject, since IIUC when
> we have a graph stack entry the unwind will be correct from livepatch's
> PoV.
> 

The current unwinder already handles this like this:

#ifdef CONFIG_FUNCTION_GRAPH_TRACER
        if (tsk->ret_stack &&
                (ptrauth_strip_insn_pac(frame->pc) == (unsigned long)return_to_handler)) {
                struct ftrace_ret_stack *ret_stack;
                /*
                 * This is a case where function graph tracer has
                 * modified a return address (LR) in a stack frame
                 * to hook a function return.
                 * So replace it to an original value.
                 */
                ret_stack = ftrace_graph_get_ret_stack(tsk, frame->graph++);
                if (WARN_ON_ONCE(!ret_stack))
                        return -EINVAL;
                frame->pc = ret_stack->ret;
        }
#endif /* CONFIG_FUNCTION_GRAPH_TRACER */

Is there anything else that needs handling here?

Thanks,

Madhavan
Mark Rutland March 23, 2021, 1:36 p.m. UTC | #3
On Tue, Mar 23, 2021 at 07:56:40AM -0500, Madhavan T. Venkataraman wrote:
> 
> 
> On 3/23/21 5:51 AM, Mark Rutland wrote:
> > On Mon, Mar 15, 2021 at 11:57:57AM -0500, madvenka@linux.microsoft.com wrote:
> >> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
> >>
> >> When CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled and tracing is activated
> >> for a function, the ftrace infrastructure is called for the function at
> >> the very beginning. Ftrace creates two frames:
> >>
> >> 	- One for the traced function
> >>
> >> 	- One for the caller of the traced function
> >>
> >> That gives a reliable stack trace while executing in the ftrace
> >> infrastructure code. When ftrace returns to the traced function, the frames
> >> are popped and everything is back to normal.
> >>
> >> However, in cases like live patch, execution is redirected to a different
> >> function when ftrace returns. A stack trace taken while still in the ftrace
> >> infrastructure code will not show the target function. The target function
> >> is the real function that we want to track.
> >>
> >> So, if an FTRACE frame is detected on the stack, just mark the stack trace
> >> as unreliable.
> > 
> > To identify this case, please identify the ftrace trampolines instead,
> > e.g. ftrace_regs_caller, return_to_handler.
> > 
> 
> Yes. As part of the return address checking, I will check this. IIUC, I think that
> I need to check for the inner labels that are defined at the point where the
> instructions are patched for ftrace. E.g., ftrace_call and ftrace_graph_call.
> 
> SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
>         bl      ftrace_stub	<====================================
> 
> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
> SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) // ftrace_graph_caller();
>         nop	<=======                // If enabled, this will be replaced
>                                         // "b ftrace_graph_caller"
> #endif
> 
> For instance, the stack trace I got while tracing do_mmap() with the stack trace
> tracer looks like this:
> 
> 		 ...
> [  338.911793]   trace_function+0xc4/0x160
> [  338.911801]   function_stack_trace_call+0xac/0x130
> [  338.911807]   ftrace_graph_call+0x0/0x4
> [  338.911813]   do_mmap+0x8/0x598
> [  338.911820]   vm_mmap_pgoff+0xf4/0x188
> [  338.911826]   ksys_mmap_pgoff+0x1d8/0x220
> [  338.911832]   __arm64_sys_mmap+0x38/0x50
> [  338.911839]   el0_svc_common.constprop.0+0x70/0x1a8
> [  338.911846]   do_el0_svc+0x2c/0x98
> [  338.911851]   el0_svc+0x2c/0x70
> [  338.911859]   el0_sync_handler+0xb0/0xb8
> [  338.911864]   el0_sync+0x180/0x1c0
> 
> > It'd be good to check *exactly* when we need to reject, since IIUC when
> > we have a graph stack entry the unwind will be correct from livepatch's
> > PoV.
> > 
> 
> The current unwinder already handles this like this:
> 
> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>         if (tsk->ret_stack &&
>                 (ptrauth_strip_insn_pac(frame->pc) == (unsigned long)return_to_handler)) {
>                 struct ftrace_ret_stack *ret_stack;
>                 /*
>                  * This is a case where function graph tracer has
>                  * modified a return address (LR) in a stack frame
>                  * to hook a function return.
>                  * So replace it to an original value.
>                  */
>                 ret_stack = ftrace_graph_get_ret_stack(tsk, frame->graph++);
>                 if (WARN_ON_ONCE(!ret_stack))
>                         return -EINVAL;
>                 frame->pc = ret_stack->ret;
>         }
> #endif /* CONFIG_FUNCTION_GRAPH_TRACER */

Beware that this handles the case where a function will return to
return_to_handler, but doesn't handle unwinding from *within*
return_to_handler, which we can't do reliably today, so that might need
special handling.

> Is there anything else that needs handling here?

I wrote up a few cases to consider in:

https://www.kernel.org/doc/html/latest/livepatch/reliable-stacktrace.html

... e.g. the "Obscuring of return addresses" case.

It might be that we're fine so long as we refuse to unwind across
exception boundaries, but it needs some thought. We probably need to go
over each of the trampolines instruction-by-instruction to consider
that.

As mentioned above, within return_to_handler when we call
ftrace_return_to_handler, there's a period where the real return address
has been removed from the ftrace return stack, but hasn't yet been
placed in x30, and wouldn't show up in a trace (e.g. if we could somehow
hook the return from ftrace_return_to_handler).

We might be saved by the fact we'll mark traces across exception
boundaries as unreliable, but I haven't thought very hard about it. We
might want to explciitly reject unwinds within return_to_handler in case
it's possible to interpose ftrace_return_to_handler somehow.

Thanks,
Mark.
Madhavan T. Venkataraman March 23, 2021, 1:38 p.m. UTC | #4
On 3/23/21 8:36 AM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 07:56:40AM -0500, Madhavan T. Venkataraman wrote:
>>
>>
>> On 3/23/21 5:51 AM, Mark Rutland wrote:
>>> On Mon, Mar 15, 2021 at 11:57:57AM -0500, madvenka@linux.microsoft.com wrote:
>>>> From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>
>>>>
>>>> When CONFIG_DYNAMIC_FTRACE_WITH_REGS is enabled and tracing is activated
>>>> for a function, the ftrace infrastructure is called for the function at
>>>> the very beginning. Ftrace creates two frames:
>>>>
>>>> 	- One for the traced function
>>>>
>>>> 	- One for the caller of the traced function
>>>>
>>>> That gives a reliable stack trace while executing in the ftrace
>>>> infrastructure code. When ftrace returns to the traced function, the frames
>>>> are popped and everything is back to normal.
>>>>
>>>> However, in cases like live patch, execution is redirected to a different
>>>> function when ftrace returns. A stack trace taken while still in the ftrace
>>>> infrastructure code will not show the target function. The target function
>>>> is the real function that we want to track.
>>>>
>>>> So, if an FTRACE frame is detected on the stack, just mark the stack trace
>>>> as unreliable.
>>>
>>> To identify this case, please identify the ftrace trampolines instead,
>>> e.g. ftrace_regs_caller, return_to_handler.
>>>
>>
>> Yes. As part of the return address checking, I will check this. IIUC, I think that
>> I need to check for the inner labels that are defined at the point where the
>> instructions are patched for ftrace. E.g., ftrace_call and ftrace_graph_call.
>>
>> SYM_INNER_LABEL(ftrace_call, SYM_L_GLOBAL)
>>         bl      ftrace_stub	<====================================
>>
>> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>> SYM_INNER_LABEL(ftrace_graph_call, SYM_L_GLOBAL) // ftrace_graph_caller();
>>         nop	<=======                // If enabled, this will be replaced
>>                                         // "b ftrace_graph_caller"
>> #endif
>>
>> For instance, the stack trace I got while tracing do_mmap() with the stack trace
>> tracer looks like this:
>>
>> 		 ...
>> [  338.911793]   trace_function+0xc4/0x160
>> [  338.911801]   function_stack_trace_call+0xac/0x130
>> [  338.911807]   ftrace_graph_call+0x0/0x4
>> [  338.911813]   do_mmap+0x8/0x598
>> [  338.911820]   vm_mmap_pgoff+0xf4/0x188
>> [  338.911826]   ksys_mmap_pgoff+0x1d8/0x220
>> [  338.911832]   __arm64_sys_mmap+0x38/0x50
>> [  338.911839]   el0_svc_common.constprop.0+0x70/0x1a8
>> [  338.911846]   do_el0_svc+0x2c/0x98
>> [  338.911851]   el0_svc+0x2c/0x70
>> [  338.911859]   el0_sync_handler+0xb0/0xb8
>> [  338.911864]   el0_sync+0x180/0x1c0
>>
>>> It'd be good to check *exactly* when we need to reject, since IIUC when
>>> we have a graph stack entry the unwind will be correct from livepatch's
>>> PoV.
>>>
>>
>> The current unwinder already handles this like this:
>>
>> #ifdef CONFIG_FUNCTION_GRAPH_TRACER
>>         if (tsk->ret_stack &&
>>                 (ptrauth_strip_insn_pac(frame->pc) == (unsigned long)return_to_handler)) {
>>                 struct ftrace_ret_stack *ret_stack;
>>                 /*
>>                  * This is a case where function graph tracer has
>>                  * modified a return address (LR) in a stack frame
>>                  * to hook a function return.
>>                  * So replace it to an original value.
>>                  */
>>                 ret_stack = ftrace_graph_get_ret_stack(tsk, frame->graph++);
>>                 if (WARN_ON_ONCE(!ret_stack))
>>                         return -EINVAL;
>>                 frame->pc = ret_stack->ret;
>>         }
>> #endif /* CONFIG_FUNCTION_GRAPH_TRACER */
> 
> Beware that this handles the case where a function will return to
> return_to_handler, but doesn't handle unwinding from *within*
> return_to_handler, which we can't do reliably today, so that might need
> special handling.
> 

OK. I will take a look at this.

>> Is there anything else that needs handling here?
> 
> I wrote up a few cases to consider in:
> 
> https://www.kernel.org/doc/html/latest/livepatch/reliable-stacktrace.html
> 
> ... e.g. the "Obscuring of return addresses" case.
> 
> It might be that we're fine so long as we refuse to unwind across
> exception boundaries, but it needs some thought. We probably need to go
> over each of the trampolines instruction-by-instruction to consider
> that.
> 
> As mentioned above, within return_to_handler when we call
> ftrace_return_to_handler, there's a period where the real return address
> has been removed from the ftrace return stack, but hasn't yet been
> placed in x30, and wouldn't show up in a trace (e.g. if we could somehow
> hook the return from ftrace_return_to_handler).
> 
> We might be saved by the fact we'll mark traces across exception
> boundaries as unreliable, but I haven't thought very hard about it. We
> might want to explciitly reject unwinds within return_to_handler in case
> it's possible to interpose ftrace_return_to_handler somehow.
> 

OK. I will study the above.

Thanks.

Madhavan
Madhavan T. Venkataraman March 23, 2021, 2:15 p.m. UTC | #5
Hi Mark,

I have a general question. When exceptions are nested, how does it work? Let us consider 2 cases:

1. Exception in a page fault handler itself. In this case, I guess one more pt_regs will get
   established in the task stack for the second exception.

2. Exception in an interrupt handler. Here the interrupt handler is running on the IRQ stack.
   Will the pt_regs get created on the IRQ stack?

Also, is there a maximum nesting for exceptions?

Madhavan
Mark Rutland March 23, 2021, 2:57 p.m. UTC | #6
On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
> Hi Mark,
> 
> I have a general question. When exceptions are nested, how does it work? Let us consider 2 cases:
> 
> 1. Exception in a page fault handler itself. In this case, I guess one more pt_regs will get
>    established in the task stack for the second exception.

Generally (ignoring SDEI and stack overflow exceptions) the regs will be
placed on the stack that was in use when the exception occurred, e.g.

  task -> task
  irq -> irq
  overflow -> overflow

For SDEI and stack overflow, we'll place the regs on the relevant SDEI
or overflow stack, e.g.

  task -> overflow
  irq -> overflow

  task -> sdei
  irq -> sdei

I tried to explain the nesting rules in:

  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/stacktrace.c?h=v5.11#n59
  https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/kernel/stacktrace.c?h=v5.11&id=592700f094be229b5c9cc1192d5cea46eb4c7afc

> 2. Exception in an interrupt handler. Here the interrupt handler is running on the IRQ stack.
>    Will the pt_regs get created on the IRQ stack?

For an interrupt the regs will be placed on the stack that was in use
when the interrupt was taken. The kernel switches to the IRQ stack
*after* stacking the registers. e.g.

  task -> task // subsequently switches to IRQ stack
  irq -> irq

> Also, is there a maximum nesting for exceptions?

In practice, yes, but the specific number isn't a constant, so in the
unwind code we have to act as if there is no limit other than stack
sizing.

We try to prevent cerain exceptions from nesting (e.g. debug exceptions
cannot nest), but there are still several level sof nesting, and some
exceptions which can be nested safely (like faults). For example, it's
possible to have a chain:

 syscall -> fault -> interrupt -> fault -> pNMI -> fault -> SError -> fault -> watchpoint -> fault -> overflow -> fault -> BRK

... and potentially longer than that.

The practical limit is the size of all the stacks, and the unwinder's 
stack monotonicity checks ensure that an unwind will terminate.

Thanks,
Mark.
Madhavan T. Venkataraman March 23, 2021, 3:26 p.m. UTC | #7
On 3/23/21 9:57 AM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
>> Hi Mark,
>>
>> I have a general question. When exceptions are nested, how does it work? Let us consider 2 cases:
>>
>> 1. Exception in a page fault handler itself. In this case, I guess one more pt_regs will get
>>    established in the task stack for the second exception.
> 
> Generally (ignoring SDEI and stack overflow exceptions) the regs will be
> placed on the stack that was in use when the exception occurred, e.g.
> 
>   task -> task
>   irq -> irq
>   overflow -> overflow
> 
> For SDEI and stack overflow, we'll place the regs on the relevant SDEI
> or overflow stack, e.g.
> 
>   task -> overflow
>   irq -> overflow
> 
>   task -> sdei
>   irq -> sdei
> 
> I tried to explain the nesting rules in:
> 
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/stacktrace.c?h=v5.11#n59
>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/kernel/stacktrace.c?h=v5.11&id=592700f094be229b5c9cc1192d5cea46eb4c7afc
> 
>> 2. Exception in an interrupt handler. Here the interrupt handler is running on the IRQ stack.
>>    Will the pt_regs get created on the IRQ stack?
> 
> For an interrupt the regs will be placed on the stack that was in use
> when the interrupt was taken. The kernel switches to the IRQ stack
> *after* stacking the registers. e.g.
> 
>   task -> task // subsequently switches to IRQ stack
>   irq -> irq
> 
>> Also, is there a maximum nesting for exceptions?
> 
> In practice, yes, but the specific number isn't a constant, so in the
> unwind code we have to act as if there is no limit other than stack
> sizing.
> 
> We try to prevent cerain exceptions from nesting (e.g. debug exceptions
> cannot nest), but there are still several level sof nesting, and some
> exceptions which can be nested safely (like faults). For example, it's
> possible to have a chain:
> 
>  syscall -> fault -> interrupt -> fault -> pNMI -> fault -> SError -> fault -> watchpoint -> fault -> overflow -> fault -> BRK
> 
> ... and potentially longer than that.
> 
> The practical limit is the size of all the stacks, and the unwinder's 
> stack monotonicity checks ensure that an unwind will terminate.
> 

Thanks for explaining the nesting. It is now clear to me.

So, my next question is - can we define a practical limit for the nesting so that any nesting beyond that
is fatal? The reason I ask is - if there is a max, then we can allocate an array of stack frames out of
band for the special frames so they are not part of the stack and will not likely get corrupted.

Also, we don't have to do any special detection. If the number of out of band frames used is one or more
then we have exceptions and the stack trace is unreliable.

Thanks.

Madhavan
Madhavan T. Venkataraman March 23, 2021, 4:20 p.m. UTC | #8
On 3/23/21 10:26 AM, Madhavan T. Venkataraman wrote:
> 
> 
> On 3/23/21 9:57 AM, Mark Rutland wrote:
>> On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
>>> Hi Mark,
>>>
>>> I have a general question. When exceptions are nested, how does it work? Let us consider 2 cases:
>>>
>>> 1. Exception in a page fault handler itself. In this case, I guess one more pt_regs will get
>>>    established in the task stack for the second exception.
>>
>> Generally (ignoring SDEI and stack overflow exceptions) the regs will be
>> placed on the stack that was in use when the exception occurred, e.g.
>>
>>   task -> task
>>   irq -> irq
>>   overflow -> overflow
>>
>> For SDEI and stack overflow, we'll place the regs on the relevant SDEI
>> or overflow stack, e.g.
>>
>>   task -> overflow
>>   irq -> overflow
>>
>>   task -> sdei
>>   irq -> sdei
>>
>> I tried to explain the nesting rules in:
>>
>>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/arch/arm64/kernel/stacktrace.c?h=v5.11#n59
>>   https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/arch/arm64/kernel/stacktrace.c?h=v5.11&id=592700f094be229b5c9cc1192d5cea46eb4c7afc
>>
>>> 2. Exception in an interrupt handler. Here the interrupt handler is running on the IRQ stack.
>>>    Will the pt_regs get created on the IRQ stack?
>>
>> For an interrupt the regs will be placed on the stack that was in use
>> when the interrupt was taken. The kernel switches to the IRQ stack
>> *after* stacking the registers. e.g.
>>
>>   task -> task // subsequently switches to IRQ stack
>>   irq -> irq
>>
>>> Also, is there a maximum nesting for exceptions?
>>
>> In practice, yes, but the specific number isn't a constant, so in the
>> unwind code we have to act as if there is no limit other than stack
>> sizing.
>>
>> We try to prevent cerain exceptions from nesting (e.g. debug exceptions
>> cannot nest), but there are still several level sof nesting, and some
>> exceptions which can be nested safely (like faults). For example, it's
>> possible to have a chain:
>>
>>  syscall -> fault -> interrupt -> fault -> pNMI -> fault -> SError -> fault -> watchpoint -> fault -> overflow -> fault -> BRK
>>
>> ... and potentially longer than that.
>>
>> The practical limit is the size of all the stacks, and the unwinder's 
>> stack monotonicity checks ensure that an unwind will terminate.
>>
> 
> Thanks for explaining the nesting. It is now clear to me.
> 
> So, my next question is - can we define a practical limit for the nesting so that any nesting beyond that
> is fatal? The reason I ask is - if there is a max, then we can allocate an array of stack frames out of
> band for the special frames so they are not part of the stack and will not likely get corrupted.
> 
> Also, we don't have to do any special detection. If the number of out of band frames used is one or more
> then we have exceptions and the stack trace is unreliable.
> 

Alternatively, if we can just increment a counter in the task structure when an exception is entered
and decrement it when an exception returns, that counter will tell us that the stack trace is
unreliable.

Is this feasible?

I think I have enough for v3 at this point. If you think that the counter idea is OK, I can
implement it in v3. Once you confirm, I will start working on v3.

Thanks for all the input.

Madhavan
Mark Rutland March 23, 2021, 4:48 p.m. UTC | #9
On Tue, Mar 23, 2021 at 10:26:50AM -0500, Madhavan T. Venkataraman wrote:
> On 3/23/21 9:57 AM, Mark Rutland wrote:
> Thanks for explaining the nesting. It is now clear to me.

No problem!

> So, my next question is - can we define a practical limit for the
> nesting so that any nesting beyond that is fatal? The reason I ask is
> - if there is a max, then we can allocate an array of stack frames out
> of band for the special frames so they are not part of the stack and
> will not likely get corrupted.

I suspect we can't define such a fatal limit without introducing a local
DoS vector on some otherwise legitimate workload, and I fear this will
further complicate the entry/exit logic, so I'd prefer to avoid
introducing a new limit.

What exactly do you mean by a "special frame", and why do those need
additional protection over regular frame records?

> Also, we don't have to do any special detection. If the number of out
> of band frames used is one or more then we have exceptions and the
> stack trace is unreliable.

What is expected to protect against?

Thanks,
Mark.
Madhavan T. Venkataraman March 23, 2021, 4:53 p.m. UTC | #10
On 3/23/21 11:48 AM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 10:26:50AM -0500, Madhavan T. Venkataraman wrote:
>> On 3/23/21 9:57 AM, Mark Rutland wrote:
>> Thanks for explaining the nesting. It is now clear to me.
> 
> No problem!
> 
>> So, my next question is - can we define a practical limit for the
>> nesting so that any nesting beyond that is fatal? The reason I ask is
>> - if there is a max, then we can allocate an array of stack frames out
>> of band for the special frames so they are not part of the stack and
>> will not likely get corrupted.
> 
> I suspect we can't define such a fatal limit without introducing a local
> DoS vector on some otherwise legitimate workload, and I fear this will
> further complicate the entry/exit logic, so I'd prefer to avoid
> introducing a new limit.
> 

I suspected as much. But I thought I will ask anyway.

> What exactly do you mean by a "special frame", and why do those need
> additional protection over regular frame records?
> 

Special frame just means pt_regs->stackframe that is used for exceptions.
No additional protection is needed. I just meant that since they are
out of band, we can reliably tell that there are exceptions without
examining the stack. That is all.

>> Also, we don't have to do any special detection. If the number of out
>> of band frames used is one or more then we have exceptions and the
>> stack trace is unreliable.
> 
> What is expected to protect against?
> 

It is not a protection thing. I just wanted a reliable way to tell that there
is an exception without having to unwind the stack up to the exception frame.
That is all.

Thanks.

Madhavan
Mark Rutland March 23, 2021, 5:02 p.m. UTC | #11
On Tue, Mar 23, 2021 at 11:20:44AM -0500, Madhavan T. Venkataraman wrote:
> On 3/23/21 10:26 AM, Madhavan T. Venkataraman wrote:
> > On 3/23/21 9:57 AM, Mark Rutland wrote:
> >> On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
> > So, my next question is - can we define a practical limit for the
> > nesting so that any nesting beyond that is fatal? The reason I ask
> > is - if there is a max, then we can allocate an array of stack
> > frames out of band for the special frames so they are not part of
> > the stack and will not likely get corrupted.
> > 
> > Also, we don't have to do any special detection. If the number of
> > out of band frames used is one or more then we have exceptions and
> > the stack trace is unreliable.
> 
> Alternatively, if we can just increment a counter in the task
> structure when an exception is entered and decrement it when an
> exception returns, that counter will tell us that the stack trace is
> unreliable.

As I noted earlier, we must treat *any* EL1 exception boundary needs to
be treated as unreliable for unwinding, and per my other comments w.r.t.
corrupting the call chain I don't think we need additional protection on
exception boundaries specifically.

> Is this feasible?
> 
> I think I have enough for v3 at this point. If you think that the
> counter idea is OK, I can implement it in v3. Once you confirm, I will
> start working on v3.

Currently, I don't see a compelling reason to need this, and would
prefer to avoid it.

More generally, could we please break this work into smaller steps? I
reckon we can break this down into the following chunks:

1. Add the explicit final frame and associated handling. I suspect that
   this is complicated enough on its own to be an independent series,
   and it's something that we can merge without all the bits and pieces
   necessary for truly reliable stacktracing.

2. Figure out how we must handle kprobes and ftrace. That probably means
   rejecting unwinds from specific places, but we might also want to
   adjust the trampolines if that makes this easier.

3. Figure out exception boundary handling. I'm currently working to
   simplify the entry assembly down to a uniform set of stubs, and I'd
   prefer to get that sorted before we teach the unwinder about
   exception boundaries, as it'll be significantly simpler to reason
   about and won't end up clashing with the rework.

Thanks,
Mark.
Mark Rutland March 23, 2021, 5:09 p.m. UTC | #12
On Tue, Mar 23, 2021 at 11:53:04AM -0500, Madhavan T. Venkataraman wrote:
> On 3/23/21 11:48 AM, Mark Rutland wrote:
> > On Tue, Mar 23, 2021 at 10:26:50AM -0500, Madhavan T. Venkataraman wrote:
> >> So, my next question is - can we define a practical limit for the
> >> nesting so that any nesting beyond that is fatal? The reason I ask is
> >> - if there is a max, then we can allocate an array of stack frames out
> >> of band for the special frames so they are not part of the stack and
> >> will not likely get corrupted.

> >> Also, we don't have to do any special detection. If the number of out
> >> of band frames used is one or more then we have exceptions and the
> >> stack trace is unreliable.
> > 
> > What is expected to protect against?
> 
> It is not a protection thing. I just wanted a reliable way to tell that there
> is an exception without having to unwind the stack up to the exception frame.
> That is all.

I see.

Given that's an optimization, we can consider doing something like that
that after we have the functional bits in place, where we'll be in a
position to see whether this is even a measureable concern in practice.

I suspect that longer-term we'll end up trying to use metadata to unwind
across exception boundaries, since it's possible to get blocked within
those for long periods (e.g. for a uaccess fault), and the larger scale
optimization for patching is to not block the patch.

Thanks,
Mark.
Madhavan T. Venkataraman March 23, 2021, 5:23 p.m. UTC | #13
On 3/23/21 12:02 PM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 11:20:44AM -0500, Madhavan T. Venkataraman wrote:
>> On 3/23/21 10:26 AM, Madhavan T. Venkataraman wrote:
>>> On 3/23/21 9:57 AM, Mark Rutland wrote:
>>>> On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
>>> So, my next question is - can we define a practical limit for the
>>> nesting so that any nesting beyond that is fatal? The reason I ask
>>> is - if there is a max, then we can allocate an array of stack
>>> frames out of band for the special frames so they are not part of
>>> the stack and will not likely get corrupted.
>>>
>>> Also, we don't have to do any special detection. If the number of
>>> out of band frames used is one or more then we have exceptions and
>>> the stack trace is unreliable.
>>
>> Alternatively, if we can just increment a counter in the task
>> structure when an exception is entered and decrement it when an
>> exception returns, that counter will tell us that the stack trace is
>> unreliable.
> 
> As I noted earlier, we must treat *any* EL1 exception boundary needs to
> be treated as unreliable for unwinding, and per my other comments w.r.t.
> corrupting the call chain I don't think we need additional protection on
> exception boundaries specifically.
> 
>> Is this feasible?
>>
>> I think I have enough for v3 at this point. If you think that the
>> counter idea is OK, I can implement it in v3. Once you confirm, I will
>> start working on v3.
> 
> Currently, I don't see a compelling reason to need this, and would
> prefer to avoid it.
> 

I think that I did a bad job of explaining what I wanted to do. It is not
for any additional protection at all.

So, let us say we create a field in the task structure:

	u64		unreliable_stack;

Whenever an EL1 exception is entered or FTRACE is entered and pt_regs get
set up and pt_regs->stackframe gets chained, increment unreliable_stack.
On exiting the above, decrement unreliable_stack.

In arch_stack_walk_reliable(), simply do this check upfront:

	if (task->unreliable_stack)
		return -EINVAL;

This way, the function does not even bother unwinding the stack to find
exception frames or checking for different return addresses or anything.
We also don't have to worry about code being reorganized, functions
being renamed, etc. It also may help in debugging to know if a task is
experiencing an exception and the level of nesting, etc.

> More generally, could we please break this work into smaller steps? I
> reckon we can break this down into the following chunks:
> 
> 1. Add the explicit final frame and associated handling. I suspect that
>    this is complicated enough on its own to be an independent series,
>    and it's something that we can merge without all the bits and pieces
>    necessary for truly reliable stacktracing.
> 

OK. I can do that.

> 2. Figure out how we must handle kprobes and ftrace. That probably means
>    rejecting unwinds from specific places, but we might also want to
>    adjust the trampolines if that makes this easier.
> 

I think I am already doing all the checks except the one you mentioned
earlier. Yes, I can do this separately.

> 3. Figure out exception boundary handling. I'm currently working to
>    simplify the entry assembly down to a uniform set of stubs, and I'd
>    prefer to get that sorted before we teach the unwinder about
>    exception boundaries, as it'll be significantly simpler to reason
>    about and won't end up clashing with the rework.
> 

So, here is where I still have a question. Is it necessary for the unwinder
to know the exception boundaries? Is it not enough if it knows if there are
exceptions present? For instance, using something like num_special_frames
I suggested above?

Thanks,

Madhavan
Madhavan T. Venkataraman March 23, 2021, 5:27 p.m. UTC | #14
On 3/23/21 12:23 PM, Madhavan T. Venkataraman wrote:
> 
> 
> On 3/23/21 12:02 PM, Mark Rutland wrote:
>> On Tue, Mar 23, 2021 at 11:20:44AM -0500, Madhavan T. Venkataraman wrote:
>>> On 3/23/21 10:26 AM, Madhavan T. Venkataraman wrote:
>>>> On 3/23/21 9:57 AM, Mark Rutland wrote:
>>>>> On Tue, Mar 23, 2021 at 09:15:36AM -0500, Madhavan T. Venkataraman wrote:
>>>> So, my next question is - can we define a practical limit for the
>>>> nesting so that any nesting beyond that is fatal? The reason I ask
>>>> is - if there is a max, then we can allocate an array of stack
>>>> frames out of band for the special frames so they are not part of
>>>> the stack and will not likely get corrupted.
>>>>
>>>> Also, we don't have to do any special detection. If the number of
>>>> out of band frames used is one or more then we have exceptions and
>>>> the stack trace is unreliable.
>>>
>>> Alternatively, if we can just increment a counter in the task
>>> structure when an exception is entered and decrement it when an
>>> exception returns, that counter will tell us that the stack trace is
>>> unreliable.
>>
>> As I noted earlier, we must treat *any* EL1 exception boundary needs to
>> be treated as unreliable for unwinding, and per my other comments w.r.t.
>> corrupting the call chain I don't think we need additional protection on
>> exception boundaries specifically.
>>
>>> Is this feasible?
>>>
>>> I think I have enough for v3 at this point. If you think that the
>>> counter idea is OK, I can implement it in v3. Once you confirm, I will
>>> start working on v3.
>>
>> Currently, I don't see a compelling reason to need this, and would
>> prefer to avoid it.
>>
> 
> I think that I did a bad job of explaining what I wanted to do. It is not
> for any additional protection at all.
> 
> So, let us say we create a field in the task structure:
> 
> 	u64		unreliable_stack;
> 
> Whenever an EL1 exception is entered or FTRACE is entered and pt_regs get
> set up and pt_regs->stackframe gets chained, increment unreliable_stack.
> On exiting the above, decrement unreliable_stack.
> 
> In arch_stack_walk_reliable(), simply do this check upfront:
> 
> 	if (task->unreliable_stack)
> 		return -EINVAL;
> 
> This way, the function does not even bother unwinding the stack to find
> exception frames or checking for different return addresses or anything.
> We also don't have to worry about code being reorganized, functions
> being renamed, etc. It also may help in debugging to know if a task is
> experiencing an exception and the level of nesting, etc.
> 
>> More generally, could we please break this work into smaller steps? I
>> reckon we can break this down into the following chunks:
>>
>> 1. Add the explicit final frame and associated handling. I suspect that
>>    this is complicated enough on its own to be an independent series,
>>    and it's something that we can merge without all the bits and pieces
>>    necessary for truly reliable stacktracing.
>>
> 
> OK. I can do that.
> 
>> 2. Figure out how we must handle kprobes and ftrace. That probably means
>>    rejecting unwinds from specific places, but we might also want to
>>    adjust the trampolines if that makes this easier.
>>
> 
> I think I am already doing all the checks except the one you mentioned
> earlier. Yes, I can do this separately.
> 
>> 3. Figure out exception boundary handling. I'm currently working to
>>    simplify the entry assembly down to a uniform set of stubs, and I'd
>>    prefer to get that sorted before we teach the unwinder about
>>    exception boundaries, as it'll be significantly simpler to reason
>>    about and won't end up clashing with the rework.
>>
> 
> So, here is where I still have a question. Is it necessary for the unwinder
> to know the exception boundaries? Is it not enough if it knows if there are
> exceptions present? For instance, using something like num_special_frames

Typo - num_special_frames should be unreliable_stack. That is the name of
the counter I used above.

Sorry about that.

Madhavan
Mark Brown March 23, 2021, 6:27 p.m. UTC | #15
On Tue, Mar 23, 2021 at 12:23:34PM -0500, Madhavan T. Venkataraman wrote:
> On 3/23/21 12:02 PM, Mark Rutland wrote:

> > 3. Figure out exception boundary handling. I'm currently working to
> >    simplify the entry assembly down to a uniform set of stubs, and I'd
> >    prefer to get that sorted before we teach the unwinder about
> >    exception boundaries, as it'll be significantly simpler to reason
> >    about and won't end up clashing with the rework.

> So, here is where I still have a question. Is it necessary for the unwinder
> to know the exception boundaries? Is it not enough if it knows if there are
> exceptions present? For instance, using something like num_special_frames
> I suggested above?

For reliable stack trace we can live with just flagging things as
unreliable when we know there's an exception boundary somewhere but (as
Mark mentioned elsewhere) being able to actually go through a subset of
exception boundaries safely is likely to help usefully improve the
performance of live patching, and for defensiveness we want to try to
detect during an actual unwind anyway so it ends up being a performance
improvment and double check rather than saving us code.  Better
understanding of what's going on in the presence of exceptions may also
help other users of the unwinder which can use stacks which aren't
reliable get better results.
Mark Rutland March 23, 2021, 6:30 p.m. UTC | #16
On Tue, Mar 23, 2021 at 12:23:34PM -0500, Madhavan T. Venkataraman wrote:
> On 3/23/21 12:02 PM, Mark Rutland wrote:

[...]

> I think that I did a bad job of explaining what I wanted to do. It is not
> for any additional protection at all.
> 
> So, let us say we create a field in the task structure:
> 
> 	u64		unreliable_stack;
> 
> Whenever an EL1 exception is entered or FTRACE is entered and pt_regs get
> set up and pt_regs->stackframe gets chained, increment unreliable_stack.
> On exiting the above, decrement unreliable_stack.
> 
> In arch_stack_walk_reliable(), simply do this check upfront:
> 
> 	if (task->unreliable_stack)
> 		return -EINVAL;
> 
> This way, the function does not even bother unwinding the stack to find
> exception frames or checking for different return addresses or anything.
> We also don't have to worry about code being reorganized, functions
> being renamed, etc. It also may help in debugging to know if a task is
> experiencing an exception and the level of nesting, etc.

As in my other reply, since this is an optimization that is not
necessary for functional correctness, I would prefer to avoid this for
now. We can reconsider that in future if we encounter performance
problems.

Even with this there will be cases where we have to identify
non-unwindable functions explicitly (e.g. the patchable-function-entry
trampolines, where the real return address is in x9), and I'd prefer
that we use one mechanism consistently.

I suspect that in the future we'll need to unwind across exception
boundaries using metadata, and we can treat the non-unwindable metadata
in the same way.

[...]

> > 3. Figure out exception boundary handling. I'm currently working to
> >    simplify the entry assembly down to a uniform set of stubs, and I'd
> >    prefer to get that sorted before we teach the unwinder about
> >    exception boundaries, as it'll be significantly simpler to reason
> >    about and won't end up clashing with the rework.
> 
> So, here is where I still have a question. Is it necessary for the unwinder
> to know the exception boundaries? Is it not enough if it knows if there are
> exceptions present? For instance, using something like num_special_frames
> I suggested above?

I agree that it would be legitimate to bail out early if we knew there
was going to be an exception somewhere in the trace. Regardless, I think
it's simpler overall to identify non-unwindability during the trace, and
doing that during the trace aligns more closely with the structure that
we'll need to permit unwinding across these boundaries in future, so I'd
prefer we do that rather than trying to optimize for early returns
today.

Thanks,
Mark.
Madhavan T. Venkataraman March 23, 2021, 8:23 p.m. UTC | #17
On 3/23/21 1:27 PM, Mark Brown wrote:
> On Tue, Mar 23, 2021 at 12:23:34PM -0500, Madhavan T. Venkataraman wrote:
>> On 3/23/21 12:02 PM, Mark Rutland wrote:
> 
>>> 3. Figure out exception boundary handling. I'm currently working to
>>>    simplify the entry assembly down to a uniform set of stubs, and I'd
>>>    prefer to get that sorted before we teach the unwinder about
>>>    exception boundaries, as it'll be significantly simpler to reason
>>>    about and won't end up clashing with the rework.
> 
>> So, here is where I still have a question. Is it necessary for the unwinder
>> to know the exception boundaries? Is it not enough if it knows if there are
>> exceptions present? For instance, using something like num_special_frames
>> I suggested above?
> 
> For reliable stack trace we can live with just flagging things as
> unreliable when we know there's an exception boundary somewhere but (as
> Mark mentioned elsewhere) being able to actually go through a subset of
> exception boundaries safely is likely to help usefully improve the
> performance of live patching, and for defensiveness we want to try to
> detect during an actual unwind anyway so it ends up being a performance
> improvment and double check rather than saving us code.  Better
> understanding of what's going on in the presence of exceptions may also
> help other users of the unwinder which can use stacks which aren't
> reliable get better results.
> 

Actually, I was not suggesting that the counter replace the unwinder
intelligence to recognize exception boundaries. I was only suggesting
the use of the counter for arch_stack_walk_reliable().

But I am fine with not implementing the counter for now.

Madhavan
Madhavan T. Venkataraman March 23, 2021, 8:24 p.m. UTC | #18
On 3/23/21 1:30 PM, Mark Rutland wrote:
> On Tue, Mar 23, 2021 at 12:23:34PM -0500, Madhavan T. Venkataraman wrote:
>> On 3/23/21 12:02 PM, Mark Rutland wrote:
> 
> [...]
> 
>> I think that I did a bad job of explaining what I wanted to do. It is not
>> for any additional protection at all.
>>
>> So, let us say we create a field in the task structure:
>>
>> 	u64		unreliable_stack;
>>
>> Whenever an EL1 exception is entered or FTRACE is entered and pt_regs get
>> set up and pt_regs->stackframe gets chained, increment unreliable_stack.
>> On exiting the above, decrement unreliable_stack.
>>
>> In arch_stack_walk_reliable(), simply do this check upfront:
>>
>> 	if (task->unreliable_stack)
>> 		return -EINVAL;
>>
>> This way, the function does not even bother unwinding the stack to find
>> exception frames or checking for different return addresses or anything.
>> We also don't have to worry about code being reorganized, functions
>> being renamed, etc. It also may help in debugging to know if a task is
>> experiencing an exception and the level of nesting, etc.
> 
> As in my other reply, since this is an optimization that is not
> necessary for functional correctness, I would prefer to avoid this for
> now. We can reconsider that in future if we encounter performance
> problems.
> 
> Even with this there will be cases where we have to identify
> non-unwindable functions explicitly (e.g. the patchable-function-entry
> trampolines, where the real return address is in x9), and I'd prefer
> that we use one mechanism consistently.
> 
> I suspect that in the future we'll need to unwind across exception
> boundaries using metadata, and we can treat the non-unwindable metadata
> in the same way.
> 
> [...]
> 
>>> 3. Figure out exception boundary handling. I'm currently working to
>>>    simplify the entry assembly down to a uniform set of stubs, and I'd
>>>    prefer to get that sorted before we teach the unwinder about
>>>    exception boundaries, as it'll be significantly simpler to reason
>>>    about and won't end up clashing with the rework.
>>
>> So, here is where I still have a question. Is it necessary for the unwinder
>> to know the exception boundaries? Is it not enough if it knows if there are
>> exceptions present? For instance, using something like num_special_frames
>> I suggested above?
> 
> I agree that it would be legitimate to bail out early if we knew there
> was going to be an exception somewhere in the trace. Regardless, I think
> it's simpler overall to identify non-unwindability during the trace, and
> doing that during the trace aligns more closely with the structure that
> we'll need to permit unwinding across these boundaries in future, so I'd
> prefer we do that rather than trying to optimize for early returns
> today.
> 

OK. Fair enough.

Thanks.

Madhavan
Madhavan T. Venkataraman March 23, 2021, 9:04 p.m. UTC | #19
Thanks for all the input - Mark Rutland and Mark Brown.

I will send out the stack termination patch next. Since I am splitting
the original series into 3 separate series, I will change the titles and
start with version 1 in each case, if there is no objection.

Again, Thanks.

Madhavan

On 3/23/21 3:24 PM, Madhavan T. Venkataraman wrote:
> 
> 
> On 3/23/21 1:30 PM, Mark Rutland wrote:
>> On Tue, Mar 23, 2021 at 12:23:34PM -0500, Madhavan T. Venkataraman wrote:
>>> On 3/23/21 12:02 PM, Mark Rutland wrote:
>>
>> [...]
>>
>>> I think that I did a bad job of explaining what I wanted to do. It is not
>>> for any additional protection at all.
>>>
>>> So, let us say we create a field in the task structure:
>>>
>>> 	u64		unreliable_stack;
>>>
>>> Whenever an EL1 exception is entered or FTRACE is entered and pt_regs get
>>> set up and pt_regs->stackframe gets chained, increment unreliable_stack.
>>> On exiting the above, decrement unreliable_stack.
>>>
>>> In arch_stack_walk_reliable(), simply do this check upfront:
>>>
>>> 	if (task->unreliable_stack)
>>> 		return -EINVAL;
>>>
>>> This way, the function does not even bother unwinding the stack to find
>>> exception frames or checking for different return addresses or anything.
>>> We also don't have to worry about code being reorganized, functions
>>> being renamed, etc. It also may help in debugging to know if a task is
>>> experiencing an exception and the level of nesting, etc.
>>
>> As in my other reply, since this is an optimization that is not
>> necessary for functional correctness, I would prefer to avoid this for
>> now. We can reconsider that in future if we encounter performance
>> problems.
>>
>> Even with this there will be cases where we have to identify
>> non-unwindable functions explicitly (e.g. the patchable-function-entry
>> trampolines, where the real return address is in x9), and I'd prefer
>> that we use one mechanism consistently.
>>
>> I suspect that in the future we'll need to unwind across exception
>> boundaries using metadata, and we can treat the non-unwindable metadata
>> in the same way.
>>
>> [...]
>>
>>>> 3. Figure out exception boundary handling. I'm currently working to
>>>>    simplify the entry assembly down to a uniform set of stubs, and I'd
>>>>    prefer to get that sorted before we teach the unwinder about
>>>>    exception boundaries, as it'll be significantly simpler to reason
>>>>    about and won't end up clashing with the rework.
>>>
>>> So, here is where I still have a question. Is it necessary for the unwinder
>>> to know the exception boundaries? Is it not enough if it knows if there are
>>> exceptions present? For instance, using something like num_special_frames
>>> I suggested above?
>>
>> I agree that it would be legitimate to bail out early if we knew there
>> was going to be an exception somewhere in the trace. Regardless, I think
>> it's simpler overall to identify non-unwindability during the trace, and
>> doing that during the trace aligns more closely with the structure that
>> we'll need to permit unwinding across these boundaries in future, so I'd
>> prefer we do that rather than trying to optimize for early returns
>> today.
>>
> 
> OK. Fair enough.
> 
> Thanks.
> 
> Madhavan
>
diff mbox series

Patch

diff --git a/arch/arm64/kernel/entry-ftrace.S b/arch/arm64/kernel/entry-ftrace.S
index b3e4f9a088b1..1ec8c5180fc0 100644
--- a/arch/arm64/kernel/entry-ftrace.S
+++ b/arch/arm64/kernel/entry-ftrace.S
@@ -74,6 +74,8 @@ 
 	/* Create our frame record within pt_regs. */
 	stp	x29, x30, [sp, #S_STACKFRAME]
 	add	x29, sp, #S_STACKFRAME
+	ldr	w17, =FTRACE_FRAME
+	str	w17, [sp, #S_FRAME_TYPE]
 	.endm
 
 SYM_CODE_START(ftrace_regs_caller)
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 6ae103326f7b..594806a0c225 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -23,6 +23,7 @@  static void check_if_reliable(unsigned long fp, struct stackframe *frame,
 {
 	struct pt_regs *regs;
 	unsigned long regs_start, regs_end;
+	unsigned long caller_fp;
 
 	/*
 	 * If the stack trace has already been marked unreliable, just
@@ -68,6 +69,38 @@  static void check_if_reliable(unsigned long fp, struct stackframe *frame,
 		frame->reliable = false;
 		return;
 	}
+
+#ifdef CONFIG_DYNAMIC_FTRACE_WITH_REGS
+	/*
+	 * When tracing is active for a function, the ftrace code is called
+	 * from the function even before the frame pointer prolog and
+	 * epilog. ftrace creates a pt_regs structure on the stack to save
+	 * register state.
+	 *
+	 * In addition, ftrace sets up two stack frames and chains them
+	 * with other frames on the stack. One frame is pt_regs->stackframe
+	 * that is for the traced function. The other frame is set up right
+	 * after the pt_regs structure and it is for the caller of the
+	 * traced function. This is done to ensure a proper stack trace.
+	 *
+	 * If the ftrace code returns to the traced function, then all is
+	 * fine. But if it transfers control to a different function (like
+	 * in livepatch), then a stack walk performed while still in the
+	 * ftrace code will not find the target function.
+	 *
+	 * So, mark the stack trace as unreliable if an ftrace frame is
+	 * detected.
+	 */
+	if (regs->frame_type == FTRACE_FRAME && frame->fp == regs_end &&
+	    frame->fp < info->high) {
+		/* Check the traced function's caller's frame. */
+		caller_fp = READ_ONCE_NOCHECK(*(unsigned long *)(frame->fp));
+		if (caller_fp == regs->regs[29]) {
+			frame->reliable = false;
+			return;
+		}
+	}
+#endif
 }
 
 /*