diff mbox series

[RFC,v8,3/4] arm64: Introduce stack trace reliability checks in the unwinder

Message ID 20210812190603.25326-4-madvenka@linux.microsoft.com (mailing list archive)
State New, archived
Headers show
Series arm64: Reorganize the unwinder and implement stack trace reliability checks | expand

Commit Message

Madhavan T. Venkataraman Aug. 12, 2021, 7:06 p.m. UTC
From: "Madhavan T. Venkataraman" <madvenka@linux.microsoft.com>

There are some kernel features and conditions that make a stack trace
unreliable. Callers may require the unwinder to detect these cases.
E.g., livepatch.

Introduce a new function called unwind_is_reliable() that will detect
these cases and return a boolean.

Introduce a new argument to unwind() called "need_reliable" so a caller
can tell unwind() that it requires a reliable stack trace. For such a
caller, any unreliability in the stack trace must be treated as a fatal
error and the unwind must be aborted.

Call unwind_is_reliable() from unwind_consume() like this:

	if (frame->need_reliable && !unwind_is_reliable(frame)) {
		frame->failed = true;
		return false;
	}

In other words, if the return PC in the stackframe falls in unreliable code,
then it cannot be unwound reliably.

arch_stack_walk() will pass "false" for need_reliable because its callers
don't care about reliability. arch_stack_walk() is used for debug and
test purposes.

Introduce arch_stack_walk_reliable() for ARM64. This works like
arch_stack_walk() except for two things:

	- It passes "true" for need_reliable.

	- It returns -EINVAL if unwind() says that the stack trace is
	  unreliable.

Introduce the first reliability check in unwind_is_reliable() - If
a return PC is not a valid kernel text address, consider the stack
trace unreliable. It could be some generated code.

Other reliability checks will be added in the future. Until all of the
checks are in place, arch_stack_walk_reliable() may not be used by
livepatch. But it may be used by debug and test code.

Signed-off-by: Madhavan T. Venkataraman <madvenka@linux.microsoft.com>
---
 arch/arm64/include/asm/stacktrace.h |  4 ++
 arch/arm64/kernel/stacktrace.c      | 63 +++++++++++++++++++++++++++--
 2 files changed, 63 insertions(+), 4 deletions(-)

Comments

nobuta.keiya@fujitsu.com Aug. 24, 2021, 5:55 a.m. UTC | #1
Hi Madhavan,

> @@ -245,7 +271,36 @@ noinline notrace void arch_stack_walk(stack_trace_consume_fn consume_entry,
>  		fp = thread_saved_fp(task);
>  		pc = thread_saved_pc(task);
>  	}
> -	unwind(consume_entry, cookie, task, fp, pc);
> +	unwind(consume_entry, cookie, task, fp, pc, false);
> +}
> +
> +/*
> + * arch_stack_walk_reliable() may not be used for livepatch until all of
> + * the reliability checks are in place in unwind_consume(). However,
> + * debug and test code can choose to use it even if all the checks are not
> + * in place.
> + */

I'm glad to see the long-awaited function :)

Does the above comment mean that this comment will be removed by
another patch series that about live patch enablement, instead of [PATCH 4/4]?

It seems to take time... But I start thinking about test code.

Thanks,
Keiya


> +noinline int notrace arch_stack_walk_reliable(stack_trace_consume_fn consume_fn,
> +					      void *cookie,
> +					      struct task_struct *task)
> +{
> +	unsigned long fp, pc;
> +
> +	if (!task)
> +		task = current;
> +
> +	if (task == current) {
> +		/* Skip arch_stack_walk_reliable() in the stack trace. */
> +		fp = (unsigned long)__builtin_frame_address(1);
> +		pc = (unsigned long)__builtin_return_address(0);
> +	} else {
> +		/* Caller guarantees that the task is not running. */
> +		fp = thread_saved_fp(task);
> +		pc = thread_saved_pc(task);
> +	}
> +	if (unwind(consume_fn, cookie, task, fp, pc, true))
> +		return 0;
> +	return -EINVAL;
>  }
> 
>  #endif
> --
> 2.25.1
Madhavan T. Venkataraman Aug. 24, 2021, 12:19 p.m. UTC | #2
On 8/24/21 12:55 AM, nobuta.keiya@fujitsu.com wrote:
> Hi Madhavan,
> 
>> @@ -245,7 +271,36 @@ noinline notrace void arch_stack_walk(stack_trace_consume_fn consume_entry,
>>  		fp = thread_saved_fp(task);
>>  		pc = thread_saved_pc(task);
>>  	}
>> -	unwind(consume_entry, cookie, task, fp, pc);
>> +	unwind(consume_entry, cookie, task, fp, pc, false);
>> +}
>> +
>> +/*
>> + * arch_stack_walk_reliable() may not be used for livepatch until all of
>> + * the reliability checks are in place in unwind_consume(). However,
>> + * debug and test code can choose to use it even if all the checks are not
>> + * in place.
>> + */
> 
> I'm glad to see the long-awaited function :)
> 
> Does the above comment mean that this comment will be removed by
> another patch series that about live patch enablement, instead of [PATCH 4/4]?
> 
> It seems to take time... But I start thinking about test code.
> 

Yes. This comment will be removed when livepatch will be enabled eventually.
So, AFAICT, there are 4 pieces that are needed:

- Reliable stack trace in the kernel. I am trying to address that with my patch
  series.

- Mark Rutland's work for making patching safe on ARM64.

- Objtool (or alternative method) for stack validation.

- Suraj Jitindar Singh's patch for miscellaneous things needed to enable live patch.

Once all of these pieces are in place, livepatch can be enabled.

That said, arch_stack_walk_reliable() can be used for test and debug purposes anytime
once this patch series gets accepted.

Thanks.

Madhavan
nobuta.keiya@fujitsu.com Aug. 25, 2021, 12:01 a.m. UTC | #3
> > Hi Madhavan,
> >
> >> @@ -245,7 +271,36 @@ noinline notrace void arch_stack_walk(stack_trace_consume_fn consume_entry,
> >>  		fp = thread_saved_fp(task);
> >>  		pc = thread_saved_pc(task);
> >>  	}
> >> -	unwind(consume_entry, cookie, task, fp, pc);
> >> +	unwind(consume_entry, cookie, task, fp, pc, false); }
> >> +
> >> +/*
> >> + * arch_stack_walk_reliable() may not be used for livepatch until
> >> +all of
> >> + * the reliability checks are in place in unwind_consume(). However,
> >> + * debug and test code can choose to use it even if all the checks
> >> +are not
> >> + * in place.
> >> + */
> >
> > I'm glad to see the long-awaited function :)
> >
> > Does the above comment mean that this comment will be removed by
> > another patch series that about live patch enablement, instead of [PATCH 4/4]?
> >
> > It seems to take time... But I start thinking about test code.
> >
> 
> Yes. This comment will be removed when livepatch will be enabled eventually.
> So, AFAICT, there are 4 pieces that are needed:
> 
> - Reliable stack trace in the kernel. I am trying to address that with my patch
>   series.
> 
> - Mark Rutland's work for making patching safe on ARM64.
> 
> - Objtool (or alternative method) for stack validation.
> 
> - Suraj Jitindar Singh's patch for miscellaneous things needed to enable live patch.
> 
> Once all of these pieces are in place, livepatch can be enabled.
> 
> That said, arch_stack_walk_reliable() can be used for test and debug purposes anytime once this patch series gets accepted.
> 
> Thanks.
> 
> Madhavan


Thank you for the information.

Keiya
Mark Brown Aug. 26, 2021, 3:57 p.m. UTC | #4
On Thu, Aug 12, 2021 at 02:06:02PM -0500, madvenka@linux.microsoft.com wrote:

> +	if (frame->need_reliable && !unwind_is_reliable(frame)) {
> +		/* Cannot unwind to the next frame reliably. */
> +		frame->failed = true;
> +		return false;
> +	}

This means we only collect reliability information in the case
where we're specifically doing a reliable stacktrace.  For
example when printing stack traces on the console it might be
useful to print a ? or something if the frame is unreliable as a
hint to the reader that the information might be misleading.
Could we therefore change the flag here to a reliability one and
our need_reliable check so that we always run
unwind_is_reliable()?

I'm not sure if we need to abandon the trace on first error when
doing a reliable trace but I can see it's a bit safer so perhaps
better to do so.  If we don't abandon then we don't require the
need_reliable check at all.
Madhavan T. Venkataraman Aug. 26, 2021, 11:31 p.m. UTC | #5
On 8/26/21 10:57 AM, Mark Brown wrote:
> On Thu, Aug 12, 2021 at 02:06:02PM -0500, madvenka@linux.microsoft.com wrote:
> 
>> +	if (frame->need_reliable && !unwind_is_reliable(frame)) {
>> +		/* Cannot unwind to the next frame reliably. */
>> +		frame->failed = true;
>> +		return false;
>> +	}
> 
> This means we only collect reliability information in the case
> where we're specifically doing a reliable stacktrace.  For
> example when printing stack traces on the console it might be
> useful to print a ? or something if the frame is unreliable as a
> hint to the reader that the information might be misleading.
> Could we therefore change the flag here to a reliability one and
> our need_reliable check so that we always run
> unwind_is_reliable()?
> 
> I'm not sure if we need to abandon the trace on first error when
> doing a reliable trace but I can see it's a bit safer so perhaps
> better to do so.  If we don't abandon then we don't require the
> need_reliable check at all.
> 

I think that the caller should be able to specify that the stack trace
should be abandoned. Like Livepatch.

So, we could always do the reliability check. But keep need_reliable.

Thanks.

Madhavan
diff mbox series

Patch

diff --git a/arch/arm64/include/asm/stacktrace.h b/arch/arm64/include/asm/stacktrace.h
index 407007376e97..65ea151da5da 100644
--- a/arch/arm64/include/asm/stacktrace.h
+++ b/arch/arm64/include/asm/stacktrace.h
@@ -53,6 +53,9 @@  struct stack_info {
  *               replacement lr value in the ftrace graph stack.
  *
  * @failed:      Unwind failed.
+ *
+ * @need_reliable The caller needs a reliable stack trace. Treat any
+ *                unreliability as a fatal error.
  */
 struct stackframe {
 	struct task_struct *task;
@@ -65,6 +68,7 @@  struct stackframe {
 	int graph;
 #endif
 	bool failed;
+	bool need_reliable;
 };
 
 extern void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk,
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index ec8f5163c4d0..b60f8a20ba64 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -34,7 +34,8 @@ 
 
 static void notrace unwind_start(struct stackframe *frame,
 				 struct task_struct *task,
-				 unsigned long fp, unsigned long pc)
+				 unsigned long fp, unsigned long pc,
+				 bool need_reliable)
 {
 	frame->task = task;
 	frame->fp = fp;
@@ -56,6 +57,7 @@  static void notrace unwind_start(struct stackframe *frame,
 	frame->prev_fp = 0;
 	frame->prev_type = STACK_TYPE_UNKNOWN;
 	frame->failed = false;
+	frame->need_reliable = need_reliable;
 }
 
 NOKPROBE_SYMBOL(unwind_start);
@@ -178,6 +180,23 @@  void show_stack(struct task_struct *tsk, unsigned long *sp, const char *loglvl)
 	barrier();
 }
 
+/*
+ * Check the stack frame for conditions that make further unwinding unreliable.
+ */
+static bool notrace unwind_is_reliable(struct stackframe *frame)
+{
+	/*
+	 * If the PC is not a known kernel text address, then we cannot
+	 * be sure that a subsequent unwind will be reliable, as we
+	 * don't know that the code follows our unwind requirements.
+	 */
+	if (!__kernel_text_address(frame->pc))
+		return false;
+	return true;
+}
+
+NOKPROBE_SYMBOL(unwind_is_reliable);
+
 static bool notrace unwind_consume(struct stackframe *frame,
 				   stack_trace_consume_fn consume_entry,
 				   void *cookie)
@@ -197,6 +216,12 @@  static bool notrace unwind_consume(struct stackframe *frame,
 		/* Final frame; nothing to unwind */
 		return false;
 	}
+
+	if (frame->need_reliable && !unwind_is_reliable(frame)) {
+		/* Cannot unwind to the next frame reliably. */
+		frame->failed = true;
+		return false;
+	}
 	return true;
 }
 
@@ -210,11 +235,12 @@  static inline bool unwind_failed(struct stackframe *frame)
 /* Core unwind function */
 static bool notrace unwind(stack_trace_consume_fn consume_entry, void *cookie,
 			   struct task_struct *task,
-			   unsigned long fp, unsigned long pc)
+			   unsigned long fp, unsigned long pc,
+			   bool need_reliable)
 {
 	struct stackframe frame;
 
-	unwind_start(&frame, task, fp, pc);
+	unwind_start(&frame, task, fp, pc, need_reliable);
 	while (unwind_consume(&frame, consume_entry, cookie))
 		unwind_next(&frame);
 	return !unwind_failed(&frame);
@@ -245,7 +271,36 @@  noinline notrace void arch_stack_walk(stack_trace_consume_fn consume_entry,
 		fp = thread_saved_fp(task);
 		pc = thread_saved_pc(task);
 	}
-	unwind(consume_entry, cookie, task, fp, pc);
+	unwind(consume_entry, cookie, task, fp, pc, false);
+}
+
+/*
+ * arch_stack_walk_reliable() may not be used for livepatch until all of
+ * the reliability checks are in place in unwind_consume(). However,
+ * debug and test code can choose to use it even if all the checks are not
+ * in place.
+ */
+noinline int notrace arch_stack_walk_reliable(stack_trace_consume_fn consume_fn,
+					      void *cookie,
+					      struct task_struct *task)
+{
+	unsigned long fp, pc;
+
+	if (!task)
+		task = current;
+
+	if (task == current) {
+		/* Skip arch_stack_walk_reliable() in the stack trace. */
+		fp = (unsigned long)__builtin_frame_address(1);
+		pc = (unsigned long)__builtin_return_address(0);
+	} else {
+		/* Caller guarantees that the task is not running. */
+		fp = thread_saved_fp(task);
+		pc = thread_saved_pc(task);
+	}
+	if (unwind(consume_fn, cookie, task, fp, pc, true))
+		return 0;
+	return -EINVAL;
 }
 
 #endif