diff mbox

[v4,2/2] arm64: Expand the stack trace feature to support IRQ stack

Message ID 1444231692-32722-3-git-send-email-jungseoklee85@gmail.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jungseok Lee Oct. 7, 2015, 3:28 p.m. UTC
Currently, a call trace drops a process stack walk when a separate IRQ
stack is used. It makes a call trace information much less useful when
a system gets paniked in interrupt context.

This patch addresses the issue with the following schemes:

  - Store aborted stack frame data
  - Decide whether another stack walk is needed or not via current sp
  - Loosen the frame pointer upper bound condition

Cc: AKASHI Takahiro <takahiro.akashi@linaro.org>
Cc: James Morse <james.morse@arm.com>
Signed-off-by: Jungseok Lee <jungseoklee85@gmail.com>
---
 arch/arm64/include/asm/irq.h    | 12 +++++++++++
 arch/arm64/kernel/asm-offsets.c |  3 +++
 arch/arm64/kernel/entry.S       | 10 ++++++++--
 arch/arm64/kernel/stacktrace.c  | 22 ++++++++++++++++++++-
 arch/arm64/kernel/traps.c       | 13 ++++++++++++
 5 files changed, 57 insertions(+), 3 deletions(-)

Comments

James Morse Oct. 9, 2015, 2:24 p.m. UTC | #1
Hi Jungseok,

On 07/10/15 16:28, Jungseok Lee wrote:
> Currently, a call trace drops a process stack walk when a separate IRQ
> stack is used. It makes a call trace information much less useful when
> a system gets paniked in interrupt context.

panicked

> This patch addresses the issue with the following schemes:
> 
>   - Store aborted stack frame data
>   - Decide whether another stack walk is needed or not via current sp
>   - Loosen the frame pointer upper bound condition

It may be worth merging this patch with its predecessor - anyone trying to
bisect a problem could land between these two patches, and spend time
debugging the truncated call traces.


> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
> index 6ea82e8..e5904a1 100644
> --- a/arch/arm64/include/asm/irq.h
> +++ b/arch/arm64/include/asm/irq.h
> @@ -2,13 +2,25 @@
>  #define __ASM_IRQ_H
>  
>  #include <linux/irqchip/arm-gic-acpi.h>
> +#include <asm/stacktrace.h>
>  
>  #include <asm-generic/irq.h>
>  
>  struct irq_stack {
>  	void *stack;
> +	struct stackframe frame;
>  };
>  
> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);

Good idea, storing this in the per-cpu data makes it immune to stack
corruption.


> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
> index 407991b..5124649 100644
> --- a/arch/arm64/kernel/stacktrace.c
> +++ b/arch/arm64/kernel/stacktrace.c
> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>  	low  = frame->sp;
>  	high = ALIGN(low, THREAD_SIZE);
>  
> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
> +	/*
> +	 * A frame pointer would reach an upper bound if a prologue of the
> +	 * first function of call trace looks as follows:
> +	 *
> +	 *	stp     x29, x30, [sp,#-16]!
> +	 *	mov     x29, sp
> +	 *
> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration

The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
to be the highest address, which is used first, making it the bottom of the
stack.

I would try to use the terms low/est and high/est, in keeping with the
variable names in use here.


> +	 * of a 16-byte empty space in THREAD_START_SP.
> +	 *
> +	 * The value, 0x20, however, does not cover all cases as interrupts
> +	 * are handled using a separate stack. That is, a call trace can start
> +	 * from elx_irq exception vectors. The symbols could not be promoted
> +	 * to candidates for a stack trace under the restriction, 0x20.
> +	 *
> +	 * The scenario is handled without complexity as 1) considering
> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
> +	 * content of which is 0, and 2) allowing the case, which changes
> +	 * the value to 0x10 from 0x20.

Where has 0x20 come from? The old value was 0x18.

My understanding is the highest part of the stack looks like this:
high        [ off-stack ]
high - 0x08 [ left free by THREAD_START_SP ]
high - 0x10 [ left free by THREAD_START_SP ]
high - 0x18 [#1 x30 ]
high - 0x20 [#1 x29 ]

So the condition 'fp > high - 0x18' prevents returning either 'left free'
address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
allows the first half of that reserved area to be a valid stack frame.

This change is breaking perf using incantations [0] and [1]:

Before, with just patch 1/2:
                  ---__do_softirq
                     |
                     |--92.95%-- __handle_domain_irq
                     |          __irqentry_text_start
                     |          el1_irq
                     |

After, with both patches:
                 ---__do_softirq
                    |
                    |--83.83%-- __handle_domain_irq
                    |          __irqentry_text_start
                    |          el1_irq
                    |          |
                    |          |--99.39%-- 0x400008040d00000c
                    |           --0.61%-- [...]
                    |

Changing the condition to 'fp >= high - 0x10' fixes this.

I agree it needs documenting, it is quite fiddly - I think Akashi Takahiro
is the expert.


I think unwind_frame() needs to walk the irq stack too. [2] is an example
of perf tracing back to userspace, (and there are patches on the list to
do/fix this), so we need to walk back to the start of the first stack for
the perf accounting to be correct.


> +	 */
> +	if (fp < low || fp > high - 0x10 || fp & 0xf)
>  		return -EINVAL;
>  
>  	frame->sp = fp + 0x10;
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index f93aae5..44b2f828 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -146,6 +146,8 @@ static void dump_instr(const char *lvl, struct pt_regs *regs)
>  static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>  {
>  	struct stackframe frame;
> +	unsigned int cpu = smp_processor_id();

I wonder if there is any case where dump_backtrace() is called on another cpu?

Setting the cpu value from task_thread_info(tsk)->cpu would protect against
this.


> +	bool in_irq = in_irq_stack(cpu);
>  
>  	pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
>  
> @@ -170,6 +172,10 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>  	}
>  
>  	pr_emerg("Call trace:\n");
> +repeat:
> +	if (in_irq)
> +		pr_emerg("<IRQ>\n");

Do we need these? 'el1_irq()' in the trace is a giveaway...


> +
>  	while (1) {
>  		unsigned long where = frame.pc;
>  		int ret;
> @@ -179,6 +185,13 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>  			break;
>  		dump_backtrace_entry(where, frame.sp);
>  	}
> +
> +	if (in_irq) {
> +		frame = per_cpu(irq_stacks, cpu).frame;
> +		in_irq = false;
> +		pr_emerg("<EOI>\n");
> +		goto repeat;
> +	}
>  }
>  
>  void show_stack(struct task_struct *tsk, unsigned long *sp)


Thanks!

James


[0] sudo ./perf record -e mem:<address of __do_softirq()>:x -ag -- sleep 10
[1] sudo ./perf report --call-graph --stdio
[2] http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html
Jungseok Lee Oct. 12, 2015, 2:53 p.m. UTC | #2
On Oct 9, 2015, at 11:24 PM, James Morse wrote:
> Hi Jungseok,

Hi James,

> On 07/10/15 16:28, Jungseok Lee wrote:
>> Currently, a call trace drops a process stack walk when a separate IRQ
>> stack is used. It makes a call trace information much less useful when
>> a system gets paniked in interrupt context.
> 
> panicked

I will fix the typo.

>> This patch addresses the issue with the following schemes:
>> 
>>  - Store aborted stack frame data
>>  - Decide whether another stack walk is needed or not via current sp
>>  - Loosen the frame pointer upper bound condition
> 
> It may be worth merging this patch with its predecessor - anyone trying to
> bisect a problem could land between these two patches, and spend time
> debugging the truncated call traces.

It was an original intention to lead them to this patch, not the [1/2] one.
This separation would help anyone touching the call trace feature including
me focus on these changes apart from stack allocation, IRQ recursion check
and thread_info management.

In addition, I would like to add a clear and sufficient explanation on the
frame pointer condition.

>> diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
>> index 6ea82e8..e5904a1 100644
>> --- a/arch/arm64/include/asm/irq.h
>> +++ b/arch/arm64/include/asm/irq.h
>> @@ -2,13 +2,25 @@
>> #define __ASM_IRQ_H
>> 
>> #include <linux/irqchip/arm-gic-acpi.h>
>> +#include <asm/stacktrace.h>
>> 
>> #include <asm-generic/irq.h>
>> 
>> struct irq_stack {
>> 	void *stack;
>> +	struct stackframe frame;
>> };
>> 
>> +DECLARE_PER_CPU(struct irq_stack, irq_stacks);
> 
> Good idea, storing this in the per-cpu data makes it immune to stack
> corruption.
> 
> 
>> diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
>> index 407991b..5124649 100644
>> --- a/arch/arm64/kernel/stacktrace.c
>> +++ b/arch/arm64/kernel/stacktrace.c
>> @@ -43,7 +43,27 @@ int notrace unwind_frame(struct stackframe *frame)
>> 	low  = frame->sp;
>> 	high = ALIGN(low, THREAD_SIZE);
>> 
>> -	if (fp < low || fp > high - 0x18 || fp & 0xf)
>> +	/*
>> +	 * A frame pointer would reach an upper bound if a prologue of the
>> +	 * first function of call trace looks as follows:
>> +	 *
>> +	 *	stp     x29, x30, [sp,#-16]!
>> +	 *	mov     x29, sp
>> +	 *
>> +	 * Thus, the upper bound is (top of stack - 0x20) with consideration
> 
> The terms 'top' and 'bottom' of the stack are confusing, your 'top' appears
> to be the highest address, which is used first, making it the bottom of the
> stack.
> 
> I would try to use the terms low/est and high/est, in keeping with the
> variable names in use here.

Good idea. I'm favor of those terms.

>> +	 * of a 16-byte empty space in THREAD_START_SP.
>> +	 *
>> +	 * The value, 0x20, however, does not cover all cases as interrupts
>> +	 * are handled using a separate stack. That is, a call trace can start
>> +	 * from elx_irq exception vectors. The symbols could not be promoted
>> +	 * to candidates for a stack trace under the restriction, 0x20.
>> +	 *
>> +	 * The scenario is handled without complexity as 1) considering
>> +	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
>> +	 * content of which is 0, and 2) allowing the case, which changes
>> +	 * the value to 0x10 from 0x20.
> 
> Where has 0x20 come from? The old value was 0x18.

What I meant is 0x20 is the highest valid frame pointer. The comment should
have been described more clearly.

> My understanding is the highest part of the stack looks like this:
> high        [ off-stack ]
> high - 0x08 [ left free by THREAD_START_SP ]
> high - 0x10 [ left free by THREAD_START_SP ]
> high - 0x18 [#1 x30 ]
> high - 0x20 [#1 x29 ]

Clear description than mine!

> So the condition 'fp > high - 0x18' prevents returning either 'left free'
> address, or off-stack-value as a frame. Changing it to 'fp > high - 0x10'
> allows the first half of that reserved area to be a valid stack frame.

I believe my understanding is aligned with yours.

Under a current condition, 'fp > high - 0x18', it is impossible to catch the
'el1_irq' symbol. This is why I set x29 to high - 0x10 and changed the frame
pointer condition, but the changes fail to cover perf according to your data.

> This change is breaking perf using incantations [0] and [1]:

I'm reviewing how perf stack trace works..

> Before, with just patch 1/2:
>                  ---__do_softirq
>                     |
>                     |--92.95%-- __handle_domain_irq
>                     |          __irqentry_text_start
>                     |          el1_irq
>                     |
> 
> After, with both patches:
>                 ---__do_softirq
>                    |
>                    |--83.83%-- __handle_domain_irq
>                    |          __irqentry_text_start
>                    |          el1_irq
>                    |          |
>                    |          |--99.39%-- 0x400008040d00000c
>                    |           --0.61%-- [...]
>                    |
> 
> Changing the condition to 'fp >= high - 0x10' fixes this.

'fp >= high - 0x10' drops 'el1_irq' when dump_stack() or panic() is called.

> I agree it needs documenting, it is quite fiddly - I think Akashi Takahiro
> is the expert.

If possible, it would be greatly helpful.

> I think unwind_frame() needs to walk the irq stack too. [2] is an example
> of perf tracing back to userspace, (and there are patches on the list to
> do/fix this), so we need to walk back to the start of the first stack for
> the perf accounting to be correct.

Frankly, I missed the case where perf does backtrace to userspace.

IMO, this statement supports why the stack trace feature commit should be
written independently. The [1/2] patch would be pretty stable if 64KB page
is supported. The separation might help us concentrate on the stack trace
feature in a generic dump stack, perf, and ftrace point of view.

>> +	 */
>> +	if (fp < low || fp > high - 0x10 || fp & 0xf)
>> 		return -EINVAL;
>> 
>> 	frame->sp = fp + 0x10;
>> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
>> index f93aae5..44b2f828 100644
>> --- a/arch/arm64/kernel/traps.c
>> +++ b/arch/arm64/kernel/traps.c
>> @@ -146,6 +146,8 @@ static void dump_instr(const char *lvl, struct pt_regs *regs)
>> static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>> {
>> 	struct stackframe frame;
>> +	unsigned int cpu = smp_processor_id();
> 
> I wonder if there is any case where dump_backtrace() is called on another cpu?
> 
> Setting the cpu value from task_thread_info(tsk)->cpu would protect against
> this.

IMO, no, but your suggestion makes sense. I will update it.

>> +	bool in_irq = in_irq_stack(cpu);
>> 
>> 	pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
>> 
>> @@ -170,6 +172,10 @@ static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>> 	}
>> 
>> 	pr_emerg("Call trace:\n");
>> +repeat:
>> +	if (in_irq)
>> +		pr_emerg("<IRQ>\n");
> 
> Do we need these? 'el1_irq()' in the trace is a giveaway…

I borrow this idea from x86 implementation in order to show a separate stack
explicitly. There is no issue to remove these tags, <IRQ> and <EOI>.

Great thanks!

Best Regards
Jungseok Lee
Jungseok Lee Oct. 15, 2015, 2:24 p.m. UTC | #3
On Oct 9, 2015, at 11:24 PM, James Morse wrote:

Hi James,

[ ... ]

> I think unwind_frame() needs to walk the irq stack too. [2] is an example
> of perf tracing back to userspace, (and there are patches on the list to
> do/fix this), so we need to walk back to the start of the first stack for
> the perf accounting to be correct.

I plan to do re-spin this series without [PATCH 2/2] since 1) Akashi's
approach looks better than mine and 2) you have the perf patches for [2].
This would help us to move forward.

Thoughts?

[ ... ]

> [0] sudo ./perf record -e mem:<address of __do_softirq()>:x -ag -- sleep 10
> [1] sudo ./perf report --call-graph --stdio
> [2] http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

Best Regards
Jungseok Lee
James Morse Oct. 15, 2015, 4:01 p.m. UTC | #4
On 15/10/15 15:24, Jungseok Lee wrote:
> On Oct 9, 2015, at 11:24 PM, James Morse wrote:
>> I think unwind_frame() needs to walk the irq stack too. [2] is an example
>> of perf tracing back to userspace, (and there are patches on the list to
>> do/fix this), so we need to walk back to the start of the first stack for
>> the perf accounting to be correct.
> 
> I plan to do re-spin this series without [PATCH 2/2] since 1) Akashi's
> approach looks better than mine and 2) you have the perf patches for [2].

They aren't my patches - the ones I saw on the list were for arm:
https://lkml.org/lkml/2015/10/1/769 - its evidently something perf
supports, so we shouldn't make it worse...


Thanks!

James
Jungseok Lee Oct. 16, 2015, 1:02 p.m. UTC | #5
On Oct 16, 2015, at 1:01 AM, James Morse wrote:
> On 15/10/15 15:24, Jungseok Lee wrote:
>> On Oct 9, 2015, at 11:24 PM, James Morse wrote:
>>> I think unwind_frame() needs to walk the irq stack too. [2] is an example
>>> of perf tracing back to userspace, (and there are patches on the list to
>>> do/fix this), so we need to walk back to the start of the first stack for
>>> the perf accounting to be correct.
>> 
>> I plan to do re-spin this series without [PATCH 2/2] since 1) Akashi's
>> approach looks better than mine and 2) you have the perf patches for [2].
> 
> They aren't my patches - the ones I saw on the list were for arm:
> https://lkml.org/lkml/2015/10/1/769 - its evidently something perf
> supports, so we shouldn't make it worse…

Aha, thanks for the information!

Best Regards
Jungseok Lee
diff mbox

Patch

diff --git a/arch/arm64/include/asm/irq.h b/arch/arm64/include/asm/irq.h
index 6ea82e8..e5904a1 100644
--- a/arch/arm64/include/asm/irq.h
+++ b/arch/arm64/include/asm/irq.h
@@ -2,13 +2,25 @@ 
 #define __ASM_IRQ_H
 
 #include <linux/irqchip/arm-gic-acpi.h>
+#include <asm/stacktrace.h>
 
 #include <asm-generic/irq.h>
 
 struct irq_stack {
 	void *stack;
+	struct stackframe frame;
 };
 
+DECLARE_PER_CPU(struct irq_stack, irq_stacks);
+
+static inline bool in_irq_stack(unsigned int cpu)
+{
+	unsigned long high = (unsigned long)per_cpu(irq_stacks, cpu).stack;
+
+	return (current_stack_pointer >= round_down(high, THREAD_SIZE)) &&
+		current_stack_pointer < high;
+}
+
 struct pt_regs;
 
 extern void migrate_irqs(void);
diff --git a/arch/arm64/kernel/asm-offsets.c b/arch/arm64/kernel/asm-offsets.c
index b16e3cf..fbb52f2d 100644
--- a/arch/arm64/kernel/asm-offsets.c
+++ b/arch/arm64/kernel/asm-offsets.c
@@ -42,6 +42,9 @@  int main(void)
   DEFINE(THREAD_CPU_CONTEXT,	offsetof(struct task_struct, thread.cpu_context));
   BLANK();
   DEFINE(IRQ_STACK,		offsetof(struct irq_stack, stack));
+  DEFINE(IRQ_FRAME_FP,		offsetof(struct irq_stack, frame.fp));
+  DEFINE(IRQ_FRAME_SP,		offsetof(struct irq_stack, frame.sp));
+  DEFINE(IRQ_FRAME_PC,		offsetof(struct irq_stack, frame.pc));
   BLANK();
   DEFINE(S_X0,			offsetof(struct pt_regs, regs[0]));
   DEFINE(S_X1,			offsetof(struct pt_regs, regs[1]));
diff --git a/arch/arm64/kernel/entry.S b/arch/arm64/kernel/entry.S
index 6d4e8c5..650cc05 100644
--- a/arch/arm64/kernel/entry.S
+++ b/arch/arm64/kernel/entry.S
@@ -121,7 +121,8 @@ 
 	 * x21 - aborted SP
 	 * x22 - aborted PC
 	 * x23 - aborted PSTATE
-	*/
+	 * x29 - aborted FP
+	 */
 	.endm
 
 	.macro	kernel_exit, el
@@ -184,7 +185,12 @@  alternative_endif
 	mov	x23, sp
 	and	x23, x23, #~(THREAD_SIZE - 1)
 	cmp	x20, x23			// check irq re-enterance
-	mov	x19, sp
+	beq	1f
+	str	x29, [x19, #IRQ_FRAME_FP]
+	str	x21, [x19, #IRQ_FRAME_SP]
+	str	x22, [x19, #IRQ_FRAME_PC]
+	mov	x29, x24
+1:	mov	x19, sp
 	csel	x23, x19, x24, eq		// x24 = top of irq stack
 	mov	sp, x23
 	.endm
diff --git a/arch/arm64/kernel/stacktrace.c b/arch/arm64/kernel/stacktrace.c
index 407991b..5124649 100644
--- a/arch/arm64/kernel/stacktrace.c
+++ b/arch/arm64/kernel/stacktrace.c
@@ -43,7 +43,27 @@  int notrace unwind_frame(struct stackframe *frame)
 	low  = frame->sp;
 	high = ALIGN(low, THREAD_SIZE);
 
-	if (fp < low || fp > high - 0x18 || fp & 0xf)
+	/*
+	 * A frame pointer would reach an upper bound if a prologue of the
+	 * first function of call trace looks as follows:
+	 *
+	 *	stp     x29, x30, [sp,#-16]!
+	 *	mov     x29, sp
+	 *
+	 * Thus, the upper bound is (top of stack - 0x20) with consideration
+	 * of a 16-byte empty space in THREAD_START_SP.
+	 *
+	 * The value, 0x20, however, does not cover all cases as interrupts
+	 * are handled using a separate stack. That is, a call trace can start
+	 * from elx_irq exception vectors. The symbols could not be promoted
+	 * to candidates for a stack trace under the restriction, 0x20.
+	 *
+	 * The scenario is handled without complexity as 1) considering
+	 * (bottom of stack + THREAD_START_SP) as a dummy frame pointer, the
+	 * content of which is 0, and 2) allowing the case, which changes
+	 * the value to 0x10 from 0x20.
+	 */
+	if (fp < low || fp > high - 0x10 || fp & 0xf)
 		return -EINVAL;
 
 	frame->sp = fp + 0x10;
diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index f93aae5..44b2f828 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -146,6 +146,8 @@  static void dump_instr(const char *lvl, struct pt_regs *regs)
 static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
 {
 	struct stackframe frame;
+	unsigned int cpu = smp_processor_id();
+	bool in_irq = in_irq_stack(cpu);
 
 	pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
 
@@ -170,6 +172,10 @@  static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
 	}
 
 	pr_emerg("Call trace:\n");
+repeat:
+	if (in_irq)
+		pr_emerg("<IRQ>\n");
+
 	while (1) {
 		unsigned long where = frame.pc;
 		int ret;
@@ -179,6 +185,13 @@  static void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
 			break;
 		dump_backtrace_entry(where, frame.sp);
 	}
+
+	if (in_irq) {
+		frame = per_cpu(irq_stacks, cpu).frame;
+		in_irq = false;
+		pr_emerg("<EOI>\n");
+		goto repeat;
+	}
 }
 
 void show_stack(struct task_struct *tsk, unsigned long *sp)