diff mbox

arm64: avoid race condition issue in dump_backtrace

Message ID 1521687960-3744-1-git-send-email-ji.zhang@mediatek.com (mailing list archive)
State New, archived
Headers show

Commit Message

Ji Zhang March 22, 2018, 3:06 a.m. UTC
When we dump the backtrace of some specific task, there is a potential race
condition due to the task may be running on other cores if SMP enabled.
That is because for current implementation, if the task is not the current
task, we will get the registers used for unwind from cpu_context saved in
thread_info, which is the snapshot before context switch, but if the task
is running on other cores, the registers and the content of stack are
changed.
This may cause that we get the wrong backtrace or incomplete backtrace or
even crash the kernel.
To avoid this case, do not dump the backtrace of the tasks which are
running on other cores.
This patch cannot solve the issue completely but can shrink the window of
race condition.

Signed-off-by: Ji Zhang <ji.zhang@mediatek.com>
---
 arch/arm64/kernel/traps.c | 3 +++
 1 file changed, 3 insertions(+)

Comments

Baruch Siach March 22, 2018, 4:57 a.m. UTC | #1
Hi Ji Zhang,

On Thu, Mar 22, 2018 at 11:06:00AM +0800, Ji Zhang wrote:
> When we dump the backtrace of some specific task, there is a potential race
> condition due to the task may be running on other cores if SMP enabled.
> That is because for current implementation, if the task is not the current
> task, we will get the registers used for unwind from cpu_context saved in
> thread_info, which is the snapshot before context switch, but if the task
> is running on other cores, the registers and the content of stack are
> changed.
> This may cause that we get the wrong backtrace or incomplete backtrace or
> even crash the kernel.
> To avoid this case, do not dump the backtrace of the tasks which are
> running on other cores.
> This patch cannot solve the issue completely but can shrink the window of
> race condition.
> 
> Signed-off-by: Ji Zhang <ji.zhang@mediatek.com>
> ---
>  arch/arm64/kernel/traps.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> index eb2d151..95749364 100644
> --- a/arch/arm64/kernel/traps.c
> +++ b/arch/arm64/kernel/traps.c
> @@ -113,6 +113,9 @@ void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>  	if (tsk == current) {
>  		frame.fp = (unsigned long)__builtin_frame_address(0);
>  		frame.pc = (unsigned long)dump_backtrace;
> +	else if (tsk->state == TASK_RUNNING) {

Missing closing brace. Does this build?

> +		pr_notice("Do not dump other running tasks\n");
> +		return;
>  	} else {
>  		/*
>  		 * task blocked in __switch_to

baruch
Ji Zhang March 22, 2018, 5:33 a.m. UTC | #2
Dear Baruch,

Thank you for your kind and careful examination.
That is my stupid fault due to I use different environments to do
verification and submit patch which I do the merge manually.
Anyway, thanks so much for the reminder, I will send another patch.

BRs,
Ji 

On Thu, 2018-03-22 at 06:57 +0200, Baruch Siach wrote:
> Hi Ji Zhang,
> 
> On Thu, Mar 22, 2018 at 11:06:00AM +0800, Ji Zhang wrote:
> > When we dump the backtrace of some specific task, there is a potential race
> > condition due to the task may be running on other cores if SMP enabled.
> > That is because for current implementation, if the task is not the current
> > task, we will get the registers used for unwind from cpu_context saved in
> > thread_info, which is the snapshot before context switch, but if the task
> > is running on other cores, the registers and the content of stack are
> > changed.
> > This may cause that we get the wrong backtrace or incomplete backtrace or
> > even crash the kernel.
> > To avoid this case, do not dump the backtrace of the tasks which are
> > running on other cores.
> > This patch cannot solve the issue completely but can shrink the window of
> > race condition.
> > 
> > Signed-off-by: Ji Zhang <ji.zhang@mediatek.com>
> > ---
> >  arch/arm64/kernel/traps.c | 3 +++
> >  1 file changed, 3 insertions(+)
> > 
> > diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
> > index eb2d151..95749364 100644
> > --- a/arch/arm64/kernel/traps.c
> > +++ b/arch/arm64/kernel/traps.c
> > @@ -113,6 +113,9 @@ void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
> >  	if (tsk == current) {
> >  		frame.fp = (unsigned long)__builtin_frame_address(0);
> >  		frame.pc = (unsigned long)dump_backtrace;
> > +	else if (tsk->state == TASK_RUNNING) {
> 
> Missing closing brace. Does this build?
> 
> > +		pr_notice("Do not dump other running tasks\n");
> > +		return;
> >  	} else {
> >  		/*
> >  		 * task blocked in __switch_to
> 
> baruch
>
Mark Rutland March 22, 2018, 5:59 a.m. UTC | #3
On Thu, Mar 22, 2018 at 11:06:00AM +0800, Ji Zhang wrote:
> When we dump the backtrace of some specific task, there is a potential race
> condition due to the task may be running on other cores if SMP enabled.
> That is because for current implementation, if the task is not the current
> task, we will get the registers used for unwind from cpu_context saved in
> thread_info, which is the snapshot before context switch, but if the task
> is running on other cores, the registers and the content of stack are
> changed.
> This may cause that we get the wrong backtrace or incomplete backtrace or
> even crash the kernel.

When do we call dump_backtrace() on a running task that is not current?

AFAICT, we don't do that in the arm64-specific callers of dump_backtrace(), and
this would have to be some caller of show_stack() in generic code.

We pin the task's stack via try_get_task_stack(), so this cannot be unmapped
while we walk it. In unwind_frame() we check that the frame record falls
entirely within the task's stack. So AFAICT, we cannot crash the kernel here,
though the backtrace may be misleading (and we could potentially get stuck in
an infinite loop).

> To avoid this case, do not dump the backtrace of the tasks which are
> running on other cores.
> This patch cannot solve the issue completely but can shrink the window of
> race condition.

> @@ -113,6 +113,9 @@ void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
>  	if (tsk == current) {
>  		frame.fp = (unsigned long)__builtin_frame_address(0);
>  		frame.pc = (unsigned long)dump_backtrace;
> +	else if (tsk->state == TASK_RUNNING) {
> +		pr_notice("Do not dump other running tasks\n");
> +		return;

As you note, if we can race with the task being scheduled, this doesn't help.

Can we rule this out at a higher level?

Thanks,
Mark.
Ji Zhang March 22, 2018, 9:35 a.m. UTC | #4
On Thu, 2018-03-22 at 05:59 +0000, Mark Rutland wrote:
> On Thu, Mar 22, 2018 at 11:06:00AM +0800, Ji Zhang wrote:
> > When we dump the backtrace of some specific task, there is a potential race
> > condition due to the task may be running on other cores if SMP enabled.
> > That is because for current implementation, if the task is not the current
> > task, we will get the registers used for unwind from cpu_context saved in
> > thread_info, which is the snapshot before context switch, but if the task
> > is running on other cores, the registers and the content of stack are
> > changed.
> > This may cause that we get the wrong backtrace or incomplete backtrace or
> > even crash the kernel.
> 
> When do we call dump_backtrace() on a running task that is not current?
> 
> AFAICT, we don't do that in the arm64-specific callers of dump_backtrace(), and
> this would have to be some caller of show_stack() in generic code.
Yes, show_stack() can make caller specify a task and dump its backtrace.
For example, SysRq-T (echo t > /proc/sysrq-trigger) will use this to
dump the backtrace of specific tasks.
> 
> We pin the task's stack via try_get_task_stack(), so this cannot be unmapped
> while we walk it. In unwind_frame() we check that the frame record falls
> entirely within the task's stack. So AFAICT, we cannot crash the kernel here,
> though the backtrace may be misleading (and we could potentially get stuck in
> an infinite loop).
You are right, I have checked the code and it seems that the check for
fp in unwind_frame() is strong enough to handle the case which stack
being changed due to task running. And as you mentioned, if
unfortunately fp is point to the address of itself, the unwind will be
an infinite loop, but it is a very small probability event, so we can
ignore this, is that right?
> 
> > To avoid this case, do not dump the backtrace of the tasks which are
> > running on other cores.
> > This patch cannot solve the issue completely but can shrink the window of
> > race condition.
> 
> > @@ -113,6 +113,9 @@ void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
> >  	if (tsk == current) {
> >  		frame.fp = (unsigned long)__builtin_frame_address(0);
> >  		frame.pc = (unsigned long)dump_backtrace;
> > +	else if (tsk->state == TASK_RUNNING) {
> > +		pr_notice("Do not dump other running tasks\n");
> > +		return;
> 
> As you note, if we can race with the task being scheduled, this doesn't help.
> 
> Can we rule this out at a higher level?
> Thanks,
> Mark.
Actually, according to my previous understanding, the low level function
should be transparent to callers and should provide the right result and
handle some unexpected cases, which means that if the result may be
misleading, we should drop it. That is why I bypass all TASK_RUNNING
tasks. I am not sure if this understanding is reasonable for this case.

And as you mentioned that rule this out at a higher level, is there any
suggestions, handle this in the caller of show_stack()?

PS, the dump_backtrace in arm64 is strong enough while arm has a weak
one, which strongly depend on the content of stack(in c_backtrace), and
it indeed can crash the kernel due to stack changed, I have submit a
similar patch for arm (arm: avoid race condition issue in
dump_backtrace)
If we can find a way to handle this at a higher way, should it be
suitable for both arm and arm64?

Thanks for your kindly support.

Best Regards,
Ji
kernel test robot March 24, 2018, 11:01 a.m. UTC | #5
Hi Ji,

Thank you for the patch! Yet something to improve:

[auto build test ERROR on arm64/for-next/core]
[also build test ERROR on v4.16-rc6 next-20180323]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]

url:    https://github.com/0day-ci/linux/commits/Ji-Zhang/arm64-avoid-race-condition-issue-in-dump_backtrace/20180324-165040
base:   https://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux.git for-next/core
config: arm64-defconfig (attached as .config)
compiler: aarch64-linux-gnu-gcc (Debian 7.2.0-11) 7.2.0
reproduce:
        wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
        chmod +x ~/bin/make.cross
        # save the attached .config to linux build tree
        make.cross ARCH=arm64 

All errors (new ones prefixed by >>):

   arch/arm64/kernel/traps.c: In function 'dump_backtrace':
>> arch/arm64/kernel/traps.c:116:2: error: expected '}' before 'else'
     else if (tsk->state == TASK_RUNNING) {
     ^~~~

vim +116 arch/arm64/kernel/traps.c

    99	
   100	void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
   101	{
   102		struct stackframe frame;
   103		int skip;
   104	
   105		pr_debug("%s(regs = %p tsk = %p)\n", __func__, regs, tsk);
   106	
   107		if (!tsk)
   108			tsk = current;
   109	
   110		if (!try_get_task_stack(tsk))
   111			return;
   112	
   113		if (tsk == current) {
   114			frame.fp = (unsigned long)__builtin_frame_address(0);
   115			frame.pc = (unsigned long)dump_backtrace;
 > 116		else if (tsk->state == TASK_RUNNING) {
   117			pr_notice("Do not dump other running tasks\n");
   118			return;
   119		} else {
   120			/*
   121			 * task blocked in __switch_to
   122			 */
   123			frame.fp = thread_saved_fp(tsk);
   124			frame.pc = thread_saved_pc(tsk);
   125		}
   126	#ifdef CONFIG_FUNCTION_GRAPH_TRACER
   127		frame.graph = tsk->curr_ret_stack;
   128	#endif
   129	
   130		skip = !!regs;
   131		printk("Call trace:\n");
   132		do {
   133			/* skip until specified stack frame */
   134			if (!skip) {
   135				dump_backtrace_entry(frame.pc);
   136			} else if (frame.fp == regs->regs[29]) {
   137				skip = 0;
   138				/*
   139				 * Mostly, this is the case where this function is
   140				 * called in panic/abort. As exception handler's
   141				 * stack frame does not contain the corresponding pc
   142				 * at which an exception has taken place, use regs->pc
   143				 * instead.
   144				 */
   145				dump_backtrace_entry(regs->pc);
   146			}
   147		} while (!unwind_frame(tsk, &frame));
   148	
   149		put_task_stack(tsk);
   150	}
   151	

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation
Mark Rutland March 26, 2018, 11:39 a.m. UTC | #6
On Thu, Mar 22, 2018 at 05:35:29PM +0800, Ji.Zhang wrote:
> On Thu, 2018-03-22 at 05:59 +0000, Mark Rutland wrote:
> > On Thu, Mar 22, 2018 at 11:06:00AM +0800, Ji Zhang wrote:
> > > When we dump the backtrace of some specific task, there is a potential race
> > > condition due to the task may be running on other cores if SMP enabled.
> > > That is because for current implementation, if the task is not the current
> > > task, we will get the registers used for unwind from cpu_context saved in
> > > thread_info, which is the snapshot before context switch, but if the task
> > > is running on other cores, the registers and the content of stack are
> > > changed.
> > > This may cause that we get the wrong backtrace or incomplete backtrace or
> > > even crash the kernel.
> > 
> > When do we call dump_backtrace() on a running task that is not current?
> > 
> > AFAICT, we don't do that in the arm64-specific callers of dump_backtrace(), and
> > this would have to be some caller of show_stack() in generic code.
> Yes, show_stack() can make caller specify a task and dump its backtrace.
> For example, SysRq-T (echo t > /proc/sysrq-trigger) will use this to
> dump the backtrace of specific tasks.

Ok. I see that this eventually calls show_state_filter(0), where we call
sched_show_task() for every task.

> > We pin the task's stack via try_get_task_stack(), so this cannot be unmapped
> > while we walk it. In unwind_frame() we check that the frame record falls
> > entirely within the task's stack. So AFAICT, we cannot crash the kernel here,
> > though the backtrace may be misleading (and we could potentially get stuck in
> > an infinite loop).
> You are right, I have checked the code and it seems that the check for
> fp in unwind_frame() is strong enough to handle the case which stack
> being changed due to task running. And as you mentioned, if
> unfortunately fp is point to the address of itself, the unwind will be
> an infinite loop, but it is a very small probability event, so we can
> ignore this, is that right?

I think that it would be preferable to try to avoid the inifinite loop
case. We could hit that by accident if we're tracing a live task.

It's a little tricky to ensure that we don't loop, since we can have
traces that span several stacks, e.g. overflow -> irq -> task, so we
need to know where the last frame was, and we need to defnie a strict
order for stack nesting.

> > > To avoid this case, do not dump the backtrace of the tasks which are
> > > running on other cores.
> > > This patch cannot solve the issue completely but can shrink the window of
> > > race condition.
> > 
> > > @@ -113,6 +113,9 @@ void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
> > >  	if (tsk == current) {
> > >  		frame.fp = (unsigned long)__builtin_frame_address(0);
> > >  		frame.pc = (unsigned long)dump_backtrace;
> > > +	else if (tsk->state == TASK_RUNNING) {
> > > +		pr_notice("Do not dump other running tasks\n");
> > > +		return;
> > 
> > As you note, if we can race with the task being scheduled, this doesn't help.
> > 
> > Can we rule this out at a higher level?
> > Thanks,
> > Mark.
> Actually, according to my previous understanding, the low level function
> should be transparent to callers and should provide the right result and
> handle some unexpected cases, which means that if the result may be
> misleading, we should drop it. That is why I bypass all TASK_RUNNING
> tasks. I am not sure if this understanding is reasonable for this case.

Given that this can change under our feet, I think this only provides a
false sense of security and complicates the code.

> And as you mentioned that rule this out at a higher level, is there any
> suggestions, handle this in the caller of show_stack()?

Unfortunately, it doesn't look like we can do this in general given
cases like sysrq-t.

Thanks,
Mark.
diff mbox

Patch

diff --git a/arch/arm64/kernel/traps.c b/arch/arm64/kernel/traps.c
index eb2d151..95749364 100644
--- a/arch/arm64/kernel/traps.c
+++ b/arch/arm64/kernel/traps.c
@@ -113,6 +113,9 @@  void dump_backtrace(struct pt_regs *regs, struct task_struct *tsk)
 	if (tsk == current) {
 		frame.fp = (unsigned long)__builtin_frame_address(0);
 		frame.pc = (unsigned long)dump_backtrace;
+	else if (tsk->state == TASK_RUNNING) {
+		pr_notice("Do not dump other running tasks\n");
+		return;
 	} else {
 		/*
 		 * task blocked in __switch_to