diff mbox

arm64/v4.16-rc1: KASAN: use-after-free Read in finish_task_switch

Message ID 1144433342.22716.1518732536600.JavaMail.zimbra@efficios.com (mailing list archive)
State New, archived
Headers show

Commit Message

Mathieu Desnoyers Feb. 15, 2018, 10:08 p.m. UTC
----- On Feb 15, 2018, at 1:21 PM, Will Deacon will.deacon@arm.com wrote:

> On Thu, Feb 15, 2018 at 05:47:54PM +0100, Peter Zijlstra wrote:
>> On Thu, Feb 15, 2018 at 02:22:39PM +0000, Will Deacon wrote:
>> > Instead, we've come up with a more plausible sequence that can in theory
>> > happen on a single CPU:
>> > 
>> > <task foo calls exit()>
>> > 
>> > do_exit
>> > 	exit_mm
>> 
>> If this is the last task of the process, we would expect:
>> 
>>   mm_count == 1
>>   mm_users == 1
>> 
>> at this point.
>> 
>> > 		mmgrab(mm);			// foo's mm has count +1
>> > 		BUG_ON(mm != current->active_mm);
>> > 		task_lock(current);
>> > 		current->mm = NULL;
>> > 		task_unlock(current);
>> 
>> So the whole active_mm is basically the last 'real' mm, and its purpose
>> is to avoid switch_mm() between user tasks and kernel tasks.
>> 
>> A kernel task has !->mm. We do this by incrementing mm_count when
>> switching from user to kernel task and decrementing when switching from
>> kernel to user.
>> 
>> What exit_mm() does is change a user task into a 'kernel' task. So it
>> should increment mm_count to mirror the context switch. I suspect this
>> is what the mmgrab() in exit_mm() is for.
>> 
>> > <irq and ctxsw to kthread>
>> > 
>> > context_switch(prev=foo, next=kthread)
>> > 	mm = next->mm;
>> > 	oldmm = prev->active_mm;
>> > 
>> > 	if (!mm) {				// True for kthread
>> > 		next->active_mm = oldmm;
>> > 		mmgrab(oldmm);			// foo's mm has count +2
>> > 	}
>> > 
>> > 	if (!prev->mm) {			// True for foo
>> > 		rq->prev_mm = oldmm;
>> > 	}
>> > 
>> > 	finish_task_switch
>> > 		mm = rq->prev_mm;
>> > 		if (mm) {			// True (foo's mm)
>> > 			mmdrop(mm);		// foo's mm has count +1
>> > 		}
>> > 
>> > 	[...]
>> > 
>> > <ctxsw to task bar>
>> > 
>> > context_switch(prev=kthread, next=bar)
>> > 	mm = next->mm;
>> > 	oldmm = prev->active_mm;		// foo's mm!
>> > 
>> > 	if (!prev->mm) {			// True for kthread
>> > 		rq->prev_mm = oldmm;
>> > 	}
>> > 
>> > 	finish_task_switch
>> > 		mm = rq->prev_mm;
>> > 		if (mm) {			// True (foo's mm)
>> > 			mmdrop(mm);		// foo's mm has count +0
>> 
>> The context switch into the next user task will then decrement. At this
>> point foo no longer has a reference to its mm, except on the stack.
>> 
>> > 		}
>> > 
>> > 	[...]
>> > 
>> > <ctxsw back to task foo>
>> > 
>> > context_switch(prev=bar, next=foo)
>> > 	mm = next->mm;
>> > 	oldmm = prev->active_mm;
>> > 
>> > 	if (!mm) {				// True for foo
>> > 		next->active_mm = oldmm;	// This is bar's mm
>> > 		mmgrab(oldmm);			// bar's mm has count +1
>> > 	}
>> > 
>> > 
>> > 	[return back to exit_mm]
>> 
>> Enter mm_users, this counts the number of tasks associated with the mm.
>> We start with 1 in mm_init(), and when it drops to 0, we decrement
>> mm_count. Since we also start with mm_count == 1, this would appear
>> consistent.
>> 
>>   mmput() // --mm_users == 0, which then results in:
>> 
>> > mmdrop(mm);					// foo's mm has count -1
>> 
>> In the above case, that's the very last reference to the mm, and since
>> we started out with mm_count == 1, this -1 makes 0 and we do the actual
>> free.
>> 
>> > At this point, we've got an imbalanced count on the mm and could free it
>> > prematurely as seen in the KASAN log.
>> 
>> I'm not sure I see premature. At this point mm_users==0, mm_count==0 and
>> we freed mm and there is no further use of the on-stack mm pointer and
>> foo no longer has a pointer to it in either ->mm or ->active_mm. It's
>> well and proper dead.
>> 
>> > A subsequent context-switch away from foo would therefore result in a
>> > use-after-free.
>> 
>> At the above point, foo no longer has a reference to mm, we cleared ->mm
>> early, and the context switch to bar cleared ->active_mm. The switch
>> back into foo then results with foo->active_mm == bar->mm, which is
>> fine.
> 
> Bugger, you're right. When we switch off foo after freeing the mm, we'll
> actually access it's active mm which points to bar's mm. So whilst this
> can explain part of the kasan splat, it doesn't explain the actual
> use-after-free.
> 
> More head-scratching required :(

My current theory: do_exit() gets preempted after having set current->mm
to NULL, and after having issued mmput(), which brings the mm_count down
to 0. Unfortunately, if the scheduler switches from a userspace thread
to a kernel thread, context_switch() loads prev->active_mm which still
points to the now-freed mm, mmgrab the mm, and eventually does mmdrop
in finish_task_switch().

If my understanding is correct, the following patch should help. The idea
is to keep a reference on the mm_count until after we are sure the scheduler
cannot schedule the task anymore. What I'm not sure is where exactly in
do_exit() are we sure the task cannot ever be preempted anymore ?

Comments

Mark Rutland Feb. 16, 2018, 4:53 p.m. UTC | #1
Hi,

On Thu, Feb 15, 2018 at 10:08:56PM +0000, Mathieu Desnoyers wrote:
> My current theory: do_exit() gets preempted after having set current->mm
> to NULL, and after having issued mmput(), which brings the mm_count down
> to 0.
>
> Unfortunately, if the scheduler switches from a userspace thread
> to a kernel thread, context_switch() loads prev->active_mm which still
> points to the now-freed mm, mmgrab the mm, and eventually does mmdrop
> in finish_task_switch().

For this to happen, we need to get to the mmput() in exit_mm() with:

  mm->mm_count == 1
  mm->mm_users == 1
  mm == active_mm

... but AFAICT, this cannot happen.

If there's no context_switch between clearing current->mm and the
mmput(), then mm->mm_count >= 2, thanks to the prior mmgrab() and the
active_mm reference (in mm_count) that context_switch+finish_task_switch
manage.

If there is a context_switch between the two, then AFAICT, either:

a) The task re-inherits its old mm as active_mm, and mm_count >= 2. In
   context_switch we mmgrab() the active_mm to inherit it, and in
   finish_task_switch() we drop the oldmm, balancing the mmgrab() with
   an mmput().

   e.g we go task -> kernel_task -> task

b) At some point, another user task is scheduled, and we switch to its
   mm. We don't mmgrab() the active_mm, but we mmdrop() the oldmm, which
   means mm_count >= 1. Since we witched to a new mm, if we switch back
   to the first task, it cannot have its own mm as active_mm.

   e.g. we go task -> other_task -> task

I suspect we have a bogus mmdrop or mmput elsewhere, and do_exit() and
finish_task_switch() aren't to blame.

Thanks,
Mark.
Mathieu Desnoyers Feb. 16, 2018, 5:17 p.m. UTC | #2
----- On Feb 16, 2018, at 11:53 AM, Mark Rutland mark.rutland@arm.com wrote:

> Hi,
> 
> On Thu, Feb 15, 2018 at 10:08:56PM +0000, Mathieu Desnoyers wrote:
>> My current theory: do_exit() gets preempted after having set current->mm
>> to NULL, and after having issued mmput(), which brings the mm_count down
>> to 0.
>>
>> Unfortunately, if the scheduler switches from a userspace thread
>> to a kernel thread, context_switch() loads prev->active_mm which still
>> points to the now-freed mm, mmgrab the mm, and eventually does mmdrop
>> in finish_task_switch().
> 
> For this to happen, we need to get to the mmput() in exit_mm() with:
> 
>  mm->mm_count == 1
>  mm->mm_users == 1
>  mm == active_mm
> 
> ... but AFAICT, this cannot happen.
> 
> If there's no context_switch between clearing current->mm and the
> mmput(), then mm->mm_count >= 2, thanks to the prior mmgrab() and the
> active_mm reference (in mm_count) that context_switch+finish_task_switch
> manage.
> 
> If there is a context_switch between the two, then AFAICT, either:
> 
> a) The task re-inherits its old mm as active_mm, and mm_count >= 2. In
>   context_switch we mmgrab() the active_mm to inherit it, and in
>   finish_task_switch() we drop the oldmm, balancing the mmgrab() with
>   an mmput().
> 
>   e.g we go task -> kernel_task -> task
> 
> b) At some point, another user task is scheduled, and we switch to its
>   mm. We don't mmgrab() the active_mm, but we mmdrop() the oldmm, which
>   means mm_count >= 1. Since we witched to a new mm, if we switch back
>   to the first task, it cannot have its own mm as active_mm.
> 
>   e.g. we go task -> other_task -> task
> 
> I suspect we have a bogus mmdrop or mmput elsewhere, and do_exit() and
> finish_task_switch() aren't to blame.

Currently reviewing: fs/proc/base.c: __set_oom_adj()

        /*
         * Make sure we will check other processes sharing the mm if this is
         * not vfrok which wants its own oom_score_adj.
         * pin the mm so it doesn't go away and get reused after task_unlock
         */
        if (!task->vfork_done) {
                struct task_struct *p = find_lock_task_mm(task);

                if (p) {
                        if (atomic_read(&p->mm->mm_users) > 1) {
                                mm = p->mm;
                                mmgrab(mm);
                        }
                        task_unlock(p);
                }
        }

Considering that mmput() done by exit_mm() is done outside of the
task_lock critical section, I wonder how the mm_users load is
synchronized ?

Thanks,

Mathieu
Mark Rutland Feb. 16, 2018, 6:33 p.m. UTC | #3
On Fri, Feb 16, 2018 at 05:17:57PM +0000, Mathieu Desnoyers wrote:
> ----- On Feb 16, 2018, at 11:53 AM, Mark Rutland mark.rutland@arm.com wrote:
> > I suspect we have a bogus mmdrop or mmput elsewhere, and do_exit() and
> > finish_task_switch() aren't to blame.
> 
> Currently reviewing: fs/proc/base.c: __set_oom_adj()
> 
>         /*
>          * Make sure we will check other processes sharing the mm if this is
>          * not vfrok which wants its own oom_score_adj.
>          * pin the mm so it doesn't go away and get reused after task_unlock
>          */
>         if (!task->vfork_done) {
>                 struct task_struct *p = find_lock_task_mm(task);
> 
>                 if (p) {
>                         if (atomic_read(&p->mm->mm_users) > 1) {
>                                 mm = p->mm;
>                                 mmgrab(mm);
>                         }
>                         task_unlock(p);
>                 }
>         }
> 
> Considering that mmput() done by exit_mm() is done outside of the
> task_lock critical section, I wonder how the mm_users load is
> synchronized ?

That looks suspicious, but I don't think it can result in this
particular problem.

In find_lock_task_mm() we get the task lock, and check mm != NULL, which
means that mm->mm_count >= 1 (thanks to the implicit reference
context_switch()+finish_task_switch() manage). While we hold the task
lock, task->mm can't change beneath our feet, and hence that reference
can't be dropped by finish_task_switch().

Thus, immediately after the mmgrab(), we know mm->mm_count >= 2. That
shouldn't drop below 1 until the subsequent mmdrop(), even after we drop
the task lock, unless there's a misplaced mmdrop() elsewhere. Locally,
mmgrab() and mmdrop() are balanced.

However, if mm_users can be incremented behind our back, we might skip
updating the oom adjustments for other users of the mm.

Thanks,
Mark.
diff mbox

Patch

diff --git a/kernel/exit.c b/kernel/exit.c
index 995453d..fefba68 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -764,6 +764,7 @@  void __noreturn do_exit(long code)
 {
        struct task_struct *tsk = current;
        int group_dead;
+       struct mm_struct *mm;
 
        profile_task_exit(tsk);
        kcov_task_exit(tsk);
@@ -849,6 +850,10 @@  void __noreturn do_exit(long code)
        tsk->exit_code = code;
        taskstats_exit(tsk, group_dead);
 
+       mm = current->mm;
+       if (mm)
+               mmgrab(mm);
+
        exit_mm();
 
        if (group_dead)
@@ -920,6 +925,9 @@  void __noreturn do_exit(long code)
 
        lockdep_free_task(tsk);
        do_task_dead();
+
+       if (mm)
+               mmdrop(mm);
 }
 EXPORT_SYMBOL_GPL(do_exit);