Message ID | 1144433342.22716.1518732536600.JavaMail.zimbra@efficios.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Hi, On Thu, Feb 15, 2018 at 10:08:56PM +0000, Mathieu Desnoyers wrote: > My current theory: do_exit() gets preempted after having set current->mm > to NULL, and after having issued mmput(), which brings the mm_count down > to 0. > > Unfortunately, if the scheduler switches from a userspace thread > to a kernel thread, context_switch() loads prev->active_mm which still > points to the now-freed mm, mmgrab the mm, and eventually does mmdrop > in finish_task_switch(). For this to happen, we need to get to the mmput() in exit_mm() with: mm->mm_count == 1 mm->mm_users == 1 mm == active_mm ... but AFAICT, this cannot happen. If there's no context_switch between clearing current->mm and the mmput(), then mm->mm_count >= 2, thanks to the prior mmgrab() and the active_mm reference (in mm_count) that context_switch+finish_task_switch manage. If there is a context_switch between the two, then AFAICT, either: a) The task re-inherits its old mm as active_mm, and mm_count >= 2. In context_switch we mmgrab() the active_mm to inherit it, and in finish_task_switch() we drop the oldmm, balancing the mmgrab() with an mmput(). e.g we go task -> kernel_task -> task b) At some point, another user task is scheduled, and we switch to its mm. We don't mmgrab() the active_mm, but we mmdrop() the oldmm, which means mm_count >= 1. Since we witched to a new mm, if we switch back to the first task, it cannot have its own mm as active_mm. e.g. we go task -> other_task -> task I suspect we have a bogus mmdrop or mmput elsewhere, and do_exit() and finish_task_switch() aren't to blame. Thanks, Mark.
----- On Feb 16, 2018, at 11:53 AM, Mark Rutland mark.rutland@arm.com wrote: > Hi, > > On Thu, Feb 15, 2018 at 10:08:56PM +0000, Mathieu Desnoyers wrote: >> My current theory: do_exit() gets preempted after having set current->mm >> to NULL, and after having issued mmput(), which brings the mm_count down >> to 0. >> >> Unfortunately, if the scheduler switches from a userspace thread >> to a kernel thread, context_switch() loads prev->active_mm which still >> points to the now-freed mm, mmgrab the mm, and eventually does mmdrop >> in finish_task_switch(). > > For this to happen, we need to get to the mmput() in exit_mm() with: > > mm->mm_count == 1 > mm->mm_users == 1 > mm == active_mm > > ... but AFAICT, this cannot happen. > > If there's no context_switch between clearing current->mm and the > mmput(), then mm->mm_count >= 2, thanks to the prior mmgrab() and the > active_mm reference (in mm_count) that context_switch+finish_task_switch > manage. > > If there is a context_switch between the two, then AFAICT, either: > > a) The task re-inherits its old mm as active_mm, and mm_count >= 2. In > context_switch we mmgrab() the active_mm to inherit it, and in > finish_task_switch() we drop the oldmm, balancing the mmgrab() with > an mmput(). > > e.g we go task -> kernel_task -> task > > b) At some point, another user task is scheduled, and we switch to its > mm. We don't mmgrab() the active_mm, but we mmdrop() the oldmm, which > means mm_count >= 1. Since we witched to a new mm, if we switch back > to the first task, it cannot have its own mm as active_mm. > > e.g. we go task -> other_task -> task > > I suspect we have a bogus mmdrop or mmput elsewhere, and do_exit() and > finish_task_switch() aren't to blame. Currently reviewing: fs/proc/base.c: __set_oom_adj() /* * Make sure we will check other processes sharing the mm if this is * not vfrok which wants its own oom_score_adj. * pin the mm so it doesn't go away and get reused after task_unlock */ if (!task->vfork_done) { struct task_struct *p = find_lock_task_mm(task); if (p) { if (atomic_read(&p->mm->mm_users) > 1) { mm = p->mm; mmgrab(mm); } task_unlock(p); } } Considering that mmput() done by exit_mm() is done outside of the task_lock critical section, I wonder how the mm_users load is synchronized ? Thanks, Mathieu
On Fri, Feb 16, 2018 at 05:17:57PM +0000, Mathieu Desnoyers wrote: > ----- On Feb 16, 2018, at 11:53 AM, Mark Rutland mark.rutland@arm.com wrote: > > I suspect we have a bogus mmdrop or mmput elsewhere, and do_exit() and > > finish_task_switch() aren't to blame. > > Currently reviewing: fs/proc/base.c: __set_oom_adj() > > /* > * Make sure we will check other processes sharing the mm if this is > * not vfrok which wants its own oom_score_adj. > * pin the mm so it doesn't go away and get reused after task_unlock > */ > if (!task->vfork_done) { > struct task_struct *p = find_lock_task_mm(task); > > if (p) { > if (atomic_read(&p->mm->mm_users) > 1) { > mm = p->mm; > mmgrab(mm); > } > task_unlock(p); > } > } > > Considering that mmput() done by exit_mm() is done outside of the > task_lock critical section, I wonder how the mm_users load is > synchronized ? That looks suspicious, but I don't think it can result in this particular problem. In find_lock_task_mm() we get the task lock, and check mm != NULL, which means that mm->mm_count >= 1 (thanks to the implicit reference context_switch()+finish_task_switch() manage). While we hold the task lock, task->mm can't change beneath our feet, and hence that reference can't be dropped by finish_task_switch(). Thus, immediately after the mmgrab(), we know mm->mm_count >= 2. That shouldn't drop below 1 until the subsequent mmdrop(), even after we drop the task lock, unless there's a misplaced mmdrop() elsewhere. Locally, mmgrab() and mmdrop() are balanced. However, if mm_users can be incremented behind our back, we might skip updating the oom adjustments for other users of the mm. Thanks, Mark.
diff --git a/kernel/exit.c b/kernel/exit.c index 995453d..fefba68 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -764,6 +764,7 @@ void __noreturn do_exit(long code) { struct task_struct *tsk = current; int group_dead; + struct mm_struct *mm; profile_task_exit(tsk); kcov_task_exit(tsk); @@ -849,6 +850,10 @@ void __noreturn do_exit(long code) tsk->exit_code = code; taskstats_exit(tsk, group_dead); + mm = current->mm; + if (mm) + mmgrab(mm); + exit_mm(); if (group_dead) @@ -920,6 +925,9 @@ void __noreturn do_exit(long code) lockdep_free_task(tsk); do_task_dead(); + + if (mm) + mmdrop(mm); } EXPORT_SYMBOL_GPL(do_exit);