diff mbox series

[v2] mm: optimize the redundant loop of mm_update_next_owner()

Message ID 20240620152744.4038983-1-alexjlzheng@tencent.com (mailing list archive)
State New
Headers show
Series [v2] mm: optimize the redundant loop of mm_update_next_owner() | expand

Commit Message

Jinliang Zheng June 20, 2024, 3:27 p.m. UTC
From: Jinliang Zheng <alexjlzheng@tencent.com>

When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or
ptrace or page migration (get_task_mm()), it is impossible to find an
appropriate task_struct in the loop whose mm_struct is the same as the target
mm_struct.

If the above race condition is combined with the stress-ng-zombie and
stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
write_lock_irq() for tasklist_lock.

Recognize this situation in advance and exit early.

Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>
---
Changelog:

V2: Fix mm_update_owner_next() to mm_update_next_owner() in comment
---
 kernel/exit.c | 2 ++
 1 file changed, 2 insertions(+)

Comments

Oleg Nesterov June 20, 2024, 5:30 p.m. UTC | #1
Can't review, I forgot everything about mm_update_next_owner().
So I am sorry for the noise I am going to add, feel free to ignore.
Just in case, I see nothing wrong in this patch.

On 06/20, alexjlzheng@gmail.com wrote:
>
> When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or
> ptrace or page migration (get_task_mm()), it is impossible to find an
> appropriate task_struct in the loop whose mm_struct is the same as the target
> mm_struct.
>
> If the above race condition is combined with the stress-ng-zombie and
> stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
> write_lock_irq() for tasklist_lock.
>
> Recognize this situation in advance and exit early.

But this patch won't help if (say) ptrace_access_vm() sleeps while
for_each_process() tries to find another owner, right?

> @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm)
>  	 * Search through everything else, we should not get here often.
>  	 */
>  	for_each_process(g) {
> +		if (atomic_read(&mm->mm_users) <= 1)
> +			break;

I think this deserves a comment to explain that this is optimization
for the case we race with the pending mmput(). mm_update_next_owner()
checks mm_users at the start.

And. Can we drop tasklist and use rcu_read_lock() before for_each_process?
Yes, this will probably need more changes even if possible...


Or even better. Can't we finally kill mm_update_next_owner() and turn the
ugly mm->owner into mm->mem_cgroup ?

Michal, Eric, iirc you had the patch(es) which do this?

Oleg.
Michal Hocko June 21, 2024, 8:50 a.m. UTC | #2
On Thu 20-06-24 19:30:19, Oleg Nesterov wrote:
> Can't review, I forgot everything about mm_update_next_owner().
> So I am sorry for the noise I am going to add, feel free to ignore.
> Just in case, I see nothing wrong in this patch.
> 
> On 06/20, alexjlzheng@gmail.com wrote:
> >
> > When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or
> > ptrace or page migration (get_task_mm()), it is impossible to find an
> > appropriate task_struct in the loop whose mm_struct is the same as the target
> > mm_struct.
> >
> > If the above race condition is combined with the stress-ng-zombie and
> > stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
> > write_lock_irq() for tasklist_lock.
> >
> > Recognize this situation in advance and exit early.
> 
> But this patch won't help if (say) ptrace_access_vm() sleeps while
> for_each_process() tries to find another owner, right?
> 
> > @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm)
> >  	 * Search through everything else, we should not get here often.
> >  	 */
> >  	for_each_process(g) {
> > +		if (atomic_read(&mm->mm_users) <= 1)
> > +			break;
> 
> I think this deserves a comment to explain that this is optimization
> for the case we race with the pending mmput(). mm_update_next_owner()
> checks mm_users at the start.
> 
> And. Can we drop tasklist and use rcu_read_lock() before for_each_process?
> Yes, this will probably need more changes even if possible...
> 
> 
> Or even better. Can't we finally kill mm_update_next_owner() and turn the
> ugly mm->owner into mm->mem_cgroup ?

Yes, dropping the mm->owner should be a way to go. Replacing that by
mem_cgroup sounds like an improvemnt. I have a vague recollection that
this has some traps on the way. E.g. tasks sharing the mm but living in
different cgroups. Things have changes since the last time I've checked
and for example memcg charge migration on task move will be deprecated
soon so chances are that there are less roadblocks on the way.
Andrew Morton June 25, 2024, 10:21 p.m. UTC | #3
On Fri, 21 Jun 2024 10:50:10 +0200 Michal Hocko <mhocko@suse.com> wrote:

> On Thu 20-06-24 19:30:19, Oleg Nesterov wrote:
> > Can't review, I forgot everything about mm_update_next_owner().
> > So I am sorry for the noise I am going to add, feel free to ignore.
> > Just in case, I see nothing wrong in this patch.
> > 
> > I think this deserves a comment to explain that this is optimization
> > for the case we race with the pending mmput(). mm_update_next_owner()
> > checks mm_users at the start.
> > 
> > And. Can we drop tasklist and use rcu_read_lock() before for_each_process?
> > Yes, this will probably need more changes even if possible...
> > 
> > 
> > Or even better. Can't we finally kill mm_update_next_owner() and turn the
> > ugly mm->owner into mm->mem_cgroup ?
> 
> Yes, dropping the mm->owner should be a way to go. Replacing that by
> mem_cgroup sounds like an improvemnt. I have a vague recollection that
> this has some traps on the way. E.g. tasks sharing the mm but living in
> different cgroups. Things have changes since the last time I've checked
> and for example memcg charge migration on task move will be deprecated
> soon so chances are that there are less roadblocks on the way.

I think this was alexjlzheng's first kernel contribution and as such we
might not be hearing from him(?) again.  And that's OK, thanks for the
bug report - it helps Linux.

Meanwhile we have a stalled patch in mm-unstable.  If someone could add
this issue to their todo list, that would be great.
Jinliang Zheng June 26, 2024, 6:43 a.m. UTC | #4
On Fri, 21 Jun 2024 10:50:10 +0200, Michal Hocko wrote:
> On Thu 20-06-24 19:30:19, Oleg Nesterov wrote:
> > Can't review, I forgot everything about mm_update_next_owner().
> > So I am sorry for the noise I am going to add, feel free to ignore.
> > Just in case, I see nothing wrong in this patch.
> > 
> > On 06/20, alexjlzheng@gmail.com wrote:
> > >
> > > When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or
> > > ptrace or page migration (get_task_mm()), it is impossible to find an
> > > appropriate task_struct in the loop whose mm_struct is the same as the target
> > > mm_struct.
> > >
> > > If the above race condition is combined with the stress-ng-zombie and
> > > stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
> > > write_lock_irq() for tasklist_lock.
> > >
> > > Recognize this situation in advance and exit early.
> > 
> > But this patch won't help if (say) ptrace_access_vm() sleeps while
> > for_each_process() tries to find another owner, right?
> > 
> > > @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm)
> > >  	 * Search through everything else, we should not get here often.
> > >  	 */
> > >  	for_each_process(g) {
> > > +		if (atomic_read(&mm->mm_users) <= 1)
> > > +			break;
> > 
> > I think this deserves a comment to explain that this is optimization
> > for the case we race with the pending mmput(). mm_update_next_owner()
> > checks mm_users at the start.
> > 
> > And. Can we drop tasklist and use rcu_read_lock() before for_each_process?
> > Yes, this will probably need more changes even if possible...
> > 
> > 
> > Or even better. Can't we finally kill mm_update_next_owner() and turn the
> > ugly mm->owner into mm->mem_cgroup ?
> 
> Yes, dropping the mm->owner should be a way to go. Replacing that by
> mem_cgroup sounds like an improvemnt. I have a vague recollection that

Sorry for the late reply.

Replacing that by mem_cgroup maybe a good idea, a rcu lock looks good, too.
But before the above optimization is implemented, I recommend using this
patch to alleviate it. Both [PATCH] and [PATCH v2] are acceptable, they only
differ in the commit log.

Thanks for your reply. :)
Jinliang Zheng

> this has some traps on the way. E.g. tasks sharing the mm but living in
> different cgroups. Things have changes since the last time I've checked
> and for example memcg charge migration on task move will be deprecated
> soon so chances are that there are less roadblocks on the way.
> -- 
> Michal Hocko
> SUSE Labs
Oleg Nesterov June 26, 2024, 3:23 p.m. UTC | #5
On 06/21, Michal Hocko wrote:
>
> On Thu 20-06-24 19:30:19, Oleg Nesterov wrote:
> >
> > Or even better. Can't we finally kill mm_update_next_owner() and turn the
> > ugly mm->owner into mm->mem_cgroup ?
>
> Yes, dropping the mm->owner should be a way to go. Replacing that by
> mem_cgroup sounds like an improvemnt. I have a vague recollection that
> this has some traps on the way. E.g. tasks sharing the mm but living in
> different cgroups. Things have changes since the last time I've checked
> and for example memcg charge migration on task move will be deprecated
> soon so chances are that there are less roadblocks on the way.

OK, thanks...

So if we can't do this right now, can we at least cleanup it? To me it looks
just ugly.

We don't need get/put_task_struct. The "retry" logic is obviously suboptimal.
The search in the children/siblings doesn't handle zombie leaders.

I'll send 2 (hopefully simple) patches in a minute, could you review? I have
no idea how to test them...

Oleg.
Michal Hocko June 27, 2024, 7:44 a.m. UTC | #6
On Thu 20-06-24 23:27:45, alexjlzheng@gmail.com wrote:
> From: Jinliang Zheng <alexjlzheng@tencent.com>
> 
> When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or
> ptrace or page migration (get_task_mm()), it is impossible to find an
> appropriate task_struct in the loop whose mm_struct is the same as the target
> mm_struct.
> 
> If the above race condition is combined with the stress-ng-zombie and
> stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in
> write_lock_irq() for tasklist_lock.
> 
> Recognize this situation in advance and exit early.
> 
> Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com>

Even if this is not really a full fix it is a useful stop gap to catch
at least some cases.

Acked-by: Michal Hocko <mhocko@suse.com>

> ---
> Changelog:
> 
> V2: Fix mm_update_owner_next() to mm_update_next_owner() in comment
> ---
>  kernel/exit.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/kernel/exit.c b/kernel/exit.c
> index f95a2c1338a8..81fcee45d630 100644
> --- a/kernel/exit.c
> +++ b/kernel/exit.c
> @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm)
>  	 * Search through everything else, we should not get here often.
>  	 */
>  	for_each_process(g) {
> +		if (atomic_read(&mm->mm_users) <= 1)
> +			break;
>  		if (g->flags & PF_KTHREAD)
>  			continue;
>  		for_each_thread(g, c) {
> -- 
> 2.39.3
>
diff mbox series

Patch

diff --git a/kernel/exit.c b/kernel/exit.c
index f95a2c1338a8..81fcee45d630 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -484,6 +484,8 @@  void mm_update_next_owner(struct mm_struct *mm)
 	 * Search through everything else, we should not get here often.
 	 */
 	for_each_process(g) {
+		if (atomic_read(&mm->mm_users) <= 1)
+			break;
 		if (g->flags & PF_KTHREAD)
 			continue;
 		for_each_thread(g, c) {