Message ID | 20240620152744.4038983-1-alexjlzheng@tencent.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [v2] mm: optimize the redundant loop of mm_update_next_owner() | expand |
Can't review, I forgot everything about mm_update_next_owner(). So I am sorry for the noise I am going to add, feel free to ignore. Just in case, I see nothing wrong in this patch. On 06/20, alexjlzheng@gmail.com wrote: > > When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or > ptrace or page migration (get_task_mm()), it is impossible to find an > appropriate task_struct in the loop whose mm_struct is the same as the target > mm_struct. > > If the above race condition is combined with the stress-ng-zombie and > stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in > write_lock_irq() for tasklist_lock. > > Recognize this situation in advance and exit early. But this patch won't help if (say) ptrace_access_vm() sleeps while for_each_process() tries to find another owner, right? > @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm) > * Search through everything else, we should not get here often. > */ > for_each_process(g) { > + if (atomic_read(&mm->mm_users) <= 1) > + break; I think this deserves a comment to explain that this is optimization for the case we race with the pending mmput(). mm_update_next_owner() checks mm_users at the start. And. Can we drop tasklist and use rcu_read_lock() before for_each_process? Yes, this will probably need more changes even if possible... Or even better. Can't we finally kill mm_update_next_owner() and turn the ugly mm->owner into mm->mem_cgroup ? Michal, Eric, iirc you had the patch(es) which do this? Oleg.
On Thu 20-06-24 19:30:19, Oleg Nesterov wrote: > Can't review, I forgot everything about mm_update_next_owner(). > So I am sorry for the noise I am going to add, feel free to ignore. > Just in case, I see nothing wrong in this patch. > > On 06/20, alexjlzheng@gmail.com wrote: > > > > When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or > > ptrace or page migration (get_task_mm()), it is impossible to find an > > appropriate task_struct in the loop whose mm_struct is the same as the target > > mm_struct. > > > > If the above race condition is combined with the stress-ng-zombie and > > stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in > > write_lock_irq() for tasklist_lock. > > > > Recognize this situation in advance and exit early. > > But this patch won't help if (say) ptrace_access_vm() sleeps while > for_each_process() tries to find another owner, right? > > > @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm) > > * Search through everything else, we should not get here often. > > */ > > for_each_process(g) { > > + if (atomic_read(&mm->mm_users) <= 1) > > + break; > > I think this deserves a comment to explain that this is optimization > for the case we race with the pending mmput(). mm_update_next_owner() > checks mm_users at the start. > > And. Can we drop tasklist and use rcu_read_lock() before for_each_process? > Yes, this will probably need more changes even if possible... > > > Or even better. Can't we finally kill mm_update_next_owner() and turn the > ugly mm->owner into mm->mem_cgroup ? Yes, dropping the mm->owner should be a way to go. Replacing that by mem_cgroup sounds like an improvemnt. I have a vague recollection that this has some traps on the way. E.g. tasks sharing the mm but living in different cgroups. Things have changes since the last time I've checked and for example memcg charge migration on task move will be deprecated soon so chances are that there are less roadblocks on the way.
On Fri, 21 Jun 2024 10:50:10 +0200 Michal Hocko <mhocko@suse.com> wrote: > On Thu 20-06-24 19:30:19, Oleg Nesterov wrote: > > Can't review, I forgot everything about mm_update_next_owner(). > > So I am sorry for the noise I am going to add, feel free to ignore. > > Just in case, I see nothing wrong in this patch. > > > > I think this deserves a comment to explain that this is optimization > > for the case we race with the pending mmput(). mm_update_next_owner() > > checks mm_users at the start. > > > > And. Can we drop tasklist and use rcu_read_lock() before for_each_process? > > Yes, this will probably need more changes even if possible... > > > > > > Or even better. Can't we finally kill mm_update_next_owner() and turn the > > ugly mm->owner into mm->mem_cgroup ? > > Yes, dropping the mm->owner should be a way to go. Replacing that by > mem_cgroup sounds like an improvemnt. I have a vague recollection that > this has some traps on the way. E.g. tasks sharing the mm but living in > different cgroups. Things have changes since the last time I've checked > and for example memcg charge migration on task move will be deprecated > soon so chances are that there are less roadblocks on the way. I think this was alexjlzheng's first kernel contribution and as such we might not be hearing from him(?) again. And that's OK, thanks for the bug report - it helps Linux. Meanwhile we have a stalled patch in mm-unstable. If someone could add this issue to their todo list, that would be great.
On Fri, 21 Jun 2024 10:50:10 +0200, Michal Hocko wrote: > On Thu 20-06-24 19:30:19, Oleg Nesterov wrote: > > Can't review, I forgot everything about mm_update_next_owner(). > > So I am sorry for the noise I am going to add, feel free to ignore. > > Just in case, I see nothing wrong in this patch. > > > > On 06/20, alexjlzheng@gmail.com wrote: > > > > > > When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or > > > ptrace or page migration (get_task_mm()), it is impossible to find an > > > appropriate task_struct in the loop whose mm_struct is the same as the target > > > mm_struct. > > > > > > If the above race condition is combined with the stress-ng-zombie and > > > stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in > > > write_lock_irq() for tasklist_lock. > > > > > > Recognize this situation in advance and exit early. > > > > But this patch won't help if (say) ptrace_access_vm() sleeps while > > for_each_process() tries to find another owner, right? > > > > > @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm) > > > * Search through everything else, we should not get here often. > > > */ > > > for_each_process(g) { > > > + if (atomic_read(&mm->mm_users) <= 1) > > > + break; > > > > I think this deserves a comment to explain that this is optimization > > for the case we race with the pending mmput(). mm_update_next_owner() > > checks mm_users at the start. > > > > And. Can we drop tasklist and use rcu_read_lock() before for_each_process? > > Yes, this will probably need more changes even if possible... > > > > > > Or even better. Can't we finally kill mm_update_next_owner() and turn the > > ugly mm->owner into mm->mem_cgroup ? > > Yes, dropping the mm->owner should be a way to go. Replacing that by > mem_cgroup sounds like an improvemnt. I have a vague recollection that Sorry for the late reply. Replacing that by mem_cgroup maybe a good idea, a rcu lock looks good, too. But before the above optimization is implemented, I recommend using this patch to alleviate it. Both [PATCH] and [PATCH v2] are acceptable, they only differ in the commit log. Thanks for your reply. :) Jinliang Zheng > this has some traps on the way. E.g. tasks sharing the mm but living in > different cgroups. Things have changes since the last time I've checked > and for example memcg charge migration on task move will be deprecated > soon so chances are that there are less roadblocks on the way. > -- > Michal Hocko > SUSE Labs
On 06/21, Michal Hocko wrote: > > On Thu 20-06-24 19:30:19, Oleg Nesterov wrote: > > > > Or even better. Can't we finally kill mm_update_next_owner() and turn the > > ugly mm->owner into mm->mem_cgroup ? > > Yes, dropping the mm->owner should be a way to go. Replacing that by > mem_cgroup sounds like an improvemnt. I have a vague recollection that > this has some traps on the way. E.g. tasks sharing the mm but living in > different cgroups. Things have changes since the last time I've checked > and for example memcg charge migration on task move will be deprecated > soon so chances are that there are less roadblocks on the way. OK, thanks... So if we can't do this right now, can we at least cleanup it? To me it looks just ugly. We don't need get/put_task_struct. The "retry" logic is obviously suboptimal. The search in the children/siblings doesn't handle zombie leaders. I'll send 2 (hopefully simple) patches in a minute, could you review? I have no idea how to test them... Oleg.
On Thu 20-06-24 23:27:45, alexjlzheng@gmail.com wrote: > From: Jinliang Zheng <alexjlzheng@tencent.com> > > When mm_update_next_owner() is racing with swapoff (try_to_unuse()) or /proc or > ptrace or page migration (get_task_mm()), it is impossible to find an > appropriate task_struct in the loop whose mm_struct is the same as the target > mm_struct. > > If the above race condition is combined with the stress-ng-zombie and > stress-ng-dup tests, such a long loop can easily cause a Hard Lockup in > write_lock_irq() for tasklist_lock. > > Recognize this situation in advance and exit early. > > Signed-off-by: Jinliang Zheng <alexjlzheng@tencent.com> Even if this is not really a full fix it is a useful stop gap to catch at least some cases. Acked-by: Michal Hocko <mhocko@suse.com> > --- > Changelog: > > V2: Fix mm_update_owner_next() to mm_update_next_owner() in comment > --- > kernel/exit.c | 2 ++ > 1 file changed, 2 insertions(+) > > diff --git a/kernel/exit.c b/kernel/exit.c > index f95a2c1338a8..81fcee45d630 100644 > --- a/kernel/exit.c > +++ b/kernel/exit.c > @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm) > * Search through everything else, we should not get here often. > */ > for_each_process(g) { > + if (atomic_read(&mm->mm_users) <= 1) > + break; > if (g->flags & PF_KTHREAD) > continue; > for_each_thread(g, c) { > -- > 2.39.3 >
diff --git a/kernel/exit.c b/kernel/exit.c index f95a2c1338a8..81fcee45d630 100644 --- a/kernel/exit.c +++ b/kernel/exit.c @@ -484,6 +484,8 @@ void mm_update_next_owner(struct mm_struct *mm) * Search through everything else, we should not get here often. */ for_each_process(g) { + if (atomic_read(&mm->mm_users) <= 1) + break; if (g->flags & PF_KTHREAD) continue; for_each_thread(g, c) {