diff mbox series

mm, oom_adj: avoid meaningless loop to find processes sharing mm

Message ID 20181008011931epcms1p82dd01b7e5c067ea99946418bc97de46a@epcms1p8 (mailing list archive)
State New, archived
Headers show
Series mm, oom_adj: avoid meaningless loop to find processes sharing mm | expand

Commit Message

Yong-Taek Lee Oct. 8, 2018, 1:19 a.m. UTC
It is introduced by commit 44a70adec910 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj"). Most of
user process's mm_users is bigger than 1 but only one thread group.
In this case, for_each_process loop meaninglessly try to find processes
which sharing same mm even though there is only one thread group.

My idea is that target task's nr thread is smaller than mm_users if there
are more thread groups sharing the same mm. So we can skip loop
if mm_user and nr_thread are same. 

test result
while true; do count=0; time while [ $count -lt 10000 ]; do echo -1000 > /proc/1457/oom_score_adj; count=$((count+1)); done; done;

before patch
0m00.59s real     0m00.09s user     0m00.51s system
0m00.59s real     0m00.14s user     0m00.45s system
0m00.58s real     0m00.11s user     0m00.47s system
0m00.58s real     0m00.10s user     0m00.48s system
0m00.59s real     0m00.11s user     0m00.48s system

after patch
0m00.15s real     0m00.07s user     0m00.08s system
0m00.14s real     0m00.10s user     0m00.04s system
0m00.14s real     0m00.10s user     0m00.05s system
0m00.14s real     0m00.08s user     0m00.07s system
0m00.14s real     0m00.08s user     0m00.07s system

Signed-off-by: Lee YongTaek <ytk.lee@samsung.com>
---
 fs/proc/base.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

--

Comments

Tetsuo Handa Oct. 8, 2018, 2:52 a.m. UTC | #1
On 2018/10/08 10:19, Yong-Taek Lee wrote:
> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>         struct mm_struct *mm = NULL;
>         struct task_struct *task;
>         int err = 0;
> +       int mm_users = 0;
> 
>         task = get_proc_task(file_inode(file));
>         if (!task)
> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>                 struct task_struct *p = find_lock_task_mm(task);
> 
>                 if (p) {
> -                       if (atomic_read(&p->mm->mm_users) > 1) {
> +                       mm_users = atomic_read(&p->mm->mm_users);
> +                       if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {

How can this work (even before this patch)? When clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
is requested, copy_process() calls copy_signal() in order to copy sig->oom_score_adj and
sig->oom_score_adj_min before calling copy_mm() in order to increment mm->mm_users, doesn't it?
Then, we will get two different "struct signal_struct" with different oom_score_adj/oom_score_adj_min
but one "struct mm_struct" shared by two thread groups.

>                                 mm = p->mm;
>                                 atomic_inc(&mm->mm_count);
>                         }
Yong-Taek Lee Oct. 8, 2018, 6:14 a.m. UTC | #2
>On 2018/10/08 10:19, Yong-Taek Lee wrote:
>> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>>         struct mm_struct *mm = NULL;
>>         struct task_struct *task;
>>         int err = 0;
>> +       int mm_users = 0;
>>
>>         task = get_proc_task(file_inode(file));
>>         if (!task)
>> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>>                 struct task_struct *p = find_lock_task_mm(task);
>>
>>                 if (p) {
>> -                       if (atomic_read(&p->mm->mm_users) > 1) {
>> +                       mm_users = atomic_read(&p->mm->mm_users);
>> +                       if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {
>
> How can this work (even before this patch)? When clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
> is requested, copy_process() calls copy_signal() in order to copy sig->oom_score_adj and
> sig->oom_score_adj_min before calling copy_mm() in order to increment mm->mm_users, doesn't it?
> Then, we will get two different "struct signal_struct" with different oom_score_adj/oom_score_adj_min
> but one "struct mm_struct" shared by two thread groups.
>

Are you talking about race between __set_oom_adj and copy_process?
If so, i agree with your opinion. It can not set oom_score_adj properly for copied process if __set_oom_adj
check mm_users before copy_process calls copy_mm after copy_signal. Please correct me if i misunderstood anything.

>>                                 mm = p->mm;
>>                                 atomic_inc(&mm->mm_count);
>>                         }
Tetsuo Handa Oct. 8, 2018, 6:22 a.m. UTC | #3
On 2018/10/08 15:14, Yong-Taek Lee wrote:
>> On 2018/10/08 10:19, Yong-Taek Lee wrote:
>>> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>>>         struct mm_struct *mm = NULL;
>>>         struct task_struct *task;
>>>         int err = 0;
>>> +       int mm_users = 0;
>>>
>>>         task = get_proc_task(file_inode(file));
>>>         if (!task)
>>> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>>>                 struct task_struct *p = find_lock_task_mm(task);
>>>
>>>                 if (p) {
>>> -                       if (atomic_read(&p->mm->mm_users) > 1) {
>>> +                       mm_users = atomic_read(&p->mm->mm_users);
>>> +                       if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {
>>
>> How can this work (even before this patch)? When clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
>> is requested, copy_process() calls copy_signal() in order to copy sig->oom_score_adj and
>> sig->oom_score_adj_min before calling copy_mm() in order to increment mm->mm_users, doesn't it?
>> Then, we will get two different "struct signal_struct" with different oom_score_adj/oom_score_adj_min
>> but one "struct mm_struct" shared by two thread groups.
>>
> 
> Are you talking about race between __set_oom_adj and copy_process?
> If so, i agree with your opinion. It can not set oom_score_adj properly for copied process if __set_oom_adj
> check mm_users before copy_process calls copy_mm after copy_signal. Please correct me if i misunderstood anything.

You understand it correctly.

Reversing copy_signal() and copy_mm() is not sufficient either. We need to use a read/write lock
(read lock for copy_process() and write lock for __set_oom_adj()) in order to make sure that
the thread created by clone() becomes reachable from for_each_process() path in __set_oom_adj().

> 
>>>                                 mm = p->mm;
>>>                                 atomic_inc(&mm->mm_count);
>>>                         }
>
Yong-Taek Lee Oct. 8, 2018, 8:38 a.m. UTC | #4
>
>On 2018/10/08 15:14, Yong-Taek Lee wrote:
>>> On 2018/10/08 10:19, Yong-Taek Lee wrote:
>>>> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>>>>         struct mm_struct *mm = NULL;
>>>>         struct task_struct *task;
>>>>         int err = 0;
>>>> +       int mm_users = 0;
>>>>
>>>>         task = get_proc_task(file_inode(file));
>>>>         if (!task)
>>>> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>>>>                 struct task_struct *p = find_lock_task_mm(task);
>>>>
>>>>                 if (p) {
>>>> -                       if (atomic_read(&p->mm->mm_users) > 1) {
>>>> +                       mm_users = atomic_read(&p->mm->mm_users);
>>>> +                       if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {
>>>
>>> How can this work (even before this patch)? When clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
>>> is requested, copy_process() calls copy_signal() in order to copy sig->oom_score_adj and
>>> sig->oom_score_adj_min before calling copy_mm() in order to increment mm->mm_users, doesn't it?
>>> Then, we will get two different "struct signal_struct" with different oom_score_adj/oom_score_adj_min
>>> but one "struct mm_struct" shared by two thread groups.
>>>
>>
>> Are you talking about race between __set_oom_adj and copy_process?
>> If so, i agree with your opinion. It can not set oom_score_adj properly for copied process if __set_oom_adj
>> check mm_users before copy_process calls copy_mm after copy_signal. Please correct me if i misunderstood anything.
>
> You understand it correctly.
>
> Reversing copy_signal() and copy_mm() is not sufficient either. We need to use a read/write lock
> (read lock for copy_process() and write lock for __set_oom_adj()) in order to make sure that
> the thread created by clone() becomes reachable from for_each_process() path in __set_oom_adj().
>

Thank you for your suggestion. But i think it would be better to seperate to 2 issues. How about think these
issues separately because there are no dependency between race issue and my patch. As i already explained,
for_each_process path is meaningless if there is only one thread group with many threads(mm_users > 1 but 
no other thread group sharing same mm). Do you have any other idea to avoid meaningless loop ? 

>>
>>>>                                 mm = p->mm;
>>>>                                 atomic_inc(&mm->mm_count);
>>>>                         }
>>
>
Tetsuo Handa Oct. 8, 2018, 9:27 a.m. UTC | #5
On 2018/10/08 17:38, Yong-Taek Lee wrote:
>>
>> On 2018/10/08 15:14, Yong-Taek Lee wrote:
>>>> On 2018/10/08 10:19, Yong-Taek Lee wrote:
>>>>> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>>>>>         struct mm_struct *mm = NULL;
>>>>>         struct task_struct *task;
>>>>>         int err = 0;
>>>>> +       int mm_users = 0;
>>>>>
>>>>>         task = get_proc_task(file_inode(file));
>>>>>         if (!task)
>>>>> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
>>>>>                 struct task_struct *p = find_lock_task_mm(task);
>>>>>
>>>>>                 if (p) {
>>>>> -                       if (atomic_read(&p->mm->mm_users) > 1) {
>>>>> +                       mm_users = atomic_read(&p->mm->mm_users);
>>>>> +                       if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {
>>>>
>>>> How can this work (even before this patch)? When clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
>>>> is requested, copy_process() calls copy_signal() in order to copy sig->oom_score_adj and
>>>> sig->oom_score_adj_min before calling copy_mm() in order to increment mm->mm_users, doesn't it?
>>>> Then, we will get two different "struct signal_struct" with different oom_score_adj/oom_score_adj_min
>>>> but one "struct mm_struct" shared by two thread groups.
>>>>
>>>
>>> Are you talking about race between __set_oom_adj and copy_process?
>>> If so, i agree with your opinion. It can not set oom_score_adj properly for copied process if __set_oom_adj
>>> check mm_users before copy_process calls copy_mm after copy_signal. Please correct me if i misunderstood anything.
>>
>> You understand it correctly.
>>
>> Reversing copy_signal() and copy_mm() is not sufficient either. We need to use a read/write lock
>> (read lock for copy_process() and write lock for __set_oom_adj()) in order to make sure that
>> the thread created by clone() becomes reachable from for_each_process() path in __set_oom_adj().
>>
> 
> Thank you for your suggestion. But i think it would be better to seperate to 2 issues. How about think these
> issues separately because there are no dependency between race issue and my patch. As i already explained,
> for_each_process path is meaningless if there is only one thread group with many threads(mm_users > 1 but 
> no other thread group sharing same mm). Do you have any other idea to avoid meaningless loop ? 

Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure processes
sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded ("mm, oom:
kill all tasks sharing the mm").

> 
>>>
>>>>>                                 mm = p->mm;
>>>>>                                 atomic_inc(&mm->mm_count);
>>>>>                         }
>>>
>>
>
Michal Hocko Oct. 9, 2018, 6:35 a.m. UTC | #6
[I have only now noticed that the patch has been reposted]

On Mon 08-10-18 18:27:39, Tetsuo Handa wrote:
> On 2018/10/08 17:38, Yong-Taek Lee wrote:
> >>
> >> On 2018/10/08 15:14, Yong-Taek Lee wrote:
> >>>> On 2018/10/08 10:19, Yong-Taek Lee wrote:
> >>>>> @@ -1056,6 +1056,7 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> >>>>>         struct mm_struct *mm = NULL;
> >>>>>         struct task_struct *task;
> >>>>>         int err = 0;
> >>>>> +       int mm_users = 0;
> >>>>>
> >>>>>         task = get_proc_task(file_inode(file));
> >>>>>         if (!task)
> >>>>> @@ -1092,7 +1093,8 @@ static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
> >>>>>                 struct task_struct *p = find_lock_task_mm(task);
> >>>>>
> >>>>>                 if (p) {
> >>>>> -                       if (atomic_read(&p->mm->mm_users) > 1) {
> >>>>> +                       mm_users = atomic_read(&p->mm->mm_users);
> >>>>> +                       if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {
> >>>>
> >>>> How can this work (even before this patch)? When clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
> >>>> is requested, copy_process() calls copy_signal() in order to copy sig->oom_score_adj and
> >>>> sig->oom_score_adj_min before calling copy_mm() in order to increment mm->mm_users, doesn't it?
> >>>> Then, we will get two different "struct signal_struct" with different oom_score_adj/oom_score_adj_min
> >>>> but one "struct mm_struct" shared by two thread groups.
> >>>>
> >>>
> >>> Are you talking about race between __set_oom_adj and copy_process?
> >>> If so, i agree with your opinion. It can not set oom_score_adj properly for copied process if __set_oom_adj
> >>> check mm_users before copy_process calls copy_mm after copy_signal. Please correct me if i misunderstood anything.
> >>
> >> You understand it correctly.
> >>
> >> Reversing copy_signal() and copy_mm() is not sufficient either. We need to use a read/write lock
> >> (read lock for copy_process() and write lock for __set_oom_adj()) in order to make sure that
> >> the thread created by clone() becomes reachable from for_each_process() path in __set_oom_adj().
> >>
> > 
> > Thank you for your suggestion. But i think it would be better to seperate to 2 issues. How about think these
> > issues separately because there are no dependency between race issue and my patch. As i already explained,
> > for_each_process path is meaningless if there is only one thread group with many threads(mm_users > 1 but 
> > no other thread group sharing same mm). Do you have any other idea to avoid meaningless loop ? 
> 
> Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure processes
> sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded ("mm, oom:
> kill all tasks sharing the mm").

This would require a lot of other work for something as border line as
weird threading model like this. I will think about something more
appropriate - e.g. we can take mmap_sem for read while doing this check
and that should prevent from races with [v]fork.
Michal Hocko Oct. 9, 2018, 7:50 a.m. UTC | #7
On Tue 09-10-18 08:35:41, Michal Hocko wrote:
> [I have only now noticed that the patch has been reposted]
> 
> On Mon 08-10-18 18:27:39, Tetsuo Handa wrote:
> > On 2018/10/08 17:38, Yong-Taek Lee wrote:
[...]
> > > Thank you for your suggestion. But i think it would be better to seperate to 2 issues. How about think these
> > > issues separately because there are no dependency between race issue and my patch. As i already explained,
> > > for_each_process path is meaningless if there is only one thread group with many threads(mm_users > 1 but 
> > > no other thread group sharing same mm). Do you have any other idea to avoid meaningless loop ? 
> > 
> > Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure processes
> > sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded ("mm, oom:
> > kill all tasks sharing the mm").
> 
> This would require a lot of other work for something as border line as
> weird threading model like this. I will think about something more
> appropriate - e.g. we can take mmap_sem for read while doing this check
> and that should prevent from races with [v]fork.

Not really. We do not even take the mmap_sem when CLONE_VM. So this is
not the way. Doing a proper synchronization seems much harder. So let's
consider what is the worst case scenario. We would basically hit a race
window between copy_signal and copy_mm and the only relevant case would
be OOM_SCORE_ADJ_MIN which wouldn't propagate to the new "thread". OOM
killer could then pick up the "thread" and kill it along with the whole
process group sharing the mm. Well, that is unfortunate indeed and it
breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here
1) do not care and encourage users to use a saner way to set
OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g.
setting it before [v]fork & exec. Btw. do we know about an actual user
who would care?
2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
reap the mm in the rare case of the race.

I would prefer the firs but if this race really has to be addressed then
the 2 sounds more reasonable than the wholesale revert.
Michal Hocko Oct. 9, 2018, 8:02 a.m. UTC | #8
On Mon 08-10-18 17:38:55, Yong-Taek Lee wrote:
> Do you have any other idea to avoid meaningless loop ? 

I have already asked in the earlier posting but let's follow up here.
Could you please expand on why this actually matters and what are the
consequences please?
Tetsuo Handa Oct. 9, 2018, 10 a.m. UTC | #9
On 2018/10/09 16:50, Michal Hocko wrote:
> On Tue 09-10-18 08:35:41, Michal Hocko wrote:
>> [I have only now noticed that the patch has been reposted]
>>
>> On Mon 08-10-18 18:27:39, Tetsuo Handa wrote:
>>> On 2018/10/08 17:38, Yong-Taek Lee wrote:
> [...]
>>>> Thank you for your suggestion. But i think it would be better to seperate to 2 issues. How about think these
>>>> issues separately because there are no dependency between race issue and my patch. As i already explained,
>>>> for_each_process path is meaningless if there is only one thread group with many threads(mm_users > 1 but 
>>>> no other thread group sharing same mm). Do you have any other idea to avoid meaningless loop ? 
>>>
>>> Yes. I suggest reverting commit 44a70adec910d692 ("mm, oom_adj: make sure processes
>>> sharing mm have same view of oom_score_adj") and commit 97fd49c2355ffded ("mm, oom:
>>> kill all tasks sharing the mm").
>>
>> This would require a lot of other work for something as border line as
>> weird threading model like this. I will think about something more
>> appropriate - e.g. we can take mmap_sem for read while doing this check
>> and that should prevent from races with [v]fork.
> 
> Not really. We do not even take the mmap_sem when CLONE_VM. So this is
> not the way. Doing a proper synchronization seems much harder. So let's
> consider what is the worst case scenario. We would basically hit a race
> window between copy_signal and copy_mm and the only relevant case would
> be OOM_SCORE_ADJ_MIN which wouldn't propagate to the new "thread".

The "between copy_signal() and copy_mm()" race window is merely whether we need
to run for_each_process() loop. The race window is much larger than that; it is
between "copy_signal() copies oom_score_adj/oom_score_adj_min" and "the created
thread becomes accessible from for_each_process() loop".

>                                                                    OOM
> killer could then pick up the "thread" and kill it along with the whole
> process group sharing the mm.

Just reverting commit 44a70adec910d692 and commit 97fd49c2355ffded is
sufficient.

>                               Well, that is unfortunate indeed and it
> breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here
> 1) do not care and encourage users to use a saner way to set
> OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g.
> setting it before [v]fork & exec. Btw. do we know about an actual user
> who would care?

I'm not talking about [v]fork & exec. Why are you talking about [v]fork & exec ?

> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> reap the mm in the rare case of the race.

That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
to -1000 (and allowed unprivileged users to OOM-lockup the system). Now that we set
MMF_OOM_SKIP, there is no need to worry about "oom_score_adj != -1000" thread group
and "oom_score_adj == -1000" thread group sharing the same mm. Since updating
oom_score_adj to -1000 is a privileged operation, it is administrator's wish if
such case happened; the kernel should respect the administrator's wish.

> 
> I would prefer the firs but if this race really has to be addressed then
> the 2 sounds more reasonable than the wholesale revert.
>
Michal Hocko Oct. 9, 2018, 11:10 a.m. UTC | #10
On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> On 2018/10/09 16:50, Michal Hocko wrote:
[...]
> >                               Well, that is unfortunate indeed and it
> > breaks the OOM_SCORE_ADJ_MIN contract. There are basically two ways here
> > 1) do not care and encourage users to use a saner way to set
> > OOM_SCORE_ADJ_MIN because doing that externally is racy anyway e.g.
> > setting it before [v]fork & exec. Btw. do we know about an actual user
> > who would care?
> 
> I'm not talking about [v]fork & exec. Why are you talking about [v]fork & exec ?

Because that is the only raceless way to set your oom_score_adj.

> > 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> > reap the mm in the rare case of the race.
> 
> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
> to -1000 (and allowed unprivileged users to OOM-lockup the system).

I do not follow.
Tetsuo Handa Oct. 9, 2018, 12:52 p.m. UTC | #11
On 2018/10/09 20:10, Michal Hocko wrote:
> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
>>> reap the mm in the rare case of the race.
>>
>> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> 
> I do not follow.
> 

http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493

[  177.722853] a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), order=0, oom_score_adj=0
[  177.724956] a.out cpuset=/ mems_allowed=0
[  177.725735] CPU: 3 PID: 3962 Comm: a.out Not tainted 4.5.0-rc2-next-20160204 #291
(...snipped...)
[  177.802889] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  177.872248] [ 3941]  1000  3941    28880      124      14       3        0             0 bash
[  177.874279] [ 3962]  1000  3962   541717   395780     784       6        0             0 a.out
[  177.876274] [ 3963]  1000  3963     1078       21       7       3        0          1000 a.out
[  177.878261] [ 3964]  1000  3964     1078       21       7       3        0          1000 a.out
[  177.880194] [ 3965]  1000  3965     1078       21       7       3        0          1000 a.out
[  177.882262] Out of memory: Kill process 3963 (a.out) score 998 or sacrifice child
[  177.884129] Killed process 3963 (a.out) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[  177.887100] oom_reaper: reaped process :3963 (a.out) anon-rss:0kB, file-rss:0kB, shmem-rss:0lB
[  179.638399] crond invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
[  179.647708] crond cpuset=/ mems_allowed=0
[  179.652996] CPU: 3 PID: 742 Comm: crond Not tainted 4.5.0-rc2-next-20160204 #291
(...snipped...)
[  179.771311] [ pid ]   uid  tgid total_vm      rss nr_ptes nr_pmds swapents oom_score_adj name
(...snipped...)
[  179.836221] [ 3941]  1000  3941    28880      124      14       3        0             0 bash
[  179.838278] [ 3962]  1000  3962   541717   396308     785       6        0             0 a.out
[  179.840328] [ 3963]  1000  3963     1078        0       7       3        0         -1000 a.out
[  179.842443] [ 3965]  1000  3965     1078        0       7       3        0          1000 a.out
[  179.844557] Out of memory: Kill process 3965 (a.out) score 998 or sacrifice child
[  179.846404] Killed process 3965 (a.out) total-vm:4312kB, anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
Michal Hocko Oct. 9, 2018, 12:58 p.m. UTC | #12
On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
> On 2018/10/09 20:10, Michal Hocko wrote:
> > On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> >>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> >>> reap the mm in the rare case of the race.
> >>
> >> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
> >> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> > 
> > I do not follow.
> > 
> 
> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493

Ahh, so you are not referring to the current upstream code. Do you see
any specific problem with the current one (well, except for the possible
race which I have tried to evaluate).
Tetsuo Handa Oct. 9, 2018, 1:14 p.m. UTC | #13
On 2018/10/09 21:58, Michal Hocko wrote:
> On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
>> On 2018/10/09 20:10, Michal Hocko wrote:
>>> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
>>>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
>>>>> reap the mm in the rare case of the race.
>>>>
>>>> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
>>>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
>>>
>>> I do not follow.
>>>
>>
>> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
> 
> Ahh, so you are not referring to the current upstream code. Do you see
> any specific problem with the current one (well, except for the possible
> race which I have tried to evaluate).
> 

Yes. "task_will_free_mem(current) in out_of_memory() returns false due to MMF_OOM_SKIP
being already set" is a problem for clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
with the current code.
Michal Hocko Oct. 9, 2018, 1:26 p.m. UTC | #14
On Tue 09-10-18 22:14:24, Tetsuo Handa wrote:
> On 2018/10/09 21:58, Michal Hocko wrote:
> > On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
> >> On 2018/10/09 20:10, Michal Hocko wrote:
> >>> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> >>>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> >>>>> reap the mm in the rare case of the race.
> >>>>
> >>>> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
> >>>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> >>>
> >>> I do not follow.
> >>>
> >>
> >> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
> > 
> > Ahh, so you are not referring to the current upstream code. Do you see
> > any specific problem with the current one (well, except for the possible
> > race which I have tried to evaluate).
> > 
> 
> Yes. "task_will_free_mem(current) in out_of_memory() returns false due to MMF_OOM_SKIP
> being already set" is a problem for clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
> with the current code.

a) I fail to see how that is related to your previous post and b) could
you be more specific. Is there any other scenario from the two described
in my earlier email?
Tetsuo Handa Oct. 9, 2018, 1:51 p.m. UTC | #15
On 2018/10/09 22:26, Michal Hocko wrote:
> On Tue 09-10-18 22:14:24, Tetsuo Handa wrote:
>> On 2018/10/09 21:58, Michal Hocko wrote:
>>> On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
>>>> On 2018/10/09 20:10, Michal Hocko wrote:
>>>>> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
>>>>>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
>>>>>>> reap the mm in the rare case of the race.
>>>>>>
>>>>>> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
>>>>>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
>>>>>
>>>>> I do not follow.
>>>>>
>>>>
>>>> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
>>>
>>> Ahh, so you are not referring to the current upstream code. Do you see
>>> any specific problem with the current one (well, except for the possible
>>> race which I have tried to evaluate).
>>>
>>
>> Yes. "task_will_free_mem(current) in out_of_memory() returns false due to MMF_OOM_SKIP
>> being already set" is a problem for clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
>> with the current code.
> 
> a) I fail to see how that is related to your previous post and b) could
> you be more specific. Is there any other scenario from the two described
> in my earlier email?
> 

I do not follow. Just reverting commit 44a70adec910d692 and commit 97fd49c2355ffded
is sufficient for closing the copy_process() versus __set_oom_adj() race.

We went too far towards complete "struct mm_struct" based OOM handling. But stepping
back to "struct signal_struct" based OOM handling solves Yong-Taek's for_each_process()
latency problem and your copy_process() versus __set_oom_adj() race problem and my
task_will_free_mem(current) race problem.
Michal Hocko Oct. 9, 2018, 2:09 p.m. UTC | #16
On Tue 09-10-18 22:51:00, Tetsuo Handa wrote:
> On 2018/10/09 22:26, Michal Hocko wrote:
> > On Tue 09-10-18 22:14:24, Tetsuo Handa wrote:
> >> On 2018/10/09 21:58, Michal Hocko wrote:
> >>> On Tue 09-10-18 21:52:12, Tetsuo Handa wrote:
> >>>> On 2018/10/09 20:10, Michal Hocko wrote:
> >>>>> On Tue 09-10-18 19:00:44, Tetsuo Handa wrote:
> >>>>>>> 2) add OOM_SCORE_ADJ_MIN and do not kill tasks sharing mm and do not
> >>>>>>> reap the mm in the rare case of the race.
> >>>>>>
> >>>>>> That is no problem. The mistake we made in 4.6 was that we updated oom_score_adj
> >>>>>> to -1000 (and allowed unprivileged users to OOM-lockup the system).
> >>>>>
> >>>>> I do not follow.
> >>>>>
> >>>>
> >>>> http://tomoyo.osdn.jp/cgi-bin/lxr/source/mm/oom_kill.c?v=linux-4.6.7#L493
> >>>
> >>> Ahh, so you are not referring to the current upstream code. Do you see
> >>> any specific problem with the current one (well, except for the possible
> >>> race which I have tried to evaluate).
> >>>
> >>
> >> Yes. "task_will_free_mem(current) in out_of_memory() returns false due to MMF_OOM_SKIP
> >> being already set" is a problem for clone(CLONE_VM without CLONE_THREAD/CLONE_SIGHAND)
> >> with the current code.
> > 
> > a) I fail to see how that is related to your previous post and b) could
> > you be more specific. Is there any other scenario from the two described
> > in my earlier email?
> > 
> 
> I do not follow. Just reverting commit 44a70adec910d692 and commit 97fd49c2355ffded
> is sufficient for closing the copy_process() versus __set_oom_adj() race.

Please go back and see why this has been done in the first place.

> We went too far towards complete "struct mm_struct" based OOM handling. But stepping
> back to "struct signal_struct" based OOM handling solves Yong-Taek's for_each_process()
> latency problem and your copy_process() versus __set_oom_adj() race problem and my
> task_will_free_mem(current) race problem.

And again, I have put an evaluation of the race and try to see what is
the effect. Then you have started to fire hard to follow notes and it is
not clear whether the analysis/conclusions is wrong/incomplete.

So an we get back to that analysis and stick to the topic please?
diff mbox series

Patch

diff --git a/fs/proc/base.c b/fs/proc/base.c
index f9f72aee6d45..54b2fb5e9c51 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1056,6 +1056,7 @@  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
        struct mm_struct *mm = NULL;
        struct task_struct *task;
        int err = 0;
+       int mm_users = 0;

        task = get_proc_task(file_inode(file));
        if (!task)
@@ -1092,7 +1093,8 @@  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
                struct task_struct *p = find_lock_task_mm(task);

                if (p) {
-                       if (atomic_read(&p->mm->mm_users) > 1) {
+                       mm_users = atomic_read(&p->mm->mm_users);
+                       if ((mm_users > 1) && (mm_users != get_nr_threads(p))) {
                                mm = p->mm;
                                atomic_inc(&mm->mm_count);
                        }