diff mbox series

proc, oom: do not report alien mms when setting oom_score_adj

Message ID 20190212102129.26288-1-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show
Series proc, oom: do not report alien mms when setting oom_score_adj | expand

Commit Message

Michal Hocko Feb. 12, 2019, 10:21 a.m. UTC
From: Michal Hocko <mhocko@suse.com>

Tetsuo has reported that creating a thousands of processes sharing MM
without SIGHAND (aka alien threads) and setting
/proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1]
to finish. This is especially worrisome that all that printing is done
under RCU lock and this can potentially trigger RCU stall or softlockup
detector.

The primary reason for the printk was to catch potential users who might
depend on the behavior prior to 44a70adec910 ("mm, oom_adj: make sure
processes sharing mm have same view of oom_score_adj") but after more
than 2 years without a single report I guess it is safe to simply remove
the printk altogether.

The next step should be moving oom_score_adj over to the mm struct and
remove all the tasks crawling as suggested by [2]

[1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
[2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz
Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 fs/proc/base.c | 4 ----
 1 file changed, 4 deletions(-)

Comments

Johannes Weiner Feb. 12, 2019, 4:08 p.m. UTC | #1
On Tue, Feb 12, 2019 at 11:21:29AM +0100, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> Tetsuo has reported that creating a thousands of processes sharing MM
> without SIGHAND (aka alien threads) and setting
> /proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1]
> to finish. This is especially worrisome that all that printing is done
> under RCU lock and this can potentially trigger RCU stall or softlockup
> detector.
> 
> The primary reason for the printk was to catch potential users who might
> depend on the behavior prior to 44a70adec910 ("mm, oom_adj: make sure
> processes sharing mm have same view of oom_score_adj") but after more
> than 2 years without a single report I guess it is safe to simply remove
> the printk altogether.
> 
> The next step should be moving oom_score_adj over to the mm struct and
> remove all the tasks crawling as suggested by [2]
> 
> [1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
> [2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz
> Reported-by: Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
> Signed-off-by: Michal Hocko <mhocko@suse.com>

Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Andrew Morton Feb. 12, 2019, 8:56 p.m. UTC | #2
On Tue, 12 Feb 2019 11:21:29 +0100 Michal Hocko <mhocko@kernel.org> wrote:

> Tetsuo has reported that creating a thousands of processes sharing MM
> without SIGHAND (aka alien threads) and setting
> /proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1]
> to finish. This is especially worrisome that all that printing is done
> under RCU lock and this can potentially trigger RCU stall or softlockup
> detector.
> 
> The primary reason for the printk was to catch potential users who might
> depend on the behavior prior to 44a70adec910 ("mm, oom_adj: make sure
> processes sharing mm have same view of oom_score_adj") but after more
> than 2 years without a single report I guess it is safe to simply remove
> the printk altogether.
> 
> The next step should be moving oom_score_adj over to the mm struct and
> remove all the tasks crawling as suggested by [2]
> 
> [1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
> [2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz

I think I'll put a cc:stable on this.  Deleting a might-trigger debug
printk is safe and welcome.
Tetsuo Handa Feb. 12, 2019, 9:07 p.m. UTC | #3
On 2019/02/13 5:56, Andrew Morton wrote:
> On Tue, 12 Feb 2019 11:21:29 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
>> Tetsuo has reported that creating a thousands of processes sharing MM
>> without SIGHAND (aka alien threads) and setting
>> /proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1]
>> to finish. This is especially worrisome that all that printing is done
>> under RCU lock and this can potentially trigger RCU stall or softlockup
>> detector.
>>
>> The primary reason for the printk was to catch potential users who might
>> depend on the behavior prior to 44a70adec910 ("mm, oom_adj: make sure
>> processes sharing mm have same view of oom_score_adj") but after more
>> than 2 years without a single report I guess it is safe to simply remove
>> the printk altogether.
>>
>> The next step should be moving oom_score_adj over to the mm struct and
>> remove all the tasks crawling as suggested by [2]
>>
>> [1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
>> [2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz
> 
> I think I'll put a cc:stable on this.  Deleting a might-trigger debug
> printk is safe and welcome.
> 

Putting cc:stable is fine. But I doubt the usefulness of this patch.
If nobody really depends on the behavior prior to 44a70adec910,
we should remove the pointless (otherwise racy) iteration itself.
Tetsuo Handa Feb. 13, 2019, 1:24 a.m. UTC | #4
Andrew Morton wrote:
> On Tue, 12 Feb 2019 11:21:29 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> 
> > Tetsuo has reported that creating a thousands of processes sharing MM
> > without SIGHAND (aka alien threads) and setting
> > /proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1]
> > to finish. This is especially worrisome that all that printing is done
> > under RCU lock and this can potentially trigger RCU stall or softlockup
> > detector.
> > 
> > The primary reason for the printk was to catch potential users who might
> > depend on the behavior prior to 44a70adec910 ("mm, oom_adj: make sure
> > processes sharing mm have same view of oom_score_adj") but after more
> > than 2 years without a single report I guess it is safe to simply remove
> > the printk altogether.
> > 
> > The next step should be moving oom_score_adj over to the mm struct and
> > remove all the tasks crawling as suggested by [2]
> > 
> > [1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
> > [2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz
> 
> I think I'll put a cc:stable on this.  Deleting a might-trigger debug
> printk is safe and welcome.
> 

I don't like this patch, for I can confirm that removing only printk() is not
sufficient for avoiding hungtask warning. If the reason of removing printk() is
that we have never heard that someone hit this printk() for more than 2 years,
the whole iteration is nothing but a garbage. I insist that this iteration
should be removed.

Nacked-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>

Reproducer:
----------------------------------------
#define _GNU_SOURCE
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sched.h>
#include <stdlib.h>
#include <signal.h>

#define STACKSIZE 8192
static int child(void *unused)
{
	int fd = open("/proc/self/oom_score_adj", O_WRONLY);
	write(fd, "0\n", 2);
	close(fd);
	pause();
	return 0;
}
int main(int argc, char *argv[])
{
	int i;
	for (i = 0; i < 8192 * 4; i++)
		if (clone(child, malloc(STACKSIZE) + STACKSIZE, CLONE_VM, NULL) == -1)
			break;
	kill(0, SIGSEGV);
	return 0;
}
----------------------------------------

Removing only printk() from the iteration:
----------------------------------------
[root@localhost tmp]# time ./a.out
Segmentation fault

real    2m16.565s
user    0m0.029s
sys     0m2.631s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    2m20.900s
user    0m0.023s
sys     0m2.380s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    2m19.322s
user    0m0.017s
sys     0m2.433s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    2m22.571s
user    0m0.010s
sys     0m2.447s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    2m17.661s
user    0m0.020s
sys     0m2.390s
----------------------------------------

----------------------------------------
[  189.025075] INFO: task a.out:20327 blocked for more than 120 seconds.
[  189.027580]       Not tainted 5.0.0-rc6+ #828
[  189.029142] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  189.031503] a.out           D28432 20327   9408 0x00000084
[  189.033163] Call Trace:
[  189.034005]  __schedule+0x69a/0x1890
[  189.035363]  ? pci_mmcfg_check_reserved+0x120/0x120
[  189.036863]  schedule+0x7f/0x180
[  189.037910]  schedule_preempt_disabled+0x13/0x20
[  189.039470]  __mutex_lock+0x4c0/0x11a0
[  189.040664]  ? __set_oom_adj+0x84/0xd00
[  189.041870]  ? ww_mutex_lock+0xb0/0xb0
[  189.043111]  ? sched_clock_cpu+0x1b/0x1b0
[  189.044318]  ? find_held_lock+0x40/0x1e0
[  189.045550]  ? kasan_check_read+0x11/0x20
[  189.047060]  mutex_lock_nested+0x16/0x20
[  189.048334]  ? mutex_lock_nested+0x16/0x20
[  189.049562]  __set_oom_adj+0x84/0xd00
[  189.050701]  ? kasan_check_write+0x14/0x20
[  189.051943]  oom_score_adj_write+0x136/0x150
[  189.053217]  ? __set_oom_adj+0xd00/0xd00
[  189.054502]  ? check_prev_add.constprop.42+0x14c0/0x14c0
[  189.055959]  ? sched_clock+0x9/0x10
[  189.057756]  ? check_prev_add.constprop.42+0x14c0/0x14c0
[  189.059323]  __vfs_write+0xe3/0x970
[  189.060406]  ? kernel_read+0x130/0x130
[  189.061578]  ? __lock_acquire+0x7f3/0x1210
[  189.062965]  ? __lock_is_held+0xbc/0x140
[  189.064208]  ? rcu_read_lock_sched_held+0x114/0x130
[  189.065672]  ? rcu_sync_lockdep_assert+0x6d/0xb0
[  189.067042]  ? __sb_start_write+0x1ff/0x2b0
[  189.068297]  vfs_write+0x15b/0x480
[  189.069352]  ksys_write+0xcd/0x1b0
[  189.070581]  ? __ia32_sys_read+0xa0/0xa0
[  189.071710]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  189.073245]  ? __this_cpu_preempt_check+0x13/0x20
[  189.074686]  __x64_sys_write+0x6e/0xb0
[  189.075834]  do_syscall_64+0x8f/0x3e0
[  189.077001]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  189.078696] RIP: 0033:0x7f01546c7fd0
[  189.079836] Code: Bad RIP value.
[  189.081075] RSP: 002b:0000000007aeda58 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  189.083315] RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007f01546c7fd0
[  189.085446] RDX: 0000000000000002 RSI: 0000000000400809 RDI: 0000000000000003
[  189.088254] RBP: 0000000000000000 R08: 0000000000002000 R09: 0000000000002000
[  189.092279] R10: 0000000000000000 R11: 0000000000000246 R12: 000000000040062e
[  189.095213] R13: 00007ffde843a8e0 R14: 0000000000000000 R15: 0000000000000000

[  916.244660] INFO: task a.out:2027 blocked for more than 120 seconds.
[  916.247443]       Not tainted 5.0.0-rc6+ #828
[  916.249667] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  916.252876] a.out           D28432  2027  55700 0x00000084
[  916.255374] Call Trace:
[  916.257012]  ? check_prev_add.constprop.42+0x14c0/0x14c0
[  916.259527]  ? sched_clock_cpu+0x1b/0x1b0
[  916.261620]  ? sched_clock+0x9/0x10
[  916.263620]  ? sched_clock_cpu+0x1b/0x1b0
[  916.265803]  ? find_held_lock+0x40/0x1e0
[  916.267956]  ? lock_release+0x746/0x1050
[  916.270014]  ? schedule+0x7f/0x180
[  916.271887]  ? do_exit+0x54b/0x2ff0
[  916.273879]  ? check_prev_add.constprop.42+0x14c0/0x14c0
[  916.276294]  ? mm_update_next_owner+0x680/0x680
[  916.278454]  ? sched_clock_cpu+0x1b/0x1b0
[  916.280556]  ? find_held_lock+0x40/0x1e0
[  916.282713]  ? get_signal+0x270/0x1850
[  916.284695]  ? __this_cpu_preempt_check+0x13/0x20
[  916.286788]  ? do_group_exit+0xf4/0x2f0
[  916.288738]  ? get_signal+0x2be/0x1850
[  916.290869]  ? __vfs_write+0xe3/0x970
[  916.292751]  ? sched_clock+0x9/0x10
[  916.294608]  ? do_signal+0x99/0x1b90
[  916.296831]  ? check_flags.part.40+0x420/0x420
[  916.299131]  ? setup_sigcontext+0x7d0/0x7d0
[  916.301134]  ? __audit_syscall_exit+0x71f/0x9a0
[  916.303319]  ? rcu_read_lock_sched_held+0x114/0x130
[  916.305503]  ? do_syscall_64+0x2df/0x3e0
[  916.307565]  ? __this_cpu_preempt_check+0x13/0x20
[  916.309703]  ? lockdep_hardirqs_on+0x347/0x5a0
[  916.311748]  ? exit_to_usermode_loop+0x5a/0x120
[  916.314011]  ? trace_hardirqs_on+0x28/0x170
[  916.316218]  ? exit_to_usermode_loop+0x72/0x120
[  916.318416]  ? do_syscall_64+0x2df/0x3e0
[  916.320471]  ? entry_SYSCALL_64_after_hwframe+0x49/0xbe
----------------------------------------

Removing the whole iteration:
----------------------------------------
[root@localhost tmp]# time ./a.out
Segmentation fault

real    0m0.309s
user    0m0.001s
sys     0m0.197s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    0m0.722s
user    0m0.007s
sys     0m0.543s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    0m0.415s
user    0m0.002s
sys     0m0.250s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    0m0.473s
user    0m0.001s
sys     0m0.233s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    0m0.327s
user    0m0.001s
sys     0m0.204s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    0m0.325s
user    0m0.001s
sys     0m0.190s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    0m0.370s
user    0m0.002s
sys     0m0.217s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    0m0.320s
user    0m0.002s
sys     0m0.184s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    0m0.361s
user    0m0.002s
sys     0m0.248s
[root@localhost tmp]# time ./a.out
Segmentation fault

real    0m0.358s
user    0m0.000s
sys     0m0.231s
----------------------------------------
Michal Hocko Feb. 13, 2019, 11:47 a.m. UTC | #5
On Wed 13-02-19 10:24:16, Tetsuo Handa wrote:
> Andrew Morton wrote:
> > On Tue, 12 Feb 2019 11:21:29 +0100 Michal Hocko <mhocko@kernel.org> wrote:
> > 
> > > Tetsuo has reported that creating a thousands of processes sharing MM
> > > without SIGHAND (aka alien threads) and setting
> > > /proc/<pid>/oom_score_adj will swamp the kernel log and takes ages [1]
> > > to finish. This is especially worrisome that all that printing is done
> > > under RCU lock and this can potentially trigger RCU stall or softlockup
> > > detector.
> > > 
> > > The primary reason for the printk was to catch potential users who might
> > > depend on the behavior prior to 44a70adec910 ("mm, oom_adj: make sure
> > > processes sharing mm have same view of oom_score_adj") but after more
> > > than 2 years without a single report I guess it is safe to simply remove
> > > the printk altogether.
> > > 
> > > The next step should be moving oom_score_adj over to the mm struct and
> > > remove all the tasks crawling as suggested by [2]
> > > 
> > > [1] http://lkml.kernel.org/r/97fce864-6f75-bca5-14bc-12c9f890e740@i-love.sakura.ne.jp
> > > [2] http://lkml.kernel.org/r/20190117155159.GA4087@dhcp22.suse.cz
> > 
> > I think I'll put a cc:stable on this.  Deleting a might-trigger debug
> > printk is safe and welcome.
> > 
> 
> I don't like this patch, for I can confirm that removing only printk() is not
> sufficient for avoiding hungtask warning. If the reason of removing printk() is
> that we have never heard that someone hit this printk() for more than 2 years,
> the whole iteration is nothing but a garbage. I insist that this iteration
> should be removed.

As the changelog states explicitly, removing the loop should be the next
step and the implementation is outlined in [2]. It is not as simple
as to do the revert as you have proposed. We simply cannot allow to have
different processes disagree on oom_score_adj. This could easily lead
to breaking the OOM_SCORE_ADJ_MIN protection. And that is a correctness
issue.

As a side note.
I am pretty sure I would have more time to do that if only I didn't
really have to spend it on pointless and repeated discussions. You
are clearly not interested on spending _your_ time to address this
issue properly yourself. This is fair but nacking a low hanging
fruit patch that doesn't make situation any worse while it removes a
potential expensive operation from withing RCU context is nothing but
an obstruction.  It is even more sad that this is not the first example
of this attitude which makes it pretty hard, if not impossible, to work
with you.

And another side note. I have already pointed out that this is by far
not the only problem with CLONE_VM without CLONE_SIGHAND threading
model. Try to put your "only the oom paths matter" glasses down
for a moment and try to look what are the actual and much more
serious consequences of this threading model. Hint have a look at
mm_update_next_owner and how we have to for_each_process from under
tasklist_lock or zap_threads with RCU as well.
Tetsuo Handa Feb. 15, 2019, 12:57 a.m. UTC | #6
Sigh, you are again misunderstanding...

I'm not opposing to forbid CLONE_VM without CLONE_SIGHAND threading model.
I'm asserting that we had better revert the iteration for now, even if we will
strive towards forbidding CLONE_VM without CLONE_SIGHAND threading model.

You say "And that is a correctness issue." but your patch is broken because
your patch does not close the race. Since nobody seems to be using CLONE_VM
without CLONE_SIGHAND threading, we can both avoid hungtask problem and close
the race by eliminating this broken iteration. We don't need to worry about
"This could easily lead to breaking the OOM_SCORE_ADJ_MIN protection." case
because setting OOM_SCORE_ADJ_MIN needs administrator's privilege. And it is
YOUR PATCH that still allows leading to breaking the OOM_SCORE_ADJ_MIN
protection. My patch is more simpler and accurate than your patch.
Michal Hocko Feb. 15, 2019, 9:37 a.m. UTC | #7
On Fri 15-02-19 09:57:59, Tetsuo Handa wrote:
> Sigh, you are again misunderstanding...
> 
> I'm not opposing to forbid CLONE_VM without CLONE_SIGHAND threading model.

We cannot do that unfortunatelly. This is a long term allowed threading
model and somebody might depend on it.

> I'm asserting that we had better revert the iteration for now, even if we will
> strive towards forbidding CLONE_VM without CLONE_SIGHAND threading model.
> 
> You say "And that is a correctness issue." but your patch is broken because
> your patch does not close the race.

Removing the printk as done in this patch has hardly anything to do with
race conditions and it is not advertised to close any either. So please
stop being off topic again.

> Since nobody seems to be using CLONE_VM
> without CLONE_SIGHAND threading, we can both avoid hungtask problem and close
> the race by eliminating this broken iteration. We don't need to worry about
> "This could easily lead to breaking the OOM_SCORE_ADJ_MIN protection." case
> because setting OOM_SCORE_ADJ_MIN needs administrator's privilege.

This is simply wrong. We have to care about the OOM_SCORE_ADJ_MIN
especially because it is the _admin's_ decision to hide a task from the
OOM killer.

> And it is
> YOUR PATCH that still allows leading to breaking the OOM_SCORE_ADJ_MIN
> protection. My patch is more simpler and accurate than your patch.

Please stop this already. Your patch to revert the oom_score_adj
consistency is simply broken. Full stop. I have already outlined how to
do that properly. If you do care really, go and try to play with that
idea. I can be convinced there are holes in that approach and can
discuss further solutions but trying to propose a broken approach again
and again is just wasting time.
diff mbox series

Patch

diff --git a/fs/proc/base.c b/fs/proc/base.c
index 633a63462573..f5ed9512d193 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -1086,10 +1086,6 @@  static int __set_oom_adj(struct file *file, int oom_adj, bool legacy)
 
 			task_lock(p);
 			if (!p->vfork_done && process_shares_mm(p, mm)) {
-				pr_info("updating oom_score_adj for %d (%s) from %d to %d because it shares mm with %d (%s). Report if this is unexpected.\n",
-						task_pid_nr(p), p->comm,
-						p->signal->oom_score_adj, oom_adj,
-						task_pid_nr(task), task->comm);
 				p->signal->oom_score_adj = oom_adj;
 				if (!legacy && has_capability_noaudit(current, CAP_SYS_RESOURCE))
 					p->signal->oom_score_adj_min = (short)oom_adj;