Message ID | 20190220032245.2413-1-stepan@pex.com (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | mm/oom: added option 'oom_dump_task_cmdline' | expand |
Hi, Spell it out correctly (2 places): On 2/19/19 7:22 PM, Stepan Bujnak wrote: > When oom_dump_tasks is enabled, this option will try to display task When oom_dump_task_cmdline is enabled, > cmdline instead of the command name in the system-wide task dump. > > This is useful in some cases e.g. on postgres server. If OOM killer is > invoked it will show a bunch of tasks called 'postgres'. With this > option enabled it will show additional information like the database > user, database name and what it is currently doing. > > Other example is python. Instead of just 'python' it will also show the > script name currently being executed. > > Signed-off-by: Stepan Bujnak <stepan@pex.com> > --- > Documentation/sysctl/vm.txt | 10 ++++++++++ > include/linux/oom.h | 1 + > kernel/sysctl.c | 7 +++++++ > mm/oom_kill.c | 20 ++++++++++++++++++-- > 4 files changed, 36 insertions(+), 2 deletions(-) > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > index 187ce4f599a2..74278c8c30d2 100644 > --- a/Documentation/sysctl/vm.txt > +++ b/Documentation/sysctl/vm.txt > @@ -50,6 +50,7 @@ Currently, these files are in /proc/sys/vm: > - nr_trim_pages (only if CONFIG_MMU=n) > - numa_zonelist_order > - oom_dump_tasks > +- oom_dump_task_cmdline > - oom_kill_allocating_task > - overcommit_kbytes > - overcommit_memory > @@ -639,6 +640,15 @@ The default value is 1 (enabled). > > ============================================================== > > +oom_dump_task_cmdline > + > +When oom_dump_tasks is enabled, this option will try to display task cmdline When oom_dump_task_cmdline is enabled, > +instead of the command name in the system-wide task dump. > + > +The default value is 0 (disabled). > + > +============================================================== > + > oom_kill_allocating_task > > This enables or disables killing the OOM-triggering task in > diff --git a/include/linux/oom.h b/include/linux/oom.h > index d07992009265..461b15b3b695 100644 > --- a/include/linux/oom.h > +++ b/include/linux/oom.h > @@ -125,6 +125,7 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p); > > /* sysctls */ > extern int sysctl_oom_dump_tasks; > +extern int sysctl_oom_dump_task_cmdline; > extern int sysctl_oom_kill_allocating_task; > extern int sysctl_panic_on_oom; > #endif /* _INCLUDE_LINUX_OOM_H */ > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > index ba4d9e85feb8..4edc5f8e6cf9 100644 > --- a/kernel/sysctl.c > +++ b/kernel/sysctl.c > @@ -1288,6 +1288,13 @@ static struct ctl_table vm_table[] = { > .mode = 0644, > .proc_handler = proc_dointvec, > }, > + { > + .procname = "oom_dump_task_cmdline", > + .data = &sysctl_oom_dump_task_cmdline, > + .maxlen = sizeof(sysctl_oom_dump_task_cmdline), > + .mode = 0644, > + .proc_handler = proc_dointvec, > + }, > { > .procname = "overcommit_ratio", > .data = &sysctl_overcommit_ratio, thanks.
On Wed, Feb 20, 2019 at 5:10 AM Randy Dunlap <rdunlap@infradead.org> wrote: > > Hi, > > Spell it out correctly (2 places): This is not a typo. It actually refers to the oom_dump_tasks option, in a sense that when that option is enabled, this option (oom_dump_task_cmdline) additionally displays task cmdline instead of task name. > > > On 2/19/19 7:22 PM, Stepan Bujnak wrote: > > When oom_dump_tasks is enabled, this option will try to display task > > When oom_dump_task_cmdline is enabled, > > > cmdline instead of the command name in the system-wide task dump. > > > > This is useful in some cases e.g. on postgres server. If OOM killer is > > invoked it will show a bunch of tasks called 'postgres'. With this > > option enabled it will show additional information like the database > > user, database name and what it is currently doing. > > > > Other example is python. Instead of just 'python' it will also show the > > script name currently being executed. > > > > Signed-off-by: Stepan Bujnak <stepan@pex.com> > > --- > > Documentation/sysctl/vm.txt | 10 ++++++++++ > > include/linux/oom.h | 1 + > > kernel/sysctl.c | 7 +++++++ > > mm/oom_kill.c | 20 ++++++++++++++++++-- > > 4 files changed, 36 insertions(+), 2 deletions(-) > > > > diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt > > index 187ce4f599a2..74278c8c30d2 100644 > > --- a/Documentation/sysctl/vm.txt > > +++ b/Documentation/sysctl/vm.txt > > @@ -50,6 +50,7 @@ Currently, these files are in /proc/sys/vm: > > - nr_trim_pages (only if CONFIG_MMU=n) > > - numa_zonelist_order > > - oom_dump_tasks > > +- oom_dump_task_cmdline > > - oom_kill_allocating_task > > - overcommit_kbytes > > - overcommit_memory > > @@ -639,6 +640,15 @@ The default value is 1 (enabled). > > > > ============================================================== > > > > +oom_dump_task_cmdline > > + > > +When oom_dump_tasks is enabled, this option will try to display task cmdline > > When oom_dump_task_cmdline is enabled, > > > +instead of the command name in the system-wide task dump. > > + > > +The default value is 0 (disabled). > > + > > +============================================================== > > + > > oom_kill_allocating_task > > > > This enables or disables killing the OOM-triggering task in > > diff --git a/include/linux/oom.h b/include/linux/oom.h > > index d07992009265..461b15b3b695 100644 > > --- a/include/linux/oom.h > > +++ b/include/linux/oom.h > > @@ -125,6 +125,7 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p); > > > > /* sysctls */ > > extern int sysctl_oom_dump_tasks; > > +extern int sysctl_oom_dump_task_cmdline; > > extern int sysctl_oom_kill_allocating_task; > > extern int sysctl_panic_on_oom; > > #endif /* _INCLUDE_LINUX_OOM_H */ > > diff --git a/kernel/sysctl.c b/kernel/sysctl.c > > index ba4d9e85feb8..4edc5f8e6cf9 100644 > > --- a/kernel/sysctl.c > > +++ b/kernel/sysctl.c > > @@ -1288,6 +1288,13 @@ static struct ctl_table vm_table[] = { > > .mode = 0644, > > .proc_handler = proc_dointvec, > > }, > > + { > > + .procname = "oom_dump_task_cmdline", > > + .data = &sysctl_oom_dump_task_cmdline, > > + .maxlen = sizeof(sysctl_oom_dump_task_cmdline), > > + .mode = 0644, > > + .proc_handler = proc_dointvec, > > + }, > > { > > .procname = "overcommit_ratio", > > .data = &sysctl_overcommit_ratio, > > > thanks. > -- > ~Randy
On 2/19/19 8:30 PM, Bujnak, Stepan wrote: > On Wed, Feb 20, 2019 at 5:10 AM Randy Dunlap <rdunlap@infradead.org> wrote: >> >> Hi, >> >> Spell it out correctly (2 places): > This is not a typo. It actually refers to the oom_dump_tasks option, > in a sense that when that option is enabled, > this option (oom_dump_task_cmdline) additionally displays task > cmdline instead of task name. >> OK, thanks for clarifying. >> >> On 2/19/19 7:22 PM, Stepan Bujnak wrote: >>> When oom_dump_tasks is enabled, this option will try to display task >> >> When oom_dump_task_cmdline is enabled, >> >>> cmdline instead of the command name in the system-wide task dump. >>> >>> This is useful in some cases e.g. on postgres server. If OOM killer is >>> invoked it will show a bunch of tasks called 'postgres'. With this >>> option enabled it will show additional information like the database >>> user, database name and what it is currently doing. >>> >>> Other example is python. Instead of just 'python' it will also show the >>> script name currently being executed. >>> >>> Signed-off-by: Stepan Bujnak <stepan@pex.com> >>> --- >>> Documentation/sysctl/vm.txt | 10 ++++++++++ >>> include/linux/oom.h | 1 + >>> kernel/sysctl.c | 7 +++++++ >>> mm/oom_kill.c | 20 ++++++++++++++++++-- >>> 4 files changed, 36 insertions(+), 2 deletions(-) >>> >>> diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt >>> index 187ce4f599a2..74278c8c30d2 100644 >>> --- a/Documentation/sysctl/vm.txt >>> +++ b/Documentation/sysctl/vm.txt >>> @@ -50,6 +50,7 @@ Currently, these files are in /proc/sys/vm: >>> - nr_trim_pages (only if CONFIG_MMU=n) >>> - numa_zonelist_order >>> - oom_dump_tasks >>> +- oom_dump_task_cmdline >>> - oom_kill_allocating_task >>> - overcommit_kbytes >>> - overcommit_memory >>> @@ -639,6 +640,15 @@ The default value is 1 (enabled). >>> >>> ============================================================== >>> >>> +oom_dump_task_cmdline >>> + >>> +When oom_dump_tasks is enabled, this option will try to display task cmdline >> >> When oom_dump_task_cmdline is enabled, >> >>> +instead of the command name in the system-wide task dump. >>> + >>> +The default value is 0 (disabled). >>> + >>> +============================================================== >>> + >>> oom_kill_allocating_task >>> >>> This enables or disables killing the OOM-triggering task in >>> diff --git a/include/linux/oom.h b/include/linux/oom.h >>> index d07992009265..461b15b3b695 100644 >>> --- a/include/linux/oom.h >>> +++ b/include/linux/oom.h >>> @@ -125,6 +125,7 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p); >>> >>> /* sysctls */ >>> extern int sysctl_oom_dump_tasks; >>> +extern int sysctl_oom_dump_task_cmdline; >>> extern int sysctl_oom_kill_allocating_task; >>> extern int sysctl_panic_on_oom; >>> #endif /* _INCLUDE_LINUX_OOM_H */ >>> diff --git a/kernel/sysctl.c b/kernel/sysctl.c >>> index ba4d9e85feb8..4edc5f8e6cf9 100644 >>> --- a/kernel/sysctl.c >>> +++ b/kernel/sysctl.c >>> @@ -1288,6 +1288,13 @@ static struct ctl_table vm_table[] = { >>> .mode = 0644, >>> .proc_handler = proc_dointvec, >>> }, >>> + { >>> + .procname = "oom_dump_task_cmdline", >>> + .data = &sysctl_oom_dump_task_cmdline, >>> + .maxlen = sizeof(sysctl_oom_dump_task_cmdline), >>> + .mode = 0644, >>> + .proc_handler = proc_dointvec, >>> + }, >>> { >>> .procname = "overcommit_ratio", >>> .data = &sysctl_overcommit_ratio, >> >> >> thanks. >> -- >> ~Randy
On Wed 20-02-19 04:22:45, Stepan Bujnak wrote: > When oom_dump_tasks is enabled, this option will try to display task > cmdline instead of the command name in the system-wide task dump. > > This is useful in some cases e.g. on postgres server. If OOM killer is > invoked it will show a bunch of tasks called 'postgres'. With this > option enabled it will show additional information like the database > user, database name and what it is currently doing. > > Other example is python. Instead of just 'python' it will also show the > script name currently being executed. The size of OOM report output is quite large already and this will just add much more for some workloads and printing from this context is quite a problem already. > Signed-off-by: Stepan Bujnak <stepan@pex.com> > --- > Documentation/sysctl/vm.txt | 10 ++++++++++ > include/linux/oom.h | 1 + > kernel/sysctl.c | 7 +++++++ > mm/oom_kill.c | 20 ++++++++++++++++++-- > 4 files changed, 36 insertions(+), 2 deletions(-) > [...] > @@ -404,9 +406,18 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) > pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n"); > rcu_read_lock(); > for_each_process(p) { > + char *name, *cmd = NULL; > + > if (oom_unkillable_task(p, memcg, nodemask)) > continue; > > + /* > + * This needs to be done before calling find_lock_task_mm() > + * since both grab a task lock which would result in deadlock. > + */ > + if (sysctl_oom_dump_task_cmdline) > + cmd = kstrdup_quotable_cmdline(p, GFP_KERNEL); > + > task = find_lock_task_mm(p); > if (!task) { > /* You are trying to allocate from the OOM context. That is a big no no. Not to mention that this is deadlock prone because get_cmdline needs mmap_sem and the allocating context migh hold the lock already. So the patch is simply wrong.
On Wed, Feb 20, 2019 at 7:49 AM Michal Hocko <mhocko@kernel.org> wrote: > > On Wed 20-02-19 04:22:45, Stepan Bujnak wrote: > > When oom_dump_tasks is enabled, this option will try to display task > > cmdline instead of the command name in the system-wide task dump. > > > > This is useful in some cases e.g. on postgres server. If OOM killer is > > invoked it will show a bunch of tasks called 'postgres'. With this > > option enabled it will show additional information like the database > > user, database name and what it is currently doing. > > > > Other example is python. Instead of just 'python' it will also show the > > script name currently being executed. > > The size of OOM report output is quite large already and this will just > add much more for some workloads and printing from this context is quite > a problem already. > The option defaults to false so most workloads wouldn't be affected. As an alternative the cmdline line can only be printed for the victim task in the OOM summary. > > Signed-off-by: Stepan Bujnak <stepan@pex.com> > > --- > > Documentation/sysctl/vm.txt | 10 ++++++++++ > > include/linux/oom.h | 1 + > > kernel/sysctl.c | 7 +++++++ > > mm/oom_kill.c | 20 ++++++++++++++++++-- > > 4 files changed, 36 insertions(+), 2 deletions(-) > > > [...] > > @@ -404,9 +406,18 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) > > pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n"); > > rcu_read_lock(); > > for_each_process(p) { > > + char *name, *cmd = NULL; > > + > > if (oom_unkillable_task(p, memcg, nodemask)) > > continue; > > > > + /* > > + * This needs to be done before calling find_lock_task_mm() > > + * since both grab a task lock which would result in deadlock. > > + */ > > + if (sysctl_oom_dump_task_cmdline) > > + cmd = kstrdup_quotable_cmdline(p, GFP_KERNEL); > > + > > task = find_lock_task_mm(p); > > if (!task) { > > /* > You are trying to allocate from the OOM context. That is a big no no. > Not to mention that this is deadlock prone because get_cmdline needs > mmap_sem and the allocating context migh hold the lock already. So the > patch is simply wrong. > Thanks for the notes. I understand how allocating from OOM context is a problem. However I still believe that this would be helpful for debugging OOM kills since task->comm is often not descriptive enough. Would it help if instead of calling kstrdup_quotable_cmdline() which allocates the buffer on heap I called get_cmdline() directly passing it stack-allocated buffer of certain size e.g. 256? > -- > Michal Hocko > SUSE Labs
On 2019/02/20 17:37, Bujnak, Stepan wrote: >>> @@ -404,9 +406,18 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) >>> pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n"); >>> rcu_read_lock(); >>> for_each_process(p) { >>> + char *name, *cmd = NULL; >>> + >>> if (oom_unkillable_task(p, memcg, nodemask)) >>> continue; >>> >>> + /* >>> + * This needs to be done before calling find_lock_task_mm() >>> + * since both grab a task lock which would result in deadlock. >>> + */ >>> + if (sysctl_oom_dump_task_cmdline) >>> + cmd = kstrdup_quotable_cmdline(p, GFP_KERNEL); >>> + >>> task = find_lock_task_mm(p); >>> if (!task) { >>> /* >> You are trying to allocate from the OOM context. That is a big no no. >> Not to mention that this is deadlock prone because get_cmdline needs >> mmap_sem and the allocating context migh hold the lock already. So the >> patch is simply wrong. >> > > Thanks for the notes. I understand how allocating from OOM context > is a problem. However I still believe that this would be helpful > for debugging OOM kills since task->comm is often not descriptive > enough. Would it help if instead of calling kstrdup_quotable_cmdline() > which allocates the buffer on heap I called get_cmdline() directly > passing it stack-allocated buffer of certain size e.g. 256? You made triple errors. First is that doing GFP_KERNEL allocation inside rcu_read_lock()/rcu_read_unlock() is not permitted. Second is that doing GFP_KERNEL allocation with oom_lock held is not permitted. Third is that somebody might be already holding p->mm->mmap_sem for write when get_cmdline() tries to hold it for read. That is, your patch can't work (even if you update your patch to use static buffer).
On Wed 20-02-19 09:37:56, Bujnak, Stepan wrote: > On Wed, Feb 20, 2019 at 7:49 AM Michal Hocko <mhocko@kernel.org> wrote: [...] > > You are trying to allocate from the OOM context. That is a big no no. > > Not to mention that this is deadlock prone because get_cmdline needs > > mmap_sem and the allocating context migh hold the lock already. So the > > patch is simply wrong. > > > > Thanks for the notes. I understand how allocating from OOM context > is a problem. However I still believe that this would be helpful > for debugging OOM kills since task->comm is often not descriptive > enough. Would it help if instead of calling kstrdup_quotable_cmdline() > which allocates the buffer on heap I called get_cmdline() directly > passing it stack-allocated buffer of certain size e.g. 256? No it wouldn't because get_cmdline take mmap_sem lock as already pointed out. Please also note that the cmd line might be considered security/privacy sensitive information and dumping it to the log sounds like a bad idea in general.
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index 187ce4f599a2..74278c8c30d2 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -50,6 +50,7 @@ Currently, these files are in /proc/sys/vm: - nr_trim_pages (only if CONFIG_MMU=n) - numa_zonelist_order - oom_dump_tasks +- oom_dump_task_cmdline - oom_kill_allocating_task - overcommit_kbytes - overcommit_memory @@ -639,6 +640,15 @@ The default value is 1 (enabled). ============================================================== +oom_dump_task_cmdline + +When oom_dump_tasks is enabled, this option will try to display task cmdline +instead of the command name in the system-wide task dump. + +The default value is 0 (disabled). + +============================================================== + oom_kill_allocating_task This enables or disables killing the OOM-triggering task in diff --git a/include/linux/oom.h b/include/linux/oom.h index d07992009265..461b15b3b695 100644 --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -125,6 +125,7 @@ extern struct task_struct *find_lock_task_mm(struct task_struct *p); /* sysctls */ extern int sysctl_oom_dump_tasks; +extern int sysctl_oom_dump_task_cmdline; extern int sysctl_oom_kill_allocating_task; extern int sysctl_panic_on_oom; #endif /* _INCLUDE_LINUX_OOM_H */ diff --git a/kernel/sysctl.c b/kernel/sysctl.c index ba4d9e85feb8..4edc5f8e6cf9 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -1288,6 +1288,13 @@ static struct ctl_table vm_table[] = { .mode = 0644, .proc_handler = proc_dointvec, }, + { + .procname = "oom_dump_task_cmdline", + .data = &sysctl_oom_dump_task_cmdline, + .maxlen = sizeof(sysctl_oom_dump_task_cmdline), + .mode = 0644, + .proc_handler = proc_dointvec, + }, { .procname = "overcommit_ratio", .data = &sysctl_overcommit_ratio, diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 26ea8636758f..736fa0a6ab8d 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -41,6 +41,7 @@ #include <linux/kthread.h> #include <linux/init.h> #include <linux/mmu_notifier.h> +#include <linux/string_helpers.h> #include <asm/tlb.h> #include "internal.h" @@ -52,6 +53,7 @@ int sysctl_panic_on_oom; int sysctl_oom_kill_allocating_task; int sysctl_oom_dump_tasks = 1; +int sysctl_oom_dump_task_cmdline; /* * Serializes oom killer invocations (out_of_memory()) from all contexts to @@ -404,9 +406,18 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) pr_info("[ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name\n"); rcu_read_lock(); for_each_process(p) { + char *name, *cmd = NULL; + if (oom_unkillable_task(p, memcg, nodemask)) continue; + /* + * This needs to be done before calling find_lock_task_mm() + * since both grab a task lock which would result in deadlock. + */ + if (sysctl_oom_dump_task_cmdline) + cmd = kstrdup_quotable_cmdline(p, GFP_KERNEL); + task = find_lock_task_mm(p); if (!task) { /* @@ -414,16 +425,21 @@ static void dump_tasks(struct mem_cgroup *memcg, const nodemask_t *nodemask) * detached their mm's. There's no need to report * them; they can't be oom killed anyway. */ - continue; + goto done; } + name = cmd ? cmd : task->comm; + pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu %5hd %s\n", task->pid, from_kuid(&init_user_ns, task_uid(task)), task->tgid, task->mm->total_vm, get_mm_rss(task->mm), mm_pgtables_bytes(task->mm), get_mm_counter(task->mm, MM_SWAPENTS), - task->signal->oom_score_adj, task->comm); + task->signal->oom_score_adj, name); task_unlock(task); + +done: + kfree(cmd); } rcu_read_unlock(); }
When oom_dump_tasks is enabled, this option will try to display task cmdline instead of the command name in the system-wide task dump. This is useful in some cases e.g. on postgres server. If OOM killer is invoked it will show a bunch of tasks called 'postgres'. With this option enabled it will show additional information like the database user, database name and what it is currently doing. Other example is python. Instead of just 'python' it will also show the script name currently being executed. Signed-off-by: Stepan Bujnak <stepan@pex.com> --- Documentation/sysctl/vm.txt | 10 ++++++++++ include/linux/oom.h | 1 + kernel/sysctl.c | 7 +++++++ mm/oom_kill.c | 20 ++++++++++++++++++-- 4 files changed, 36 insertions(+), 2 deletions(-)