diff mbox series

mm/oom: Add oom_score_adj value to oom Killed process message

Message ID 20190821001445.32114-1-echron@arista.com (mailing list archive)
State New, archived
Headers show
Series mm/oom: Add oom_score_adj value to oom Killed process message | expand

Commit Message

Edward Chron Aug. 21, 2019, 12:14 a.m. UTC
For an OOM event: print oom_score_adj value for the OOM Killed process to
document what the oom score adjust value was at the time the process was
OOM Killed. The adjustment value can be set by user code and it affects
the resulting oom_score so it is used to influence kill process selection.

When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing
this value is the only documentation of the value for the process being
killed. Having this value on the Killed process message documents if a
miscconfiguration occurred or it can confirm that the oom_score_adj
value applies as expected.

An example which illustates both misconfiguration and validation that
the oom_score_adj was applied as expected is:

Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692
 (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB,
 shmem-rss:0kB oom_score_adj:1000

The systemd-udevd is a critical system application that should have an
oom_score_adj of -1000. Here it was misconfigured to have a adjustment
of 1000 making it a highly favored OOM kill target process. The output
documents both the misconfiguration and the fact that the process
was correctly targeted by OOM due to the miconfiguration. Having
the oom_score_adj on the Killed message ensures that it is documented.

Signed-off-by: Edward Chron <echron@arista.com>
Acked-by: Michal Hocko <mhocko@suse.com>
---
 mm/oom_kill.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

Comments

David Rientjes Aug. 21, 2019, 3:25 a.m. UTC | #1
On Tue, 20 Aug 2019, Edward Chron wrote:

> For an OOM event: print oom_score_adj value for the OOM Killed process to
> document what the oom score adjust value was at the time the process was
> OOM Killed. The adjustment value can be set by user code and it affects
> the resulting oom_score so it is used to influence kill process selection.
> 
> When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing
> this value is the only documentation of the value for the process being
> killed. Having this value on the Killed process message documents if a
> miscconfiguration occurred or it can confirm that the oom_score_adj
> value applies as expected.
> 
> An example which illustates both misconfiguration and validation that
> the oom_score_adj was applied as expected is:
> 
> Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692
>  (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB,
>  shmem-rss:0kB oom_score_adj:1000
> 
> The systemd-udevd is a critical system application that should have an
> oom_score_adj of -1000. Here it was misconfigured to have a adjustment
> of 1000 making it a highly favored OOM kill target process. The output
> documents both the misconfiguration and the fact that the process
> was correctly targeted by OOM due to the miconfiguration. Having
> the oom_score_adj on the Killed message ensures that it is documented.
> 
> Signed-off-by: Edward Chron <echron@arista.com>
> Acked-by: Michal Hocko <mhocko@suse.com>

Acked-by: David Rientjes <rientjes@google.com>

vm.oom_dump_tasks is pretty useful, however, so it's curious why you 
haven't left it enabled :/

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eda2e2a0bdc6..c781f73b6cd6 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -884,12 +884,13 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
>  	 */
>  	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
>  	mark_oom_victim(victim);
> -	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> +	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB oom_score_adj:%ld\n",
>  		message, task_pid_nr(victim), victim->comm,
>  		K(victim->mm->total_vm),
>  		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
>  		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
> -		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
> +		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
> +		(long)victim->signal->oom_score_adj);
>  	task_unlock(victim);
>  
>  	/*

Nit: why not just use %hd and avoid the cast to long?
Michal Hocko Aug. 21, 2019, 6:47 a.m. UTC | #2
On Tue 20-08-19 20:25:32, David Rientjes wrote:
> On Tue, 20 Aug 2019, Edward Chron wrote:
> 
> > For an OOM event: print oom_score_adj value for the OOM Killed process to
> > document what the oom score adjust value was at the time the process was
> > OOM Killed. The adjustment value can be set by user code and it affects
> > the resulting oom_score so it is used to influence kill process selection.
> > 
> > When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing
> > this value is the only documentation of the value for the process being
> > killed. Having this value on the Killed process message documents if a
> > miscconfiguration occurred or it can confirm that the oom_score_adj
> > value applies as expected.
> > 
> > An example which illustates both misconfiguration and validation that
> > the oom_score_adj was applied as expected is:
> > 
> > Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692
> >  (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB,
> >  shmem-rss:0kB oom_score_adj:1000
> > 
> > The systemd-udevd is a critical system application that should have an
> > oom_score_adj of -1000. Here it was misconfigured to have a adjustment
> > of 1000 making it a highly favored OOM kill target process. The output
> > documents both the misconfiguration and the fact that the process
> > was correctly targeted by OOM due to the miconfiguration. Having
> > the oom_score_adj on the Killed message ensures that it is documented.
> > 
> > Signed-off-by: Edward Chron <echron@arista.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
> 
> Acked-by: David Rientjes <rientjes@google.com>
> 
> vm.oom_dump_tasks is pretty useful, however, so it's curious why you 
> haven't left it enabled :/

Because it generates a lot of output potentially. Think of a workload
with too many tasks which is not uncommon.
David Rientjes Aug. 21, 2019, 7:19 a.m. UTC | #3
On Wed, 21 Aug 2019, Michal Hocko wrote:

> > vm.oom_dump_tasks is pretty useful, however, so it's curious why you 
> > haven't left it enabled :/
> 
> Because it generates a lot of output potentially. Think of a workload
> with too many tasks which is not uncommon.

Probably better to always print all the info for the victim so we don't 
need to duplicate everything between dump_tasks() and dump_oom_summary().

Edward, how about this?

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -420,11 +420,17 @@ static int dump_task(struct task_struct *p, void *arg)
  * State information includes task's pid, uid, tgid, vm size, rss,
  * pgtables_bytes, swapents, oom_score_adj value, and name.
  */
-static void dump_tasks(struct oom_control *oc)
+static void dump_tasks(struct oom_control *oc, struct task_struct *victim)
 {
 	pr_info("Tasks state (memory values in pages):\n");
 	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
 
+	/* If vm.oom_dump_tasks is disabled, only show the victim */
+	if (!sysctl_oom_dump_tasks) {
+		dump_task(victim, oc);
+		return;
+	}
+
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
 	else {
@@ -465,8 +471,8 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
 		if (is_dump_unreclaim_slabs())
 			dump_unreclaimable_slab();
 	}
-	if (sysctl_oom_dump_tasks)
-		dump_tasks(oc);
+	if (p || sysctl_oom_dump_tasks)
+		dump_tasks(oc, p);
 	if (p)
 		dump_oom_summary(oc, p);
 }
Michal Hocko Aug. 21, 2019, 7:47 a.m. UTC | #4
On Wed 21-08-19 00:19:37, David Rientjes wrote:
> On Wed, 21 Aug 2019, Michal Hocko wrote:
> 
> > > vm.oom_dump_tasks is pretty useful, however, so it's curious why you 
> > > haven't left it enabled :/
> > 
> > Because it generates a lot of output potentially. Think of a workload
> > with too many tasks which is not uncommon.
> 
> Probably better to always print all the info for the victim so we don't 
> need to duplicate everything between dump_tasks() and dump_oom_summary().

I believe that the motivation was to have a one line summary that is already
parsed by log consumers. And that is in __oom_kill_process one.

Also I do not think this patch improves things much for two reasons
at leasts a) it doesn't really give you the whole list of killed tasks
(this might be the whole memcg) and b) we already do have most important
information in __oom_kill_process. If something is missing there I do
not see a strong reason we cannot add it there. Like in this case.

> Edward, how about this?
> 
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -420,11 +420,17 @@ static int dump_task(struct task_struct *p, void *arg)
>   * State information includes task's pid, uid, tgid, vm size, rss,
>   * pgtables_bytes, swapents, oom_score_adj value, and name.
>   */
> -static void dump_tasks(struct oom_control *oc)
> +static void dump_tasks(struct oom_control *oc, struct task_struct *victim)
>  {
>  	pr_info("Tasks state (memory values in pages):\n");
>  	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
>  
> +	/* If vm.oom_dump_tasks is disabled, only show the victim */
> +	if (!sysctl_oom_dump_tasks) {
> +		dump_task(victim, oc);
> +		return;
> +	}
> +
>  	if (is_memcg_oom(oc))
>  		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
>  	else {
> @@ -465,8 +471,8 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
>  		if (is_dump_unreclaim_slabs())
>  			dump_unreclaimable_slab();
>  	}
> -	if (sysctl_oom_dump_tasks)
> -		dump_tasks(oc);
> +	if (p || sysctl_oom_dump_tasks)
> +		dump_tasks(oc, p);
>  	if (p)
>  		dump_oom_summary(oc, p);
>  }
Edward Chron Aug. 21, 2019, 9:51 p.m. UTC | #5
Good point, I can post this with your correction.

I will add your Acked-by: David Rientjes <rientjes@google.com>

I am adding your Acked-by to the revised patch as this is what Michal
asked me to do (so I assume that is what I should do).

Should I post as a separate fix again or simply post here?
I'll post here and if you prefer a fresh submission, let me know and
I'll do that.

Thank-you for reviewing this patch.

-Edward Chron
Arista Networks

On Tue, Aug 20, 2019 at 8:25 PM David Rientjes <rientjes@google.com> wrote:
>
> On Tue, 20 Aug 2019, Edward Chron wrote:
>
> > For an OOM event: print oom_score_adj value for the OOM Killed process to
> > document what the oom score adjust value was at the time the process was
> > OOM Killed. The adjustment value can be set by user code and it affects
> > the resulting oom_score so it is used to influence kill process selection.
> >
> > When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing
> > this value is the only documentation of the value for the process being
> > killed. Having this value on the Killed process message documents if a
> > miscconfiguration occurred or it can confirm that the oom_score_adj
> > value applies as expected.
> >
> > An example which illustates both misconfiguration and validation that
> > the oom_score_adj was applied as expected is:
> >
> > Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692
> >  (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB,
> >  shmem-rss:0kB oom_score_adj:1000
> >
> > The systemd-udevd is a critical system application that should have an
> > oom_score_adj of -1000. Here it was misconfigured to have a adjustment
> > of 1000 making it a highly favored OOM kill target process. The output
> > documents both the misconfiguration and the fact that the process
> > was correctly targeted by OOM due to the miconfiguration. Having
> > the oom_score_adj on the Killed message ensures that it is documented.
> >
> > Signed-off-by: Edward Chron <echron@arista.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
>
> Acked-by: David Rientjes <rientjes@google.com>
>
> vm.oom_dump_tasks is pretty useful, however, so it's curious why you
> haven't left it enabled :/
>
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index eda2e2a0bdc6..c781f73b6cd6 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -884,12 +884,13 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> >        */
> >       do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
> >       mark_oom_victim(victim);
> > -     pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> > +     pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB oom_score_adj:%ld\n",
> >               message, task_pid_nr(victim), victim->comm,
> >               K(victim->mm->total_vm),
> >               K(get_mm_counter(victim->mm, MM_ANONPAGES)),
> >               K(get_mm_counter(victim->mm, MM_FILEPAGES)),
> > -             K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
> > +             K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
> > +             (long)victim->signal->oom_score_adj);
> >       task_unlock(victim);
> >
> >       /*
>
> Nit: why not just use %hd and avoid the cast to long?
Edward Chron Aug. 21, 2019, 10:22 p.m. UTC | #6
On Wed, Aug 21, 2019 at 12:19 AM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 21 Aug 2019, Michal Hocko wrote:
>
> > > vm.oom_dump_tasks is pretty useful, however, so it's curious why you
> > > haven't left it enabled :/
> >
> > Because it generates a lot of output potentially. Think of a workload
> > with too many tasks which is not uncommon.
>
> Probably better to always print all the info for the victim so we don't
> need to duplicate everything between dump_tasks() and dump_oom_summary().
>
> Edward, how about this?
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -420,11 +420,17 @@ static int dump_task(struct task_struct *p, void *arg)
>   * State information includes task's pid, uid, tgid, vm size, rss,
>   * pgtables_bytes, swapents, oom_score_adj value, and name.
>   */
> -static void dump_tasks(struct oom_control *oc)
> +static void dump_tasks(struct oom_control *oc, struct task_struct *victim)
>  {
>         pr_info("Tasks state (memory values in pages):\n");
>         pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
>
> +       /* If vm.oom_dump_tasks is disabled, only show the victim */
> +       if (!sysctl_oom_dump_tasks) {
> +               dump_task(victim, oc);
> +               return;
> +       }
> +
>         if (is_memcg_oom(oc))
>                 mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
>         else {
> @@ -465,8 +471,8 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
>                 if (is_dump_unreclaim_slabs())
>                         dump_unreclaimable_slab();
>         }
> -       if (sysctl_oom_dump_tasks)
> -               dump_tasks(oc);
> +       if (p || sysctl_oom_dump_tasks)
> +               dump_tasks(oc, p);
>         if (p)
>                 dump_oom_summary(oc, p);
>  }

I would be willing to accept this, though as Michal mentions in his
post, it would be very helpful to have the oom_score_adj on the Killed
process message.

One reason for that is that the Killed process message is the one
message that is printed with error priority (pr_err)
and so that message can be filtered out and sent to notify support
that an OOM event occurred.
Putting any information that can be shared in that message is useful
from my experience as it the initial point of triage for an OOM event.
Even if the full log with per user process is available it the
starting point for triage for an OOM event.

So from my perspective I would be happy having both, with David's
proposal providing a bit of extra information as shown here:

Jul 21 20:07:48 linuxserver kernel: [  pid  ]   uid  tgid total_vm
 rss pgtables_bytes swapents oom_score_adj name
Jul 21 20:07:48 linuxserver kernel: [    547]     0   547    31664
615             299008              0                       0
systemd-journal

The OOM Killed process message will print as:

Jul 21 20:07:48 linuxserver kernel: Out of memory: Killed process 2826
(oomprocs) total-vm:1056800kB, anon-rss:1052784kB, file-rss:4kB,
shmem-rss:0kB oom_score_adj:1000

But if only one one output change is allowed I'd favor the Killed
process message since that can be singled due to it's print priority
and forwarded.

By the way, right now there is redundancy in that the Killed process
message is printing vm, rss even if vm.oom_dump_tasks is enabled.
I don't see why that is a big deal.
It is very useful to have all the information that is there.
Wouldn't mind also having pgtables too but we would be able to get
that from the output of dump_task if that is enabled.

If it is acceptable to also add the dump_task for the killed process
for !sysctl_oom_dump_tasks I can repost the patch including that as
well.

Thank-you,

Edward Chron
Arista Networks
Edward Chron Aug. 21, 2019, 10:25 p.m. UTC | #7
On Tue, Aug 20, 2019 at 8:25 PM David Rientjes <rientjes@google.com> wrote:
>
> On Tue, 20 Aug 2019, Edward Chron wrote:
>
> > For an OOM event: print oom_score_adj value for the OOM Killed process to
> > document what the oom score adjust value was at the time the process was
> > OOM Killed. The adjustment value can be set by user code and it affects
> > the resulting oom_score so it is used to influence kill process selection.
> >
> > When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing
> > this value is the only documentation of the value for the process being
> > killed. Having this value on the Killed process message documents if a
> > miscconfiguration occurred or it can confirm that the oom_score_adj
> > value applies as expected.
> >
> > An example which illustates both misconfiguration and validation that
> > the oom_score_adj was applied as expected is:
> >
> > Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692
> >  (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB,
> >  shmem-rss:0kB oom_score_adj:1000
> >
> > The systemd-udevd is a critical system application that should have an
> > oom_score_adj of -1000. Here it was misconfigured to have a adjustment
> > of 1000 making it a highly favored OOM kill target process. The output
> > documents both the misconfiguration and the fact that the process
> > was correctly targeted by OOM due to the miconfiguration. Having
> > the oom_score_adj on the Killed message ensures that it is documented.
> >
> > Signed-off-by: Edward Chron <echron@arista.com>
> > Acked-by: Michal Hocko <mhocko@suse.com>
>
> Acked-by: David Rientjes <rientjes@google.com>
>
> vm.oom_dump_tasks is pretty useful, however, so it's curious why you
> haven't left it enabled :/
>
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index eda2e2a0bdc6..c781f73b6cd6 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -884,12 +884,13 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> >        */
> >       do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
> >       mark_oom_victim(victim);
> > -     pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> > +     pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB oom_score_adj:%ld\n",
> >               message, task_pid_nr(victim), victim->comm,
> >               K(victim->mm->total_vm),
> >               K(get_mm_counter(victim->mm, MM_ANONPAGES)),
> >               K(get_mm_counter(victim->mm, MM_FILEPAGES)),
> > -             K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
> > +             K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
> > +             (long)victim->signal->oom_score_adj);
> >       task_unlock(victim);
> >
> >       /*
>
> Nit: why not just use %hd and avoid the cast to long?

Sorry I may have accidently top posted my response to this. Here is
where my response should go:
-----------------------------------------------------------------------------------------------------------------------------------

Good point, I can post this with your correction.

I will add your Acked-by: David Rientjes <rientjes@google.com>

I am adding your Acked-by to the revised patch as this is what Michal
asked me to do (so I assume that is what I should do).

Should I post as a separate fix again or simply post here?

I'll post here and if you prefer a fresh submission, let me know and
I'll do that.

Thank-you for reviewing this patch.
Edward Chron Aug. 21, 2019, 11:12 p.m. UTC | #8
On Wed, Aug 21, 2019 at 12:47 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 21-08-19 00:19:37, David Rientjes wrote:
> > On Wed, 21 Aug 2019, Michal Hocko wrote:
> >
> > > > vm.oom_dump_tasks is pretty useful, however, so it's curious why you
> > > > haven't left it enabled :/
> > >
> > > Because it generates a lot of output potentially. Think of a workload
> > > with too many tasks which is not uncommon.
> >
> > Probably better to always print all the info for the victim so we don't
> > need to duplicate everything between dump_tasks() and dump_oom_summary().
>
> I believe that the motivation was to have a one line summary that is already
> parsed by log consumers. And that is in __oom_kill_process one.
>

Yes the motivation was one line summary that the OOM Killed Process
message supplies along
with the fact it is error priority as I mentioned. It is a very
desirable place to put summarized
information.

> Also I do not think this patch improves things much for two reasons
> at leasts a) it doesn't really give you the whole list of killed tasks
> (this might be the whole memcg) and b) we already do have most important
> information in __oom_kill_process. If something is missing there I do
> not see a strong reason we cannot add it there. Like in this case.
>

This is a good point.

Additionally (which you know, but mentioning for reference) the OOM
output used to look like this:

Nov 14 15:23:48 oldserver kernel: [337631.991218] Out of memory: Kill
process 19961 (python) score 17 or sacrifice child
Nov 14 15:23:48 oldserver kernel: [337631.991237] Killed process 31357
(sh) total-vm:5400kB, anon-rss:252kB, file-rss:4kB, shmem-rss:0kB

It now looks like this with 5.3.0-rc5 (minus the oom_score_adj):

Jul 22 10:42:40 newserver kernel:
oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-10383.slice/user@10383.service,task=oomprocs,pid=3035,uid=10383
Jul 22 10:42:40 newserver kernel: Out of memory: Killed process 3035
(oomprocs) total-vm:1056800kB, anon-rss:8kB, file-rss:4kB,
shmem-rss:0kB
Jul 22 10:42:40 newserver kernel: oom_reaper: reaped process 3035
(oomprocs), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

The old output did explain that a oom_score of 17 must have either
tied for highest or was the highest.
This did document why OOM selected the process it did, even if ends up
killing the related sh process.

With the newer format that added constraint message, it does provide
uid which can be helpful and
the oom_reaper showing that the memory was reclaimed is certainly reassuring.

My understanding now is that printing the oom_score is discouraged.
This seems unfortunate.  The oom_score_adj can be adjusted
appropriately if oom_score is known.
So It would be useful to have both.

But at least if oom_score_adj is printed you can confirm the value at
the time of the OOM event.

Thank-you,
-Edward Chron
Arista Networks

> > Edward, how about this?
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -420,11 +420,17 @@ static int dump_task(struct task_struct *p, void *arg)
> >   * State information includes task's pid, uid, tgid, vm size, rss,
> >   * pgtables_bytes, swapents, oom_score_adj value, and name.
> >   */
> > -static void dump_tasks(struct oom_control *oc)
> > +static void dump_tasks(struct oom_control *oc, struct task_struct *victim)
> >  {
> >       pr_info("Tasks state (memory values in pages):\n");
> >       pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
> >
> > +     /* If vm.oom_dump_tasks is disabled, only show the victim */
> > +     if (!sysctl_oom_dump_tasks) {
> > +             dump_task(victim, oc);
> > +             return;
> > +     }
> > +
> >       if (is_memcg_oom(oc))
> >               mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
> >       else {
> > @@ -465,8 +471,8 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
> >               if (is_dump_unreclaim_slabs())
> >                       dump_unreclaimable_slab();
> >       }
> > -     if (sysctl_oom_dump_tasks)
> > -             dump_tasks(oc);
> > +     if (p || sysctl_oom_dump_tasks)
> > +             dump_tasks(oc, p);
> >       if (p)
> >               dump_oom_summary(oc, p);
> >  }
>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko Aug. 22, 2019, 7:09 a.m. UTC | #9
On Wed 21-08-19 15:25:13, Edward Chron wrote:
> On Tue, Aug 20, 2019 at 8:25 PM David Rientjes <rientjes@google.com> wrote:
> >
> > On Tue, 20 Aug 2019, Edward Chron wrote:
> >
> > > For an OOM event: print oom_score_adj value for the OOM Killed process to
> > > document what the oom score adjust value was at the time the process was
> > > OOM Killed. The adjustment value can be set by user code and it affects
> > > the resulting oom_score so it is used to influence kill process selection.
> > >
> > > When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing
> > > this value is the only documentation of the value for the process being
> > > killed. Having this value on the Killed process message documents if a
> > > miscconfiguration occurred or it can confirm that the oom_score_adj
> > > value applies as expected.
> > >
> > > An example which illustates both misconfiguration and validation that
> > > the oom_score_adj was applied as expected is:
> > >
> > > Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692
> > >  (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB,
> > >  shmem-rss:0kB oom_score_adj:1000
> > >
> > > The systemd-udevd is a critical system application that should have an
> > > oom_score_adj of -1000. Here it was misconfigured to have a adjustment
> > > of 1000 making it a highly favored OOM kill target process. The output
> > > documents both the misconfiguration and the fact that the process
> > > was correctly targeted by OOM due to the miconfiguration. Having
> > > the oom_score_adj on the Killed message ensures that it is documented.
> > >
> > > Signed-off-by: Edward Chron <echron@arista.com>
> > > Acked-by: Michal Hocko <mhocko@suse.com>
> >
> > Acked-by: David Rientjes <rientjes@google.com>
> >
> > vm.oom_dump_tasks is pretty useful, however, so it's curious why you
> > haven't left it enabled :/
> >
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > index eda2e2a0bdc6..c781f73b6cd6 100644
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -884,12 +884,13 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> > >        */
> > >       do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
> > >       mark_oom_victim(victim);
> > > -     pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> > > +     pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB oom_score_adj:%ld\n",
> > >               message, task_pid_nr(victim), victim->comm,
> > >               K(victim->mm->total_vm),
> > >               K(get_mm_counter(victim->mm, MM_ANONPAGES)),
> > >               K(get_mm_counter(victim->mm, MM_FILEPAGES)),
> > > -             K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
> > > +             K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
> > > +             (long)victim->signal->oom_score_adj);
> > >       task_unlock(victim);
> > >
> > >       /*
> >
> > Nit: why not just use %hd and avoid the cast to long?
> 
> Sorry I may have accidently top posted my response to this. Here is
> where my response should go:
> -----------------------------------------------------------------------------------------------------------------------------------
> 
> Good point, I can post this with your correction.
> 
> I will add your Acked-by: David Rientjes <rientjes@google.com>
> 
> I am adding your Acked-by to the revised patch as this is what Michal
> asked me to do (so I assume that is what I should do).
> 
> Should I post as a separate fix again or simply post here?

Andrew usually folds these small fixups automagically. If that doesn't
happen here for some reason then just repost with acks and the fixup.

Thanks!
Michal Hocko Aug. 22, 2019, 7:15 a.m. UTC | #10
On Wed 21-08-19 15:22:07, Edward Chron wrote:
> On Wed, Aug 21, 2019 at 12:19 AM David Rientjes <rientjes@google.com> wrote:
> >
> > On Wed, 21 Aug 2019, Michal Hocko wrote:
> >
> > > > vm.oom_dump_tasks is pretty useful, however, so it's curious why you
> > > > haven't left it enabled :/
> > >
> > > Because it generates a lot of output potentially. Think of a workload
> > > with too many tasks which is not uncommon.
> >
> > Probably better to always print all the info for the victim so we don't
> > need to duplicate everything between dump_tasks() and dump_oom_summary().
> >
> > Edward, how about this?
> >
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -420,11 +420,17 @@ static int dump_task(struct task_struct *p, void *arg)
> >   * State information includes task's pid, uid, tgid, vm size, rss,
> >   * pgtables_bytes, swapents, oom_score_adj value, and name.
> >   */
> > -static void dump_tasks(struct oom_control *oc)
> > +static void dump_tasks(struct oom_control *oc, struct task_struct *victim)
> >  {
> >         pr_info("Tasks state (memory values in pages):\n");
> >         pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
> >
> > +       /* If vm.oom_dump_tasks is disabled, only show the victim */
> > +       if (!sysctl_oom_dump_tasks) {
> > +               dump_task(victim, oc);
> > +               return;
> > +       }
> > +
> >         if (is_memcg_oom(oc))
> >                 mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
> >         else {
> > @@ -465,8 +471,8 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
> >                 if (is_dump_unreclaim_slabs())
> >                         dump_unreclaimable_slab();
> >         }
> > -       if (sysctl_oom_dump_tasks)
> > -               dump_tasks(oc);
> > +       if (p || sysctl_oom_dump_tasks)
> > +               dump_tasks(oc, p);
> >         if (p)
> >                 dump_oom_summary(oc, p);
> >  }
> 
> I would be willing to accept this, though as Michal mentions in his
> post, it would be very helpful to have the oom_score_adj on the Killed
> process message.
> 
> One reason for that is that the Killed process message is the one
> message that is printed with error priority (pr_err)
> and so that message can be filtered out and sent to notify support
> that an OOM event occurred.
> Putting any information that can be shared in that message is useful
> from my experience as it the initial point of triage for an OOM event.
> Even if the full log with per user process is available it the
> starting point for triage for an OOM event.
> 
> So from my perspective I would be happy having both, with David's
> proposal providing a bit of extra information as shown here:
> 
> Jul 21 20:07:48 linuxserver kernel: [  pid  ]   uid  tgid total_vm
>  rss pgtables_bytes swapents oom_score_adj name
> Jul 21 20:07:48 linuxserver kernel: [    547]     0   547    31664
> 615             299008              0                       0
> systemd-journal
> 
> The OOM Killed process message will print as:
> 
> Jul 21 20:07:48 linuxserver kernel: Out of memory: Killed process 2826
> (oomprocs) total-vm:1056800kB, anon-rss:1052784kB, file-rss:4kB,
> shmem-rss:0kB oom_score_adj:1000
> 
> But if only one one output change is allowed I'd favor the Killed
> process message since that can be singled due to it's print priority
> and forwarded.
> 
> By the way, right now there is redundancy in that the Killed process
> message is printing vm, rss even if vm.oom_dump_tasks is enabled.
> I don't see why that is a big deal.

There will always be redundancy there because dump_tasks part is there
mostly to check the oom victim decision for potential wrong/unexpected
selection. While "killed..." message is there to inform who has been
killed. Most people really do care about that part only.

> It is very useful to have all the information that is there.
> Wouldn't mind also having pgtables too but we would be able to get
> that from the output of dump_task if that is enabled.

I am not against adding pgrable information there. That memory is going
to be released when the task dies.
 
> If it is acceptable to also add the dump_task for the killed process
> for !sysctl_oom_dump_tasks I can repost the patch including that as
> well.

Well, I would rather focus on adding the missing pieces to the killed
task message instead.
Michal Hocko Aug. 22, 2019, 7:21 a.m. UTC | #11
On Wed 21-08-19 16:12:08, Edward Chron wrote:
[...]
> Additionally (which you know, but mentioning for reference) the OOM
> output used to look like this:
> 
> Nov 14 15:23:48 oldserver kernel: [337631.991218] Out of memory: Kill
> process 19961 (python) score 17 or sacrifice child
> Nov 14 15:23:48 oldserver kernel: [337631.991237] Killed process 31357
> (sh) total-vm:5400kB, anon-rss:252kB, file-rss:4kB, shmem-rss:0kB
> 
> It now looks like this with 5.3.0-rc5 (minus the oom_score_adj):
> 
> Jul 22 10:42:40 newserver kernel:
> oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-10383.slice/user@10383.service,task=oomprocs,pid=3035,uid=10383
> Jul 22 10:42:40 newserver kernel: Out of memory: Killed process 3035
> (oomprocs) total-vm:1056800kB, anon-rss:8kB, file-rss:4kB,
> shmem-rss:0kB
> Jul 22 10:42:40 newserver kernel: oom_reaper: reaped process 3035
> (oomprocs), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> 
> The old output did explain that a oom_score of 17 must have either
> tied for highest or was the highest.
> This did document why OOM selected the process it did, even if ends up
> killing the related sh process.
> 
> With the newer format that added constraint message, it does provide
> uid which can be helpful and
> the oom_reaper showing that the memory was reclaimed is certainly reassuring.
> 
> My understanding now is that printing the oom_score is discouraged.
> This seems unfortunate.  The oom_score_adj can be adjusted
> appropriately if oom_score is known.
> So It would be useful to have both.

As already mentioned in our previous discussion I am really not happy
about exporting oom_score withtout a larger context - aka other tasks
scores to have something to compare against. Other than that the value
is an internal implementation detail and it is meaningless without
knowing the exact algorithm which can change at any times so no
userspace should really depend on it. All important metrics should be
displayed by the oom report message already.
Edward Chron Aug. 22, 2019, 2:47 p.m. UTC | #12
On Thu, Aug 22, 2019 at 12:15 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 21-08-19 15:22:07, Edward Chron wrote:
> > On Wed, Aug 21, 2019 at 12:19 AM David Rientjes <rientjes@google.com> wrote:
> > >
> > > On Wed, 21 Aug 2019, Michal Hocko wrote:
> > >
> > > > > vm.oom_dump_tasks is pretty useful, however, so it's curious why you
> > > > > haven't left it enabled :/
> > > >
> > > > Because it generates a lot of output potentially. Think of a workload
> > > > with too many tasks which is not uncommon.
> > >
> > > Probably better to always print all the info for the victim so we don't
> > > need to duplicate everything between dump_tasks() and dump_oom_summary().
> > >
> > > Edward, how about this?
> > >
> > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > --- a/mm/oom_kill.c
> > > +++ b/mm/oom_kill.c
> > > @@ -420,11 +420,17 @@ static int dump_task(struct task_struct *p, void *arg)
> > >   * State information includes task's pid, uid, tgid, vm size, rss,
> > >   * pgtables_bytes, swapents, oom_score_adj value, and name.
> > >   */
> > > -static void dump_tasks(struct oom_control *oc)
> > > +static void dump_tasks(struct oom_control *oc, struct task_struct *victim)
> > >  {
> > >         pr_info("Tasks state (memory values in pages):\n");
> > >         pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
> > >
> > > +       /* If vm.oom_dump_tasks is disabled, only show the victim */
> > > +       if (!sysctl_oom_dump_tasks) {
> > > +               dump_task(victim, oc);
> > > +               return;
> > > +       }
> > > +
> > >         if (is_memcg_oom(oc))
> > >                 mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
> > >         else {
> > > @@ -465,8 +471,8 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
> > >                 if (is_dump_unreclaim_slabs())
> > >                         dump_unreclaimable_slab();
> > >         }
> > > -       if (sysctl_oom_dump_tasks)
> > > -               dump_tasks(oc);
> > > +       if (p || sysctl_oom_dump_tasks)
> > > +               dump_tasks(oc, p);
> > >         if (p)
> > >                 dump_oom_summary(oc, p);
> > >  }
> >
> > I would be willing to accept this, though as Michal mentions in his
> > post, it would be very helpful to have the oom_score_adj on the Killed
> > process message.
> >
> > One reason for that is that the Killed process message is the one
> > message that is printed with error priority (pr_err)
> > and so that message can be filtered out and sent to notify support
> > that an OOM event occurred.
> > Putting any information that can be shared in that message is useful
> > from my experience as it the initial point of triage for an OOM event.
> > Even if the full log with per user process is available it the
> > starting point for triage for an OOM event.
> >
> > So from my perspective I would be happy having both, with David's
> > proposal providing a bit of extra information as shown here:
> >
> > Jul 21 20:07:48 linuxserver kernel: [  pid  ]   uid  tgid total_vm
> >  rss pgtables_bytes swapents oom_score_adj name
> > Jul 21 20:07:48 linuxserver kernel: [    547]     0   547    31664
> > 615             299008              0                       0
> > systemd-journal
> >
> > The OOM Killed process message will print as:
> >
> > Jul 21 20:07:48 linuxserver kernel: Out of memory: Killed process 2826
> > (oomprocs) total-vm:1056800kB, anon-rss:1052784kB, file-rss:4kB,
> > shmem-rss:0kB oom_score_adj:1000
> >
> > But if only one one output change is allowed I'd favor the Killed
> > process message since that can be singled due to it's print priority
> > and forwarded.
> >
> > By the way, right now there is redundancy in that the Killed process
> > message is printing vm, rss even if vm.oom_dump_tasks is enabled.
> > I don't see why that is a big deal.
>
> There will always be redundancy there because dump_tasks part is there
> mostly to check the oom victim decision for potential wrong/unexpected
> selection. While "killed..." message is there to inform who has been
> killed. Most people really do care about that part only.
>
> > It is very useful to have all the information that is there.
> > Wouldn't mind also having pgtables too but we would be able to get
> > that from the output of dump_task if that is enabled.
>
> I am not against adding pgrable information there. That memory is going
> to be released when the task dies.

Oh Thank-you, will include that in updated patch as it useful information.

>
> > If it is acceptable to also add the dump_task for the killed process
> > for !sysctl_oom_dump_tasks I can repost the patch including that as
> > well.
>
> Well, I would rather focus on adding the missing pieces to the killed
> task message instead.
>

Will do.

> --
> Michal Hocko
> SUSE Labs
Edward Chron Aug. 22, 2019, 2:55 p.m. UTC | #13
On Thu, Aug 22, 2019 at 12:21 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 21-08-19 16:12:08, Edward Chron wrote:
> [...]
> > Additionally (which you know, but mentioning for reference) the OOM
> > output used to look like this:
> >
> > Nov 14 15:23:48 oldserver kernel: [337631.991218] Out of memory: Kill
> > process 19961 (python) score 17 or sacrifice child
> > Nov 14 15:23:48 oldserver kernel: [337631.991237] Killed process 31357
> > (sh) total-vm:5400kB, anon-rss:252kB, file-rss:4kB, shmem-rss:0kB
> >
> > It now looks like this with 5.3.0-rc5 (minus the oom_score_adj):
> >
> > Jul 22 10:42:40 newserver kernel:
> > oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/user.slice/user-10383.slice/user@10383.service,task=oomprocs,pid=3035,uid=10383
> > Jul 22 10:42:40 newserver kernel: Out of memory: Killed process 3035
> > (oomprocs) total-vm:1056800kB, anon-rss:8kB, file-rss:4kB,
> > shmem-rss:0kB
> > Jul 22 10:42:40 newserver kernel: oom_reaper: reaped process 3035
> > (oomprocs), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
> >
> > The old output did explain that a oom_score of 17 must have either
> > tied for highest or was the highest.
> > This did document why OOM selected the process it did, even if ends up
> > killing the related sh process.
> >
> > With the newer format that added constraint message, it does provide
> > uid which can be helpful and
> > the oom_reaper showing that the memory was reclaimed is certainly reassuring.
> >
> > My understanding now is that printing the oom_score is discouraged.
> > This seems unfortunate.  The oom_score_adj can be adjusted
> > appropriately if oom_score is known.
> > So It would be useful to have both.
>
> As already mentioned in our previous discussion I am really not happy
> about exporting oom_score withtout a larger context - aka other tasks
> scores to have something to compare against. Other than that the value
> is an internal implementation detail and it is meaningless without
> knowing the exact algorithm which can change at any times so no
> userspace should really depend on it. All important metrics should be
> displayed by the oom report message already.

The oom_score is no longer displayed any where in the OOM output with 5.3
so there isn't anything to compare against any way with the current OOM
per process output and for the killed process.

I understand the reasoning for this from your discussion.
Thanks for explaining the rational.

>
> --
> Michal Hocko
> SUSE Labs
Edward Chron Aug. 22, 2019, 2:58 p.m. UTC | #14
On Thu, Aug 22, 2019 at 12:09 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Wed 21-08-19 15:25:13, Edward Chron wrote:
> > On Tue, Aug 20, 2019 at 8:25 PM David Rientjes <rientjes@google.com> wrote:
> > >
> > > On Tue, 20 Aug 2019, Edward Chron wrote:
> > >
> > > > For an OOM event: print oom_score_adj value for the OOM Killed process to
> > > > document what the oom score adjust value was at the time the process was
> > > > OOM Killed. The adjustment value can be set by user code and it affects
> > > > the resulting oom_score so it is used to influence kill process selection.
> > > >
> > > > When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing
> > > > this value is the only documentation of the value for the process being
> > > > killed. Having this value on the Killed process message documents if a
> > > > miscconfiguration occurred or it can confirm that the oom_score_adj
> > > > value applies as expected.
> > > >
> > > > An example which illustates both misconfiguration and validation that
> > > > the oom_score_adj was applied as expected is:
> > > >
> > > > Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692
> > > >  (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB,
> > > >  shmem-rss:0kB oom_score_adj:1000
> > > >
> > > > The systemd-udevd is a critical system application that should have an
> > > > oom_score_adj of -1000. Here it was misconfigured to have a adjustment
> > > > of 1000 making it a highly favored OOM kill target process. The output
> > > > documents both the misconfiguration and the fact that the process
> > > > was correctly targeted by OOM due to the miconfiguration. Having
> > > > the oom_score_adj on the Killed message ensures that it is documented.
> > > >
> > > > Signed-off-by: Edward Chron <echron@arista.com>
> > > > Acked-by: Michal Hocko <mhocko@suse.com>
> > >
> > > Acked-by: David Rientjes <rientjes@google.com>
> > >
> > > vm.oom_dump_tasks is pretty useful, however, so it's curious why you
> > > haven't left it enabled :/
> > >
> > > > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > > > index eda2e2a0bdc6..c781f73b6cd6 100644
> > > > --- a/mm/oom_kill.c
> > > > +++ b/mm/oom_kill.c
> > > > @@ -884,12 +884,13 @@ static void __oom_kill_process(struct task_struct *victim, const char *message)
> > > >        */
> > > >       do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
> > > >       mark_oom_victim(victim);
> > > > -     pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
> > > > +     pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB oom_score_adj:%ld\n",
> > > >               message, task_pid_nr(victim), victim->comm,
> > > >               K(victim->mm->total_vm),
> > > >               K(get_mm_counter(victim->mm, MM_ANONPAGES)),
> > > >               K(get_mm_counter(victim->mm, MM_FILEPAGES)),
> > > > -             K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
> > > > +             K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
> > > > +             (long)victim->signal->oom_score_adj);
> > > >       task_unlock(victim);
> > > >
> > > >       /*
> > >
> > > Nit: why not just use %hd and avoid the cast to long?
> >
> > Sorry I may have accidently top posted my response to this. Here is
> > where my response should go:
> > -----------------------------------------------------------------------------------------------------------------------------------
> >
> > Good point, I can post this with your correction.
> >
> > I will add your Acked-by: David Rientjes <rientjes@google.com>
> >
> > I am adding your Acked-by to the revised patch as this is what Michal
> > asked me to do (so I assume that is what I should do).
> >
> > Should I post as a separate fix again or simply post here?
>
> Andrew usually folds these small fixups automagically. If that doesn't
> happen here for some reason then just repost with acks and the fixup.
>

OK I will resubmit, wasn't sure if I should use --subject-prefix
"PATCH v2" or -v 2
or just resubmit but sounds like it should work either way.

> Thanks!
>
> --
> Michal Hocko
> SUSE Labs
Edward Chron Aug. 22, 2019, 3:18 p.m. UTC | #15
On Wed, Aug 21, 2019 at 12:19 AM David Rientjes <rientjes@google.com> wrote:
>
> On Wed, 21 Aug 2019, Michal Hocko wrote:
>
> > > vm.oom_dump_tasks is pretty useful, however, so it's curious why you
> > > haven't left it enabled :/
> >
> > Because it generates a lot of output potentially. Think of a workload
> > with too many tasks which is not uncommon.
>
> Probably better to always print all the info for the victim so we don't
> need to duplicate everything between dump_tasks() and dump_oom_summary().
>
> Edward, how about this?

It is worth mentioning that David's suggested change, while I agree with Michal
that it should be a separate issue from updating the OOM Killed process message,
certainly has merit.  Though, it's not strictly necessary for what I
was asking for.

If you have scripts that scan your logs from OOM events, having a regular format
to OOM output makes parsing easier. With David's suggestion there would always
be a "Tasks state" section and the vm.oom_dump_tasks still works but
it just prevents
all the tasks from being dumped not from dumping the killed process.

OOM output was reorganized not that long ago as we discussed earlier to provide
improved organization of data, so this proposal would be in line with
that change.

If there is interest in this I can submit a separate patch submission.

>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -420,11 +420,17 @@ static int dump_task(struct task_struct *p, void *arg)
>   * State information includes task's pid, uid, tgid, vm size, rss,
>   * pgtables_bytes, swapents, oom_score_adj value, and name.
>   */
> -static void dump_tasks(struct oom_control *oc)
> +static void dump_tasks(struct oom_control *oc, struct task_struct *victim)
>  {
>         pr_info("Tasks state (memory values in pages):\n");
>         pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
>
> +       /* If vm.oom_dump_tasks is disabled, only show the victim */
> +       if (!sysctl_oom_dump_tasks) {
> +               dump_task(victim, oc);
> +               return;
> +       }
> +
>         if (is_memcg_oom(oc))
>                 mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
>         else {
> @@ -465,8 +471,8 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
>                 if (is_dump_unreclaim_slabs())
>                         dump_unreclaimable_slab();
>         }
> -       if (sysctl_oom_dump_tasks)
> -               dump_tasks(oc);
> +       if (p || sysctl_oom_dump_tasks)
> +               dump_tasks(oc, p);
>         if (p)
>                 dump_oom_summary(oc, p);
>  }
diff mbox series

Patch

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eda2e2a0bdc6..c781f73b6cd6 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -884,12 +884,13 @@  static void __oom_kill_process(struct task_struct *victim, const char *message)
 	 */
 	do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
 	mark_oom_victim(victim);
-	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+	pr_err("%s: Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB oom_score_adj:%ld\n",
 		message, task_pid_nr(victim), victim->comm,
 		K(victim->mm->total_vm),
 		K(get_mm_counter(victim->mm, MM_ANONPAGES)),
 		K(get_mm_counter(victim->mm, MM_FILEPAGES)),
-		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)));
+		K(get_mm_counter(victim->mm, MM_SHMEMPAGES)),
+		(long)victim->signal->oom_score_adj);
 	task_unlock(victim);
 
 	/*