mm/oom: Add killed process selection information

Message ID	20190808183247.28206-1-echron@arista.com (mailing list archive)
State	New, archived
Headers	show Return-Path: <owner-linux-mm@kvack.org> Received-SPF: pass (google.com: domain of echron@arista.com designates 162.210.129.12 as permitted sender) client-ip=162.210.129.12; From: Edward Chron <echron@arista.com> To: Andrew Morton <akpm@linux-foundation.org> Cc: Michal Hocko <mhocko@suse.com>, Roman Gushchin <guro@fb.com>, Johannes Weiner <hannes@cmpxchg.org>, David Rientjes <rientjes@google.com>, Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>, Shakeel Butt <shakeelb@google.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, colona@arista.com, Edward Chron <echron@arista.com> Subject: [PATCH] mm/oom: Add killed process selection information Date: Thu, 8 Aug 2019 11:32:47 -0700 Message-Id: <20190808183247.28206-1-echron@arista.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	mm/oom: Add killed process selection information \| expand mm/oom: Add killed process selection information

Edward Chron Aug. 8, 2019, 6:32 p.m. UTC

For an OOM event: print oomscore, memory pct, oom adjustment of the process
that OOM kills and the totalpages value in kB (KiB) used in the calculation
with the OOM killed process message. This is helpful to document why the
process was selected by OOM at the time of the OOM event.

Sample message output:
Jul 21 20:07:48 yoursystem kernel: Out of memory: Killed process 2826
 (processname) total-vm:1056800kB, anon-rss:1052784kB, file-rss:4kB,
 shmem-rss:0kB memory-usage:3.2% oom_score:1032 oom_score_adj:1000
 total-pages: 32791748kB

Signed-off-by: Edward Chron <echron@arista.com>
---
 fs/proc/base.c      |  2 +-
 include/linux/oom.h | 18 +++++++++++-
 mm/oom_kill.c       | 67 +++++++++++++++++++++++++++++++++------------
 3 files changed, 68 insertions(+), 19 deletions(-)

Michal Hocko Aug. 8, 2019, 6:51 p.m. UTC | #1

On Thu 08-08-19 11:32:47, Edward Chron wrote:
> For an OOM event: print oomscore, memory pct, oom adjustment of the process
> that OOM kills and the totalpages value in kB (KiB) used in the calculation
> with the OOM killed process message. This is helpful to document why the
> process was selected by OOM at the time of the OOM event.
> 
> Sample message output:
> Jul 21 20:07:48 yoursystem kernel: Out of memory: Killed process 2826
>  (processname) total-vm:1056800kB, anon-rss:1052784kB, file-rss:4kB,
>  shmem-rss:0kB memory-usage:3.2% oom_score:1032 oom_score_adj:1000
>  total-pages: 32791748kB

A large part of this information is already printed in the oom eligible
task list. Namely rss, oom_score_adj, there is also page tables
consumption which might be a serious contributor as well. Why would you
like to see oom_score, memory-usage and total-pages to be printed as
well? How is that information useful?

Edward Chron Aug. 8, 2019, 7:21 p.m. UTC | #2

It is helpful to the admin that looks at the kill message and records this
information. OOMs can come in bunches.
Knowing how much resource the oom selected process was using at the time of
the OOM event is very useful, these fields document key process and system
memory/swap values and can be quite helpful.

Also can't you disable printing the oom eligible task list? For systems
with very large numbers of oom eligible processes that would seem to be
very desirable.
We have some servers that have many thousands of processes and printing
them all, especially as there may be several oom events that occur can
occur in quick succession, this can be problematic and can result in print
rate limiting.
Having this information with this message is of extra value in that case.

We've included it on the many thousands of linux systems that we've shipped
and also on our internal linux systems and for us it has been helpful.

Also, on our systems we set the Killed process message to pr_err as opposed
pr_info as we want just that message being sent to the console.
Customers and our internal support people find this message in that format
valuable as they want to know when OOM events occur and so this message
gives them a decent amount to go on.
Very few messages go to the console, to avoid clutter, but this one that
people agree belongs there.
I'm not sure that change would be supported upstream but again in our
experience we've found it helpful, since you asked.

Thanks.

On Thu, Aug 8, 2019 at 11:51 AM Michal Hocko <mhocko@kernel.org> wrote:

> On Thu 08-08-19 11:32:47, Edward Chron wrote:
> > For an OOM event: print oomscore, memory pct, oom adjustment of the
> process
> > that OOM kills and the totalpages value in kB (KiB) used in the
> calculation
> > with the OOM killed process message. This is helpful to document why the
> > process was selected by OOM at the time of the OOM event.
> >
> > Sample message output:
> > Jul 21 20:07:48 yoursystem kernel: Out of memory: Killed process 2826
> >  (processname) total-vm:1056800kB, anon-rss:1052784kB, file-rss:4kB,
> >  shmem-rss:0kB memory-usage:3.2% oom_score:1032 oom_score_adj:1000
> >  total-pages: 32791748kB
>
> A large part of this information is already printed in the oom eligible
> task list. Namely rss, oom_score_adj, there is also page tables
> consumption which might be a serious contributor as well. Why would you
> like to see oom_score, memory-usage and total-pages to be printed as
> well? How is that information useful?
> --
> Michal Hocko
> SUSE Labs
>

Michal Hocko Aug. 8, 2019, 8:07 p.m. UTC | #3

[please do not top-post]

On Thu 08-08-19 12:21:30, Edward Chron wrote:
> It is helpful to the admin that looks at the kill message and records this
> information. OOMs can come in bunches.
> Knowing how much resource the oom selected process was using at the time of
> the OOM event is very useful, these fields document key process and system
> memory/swap values and can be quite helpful.

I do agree and we already print that information. rss with a break down
to anonymous, file backed and shmem, is usually a large part of the oom
victims foot print. It is not a complete information because there might
be a lot of memory hidden by other resource (open files etc.). We do not
print that information because it is not considered in the oom
selection. It is also not guaranteed to be freed upon the task exit.

> Also can't you disable printing the oom eligible task list? For systems
> with very large numbers of oom eligible processes that would seem to be
> very desirable.

Yes that is indeed the case. But how does the oom_score and
oom_score_adj alone without comparing it to other eligible tasks help in
isolation?

[...]

> I'm not sure that change would be supported upstream but again in our
> experience we've found it helpful, since you asked.

Could you be more specific about how that information is useful except
for recording it? I am all for giving an useful information in the OOM
report but I would like to hear a sound justification for each
additional piece of information.

E.g. this helped us to understand why the task has been selected - this
is usually dump_tasks portion of the report because it gives a picture
of what the OOM killer sees when choosing who to kill.

Then we have the summary to give us an estimation on how much
memory will get freed when the victim dies - rss is a very rough
estimation. But is a portion of the overal memory or oom_score{_adj}
important to print as well? Those are relative values. Say you get
memory-usage:10%, oom_score:42 and oom_score_adj:0. What are you going
to tell from that information?

Edward Chron Aug. 8, 2019, 10:15 p.m. UTC | #4

In our experience far more (99.9%+) OOM events are not kernel issues,
they're user task memory issues.
Properly maintained Linux kernel only rarely have issues.
So useful information about the killed task, displayed in a manner
that can be quickly digested, is very helpful.
But it turns out the totalpages parameter is also critical to make
sense of what is shown.

So if we report the fooWidget task was using ~15% of memory (I know
this is just an approximation but it is often an adequate metric) we
often can tell just from that the number is larger than expected so we
can start there.
Even though the % is a ballpark number, if you are familiar with the
tasks on your system and approximately how much memory you expect them
to use you can often tell if memory usage is excessive.
This is not always the case but it is a fair amount of the time.
So the % of memory field is helpful. But we've found we need totalpages as well.
The totalpages effects the % of memory the task uses.

You're an OOM expert so I don't need to tell you, but just for
clarity, that if a system we're expecting to have swap space has it's
swap diminshed or removed, that can have a significant effect both on
the available memory and the % of memory/swap the task consumes.
Just as if you run a task of a fixed memory size on a system with half
the memory it's % of memory jumps up as does it's oom_score.
Often we know the Memory Size and how much swap space a system is
expected to have, printing totalpages allows us to confirm that this
was in fact the case at the time of the oom event.
Also the size of totalpages is very important to being to tell
approximately how much memory/swap, the task was using because we have
the % and we can quickly get an idea of usage.
For systems that come in a variety of sizes that is important, the
percentage number in conjunction with totalpages is essential since
they are dependent on each other.

With memcg usage increasing, having totalpages readily available is
even more important because the memory container caps the value and it
is helpful to know the value that was in use at the time of the oom
event.

The oom_score tells us how Linux calculated the score for the task,
the oom_score_adj effects this so it is helpful to have that in
conjunction with the oom_score.
If the adjust is high it can tell us that the task was acting as a
canary and so it's oom_score is high even though it's memory
utilization can be modest or low.
In that case we may need more information about what was going on
because the task selected was not necessarily using much memory.
But at least we know why it was selected. The kill message with a high
oom_score_adjust and high oom_score makes that obvious.

Just by adding a few values to the kill message we're often able to
quickly get an idea of what the cause of an oom event was, or at least
we have a better idea where to start looking.

Since we're running a business and so are our customers anything we
can do speed up the triage process saves money and makes people more
productive, so we find it valuable.

What other justification is needed? Let me know.

Thanks!

P.S.

By the way, just for feedback, the recent reorganization of the OOM
sections and print output, for those of us that do have to wade
through OOM output, was appreciated:

commit ef8444ea01d7442652f8e1b8a8b94278cb57eafd    (v5.0-rc1-107^2~63)
Author: yuzhoujian <yuzhoujian@didichuxing.com>
Date:   Fri Dec 28 00:36:07 2018 -0800

    mm, oom: reorganize the oom report in dump_header

On Thu, Aug 8, 2019 at 1:07 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> [please do not top-post]
>
> On Thu 08-08-19 12:21:30, Edward Chron wrote:
> > It is helpful to the admin that looks at the kill message and records this
> > information. OOMs can come in bunches.
> > Knowing how much resource the oom selected process was using at the time of
> > the OOM event is very useful, these fields document key process and system
> > memory/swap values and can be quite helpful.
>
> I do agree and we already print that information. rss with a break down
> to anonymous, file backed and shmem, is usually a large part of the oom
> victims foot print. It is not a complete information because there might
> be a lot of memory hidden by other resource (open files etc.). We do not
> print that information because it is not considered in the oom
> selection. It is also not guaranteed to be freed upon the task exit.
>
> > Also can't you disable printing the oom eligible task list? For systems
> > with very large numbers of oom eligible processes that would seem to be
> > very desirable.
>
> Yes that is indeed the case. But how does the oom_score and
> oom_score_adj alone without comparing it to other eligible tasks help in
> isolation?
>
> [...]
>
> > I'm not sure that change would be supported upstream but again in our
> > experience we've found it helpful, since you asked.
>
> Could you be more specific about how that information is useful except
> for recording it? I am all for giving an useful information in the OOM
> report but I would like to hear a sound justification for each
> additional piece of information.
>
> E.g. this helped us to understand why the task has been selected - this
> is usually dump_tasks portion of the report because it gives a picture
> of what the OOM killer sees when choosing who to kill.
>
> Then we have the summary to give us an estimation on how much
> memory will get freed when the victim dies - rss is a very rough
> estimation. But is a portion of the overal memory or oom_score{_adj}
> important to print as well? Those are relative values. Say you get
> memory-usage:10%, oom_score:42 and oom_score_adj:0. What are you going
> to tell from that information?
> --
> Michal Hocko
> SUSE Labs

Michal Hocko Aug. 9, 2019, 6:40 a.m. UTC | #5

[Again, please do not top post - it makes a mess of any longer
discussion]

On Thu 08-08-19 15:15:12, Edward Chron wrote:
> In our experience far more (99.9%+) OOM events are not kernel issues,
> they're user task memory issues.
> Properly maintained Linux kernel only rarely have issues.
> So useful information about the killed task, displayed in a manner
> that can be quickly digested, is very helpful.
> But it turns out the totalpages parameter is also critical to make
> sense of what is shown.

We already do print that information (see mem_cgroup_print_oom_meminfo
resp. show_mem).

> So if we report the fooWidget task was using ~15% of memory (I know
> this is just an approximation but it is often an adequate metric) we
> often can tell just from that the number is larger than expected so we
> can start there.
> Even though the % is a ballpark number, if you are familiar with the
> tasks on your system and approximately how much memory you expect them
> to use you can often tell if memory usage is excessive.
> This is not always the case but it is a fair amount of the time.
> So the % of memory field is helpful. But we've found we need totalpages as well.
> The totalpages effects the % of memory the task uses.

Is it too difficult to calculate that % from the data available in the
existing report? I would expect this would be a quite simple script
which I would consider a better than changing the kernel code.

[...]
> The oom_score tells us how Linux calculated the score for the task,
> the oom_score_adj effects this so it is helpful to have that in
> conjunction with the oom_score.
> If the adjust is high it can tell us that the task was acting as a
> canary and so it's oom_score is high even though it's memory
> utilization can be modest or low.

I am sorry but I still do not get it. How are you going to use that
information without seeing other eligible tasks. oom_score is just a
normalized memory usage + some heuristics potentially (we have given a
discount to root processes until just recently). So this value only
makes sense to the kernel oom killer implementation. Note that the
equation might change in the future (that has happen in the past several
times) so looking at the value in isolation might be quite misleading.

I can see some point in printing oom_score_adj, though. Seeing biased -
one way or the other - tasks being selected might confirm the setting is
reasonable or otherwise (e.g. seeing tasks with negative scores will
give an indication that they might be not biased enough). Then you can
go and check the eligible tasks dump and see what happened. So this part
makes some sense to me.

Edward Chron Aug. 9, 2019, 10:15 p.m. UTC | #6

Sorry about top posting, responses inline.

On Thu, Aug 8, 2019 at 11:40 PM Michal Hocko <mhocko@kernel.org> wrote:
>
> [Again, please do not top post - it makes a mess of any longer
> discussion]
>
> On Thu 08-08-19 15:15:12, Edward Chron wrote:
> > In our experience far more (99.9%+) OOM events are not kernel issues,
> > they're user task memory issues.
> > Properly maintained Linux kernel only rarely have issues.
> > So useful information about the killed task, displayed in a manner
> > that can be quickly digested, is very helpful.
> > But it turns out the totalpages parameter is also critical to make
> > sense of what is shown.
>
> We already do print that information (see mem_cgroup_print_oom_meminfo
> resp. show_mem).
>
> > So if we report the fooWidget task was using ~15% of memory (I know
> > this is just an approximation but it is often an adequate metric) we
> > often can tell just from that the number is larger than expected so we
> > can start there.
> > Even though the % is a ballpark number, if you are familiar with the
> > tasks on your system and approximately how much memory you expect them
> > to use you can often tell if memory usage is excessive.
> > This is not always the case but it is a fair amount of the time.
> > So the % of memory field is helpful. But we've found we need totalpages as well.
> > The totalpages effects the % of memory the task uses.
>
> Is it too difficult to calculate that % from the data available in the
> existing report? I would expect this would be a quite simple script
> which I would consider a better than changing the kernel code.
>

Depending on your environment the answer is yes, we don't have the full
/var/log/messages (dmesg buffer) readily available so it can be painful.

If you live in the data center world with large numbers of servers and
switches it's very common that you are sent select messages on your
laptop or phone because you can't possibly log in and check all of your
systems.

Logs get moved off servers and in some cases the servers run diskless
and the logs are sent through the network else where.

So it is optimal if you only have to go and find the correct log and search
or run your script(s) when you absolutely need to, not on every OOM event.

That is the whole point of triage and triage is easier when you have
relevant information to decide which events require action and with what
priority.

The OOM Killed message is the one message that we have go to
the console and or is sent as SNMP alert to the Admin to let the
Admin know that a server or switch has suffered a low memory OOM
event.

Maybe a few examples would be helpful to show why the few extra
bits of information would be helpful in such an environment.

For example if we see serverA and serverB are taking oom events
with the fooWidget being killed, something along the lines of
the following you will get message likes this:

Jul 21 20:07:48 serverA kernel: Out of memory: Killed process 2826
 (fooWidget) total-vm:10493400kB, anon-rss:10492996kB, file-rss:128kB,
 shmem-rss:0kB memory-usage:32.0% oom_score: 320 oom_score_adj:0
 total-pages: 32791748kB

Jul 21 20:13:51 serverB kernel: Out of memory: Killed process 2911
 (fooWidget) total-vm:11149196kB, anon-rss:11148508kB, file-rss:128kB,
 shmem-rss:0kB memory-usage:34.0% oom_score: 340 oom_score_adj:0
 total-pages: 32791748kB

It is often possible to recognize that fooWidget is using more memory than
expected on those systems and you can act on that possibly without ever
having to hunt down the log and run a script or otherwise analyze the
log. The % of memory and memory size can often be helpful to understand
if the numbers look reasonable or not. Maybe the application was updated
on just the those systems which explains why we don't see issues on the
other servers running that application, possible application memory leak.

Another example of an application being targeted where the extra
information is helpful:

Aug  6 09:37:21 serverC kernel: Killed process 7583
(fooWidget) total-vm:528408kB, anon-rss:527144kB, file-rss:32kB,
shmem-rss:0kB, memory-usage:1.6% oom_score:16 oom_score_adj:0
total-pages: 32579088kB

Here the fooWidget process is only using about ~1.6% of the memory
resources. Note that is has zero oom_score_adj and that Linux
calculated the oom_score to be 16 so no boosts the oom_score of
16 was the highest memory consuming process on the system.
If that is a reasonable size for this application, we know that
if we want to debug this further we'll need to access the log in
this case. Either we have a number of applications consuming enough
memory to drive a low memory OOM event or a process consuming
more memory has an OOM adjust that lowers it's score and avoids
making it a target but may help to drive the system to OOM.
Again here the information provided was useful to provide a quick
triage of the OOM event and we can act accordingly.

You can also imagine that if for example systemd-udev gets OOM killed,
well that should really grab your attention:

Jul 21 20:08:11 serverX kernel: Out of memory: Killed process 2911
 (systemd-udevd) total-vm:83128kB, anon-rss:80520kB, file-rss:128kB,
 shmem-rss:0kB memory-usage:0.1% oom_score: 1001 oom_score_adj:1000
 total-pages: 8312512kB

Here we see an obvious issue: systemd-udevd is a critical system app
and it should not have an oom_score_adj: 1000 that clearly has been changed
it should be -1000. So we'll need to track down what happened there.
Also this is an 8GB system so it may be running some low priority offload
work for example, so we may not need to prioritize finding out why the
system ran low on memory, though we will want to try and track down
why the oom_score_adj was changed from unkillable to most favored.
Possibly a script or command error.

I can give you additional examples of cases where 1st order triage
of OOM events are aided by having the additional information present
on the OOM Kill message if you need them to justify adding these
fields.

> [...]
> > The oom_score tells us how Linux calculated the score for the task,
> > the oom_score_adj effects this so it is helpful to have that in
> > conjunction with the oom_score.
> > If the adjust is high it can tell us that the task was acting as a
> > canary and so it's oom_score is high even though it's memory
> > utilization can be modest or low.
>
> I am sorry but I still do not get it. How are you going to use that
> information without seeing other eligible tasks. oom_score is just a
> normalized memory usage + some heuristics potentially (we have given a
> discount to root processes until just recently). So this value only
> makes sense to the kernel oom killer implementation. Note that the
> equation might change in the future (that has happen in the past several
> times) so looking at the value in isolation might be quite misleading.

We've been through the change where oom_scores went from -17 to 16
to -1000 to 1000. This was the change David Rientjes from Google made
back around 2010.

This was not a problem for us then and if you change again in the future
(though the current implementation seems quite reasonable) it shouldn't
be an issue for us going forward or for anyone else that can use the
additional information in the OOM Kill message we're proposing. Here
is why, looking at the proposed message:

Jul 21 20:07:48 yoursystem kernel: Out of memory: Killed process 2826
 (processname) total-vm:1056800kB, anon-rss:1052784kB, file-rss:4kB,
 shmem-rss:0kB memory-usage:3.2% oom_score:1032 oom_score_adj:1000
 total-pages: 32791748kB

Let me go through each field again, apologies for stating much that
you already know, but just to be clear:

oom_score_adj: Useful to document the adjustment at the time of the OOM
                          event. Also useful in helping to document
the oom_score.
                          Really should have been included from day
one in my opinion.
oom_score: The value, using your internal algorithm - documented with
                    source code, so its no secret, and is used to
select the task
                    to kill on the OOM event. Having this and the % of
memory used
                    tells us whether any additional adjustments were made to the
                    process. As you can see from the sample messages that I've
                    given: oom_score is % of memory, plus (+-
adjustment value)
                               + any internal adjustment.
                   Since David's implementation became the OOM algorithm
                   there was only one such adjustment the 3% root oom_score
                   reduction. That was added and then removed. If it came
                   back or others were added it would be reflected in the
                   oom_score. That is why having oom_score and  % memory
                   together would be quite helpful.
% memory: Simple to calculate for the kernel at the time of the OOM
                    event this documents how much memory the task was
                    using and is easier for humans to read and digest than
                    total-vm:1056800kB, anon-rss:1052784kB, file-rss:4kB,
                    shmem-rss:0kB though these fields are useful to know
                    Strictly speaking if you provide the totalpages in the
                    output we can calculate the % of memory used except
                    that oom_badness calculate this as rss + pte + swap
                    and that is not exactly what you provide in the kill
                    message. Since oom_badness calculates this and
                    there is little overhead in printing it it would better to
                    have the kernel print it. If the calculation changes
                    for some reason then it would print the value it
                   calculates. Knowing how much memory a task was
                   using seems quite valuable to an algorithm like
                   OOM so it seems unlikely that it won't matter.
totalpages: Gives the size of the memory+swap (if any) at the
                   time event. Quite useful to have that with the
                   kill message and it is readily available.

That's all we're asking. I hope I have explained why it is useful to
have these values with the kill message. Gosh, all the fields you
print are included in the OOM output, assuming you print all the
per task information, you could remove them and make the same
argument your making to me now, those are printed somewhere
else (probably). However, we would prefer you keep them in the
message and add the additional fields if possible.

Now what about the oom_score value changing that you mentioned?
What if you toss David's OOM Kill algorithm for a new algorithm?
That could happen. What happens to the message and how do we
tell things have changed?

A different oom_score requires a different oom adjustment variable.
I hope we can agree on that and history supports this.

As you recall when David's algorithm was brought in, the Kernel OOM
team took good care of us. They added a new adjustment value:
oom_score_adj. As you'll recall the previous oom adjustment variable
was oom_adj. To keep user level code from breaking the Kernel OOM
developers provided a conversion so that if your application set
oom_adj = -17 the Linux OOM code internally set oom_score_adj = -1000.
They had a conversion that handled all the values. Eventually the
deprecated oom_adj field was removed, but it was around for several years.

It is true that you can change the OOM algorithm but not
overnight. If it does happen when you update the code in the kernel
you can change the oom_score_adj: header to be oom_new_adj: or
whatever you wise guys and gals decide to call it. This will tell us
definitively what the oom_score that you're printing means, because
we know which version of the Linux kernel we're running, you told us
by the naming in this message. If small adjustments occur like the
3% reduction in oom_score that was present for a while for tasks with
root privilege (but it didn't last), that will be included in the oom_score
and since we'd also like % of memory, it won't confuse anything.

Further, you export oom_score through the /proc/pid/oom_score
interface. How the score is calculated could change but it is
accessible. It's accessible for a reason, it's useful to know how
the OOM algorithm scores a task and that can be used to help
set appropriate oom adjustment values. This because what the
oom_score means is in fact well documented. It needs to.
Otherwise, the oom adjustment value becomes impossible to
use intelligently. Thanks to David Rientjes et al for making this so.

One of the really nice design points of David Rientjes implementation
is that it is very straight forward to use and understand. So hopefully
if there is a change in the future it's to something that is just as easy
to use and to understand.

>
> I can see some point in printing oom_score_adj, though. Seeing biased -
> one way or the other - tasks being selected might confirm the setting is
> reasonable or otherwise (e.g. seeing tasks with negative scores will
> give an indication that they might be not biased enough). Then you can
> go and check the eligible tasks dump and see what happened. So this part
> makes some sense to me.

Agreed, the oom_score_adj is sorely needed and should be included.

In Summary:
----------------
I hope I have presented a reasonable enough argument for the proposed
additional parameters.

If you need more information I will be oblige as quickly as I can.

Of course it is your call what you are willing to include.
Any of the parameters suggested would be useful and we'll gladly take whatever
you can allow.

Again, Thank-you for your time and your consideration.

Best wishes,

-Edward Chron
Arista Networks

> --
> Michal Hocko
> SUSE Labs

Michal Hocko Aug. 12, 2019, 11:42 a.m. UTC | #7

On Fri 09-08-19 15:15:18, Edward Chron wrote:
[...]
> So it is optimal if you only have to go and find the correct log and search
> or run your script(s) when you absolutely need to, not on every OOM event.

OK, understood.

> That is the whole point of triage and triage is easier when you have
> relevant information to decide which events require action and with what
> priority.
> 
> The OOM Killed message is the one message that we have go to
> the console and or is sent as SNMP alert to the Admin to let the
> Admin know that a server or switch has suffered a low memory OOM
> event.
> 
> Maybe a few examples would be helpful to show why the few extra
> bits of information would be helpful in such an environment.
> 
> For example if we see serverA and serverB are taking oom events
> with the fooWidget being killed, something along the lines of
> the following you will get message likes this:
> 
> Jul 21 20:07:48 serverA kernel: Out of memory: Killed process 2826
>  (fooWidget) total-vm:10493400kB, anon-rss:10492996kB, file-rss:128kB,
>  shmem-rss:0kB memory-usage:32.0% oom_score: 320 oom_score_adj:0
>  total-pages: 32791748kB
> 
> Jul 21 20:13:51 serverB kernel: Out of memory: Killed process 2911
>  (fooWidget) total-vm:11149196kB, anon-rss:11148508kB, file-rss:128kB,
>  shmem-rss:0kB memory-usage:34.0% oom_score: 340 oom_score_adj:0
>  total-pages: 32791748kB
> 
> It is often possible to recognize that fooWidget is using more memory than
> expected on those systems and you can act on that possibly without ever
> having to hunt down the log and run a script or otherwise analyze the
> log. The % of memory and memory size can often be helpful to understand
> if the numbers look reasonable or not. Maybe the application was updated
> on just the those systems which explains why we don't see issues on the
> other servers running that application, possible application memory leak.

This is all quite vague and requires a lot of guessing. Also your
trained guess eye might easily get confused for constrained OOMs (e.g.
due to NUMA or memcg). So I am not really sold to the percentage idea.
And likewise the oom_score.

[...]

> You can also imagine that if for example systemd-udev gets OOM killed,
> well that should really grab your attention:
> 
> Jul 21 20:08:11 serverX kernel: Out of memory: Killed process 2911
>  (systemd-udevd) total-vm:83128kB, anon-rss:80520kB, file-rss:128kB,
>  shmem-rss:0kB memory-usage:0.1% oom_score: 1001 oom_score_adj:1000
>  total-pages: 8312512kB
> 
> Here we see an obvious issue: systemd-udevd is a critical system app
> and it should not have an oom_score_adj: 1000 that clearly has been changed
> it should be -1000.

I do agree here. As I've said in the previous email oom_score_adj indeed
has some value, and this is a nice example of that. So I am completely
fine with a patch that adds this part to the changelog.

[...]
> > > The oom_score tells us how Linux calculated the score for the task,
> > > the oom_score_adj effects this so it is helpful to have that in
> > > conjunction with the oom_score.
> > > If the adjust is high it can tell us that the task was acting as a
> > > canary and so it's oom_score is high even though it's memory
> > > utilization can be modest or low.
> >
> > I am sorry but I still do not get it. How are you going to use that
> > information without seeing other eligible tasks. oom_score is just a
> > normalized memory usage + some heuristics potentially (we have given a
> > discount to root processes until just recently). So this value only
> > makes sense to the kernel oom killer implementation. Note that the
> > equation might change in the future (that has happen in the past several
> > times) so looking at the value in isolation might be quite misleading.
> 
> We've been through the change where oom_scores went from -17 to 16
> to -1000 to 1000. This was the change David Rientjes from Google made
> back around 2010.
> 
> This was not a problem for us then and if you change again in the future
> (though the current implementation seems quite reasonable) it shouldn't
> be an issue for us going forward or for anyone else that can use the
> additional information in the OOM Kill message we're proposing.

While I appreciate that you are flexible enough to cope with those
changes there are other users which might be less so and there is a
strong "no regressions" rule which might get us into the corner so we
are trying hard to not export to much of an internal information so that
userspace doesn't start depending on them.

[...]

> Now what about the oom_score value changing that you mentioned?
> What if you toss David's OOM Kill algorithm for a new algorithm?
> That could happen. What happens to the message and how do we
> tell things have changed?
> 
> A different oom_score requires a different oom adjustment variable.
> I hope we can agree on that and history supports this.

The idea is that we would have to try to fit oom_score_adj semantic into
a new algoritm and -1000..1000 value range would be hopefully good
enough. That doesn't really dictate the internal calculation of the
badness, if such a theretical alg. would use at all.

> As you recall when David's algorithm was brought in, the Kernel OOM
> team took good care of us. They added a new adjustment value:
> oom_score_adj. As you'll recall the previous oom adjustment variable
> was oom_adj. To keep user level code from breaking the Kernel OOM
> developers provided a conversion so that if your application set
> oom_adj = -17 the Linux OOM code internally set oom_score_adj = -1000.
> They had a conversion that handled all the values. Eventually the
> deprecated oom_adj field was removed, but it was around for several years.

Yes, the scaling just happened to work back then.

[...]

> Further, you export oom_score through the /proc/pid/oom_score
> interface. How the score is calculated could change but it is
> accessible. It's accessible for a reason, it's useful to know how
> the OOM algorithm scores a task and that can be used to help
> set appropriate oom adjustment values. This because what the
> oom_score means is in fact well documented. It needs to.
> Otherwise, the oom adjustment value becomes impossible to
> use intelligently. Thanks to David Rientjes et al for making this so.

The point I am trying to push through is that the score (exported via
proc or displayed via dump_tasks) is valuable only as far as you have a
meaningful comparision to make - aka compare to scores of other tasks.
The value on its own cannot tell you really much without a deep
understanding of how it is calculated. And I absolutely do not want
userspace to hardcode that alg. and rely on it being stable. You really
do not need this internal knowledge when comparing scores of different
tasks, though so it is quite safe and robust from future changes.

We have made those mistakes when exporting way to much internal details
to userspace in the past and got burnt.
 
> One of the really nice design points of David Rientjes implementation
> is that it is very straight forward to use and understand. So hopefully
> if there is a change in the future it's to something that is just as easy
> to use and to understand.
> 
> >
> > I can see some point in printing oom_score_adj, though. Seeing biased -
> > one way or the other - tasks being selected might confirm the setting is
> > reasonable or otherwise (e.g. seeing tasks with negative scores will
> > give an indication that they might be not biased enough). Then you can
> > go and check the eligible tasks dump and see what happened. So this part
> > makes some sense to me.
> 
> Agreed, the oom_score_adj is sorely needed and should be included.

I am willing to ack a patch to add oom_score_adj on the grounds that
this information is helpful to pinpoint misconfigurations and it is not
generally available when dump_tasks is disabled.

> In Summary:
> ----------------
> I hope I have presented a reasonable enough argument for the proposed
> additional parameters.

I am not convinced on oom_score and percentage part because score on its
own is an implementation detail that makes sense when comparing tasks
but on on its own and percentage might be even confusing as explained
above.

Thanks for your detailed information!

Edward Chron Aug. 15, 2019, 6:24 a.m. UTC | #8

On Mon, Aug 12, 2019 at 4:42 AM Michal Hocko <mhocko@kernel.org> wrote:
>
> On Fri 09-08-19 15:15:18, Edward Chron wrote:
> [...]
> > So it is optimal if you only have to go and find the correct log and search
> > or run your script(s) when you absolutely need to, not on every OOM event.
>
> OK, understood.
>
> > That is the whole point of triage and triage is easier when you have
> > relevant information to decide which events require action and with what
> > priority.
> >
> > The OOM Killed message is the one message that we have go to
> > the console and or is sent as SNMP alert to the Admin to let the
> > Admin know that a server or switch has suffered a low memory OOM
> > event.
> >
> > Maybe a few examples would be helpful to show why the few extra
> > bits of information would be helpful in such an environment.
> >
> > For example if we see serverA and serverB are taking oom events
> > with the fooWidget being killed, something along the lines of
> > the following you will get message likes this:
> >
> > Jul 21 20:07:48 serverA kernel: Out of memory: Killed process 2826
> >  (fooWidget) total-vm:10493400kB, anon-rss:10492996kB, file-rss:128kB,
> >  shmem-rss:0kB memory-usage:32.0% oom_score: 320 oom_score_adj:0
> >  total-pages: 32791748kB
> >
> > Jul 21 20:13:51 serverB kernel: Out of memory: Killed process 2911
> >  (fooWidget) total-vm:11149196kB, anon-rss:11148508kB, file-rss:128kB,
> >  shmem-rss:0kB memory-usage:34.0% oom_score: 340 oom_score_adj:0
> >  total-pages: 32791748kB
> >
> > It is often possible to recognize that fooWidget is using more memory than
> > expected on those systems and you can act on that possibly without ever
> > having to hunt down the log and run a script or otherwise analyze the
> > log. The % of memory and memory size can often be helpful to understand
> > if the numbers look reasonable or not. Maybe the application was updated
> > on just the those systems which explains why we don't see issues on the
> > other servers running that application, possible application memory leak.
>
> This is all quite vague and requires a lot of guessing. Also your
> trained guess eye might easily get confused for constrained OOMs (e.g.
> due to NUMA or memcg). So I am not really sold to the percentage idea.
> And likewise the oom_score.
>
> [...]
>

Actually totalpages is used by oom control and is set to the appropriate
value for a memcg OOM event or to totalram_pages + totalswap if it is
a system wide OOM event.

The percentage coupled with the totalpages is how we know what we're
looking at and for our environments. Seems to work fine, but maybe there are
some environments where that is not the case.

I must be missing something here so I need to go back and study this.

> > You can also imagine that if for example systemd-udev gets OOM killed,
> > well that should really grab your attention:
> >
> > Jul 21 20:08:11 serverX kernel: Out of memory: Killed process 2911
> >  (systemd-udevd) total-vm:83128kB, anon-rss:80520kB, file-rss:128kB,
> >  shmem-rss:0kB memory-usage:0.1% oom_score: 1001 oom_score_adj:1000
> >  total-pages: 8312512kB
> >
> > Here we see an obvious issue: systemd-udevd is a critical system app
> > and it should not have an oom_score_adj: 1000 that clearly has been changed
> > it should be -1000.
>
> I do agree here. As I've said in the previous email oom_score_adj indeed
> has some value, and this is a nice example of that. So I am completely
> fine with a patch that adds this part to the changelog.
>
> [...]
> > > > The oom_score tells us how Linux calculated the score for the task,
> > > > the oom_score_adj effects this so it is helpful to have that in
> > > > conjunction with the oom_score.
> > > > If the adjust is high it can tell us that the task was acting as a
> > > > canary and so it's oom_score is high even though it's memory
> > > > utilization can be modest or low.
> > >
> > > I am sorry but I still do not get it. How are you going to use that
> > > information without seeing other eligible tasks. oom_score is just a
> > > normalized memory usage + some heuristics potentially (we have given a
> > > discount to root processes until just recently). So this value only
> > > makes sense to the kernel oom killer implementation. Note that the
> > > equation might change in the future (that has happen in the past several
> > > times) so looking at the value in isolation might be quite misleading.
> >
> > We've been through the change where oom_scores went from -17 to 16
> > to -1000 to 1000. This was the change David Rientjes from Google made
> > back around 2010.
> >
> > This was not a problem for us then and if you change again in the future
> > (though the current implementation seems quite reasonable) it shouldn't
> > be an issue for us going forward or for anyone else that can use the
> > additional information in the OOM Kill message we're proposing.
>
> While I appreciate that you are flexible enough to cope with those
> changes there are other users which might be less so and there is a
> strong "no regressions" rule which might get us into the corner so we
> are trying hard to not export to much of an internal information so that
> userspace doesn't start depending on them.
>
> [...]
>
> > Now what about the oom_score value changing that you mentioned?
> > What if you toss David's OOM Kill algorithm for a new algorithm?
> > That could happen. What happens to the message and how do we
> > tell things have changed?
> >
> > A different oom_score requires a different oom adjustment variable.
> > I hope we can agree on that and history supports this.
>
> The idea is that we would have to try to fit oom_score_adj semantic into
> a new algoritm and -1000..1000 value range would be hopefully good
> enough. That doesn't really dictate the internal calculation of the
> badness, if such a theretical alg. would use at all.
>
> > As you recall when David's algorithm was brought in, the Kernel OOM
> > team took good care of us. They added a new adjustment value:
> > oom_score_adj. As you'll recall the previous oom adjustment variable
> > was oom_adj. To keep user level code from breaking the Kernel OOM
> > developers provided a conversion so that if your application set
> > oom_adj = -17 the Linux OOM code internally set oom_score_adj = -1000.
> > They had a conversion that handled all the values. Eventually the
> > deprecated oom_adj field was removed, but it was around for several years.
>
> Yes, the scaling just happened to work back then.
>
> [...]
>
> > Further, you export oom_score through the /proc/pid/oom_score
> > interface. How the score is calculated could change but it is
> > accessible. It's accessible for a reason, it's useful to know how
> > the OOM algorithm scores a task and that can be used to help
> > set appropriate oom adjustment values. This because what the
> > oom_score means is in fact well documented. It needs to.
> > Otherwise, the oom adjustment value becomes impossible to
> > use intelligently. Thanks to David Rientjes et al for making this so.
>
> The point I am trying to push through is that the score (exported via
> proc or displayed via dump_tasks) is valuable only as far as you have a
> meaningful comparision to make - aka compare to scores of other tasks.
> The value on its own cannot tell you really much without a deep
> understanding of how it is calculated. And I absolutely do not want
> userspace to hardcode that alg. and rely on it being stable. You really
> do not need this internal knowledge when comparing scores of different
> tasks, though so it is quite safe and robust from future changes.
>
> We have made those mistakes when exporting way to much internal details
> to userspace in the past and got burnt.
>

Interesting. Knowing how the OOM code works and what oom_score means
allows us set a meaning oom_score_adj. When you provide an interface that
allows adjustment it is helpful to know as much as you can about the
impact it will have so you can set an appropriate value, at least that is how
I think about it. We reference source code as needed but of course
documentation is always appreciated and as you point out code changes
as needed.

> > One of the really nice design points of David Rientjes implementation
> > is that it is very straight forward to use and understand. So hopefully
> > if there is a change in the future it's to something that is just as easy
> > to use and to understand.
> >
> > >
> > > I can see some point in printing oom_score_adj, though. Seeing biased -
> > > one way or the other - tasks being selected might confirm the setting is
> > > reasonable or otherwise (e.g. seeing tasks with negative scores will
> > > give an indication that they might be not biased enough). Then you can
> > > go and check the eligible tasks dump and see what happened. So this part
> > > makes some sense to me.
> >
> > Agreed, the oom_score_adj is sorely needed and should be included.
>
> I am willing to ack a patch to add oom_score_adj on the grounds that
> this information is helpful to pinpoint misconfigurations and it is not
> generally available when dump_tasks is disabled.
>
> > In Summary:
> > ----------------
> > I hope I have presented a reasonable enough argument for the proposed
> > additional parameters.
>
> I am not convinced on oom_score and percentage part because score on its
> own is an implementation detail that makes sense when comparing tasks
> but on on its own and percentage might be even confusing as explained
> above.
>
> Thanks for your detailed information!

OK, Thank-you Michal.

I've coded up the small change to add the oom_score_adj to the OOM
Killed message and sent that up for your review.

I will go back and study constrained cases and try and figure out what
I'm missing there.
We're doing a lot of memcg processing now on our systems so I want to
make sure I understand this.
The OOM code has improved it's memcg processing in more recent releases.

Thanks again for your help!

Edward Chron
Arista Networks

> --
> Michal Hocko
> SUSE Labs

Michal Hocko Aug. 15, 2019, 8:18 a.m. UTC | #9

On Wed 14-08-19 23:24:51, Edward Chron wrote:
> On Mon, Aug 12, 2019 at 4:42 AM Michal Hocko <mhocko@kernel.org> wrote:
> >
> > On Fri 09-08-19 15:15:18, Edward Chron wrote:
> > [...]
> > > So it is optimal if you only have to go and find the correct log and search
> > > or run your script(s) when you absolutely need to, not on every OOM event.
> >
> > OK, understood.
> >
> > > That is the whole point of triage and triage is easier when you have
> > > relevant information to decide which events require action and with what
> > > priority.
> > >
> > > The OOM Killed message is the one message that we have go to
> > > the console and or is sent as SNMP alert to the Admin to let the
> > > Admin know that a server or switch has suffered a low memory OOM
> > > event.
> > >
> > > Maybe a few examples would be helpful to show why the few extra
> > > bits of information would be helpful in such an environment.
> > >
> > > For example if we see serverA and serverB are taking oom events
> > > with the fooWidget being killed, something along the lines of
> > > the following you will get message likes this:
> > >
> > > Jul 21 20:07:48 serverA kernel: Out of memory: Killed process 2826
> > >  (fooWidget) total-vm:10493400kB, anon-rss:10492996kB, file-rss:128kB,
> > >  shmem-rss:0kB memory-usage:32.0% oom_score: 320 oom_score_adj:0
> > >  total-pages: 32791748kB
> > >
> > > Jul 21 20:13:51 serverB kernel: Out of memory: Killed process 2911
> > >  (fooWidget) total-vm:11149196kB, anon-rss:11148508kB, file-rss:128kB,
> > >  shmem-rss:0kB memory-usage:34.0% oom_score: 340 oom_score_adj:0
> > >  total-pages: 32791748kB
> > >
> > > It is often possible to recognize that fooWidget is using more memory than
> > > expected on those systems and you can act on that possibly without ever
> > > having to hunt down the log and run a script or otherwise analyze the
> > > log. The % of memory and memory size can often be helpful to understand
> > > if the numbers look reasonable or not. Maybe the application was updated
> > > on just the those systems which explains why we don't see issues on the
> > > other servers running that application, possible application memory leak.
> >
> > This is all quite vague and requires a lot of guessing. Also your
> > trained guess eye might easily get confused for constrained OOMs (e.g.
> > due to NUMA or memcg). So I am not really sold to the percentage idea.
> > And likewise the oom_score.
> >
> > [...]
> >
> 
> Actually totalpages is used by oom control and is set to the appropriate
> value for a memcg OOM event or to totalram_pages + totalswap if it is
> a system wide OOM event.
> 
> The percentage coupled with the totalpages is how we know what we're
> looking at and for our environments. Seems to work fine, but maybe there are
> some environments where that is not the case.
> 
> I must be missing something here so I need to go back and study this.

total pages is the amount of memory (limit for memcg) in the OOM domain
(e.g. a subset of numa nodes) while the oom victim might span more numa
nodes resp. have memory charged to a different memcg (e.g. when the task
has been moved between memcgs). And that is why the percentage might be
misleading for anything but the whole system or static memcgs oom and
likely why it works in your case.

mm/oom: Add killed process selection information

Commit Message

Comments

Patch