diff mbox series

[v2] mm/oom_kill: show oom eligibility when displaying the current memory state of all tasks

Message ID 20210612204634.1102472-1-atomlin@redhat.com (mailing list archive)
State New
Headers show
Series [v2] mm/oom_kill: show oom eligibility when displaying the current memory state of all tasks | expand

Commit Message

Aaron Tomlin June 12, 2021, 8:46 p.m. UTC
Changes since v2:
 - Use single character (e.g. 'R' for MMF_OOM_SKIP) as suggested
   by Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
 - Add new header to oom_dump_tasks documentation


At the present time, when showing potential OOM victims, we do not
exclude tasks which already have MMF_OOM_SKIP set; it is possible that
the last OOM killable victim was already OOM killed, yet the OOM
reaper failed to reclaim memory and set MMF_OOM_SKIP.
This can be confusing/or perhaps even misleading, to the reader of the
OOM report. Now, we already unconditionally display a task's
oom_score_adj_min value that can be set to OOM_SCORE_ADJ_MIN which is
indicative of an "unkillable" task i.e. is not eligible.

This patch provides a clear indication with regard to the OOM
eligibility of each displayed task.

Signed-off-by: Aaron Tomlin <atomlin@redhat.com>
---
 Documentation/admin-guide/sysctl/vm.rst |  5 ++--
 mm/oom_kill.c                           | 31 +++++++++++++++++++++----
 2 files changed, 30 insertions(+), 6 deletions(-)

Comments

David Rientjes June 13, 2021, 11:47 p.m. UTC | #1
On Sat, 12 Jun 2021, Aaron Tomlin wrote:

> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index eefd3f5fde46..094b7b61d66f 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -160,6 +160,27 @@ static inline bool is_sysrq_oom(struct oom_control *oc)
>  	return oc->order == -1;
>  }
>  
> +/**
> + * is_task_eligible_oom - determine if and why a task cannot be OOM killed
> + * @tsk: task to check
> + *
> + * Needs to be called with task_lock().
> + */
> +static const char * is_task_oom_eligible(struct task_struct *p)

You should be able to just return a char.

> +{
> +	long adj;
> +
> +	adj = (long)p->signal->oom_score_adj;
> +	if (adj == OOM_SCORE_ADJ_MIN)
> +		return "S";

The value is already printed in the task dump, this doesn't look to add 
any information.

> +	else if (test_bit(MMF_OOM_SKIP, &p->mm->flags)
> +		return "R";

We should be doing the task dump only when we're killing a victim (unless 
we're panicking), so something else has been chosen.  Since we would have 
oom killed a process with MMF_OOM_SKIP already, can we simply choose to 
not print a line for this process?

> +	else if (in_vfork(p))
> +		return "V";

Is this a transition state that we can simply disregard from the output as 
well unless/until it becomes eligible?

> +	else
> +		return "";
> +}
> +
>  /* return true if the task is not adequate as candidate victim task. */
>  static bool oom_unkillable_task(struct task_struct *p)
>  {
> @@ -401,12 +422,13 @@ static int dump_task(struct task_struct *p, void *arg)
>  		return 0;
>  	}
>  
> -	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %s\n",
> +	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %13s %s\n",

13 characters for one char output?

>  		task->pid, from_kuid(&init_user_ns, task_uid(task)),
>  		task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
>  		mm_pgtables_bytes(task->mm),
>  		get_mm_counter(task->mm, MM_SWAPENTS),
> -		task->signal->oom_score_adj, task->comm);
> +		task->signal->oom_score_adj, is_task_oom_eligible(task),
> +		task->comm);
>  	task_unlock(task);
>  
>  	return 0;
> @@ -420,12 +442,13 @@ static int dump_task(struct task_struct *p, void *arg)
>   * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
>   * are not shown.
>   * State information includes task's pid, uid, tgid, vm size, rss,
> - * pgtables_bytes, swapents, oom_score_adj value, and name.
> + * pgtables_bytes, swapents, oom_score_adj value, oom eligibility status
> + * and name.
>   */
>  static void dump_tasks(struct oom_control *oc)
>  {
>  	pr_info("Tasks state (memory values in pages):\n");
> -	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
> +	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj oom eligible? name\n");

Field names are single words.

>  
>  	if (is_memcg_oom(oc))
>  		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);
Michal Hocko June 14, 2021, 6:44 a.m. UTC | #2
On Sat 12-06-21 21:46:34, Aaron Tomlin wrote:
> Changes since v2:
>  - Use single character (e.g. 'R' for MMF_OOM_SKIP) as suggested
>    by Tetsuo Handa <penguin-kernel@i-love.sakura.ne.jp>
>  - Add new header to oom_dump_tasks documentation
> 
> 
> At the present time, when showing potential OOM victims, we do not
> exclude tasks which already have MMF_OOM_SKIP set; it is possible that
> the last OOM killable victim was already OOM killed, yet the OOM
> reaper failed to reclaim memory and set MMF_OOM_SKIP.
> This can be confusing/or perhaps even misleading, to the reader of the
> OOM report. Now, we already unconditionally display a task's
> oom_score_adj_min value that can be set to OOM_SCORE_ADJ_MIN which is
> indicative of an "unkillable" task i.e. is not eligible.

Well, I have to say that I have a bit hard time understand the problem
statement here. First of all you are very likely basing your observation
on an old kernel which is missing a fix which should make the situation
impossible IIRC. You should be focusing on a justification why the new
information is helpful for the current tree.

Historically, all tasks eligible for the oom killing have been
printed. That includes also tasks which are excluded later in the
selection. E.g. OOMS_SCORE_ADJ_MIN which can be tricky indeed. IIRC the
primary reason was to have a sufficient amount of information to
evaluate whether the system is configured properly (e.g.
OOMS_SCORE_ADJ_MIN is not used too extensively). More excluded
criterion have been added due to implementation details
(e.g.MMF_OOM_SKIP or mm shared with otherwise ineligible task.

You are correctly pointing out that those internal states are not
exposed but you should focus on explanation why that gap really stands
in the way for the current upstream. Who is going to consume that
information and for what purpose?

> This patch provides a clear indication with regard to the OOM
> eligibility of each displayed task.

This should provide an example of the output with a clarification of the
meaning.

[...]
Aaron Tomlin June 15, 2021, 11:51 a.m. UTC | #3
On Mon 2021-06-14 08:44 +0200, Michal Hocko wrote:
> Well, I have to say that I have a bit hard time understand the problem
> statement here. First of all you are very likely basing your observation
> on an old kernel which is missing a fix which should make the situation
> impossible IIRC. You should be focusing on a justification why the new
> information is helpful for the current tree.

Michal,

Not exactly.

See oom_reap_task(). Let's assume an OOM event occurred within the context
of a memcg and 'memory.oom.group' was not set. If I understand correctly,
once all attempts to OOM reap the specified task were "unsuccessful" then
MMF_OOM_SKIP is applied; and, the assumption is it will be terminated
shorty due to the pending fatal signal (see __oom_kill_process()) i.e. a
SIGKILL is sent to the "victim" before the OOM reaper is notified. Now
assuming the above task did not exited yet, another task, in the same
memcg, could also trigger an OOM event.  Therefore, when showing potential
OOM victims the task above with MMF_OOM_SKIP set, will indeed be displayed.

I understanding the point on OOM_SCORE_ADJ_MIN. This can be easily
identified and is clear to the viewer. However, the same cannot be stated
for MMF_OOM_SKIP.

So, if we prefer to display rather than exclude such tasks, in my opinion
having a flag/or marker of some description might prove useful to avoid any
misunderstanding.

> This should provide an example of the output with a clarification of the
> meaning.

Fair enough.




Kind regards,
Aaron Tomlin June 15, 2021, 12:02 p.m. UTC | #4
On Sun 2021-06-13 16:47 -0700, David Rientjes wrote:
> On Sat, 12 Jun 2021, Aaron Tomlin wrote:
> 
> > diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> > index eefd3f5fde46..094b7b61d66f 100644
> > --- a/mm/oom_kill.c
> > +++ b/mm/oom_kill.c
> > @@ -160,6 +160,27 @@ static inline bool is_sysrq_oom(struct oom_control *oc)
> >  	return oc->order == -1;
> >  }
> >  
> > +/**
> > + * is_task_eligible_oom - determine if and why a task cannot be OOM killed
> > + * @tsk: task to check
> > + *
> > + * Needs to be called with task_lock().
> > + */
> > +static const char * is_task_oom_eligible(struct task_struct *p)
> 
> You should be able to just return a char.

I see, sure.

> > +{
> > +	long adj;
> > +
> > +	adj = (long)p->signal->oom_score_adj;
> > +	if (adj == OOM_SCORE_ADJ_MIN)
> > +		return "S";
> 
> The value is already printed in the task dump, this doesn't look to add 
> any information.

I understand and perhaps it does not make sense; albeit, the reader might
not understand the meaning of OOM_SCORE_ADJ_MIN.

> > +	else if (test_bit(MMF_OOM_SKIP, &p->mm->flags)
> > +		return "R";
> 
> We should be doing the task dump only when we're killing a victim (unless 
> we're panicking), so something else has been chosen.  Since we would have 
> oom killed a process with MMF_OOM_SKIP already, can we simply choose to 
> not print a line for this process?

I'd prefer not to show such tasks, when displaying potential OOM victims
(including those in_vfork()) as in my opinion, it can be misleading to the
reader. That said, a case has been made to maintain their inclusion.
However, should in_vfork() even be shown in the actual report?

> 
> > @@ -401,12 +422,13 @@ static int dump_task(struct task_struct *p, void *arg)
> >  		return 0;
> >  	}
> >  
> > -	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %s\n",
> > +	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %13s %s\n",
> 
> 13 characters for one char output?

This was to maintain some alignment but fair enough.

> >  static void dump_tasks(struct oom_control *oc)
> >  {
> >  	pr_info("Tasks state (memory values in pages):\n");
> > -	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
> > +	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj oom eligible? name\n");
> 
> Field names are single words.

Understood.



Kind regards,
Michal Hocko June 15, 2021, 12:42 p.m. UTC | #5
On Tue 15-06-21 12:51:47, Aaron Tomlin wrote:
> On Mon 2021-06-14 08:44 +0200, Michal Hocko wrote:
> > Well, I have to say that I have a bit hard time understand the problem
> > statement here. First of all you are very likely basing your observation
> > on an old kernel which is missing a fix which should make the situation
> > impossible IIRC. You should be focusing on a justification why the new
> > information is helpful for the current tree.
> 
> Michal,
> 
> Not exactly.
> 
> See oom_reap_task(). Let's assume an OOM event occurred within the context
> of a memcg and 'memory.oom.group' was not set. If I understand correctly,
> once all attempts to OOM reap the specified task were "unsuccessful" then
> MMF_OOM_SKIP is applied; and, the assumption is it will be terminated
> shorty due to the pending fatal signal (see __oom_kill_process()) i.e. a
> SIGKILL is sent to the "victim" before the OOM reaper is notified. Now
> assuming the above task did not exited yet, another task, in the same
> memcg, could also trigger an OOM event.  Therefore, when showing potential
> OOM victims the task above with MMF_OOM_SKIP set, will indeed be displayed.
> 
> I understanding the point on OOM_SCORE_ADJ_MIN. This can be easily
> identified and is clear to the viewer. However, the same cannot be stated
> for MMF_OOM_SKIP.

This is all true but it is not really clear why that is really a
problem. Kernel log already contains information about reaped processes
as they are reported to the log. I fully acknowledge that this is rather
spartan but on the other hand from years of experience reading oom
reports I have to say the dump_tasks is the least interesting part of
the report (while being the most verbose one).

All that being said, I am not really opposing extending the information
although I am a bit worried about leaking too much internal state to the
log. What I am asking for here is a justification why this addition is a
general improvement and how it helps uderstanding oom reports further.
So please focus on that part.
Aaron Tomlin June 16, 2021, 8:18 p.m. UTC | #6
On Tue 2021-06-15 14:42 +0200, Michal Hocko wrote:
> This is all true but it is not really clear why that is really a
> problem. Kernel log already contains information about reaped processes
> as they are reported to the log. I fully acknowledge that this is rather
> spartan but on the other hand from years of experience reading oom
> reports I have to say the dump_tasks is the least interesting part of
> the report (while being the most verbose one).

I understand. I suppose, in a situation whereby dump_tasks() output is only
available, for whatever reason, it can provide at least some
insight into what tasks were actually considered not OOM eligible and why.

> All that being said, I am not really opposing extending the information
> although I am a bit worried about leaking too much internal state to the
> log.

Fair enough. That said, I still feel highlighting such "ineligible" tasks
could be useful to the viewer for troubleshooting purposes; we already
display OOM_SCORE_ADJ_MIN. Consider a situation when only a few tasks in
a memcg are displayed as possibly OOM eligible but one had MMF_OOM_SKIP
applied.

In my opinion, perhaps it is better to just exclude such details
altogether. That being said, as you know, we only provide this facility
when one is interested in it anyway i.e., if oom_dump_tasks is enabled.

> What I am asking for here is a justification why this addition is a
> general improvement and how it helps uderstanding oom reports further.
> So please focus on that part.

Sure; albeit, thinking about this more, it does not provide much
understanding in simple isolation.



Kind regards,
diff mbox series

Patch

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 586cd4b86428..123be642bc7e 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -658,8 +658,9 @@  oom_dump_tasks
 Enables a system-wide task dump (excluding kernel threads) to be produced
 when the kernel performs an OOM-killing and includes such information as
 pid, uid, tgid, vm size, rss, pgtables_bytes, swapents, oom_score_adj
-score, and name.  This is helpful to determine why the OOM killer was
-invoked, to identify the rogue task that caused it, and to determine why
+score, oom eligibility status and name.  This is helpful to determine why
+the OOM killer was invoked, to identify the rogue task that caused it, and
+to determine why
 the OOM killer chose the task it did to kill.
 
 If this is set to zero, this information is suppressed.  On very
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index eefd3f5fde46..094b7b61d66f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -160,6 +160,27 @@  static inline bool is_sysrq_oom(struct oom_control *oc)
 	return oc->order == -1;
 }
 
+/**
+ * is_task_eligible_oom - determine if and why a task cannot be OOM killed
+ * @tsk: task to check
+ *
+ * Needs to be called with task_lock().
+ */
+static const char * is_task_oom_eligible(struct task_struct *p)
+{
+	long adj;
+
+	adj = (long)p->signal->oom_score_adj;
+	if (adj == OOM_SCORE_ADJ_MIN)
+		return "S";
+	else if (test_bit(MMF_OOM_SKIP, &p->mm->flags)
+		return "R";
+	else if (in_vfork(p))
+		return "V";
+	else
+		return "";
+}
+
 /* return true if the task is not adequate as candidate victim task. */
 static bool oom_unkillable_task(struct task_struct *p)
 {
@@ -401,12 +422,13 @@  static int dump_task(struct task_struct *p, void *arg)
 		return 0;
 	}
 
-	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %s\n",
+	pr_info("[%7d] %5d %5d %8lu %8lu %8ld %8lu         %5hd %13s %s\n",
 		task->pid, from_kuid(&init_user_ns, task_uid(task)),
 		task->tgid, task->mm->total_vm, get_mm_rss(task->mm),
 		mm_pgtables_bytes(task->mm),
 		get_mm_counter(task->mm, MM_SWAPENTS),
-		task->signal->oom_score_adj, task->comm);
+		task->signal->oom_score_adj, is_task_oom_eligible(task),
+		task->comm);
 	task_unlock(task);
 
 	return 0;
@@ -420,12 +442,13 @@  static int dump_task(struct task_struct *p, void *arg)
  * memcg, not in the same cpuset, or bound to a disjoint set of mempolicy nodes
  * are not shown.
  * State information includes task's pid, uid, tgid, vm size, rss,
- * pgtables_bytes, swapents, oom_score_adj value, and name.
+ * pgtables_bytes, swapents, oom_score_adj value, oom eligibility status
+ * and name.
  */
 static void dump_tasks(struct oom_control *oc)
 {
 	pr_info("Tasks state (memory values in pages):\n");
-	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj name\n");
+	pr_info("[  pid  ]   uid  tgid total_vm      rss pgtables_bytes swapents oom_score_adj oom eligible? name\n");
 
 	if (is_memcg_oom(oc))
 		mem_cgroup_scan_tasks(oc->memcg, dump_task, oc);