mbox series

[v3,0/2] memcontrol: support cgroup level OOM protection

Message ID 20230506114948.6862-1-chengkaitao@didiglobal.com (mailing list archive)
Headers show
Series memcontrol: support cgroup level OOM protection | expand

Message

程垲涛 Chengkaitao Cheng May 6, 2023, 11:49 a.m. UTC
Establish a new OOM score algorithm, supports the cgroup level OOM
protection mechanism. When an global/memcg oom event occurs, we treat
all processes in the cgroup as a whole, and OOM killers need to select
the process to kill based on the protection quota of the cgroup

Changelog:
v3:
  * Add "auto" option for memory.oom.protect. (patch 1)
  * Fix division errors. (patch 1)
  * Add observation indicator oom_kill_inherit. (patch 2)
v2:
  * Modify the formula of the process request memcg protection quota.
  https://lore.kernel.org/linux-mm/20221208034644.3077-1-chengkaitao@didiglobal.com/
v1:
  https://lore.kernel.org/linux-mm/20221130070158.44221-1-chengkaitao@didiglobal.com/

chengkaitao (2):
  mm: memcontrol: protect the memory in cgroup from being oom killed
  memcg: add oom_kill_inherit event indicator

 Documentation/admin-guide/cgroup-v2.rst |  29 ++++-
 fs/proc/base.c                          |  17 ++-
 include/linux/memcontrol.h              |  46 +++++++-
 include/linux/oom.h                     |   3 +-
 include/linux/page_counter.h            |   6 +
 mm/memcontrol.c                         | 199 ++++++++++++++++++++++++++++++++
 mm/oom_kill.c                           |  25 ++--
 mm/page_counter.c                       |  30 +++++
 8 files changed, 334 insertions(+), 21 deletions(-)

Comments

Michal Hocko May 7, 2023, 10:11 a.m. UTC | #1
On Sat 06-05-23 19:49:46, chengkaitao wrote:
> Establish a new OOM score algorithm, supports the cgroup level OOM
> protection mechanism. When an global/memcg oom event occurs, we treat
> all processes in the cgroup as a whole, and OOM killers need to select
> the process to kill based on the protection quota of the cgroup

Although your patch 1 briefly touches on some advantages of this
interface there is a lack of actual usecase. Arguing that oom_score_adj
is hard because it needs a parent process is rather weak to be honest.
It is just trivial to create a thin wrapper, use systemd to launch
important services or simply update the value after the fact. Now
oom_score_adj has its own downsides of course (most notably a
granularity and a lack of group protection.

That being said, make sure you describe your usecase more thoroughly.
Please also make sure you describe the intended heuristic of the knob.
It is not really clear from the description how this fits hierarchical
behavior of cgroups. I would be especially interested in the semantics
of non-leaf memcgs protection as they do not have any actual processes
to protect.

Also there have been concerns mentioned in v2 discussion and it would be
really appreciated to summarize how you have dealt with them.

Please also note that many people are going to be slow in responding
this week because of LSFMM conference
(https://events.linuxfoundation.org/lsfmm/)
程垲涛 Chengkaitao Cheng May 8, 2023, 9:08 a.m. UTC | #2
At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@suse.com> wrote:
>On Sat 06-05-23 19:49:46, chengkaitao wrote:
>> Establish a new OOM score algorithm, supports the cgroup level OOM
>> protection mechanism. When an global/memcg oom event occurs, we treat
>> all processes in the cgroup as a whole, and OOM killers need to select
>> the process to kill based on the protection quota of the cgroup
>
>Although your patch 1 briefly touches on some advantages of this
>interface there is a lack of actual usecase. Arguing that oom_score_adj
>is hard because it needs a parent process is rather weak to be honest.
>It is just trivial to create a thin wrapper, use systemd to launch
>important services or simply update the value after the fact. Now
>oom_score_adj has its own downsides of course (most notably a
>granularity and a lack of group protection.
>
>That being said, make sure you describe your usecase more thoroughly.
>Please also make sure you describe the intended heuristic of the knob.
>It is not really clear from the description how this fits hierarchical
>behavior of cgroups. I would be especially interested in the semantics
>of non-leaf memcgs protection as they do not have any actual processes
>to protect.
>
>Also there have been concerns mentioned in v2 discussion and it would be
>really appreciated to summarize how you have dealt with them.
>
>Please also note that many people are going to be slow in responding
>this week because of LSFMM conference
>(https://events.linuxfoundation.org/lsfmm/)

Here is a more detailed comparison and introduction of the old oom_score_adj
mechanism and the new oom_protect mechanism,
1. The regulating granularity of oom_protect is smaller than that of oom_score_adj.
On a 512G physical machine, the minimum granularity adjusted by oom_score_adj
is 512M, and the minimum granularity adjusted by oom_protect is one page (4K).
2. It may be simple to create a lightweight parent process and uniformly set the 
oom_score_adj of some important processes, but it is not a simple matter to make 
multi-level settings for tens of thousands of processes on the physical machine 
through the lightweight parent processes. We may need a huge table to record the 
value of oom_score_adj maintained by all lightweight parent processes, and the 
user process limited by the parent process has no ability to change its own 
oom_score_adj, because it does not know the details of the huge table. The new 
patch adopts the cgroup mechanism. It does not need any parent process to manage 
oom_score_adj. the settings between each memcg are independent of each other, 
making it easier to plan the OOM order of all processes. Due to the unique nature 
of memory resources, current Service cloud vendors are not oversold in memory 
planning. I would like to use the new patch to try to achieve the possibility of 
oversold memory resources.
3. I conducted a test and deployed an excessive number of containers on a physical 
machine, By setting the oom_score_adj value of all processes in the container to 
a positive number through dockerinit, even processes that occupy very little memory 
in the container are easily killed, resulting in a large number of invalid kill behaviors. 
If dockerinit is also killed unfortunately, it will trigger container self-healing, and the 
container will rebuild, resulting in more severe memory oscillations. The new patch 
abandons the behavior of adding an equal amount of oom_score_adj to each process 
in the container and adopts a shared oom_protect quota for all processes in the container. 
If a process in the container is killed, the remaining other processes will receive more 
oom_protect quota, making it more difficult for the remaining processes to be killed.
In my test case, the new patch reduced the number of invalid kill behaviors by 70%.
4. oom_score_adj is a global configuration that cannot achieve a kill order that only 
affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits 
downwards, and user can only change the kill order of its own memcg oom, but the 
kill order of their parent memcg-oom-killer or global-oom-killer will not be affected

In the final discussion of patch v2, we discussed that although the adjustment range 
of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
(OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is 
clumsy at best. In order to solve this problem in the new patch, I introduced a new 
indicator oom_kill_inherit, which counts the number of times the local and child 
cgroups have been selected by the OOM killer of the ancestor cgroup. By observing 
the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the 
value of oom_protect to achieve the best.

about the semantics of non-leaf memcgs protection,
If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
calculate the new effective oom_protect quota based on non-leaf memcg's quota.
Michal Hocko May 8, 2023, 2:18 p.m. UTC | #3
On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote:
> At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@suse.com> wrote:
> >On Sat 06-05-23 19:49:46, chengkaitao wrote:
> >> Establish a new OOM score algorithm, supports the cgroup level OOM
> >> protection mechanism. When an global/memcg oom event occurs, we treat
> >> all processes in the cgroup as a whole, and OOM killers need to select
> >> the process to kill based on the protection quota of the cgroup
> >
> >Although your patch 1 briefly touches on some advantages of this
> >interface there is a lack of actual usecase. Arguing that oom_score_adj
> >is hard because it needs a parent process is rather weak to be honest.
> >It is just trivial to create a thin wrapper, use systemd to launch
> >important services or simply update the value after the fact. Now
> >oom_score_adj has its own downsides of course (most notably a
> >granularity and a lack of group protection.
> >
> >That being said, make sure you describe your usecase more thoroughly.
> >Please also make sure you describe the intended heuristic of the knob.
> >It is not really clear from the description how this fits hierarchical
> >behavior of cgroups. I would be especially interested in the semantics
> >of non-leaf memcgs protection as they do not have any actual processes
> >to protect.
> >
> >Also there have been concerns mentioned in v2 discussion and it would be
> >really appreciated to summarize how you have dealt with them.
> >
> >Please also note that many people are going to be slow in responding
> >this week because of LSFMM conference
> >(https://events.linuxfoundation.org/lsfmm/)
> 
> Here is a more detailed comparison and introduction of the old oom_score_adj
> mechanism and the new oom_protect mechanism,
> 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj.
> On a 512G physical machine, the minimum granularity adjusted by oom_score_adj
> is 512M, and the minimum granularity adjusted by oom_protect is one page (4K).
> 2. It may be simple to create a lightweight parent process and uniformly set the 
> oom_score_adj of some important processes, but it is not a simple matter to make 
> multi-level settings for tens of thousands of processes on the physical machine 
> through the lightweight parent processes. We may need a huge table to record the 
> value of oom_score_adj maintained by all lightweight parent processes, and the 
> user process limited by the parent process has no ability to change its own 
> oom_score_adj, because it does not know the details of the huge table. The new 
> patch adopts the cgroup mechanism. It does not need any parent process to manage 
> oom_score_adj. the settings between each memcg are independent of each other, 
> making it easier to plan the OOM order of all processes. Due to the unique nature 
> of memory resources, current Service cloud vendors are not oversold in memory 
> planning. I would like to use the new patch to try to achieve the possibility of 
> oversold memory resources.

OK, this is more specific about the usecase. Thanks! So essentially what
it boils down to is that you are handling many containers (memcgs from
our POV) and they have different priorities. You want to overcommit the
memory to the extend that global ooms are not an unexpected event. Once
that happens the total memory consumption of a specific memcg is less
important than its "priority". You define that priority by the excess of
the memory usage above a user defined threshold. Correct?

Your cover letter mentions that then "all processes in the cgroup as a
whole". That to me reads as oom.group oom killer policy. But a brief
look into the patch suggests you are still looking at specific tasks and
this has been a concern in the previous version of the patch because
memcg accounting and per-process accounting are detached.

> 3. I conducted a test and deployed an excessive number of containers on a physical 
> machine, By setting the oom_score_adj value of all processes in the container to 
> a positive number through dockerinit, even processes that occupy very little memory 
> in the container are easily killed, resulting in a large number of invalid kill behaviors. 
> If dockerinit is also killed unfortunately, it will trigger container self-healing, and the 
> container will rebuild, resulting in more severe memory oscillations. The new patch 
> abandons the behavior of adding an equal amount of oom_score_adj to each process 
> in the container and adopts a shared oom_protect quota for all processes in the container. 
> If a process in the container is killed, the remaining other processes will receive more 
> oom_protect quota, making it more difficult for the remaining processes to be killed.
> In my test case, the new patch reduced the number of invalid kill behaviors by 70%.
> 4. oom_score_adj is a global configuration that cannot achieve a kill order that only 
> affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits 
> downwards, and user can only change the kill order of its own memcg oom, but the 
> kill order of their parent memcg-oom-killer or global-oom-killer will not be affected

Yes oom_score_adj has shortcomings.

> In the final discussion of patch v2, we discussed that although the adjustment range 
> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is 
> clumsy at best. In order to solve this problem in the new patch, I introduced a new 
> indicator oom_kill_inherit, which counts the number of times the local and child 
> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing 
> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the 
> value of oom_protect to achieve the best.

What does the best mean in this context?

> about the semantics of non-leaf memcgs protection,
> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
> calculate the new effective oom_protect quota based on non-leaf memcg's quota.

So the non-leaf memcg is never used as a target? What if the workload is
distributed over several sub-groups? Our current oom.group
implementation traverses the tree to find a common ancestor in the oom
domain with the oom.group.

All that being said and with the usecase described more specifically. I
can see that memcg based oom victim selection makes some sense. That
menas that it is always a memcg selected and all tasks withing killed.
Memcg based protection can be used to evaluate which memcg to choose and
the overall scheme should be still manageable. It would indeed resemble
memory protection for the regular reclaim.

One thing that is still not really clear to me is to how group vs.
non-group ooms could be handled gracefully. Right now we can handle that
because the oom selection is still process based but with the protection
this will become more problematic as explained previously. Essentially
we would need to enforce the oom selection to be memcg based for all
memcgs. Maybe a mount knob? What do you think?
程垲涛 Chengkaitao Cheng May 9, 2023, 6:50 a.m. UTC | #4
At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@suse.com> wrote:
>On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote:
>> At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@suse.com> wrote:
>> >On Sat 06-05-23 19:49:46, chengkaitao wrote:
>> >
>> >That being said, make sure you describe your usecase more thoroughly.
>> >Please also make sure you describe the intended heuristic of the knob.
>> >It is not really clear from the description how this fits hierarchical
>> >behavior of cgroups. I would be especially interested in the semantics
>> >of non-leaf memcgs protection as they do not have any actual processes
>> >to protect.
>> >
>> >Also there have been concerns mentioned in v2 discussion and it would be
>> >really appreciated to summarize how you have dealt with them.
>> >
>> >Please also note that many people are going to be slow in responding
>> >this week because of LSFMM conference
>> >(https://events.linuxfoundation.org/lsfmm/)
>> 
>> Here is a more detailed comparison and introduction of the old oom_score_adj
>> mechanism and the new oom_protect mechanism,
>> 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj.
>> On a 512G physical machine, the minimum granularity adjusted by oom_score_adj
>> is 512M, and the minimum granularity adjusted by oom_protect is one page (4K).
>> 2. It may be simple to create a lightweight parent process and uniformly set the 
>> oom_score_adj of some important processes, but it is not a simple matter to make 
>> multi-level settings for tens of thousands of processes on the physical machine 
>> through the lightweight parent processes. We may need a huge table to record the 
>> value of oom_score_adj maintained by all lightweight parent processes, and the 
>> user process limited by the parent process has no ability to change its own 
>> oom_score_adj, because it does not know the details of the huge table. The new 
>> patch adopts the cgroup mechanism. It does not need any parent process to manage 
>> oom_score_adj. the settings between each memcg are independent of each other, 
>> making it easier to plan the OOM order of all processes. Due to the unique nature 
>> of memory resources, current Service cloud vendors are not oversold in memory 
>> planning. I would like to use the new patch to try to achieve the possibility of 
>> oversold memory resources.
>
>OK, this is more specific about the usecase. Thanks! So essentially what
>it boils down to is that you are handling many containers (memcgs from
>our POV) and they have different priorities. You want to overcommit the
>memory to the extend that global ooms are not an unexpected event. Once
>that happens the total memory consumption of a specific memcg is less
>important than its "priority". You define that priority by the excess of
>the memory usage above a user defined threshold. Correct?

It's correct.

>Your cover letter mentions that then "all processes in the cgroup as a
>whole". That to me reads as oom.group oom killer policy. But a brief
>look into the patch suggests you are still looking at specific tasks and
>this has been a concern in the previous version of the patch because
>memcg accounting and per-process accounting are detached.

I think the memcg accounting may be more reasonable, as its memory 
statistics are more comprehensive, similar to active page cache, which 
also increases the probability of OOM-kill. In the new patch, all the 
shared memory will also consume the oom_protect quota of the memcg, 
and the process's oom_protect quota of the memcg will decrease.

>> 3. I conducted a test and deployed an excessive number of containers on a physical 
>> machine, By setting the oom_score_adj value of all processes in the container to 
>> a positive number through dockerinit, even processes that occupy very little memory 
>> in the container are easily killed, resulting in a large number of invalid kill behaviors. 
>> If dockerinit is also killed unfortunately, it will trigger container self-healing, and the 
>> container will rebuild, resulting in more severe memory oscillations. The new patch 
>> abandons the behavior of adding an equal amount of oom_score_adj to each process 
>> in the container and adopts a shared oom_protect quota for all processes in the container. 
>> If a process in the container is killed, the remaining other processes will receive more 
>> oom_protect quota, making it more difficult for the remaining processes to be killed.
>> In my test case, the new patch reduced the number of invalid kill behaviors by 70%.
>> 4. oom_score_adj is a global configuration that cannot achieve a kill order that only 
>> affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits 
>> downwards, and user can only change the kill order of its own memcg oom, but the 
>> kill order of their parent memcg-oom-killer or global-oom-killer will not be affected
>
>Yes oom_score_adj has shortcomings.
>
>> In the final discussion of patch v2, we discussed that although the adjustment range 
>> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
>> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is 
>> clumsy at best. In order to solve this problem in the new patch, I introduced a new 
>> indicator oom_kill_inherit, which counts the number of times the local and child 
>> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing 
>> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the 
>> value of oom_protect to achieve the best.
>
>What does the best mean in this context?

I have created a new indicator oom_kill_inherit that maintains a negative correlation 
with memory.oom.protect, so we have a ruler to measure the optimal value of 
memory.oom.protect.

>> about the semantics of non-leaf memcgs protection,
>> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
>> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
>
>So the non-leaf memcg is never used as a target? What if the workload is
>distributed over several sub-groups? Our current oom.group
>implementation traverses the tree to find a common ancestor in the oom
>domain with the oom.group.

If the oom_protect quota of the parent non-leaf memcg is less than the sum of 
sub-groups oom_protect quota, the oom_protect quota of each sub-group will 
be proportionally reduced
If the oom_protect quota of the parent non-leaf memcg is greater than the sum 
of sub-groups oom_protect quota, the oom_protect quota of each sub-group 
will be proportionally increased
The purpose of doing so is that users can set oom_protect quota according to 
their own needs, and the system management process can set appropriate 
oom_protect quota on the parent non-leaf memcg as the final cover, so that 
the system management process can indirectly manage all user processes.

>All that being said and with the usecase described more specifically. I
>can see that memcg based oom victim selection makes some sense. That
>menas that it is always a memcg selected and all tasks withing killed.
>Memcg based protection can be used to evaluate which memcg to choose and
>the overall scheme should be still manageable. It would indeed resemble
>memory protection for the regular reclaim.
>
>One thing that is still not really clear to me is to how group vs.
>non-group ooms could be handled gracefully. Right now we can handle that
>because the oom selection is still process based but with the protection
>this will become more problematic as explained previously. Essentially
>we would need to enforce the oom selection to be memcg based for all
>memcgs. Maybe a mount knob? What do you think?

There is a function in the patch to determine whether the oom_protect 
mechanism is enabled. All memory.oom.protect nodes default to 0, so the function 
<is_root_oom_protect> returns 0 by default. The oom_protect  mechanism will 
only take effect when "root_mem_cgroup->memory.children_oom_protect_usage" 
is not 0, and only memcg with memory.oom.protect node set will take effect.

+bool is_root_oom_protect(void)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+
+	return !!atomic_long_read(&root_mem_cgroup->memory.children_oom_protect_usage);
+}
I don't know if there is some problems with my understanding?
Michal Hocko May 22, 2023, 1:03 p.m. UTC | #5
[Sorry for a late reply but I was mostly offline last 2 weeks]

On Tue 09-05-23 06:50:59, 程垲涛 Chengkaitao Cheng wrote:
> At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@suse.com> wrote:
[...]
> >Your cover letter mentions that then "all processes in the cgroup as a
> >whole". That to me reads as oom.group oom killer policy. But a brief
> >look into the patch suggests you are still looking at specific tasks and
> >this has been a concern in the previous version of the patch because
> >memcg accounting and per-process accounting are detached.
> 
> I think the memcg accounting may be more reasonable, as its memory 
> statistics are more comprehensive, similar to active page cache, which 
> also increases the probability of OOM-kill. In the new patch, all the 
> shared memory will also consume the oom_protect quota of the memcg, 
> and the process's oom_protect quota of the memcg will decrease.

I am sorry but I do not follow. Could you elaborate please? Are you
arguing for per memcg or per process metrics?

[...]

> >> In the final discussion of patch v2, we discussed that although the adjustment range 
> >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
> >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is 
> >> clumsy at best. In order to solve this problem in the new patch, I introduced a new 
> >> indicator oom_kill_inherit, which counts the number of times the local and child 
> >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing 
> >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the 
> >> value of oom_protect to achieve the best.
> >
> >What does the best mean in this context?
> 
> I have created a new indicator oom_kill_inherit that maintains a negative correlation 
> with memory.oom.protect, so we have a ruler to measure the optimal value of 
> memory.oom.protect.

An example might help here.

> >> about the semantics of non-leaf memcgs protection,
> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
> >
> >So the non-leaf memcg is never used as a target? What if the workload is
> >distributed over several sub-groups? Our current oom.group
> >implementation traverses the tree to find a common ancestor in the oom
> >domain with the oom.group.
> 
> If the oom_protect quota of the parent non-leaf memcg is less than the sum of 
> sub-groups oom_protect quota, the oom_protect quota of each sub-group will 
> be proportionally reduced
> If the oom_protect quota of the parent non-leaf memcg is greater than the sum 
> of sub-groups oom_protect quota, the oom_protect quota of each sub-group 
> will be proportionally increased
> The purpose of doing so is that users can set oom_protect quota according to 
> their own needs, and the system management process can set appropriate 
> oom_protect quota on the parent non-leaf memcg as the final cover, so that 
> the system management process can indirectly manage all user processes.

I guess that you are trying to say that the oom protection has a
standard hierarchical behavior. And that is fine, well, in fact it is
mandatory for any control knob to have a sane hierarchical properties.
But that doesn't address my above question. Let me try again. When is a
non-leaf memcg potentially selected as the oom victim? It doesn't have
any tasks directly but it might be a suitable target to kill a multi
memcg based workload (e.g. a full container).

> >All that being said and with the usecase described more specifically. I
> >can see that memcg based oom victim selection makes some sense. That
> >menas that it is always a memcg selected and all tasks withing killed.
> >Memcg based protection can be used to evaluate which memcg to choose and
> >the overall scheme should be still manageable. It would indeed resemble
> >memory protection for the regular reclaim.
> >
> >One thing that is still not really clear to me is to how group vs.
> >non-group ooms could be handled gracefully. Right now we can handle that
> >because the oom selection is still process based but with the protection
> >this will become more problematic as explained previously. Essentially
> >we would need to enforce the oom selection to be memcg based for all
> >memcgs. Maybe a mount knob? What do you think?
> 
> There is a function in the patch to determine whether the oom_protect 
> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function 
> <is_root_oom_protect> returns 0 by default.

How can an admin determine what is the current oom detection logic?
程垲涛 Chengkaitao Cheng May 25, 2023, 7:35 a.m. UTC | #6
At 2023-05-22 21:03:50, "Michal Hocko" <mhocko@suse.com> wrote:
>[Sorry for a late reply but I was mostly offline last 2 weeks]
>
>On Tue 09-05-23 06:50:59, 程垲涛 Chengkaitao Cheng wrote:
>> At 2023-05-08 22:18:18, "Michal Hocko" <mhocko@suse.com> wrote:
>[...]
>> >Your cover letter mentions that then "all processes in the cgroup as a
>> >whole". That to me reads as oom.group oom killer policy. But a brief
>> >look into the patch suggests you are still looking at specific tasks and
>> >this has been a concern in the previous version of the patch because
>> >memcg accounting and per-process accounting are detached.
>> 
>> I think the memcg accounting may be more reasonable, as its memory 
>> statistics are more comprehensive, similar to active page cache, which 
>> also increases the probability of OOM-kill. In the new patch, all the 
>> shared memory will also consume the oom_protect quota of the memcg, 
>> and the process's oom_protect quota of the memcg will decrease.
>
>I am sorry but I do not follow. Could you elaborate please? Are you
>arguing for per memcg or per process metrics?

You mentioned earlier that 'memcg accounting and per process accounting
are detached', and I may have misunderstood your question. I want to 
express above that memcg accounting is more comprehensive than per process 
accounting, and using memcg accounting in the OOM-killer mechanism would 
be more reasonable.

>[...]
>
>> >> In the final discussion of patch v2, we discussed that although the adjustment range 
>> >> of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
>> >> (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is 
>> >> clumsy at best. In order to solve this problem in the new patch, I introduced a new 
>> >> indicator oom_kill_inherit, which counts the number of times the local and child 
>> >> cgroups have been selected by the OOM killer of the ancestor cgroup. By observing 
>> >> the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the 
>> >> value of oom_protect to achieve the best.
>> >
>> >What does the best mean in this context?
>> 
>> I have created a new indicator oom_kill_inherit that maintains a negative correlation 
>> with memory.oom.protect, so we have a ruler to measure the optimal value of 
>> memory.oom.protect.
>
>An example might help here.

In my testing case, by adjusting memory.oom.protect, I was able to significantly 
reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine 
with severely oversold memory, I divided all cgroups into three categories and 
controlled their probability of being selected by the oom-killer to 0%,% 20, 
and 80%, respectively.

>> >> about the semantics of non-leaf memcgs protection,
>> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
>> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
>> >
>> >So the non-leaf memcg is never used as a target? What if the workload is
>> >distributed over several sub-groups? Our current oom.group
>> >implementation traverses the tree to find a common ancestor in the oom
>> >domain with the oom.group.
>> 
>> If the oom_protect quota of the parent non-leaf memcg is less than the sum of 
>> sub-groups oom_protect quota, the oom_protect quota of each sub-group will 
>> be proportionally reduced
>> If the oom_protect quota of the parent non-leaf memcg is greater than the sum 
>> of sub-groups oom_protect quota, the oom_protect quota of each sub-group 
>> will be proportionally increased
>> The purpose of doing so is that users can set oom_protect quota according to 
>> their own needs, and the system management process can set appropriate 
>> oom_protect quota on the parent non-leaf memcg as the final cover, so that 
>> the system management process can indirectly manage all user processes.
>
>I guess that you are trying to say that the oom protection has a
>standard hierarchical behavior. And that is fine, well, in fact it is
>mandatory for any control knob to have a sane hierarchical properties.
>But that doesn't address my above question. Let me try again. When is a
>non-leaf memcg potentially selected as the oom victim? It doesn't have
>any tasks directly but it might be a suitable target to kill a multi
>memcg based workload (e.g. a full container).

If nonleaf memcg have the higher memory usage and the smaller 
memory.oom.protect, it will have the higher the probability being 
selected by the killer. If the non-leaf memcg is selected as the oom 
victim, OOM-killer will continue to select the appropriate child 
memcg downwards until the leaf memcg is selected.

>> >All that being said and with the usecase described more specifically. I
>> >can see that memcg based oom victim selection makes some sense. That
>> >menas that it is always a memcg selected and all tasks withing killed.
>> >Memcg based protection can be used to evaluate which memcg to choose and
>> >the overall scheme should be still manageable. It would indeed resemble
>> >memory protection for the regular reclaim.
>> >
>> >One thing that is still not really clear to me is to how group vs.
>> >non-group ooms could be handled gracefully. Right now we can handle that
>> >because the oom selection is still process based but with the protection
>> >this will become more problematic as explained previously. Essentially
>> >we would need to enforce the oom selection to be memcg based for all
>> >memcgs. Maybe a mount knob? What do you think?
>> 
>> There is a function in the patch to determine whether the oom_protect 
>> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function 
>> <is_root_oom_protect> returns 0 by default.
>
>How can an admin determine what is the current oom detection logic?

The memory.oom.protect are set by the administrator themselves, and they 
must know what the current OOM policy is. Reading the memory.oom.protect 
of the first level cgroup directory and observing whether it is 0 can also 
determine whether the oom.protect policy is enabled.

For a process, the physical machine administrator, k8s administrator, 
agent administrator, and container administrator see different effective 
memory.oom.protect for the process, so they only need to pay attention 
to the memory.oom.protect of the local cgroup directory. If an administrator 
wants to know the OOM detection logic of all administrators, I don't think 
there is such a business requirement.
Michal Hocko May 29, 2023, 2:02 p.m. UTC | #7
On Thu 25-05-23 07:35:41, 程垲涛 Chengkaitao Cheng wrote:
> At 2023-05-22 21:03:50, "Michal Hocko" <mhocko@suse.com> wrote:
[...]
> >> I have created a new indicator oom_kill_inherit that maintains a negative correlation 
> >> with memory.oom.protect, so we have a ruler to measure the optimal value of 
> >> memory.oom.protect.
> >
> >An example might help here.
> 
> In my testing case, by adjusting memory.oom.protect, I was able to significantly 
> reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine 
> with severely oversold memory, I divided all cgroups into three categories and 
> controlled their probability of being selected by the oom-killer to 0%,% 20, 
> and 80%, respectively.

I might be just dense but I am lost. Can we focus on the barebone
semantic of the group oom selection and killing first. No magic
auto-tuning at this stage please.

> >> >> about the semantics of non-leaf memcgs protection,
> >> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
> >> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
> >> >
> >> >So the non-leaf memcg is never used as a target? What if the workload is
> >> >distributed over several sub-groups? Our current oom.group
> >> >implementation traverses the tree to find a common ancestor in the oom
> >> >domain with the oom.group.
> >> 
> >> If the oom_protect quota of the parent non-leaf memcg is less than the sum of 
> >> sub-groups oom_protect quota, the oom_protect quota of each sub-group will 
> >> be proportionally reduced
> >> If the oom_protect quota of the parent non-leaf memcg is greater than the sum 
> >> of sub-groups oom_protect quota, the oom_protect quota of each sub-group 
> >> will be proportionally increased
> >> The purpose of doing so is that users can set oom_protect quota according to 
> >> their own needs, and the system management process can set appropriate 
> >> oom_protect quota on the parent non-leaf memcg as the final cover, so that 
> >> the system management process can indirectly manage all user processes.
> >
> >I guess that you are trying to say that the oom protection has a
> >standard hierarchical behavior. And that is fine, well, in fact it is
> >mandatory for any control knob to have a sane hierarchical properties.
> >But that doesn't address my above question. Let me try again. When is a
> >non-leaf memcg potentially selected as the oom victim? It doesn't have
> >any tasks directly but it might be a suitable target to kill a multi
> >memcg based workload (e.g. a full container).
> 
> If nonleaf memcg have the higher memory usage and the smaller 
> memory.oom.protect, it will have the higher the probability being 
> selected by the killer. If the non-leaf memcg is selected as the oom 
> victim, OOM-killer will continue to select the appropriate child 
> memcg downwards until the leaf memcg is selected.

Parent memcg has more or equal memory charged than its child(ren) by
definition. Let me try to ask differently. Say you have the following
hierarchy

		  root
		/     \
       container_A     container_B
     (oom.prot=100M)   (oom.prot=200M)
     (usage=120M)      (usage=180M)
     /     |     \
    A      B      C
                 / \
		C1  C2


container_B is protected so it should be excluded. Correct? So we are at
container_A to chose from. There are multiple ways the system and
continer admin might want to achieve.
1) system admin might want to shut down the whole container.
2) continer admin might want to shut the whole container down
3) cont. admin might want to shut down a whole sub group (e.g. C as it
   is a self contained workload and killing portion of it will put it into
   inconsistent state).
4) cont. admin might want to kill the most excess cgroup with tasks (i.e. a
   leaf memcg).
5) admin might want to kill a process in the most excess memcg.

Now we already have oom.group thingy that can drive the group killing
policy but it is not really clear how you want to incorporate that to
the protection.

Again, I think that an oom.protection makes sense but the semantic has
to be very carefully thought through because it is quite easy to create
corner cases and weird behavior. I also think that oom.group has to be
consistent with the protection.

> >> >All that being said and with the usecase described more specifically. I
> >> >can see that memcg based oom victim selection makes some sense. That
> >> >menas that it is always a memcg selected and all tasks withing killed.
> >> >Memcg based protection can be used to evaluate which memcg to choose and
> >> >the overall scheme should be still manageable. It would indeed resemble
> >> >memory protection for the regular reclaim.
> >> >
> >> >One thing that is still not really clear to me is to how group vs.
> >> >non-group ooms could be handled gracefully. Right now we can handle that
> >> >because the oom selection is still process based but with the protection
> >> >this will become more problematic as explained previously. Essentially
> >> >we would need to enforce the oom selection to be memcg based for all
> >> >memcgs. Maybe a mount knob? What do you think?
> >> 
> >> There is a function in the patch to determine whether the oom_protect 
> >> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function 
> >> <is_root_oom_protect> returns 0 by default.
> >
> >How can an admin determine what is the current oom detection logic?
> 
> The memory.oom.protect are set by the administrator themselves, and they 
> must know what the current OOM policy is. Reading the memory.oom.protect 
> of the first level cgroup directory and observing whether it is 0 can also 
> determine whether the oom.protect policy is enabled.

How do you achieve that from withing a container which doesn't have a
full visibility to the cgroup tree?
程垲涛 Chengkaitao Cheng June 4, 2023, 8:05 a.m. UTC | #8
At 2023-05-29 22:02:47, "Michal Hocko" <mhocko@suse.com> wrote:
>On Thu 25-05-23 07:35:41, 程垲涛 Chengkaitao Cheng wrote:
>> At 2023-05-22 21:03:50, "Michal Hocko" <mhocko@suse.com> wrote:
>[...]
>> >> I have created a new indicator oom_kill_inherit that maintains a negative correlation 
>> >> with memory.oom.protect, so we have a ruler to measure the optimal value of 
>> >> memory.oom.protect.
>> >
>> >An example might help here.
>> 
>> In my testing case, by adjusting memory.oom.protect, I was able to significantly 
>> reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine 
>> with severely oversold memory, I divided all cgroups into three categories and 
>> controlled their probability of being selected by the oom-killer to 0%,% 20, 
>> and 80%, respectively.
>
>I might be just dense but I am lost. Can we focus on the barebone
>semantic of the group oom selection and killing first. No magic
>auto-tuning at this stage please.
>
>> >> >> about the semantics of non-leaf memcgs protection,
>> >> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
>> >> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
>> >> >
>> >> >So the non-leaf memcg is never used as a target? What if the workload is
>> >> >distributed over several sub-groups? Our current oom.group
>> >> >implementation traverses the tree to find a common ancestor in the oom
>> >> >domain with the oom.group.
>> >> 
>> >> If the oom_protect quota of the parent non-leaf memcg is less than the sum of 
>> >> sub-groups oom_protect quota, the oom_protect quota of each sub-group will 
>> >> be proportionally reduced
>> >> If the oom_protect quota of the parent non-leaf memcg is greater than the sum 
>> >> of sub-groups oom_protect quota, the oom_protect quota of each sub-group 
>> >> will be proportionally increased
>> >> The purpose of doing so is that users can set oom_protect quota according to 
>> >> their own needs, and the system management process can set appropriate 
>> >> oom_protect quota on the parent non-leaf memcg as the final cover, so that 
>> >> the system management process can indirectly manage all user processes.
>> >
>> >I guess that you are trying to say that the oom protection has a
>> >standard hierarchical behavior. And that is fine, well, in fact it is
>> >mandatory for any control knob to have a sane hierarchical properties.
>> >But that doesn't address my above question. Let me try again. When is a
>> >non-leaf memcg potentially selected as the oom victim? It doesn't have
>> >any tasks directly but it might be a suitable target to kill a multi
>> >memcg based workload (e.g. a full container).
>> 
>> If nonleaf memcg have the higher memory usage and the smaller 
>> memory.oom.protect, it will have the higher the probability being 
>> selected by the killer. If the non-leaf memcg is selected as the oom 
>> victim, OOM-killer will continue to select the appropriate child 
>> memcg downwards until the leaf memcg is selected.
>
>Parent memcg has more or equal memory charged than its child(ren) by
>definition. Let me try to ask differently. Say you have the following
>hierarchy
>
>		  root
>		/     \
>       container_A     container_B
>     (oom.prot=100M)   (oom.prot=200M)
>     (usage=120M)      (usage=180M)
>     /     |     \
>    A      B      C
>                 / \
>		C1  C2
>
>
>container_B is protected so it should be excluded. Correct? So we are at
>container_A to chose from. There are multiple ways the system and
>continer admin might want to achieve.
>1) system admin might want to shut down the whole container.
>2) continer admin might want to shut the whole container down
>3) cont. admin might want to shut down a whole sub group (e.g. C as it
>   is a self contained workload and killing portion of it will put it into
>   inconsistent state).
>4) cont. admin might want to kill the most excess cgroup with tasks (i.e. a
>   leaf memcg).
>5) admin might want to kill a process in the most excess memcg.
>
>Now we already have oom.group thingy that can drive the group killing
>policy but it is not really clear how you want to incorporate that to
>the protection.
>
>Again, I think that an oom.protection makes sense but the semantic has
>to be very carefully thought through because it is quite easy to create
>corner cases and weird behavior. I also think that oom.group has to be
>consistent with the protection.

The barebone semantic of the function implemented by my patch are 
summarized as follows:
Memcg only allows processes in the memcg to be selected by their 
ancestor's OOM killer when the memory usage exceeds "oom.protect"

It should be noted that "oom.protect" and "oom.group" are completely 
different things, and kneading them together may make the explanation 
more confusing.

>> >> >All that being said and with the usecase described more specifically. I
>> >> >can see that memcg based oom victim selection makes some sense. That
>> >> >menas that it is always a memcg selected and all tasks withing killed.
>> >> >Memcg based protection can be used to evaluate which memcg to choose and
>> >> >the overall scheme should be still manageable. It would indeed resemble
>> >> >memory protection for the regular reclaim.
>> >> >
>> >> >One thing that is still not really clear to me is to how group vs.
>> >> >non-group ooms could be handled gracefully. Right now we can handle that
>> >> >because the oom selection is still process based but with the protection
>> >> >this will become more problematic as explained previously. Essentially
>> >> >we would need to enforce the oom selection to be memcg based for all
>> >> >memcgs. Maybe a mount knob? What do you think?
>> >> 
>> >> There is a function in the patch to determine whether the oom_protect 
>> >> mechanism is enabled. All memory.oom.protect nodes default to 0, so the function 
>> >> <is_root_oom_protect> returns 0 by default.
>> >
>> >How can an admin determine what is the current oom detection logic?
>> 
>> The memory.oom.protect are set by the administrator themselves, and they 
>> must know what the current OOM policy is. Reading the memory.oom.protect 
>> of the first level cgroup directory and observing whether it is 0 can also 
>> determine whether the oom.protect policy is enabled.
>
>How do you achieve that from withing a container which doesn't have a
>full visibility to the cgroup tree?

The container does not need to have full visibility to the cgroup tree, function 
"oom.protect" requires it to only focus on local cgroup and subgroup trees.

--
Thanks for your comment!
chengkaitao
Yosry Ahmed June 4, 2023, 8:25 a.m. UTC | #9
On Mon, May 8, 2023 at 7:18 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Mon 08-05-23 09:08:25, 程垲涛 Chengkaitao Cheng wrote:
> > At 2023-05-07 18:11:58, "Michal Hocko" <mhocko@suse.com> wrote:
> > >On Sat 06-05-23 19:49:46, chengkaitao wrote:
> > >> Establish a new OOM score algorithm, supports the cgroup level OOM
> > >> protection mechanism. When an global/memcg oom event occurs, we treat
> > >> all processes in the cgroup as a whole, and OOM killers need to select
> > >> the process to kill based on the protection quota of the cgroup
> > >
> > >Although your patch 1 briefly touches on some advantages of this
> > >interface there is a lack of actual usecase. Arguing that oom_score_adj
> > >is hard because it needs a parent process is rather weak to be honest.
> > >It is just trivial to create a thin wrapper, use systemd to launch
> > >important services or simply update the value after the fact. Now
> > >oom_score_adj has its own downsides of course (most notably a
> > >granularity and a lack of group protection.
> > >
> > >That being said, make sure you describe your usecase more thoroughly.
> > >Please also make sure you describe the intended heuristic of the knob.
> > >It is not really clear from the description how this fits hierarchical
> > >behavior of cgroups. I would be especially interested in the semantics
> > >of non-leaf memcgs protection as they do not have any actual processes
> > >to protect.
> > >
> > >Also there have been concerns mentioned in v2 discussion and it would be
> > >really appreciated to summarize how you have dealt with them.
> > >
> > >Please also note that many people are going to be slow in responding
> > >this week because of LSFMM conference
> > >(https://events.linuxfoundation.org/lsfmm/)
> >
> > Here is a more detailed comparison and introduction of the old oom_score_adj
> > mechanism and the new oom_protect mechanism,
> > 1. The regulating granularity of oom_protect is smaller than that of oom_score_adj.
> > On a 512G physical machine, the minimum granularity adjusted by oom_score_adj
> > is 512M, and the minimum granularity adjusted by oom_protect is one page (4K).
> > 2. It may be simple to create a lightweight parent process and uniformly set the
> > oom_score_adj of some important processes, but it is not a simple matter to make
> > multi-level settings for tens of thousands of processes on the physical machine
> > through the lightweight parent processes. We may need a huge table to record the
> > value of oom_score_adj maintained by all lightweight parent processes, and the
> > user process limited by the parent process has no ability to change its own
> > oom_score_adj, because it does not know the details of the huge table. The new
> > patch adopts the cgroup mechanism. It does not need any parent process to manage
> > oom_score_adj. the settings between each memcg are independent of each other,
> > making it easier to plan the OOM order of all processes. Due to the unique nature
> > of memory resources, current Service cloud vendors are not oversold in memory
> > planning. I would like to use the new patch to try to achieve the possibility of
> > oversold memory resources.
>
> OK, this is more specific about the usecase. Thanks! So essentially what
> it boils down to is that you are handling many containers (memcgs from
> our POV) and they have different priorities. You want to overcommit the
> memory to the extend that global ooms are not an unexpected event. Once
> that happens the total memory consumption of a specific memcg is less
> important than its "priority". You define that priority by the excess of
> the memory usage above a user defined threshold. Correct?

There has been a parallel discussion in the cover letter thread of v4
[1]. To summarize, at Google, we have been using OOM scores to
describe different job priorities in a more explicit way -- regardless
of memory usage. It is strictly priority-based OOM killing. Ties are
broken based on memory usage.

We understand that something like memory.oom.protect has an advantage
in the sense that you can skip killing a process if you know that it
won't free enough memory anyway, but for an environment where multiple
jobs of different priorities are running, we find it crucial to be
able to define strict ordering. Some jobs are simply more important
than others, regardless of their memory usage.

It would be great if we can arrive at an interface that serves this
use case as well.

Thanks!

[1]https://lore.kernel.org/linux-mm/CAJD7tkaQdSTDX0Q7zvvYrA3Y4TcvLdWKnN3yc8VpfWRpUjcYBw@mail.gmail.com/

>
> Your cover letter mentions that then "all processes in the cgroup as a
> whole". That to me reads as oom.group oom killer policy. But a brief
> look into the patch suggests you are still looking at specific tasks and
> this has been a concern in the previous version of the patch because
> memcg accounting and per-process accounting are detached.
>
> > 3. I conducted a test and deployed an excessive number of containers on a physical
> > machine, By setting the oom_score_adj value of all processes in the container to
> > a positive number through dockerinit, even processes that occupy very little memory
> > in the container are easily killed, resulting in a large number of invalid kill behaviors.
> > If dockerinit is also killed unfortunately, it will trigger container self-healing, and the
> > container will rebuild, resulting in more severe memory oscillations. The new patch
> > abandons the behavior of adding an equal amount of oom_score_adj to each process
> > in the container and adopts a shared oom_protect quota for all processes in the container.
> > If a process in the container is killed, the remaining other processes will receive more
> > oom_protect quota, making it more difficult for the remaining processes to be killed.
> > In my test case, the new patch reduced the number of invalid kill behaviors by 70%.
> > 4. oom_score_adj is a global configuration that cannot achieve a kill order that only
> > affects a certain memcg-oom-killer. However, the oom_protect mechanism inherits
> > downwards, and user can only change the kill order of its own memcg oom, but the
> > kill order of their parent memcg-oom-killer or global-oom-killer will not be affected
>
> Yes oom_score_adj has shortcomings.
>
> > In the final discussion of patch v2, we discussed that although the adjustment range
> > of oom_score_adj is [-1000,1000], but essentially it only allows two usecases
> > (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between is
> > clumsy at best. In order to solve this problem in the new patch, I introduced a new
> > indicator oom_kill_inherit, which counts the number of times the local and child
> > cgroups have been selected by the OOM killer of the ancestor cgroup. By observing
> > the proportion of oom_kill_inherit in the parent cgroup, I can effectively adjust the
> > value of oom_protect to achieve the best.
>
> What does the best mean in this context?
>
> > about the semantics of non-leaf memcgs protection,
> > If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally
> > calculate the new effective oom_protect quota based on non-leaf memcg's quota.
>
> So the non-leaf memcg is never used as a target? What if the workload is
> distributed over several sub-groups? Our current oom.group
> implementation traverses the tree to find a common ancestor in the oom
> domain with the oom.group.
>
> All that being said and with the usecase described more specifically. I
> can see that memcg based oom victim selection makes some sense. That
> menas that it is always a memcg selected and all tasks withing killed.
> Memcg based protection can be used to evaluate which memcg to choose and
> the overall scheme should be still manageable. It would indeed resemble
> memory protection for the regular reclaim.
>
> One thing that is still not really clear to me is to how group vs.
> non-group ooms could be handled gracefully. Right now we can handle that
> because the oom selection is still process based but with the protection
> this will become more problematic as explained previously. Essentially
> we would need to enforce the oom selection to be memcg based for all
> memcgs. Maybe a mount knob? What do you think?
> --
> Michal Hocko
> SUSE Labs
Michal Hocko June 13, 2023, 8:16 a.m. UTC | #10
On Sun 04-06-23 08:05:53, 程垲涛 Chengkaitao Cheng wrote:
> At 2023-05-29 22:02:47, "Michal Hocko" <mhocko@suse.com> wrote:
> >On Thu 25-05-23 07:35:41, 程垲涛 Chengkaitao Cheng wrote:
> >> At 2023-05-22 21:03:50, "Michal Hocko" <mhocko@suse.com> wrote:
> >[...]
> >> >> I have created a new indicator oom_kill_inherit that maintains a negative correlation 
> >> >> with memory.oom.protect, so we have a ruler to measure the optimal value of 
> >> >> memory.oom.protect.
> >> >
> >> >An example might help here.
> >> 
> >> In my testing case, by adjusting memory.oom.protect, I was able to significantly 
> >> reduce the oom_kill_inherit of the corresponding cgroup. In a physical machine 
> >> with severely oversold memory, I divided all cgroups into three categories and 
> >> controlled their probability of being selected by the oom-killer to 0%,% 20, 
> >> and 80%, respectively.
> >
> >I might be just dense but I am lost. Can we focus on the barebone
> >semantic of the group oom selection and killing first. No magic
> >auto-tuning at this stage please.
> >
> >> >> >> about the semantics of non-leaf memcgs protection,
> >> >> >> If a non-leaf memcg's oom_protect quota is set, its leaf memcg will proportionally 
> >> >> >> calculate the new effective oom_protect quota based on non-leaf memcg's quota.
> >> >> >
> >> >> >So the non-leaf memcg is never used as a target? What if the workload is
> >> >> >distributed over several sub-groups? Our current oom.group
> >> >> >implementation traverses the tree to find a common ancestor in the oom
> >> >> >domain with the oom.group.
> >> >> 
> >> >> If the oom_protect quota of the parent non-leaf memcg is less than the sum of 
> >> >> sub-groups oom_protect quota, the oom_protect quota of each sub-group will 
> >> >> be proportionally reduced
> >> >> If the oom_protect quota of the parent non-leaf memcg is greater than the sum 
> >> >> of sub-groups oom_protect quota, the oom_protect quota of each sub-group 
> >> >> will be proportionally increased
> >> >> The purpose of doing so is that users can set oom_protect quota according to 
> >> >> their own needs, and the system management process can set appropriate 
> >> >> oom_protect quota on the parent non-leaf memcg as the final cover, so that 
> >> >> the system management process can indirectly manage all user processes.
> >> >
> >> >I guess that you are trying to say that the oom protection has a
> >> >standard hierarchical behavior. And that is fine, well, in fact it is
> >> >mandatory for any control knob to have a sane hierarchical properties.
> >> >But that doesn't address my above question. Let me try again. When is a
> >> >non-leaf memcg potentially selected as the oom victim? It doesn't have
> >> >any tasks directly but it might be a suitable target to kill a multi
> >> >memcg based workload (e.g. a full container).
> >> 
> >> If nonleaf memcg have the higher memory usage and the smaller 
> >> memory.oom.protect, it will have the higher the probability being 
> >> selected by the killer. If the non-leaf memcg is selected as the oom 
> >> victim, OOM-killer will continue to select the appropriate child 
> >> memcg downwards until the leaf memcg is selected.
> >
> >Parent memcg has more or equal memory charged than its child(ren) by
> >definition. Let me try to ask differently. Say you have the following
> >hierarchy
> >
> >		  root
> >		/     \
> >       container_A     container_B
> >     (oom.prot=100M)   (oom.prot=200M)
> >     (usage=120M)      (usage=180M)
> >     /     |     \
> >    A      B      C
> >                 / \
> >		C1  C2
> >
> >
> >container_B is protected so it should be excluded. Correct? So we are at
> >container_A to chose from. There are multiple ways the system and
> >continer admin might want to achieve.
> >1) system admin might want to shut down the whole container.
> >2) continer admin might want to shut the whole container down
> >3) cont. admin might want to shut down a whole sub group (e.g. C as it
> >   is a self contained workload and killing portion of it will put it into
> >   inconsistent state).
> >4) cont. admin might want to kill the most excess cgroup with tasks (i.e. a
> >   leaf memcg).
> >5) admin might want to kill a process in the most excess memcg.
> >
> >Now we already have oom.group thingy that can drive the group killing
> >policy but it is not really clear how you want to incorporate that to
> >the protection.
> >
> >Again, I think that an oom.protection makes sense but the semantic has
> >to be very carefully thought through because it is quite easy to create
> >corner cases and weird behavior. I also think that oom.group has to be
> >consistent with the protection.
> 
> The barebone semantic of the function implemented by my patch are 
> summarized as follows:
> Memcg only allows processes in the memcg to be selected by their 
> ancestor's OOM killer when the memory usage exceeds "oom.protect"

I am sure you would need to break this expectation if there is no such
memcg with tasks available or do you panic the system in that case in
the global case and retry for ever for the memcg oom?

> It should be noted that "oom.protect" and "oom.group" are completely 
> different things, and kneading them together may make the explanation 
> more confusing.

I am not suggesting to tight those two together by any means. I am
merely saying that those two have to be mutually cooperative and still
represent a reasonable semantic. Please have a look at above example
usecases and try to explain how the memory protection fits in here as
you have defined and implemented it.
Michal Hocko June 13, 2023, 8:27 a.m. UTC | #11
On Sun 04-06-23 01:25:42, Yosry Ahmed wrote:
[...]
> There has been a parallel discussion in the cover letter thread of v4
> [1]. To summarize, at Google, we have been using OOM scores to
> describe different job priorities in a more explicit way -- regardless
> of memory usage. It is strictly priority-based OOM killing. Ties are
> broken based on memory usage.
> 
> We understand that something like memory.oom.protect has an advantage
> in the sense that you can skip killing a process if you know that it
> won't free enough memory anyway, but for an environment where multiple
> jobs of different priorities are running, we find it crucial to be
> able to define strict ordering. Some jobs are simply more important
> than others, regardless of their memory usage.

I do remember that discussion. I am not a great fan of simple priority
based interfaces TBH. It sounds as an easy interface but it hits
complications as soon as you try to define a proper/sensible
hierarchical semantic. I can see how they might work on leaf memcgs with
statically assigned priorities but that sounds like a very narrow
usecase IMHO.

I do not think we can effort a plethora of different OOM selection
algorithms implemented in the kernel. Therefore we should really
consider a control interface to be as much extensible and in line
with the existing interfaces as much as possible. That is why I am
really open to the oom protection concept which fits reasonably well
to the reclaim protection scheme. After all oom killer is just a very
aggressive method of the memory reclaim.

On the other hand I can see a need to customizable OOM victim selection
functionality. We've been through that discussion on several other
occasions and the best thing we could come up with was to allow to plug
BPF into the victim selection process and allow to bypass the system
default method. No code has ever materialized from those discussions
though. Maybe this is the time to revive that idea again?
 
> It would be great if we can arrive at an interface that serves this
> use case as well.
> 
> Thanks!
> 
> [1]https://lore.kernel.org/linux-mm/CAJD7tkaQdSTDX0Q7zvvYrA3Y4TcvLdWKnN3yc8VpfWRpUjcYBw@mail.gmail.com/
Yosry Ahmed June 13, 2023, 8:36 a.m. UTC | #12
+David Rientjes

On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Sun 04-06-23 01:25:42, Yosry Ahmed wrote:
> [...]
> > There has been a parallel discussion in the cover letter thread of v4
> > [1]. To summarize, at Google, we have been using OOM scores to
> > describe different job priorities in a more explicit way -- regardless
> > of memory usage. It is strictly priority-based OOM killing. Ties are
> > broken based on memory usage.
> >
> > We understand that something like memory.oom.protect has an advantage
> > in the sense that you can skip killing a process if you know that it
> > won't free enough memory anyway, but for an environment where multiple
> > jobs of different priorities are running, we find it crucial to be
> > able to define strict ordering. Some jobs are simply more important
> > than others, regardless of their memory usage.
>
> I do remember that discussion. I am not a great fan of simple priority
> based interfaces TBH. It sounds as an easy interface but it hits
> complications as soon as you try to define a proper/sensible
> hierarchical semantic. I can see how they might work on leaf memcgs with
> statically assigned priorities but that sounds like a very narrow
> usecase IMHO.

Do you mind elaborating the problem with the hierarchical semantics?

The way it works with our internal implementation is (imo) sensible
and straightforward from a hierarchy POV. Starting at the OOM memcg
(which can be root), we recursively compare the OOM scores of the
children memcgs and pick the one with the lowest score, until we
arrive at a leaf memcg. Within that leaf, we also define per-process
scores, but these are less important to us.

I am not sure I understand why this is not sensible from a hierarchy
POV or a very narrow use case. Not that all this is optional, by
default all memcgs are given the same score, and ties are broken based
on per-memcg (or per-process) usage.

>
> I do not think we can effort a plethora of different OOM selection
> algorithms implemented in the kernel. Therefore we should really
> consider a control interface to be as much extensible and in line
> with the existing interfaces as much as possible. That is why I am
> really open to the oom protection concept which fits reasonably well
> to the reclaim protection scheme. After all oom killer is just a very
> aggressive method of the memory reclaim.
>
> On the other hand I can see a need to customizable OOM victim selection
> functionality. We've been through that discussion on several other
> occasions and the best thing we could come up with was to allow to plug
> BPF into the victim selection process and allow to bypass the system
> default method. No code has ever materialized from those discussions
> though. Maybe this is the time to revive that idea again?

That definitely sounds interesting, and it was brought up before. It
does sound like BPF (or a different customization framework) can be
the answer here. Interested to hear what others think as well.

>
> > It would be great if we can arrive at an interface that serves this
> > use case as well.
> >
> > Thanks!
> >
> > [1]https://lore.kernel.org/linux-mm/CAJD7tkaQdSTDX0Q7zvvYrA3Y4TcvLdWKnN3yc8VpfWRpUjcYBw@mail.gmail.com/
> --
> Michal Hocko
> SUSE Labs
Tejun Heo June 13, 2023, 8:40 a.m. UTC | #13
Hello,

On Tue, Jun 13, 2023 at 10:27:32AM +0200, Michal Hocko wrote:
> On the other hand I can see a need to customizable OOM victim selection
> functionality. We've been through that discussion on several other
> occasions and the best thing we could come up with was to allow to plug
> BPF into the victim selection process and allow to bypass the system
> default method. No code has ever materialized from those discussions
> though. Maybe this is the time to revive that idea again?

Yeah, my 5 cent - trying to define a rigid interface for something complex
and flexible is a fool's errand.

Johannes knows a lot better than me but we (meta) are handling most OOMs
with oomd which gives more than sufficient policy flexibility. That said,
handling OOMs in userspace requires wholesale configuration changes (e.g.
working IO isolation) and being able to steer kernel OOM kills can be useful
for many folks. I haven't thought too much about it but the problem seems
pretty well fit for offloading policy decisions to a BPF program.

Thanks.
Michal Hocko June 13, 2023, 12:06 p.m. UTC | #14
On Tue 13-06-23 01:36:51, Yosry Ahmed wrote:
> +David Rientjes
> 
> On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote:
> > [...]
> > > There has been a parallel discussion in the cover letter thread of v4
> > > [1]. To summarize, at Google, we have been using OOM scores to
> > > describe different job priorities in a more explicit way -- regardless
> > > of memory usage. It is strictly priority-based OOM killing. Ties are
> > > broken based on memory usage.
> > >
> > > We understand that something like memory.oom.protect has an advantage
> > > in the sense that you can skip killing a process if you know that it
> > > won't free enough memory anyway, but for an environment where multiple
> > > jobs of different priorities are running, we find it crucial to be
> > > able to define strict ordering. Some jobs are simply more important
> > > than others, regardless of their memory usage.
> >
> > I do remember that discussion. I am not a great fan of simple priority
> > based interfaces TBH. It sounds as an easy interface but it hits
> > complications as soon as you try to define a proper/sensible
> > hierarchical semantic. I can see how they might work on leaf memcgs with
> > statically assigned priorities but that sounds like a very narrow
> > usecase IMHO.
> 
> Do you mind elaborating the problem with the hierarchical semantics?

Well, let me be more specific. If you have a simple hierarchical numeric
enforcement (assume higher priority more likely to be chosen and the
effective priority to be max(self, max(parents)) then the semantic
itslef is straightforward.

I am not really sure about the practical manageability though. I have
hard time to imagine priority assignment on something like a shared
workload with a more complex hierarchy. For example:
	    root
	/    |    \
cont_A    cont_B  cont_C

each container running its workload with own hierarchy structures that
might be rather dynamic during the lifetime. In order to have a
predictable OOM behavior you need to watch and reassign priorities all
the time, no?

> The way it works with our internal implementation is (imo) sensible
> and straightforward from a hierarchy POV. Starting at the OOM memcg
> (which can be root), we recursively compare the OOM scores of the
> children memcgs and pick the one with the lowest score, until we
> arrive at a leaf memcg.

This approach has a strong requirement on the memcg hierarchy
organization. Siblings have to be directly comparable because you cut
off many potential sub-trees this way (e.g. is it easy to tell
whether you want to rule out all system or user slices?).

I can imagine usecases where this could work reasonably well e.g. a set
of workers of a different priority all of them running under a shared
memcg parent. But more more involved hierarchies seem more complex
because you always keep in mind how the hierarchy is organize to get to
your desired victim.
Yosry Ahmed June 13, 2023, 8:24 p.m. UTC | #15
On Tue, Jun 13, 2023 at 5:06 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 13-06-23 01:36:51, Yosry Ahmed wrote:
> > +David Rientjes
> >
> > On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote:
> > > [...]
> > > > There has been a parallel discussion in the cover letter thread of v4
> > > > [1]. To summarize, at Google, we have been using OOM scores to
> > > > describe different job priorities in a more explicit way -- regardless
> > > > of memory usage. It is strictly priority-based OOM killing. Ties are
> > > > broken based on memory usage.
> > > >
> > > > We understand that something like memory.oom.protect has an advantage
> > > > in the sense that you can skip killing a process if you know that it
> > > > won't free enough memory anyway, but for an environment where multiple
> > > > jobs of different priorities are running, we find it crucial to be
> > > > able to define strict ordering. Some jobs are simply more important
> > > > than others, regardless of their memory usage.
> > >
> > > I do remember that discussion. I am not a great fan of simple priority
> > > based interfaces TBH. It sounds as an easy interface but it hits
> > > complications as soon as you try to define a proper/sensible
> > > hierarchical semantic. I can see how they might work on leaf memcgs with
> > > statically assigned priorities but that sounds like a very narrow
> > > usecase IMHO.
> >
> > Do you mind elaborating the problem with the hierarchical semantics?
>
> Well, let me be more specific. If you have a simple hierarchical numeric
> enforcement (assume higher priority more likely to be chosen and the
> effective priority to be max(self, max(parents)) then the semantic
> itslef is straightforward.
>
> I am not really sure about the practical manageability though. I have
> hard time to imagine priority assignment on something like a shared
> workload with a more complex hierarchy. For example:
>             root
>         /    |    \
> cont_A    cont_B  cont_C
>
> each container running its workload with own hierarchy structures that
> might be rather dynamic during the lifetime. In order to have a
> predictable OOM behavior you need to watch and reassign priorities all
> the time, no?

In our case we don't really manage the entire hierarchy in a
centralized fashion. Each container gets a score based on their
relative priority, and each container is free to set scores within its
subcontainers if needed. Isn't this what the hierarchy is all about?
Each parent only cares about its direct children. On the system level,
we care about the priority ordering of containers. Ordering within
containers can be deferred to containers.

>
> > The way it works with our internal implementation is (imo) sensible
> > and straightforward from a hierarchy POV. Starting at the OOM memcg
> > (which can be root), we recursively compare the OOM scores of the
> > children memcgs and pick the one with the lowest score, until we
> > arrive at a leaf memcg.
>
> This approach has a strong requirement on the memcg hierarchy
> organization. Siblings have to be directly comparable because you cut
> off many potential sub-trees this way (e.g. is it easy to tell
> whether you want to rule out all system or user slices?).
>
> I can imagine usecases where this could work reasonably well e.g. a set
> of workers of a different priority all of them running under a shared
> memcg parent. But more more involved hierarchies seem more complex
> because you always keep in mind how the hierarchy is organize to get to
> your desired victim.

I guess the main point is what I mentioned above, you don't need to
manage the entire tree, containers can manage their subtrees. The most
important thing is to provide the kernel with priority ordering among
containers, and optionally priority ordering within a container
(disregarding other containers).

>
> --
> Michal Hocko
> SUSE Labs
Michal Hocko June 15, 2023, 10:39 a.m. UTC | #16
On Tue 13-06-23 13:24:24, Yosry Ahmed wrote:
> On Tue, Jun 13, 2023 at 5:06 AM Michal Hocko <mhocko@suse.com> wrote:
> >
> > On Tue 13-06-23 01:36:51, Yosry Ahmed wrote:
> > > +David Rientjes
> > >
> > > On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <mhocko@suse.com> wrote:
> > > >
> > > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote:
> > > > [...]
> > > > > There has been a parallel discussion in the cover letter thread of v4
> > > > > [1]. To summarize, at Google, we have been using OOM scores to
> > > > > describe different job priorities in a more explicit way -- regardless
> > > > > of memory usage. It is strictly priority-based OOM killing. Ties are
> > > > > broken based on memory usage.
> > > > >
> > > > > We understand that something like memory.oom.protect has an advantage
> > > > > in the sense that you can skip killing a process if you know that it
> > > > > won't free enough memory anyway, but for an environment where multiple
> > > > > jobs of different priorities are running, we find it crucial to be
> > > > > able to define strict ordering. Some jobs are simply more important
> > > > > than others, regardless of their memory usage.
> > > >
> > > > I do remember that discussion. I am not a great fan of simple priority
> > > > based interfaces TBH. It sounds as an easy interface but it hits
> > > > complications as soon as you try to define a proper/sensible
> > > > hierarchical semantic. I can see how they might work on leaf memcgs with
> > > > statically assigned priorities but that sounds like a very narrow
> > > > usecase IMHO.
> > >
> > > Do you mind elaborating the problem with the hierarchical semantics?
> >
> > Well, let me be more specific. If you have a simple hierarchical numeric
> > enforcement (assume higher priority more likely to be chosen and the
> > effective priority to be max(self, max(parents)) then the semantic
> > itslef is straightforward.
> >
> > I am not really sure about the practical manageability though. I have
> > hard time to imagine priority assignment on something like a shared
> > workload with a more complex hierarchy. For example:
> >             root
> >         /    |    \
> > cont_A    cont_B  cont_C
> >
> > each container running its workload with own hierarchy structures that
> > might be rather dynamic during the lifetime. In order to have a
> > predictable OOM behavior you need to watch and reassign priorities all
> > the time, no?
> 
> In our case we don't really manage the entire hierarchy in a
> centralized fashion. Each container gets a score based on their
> relative priority, and each container is free to set scores within its
> subcontainers if needed. Isn't this what the hierarchy is all about?
> Each parent only cares about its direct children. On the system level,
> we care about the priority ordering of containers. Ordering within
> containers can be deferred to containers.

This really depends on the workload. This might be working for your
setup but as I've said above, many workloads would be struggling with
re-prioritizing as soon as a new workload is started and oom priorities
would need to be reorganized as a result. The setup is just too static
to be generally useful IMHO. 
You can avoid that by essentially making mid-layers no priority and only
rely on leaf memcgs when this would become more flexible. This is
something even more complicated with the top-down approach.

That being said, I can see workloads which could benefit from a
priority (essentially user spaced controlled oom pre-selection) based
policy. But there are many other policies like that that would be
usecase specific and not generic enough so I do not think this is worth
a generic interface and would fall into BPF or alike based policies.
Yosry Ahmed June 16, 2023, 1:44 a.m. UTC | #17
On Thu, Jun 15, 2023 at 3:39 AM Michal Hocko <mhocko@suse.com> wrote:
>
> On Tue 13-06-23 13:24:24, Yosry Ahmed wrote:
> > On Tue, Jun 13, 2023 at 5:06 AM Michal Hocko <mhocko@suse.com> wrote:
> > >
> > > On Tue 13-06-23 01:36:51, Yosry Ahmed wrote:
> > > > +David Rientjes
> > > >
> > > > On Tue, Jun 13, 2023 at 1:27 AM Michal Hocko <mhocko@suse.com> wrote:
> > > > >
> > > > > On Sun 04-06-23 01:25:42, Yosry Ahmed wrote:
> > > > > [...]
> > > > > > There has been a parallel discussion in the cover letter thread of v4
> > > > > > [1]. To summarize, at Google, we have been using OOM scores to
> > > > > > describe different job priorities in a more explicit way -- regardless
> > > > > > of memory usage. It is strictly priority-based OOM killing. Ties are
> > > > > > broken based on memory usage.
> > > > > >
> > > > > > We understand that something like memory.oom.protect has an advantage
> > > > > > in the sense that you can skip killing a process if you know that it
> > > > > > won't free enough memory anyway, but for an environment where multiple
> > > > > > jobs of different priorities are running, we find it crucial to be
> > > > > > able to define strict ordering. Some jobs are simply more important
> > > > > > than others, regardless of their memory usage.
> > > > >
> > > > > I do remember that discussion. I am not a great fan of simple priority
> > > > > based interfaces TBH. It sounds as an easy interface but it hits
> > > > > complications as soon as you try to define a proper/sensible
> > > > > hierarchical semantic. I can see how they might work on leaf memcgs with
> > > > > statically assigned priorities but that sounds like a very narrow
> > > > > usecase IMHO.
> > > >
> > > > Do you mind elaborating the problem with the hierarchical semantics?
> > >
> > > Well, let me be more specific. If you have a simple hierarchical numeric
> > > enforcement (assume higher priority more likely to be chosen and the
> > > effective priority to be max(self, max(parents)) then the semantic
> > > itslef is straightforward.
> > >
> > > I am not really sure about the practical manageability though. I have
> > > hard time to imagine priority assignment on something like a shared
> > > workload with a more complex hierarchy. For example:
> > >             root
> > >         /    |    \
> > > cont_A    cont_B  cont_C
> > >
> > > each container running its workload with own hierarchy structures that
> > > might be rather dynamic during the lifetime. In order to have a
> > > predictable OOM behavior you need to watch and reassign priorities all
> > > the time, no?
> >
> > In our case we don't really manage the entire hierarchy in a
> > centralized fashion. Each container gets a score based on their
> > relative priority, and each container is free to set scores within its
> > subcontainers if needed. Isn't this what the hierarchy is all about?
> > Each parent only cares about its direct children. On the system level,
> > we care about the priority ordering of containers. Ordering within
> > containers can be deferred to containers.
>
> This really depends on the workload. This might be working for your
> setup but as I've said above, many workloads would be struggling with
> re-prioritizing as soon as a new workload is started and oom priorities
> would need to be reorganized as a result. The setup is just too static
> to be generally useful IMHO.
> You can avoid that by essentially making mid-layers no priority and only
> rely on leaf memcgs when this would become more flexible. This is
> something even more complicated with the top-down approach.

I agree that other setups may find it more difficult if one entity
needs to manage the entire tree, although if the scores range is large
enough, I don't really think it's that static. When a new workload is
started you decide what its priority is compared to the existing
workloads and set its score as such. We use a range of scores from 0
to 10,000 (and it can easily be larger), so it's easy to assign new
scores without reorganizing the existing scores.

>
> That being said, I can see workloads which could benefit from a
> priority (essentially user spaced controlled oom pre-selection) based
> policy. But there are many other policies like that that would be
> usecase specific and not generic enough so I do not think this is worth
> a generic interface and would fall into BPF or alike based policies.

That's reasonable. I can't speak for other folks. Perhaps no single
policy will be generic enough, and we should focus on enabling
customized policy. Perhaps other userspace OOM agents can benefit from
this as well.

>
> --
> Michal Hocko
> SUSE Labs