diff mbox series

[v2] mm: memcontrol: protect the memory in cgroup from being oom killed

Message ID 20221208034644.3077-1-chengkaitao@didiglobal.com (mailing list archive)
State New, archived
Headers show
Series [v2] mm: memcontrol: protect the memory in cgroup from being oom killed | expand

Commit Message

chengkaitao Dec. 8, 2022, 3:46 a.m. UTC
From: chengkaitao <pilgrimtao@gmail.com>

We created a new interface <memory.oom.protect> for memory, If there is
the OOM killer under parent memory cgroup, and the memory usage of a
child cgroup is within its effective oom.protect boundary, the cgroup's
tasks won't be OOM killed unless there is no unprotected tasks in other
children cgroups. It draws on the logic of <memory.min/low> in the
inheritance relationship.

It has the following advantages,
1. We have the ability to protect more important processes, when there
is a memcg's OOM killer. The oom.protect only takes effect local memcg,
and does not affect the OOM killer of the host.
2. Historically, we can often use oom_score_adj to control a group of
processes, It requires that all processes in the cgroup must have a
common parent processes, we have to set the common parent process's
oom_score_adj, before it forks all children processes. So that it is
very difficult to apply it in other situations. Now oom.protect has no
such restrictions, we can protect a cgroup of processes more easily. The
cgroup can keep some memory, even if the OOM killer has to be called.

Signed-off-by: chengkaitao <pilgrimtao@gmail.com>
---
v2: Modify the formula of the process request memcg protection quota.
---
 Documentation/admin-guide/cgroup-v2.rst |  22 +++-
 fs/proc/base.c                          |  17 ++-
 include/linux/memcontrol.h              |  45 +++++++-
 include/linux/oom.h                     |   3 +-
 include/linux/page_counter.h            |   6 ++
 mm/memcontrol.c                         | 182 ++++++++++++++++++++++++++++++++
 mm/oom_kill.c                           |  25 +++--
 mm/page_counter.c                       |  26 +++++
 8 files changed, 305 insertions(+), 21 deletions(-)

Comments

Michal Hocko Dec. 8, 2022, 7:33 a.m. UTC | #1
On Thu 08-12-22 11:46:44, chengkaitao wrote:
> From: chengkaitao <pilgrimtao@gmail.com>
> 
> We created a new interface <memory.oom.protect> for memory, If there is
> the OOM killer under parent memory cgroup, and the memory usage of a
> child cgroup is within its effective oom.protect boundary, the cgroup's
> tasks won't be OOM killed unless there is no unprotected tasks in other
> children cgroups. It draws on the logic of <memory.min/low> in the
> inheritance relationship.
> 
> It has the following advantages,
> 1. We have the ability to protect more important processes, when there
> is a memcg's OOM killer. The oom.protect only takes effect local memcg,
> and does not affect the OOM killer of the host.
> 2. Historically, we can often use oom_score_adj to control a group of
> processes, It requires that all processes in the cgroup must have a
> common parent processes, we have to set the common parent process's
> oom_score_adj, before it forks all children processes. So that it is
> very difficult to apply it in other situations. Now oom.protect has no
> such restrictions, we can protect a cgroup of processes more easily. The
> cgroup can keep some memory, even if the OOM killer has to be called.
> 
> Signed-off-by: chengkaitao <pilgrimtao@gmail.com>
> ---
> v2: Modify the formula of the process request memcg protection quota.

The new formula doesn't really address concerns expressed previously.
Please read my feedback carefully again and follow up with questions if
something is not clear.
程垲涛 Chengkaitao Cheng Dec. 8, 2022, 7:59 a.m. UTC | #2
At 2022-12-08 15:33:07, "Michal Hocko" <mhocko@suse.com> wrote:
>On Thu 08-12-22 11:46:44, chengkaitao wrote:
>> From: chengkaitao <pilgrimtao@gmail.com>
>> 
>> We created a new interface <memory.oom.protect> for memory, If there is
>> the OOM killer under parent memory cgroup, and the memory usage of a
>> child cgroup is within its effective oom.protect boundary, the cgroup's
>> tasks won't be OOM killed unless there is no unprotected tasks in other
>> children cgroups. It draws on the logic of <memory.min/low> in the
>> inheritance relationship.
>> 
>> It has the following advantages,
>> 1. We have the ability to protect more important processes, when there
>> is a memcg's OOM killer. The oom.protect only takes effect local memcg,
>> and does not affect the OOM killer of the host.
>> 2. Historically, we can often use oom_score_adj to control a group of
>> processes, It requires that all processes in the cgroup must have a
>> common parent processes, we have to set the common parent process's
>> oom_score_adj, before it forks all children processes. So that it is
>> very difficult to apply it in other situations. Now oom.protect has no
>> such restrictions, we can protect a cgroup of processes more easily. The
>> cgroup can keep some memory, even if the OOM killer has to be called.
>> 
>> Signed-off-by: chengkaitao <pilgrimtao@gmail.com>
>> ---
>> v2: Modify the formula of the process request memcg protection quota.
>
>The new formula doesn't really address concerns expressed previously.
>Please read my feedback carefully again and follow up with questions if
>something is not clear.

The previous discussion was quite scattered. Can you help me summarize
your concerns again? Let me think about the optimization plan for these
problems.
Michal Hocko Dec. 8, 2022, 8:14 a.m. UTC | #3
On Thu 08-12-22 07:59:27, 程垲涛 Chengkaitao Cheng wrote:
> At 2022-12-08 15:33:07, "Michal Hocko" <mhocko@suse.com> wrote:
> >On Thu 08-12-22 11:46:44, chengkaitao wrote:
> >> From: chengkaitao <pilgrimtao@gmail.com>
> >> 
> >> We created a new interface <memory.oom.protect> for memory, If there is
> >> the OOM killer under parent memory cgroup, and the memory usage of a
> >> child cgroup is within its effective oom.protect boundary, the cgroup's
> >> tasks won't be OOM killed unless there is no unprotected tasks in other
> >> children cgroups. It draws on the logic of <memory.min/low> in the
> >> inheritance relationship.
> >> 
> >> It has the following advantages,
> >> 1. We have the ability to protect more important processes, when there
> >> is a memcg's OOM killer. The oom.protect only takes effect local memcg,
> >> and does not affect the OOM killer of the host.
> >> 2. Historically, we can often use oom_score_adj to control a group of
> >> processes, It requires that all processes in the cgroup must have a
> >> common parent processes, we have to set the common parent process's
> >> oom_score_adj, before it forks all children processes. So that it is
> >> very difficult to apply it in other situations. Now oom.protect has no
> >> such restrictions, we can protect a cgroup of processes more easily. The
> >> cgroup can keep some memory, even if the OOM killer has to be called.
> >> 
> >> Signed-off-by: chengkaitao <pilgrimtao@gmail.com>
> >> ---
> >> v2: Modify the formula of the process request memcg protection quota.
> >
> >The new formula doesn't really address concerns expressed previously.
> >Please read my feedback carefully again and follow up with questions if
> >something is not clear.
> 
> The previous discussion was quite scattered. Can you help me summarize
> your concerns again?

The most important part is http://lkml.kernel.org/r/Y4jFnY7kMdB8ReSW@dhcp22.suse.cz
: Let me just emphasise that we are talking about fundamental disconnect.
: Rss based accounting has been used for the OOM killer selection because
: the memory gets unmapped and _potentially_ freed when the process goes
: away. Memcg changes are bound to the object life time and as said in
: many cases there is no direct relation with any process life time.

That is to the per-process discount based on rss or any per-process
memory metrics.

Another really important question is the actual configurability. The
hierarchical protection has to be enforced and that means that same as
memory reclaim protection it has to be enforced top-to-bottom in the
cgroup hierarchy. That makes the oom protection rather non-trivial to
configure without having a good picture of a larger part of the cgroup
hierarchy as it cannot be tuned based on a reclaim feedback.
程垲涛 Chengkaitao Cheng Dec. 8, 2022, 2:07 p.m. UTC | #4
At 2022-12-08 16:14:10, "Michal Hocko" <mhocko@suse.com> wrote:
>On Thu 08-12-22 07:59:27, 程垲涛 Chengkaitao Cheng wrote:
>> At 2022-12-08 15:33:07, "Michal Hocko" <mhocko@suse.com> wrote:
>> >On Thu 08-12-22 11:46:44, chengkaitao wrote:
>> >> From: chengkaitao <pilgrimtao@gmail.com>
>> >> 
>> >> We created a new interface <memory.oom.protect> for memory, If there is
>> >> the OOM killer under parent memory cgroup, and the memory usage of a
>> >> child cgroup is within its effective oom.protect boundary, the cgroup's
>> >> tasks won't be OOM killed unless there is no unprotected tasks in other
>> >> children cgroups. It draws on the logic of <memory.min/low> in the
>> >> inheritance relationship.
>> >> 
>> >> It has the following advantages,
>> >> 1. We have the ability to protect more important processes, when there
>> >> is a memcg's OOM killer. The oom.protect only takes effect local memcg,
>> >> and does not affect the OOM killer of the host.
>> >> 2. Historically, we can often use oom_score_adj to control a group of
>> >> processes, It requires that all processes in the cgroup must have a
>> >> common parent processes, we have to set the common parent process's
>> >> oom_score_adj, before it forks all children processes. So that it is
>> >> very difficult to apply it in other situations. Now oom.protect has no
>> >> such restrictions, we can protect a cgroup of processes more easily. The
>> >> cgroup can keep some memory, even if the OOM killer has to be called.
>> >> 
>> >> Signed-off-by: chengkaitao <pilgrimtao@gmail.com>
>> >> ---
>> >> v2: Modify the formula of the process request memcg protection quota.
>> >
>> >The new formula doesn't really address concerns expressed previously.
>> >Please read my feedback carefully again and follow up with questions if
>> >something is not clear.
>> 
>> The previous discussion was quite scattered. Can you help me summarize
>> your concerns again?
>
>The most important part is http://lkml.kernel.org/r/Y4jFnY7kMdB8ReSW@dhcp22.suse.cz
>: Let me just emphasise that we are talking about fundamental disconnect.
>: Rss based accounting has been used for the OOM killer selection because
>: the memory gets unmapped and _potentially_ freed when the process goes
>: away. Memcg changes are bound to the object life time and as said in
>: many cases there is no direct relation with any process life time.
>
We need to discuss the relationship between memcg's mem and process's mem, 

task_usage = task_anon(rss_anon) + task_mapped_file(rss_file) 
	 + task_mapped_share(rss_share) + task_pgtables + task_swapents

memcg_usage	= memcg_anon + memcg_file + memcg_pgtables + memcg_share
	= all_task_anon + all_task_mapped_file + all_task_mapped_share 
	 + all_task_pgtables + unmapped_file + unmapped_share
	= all_task_usage + unmapped_file + unmapped_share - all_task_swapents

Memcg is directly related to processes for most memory. On the other hand, 
unmapped_file pages and unmapped_share pages aren't charged into the 
process, but these memories can not be released by the oom killer. Therefore, 
they should not apply to cgroup for protection quota. They can be excluded 
during calculation.

       memcg A
    /     |     \
task-x  task-y  common-cache
   2G     2G        2G

eoom.protect(memcg A) = 3G;
usage(memcg A) = 6G
usage(task x) = 2G
usage(task y) = 2G
common-cache = 2G

After calculation,
actual-protection(task x) = 1G
actual-protection(task y) = 1G

This formula is more fair for groups with fewer common-caches (unmapped_
file pages and unmapped_share pages).
In extreme environments, unmapped_file pages and unmapped_share pages 
may lock a large share of protection quota, but it is expected.

>That is to the per-process discount based on rss or any per-process
>memory metrics.
>
>Another really important question is the actual configurability. The
>hierarchical protection has to be enforced and that means that same as
>memory reclaim protection it has to be enforced top-to-bottom in the
>cgroup hierarchy. That makes the oom protection rather non-trivial to
>configure without having a good picture of a larger part of the cgroup
>hierarchy as it cannot be tuned based on a reclaim feedback.

There is an essential difference between reclaim and oom killer. The reclaim 
cannot be directly perceived by users, so memcg need to count indicators 
similar to pgscan_(kswapd/direct). However, when the user process is killed 
by oom killer, users can clearly perceive and count (such as the number of 
restarts of a certain type of process). At the same time, the kernel also has 
memory.events to count some information about the oom killer, which can 
also be used for feedback adjustment. Of course, I can also add some 
indicators similar to the accumulated memory released by the oom killer 
to help users better grasp the dynamics of the oom killer. Do you think it 
is valuable?
Michal Hocko Dec. 8, 2022, 2:23 p.m. UTC | #5
On Thu 08-12-22 14:07:06, 程垲涛 Chengkaitao Cheng wrote:
> At 2022-12-08 16:14:10, "Michal Hocko" <mhocko@suse.com> wrote:
> >On Thu 08-12-22 07:59:27, 程垲涛 Chengkaitao Cheng wrote:
> >> At 2022-12-08 15:33:07, "Michal Hocko" <mhocko@suse.com> wrote:
> >> >On Thu 08-12-22 11:46:44, chengkaitao wrote:
> >> >> From: chengkaitao <pilgrimtao@gmail.com>
> >> >> 
> >> >> We created a new interface <memory.oom.protect> for memory, If there is
> >> >> the OOM killer under parent memory cgroup, and the memory usage of a
> >> >> child cgroup is within its effective oom.protect boundary, the cgroup's
> >> >> tasks won't be OOM killed unless there is no unprotected tasks in other
> >> >> children cgroups. It draws on the logic of <memory.min/low> in the
> >> >> inheritance relationship.
> >> >> 
> >> >> It has the following advantages,
> >> >> 1. We have the ability to protect more important processes, when there
> >> >> is a memcg's OOM killer. The oom.protect only takes effect local memcg,
> >> >> and does not affect the OOM killer of the host.
> >> >> 2. Historically, we can often use oom_score_adj to control a group of
> >> >> processes, It requires that all processes in the cgroup must have a
> >> >> common parent processes, we have to set the common parent process's
> >> >> oom_score_adj, before it forks all children processes. So that it is
> >> >> very difficult to apply it in other situations. Now oom.protect has no
> >> >> such restrictions, we can protect a cgroup of processes more easily. The
> >> >> cgroup can keep some memory, even if the OOM killer has to be called.
> >> >> 
> >> >> Signed-off-by: chengkaitao <pilgrimtao@gmail.com>
> >> >> ---
> >> >> v2: Modify the formula of the process request memcg protection quota.
> >> >
> >> >The new formula doesn't really address concerns expressed previously.
> >> >Please read my feedback carefully again and follow up with questions if
> >> >something is not clear.
> >> 
> >> The previous discussion was quite scattered. Can you help me summarize
> >> your concerns again?
> >
> >The most important part is http://lkml.kernel.org/r/Y4jFnY7kMdB8ReSW@dhcp22.suse.cz
> >: Let me just emphasise that we are talking about fundamental disconnect.
> >: Rss based accounting has been used for the OOM killer selection because
> >: the memory gets unmapped and _potentially_ freed when the process goes
> >: away. Memcg changes are bound to the object life time and as said in
> >: many cases there is no direct relation with any process life time.
> >
> We need to discuss the relationship between memcg's mem and process's mem, 
> 
> task_usage = task_anon(rss_anon) + task_mapped_file(rss_file) 
> 	 + task_mapped_share(rss_share) + task_pgtables + task_swapents
> 
> memcg_usage	= memcg_anon + memcg_file + memcg_pgtables + memcg_share
> 	= all_task_anon + all_task_mapped_file + all_task_mapped_share 
> 	 + all_task_pgtables + unmapped_file + unmapped_share
> 	= all_task_usage + unmapped_file + unmapped_share - all_task_swapents

You are missing all the kernel charged objects (aka __GFP_ACCOUNT
allocations resp. SLAB_ACCOUNT for slab caches). Depending on the
workload this can be really a very noticeable portion. So not this is
not just about unmapped cache or shm.

> >That is to the per-process discount based on rss or any per-process
> >memory metrics.
> >
> >Another really important question is the actual configurability. The
> >hierarchical protection has to be enforced and that means that same as
> >memory reclaim protection it has to be enforced top-to-bottom in the
> >cgroup hierarchy. That makes the oom protection rather non-trivial to
> >configure without having a good picture of a larger part of the cgroup
> >hierarchy as it cannot be tuned based on a reclaim feedback.
> 
> There is an essential difference between reclaim and oom killer.

oom killer is a memory reclaim of the last resort. So yes, there is some
difference but fundamentally it is about releasing some memory. And long
term we have learned that the more clever it tries to be the more likely
corner cases can happen. It is simply impossible to know the best
candidate so this is a just a best effort. We try to aim for
predictability at least.

> The reclaim 
> cannot be directly perceived by users,

I very strongly disagree with this statement. First the direct reclaim is a
direct source of latencies because the work is done on behalf of the
allocating process. There are side effect possible as well because
refaults have their cost as well.

> so memcg need to count indicators 
> similar to pgscan_(kswapd/direct). However, when the user process is killed 
> by oom killer, users can clearly perceive and count (such as the number of 
> restarts of a certain type of process). At the same time, the kernel also has 
> memory.events to count some information about the oom killer, which can 
> also be used for feedback adjustment.

Yes we have those metrics already. I suspect I haven't made myself
clear. I didn't say there are no measures to see how oom behaves. What
I've said that I _suspect_ that oom protection would be really hard to
configure correctly because unlike the memory reclaim which happens
during the normal operation, oom is a relatively rare event and it is
quite hard to use it for any feedback mechanisms. But I am really open
to be convinced otherwise and this is in fact what I have been asking
for since the beginning. I would love to see some examples on the
reasonable configuration for a practical usecase. It is one thing to say
that you can set the protection to a certain value and a different one
to have a way to determine that value. See my point?
程垲涛 Chengkaitao Cheng Dec. 9, 2022, 5:07 a.m. UTC | #6
At 2022-12-08 22:23:56, "Michal Hocko" <mhocko@suse.com> wrote:
>On Thu 08-12-22 14:07:06, 程垲涛 Chengkaitao Cheng wrote:
>> At 2022-12-08 16:14:10, "Michal Hocko" <mhocko@suse.com> wrote:
>> >On Thu 08-12-22 07:59:27, 程垲涛 Chengkaitao Cheng wrote:
>> >> At 2022-12-08 15:33:07, "Michal Hocko" <mhocko@suse.com> wrote:
>> >> >On Thu 08-12-22 11:46:44, chengkaitao wrote:
>> >> >> From: chengkaitao <pilgrimtao@gmail.com>
>> >> >> 
>> >> >> We created a new interface <memory.oom.protect> for memory, If there is
>> >> >> the OOM killer under parent memory cgroup, and the memory usage of a
>> >> >> child cgroup is within its effective oom.protect boundary, the cgroup's
>> >> >> tasks won't be OOM killed unless there is no unprotected tasks in other
>> >> >> children cgroups. It draws on the logic of <memory.min/low> in the
>> >> >> inheritance relationship.
>> >> >> 
>> >> >> It has the following advantages,
>> >> >> 1. We have the ability to protect more important processes, when there
>> >> >> is a memcg's OOM killer. The oom.protect only takes effect local memcg,
>> >> >> and does not affect the OOM killer of the host.
>> >> >> 2. Historically, we can often use oom_score_adj to control a group of
>> >> >> processes, It requires that all processes in the cgroup must have a
>> >> >> common parent processes, we have to set the common parent process's
>> >> >> oom_score_adj, before it forks all children processes. So that it is
>> >> >> very difficult to apply it in other situations. Now oom.protect has no
>> >> >> such restrictions, we can protect a cgroup of processes more easily. The
>> >> >> cgroup can keep some memory, even if the OOM killer has to be called.
>> >> >> 
>> >> >> Signed-off-by: chengkaitao <pilgrimtao@gmail.com>
>> >> >> ---
>> >> >> v2: Modify the formula of the process request memcg protection quota.
>> >> >
>> >> >The new formula doesn't really address concerns expressed previously.
>> >> >Please read my feedback carefully again and follow up with questions if
>> >> >something is not clear.
>> >> 
>> >> The previous discussion was quite scattered. Can you help me summarize
>> >> your concerns again?
>> >
>> >The most important part is http://lkml.kernel.org/r/Y4jFnY7kMdB8ReSW@dhcp22.suse.cz
>> >: Let me just emphasise that we are talking about fundamental disconnect.
>> >: Rss based accounting has been used for the OOM killer selection because
>> >: the memory gets unmapped and _potentially_ freed when the process goes
>> >: away. Memcg changes are bound to the object life time and as said in
>> >: many cases there is no direct relation with any process life time.
>> >
>> We need to discuss the relationship between memcg's mem and process's mem, 
>> 
>> task_usage = task_anon(rss_anon) + task_mapped_file(rss_file) 
>> 	 + task_mapped_share(rss_share) + task_pgtables + task_swapents
>> 
>> memcg_usage	= memcg_anon + memcg_file + memcg_pgtables + memcg_share
>> 	= all_task_anon + all_task_mapped_file + all_task_mapped_share 
>> 	 + all_task_pgtables + unmapped_file + unmapped_share
>> 	= all_task_usage + unmapped_file + unmapped_share - all_task_swapents
>
>You are missing all the kernel charged objects (aka __GFP_ACCOUNT
>allocations resp. SLAB_ACCOUNT for slab caches). Depending on the
>workload this can be really a very noticeable portion. So not this is
>not just about unmapped cache or shm.
>
Kmem is indeed missing here, thanks for reminding. but the patch is also applicable 
when kmem is added.

>> >That is to the per-process discount based on rss or any per-process
>> >memory metrics.
>> >
>> >Another really important question is the actual configurability. The
>> >hierarchical protection has to be enforced and that means that same as
>> >memory reclaim protection it has to be enforced top-to-bottom in the
>> >cgroup hierarchy. That makes the oom protection rather non-trivial to
>> >configure without having a good picture of a larger part of the cgroup
>> >hierarchy as it cannot be tuned based on a reclaim feedback.
>> 
>> There is an essential difference between reclaim and oom killer.
>
>oom killer is a memory reclaim of the last resort. So yes, there is some
>difference but fundamentally it is about releasing some memory. And long
>term we have learned that the more clever it tries to be the more likely
>corner cases can happen. It is simply impossible to know the best
>candidate so this is a just a best effort. We try to aim for
>predictability at least.

Is the current oom_score strategy predictable? I don't think so. The score_adj 
has broken the predictability of oom_score (it is no longer simply killing the 
process that uses the most mems). And I think that score_adj and oom.protect 
are not for the kernel to choose the best candidate, but for the user to choose 
the candidate more conveniently. If the user does not configure the score_adj 
and oom.protect, the kernel will follow the simplest and most direct logic (killing 
the process that uses the most mems).

>
>> The reclaim 
>> cannot be directly perceived by users,
>
>I very strongly disagree with this statement. First the direct reclaim is a
>direct source of latencies because the work is done on behalf of the
>allocating process. There are side effect possible as well because
>refaults have their cost as well.

The "direct perception" here does not mean that reclaim will not affect the 
performance of user processes, but emphasizes that users cannot make 
feedback adjustments based on their own state and must rely on the help 
of kernel indicators.
>
>> so memcg need to count indicators 
>> similar to pgscan_(kswapd/direct). However, when the user process is killed 
>> by oom killer, users can clearly perceive and count (such as the number of 
>> restarts of a certain type of process). At the same time, the kernel also has 
>> memory.events to count some information about the oom killer, which can 
>> also be used for feedback adjustment.
>
>Yes we have those metrics already. I suspect I haven't made myself
>clear. I didn't say there are no measures to see how oom behaves. What
>I've said that I _suspect_ that oom protection would be really hard to
>configure correctly because unlike the memory reclaim which happens
>during the normal operation, oom is a relatively rare event and it is
>quite hard to use it for any feedback mechanisms. 

We discussed similar cases,
https://lore.kernel.org/linux-mm/EF1DC035-442F-4BAE-B86F-6C6B10B4A094@didiglobal.com/
* More and more users want to save costs as much as possible by setting the 
* mem.max to a very small value, resulting in a small number of oom events, 
* but users can tolerate them, and users want to minimize the impact of oom 
* events at this time. In similar scenarios, oom events are no longer abnormal 
* and unpredictable. We need to provide convenient oom policies for users to 
* choose.

> But I am really open
>to be convinced otherwise and this is in fact what I have been asking
>for since the beginning. I would love to see some examples on the
>reasonable configuration for a practical usecase.

Here is a simple example. In a docker container, users can divide all processes 
into two categories (important and normal), and put them in different cgroups. 
One cgroup's oom.protect is set to "max", the other is set to "0". In this way, 
important processes in the container can be protected.

> It is one thing to say
>that you can set the protection to a certain value and a different one
>to have a way to determine that value. See my point?

According to the current situation, if the score_adj is set, the only way for 
the kernel to determine the value is "cat /proc/pid/oom_core". In the 
oom.protect scheme, I also propose to change "/proc/pid/oom_core". 
Please refer to the link,
https://lore.kernel.org/linux-mm/C2CC36C1-29AE-4B65-A18A-19A745652182@didiglobal.com/

>
>-- 
>Michal Hocko
>SUSE Labs
Michal Hocko Dec. 9, 2022, 8:25 a.m. UTC | #7
On Fri 09-12-22 05:07:15, 程垲涛 Chengkaitao Cheng wrote:
> At 2022-12-08 22:23:56, "Michal Hocko" <mhocko@suse.com> wrote:
[...]
> >oom killer is a memory reclaim of the last resort. So yes, there is some
> >difference but fundamentally it is about releasing some memory. And long
> >term we have learned that the more clever it tries to be the more likely
> >corner cases can happen. It is simply impossible to know the best
> >candidate so this is a just a best effort. We try to aim for
> >predictability at least.
> 
> Is the current oom_score strategy predictable? I don't think so. The score_adj 
> has broken the predictability of oom_score (it is no longer simply killing the 
> process that uses the most mems).

oom_score as reported to the userspace already considers oom_score_adj
which means that you can compare processes and get a reasonable guess
what would be the current oom_victim. There is a certain fuzz level
because this is not atomic and also there is no clear candidate when
multiple processes have equal score. So yes, it is not 100% predictable.
memory.reclaim as you propose doesn't change that though.

Is oom_score_adj a good interface? No, not really. If I could go back in
time I would nack it but here we are. We have an interface that
promises quite much but essentially it only allows two usecases
(OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between
is clumsy at best because a real user space oom policy would require to
re-evaluate the whole oom domain (be it global or memcg oom) as the
memory consumption evolves over time. I am really worried that your
memory.oom.protection directs a very similar trajectory because
protection really needs to consider other memcgs to balance properly.

[...]

> > But I am really open
> >to be convinced otherwise and this is in fact what I have been asking
> >for since the beginning. I would love to see some examples on the
> >reasonable configuration for a practical usecase.
> 
> Here is a simple example. In a docker container, users can divide all processes 
> into two categories (important and normal), and put them in different cgroups. 
> One cgroup's oom.protect is set to "max", the other is set to "0". In this way, 
> important processes in the container can be protected.

That is effectivelly oom_score_adj = OOM_SCORE_ADJ_MIN - 1 to all
processes in the important group. I would argue you can achieve a very
similar result by the process launcher to set the oom_score_adj and
inherit it to all processes in that important container. You do not need
any memcg tunable for that. I am really much more interested in examples
when the protection is to be fine tuned.
Michal Hocko Dec. 9, 2022, 12:27 p.m. UTC | #8
On Fri 09-12-22 09:25:38, Michal Hocko wrote:
> On Fri 09-12-22 05:07:15, 程垲涛 Chengkaitao Cheng wrote:
[...]
> > Here is a simple example. In a docker container, users can divide all processes 
> > into two categories (important and normal), and put them in different cgroups. 
> > One cgroup's oom.protect is set to "max", the other is set to "0". In this way, 
> > important processes in the container can be protected.
> 
> That is effectivelly oom_score_adj = OOM_SCORE_ADJ_MIN - 1 to all

Sorry that should have been OOM_SCORE_ADJ_MIN + 1
程垲涛 Chengkaitao Cheng Dec. 10, 2022, 9:18 a.m. UTC | #9
At 2022-12-09 16:25:37, "Michal Hocko" <mhocko@suse.com> wrote:
>On Fri 09-12-22 05:07:15, 程垲涛 Chengkaitao Cheng wrote:
>> At 2022-12-08 22:23:56, "Michal Hocko" <mhocko@suse.com> wrote:
>[...]
>> >oom killer is a memory reclaim of the last resort. So yes, there is some
>> >difference but fundamentally it is about releasing some memory. And long
>> >term we have learned that the more clever it tries to be the more likely
>> >corner cases can happen. It is simply impossible to know the best
>> >candidate so this is a just a best effort. We try to aim for
>> >predictability at least.
>> 
>> Is the current oom_score strategy predictable? I don't think so. The score_adj 
>> has broken the predictability of oom_score (it is no longer simply killing the 
>> process that uses the most mems).
>
>oom_score as reported to the userspace already considers oom_score_adj
>which means that you can compare processes and get a reasonable guess
>what would be the current oom_victim. There is a certain fuzz level
>because this is not atomic and also there is no clear candidate when
>multiple processes have equal score. 

Multiple processes have the same score, which means it is reasonable to kill 
any one. Why must we determine which one is?

> So yes, it is not 100% predictable.
>memory.reclaim as you propose doesn't change that though.
>
This scheme is to give the decision power of the candidate to the user. 
The user's behavior is random. I think it is impossible to 100% predict 
a random event.

Is it really necessary to make everything 100% predictable? Just as we can't 
accurately predict which cgroup will access the page cache frequently, 
we can't accurately predict whether the memory is hot or cold. These 
strategies are fuzzy, but we can't deny their rationality.

>Is oom_score_adj a good interface? No, not really. If I could go back in
>time I would nack it but here we are. We have an interface that
>promises quite much but essentially it only allows two usecases
>(OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between
>is clumsy at best because a real user space oom policy would require to
>re-evaluate the whole oom domain (be it global or memcg oom) as the
>memory consumption evolves over time. I am really worried that your
>memory.oom.protection directs a very similar trajectory because
>protection really needs to consider other memcgs to balance properly.
>
The score_adj is an interface that promises quite much. I think the reason 
why only two usecases (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) 
are reliable is that user cannot evaluate the priority level of all processes in 
the physical machine. If there is a agent process in the physical machine, 
which can accurately divide all the user processes of the physical machine 
into different levels, other usecases of the score_adj will be well applied, 
but it is almost impossible to achieve in real life.

There is an example of the practical application
Kubelet will set the score_adj of dockerinit process of all burstabler containers, 
the setting specification follows the following formula,

score_adj = 1000 - request * 1000 / totalpages
(request = "Fixed coefficient" * "memory.max")

Because kubelet has a clear understanding of all the container memory behavior 
attributes in the physical machine, it can use more score_adj usecases. The 
advantage of the oom.protrct is that users do not need to have a clear understanding 
of all the processes in the physical machine, they only need to have a clear 
understanding of all the processes int local cgroup. I think the requirement is very 
easy to achieve.

>[...]
>
>> > But I am really open
>> >to be convinced otherwise and this is in fact what I have been asking
>> >for since the beginning. I would love to see some examples on the
>> >reasonable configuration for a practical usecase.
>> 
>> Here is a simple example. In a docker container, users can divide all processes 
>> into two categories (important and normal), and put them in different cgroups. 
>> One cgroup's oom.protect is set to "max", the other is set to "0". In this way, 
>> important processes in the container can be protected.
>
>That is effectivelly oom_score_adj = OOM_SCORE_ADJ_MIN - 1 to all
>processes in the important group. I would argue you can achieve a very
>similar result by the process launcher to set the oom_score_adj and
>inherit it to all processes in that important container. You do not need
>any memcg tunable for that. 

Your method is not feasible. Please refer to the previous email
https://lore.kernel.org/linux-mm/E5A5BCC3-460E-4E81-8DD3-88B4A2868285@didiglobal.com/
* usecases 1: users say that they want to protect an important process 
* with high memory consumption from being killed by the oom in case 
* of docker container failure, so as to retain more critical on-site 
* information or a self recovery mechanism. At this time, they suggest 
* setting the score_adj of this process to -1000, but I don't agree with 
* it, because the docker container is not important to other docker 
* containers of the same physical machine. If score_adj of the process 
* is set to -1000, the probability of oom in other container processes will 
* increase.

>I am really much more interested in examples
>when the protection is to be fine tuned.
程垲涛 Chengkaitao Cheng Dec. 19, 2022, 3:16 a.m. UTC | #10
Hi Michal Hocko,
Looking forward to your reply.

On 2022/12/10 17:18,“程垲涛 Chengkaitao Cheng”<chengkaitao@didiglobal.com <mailto:chengkaitao@didiglobal.com>> wrote:
At 2022-12-09 16:25:37, "Michal Hocko" <mhocko@suse.com <mailto:mhocko@suse.com>> wrote:
>On Fri 09-12-22 05:07:15, 程垲涛 Chengkaitao Cheng wrote:
>> At 2022-12-08 22:23:56, "Michal Hocko" <mhocko@suse.com <mailto:mhocko@suse.com>> wrote:
>[...]
>> >oom killer is a memory reclaim of the last resort. So yes, there is some
>> >difference but fundamentally it is about releasing some memory. And long
>> >term we have learned that the more clever it tries to be the more likely
>> >corner cases can happen. It is simply impossible to know the best
>> >candidate so this is a just a best effort. We try to aim for
>> >predictability at least.
>> 
>> Is the current oom_score strategy predictable? I don't think so. The score_adj 
>> has broken the predictability of oom_score (it is no longer simply killing the 
>> process that uses the most mems).
>
>oom_score as reported to the userspace already considers oom_score_adj
>which means that you can compare processes and get a reasonable guess
>what would be the current oom_victim. There is a certain fuzz level
>because this is not atomic and also there is no clear candidate when
>multiple processes have equal score. 

Multiple processes have the same score, which means it is reasonable to kill 
any one. Why must we determine which one is?

> So yes, it is not 100% predictable.
>memory.reclaim as you propose doesn't change that though.
>
This scheme is to give the decision power of the candidate to the user. 
The user's behavior is random. I think it is impossible to 100% predict 
a random event.

Is it really necessary to make everything 100% predictable? Just as we can't 
accurately predict which cgroup will access the page cache frequently, 
we can't accurately predict whether the memory is hot or cold. These 
strategies are fuzzy, but we can't deny their rationality.

>Is oom_score_adj a good interface? No, not really. If I could go back in
>time I would nack it but here we are. We have an interface that
>promises quite much but essentially it only allows two usecases
>(OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) reliably. Everything in between
>is clumsy at best because a real user space oom policy would require to
>re-evaluate the whole oom domain (be it global or memcg oom) as the
>memory consumption evolves over time. I am really worried that your
>memory.oom.protection directs a very similar trajectory because
>protection really needs to consider other memcgs to balance properly.
>
The score_adj is an interface that promises quite much. I think the reason 
why only two usecases (OOM_SCORE_ADJ_MIN, OOM_SCORE_ADJ_MAX) 
are reliable is that user cannot evaluate the priority level of all processes in 
the physical machine. If there is a agent process in the physical machine, 
which can accurately divide all the user processes of the physical machine 
into different levels, other usecases of the score_adj will be well applied, 
but it is almost impossible to achieve in real life.

There is an example of the practical application
Kubelet will set the score_adj of dockerinit process of all burstabler containers, 
the setting specification follows the following formula,

score_adj = 1000 - request * 1000 / totalpages
(request = "Fixed coefficient" * "memory.max")

Because kubelet has a clear understanding of all the container memory behavior 
attributes in the physical machine, it can use more score_adj usecases. The 
advantage of the oom.protrct is that users do not need to have a clear understanding 
of all the processes in the physical machine, they only need to have a clear 
understanding of all the processes int local cgroup. I think the requirement is very 
easy to achieve.

>[...]
>
>> > But I am really open
>> >to be convinced otherwise and this is in fact what I have been asking
>> >for since the beginning. I would love to see some examples on the
>> >reasonable configuration for a practical usecase.
>> 
>> Here is a simple example. In a docker container, users can divide all processes 
>> into two categories (important and normal), and put them in different cgroups. 
>> One cgroup's oom.protect is set to "max", the other is set to "0". In this way, 
>> important processes in the container can be protected.
>
>That is effectivelly oom_score_adj = OOM_SCORE_ADJ_MIN - 1 to all
>processes in the important group. I would argue you can achieve a very
>similar result by the process launcher to set the oom_score_adj and
>inherit it to all processes in that important container. You do not need
>any memcg tunable for that. 

Your method is not feasible. Please refer to the previous email
https://lore.kernel.org/linux-mm/E5A5BCC3-460E-4E81-8DD3-88B4A2868285@didiglobal.com <mailto:E5A5BCC3-460E-4E81-8DD3-88B4A2868285@didiglobal.com>/
* usecases 1: users say that they want to protect an important process 
* with high memory consumption from being killed by the oom in case 
* of docker container failure, so as to retain more critical on-site 
* information or a self recovery mechanism. At this time, they suggest 
* setting the score_adj of this process to -1000, but I don't agree with 
* it, because the docker container is not important to other docker 
* containers of the same physical machine. If score_adj of the process 
* is set to -1000, the probability of oom in other container processes will 
* increase.

>I am really much more interested in examples
>when the protection is to be fine tuned.
Michal Hocko Dec. 19, 2022, 12:06 p.m. UTC | #11
On Mon 19-12-22 03:16:33, 程垲涛 Chengkaitao Cheng wrote:
> Hi Michal Hocko,
> Looking forward to your reply.

I am sorry, I do not have anything to add to my previous concerns. But
let me summarize. I think your way of mixing per memcg protection with
the per-process oom_score is very dubious. This is not an unfixable
problem. All you need to do is the discount all processes in the same
memcg equally. A bigger problem is, though, that I am not convinced the
memory protection based interface is really viable. Based on experiences
with the existing reclaim protection interface this is not really
trivial interface to use. You either have to have a good overview of the
working set size or you have to auto-tune it based on a feedback
mechanism (e.g. PSI). Auto-tuning based on oom which should be a
rare event is rather problematic I would say.

All that being said I am not convinced that the interface is practically
usable and you haven't really provided good examples to prove me wrong.
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 74cec76be9f2..e816172c80a5 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1191,7 +1191,7 @@  PAGE_SIZE multiple when read back.
 	cgroup is within its effective low boundary, the cgroup's
 	memory won't be reclaimed unless there is no reclaimable
 	memory available in unprotected cgroups.
-	Above the effective low	boundary (or 
+	Above the effective low	boundary (or
 	effective min boundary if it is higher), pages are reclaimed
 	proportionally to the overage, reducing reclaim pressure for
 	smaller overages.
@@ -1292,6 +1292,24 @@  PAGE_SIZE multiple when read back.
 	to kill any tasks outside of this cgroup, regardless
 	memory.oom.group values of ancestor cgroups.
 
+  memory.oom.protect
+	A read-write single value file which exists on non-root
+	cgroups. The default value is "0".
+
+	If there is the OOM killer under parent memory cgroup, and
+	the memory usage of a child cgroup is within its effective
+	oom.protect boundary, the cgroup's processes won't be oom killed
+	unless there is no unprotected processes in other children
+	cgroups. About the effective oom.protect boundary, we assign it
+	to each process in this cgroup in proportion to the actual usage.
+	this factor will be taken into account when calculating the
+	oom_score. Effective oom.protect boundary is limited by
+	memory.oom.protect values of all ancestor cgroups. If there is
+	memory.oom.protect overcommitment (child cgroup or cgroups are
+	requiring more protected memory than parent will allow), then each
+	child cgroup will get the part of parent's protection proportional
+	to its actual memory usage below memory.oom.protect.
+
   memory.events
 	A read-only flat-keyed file which exists on non-root cgroups.
 	The following entries are defined.  Unless specified
@@ -1891,7 +1909,7 @@  of the two is enforced.
 
 cgroup writeback requires explicit support from the underlying
 filesystem.  Currently, cgroup writeback is implemented on ext2, ext4,
-btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are 
+btrfs, f2fs, and xfs.  On other filesystems, all writeback IOs are
 attributed to the root cgroup.
 
 There are inherent differences in memory and writeback management
diff --git a/fs/proc/base.c b/fs/proc/base.c
index 9e479d7d202b..f169abcfbe21 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -552,8 +552,19 @@  static int proc_oom_score(struct seq_file *m, struct pid_namespace *ns,
 	unsigned long totalpages = totalram_pages() + total_swap_pages;
 	unsigned long points = 0;
 	long badness;
+#ifdef CONFIG_MEMCG
+	struct mem_cgroup *memcg;
 
-	badness = oom_badness(task, totalpages);
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(task);
+	if (memcg && !css_tryget(&memcg->css))
+		memcg = NULL;
+	rcu_read_unlock();
+
+	update_parent_oom_protection(root_mem_cgroup, memcg);
+	css_put(&memcg->css);
+#endif
+	badness = oom_badness(task, totalpages, MEMCG_OOM_PROTECT);
 	/*
 	 * Special case OOM_SCORE_ADJ_MIN for all others scale the
 	 * badness value into [0, 2000] range which we have been
@@ -2657,7 +2668,7 @@  static struct dentry *proc_pident_instantiate(struct dentry *dentry,
 	return d_splice_alias(inode, dentry);
 }
 
-static struct dentry *proc_pident_lookup(struct inode *dir, 
+static struct dentry *proc_pident_lookup(struct inode *dir,
 					 struct dentry *dentry,
 					 const struct pid_entry *p,
 					 const struct pid_entry *end)
@@ -2870,7 +2881,7 @@  static const struct pid_entry attr_dir_stuff[] = {
 
 static int proc_attr_dir_readdir(struct file *file, struct dir_context *ctx)
 {
-	return proc_pident_readdir(file, ctx, 
+	return proc_pident_readdir(file, ctx,
 				   attr_dir_stuff, ARRAY_SIZE(attr_dir_stuff));
 }
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index e1644a24009c..d1628b1859d4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -53,6 +53,11 @@  enum memcg_memory_event {
 	MEMCG_NR_MEMORY_EVENTS,
 };
 
+enum memcg_oom_evaluate {
+	MEMCG_OOM_EVALUATE_NONE,
+	MEMCG_OOM_PROTECT,
+};
+
 struct mem_cgroup_reclaim_cookie {
 	pg_data_t *pgdat;
 	unsigned int generation;
@@ -614,6 +619,14 @@  static inline void mem_cgroup_protection(struct mem_cgroup *root,
 
 void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 				     struct mem_cgroup *memcg);
+void mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+				     struct mem_cgroup *memcg);
+void update_parent_oom_protection(struct mem_cgroup *root,
+				     struct mem_cgroup *memcg);
+unsigned long get_task_eoom_protect(struct task_struct *p, long points);
+struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
+struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
+bool is_root_oom_protect(void);
 
 static inline bool mem_cgroup_supports_protection(struct mem_cgroup *memcg)
 {
@@ -746,10 +759,6 @@  static inline struct lruvec *folio_lruvec(struct folio *folio)
 	return mem_cgroup_lruvec(memcg, folio_pgdat(folio));
 }
 
-struct mem_cgroup *mem_cgroup_from_task(struct task_struct *p);
-
-struct mem_cgroup *get_mem_cgroup_from_mm(struct mm_struct *mm);
-
 struct lruvec *folio_lruvec_lock(struct folio *folio);
 struct lruvec *folio_lruvec_lock_irq(struct folio *folio);
 struct lruvec *folio_lruvec_lock_irqsave(struct folio *folio,
@@ -805,6 +814,8 @@  struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
 void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
 int mem_cgroup_scan_tasks(struct mem_cgroup *,
 			  int (*)(struct task_struct *, void *), void *);
+int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
+		int (*fn)(struct task_struct *, void *, int), void *arg);
 
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
@@ -1209,6 +1220,16 @@  static inline void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 {
 }
 
+static inline void mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+						   struct mem_cgroup *memcg)
+{
+}
+
+static inline void update_parent_oom_protection(struct mem_cgroup *root,
+						struct mem_cgroup *memcg)
+{
+}
+
 static inline bool mem_cgroup_below_low(struct mem_cgroup *memcg)
 {
 	return false;
@@ -1219,6 +1240,16 @@  static inline bool mem_cgroup_below_min(struct mem_cgroup *memcg)
 	return false;
 }
 
+static inline unsigned long get_task_eoom_protect(struct task_struct *p, long points)
+{
+	return 0;
+}
+
+static inline bool is_root_oom_protect(void)
+{
+	return 0;
+}
+
 static inline int mem_cgroup_charge(struct folio *folio,
 		struct mm_struct *mm, gfp_t gfp)
 {
@@ -1338,6 +1369,12 @@  static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return 0;
 }
 
+static inline int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
+		int (*fn)(struct task_struct *, void *, int), void *arg)
+{
+	return 0;
+}
+
 static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
 {
 	return 0;
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7d0c9c48a0c5..04b6daca5a9c 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -97,8 +97,7 @@  static inline vm_fault_t check_stable_address_space(struct mm_struct *mm)
 	return 0;
 }
 
-long oom_badness(struct task_struct *p,
-		unsigned long totalpages);
+long oom_badness(struct task_struct *p, unsigned long totalpages, int flags);
 
 extern bool out_of_memory(struct oom_control *oc);
 
diff --git a/include/linux/page_counter.h b/include/linux/page_counter.h
index c141ea9a95ef..d730a7373c1d 100644
--- a/include/linux/page_counter.h
+++ b/include/linux/page_counter.h
@@ -25,6 +25,10 @@  struct page_counter {
 	atomic_long_t low_usage;
 	atomic_long_t children_low_usage;
 
+	unsigned long eoom_protect;
+	atomic_long_t oom_protect_usage;
+	atomic_long_t children_oom_protect_usage;
+
 	unsigned long watermark;
 	unsigned long failcnt;
 
@@ -35,6 +39,7 @@  struct page_counter {
 	unsigned long low;
 	unsigned long high;
 	unsigned long max;
+	unsigned long oom_protect;
 	struct page_counter *parent;
 } ____cacheline_internodealigned_in_smp;
 
@@ -65,6 +70,7 @@  bool page_counter_try_charge(struct page_counter *counter,
 void page_counter_uncharge(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_min(struct page_counter *counter, unsigned long nr_pages);
 void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages);
+void page_counter_set_oom_protect(struct page_counter *counter, unsigned long nr_pages);
 
 static inline void page_counter_set_high(struct page_counter *counter,
 					 unsigned long nr_pages)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 23750cec0036..1da956a80d56 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1263,6 +1263,52 @@  int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
 	return ret;
 }
 
+/**
+ * mem_cgroup_scan_tasks_update_eoom - iterate over tasks of a memory cgroup
+ * hierarchy and update memcg's eoom_protect
+ * @memcg: hierarchy root
+ * @fn: function to call for each task
+ * @arg: argument passed to @fn
+ *
+ * This function iterates over tasks attached to @memcg or to any of its
+ * descendants and update all memcg's eoom_protect, then calls @fn for each
+ * task. If @fn returns a non-zero value, the function breaks the iteration
+ * loop and returns the value. Otherwise, it will iterate over all tasks and
+ * return 0.
+ *
+ * This function may be called for the root memory cgroup.
+ */
+int mem_cgroup_scan_tasks_update_eoom(struct mem_cgroup *memcg,
+		int (*fn)(struct task_struct *, void *, int), void *arg)
+{
+	struct mem_cgroup *iter;
+	int ret = 0;
+
+	for_each_mem_cgroup_tree(iter, memcg) {
+		struct css_task_iter it;
+		struct task_struct *task;
+
+		mem_cgroup_calculate_oom_protection(memcg, iter);
+		css_task_iter_start(&iter->css, CSS_TASK_ITER_PROCS, &it);
+		while (!ret && (task = css_task_iter_next(&it)))
+			ret = fn(task, arg, MEMCG_OOM_PROTECT);
+		css_task_iter_end(&it);
+		if (ret) {
+			mem_cgroup_iter_break(memcg, iter);
+			break;
+		}
+	}
+	return ret;
+}
+
+bool is_root_oom_protect(void)
+{
+	if (mem_cgroup_disabled())
+		return 0;
+
+	return !!atomic_long_read(&root_mem_cgroup->memory.children_oom_protect_usage);
+}
+
 #ifdef CONFIG_DEBUG_VM
 void lruvec_memcg_debug(struct lruvec *lruvec, struct folio *folio)
 {
@@ -6579,6 +6625,29 @@  static ssize_t memory_oom_group_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static int memory_oom_protect_show(struct seq_file *m, void *v)
+{
+	return seq_puts_memcg_tunable(m,
+		READ_ONCE(mem_cgroup_from_seq(m)->memory.oom_protect));
+}
+
+static ssize_t memory_oom_protect_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	unsigned long oom_protect;
+	int err;
+
+	buf = strstrip(buf);
+	err = page_counter_memparse(buf, "max", &oom_protect);
+	if (err)
+		return err;
+
+	page_counter_set_oom_protect(&memcg->memory, oom_protect);
+
+	return nbytes;
+}
+
 static ssize_t memory_reclaim(struct kernfs_open_file *of, char *buf,
 			      size_t nbytes, loff_t off)
 {
@@ -6684,6 +6753,12 @@  static struct cftype memory_files[] = {
 		.seq_show = memory_oom_group_show,
 		.write = memory_oom_group_write,
 	},
+	{
+		.name = "oom.protect",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.seq_show = memory_oom_protect_show,
+		.write = memory_oom_protect_write,
+	},
 	{
 		.name = "reclaim",
 		.flags = CFTYPE_NS_DELEGATABLE,
@@ -6880,6 +6955,113 @@  void mem_cgroup_calculate_protection(struct mem_cgroup *root,
 			atomic_long_read(&parent->memory.children_low_usage)));
 }
 
+static void __mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+				     struct mem_cgroup *memcg)
+{
+	unsigned long usage, parent_usage;
+	struct mem_cgroup *parent;
+
+	usage = page_counter_read(&memcg->memory);
+	if (!usage)
+		return;
+
+	parent = parent_mem_cgroup(memcg);
+
+	if (parent == root) {
+		memcg->memory.eoom_protect = READ_ONCE(memcg->memory.oom_protect);
+		return;
+	}
+
+	parent_usage = page_counter_read(&parent->memory);
+
+	WRITE_ONCE(memcg->memory.eoom_protect, effective_protection(usage, parent_usage,
+			READ_ONCE(memcg->memory.oom_protect),
+			READ_ONCE(parent->memory.eoom_protect),
+			atomic_long_read(&parent->memory.children_oom_protect_usage)));
+}
+
+/**
+ * mem_cgroup_calculate_oom_protection - check if memory consumption is in the
+ * normal range of oom's protection
+ * @root: the top ancestor of the sub-tree being checked
+ * @memcg: the memory cgroup to check
+ *
+ * WARNING: This function is not stateless! It can only be used as part
+ *          of a top-down tree iteration, not for isolated queries.
+ */
+void mem_cgroup_calculate_oom_protection(struct mem_cgroup *root,
+				     struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!root)
+		root = root_mem_cgroup;
+
+	/*
+	 * Effective values of the reclaim targets are ignored so they
+	 * can be stale. Have a look at mem_cgroup_protection for more
+	 * details.
+	 * TODO: calculation should be more robust so that we do not need
+	 * that special casing.
+	 */
+	if (memcg == root)
+		return;
+
+	__mem_cgroup_calculate_oom_protection(root, memcg);
+}
+
+static void lsit_postorder_for_memcg_parent(
+		struct mem_cgroup *root, struct mem_cgroup *memcg,
+		void (*fn)(struct mem_cgroup *, struct mem_cgroup *))
+{
+	struct mem_cgroup *parent;
+
+	if (!memcg || memcg == root)
+		return;
+
+	parent = parent_mem_cgroup(memcg);
+	lsit_postorder_for_memcg_parent(root, parent, fn);
+	fn(root, memcg);
+}
+
+void update_parent_oom_protection(struct mem_cgroup *root,
+						struct mem_cgroup *memcg)
+{
+	if (mem_cgroup_disabled())
+		return;
+
+	if (!root)
+		root = root_mem_cgroup;
+
+	lsit_postorder_for_memcg_parent(root, memcg,
+			__mem_cgroup_calculate_oom_protection);
+}
+
+unsigned long get_task_eoom_protect(struct task_struct *p, long points)
+{
+	struct mem_cgroup *memcg;
+	unsigned long usage, eoom;
+
+	rcu_read_lock();
+	memcg = mem_cgroup_from_task(p);
+
+	if (!memcg || !mem_cgroup_supports_protection(memcg)) {
+		rcu_read_unlock();
+		return 0;
+	}
+
+	if (do_memsw_account())
+		usage = page_counter_read(&memcg->memsw);
+	else
+		usage = page_counter_read(&memcg->memory)
+			+ page_counter_read(&memcg->swap);
+	eoom = READ_ONCE(memcg->memory.eoom_protect) * points / usage;
+	rcu_read_unlock();
+
+	return eoom;
+}
+
 static int charge_memcg(struct folio *folio, struct mem_cgroup *memcg,
 			gfp_t gfp)
 {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 1276e49b31b0..d07b10778754 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -1,7 +1,7 @@ 
 // SPDX-License-Identifier: GPL-2.0-only
 /*
  *  linux/mm/oom_kill.c
- * 
+ *
  *  Copyright (C)  1998,2000  Rik van Riel
  *	Thanks go out to Claus Fischer for some serious inspiration and
  *	for goading me into coding this file...
@@ -193,15 +193,16 @@  static bool should_dump_unreclaim_slab(void)
  * oom_badness - heuristic function to determine which candidate task to kill
  * @p: task struct of which task we should calculate
  * @totalpages: total present RAM allowed for page allocation
+ * @flag: if you want to skip oom_protect function
  *
  * The heuristic for determining which task to kill is made to be as simple and
  * predictable as possible.  The goal is to return the highest value for the
  * task consuming the most memory to avoid subsequent oom failures.
  */
-long oom_badness(struct task_struct *p, unsigned long totalpages)
+long oom_badness(struct task_struct *p, unsigned long totalpages, int flag)
 {
-	long points;
-	long adj;
+	long points, adj, val = 0;
+	unsigned long shmem;
 
 	if (oom_unkillable_task(p))
 		return LONG_MIN;
@@ -229,11 +230,15 @@  long oom_badness(struct task_struct *p, unsigned long totalpages)
 	 */
 	points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
 		mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+
+	shmem = get_mm_counter(p->mm, MM_SHMEMPAGES);
 	task_unlock(p);
 
+	if (flag == MEMCG_OOM_PROTECT)
+		val = get_task_eoom_protect(p, points - shmem);
 	/* Normalize to oom_score_adj units */
 	adj *= totalpages / 1000;
-	points += adj;
+	points = points + adj - val;
 
 	return points;
 }
@@ -305,7 +310,7 @@  static enum oom_constraint constrained_alloc(struct oom_control *oc)
 	return CONSTRAINT_NONE;
 }
 
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+static int oom_evaluate_task(struct task_struct *task, void *arg, int flag)
 {
 	struct oom_control *oc = arg;
 	long points;
@@ -338,7 +343,7 @@  static int oom_evaluate_task(struct task_struct *task, void *arg)
 		goto select;
 	}
 
-	points = oom_badness(task, oc->totalpages);
+	points = oom_badness(task, oc->totalpages, flag);
 	if (points == LONG_MIN || points < oc->chosen_points)
 		goto next;
 
@@ -365,14 +370,14 @@  static void select_bad_process(struct oom_control *oc)
 {
 	oc->chosen_points = LONG_MIN;
 
-	if (is_memcg_oom(oc))
-		mem_cgroup_scan_tasks(oc->memcg, oom_evaluate_task, oc);
+	if (is_memcg_oom(oc) || is_root_oom_protect())
+		mem_cgroup_scan_tasks_update_eoom(oc->memcg, oom_evaluate_task, oc);
 	else {
 		struct task_struct *p;
 
 		rcu_read_lock();
 		for_each_process(p)
-			if (oom_evaluate_task(p, oc))
+			if (oom_evaluate_task(p, oc, MEMCG_OOM_EVALUATE_NONE))
 				break;
 		rcu_read_unlock();
 	}
diff --git a/mm/page_counter.c b/mm/page_counter.c
index db20d6452b71..43987cc59443 100644
--- a/mm/page_counter.c
+++ b/mm/page_counter.c
@@ -39,6 +39,15 @@  static void propagate_protected_usage(struct page_counter *c,
 		if (delta)
 			atomic_long_add(delta, &c->parent->children_low_usage);
 	}
+
+	protected = min(usage, READ_ONCE(c->oom_protect));
+	old_protected = atomic_long_read(&c->oom_protect_usage);
+	if (protected != old_protected) {
+		old_protected = atomic_long_xchg(&c->oom_protect_usage, protected);
+		delta = protected - old_protected;
+		if (delta)
+			atomic_long_add(delta, &c->parent->children_oom_protect_usage);
+	}
 }
 
 /**
@@ -234,6 +243,23 @@  void page_counter_set_low(struct page_counter *counter, unsigned long nr_pages)
 		propagate_protected_usage(c, atomic_long_read(&c->usage));
 }
 
+/**
+ * page_counter_set_oom_protect - set the amount of oom protected memory
+ * @counter: counter
+ * @nr_pages: value to set
+ *
+ * The caller must serialize invocations on the same counter.
+ */
+void page_counter_set_oom_protect(struct page_counter *counter, unsigned long nr_pages)
+{
+	struct page_counter *c;
+
+	WRITE_ONCE(counter->oom_protect, nr_pages);
+
+	for (c = counter; c; c = c->parent)
+		propagate_protected_usage(c, atomic_long_read(&c->usage));
+}
+
 /**
  * page_counter_memparse - memparse() for page counter limits
  * @buf: string to parse