diff mbox series

[RFC] mm, memcg: introduce memory.high.throttle

Message ID 20250129191204.368199-1-longman@redhat.com (mailing list archive)
State New
Headers show
Series [RFC] mm, memcg: introduce memory.high.throttle | expand

Commit Message

Waiman Long Jan. 29, 2025, 7:12 p.m. UTC
Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
reclaim over memory.high"), the amount of allocator throttling had
increased substantially. As a result, it could be difficult for a
misbehaving application that consumes increasing amount of memory from
being OOM-killed if memory.high is set. Instead, the application may
just be crawling along holding close to the allowed memory.high memory
for the current memory cgroup for a very long time especially those
that do a lot of memcg charging and uncharging operations.

This behavior makes the upstream Kubernetes community hesitate to
use memory.high. Instead, they use only memory.max for memory control
similar to what is being done for cgroup v1 [1].

To allow better control of the amount of throttling and hence the
speed that a misbehving task can be OOM killed, a new single-value
memory.high.throttle control file is now added. The allowable range
is 0-32.  By default, it has a value of 0 which means maximum throttling
like before. Any non-zero positive value represents the corresponding
power of 2 reduction of throttling and makes OOM kills easier to happen.

System administrators can now use this parameter to determine how easy
they want OOM kills to happen for applications that tend to consume
a lot of memory without the need to run a special userspace memory
management tool to monitor memory consumption when memory.high is set.

Below are the test results of a simple program showing how different
values of memory.high.throttle can affect its run time (in secs) until
it gets OOM killed. This test program allocates pages from kernel
continuously. There are some run-to-run variations and the results
are just one possible set of samples.

  # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \
	--wait -t timeout 300 /tmp/mmap-oom

  memory.high.throttle	service runtime
  --------------------	---------------
            0		    120.521
            1		    103.376
            2		     85.881
            3		     69.698
            4		     42.668
            5		     45.782
            6		     22.179
            7		      9.909
            8		      5.347
            9		      3.100
           10		      1.757
           11		      1.084
           12		      0.919
           13		      0.650
           14		      0.650
           15		      0.655

[1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0

Signed-off-by: Waiman Long <longman@redhat.com>
---
 Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++--
 include/linux/memcontrol.h              |  2 ++
 mm/memcontrol.c                         | 41 +++++++++++++++++++++++++
 3 files changed, 57 insertions(+), 2 deletions(-)

Comments

Yosry Ahmed Jan. 29, 2025, 8:10 p.m. UTC | #1
On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote:
> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
> reclaim over memory.high"), the amount of allocator throttling had
> increased substantially. As a result, it could be difficult for a
> misbehaving application that consumes increasing amount of memory from
> being OOM-killed if memory.high is set. Instead, the application may
> just be crawling along holding close to the allowed memory.high memory
> for the current memory cgroup for a very long time especially those
> that do a lot of memcg charging and uncharging operations.
> 
> This behavior makes the upstream Kubernetes community hesitate to
> use memory.high. Instead, they use only memory.max for memory control
> similar to what is being done for cgroup v1 [1].
> 
> To allow better control of the amount of throttling and hence the
> speed that a misbehving task can be OOM killed, a new single-value
> memory.high.throttle control file is now added. The allowable range
> is 0-32.  By default, it has a value of 0 which means maximum throttling
> like before. Any non-zero positive value represents the corresponding
> power of 2 reduction of throttling and makes OOM kills easier to happen.
> 
> System administrators can now use this parameter to determine how easy
> they want OOM kills to happen for applications that tend to consume
> a lot of memory without the need to run a special userspace memory
> management tool to monitor memory consumption when memory.high is set.
> 
> Below are the test results of a simple program showing how different
> values of memory.high.throttle can affect its run time (in secs) until
> it gets OOM killed. This test program allocates pages from kernel
> continuously. There are some run-to-run variations and the results
> are just one possible set of samples.
> 
>   # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \
> 	--wait -t timeout 300 /tmp/mmap-oom
> 
>   memory.high.throttle	service runtime
>   --------------------	---------------
>             0		    120.521
>             1		    103.376
>             2		     85.881
>             3		     69.698
>             4		     42.668
>             5		     45.782
>             6		     22.179
>             7		      9.909
>             8		      5.347
>             9		      3.100
>            10		      1.757
>            11		      1.084
>            12		      0.919
>            13		      0.650
>            14		      0.650
>            15		      0.655
> 
> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0
> 
> Signed-off-by: Waiman Long <longman@redhat.com>
> ---
>  Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++--
>  include/linux/memcontrol.h              |  2 ++
>  mm/memcontrol.c                         | 41 +++++++++++++++++++++++++
>  3 files changed, 57 insertions(+), 2 deletions(-)
> 
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index cb1b4e759b7e..df9410ad8b3b 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back.
>  	Going over the high limit never invokes the OOM killer and
>  	under extreme conditions the limit may be breached. The high
>  	limit should be used in scenarios where an external process
> -	monitors the limited cgroup to alleviate heavy reclaim
> -	pressure.
> +	monitors the limited cgroup to alleviate heavy reclaim pressure
> +	unless a high enough value is set in "memory.high.throttle".
> +
> +  memory.high.throttle
> +	A read-write single value file which exists on non-root
> +	cgroups.  The default is 0.
> +
> +	Memory usage throttle control.	This value controls the amount
> +	of throttling that will be applied when memory consumption
> +	exceeds the "memory.high" limit.  The larger the value is,
> +	the smaller the amount of throttling will be and the easier an
> +	offending application may get OOM killed.

memory.high is supposed to never invoke the OOM killer (see above). It's
unclear to me if you are referring to OOM kills from the kernel or
userspace in the commit message. If the latter, I think it shouldn't be
in kernel docs.
Michal Hocko Jan. 30, 2025, 8:15 a.m. UTC | #2
On Wed 29-01-25 14:12:04, Waiman Long wrote:
> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
> reclaim over memory.high"), the amount of allocator throttling had
> increased substantially. As a result, it could be difficult for a
> misbehaving application that consumes increasing amount of memory from
> being OOM-killed if memory.high is set. Instead, the application may
> just be crawling along holding close to the allowed memory.high memory
> for the current memory cgroup for a very long time especially those
> that do a lot of memcg charging and uncharging operations.
> 
> This behavior makes the upstream Kubernetes community hesitate to
> use memory.high. Instead, they use only memory.max for memory control
> similar to what is being done for cgroup v1 [1].

Why is this a problem for them?

> To allow better control of the amount of throttling and hence the
> speed that a misbehving task can be OOM killed, a new single-value
> memory.high.throttle control file is now added. The allowable range
> is 0-32.  By default, it has a value of 0 which means maximum throttling
> like before. Any non-zero positive value represents the corresponding
> power of 2 reduction of throttling and makes OOM kills easier to happen.

I do not like the interface to be honest. It exposes an implementation
detail and casts it into a user API. If we ever need to change the way
how the throttling is implemented this will stand in the way because
there will be applications depending on a behavior they were carefuly
tuned to.

It is also not entirely sure how is this supposed to be used in
practice? How do people what kind of value they should use?

> System administrators can now use this parameter to determine how easy
> they want OOM kills to happen for applications that tend to consume
> a lot of memory without the need to run a special userspace memory
> management tool to monitor memory consumption when memory.high is set.

Why cannot they achieve the same with the existing events/metrics we
already do provide? Most notably PSI which is properly accounted when
a task is throttled due to memory.high throttling.
Waiman Long Jan. 30, 2025, 2:52 p.m. UTC | #3
On 1/29/25 3:10 PM, Yosry Ahmed wrote:
> On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote:
>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
>> reclaim over memory.high"), the amount of allocator throttling had
>> increased substantially. As a result, it could be difficult for a
>> misbehaving application that consumes increasing amount of memory from
>> being OOM-killed if memory.high is set. Instead, the application may
>> just be crawling along holding close to the allowed memory.high memory
>> for the current memory cgroup for a very long time especially those
>> that do a lot of memcg charging and uncharging operations.
>>
>> This behavior makes the upstream Kubernetes community hesitate to
>> use memory.high. Instead, they use only memory.max for memory control
>> similar to what is being done for cgroup v1 [1].
>>
>> To allow better control of the amount of throttling and hence the
>> speed that a misbehving task can be OOM killed, a new single-value
>> memory.high.throttle control file is now added. The allowable range
>> is 0-32.  By default, it has a value of 0 which means maximum throttling
>> like before. Any non-zero positive value represents the corresponding
>> power of 2 reduction of throttling and makes OOM kills easier to happen.
>>
>> System administrators can now use this parameter to determine how easy
>> they want OOM kills to happen for applications that tend to consume
>> a lot of memory without the need to run a special userspace memory
>> management tool to monitor memory consumption when memory.high is set.
>>
>> Below are the test results of a simple program showing how different
>> values of memory.high.throttle can affect its run time (in secs) until
>> it gets OOM killed. This test program allocates pages from kernel
>> continuously. There are some run-to-run variations and the results
>> are just one possible set of samples.
>>
>>    # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \
>> 	--wait -t timeout 300 /tmp/mmap-oom
>>
>>    memory.high.throttle	service runtime
>>    --------------------	---------------
>>              0		    120.521
>>              1		    103.376
>>              2		     85.881
>>              3		     69.698
>>              4		     42.668
>>              5		     45.782
>>              6		     22.179
>>              7		      9.909
>>              8		      5.347
>>              9		      3.100
>>             10		      1.757
>>             11		      1.084
>>             12		      0.919
>>             13		      0.650
>>             14		      0.650
>>             15		      0.655
>>
>> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0
>>
>> Signed-off-by: Waiman Long <longman@redhat.com>
>> ---
>>   Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++--
>>   include/linux/memcontrol.h              |  2 ++
>>   mm/memcontrol.c                         | 41 +++++++++++++++++++++++++
>>   3 files changed, 57 insertions(+), 2 deletions(-)
>>
>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>> index cb1b4e759b7e..df9410ad8b3b 100644
>> --- a/Documentation/admin-guide/cgroup-v2.rst
>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back.
>>   	Going over the high limit never invokes the OOM killer and
>>   	under extreme conditions the limit may be breached. The high
>>   	limit should be used in scenarios where an external process
>> -	monitors the limited cgroup to alleviate heavy reclaim
>> -	pressure.
>> +	monitors the limited cgroup to alleviate heavy reclaim pressure
>> +	unless a high enough value is set in "memory.high.throttle".
>> +
>> +  memory.high.throttle
>> +	A read-write single value file which exists on non-root
>> +	cgroups.  The default is 0.
>> +
>> +	Memory usage throttle control.	This value controls the amount
>> +	of throttling that will be applied when memory consumption
>> +	exceeds the "memory.high" limit.  The larger the value is,
>> +	the smaller the amount of throttling will be and the easier an
>> +	offending application may get OOM killed.
> memory.high is supposed to never invoke the OOM killer (see above). It's
> unclear to me if you are referring to OOM kills from the kernel or
> userspace in the commit message. If the latter, I think it shouldn't be
> in kernel docs.

I am sorry for not being clear. What I meant is that if an application 
is consuming more memory than what can be recovered by memory reclaim, 
it will reach memory.max faster, if set, and get OOM killed. Will 
clarify that in the next version.

Cheers,
Longman
Waiman Long Jan. 30, 2025, 3:05 p.m. UTC | #4
On 1/30/25 3:15 AM, Michal Hocko wrote:
> On Wed 29-01-25 14:12:04, Waiman Long wrote:
>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
>> reclaim over memory.high"), the amount of allocator throttling had
>> increased substantially. As a result, it could be difficult for a
>> misbehaving application that consumes increasing amount of memory from
>> being OOM-killed if memory.high is set. Instead, the application may
>> just be crawling along holding close to the allowed memory.high memory
>> for the current memory cgroup for a very long time especially those
>> that do a lot of memcg charging and uncharging operations.
>>
>> This behavior makes the upstream Kubernetes community hesitate to
>> use memory.high. Instead, they use only memory.max for memory control
>> similar to what is being done for cgroup v1 [1].
> Why is this a problem for them?
My understanding is that a mishaving container will hold up memory.high 
amount of memory for a long time instead of getting OOM killed sooner 
and be more productively used elsewhere.
>
>> To allow better control of the amount of throttling and hence the
>> speed that a misbehving task can be OOM killed, a new single-value
>> memory.high.throttle control file is now added. The allowable range
>> is 0-32.  By default, it has a value of 0 which means maximum throttling
>> like before. Any non-zero positive value represents the corresponding
>> power of 2 reduction of throttling and makes OOM kills easier to happen.
> I do not like the interface to be honest. It exposes an implementation
> detail and casts it into a user API. If we ever need to change the way
> how the throttling is implemented this will stand in the way because
> there will be applications depending on a behavior they were carefuly
> tuned to.
>
> It is also not entirely sure how is this supposed to be used in
> practice? How do people what kind of value they should use?
Yes, I agree that a user may need to run some trial runs to find a 
proper value. Perhaps a simpler binary interface of "off" and "on" may 
be easier to understand and use.
>
>> System administrators can now use this parameter to determine how easy
>> they want OOM kills to happen for applications that tend to consume
>> a lot of memory without the need to run a special userspace memory
>> management tool to monitor memory consumption when memory.high is set.
> Why cannot they achieve the same with the existing events/metrics we
> already do provide? Most notably PSI which is properly accounted when
> a task is throttled due to memory.high throttling.

That will require the use of a userspace management agent that looks for 
these stalling conditions and make the kill, if necessary. There are 
certainly users out there that want to get some benefit of using 
memory.high like early memory reclaim without the trouble of handling 
these kind of stalling conditions.

Cheers,
Longman
Johannes Weiner Jan. 30, 2025, 4:39 p.m. UTC | #5
On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote:
> On 1/29/25 3:10 PM, Yosry Ahmed wrote:
> > On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote:
> >> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
> >> reclaim over memory.high"), the amount of allocator throttling had
> >> increased substantially. As a result, it could be difficult for a
> >> misbehaving application that consumes increasing amount of memory from
> >> being OOM-killed if memory.high is set. Instead, the application may
> >> just be crawling along holding close to the allowed memory.high memory
> >> for the current memory cgroup for a very long time especially those
> >> that do a lot of memcg charging and uncharging operations.
> >>
> >> This behavior makes the upstream Kubernetes community hesitate to
> >> use memory.high. Instead, they use only memory.max for memory control
> >> similar to what is being done for cgroup v1 [1].
> >>
> >> To allow better control of the amount of throttling and hence the
> >> speed that a misbehving task can be OOM killed, a new single-value
> >> memory.high.throttle control file is now added. The allowable range
> >> is 0-32.  By default, it has a value of 0 which means maximum throttling
> >> like before. Any non-zero positive value represents the corresponding
> >> power of 2 reduction of throttling and makes OOM kills easier to happen.
> >>
> >> System administrators can now use this parameter to determine how easy
> >> they want OOM kills to happen for applications that tend to consume
> >> a lot of memory without the need to run a special userspace memory
> >> management tool to monitor memory consumption when memory.high is set.
> >>
> >> Below are the test results of a simple program showing how different
> >> values of memory.high.throttle can affect its run time (in secs) until
> >> it gets OOM killed. This test program allocates pages from kernel
> >> continuously. There are some run-to-run variations and the results
> >> are just one possible set of samples.
> >>
> >>    # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \
> >> 	--wait -t timeout 300 /tmp/mmap-oom
> >>
> >>    memory.high.throttle	service runtime
> >>    --------------------	---------------
> >>              0		    120.521
> >>              1		    103.376
> >>              2		     85.881
> >>              3		     69.698
> >>              4		     42.668
> >>              5		     45.782
> >>              6		     22.179
> >>              7		      9.909
> >>              8		      5.347
> >>              9		      3.100
> >>             10		      1.757
> >>             11		      1.084
> >>             12		      0.919
> >>             13		      0.650
> >>             14		      0.650
> >>             15		      0.655
> >>
> >> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0
> >>
> >> Signed-off-by: Waiman Long <longman@redhat.com>
> >> ---
> >>   Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++--
> >>   include/linux/memcontrol.h              |  2 ++
> >>   mm/memcontrol.c                         | 41 +++++++++++++++++++++++++
> >>   3 files changed, 57 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> >> index cb1b4e759b7e..df9410ad8b3b 100644
> >> --- a/Documentation/admin-guide/cgroup-v2.rst
> >> +++ b/Documentation/admin-guide/cgroup-v2.rst
> >> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back.
> >>   	Going over the high limit never invokes the OOM killer and
> >>   	under extreme conditions the limit may be breached. The high
> >>   	limit should be used in scenarios where an external process
> >> -	monitors the limited cgroup to alleviate heavy reclaim
> >> -	pressure.
> >> +	monitors the limited cgroup to alleviate heavy reclaim pressure
> >> +	unless a high enough value is set in "memory.high.throttle".
> >> +
> >> +  memory.high.throttle
> >> +	A read-write single value file which exists on non-root
> >> +	cgroups.  The default is 0.
> >> +
> >> +	Memory usage throttle control.	This value controls the amount
> >> +	of throttling that will be applied when memory consumption
> >> +	exceeds the "memory.high" limit.  The larger the value is,
> >> +	the smaller the amount of throttling will be and the easier an
> >> +	offending application may get OOM killed.
> > memory.high is supposed to never invoke the OOM killer (see above). It's
> > unclear to me if you are referring to OOM kills from the kernel or
> > userspace in the commit message. If the latter, I think it shouldn't be
> > in kernel docs.
> 
> I am sorry for not being clear. What I meant is that if an application 
> is consuming more memory than what can be recovered by memory reclaim, 
> it will reach memory.max faster, if set, and get OOM killed. Will 
> clarify that in the next version.

You're not really supposed to use max and high in conjunction. One is
for kernel OOM killing, the other for userspace OOM killing. That's
what the documentation that you edited is trying to explain.

What's the usecase you have in mind?
Roman Gushchin Jan. 30, 2025, 5:05 p.m. UTC | #6
On Thu, Jan 30, 2025 at 10:05:34AM -0500, Waiman Long wrote:
> On 1/30/25 3:15 AM, Michal Hocko wrote:
> > On Wed 29-01-25 14:12:04, Waiman Long wrote:
> > > Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
> > > reclaim over memory.high"), the amount of allocator throttling had
> > > increased substantially. As a result, it could be difficult for a
> > > misbehaving application that consumes increasing amount of memory from
> > > being OOM-killed if memory.high is set. Instead, the application may
> > > just be crawling along holding close to the allowed memory.high memory
> > > for the current memory cgroup for a very long time especially those
> > > that do a lot of memcg charging and uncharging operations.
> > > 
> > > This behavior makes the upstream Kubernetes community hesitate to
> > > use memory.high. Instead, they use only memory.max for memory control
> > > similar to what is being done for cgroup v1 [1].
> > Why is this a problem for them?
> My understanding is that a mishaving container will hold up memory.high
> amount of memory for a long time instead of getting OOM killed sooner and be
> more productively used elsewhere.
> > 
> > > To allow better control of the amount of throttling and hence the
> > > speed that a misbehving task can be OOM killed, a new single-value
> > > memory.high.throttle control file is now added. The allowable range
> > > is 0-32.  By default, it has a value of 0 which means maximum throttling
> > > like before. Any non-zero positive value represents the corresponding
> > > power of 2 reduction of throttling and makes OOM kills easier to happen.
> > I do not like the interface to be honest. It exposes an implementation
> > detail and casts it into a user API. If we ever need to change the way
> > how the throttling is implemented this will stand in the way because
> > there will be applications depending on a behavior they were carefuly
> > tuned to.
> > 
> > It is also not entirely sure how is this supposed to be used in
> > practice? How do people what kind of value they should use?
> Yes, I agree that a user may need to run some trial runs to find a proper
> value. Perhaps a simpler binary interface of "off" and "on" may be easier to
> understand and use.
> > 
> > > System administrators can now use this parameter to determine how easy
> > > they want OOM kills to happen for applications that tend to consume
> > > a lot of memory without the need to run a special userspace memory
> > > management tool to monitor memory consumption when memory.high is set.
> > Why cannot they achieve the same with the existing events/metrics we
> > already do provide? Most notably PSI which is properly accounted when
> > a task is throttled due to memory.high throttling.
> 
> That will require the use of a userspace management agent that looks for
> these stalling conditions and make the kill, if necessary. There are
> certainly users out there that want to get some benefit of using memory.high
> like early memory reclaim without the trouble of handling these kind of
> stalling conditions.

So you basically want to force the workload into some sort of a proactive
reclaim but without an artificial slow down?
It makes some sense to me, but
1) Idk if it deserves a new API, because it can be relatively easy implemented
  in userspace by a daemon which monitors cgroups usage and reclaims the memory
  if necessarily. No kernel changes are needed.
2) If new API is introduced, I think it's better to introduce a new limit,
  e.g. memory.target, keeping memory.high semantics intact.

Thanks!
Waiman Long Jan. 30, 2025, 5:07 p.m. UTC | #7
On 1/30/25 11:39 AM, Johannes Weiner wrote:
> On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote:
>> On 1/29/25 3:10 PM, Yosry Ahmed wrote:
>>> On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote:
>>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
>>>> reclaim over memory.high"), the amount of allocator throttling had
>>>> increased substantially. As a result, it could be difficult for a
>>>> misbehaving application that consumes increasing amount of memory from
>>>> being OOM-killed if memory.high is set. Instead, the application may
>>>> just be crawling along holding close to the allowed memory.high memory
>>>> for the current memory cgroup for a very long time especially those
>>>> that do a lot of memcg charging and uncharging operations.
>>>>
>>>> This behavior makes the upstream Kubernetes community hesitate to
>>>> use memory.high. Instead, they use only memory.max for memory control
>>>> similar to what is being done for cgroup v1 [1].
>>>>
>>>> To allow better control of the amount of throttling and hence the
>>>> speed that a misbehving task can be OOM killed, a new single-value
>>>> memory.high.throttle control file is now added. The allowable range
>>>> is 0-32.  By default, it has a value of 0 which means maximum throttling
>>>> like before. Any non-zero positive value represents the corresponding
>>>> power of 2 reduction of throttling and makes OOM kills easier to happen.
>>>>
>>>> System administrators can now use this parameter to determine how easy
>>>> they want OOM kills to happen for applications that tend to consume
>>>> a lot of memory without the need to run a special userspace memory
>>>> management tool to monitor memory consumption when memory.high is set.
>>>>
>>>> Below are the test results of a simple program showing how different
>>>> values of memory.high.throttle can affect its run time (in secs) until
>>>> it gets OOM killed. This test program allocates pages from kernel
>>>> continuously. There are some run-to-run variations and the results
>>>> are just one possible set of samples.
>>>>
>>>>     # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \
>>>> 	--wait -t timeout 300 /tmp/mmap-oom
>>>>
>>>>     memory.high.throttle	service runtime
>>>>     --------------------	---------------
>>>>               0		    120.521
>>>>               1		    103.376
>>>>               2		     85.881
>>>>               3		     69.698
>>>>               4		     42.668
>>>>               5		     45.782
>>>>               6		     22.179
>>>>               7		      9.909
>>>>               8		      5.347
>>>>               9		      3.100
>>>>              10		      1.757
>>>>              11		      1.084
>>>>              12		      0.919
>>>>              13		      0.650
>>>>              14		      0.650
>>>>              15		      0.655
>>>>
>>>> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0
>>>>
>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>> ---
>>>>    Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++--
>>>>    include/linux/memcontrol.h              |  2 ++
>>>>    mm/memcontrol.c                         | 41 +++++++++++++++++++++++++
>>>>    3 files changed, 57 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>>>> index cb1b4e759b7e..df9410ad8b3b 100644
>>>> --- a/Documentation/admin-guide/cgroup-v2.rst
>>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>>>> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back.
>>>>    	Going over the high limit never invokes the OOM killer and
>>>>    	under extreme conditions the limit may be breached. The high
>>>>    	limit should be used in scenarios where an external process
>>>> -	monitors the limited cgroup to alleviate heavy reclaim
>>>> -	pressure.
>>>> +	monitors the limited cgroup to alleviate heavy reclaim pressure
>>>> +	unless a high enough value is set in "memory.high.throttle".
>>>> +
>>>> +  memory.high.throttle
>>>> +	A read-write single value file which exists on non-root
>>>> +	cgroups.  The default is 0.
>>>> +
>>>> +	Memory usage throttle control.	This value controls the amount
>>>> +	of throttling that will be applied when memory consumption
>>>> +	exceeds the "memory.high" limit.  The larger the value is,
>>>> +	the smaller the amount of throttling will be and the easier an
>>>> +	offending application may get OOM killed.
>>> memory.high is supposed to never invoke the OOM killer (see above). It's
>>> unclear to me if you are referring to OOM kills from the kernel or
>>> userspace in the commit message. If the latter, I think it shouldn't be
>>> in kernel docs.
>> I am sorry for not being clear. What I meant is that if an application
>> is consuming more memory than what can be recovered by memory reclaim,
>> it will reach memory.max faster, if set, and get OOM killed. Will
>> clarify that in the next version.
> You're not really supposed to use max and high in conjunction. One is
> for kernel OOM killing, the other for userspace OOM killing. That's tho
> what the documentation that you edited is trying to explain.
>
> What's the usecase you have in mind?

That is new to me that high and max are not supposed to be used 
together. One problem with v1 is that by the time the limit is reached 
and memory reclaim is not able to recover enough memory in time, the 
task will be OOM killed. I always thought that by setting high to a bit 
below max, say 90%, early memory reclaim will reduce the chance of OOM 
kills. There are certainly others that think like that.

So the use case here is to reduce the chance of OOM kills without 
letting really mishaving tasks from holding up useful memory for too long.

Cheers,
Longman
Waiman Long Jan. 30, 2025, 5:19 p.m. UTC | #8
On 1/30/25 12:05 PM, Roman Gushchin wrote:
> On Thu, Jan 30, 2025 at 10:05:34AM -0500, Waiman Long wrote:
>> On 1/30/25 3:15 AM, Michal Hocko wrote:
>>> On Wed 29-01-25 14:12:04, Waiman Long wrote:
>>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
>>>> reclaim over memory.high"), the amount of allocator throttling had
>>>> increased substantially. As a result, it could be difficult for a
>>>> misbehaving application that consumes increasing amount of memory from
>>>> being OOM-killed if memory.high is set. Instead, the application may
>>>> just be crawling along holding close to the allowed memory.high memory
>>>> for the current memory cgroup for a very long time especially those
>>>> that do a lot of memcg charging and uncharging operations.
>>>>
>>>> This behavior makes the upstream Kubernetes community hesitate to
>>>> use memory.high. Instead, they use only memory.max for memory control
>>>> similar to what is being done for cgroup v1 [1].
>>> Why is this a problem for them?
>> My understanding is that a mishaving container will hold up memory.high
>> amount of memory for a long time instead of getting OOM killed sooner and be
>> more productively used elsewhere.
>>>> To allow better control of the amount of throttling and hence the
>>>> speed that a misbehving task can be OOM killed, a new single-value
>>>> memory.high.throttle control file is now added. The allowable range
>>>> is 0-32.  By default, it has a value of 0 which means maximum throttling
>>>> like before. Any non-zero positive value represents the corresponding
>>>> power of 2 reduction of throttling and makes OOM kills easier to happen.
>>> I do not like the interface to be honest. It exposes an implementation
>>> detail and casts it into a user API. If we ever need to change the way
>>> how the throttling is implemented this will stand in the way because
>>> there will be applications depending on a behavior they were carefuly
>>> tuned to.
>>>
>>> It is also not entirely sure how is this supposed to be used in
>>> practice? How do people what kind of value they should use?
>> Yes, I agree that a user may need to run some trial runs to find a proper
>> value. Perhaps a simpler binary interface of "off" and "on" may be easier to
>> understand and use.
>>>> System administrators can now use this parameter to determine how easy
>>>> they want OOM kills to happen for applications that tend to consume
>>>> a lot of memory without the need to run a special userspace memory
>>>> management tool to monitor memory consumption when memory.high is set.
>>> Why cannot they achieve the same with the existing events/metrics we
>>> already do provide? Most notably PSI which is properly accounted when
>>> a task is throttled due to memory.high throttling.
>> That will require the use of a userspace management agent that looks for
>> these stalling conditions and make the kill, if necessary. There are
>> certainly users out there that want to get some benefit of using memory.high
>> like early memory reclaim without the trouble of handling these kind of
>> stalling conditions.
> So you basically want to force the workload into some sort of a proactive
> reclaim but without an artificial slow down?
> It makes some sense to me, but
> 1) Idk if it deserves a new API, because it can be relatively easy implemented
>    in userspace by a daemon which monitors cgroups usage and reclaims the memory
>    if necessarily. No kernel changes are needed.
> 2) If new API is introduced, I think it's better to introduce a new limit,
>    e.g. memory.target, keeping memory.high semantics intact.

Yes, you are right about that. Introducing a new "memory.target" without 
disturbing the existing "memory.high" semantics will work for me too.

Cheers,
Longman
Shakeel Butt Jan. 30, 2025, 5:32 p.m. UTC | #9
On Thu, Jan 30, 2025 at 12:19:38PM -0500, Waiman Long wrote:
> On 1/30/25 12:05 PM, Roman Gushchin wrote:
> > On Thu, Jan 30, 2025 at 10:05:34AM -0500, Waiman Long wrote:
> > > On 1/30/25 3:15 AM, Michal Hocko wrote:
> > > > On Wed 29-01-25 14:12:04, Waiman Long wrote:
> > > > > Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
> > > > > reclaim over memory.high"), the amount of allocator throttling had
> > > > > increased substantially. As a result, it could be difficult for a
> > > > > misbehaving application that consumes increasing amount of memory from
> > > > > being OOM-killed if memory.high is set. Instead, the application may
> > > > > just be crawling along holding close to the allowed memory.high memory
> > > > > for the current memory cgroup for a very long time especially those
> > > > > that do a lot of memcg charging and uncharging operations.
> > > > > 
> > > > > This behavior makes the upstream Kubernetes community hesitate to
> > > > > use memory.high. Instead, they use only memory.max for memory control
> > > > > similar to what is being done for cgroup v1 [1].
> > > > Why is this a problem for them?
> > > My understanding is that a mishaving container will hold up memory.high
> > > amount of memory for a long time instead of getting OOM killed sooner and be
> > > more productively used elsewhere.
> > > > > To allow better control of the amount of throttling and hence the
> > > > > speed that a misbehving task can be OOM killed, a new single-value
> > > > > memory.high.throttle control file is now added. The allowable range
> > > > > is 0-32.  By default, it has a value of 0 which means maximum throttling
> > > > > like before. Any non-zero positive value represents the corresponding
> > > > > power of 2 reduction of throttling and makes OOM kills easier to happen.
> > > > I do not like the interface to be honest. It exposes an implementation
> > > > detail and casts it into a user API. If we ever need to change the way
> > > > how the throttling is implemented this will stand in the way because
> > > > there will be applications depending on a behavior they were carefuly
> > > > tuned to.
> > > > 
> > > > It is also not entirely sure how is this supposed to be used in
> > > > practice? How do people what kind of value they should use?
> > > Yes, I agree that a user may need to run some trial runs to find a proper
> > > value. Perhaps a simpler binary interface of "off" and "on" may be easier to
> > > understand and use.
> > > > > System administrators can now use this parameter to determine how easy
> > > > > they want OOM kills to happen for applications that tend to consume
> > > > > a lot of memory without the need to run a special userspace memory
> > > > > management tool to monitor memory consumption when memory.high is set.
> > > > Why cannot they achieve the same with the existing events/metrics we
> > > > already do provide? Most notably PSI which is properly accounted when
> > > > a task is throttled due to memory.high throttling.
> > > That will require the use of a userspace management agent that looks for
> > > these stalling conditions and make the kill, if necessary. There are
> > > certainly users out there that want to get some benefit of using memory.high
> > > like early memory reclaim without the trouble of handling these kind of
> > > stalling conditions.
> > So you basically want to force the workload into some sort of a proactive
> > reclaim but without an artificial slow down?

I wouldn't call it a proactive reclaim as reclaim will happen
synchronously in allocating thread.

> > It makes some sense to me, but
> > 1) Idk if it deserves a new API, because it can be relatively easy implemented
> >    in userspace by a daemon which monitors cgroups usage and reclaims the memory
> >    if necessarily. No kernel changes are needed.
> > 2) If new API is introduced, I think it's better to introduce a new limit,
> >    e.g. memory.target, keeping memory.high semantics intact.
> 
> Yes, you are right about that. Introducing a new "memory.target" without
> disturbing the existing "memory.high" semantics will work for me too.
> 

So, what happens if reclaim can not reduce usage below memory.target?
Infinite reclaim cycles or just give up?

> Cheers,
> Longman
>
Waiman Long Jan. 30, 2025, 5:41 p.m. UTC | #10
On 1/30/25 12:32 PM, Shakeel Butt wrote:
> On Thu, Jan 30, 2025 at 12:19:38PM -0500, Waiman Long wrote:
>> On 1/30/25 12:05 PM, Roman Gushchin wrote:
>>> On Thu, Jan 30, 2025 at 10:05:34AM -0500, Waiman Long wrote:
>>>> On 1/30/25 3:15 AM, Michal Hocko wrote:
>>>>> On Wed 29-01-25 14:12:04, Waiman Long wrote:
>>>>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
>>>>>> reclaim over memory.high"), the amount of allocator throttling had
>>>>>> increased substantially. As a result, it could be difficult for a
>>>>>> misbehaving application that consumes increasing amount of memory from
>>>>>> being OOM-killed if memory.high is set. Instead, the application may
>>>>>> just be crawling along holding close to the allowed memory.high memory
>>>>>> for the current memory cgroup for a very long time especially those
>>>>>> that do a lot of memcg charging and uncharging operations.
>>>>>>
>>>>>> This behavior makes the upstream Kubernetes community hesitate to
>>>>>> use memory.high. Instead, they use only memory.max for memory control
>>>>>> similar to what is being done for cgroup v1 [1].
>>>>> Why is this a problem for them?
>>>> My understanding is that a mishaving container will hold up memory.high
>>>> amount of memory for a long time instead of getting OOM killed sooner and be
>>>> more productively used elsewhere.
>>>>>> To allow better control of the amount of throttling and hence the
>>>>>> speed that a misbehving task can be OOM killed, a new single-value
>>>>>> memory.high.throttle control file is now added. The allowable range
>>>>>> is 0-32.  By default, it has a value of 0 which means maximum throttling
>>>>>> like before. Any non-zero positive value represents the corresponding
>>>>>> power of 2 reduction of throttling and makes OOM kills easier to happen.
>>>>> I do not like the interface to be honest. It exposes an implementation
>>>>> detail and casts it into a user API. If we ever need to change the way
>>>>> how the throttling is implemented this will stand in the way because
>>>>> there will be applications depending on a behavior they were carefuly
>>>>> tuned to.
>>>>>
>>>>> It is also not entirely sure how is this supposed to be used in
>>>>> practice? How do people what kind of value they should use?
>>>> Yes, I agree that a user may need to run some trial runs to find a proper
>>>> value. Perhaps a simpler binary interface of "off" and "on" may be easier to
>>>> understand and use.
>>>>>> System administrators can now use this parameter to determine how easy
>>>>>> they want OOM kills to happen for applications that tend to consume
>>>>>> a lot of memory without the need to run a special userspace memory
>>>>>> management tool to monitor memory consumption when memory.high is set.
>>>>> Why cannot they achieve the same with the existing events/metrics we
>>>>> already do provide? Most notably PSI which is properly accounted when
>>>>> a task is throttled due to memory.high throttling.
>>>> That will require the use of a userspace management agent that looks for
>>>> these stalling conditions and make the kill, if necessary. There are
>>>> certainly users out there that want to get some benefit of using memory.high
>>>> like early memory reclaim without the trouble of handling these kind of
>>>> stalling conditions.
>>> So you basically want to force the workload into some sort of a proactive
>>> reclaim but without an artificial slow down?
> I wouldn't call it a proactive reclaim as reclaim will happen
> synchronously in allocating thread.
>
>>> It makes some sense to me, but
>>> 1) Idk if it deserves a new API, because it can be relatively easy implemented
>>>     in userspace by a daemon which monitors cgroups usage and reclaims the memory
>>>     if necessarily. No kernel changes are needed.
>>> 2) If new API is introduced, I think it's better to introduce a new limit,
>>>     e.g. memory.target, keeping memory.high semantics intact.
>> Yes, you are right about that. Introducing a new "memory.target" without
>> disturbing the existing "memory.high" semantics will work for me too.
>>
> So, what happens if reclaim can not reduce usage below memory.target?
> Infinite reclaim cycles or just give up?

Just give up in this case. It is used mainly to reduce the chance of 
reaching max and cause OOM kill.

Cheers,
Longman
Michal Hocko Jan. 30, 2025, 5:46 p.m. UTC | #11
On Thu 30-01-25 12:19:38, Waiman Long wrote:
> On 1/30/25 12:05 PM, Roman Gushchin wrote:
> > On Thu, Jan 30, 2025 at 10:05:34AM -0500, Waiman Long wrote:
[...]
> > > > Why cannot they achieve the same with the existing events/metrics we
> > > > already do provide? Most notably PSI which is properly accounted when
> > > > a task is throttled due to memory.high throttling.
> > > That will require the use of a userspace management agent that looks for
> > > these stalling conditions and make the kill, if necessary. There are
> > > certainly users out there that want to get some benefit of using memory.high
> > > like early memory reclaim without the trouble of handling these kind of
> > > stalling conditions.

Could you expand more on that? As long as the memory is reasonably
reclaimable then the hard (max) limit is exactly what you need. If you
want to implement oom policy in the userspace you have high limit to get
notifications about the memory pressure. Why this is insufficient?

> > So you basically want to force the workload into some sort of a proactive
> > reclaim but without an artificial slow down?
> > It makes some sense to me, but
> > 1) Idk if it deserves a new API, because it can be relatively easy implemented
> >    in userspace by a daemon which monitors cgroups usage and reclaims the memory
> >    if necessarily. No kernel changes are needed.
> > 2) If new API is introduced, I think it's better to introduce a new limit,
> >    e.g. memory.target, keeping memory.high semantics intact.
> 
> Yes, you are right about that. Introducing a new "memory.target" without
> disturbing the existing "memory.high" semantics will work for me too.

I thought those usecases want to kill misbehaving containers rather than
burning time trying to reclaim. I do not understand how will such a new
limit help to achieve that.
Johannes Weiner Jan. 30, 2025, 8:19 p.m. UTC | #12
On Thu, Jan 30, 2025 at 12:07:31PM -0500, Waiman Long wrote:
> On 1/30/25 11:39 AM, Johannes Weiner wrote:
> > On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote:
> >> On 1/29/25 3:10 PM, Yosry Ahmed wrote:
> >>> On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote:
> >>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
> >>>> reclaim over memory.high"), the amount of allocator throttling had
> >>>> increased substantially. As a result, it could be difficult for a
> >>>> misbehaving application that consumes increasing amount of memory from
> >>>> being OOM-killed if memory.high is set. Instead, the application may
> >>>> just be crawling along holding close to the allowed memory.high memory
> >>>> for the current memory cgroup for a very long time especially those
> >>>> that do a lot of memcg charging and uncharging operations.
> >>>>
> >>>> This behavior makes the upstream Kubernetes community hesitate to
> >>>> use memory.high. Instead, they use only memory.max for memory control
> >>>> similar to what is being done for cgroup v1 [1].
> >>>>
> >>>> To allow better control of the amount of throttling and hence the
> >>>> speed that a misbehving task can be OOM killed, a new single-value
> >>>> memory.high.throttle control file is now added. The allowable range
> >>>> is 0-32.  By default, it has a value of 0 which means maximum throttling
> >>>> like before. Any non-zero positive value represents the corresponding
> >>>> power of 2 reduction of throttling and makes OOM kills easier to happen.
> >>>>
> >>>> System administrators can now use this parameter to determine how easy
> >>>> they want OOM kills to happen for applications that tend to consume
> >>>> a lot of memory without the need to run a special userspace memory
> >>>> management tool to monitor memory consumption when memory.high is set.
> >>>>
> >>>> Below are the test results of a simple program showing how different
> >>>> values of memory.high.throttle can affect its run time (in secs) until
> >>>> it gets OOM killed. This test program allocates pages from kernel
> >>>> continuously. There are some run-to-run variations and the results
> >>>> are just one possible set of samples.
> >>>>
> >>>>     # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \
> >>>> 	--wait -t timeout 300 /tmp/mmap-oom
> >>>>
> >>>>     memory.high.throttle	service runtime
> >>>>     --------------------	---------------
> >>>>               0		    120.521
> >>>>               1		    103.376
> >>>>               2		     85.881
> >>>>               3		     69.698
> >>>>               4		     42.668
> >>>>               5		     45.782
> >>>>               6		     22.179
> >>>>               7		      9.909
> >>>>               8		      5.347
> >>>>               9		      3.100
> >>>>              10		      1.757
> >>>>              11		      1.084
> >>>>              12		      0.919
> >>>>              13		      0.650
> >>>>              14		      0.650
> >>>>              15		      0.655
> >>>>
> >>>> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0
> >>>>
> >>>> Signed-off-by: Waiman Long <longman@redhat.com>
> >>>> ---
> >>>>    Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++--
> >>>>    include/linux/memcontrol.h              |  2 ++
> >>>>    mm/memcontrol.c                         | 41 +++++++++++++++++++++++++
> >>>>    3 files changed, 57 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> >>>> index cb1b4e759b7e..df9410ad8b3b 100644
> >>>> --- a/Documentation/admin-guide/cgroup-v2.rst
> >>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
> >>>> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back.
> >>>>    	Going over the high limit never invokes the OOM killer and
> >>>>    	under extreme conditions the limit may be breached. The high
> >>>>    	limit should be used in scenarios where an external process
> >>>> -	monitors the limited cgroup to alleviate heavy reclaim
> >>>> -	pressure.
> >>>> +	monitors the limited cgroup to alleviate heavy reclaim pressure
> >>>> +	unless a high enough value is set in "memory.high.throttle".
> >>>> +
> >>>> +  memory.high.throttle
> >>>> +	A read-write single value file which exists on non-root
> >>>> +	cgroups.  The default is 0.
> >>>> +
> >>>> +	Memory usage throttle control.	This value controls the amount
> >>>> +	of throttling that will be applied when memory consumption
> >>>> +	exceeds the "memory.high" limit.  The larger the value is,
> >>>> +	the smaller the amount of throttling will be and the easier an
> >>>> +	offending application may get OOM killed.
> >>> memory.high is supposed to never invoke the OOM killer (see above). It's
> >>> unclear to me if you are referring to OOM kills from the kernel or
> >>> userspace in the commit message. If the latter, I think it shouldn't be
> >>> in kernel docs.
> >> I am sorry for not being clear. What I meant is that if an application
> >> is consuming more memory than what can be recovered by memory reclaim,
> >> it will reach memory.max faster, if set, and get OOM killed. Will
> >> clarify that in the next version.
> > You're not really supposed to use max and high in conjunction. One is
> > for kernel OOM killing, the other for userspace OOM killing. That's tho
> > what the documentation that you edited is trying to explain.
> >
> > What's the usecase you have in mind?
> 
> That is new to me that high and max are not supposed to be used 
> together. One problem with v1 is that by the time the limit is reached 
> and memory reclaim is not able to recover enough memory in time, the 
> task will be OOM killed. I always thought that by setting high to a bit 
> below max, say 90%, early memory reclaim will reduce the chance of OOM 
> kills. There are certainly others that think like that.

I can't fault you or them for this, because this was the original plan
for these knobs. However, this didn't end up working in practice.

If you have a non-throttling, non-killing limit, then reclaim will
either work and keep the workload to that limit; or it won't work, and
the workload escapes to the hard limit and gets killed.

You'll notice you get the same behavior with just memory.max set by
itself - either reclaim can keep up, or OOM is triggered.

> So the use case here is to reduce the chance of OOM kills without 
> letting really mishaving tasks from holding up useful memory for too long.

That brings us to the idea of a medium amount of throttling.

The premise would be that, by throttling *to a certain degree*, you
can slow the workload down just enough to tide over the pressure peak
and avert the OOM kill.

This assumes that some tasks inside the cgroup can independently make
forward progress and release memory, while allocating tasks inside the
group are already throttled.

[ Keep in mind, it's a cgroup-internal limit, so no memory freeing
  outside of the group can alleviate the situation. Progress must
  happen from within the cgroup. ]

But this sort of parallelism in a pressured cgroup is unlikely in
practice. By the time reclaim fails, usually *every task* in the
cgroup ends up having to allocate. Because they lose executables to
cache reclaim, or heap memory to swap etc, and then page fault.

We found that more often than not, it just deteriorates into a single
sequence of events. Slowing it down just drags out the inevitable.

As a result we eventually moved away from the idea of gradual
throttling. The last remnants of this idea finally disappeared from
the docs last year (commit 5647e53f7856bb39dae781fe26aa65a699e2fc9f).

memory.high now effectively puts the cgroup to sleep when reclaim
fails (similar to oom killer disabling in v1, but without the caveats
of that implementation). This is useful to let userspace implement
custom OOM killing policies.
Balbir Singh Jan. 30, 2025, 10:27 p.m. UTC | #13
On 1/31/25 07:19, Johannes Weiner wrote:
> On Thu, Jan 30, 2025 at 12:07:31PM -0500, Waiman Long wrote:
>> On 1/30/25 11:39 AM, Johannes Weiner wrote:
>>> On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote:
>>>> On 1/29/25 3:10 PM, Yosry Ahmed wrote:
>>>>> On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote:
>>>>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing
>>>>>> reclaim over memory.high"), the amount of allocator throttling had
>>>>>> increased substantially. As a result, it could be difficult for a
>>>>>> misbehaving application that consumes increasing amount of memory from
>>>>>> being OOM-killed if memory.high is set. Instead, the application may
>>>>>> just be crawling along holding close to the allowed memory.high memory
>>>>>> for the current memory cgroup for a very long time especially those
>>>>>> that do a lot of memcg charging and uncharging operations.
>>>>>>
>>>>>> This behavior makes the upstream Kubernetes community hesitate to
>>>>>> use memory.high. Instead, they use only memory.max for memory control
>>>>>> similar to what is being done for cgroup v1 [1].
>>>>>>
>>>>>> To allow better control of the amount of throttling and hence the
>>>>>> speed that a misbehving task can be OOM killed, a new single-value
>>>>>> memory.high.throttle control file is now added. The allowable range
>>>>>> is 0-32.  By default, it has a value of 0 which means maximum throttling
>>>>>> like before. Any non-zero positive value represents the corresponding
>>>>>> power of 2 reduction of throttling and makes OOM kills easier to happen.
>>>>>>
>>>>>> System administrators can now use this parameter to determine how easy
>>>>>> they want OOM kills to happen for applications that tend to consume
>>>>>> a lot of memory without the need to run a special userspace memory
>>>>>> management tool to monitor memory consumption when memory.high is set.
>>>>>>
>>>>>> Below are the test results of a simple program showing how different
>>>>>> values of memory.high.throttle can affect its run time (in secs) until
>>>>>> it gets OOM killed. This test program allocates pages from kernel
>>>>>> continuously. There are some run-to-run variations and the results
>>>>>> are just one possible set of samples.
>>>>>>
>>>>>>     # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \
>>>>>> 	--wait -t timeout 300 /tmp/mmap-oom
>>>>>>
>>>>>>     memory.high.throttle	service runtime
>>>>>>     --------------------	---------------
>>>>>>               0		    120.521
>>>>>>               1		    103.376
>>>>>>               2		     85.881
>>>>>>               3		     69.698
>>>>>>               4		     42.668
>>>>>>               5		     45.782
>>>>>>               6		     22.179
>>>>>>               7		      9.909
>>>>>>               8		      5.347
>>>>>>               9		      3.100
>>>>>>              10		      1.757
>>>>>>              11		      1.084
>>>>>>              12		      0.919
>>>>>>              13		      0.650
>>>>>>              14		      0.650
>>>>>>              15		      0.655
>>>>>>
>>>>>> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0
>>>>>>
>>>>>> Signed-off-by: Waiman Long <longman@redhat.com>
>>>>>> ---
>>>>>>    Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++--
>>>>>>    include/linux/memcontrol.h              |  2 ++
>>>>>>    mm/memcontrol.c                         | 41 +++++++++++++++++++++++++
>>>>>>    3 files changed, 57 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
>>>>>> index cb1b4e759b7e..df9410ad8b3b 100644
>>>>>> --- a/Documentation/admin-guide/cgroup-v2.rst
>>>>>> +++ b/Documentation/admin-guide/cgroup-v2.rst
>>>>>> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back.
>>>>>>    	Going over the high limit never invokes the OOM killer and
>>>>>>    	under extreme conditions the limit may be breached. The high
>>>>>>    	limit should be used in scenarios where an external process
>>>>>> -	monitors the limited cgroup to alleviate heavy reclaim
>>>>>> -	pressure.
>>>>>> +	monitors the limited cgroup to alleviate heavy reclaim pressure
>>>>>> +	unless a high enough value is set in "memory.high.throttle".
>>>>>> +
>>>>>> +  memory.high.throttle
>>>>>> +	A read-write single value file which exists on non-root
>>>>>> +	cgroups.  The default is 0.
>>>>>> +
>>>>>> +	Memory usage throttle control.	This value controls the amount
>>>>>> +	of throttling that will be applied when memory consumption
>>>>>> +	exceeds the "memory.high" limit.  The larger the value is,
>>>>>> +	the smaller the amount of throttling will be and the easier an
>>>>>> +	offending application may get OOM killed.
>>>>> memory.high is supposed to never invoke the OOM killer (see above). It's
>>>>> unclear to me if you are referring to OOM kills from the kernel or
>>>>> userspace in the commit message. If the latter, I think it shouldn't be
>>>>> in kernel docs.
>>>> I am sorry for not being clear. What I meant is that if an application
>>>> is consuming more memory than what can be recovered by memory reclaim,
>>>> it will reach memory.max faster, if set, and get OOM killed. Will
>>>> clarify that in the next version.
>>> You're not really supposed to use max and high in conjunction. One is
>>> for kernel OOM killing, the other for userspace OOM killing. That's tho
>>> what the documentation that you edited is trying to explain.
>>>
>>> What's the usecase you have in mind?
>>
>> That is new to me that high and max are not supposed to be used 
>> together. One problem with v1 is that by the time the limit is reached 
>> and memory reclaim is not able to recover enough memory in time, the 
>> task will be OOM killed. I always thought that by setting high to a bit 
>> below max, say 90%, early memory reclaim will reduce the chance of OOM 
>> kills. There are certainly others that think like that.
> 
> I can't fault you or them for this, because this was the original plan
> for these knobs. However, this didn't end up working in practice.
> 
> If you have a non-throttling, non-killing limit, then reclaim will
> either work and keep the workload to that limit; or it won't work, and
> the workload escapes to the hard limit and gets killed.
> 
> You'll notice you get the same behavior with just memory.max set by
> itself - either reclaim can keep up, or OOM is triggered.

Yep that was intentional, it was best effort.

> 
>> So the use case here is to reduce the chance of OOM kills without 
>> letting really mishaving tasks from holding up useful memory for too long.
> 
> That brings us to the idea of a medium amount of throttling.
> 
> The premise would be that, by throttling *to a certain degree*, you
> can slow the workload down just enough to tide over the pressure peak
> and avert the OOM kill.
> 
> This assumes that some tasks inside the cgroup can independently make
> forward progress and release memory, while allocating tasks inside the
> group are already throttled.
> 
> [ Keep in mind, it's a cgroup-internal limit, so no memory freeing
>   outside of the group can alleviate the situation. Progress must
>   happen from within the cgroup. ]
> 
> But this sort of parallelism in a pressured cgroup is unlikely in
> practice. By the time reclaim fails, usually *every task* in the
> cgroup ends up having to allocate. Because they lose executables to
> cache reclaim, or heap memory to swap etc, and then page fault.
> 
> We found that more often than not, it just deteriorates into a single
> sequence of events. Slowing it down just drags out the inevitable.
> 
> As a result we eventually moved away from the idea of gradual
> throttling. The last remnants of this idea finally disappeared from
> the docs last year (commit 5647e53f7856bb39dae781fe26aa65a699e2fc9f).
> 
> memory.high now effectively puts the cgroup to sleep when reclaim
> fails (similar to oom killer disabling in v1, but without the caveats
> of that implementation). This is useful to let userspace implement
> custom OOM killing policies.
> 

I've found using memory.high as limiting the way you've defined (using a benchmark
like STREAM, the benchmark did not finish and was stalled for several hours when
it was short of a few GB's of memory) and I did not have a background user space process
to do a user space kill. In my case, reclaim was able to reclaim a little bit, so some
forward progress was made and nr_retries limit was never hit (IIRC).

Effectively in v1 soft_limit was supposed to be best effort pushing back and OOM kill
could find a task to kill globally (initial design) if there was global memory pressure.

For this discussion adding memory.high.throttle seems like it's bridging the transition
from memory.high to memory.max/OOM without external intervention. I do feel that not
killing the task, just locks the task in the memcg forever (at-least in my case) and
it sounds like using memory.high requires an external process monitor to kill the task
if it does not make progress.

Warm Regards,
Balbir Singh
diff mbox series

Patch

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index cb1b4e759b7e..df9410ad8b3b 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -1291,8 +1291,20 @@  PAGE_SIZE multiple when read back.
 	Going over the high limit never invokes the OOM killer and
 	under extreme conditions the limit may be breached. The high
 	limit should be used in scenarios where an external process
-	monitors the limited cgroup to alleviate heavy reclaim
-	pressure.
+	monitors the limited cgroup to alleviate heavy reclaim pressure
+	unless a high enough value is set in "memory.high.throttle".
+
+  memory.high.throttle
+	A read-write single value file which exists on non-root
+	cgroups.  The default is 0.
+
+	Memory usage throttle control.	This value controls the amount
+	of throttling that will be applied when memory consumption
+	exceeds the "memory.high" limit.  The larger the value is,
+	the smaller the amount of throttling will be and the easier an
+	offending application may get OOM killed.
+
+	The valid range of this control file is 0-32.
 
   memory.max
 	A read-write single value file which exists on non-root
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6e74b8254d9b..b184d7b008d4 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -199,6 +199,8 @@  struct mem_cgroup {
 	struct list_head swap_peaks;
 	spinlock_t	 peaks_lock;
 
+	int high_throttle_shift;
+
 	/* Range enforcement for interrupt charges */
 	struct work_struct high_work;
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 46f8b372d212..2fa3fd99ebc9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2112,6 +2112,7 @@  void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	unsigned long nr_reclaimed;
 	unsigned int nr_pages = current->memcg_nr_pages_over_high;
 	int nr_retries = MAX_RECLAIM_RETRIES;
+	int throttle_shift;
 	struct mem_cgroup *memcg;
 	bool in_retry = false;
 
@@ -2156,6 +2157,13 @@  void mem_cgroup_handle_over_high(gfp_t gfp_mask)
 	penalty_jiffies += calculate_high_delay(memcg, nr_pages,
 						swap_find_max_overage(memcg));
 
+	/*
+	 * Reduce penalty according to the high_throttle_shift value.
+	 */
+	throttle_shift = READ_ONCE(memcg->high_throttle_shift);
+	if (throttle_shift)
+		penalty_jiffies >>= throttle_shift;
+
 	/*
 	 * Clamp the max delay per usermode return so as to still keep the
 	 * application moving forwards and also permit diagnostics, albeit
@@ -4172,6 +4180,33 @@  static ssize_t memory_max_write(struct kernfs_open_file *of,
 	return nbytes;
 }
 
+static u64 memory_high_throttle_read(struct cgroup_subsys_state *css,
+				     struct cftype *cft)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+
+	return READ_ONCE(memcg->high_throttle_shift);
+}
+
+static ssize_t memory_high_throttle_write(struct kernfs_open_file *of,
+				char *buf, size_t nbytes, loff_t off)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of));
+	u64 val;
+	int err;
+
+	buf = strstrip(buf);
+	err = kstrtoull(buf, 10, &val);
+	if (err)
+		return err;
+
+	if (val > 32)
+		return -EINVAL;
+
+	WRITE_ONCE(memcg->high_throttle_shift, (int)val);
+	return nbytes;
+}
+
 /*
  * Note: don't forget to update the 'samples/cgroup/memcg_event_listener'
  * if any new events become available.
@@ -4396,6 +4431,12 @@  static struct cftype memory_files[] = {
 		.seq_show = memory_high_show,
 		.write = memory_high_write,
 	},
+	{
+		.name = "high.throttle",
+		.flags = CFTYPE_NOT_ON_ROOT,
+		.read_u64 = memory_high_throttle_read,
+		.write = memory_high_throttle_write,
+	},
 	{
 		.name = "max",
 		.flags = CFTYPE_NOT_ON_ROOT,