Message ID | 20250129191204.368199-1-longman@redhat.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | [RFC] mm, memcg: introduce memory.high.throttle | expand |
On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote: > Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing > reclaim over memory.high"), the amount of allocator throttling had > increased substantially. As a result, it could be difficult for a > misbehaving application that consumes increasing amount of memory from > being OOM-killed if memory.high is set. Instead, the application may > just be crawling along holding close to the allowed memory.high memory > for the current memory cgroup for a very long time especially those > that do a lot of memcg charging and uncharging operations. > > This behavior makes the upstream Kubernetes community hesitate to > use memory.high. Instead, they use only memory.max for memory control > similar to what is being done for cgroup v1 [1]. > > To allow better control of the amount of throttling and hence the > speed that a misbehving task can be OOM killed, a new single-value > memory.high.throttle control file is now added. The allowable range > is 0-32. By default, it has a value of 0 which means maximum throttling > like before. Any non-zero positive value represents the corresponding > power of 2 reduction of throttling and makes OOM kills easier to happen. > > System administrators can now use this parameter to determine how easy > they want OOM kills to happen for applications that tend to consume > a lot of memory without the need to run a special userspace memory > management tool to monitor memory consumption when memory.high is set. > > Below are the test results of a simple program showing how different > values of memory.high.throttle can affect its run time (in secs) until > it gets OOM killed. This test program allocates pages from kernel > continuously. There are some run-to-run variations and the results > are just one possible set of samples. > > # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \ > --wait -t timeout 300 /tmp/mmap-oom > > memory.high.throttle service runtime > -------------------- --------------- > 0 120.521 > 1 103.376 > 2 85.881 > 3 69.698 > 4 42.668 > 5 45.782 > 6 22.179 > 7 9.909 > 8 5.347 > 9 3.100 > 10 1.757 > 11 1.084 > 12 0.919 > 13 0.650 > 14 0.650 > 15 0.655 > > [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0 > > Signed-off-by: Waiman Long <longman@redhat.com> > --- > Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++-- > include/linux/memcontrol.h | 2 ++ > mm/memcontrol.c | 41 +++++++++++++++++++++++++ > 3 files changed, 57 insertions(+), 2 deletions(-) > > diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > index cb1b4e759b7e..df9410ad8b3b 100644 > --- a/Documentation/admin-guide/cgroup-v2.rst > +++ b/Documentation/admin-guide/cgroup-v2.rst > @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back. > Going over the high limit never invokes the OOM killer and > under extreme conditions the limit may be breached. The high > limit should be used in scenarios where an external process > - monitors the limited cgroup to alleviate heavy reclaim > - pressure. > + monitors the limited cgroup to alleviate heavy reclaim pressure > + unless a high enough value is set in "memory.high.throttle". > + > + memory.high.throttle > + A read-write single value file which exists on non-root > + cgroups. The default is 0. > + > + Memory usage throttle control. This value controls the amount > + of throttling that will be applied when memory consumption > + exceeds the "memory.high" limit. The larger the value is, > + the smaller the amount of throttling will be and the easier an > + offending application may get OOM killed. memory.high is supposed to never invoke the OOM killer (see above). It's unclear to me if you are referring to OOM kills from the kernel or userspace in the commit message. If the latter, I think it shouldn't be in kernel docs.
On Wed 29-01-25 14:12:04, Waiman Long wrote: > Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing > reclaim over memory.high"), the amount of allocator throttling had > increased substantially. As a result, it could be difficult for a > misbehaving application that consumes increasing amount of memory from > being OOM-killed if memory.high is set. Instead, the application may > just be crawling along holding close to the allowed memory.high memory > for the current memory cgroup for a very long time especially those > that do a lot of memcg charging and uncharging operations. > > This behavior makes the upstream Kubernetes community hesitate to > use memory.high. Instead, they use only memory.max for memory control > similar to what is being done for cgroup v1 [1]. Why is this a problem for them? > To allow better control of the amount of throttling and hence the > speed that a misbehving task can be OOM killed, a new single-value > memory.high.throttle control file is now added. The allowable range > is 0-32. By default, it has a value of 0 which means maximum throttling > like before. Any non-zero positive value represents the corresponding > power of 2 reduction of throttling and makes OOM kills easier to happen. I do not like the interface to be honest. It exposes an implementation detail and casts it into a user API. If we ever need to change the way how the throttling is implemented this will stand in the way because there will be applications depending on a behavior they were carefuly tuned to. It is also not entirely sure how is this supposed to be used in practice? How do people what kind of value they should use? > System administrators can now use this parameter to determine how easy > they want OOM kills to happen for applications that tend to consume > a lot of memory without the need to run a special userspace memory > management tool to monitor memory consumption when memory.high is set. Why cannot they achieve the same with the existing events/metrics we already do provide? Most notably PSI which is properly accounted when a task is throttled due to memory.high throttling.
On 1/29/25 3:10 PM, Yosry Ahmed wrote: > On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote: >> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing >> reclaim over memory.high"), the amount of allocator throttling had >> increased substantially. As a result, it could be difficult for a >> misbehaving application that consumes increasing amount of memory from >> being OOM-killed if memory.high is set. Instead, the application may >> just be crawling along holding close to the allowed memory.high memory >> for the current memory cgroup for a very long time especially those >> that do a lot of memcg charging and uncharging operations. >> >> This behavior makes the upstream Kubernetes community hesitate to >> use memory.high. Instead, they use only memory.max for memory control >> similar to what is being done for cgroup v1 [1]. >> >> To allow better control of the amount of throttling and hence the >> speed that a misbehving task can be OOM killed, a new single-value >> memory.high.throttle control file is now added. The allowable range >> is 0-32. By default, it has a value of 0 which means maximum throttling >> like before. Any non-zero positive value represents the corresponding >> power of 2 reduction of throttling and makes OOM kills easier to happen. >> >> System administrators can now use this parameter to determine how easy >> they want OOM kills to happen for applications that tend to consume >> a lot of memory without the need to run a special userspace memory >> management tool to monitor memory consumption when memory.high is set. >> >> Below are the test results of a simple program showing how different >> values of memory.high.throttle can affect its run time (in secs) until >> it gets OOM killed. This test program allocates pages from kernel >> continuously. There are some run-to-run variations and the results >> are just one possible set of samples. >> >> # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \ >> --wait -t timeout 300 /tmp/mmap-oom >> >> memory.high.throttle service runtime >> -------------------- --------------- >> 0 120.521 >> 1 103.376 >> 2 85.881 >> 3 69.698 >> 4 42.668 >> 5 45.782 >> 6 22.179 >> 7 9.909 >> 8 5.347 >> 9 3.100 >> 10 1.757 >> 11 1.084 >> 12 0.919 >> 13 0.650 >> 14 0.650 >> 15 0.655 >> >> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0 >> >> Signed-off-by: Waiman Long <longman@redhat.com> >> --- >> Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++-- >> include/linux/memcontrol.h | 2 ++ >> mm/memcontrol.c | 41 +++++++++++++++++++++++++ >> 3 files changed, 57 insertions(+), 2 deletions(-) >> >> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst >> index cb1b4e759b7e..df9410ad8b3b 100644 >> --- a/Documentation/admin-guide/cgroup-v2.rst >> +++ b/Documentation/admin-guide/cgroup-v2.rst >> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back. >> Going over the high limit never invokes the OOM killer and >> under extreme conditions the limit may be breached. The high >> limit should be used in scenarios where an external process >> - monitors the limited cgroup to alleviate heavy reclaim >> - pressure. >> + monitors the limited cgroup to alleviate heavy reclaim pressure >> + unless a high enough value is set in "memory.high.throttle". >> + >> + memory.high.throttle >> + A read-write single value file which exists on non-root >> + cgroups. The default is 0. >> + >> + Memory usage throttle control. This value controls the amount >> + of throttling that will be applied when memory consumption >> + exceeds the "memory.high" limit. The larger the value is, >> + the smaller the amount of throttling will be and the easier an >> + offending application may get OOM killed. > memory.high is supposed to never invoke the OOM killer (see above). It's > unclear to me if you are referring to OOM kills from the kernel or > userspace in the commit message. If the latter, I think it shouldn't be > in kernel docs. I am sorry for not being clear. What I meant is that if an application is consuming more memory than what can be recovered by memory reclaim, it will reach memory.max faster, if set, and get OOM killed. Will clarify that in the next version. Cheers, Longman
On 1/30/25 3:15 AM, Michal Hocko wrote: > On Wed 29-01-25 14:12:04, Waiman Long wrote: >> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing >> reclaim over memory.high"), the amount of allocator throttling had >> increased substantially. As a result, it could be difficult for a >> misbehaving application that consumes increasing amount of memory from >> being OOM-killed if memory.high is set. Instead, the application may >> just be crawling along holding close to the allowed memory.high memory >> for the current memory cgroup for a very long time especially those >> that do a lot of memcg charging and uncharging operations. >> >> This behavior makes the upstream Kubernetes community hesitate to >> use memory.high. Instead, they use only memory.max for memory control >> similar to what is being done for cgroup v1 [1]. > Why is this a problem for them? My understanding is that a mishaving container will hold up memory.high amount of memory for a long time instead of getting OOM killed sooner and be more productively used elsewhere. > >> To allow better control of the amount of throttling and hence the >> speed that a misbehving task can be OOM killed, a new single-value >> memory.high.throttle control file is now added. The allowable range >> is 0-32. By default, it has a value of 0 which means maximum throttling >> like before. Any non-zero positive value represents the corresponding >> power of 2 reduction of throttling and makes OOM kills easier to happen. > I do not like the interface to be honest. It exposes an implementation > detail and casts it into a user API. If we ever need to change the way > how the throttling is implemented this will stand in the way because > there will be applications depending on a behavior they were carefuly > tuned to. > > It is also not entirely sure how is this supposed to be used in > practice? How do people what kind of value they should use? Yes, I agree that a user may need to run some trial runs to find a proper value. Perhaps a simpler binary interface of "off" and "on" may be easier to understand and use. > >> System administrators can now use this parameter to determine how easy >> they want OOM kills to happen for applications that tend to consume >> a lot of memory without the need to run a special userspace memory >> management tool to monitor memory consumption when memory.high is set. > Why cannot they achieve the same with the existing events/metrics we > already do provide? Most notably PSI which is properly accounted when > a task is throttled due to memory.high throttling. That will require the use of a userspace management agent that looks for these stalling conditions and make the kill, if necessary. There are certainly users out there that want to get some benefit of using memory.high like early memory reclaim without the trouble of handling these kind of stalling conditions. Cheers, Longman
On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote: > On 1/29/25 3:10 PM, Yosry Ahmed wrote: > > On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote: > >> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing > >> reclaim over memory.high"), the amount of allocator throttling had > >> increased substantially. As a result, it could be difficult for a > >> misbehaving application that consumes increasing amount of memory from > >> being OOM-killed if memory.high is set. Instead, the application may > >> just be crawling along holding close to the allowed memory.high memory > >> for the current memory cgroup for a very long time especially those > >> that do a lot of memcg charging and uncharging operations. > >> > >> This behavior makes the upstream Kubernetes community hesitate to > >> use memory.high. Instead, they use only memory.max for memory control > >> similar to what is being done for cgroup v1 [1]. > >> > >> To allow better control of the amount of throttling and hence the > >> speed that a misbehving task can be OOM killed, a new single-value > >> memory.high.throttle control file is now added. The allowable range > >> is 0-32. By default, it has a value of 0 which means maximum throttling > >> like before. Any non-zero positive value represents the corresponding > >> power of 2 reduction of throttling and makes OOM kills easier to happen. > >> > >> System administrators can now use this parameter to determine how easy > >> they want OOM kills to happen for applications that tend to consume > >> a lot of memory without the need to run a special userspace memory > >> management tool to monitor memory consumption when memory.high is set. > >> > >> Below are the test results of a simple program showing how different > >> values of memory.high.throttle can affect its run time (in secs) until > >> it gets OOM killed. This test program allocates pages from kernel > >> continuously. There are some run-to-run variations and the results > >> are just one possible set of samples. > >> > >> # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \ > >> --wait -t timeout 300 /tmp/mmap-oom > >> > >> memory.high.throttle service runtime > >> -------------------- --------------- > >> 0 120.521 > >> 1 103.376 > >> 2 85.881 > >> 3 69.698 > >> 4 42.668 > >> 5 45.782 > >> 6 22.179 > >> 7 9.909 > >> 8 5.347 > >> 9 3.100 > >> 10 1.757 > >> 11 1.084 > >> 12 0.919 > >> 13 0.650 > >> 14 0.650 > >> 15 0.655 > >> > >> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0 > >> > >> Signed-off-by: Waiman Long <longman@redhat.com> > >> --- > >> Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++-- > >> include/linux/memcontrol.h | 2 ++ > >> mm/memcontrol.c | 41 +++++++++++++++++++++++++ > >> 3 files changed, 57 insertions(+), 2 deletions(-) > >> > >> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > >> index cb1b4e759b7e..df9410ad8b3b 100644 > >> --- a/Documentation/admin-guide/cgroup-v2.rst > >> +++ b/Documentation/admin-guide/cgroup-v2.rst > >> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back. > >> Going over the high limit never invokes the OOM killer and > >> under extreme conditions the limit may be breached. The high > >> limit should be used in scenarios where an external process > >> - monitors the limited cgroup to alleviate heavy reclaim > >> - pressure. > >> + monitors the limited cgroup to alleviate heavy reclaim pressure > >> + unless a high enough value is set in "memory.high.throttle". > >> + > >> + memory.high.throttle > >> + A read-write single value file which exists on non-root > >> + cgroups. The default is 0. > >> + > >> + Memory usage throttle control. This value controls the amount > >> + of throttling that will be applied when memory consumption > >> + exceeds the "memory.high" limit. The larger the value is, > >> + the smaller the amount of throttling will be and the easier an > >> + offending application may get OOM killed. > > memory.high is supposed to never invoke the OOM killer (see above). It's > > unclear to me if you are referring to OOM kills from the kernel or > > userspace in the commit message. If the latter, I think it shouldn't be > > in kernel docs. > > I am sorry for not being clear. What I meant is that if an application > is consuming more memory than what can be recovered by memory reclaim, > it will reach memory.max faster, if set, and get OOM killed. Will > clarify that in the next version. You're not really supposed to use max and high in conjunction. One is for kernel OOM killing, the other for userspace OOM killing. That's what the documentation that you edited is trying to explain. What's the usecase you have in mind?
On Thu, Jan 30, 2025 at 10:05:34AM -0500, Waiman Long wrote: > On 1/30/25 3:15 AM, Michal Hocko wrote: > > On Wed 29-01-25 14:12:04, Waiman Long wrote: > > > Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing > > > reclaim over memory.high"), the amount of allocator throttling had > > > increased substantially. As a result, it could be difficult for a > > > misbehaving application that consumes increasing amount of memory from > > > being OOM-killed if memory.high is set. Instead, the application may > > > just be crawling along holding close to the allowed memory.high memory > > > for the current memory cgroup for a very long time especially those > > > that do a lot of memcg charging and uncharging operations. > > > > > > This behavior makes the upstream Kubernetes community hesitate to > > > use memory.high. Instead, they use only memory.max for memory control > > > similar to what is being done for cgroup v1 [1]. > > Why is this a problem for them? > My understanding is that a mishaving container will hold up memory.high > amount of memory for a long time instead of getting OOM killed sooner and be > more productively used elsewhere. > > > > > To allow better control of the amount of throttling and hence the > > > speed that a misbehving task can be OOM killed, a new single-value > > > memory.high.throttle control file is now added. The allowable range > > > is 0-32. By default, it has a value of 0 which means maximum throttling > > > like before. Any non-zero positive value represents the corresponding > > > power of 2 reduction of throttling and makes OOM kills easier to happen. > > I do not like the interface to be honest. It exposes an implementation > > detail and casts it into a user API. If we ever need to change the way > > how the throttling is implemented this will stand in the way because > > there will be applications depending on a behavior they were carefuly > > tuned to. > > > > It is also not entirely sure how is this supposed to be used in > > practice? How do people what kind of value they should use? > Yes, I agree that a user may need to run some trial runs to find a proper > value. Perhaps a simpler binary interface of "off" and "on" may be easier to > understand and use. > > > > > System administrators can now use this parameter to determine how easy > > > they want OOM kills to happen for applications that tend to consume > > > a lot of memory without the need to run a special userspace memory > > > management tool to monitor memory consumption when memory.high is set. > > Why cannot they achieve the same with the existing events/metrics we > > already do provide? Most notably PSI which is properly accounted when > > a task is throttled due to memory.high throttling. > > That will require the use of a userspace management agent that looks for > these stalling conditions and make the kill, if necessary. There are > certainly users out there that want to get some benefit of using memory.high > like early memory reclaim without the trouble of handling these kind of > stalling conditions. So you basically want to force the workload into some sort of a proactive reclaim but without an artificial slow down? It makes some sense to me, but 1) Idk if it deserves a new API, because it can be relatively easy implemented in userspace by a daemon which monitors cgroups usage and reclaims the memory if necessarily. No kernel changes are needed. 2) If new API is introduced, I think it's better to introduce a new limit, e.g. memory.target, keeping memory.high semantics intact. Thanks!
On 1/30/25 11:39 AM, Johannes Weiner wrote: > On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote: >> On 1/29/25 3:10 PM, Yosry Ahmed wrote: >>> On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote: >>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing >>>> reclaim over memory.high"), the amount of allocator throttling had >>>> increased substantially. As a result, it could be difficult for a >>>> misbehaving application that consumes increasing amount of memory from >>>> being OOM-killed if memory.high is set. Instead, the application may >>>> just be crawling along holding close to the allowed memory.high memory >>>> for the current memory cgroup for a very long time especially those >>>> that do a lot of memcg charging and uncharging operations. >>>> >>>> This behavior makes the upstream Kubernetes community hesitate to >>>> use memory.high. Instead, they use only memory.max for memory control >>>> similar to what is being done for cgroup v1 [1]. >>>> >>>> To allow better control of the amount of throttling and hence the >>>> speed that a misbehving task can be OOM killed, a new single-value >>>> memory.high.throttle control file is now added. The allowable range >>>> is 0-32. By default, it has a value of 0 which means maximum throttling >>>> like before. Any non-zero positive value represents the corresponding >>>> power of 2 reduction of throttling and makes OOM kills easier to happen. >>>> >>>> System administrators can now use this parameter to determine how easy >>>> they want OOM kills to happen for applications that tend to consume >>>> a lot of memory without the need to run a special userspace memory >>>> management tool to monitor memory consumption when memory.high is set. >>>> >>>> Below are the test results of a simple program showing how different >>>> values of memory.high.throttle can affect its run time (in secs) until >>>> it gets OOM killed. This test program allocates pages from kernel >>>> continuously. There are some run-to-run variations and the results >>>> are just one possible set of samples. >>>> >>>> # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \ >>>> --wait -t timeout 300 /tmp/mmap-oom >>>> >>>> memory.high.throttle service runtime >>>> -------------------- --------------- >>>> 0 120.521 >>>> 1 103.376 >>>> 2 85.881 >>>> 3 69.698 >>>> 4 42.668 >>>> 5 45.782 >>>> 6 22.179 >>>> 7 9.909 >>>> 8 5.347 >>>> 9 3.100 >>>> 10 1.757 >>>> 11 1.084 >>>> 12 0.919 >>>> 13 0.650 >>>> 14 0.650 >>>> 15 0.655 >>>> >>>> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0 >>>> >>>> Signed-off-by: Waiman Long <longman@redhat.com> >>>> --- >>>> Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++-- >>>> include/linux/memcontrol.h | 2 ++ >>>> mm/memcontrol.c | 41 +++++++++++++++++++++++++ >>>> 3 files changed, 57 insertions(+), 2 deletions(-) >>>> >>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst >>>> index cb1b4e759b7e..df9410ad8b3b 100644 >>>> --- a/Documentation/admin-guide/cgroup-v2.rst >>>> +++ b/Documentation/admin-guide/cgroup-v2.rst >>>> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back. >>>> Going over the high limit never invokes the OOM killer and >>>> under extreme conditions the limit may be breached. The high >>>> limit should be used in scenarios where an external process >>>> - monitors the limited cgroup to alleviate heavy reclaim >>>> - pressure. >>>> + monitors the limited cgroup to alleviate heavy reclaim pressure >>>> + unless a high enough value is set in "memory.high.throttle". >>>> + >>>> + memory.high.throttle >>>> + A read-write single value file which exists on non-root >>>> + cgroups. The default is 0. >>>> + >>>> + Memory usage throttle control. This value controls the amount >>>> + of throttling that will be applied when memory consumption >>>> + exceeds the "memory.high" limit. The larger the value is, >>>> + the smaller the amount of throttling will be and the easier an >>>> + offending application may get OOM killed. >>> memory.high is supposed to never invoke the OOM killer (see above). It's >>> unclear to me if you are referring to OOM kills from the kernel or >>> userspace in the commit message. If the latter, I think it shouldn't be >>> in kernel docs. >> I am sorry for not being clear. What I meant is that if an application >> is consuming more memory than what can be recovered by memory reclaim, >> it will reach memory.max faster, if set, and get OOM killed. Will >> clarify that in the next version. > You're not really supposed to use max and high in conjunction. One is > for kernel OOM killing, the other for userspace OOM killing. That's tho > what the documentation that you edited is trying to explain. > > What's the usecase you have in mind? That is new to me that high and max are not supposed to be used together. One problem with v1 is that by the time the limit is reached and memory reclaim is not able to recover enough memory in time, the task will be OOM killed. I always thought that by setting high to a bit below max, say 90%, early memory reclaim will reduce the chance of OOM kills. There are certainly others that think like that. So the use case here is to reduce the chance of OOM kills without letting really mishaving tasks from holding up useful memory for too long. Cheers, Longman
On 1/30/25 12:05 PM, Roman Gushchin wrote: > On Thu, Jan 30, 2025 at 10:05:34AM -0500, Waiman Long wrote: >> On 1/30/25 3:15 AM, Michal Hocko wrote: >>> On Wed 29-01-25 14:12:04, Waiman Long wrote: >>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing >>>> reclaim over memory.high"), the amount of allocator throttling had >>>> increased substantially. As a result, it could be difficult for a >>>> misbehaving application that consumes increasing amount of memory from >>>> being OOM-killed if memory.high is set. Instead, the application may >>>> just be crawling along holding close to the allowed memory.high memory >>>> for the current memory cgroup for a very long time especially those >>>> that do a lot of memcg charging and uncharging operations. >>>> >>>> This behavior makes the upstream Kubernetes community hesitate to >>>> use memory.high. Instead, they use only memory.max for memory control >>>> similar to what is being done for cgroup v1 [1]. >>> Why is this a problem for them? >> My understanding is that a mishaving container will hold up memory.high >> amount of memory for a long time instead of getting OOM killed sooner and be >> more productively used elsewhere. >>>> To allow better control of the amount of throttling and hence the >>>> speed that a misbehving task can be OOM killed, a new single-value >>>> memory.high.throttle control file is now added. The allowable range >>>> is 0-32. By default, it has a value of 0 which means maximum throttling >>>> like before. Any non-zero positive value represents the corresponding >>>> power of 2 reduction of throttling and makes OOM kills easier to happen. >>> I do not like the interface to be honest. It exposes an implementation >>> detail and casts it into a user API. If we ever need to change the way >>> how the throttling is implemented this will stand in the way because >>> there will be applications depending on a behavior they were carefuly >>> tuned to. >>> >>> It is also not entirely sure how is this supposed to be used in >>> practice? How do people what kind of value they should use? >> Yes, I agree that a user may need to run some trial runs to find a proper >> value. Perhaps a simpler binary interface of "off" and "on" may be easier to >> understand and use. >>>> System administrators can now use this parameter to determine how easy >>>> they want OOM kills to happen for applications that tend to consume >>>> a lot of memory without the need to run a special userspace memory >>>> management tool to monitor memory consumption when memory.high is set. >>> Why cannot they achieve the same with the existing events/metrics we >>> already do provide? Most notably PSI which is properly accounted when >>> a task is throttled due to memory.high throttling. >> That will require the use of a userspace management agent that looks for >> these stalling conditions and make the kill, if necessary. There are >> certainly users out there that want to get some benefit of using memory.high >> like early memory reclaim without the trouble of handling these kind of >> stalling conditions. > So you basically want to force the workload into some sort of a proactive > reclaim but without an artificial slow down? > It makes some sense to me, but > 1) Idk if it deserves a new API, because it can be relatively easy implemented > in userspace by a daemon which monitors cgroups usage and reclaims the memory > if necessarily. No kernel changes are needed. > 2) If new API is introduced, I think it's better to introduce a new limit, > e.g. memory.target, keeping memory.high semantics intact. Yes, you are right about that. Introducing a new "memory.target" without disturbing the existing "memory.high" semantics will work for me too. Cheers, Longman
On Thu, Jan 30, 2025 at 12:19:38PM -0500, Waiman Long wrote: > On 1/30/25 12:05 PM, Roman Gushchin wrote: > > On Thu, Jan 30, 2025 at 10:05:34AM -0500, Waiman Long wrote: > > > On 1/30/25 3:15 AM, Michal Hocko wrote: > > > > On Wed 29-01-25 14:12:04, Waiman Long wrote: > > > > > Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing > > > > > reclaim over memory.high"), the amount of allocator throttling had > > > > > increased substantially. As a result, it could be difficult for a > > > > > misbehaving application that consumes increasing amount of memory from > > > > > being OOM-killed if memory.high is set. Instead, the application may > > > > > just be crawling along holding close to the allowed memory.high memory > > > > > for the current memory cgroup for a very long time especially those > > > > > that do a lot of memcg charging and uncharging operations. > > > > > > > > > > This behavior makes the upstream Kubernetes community hesitate to > > > > > use memory.high. Instead, they use only memory.max for memory control > > > > > similar to what is being done for cgroup v1 [1]. > > > > Why is this a problem for them? > > > My understanding is that a mishaving container will hold up memory.high > > > amount of memory for a long time instead of getting OOM killed sooner and be > > > more productively used elsewhere. > > > > > To allow better control of the amount of throttling and hence the > > > > > speed that a misbehving task can be OOM killed, a new single-value > > > > > memory.high.throttle control file is now added. The allowable range > > > > > is 0-32. By default, it has a value of 0 which means maximum throttling > > > > > like before. Any non-zero positive value represents the corresponding > > > > > power of 2 reduction of throttling and makes OOM kills easier to happen. > > > > I do not like the interface to be honest. It exposes an implementation > > > > detail and casts it into a user API. If we ever need to change the way > > > > how the throttling is implemented this will stand in the way because > > > > there will be applications depending on a behavior they were carefuly > > > > tuned to. > > > > > > > > It is also not entirely sure how is this supposed to be used in > > > > practice? How do people what kind of value they should use? > > > Yes, I agree that a user may need to run some trial runs to find a proper > > > value. Perhaps a simpler binary interface of "off" and "on" may be easier to > > > understand and use. > > > > > System administrators can now use this parameter to determine how easy > > > > > they want OOM kills to happen for applications that tend to consume > > > > > a lot of memory without the need to run a special userspace memory > > > > > management tool to monitor memory consumption when memory.high is set. > > > > Why cannot they achieve the same with the existing events/metrics we > > > > already do provide? Most notably PSI which is properly accounted when > > > > a task is throttled due to memory.high throttling. > > > That will require the use of a userspace management agent that looks for > > > these stalling conditions and make the kill, if necessary. There are > > > certainly users out there that want to get some benefit of using memory.high > > > like early memory reclaim without the trouble of handling these kind of > > > stalling conditions. > > So you basically want to force the workload into some sort of a proactive > > reclaim but without an artificial slow down? I wouldn't call it a proactive reclaim as reclaim will happen synchronously in allocating thread. > > It makes some sense to me, but > > 1) Idk if it deserves a new API, because it can be relatively easy implemented > > in userspace by a daemon which monitors cgroups usage and reclaims the memory > > if necessarily. No kernel changes are needed. > > 2) If new API is introduced, I think it's better to introduce a new limit, > > e.g. memory.target, keeping memory.high semantics intact. > > Yes, you are right about that. Introducing a new "memory.target" without > disturbing the existing "memory.high" semantics will work for me too. > So, what happens if reclaim can not reduce usage below memory.target? Infinite reclaim cycles or just give up? > Cheers, > Longman >
On 1/30/25 12:32 PM, Shakeel Butt wrote: > On Thu, Jan 30, 2025 at 12:19:38PM -0500, Waiman Long wrote: >> On 1/30/25 12:05 PM, Roman Gushchin wrote: >>> On Thu, Jan 30, 2025 at 10:05:34AM -0500, Waiman Long wrote: >>>> On 1/30/25 3:15 AM, Michal Hocko wrote: >>>>> On Wed 29-01-25 14:12:04, Waiman Long wrote: >>>>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing >>>>>> reclaim over memory.high"), the amount of allocator throttling had >>>>>> increased substantially. As a result, it could be difficult for a >>>>>> misbehaving application that consumes increasing amount of memory from >>>>>> being OOM-killed if memory.high is set. Instead, the application may >>>>>> just be crawling along holding close to the allowed memory.high memory >>>>>> for the current memory cgroup for a very long time especially those >>>>>> that do a lot of memcg charging and uncharging operations. >>>>>> >>>>>> This behavior makes the upstream Kubernetes community hesitate to >>>>>> use memory.high. Instead, they use only memory.max for memory control >>>>>> similar to what is being done for cgroup v1 [1]. >>>>> Why is this a problem for them? >>>> My understanding is that a mishaving container will hold up memory.high >>>> amount of memory for a long time instead of getting OOM killed sooner and be >>>> more productively used elsewhere. >>>>>> To allow better control of the amount of throttling and hence the >>>>>> speed that a misbehving task can be OOM killed, a new single-value >>>>>> memory.high.throttle control file is now added. The allowable range >>>>>> is 0-32. By default, it has a value of 0 which means maximum throttling >>>>>> like before. Any non-zero positive value represents the corresponding >>>>>> power of 2 reduction of throttling and makes OOM kills easier to happen. >>>>> I do not like the interface to be honest. It exposes an implementation >>>>> detail and casts it into a user API. If we ever need to change the way >>>>> how the throttling is implemented this will stand in the way because >>>>> there will be applications depending on a behavior they were carefuly >>>>> tuned to. >>>>> >>>>> It is also not entirely sure how is this supposed to be used in >>>>> practice? How do people what kind of value they should use? >>>> Yes, I agree that a user may need to run some trial runs to find a proper >>>> value. Perhaps a simpler binary interface of "off" and "on" may be easier to >>>> understand and use. >>>>>> System administrators can now use this parameter to determine how easy >>>>>> they want OOM kills to happen for applications that tend to consume >>>>>> a lot of memory without the need to run a special userspace memory >>>>>> management tool to monitor memory consumption when memory.high is set. >>>>> Why cannot they achieve the same with the existing events/metrics we >>>>> already do provide? Most notably PSI which is properly accounted when >>>>> a task is throttled due to memory.high throttling. >>>> That will require the use of a userspace management agent that looks for >>>> these stalling conditions and make the kill, if necessary. There are >>>> certainly users out there that want to get some benefit of using memory.high >>>> like early memory reclaim without the trouble of handling these kind of >>>> stalling conditions. >>> So you basically want to force the workload into some sort of a proactive >>> reclaim but without an artificial slow down? > I wouldn't call it a proactive reclaim as reclaim will happen > synchronously in allocating thread. > >>> It makes some sense to me, but >>> 1) Idk if it deserves a new API, because it can be relatively easy implemented >>> in userspace by a daemon which monitors cgroups usage and reclaims the memory >>> if necessarily. No kernel changes are needed. >>> 2) If new API is introduced, I think it's better to introduce a new limit, >>> e.g. memory.target, keeping memory.high semantics intact. >> Yes, you are right about that. Introducing a new "memory.target" without >> disturbing the existing "memory.high" semantics will work for me too. >> > So, what happens if reclaim can not reduce usage below memory.target? > Infinite reclaim cycles or just give up? Just give up in this case. It is used mainly to reduce the chance of reaching max and cause OOM kill. Cheers, Longman
On Thu 30-01-25 12:19:38, Waiman Long wrote: > On 1/30/25 12:05 PM, Roman Gushchin wrote: > > On Thu, Jan 30, 2025 at 10:05:34AM -0500, Waiman Long wrote: [...] > > > > Why cannot they achieve the same with the existing events/metrics we > > > > already do provide? Most notably PSI which is properly accounted when > > > > a task is throttled due to memory.high throttling. > > > That will require the use of a userspace management agent that looks for > > > these stalling conditions and make the kill, if necessary. There are > > > certainly users out there that want to get some benefit of using memory.high > > > like early memory reclaim without the trouble of handling these kind of > > > stalling conditions. Could you expand more on that? As long as the memory is reasonably reclaimable then the hard (max) limit is exactly what you need. If you want to implement oom policy in the userspace you have high limit to get notifications about the memory pressure. Why this is insufficient? > > So you basically want to force the workload into some sort of a proactive > > reclaim but without an artificial slow down? > > It makes some sense to me, but > > 1) Idk if it deserves a new API, because it can be relatively easy implemented > > in userspace by a daemon which monitors cgroups usage and reclaims the memory > > if necessarily. No kernel changes are needed. > > 2) If new API is introduced, I think it's better to introduce a new limit, > > e.g. memory.target, keeping memory.high semantics intact. > > Yes, you are right about that. Introducing a new "memory.target" without > disturbing the existing "memory.high" semantics will work for me too. I thought those usecases want to kill misbehaving containers rather than burning time trying to reclaim. I do not understand how will such a new limit help to achieve that.
On Thu, Jan 30, 2025 at 12:07:31PM -0500, Waiman Long wrote: > On 1/30/25 11:39 AM, Johannes Weiner wrote: > > On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote: > >> On 1/29/25 3:10 PM, Yosry Ahmed wrote: > >>> On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote: > >>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing > >>>> reclaim over memory.high"), the amount of allocator throttling had > >>>> increased substantially. As a result, it could be difficult for a > >>>> misbehaving application that consumes increasing amount of memory from > >>>> being OOM-killed if memory.high is set. Instead, the application may > >>>> just be crawling along holding close to the allowed memory.high memory > >>>> for the current memory cgroup for a very long time especially those > >>>> that do a lot of memcg charging and uncharging operations. > >>>> > >>>> This behavior makes the upstream Kubernetes community hesitate to > >>>> use memory.high. Instead, they use only memory.max for memory control > >>>> similar to what is being done for cgroup v1 [1]. > >>>> > >>>> To allow better control of the amount of throttling and hence the > >>>> speed that a misbehving task can be OOM killed, a new single-value > >>>> memory.high.throttle control file is now added. The allowable range > >>>> is 0-32. By default, it has a value of 0 which means maximum throttling > >>>> like before. Any non-zero positive value represents the corresponding > >>>> power of 2 reduction of throttling and makes OOM kills easier to happen. > >>>> > >>>> System administrators can now use this parameter to determine how easy > >>>> they want OOM kills to happen for applications that tend to consume > >>>> a lot of memory without the need to run a special userspace memory > >>>> management tool to monitor memory consumption when memory.high is set. > >>>> > >>>> Below are the test results of a simple program showing how different > >>>> values of memory.high.throttle can affect its run time (in secs) until > >>>> it gets OOM killed. This test program allocates pages from kernel > >>>> continuously. There are some run-to-run variations and the results > >>>> are just one possible set of samples. > >>>> > >>>> # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \ > >>>> --wait -t timeout 300 /tmp/mmap-oom > >>>> > >>>> memory.high.throttle service runtime > >>>> -------------------- --------------- > >>>> 0 120.521 > >>>> 1 103.376 > >>>> 2 85.881 > >>>> 3 69.698 > >>>> 4 42.668 > >>>> 5 45.782 > >>>> 6 22.179 > >>>> 7 9.909 > >>>> 8 5.347 > >>>> 9 3.100 > >>>> 10 1.757 > >>>> 11 1.084 > >>>> 12 0.919 > >>>> 13 0.650 > >>>> 14 0.650 > >>>> 15 0.655 > >>>> > >>>> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0 > >>>> > >>>> Signed-off-by: Waiman Long <longman@redhat.com> > >>>> --- > >>>> Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++-- > >>>> include/linux/memcontrol.h | 2 ++ > >>>> mm/memcontrol.c | 41 +++++++++++++++++++++++++ > >>>> 3 files changed, 57 insertions(+), 2 deletions(-) > >>>> > >>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst > >>>> index cb1b4e759b7e..df9410ad8b3b 100644 > >>>> --- a/Documentation/admin-guide/cgroup-v2.rst > >>>> +++ b/Documentation/admin-guide/cgroup-v2.rst > >>>> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back. > >>>> Going over the high limit never invokes the OOM killer and > >>>> under extreme conditions the limit may be breached. The high > >>>> limit should be used in scenarios where an external process > >>>> - monitors the limited cgroup to alleviate heavy reclaim > >>>> - pressure. > >>>> + monitors the limited cgroup to alleviate heavy reclaim pressure > >>>> + unless a high enough value is set in "memory.high.throttle". > >>>> + > >>>> + memory.high.throttle > >>>> + A read-write single value file which exists on non-root > >>>> + cgroups. The default is 0. > >>>> + > >>>> + Memory usage throttle control. This value controls the amount > >>>> + of throttling that will be applied when memory consumption > >>>> + exceeds the "memory.high" limit. The larger the value is, > >>>> + the smaller the amount of throttling will be and the easier an > >>>> + offending application may get OOM killed. > >>> memory.high is supposed to never invoke the OOM killer (see above). It's > >>> unclear to me if you are referring to OOM kills from the kernel or > >>> userspace in the commit message. If the latter, I think it shouldn't be > >>> in kernel docs. > >> I am sorry for not being clear. What I meant is that if an application > >> is consuming more memory than what can be recovered by memory reclaim, > >> it will reach memory.max faster, if set, and get OOM killed. Will > >> clarify that in the next version. > > You're not really supposed to use max and high in conjunction. One is > > for kernel OOM killing, the other for userspace OOM killing. That's tho > > what the documentation that you edited is trying to explain. > > > > What's the usecase you have in mind? > > That is new to me that high and max are not supposed to be used > together. One problem with v1 is that by the time the limit is reached > and memory reclaim is not able to recover enough memory in time, the > task will be OOM killed. I always thought that by setting high to a bit > below max, say 90%, early memory reclaim will reduce the chance of OOM > kills. There are certainly others that think like that. I can't fault you or them for this, because this was the original plan for these knobs. However, this didn't end up working in practice. If you have a non-throttling, non-killing limit, then reclaim will either work and keep the workload to that limit; or it won't work, and the workload escapes to the hard limit and gets killed. You'll notice you get the same behavior with just memory.max set by itself - either reclaim can keep up, or OOM is triggered. > So the use case here is to reduce the chance of OOM kills without > letting really mishaving tasks from holding up useful memory for too long. That brings us to the idea of a medium amount of throttling. The premise would be that, by throttling *to a certain degree*, you can slow the workload down just enough to tide over the pressure peak and avert the OOM kill. This assumes that some tasks inside the cgroup can independently make forward progress and release memory, while allocating tasks inside the group are already throttled. [ Keep in mind, it's a cgroup-internal limit, so no memory freeing outside of the group can alleviate the situation. Progress must happen from within the cgroup. ] But this sort of parallelism in a pressured cgroup is unlikely in practice. By the time reclaim fails, usually *every task* in the cgroup ends up having to allocate. Because they lose executables to cache reclaim, or heap memory to swap etc, and then page fault. We found that more often than not, it just deteriorates into a single sequence of events. Slowing it down just drags out the inevitable. As a result we eventually moved away from the idea of gradual throttling. The last remnants of this idea finally disappeared from the docs last year (commit 5647e53f7856bb39dae781fe26aa65a699e2fc9f). memory.high now effectively puts the cgroup to sleep when reclaim fails (similar to oom killer disabling in v1, but without the caveats of that implementation). This is useful to let userspace implement custom OOM killing policies.
On 1/31/25 07:19, Johannes Weiner wrote: > On Thu, Jan 30, 2025 at 12:07:31PM -0500, Waiman Long wrote: >> On 1/30/25 11:39 AM, Johannes Weiner wrote: >>> On Thu, Jan 30, 2025 at 09:52:29AM -0500, Waiman Long wrote: >>>> On 1/29/25 3:10 PM, Yosry Ahmed wrote: >>>>> On Wed, Jan 29, 2025 at 02:12:04PM -0500, Waiman Long wrote: >>>>>> Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing >>>>>> reclaim over memory.high"), the amount of allocator throttling had >>>>>> increased substantially. As a result, it could be difficult for a >>>>>> misbehaving application that consumes increasing amount of memory from >>>>>> being OOM-killed if memory.high is set. Instead, the application may >>>>>> just be crawling along holding close to the allowed memory.high memory >>>>>> for the current memory cgroup for a very long time especially those >>>>>> that do a lot of memcg charging and uncharging operations. >>>>>> >>>>>> This behavior makes the upstream Kubernetes community hesitate to >>>>>> use memory.high. Instead, they use only memory.max for memory control >>>>>> similar to what is being done for cgroup v1 [1]. >>>>>> >>>>>> To allow better control of the amount of throttling and hence the >>>>>> speed that a misbehving task can be OOM killed, a new single-value >>>>>> memory.high.throttle control file is now added. The allowable range >>>>>> is 0-32. By default, it has a value of 0 which means maximum throttling >>>>>> like before. Any non-zero positive value represents the corresponding >>>>>> power of 2 reduction of throttling and makes OOM kills easier to happen. >>>>>> >>>>>> System administrators can now use this parameter to determine how easy >>>>>> they want OOM kills to happen for applications that tend to consume >>>>>> a lot of memory without the need to run a special userspace memory >>>>>> management tool to monitor memory consumption when memory.high is set. >>>>>> >>>>>> Below are the test results of a simple program showing how different >>>>>> values of memory.high.throttle can affect its run time (in secs) until >>>>>> it gets OOM killed. This test program allocates pages from kernel >>>>>> continuously. There are some run-to-run variations and the results >>>>>> are just one possible set of samples. >>>>>> >>>>>> # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \ >>>>>> --wait -t timeout 300 /tmp/mmap-oom >>>>>> >>>>>> memory.high.throttle service runtime >>>>>> -------------------- --------------- >>>>>> 0 120.521 >>>>>> 1 103.376 >>>>>> 2 85.881 >>>>>> 3 69.698 >>>>>> 4 42.668 >>>>>> 5 45.782 >>>>>> 6 22.179 >>>>>> 7 9.909 >>>>>> 8 5.347 >>>>>> 9 3.100 >>>>>> 10 1.757 >>>>>> 11 1.084 >>>>>> 12 0.919 >>>>>> 13 0.650 >>>>>> 14 0.650 >>>>>> 15 0.655 >>>>>> >>>>>> [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0 >>>>>> >>>>>> Signed-off-by: Waiman Long <longman@redhat.com> >>>>>> --- >>>>>> Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++-- >>>>>> include/linux/memcontrol.h | 2 ++ >>>>>> mm/memcontrol.c | 41 +++++++++++++++++++++++++ >>>>>> 3 files changed, 57 insertions(+), 2 deletions(-) >>>>>> >>>>>> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst >>>>>> index cb1b4e759b7e..df9410ad8b3b 100644 >>>>>> --- a/Documentation/admin-guide/cgroup-v2.rst >>>>>> +++ b/Documentation/admin-guide/cgroup-v2.rst >>>>>> @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back. >>>>>> Going over the high limit never invokes the OOM killer and >>>>>> under extreme conditions the limit may be breached. The high >>>>>> limit should be used in scenarios where an external process >>>>>> - monitors the limited cgroup to alleviate heavy reclaim >>>>>> - pressure. >>>>>> + monitors the limited cgroup to alleviate heavy reclaim pressure >>>>>> + unless a high enough value is set in "memory.high.throttle". >>>>>> + >>>>>> + memory.high.throttle >>>>>> + A read-write single value file which exists on non-root >>>>>> + cgroups. The default is 0. >>>>>> + >>>>>> + Memory usage throttle control. This value controls the amount >>>>>> + of throttling that will be applied when memory consumption >>>>>> + exceeds the "memory.high" limit. The larger the value is, >>>>>> + the smaller the amount of throttling will be and the easier an >>>>>> + offending application may get OOM killed. >>>>> memory.high is supposed to never invoke the OOM killer (see above). It's >>>>> unclear to me if you are referring to OOM kills from the kernel or >>>>> userspace in the commit message. If the latter, I think it shouldn't be >>>>> in kernel docs. >>>> I am sorry for not being clear. What I meant is that if an application >>>> is consuming more memory than what can be recovered by memory reclaim, >>>> it will reach memory.max faster, if set, and get OOM killed. Will >>>> clarify that in the next version. >>> You're not really supposed to use max and high in conjunction. One is >>> for kernel OOM killing, the other for userspace OOM killing. That's tho >>> what the documentation that you edited is trying to explain. >>> >>> What's the usecase you have in mind? >> >> That is new to me that high and max are not supposed to be used >> together. One problem with v1 is that by the time the limit is reached >> and memory reclaim is not able to recover enough memory in time, the >> task will be OOM killed. I always thought that by setting high to a bit >> below max, say 90%, early memory reclaim will reduce the chance of OOM >> kills. There are certainly others that think like that. > > I can't fault you or them for this, because this was the original plan > for these knobs. However, this didn't end up working in practice. > > If you have a non-throttling, non-killing limit, then reclaim will > either work and keep the workload to that limit; or it won't work, and > the workload escapes to the hard limit and gets killed. > > You'll notice you get the same behavior with just memory.max set by > itself - either reclaim can keep up, or OOM is triggered. Yep that was intentional, it was best effort. > >> So the use case here is to reduce the chance of OOM kills without >> letting really mishaving tasks from holding up useful memory for too long. > > That brings us to the idea of a medium amount of throttling. > > The premise would be that, by throttling *to a certain degree*, you > can slow the workload down just enough to tide over the pressure peak > and avert the OOM kill. > > This assumes that some tasks inside the cgroup can independently make > forward progress and release memory, while allocating tasks inside the > group are already throttled. > > [ Keep in mind, it's a cgroup-internal limit, so no memory freeing > outside of the group can alleviate the situation. Progress must > happen from within the cgroup. ] > > But this sort of parallelism in a pressured cgroup is unlikely in > practice. By the time reclaim fails, usually *every task* in the > cgroup ends up having to allocate. Because they lose executables to > cache reclaim, or heap memory to swap etc, and then page fault. > > We found that more often than not, it just deteriorates into a single > sequence of events. Slowing it down just drags out the inevitable. > > As a result we eventually moved away from the idea of gradual > throttling. The last remnants of this idea finally disappeared from > the docs last year (commit 5647e53f7856bb39dae781fe26aa65a699e2fc9f). > > memory.high now effectively puts the cgroup to sleep when reclaim > fails (similar to oom killer disabling in v1, but without the caveats > of that implementation). This is useful to let userspace implement > custom OOM killing policies. > I've found using memory.high as limiting the way you've defined (using a benchmark like STREAM, the benchmark did not finish and was stalled for several hours when it was short of a few GB's of memory) and I did not have a background user space process to do a user space kill. In my case, reclaim was able to reclaim a little bit, so some forward progress was made and nr_retries limit was never hit (IIRC). Effectively in v1 soft_limit was supposed to be best effort pushing back and OOM kill could find a task to kill globally (initial design) if there was global memory pressure. For this discussion adding memory.high.throttle seems like it's bridging the transition from memory.high to memory.max/OOM without external intervention. I do feel that not killing the task, just locks the task in the memcg forever (at-least in my case) and it sounds like using memory.high requires an external process monitor to kill the task if it does not make progress. Warm Regards, Balbir Singh
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index cb1b4e759b7e..df9410ad8b3b 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -1291,8 +1291,20 @@ PAGE_SIZE multiple when read back. Going over the high limit never invokes the OOM killer and under extreme conditions the limit may be breached. The high limit should be used in scenarios where an external process - monitors the limited cgroup to alleviate heavy reclaim - pressure. + monitors the limited cgroup to alleviate heavy reclaim pressure + unless a high enough value is set in "memory.high.throttle". + + memory.high.throttle + A read-write single value file which exists on non-root + cgroups. The default is 0. + + Memory usage throttle control. This value controls the amount + of throttling that will be applied when memory consumption + exceeds the "memory.high" limit. The larger the value is, + the smaller the amount of throttling will be and the easier an + offending application may get OOM killed. + + The valid range of this control file is 0-32. memory.max A read-write single value file which exists on non-root diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6e74b8254d9b..b184d7b008d4 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -199,6 +199,8 @@ struct mem_cgroup { struct list_head swap_peaks; spinlock_t peaks_lock; + int high_throttle_shift; + /* Range enforcement for interrupt charges */ struct work_struct high_work; diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 46f8b372d212..2fa3fd99ebc9 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -2112,6 +2112,7 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) unsigned long nr_reclaimed; unsigned int nr_pages = current->memcg_nr_pages_over_high; int nr_retries = MAX_RECLAIM_RETRIES; + int throttle_shift; struct mem_cgroup *memcg; bool in_retry = false; @@ -2156,6 +2157,13 @@ void mem_cgroup_handle_over_high(gfp_t gfp_mask) penalty_jiffies += calculate_high_delay(memcg, nr_pages, swap_find_max_overage(memcg)); + /* + * Reduce penalty according to the high_throttle_shift value. + */ + throttle_shift = READ_ONCE(memcg->high_throttle_shift); + if (throttle_shift) + penalty_jiffies >>= throttle_shift; + /* * Clamp the max delay per usermode return so as to still keep the * application moving forwards and also permit diagnostics, albeit @@ -4172,6 +4180,33 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, return nbytes; } +static u64 memory_high_throttle_read(struct cgroup_subsys_state *css, + struct cftype *cft) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + return READ_ONCE(memcg->high_throttle_shift); +} + +static ssize_t memory_high_throttle_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + u64 val; + int err; + + buf = strstrip(buf); + err = kstrtoull(buf, 10, &val); + if (err) + return err; + + if (val > 32) + return -EINVAL; + + WRITE_ONCE(memcg->high_throttle_shift, (int)val); + return nbytes; +} + /* * Note: don't forget to update the 'samples/cgroup/memcg_event_listener' * if any new events become available. @@ -4396,6 +4431,12 @@ static struct cftype memory_files[] = { .seq_show = memory_high_show, .write = memory_high_write, }, + { + .name = "high.throttle", + .flags = CFTYPE_NOT_ON_ROOT, + .read_u64 = memory_high_throttle_read, + .write = memory_high_throttle_write, + }, { .name = "max", .flags = CFTYPE_NOT_ON_ROOT,
Since commit 0e4b01df8659 ("mm, memcg: throttle allocators when failing reclaim over memory.high"), the amount of allocator throttling had increased substantially. As a result, it could be difficult for a misbehaving application that consumes increasing amount of memory from being OOM-killed if memory.high is set. Instead, the application may just be crawling along holding close to the allowed memory.high memory for the current memory cgroup for a very long time especially those that do a lot of memcg charging and uncharging operations. This behavior makes the upstream Kubernetes community hesitate to use memory.high. Instead, they use only memory.max for memory control similar to what is being done for cgroup v1 [1]. To allow better control of the amount of throttling and hence the speed that a misbehving task can be OOM killed, a new single-value memory.high.throttle control file is now added. The allowable range is 0-32. By default, it has a value of 0 which means maximum throttling like before. Any non-zero positive value represents the corresponding power of 2 reduction of throttling and makes OOM kills easier to happen. System administrators can now use this parameter to determine how easy they want OOM kills to happen for applications that tend to consume a lot of memory without the need to run a special userspace memory management tool to monitor memory consumption when memory.high is set. Below are the test results of a simple program showing how different values of memory.high.throttle can affect its run time (in secs) until it gets OOM killed. This test program allocates pages from kernel continuously. There are some run-to-run variations and the results are just one possible set of samples. # systemd-run -p MemoryHigh=10M -p MemoryMax=20M -p MemorySwapMax=10M \ --wait -t timeout 300 /tmp/mmap-oom memory.high.throttle service runtime -------------------- --------------- 0 120.521 1 103.376 2 85.881 3 69.698 4 42.668 5 45.782 6 22.179 7 9.909 8 5.347 9 3.100 10 1.757 11 1.084 12 0.919 13 0.650 14 0.650 15 0.655 [1] https://docs.google.com/document/d/1mY0MTT34P-Eyv5G1t_Pqs4OWyIH-cg9caRKWmqYlSbI/edit?tab=t.0 Signed-off-by: Waiman Long <longman@redhat.com> --- Documentation/admin-guide/cgroup-v2.rst | 16 ++++++++-- include/linux/memcontrol.h | 2 ++ mm/memcontrol.c | 41 +++++++++++++++++++++++++ 3 files changed, 57 insertions(+), 2 deletions(-)