Prevent OOM casualties by enforcing memcg limits

Message ID	ea6db5cc-f862-7c4b-d872-acb29c2d8193@sosna.de (mailing list archive)
State	New, archived
Headers	show Return-Path: <SRS0=i7Jg=JX=kvack.org=owner-linux-mm@kernel.org> DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5EB2961090 sender: alexander@sosna.de) by mailproxy01.manitu.net (Postfix) with ESMTPSA id C8655126002E; Mon, 26 Apr 2021 22:04:56 +0200 (CEST) To: linux-mm@kvack.org, linux-kernel@vger.kernel.org From: Alexander Sosna <alexander@sosna.de> Subject: [PATCH] Prevent OOM casualties by enforcing memcg limits Message-ID: <ea6db5cc-f862-7c4b-d872-acb29c2d8193@sosna.de> Date: Mon, 26 Apr 2021 22:04:56 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.9.0 MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Received-SPF: none (sosna.de>: No applicable sender policy available) receiver=imf05; identity=mailfrom; envelope-from="<alexander@sosna.de>"; helo=mailproxy01.manitu.net; client-ip=217.11.48.65 Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Prevent OOM casualties by enforcing memcg limits \| expand Prevent OOM casualties by enforcing memcg limits

Alexander Sosna April 26, 2021, 8:04 p.m. UTC

Before this commit memory cgroup limits were not enforced during
allocation.  If a process within a cgroup tries to allocates more
memory than allowed, the kernel will not prevent the allocation even if
OVERCOMMIT_NEVER is set.  Than the OOM killer is activated to kill
processes in the corresponding cgroup.  This behavior is not to be expected
when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge
problem for applications assuming that the kernel will deny an allocation
if not enough memory is available, like PostgreSQL.  To prevent this a
check is implemented to not allow a process to allocate more memory than
limited by it's cgroup.  This means a process will not be killed while
accessing pages but will receive errors on memory allocation as
appropriate.  This gives programs a chance to handle memory allocation
failures gracefully instead of being reaped.

Signed-off-by: Alexander Sosna <alexander@sosna.de>

 	if (percpu_counter_read_positive(&vm_committed_as) < allowed)

Chris Down April 27, 2021, 12:09 a.m. UTC | #1

Hi Alexander,

Alexander Sosna writes:
>Before this commit memory cgroup limits were not enforced during
>allocation.  If a process within a cgroup tries to allocates more
>memory than allowed, the kernel will not prevent the allocation even if
>OVERCOMMIT_NEVER is set.  Than the OOM killer is activated to kill
>processes in the corresponding cgroup.

Unresolvable cgroup overages are indifferent to vm.overcommit_memory, since 
exceeding memory.max is not overcommitment, it's just a natural consequence of 
the fact that allocation and reclaim are not atomic processes. Overcommitment, 
on the other hand, is about the bounds of available memory at the global 
resource level.

>This behavior is not to be expected
>when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge
>problem for applications assuming that the kernel will deny an allocation
>if not enough memory is available, like PostgreSQL.  To prevent this a
>check is implemented to not allow a process to allocate more memory than
>limited by it's cgroup.  This means a process will not be killed while
>accessing pages but will receive errors on memory allocation as
>appropriate.  This gives programs a chance to handle memory allocation
>failures gracefully instead of being reaped.

We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It can 
still happen for a bunch of reasons, so I really hope PostgreSQL isn't relying 
on that.

Could you please be more clear about the "huge problem" being solved here? I'm 
not seeing it.

>Signed-off-by: Alexander Sosna <alexander@sosna.de>
>
>diff --git a/mm/util.c b/mm/util.c
>index a8bf17f18a81..c84b83c532c6 100644
>--- a/mm/util.c
>+++ b/mm/util.c
>@@ -853,6 +853,7 @@ EXPORT_SYMBOL_GPL(vm_memory_committed);
>  *
>  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
>  * Additional code 2002 Jul 20 by Robert Love.
>+ * Code to enforce memory cgroup limits added 2021 by Alexander Sosna.
>  *
>  * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
>  *
>@@ -891,6 +892,34 @@ int __vm_enough_memory(struct mm_struct *mm, long
>pages, int cap_sys_admin)
> 		long reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
>
> 		allowed -= min_t(long, mm->total_vm / 32, reserve);
>+
>+#ifdef CONFIG_MEMCG
>+		/*
>+		 * If we are in a memory cgroup we also evaluate if the cgroup
>+		 * has enough memory to allocate a new virtual mapping.

This comment confuses me further, I'm afraid. You're talking about virtual 
mappings, but then checking memory.max, which is about allocated pages.

>+		 * This is how we can keep processes from exceeding their
>+		 * limits and also prevent that the OOM killer must be
>+		 * awakened.  This gives programs a chance to handle memory
>+		 * allocation failures gracefully and not being reaped.
>+		 * In the current version mem_cgroup_get_max() is used which
>+		 * allows the processes to exceeded their memory limits if
>+		 * enough SWAP is available.  If this is not intended we could
>+		 * use READ_ONCE(memcg->memory.max) instead.
>+		 *
>+		 * This code is only reached if sysctl_overcommit_memory equals
>+		 * OVERCOMMIT_NEVER, both other options are handled above.
>+		 */
>+		{
>+			struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
>+
>+			if (memcg) {
>+				long available = mem_cgroup_get_max(memcg)
>+						- mem_cgroup_size(memcg);
>+
>+				allowed = min_t(long, available, allowed);
>+			}
>+		}
>+#endif
> 	}
>
> 	if (percpu_counter_read_positive(&vm_committed_as) < allowed)
>

Alexander Sosna April 27, 2021, 6:37 a.m. UTC | #2

Hi Chris,

Am 27.04.21 um 02:09 schrieb Chris Down:
> Hi Alexander,
> 
> Alexander Sosna writes:
>> Before this commit memory cgroup limits were not enforced during
>> allocation.  If a process within a cgroup tries to allocates more
>> memory than allowed, the kernel will not prevent the allocation even if
>> OVERCOMMIT_NEVER is set.  Than the OOM killer is activated to kill
>> processes in the corresponding cgroup.
> 
> Unresolvable cgroup overages are indifferent to vm.overcommit_memory,
> since exceeding memory.max is not overcommitment, it's just a natural
> consequence of the fact that allocation and reclaim are not atomic
> processes. Overcommitment, on the other hand, is about the bounds of
> available memory at the global resource level.
> 
>> This behavior is not to be expected
>> when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge
>> problem for applications assuming that the kernel will deny an allocation
>> if not enough memory is available, like PostgreSQL.  To prevent this a
>> check is implemented to not allow a process to allocate more memory than
>> limited by it's cgroup.  This means a process will not be killed while
>> accessing pages but will receive errors on memory allocation as
>> appropriate.  This gives programs a chance to handle memory allocation
>> failures gracefully instead of being reaped.
> 
> We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It
> can still happen for a bunch of reasons, so I really hope PostgreSQL
> isn't relying on that.
> 
> Could you please be more clear about the "huge problem" being solved
> here? I'm not seeing it.

let me explain the problem I encounter and why I fell down the mm rabbit
hole.  It is not a PostgreSQL specific problem but that's where I run
into it.  PostgreSQL forks a backend for each client connection.  All
backends have shared memory as well as local work memory.  When a
backend needs more dynamic work_mem to execute a query, new memory
is allocated.  It is normal that such an allocation can fail.  If the
backend gets an ENOMEM the current query is rolled back an all dynamic
work_mem is freed. The RDBMS stays operational an no other query is
disturbed.

When running in a memory cgroup - for example via systemd or on k8s -
the kernel will not return ENOMEM even if the cgroup's memory limit is
exceeded.  Instead the OOM killer is awakened and kills processes in the
violating cgroup.  If any backend is killed with SIGKILL the shared
memory of the whole cluster is deemed potentially corrupted and
PostgreSQL needs to do an emergency restart.  This cancels all operation
on all backends and it entails a potentially lengthy recovery process.
Therefore the behavior is quite "costly".

I totally understand that vm.overcommit_memory 2 does not mean "no OOM
killer". IMHO it should mean "no OOM killer if we can avoid it" and I
would highly appreciate if the kernel would use a less invasive means
whenever possible.  I guess this might also be the expectation by many
other users.  In my described case - which is a real pain for me - it is
quite easy to tweak the kernel behavior in order to handle this and
other similar situations with less casualties.  This is why I send a
patch instead of starting a theoretical discussion.

What do you think is necessary to get this to an approvable quality?

>> Signed-off-by: Alexander Sosna <alexander@sosna.de>
>>
>> diff --git a/mm/util.c b/mm/util.c
>> index a8bf17f18a81..c84b83c532c6 100644
>> --- a/mm/util.c
>> +++ b/mm/util.c
>> @@ -853,6 +853,7 @@ EXPORT_SYMBOL_GPL(vm_memory_committed);
>>  *
>>  * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
>>  * Additional code 2002 Jul 20 by Robert Love.
>> + * Code to enforce memory cgroup limits added 2021 by Alexander Sosna.
>>  *
>>  * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
>>  *
>> @@ -891,6 +892,34 @@ int __vm_enough_memory(struct mm_struct *mm, long
>> pages, int cap_sys_admin)
>>         long reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
>>
>>         allowed -= min_t(long, mm->total_vm / 32, reserve);
>> +
>> +#ifdef CONFIG_MEMCG
>> +        /*
>> +         * If we are in a memory cgroup we also evaluate if the cgroup
>> +         * has enough memory to allocate a new virtual mapping.
> 
> This comment confuses me further, I'm afraid. You're talking about
> virtual mappings, but then checking memory.max, which is about allocated
> pages.

I had some problems understanding all mm and cgroup related code in the
kernel and wished for helpfull comments here and there.  So I tried at
least to document my code and made it worse.  Thank you for pointing
this out.

>> +         * This is how we can keep processes from exceeding their
>> +         * limits and also prevent that the OOM killer must be
>> +         * awakened.  This gives programs a chance to handle memory
>> +         * allocation failures gracefully and not being reaped.
>> +         * In the current version mem_cgroup_get_max() is used which
>> +         * allows the processes to exceeded their memory limits if
>> +         * enough SWAP is available.  If this is not intended we could
>> +         * use READ_ONCE(memcg->memory.max) instead.
>> +         *
>> +         * This code is only reached if sysctl_overcommit_memory equals
>> +         * OVERCOMMIT_NEVER, both other options are handled above.
>> +         */
>> +        {
>> +            struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
>> +
>> +            if (memcg) {
>> +                long available = mem_cgroup_get_max(memcg)
>> +                        - mem_cgroup_size(memcg);
>> +
>> +                allowed = min_t(long, available, allowed);
>> +            }
>> +        }
>> +#endif
>>     }
>>
>>     if (percpu_counter_read_positive(&vm_committed_as) < allowed)
>>

Michal Hocko April 27, 2021, 7:53 a.m. UTC | #3

On Mon 26-04-21 22:04:56, Alexander Sosna wrote:
> Before this commit memory cgroup limits were not enforced during
> allocation.  If a process within a cgroup tries to allocates more
> memory than allowed, the kernel will not prevent the allocation even if
> OVERCOMMIT_NEVER is set.  Than the OOM killer is activated to kill
> processes in the corresponding cgroup.  This behavior is not to be expected
> when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge
> problem for applications assuming that the kernel will deny an allocation
> if not enough memory is available, like PostgreSQL.

Memory cgroup controller is by design accounting physically allocated
memory while overcommit policy is a global control of the virtual memory
allocation. Memcg is not aware of the virtual memory commitment so it
cannot really evaluate OVERCOMMIT_NEVER heuristic.

> To prevent this a
> check is implemented to not allow a process to allocate more memory than
> limited by it's cgroup.  This means a process will not be killed while
> accessing pages but will receive errors on memory allocation as
> appropriate.  This gives programs a chance to handle memory allocation
> failures gracefully instead of being reaped.

I am afraid I have to nak this patch. It is changing a long term
semantic of a user interface which can break many existing applications.
So you would need to create a new overcommit mode which would be
explicitly memcg aware.

As mentioned above memcg would need to have some awareness of the
virtual memory committed for the memcg. Without that
OVERCOMMIT_NEVER_MEMCG would effectively turn into OVERCOMMIT_GUESS.
 
> Signed-off-by: Alexander Sosna <alexander@sosna.de>

Nacked-by: Michal Hocko <mhocko@suse.com>

> diff --git a/mm/util.c b/mm/util.c
> index a8bf17f18a81..c84b83c532c6 100644
> --- a/mm/util.c
> +++ b/mm/util.c
> @@ -853,6 +853,7 @@ EXPORT_SYMBOL_GPL(vm_memory_committed);
>   *
>   * Strict overcommit modes added 2002 Feb 26 by Alan Cox.
>   * Additional code 2002 Jul 20 by Robert Love.
> + * Code to enforce memory cgroup limits added 2021 by Alexander Sosna.
>   *
>   * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise.
>   *
> @@ -891,6 +892,34 @@ int __vm_enough_memory(struct mm_struct *mm, long
> pages, int cap_sys_admin)
>  		long reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10);
> 
>  		allowed -= min_t(long, mm->total_vm / 32, reserve);
> +
> +#ifdef CONFIG_MEMCG
> +		/*
> +		 * If we are in a memory cgroup we also evaluate if the cgroup
> +		 * has enough memory to allocate a new virtual mapping.
> +		 * This is how we can keep processes from exceeding their
> +		 * limits and also prevent that the OOM killer must be
> +		 * awakened.  This gives programs a chance to handle memory
> +		 * allocation failures gracefully and not being reaped.
> +		 * In the current version mem_cgroup_get_max() is used which
> +		 * allows the processes to exceeded their memory limits if
> +		 * enough SWAP is available.  If this is not intended we could
> +		 * use READ_ONCE(memcg->memory.max) instead.
> +		 *
> +		 * This code is only reached if sysctl_overcommit_memory equals
> +		 * OVERCOMMIT_NEVER, both other options are handled above.
> +		 */
> +		{
> +			struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm);
> +
> +			if (memcg) {
> +				long available = mem_cgroup_get_max(memcg)
> +						- mem_cgroup_size(memcg);
> +
> +				allowed = min_t(long, available, allowed);
> +			}
> +		}
> +#endif
>  	}
> 
>  	if (percpu_counter_read_positive(&vm_committed_as) < allowed)

Michal Hocko April 27, 2021, 8:08 a.m. UTC | #4

On Tue 27-04-21 08:37:30, Alexander Sosna wrote:
> Hi Chris,
> 
> Am 27.04.21 um 02:09 schrieb Chris Down:
> > Hi Alexander,
> > 
> > Alexander Sosna writes:
> >> Before this commit memory cgroup limits were not enforced during
> >> allocation.  If a process within a cgroup tries to allocates more
> >> memory than allowed, the kernel will not prevent the allocation even if
> >> OVERCOMMIT_NEVER is set.  Than the OOM killer is activated to kill
> >> processes in the corresponding cgroup.
> > 
> > Unresolvable cgroup overages are indifferent to vm.overcommit_memory,
> > since exceeding memory.max is not overcommitment, it's just a natural
> > consequence of the fact that allocation and reclaim are not atomic
> > processes. Overcommitment, on the other hand, is about the bounds of
> > available memory at the global resource level.
> > 
> >> This behavior is not to be expected
> >> when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge
> >> problem for applications assuming that the kernel will deny an allocation
> >> if not enough memory is available, like PostgreSQL.  To prevent this a
> >> check is implemented to not allow a process to allocate more memory than
> >> limited by it's cgroup.  This means a process will not be killed while
> >> accessing pages but will receive errors on memory allocation as
> >> appropriate.  This gives programs a chance to handle memory allocation
> >> failures gracefully instead of being reaped.
> > 
> > We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It
> > can still happen for a bunch of reasons, so I really hope PostgreSQL
> > isn't relying on that.
> > 
> > Could you please be more clear about the "huge problem" being solved
> > here? I'm not seeing it.
> 
> let me explain the problem I encounter and why I fell down the mm rabbit
> hole.  It is not a PostgreSQL specific problem but that's where I run
> into it.  PostgreSQL forks a backend for each client connection.  All
> backends have shared memory as well as local work memory.  When a
> backend needs more dynamic work_mem to execute a query, new memory
> is allocated.  It is normal that such an allocation can fail.  If the
> backend gets an ENOMEM the current query is rolled back an all dynamic
> work_mem is freed. The RDBMS stays operational an no other query is
> disturbed.

I am afraid the kernel MM implementation has never been really
compatible with such a memory allocation model. Linux has always
preferred to pretend there is always memory available and rather reclaim
memory - including by killing some processes - rather than fail the
allocation eith ENOMEM. Overcommit configuration (especially
OVERCOMMIT_NEVER) is an attempt to somehow mitigate this ambitious
memory allocation approach but in reality this has turned out a)
unreliable and b) unsuable with modern userspace which relies on
considerable virtual memory overcommit.

> When running in a memory cgroup - for example via systemd or on k8s -
> the kernel will not return ENOMEM even if the cgroup's memory limit is
> exceeded.

Yes, memcg doesn't change the overal approach. It just restricts the
existing semantic with a smaller memory limit. Also overcommit heuristic
has never been implemented for memory controllers.

> Instead the OOM killer is awakened and kills processes in the
> violating cgroup.  If any backend is killed with SIGKILL the shared
> memory of the whole cluster is deemed potentially corrupted and
> PostgreSQL needs to do an emergency restart.  This cancels all operation
> on all backends and it entails a potentially lengthy recovery process.
> Therefore the behavior is quite "costly".

One way around that would be to use high limit rather than hard limit
and pro-actively watch for memory utilization and communicate that back
to the application to throttle its workers. I can see how that

> I totally understand that vm.overcommit_memory 2 does not mean "no OOM
> killer". IMHO it should mean "no OOM killer if we can avoid it" and I

I do not see how it can ever promise anything like that. Memory
consumption by kernel subsystems cannot be predicted at the time virtual
memory allocated from the userspace. Not only it cannot be predicted but
it is also highly impractical to force kernel allocations - necessary
for the OS operation - to fail just because userspace has reserved
virtual memory. So this all is just a heuristic to help in some
extreme cases but overall I consider OVERCOMMIT_NEVER as impractical to
say the least.

> would highly appreciate if the kernel would use a less invasive means
> whenever possible.  I guess this might also be the expectation by many
> other users.  In my described case - which is a real pain for me - it is
> quite easy to tweak the kernel behavior in order to handle this and
> other similar situations with less casualties.  This is why I send a
> patch instead of starting a theoretical discussion.

I am pretty sure that many users would agree with you on that but the
matter of fact is that a different approach has been chosen
historically. We can argue whether this has been a good or bad design
decision but I do not see that to change without a lot of fallouts. Btw.
a strong memory reservation approach can be found with hugetlb pages and
this one has turned out to be very tricky both from implementation and
userspace usage POV. Needless to say that it operates on a single
purpose preallocated memory pool and it would be quite reasonable to
expect the complexity would grow with more users of the pool which is
the general case for general purpose memory allocator.

> What do you think is necessary to get this to an approvable quality?

See my other reply.

Alexander Sosna April 27, 2021, 11:01 a.m. UTC | #5

On 27.04.21 10:08, Michal Hocko wrote:
> On Tue 27-04-21 08:37:30, Alexander Sosna wrote:
>> Hi Chris,
>>
>> Am 27.04.21 um 02:09 schrieb Chris Down:
>>> Hi Alexander,
>>>
>>> Alexander Sosna writes:
>>>> Before this commit memory cgroup limits were not enforced during
>>>> allocation.  If a process within a cgroup tries to allocates more
>>>> memory than allowed, the kernel will not prevent the allocation even if
>>>> OVERCOMMIT_NEVER is set.  Than the OOM killer is activated to kill
>>>> processes in the corresponding cgroup.
>>>
>>> Unresolvable cgroup overages are indifferent to vm.overcommit_memory,
>>> since exceeding memory.max is not overcommitment, it's just a natural
>>> consequence of the fact that allocation and reclaim are not atomic
>>> processes. Overcommitment, on the other hand, is about the bounds of
>>> available memory at the global resource level.
>>>
>>>> This behavior is not to be expected
>>>> when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge
>>>> problem for applications assuming that the kernel will deny an allocation
>>>> if not enough memory is available, like PostgreSQL.  To prevent this a
>>>> check is implemented to not allow a process to allocate more memory than
>>>> limited by it's cgroup.  This means a process will not be killed while
>>>> accessing pages but will receive errors on memory allocation as
>>>> appropriate.  This gives programs a chance to handle memory allocation
>>>> failures gracefully instead of being reaped.
>>>
>>> We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It
>>> can still happen for a bunch of reasons, so I really hope PostgreSQL
>>> isn't relying on that.
>>>
>>> Could you please be more clear about the "huge problem" being solved
>>> here? I'm not seeing it.
>>
>> let me explain the problem I encounter and why I fell down the mm rabbit
>> hole.  It is not a PostgreSQL specific problem but that's where I run
>> into it.  PostgreSQL forks a backend for each client connection.  All
>> backends have shared memory as well as local work memory.  When a
>> backend needs more dynamic work_mem to execute a query, new memory
>> is allocated.  It is normal that such an allocation can fail.  If the
>> backend gets an ENOMEM the current query is rolled back an all dynamic
>> work_mem is freed. The RDBMS stays operational an no other query is
>> disturbed.
> 
> I am afraid the kernel MM implementation has never been really
> compatible with such a memory allocation model. Linux has always
> preferred to pretend there is always memory available and rather reclaim
> memory - including by killing some processes - rather than fail the
> allocation eith ENOMEM. Overcommit configuration (especially
> OVERCOMMIT_NEVER) is an attempt to somehow mitigate this ambitious
> memory allocation approach but in reality this has turned out a)
> unreliable and b) unsuable with modern userspace which relies on
> considerable virtual memory overcommit.

Thank you for taking the time to discuss this issue with me.  I agree
that the kernel and a lot of software prefers to pretend there is more
memory than there really is.  It was also never possible to assume that
the OOM killer is fully absent.  I'm running production Linux systems
for quite a while now and without memory cgroups involved
OVERCOMMIT_NEVER does a pretty good job.  I can't even remember the last
time the OOM killer caused me any problems on a properly configured
database server.  This is what I would like and what users should be
able to expect for the use with cgroup memory limits as well.

Please correct me if I am wrong, but "modern userspace which relies on
considerable virtual memory overcommit" should not rely on the kernel to
overcommit memory when OVERCOMMIT_NEVER is explicitly set.

>> When running in a memory cgroup - for example via systemd or on k8s -
>> the kernel will not return ENOMEM even if the cgroup's memory limit is
>> exceeded.
> 
> Yes, memcg doesn't change the overal approach. It just restricts the
> existing semantic with a smaller memory limit. Also overcommit heuristic
> has never been implemented for memory controllers.
> 
>> Instead the OOM killer is awakened and kills processes in the
>> violating cgroup.  If any backend is killed with SIGKILL the shared
>> memory of the whole cluster is deemed potentially corrupted and
>> PostgreSQL needs to do an emergency restart.  This cancels all operation
>> on all backends and it entails a potentially lengthy recovery process.
>> Therefore the behavior is quite "costly".
> 
> One way around that would be to use high limit rather than hard limit
> and pro-actively watch for memory utilization and communicate that back
> to the application to throttle its workers. I can see how that
> 
>> I totally understand that vm.overcommit_memory 2 does not mean "no OOM
>> killer". IMHO it should mean "no OOM killer if we can avoid it" and I
> 
> I do not see how it can ever promise anything like that. Memory
> consumption by kernel subsystems cannot be predicted at the time virtual
> memory allocated from the userspace. Not only it cannot be predicted but
> it is also highly impractical to force kernel allocations - necessary
> for the OS operation - to fail just because userspace has reserved
> virtual memory. So this all is just a heuristic to help in some
> extreme cases but overall I consider OVERCOMMIT_NEVER as impractical to
> say the least.

I'm not fully able to follow you why we need to let kernel allocations
fail here.  Yes, if you run a system to a point where the kernel can't
free enough memory, invasive decisions have to be made.  Think of an
application server running multiple applications in memcgs each with its
limits way below the available resources.  Why is it preferable to
SIGKILL a process rather than just deny the limit exceeding malloc, when
OVERCOMMIT_NEVER is set of cause?

>> would highly appreciate if the kernel would use a less invasive means
>> whenever possible.  I guess this might also be the expectation by many
>> other users.  In my described case - which is a real pain for me - it is
>> quite easy to tweak the kernel behavior in order to handle this and
>> other similar situations with less casualties.  This is why I send a
>> patch instead of starting a theoretical discussion.
> 
> I am pretty sure that many users would agree with you on that but the
> matter of fact is that a different approach has been chosen
> historically. We can argue whether this has been a good or bad design
> decision but I do not see that to change without a lot of fallouts. Btw.
> a strong memory reservation approach can be found with hugetlb pages and
> this one has turned out to be very tricky both from implementation and
> userspace usage POV. Needless to say that it operates on a single
> purpose preallocated memory pool and it would be quite reasonable to
> expect the complexity would grow with more users of the pool which is
> the general case for general purpose memory allocator.

The history is very interesting and needs to be taken into
consideration.  What drives me is to help myself and all other Linux
user to run workloads like RDBMS reliable, even in modern environments
like k8s which make use of memory cgroups.  I see a gain for the
community to develop a reliable and easy available solution, even if my
current approach might be amateurish and is not the right answer.  Could
you elaborate on where you see "a lot of fallouts"?  overcommit_memory 2
is only set when needed for the desired workload.

If the gain is worth it one could implement an overcommit_memory 3 in
order to set this behavior, overcommit_memory needs to be explicitly set
by the sysadmin anyways.

>> What do you think is necessary to get this to an approvable quality?
> 
> See my other reply.

Michal Hocko April 27, 2021, 12:11 p.m. UTC | #6

On Tue 27-04-21 13:01:33, Alexander Sosna wrote:
[...]
> Please correct me if I am wrong, but "modern userspace which relies on
> considerable virtual memory overcommit" should not rely on the kernel to
> overcommit memory when OVERCOMMIT_NEVER is explicitly set.

Correct. Which makes it application very limited from my experience.

> >> When running in a memory cgroup - for example via systemd or on k8s -
> >> the kernel will not return ENOMEM even if the cgroup's memory limit is
> >> exceeded.
> > 
> > Yes, memcg doesn't change the overal approach. It just restricts the
> > existing semantic with a smaller memory limit. Also overcommit heuristic
> > has never been implemented for memory controllers.
> > 
> >> Instead the OOM killer is awakened and kills processes in the
> >> violating cgroup.  If any backend is killed with SIGKILL the shared
> >> memory of the whole cluster is deemed potentially corrupted and
> >> PostgreSQL needs to do an emergency restart.  This cancels all operation
> >> on all backends and it entails a potentially lengthy recovery process.
> >> Therefore the behavior is quite "costly".
> > 
> > One way around that would be to use high limit rather than hard limit
> > and pro-actively watch for memory utilization and communicate that back
> > to the application to throttle its workers. I can see how that
> > 
> >> I totally understand that vm.overcommit_memory 2 does not mean "no OOM
> >> killer". IMHO it should mean "no OOM killer if we can avoid it" and I
> > 
> > I do not see how it can ever promise anything like that. Memory
> > consumption by kernel subsystems cannot be predicted at the time virtual
> > memory allocated from the userspace. Not only it cannot be predicted but
> > it is also highly impractical to force kernel allocations - necessary
> > for the OS operation - to fail just because userspace has reserved
> > virtual memory. So this all is just a heuristic to help in some
> > extreme cases but overall I consider OVERCOMMIT_NEVER as impractical to
> > say the least.
> 
> I'm not fully able to follow you why we need to let kernel allocations
> fail here.  Yes, if you run a system to a point where the kernel can't
> free enough memory, invasive decisions have to be made.

OK. But then I do not see what "no OOM killer if we can avoid it" is
suppose to mean. There are only 2 ways around that. Either start
failing allocations or reclaim by tearing down processes as all other
means of memory reclaim have been already exercised.

> Think of an
> application server running multiple applications in memcgs each with its
> limits way below the available resources.  Why is it preferable to
> SIGKILL a process rather than just deny the limit exceeding malloc, when
> OVERCOMMIT_NEVER is set of cause?

Because the actual physical memory allocation for malloc might (and
usually does) happen much later than the virtual memory allocated for it
(brk or mmap). Memory requirements could have changed considerably
between the two events. An allocation struggling to make a forward
progress might be for a completely different purpose than the overcommit
accounted one. Does this make more sense now?

> >> would highly appreciate if the kernel would use a less invasive means
> >> whenever possible.  I guess this might also be the expectation by many
> >> other users.  In my described case - which is a real pain for me - it is
> >> quite easy to tweak the kernel behavior in order to handle this and
> >> other similar situations with less casualties.  This is why I send a
> >> patch instead of starting a theoretical discussion.
> > 
> > I am pretty sure that many users would agree with you on that but the
> > matter of fact is that a different approach has been chosen
> > historically. We can argue whether this has been a good or bad design
> > decision but I do not see that to change without a lot of fallouts. Btw.
> > a strong memory reservation approach can be found with hugetlb pages and
> > this one has turned out to be very tricky both from implementation and
> > userspace usage POV. Needless to say that it operates on a single
> > purpose preallocated memory pool and it would be quite reasonable to
> > expect the complexity would grow with more users of the pool which is
> > the general case for general purpose memory allocator.
> 
> The history is very interesting and needs to be taken into
> consideration.  What drives me is to help myself and all other Linux
> user to run workloads like RDBMS reliable, even in modern environments
> like k8s which make use of memory cgroups.  I see a gain for the
> community to develop a reliable and easy available solution, even if my
> current approach might be amateurish and is not the right answer.

Well, I am afraid that a reliable and easy solutions would be extremely
hard to find. A memcg aware overcommit policy is certainly possible but
as I've said it would require an additional accounting, it would be
quite unreliable - especially with small limits where the mapped (and
accounted) address space is not predominant. A lack of background
reclaim (kswapd in the global case) would result in ENOMEM reported even
though there is reclaimable memory to satisfy the reserved address space
etc.

> Could
> you elaborate on where you see "a lot of fallouts"?  overcommit_memory 2
> is only set when needed for the desired workload.

My above comment was more general to the approach Linux is embracing
overcommit and relies on oom killer to handle fallouts. This to change
would lead to lot of fallouts. E.g. many syscalls returning unexpected
and unhandled ENOMEM etc.

Chris Down April 27, 2021, 12:26 p.m. UTC | #7

Alexander Sosna writes:
>> We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It
>> can still happen for a bunch of reasons, so I really hope PostgreSQL
>> isn't relying on that.
>>
>> Could you please be more clear about the "huge problem" being solved
>> here? I'm not seeing it.
>
>let me explain the problem I encounter and why I fell down the mm rabbit
>hole.  It is not a PostgreSQL specific problem but that's where I run
>into it.  PostgreSQL forks a backend for each client connection.  All
>backends have shared memory as well as local work memory.  When a
>backend needs more dynamic work_mem to execute a query, new memory
>is allocated.  It is normal that such an allocation can fail.  If the
>backend gets an ENOMEM the current query is rolled back an all dynamic
>work_mem is freed. The RDBMS stays operational an no other query is
>disturbed.
>
>When running in a memory cgroup - for example via systemd or on k8s -
>the kernel will not return ENOMEM even if the cgroup's memory limit is
>exceeded.  Instead the OOM killer is awakened and kills processes in the
>violating cgroup.  If any backend is killed with SIGKILL the shared
>memory of the whole cluster is deemed potentially corrupted and
>PostgreSQL needs to do an emergency restart.  This cancels all operation
>on all backends and it entails a potentially lengthy recovery process.
>Therefore the behavior is quite "costly".

My point that memory cgroups are completely overcommit agnostic isn't just a 
question of abstract semantics, but a practical one. Exceeding memory.max is 
not overcommitment, because overages are physical, not virtual, and that has 
vastly different ramifications in terms of what managing that overage means.

For example, if we aggressively ENOMEM at the memory.max bounds, there's no 
provision provided for the natural bounds of memory reclaim to occur. Now maybe 
your application likes that (which I find highly dubious), but from a memory 
balancing perspective it's just nonsensical: we need to ensure that we're 
assisting forward progress of the system at the cgroup level, especially with 
the huge amounts of slack generated.

>I totally understand that vm.overcommit_memory 2 does not mean "no OOM
>killer". IMHO it should mean "no OOM killer if we can avoid it" and I
>would highly appreciate if the kernel would use a less invasive means
>whenever possible.  I guess this might also be the expectation by many
>other users.  In my described case - which is a real pain for me - it is
>quite easy to tweak the kernel behavior in order to handle this and
>other similar situations with less casualties.  This is why I send a
>patch instead of starting a theoretical discussion.

vm.overcommit_memory=2 means "don't overcommit", nothing less, nothing more. 
Adding more semantics is a very good way to make an extremely confusing and 
overloaded API.

This commit reminds me of the comments on cosmetic products that say "no 
parabens". Ok, so there's no parabens -- great, parabens are terrible -- but 
are you now using a much more dangerous preservative instead?

Likewise, this commit claims that it reduces the likelihood of invoking the OOM 
killer -- great, nobody wants their processes to be OOM killed. What do we have 
instead? Code that calls off memory allocations way, way before it's needed to 
do so, and prevents the system from even getting into a state where it can 
efficiently evaluate how it should rebalance memory. That's really not a good 
tradeoff.

>What do you think is necessary to get this to an approvable quality?

The problem is not the code, it's the concept and the way it interacts with the 
rest of the mm subsystem. It asks the mm subsystem to deny memory allocations 
long before it has even had a chance to reliably rebalance (just as one 
example, to punt anon pages to swap) based on the new allocations, which 
doesn't make very much sense. It may not break in some highly trivial setups, 
but it certainly will not work well with stacking or machines with high 
volatility of the anon/file LRUs. You're also likely to see random ENOMEM 
failures from kernelspace when operating under this memcg context long before 
such a response was necessary, which doesn't make much sense.

If you want to know when to back off allocations, use memory.high with PSI 
pressure metrics.

I also would strongly suggest that vm.overcommit_memory=2 is the equivalent of 
using a bucket of ignited thermite to warm one's house.

Alexander Sosna April 27, 2021, 1:43 p.m. UTC | #8

On 27.04.21 14:11, Michal Hocko wrote:
> On Tue 27-04-21 13:01:33, Alexander Sosna wrote:
> [...]
>> Please correct me if I am wrong, but "modern userspace which relies on
>> considerable virtual memory overcommit" should not rely on the kernel to
>> overcommit memory when OVERCOMMIT_NEVER is explicitly set.
> 
> Correct. Which makes it application very limited from my experience.

Yes.  It is a special tool for a special use case, I use it exclusively
for database servers and hosting DBaaS.  Therefore my point is that a
change in it's behavior only effects special use cases and should take
their requirements into consideration.

>>>> When running in a memory cgroup - for example via systemd or on k8s -
>>>> the kernel will not return ENOMEM even if the cgroup's memory limit is
>>>> exceeded.
>>>
>>> Yes, memcg doesn't change the overal approach. It just restricts the
>>> existing semantic with a smaller memory limit. Also overcommit heuristic
>>> has never been implemented for memory controllers.
>>>
>>>> Instead the OOM killer is awakened and kills processes in the
>>>> violating cgroup.  If any backend is killed with SIGKILL the shared
>>>> memory of the whole cluster is deemed potentially corrupted and
>>>> PostgreSQL needs to do an emergency restart.  This cancels all operation
>>>> on all backends and it entails a potentially lengthy recovery process.
>>>> Therefore the behavior is quite "costly".
>>>
>>> One way around that would be to use high limit rather than hard limit
>>> and pro-actively watch for memory utilization and communicate that back
>>> to the application to throttle its workers. I can see how that
>>>
>>>> I totally understand that vm.overcommit_memory 2 does not mean "no OOM
>>>> killer". IMHO it should mean "no OOM killer if we can avoid it" and I
>>>
>>> I do not see how it can ever promise anything like that. Memory
>>> consumption by kernel subsystems cannot be predicted at the time virtual
>>> memory allocated from the userspace. Not only it cannot be predicted but
>>> it is also highly impractical to force kernel allocations - necessary
>>> for the OS operation - to fail just because userspace has reserved
>>> virtual memory. So this all is just a heuristic to help in some
>>> extreme cases but overall I consider OVERCOMMIT_NEVER as impractical to
>>> say the least.
>>
>> I'm not fully able to follow you why we need to let kernel allocations
>> fail here.  Yes, if you run a system to a point where the kernel can't
>> free enough memory, invasive decisions have to be made.
> 
> OK. But then I do not see what "no OOM killer if we can avoid it" is
> suppose to mean. There are only 2 ways around that. Either start
> failing allocations or reclaim by tearing down processes as all other
> means of memory reclaim have been already exercised.

Yes of cause if all means of memory reclaim have been already exercised,
there is not much todo.  But I want to prevent the OOM killer from
reaping processes especially if more than enough memory is available.
There are many reasons to enforce limits even if enough resources are
available.  For example to prevent bad neighbor behavior, leave plenty
of free memory for the file system cache or because someone pays only
for a given amount of resources in a SaaS offering and the SREs would
like to enforce the limit without shutting down their clients database
service.  :)
With KVM this is quite easy, just use one VM per RDBMS instance and set
overcommit_memory=2, but this would waste a lot of resources.

>> Think of an
>> application server running multiple applications in memcgs each with its
>> limits way below the available resources.  Why is it preferable to
>> SIGKILL a process rather than just deny the limit exceeding malloc, when
>> OVERCOMMIT_NEVER is set of cause?
> 
> Because the actual physical memory allocation for malloc might (and
> usually does) happen much later than the virtual memory allocated for it
> (brk or mmap). Memory requirements could have changed considerably
> between the two events. An allocation struggling to make a forward
> progress might be for a completely different purpose than the overcommit
> accounted one. Does this make more sense now?

Yes.  Thank you for the explanation.  What you describe is the
observable behavior in many pieces of software.  We have to keep in mind
that we are talking about the special case of OVERCOMMIT_NEVER here.
Software that want's / expects to run on such a system normally has a
much tighter allocation behavior.  PostgreSQL for example allocates
memory when it is needed and in the needed quantity and will swiftly
write on all of it.  Further, it deals very gracefully with an
out-of-memory situation (malloc returns NULL) by simply reporting back
to the client that a query was aborted due to out of memory.
Overcommiting doesn't make any sense with such a disciplined application
and sysadmins configure their kernel accordingly.  Implying overcommit
and OOM-killer-activity currently rules out using cgroup-limits with
such an application, there is no way to handle a SIGKILL gracefully.
Therefore I was quite happy with the results when testing my patch with
a DBaaS workload.

>>>> would highly appreciate if the kernel would use a less invasive means
>>>> whenever possible.  I guess this might also be the expectation by many
>>>> other users.  In my described case - which is a real pain for me - it is
>>>> quite easy to tweak the kernel behavior in order to handle this and
>>>> other similar situations with less casualties.  This is why I send a
>>>> patch instead of starting a theoretical discussion.
>>>
>>> I am pretty sure that many users would agree with you on that but the
>>> matter of fact is that a different approach has been chosen
>>> historically. We can argue whether this has been a good or bad design
>>> decision but I do not see that to change without a lot of fallouts. Btw.
>>> a strong memory reservation approach can be found with hugetlb pages and
>>> this one has turned out to be very tricky both from implementation and
>>> userspace usage POV. Needless to say that it operates on a single
>>> purpose preallocated memory pool and it would be quite reasonable to
>>> expect the complexity would grow with more users of the pool which is
>>> the general case for general purpose memory allocator.
>>
>> The history is very interesting and needs to be taken into
>> consideration.  What drives me is to help myself and all other Linux
>> user to run workloads like RDBMS reliable, even in modern environments
>> like k8s which make use of memory cgroups.  I see a gain for the
>> community to develop a reliable and easy available solution, even if my
>> current approach might be amateurish and is not the right answer.
> 
> Well, I am afraid that a reliable and easy solutions would be extremely
> hard to find. A memcg aware overcommit policy is certainly possible but
> as I've said it would require an additional accounting, it would be
> quite unreliable - especially with small limits where the mapped (and
> accounted) address space is not predominant. A lack of background
> reclaim (kswapd in the global case) would result in ENOMEM reported even
> though there is reclaimable memory to satisfy the reserved address space
> etc.

Thank you very much for this information.  Would you share the opinion
that it would be too hacky to define an arbitrary memory threshold here?
 One could say that below a used memory of X the memory cgroup limit is
not enforced by denying a malloc().  So that the status quo behavior is
only altered when the memory usage is above X.  This would mitigate the
problem with small limits and does not introduce new risks or surprises,
because in this edge case it will behaves identical to the current kernel.
>> Could
>> you elaborate on where you see "a lot of fallouts"?  overcommit_memory 2
>> is only set when needed for the desired workload.
> 
> My above comment was more general to the approach Linux is embracing
> overcommit and relies on oom killer to handle fallouts. This to change
> would lead to lot of fallouts. E.g. many syscalls returning unexpected
> and unhandled ENOMEM etc.

We are talking about a special use case here.  Do you see a problem in
the domain where and how overcommit_memory=2 is used today?

Michal Hocko April 27, 2021, 2:17 p.m. UTC | #9

On Tue 27-04-21 15:43:25, Alexander Sosna wrote:
> 
> On 27.04.21 14:11, Michal Hocko wrote:
[...]
> > Well, I am afraid that a reliable and easy solutions would be extremely
> > hard to find. A memcg aware overcommit policy is certainly possible but
> > as I've said it would require an additional accounting, it would be
> > quite unreliable - especially with small limits where the mapped (and
> > accounted) address space is not predominant. A lack of background
> > reclaim (kswapd in the global case) would result in ENOMEM reported even
> > though there is reclaimable memory to satisfy the reserved address space
> > etc.
> 
> Thank you very much for this information.  Would you share the opinion
> that it would be too hacky to define an arbitrary memory threshold here?
>  One could say that below a used memory of X the memory cgroup limit is
> not enforced by denying a malloc().  So that the status quo behavior is
> only altered when the memory usage is above X.  This would mitigate the
> problem with small limits and does not introduce new risks or surprises,
> because in this edge case it will behaves identical to the current kernel.

It will not. Please read again about the memory reclaim concern. There
is no background reclaim so (and I believe Chris has mentioned that in
other email) the only way to balance memory consumption (e.g. caches)
would be memory allocations which are excluded from the virtual memory
accounting. That can lead to a hard to predict behavior.

> >> Could
> >> you elaborate on where you see "a lot of fallouts"?  overcommit_memory 2
> >> is only set when needed for the desired workload.
> > 
> > My above comment was more general to the approach Linux is embracing
> > overcommit and relies on oom killer to handle fallouts. This to change
> > would lead to lot of fallouts. E.g. many syscalls returning unexpected
> > and unhandled ENOMEM etc.
> 
> We are talking about a special use case here.  Do you see a problem in
> the domain where and how overcommit_memory=2 is used today?

yes I do. I believe I have already provided some real challenges. All
that being said, a virtual memory overcommit control could be
implemented but I am not sure this is worth the additional complexity
and overhead introduced by the additional accounting.

Prevent OOM casualties by enforcing memcg limits

Commit Message

Comments

Patch