Message ID | ea6db5cc-f862-7c4b-d872-acb29c2d8193@sosna.de (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
Series | Prevent OOM casualties by enforcing memcg limits | expand |
Hi Alexander, Alexander Sosna writes: >Before this commit memory cgroup limits were not enforced during >allocation. If a process within a cgroup tries to allocates more >memory than allowed, the kernel will not prevent the allocation even if >OVERCOMMIT_NEVER is set. Than the OOM killer is activated to kill >processes in the corresponding cgroup. Unresolvable cgroup overages are indifferent to vm.overcommit_memory, since exceeding memory.max is not overcommitment, it's just a natural consequence of the fact that allocation and reclaim are not atomic processes. Overcommitment, on the other hand, is about the bounds of available memory at the global resource level. >This behavior is not to be expected >when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge >problem for applications assuming that the kernel will deny an allocation >if not enough memory is available, like PostgreSQL. To prevent this a >check is implemented to not allow a process to allocate more memory than >limited by it's cgroup. This means a process will not be killed while >accessing pages but will receive errors on memory allocation as >appropriate. This gives programs a chance to handle memory allocation >failures gracefully instead of being reaped. We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It can still happen for a bunch of reasons, so I really hope PostgreSQL isn't relying on that. Could you please be more clear about the "huge problem" being solved here? I'm not seeing it. >Signed-off-by: Alexander Sosna <alexander@sosna.de> > >diff --git a/mm/util.c b/mm/util.c >index a8bf17f18a81..c84b83c532c6 100644 >--- a/mm/util.c >+++ b/mm/util.c >@@ -853,6 +853,7 @@ EXPORT_SYMBOL_GPL(vm_memory_committed); > * > * Strict overcommit modes added 2002 Feb 26 by Alan Cox. > * Additional code 2002 Jul 20 by Robert Love. >+ * Code to enforce memory cgroup limits added 2021 by Alexander Sosna. > * > * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise. > * >@@ -891,6 +892,34 @@ int __vm_enough_memory(struct mm_struct *mm, long >pages, int cap_sys_admin) > long reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10); > > allowed -= min_t(long, mm->total_vm / 32, reserve); >+ >+#ifdef CONFIG_MEMCG >+ /* >+ * If we are in a memory cgroup we also evaluate if the cgroup >+ * has enough memory to allocate a new virtual mapping. This comment confuses me further, I'm afraid. You're talking about virtual mappings, but then checking memory.max, which is about allocated pages. >+ * This is how we can keep processes from exceeding their >+ * limits and also prevent that the OOM killer must be >+ * awakened. This gives programs a chance to handle memory >+ * allocation failures gracefully and not being reaped. >+ * In the current version mem_cgroup_get_max() is used which >+ * allows the processes to exceeded their memory limits if >+ * enough SWAP is available. If this is not intended we could >+ * use READ_ONCE(memcg->memory.max) instead. >+ * >+ * This code is only reached if sysctl_overcommit_memory equals >+ * OVERCOMMIT_NEVER, both other options are handled above. >+ */ >+ { >+ struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); >+ >+ if (memcg) { >+ long available = mem_cgroup_get_max(memcg) >+ - mem_cgroup_size(memcg); >+ >+ allowed = min_t(long, available, allowed); >+ } >+ } >+#endif > } > > if (percpu_counter_read_positive(&vm_committed_as) < allowed) >
Hi Chris, Am 27.04.21 um 02:09 schrieb Chris Down: > Hi Alexander, > > Alexander Sosna writes: >> Before this commit memory cgroup limits were not enforced during >> allocation. If a process within a cgroup tries to allocates more >> memory than allowed, the kernel will not prevent the allocation even if >> OVERCOMMIT_NEVER is set. Than the OOM killer is activated to kill >> processes in the corresponding cgroup. > > Unresolvable cgroup overages are indifferent to vm.overcommit_memory, > since exceeding memory.max is not overcommitment, it's just a natural > consequence of the fact that allocation and reclaim are not atomic > processes. Overcommitment, on the other hand, is about the bounds of > available memory at the global resource level. > >> This behavior is not to be expected >> when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge >> problem for applications assuming that the kernel will deny an allocation >> if not enough memory is available, like PostgreSQL. To prevent this a >> check is implemented to not allow a process to allocate more memory than >> limited by it's cgroup. This means a process will not be killed while >> accessing pages but will receive errors on memory allocation as >> appropriate. This gives programs a chance to handle memory allocation >> failures gracefully instead of being reaped. > > We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It > can still happen for a bunch of reasons, so I really hope PostgreSQL > isn't relying on that. > > Could you please be more clear about the "huge problem" being solved > here? I'm not seeing it. let me explain the problem I encounter and why I fell down the mm rabbit hole. It is not a PostgreSQL specific problem but that's where I run into it. PostgreSQL forks a backend for each client connection. All backends have shared memory as well as local work memory. When a backend needs more dynamic work_mem to execute a query, new memory is allocated. It is normal that such an allocation can fail. If the backend gets an ENOMEM the current query is rolled back an all dynamic work_mem is freed. The RDBMS stays operational an no other query is disturbed. When running in a memory cgroup - for example via systemd or on k8s - the kernel will not return ENOMEM even if the cgroup's memory limit is exceeded. Instead the OOM killer is awakened and kills processes in the violating cgroup. If any backend is killed with SIGKILL the shared memory of the whole cluster is deemed potentially corrupted and PostgreSQL needs to do an emergency restart. This cancels all operation on all backends and it entails a potentially lengthy recovery process. Therefore the behavior is quite "costly". I totally understand that vm.overcommit_memory 2 does not mean "no OOM killer". IMHO it should mean "no OOM killer if we can avoid it" and I would highly appreciate if the kernel would use a less invasive means whenever possible. I guess this might also be the expectation by many other users. In my described case - which is a real pain for me - it is quite easy to tweak the kernel behavior in order to handle this and other similar situations with less casualties. This is why I send a patch instead of starting a theoretical discussion. What do you think is necessary to get this to an approvable quality? >> Signed-off-by: Alexander Sosna <alexander@sosna.de> >> >> diff --git a/mm/util.c b/mm/util.c >> index a8bf17f18a81..c84b83c532c6 100644 >> --- a/mm/util.c >> +++ b/mm/util.c >> @@ -853,6 +853,7 @@ EXPORT_SYMBOL_GPL(vm_memory_committed); >> * >> * Strict overcommit modes added 2002 Feb 26 by Alan Cox. >> * Additional code 2002 Jul 20 by Robert Love. >> + * Code to enforce memory cgroup limits added 2021 by Alexander Sosna. >> * >> * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise. >> * >> @@ -891,6 +892,34 @@ int __vm_enough_memory(struct mm_struct *mm, long >> pages, int cap_sys_admin) >> long reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10); >> >> allowed -= min_t(long, mm->total_vm / 32, reserve); >> + >> +#ifdef CONFIG_MEMCG >> + /* >> + * If we are in a memory cgroup we also evaluate if the cgroup >> + * has enough memory to allocate a new virtual mapping. > > This comment confuses me further, I'm afraid. You're talking about > virtual mappings, but then checking memory.max, which is about allocated > pages. I had some problems understanding all mm and cgroup related code in the kernel and wished for helpfull comments here and there. So I tried at least to document my code and made it worse. Thank you for pointing this out. >> + * This is how we can keep processes from exceeding their >> + * limits and also prevent that the OOM killer must be >> + * awakened. This gives programs a chance to handle memory >> + * allocation failures gracefully and not being reaped. >> + * In the current version mem_cgroup_get_max() is used which >> + * allows the processes to exceeded their memory limits if >> + * enough SWAP is available. If this is not intended we could >> + * use READ_ONCE(memcg->memory.max) instead. >> + * >> + * This code is only reached if sysctl_overcommit_memory equals >> + * OVERCOMMIT_NEVER, both other options are handled above. >> + */ >> + { >> + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); >> + >> + if (memcg) { >> + long available = mem_cgroup_get_max(memcg) >> + - mem_cgroup_size(memcg); >> + >> + allowed = min_t(long, available, allowed); >> + } >> + } >> +#endif >> } >> >> if (percpu_counter_read_positive(&vm_committed_as) < allowed) >>
On Mon 26-04-21 22:04:56, Alexander Sosna wrote: > Before this commit memory cgroup limits were not enforced during > allocation. If a process within a cgroup tries to allocates more > memory than allowed, the kernel will not prevent the allocation even if > OVERCOMMIT_NEVER is set. Than the OOM killer is activated to kill > processes in the corresponding cgroup. This behavior is not to be expected > when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge > problem for applications assuming that the kernel will deny an allocation > if not enough memory is available, like PostgreSQL. Memory cgroup controller is by design accounting physically allocated memory while overcommit policy is a global control of the virtual memory allocation. Memcg is not aware of the virtual memory commitment so it cannot really evaluate OVERCOMMIT_NEVER heuristic. > To prevent this a > check is implemented to not allow a process to allocate more memory than > limited by it's cgroup. This means a process will not be killed while > accessing pages but will receive errors on memory allocation as > appropriate. This gives programs a chance to handle memory allocation > failures gracefully instead of being reaped. I am afraid I have to nak this patch. It is changing a long term semantic of a user interface which can break many existing applications. So you would need to create a new overcommit mode which would be explicitly memcg aware. As mentioned above memcg would need to have some awareness of the virtual memory committed for the memcg. Without that OVERCOMMIT_NEVER_MEMCG would effectively turn into OVERCOMMIT_GUESS. > Signed-off-by: Alexander Sosna <alexander@sosna.de> Nacked-by: Michal Hocko <mhocko@suse.com> > diff --git a/mm/util.c b/mm/util.c > index a8bf17f18a81..c84b83c532c6 100644 > --- a/mm/util.c > +++ b/mm/util.c > @@ -853,6 +853,7 @@ EXPORT_SYMBOL_GPL(vm_memory_committed); > * > * Strict overcommit modes added 2002 Feb 26 by Alan Cox. > * Additional code 2002 Jul 20 by Robert Love. > + * Code to enforce memory cgroup limits added 2021 by Alexander Sosna. > * > * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise. > * > @@ -891,6 +892,34 @@ int __vm_enough_memory(struct mm_struct *mm, long > pages, int cap_sys_admin) > long reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10); > > allowed -= min_t(long, mm->total_vm / 32, reserve); > + > +#ifdef CONFIG_MEMCG > + /* > + * If we are in a memory cgroup we also evaluate if the cgroup > + * has enough memory to allocate a new virtual mapping. > + * This is how we can keep processes from exceeding their > + * limits and also prevent that the OOM killer must be > + * awakened. This gives programs a chance to handle memory > + * allocation failures gracefully and not being reaped. > + * In the current version mem_cgroup_get_max() is used which > + * allows the processes to exceeded their memory limits if > + * enough SWAP is available. If this is not intended we could > + * use READ_ONCE(memcg->memory.max) instead. > + * > + * This code is only reached if sysctl_overcommit_memory equals > + * OVERCOMMIT_NEVER, both other options are handled above. > + */ > + { > + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); > + > + if (memcg) { > + long available = mem_cgroup_get_max(memcg) > + - mem_cgroup_size(memcg); > + > + allowed = min_t(long, available, allowed); > + } > + } > +#endif > } > > if (percpu_counter_read_positive(&vm_committed_as) < allowed)
On Tue 27-04-21 08:37:30, Alexander Sosna wrote: > Hi Chris, > > Am 27.04.21 um 02:09 schrieb Chris Down: > > Hi Alexander, > > > > Alexander Sosna writes: > >> Before this commit memory cgroup limits were not enforced during > >> allocation. If a process within a cgroup tries to allocates more > >> memory than allowed, the kernel will not prevent the allocation even if > >> OVERCOMMIT_NEVER is set. Than the OOM killer is activated to kill > >> processes in the corresponding cgroup. > > > > Unresolvable cgroup overages are indifferent to vm.overcommit_memory, > > since exceeding memory.max is not overcommitment, it's just a natural > > consequence of the fact that allocation and reclaim are not atomic > > processes. Overcommitment, on the other hand, is about the bounds of > > available memory at the global resource level. > > > >> This behavior is not to be expected > >> when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge > >> problem for applications assuming that the kernel will deny an allocation > >> if not enough memory is available, like PostgreSQL. To prevent this a > >> check is implemented to not allow a process to allocate more memory than > >> limited by it's cgroup. This means a process will not be killed while > >> accessing pages but will receive errors on memory allocation as > >> appropriate. This gives programs a chance to handle memory allocation > >> failures gracefully instead of being reaped. > > > > We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It > > can still happen for a bunch of reasons, so I really hope PostgreSQL > > isn't relying on that. > > > > Could you please be more clear about the "huge problem" being solved > > here? I'm not seeing it. > > let me explain the problem I encounter and why I fell down the mm rabbit > hole. It is not a PostgreSQL specific problem but that's where I run > into it. PostgreSQL forks a backend for each client connection. All > backends have shared memory as well as local work memory. When a > backend needs more dynamic work_mem to execute a query, new memory > is allocated. It is normal that such an allocation can fail. If the > backend gets an ENOMEM the current query is rolled back an all dynamic > work_mem is freed. The RDBMS stays operational an no other query is > disturbed. I am afraid the kernel MM implementation has never been really compatible with such a memory allocation model. Linux has always preferred to pretend there is always memory available and rather reclaim memory - including by killing some processes - rather than fail the allocation eith ENOMEM. Overcommit configuration (especially OVERCOMMIT_NEVER) is an attempt to somehow mitigate this ambitious memory allocation approach but in reality this has turned out a) unreliable and b) unsuable with modern userspace which relies on considerable virtual memory overcommit. > When running in a memory cgroup - for example via systemd or on k8s - > the kernel will not return ENOMEM even if the cgroup's memory limit is > exceeded. Yes, memcg doesn't change the overal approach. It just restricts the existing semantic with a smaller memory limit. Also overcommit heuristic has never been implemented for memory controllers. > Instead the OOM killer is awakened and kills processes in the > violating cgroup. If any backend is killed with SIGKILL the shared > memory of the whole cluster is deemed potentially corrupted and > PostgreSQL needs to do an emergency restart. This cancels all operation > on all backends and it entails a potentially lengthy recovery process. > Therefore the behavior is quite "costly". One way around that would be to use high limit rather than hard limit and pro-actively watch for memory utilization and communicate that back to the application to throttle its workers. I can see how that > I totally understand that vm.overcommit_memory 2 does not mean "no OOM > killer". IMHO it should mean "no OOM killer if we can avoid it" and I I do not see how it can ever promise anything like that. Memory consumption by kernel subsystems cannot be predicted at the time virtual memory allocated from the userspace. Not only it cannot be predicted but it is also highly impractical to force kernel allocations - necessary for the OS operation - to fail just because userspace has reserved virtual memory. So this all is just a heuristic to help in some extreme cases but overall I consider OVERCOMMIT_NEVER as impractical to say the least. > would highly appreciate if the kernel would use a less invasive means > whenever possible. I guess this might also be the expectation by many > other users. In my described case - which is a real pain for me - it is > quite easy to tweak the kernel behavior in order to handle this and > other similar situations with less casualties. This is why I send a > patch instead of starting a theoretical discussion. I am pretty sure that many users would agree with you on that but the matter of fact is that a different approach has been chosen historically. We can argue whether this has been a good or bad design decision but I do not see that to change without a lot of fallouts. Btw. a strong memory reservation approach can be found with hugetlb pages and this one has turned out to be very tricky both from implementation and userspace usage POV. Needless to say that it operates on a single purpose preallocated memory pool and it would be quite reasonable to expect the complexity would grow with more users of the pool which is the general case for general purpose memory allocator. > What do you think is necessary to get this to an approvable quality? See my other reply.
On 27.04.21 10:08, Michal Hocko wrote: > On Tue 27-04-21 08:37:30, Alexander Sosna wrote: >> Hi Chris, >> >> Am 27.04.21 um 02:09 schrieb Chris Down: >>> Hi Alexander, >>> >>> Alexander Sosna writes: >>>> Before this commit memory cgroup limits were not enforced during >>>> allocation. If a process within a cgroup tries to allocates more >>>> memory than allowed, the kernel will not prevent the allocation even if >>>> OVERCOMMIT_NEVER is set. Than the OOM killer is activated to kill >>>> processes in the corresponding cgroup. >>> >>> Unresolvable cgroup overages are indifferent to vm.overcommit_memory, >>> since exceeding memory.max is not overcommitment, it's just a natural >>> consequence of the fact that allocation and reclaim are not atomic >>> processes. Overcommitment, on the other hand, is about the bounds of >>> available memory at the global resource level. >>> >>>> This behavior is not to be expected >>>> when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge >>>> problem for applications assuming that the kernel will deny an allocation >>>> if not enough memory is available, like PostgreSQL. To prevent this a >>>> check is implemented to not allow a process to allocate more memory than >>>> limited by it's cgroup. This means a process will not be killed while >>>> accessing pages but will receive errors on memory allocation as >>>> appropriate. This gives programs a chance to handle memory allocation >>>> failures gracefully instead of being reaped. >>> >>> We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It >>> can still happen for a bunch of reasons, so I really hope PostgreSQL >>> isn't relying on that. >>> >>> Could you please be more clear about the "huge problem" being solved >>> here? I'm not seeing it. >> >> let me explain the problem I encounter and why I fell down the mm rabbit >> hole. It is not a PostgreSQL specific problem but that's where I run >> into it. PostgreSQL forks a backend for each client connection. All >> backends have shared memory as well as local work memory. When a >> backend needs more dynamic work_mem to execute a query, new memory >> is allocated. It is normal that such an allocation can fail. If the >> backend gets an ENOMEM the current query is rolled back an all dynamic >> work_mem is freed. The RDBMS stays operational an no other query is >> disturbed. > > I am afraid the kernel MM implementation has never been really > compatible with such a memory allocation model. Linux has always > preferred to pretend there is always memory available and rather reclaim > memory - including by killing some processes - rather than fail the > allocation eith ENOMEM. Overcommit configuration (especially > OVERCOMMIT_NEVER) is an attempt to somehow mitigate this ambitious > memory allocation approach but in reality this has turned out a) > unreliable and b) unsuable with modern userspace which relies on > considerable virtual memory overcommit. Thank you for taking the time to discuss this issue with me. I agree that the kernel and a lot of software prefers to pretend there is more memory than there really is. It was also never possible to assume that the OOM killer is fully absent. I'm running production Linux systems for quite a while now and without memory cgroups involved OVERCOMMIT_NEVER does a pretty good job. I can't even remember the last time the OOM killer caused me any problems on a properly configured database server. This is what I would like and what users should be able to expect for the use with cgroup memory limits as well. Please correct me if I am wrong, but "modern userspace which relies on considerable virtual memory overcommit" should not rely on the kernel to overcommit memory when OVERCOMMIT_NEVER is explicitly set. >> When running in a memory cgroup - for example via systemd or on k8s - >> the kernel will not return ENOMEM even if the cgroup's memory limit is >> exceeded. > > Yes, memcg doesn't change the overal approach. It just restricts the > existing semantic with a smaller memory limit. Also overcommit heuristic > has never been implemented for memory controllers. > >> Instead the OOM killer is awakened and kills processes in the >> violating cgroup. If any backend is killed with SIGKILL the shared >> memory of the whole cluster is deemed potentially corrupted and >> PostgreSQL needs to do an emergency restart. This cancels all operation >> on all backends and it entails a potentially lengthy recovery process. >> Therefore the behavior is quite "costly". > > One way around that would be to use high limit rather than hard limit > and pro-actively watch for memory utilization and communicate that back > to the application to throttle its workers. I can see how that > >> I totally understand that vm.overcommit_memory 2 does not mean "no OOM >> killer". IMHO it should mean "no OOM killer if we can avoid it" and I > > I do not see how it can ever promise anything like that. Memory > consumption by kernel subsystems cannot be predicted at the time virtual > memory allocated from the userspace. Not only it cannot be predicted but > it is also highly impractical to force kernel allocations - necessary > for the OS operation - to fail just because userspace has reserved > virtual memory. So this all is just a heuristic to help in some > extreme cases but overall I consider OVERCOMMIT_NEVER as impractical to > say the least. I'm not fully able to follow you why we need to let kernel allocations fail here. Yes, if you run a system to a point where the kernel can't free enough memory, invasive decisions have to be made. Think of an application server running multiple applications in memcgs each with its limits way below the available resources. Why is it preferable to SIGKILL a process rather than just deny the limit exceeding malloc, when OVERCOMMIT_NEVER is set of cause? >> would highly appreciate if the kernel would use a less invasive means >> whenever possible. I guess this might also be the expectation by many >> other users. In my described case - which is a real pain for me - it is >> quite easy to tweak the kernel behavior in order to handle this and >> other similar situations with less casualties. This is why I send a >> patch instead of starting a theoretical discussion. > > I am pretty sure that many users would agree with you on that but the > matter of fact is that a different approach has been chosen > historically. We can argue whether this has been a good or bad design > decision but I do not see that to change without a lot of fallouts. Btw. > a strong memory reservation approach can be found with hugetlb pages and > this one has turned out to be very tricky both from implementation and > userspace usage POV. Needless to say that it operates on a single > purpose preallocated memory pool and it would be quite reasonable to > expect the complexity would grow with more users of the pool which is > the general case for general purpose memory allocator. The history is very interesting and needs to be taken into consideration. What drives me is to help myself and all other Linux user to run workloads like RDBMS reliable, even in modern environments like k8s which make use of memory cgroups. I see a gain for the community to develop a reliable and easy available solution, even if my current approach might be amateurish and is not the right answer. Could you elaborate on where you see "a lot of fallouts"? overcommit_memory 2 is only set when needed for the desired workload. If the gain is worth it one could implement an overcommit_memory 3 in order to set this behavior, overcommit_memory needs to be explicitly set by the sysadmin anyways. >> What do you think is necessary to get this to an approvable quality? > > See my other reply.
On Tue 27-04-21 13:01:33, Alexander Sosna wrote: [...] > Please correct me if I am wrong, but "modern userspace which relies on > considerable virtual memory overcommit" should not rely on the kernel to > overcommit memory when OVERCOMMIT_NEVER is explicitly set. Correct. Which makes it application very limited from my experience. > >> When running in a memory cgroup - for example via systemd or on k8s - > >> the kernel will not return ENOMEM even if the cgroup's memory limit is > >> exceeded. > > > > Yes, memcg doesn't change the overal approach. It just restricts the > > existing semantic with a smaller memory limit. Also overcommit heuristic > > has never been implemented for memory controllers. > > > >> Instead the OOM killer is awakened and kills processes in the > >> violating cgroup. If any backend is killed with SIGKILL the shared > >> memory of the whole cluster is deemed potentially corrupted and > >> PostgreSQL needs to do an emergency restart. This cancels all operation > >> on all backends and it entails a potentially lengthy recovery process. > >> Therefore the behavior is quite "costly". > > > > One way around that would be to use high limit rather than hard limit > > and pro-actively watch for memory utilization and communicate that back > > to the application to throttle its workers. I can see how that > > > >> I totally understand that vm.overcommit_memory 2 does not mean "no OOM > >> killer". IMHO it should mean "no OOM killer if we can avoid it" and I > > > > I do not see how it can ever promise anything like that. Memory > > consumption by kernel subsystems cannot be predicted at the time virtual > > memory allocated from the userspace. Not only it cannot be predicted but > > it is also highly impractical to force kernel allocations - necessary > > for the OS operation - to fail just because userspace has reserved > > virtual memory. So this all is just a heuristic to help in some > > extreme cases but overall I consider OVERCOMMIT_NEVER as impractical to > > say the least. > > I'm not fully able to follow you why we need to let kernel allocations > fail here. Yes, if you run a system to a point where the kernel can't > free enough memory, invasive decisions have to be made. OK. But then I do not see what "no OOM killer if we can avoid it" is suppose to mean. There are only 2 ways around that. Either start failing allocations or reclaim by tearing down processes as all other means of memory reclaim have been already exercised. > Think of an > application server running multiple applications in memcgs each with its > limits way below the available resources. Why is it preferable to > SIGKILL a process rather than just deny the limit exceeding malloc, when > OVERCOMMIT_NEVER is set of cause? Because the actual physical memory allocation for malloc might (and usually does) happen much later than the virtual memory allocated for it (brk or mmap). Memory requirements could have changed considerably between the two events. An allocation struggling to make a forward progress might be for a completely different purpose than the overcommit accounted one. Does this make more sense now? > >> would highly appreciate if the kernel would use a less invasive means > >> whenever possible. I guess this might also be the expectation by many > >> other users. In my described case - which is a real pain for me - it is > >> quite easy to tweak the kernel behavior in order to handle this and > >> other similar situations with less casualties. This is why I send a > >> patch instead of starting a theoretical discussion. > > > > I am pretty sure that many users would agree with you on that but the > > matter of fact is that a different approach has been chosen > > historically. We can argue whether this has been a good or bad design > > decision but I do not see that to change without a lot of fallouts. Btw. > > a strong memory reservation approach can be found with hugetlb pages and > > this one has turned out to be very tricky both from implementation and > > userspace usage POV. Needless to say that it operates on a single > > purpose preallocated memory pool and it would be quite reasonable to > > expect the complexity would grow with more users of the pool which is > > the general case for general purpose memory allocator. > > The history is very interesting and needs to be taken into > consideration. What drives me is to help myself and all other Linux > user to run workloads like RDBMS reliable, even in modern environments > like k8s which make use of memory cgroups. I see a gain for the > community to develop a reliable and easy available solution, even if my > current approach might be amateurish and is not the right answer. Well, I am afraid that a reliable and easy solutions would be extremely hard to find. A memcg aware overcommit policy is certainly possible but as I've said it would require an additional accounting, it would be quite unreliable - especially with small limits where the mapped (and accounted) address space is not predominant. A lack of background reclaim (kswapd in the global case) would result in ENOMEM reported even though there is reclaimable memory to satisfy the reserved address space etc. > Could > you elaborate on where you see "a lot of fallouts"? overcommit_memory 2 > is only set when needed for the desired workload. My above comment was more general to the approach Linux is embracing overcommit and relies on oom killer to handle fallouts. This to change would lead to lot of fallouts. E.g. many syscalls returning unexpected and unhandled ENOMEM etc.
Alexander Sosna writes: >> We don't guarantee that vm.overcommit_memory 2 means "no OOM killer". It >> can still happen for a bunch of reasons, so I really hope PostgreSQL >> isn't relying on that. >> >> Could you please be more clear about the "huge problem" being solved >> here? I'm not seeing it. > >let me explain the problem I encounter and why I fell down the mm rabbit >hole. It is not a PostgreSQL specific problem but that's where I run >into it. PostgreSQL forks a backend for each client connection. All >backends have shared memory as well as local work memory. When a >backend needs more dynamic work_mem to execute a query, new memory >is allocated. It is normal that such an allocation can fail. If the >backend gets an ENOMEM the current query is rolled back an all dynamic >work_mem is freed. The RDBMS stays operational an no other query is >disturbed. > >When running in a memory cgroup - for example via systemd or on k8s - >the kernel will not return ENOMEM even if the cgroup's memory limit is >exceeded. Instead the OOM killer is awakened and kills processes in the >violating cgroup. If any backend is killed with SIGKILL the shared >memory of the whole cluster is deemed potentially corrupted and >PostgreSQL needs to do an emergency restart. This cancels all operation >on all backends and it entails a potentially lengthy recovery process. >Therefore the behavior is quite "costly". My point that memory cgroups are completely overcommit agnostic isn't just a question of abstract semantics, but a practical one. Exceeding memory.max is not overcommitment, because overages are physical, not virtual, and that has vastly different ramifications in terms of what managing that overage means. For example, if we aggressively ENOMEM at the memory.max bounds, there's no provision provided for the natural bounds of memory reclaim to occur. Now maybe your application likes that (which I find highly dubious), but from a memory balancing perspective it's just nonsensical: we need to ensure that we're assisting forward progress of the system at the cgroup level, especially with the huge amounts of slack generated. >I totally understand that vm.overcommit_memory 2 does not mean "no OOM >killer". IMHO it should mean "no OOM killer if we can avoid it" and I >would highly appreciate if the kernel would use a less invasive means >whenever possible. I guess this might also be the expectation by many >other users. In my described case - which is a real pain for me - it is >quite easy to tweak the kernel behavior in order to handle this and >other similar situations with less casualties. This is why I send a >patch instead of starting a theoretical discussion. vm.overcommit_memory=2 means "don't overcommit", nothing less, nothing more. Adding more semantics is a very good way to make an extremely confusing and overloaded API. This commit reminds me of the comments on cosmetic products that say "no parabens". Ok, so there's no parabens -- great, parabens are terrible -- but are you now using a much more dangerous preservative instead? Likewise, this commit claims that it reduces the likelihood of invoking the OOM killer -- great, nobody wants their processes to be OOM killed. What do we have instead? Code that calls off memory allocations way, way before it's needed to do so, and prevents the system from even getting into a state where it can efficiently evaluate how it should rebalance memory. That's really not a good tradeoff. >What do you think is necessary to get this to an approvable quality? The problem is not the code, it's the concept and the way it interacts with the rest of the mm subsystem. It asks the mm subsystem to deny memory allocations long before it has even had a chance to reliably rebalance (just as one example, to punt anon pages to swap) based on the new allocations, which doesn't make very much sense. It may not break in some highly trivial setups, but it certainly will not work well with stacking or machines with high volatility of the anon/file LRUs. You're also likely to see random ENOMEM failures from kernelspace when operating under this memcg context long before such a response was necessary, which doesn't make much sense. If you want to know when to back off allocations, use memory.high with PSI pressure metrics. I also would strongly suggest that vm.overcommit_memory=2 is the equivalent of using a bucket of ignited thermite to warm one's house.
On 27.04.21 14:11, Michal Hocko wrote: > On Tue 27-04-21 13:01:33, Alexander Sosna wrote: > [...] >> Please correct me if I am wrong, but "modern userspace which relies on >> considerable virtual memory overcommit" should not rely on the kernel to >> overcommit memory when OVERCOMMIT_NEVER is explicitly set. > > Correct. Which makes it application very limited from my experience. Yes. It is a special tool for a special use case, I use it exclusively for database servers and hosting DBaaS. Therefore my point is that a change in it's behavior only effects special use cases and should take their requirements into consideration. >>>> When running in a memory cgroup - for example via systemd or on k8s - >>>> the kernel will not return ENOMEM even if the cgroup's memory limit is >>>> exceeded. >>> >>> Yes, memcg doesn't change the overal approach. It just restricts the >>> existing semantic with a smaller memory limit. Also overcommit heuristic >>> has never been implemented for memory controllers. >>> >>>> Instead the OOM killer is awakened and kills processes in the >>>> violating cgroup. If any backend is killed with SIGKILL the shared >>>> memory of the whole cluster is deemed potentially corrupted and >>>> PostgreSQL needs to do an emergency restart. This cancels all operation >>>> on all backends and it entails a potentially lengthy recovery process. >>>> Therefore the behavior is quite "costly". >>> >>> One way around that would be to use high limit rather than hard limit >>> and pro-actively watch for memory utilization and communicate that back >>> to the application to throttle its workers. I can see how that >>> >>>> I totally understand that vm.overcommit_memory 2 does not mean "no OOM >>>> killer". IMHO it should mean "no OOM killer if we can avoid it" and I >>> >>> I do not see how it can ever promise anything like that. Memory >>> consumption by kernel subsystems cannot be predicted at the time virtual >>> memory allocated from the userspace. Not only it cannot be predicted but >>> it is also highly impractical to force kernel allocations - necessary >>> for the OS operation - to fail just because userspace has reserved >>> virtual memory. So this all is just a heuristic to help in some >>> extreme cases but overall I consider OVERCOMMIT_NEVER as impractical to >>> say the least. >> >> I'm not fully able to follow you why we need to let kernel allocations >> fail here. Yes, if you run a system to a point where the kernel can't >> free enough memory, invasive decisions have to be made. > > OK. But then I do not see what "no OOM killer if we can avoid it" is > suppose to mean. There are only 2 ways around that. Either start > failing allocations or reclaim by tearing down processes as all other > means of memory reclaim have been already exercised. Yes of cause if all means of memory reclaim have been already exercised, there is not much todo. But I want to prevent the OOM killer from reaping processes especially if more than enough memory is available. There are many reasons to enforce limits even if enough resources are available. For example to prevent bad neighbor behavior, leave plenty of free memory for the file system cache or because someone pays only for a given amount of resources in a SaaS offering and the SREs would like to enforce the limit without shutting down their clients database service. :) With KVM this is quite easy, just use one VM per RDBMS instance and set overcommit_memory=2, but this would waste a lot of resources. >> Think of an >> application server running multiple applications in memcgs each with its >> limits way below the available resources. Why is it preferable to >> SIGKILL a process rather than just deny the limit exceeding malloc, when >> OVERCOMMIT_NEVER is set of cause? > > Because the actual physical memory allocation for malloc might (and > usually does) happen much later than the virtual memory allocated for it > (brk or mmap). Memory requirements could have changed considerably > between the two events. An allocation struggling to make a forward > progress might be for a completely different purpose than the overcommit > accounted one. Does this make more sense now? Yes. Thank you for the explanation. What you describe is the observable behavior in many pieces of software. We have to keep in mind that we are talking about the special case of OVERCOMMIT_NEVER here. Software that want's / expects to run on such a system normally has a much tighter allocation behavior. PostgreSQL for example allocates memory when it is needed and in the needed quantity and will swiftly write on all of it. Further, it deals very gracefully with an out-of-memory situation (malloc returns NULL) by simply reporting back to the client that a query was aborted due to out of memory. Overcommiting doesn't make any sense with such a disciplined application and sysadmins configure their kernel accordingly. Implying overcommit and OOM-killer-activity currently rules out using cgroup-limits with such an application, there is no way to handle a SIGKILL gracefully. Therefore I was quite happy with the results when testing my patch with a DBaaS workload. >>>> would highly appreciate if the kernel would use a less invasive means >>>> whenever possible. I guess this might also be the expectation by many >>>> other users. In my described case - which is a real pain for me - it is >>>> quite easy to tweak the kernel behavior in order to handle this and >>>> other similar situations with less casualties. This is why I send a >>>> patch instead of starting a theoretical discussion. >>> >>> I am pretty sure that many users would agree with you on that but the >>> matter of fact is that a different approach has been chosen >>> historically. We can argue whether this has been a good or bad design >>> decision but I do not see that to change without a lot of fallouts. Btw. >>> a strong memory reservation approach can be found with hugetlb pages and >>> this one has turned out to be very tricky both from implementation and >>> userspace usage POV. Needless to say that it operates on a single >>> purpose preallocated memory pool and it would be quite reasonable to >>> expect the complexity would grow with more users of the pool which is >>> the general case for general purpose memory allocator. >> >> The history is very interesting and needs to be taken into >> consideration. What drives me is to help myself and all other Linux >> user to run workloads like RDBMS reliable, even in modern environments >> like k8s which make use of memory cgroups. I see a gain for the >> community to develop a reliable and easy available solution, even if my >> current approach might be amateurish and is not the right answer. > > Well, I am afraid that a reliable and easy solutions would be extremely > hard to find. A memcg aware overcommit policy is certainly possible but > as I've said it would require an additional accounting, it would be > quite unreliable - especially with small limits where the mapped (and > accounted) address space is not predominant. A lack of background > reclaim (kswapd in the global case) would result in ENOMEM reported even > though there is reclaimable memory to satisfy the reserved address space > etc. Thank you very much for this information. Would you share the opinion that it would be too hacky to define an arbitrary memory threshold here? One could say that below a used memory of X the memory cgroup limit is not enforced by denying a malloc(). So that the status quo behavior is only altered when the memory usage is above X. This would mitigate the problem with small limits and does not introduce new risks or surprises, because in this edge case it will behaves identical to the current kernel. >> Could >> you elaborate on where you see "a lot of fallouts"? overcommit_memory 2 >> is only set when needed for the desired workload. > > My above comment was more general to the approach Linux is embracing > overcommit and relies on oom killer to handle fallouts. This to change > would lead to lot of fallouts. E.g. many syscalls returning unexpected > and unhandled ENOMEM etc. We are talking about a special use case here. Do you see a problem in the domain where and how overcommit_memory=2 is used today?
On Tue 27-04-21 15:43:25, Alexander Sosna wrote: > > On 27.04.21 14:11, Michal Hocko wrote: [...] > > Well, I am afraid that a reliable and easy solutions would be extremely > > hard to find. A memcg aware overcommit policy is certainly possible but > > as I've said it would require an additional accounting, it would be > > quite unreliable - especially with small limits where the mapped (and > > accounted) address space is not predominant. A lack of background > > reclaim (kswapd in the global case) would result in ENOMEM reported even > > though there is reclaimable memory to satisfy the reserved address space > > etc. > > Thank you very much for this information. Would you share the opinion > that it would be too hacky to define an arbitrary memory threshold here? > One could say that below a used memory of X the memory cgroup limit is > not enforced by denying a malloc(). So that the status quo behavior is > only altered when the memory usage is above X. This would mitigate the > problem with small limits and does not introduce new risks or surprises, > because in this edge case it will behaves identical to the current kernel. It will not. Please read again about the memory reclaim concern. There is no background reclaim so (and I believe Chris has mentioned that in other email) the only way to balance memory consumption (e.g. caches) would be memory allocations which are excluded from the virtual memory accounting. That can lead to a hard to predict behavior. > >> Could > >> you elaborate on where you see "a lot of fallouts"? overcommit_memory 2 > >> is only set when needed for the desired workload. > > > > My above comment was more general to the approach Linux is embracing > > overcommit and relies on oom killer to handle fallouts. This to change > > would lead to lot of fallouts. E.g. many syscalls returning unexpected > > and unhandled ENOMEM etc. > > We are talking about a special use case here. Do you see a problem in > the domain where and how overcommit_memory=2 is used today? yes I do. I believe I have already provided some real challenges. All that being said, a virtual memory overcommit control could be implemented but I am not sure this is worth the additional complexity and overhead introduced by the additional accounting.
diff --git a/mm/util.c b/mm/util.c index a8bf17f18a81..c84b83c532c6 100644 --- a/mm/util.c +++ b/mm/util.c @@ -853,6 +853,7 @@ EXPORT_SYMBOL_GPL(vm_memory_committed); * * Strict overcommit modes added 2002 Feb 26 by Alan Cox. * Additional code 2002 Jul 20 by Robert Love. + * Code to enforce memory cgroup limits added 2021 by Alexander Sosna. * * cap_sys_admin is 1 if the process has admin privileges, 0 otherwise. * @@ -891,6 +892,34 @@ int __vm_enough_memory(struct mm_struct *mm, long pages, int cap_sys_admin) long reserve = sysctl_user_reserve_kbytes >> (PAGE_SHIFT - 10); allowed -= min_t(long, mm->total_vm / 32, reserve); + +#ifdef CONFIG_MEMCG + /* + * If we are in a memory cgroup we also evaluate if the cgroup + * has enough memory to allocate a new virtual mapping. + * This is how we can keep processes from exceeding their + * limits and also prevent that the OOM killer must be + * awakened. This gives programs a chance to handle memory + * allocation failures gracefully and not being reaped. + * In the current version mem_cgroup_get_max() is used which + * allows the processes to exceeded their memory limits if + * enough SWAP is available. If this is not intended we could + * use READ_ONCE(memcg->memory.max) instead. + * + * This code is only reached if sysctl_overcommit_memory equals + * OVERCOMMIT_NEVER, both other options are handled above. + */ + { + struct mem_cgroup *memcg = get_mem_cgroup_from_mm(mm); + + if (memcg) { + long available = mem_cgroup_get_max(memcg) + - mem_cgroup_size(memcg); + + allowed = min_t(long, available, allowed); + } + } +#endif }
Before this commit memory cgroup limits were not enforced during allocation. If a process within a cgroup tries to allocates more memory than allowed, the kernel will not prevent the allocation even if OVERCOMMIT_NEVER is set. Than the OOM killer is activated to kill processes in the corresponding cgroup. This behavior is not to be expected when setting OVERCOMMIT_NEVER (vm.overcommit_memory = 2) and it is a huge problem for applications assuming that the kernel will deny an allocation if not enough memory is available, like PostgreSQL. To prevent this a check is implemented to not allow a process to allocate more memory than limited by it's cgroup. This means a process will not be killed while accessing pages but will receive errors on memory allocation as appropriate. This gives programs a chance to handle memory allocation failures gracefully instead of being reaped. Signed-off-by: Alexander Sosna <alexander@sosna.de> if (percpu_counter_read_positive(&vm_committed_as) < allowed)