Message ID | 20171025141221.xm4cqp2z6nunr6vy@dhcp22.suse.cz (mailing list archive) |
---|---|
State | New, archived |
Headers | show |
On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote: > On Wed 25-10-17 09:11:51, Johannes Weiner wrote: > > On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote: > [...] > > > ... we shouldn't make it more loose though. > > > > Then we can end this discussion right now. I pointed out right from > > the start that the only way to replace -ENOMEM with OOM killing in the > > syscall is to force charges. If we don't, we either deadlock or still > > return -ENOMEM occasionally. Nobody has refuted that this is the case. > > Yes this is true. I guess we are back to the non-failing allocations > discussion... Currently we are too ENOMEM happy for memcg !PF paths which > can lead to weird issues Greg has pointed out earlier. Going to opposite > direction to basically never ENOMEM and rather pretend a success (which > allows runaways for extreme setups with no oom eligible tasks) sounds > like going from one extreme to another. This basically means that those > charges will effectively GFP_NOFAIL. Too much to guarantee IMHO. We're talking in circles. Overrunning the hard limit in the syscall path isn't worse than an infinite loop in the page fault path. Once you make tasks ineligible for OOM, that's what you have to expect. It's the OOM disabling that results in extreme consequences across the board. It's silly to keep bringing this up as an example. > > They're a challenge as it is. The only sane options are to stick with > > the status quo, > > One thing that really worries me about the current status quo is that > the behavior depends on whether you run under memcg or not. The global > policy is "almost never fail unless something horrible is going on". > But we _do not_ guarantee that ENOMEM stays inside the kernel. > > So if we need to do something about that I would think we need an > universal solution rather than something memcg specific. Sure global > ENOMEMs are so rare that nobody will probably trigger those but that is > just a wishful thinking... The difference in current behavior comes from the fact that real-life workloads could trigger a deadlock with memcg looping indefinitely in the charge path, but not with the allocator looping indefinitely in the allocation path. That also means that if we change it like you proposed, it'll be much more likely to trigger -ENOMEM from syscalls inside memcg than it is outside. The margins are simply tighter inside cgroups. You often have only a few closely interacting tasks in there, and the probability for something to go wrong is much higher than on the system scope. But also, the system is constrained by physical limits. It's a hard wall; once you hit it, the kernel is dead, so we don't have a lot of options. Memcg on the other hand is a completely artificial limit. It doesn't make sense to me to pretend it's a hard wall to the point where we actively *create* for ourselves the downsides of a physical limitation. > So how about we start with a BIG FAT WARNING for the failure case? > Something resembling warn_alloc for the failure case. > > --- > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index 5d9323028870..3ba62c73eee5 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > * we would fall back to the global oom killer in pagefault_out_of_memory > */ > - if (!memcg->oom_kill_disable && > - mem_cgroup_out_of_memory(memcg, mask, order)) > - return true; > + if (!memcg->oom_kill_disable) { > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > + return true; > + > + WARN(!current->memcg_may_oom, > + "Memory cgroup charge failed because of no reclaimable memory! " > + "This looks like a misconfiguration or a kernel bug."); > + } That's crazy! We shouldn't create interfaces that make it possible to accidentally livelock the kernel. Then warn about it and let it crash. That is a DOS-level lack of OS abstraction. In such a situation, we should ignore oom_score_adj or ignore the hard limit. Even panic() would be better from a machine management point of view than leaving random tasks inside infinite loops. Why is OOM-disabling a thing? Why isn't this simply a "kill everything else before you kill me"? It's crashing the kernel in trying to protect a userspace application. How is that not insane?
On Wed 25-10-17 12:44:02, Johannes Weiner wrote: > On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote: [...] I yet have to digest the first path of the email but the remaining just sounds we are not on the same page. > > So how about we start with a BIG FAT WARNING for the failure case? > > Something resembling warn_alloc for the failure case. > > > > --- > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > index 5d9323028870..3ba62c73eee5 100644 > > --- a/mm/memcontrol.c > > +++ b/mm/memcontrol.c > > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > * we would fall back to the global oom killer in pagefault_out_of_memory > > */ > > - if (!memcg->oom_kill_disable && > > - mem_cgroup_out_of_memory(memcg, mask, order)) > > - return true; > > + if (!memcg->oom_kill_disable) { > > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > > + return true; > > + > > + WARN(!current->memcg_may_oom, > > + "Memory cgroup charge failed because of no reclaimable memory! " > > + "This looks like a misconfiguration or a kernel bug."); > > + } > > That's crazy! > > We shouldn't create interfaces that make it possible to accidentally > livelock the kernel. Then warn about it and let it crash. That is a > DOS-level lack of OS abstraction. > > In such a situation, we should ignore oom_score_adj or ignore the hard > limit. Even panic() would be better from a machine management point of > view than leaving random tasks inside infinite loops. > > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > else before you kill me"? It's crashing the kernel in trying to > protect a userspace application. How is that not insane? I really do not follow. What kind of livelock or crash are you talking about. All this code does is that the charge request (which is not explicitly GFP_NOFAIL) fails with ENOMEM if the oom killer is not able to make a forward progress. That sounds like a safer option than failing with ENOMEM unconditionally which is what we do currently. So the only change I am really proposing is to keep retrying as long as the oom killer makes a forward progress and ENOMEM otherwise. I am also not trying to protect an userspace application. Quite contrary, I would like the application gets ENOMEM when it should run away from the constraint it runs within. I am protecting everything outside of the hard limited memcg actually. So what is that I am missing?
On Wed, Oct 25, 2017 at 07:29:24PM +0200, Michal Hocko wrote: > On Wed 25-10-17 12:44:02, Johannes Weiner wrote: > > On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote: > > > So how about we start with a BIG FAT WARNING for the failure case? > > > Something resembling warn_alloc for the failure case. > > > > > > --- > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > index 5d9323028870..3ba62c73eee5 100644 > > > --- a/mm/memcontrol.c > > > +++ b/mm/memcontrol.c > > > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > > * we would fall back to the global oom killer in pagefault_out_of_memory > > > */ > > > - if (!memcg->oom_kill_disable && > > > - mem_cgroup_out_of_memory(memcg, mask, order)) > > > - return true; > > > + if (!memcg->oom_kill_disable) { > > > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > > > + return true; > > > + > > > + WARN(!current->memcg_may_oom, > > > + "Memory cgroup charge failed because of no reclaimable memory! " > > > + "This looks like a misconfiguration or a kernel bug."); > > > + } > > > > That's crazy! > > > > We shouldn't create interfaces that make it possible to accidentally > > livelock the kernel. Then warn about it and let it crash. That is a > > DOS-level lack of OS abstraction. > > > > In such a situation, we should ignore oom_score_adj or ignore the hard > > limit. Even panic() would be better from a machine management point of > > view than leaving random tasks inside infinite loops. > > > > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > > else before you kill me"? It's crashing the kernel in trying to > > protect a userspace application. How is that not insane? > > I really do not follow. What kind of livelock or crash are you talking > about. All this code does is that the charge request (which is not > explicitly GFP_NOFAIL) fails with ENOMEM if the oom killer is not able > to make a forward progress. That sounds like a safer option than failing > with ENOMEM unconditionally which is what we do currently. I pointed out multiple times now that consistent -ENOMEM is better than a rare one because it's easier to verify application behavior with test runs. And I don't understand what your counter-argument is. "Safe" is a vague term, and it doesn't make much sense to me in this situation. The OOM behavior should be predictable and consistent. Yes, global might in the rarest cases also return -ENOMEM. Maybe. We don't have to do that in memcg because we're not physically limited. > So the only change I am really proposing is to keep retrying as long > as the oom killer makes a forward progress and ENOMEM otherwise. That's the behavior change I'm against. > I am also not trying to protect an userspace application. Quite > contrary, I would like the application gets ENOMEM when it should run > away from the constraint it runs within. I am protecting everything > outside of the hard limited memcg actually. > > So what is that I am missing? The page fault path. Maybe that's not what you set out to fix, but it will now not only enter an infinite loop, it will also WARN() on each iteration. It would make sense to step back and think about the comprehensive user-visible behavior of an out-of-memory situation, and not just the one syscall aspect.
On Wed 25-10-17 14:11:06, Johannes Weiner wrote: > On Wed, Oct 25, 2017 at 07:29:24PM +0200, Michal Hocko wrote: > > On Wed 25-10-17 12:44:02, Johannes Weiner wrote: > > > On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote: > > > > So how about we start with a BIG FAT WARNING for the failure case? > > > > Something resembling warn_alloc for the failure case. > > > > > > > > --- > > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > > > > index 5d9323028870..3ba62c73eee5 100644 > > > > --- a/mm/memcontrol.c > > > > +++ b/mm/memcontrol.c > > > > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) > > > > * victim and then we have rely on mem_cgroup_oom_synchronize otherwise > > > > * we would fall back to the global oom killer in pagefault_out_of_memory > > > > */ > > > > - if (!memcg->oom_kill_disable && > > > > - mem_cgroup_out_of_memory(memcg, mask, order)) > > > > - return true; > > > > + if (!memcg->oom_kill_disable) { > > > > + if (mem_cgroup_out_of_memory(memcg, mask, order)) > > > > + return true; > > > > + > > > > + WARN(!current->memcg_may_oom, > > > > + "Memory cgroup charge failed because of no reclaimable memory! " > > > > + "This looks like a misconfiguration or a kernel bug."); > > > > + } > > > > > > That's crazy! > > > > > > We shouldn't create interfaces that make it possible to accidentally > > > livelock the kernel. Then warn about it and let it crash. That is a > > > DOS-level lack of OS abstraction. > > > > > > In such a situation, we should ignore oom_score_adj or ignore the hard > > > limit. Even panic() would be better from a machine management point of > > > view than leaving random tasks inside infinite loops. > > > > > > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > > > else before you kill me"? It's crashing the kernel in trying to > > > protect a userspace application. How is that not insane? > > > > I really do not follow. What kind of livelock or crash are you talking > > about. All this code does is that the charge request (which is not > > explicitly GFP_NOFAIL) fails with ENOMEM if the oom killer is not able > > to make a forward progress. That sounds like a safer option than failing > > with ENOMEM unconditionally which is what we do currently. > > I pointed out multiple times now that consistent -ENOMEM is better > than a rare one because it's easier to verify application behavior > with test runs. And I don't understand what your counter-argument is. My counter argument is that running inside the memcg shouldn't be too much different than running outside. And that means that if we try to reduce chances of ENOMEM in the global case then we should do the same in the memcg case as well. Failing overly eagerly is more harmful than what the determinism gives you for testing. You have other means to test failure paths like fault injections. > "Safe" is a vague term, and it doesn't make much sense to me in this > situation. The OOM behavior should be predictable and consistent. > > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We > don't have to do that in memcg because we're not physically limited. OK, so here seems to be the biggest disconnect. Being physically or artificially constrained shouldn't make much difference IMHO. In both cases the resource is simply limited for the consumer. And once all the attempts to fit within the limit fail then the request for the resource has to fail. > > So the only change I am really proposing is to keep retrying as long > > as the oom killer makes a forward progress and ENOMEM otherwise. > > That's the behavior change I'm against. So just to make it clear you would be OK with the retry on successful OOM killer invocation and force charge on oom failure, right? > > I am also not trying to protect an userspace application. Quite > > contrary, I would like the application gets ENOMEM when it should run > > away from the constraint it runs within. I am protecting everything > > outside of the hard limited memcg actually. > > > > So what is that I am missing? > > The page fault path. > > Maybe that's not what you set out to fix, but it will now not only > enter an infinite loop, it will also WARN() on each iteration. It will not warn! The WARN is explicitly for non-PF paths unless I have screwed something there because admittedly I even didn't try to compile that code - it was merely for an illustration. Please note that the diff is on top of the previous one. And we already do have an endless loop if the memcg PF oom path doesn't make a forward progress. So there shouldn't be any difference in that regards. > It would make sense to step back and think about the comprehensive > user-visible behavior of an out-of-memory situation, and not just the > one syscall aspect.
On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote: > On Wed 25-10-17 14:11:06, Johannes Weiner wrote: > > "Safe" is a vague term, and it doesn't make much sense to me in this > > situation. The OOM behavior should be predictable and consistent. > > > > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We > > don't have to do that in memcg because we're not physically limited. > > OK, so here seems to be the biggest disconnect. Being physically or > artificially constrained shouldn't make much difference IMHO. In both > cases the resource is simply limited for the consumer. And once all the > attempts to fit within the limit fail then the request for the resource > has to fail. It's a huge difference. In the global case, we have to make trade-offs to not deadlock the kernel. In the memcg case, we have to make a trade off between desirable OOM behavior and desirable meaning of memory.max. If we can borrow a resource temporarily from the ether to resolve the OOM situation, I don't see why we shouldn't. We're only briefly ignoring the limit to make sure the allocating task isn't preventing the OOM victim from exiting or the OOM reaper from reaping. It's more of an implementation detail than interface. The only scenario you brought up where this might be the permanent overrun is the single, oom-disabled task. And I explained why that is a silly argument, why that's the least problematic consequence of oom-disabling, and why it probably shouldn't even be configurable. The idea that memory.max must never be breached is an extreme and narrow view. As Greg points out, there are allocations we do not even track. There are other scenarios that force allocations. They may violate the limit on paper, but they're not notably weakening the goal of memory.max - isolating workloads from each other. Let's look at it this way. There are two deadlock relationships the OOM killer needs to solve between the triggerer and the potential OOM victim: #1 Memory. The triggerer needs memory that the victim has, but the victim needs some additional memory to release it. #2 Locks. The triggerer needs memory that the victim has, but the victim needs a lock the triggerer holds to release it. We have no qualms letting the victim temporarily (until the victim's exit) ignore memory.max to resolve the memory deadlock #1. I don't understand why it's such a stretch to let the triggerer temporarily (until the victim's exit) ignore memory.max to resolve the locks deadlock #2. [1] We need both for the OOM killer to function correctly. We've solved #1 both for memcg and globally. But we haven't solved #2. Global can still deadlock, and memcg copped out and returns -ENOMEM. Adding speculative OOM killing before the -ENOMEM makes things more muddy and unpredictable. It doesn't actually solve deadlock #2. [1] And arguably that's what we should be doing in the global case too: give the triggerer access to reserves. If you recall this thread here: https://patchwork.kernel.org/patch/6088511/ > > > So the only change I am really proposing is to keep retrying as long > > > as the oom killer makes a forward progress and ENOMEM otherwise. > > > > That's the behavior change I'm against. > > So just to make it clear you would be OK with the retry on successful > OOM killer invocation and force charge on oom failure, right? Yeah, that sounds reasonable to me.
Johannes Weiner <hannes@cmpxchg.org> wrote: > On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote: >> On Wed 25-10-17 14:11:06, Johannes Weiner wrote: >> > "Safe" is a vague term, and it doesn't make much sense to me in this >> > situation. The OOM behavior should be predictable and consistent. >> > >> > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We >> > don't have to do that in memcg because we're not physically limited. >> >> OK, so here seems to be the biggest disconnect. Being physically or >> artificially constrained shouldn't make much difference IMHO. In both >> cases the resource is simply limited for the consumer. And once all the >> attempts to fit within the limit fail then the request for the resource >> has to fail. > > It's a huge difference. In the global case, we have to make trade-offs > to not deadlock the kernel. In the memcg case, we have to make a trade > off between desirable OOM behavior and desirable meaning of memory.max. > > If we can borrow a resource temporarily from the ether to resolve the > OOM situation, I don't see why we shouldn't. We're only briefly > ignoring the limit to make sure the allocating task isn't preventing > the OOM victim from exiting or the OOM reaper from reaping. It's more > of an implementation detail than interface. > > The only scenario you brought up where this might be the permanent > overrun is the single, oom-disabled task. And I explained why that is > a silly argument, why that's the least problematic consequence of > oom-disabling, and why it probably shouldn't even be configurable. > > The idea that memory.max must never be breached is an extreme and > narrow view. As Greg points out, there are allocations we do not even > track. There are other scenarios that force allocations. They may > violate the limit on paper, but they're not notably weakening the goal > of memory.max - isolating workloads from each other. > > Let's look at it this way. > > There are two deadlock relationships the OOM killer needs to solve > between the triggerer and the potential OOM victim: > > #1 Memory. The triggerer needs memory that the victim has, > but the victim needs some additional memory to release it. > > #2 Locks. The triggerer needs memory that the victim has, but > the victim needs a lock the triggerer holds to release it. > > We have no qualms letting the victim temporarily (until the victim's > exit) ignore memory.max to resolve the memory deadlock #1. > > I don't understand why it's such a stretch to let the triggerer > temporarily (until the victim's exit) ignore memory.max to resolve the > locks deadlock #2. [1] > > We need both for the OOM killer to function correctly. > > We've solved #1 both for memcg and globally. But we haven't solved #2. > Global can still deadlock, and memcg copped out and returns -ENOMEM. > > Adding speculative OOM killing before the -ENOMEM makes things more > muddy and unpredictable. It doesn't actually solve deadlock #2. > > [1] And arguably that's what we should be doing in the global case > too: give the triggerer access to reserves. If you recall this > thread here: https://patchwork.kernel.org/patch/6088511/ > >> > > So the only change I am really proposing is to keep retrying as long >> > > as the oom killer makes a forward progress and ENOMEM otherwise. >> > >> > That's the behavior change I'm against. >> >> So just to make it clear you would be OK with the retry on successful >> OOM killer invocation and force charge on oom failure, right? > > Yeah, that sounds reasonable to me. Assuming we're talking about retrying within try_charge(), then there's a detail to iron out... If there is a pending oom victim blocked on a lock held by try_charge() caller (the "#2 Locks" case), then I think repeated calls to out_of_memory() will return true until the victim either gets MMF_OOM_SKIP or disappears. So a force charge fallback might be a needed even with oom killer successful invocations. Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM, NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM.
On Wed 25-10-17 15:49:21, Greg Thelen wrote: > Johannes Weiner <hannes@cmpxchg.org> wrote: > > > On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote: [...] > >> So just to make it clear you would be OK with the retry on successful > >> OOM killer invocation and force charge on oom failure, right? > > > > Yeah, that sounds reasonable to me. > > Assuming we're talking about retrying within try_charge(), then there's > a detail to iron out... > > If there is a pending oom victim blocked on a lock held by try_charge() caller > (the "#2 Locks" case), then I think repeated calls to out_of_memory() will > return true until the victim either gets MMF_OOM_SKIP or disappears. true. And oom_reaper guarantees that MMF_OOM_SKIP gets set in the finit amount of time. > So a force > charge fallback might be a needed even with oom killer successful invocations. > Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM, > NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM. No we, really want to wait for the oom victim to do its job. The only thing we should be worried about is when out_of_memory doesn't invoke the reaper. There is only one case like that AFAIK - GFP_NOFS request. I have to think about this case some more. We currently fail in that case the request.
On 2017/10/26 16:49, Michal Hocko wrote: > On Wed 25-10-17 15:49:21, Greg Thelen wrote: >> Johannes Weiner <hannes@cmpxchg.org> wrote: >> >>> On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote: > [...] >>>> So just to make it clear you would be OK with the retry on successful >>>> OOM killer invocation and force charge on oom failure, right? >>> >>> Yeah, that sounds reasonable to me. >> >> Assuming we're talking about retrying within try_charge(), then there's >> a detail to iron out... >> >> If there is a pending oom victim blocked on a lock held by try_charge() caller >> (the "#2 Locks" case), then I think repeated calls to out_of_memory() will >> return true until the victim either gets MMF_OOM_SKIP or disappears. > > true. And oom_reaper guarantees that MMF_OOM_SKIP gets set in the finit > amount of time. Just a confirmation. You are talking about kmemcg, aren't you? And kmemcg depends on CONFIG_MMU=y, doesn't it? If no, there is no such guarantee. > >> So a force >> charge fallback might be a needed even with oom killer successful invocations. >> Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM, >> NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM. > > No we, really want to wait for the oom victim to do its job. The only > thing we should be worried about is when out_of_memory doesn't invoke > the reaper. There is only one case like that AFAIK - GFP_NOFS request. I > have to think about this case some more. We currently fail in that case > the request. > Do we really need to apply /* * The OOM killer does not compensate for IO-less reclaim. * pagefault_out_of_memory lost its gfp context so we have to * make sure exclude 0 mask - all other users should have at least * ___GFP_DIRECT_RECLAIM to get here. */ if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)) return true; unconditionally? We can encourage !__GFP_FS allocations to use __GFP_NORETRY or __GFP_RETRY_MAYFAIL if their allocations are not important. Then, only important !__GFP_FS allocations will be checked here. I think that we can allow such important allocations to invoke the OOM killer (i.e. remove this check) because situation is already hopeless if important !__GFP_FS allocations cannot make progress.
On Wed, Oct 25, 2017 at 03:49:21PM -0700, Greg Thelen wrote: > Johannes Weiner <hannes@cmpxchg.org> wrote: > > > On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote: > >> On Wed 25-10-17 14:11:06, Johannes Weiner wrote: > >> > "Safe" is a vague term, and it doesn't make much sense to me in this > >> > situation. The OOM behavior should be predictable and consistent. > >> > > >> > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We > >> > don't have to do that in memcg because we're not physically limited. > >> > >> OK, so here seems to be the biggest disconnect. Being physically or > >> artificially constrained shouldn't make much difference IMHO. In both > >> cases the resource is simply limited for the consumer. And once all the > >> attempts to fit within the limit fail then the request for the resource > >> has to fail. > > > > It's a huge difference. In the global case, we have to make trade-offs > > to not deadlock the kernel. In the memcg case, we have to make a trade > > off between desirable OOM behavior and desirable meaning of memory.max. > > > > If we can borrow a resource temporarily from the ether to resolve the > > OOM situation, I don't see why we shouldn't. We're only briefly > > ignoring the limit to make sure the allocating task isn't preventing > > the OOM victim from exiting or the OOM reaper from reaping. It's more > > of an implementation detail than interface. > > > > The only scenario you brought up where this might be the permanent > > overrun is the single, oom-disabled task. And I explained why that is > > a silly argument, why that's the least problematic consequence of > > oom-disabling, and why it probably shouldn't even be configurable. > > > > The idea that memory.max must never be breached is an extreme and > > narrow view. As Greg points out, there are allocations we do not even > > track. There are other scenarios that force allocations. They may > > violate the limit on paper, but they're not notably weakening the goal > > of memory.max - isolating workloads from each other. > > > > Let's look at it this way. > > > > There are two deadlock relationships the OOM killer needs to solve > > between the triggerer and the potential OOM victim: > > > > #1 Memory. The triggerer needs memory that the victim has, > > but the victim needs some additional memory to release it. > > > > #2 Locks. The triggerer needs memory that the victim has, but > > the victim needs a lock the triggerer holds to release it. > > > > We have no qualms letting the victim temporarily (until the victim's > > exit) ignore memory.max to resolve the memory deadlock #1. > > > > I don't understand why it's such a stretch to let the triggerer > > temporarily (until the victim's exit) ignore memory.max to resolve the > > locks deadlock #2. [1] > > > > We need both for the OOM killer to function correctly. > > > > We've solved #1 both for memcg and globally. But we haven't solved #2. > > Global can still deadlock, and memcg copped out and returns -ENOMEM. > > > > Adding speculative OOM killing before the -ENOMEM makes things more > > muddy and unpredictable. It doesn't actually solve deadlock #2. > > > > [1] And arguably that's what we should be doing in the global case > > too: give the triggerer access to reserves. If you recall this > > thread here: https://patchwork.kernel.org/patch/6088511/ > > > >> > > So the only change I am really proposing is to keep retrying as long > >> > > as the oom killer makes a forward progress and ENOMEM otherwise. > >> > > >> > That's the behavior change I'm against. > >> > >> So just to make it clear you would be OK with the retry on successful > >> OOM killer invocation and force charge on oom failure, right? > > > > Yeah, that sounds reasonable to me. > > Assuming we're talking about retrying within try_charge(), then there's > a detail to iron out... > > If there is a pending oom victim blocked on a lock held by try_charge() caller > (the "#2 Locks" case), then I think repeated calls to out_of_memory() will > return true until the victim either gets MMF_OOM_SKIP or disappears. So a force > charge fallback might be a needed even with oom killer successful invocations. > Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM, > NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM. True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a maximum, even if the OOM killer indicates a kill has been issued. What you propose makes sense too.
Michal Hocko wrote: > Greg Thelen wrote: > > So a force charge fallback might be a needed even with oom killer successful > > invocations. Or we'll need to teach out_of_memory() to return three values > > (e.g. NO_VICTIM, NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on > > NEW_VICTIM. > > No we, really want to wait for the oom victim to do its job. The only thing we > should be worried about is when out_of_memory doesn't invoke the reaper. There > is only one case like that AFAIK - GFP_NOFS request. I have to think about > this case some more. We currently fail in that case the request. Nod, but I think only wait a short time (more below). The schedule_timeout_killable(1) in out_of_memory() seems ok to me. I don't think there's a problem overcharging a little bit to expedite oom killing. Johannes Weiner wrote: > True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a maximum, > even if the OOM killer indicates a kill has been issued. What you propose > makes sense too. Sounds good. It looks like the oom reaper will wait 1 second (MAX_OOM_REAP_RETRIES*HZ/10) before giving up and setting MMF_OOM_SKIP, which enables the oom killer to select another victim. Repeated try_charge() => out_of_memory() calls will return true while there's a pending victim. After the first call, out_of_memory() doesn't appear to sleep. So I assume try_charge() would quickly burn through MEM_CGROUP_RECLAIM_RETRIES (5) attempts before resorting to overcharging. IMO, this is fine because: 1) it's possible the victim wants locks held by try_charge caller. So waiting for the oom reaper to timeout and out_of_memory to select additional victims would kill more than required. 2) waiting 1 sec to detect a livelock between try_charge() and pending oom victim seems unfortunate.
On Thu 26-10-17 12:56:48, Greg Thelen wrote: > Michal Hocko wrote: > > Greg Thelen wrote: > > > So a force charge fallback might be a needed even with oom killer successful > > > invocations. Or we'll need to teach out_of_memory() to return three values > > > (e.g. NO_VICTIM, NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on > > > NEW_VICTIM. > > > > No we, really want to wait for the oom victim to do its job. The only thing we > > should be worried about is when out_of_memory doesn't invoke the reaper. There > > is only one case like that AFAIK - GFP_NOFS request. I have to think about > > this case some more. We currently fail in that case the request. > > Nod, but I think only wait a short time (more below). The > schedule_timeout_killable(1) in out_of_memory() seems ok to me. I don't > think there's a problem overcharging a little bit to expedite oom > killing. This is not about time. This is about the feedback mechanism oom_reaper provides. We should do any actions until we get that feedback. > Johannes Weiner wrote: > > True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a maximum, > > even if the OOM killer indicates a kill has been issued. What you propose > > makes sense too. > > Sounds good. > > It looks like the oom reaper will wait 1 second > (MAX_OOM_REAP_RETRIES*HZ/10) before giving up and setting MMF_OOM_SKIP, > which enables the oom killer to select another victim. Repeated > try_charge() => out_of_memory() calls will return true while there's a > pending victim. After the first call, out_of_memory() doesn't appear to > sleep. So I assume try_charge() would quickly burn through > MEM_CGROUP_RECLAIM_RETRIES (5) attempts before resorting to > overcharging. IMO, this is fine because: > 1) it's possible the victim wants locks held by try_charge caller. So > waiting for the oom reaper to timeout and out_of_memory to select > additional victims would kill more than required. > 2) waiting 1 sec to detect a livelock between try_charge() and pending > oom victim seems unfortunate. I am not yet sure whether overcharging or ENOMEM is the right way to go (have to think through that much more) but one thing is clear I guess. The charge path shouldn't do any decision as long as it gets a possitive feedback from the oom killer path. And that pretty much depend on what oom reaper is able to do. Implementation details about how much the reaper waits if it is not able reclaim any memory is not all that important.
> Why is OOM-disabling a thing? Why isn't this simply a "kill everything > else before you kill me"? It's crashing the kernel in trying to > protect a userspace application. How is that not insane? In parallel to other discussion, I think we should definitely move from "completely oom-disabled" semantics to something similar to "kill me last" semantics. Is there any objection to this idea?
On Fri 27-10-17 13:50:47, Shakeel Butt wrote: > > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > > else before you kill me"? It's crashing the kernel in trying to > > protect a userspace application. How is that not insane? > > In parallel to other discussion, I think we should definitely move > from "completely oom-disabled" semantics to something similar to "kill > me last" semantics. Is there any objection to this idea? Could you be more specific what you mean?
On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote: > On Fri 27-10-17 13:50:47, Shakeel Butt wrote: >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything >> > else before you kill me"? It's crashing the kernel in trying to >> > protect a userspace application. How is that not insane? >> >> In parallel to other discussion, I think we should definitely move >> from "completely oom-disabled" semantics to something similar to "kill >> me last" semantics. Is there any objection to this idea? > > Could you be more specific what you mean? > I get the impression that the main reason behind the complexity of oom-killer is allowing processes to be protected from the oom-killer i.e. disabling oom-killing a process by setting /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add an interface which will let users/admins to set a process to be oom-killed as a last resort.
On Mon 30-10-17 12:28:13, Shakeel Butt wrote: > On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote: > > On Fri 27-10-17 13:50:47, Shakeel Butt wrote: > >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > >> > else before you kill me"? It's crashing the kernel in trying to > >> > protect a userspace application. How is that not insane? > >> > >> In parallel to other discussion, I think we should definitely move > >> from "completely oom-disabled" semantics to something similar to "kill > >> me last" semantics. Is there any objection to this idea? > > > > Could you be more specific what you mean? > > > > I get the impression that the main reason behind the complexity of > oom-killer is allowing processes to be protected from the oom-killer > i.e. disabling oom-killing a process by setting > /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add > an interface which will let users/admins to set a process to be > oom-killed as a last resort. If a process opts in to be oom disabled it needs CAP_SYS_RESOURCE and it probably has a strong reason to do that. E.g. no unexpected SIGKILL which could leave inconsistent data behind. We cannot simply break that contract. Yes, it is a PITA configuration to support but it has its reasons to exit. We are not guaranteeing it to 100% though, e.g. the global case just panics if there is no eligible task. It is responsibility of the userspace to make sure that the protected task doesn't blow up completely otherwise they are on their own. We should do something similar for the memcg case. Protect those tasks as long as we are able to make forward progress and then either give them ENOMEM or overcharge. Which one to go requires more discussion but I do not think that unexpected SIGKILL is a way to go. We just want to give those tasks a chance to do a graceful shutdown.
On Tue, Oct 31, 2017 at 09:00:48AM +0100, Michal Hocko wrote: > On Mon 30-10-17 12:28:13, Shakeel Butt wrote: > > On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > On Fri 27-10-17 13:50:47, Shakeel Butt wrote: > > >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > > >> > else before you kill me"? It's crashing the kernel in trying to > > >> > protect a userspace application. How is that not insane? > > >> > > >> In parallel to other discussion, I think we should definitely move > > >> from "completely oom-disabled" semantics to something similar to "kill > > >> me last" semantics. Is there any objection to this idea? > > > > > > Could you be more specific what you mean? > > > > I get the impression that the main reason behind the complexity of > > oom-killer is allowing processes to be protected from the oom-killer > > i.e. disabling oom-killing a process by setting > > /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add > > an interface which will let users/admins to set a process to be > > oom-killed as a last resort. > > If a process opts in to be oom disabled it needs CAP_SYS_RESOURCE and it > probably has a strong reason to do that. E.g. no unexpected SIGKILL > which could leave inconsistent data behind. We cannot simply break that > contract. Yes, it is a PITA configuration to support but it has its > reasons to exit. I don't think that's true. The most prominent users are things like X and sshd, and all they wanted to say was "kill me last." If sshd were to have a bug and swell up, currently the system would kill everything and then panic. It'd be much better to kill sshd at the end and let the init system restart it. Can you describe a scenario in which the NEVERKILL semantics actually make sense? You're still OOM-killing the task anyway, it's not like it can run without the kernel. So why kill the kernel?
On Tue 31-10-17 12:49:59, Johannes Weiner wrote: > On Tue, Oct 31, 2017 at 09:00:48AM +0100, Michal Hocko wrote: > > On Mon 30-10-17 12:28:13, Shakeel Butt wrote: > > > On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote: > > > > On Fri 27-10-17 13:50:47, Shakeel Butt wrote: > > > >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything > > > >> > else before you kill me"? It's crashing the kernel in trying to > > > >> > protect a userspace application. How is that not insane? > > > >> > > > >> In parallel to other discussion, I think we should definitely move > > > >> from "completely oom-disabled" semantics to something similar to "kill > > > >> me last" semantics. Is there any objection to this idea? > > > > > > > > Could you be more specific what you mean? > > > > > > I get the impression that the main reason behind the complexity of > > > oom-killer is allowing processes to be protected from the oom-killer > > > i.e. disabling oom-killing a process by setting > > > /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add > > > an interface which will let users/admins to set a process to be > > > oom-killed as a last resort. > > > > If a process opts in to be oom disabled it needs CAP_SYS_RESOURCE and it > > probably has a strong reason to do that. E.g. no unexpected SIGKILL > > which could leave inconsistent data behind. We cannot simply break that > > contract. Yes, it is a PITA configuration to support but it has its > > reasons to exit. > > I don't think that's true. The most prominent users are things like X > and sshd, and all they wanted to say was "kill me last." This might be the case for the desktop environment and I would tend to agree that those can handle restart easily. I was considering applications which need an explicit shut down and manual intervention when not done so. Think of a database or similar. > If sshd were to have a bug and swell up, currently the system would > kill everything and then panic. It'd be much better to kill sshd at > the end and let the init system restart it. > > Can you describe a scenario in which the NEVERKILL semantics actually > make sense? You're still OOM-killing the task anyway, it's not like it > can run without the kernel. So why kill the kernel? Yes but you start with a clean state after reboot which is rather a different thing than restarting from an inconsistant state. In any case I am not trying to defend this configuration! I really dislike it and it shouldn't have ever been introduced. But it is an established behavior for many years and I am not really willing to break it without having a _really strong_ reason.
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 5d9323028870..3ba62c73eee5 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) * victim and then we have rely on mem_cgroup_oom_synchronize otherwise * we would fall back to the global oom killer in pagefault_out_of_memory */ - if (!memcg->oom_kill_disable && - mem_cgroup_out_of_memory(memcg, mask, order)) - return true; + if (!memcg->oom_kill_disable) { + if (mem_cgroup_out_of_memory(memcg, mask, order)) + return true; + + WARN(!current->memcg_may_oom, + "Memory cgroup charge failed because of no reclaimable memory! " + "This looks like a misconfiguration or a kernel bug."); + } if (!current->memcg_may_oom) return false;