fs, mm: account filp and names caches to kmemcg

Message ID	20171025141221.xm4cqp2z6nunr6vy@dhcp22.suse.cz (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Wed, 25 Oct 2017 16:12:21 +0200 From: Michal Hocko <mhocko@kernel.org> To: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Thelen <gthelen@google.com>, Shakeel Butt <shakeelb@google.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Vladimir Davydov <vdavydov.dev@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, Linux MM <linux-mm@kvack.org>, linux-fsdevel@vger.kernel.org, LKML <linux-kernel@vger.kernel.org> Subject: Re: [PATCH] fs, mm: account filp and names caches to kmemcg Message-ID: <20171025141221.xm4cqp2z6nunr6vy@dhcp22.suse.cz> References: <20171013152421.yf76n7jui3z5bbn4@dhcp22.suse.cz> <20171024160637.GB32340@cmpxchg.org> <20171024162213.n6jrpz3t5pldkgxy@dhcp22.suse.cz> <20171024172330.GA3973@cmpxchg.org> <20171024175558.uxqtxwhjgu6ceadk@dhcp22.suse.cz> <20171024185854.GA6154@cmpxchg.org> <20171024201522.3z2fjnfywgx2egqx@dhcp22.suse.cz> <xr93r2tr67pp.fsf@gthelen.svl.corp.google.com> <20171025071522.xyw4lsvdv4xsbhbo@dhcp22.suse.cz> <20171025131151.GA8210@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171025131151.GA8210@cmpxchg.org> User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Michal Hocko Oct. 25, 2017, 2:12 p.m. UTC

On Wed 25-10-17 09:11:51, Johannes Weiner wrote:
> On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote:
[...]
> > ... we shouldn't make it more loose though.
> 
> Then we can end this discussion right now. I pointed out right from
> the start that the only way to replace -ENOMEM with OOM killing in the
> syscall is to force charges. If we don't, we either deadlock or still
> return -ENOMEM occasionally. Nobody has refuted that this is the case.

Yes this is true. I guess we are back to the non-failing allocations
discussion...  Currently we are too ENOMEM happy for memcg !PF paths which
can lead to weird issues Greg has pointed out earlier. Going to opposite
direction to basically never ENOMEM and rather pretend a success (which
allows runaways for extreme setups with no oom eligible tasks) sounds
like going from one extreme to another. This basically means that those
charges will effectively GFP_NOFAIL. Too much to guarantee IMHO.

> > > The current thread can loop in syscall exit until
> > > usage is reconciled (either via reclaim or kill).  This seems consistent
> > > with pagefault oom handling and compatible with overcommit use case.
> > 
> > But we do not really want to make the syscall exit path any more complex
> > or more expensive than it is. The point is that we shouldn't be afraid
> > about triggering the oom killer from the charge patch because we do have
> > async OOM killer. This is very same with the standard allocator path. So
> > why should be memcg any different?
> 
> I have nothing against triggering the OOM killer from the allocation
> path. I am dead-set against making the -ENOMEM return from syscalls
> rare and unpredictable.

Isn't that the case when we put memcg out of the picture already? More
on that below.

> They're a challenge as it is.  The only sane options are to stick with
> the status quo,

One thing that really worries me about the current status quo is that
the behavior depends on whether you run under memcg or not. The global
policy is "almost never fail unless something horrible is going on".
But we _do not_ guarantee that ENOMEM stays inside the kernel.

So if we need to do something about that I would think we need an
universal solution rather than something memcg specific. Sure global
ENOMEMs are so rare that nobody will probably trigger those but that is
just a wishful thinking...

So how about we start with a BIG FAT WARNING for the failure case?
Something resembling warn_alloc for the failure case.
---

Johannes Weiner Oct. 25, 2017, 4:44 p.m. UTC | #1

On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote:
> On Wed 25-10-17 09:11:51, Johannes Weiner wrote:
> > On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote:
> [...]
> > > ... we shouldn't make it more loose though.
> > 
> > Then we can end this discussion right now. I pointed out right from
> > the start that the only way to replace -ENOMEM with OOM killing in the
> > syscall is to force charges. If we don't, we either deadlock or still
> > return -ENOMEM occasionally. Nobody has refuted that this is the case.
> 
> Yes this is true. I guess we are back to the non-failing allocations
> discussion...  Currently we are too ENOMEM happy for memcg !PF paths which
> can lead to weird issues Greg has pointed out earlier. Going to opposite
> direction to basically never ENOMEM and rather pretend a success (which
> allows runaways for extreme setups with no oom eligible tasks) sounds
> like going from one extreme to another. This basically means that those
> charges will effectively GFP_NOFAIL. Too much to guarantee IMHO.

We're talking in circles.

Overrunning the hard limit in the syscall path isn't worse than an
infinite loop in the page fault path. Once you make tasks ineligible
for OOM, that's what you have to expect. It's the OOM disabling that
results in extreme consequences across the board.

It's silly to keep bringing this up as an example.

> > They're a challenge as it is.  The only sane options are to stick with
> > the status quo,
> 
> One thing that really worries me about the current status quo is that
> the behavior depends on whether you run under memcg or not. The global
> policy is "almost never fail unless something horrible is going on".
> But we _do not_ guarantee that ENOMEM stays inside the kernel.
> 
> So if we need to do something about that I would think we need an
> universal solution rather than something memcg specific. Sure global
> ENOMEMs are so rare that nobody will probably trigger those but that is
> just a wishful thinking...

The difference in current behavior comes from the fact that real-life
workloads could trigger a deadlock with memcg looping indefinitely in
the charge path, but not with the allocator looping indefinitely in
the allocation path.

That also means that if we change it like you proposed, it'll be much
more likely to trigger -ENOMEM from syscalls inside memcg than it is
outside.

The margins are simply tighter inside cgroups. You often have only a
few closely interacting tasks in there, and the probability for
something to go wrong is much higher than on the system scope.

But also, the system is constrained by physical limits. It's a hard
wall; once you hit it, the kernel is dead, so we don't have a lot of
options. Memcg on the other hand is a completely artificial limit. It
doesn't make sense to me to pretend it's a hard wall to the point
where we actively *create* for ourselves the downsides of a physical
limitation.

> So how about we start with a BIG FAT WARNING for the failure case?
> Something resembling warn_alloc for the failure case.
>
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 5d9323028870..3ba62c73eee5 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
>  	 * we would fall back to the global oom killer in pagefault_out_of_memory
>  	 */
> -	if (!memcg->oom_kill_disable &&
> -			mem_cgroup_out_of_memory(memcg, mask, order))
> -		return true;
> +	if (!memcg->oom_kill_disable) {
> +		if (mem_cgroup_out_of_memory(memcg, mask, order))
> +			return true;
> +
> +		WARN(!current->memcg_may_oom,
> +				"Memory cgroup charge failed because of no reclaimable memory! "
> +				"This looks like a misconfiguration or a kernel bug.");
> +	}

That's crazy!

We shouldn't create interfaces that make it possible to accidentally
livelock the kernel. Then warn about it and let it crash. That is a
DOS-level lack of OS abstraction.

In such a situation, we should ignore oom_score_adj or ignore the hard
limit. Even panic() would be better from a machine management point of
view than leaving random tasks inside infinite loops.

Why is OOM-disabling a thing? Why isn't this simply a "kill everything
else before you kill me"? It's crashing the kernel in trying to
protect a userspace application. How is that not insane?

Michal Hocko Oct. 25, 2017, 5:29 p.m. UTC | #2

On Wed 25-10-17 12:44:02, Johannes Weiner wrote:
> On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote:
[...]

I yet have to digest the first path of the email but the remaining
just sounds we are not on the same page.

> > So how about we start with a BIG FAT WARNING for the failure case?
> > Something resembling warn_alloc for the failure case.
> >
> > ---
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 5d9323028870..3ba62c73eee5 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> >  	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> >  	 * we would fall back to the global oom killer in pagefault_out_of_memory
> >  	 */
> > -	if (!memcg->oom_kill_disable &&
> > -			mem_cgroup_out_of_memory(memcg, mask, order))
> > -		return true;
> > +	if (!memcg->oom_kill_disable) {
> > +		if (mem_cgroup_out_of_memory(memcg, mask, order))
> > +			return true;
> > +
> > +		WARN(!current->memcg_may_oom,
> > +				"Memory cgroup charge failed because of no reclaimable memory! "
> > +				"This looks like a misconfiguration or a kernel bug.");
> > +	}
> 
> That's crazy!
> 
> We shouldn't create interfaces that make it possible to accidentally
> livelock the kernel. Then warn about it and let it crash. That is a
> DOS-level lack of OS abstraction.
> 
> In such a situation, we should ignore oom_score_adj or ignore the hard
> limit. Even panic() would be better from a machine management point of
> view than leaving random tasks inside infinite loops.
> 
> Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> else before you kill me"? It's crashing the kernel in trying to
> protect a userspace application. How is that not insane?

I really do not follow. What kind of livelock or crash are you talking
about. All this code does is that the charge request (which is not
explicitly GFP_NOFAIL) fails with ENOMEM if the oom killer is not able
to make a forward progress. That sounds like a safer option than failing
with ENOMEM unconditionally which is what we do currently. So the only
change I am really proposing is to keep retrying as long as the oom
killer makes a forward progress and ENOMEM otherwise.

I am also not trying to protect an userspace application. Quite
contrary, I would like the application gets ENOMEM when it should run
away from the constraint it runs within. I am protecting everything
outside of the hard limited memcg actually.

So what is that I am missing?

Johannes Weiner Oct. 25, 2017, 6:11 p.m. UTC | #3

On Wed, Oct 25, 2017 at 07:29:24PM +0200, Michal Hocko wrote:
> On Wed 25-10-17 12:44:02, Johannes Weiner wrote:
> > On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote:
> > > So how about we start with a BIG FAT WARNING for the failure case?
> > > Something resembling warn_alloc for the failure case.
> > >
> > > ---
> > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > index 5d9323028870..3ba62c73eee5 100644
> > > --- a/mm/memcontrol.c
> > > +++ b/mm/memcontrol.c
> > > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > >  	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> > >  	 * we would fall back to the global oom killer in pagefault_out_of_memory
> > >  	 */
> > > -	if (!memcg->oom_kill_disable &&
> > > -			mem_cgroup_out_of_memory(memcg, mask, order))
> > > -		return true;
> > > +	if (!memcg->oom_kill_disable) {
> > > +		if (mem_cgroup_out_of_memory(memcg, mask, order))
> > > +			return true;
> > > +
> > > +		WARN(!current->memcg_may_oom,
> > > +				"Memory cgroup charge failed because of no reclaimable memory! "
> > > +				"This looks like a misconfiguration or a kernel bug.");
> > > +	}
> > 
> > That's crazy!
> > 
> > We shouldn't create interfaces that make it possible to accidentally
> > livelock the kernel. Then warn about it and let it crash. That is a
> > DOS-level lack of OS abstraction.
> > 
> > In such a situation, we should ignore oom_score_adj or ignore the hard
> > limit. Even panic() would be better from a machine management point of
> > view than leaving random tasks inside infinite loops.
> > 
> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> > else before you kill me"? It's crashing the kernel in trying to
> > protect a userspace application. How is that not insane?
> 
> I really do not follow. What kind of livelock or crash are you talking
> about. All this code does is that the charge request (which is not
> explicitly GFP_NOFAIL) fails with ENOMEM if the oom killer is not able
> to make a forward progress. That sounds like a safer option than failing
> with ENOMEM unconditionally which is what we do currently.

I pointed out multiple times now that consistent -ENOMEM is better
than a rare one because it's easier to verify application behavior
with test runs. And I don't understand what your counter-argument is.

"Safe" is a vague term, and it doesn't make much sense to me in this
situation. The OOM behavior should be predictable and consistent.

Yes, global might in the rarest cases also return -ENOMEM. Maybe. We
don't have to do that in memcg because we're not physically limited.

> So the only change I am really proposing is to keep retrying as long
> as the oom killer makes a forward progress and ENOMEM otherwise.

That's the behavior change I'm against.

> I am also not trying to protect an userspace application. Quite
> contrary, I would like the application gets ENOMEM when it should run
> away from the constraint it runs within. I am protecting everything
> outside of the hard limited memcg actually.
> 
> So what is that I am missing?

The page fault path.

Maybe that's not what you set out to fix, but it will now not only
enter an infinite loop, it will also WARN() on each iteration.

It would make sense to step back and think about the comprehensive
user-visible behavior of an out-of-memory situation, and not just the
one syscall aspect.

Michal Hocko Oct. 25, 2017, 7 p.m. UTC | #4

On Wed 25-10-17 14:11:06, Johannes Weiner wrote:
> On Wed, Oct 25, 2017 at 07:29:24PM +0200, Michal Hocko wrote:
> > On Wed 25-10-17 12:44:02, Johannes Weiner wrote:
> > > On Wed, Oct 25, 2017 at 04:12:21PM +0200, Michal Hocko wrote:
> > > > So how about we start with a BIG FAT WARNING for the failure case?
> > > > Something resembling warn_alloc for the failure case.
> > > >
> > > > ---
> > > > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > > > index 5d9323028870..3ba62c73eee5 100644
> > > > --- a/mm/memcontrol.c
> > > > +++ b/mm/memcontrol.c
> > > > @@ -1547,9 +1547,14 @@ static bool mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > > >  	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> > > >  	 * we would fall back to the global oom killer in pagefault_out_of_memory
> > > >  	 */
> > > > -	if (!memcg->oom_kill_disable &&
> > > > -			mem_cgroup_out_of_memory(memcg, mask, order))
> > > > -		return true;
> > > > +	if (!memcg->oom_kill_disable) {
> > > > +		if (mem_cgroup_out_of_memory(memcg, mask, order))
> > > > +			return true;
> > > > +
> > > > +		WARN(!current->memcg_may_oom,
> > > > +				"Memory cgroup charge failed because of no reclaimable memory! "
> > > > +				"This looks like a misconfiguration or a kernel bug.");
> > > > +	}
> > > 
> > > That's crazy!
> > > 
> > > We shouldn't create interfaces that make it possible to accidentally
> > > livelock the kernel. Then warn about it and let it crash. That is a
> > > DOS-level lack of OS abstraction.
> > > 
> > > In such a situation, we should ignore oom_score_adj or ignore the hard
> > > limit. Even panic() would be better from a machine management point of
> > > view than leaving random tasks inside infinite loops.
> > > 
> > > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> > > else before you kill me"? It's crashing the kernel in trying to
> > > protect a userspace application. How is that not insane?
> > 
> > I really do not follow. What kind of livelock or crash are you talking
> > about. All this code does is that the charge request (which is not
> > explicitly GFP_NOFAIL) fails with ENOMEM if the oom killer is not able
> > to make a forward progress. That sounds like a safer option than failing
> > with ENOMEM unconditionally which is what we do currently.
> 
> I pointed out multiple times now that consistent -ENOMEM is better
> than a rare one because it's easier to verify application behavior
> with test runs. And I don't understand what your counter-argument is.

My counter argument is that running inside the memcg shouldn't be too
much different than running outside. And that means that if we try to
reduce chances of ENOMEM in the global case then we should do the same
in the memcg case as well. Failing overly eagerly is more harmful than
what the determinism gives you for testing. You have other means to test
failure paths like fault injections.
 
> "Safe" is a vague term, and it doesn't make much sense to me in this
> situation. The OOM behavior should be predictable and consistent.
> 
> Yes, global might in the rarest cases also return -ENOMEM. Maybe. We
> don't have to do that in memcg because we're not physically limited.

OK, so here seems to be the biggest disconnect. Being physically or
artificially constrained shouldn't make much difference IMHO. In both
cases the resource is simply limited for the consumer. And once all the
attempts to fit within the limit fail then the request for the resource
has to fail.
 
> > So the only change I am really proposing is to keep retrying as long
> > as the oom killer makes a forward progress and ENOMEM otherwise.
> 
> That's the behavior change I'm against.

So just to make it clear you would be OK with the retry on successful
OOM killer invocation and force charge on oom failure, right?

> > I am also not trying to protect an userspace application. Quite
> > contrary, I would like the application gets ENOMEM when it should run
> > away from the constraint it runs within. I am protecting everything
> > outside of the hard limited memcg actually.
> > 
> > So what is that I am missing?
> 
> The page fault path.
> 
> Maybe that's not what you set out to fix, but it will now not only
> enter an infinite loop, it will also WARN() on each iteration.

It will not warn! The WARN is explicitly for non-PF paths unless I
have screwed something there because admittedly I even didn't try to
compile that code - it was merely for an illustration. Please note that
the diff is on top of the previous one.

And we already do have an endless loop if the memcg PF oom path doesn't
make a forward progress. So there shouldn't be any difference in that
regards.

> It would make sense to step back and think about the comprehensive
> user-visible behavior of an out-of-memory situation, and not just the
> one syscall aspect.

Johannes Weiner Oct. 25, 2017, 9:13 p.m. UTC | #5

On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
> On Wed 25-10-17 14:11:06, Johannes Weiner wrote:
> > "Safe" is a vague term, and it doesn't make much sense to me in this
> > situation. The OOM behavior should be predictable and consistent.
> > 
> > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We
> > don't have to do that in memcg because we're not physically limited.
> 
> OK, so here seems to be the biggest disconnect. Being physically or
> artificially constrained shouldn't make much difference IMHO. In both
> cases the resource is simply limited for the consumer. And once all the
> attempts to fit within the limit fail then the request for the resource
> has to fail.

It's a huge difference. In the global case, we have to make trade-offs
to not deadlock the kernel. In the memcg case, we have to make a trade
off between desirable OOM behavior and desirable meaning of memory.max.

If we can borrow a resource temporarily from the ether to resolve the
OOM situation, I don't see why we shouldn't. We're only briefly
ignoring the limit to make sure the allocating task isn't preventing
the OOM victim from exiting or the OOM reaper from reaping. It's more
of an implementation detail than interface.

The only scenario you brought up where this might be the permanent
overrun is the single, oom-disabled task. And I explained why that is
a silly argument, why that's the least problematic consequence of
oom-disabling, and why it probably shouldn't even be configurable.

The idea that memory.max must never be breached is an extreme and
narrow view. As Greg points out, there are allocations we do not even
track. There are other scenarios that force allocations. They may
violate the limit on paper, but they're not notably weakening the goal
of memory.max - isolating workloads from each other.

Let's look at it this way.

There are two deadlock relationships the OOM killer needs to solve
between the triggerer and the potential OOM victim:

	#1 Memory. The triggerer needs memory that the victim has,
	    but the victim needs some additional memory to release it.

	#2 Locks. The triggerer needs memory that the victim has, but
	    the victim needs a lock the triggerer holds to release it.

We have no qualms letting the victim temporarily (until the victim's
exit) ignore memory.max to resolve the memory deadlock #1.

I don't understand why it's such a stretch to let the triggerer
temporarily (until the victim's exit) ignore memory.max to resolve the
locks deadlock #2. [1]

We need both for the OOM killer to function correctly.

We've solved #1 both for memcg and globally. But we haven't solved #2.
Global can still deadlock, and memcg copped out and returns -ENOMEM.

Adding speculative OOM killing before the -ENOMEM makes things more
muddy and unpredictable. It doesn't actually solve deadlock #2.

[1] And arguably that's what we should be doing in the global case
    too: give the triggerer access to reserves. If you recall this
    thread here: https://patchwork.kernel.org/patch/6088511/

> > > So the only change I am really proposing is to keep retrying as long
> > > as the oom killer makes a forward progress and ENOMEM otherwise.
> > 
> > That's the behavior change I'm against.
> 
> So just to make it clear you would be OK with the retry on successful
> OOM killer invocation and force charge on oom failure, right?

Yeah, that sounds reasonable to me.

Greg Thelen Oct. 25, 2017, 10:49 p.m. UTC | #6

Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
>> On Wed 25-10-17 14:11:06, Johannes Weiner wrote:
>> > "Safe" is a vague term, and it doesn't make much sense to me in this
>> > situation. The OOM behavior should be predictable and consistent.
>> > 
>> > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We
>> > don't have to do that in memcg because we're not physically limited.
>> 
>> OK, so here seems to be the biggest disconnect. Being physically or
>> artificially constrained shouldn't make much difference IMHO. In both
>> cases the resource is simply limited for the consumer. And once all the
>> attempts to fit within the limit fail then the request for the resource
>> has to fail.
>
> It's a huge difference. In the global case, we have to make trade-offs
> to not deadlock the kernel. In the memcg case, we have to make a trade
> off between desirable OOM behavior and desirable meaning of memory.max.
>
> If we can borrow a resource temporarily from the ether to resolve the
> OOM situation, I don't see why we shouldn't. We're only briefly
> ignoring the limit to make sure the allocating task isn't preventing
> the OOM victim from exiting or the OOM reaper from reaping. It's more
> of an implementation detail than interface.
>
> The only scenario you brought up where this might be the permanent
> overrun is the single, oom-disabled task. And I explained why that is
> a silly argument, why that's the least problematic consequence of
> oom-disabling, and why it probably shouldn't even be configurable.
>
> The idea that memory.max must never be breached is an extreme and
> narrow view. As Greg points out, there are allocations we do not even
> track. There are other scenarios that force allocations. They may
> violate the limit on paper, but they're not notably weakening the goal
> of memory.max - isolating workloads from each other.
>
> Let's look at it this way.
>
> There are two deadlock relationships the OOM killer needs to solve
> between the triggerer and the potential OOM victim:
>
> 	#1 Memory. The triggerer needs memory that the victim has,
> 	    but the victim needs some additional memory to release it.
>
> 	#2 Locks. The triggerer needs memory that the victim has, but
> 	    the victim needs a lock the triggerer holds to release it.
>
> We have no qualms letting the victim temporarily (until the victim's
> exit) ignore memory.max to resolve the memory deadlock #1.
>
> I don't understand why it's such a stretch to let the triggerer
> temporarily (until the victim's exit) ignore memory.max to resolve the
> locks deadlock #2. [1]
>
> We need both for the OOM killer to function correctly.
>
> We've solved #1 both for memcg and globally. But we haven't solved #2.
> Global can still deadlock, and memcg copped out and returns -ENOMEM.
>
> Adding speculative OOM killing before the -ENOMEM makes things more
> muddy and unpredictable. It doesn't actually solve deadlock #2.
>
> [1] And arguably that's what we should be doing in the global case
>     too: give the triggerer access to reserves. If you recall this
>     thread here: https://patchwork.kernel.org/patch/6088511/
>
>> > > So the only change I am really proposing is to keep retrying as long
>> > > as the oom killer makes a forward progress and ENOMEM otherwise.
>> > 
>> > That's the behavior change I'm against.
>> 
>> So just to make it clear you would be OK with the retry on successful
>> OOM killer invocation and force charge on oom failure, right?
>
> Yeah, that sounds reasonable to me.

Assuming we're talking about retrying within try_charge(), then there's
a detail to iron out...

If there is a pending oom victim blocked on a lock held by try_charge() caller
(the "#2 Locks" case), then I think repeated calls to out_of_memory() will
return true until the victim either gets MMF_OOM_SKIP or disappears.  So a force
charge fallback might be a needed even with oom killer successful invocations.
Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM,
NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM.

Michal Hocko Oct. 26, 2017, 7:49 a.m. UTC | #7

On Wed 25-10-17 15:49:21, Greg Thelen wrote:
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
[...]
> >> So just to make it clear you would be OK with the retry on successful
> >> OOM killer invocation and force charge on oom failure, right?
> >
> > Yeah, that sounds reasonable to me.
> 
> Assuming we're talking about retrying within try_charge(), then there's
> a detail to iron out...
> 
> If there is a pending oom victim blocked on a lock held by try_charge() caller
> (the "#2 Locks" case), then I think repeated calls to out_of_memory() will
> return true until the victim either gets MMF_OOM_SKIP or disappears.

true. And oom_reaper guarantees that MMF_OOM_SKIP gets set in the finit
amount of time.

> So a force
> charge fallback might be a needed even with oom killer successful invocations.
> Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM,
> NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM.

No we, really want to wait for the oom victim to do its job. The only
thing we should be worried about is when out_of_memory doesn't invoke
the reaper. There is only one case like that AFAIK - GFP_NOFS request. I
have to think about this case some more. We currently fail in that case
the request.

Tetsuo Handa Oct. 26, 2017, 12:45 p.m. UTC | #8

On 2017/10/26 16:49, Michal Hocko wrote:
> On Wed 25-10-17 15:49:21, Greg Thelen wrote:
>> Johannes Weiner <hannes@cmpxchg.org> wrote:
>>
>>> On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
> [...]
>>>> So just to make it clear you would be OK with the retry on successful
>>>> OOM killer invocation and force charge on oom failure, right?
>>>
>>> Yeah, that sounds reasonable to me.
>>
>> Assuming we're talking about retrying within try_charge(), then there's
>> a detail to iron out...
>>
>> If there is a pending oom victim blocked on a lock held by try_charge() caller
>> (the "#2 Locks" case), then I think repeated calls to out_of_memory() will
>> return true until the victim either gets MMF_OOM_SKIP or disappears.
> 
> true. And oom_reaper guarantees that MMF_OOM_SKIP gets set in the finit
> amount of time.

Just a confirmation. You are talking about kmemcg, aren't you? And kmemcg
depends on CONFIG_MMU=y, doesn't it? If no, there is no such guarantee.

> 
>> So a force
>> charge fallback might be a needed even with oom killer successful invocations.
>> Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM,
>> NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM.
> 
> No we, really want to wait for the oom victim to do its job. The only
> thing we should be worried about is when out_of_memory doesn't invoke
> the reaper. There is only one case like that AFAIK - GFP_NOFS request. I
> have to think about this case some more. We currently fail in that case
> the request.
> 

Do we really need to apply

	/*
	 * The OOM killer does not compensate for IO-less reclaim.
	 * pagefault_out_of_memory lost its gfp context so we have to
	 * make sure exclude 0 mask - all other users should have at least
	 * ___GFP_DIRECT_RECLAIM to get here.
	 */
	if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
		return true;

unconditionally?

We can encourage !__GFP_FS allocations to use __GFP_NORETRY or
__GFP_RETRY_MAYFAIL if their allocations are not important.
Then, only important !__GFP_FS allocations will be checked here.
I think that we can allow such important allocations to invoke the OOM
killer (i.e. remove this check) because situation is already hopeless
if important !__GFP_FS allocations cannot make progress.

Johannes Weiner Oct. 26, 2017, 2:31 p.m. UTC | #9

On Wed, Oct 25, 2017 at 03:49:21PM -0700, Greg Thelen wrote:
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> 
> > On Wed, Oct 25, 2017 at 09:00:57PM +0200, Michal Hocko wrote:
> >> On Wed 25-10-17 14:11:06, Johannes Weiner wrote:
> >> > "Safe" is a vague term, and it doesn't make much sense to me in this
> >> > situation. The OOM behavior should be predictable and consistent.
> >> > 
> >> > Yes, global might in the rarest cases also return -ENOMEM. Maybe. We
> >> > don't have to do that in memcg because we're not physically limited.
> >> 
> >> OK, so here seems to be the biggest disconnect. Being physically or
> >> artificially constrained shouldn't make much difference IMHO. In both
> >> cases the resource is simply limited for the consumer. And once all the
> >> attempts to fit within the limit fail then the request for the resource
> >> has to fail.
> >
> > It's a huge difference. In the global case, we have to make trade-offs
> > to not deadlock the kernel. In the memcg case, we have to make a trade
> > off between desirable OOM behavior and desirable meaning of memory.max.
> >
> > If we can borrow a resource temporarily from the ether to resolve the
> > OOM situation, I don't see why we shouldn't. We're only briefly
> > ignoring the limit to make sure the allocating task isn't preventing
> > the OOM victim from exiting or the OOM reaper from reaping. It's more
> > of an implementation detail than interface.
> >
> > The only scenario you brought up where this might be the permanent
> > overrun is the single, oom-disabled task. And I explained why that is
> > a silly argument, why that's the least problematic consequence of
> > oom-disabling, and why it probably shouldn't even be configurable.
> >
> > The idea that memory.max must never be breached is an extreme and
> > narrow view. As Greg points out, there are allocations we do not even
> > track. There are other scenarios that force allocations. They may
> > violate the limit on paper, but they're not notably weakening the goal
> > of memory.max - isolating workloads from each other.
> >
> > Let's look at it this way.
> >
> > There are two deadlock relationships the OOM killer needs to solve
> > between the triggerer and the potential OOM victim:
> >
> > 	#1 Memory. The triggerer needs memory that the victim has,
> > 	    but the victim needs some additional memory to release it.
> >
> > 	#2 Locks. The triggerer needs memory that the victim has, but
> > 	    the victim needs a lock the triggerer holds to release it.
> >
> > We have no qualms letting the victim temporarily (until the victim's
> > exit) ignore memory.max to resolve the memory deadlock #1.
> >
> > I don't understand why it's such a stretch to let the triggerer
> > temporarily (until the victim's exit) ignore memory.max to resolve the
> > locks deadlock #2. [1]
> >
> > We need both for the OOM killer to function correctly.
> >
> > We've solved #1 both for memcg and globally. But we haven't solved #2.
> > Global can still deadlock, and memcg copped out and returns -ENOMEM.
> >
> > Adding speculative OOM killing before the -ENOMEM makes things more
> > muddy and unpredictable. It doesn't actually solve deadlock #2.
> >
> > [1] And arguably that's what we should be doing in the global case
> >     too: give the triggerer access to reserves. If you recall this
> >     thread here: https://patchwork.kernel.org/patch/6088511/
> >
> >> > > So the only change I am really proposing is to keep retrying as long
> >> > > as the oom killer makes a forward progress and ENOMEM otherwise.
> >> > 
> >> > That's the behavior change I'm against.
> >> 
> >> So just to make it clear you would be OK with the retry on successful
> >> OOM killer invocation and force charge on oom failure, right?
> >
> > Yeah, that sounds reasonable to me.
> 
> Assuming we're talking about retrying within try_charge(), then there's
> a detail to iron out...
> 
> If there is a pending oom victim blocked on a lock held by try_charge() caller
> (the "#2 Locks" case), then I think repeated calls to out_of_memory() will
> return true until the victim either gets MMF_OOM_SKIP or disappears.  So a force
> charge fallback might be a needed even with oom killer successful invocations.
> Or we'll need to teach out_of_memory() to return three values (e.g. NO_VICTIM,
> NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on NEW_VICTIM.

True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a
maximum, even if the OOM killer indicates a kill has been issued. What
you propose makes sense too.

Greg Thelen Oct. 26, 2017, 7:56 p.m. UTC | #10

Michal Hocko wrote:
> Greg Thelen wrote:
> > So a force charge fallback might be a needed even with oom killer successful
> > invocations.  Or we'll need to teach out_of_memory() to return three values
> > (e.g. NO_VICTIM, NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on
> > NEW_VICTIM.
> 
> No we, really want to wait for the oom victim to do its job. The only thing we
> should be worried about is when out_of_memory doesn't invoke the reaper. There
> is only one case like that AFAIK - GFP_NOFS request. I have to think about
> this case some more. We currently fail in that case the request.

Nod, but I think only wait a short time (more below).  The
schedule_timeout_killable(1) in out_of_memory() seems ok to me.  I don't
think there's a problem overcharging a little bit to expedite oom
killing.

Johannes Weiner wrote:
> True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a maximum,
> even if the OOM killer indicates a kill has been issued. What you propose
> makes sense too.

Sounds good.

It looks like the oom reaper will wait 1 second
(MAX_OOM_REAP_RETRIES*HZ/10) before giving up and setting MMF_OOM_SKIP,
which enables the oom killer to select another victim.  Repeated
try_charge() => out_of_memory() calls will return true while there's a
pending victim.  After the first call, out_of_memory() doesn't appear to
sleep.  So I assume try_charge() would quickly burn through
MEM_CGROUP_RECLAIM_RETRIES (5) attempts before resorting to
overcharging.  IMO, this is fine because:
1) it's possible the victim wants locks held by try_charge caller.  So
   waiting for the oom reaper to timeout and out_of_memory to select
   additional victims would kill more than required.
2) waiting 1 sec to detect a livelock between try_charge() and pending
   oom victim seems unfortunate.

Michal Hocko Oct. 27, 2017, 8:20 a.m. UTC | #11

On Thu 26-10-17 12:56:48, Greg Thelen wrote:
> Michal Hocko wrote:
> > Greg Thelen wrote:
> > > So a force charge fallback might be a needed even with oom killer successful
> > > invocations.  Or we'll need to teach out_of_memory() to return three values
> > > (e.g. NO_VICTIM, NEW_VICTIM, PENDING_VICTIM) and try_charge() can loop on
> > > NEW_VICTIM.
> > 
> > No we, really want to wait for the oom victim to do its job. The only thing we
> > should be worried about is when out_of_memory doesn't invoke the reaper. There
> > is only one case like that AFAIK - GFP_NOFS request. I have to think about
> > this case some more. We currently fail in that case the request.
> 
> Nod, but I think only wait a short time (more below).  The
> schedule_timeout_killable(1) in out_of_memory() seems ok to me.  I don't
> think there's a problem overcharging a little bit to expedite oom
> killing.

This is not about time. This is about the feedback mechanism oom_reaper
provides. We should do any actions until we get that feedback.

> Johannes Weiner wrote:
> > True. I was assuming we'd retry MEM_CGROUP_RECLAIM_RETRIES times at a maximum,
> > even if the OOM killer indicates a kill has been issued. What you propose
> > makes sense too.
> 
> Sounds good.
> 
> It looks like the oom reaper will wait 1 second
> (MAX_OOM_REAP_RETRIES*HZ/10) before giving up and setting MMF_OOM_SKIP,
> which enables the oom killer to select another victim.  Repeated
> try_charge() => out_of_memory() calls will return true while there's a
> pending victim.  After the first call, out_of_memory() doesn't appear to
> sleep.  So I assume try_charge() would quickly burn through
> MEM_CGROUP_RECLAIM_RETRIES (5) attempts before resorting to
> overcharging.  IMO, this is fine because:
> 1) it's possible the victim wants locks held by try_charge caller.  So
>    waiting for the oom reaper to timeout and out_of_memory to select
>    additional victims would kill more than required.
> 2) waiting 1 sec to detect a livelock between try_charge() and pending
>    oom victim seems unfortunate.

I am not yet sure whether overcharging or ENOMEM is the right way to go
(have to think through that much more) but one thing is clear I guess.
The charge path shouldn't do any decision as long as it gets a possitive
feedback from the oom killer path. And that pretty much depend on what
oom reaper is able to do. Implementation details about how much the
reaper waits if it is not able reclaim any memory is not all that
important.

Shakeel Butt Oct. 27, 2017, 8:50 p.m. UTC | #12

> Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> else before you kill me"? It's crashing the kernel in trying to
> protect a userspace application. How is that not insane?

In parallel to other discussion, I think we should definitely move
from "completely oom-disabled" semantics to something similar to "kill
me last" semantics. Is there any objection to this idea?

Michal Hocko Oct. 30, 2017, 8:29 a.m. UTC | #13

On Fri 27-10-17 13:50:47, Shakeel Butt wrote:
> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> > else before you kill me"? It's crashing the kernel in trying to
> > protect a userspace application. How is that not insane?
> 
> In parallel to other discussion, I think we should definitely move
> from "completely oom-disabled" semantics to something similar to "kill
> me last" semantics. Is there any objection to this idea?

Could you be more specific what you mean?

Shakeel Butt Oct. 30, 2017, 7:28 p.m. UTC | #14

On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote:
> On Fri 27-10-17 13:50:47, Shakeel Butt wrote:
>> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
>> > else before you kill me"? It's crashing the kernel in trying to
>> > protect a userspace application. How is that not insane?
>>
>> In parallel to other discussion, I think we should definitely move
>> from "completely oom-disabled" semantics to something similar to "kill
>> me last" semantics. Is there any objection to this idea?
>
> Could you be more specific what you mean?
>

I get the impression that the main reason behind the complexity of
oom-killer is allowing processes to be protected from the oom-killer
i.e. disabling oom-killing a process by setting
/proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add
an interface which will let users/admins to set a process to be
oom-killed as a last resort.

Michal Hocko Oct. 31, 2017, 8 a.m. UTC | #15

On Mon 30-10-17 12:28:13, Shakeel Butt wrote:
> On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > On Fri 27-10-17 13:50:47, Shakeel Butt wrote:
> >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> >> > else before you kill me"? It's crashing the kernel in trying to
> >> > protect a userspace application. How is that not insane?
> >>
> >> In parallel to other discussion, I think we should definitely move
> >> from "completely oom-disabled" semantics to something similar to "kill
> >> me last" semantics. Is there any objection to this idea?
> >
> > Could you be more specific what you mean?
> >
> 
> I get the impression that the main reason behind the complexity of
> oom-killer is allowing processes to be protected from the oom-killer
> i.e. disabling oom-killing a process by setting
> /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add
> an interface which will let users/admins to set a process to be
> oom-killed as a last resort.

If a process opts in to be oom disabled it needs CAP_SYS_RESOURCE and it
probably has a strong reason to do that. E.g. no unexpected SIGKILL
which could leave inconsistent data behind. We cannot simply break that
contract. Yes, it is a PITA configuration to support but it has its
reasons to exit. We are not guaranteeing it to 100% though, e.g. the
global case just panics if there is no eligible task. It is
responsibility of the userspace to make sure that the protected task
doesn't blow up completely otherwise they are on their own. We should do
something similar for the memcg case. Protect those tasks as long as we
are able to make forward progress and then either give them ENOMEM or
overcharge. Which one to go requires more discussion but I do not think
that unexpected SIGKILL is a way to go. We just want to give those tasks
a chance to do a graceful shutdown.

Johannes Weiner Oct. 31, 2017, 4:49 p.m. UTC | #16

On Tue, Oct 31, 2017 at 09:00:48AM +0100, Michal Hocko wrote:
> On Mon 30-10-17 12:28:13, Shakeel Butt wrote:
> > On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > On Fri 27-10-17 13:50:47, Shakeel Butt wrote:
> > >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> > >> > else before you kill me"? It's crashing the kernel in trying to
> > >> > protect a userspace application. How is that not insane?
> > >>
> > >> In parallel to other discussion, I think we should definitely move
> > >> from "completely oom-disabled" semantics to something similar to "kill
> > >> me last" semantics. Is there any objection to this idea?
> > >
> > > Could you be more specific what you mean?
> > 
> > I get the impression that the main reason behind the complexity of
> > oom-killer is allowing processes to be protected from the oom-killer
> > i.e. disabling oom-killing a process by setting
> > /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add
> > an interface which will let users/admins to set a process to be
> > oom-killed as a last resort.
> 
> If a process opts in to be oom disabled it needs CAP_SYS_RESOURCE and it
> probably has a strong reason to do that. E.g. no unexpected SIGKILL
> which could leave inconsistent data behind. We cannot simply break that
> contract. Yes, it is a PITA configuration to support but it has its
> reasons to exit.

I don't think that's true. The most prominent users are things like X
and sshd, and all they wanted to say was "kill me last."

If sshd were to have a bug and swell up, currently the system would
kill everything and then panic. It'd be much better to kill sshd at
the end and let the init system restart it.

Can you describe a scenario in which the NEVERKILL semantics actually
make sense? You're still OOM-killing the task anyway, it's not like it
can run without the kernel. So why kill the kernel?

Michal Hocko Oct. 31, 2017, 6:50 p.m. UTC | #17

On Tue 31-10-17 12:49:59, Johannes Weiner wrote:
> On Tue, Oct 31, 2017 at 09:00:48AM +0100, Michal Hocko wrote:
> > On Mon 30-10-17 12:28:13, Shakeel Butt wrote:
> > > On Mon, Oct 30, 2017 at 1:29 AM, Michal Hocko <mhocko@kernel.org> wrote:
> > > > On Fri 27-10-17 13:50:47, Shakeel Butt wrote:
> > > >> > Why is OOM-disabling a thing? Why isn't this simply a "kill everything
> > > >> > else before you kill me"? It's crashing the kernel in trying to
> > > >> > protect a userspace application. How is that not insane?
> > > >>
> > > >> In parallel to other discussion, I think we should definitely move
> > > >> from "completely oom-disabled" semantics to something similar to "kill
> > > >> me last" semantics. Is there any objection to this idea?
> > > >
> > > > Could you be more specific what you mean?
> > > 
> > > I get the impression that the main reason behind the complexity of
> > > oom-killer is allowing processes to be protected from the oom-killer
> > > i.e. disabling oom-killing a process by setting
> > > /proc/[pid]/oom_score_adj to -1000. So, instead of oom-disabling, add
> > > an interface which will let users/admins to set a process to be
> > > oom-killed as a last resort.
> > 
> > If a process opts in to be oom disabled it needs CAP_SYS_RESOURCE and it
> > probably has a strong reason to do that. E.g. no unexpected SIGKILL
> > which could leave inconsistent data behind. We cannot simply break that
> > contract. Yes, it is a PITA configuration to support but it has its
> > reasons to exit.
> 
> I don't think that's true. The most prominent users are things like X
> and sshd, and all they wanted to say was "kill me last."

This might be the case for the desktop environment and I would tend to
agree that those can handle restart easily. I was considering
applications which need an explicit shut down and manual intervention
when not done so. Think of a database or similar.

> If sshd were to have a bug and swell up, currently the system would
> kill everything and then panic. It'd be much better to kill sshd at
> the end and let the init system restart it.
> 
> Can you describe a scenario in which the NEVERKILL semantics actually
> make sense? You're still OOM-killing the task anyway, it's not like it
> can run without the kernel. So why kill the kernel?

Yes but you start with a clean state after reboot which is rather a
different thing than restarting from an inconsistant state.

In any case I am not trying to defend this configuration! I really
dislike it and it shouldn't have ever been introduced. But it is an
established behavior for many years and I am not really willing to break
it without having a _really strong_ reason.

fs, mm: account filp and names caches to kmemcg

Commit Message

Comments

Patch