fs, mm: account filp and names caches to kmemcg

Message ID	20171013152421.yf76n7jui3z5bbn4@dhcp22.suse.cz (mailing list archive)
State	New, archived
Headers	show Return-Path: <linux-fsdevel-owner@kernel.org> Date: Fri, 13 Oct 2017 17:24:21 +0200 From: Michal Hocko <mhocko@kernel.org> To: Johannes Weiner <hannes@cmpxchg.org> Cc: Greg Thelen <gthelen@google.com>, Shakeel Butt <shakeelb@google.com>, Alexander Viro <viro@zeniv.linux.org.uk>, Vladimir Davydov <vdavydov.dev@gmail.com>, Andrew Morton <akpm@linux-foundation.org>, Linux MM <linux-mm@kvack.org>, linux-fsdevel@vger.kernel.org, LKML <linux-kernel@vger.kernel.org> Subject: Re: [PATCH] fs, mm: account filp and names caches to kmemcg Message-ID: <20171013152421.yf76n7jui3z5bbn4@dhcp22.suse.cz> References: <CALvZod7YN4JCG7Anm2FViyZ0-APYy+nxEd3nyxe5LT_P0FC9wg@mail.gmail.com> <20171009062426.hmqedtqz5hkmhnff@dhcp22.suse.cz> <xr93a810xl77.fsf@gthelen.svl.corp.google.com> <20171009202613.GA15027@cmpxchg.org> <20171010091430.giflzlayvjblx5bu@dhcp22.suse.cz> <20171010141733.GB16710@cmpxchg.org> <20171010142434.bpiqmsbb7gttrlcb@dhcp22.suse.cz> <20171012190312.GA5075@cmpxchg.org> <20171013063555.pa7uco43mod7vrkn@dhcp22.suse.cz> <20171013070001.mglwdzdrqjt47clz@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20171013070001.mglwdzdrqjt47clz@dhcp22.suse.cz> User-Agent: NeoMutt/20170609 (1.8.3) Sender: linux-fsdevel-owner@vger.kernel.org Precedence: bulk

Michal Hocko Oct. 13, 2017, 3:24 p.m. UTC

Well, it actually occured to me that this would trigger the global oom
killer in case no memcg specific victim can be found which is definitely
not something we would like to do. This should work better. I am not
sure we can trigger this corner case but we should cover it and it
actually doesn't make the code much worse.
---

Michal Hocko Oct. 24, 2017, 12:18 p.m. UTC | #1

Does this sound something that you would be interested in? I can spend
som more time on it if it is worthwhile.

On Fri 13-10-17 17:24:21, Michal Hocko wrote:
> Well, it actually occured to me that this would trigger the global oom
> killer in case no memcg specific victim can be found which is definitely
> not something we would like to do. This should work better. I am not
> sure we can trigger this corner case but we should cover it and it
> actually doesn't make the code much worse.
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d5f3a62887cf..7b370f070b82 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1528,26 +1528,40 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>  
>  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	if (!current->memcg_may_oom)
> -		return;
>  	/*
>  	 * We are in the middle of the charge context here, so we
>  	 * don't want to block when potentially sitting on a callstack
>  	 * that holds all kinds of filesystem and mm locks.
>  	 *
> -	 * Also, the caller may handle a failed allocation gracefully
> -	 * (like optional page cache readahead) and so an OOM killer
> -	 * invocation might not even be necessary.
> +	 * cgroup v1 allowes sync users space handling so we cannot afford
> +	 * to get stuck here for that configuration. That's why we don't do
> +	 * anything here except remember the OOM context and then deal with
> +	 * it at the end of the page fault when the stack is unwound, the
> +	 * locks are released, and when we know whether the fault was overall
> +	 * successful.
> +	 *
> +	 * On the other hand, in-kernel OOM killer allows for an async victim
> +	 * memory reclaim (oom_reaper) and that means that we are not solely
> +	 * relying on the oom victim to make a forward progress so we can stay
> +	 * in the the try_charge context and keep retrying as long as there
> +	 * are oom victims to select.
>  	 *
> -	 * That's why we don't do anything here except remember the
> -	 * OOM context and then deal with it at the end of the page
> -	 * fault when the stack is unwound, the locks are released,
> -	 * and when we know whether the fault was overall successful.
> +	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> +	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> +	 * we would fall back to the global oom killer in pagefault_out_of_memory
>  	 */
> +	if (!memcg->oom_kill_disable &&
> +			mem_cgroup_out_of_memory(memcg, mask, order))
> +		return true;
> +
> +	if (!current->memcg_may_oom)
> +		return false;
>  	css_get(&memcg->css);
>  	current->memcg_in_oom = memcg;
>  	current->memcg_oom_gfp_mask = mask;
>  	current->memcg_oom_order = order;
> +
> +	return false;
>  }
>  
>  /**
> @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	mem_cgroup_event(mem_over_limit, MEMCG_OOM);
>  
> -	mem_cgroup_oom(mem_over_limit, gfp_mask,
> -		       get_order(nr_pages * PAGE_SIZE));
> +	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
> +		       get_order(nr_pages * PAGE_SIZE))) {
> +		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +		goto retry;
> +	}
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
>  		return -ENOMEM;
> -- 
> Michal Hocko
> SUSE Labs

Johannes Weiner Oct. 24, 2017, 4:06 p.m. UTC | #2

On Fri, Oct 13, 2017 at 05:24:21PM +0200, Michal Hocko wrote:
> Well, it actually occured to me that this would trigger the global oom
> killer in case no memcg specific victim can be found which is definitely
> not something we would like to do. This should work better. I am not
> sure we can trigger this corner case but we should cover it and it
> actually doesn't make the code much worse.
> ---
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d5f3a62887cf..7b370f070b82 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -1528,26 +1528,40 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>  
>  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	if (!current->memcg_may_oom)
> -		return;
>  	/*
>  	 * We are in the middle of the charge context here, so we
>  	 * don't want to block when potentially sitting on a callstack
>  	 * that holds all kinds of filesystem and mm locks.
>  	 *
> -	 * Also, the caller may handle a failed allocation gracefully
> -	 * (like optional page cache readahead) and so an OOM killer
> -	 * invocation might not even be necessary.
> +	 * cgroup v1 allowes sync users space handling so we cannot afford
> +	 * to get stuck here for that configuration. That's why we don't do
> +	 * anything here except remember the OOM context and then deal with
> +	 * it at the end of the page fault when the stack is unwound, the
> +	 * locks are released, and when we know whether the fault was overall
> +	 * successful.

How about

"cgroup1 allows disabling the OOM killer and waiting for outside
handling until the charge can succeed; remember the context and put
the task to sleep at the end of the page fault when all locks are
released."

and then follow it directly with the branch that handles this:

	if (memcg->oom_kill_disable) {
		css_get(&memcg->css);
		current->memcg_in_oom = memcg;
		...
		return false;
	}

	return mem_cgroup_out_of_memory(memcg, mask, order);	

> +	 * On the other hand, in-kernel OOM killer allows for an async victim
> +	 * memory reclaim (oom_reaper) and that means that we are not solely
> +	 * relying on the oom victim to make a forward progress so we can stay
> +	 * in the the try_charge context and keep retrying as long as there
> +	 * are oom victims to select.

I would put that part into try_charge, where that decision is made.

>  	 *
> -	 * That's why we don't do anything here except remember the
> -	 * OOM context and then deal with it at the end of the page
> -	 * fault when the stack is unwound, the locks are released,
> -	 * and when we know whether the fault was overall successful.
> +	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> +	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> +	 * we would fall back to the global oom killer in pagefault_out_of_memory

Ah, that's why... Ugh, that's really duct-tapey.

>  	 */
> +	if (!memcg->oom_kill_disable &&
> +			mem_cgroup_out_of_memory(memcg, mask, order))
> +		return true;
> +
> +	if (!current->memcg_may_oom)
> +		return false;
>  	css_get(&memcg->css);
>  	current->memcg_in_oom = memcg;
>  	current->memcg_oom_gfp_mask = mask;
>  	current->memcg_oom_order = order;
> +
> +	return false;
>  }
>  
>  /**
> @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	mem_cgroup_event(mem_over_limit, MEMCG_OOM);
>  
> -	mem_cgroup_oom(mem_over_limit, gfp_mask,
> -		       get_order(nr_pages * PAGE_SIZE));
> +	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
> +		       get_order(nr_pages * PAGE_SIZE))) {
> +		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +		goto retry;
> +	}

As per the previous email, this has to goto force, otherwise we return
-ENOMEM from syscalls once in a blue moon, which makes verification an
absolute nightmare. The behavior should be reliable, without weird p99
corner cases.

I think what we should be doing here is: if a charge fails, set up an
oom context and force the charge; add mem_cgroup_oom_synchronize() to
the end of syscalls and kernel-context faults.

Michal Hocko Oct. 24, 2017, 4:22 p.m. UTC | #3

On Tue 24-10-17 12:06:37, Johannes Weiner wrote:
> On Fri, Oct 13, 2017 at 05:24:21PM +0200, Michal Hocko wrote:
> > Well, it actually occured to me that this would trigger the global oom
> > killer in case no memcg specific victim can be found which is definitely
> > not something we would like to do. This should work better. I am not
> > sure we can trigger this corner case but we should cover it and it
> > actually doesn't make the code much worse.
> > ---
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index d5f3a62887cf..7b370f070b82 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -1528,26 +1528,40 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> >  
> >  static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> >  {
> > -	if (!current->memcg_may_oom)
> > -		return;
> >  	/*
> >  	 * We are in the middle of the charge context here, so we
> >  	 * don't want to block when potentially sitting on a callstack
> >  	 * that holds all kinds of filesystem and mm locks.
> >  	 *
> > -	 * Also, the caller may handle a failed allocation gracefully
> > -	 * (like optional page cache readahead) and so an OOM killer
> > -	 * invocation might not even be necessary.
> > +	 * cgroup v1 allowes sync users space handling so we cannot afford
> > +	 * to get stuck here for that configuration. That's why we don't do
> > +	 * anything here except remember the OOM context and then deal with
> > +	 * it at the end of the page fault when the stack is unwound, the
> > +	 * locks are released, and when we know whether the fault was overall
> > +	 * successful.
> 
> How about
> 
> "cgroup1 allows disabling the OOM killer and waiting for outside
> handling until the charge can succeed; remember the context and put
> the task to sleep at the end of the page fault when all locks are
> released."

OK

> and then follow it directly with the branch that handles this:
> 
> 	if (memcg->oom_kill_disable) {
> 		css_get(&memcg->css);
> 		current->memcg_in_oom = memcg;
> 		...
> 		return false;
> 	}
> 
> 	return mem_cgroup_out_of_memory(memcg, mask, order);	
> 
> > +	 * On the other hand, in-kernel OOM killer allows for an async victim
> > +	 * memory reclaim (oom_reaper) and that means that we are not solely
> > +	 * relying on the oom victim to make a forward progress so we can stay
> > +	 * in the the try_charge context and keep retrying as long as there
> > +	 * are oom victims to select.
> 
> I would put that part into try_charge, where that decision is made.

OK

> >  	 *
> > -	 * That's why we don't do anything here except remember the
> > -	 * OOM context and then deal with it at the end of the page
> > -	 * fault when the stack is unwound, the locks are released,
> > -	 * and when we know whether the fault was overall successful.
> > +	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> > +	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> > +	 * we would fall back to the global oom killer in pagefault_out_of_memory
> 
> Ah, that's why... Ugh, that's really duct-tapey.

As you know, I really hate the #PF OOM path. We should get rid of it.
 
> >  	 */
> > +	if (!memcg->oom_kill_disable &&
> > +			mem_cgroup_out_of_memory(memcg, mask, order))
> > +		return true;
> > +
> > +	if (!current->memcg_may_oom)
> > +		return false;
> >  	css_get(&memcg->css);
> >  	current->memcg_in_oom = memcg;
> >  	current->memcg_oom_gfp_mask = mask;
> >  	current->memcg_oom_order = order;
> > +
> > +	return false;
> >  }
> >  
> >  /**
> > @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> >  
> >  	mem_cgroup_event(mem_over_limit, MEMCG_OOM);
> >  
> > -	mem_cgroup_oom(mem_over_limit, gfp_mask,
> > -		       get_order(nr_pages * PAGE_SIZE));
> > +	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
> > +		       get_order(nr_pages * PAGE_SIZE))) {
> > +		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > +		goto retry;
> > +	}
> 
> As per the previous email, this has to goto force, otherwise we return
> -ENOMEM from syscalls once in a blue moon, which makes verification an
> absolute nightmare. The behavior should be reliable, without weird p99
> corner cases.
>
> I think what we should be doing here is: if a charge fails, set up an
> oom context and force the charge; add mem_cgroup_oom_synchronize() to
> the end of syscalls and kernel-context faults.

What would prevent a runaway in case the only process in the memcg is
oom unkillable then?

Johannes Weiner Oct. 24, 2017, 5:23 p.m. UTC | #4

On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote:
> On Tue 24-10-17 12:06:37, Johannes Weiner wrote:
> > >  	 *
> > > -	 * That's why we don't do anything here except remember the
> > > -	 * OOM context and then deal with it at the end of the page
> > > -	 * fault when the stack is unwound, the locks are released,
> > > -	 * and when we know whether the fault was overall successful.
> > > +	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> > > +	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> > > +	 * we would fall back to the global oom killer in pagefault_out_of_memory
> > 
> > Ah, that's why... Ugh, that's really duct-tapey.
> 
> As you know, I really hate the #PF OOM path. We should get rid of it.

I agree, but this isn't getting rid of it, it just adds more layers.

> > > @@ -2007,8 +2021,11 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> > >  
> > >  	mem_cgroup_event(mem_over_limit, MEMCG_OOM);
> > >  
> > > -	mem_cgroup_oom(mem_over_limit, gfp_mask,
> > > -		       get_order(nr_pages * PAGE_SIZE));
> > > +	if (mem_cgroup_oom(mem_over_limit, gfp_mask,
> > > +		       get_order(nr_pages * PAGE_SIZE))) {
> > > +		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > > +		goto retry;
> > > +	}
> > 
> > As per the previous email, this has to goto force, otherwise we return
> > -ENOMEM from syscalls once in a blue moon, which makes verification an
> > absolute nightmare. The behavior should be reliable, without weird p99
> > corner cases.
> >
> > I think what we should be doing here is: if a charge fails, set up an
> > oom context and force the charge; add mem_cgroup_oom_synchronize() to
> > the end of syscalls and kernel-context faults.
> 
> What would prevent a runaway in case the only process in the memcg is
> oom unkillable then?

In such a scenario, the page fault handler would busy-loop right now.

Disabling oom kills is a privileged operation with dire consequences
if used incorrectly. You can panic the kernel with it. Why should the
cgroup OOM killer implement protective semantics around this setting?
Breaching the limit in such a setup is entirely acceptable.

Really, I think it's an enormous mistake to start modeling semantics
based on the most contrived and non-sensical edge case configurations.
Start the discussion with what is sane and what most users should
optimally experience, and keep the cornercases simple.

Johannes Weiner Oct. 24, 2017, 5:54 p.m. UTC | #5

On Tue, Oct 24, 2017 at 02:18:59PM +0200, Michal Hocko wrote:
> Does this sound something that you would be interested in? I can spend
> som more time on it if it is worthwhile.

Before you invest too much time in this, I think the rationale for
changing the current behavior so far is very weak. The ideas that have
been floated around in this thread barely cross into nice-to-have
territory, and as a result the acceptable additional complexity to
implement them is very low as well.

Making the OOM behavior less consistent, or introducing very rare
problem behavior (e.g. merely reducing the probability of syscalls
returning -ENOMEM instead of fully eliminating it, re-adding avenues
for deadlocks, no matter how rare, etc.) is a non-starter.

Michal Hocko Oct. 24, 2017, 5:55 p.m. UTC | #6

On Tue 24-10-17 13:23:30, Johannes Weiner wrote:
> On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote:
[...]
> > What would prevent a runaway in case the only process in the memcg is
> > oom unkillable then?
> 
> In such a scenario, the page fault handler would busy-loop right now.
> 
> Disabling oom kills is a privileged operation with dire consequences
> if used incorrectly. You can panic the kernel with it. Why should the
> cgroup OOM killer implement protective semantics around this setting?
> Breaching the limit in such a setup is entirely acceptable.
> 
> Really, I think it's an enormous mistake to start modeling semantics
> based on the most contrived and non-sensical edge case configurations.
> Start the discussion with what is sane and what most users should
> optimally experience, and keep the cornercases simple.

I am not really seeing your concern about the semantic. The most
important property of the hard limit is to protect from runaways and
stop them if they happen. Users can use the softer variant (high limit)
if they are not afraid of those scenarios. It is not so insane to
imagine that a master task (which I can easily imagine would be oom
disabled) has a leak and runaway as a result. We are not talking only
about the page fault path. There are other allocation paths to consume a
lot of memory and spill over and break the isolation restriction. So it
makes much more sense to me to fail the allocation in such a situation
rather than allow the runaway to continue. Just consider that such a
situation shouldn't happen in the first place because there should
always be an eligible task to kill - who would own all the memory
otherwise?

Johannes Weiner Oct. 24, 2017, 6:58 p.m. UTC | #7

On Tue, Oct 24, 2017 at 07:55:58PM +0200, Michal Hocko wrote:
> On Tue 24-10-17 13:23:30, Johannes Weiner wrote:
> > On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote:
> [...]
> > > What would prevent a runaway in case the only process in the memcg is
> > > oom unkillable then?
> > 
> > In such a scenario, the page fault handler would busy-loop right now.
> > 
> > Disabling oom kills is a privileged operation with dire consequences
> > if used incorrectly. You can panic the kernel with it. Why should the
> > cgroup OOM killer implement protective semantics around this setting?
> > Breaching the limit in such a setup is entirely acceptable.
> > 
> > Really, I think it's an enormous mistake to start modeling semantics
> > based on the most contrived and non-sensical edge case configurations.
> > Start the discussion with what is sane and what most users should
> > optimally experience, and keep the cornercases simple.
> 
> I am not really seeing your concern about the semantic. The most
> important property of the hard limit is to protect from runaways and
> stop them if they happen. Users can use the softer variant (high limit)
> if they are not afraid of those scenarios. It is not so insane to
> imagine that a master task (which I can easily imagine would be oom
> disabled) has a leak and runaway as a result.

Then you're screwed either way. Where do you return -ENOMEM in a page
fault path that cannot OOM kill anything? Your choice is between
maintaining the hard limit semantics or going into an infinite loop.

I fail to see how this setup has any impact on the semantics we pick
here. And even if it were real, it's really not what most users do.

> We are not talking only about the page fault path. There are other
> allocation paths to consume a lot of memory and spill over and break
> the isolation restriction. So it makes much more sense to me to fail
> the allocation in such a situation rather than allow the runaway to
> continue. Just consider that such a situation shouldn't happen in
> the first place because there should always be an eligible task to
> kill - who would own all the memory otherwise?

Okay, then let's just stick to the current behavior.

Michal Hocko Oct. 24, 2017, 8:15 p.m. UTC | #8

On Tue 24-10-17 14:58:54, Johannes Weiner wrote:
> On Tue, Oct 24, 2017 at 07:55:58PM +0200, Michal Hocko wrote:
> > On Tue 24-10-17 13:23:30, Johannes Weiner wrote:
> > > On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote:
> > [...]
> > > > What would prevent a runaway in case the only process in the memcg is
> > > > oom unkillable then?
> > > 
> > > In such a scenario, the page fault handler would busy-loop right now.
> > > 
> > > Disabling oom kills is a privileged operation with dire consequences
> > > if used incorrectly. You can panic the kernel with it. Why should the
> > > cgroup OOM killer implement protective semantics around this setting?
> > > Breaching the limit in such a setup is entirely acceptable.
> > > 
> > > Really, I think it's an enormous mistake to start modeling semantics
> > > based on the most contrived and non-sensical edge case configurations.
> > > Start the discussion with what is sane and what most users should
> > > optimally experience, and keep the cornercases simple.
> > 
> > I am not really seeing your concern about the semantic. The most
> > important property of the hard limit is to protect from runaways and
> > stop them if they happen. Users can use the softer variant (high limit)
> > if they are not afraid of those scenarios. It is not so insane to
> > imagine that a master task (which I can easily imagine would be oom
> > disabled) has a leak and runaway as a result.
> 
> Then you're screwed either way. Where do you return -ENOMEM in a page
> fault path that cannot OOM kill anything? Your choice is between
> maintaining the hard limit semantics or going into an infinite loop.

in the PF path yes. And I would argue that this is a reasonable
compromise to provide the gurantee the hard limit is giving us (and
the resulting isolation which is the whole point). Btw. we are already
having that behavior. All we are talking about is the non-PF path which
ENOMEMs right now and the meta-patch tried to handle it more gracefully
and only ENOMEM when there is no other option.

> I fail to see how this setup has any impact on the semantics we pick
> here. And even if it were real, it's really not what most users do.

sure, such a scenario is really on the edge but my main point was that
the hard limit is an enforcement of an isolation guarantee (as much as
possible of course).

> > We are not talking only about the page fault path. There are other
> > allocation paths to consume a lot of memory and spill over and break
> > the isolation restriction. So it makes much more sense to me to fail
> > the allocation in such a situation rather than allow the runaway to
> > continue. Just consider that such a situation shouldn't happen in
> > the first place because there should always be an eligible task to
> > kill - who would own all the memory otherwise?
> 
> Okay, then let's just stick to the current behavior.

I am definitely not pushing that thing right now. It is good to discuss
it, though. The more kernel allocations we will track the more careful we
will have to be. So maybe we will have to reconsider the current
approach. I am not sure we need it _right now_ but I feel we will
eventually have to reconsider it.

Greg Thelen Oct. 25, 2017, 6:51 a.m. UTC | #9

Michal Hocko <mhocko@kernel.org> wrote:

> On Tue 24-10-17 14:58:54, Johannes Weiner wrote:
>> On Tue, Oct 24, 2017 at 07:55:58PM +0200, Michal Hocko wrote:
>> > On Tue 24-10-17 13:23:30, Johannes Weiner wrote:
>> > > On Tue, Oct 24, 2017 at 06:22:13PM +0200, Michal Hocko wrote:
>> > [...]
>> > > > What would prevent a runaway in case the only process in the memcg is
>> > > > oom unkillable then?
>> > > 
>> > > In such a scenario, the page fault handler would busy-loop right now.
>> > > 
>> > > Disabling oom kills is a privileged operation with dire consequences
>> > > if used incorrectly. You can panic the kernel with it. Why should the
>> > > cgroup OOM killer implement protective semantics around this setting?
>> > > Breaching the limit in such a setup is entirely acceptable.
>> > > 
>> > > Really, I think it's an enormous mistake to start modeling semantics
>> > > based on the most contrived and non-sensical edge case configurations.
>> > > Start the discussion with what is sane and what most users should
>> > > optimally experience, and keep the cornercases simple.
>> > 
>> > I am not really seeing your concern about the semantic. The most
>> > important property of the hard limit is to protect from runaways and
>> > stop them if they happen. Users can use the softer variant (high limit)
>> > if they are not afraid of those scenarios. It is not so insane to
>> > imagine that a master task (which I can easily imagine would be oom
>> > disabled) has a leak and runaway as a result.
>> 
>> Then you're screwed either way. Where do you return -ENOMEM in a page
>> fault path that cannot OOM kill anything? Your choice is between
>> maintaining the hard limit semantics or going into an infinite loop.
>
> in the PF path yes. And I would argue that this is a reasonable
> compromise to provide the gurantee the hard limit is giving us (and
> the resulting isolation which is the whole point). Btw. we are already
> having that behavior. All we are talking about is the non-PF path which
> ENOMEMs right now and the meta-patch tried to handle it more gracefully
> and only ENOMEM when there is no other option.
>
>> I fail to see how this setup has any impact on the semantics we pick
>> here. And even if it were real, it's really not what most users do.
>
> sure, such a scenario is really on the edge but my main point was that
> the hard limit is an enforcement of an isolation guarantee (as much as
> possible of course).
>
>> > We are not talking only about the page fault path. There are other
>> > allocation paths to consume a lot of memory and spill over and break
>> > the isolation restriction. So it makes much more sense to me to fail
>> > the allocation in such a situation rather than allow the runaway to
>> > continue. Just consider that such a situation shouldn't happen in
>> > the first place because there should always be an eligible task to
>> > kill - who would own all the memory otherwise?
>> 
>> Okay, then let's just stick to the current behavior.
>
> I am definitely not pushing that thing right now. It is good to discuss
> it, though. The more kernel allocations we will track the more careful we
> will have to be. So maybe we will have to reconsider the current
> approach. I am not sure we need it _right now_ but I feel we will
> eventually have to reconsider it.

The kernel already attempts to charge radix_tree_nodes.  If they fail
then we fallback to unaccounted memory.  So the memcg limit already
isn't an air tight constraint.

I agree that unchecked overcharging could be bad, but wonder if we could
overcharge kmem so long as there is a pending oom kill victim.  If
current is the victim or no victim, then fail allocations (as is
currently done).  The current thread can loop in syscall exit until
usage is reconciled (either via reclaim or kill).  This seems consistent
with pagefault oom handling and compatible with overcommit use case.

Here's an example of an overcommit case we've found quite useful.  Memcg A has
memory which is shared between children B and C.  B is more important the C.
B and C are unprivileged, neither has the authority to kill the other.

    /A(limit=100MB) - B(limit=80MB,prio=high)
                     \ C(limit=80MB,prio=low)

If memcg charge drives B.usage+C.usage>=A.limit, then C should be killed due to
its low priority.  B pagefault can kill, but if a syscall returns ENOMEM then B
can't do anything useful with it.

I know there are related oom killer victim selections discussions afoot.
Even with classic oom_score_adj killing it's possible to heavily bias
oom killer to select C over B.

Michal Hocko Oct. 25, 2017, 7:15 a.m. UTC | #10

On Tue 24-10-17 23:51:30, Greg Thelen wrote:
> Michal Hocko <mhocko@kernel.org> wrote:
[...]
> > I am definitely not pushing that thing right now. It is good to discuss
> > it, though. The more kernel allocations we will track the more careful we
> > will have to be. So maybe we will have to reconsider the current
> > approach. I am not sure we need it _right now_ but I feel we will
> > eventually have to reconsider it.
> 
> The kernel already attempts to charge radix_tree_nodes.  If they fail
> then we fallback to unaccounted memory. 

I am not sure which code path you have in mind. All I can see is that we
drop __GFP_ACCOUNT when preloading radix tree nodes. Anyway...

> So the memcg limit already
> isn't an air tight constraint.

... we shouldn't make it more loose though.

> I agree that unchecked overcharging could be bad, but wonder if we could
> overcharge kmem so long as there is a pending oom kill victim.

Why is this any better than simply trying to charge as long as the oom
killer makes progress?

> If
> current is the victim or no victim, then fail allocations (as is
> currently done).

we actually force the charge in that case so we will proceed.

> The current thread can loop in syscall exit until
> usage is reconciled (either via reclaim or kill).  This seems consistent
> with pagefault oom handling and compatible with overcommit use case.

But we do not really want to make the syscall exit path any more complex
or more expensive than it is. The point is that we shouldn't be afraid
about triggering the oom killer from the charge patch because we do have
async OOM killer. This is very same with the standard allocator path. So
why should be memcg any different?

> Here's an example of an overcommit case we've found quite useful.  Memcg A has
> memory which is shared between children B and C.  B is more important the C.
> B and C are unprivileged, neither has the authority to kill the other.
> 
>     /A(limit=100MB) - B(limit=80MB,prio=high)
>                      \ C(limit=80MB,prio=low)
> 
> If memcg charge drives B.usage+C.usage>=A.limit, then C should be killed due to
> its low priority.  B pagefault can kill, but if a syscall returns ENOMEM then B
> can't do anything useful with it.

well, my proposal was to not return ENOMEM and rather loop in the charge
path and wait for the oom killer to free up some charges. Who gets
killed is really out of scope of this discussion.

Johannes Weiner Oct. 25, 2017, 1:11 p.m. UTC | #11

On Wed, Oct 25, 2017 at 09:15:22AM +0200, Michal Hocko wrote:
> On Tue 24-10-17 23:51:30, Greg Thelen wrote:
> > Michal Hocko <mhocko@kernel.org> wrote:
> [...]
> > > I am definitely not pushing that thing right now. It is good to discuss
> > > it, though. The more kernel allocations we will track the more careful we
> > > will have to be. So maybe we will have to reconsider the current
> > > approach. I am not sure we need it _right now_ but I feel we will
> > > eventually have to reconsider it.
> > 
> > The kernel already attempts to charge radix_tree_nodes.  If they fail
> > then we fallback to unaccounted memory. 
> 
> I am not sure which code path you have in mind. All I can see is that we
> drop __GFP_ACCOUNT when preloading radix tree nodes. Anyway...
> 
> > So the memcg limit already
> > isn't an air tight constraint.

I fully agree with this. Socket buffers overcharge too. There are
plenty of memory allocations that aren't even tracked.

The point is, it's a hard limit in the sense that breaching it will
trigger the OOM killer. It's not a hard limit in the sense that the
kernel will deadlock to avoid crossing it.

> ... we shouldn't make it more loose though.

Then we can end this discussion right now. I pointed out right from
the start that the only way to replace -ENOMEM with OOM killing in the
syscall is to force charges. If we don't, we either deadlock or still
return -ENOMEM occasionally. Nobody has refuted that this is the case.

> > The current thread can loop in syscall exit until
> > usage is reconciled (either via reclaim or kill).  This seems consistent
> > with pagefault oom handling and compatible with overcommit use case.
> 
> But we do not really want to make the syscall exit path any more complex
> or more expensive than it is. The point is that we shouldn't be afraid
> about triggering the oom killer from the charge patch because we do have
> async OOM killer. This is very same with the standard allocator path. So
> why should be memcg any different?

I have nothing against triggering the OOM killer from the allocation
path. I am dead-set against making the -ENOMEM return from syscalls
rare and unpredictable. They're a challenge as it is.

The only sane options are to stick with the status quo, or make sure
the task never returns before the allocation succeeds. Making things
in this path more speculative is a downgrade, not an improvement.

fs, mm: account filp and names caches to kmemcg

Commit Message

Comments

Patch