diff mbox

[RFC] memcg, oom: move out_of_memory back to the charge path

Message ID 20180620103736.13880-1-mhocko@kernel.org (mailing list archive)
State New, archived
Headers show

Commit Message

Michal Hocko June 20, 2018, 10:37 a.m. UTC
From: Michal Hocko <mhocko@suse.com>

3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM")
has changed the ENOMEM semantic of memcg charges. Rather than invoking
the oom killer from the charging context it delays the oom killer to the
page fault path (pagefault_out_of_memory). This in turn means that many
users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg
hits the hard limit and the memcg is is OOM. This is behavior is
inconsistent with !memcg case where the oom killer is invoked from the
allocation context and the allocator keeps retrying until it succeeds.

The difference in the behavior is user visible. mmap(MAP_POPULATE) might
result in not fully populated ranges while the mmap return code doesn't
tell that to the userspace. Random syscalls might fail with ENOMEM etc.

The primary motivation of the different memcg oom semantic was the
deadlock avoidance. Things have changed since then, though. We have
an async oom teardown by the oom reaper now and so we do not have to
rely on the victim to tear down its memory anymore. Therefore we can
return to the original semantic as long as the memcg oom killer is not
handed over to the users space.

There is still one thing to be careful about here though. If the oom
killer is not able to make any forward progress - e.g. because there is
no eligible task to kill - then we have to bail out of the charge path
to prevent from same class of deadlocks. We have basically two options
here. Either we fail the charge with ENOMEM or force the charge and
allow overcharge. The first option has been considered more harmful than
useful because rare inconsistencies in the ENOMEM behavior is hard to
test for and error prone. Basically the same reason why the page
allocator doesn't fail allocations under such conditions. The later
might allow runaways but those should be really unlikely unless somebody
misconfigures the system. E.g. allowing to migrate tasks away from the
memcg to a different unlimited memcg with move_charge_at_immigrate
disabled.

Signed-off-by: Michal Hocko <mhocko@suse.com>
---

Hi,
we have discussed this at LSFMM this year and my recollection is that
we have agreed that we should do this. So I am sending the patch as an
RFC. Please note I have only forward ported the patch without any
testing yet. I would like to see a general agreement before I spend more
time on this.

Thoughts? Objections?

 mm/memcontrol.c | 67 +++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 56 insertions(+), 11 deletions(-)

Comments

Johannes Weiner June 20, 2018, 3:18 p.m. UTC | #1
On Wed, Jun 20, 2018 at 12:37:36PM +0200, Michal Hocko wrote:
> From: Michal Hocko <mhocko@suse.com>
> 
> 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM")
> has changed the ENOMEM semantic of memcg charges. Rather than invoking
> the oom killer from the charging context it delays the oom killer to the
> page fault path (pagefault_out_of_memory). This in turn means that many
> users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg
> hits the hard limit and the memcg is is OOM. This is behavior is
> inconsistent with !memcg case where the oom killer is invoked from the
> allocation context and the allocator keeps retrying until it succeeds.
> 
> The difference in the behavior is user visible. mmap(MAP_POPULATE) might
> result in not fully populated ranges while the mmap return code doesn't
> tell that to the userspace. Random syscalls might fail with ENOMEM etc.
> 
> The primary motivation of the different memcg oom semantic was the
> deadlock avoidance. Things have changed since then, though. We have
> an async oom teardown by the oom reaper now and so we do not have to
> rely on the victim to tear down its memory anymore. Therefore we can
> return to the original semantic as long as the memcg oom killer is not
> handed over to the users space.
> 
> There is still one thing to be careful about here though. If the oom
> killer is not able to make any forward progress - e.g. because there is
> no eligible task to kill - then we have to bail out of the charge path
> to prevent from same class of deadlocks. We have basically two options
> here. Either we fail the charge with ENOMEM or force the charge and
> allow overcharge. The first option has been considered more harmful than
> useful because rare inconsistencies in the ENOMEM behavior is hard to
> test for and error prone. Basically the same reason why the page
> allocator doesn't fail allocations under such conditions. The later
> might allow runaways but those should be really unlikely unless somebody
> misconfigures the system. E.g. allowing to migrate tasks away from the
> memcg to a different unlimited memcg with move_charge_at_immigrate
> disabled.

This is more straight-forward than I thought it would be. I have no
objections to this going forward, just a couple of minor notes.

> @@ -1483,28 +1483,54 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
>  		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
>  }
>  
> -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> +enum oom_status {
> +	OOM_SUCCESS,
> +	OOM_FAILED,
> +	OOM_ASYNC,
> +	OOM_SKIPPED

Either SUCCESS & FAILURE, or SUCCEEDED & FAILED ;)

We're not distinguishing ASYNC and SKIPPED anywhere below, but I
cannot think of a good name to communicate them both without this
function making assumptions about the charge function's behavior.
So it's a bit weird, but probably the best way to go.

> +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
>  {
> -	if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
> -		return;
> +	if (order > PAGE_ALLOC_COSTLY_ORDER)
> +		return OOM_SKIPPED;
>  	/*
>  	 * We are in the middle of the charge context here, so we
>  	 * don't want to block when potentially sitting on a callstack
>  	 * that holds all kinds of filesystem and mm locks.
>  	 *
> -	 * Also, the caller may handle a failed allocation gracefully
> -	 * (like optional page cache readahead) and so an OOM killer
> -	 * invocation might not even be necessary.
> +	 * cgroup1 allows disabling the OOM killer and waiting for outside
> +	 * handling until the charge can succeed; remember the context and put
> +	 * the task to sleep at the end of the page fault when all locks are
> +	 * released.
> +	 *
> +	 * On the other hand, in-kernel OOM killer allows for an async victim
> +	 * memory reclaim (oom_reaper) and that means that we are not solely
> +	 * relying on the oom victim to make a forward progress and we can
> +	 * invoke the oom killer here.
>  	 *
> -	 * That's why we don't do anything here except remember the
> -	 * OOM context and then deal with it at the end of the page
> -	 * fault when the stack is unwound, the locks are released,
> -	 * and when we know whether the fault was overall successful.
> +	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> +	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> +	 * we would fall back to the global oom killer in pagefault_out_of_memory
>  	 */
> +	if (!memcg->oom_kill_disable) {
> +		if (mem_cgroup_out_of_memory(memcg, mask, order))
> +			return OOM_SUCCESS;
> +
> +		WARN(!current->memcg_may_oom,
> +				"Memory cgroup charge failed because of no reclaimable memory! "
> +				"This looks like a misconfiguration or a kernel bug.");
> +		return OOM_FAILED;
> +	}
> +
> +	if (!current->memcg_may_oom)
> +		return OOM_SKIPPED;

memcg_may_oom was introduced to distinguish between userspace faults
that can OOM and contexts that return -ENOMEM. Now we're using it
slightly differently and it's confusing.

1) Why warn for kernel allocations, but not userspace ones? This
should have a comment at least.

2) We invoke the OOM killer when !memcg_may_oom. We want to OOM kill
in either case, but only set up the mem_cgroup_oom_synchronize() for
userspace faults. So the code makes sense, but a better name would be
in order -- current->in_user_fault?

>  	css_get(&memcg->css);
>  	current->memcg_in_oom = memcg;
>  	current->memcg_oom_gfp_mask = mask;
>  	current->memcg_oom_order = order;
> +
> +	return OOM_ASYNC;

In terms of code flow, it would be much clearer to handle the
memcg->oom_kill_disable case first, as a special case with early
return, and make the OOM invocation the main code of this function,
given its name.

> @@ -1994,8 +2024,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
>  
>  	memcg_memory_event(mem_over_limit, MEMCG_OOM);
>  
> -	mem_cgroup_oom(mem_over_limit, gfp_mask,
> +	/*
> +	 * keep retrying as long as the memcg oom killer is able to make
> +	 * a forward progress or bypass the charge if the oom killer
> +	 * couldn't make any progress.
> +	 */
> +	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
>  		       get_order(nr_pages * PAGE_SIZE));
> +	switch (oom_status) {
> +	case OOM_SUCCESS:
> +		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> +		oomed = true;
> +		goto retry;
> +	case OOM_FAILED:
> +		goto force;
> +	default:
> +		goto nomem;
> +	}
>  nomem:
>  	if (!(gfp_mask & __GFP_NOFAIL))
>  		return -ENOMEM;
> -- 
> 2.17.1
>
Michal Hocko June 20, 2018, 3:31 p.m. UTC | #2
On Wed 20-06-18 11:18:12, Johannes Weiner wrote:
> On Wed, Jun 20, 2018 at 12:37:36PM +0200, Michal Hocko wrote:
[...]
> > -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > +enum oom_status {
> > +	OOM_SUCCESS,
> > +	OOM_FAILED,
> > +	OOM_ASYNC,
> > +	OOM_SKIPPED
> 
> Either SUCCESS & FAILURE, or SUCCEEDED & FAILED ;)

sure, I will go with later.

> We're not distinguishing ASYNC and SKIPPED anywhere below, but I
> cannot think of a good name to communicate them both without this
> function making assumptions about the charge function's behavior.
> So it's a bit weird, but probably the best way to go.

Yeah, that was what I was fighting with. My original proposal which
simply ENOMEM in the failure case was a simple bool but once we have
those different sates and failure behavior I think it is better to
comunicate that and let the caller do whatever it finds reasonable.

> 
> > +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> >  {
> > -	if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
> > -		return;
> > +	if (order > PAGE_ALLOC_COSTLY_ORDER)
> > +		return OOM_SKIPPED;
> >  	/*
> >  	 * We are in the middle of the charge context here, so we
> >  	 * don't want to block when potentially sitting on a callstack
> >  	 * that holds all kinds of filesystem and mm locks.
> >  	 *
> > -	 * Also, the caller may handle a failed allocation gracefully
> > -	 * (like optional page cache readahead) and so an OOM killer
> > -	 * invocation might not even be necessary.
> > +	 * cgroup1 allows disabling the OOM killer and waiting for outside
> > +	 * handling until the charge can succeed; remember the context and put
> > +	 * the task to sleep at the end of the page fault when all locks are
> > +	 * released.
> > +	 *
> > +	 * On the other hand, in-kernel OOM killer allows for an async victim
> > +	 * memory reclaim (oom_reaper) and that means that we are not solely
> > +	 * relying on the oom victim to make a forward progress and we can
> > +	 * invoke the oom killer here.
> >  	 *
> > -	 * That's why we don't do anything here except remember the
> > -	 * OOM context and then deal with it at the end of the page
> > -	 * fault when the stack is unwound, the locks are released,
> > -	 * and when we know whether the fault was overall successful.
> > +	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> > +	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> > +	 * we would fall back to the global oom killer in pagefault_out_of_memory
> >  	 */
> > +	if (!memcg->oom_kill_disable) {
> > +		if (mem_cgroup_out_of_memory(memcg, mask, order))
> > +			return OOM_SUCCESS;
> > +
> > +		WARN(!current->memcg_may_oom,
> > +				"Memory cgroup charge failed because of no reclaimable memory! "
> > +				"This looks like a misconfiguration or a kernel bug.");
> > +		return OOM_FAILED;
> > +	}
> > +
> > +	if (!current->memcg_may_oom)
> > +		return OOM_SKIPPED;
> 
> memcg_may_oom was introduced to distinguish between userspace faults
> that can OOM and contexts that return -ENOMEM. Now we're using it
> slightly differently and it's confusing.
> 
> 1) Why warn for kernel allocations, but not userspace ones? This
> should have a comment at least.

I am not sure I understand. We do warn for all allocations types of
mem_cgroup_out_of_memory fails as long as we are not in a legacy -
oom_disabled case.

> 2) We invoke the OOM killer when !memcg_may_oom. We want to OOM kill
> in either case, but only set up the mem_cgroup_oom_synchronize() for
> userspace faults. So the code makes sense, but a better name would be
> in order -- current->in_user_fault?

in_user_fault is definitely better than memcg_may_oom.

> >  	css_get(&memcg->css);
> >  	current->memcg_in_oom = memcg;
> >  	current->memcg_oom_gfp_mask = mask;
> >  	current->memcg_oom_order = order;
> > +
> > +	return OOM_ASYNC;
> 
> In terms of code flow, it would be much clearer to handle the
> memcg->oom_kill_disable case first, as a special case with early
> return, and make the OOM invocation the main code of this function,
> given its name.

This?
	if (order > PAGE_ALLOC_COSTLY_ORDER)
		return OOM_SKIPPED;

	/*
	 * We are in the middle of the charge context here, so we
	 * don't want to block when potentially sitting on a callstack
	 * that holds all kinds of filesystem and mm locks.
	 *
	 * cgroup1 allows disabling the OOM killer and waiting for outside
	 * handling until the charge can succeed; remember the context and put
	 * the task to sleep at the end of the page fault when all locks are
	 * released.
	 *
	 * On the other hand, in-kernel OOM killer allows for an async victim
	 * memory reclaim (oom_reaper) and that means that we are not solely
	 * relying on the oom victim to make a forward progress and we can
	 * invoke the oom killer here.
	 *
	 * Please note that mem_cgroup_oom_synchronize might fail to find a
	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
	 * we would fall back to the global oom killer in pagefault_out_of_memory
	 */
	if (memcg->oom_kill_disable) {
		if (!current->memcg_may_oom)
			return OOM_SKIPPED;
		css_get(&memcg->css);
		current->memcg_in_oom = memcg;
		current->memcg_oom_gfp_mask = mask;
		current->memcg_oom_order = order;

		return OOM_ASYNC;
	}

	if (mem_cgroup_out_of_memory(memcg, mask, order))
		return OOM_SUCCESS;

	WARN(!current->memcg_may_oom,
			"Memory cgroup charge failed because of no reclaimable memory! "
			"This looks like a misconfiguration or a kernel bug.");
	return OOM_FAILED;

Thanks!
Michal Hocko June 20, 2018, 3:34 p.m. UTC | #3
On Wed 20-06-18 17:31:48, Michal Hocko wrote:
> On Wed 20-06-18 11:18:12, Johannes Weiner wrote:
[...]
> > 1) Why warn for kernel allocations, but not userspace ones? This
> > should have a comment at least.
> 
> I am not sure I understand. We do warn for all allocations types of
> mem_cgroup_out_of_memory fails as long as we are not in a legacy -
> oom_disabled case.

OK, I can see it now. It wasn't in the quoted context and I just forgot
that WARN(!current->memcg_may_oom, ...). Well, I do not remember why
I've made it conditional and you are right it doesn't make any sense.
Probably a different code flow back then.

Updated to warn regardless of memcg_may_oom.
Johannes Weiner June 20, 2018, 7:35 p.m. UTC | #4
On Wed, Jun 20, 2018 at 05:31:48PM +0200, Michal Hocko wrote:
> This?
> 	if (order > PAGE_ALLOC_COSTLY_ORDER)
> 		return OOM_SKIPPED;
> 
> 	/*
> 	 * We are in the middle of the charge context here, so we
> 	 * don't want to block when potentially sitting on a callstack
> 	 * that holds all kinds of filesystem and mm locks.
> 	 *
> 	 * cgroup1 allows disabling the OOM killer and waiting for outside
> 	 * handling until the charge can succeed; remember the context and put
> 	 * the task to sleep at the end of the page fault when all locks are
> 	 * released.
> 	 *
> 	 * On the other hand, in-kernel OOM killer allows for an async victim
> 	 * memory reclaim (oom_reaper) and that means that we are not solely
> 	 * relying on the oom victim to make a forward progress and we can
> 	 * invoke the oom killer here.
> 	 *
> 	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> 	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> 	 * we would fall back to the global oom killer in pagefault_out_of_memory
> 	 */
> 	if (memcg->oom_kill_disable) {
> 		if (!current->memcg_may_oom)
> 			return OOM_SKIPPED;
> 		css_get(&memcg->css);
> 		current->memcg_in_oom = memcg;
> 		current->memcg_oom_gfp_mask = mask;
> 		current->memcg_oom_order = order;
> 
> 		return OOM_ASYNC;
> 	}
> 
> 	if (mem_cgroup_out_of_memory(memcg, mask, order))
> 		return OOM_SUCCESS;
> 
> 	WARN(!current->memcg_may_oom,
> 			"Memory cgroup charge failed because of no reclaimable memory! "
> 			"This looks like a misconfiguration or a kernel bug.");
> 	return OOM_FAILED;

Yep, this looks good IMO.
Johannes Weiner June 20, 2018, 7:38 p.m. UTC | #5
On Wed, Jun 20, 2018 at 05:31:48PM +0200, Michal Hocko wrote:
> 	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> 	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> 	 * we would fall back to the global oom killer in pagefault_out_of_memory

I can't quite figure out what this paragraph is trying to
say. "oom_synchronize might fail [...] and we have to rely on
oom_synchronize". Hm?
Michal Hocko June 21, 2018, 7:36 a.m. UTC | #6
On Wed 20-06-18 15:38:36, Johannes Weiner wrote:
> On Wed, Jun 20, 2018 at 05:31:48PM +0200, Michal Hocko wrote:
> > 	 * Please note that mem_cgroup_oom_synchronize might fail to find a
> > 	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
> > 	 * we would fall back to the global oom killer in pagefault_out_of_memory
> 
> I can't quite figure out what this paragraph is trying to
> say. "oom_synchronize might fail [...] and we have to rely on
> oom_synchronize". Hm?

heh, vim autocompletion + a stale comment from the previous
implementation which ENOMEM on the fail path. I went with 

	 * Please note that mem_cgroup_out_of_memory might fail to find a
	 * victim and then we have to bail out from the charge path.
diff mbox

Patch

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e6f0d5ef320a..7fe3ce1fd625 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1483,28 +1483,54 @@  static void memcg_oom_recover(struct mem_cgroup *memcg)
 		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
 }
 
-static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
+enum oom_status {
+	OOM_SUCCESS,
+	OOM_FAILED,
+	OOM_ASYNC,
+	OOM_SKIPPED
+};
+
+static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
-		return;
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		return OOM_SKIPPED;
 	/*
 	 * We are in the middle of the charge context here, so we
 	 * don't want to block when potentially sitting on a callstack
 	 * that holds all kinds of filesystem and mm locks.
 	 *
-	 * Also, the caller may handle a failed allocation gracefully
-	 * (like optional page cache readahead) and so an OOM killer
-	 * invocation might not even be necessary.
+	 * cgroup1 allows disabling the OOM killer and waiting for outside
+	 * handling until the charge can succeed; remember the context and put
+	 * the task to sleep at the end of the page fault when all locks are
+	 * released.
+	 *
+	 * On the other hand, in-kernel OOM killer allows for an async victim
+	 * memory reclaim (oom_reaper) and that means that we are not solely
+	 * relying on the oom victim to make a forward progress and we can
+	 * invoke the oom killer here.
 	 *
-	 * That's why we don't do anything here except remember the
-	 * OOM context and then deal with it at the end of the page
-	 * fault when the stack is unwound, the locks are released,
-	 * and when we know whether the fault was overall successful.
+	 * Please note that mem_cgroup_oom_synchronize might fail to find a
+	 * victim and then we have rely on mem_cgroup_oom_synchronize otherwise
+	 * we would fall back to the global oom killer in pagefault_out_of_memory
 	 */
+	if (!memcg->oom_kill_disable) {
+		if (mem_cgroup_out_of_memory(memcg, mask, order))
+			return OOM_SUCCESS;
+
+		WARN(!current->memcg_may_oom,
+				"Memory cgroup charge failed because of no reclaimable memory! "
+				"This looks like a misconfiguration or a kernel bug.");
+		return OOM_FAILED;
+	}
+
+	if (!current->memcg_may_oom)
+		return OOM_SKIPPED;
 	css_get(&memcg->css);
 	current->memcg_in_oom = memcg;
 	current->memcg_oom_gfp_mask = mask;
 	current->memcg_oom_order = order;
+
+	return OOM_ASYNC;
 }
 
 /**
@@ -1899,6 +1925,7 @@  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long nr_reclaimed;
 	bool may_swap = true;
 	bool drained = false;
+	bool oomed = false;
 
 	if (mem_cgroup_is_root(memcg))
 		return 0;
@@ -1986,6 +2013,9 @@  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (nr_retries--)
 		goto retry;
 
+	if (gfp_mask & __GFP_RETRY_MAYFAIL && oomed)
+		goto nomem;
+
 	if (gfp_mask & __GFP_NOFAIL)
 		goto force;
 
@@ -1994,8 +2024,23 @@  static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	memcg_memory_event(mem_over_limit, MEMCG_OOM);
 
-	mem_cgroup_oom(mem_over_limit, gfp_mask,
+	/*
+	 * keep retrying as long as the memcg oom killer is able to make
+	 * a forward progress or bypass the charge if the oom killer
+	 * couldn't make any progress.
+	 */
+	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
+	switch (oom_status) {
+	case OOM_SUCCESS:
+		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+		oomed = true;
+		goto retry;
+	case OOM_FAILED:
+		goto force;
+	default:
+		goto nomem;
+	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;