From patchwork Thu Jun 21 08:09:27 2018
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Michal Hocko <mhocko@kernel.org>
X-Patchwork-Id: 10479397
Return-Path: <owner-linux-mm@kvack.org>
Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org
	[172.30.200.125])
	by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id
	1CA7D604D3 for <patchwork-linux-mm@patchwork.kernel.org>;
	Thu, 21 Jun 2018 08:09:34 +0000 (UTC)
Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 0AAC629074
	for <patchwork-linux-mm@patchwork.kernel.org>;
	Thu, 21 Jun 2018 08:09:34 +0000 (UTC)
Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486)
	id F26712907D; Thu, 21 Jun 2018 08:09:33 +0000 (UTC)
X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on
	pdx-wl-mail.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI,
	RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 3B56E29074
	for <patchwork-linux-mm@patchwork.kernel.org>;
	Thu, 21 Jun 2018 08:09:32 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CAEA36B0003; Thu, 21 Jun 2018 04:09:31 -0400 (EDT)
Delivered-To: linux-mm-outgoing@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C5F2C6B0006; Thu, 21 Jun 2018 04:09:31 -0400 (EDT)
X-Original-To: int-list-linux-mm@kvack.org
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id B4E6F6B0007; Thu, 21 Jun 2018 04:09:31 -0400 (EDT)
X-Original-To: linux-mm@kvack.org
X-Delivered-To: linux-mm@kvack.org
Received: from mail-wm0-f71.google.com (mail-wm0-f71.google.com
	[74.125.82.71])
	by kanga.kvack.org (Postfix) with ESMTP id 5ABEA6B0003
	for <linux-mm@kvack.org>; Thu, 21 Jun 2018 04:09:31 -0400 (EDT)
Received: by mail-wm0-f71.google.com with SMTP id t7-v6so1367855wmg.3
	for <linux-mm@kvack.org>; Thu, 21 Jun 2018 01:09:31 -0700 (PDT)
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
	d=1e100.net; s=20161025;
	h=x-original-authentication-results:x-gm-message-state:date:from:to
	:cc:subject:message-id:references:mime-version:content-disposition
	:in-reply-to:user-agent;
	bh=l6bV9KiMEyUgPbmsuogYQ+iZYFbi3rR6MuvZctsJgds=;
	b=nAO0AmuwRzw5lKhXFQ89XozhvlqMJNG1KQ0PTn6j8uny+wl4kJ4FwBloD+rVg6IgLH
	sBL4gU14RW+nvxFbuOzL5CNF2Z0qUlnBBmwQRekPQ2AyH8aqpdH5yNsvva96LTDaebRI
	Za5GMzaWfLwKl6TOmZfENNK2a63KdiQkHoGxGXeeaPNUAikrrPwrnragN3Qd2c2Xns3Q
	vN4Z7rpkuI05h/8exwG+LC+v+DC3+BgfZ8WUpgNe9v3M9iqcv16PMti6PWYgHaJL18V6
	Heftr1M3FbZSWsIO+LETwT3LMt6rEhZ5FrVNCjdQrkGkOiqVIF/MxuINRjJde9FYQXU7
	kYxw==
X-Original-Authentication-Results: mx.google.com;
	spf=softfail (google.com: domain of
	transitioning mhocko@kernel.org does not designate
	195.135.220.15 as permitted sender)
	smtp.mailfrom=mhocko@kernel.org;
	dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org
X-Gm-Message-State: APt69E1DxUNHKIwzXWOw5fr6vtJO/KNFWrEeeJhTo69pQBQ+1ORA+aXs
	UR1w/8SG6KHnJN0+eV763yRcosP6EPW9LMDRvlgLyeOYBTgRZtwcC3tICfb1ysVHJP8YNWFIMOb
	DIvgZ4HUiyQgJlLfLL0m2xv2CtPK/KRWY59/FSfB4FUW7hPholgOk/wIzcK1sqnQ=
X-Received: by 2002:a50:8367:: with SMTP id
	94-v6mr21295418edh.5.1529568570886;
	Thu, 21 Jun 2018 01:09:30 -0700 (PDT)
X-Google-Smtp-Source: 
 ADUXVKJpIGqRJJIf6ZM92DsxrAm6CBBQNIivlubSLv1YNELQ6Yu9M8ffKZxjZ29QWAsXOo7/ncEr
X-Received: by 2002:a50:8367:: with SMTP id
	94-v6mr21295360edh.5.1529568569753;
	Thu, 21 Jun 2018 01:09:29 -0700 (PDT)
ARC-Seal: i=1; a=rsa-sha256; t=1529568569; cv=none;
	d=google.com; s=arc-20160816;
	b=QP56E8IaAJZsYo47r+TgIjt/ECShqh8nyXaxgTCcURKO5iYfsylT326gyYV7av6CNo
	ZW5e95Npx9UBl1y3zqUH4xLmoiC4xcpi3XIDFocxdnSY153ygGklSyOp2PeOfkv2ujKC
	JAZh2AIT1KEUNhKvB524BD5GiPOIZLVE/7PZ0FqoyeIDK9YHg0QueyryTYxxgU/VCIib
	X2o8RoT4CoW2YtYOcANIZCTJEJUqEVjo0sZA8BfBXOkv2bud4tllfudzTmmQIC5XaU+f
	S6/jLgHsye17eRoIUKvMYTyyfN0Jb/OfkkgM/cfyw424rOW+U6tkd5cV9f5Y0cACnqHj
	ZABg==
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com;
	s=arc-20160816;
	h=user-agent:in-reply-to:content-disposition:mime-version:references
	:message-id:subject:cc:to:from:date:arc-authentication-results;
	bh=l6bV9KiMEyUgPbmsuogYQ+iZYFbi3rR6MuvZctsJgds=;
	b=nCHdYHa8rJNIkKfzhM//DY/347ku7ffl4eTQtEgR7kVqCUx28G7RbNgbOzYZ2w/vmT
	0r4mFS2Jkr2vTEeRjc9BOHG3ME8GKFYG+BA9R1ngnicUJnj0GyuR+XssECQCqDtC81ru
	7vT8IwcWqv3s3Ehua0LmdzmHkqU71s715Ms2JFeCCpyRytMzloBzXyJ/2S9otDfW8/OX
	TLASiAK3xe0rtF3S/pIslvuD0gAV0wDG4hwogHT9Wk5vWlpezIL75q3YsaJls4qYaO0k
	5xMFCbcFxWT7GL2avnuR/pIwWQJvveYja1rgfFa30XI5PHjNit2g8K2FxLW8l38vr6zx
	EBWQ==
ARC-Authentication-Results: i=1; mx.google.com;
	spf=softfail (google.com: domain of transitioning mhocko@kernel.org
	does not designate 195.135.220.15 as permitted sender)
	smtp.mailfrom=mhocko@kernel.org;
	dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org
Received: from mx2.suse.de (mx2.suse.de. [195.135.220.15])
	by mx.google.com with ESMTPS id
	z12-v6si2599738edi.394.2018.06.21.01.09.29 for <linux-mm@kvack.org>
	(version=TLS1 cipher=AES128-SHA bits=128/128);
	Thu, 21 Jun 2018 01:09:29 -0700 (PDT)
Received-SPF: softfail (google.com: domain of transitioning
	mhocko@kernel.org does not designate 195.135.220.15 as
	permitted sender) client-ip=195.135.220.15;
Authentication-Results: mx.google.com;
	spf=softfail (google.com: domain of transitioning mhocko@kernel.org
	does not designate 195.135.220.15 as permitted sender)
	smtp.mailfrom=mhocko@kernel.org;
	dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org
X-Virus-Scanned: by amavisd-new at test-mx.suse.de
Received: from relay2.suse.de (charybdis-ext-too.suse.de [195.135.220.254])
	by mx2.suse.de (Postfix) with ESMTP id D627EAEFF;
	Thu, 21 Jun 2018 08:09:28 +0000 (UTC)
Date: Thu, 21 Jun 2018 10:09:27 +0200
From: Michal Hocko <mhocko@kernel.org>
To: linux-mm@kvack.org
Cc: Johannes Weiner <hannes@cmpxchg.org>, Greg Thelen <gthelen@google.com>,
	Shakeel Butt <shakeelb@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH] memcg, oom: move out_of_memory back to the charge
	path
Message-ID: <20180621080927.GE10465@dhcp22.suse.cz>
References: <20180620103736.13880-1-mhocko@kernel.org>
MIME-Version: 1.0
Content-Disposition: inline
In-Reply-To: <20180620103736.13880-1-mhocko@kernel.org>
User-Agent: Mutt/1.9.5 (2018-04-13)
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
X-Virus-Scanned: ClamAV using ClamSMTP

This is an updated version with feedback from Johannes integrated. Still
not runtime tested but I am posting it to make further review easier.

From ed2796dc3894f93ddf0fc9ec74b83c58abc2b4ff Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.com>
Date: Wed, 20 Jun 2018 10:25:10 +0200
Subject: [PATCH] memcg, oom: move out_of_memory back to the charge path

3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM")
has changed the ENOMEM semantic of memcg charges. Rather than invoking
the oom killer from the charging context it delays the oom killer to the
page fault path (pagefault_out_of_memory). This in turn means that many
users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg
hits the hard limit and the memcg is is OOM. This is behavior is
inconsistent with !memcg case where the oom killer is invoked from the
allocation context and the allocator keeps retrying until it succeeds.

The difference in the behavior is user visible. mmap(MAP_POPULATE) might
result in not fully populated ranges while the mmap return code doesn't
tell that to the userspace. Random syscalls might fail with ENOMEM etc.

The primary motivation of the different memcg oom semantic was the
deadlock avoidance. Things have changed since then, though. We have
an async oom teardown by the oom reaper now and so we do not have to
rely on the victim to tear down its memory anymore. Therefore we can
return to the original semantic as long as the memcg oom killer is not
handed over to the users space.

There is still one thing to be careful about here though. If the oom
killer is not able to make any forward progress - e.g. because there is
no eligible task to kill - then we have to bail out of the charge path
to prevent from same class of deadlocks. We have basically two options
here. Either we fail the charge with ENOMEM or force the charge and
allow overcharge. The first option has been considered more harmful than
useful because rare inconsistencies in the ENOMEM behavior is hard to
test for and error prone. Basically the same reason why the page
allocator doesn't fail allocations under such conditions. The later
might allow runaways but those should be really unlikely unless somebody
misconfigures the system. E.g. allowing to migrate tasks away from the
memcg to a different unlimited memcg with move_charge_at_immigrate
disabled.

Changes since rfc v1
- s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more
  clear what is the purpose of the flag now
- make oom_kill_disable an exceptional case because it should be rare
  and the normal oom handling a core of the function - per Johannes

Signed-off-by: Michal Hocko <mhocko@suse.com>
---
 include/linux/memcontrol.h |  8 ++--
 include/linux/sched.h      |  2 +-
 mm/memcontrol.c            | 75 ++++++++++++++++++++++++++++++--------
 3 files changed, 65 insertions(+), 20 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c6fb116e925..8753bc313ef6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -496,14 +496,14 @@ void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
 
 static inline void mem_cgroup_oom_enable(void)
 {
-	WARN_ON(current->memcg_may_oom);
-	current->memcg_may_oom = 1;
+	WARN_ON(current->in_user_fault);
+	current->in_user_fault = 1;
 }
 
 static inline void mem_cgroup_oom_disable(void)
 {
-	WARN_ON(!current->memcg_may_oom);
-	current->memcg_may_oom = 0;
+	WARN_ON(!current->in_user_fault);
+	current->in_user_fault = 0;
 }
 
 static inline bool task_in_memcg_oom(struct task_struct *p)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 87bf02d93a27..34cc95b751cd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -722,7 +722,7 @@ struct task_struct {
 	unsigned			restore_sigmask:1;
 #endif
 #ifdef CONFIG_MEMCG
-	unsigned			memcg_may_oom:1;
+	unsigned			in_user_fault:1;
 #ifndef CONFIG_SLOB
 	unsigned			memcg_kmem_skip_account:1;
 #endif
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e6f0d5ef320a..cff6c75137c1 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1483,28 +1483,53 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
 		__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
 }
 
-static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
+enum oom_status {
+	OOM_SUCCESS,
+	OOM_FAILED,
+	OOM_ASYNC,
+	OOM_SKIPPED
+};
+
+static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
 {
-	if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER)
-		return;
+	if (order > PAGE_ALLOC_COSTLY_ORDER)
+		return OOM_SKIPPED;
+
 	/*
 	 * We are in the middle of the charge context here, so we
 	 * don't want to block when potentially sitting on a callstack
 	 * that holds all kinds of filesystem and mm locks.
 	 *
-	 * Also, the caller may handle a failed allocation gracefully
-	 * (like optional page cache readahead) and so an OOM killer
-	 * invocation might not even be necessary.
+	 * cgroup1 allows disabling the OOM killer and waiting for outside
+	 * handling until the charge can succeed; remember the context and put
+	 * the task to sleep at the end of the page fault when all locks are
+	 * released.
+	 *
+	 * On the other hand, in-kernel OOM killer allows for an async victim
+	 * memory reclaim (oom_reaper) and that means that we are not solely
+	 * relying on the oom victim to make a forward progress and we can
+	 * invoke the oom killer here.
 	 *
-	 * That's why we don't do anything here except remember the
-	 * OOM context and then deal with it at the end of the page
-	 * fault when the stack is unwound, the locks are released,
-	 * and when we know whether the fault was overall successful.
+	 * Please note that mem_cgroup_out_of_memory might fail to find a
+	 * victim and then we have to bail out from the charge path.
 	 */
-	css_get(&memcg->css);
-	current->memcg_in_oom = memcg;
-	current->memcg_oom_gfp_mask = mask;
-	current->memcg_oom_order = order;
+	if (memcg->oom_kill_disable) {
+		if (!current->in_user_fault)
+			return OOM_SKIPPED;
+		css_get(&memcg->css);
+		current->memcg_in_oom = memcg;
+		current->memcg_oom_gfp_mask = mask;
+		current->memcg_oom_order = order;
+
+		return OOM_ASYNC;
+	}
+
+	if (mem_cgroup_out_of_memory(memcg, mask, order))
+		return OOM_SUCCESS;
+
+	WARN(1,"Memory cgroup charge failed because of no reclaimable memory! "
+		"This looks like a misconfiguration or a kernel bug.");
+	return OOM_FAILED;
 }
 
 /**
@@ -1899,6 +1924,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	unsigned long nr_reclaimed;
 	bool may_swap = true;
 	bool drained = false;
+	bool oomed = false;
+	enum oom_status oom_status;
 
 	if (mem_cgroup_is_root(memcg))
 		return 0;
@@ -1986,6 +2013,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 	if (nr_retries--)
 		goto retry;
 
+	if (gfp_mask & __GFP_RETRY_MAYFAIL && oomed)
+		goto nomem;
+
 	if (gfp_mask & __GFP_NOFAIL)
 		goto force;
 
@@ -1994,8 +2024,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
 
 	memcg_memory_event(mem_over_limit, MEMCG_OOM);
 
-	mem_cgroup_oom(mem_over_limit, gfp_mask,
+	/*
+	 * keep retrying as long as the memcg oom killer is able to make
+	 * a forward progress or bypass the charge if the oom killer
+	 * couldn't make any progress.
+	 */
+	oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask,
 		       get_order(nr_pages * PAGE_SIZE));
+	switch (oom_status) {
+	case OOM_SUCCESS:
+		nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+		oomed = true;
+		goto retry;
+	case OOM_FAILED:
+		goto force;
+	default:
+		goto nomem;
+	}
 nomem:
 	if (!(gfp_mask & __GFP_NOFAIL))
 		return -ENOMEM;