From patchwork Thu Jun 28 15:11:01 2018 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Michal Hocko X-Patchwork-Id: 10494245 Return-Path: Received: from mail.wl.linuxfoundation.org (pdx-wl-mail.web.codeaurora.org [172.30.200.125]) by pdx-korg-patchwork.web.codeaurora.org (Postfix) with ESMTP id 70B5F6022E for ; Thu, 28 Jun 2018 15:11:13 +0000 (UTC) Received: from mail.wl.linuxfoundation.org (localhost [127.0.0.1]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 55C1C29C9D for ; Thu, 28 Jun 2018 15:11:13 +0000 (UTC) Received: by mail.wl.linuxfoundation.org (Postfix, from userid 486) id 529EC2A6B8; Thu, 28 Jun 2018 15:11:13 +0000 (UTC) X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on pdx-wl-mail.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.9 required=2.0 tests=BAYES_00, MAILING_LIST_MULTI, RCVD_IN_DNSWL_NONE autolearn=ham version=3.3.1 Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.wl.linuxfoundation.org (Postfix) with ESMTP id 5AF5320881 for ; Thu, 28 Jun 2018 15:11:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 36CCC6B0005; Thu, 28 Jun 2018 11:11:11 -0400 (EDT) Delivered-To: linux-mm-outgoing@kvack.org Received: by kanga.kvack.org (Postfix, from userid 40) id 31E506B0006; Thu, 28 Jun 2018 11:11:11 -0400 (EDT) X-Original-To: int-list-linux-mm@kvack.org X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 20D0D6B026B; Thu, 28 Jun 2018 11:11:11 -0400 (EDT) X-Original-To: linux-mm@kvack.org X-Delivered-To: linux-mm@kvack.org Received: from mail-ed1-f70.google.com (mail-ed1-f70.google.com [209.85.208.70]) by kanga.kvack.org (Postfix) with ESMTP id B620F6B0005 for ; Thu, 28 Jun 2018 11:11:10 -0400 (EDT) Received: by mail-ed1-f70.google.com with SMTP id j11-v6so2211217edr.15 for ; Thu, 28 Jun 2018 08:11:10 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-original-authentication-results:x-gm-message-state:from:to:cc :subject:date:message-id; bh=SiChA9yBNztd2gLPihsApzhKSdjSPZJEWk2SUpU1sBY=; b=Ra9OLcedAGXLFe6eeS/+h/vXe4CVa8lv+5dvIX3LDObhN+mOCkp4StCpGfXw/F9WRz 6949b/oTD2wiqIlZEXD8IMOyxxMyBJoOwfoxJkljgvOI4siSSz3/P25UBH3CNKWNswbK 0T9Eo0LhuU2Oy0g9p7cQiF8JqhdBD9OUSkCR3I5v6fA6t76wh36SwMCBzWXnp6XiVbGg Jj0lKIO1ee11KSKGviBuVamkEikZVDNb+G6yuKABug9txMZPjPbvLx8RohsWl/9cWCds PL/6nueP+e5a4QPIZxNOsH5cPeDlvznDziYg7geAvWnPy+h0J0BIX2q3g8zhmGN39ejt X68A== X-Original-Authentication-Results: mx.google.com; spf=pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Gm-Message-State: APt69E1V6LKWZ9JV2hxm48C6uaxYUdyg66uXXwgEXAfowO81OhVT8t0F Uc18RCuBeIwe7exqvv8MnCDvKVfA5PdoRtOaM/jIhDXgPbqALCSYSJRfrOeMsbS3YC0/URPHYhP qi3Y95lyy34GIDxkySSFTg/g8zaGoDyr5UGDc3sEVrElLnVs8uec17L0NV+MCpiVJWp1gvVMFxE hDqDkJ/MBeuoZK/hCP4YRDjQH+jB9ocfsNP/jqMhMmAFSQMHu+KdaADgfmY4K26A41Cp5ehkU8O kkvQFV5dVbEQKB0wPsvZPHi3FkfLNhsEf3HcEHguLX/gHXutSbkTcQso+mx6JRZ89bIIzZatISG MnL3RdUaSA3ydJXb7uSQWpcUef+LHiO6piBp1aVBnxaryzc+XTkmQC3+9pF+gYL3dLUKx+o5UQ= = X-Received: by 2002:a50:dac9:: with SMTP id s9-v6mr9352435edj.241.1530198670273; Thu, 28 Jun 2018 08:11:10 -0700 (PDT) X-Received: by 2002:a50:dac9:: with SMTP id s9-v6mr9352371edj.241.1530198669212; Thu, 28 Jun 2018 08:11:09 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1530198669; cv=none; d=google.com; s=arc-20160816; b=vs69vCq9cl6rlyYXkGTtkXhmfpsuapFP0seKRtbb8VpafhockxJsQwnX72DCGggbsX oIstL46l5GFoCA7S2k4kuLDu6o3anDFuMne1KyR3PN7NJp+uQSXrGM+qnHDitkXr+hlP oupLhw6INkthBPH5/HqOuhlsI8SvN3FoNBX3QNn9XDYgByuqKAugATHTW5YaUVX7Q6RB f9s4z+n1hmgJcIfaYOyh+EuiM6xrlU/Qf07HDMbFrE/CBrmoAdiZRwl/45KTTroNVZsa MnZxKlhRykfV6o5ZXC+IXLiYqyAUUtYfm2lDQzxMsLwPD/gjkSa1vuUIIVoPbD1orkI7 kkJw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=message-id:date:subject:cc:to:from:arc-authentication-results; bh=SiChA9yBNztd2gLPihsApzhKSdjSPZJEWk2SUpU1sBY=; b=1CGnQ/BN1D5Tu7VDs7ylRUtwBZVaLlHR+tEG0FjjAKL2HXhz1eI59w/5/douiwoOXq EY6N3WPgVZp21VBkV3F6HjcL7aPt4NULM3Y5LhhNmDp7hB5PfDijCMn3IDr/BecaRlcg P6pc/OXoXFRWCN4k88XT3XBXSovxn/SqwA2/EVzg2OpTtLiTkSIxTdLOtKj2k/SXsFxH HBDohi3cTU+n9FXWeS/VqSesJa1pq+Cj6Q2SCr9YnG4BujyDlJByRxkDywd0rRdOxiQn AZHu4sU1c+2GtmTuU275PDKhX+VbU79+FW42GAyNrsFLHnCsuESfLZ+OyvUKGf8ixENd 1g/Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: from mail-sor-f65.google.com (mail-sor-f65.google.com. [209.85.220.65]) by mx.google.com with SMTPS id y6-v6sor2898727edo.15.2018.06.28.08.11.08 for (Google Transport Security); Thu, 28 Jun 2018 08:11:09 -0700 (PDT) Received-SPF: pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) client-ip=209.85.220.65; Authentication-Results: mx.google.com; spf=pass (google.com: domain of mstsxfx@gmail.com designates 209.85.220.65 as permitted sender) smtp.mailfrom=mstsxfx@gmail.com; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org X-Google-Smtp-Source: AAOMgpdGr/IvNh/lJgrSnaFwv4cDAzV1Jqf0oiDpiy+/6a/CKESTBkAa8XM0xrEgLLwbmWQQ6XFKBg== X-Received: by 2002:aa7:d2cc:: with SMTP id k12-v6mr1581604edr.258.1530198668769; Thu, 28 Jun 2018 08:11:08 -0700 (PDT) Received: from tiehlicka.microfocus.com (nat.nue.novell.com. [195.135.221.2]) by smtp.gmail.com with ESMTPSA id h24-v6sm3036532eds.18.2018.06.28.08.11.07 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Thu, 28 Jun 2018 08:11:07 -0700 (PDT) From: Michal Hocko To: Andrew Morton Cc: Johannes Weiner , Shakeel Butt , Greg Thelen , , LKML , Michal Hocko Subject: [PATCH] memcg, oom: move out_of_memory back to the charge path Date: Thu, 28 Jun 2018 17:11:01 +0200 Message-Id: <20180628151101.25307-1-mhocko@kernel.org> X-Mailer: git-send-email 2.18.0 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: X-Virus-Scanned: ClamAV using ClamSMTP From: Michal Hocko 3812c8c8f395 ("mm: memcg: do not trap chargers with full callstack on OOM") has changed the ENOMEM semantic of memcg charges. Rather than invoking the oom killer from the charging context it delays the oom killer to the page fault path (pagefault_out_of_memory). This in turn means that many users (e.g. slab or g-u-p) will get ENOMEM when the corresponding memcg hits the hard limit and the memcg is is OOM. This is behavior is inconsistent with !memcg case where the oom killer is invoked from the allocation context and the allocator keeps retrying until it succeeds. The difference in the behavior is user visible. mmap(MAP_POPULATE) might result in not fully populated ranges while the mmap return code doesn't tell that to the userspace. Random syscalls might fail with ENOMEM etc. The primary motivation of the different memcg oom semantic was the deadlock avoidance. Things have changed since then, though. We have an async oom teardown by the oom reaper now and so we do not have to rely on the victim to tear down its memory anymore. Therefore we can return to the original semantic as long as the memcg oom killer is not handed over to the users space. There is still one thing to be careful about here though. If the oom killer is not able to make any forward progress - e.g. because there is no eligible task to kill - then we have to bail out of the charge path to prevent from same class of deadlocks. We have basically two options here. Either we fail the charge with ENOMEM or force the charge and allow overcharge. The first option has been considered more harmful than useful because rare inconsistencies in the ENOMEM behavior is hard to test for and error prone. Basically the same reason why the page allocator doesn't fail allocations under such conditions. The later might allow runaways but those should be really unlikely unless somebody misconfigures the system. E.g. allowing to migrate tasks away from the memcg to a different unlimited memcg with move_charge_at_immigrate disabled. Changes since rfc v1 - s@memcg_may_oom@in_user_fault@ suggested by Johannes. It is much more clear what is the purpose of the flag now - s@mem_cgroup_oom_enable@mem_cgroup_enter_user_fault@g s@mem_cgroup_oom_disable@mem_cgroup_exit_user_fault@g as per Johannes - make oom_kill_disable an exceptional case because it should be rare and the normal oom handling a core of the function - per Johannes Signed-off-by: Michal Hocko Acked-by: Greg Thelen --- Hi, I've posted this as an RFC previously [1]. There was no fundamental disagreement so I've integrated all the suggested changes and tested it. mmap(MAP_POPULATE) hits the oom killer again rather than silently fails to populate the mapping on the hard limit excess. On the other hand g-u-p and other charge path keep the ENOMEM semantic when the memcg oom killer is disabled. All the forward progress guarantee relies on the oom reaper. Unless there are objections I think this is ready to go to mmotm and ready for the next merge window [1] http://lkml.kernel.org/r/20180620103736.13880-1-mhocko@kernel.org include/linux/memcontrol.h | 16 ++++---- include/linux/sched.h | 2 +- mm/memcontrol.c | 75 ++++++++++++++++++++++++++++++-------- mm/memory.c | 4 +- 4 files changed, 71 insertions(+), 26 deletions(-) diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 6c6fb116e925..5a69bb4026f6 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -494,16 +494,16 @@ unsigned long mem_cgroup_get_max(struct mem_cgroup *memcg); void mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p); -static inline void mem_cgroup_oom_enable(void) +static inline void mem_cgroup_enter_user_fault(void) { - WARN_ON(current->memcg_may_oom); - current->memcg_may_oom = 1; + WARN_ON(current->in_user_fault); + current->in_user_fault = 1; } -static inline void mem_cgroup_oom_disable(void) +static inline void mem_cgroup_exit_user_fault(void) { - WARN_ON(!current->memcg_may_oom); - current->memcg_may_oom = 0; + WARN_ON(!current->in_user_fault); + current->in_user_fault = 0; } static inline bool task_in_memcg_oom(struct task_struct *p) @@ -924,11 +924,11 @@ static inline void mem_cgroup_handle_over_high(void) { } -static inline void mem_cgroup_oom_enable(void) +static inline void mem_cgroup_enter_user_fault(void) { } -static inline void mem_cgroup_oom_disable(void) +static inline void mem_cgroup_exit_user_fault(void) { } diff --git a/include/linux/sched.h b/include/linux/sched.h index 87bf02d93a27..34cc95b751cd 100644 --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -722,7 +722,7 @@ struct task_struct { unsigned restore_sigmask:1; #endif #ifdef CONFIG_MEMCG - unsigned memcg_may_oom:1; + unsigned in_user_fault:1; #ifndef CONFIG_SLOB unsigned memcg_kmem_skip_account:1; #endif diff --git a/mm/memcontrol.c b/mm/memcontrol.c index e6f0d5ef320a..cff6c75137c1 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -1483,28 +1483,53 @@ static void memcg_oom_recover(struct mem_cgroup *memcg) __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg); } -static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) +enum oom_status { + OOM_SUCCESS, + OOM_FAILED, + OOM_ASYNC, + OOM_SKIPPED +}; + +static enum oom_status mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order) { - if (!current->memcg_may_oom || order > PAGE_ALLOC_COSTLY_ORDER) - return; + if (order > PAGE_ALLOC_COSTLY_ORDER) + return OOM_SKIPPED; + /* * We are in the middle of the charge context here, so we * don't want to block when potentially sitting on a callstack * that holds all kinds of filesystem and mm locks. * - * Also, the caller may handle a failed allocation gracefully - * (like optional page cache readahead) and so an OOM killer - * invocation might not even be necessary. + * cgroup1 allows disabling the OOM killer and waiting for outside + * handling until the charge can succeed; remember the context and put + * the task to sleep at the end of the page fault when all locks are + * released. + * + * On the other hand, in-kernel OOM killer allows for an async victim + * memory reclaim (oom_reaper) and that means that we are not solely + * relying on the oom victim to make a forward progress and we can + * invoke the oom killer here. * - * That's why we don't do anything here except remember the - * OOM context and then deal with it at the end of the page - * fault when the stack is unwound, the locks are released, - * and when we know whether the fault was overall successful. + * Please note that mem_cgroup_out_of_memory might fail to find a + * victim and then we have to bail out from the charge path. */ - css_get(&memcg->css); - current->memcg_in_oom = memcg; - current->memcg_oom_gfp_mask = mask; - current->memcg_oom_order = order; + if (memcg->oom_kill_disable) { + if (!current->in_user_fault) + return OOM_SKIPPED; + css_get(&memcg->css); + current->memcg_in_oom = memcg; + current->memcg_oom_gfp_mask = mask; + current->memcg_oom_order = order; + + return OOM_ASYNC; + } + + if (mem_cgroup_out_of_memory(memcg, mask, order)) + return OOM_SUCCESS; + + WARN(1,"Memory cgroup charge failed because of no reclaimable memory! " + "This looks like a misconfiguration or a kernel bug."); + return OOM_FAILED; } /** @@ -1899,6 +1924,8 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, unsigned long nr_reclaimed; bool may_swap = true; bool drained = false; + bool oomed = false; + enum oom_status oom_status; if (mem_cgroup_is_root(memcg)) return 0; @@ -1986,6 +2013,9 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, if (nr_retries--) goto retry; + if (gfp_mask & __GFP_RETRY_MAYFAIL && oomed) + goto nomem; + if (gfp_mask & __GFP_NOFAIL) goto force; @@ -1994,8 +2024,23 @@ static int try_charge(struct mem_cgroup *memcg, gfp_t gfp_mask, memcg_memory_event(mem_over_limit, MEMCG_OOM); - mem_cgroup_oom(mem_over_limit, gfp_mask, + /* + * keep retrying as long as the memcg oom killer is able to make + * a forward progress or bypass the charge if the oom killer + * couldn't make any progress. + */ + oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages * PAGE_SIZE)); + switch (oom_status) { + case OOM_SUCCESS: + nr_retries = MEM_CGROUP_RECLAIM_RETRIES; + oomed = true; + goto retry; + case OOM_FAILED: + goto force; + default: + goto nomem; + } nomem: if (!(gfp_mask & __GFP_NOFAIL)) return -ENOMEM; diff --git a/mm/memory.c b/mm/memory.c index 7206a634270b..a4b1f8c24884 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -4125,7 +4125,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, * space. Kernel faults are handled more gracefully. */ if (flags & FAULT_FLAG_USER) - mem_cgroup_oom_enable(); + mem_cgroup_enter_user_fault(); if (unlikely(is_vm_hugetlb_page(vma))) ret = hugetlb_fault(vma->vm_mm, vma, address, flags); @@ -4133,7 +4133,7 @@ int handle_mm_fault(struct vm_area_struct *vma, unsigned long address, ret = __handle_mm_fault(vma, address, flags); if (flags & FAULT_FLAG_USER) { - mem_cgroup_oom_disable(); + mem_cgroup_exit_user_fault(); /* * The task may have entered a memcg OOM situation but * if the allocation error was handled gracefully (no