From patchwork Fri Oct 22 08:11:09 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vasily Averin X-Patchwork-Id: 12577379 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 0DCE2C433F5 for ; Fri, 22 Oct 2021 08:11:26 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 8A3EA6112D for ; Fri, 22 Oct 2021 08:11:25 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 8A3EA6112D Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=virtuozzo.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 2923A940007; Fri, 22 Oct 2021 04:11:25 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 2427B900002; Fri, 22 Oct 2021 04:11:25 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 131D6940007; Fri, 22 Oct 2021 04:11:25 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0034.hostedemail.com [216.40.44.34]) by kanga.kvack.org (Postfix) with ESMTP id 05EAD900002 for ; Fri, 22 Oct 2021 04:11:25 -0400 (EDT) Received: from smtpin06.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id B1F581835229C for ; Fri, 22 Oct 2021 08:11:24 +0000 (UTC) X-FDA: 78723353688.06.54FEEAE Received: from relay.sw.ru (relay.sw.ru [185.231.240.75]) by imf14.hostedemail.com (Postfix) with ESMTP id 769AC6001988 for ; Fri, 22 Oct 2021 08:11:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=virtuozzo.com; s=relay; h=Content-Type:MIME-Version:Date:Message-ID:Subject :From; bh=eJF2bPLrfoGml4ORgOmoEHee/5o6h/70a4+fd/beNO0=; b=GJaMAcBDG07/QtFE6wl YEZpKgkBqBSPGv34ZLajUt1kV7fCReA6AvcZ5N7R9P6/0/iLZTyJzFXjjDFwCeFY1HL0f1xK5O9ST K6l8RBQzOdZpOZBPBqxkIP1Z3EC/kM3m2/rn+jY53Z7KY0/a91rdGQZKO3HP3dmnrqD7JoYsZmg=; Received: from [172.29.1.17] by relay.sw.ru with esmtp (Exim 4.94.2) (envelope-from ) id 1mdpe5-006oNB-E3; Fri, 22 Oct 2021 11:11:21 +0300 From: Vasily Averin Subject: [PATCH memcg v2 1/2] mm, oom: do not trigger out_of_memory from the #PF To: Michal Hocko , Johannes Weiner , Vladimir Davydov , Andrew Morton Cc: Roman Gushchin , Uladzislau Rezki , Vlastimil Babka , Shakeel Butt , Mel Gorman , Tetsuo Handa , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel@openvz.org References: Message-ID: <91d9196e-842a-757f-a3f2-caeb4a89a0d8@virtuozzo.com> Date: Fri, 22 Oct 2021 11:11:09 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 769AC6001988 X-Stat-Signature: rbetzimxufie5g34tw9yrbk18c17r6ab Authentication-Results: imf14.hostedemail.com; dkim=pass header.d=virtuozzo.com header.s=relay header.b=GJaMAcBD; dmarc=pass (policy=quarantine) header.from=virtuozzo.com; spf=pass (imf14.hostedemail.com: domain of vvs@virtuozzo.com designates 185.231.240.75 as permitted sender) smtp.mailfrom=vvs@virtuozzo.com X-HE-Tag: 1634890285-913486 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: From: Michal Hocko Any allocation failure during the #PF path will return with VM_FAULT_OOM which in turn results in pagefault_out_of_memory. This can happen for 2 different reasons. a) Memcg is out of memory and we rely on mem_cgroup_oom_synchronize to perform the memcg OOM handling or b) normal allocation fails. The later is quite problematic because allocation paths already trigger out_of_memory and the page allocator tries really hard to not fail allocations. Anyway, if the OOM killer has been already invoked there is no reason to invoke it again from the #PF path. Especially when the OOM condition might be gone by that time and we have no way to find out other than allocate. Moreover if the allocation failed and the OOM killer hasn't been invoked then we are unlikely to do the right thing from the #PF context because we have already lost the allocation context and restictions and therefore might oom kill a task from a different NUMA domain. An allocation might fail also when the current task is the oom victim and there are no memory reserves left and we should simply bail out from the #PF rather than invoking out_of_memory. This all suggests that there is no legitimate reason to trigger out_of_memory from pagefault_out_of_memory so drop it. Just to be sure that no #PF path returns with VM_FAULT_OOM without allocation print a warning that this is happening before we restart the #PF. [VvS: #PF allocation can hit into limit of cgroup v1 kmem controlle. This is a local problem related to memcg, however, it causes unnecessary global OOM kills that are repeated over and over again and escalate into a real disaster. It was broken long time ago, most likely since 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg"). In upstream the problem will be fixed by removing the outdated kmem limit, however stable and LTS kernels cannot do it and are still affected. This patch fixes the problem and should be backported into stable.] Fixes: 6a1a0d3b625a ("mm: allocate kernel pages to the right memcg") Cc: stable@vger.kernel.org Signed-off-by: Michal Hocko Reviewed-by: Vasily Averin --- mm/oom_kill.c | 23 ++++++++++------------- 1 file changed, 10 insertions(+), 13 deletions(-) diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 831340e7ad8b..f98954befafb 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -1120,27 +1120,24 @@ bool out_of_memory(struct oom_control *oc) } /* - * The pagefault handler calls here because it is out of memory, so kill a - * memory-hogging task. If oom_lock is held by somebody else, a parallel oom - * killing is already in progress so do nothing. + * The pagefault handler calls here because some allocation has failed. We have + * to take care of the memcg OOM here because this is the only safe context without + * any locks held but let the oom killer triggered from the allocation context care + * about the global OOM. */ void pagefault_out_of_memory(void) { - struct oom_control oc = { - .zonelist = NULL, - .nodemask = NULL, - .memcg = NULL, - .gfp_mask = 0, - .order = 0, - }; + static DEFINE_RATELIMIT_STATE(pfoom_rs, DEFAULT_RATELIMIT_INTERVAL, + DEFAULT_RATELIMIT_BURST); if (mem_cgroup_oom_synchronize(true)) return; - if (!mutex_trylock(&oom_lock)) + if (fatal_signal_pending(current)) return; - out_of_memory(&oc); - mutex_unlock(&oom_lock); + + if (__ratelimit(&pfoom_rs)) + pr_warn("Huh VM_FAULT_OOM leaked out to the #PF handler. Retrying PF\n"); } SYSCALL_DEFINE2(process_mrelease, int, pidfd, unsigned int, flags) From patchwork Fri Oct 22 08:11:29 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Vasily Averin X-Patchwork-Id: 12577381 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1DBFFC433F5 for ; Fri, 22 Oct 2021 08:11:56 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id C87F0610C8 for ; Fri, 22 Oct 2021 08:11:55 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org C87F0610C8 Authentication-Results: mail.kernel.org; dmarc=fail (p=quarantine dis=none) header.from=virtuozzo.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 68B8C940008; Fri, 22 Oct 2021 04:11:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 63A1F900002; Fri, 22 Oct 2021 04:11:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 55052940008; Fri, 22 Oct 2021 04:11:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0101.hostedemail.com [216.40.44.101]) by kanga.kvack.org (Postfix) with ESMTP id 47B7E900002 for ; Fri, 22 Oct 2021 04:11:55 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id EC74D2D4D8 for ; Fri, 22 Oct 2021 08:11:54 +0000 (UTC) X-FDA: 78723354948.08.53D1975 Received: from relay.sw.ru (relay.sw.ru [185.231.240.75]) by imf26.hostedemail.com (Postfix) with ESMTP id 86B7720019C6 for ; Fri, 22 Oct 2021 08:11:55 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=virtuozzo.com; s=relay; h=Content-Type:MIME-Version:Date:Message-ID:Subject :From; bh=Z5qJC2PMSc7dBPFzy7GUbPnzdUrepr/bhnEyppDvcMM=; b=Mmkju3VsxYztt/ednVO ezhdPmoJb+bKP6OO0zWJaGiKZ4LzOLJN6/ETQ4iP9fyFIdll/Tj6SlV7li0AQDCK1cOqKhqVcAVXT kbFSbXJj0kcbU/cLT9fPyIgvDD+w2bUxBV3/YDB+IxDgVjyFN/WnjbQes9gFBEIGP/OZd095JgU=; Received: from [172.29.1.17] by relay.sw.ru with esmtp (Exim 4.94.2) (envelope-from ) id 1mdpeZ-006oNy-43; Fri, 22 Oct 2021 11:11:51 +0300 From: Vasily Averin Subject: [PATCH memcg v2 2/2] memcg: prohibit unconditional exceeding the limit of dying tasks To: Michal Hocko , Johannes Weiner , Vladimir Davydov , Andrew Morton Cc: Roman Gushchin , Uladzislau Rezki , Vlastimil Babka , Shakeel Butt , Mel Gorman , Tetsuo Handa , cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, kernel@openvz.org References: Message-ID: <4b315938-5600-b7f5-bde9-82f638a2e595@virtuozzo.com> Date: Fri, 22 Oct 2021 11:11:29 +0300 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.13.0 MIME-Version: 1.0 In-Reply-To: Content-Language: en-US X-Rspamd-Queue-Id: 86B7720019C6 Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=virtuozzo.com header.s=relay header.b=Mmkju3Vs; spf=pass (imf26.hostedemail.com: domain of vvs@virtuozzo.com designates 185.231.240.75 as permitted sender) smtp.mailfrom=vvs@virtuozzo.com; dmarc=pass (policy=quarantine) header.from=virtuozzo.com X-Stat-Signature: jwyykai1mx5xpnascb7ju6x11wr4nm7x X-Rspamd-Server: rspam06 X-HE-Tag: 1634890315-117734 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Memory cgroup charging allows killed or exiting tasks to exceed the hard limit. It is assumed that the amount of the memory charged by those tasks is bound and most of the memory will get released while the task is exiting. This is resembling a heuristic for the global OOM situation when tasks get access to memory reserves. There is no global memory shortage at the memcg level so the memcg heuristic is more relieved. The above assumption is overly optimistic though. E.g. vmalloc can scale to really large requests and the heuristic would allow that. We used to have an early break in the vmalloc allocator for killed tasks but this has been reverted by commit b8c8a338f75e ("Revert "vmalloc: back off when the current task is killed""). There are likely other similar code paths which do not check for fatal signals in an allocation&charge loop. Also there are some kernel objects charged to a memcg which are not bound to a process life time. It has been observed that it is not really hard to trigger these bypasses and cause global OOM situation. One potential way to address these runaways would be to limit the amount of excess (similar to the global OOM with limited oom reserves). This is certainly possible but it is not really clear how much of an excess is desirable and still protects from global OOMs as that would have to consider the overall memcg configuration. This patch is addressing the problem by removing the heuristic altogether. Bypass is only allowed for requests which either cannot fail or where the failure is not desirable while excess should be still limited (e.g. atomic requests). Implementation wise a killed or dying task fails to charge if it has passed the OOM killer stage. That should give all forms of reclaim chance to restore the limit before the failure (ENOMEM) and tell the caller to back off. In addition, this patch renames should_force_charge() helper to task_is_dying() because now its use is not associated witch forced charging. Fixes: a636b327f731 ("memcg: avoid unnecessary system-wide-oom-killer") Cc: stable@vger.kernel.org Suggested-by: Michal Hocko Signed-off-by: Vasily Averin Acked-by: Michal Hocko --- mm/memcontrol.c | 27 ++++++++------------------- 1 file changed, 8 insertions(+), 19 deletions(-) diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 6da5020a8656..87e41c3cac10 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -239,7 +239,7 @@ enum res_type { iter != NULL; \ iter = mem_cgroup_iter(NULL, iter, NULL)) -static inline bool should_force_charge(void) +static inline bool task_is_dying(void) { return tsk_is_oom_victim(current) || fatal_signal_pending(current) || (current->flags & PF_EXITING); @@ -1575,7 +1575,7 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask, * A few threads which were not waiting at mutex_lock_killable() can * fail to bail out. Therefore, check again after holding oom_lock. */ - ret = should_force_charge() || out_of_memory(&oc); + ret = task_is_dying() || out_of_memory(&oc); unlock: mutex_unlock(&oom_lock); @@ -2530,6 +2530,7 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, struct page_counter *counter; enum oom_status oom_status; unsigned long nr_reclaimed; + bool passed_oom = false; bool may_swap = true; bool drained = false; unsigned long pflags; @@ -2564,15 +2565,6 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_ATOMIC) goto force; - /* - * Unlike in global OOM situations, memcg is not in a physical - * memory shortage. Allow dying and OOM-killed tasks to - * bypass the last charges so that they can exit quickly and - * free their memory. - */ - if (unlikely(should_force_charge())) - goto force; - /* * Prevent unbounded recursion when reclaim operations need to * allocate memory. This might exceed the limits temporarily, @@ -2630,8 +2622,9 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, if (gfp_mask & __GFP_RETRY_MAYFAIL) goto nomem; - if (fatal_signal_pending(current)) - goto force; + /* Avoid endless loop for tasks bypassed by the oom killer */ + if (passed_oom && task_is_dying()) + goto nomem; /* * keep retrying as long as the memcg oom killer is able to make @@ -2640,14 +2633,10 @@ static int try_charge_memcg(struct mem_cgroup *memcg, gfp_t gfp_mask, */ oom_status = mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(nr_pages * PAGE_SIZE)); - switch (oom_status) { - case OOM_SUCCESS: + if (oom_status == OOM_SUCCESS) { + passed_oom = true; nr_retries = MAX_RECLAIM_RETRIES; goto retry; - case OOM_FAILED: - goto force; - default: - goto nomem; } nomem: if (!(gfp_mask & __GFP_NOFAIL))