From patchwork Tue Sep 7 08:25:50 2021 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Feng Tang X-Patchwork-Id: 12477837 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_GIT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1428CC433F5 for ; Tue, 7 Sep 2021 08:25:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BDA6160184 for ; Tue, 7 Sep 2021 08:25:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org BDA6160184 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 179A46B0071; Tue, 7 Sep 2021 04:25:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 12A016B0072; Tue, 7 Sep 2021 04:25:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0185F900002; Tue, 7 Sep 2021 04:25:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0116.hostedemail.com [216.40.44.116]) by kanga.kvack.org (Postfix) with ESMTP id E2A4B6B0071 for ; Tue, 7 Sep 2021 04:25:56 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 98BD11801C421 for ; Tue, 7 Sep 2021 08:25:56 +0000 (UTC) X-FDA: 78560094312.26.F9EFEA3 Received: from mga03.intel.com (mga03.intel.com [134.134.136.65]) by imf29.hostedemail.com (Postfix) with ESMTP id A28959000247 for ; Tue, 7 Sep 2021 08:25:55 +0000 (UTC) X-IronPort-AV: E=McAfee;i="6200,9189,10099"; a="220189616" X-IronPort-AV: E=Sophos;i="5.85,274,1624345200"; d="scan'208";a="220189616" Received: from fmsmga003.fm.intel.com ([10.253.24.29]) by orsmga103.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 07 Sep 2021 01:25:53 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.85,274,1624345200"; d="scan'208";a="537831722" Received: from shbuild999.sh.intel.com ([10.239.146.151]) by FMSMGA003.fm.intel.com with ESMTP; 07 Sep 2021 01:25:51 -0700 From: Feng Tang To: Andrew Morton , Michal Hocko , David Rientjes , Mel Gorman , Vlastimil Babka , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, Feng Tang Subject: [PATCH] mm/page_alloc: detect allocation forbidden by cpuset and bail out early Date: Tue, 7 Sep 2021 16:25:50 +0800 Message-Id: <1631003150-96935-1-git-send-email-feng.tang@intel.com> X-Mailer: git-send-email 2.7.4 Authentication-Results: imf29.hostedemail.com; dkim=none; dmarc=fail reason="No valid SPF, No valid DKIM" header.from=intel.com (policy=none); spf=none (imf29.hostedemail.com: domain of feng.tang@intel.com has no SPF policy when checking 134.134.136.65) smtp.mailfrom=feng.tang@intel.com X-Stat-Signature: x745r9jbxipkc9dod9b5i3e1jny4o5bb X-Rspamd-Queue-Id: A28959000247 X-Rspamd-Server: rspam04 X-HE-Tag: 1631003155-346679 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: There was report that starting an Ubuntu in docker while using cpuset to bind it to movlabe nodes (a node only has movable zone, like a node for hotplug or a Persistent Memory node in normal usage) will fail due to memory allocation failure, and then OOM is involved and many other innocent processes got killed. It can be reproduced with command: $docker run -it --rm --cpuset-mems 4 ubuntu:latest bash -c "grep Mems_allowed /proc/self/status" (node 4 is a movable node) The reason is, in the case, the target cpuset nodes only have movable zone, while the creation of an OS in docker sometimes needs to allocate memory in non-movable zones (dma/dma32/normal) like GFP_HIGHUSER, and the cpuset limit forbids the allocation, then out-of-memory killing is involved even when normal nodes and movable nodes both have many free memory. The failure is reasonable, but still there is one problem, that when the usage fails as it's an mission impossible due to the cpuset limit, the allocation should just not trigger reclaim/compaction, and more importantly, not get any innocent process oom-killed. So add detection for cases like this in the slowpath of allocation, and bail out early returning NULL for the allocation. We've run some cases of malloc/mmap/page_fault/lru-shm/swap from will-it-scale and vm-scalability, and didn't see obvious performance change (all inside +/- 1%), test boxes are 2 socket Cascade Lake and Icelake servers. [thanks to Micho Hocko and David Rientjes for suggesting not handle it inside OOM code] Suggested-by: Michal Hocko Signed-off-by: Feng Tang --- Changelog: since RFC * move the handling from oom code to page allocation path (Michal/David) mm/page_alloc.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/mm/page_alloc.c b/mm/page_alloc.c index f95e1d2386a1..d6657f68d1fb 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -4929,6 +4929,19 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, if (!ac->preferred_zoneref->zone) goto nopage; + /* + * Check for insane configurations where the cpuset doesn't contain + * any suitable zone to satisfy the request - e.g. non-movable + * GFP_HIGHUSER allocations from MOVABLE nodes only. + */ + if (cpusets_enabled() && (gfp_mask & __GFP_HARDWALL)) { + struct zoneref *z = first_zones_zonelist(ac->zonelist, + ac->highest_zoneidx, + &cpuset_current_mems_allowed); + if (!z->zone) + goto nopage; + } + if (alloc_flags & ALLOC_KSWAPD) wake_all_kswapds(order, gfp_mask, ac);