From patchwork Mon Jul 1 21:23:43 2024 Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: 7bit X-Patchwork-Submitter: Aristeu Rozanski X-Patchwork-Id: 13718649 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2AD34C2BD09 for ; Mon, 1 Jul 2024 21:23:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9FFAB6B0088; Mon, 1 Jul 2024 17:23:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9881F6B008A; Mon, 1 Jul 2024 17:23:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 829076B008C; Mon, 1 Jul 2024 17:23:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0016.hostedemail.com [216.40.44.16]) by kanga.kvack.org (Postfix) with ESMTP id 609C26B0088 for ; Mon, 1 Jul 2024 17:23:51 -0400 (EDT) Received: from smtpin28.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id CA999A4563 for ; Mon, 1 Jul 2024 21:23:50 +0000 (UTC) X-FDA: 82292461020.28.5908467 Received: from lobo.ruivo.org (lobo.ruivo.org [173.14.175.98]) by imf04.hostedemail.com (Postfix) with ESMTP id 0BE6740015 for ; Mon, 1 Jul 2024 21:23:47 +0000 (UTC) Authentication-Results: imf04.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf04.hostedemail.com: domain of aris@ruivo.org designates 173.14.175.98 as permitted sender) smtp.mailfrom=aris@ruivo.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1719869010; a=rsa-sha256; cv=none; b=P4PLaWFqTJOPEJH+PlaHe6p7bwbEDF+6+dXbd8rx8nRBItGb045l8REadIB1PLWQEpkUsv tn2gimLq6U9OtZcPEPMsFJE3wVQECTChyh4K+geU6dIcGlsI9pnrF2o/U2DPY0E+m8OrAQ Aly2U129E72q66Xe9GkbpY+Y0QbwHt4= ARC-Authentication-Results: i=1; imf04.hostedemail.com; dkim=none; dmarc=none; spf=pass (imf04.hostedemail.com: domain of aris@ruivo.org designates 173.14.175.98 as permitted sender) smtp.mailfrom=aris@ruivo.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1719869010; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=cObx2gYMOneLBhrZdcoqoADJjau1djlikoDSKWMz328=; b=qXi/kgfrsxb3//R6cNVK24yMKzt+gdWHujlPAx5AkTWT36aLz7vhNq350+sKUo67waco35 NcyIpsCKzJvm7Lw02vCQadOi67L5BeUrGfbRi55qSCzdzuZellWwxN6bkmEajfcOIy/O1f xZNFyY0phaYNGNfnwXip7ijjQvdn6Ws= Received: by lobo.ruivo.org (Postfix, from userid 1011) id 792EA535A0; Mon, 1 Jul 2024 17:23:45 -0400 (EDT) Received: from jake.ruivo.org (bob.qemu.ruivo [192.168.72.19]) by lobo.ruivo.org (Postfix) with ESMTPSA id 35EA8528DF; Mon, 1 Jul 2024 17:23:43 -0400 (EDT) Received: by jake.ruivo.org (Postfix, from userid 1000) id 2341F12014C; Mon, 01 Jul 2024 17:23:43 -0400 (EDT) Date: Mon, 1 Jul 2024 17:23:43 -0400 From: Aristeu Rozanski To: linux-mm@kvack.org Cc: Andrew Morton , Vishal Moola , Aristeu Rozanski , Muchun Song , stable@vger.kernel.org Subject: [PATCH v2] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes Message-ID: <20240701212343.GG844599@cathedrallabs.org> References: <20240621190050.mhxwb65zn37doegp@redhat.com> <20240621175609.9658bb023d6271125c685af8@linux-foundation.org> <20240625155438.98bfbac706b05d7ccc9b74a3@linux-foundation.org> <20240701191207.GB43545@cathedrallabs.org> <6683024a.050a0220.45e6c.7312@mx.google.com> MIME-Version: 1.0 Content-Disposition: inline In-Reply-To: <6683024a.050a0220.45e6c.7312@mx.google.com> User-Agent: Mutt/2.2.12 (2023-09-09) X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 0BE6740015 X-Stat-Signature: qwauuizz1zncp5i4khdiphucbrzfqcer X-Rspam-User: X-HE-Tag: 1719869027-198821 X-HE-Meta: U2FsdGVkX19WzzRnGlftQ2yxFYaIyjws2Zj2lMoop8FExKGmVrS1eVrgh5tr77aL0aozeLK3/VVLiupjNQHAgw6hfwXchAMPyJHciczSgihaVakdi+bc+wVxPIurk3nsnWNH/Vl3nS/7bPgZRp3u7431LiYHliZyREYjc9/Tgv8ddFqfbXHRF/VNvJ2JftKr0iqb8zEh8EXuIf8COwhLZR8HThF6UXxlGzGaZec8tT0uPrD+6W+wQxrcP04DkY3ndnGsQAFi4fMqkHGXYoyZwP0sEVSSKKfP9Ob00f9LpxEqh+l/EAS0g6OW6PndAKyZgTmkaQDFv5pANfHScCVPnb8NPXsOBuFOT3xT69edcAu1Zvcf/QWkEBnZiR2idWv6+iYIjjYJVxeFf+dS9QKp6fjcZcuXzHI7PmX2aZg7gqKkVuc5Vc94YsESiW6VBQ7SeOxU/5uj1shuiyEJwg92Mnx5NiPiwPjmgXJVTDPxga7Zbs9ZbJEH2uo6qD9daovvzJsRRpuXAxAVQU9Yk3Ckl5Zi2k04TaRa55jMbDjcBWdqTMg+W0phTTlF5bjV3L2RouQQJGUB1suvTM2YRFJTIxrtORs+WVzTwR6pviHlVz78yTddVlRSGoyb7pzg12Zaj3qER0KJhuS+muNBRsFtrTjLNdRQ36mz673IBGd8X1DloThq3U6w0P+sDcvkkFCrL39EErs6GN65s4UwxknN1q2w/cwQe0ljmv0lokIYs1/0l1J/D7PuHfgepE2veZFw2dlkiy25WhkRG90//QAboKkZ0Wugx3Csc1rFjUKmU6FR7zgxECEg3pjBvFCP+/tmTcp2OiwdU2yl4J06cMSpXA03T1zX1geaFzT16yt3nIS/PJq/FaBWypyidmwcqLbpaLSbvdd5ZCnRCpVmdSY8yzbz6njAyZIVmhl0BvkpFrDvSxT0+XNJKgyS3ysSK+MkApDzkbBkd4DSCQxNIsi E0z6Yzwk CQuErQ+T/Q1By8Tl9lFqIlcxGGvzCEF04zqueuapI710IJenJPueB1+CZ2l4zQYy/c8l1ERWYnmbWCXRJvr8zUiuoIRqftBjJCOMGjfR+Wk2dUWCnJeOPA3ut/mZG1K926zcUgVJl3NywsKu4kj7x1kqDuWm3AqekM5VOmUUJCRKs+1FZkJIX6GzJ0XTlXv4KJFwQH+k+c591fVjF9+78lKCmszxghdZpeMRHLpLsg5GurMD4LY72XOtWKbwGFw8ewdKukwApg3+Ee5hHsoqe6KIBm/Kg7DACBPmt+NoyZRWXaMxXiPTE8PUge4zACnnsgrO1j7jPpa9KTLO+zkSPPyWfvg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: When trying to allocate a hugepage with none reserved ones free, it may be allowed in case a number of overcommit hugepages was configured (using /proc/sys/vm/nr_overcommit_hugepages) and that number wasn't reached. This allows for a behavior of having extra hugepages allocated dynamically, if there're resources for it. Some sysadmins even prefer not reserving any hugepages and setting a big number of overcommit hugepages. But while attempting to allocate overcommit hugepages in a multi node system (either NUMA or mempolicy/cpuset) said allocations might randomly fail even when there're resources available for the allocation. This happens due allowed_mems_nr() only accounting for the number of free hugepages in the nodes the current process belongs to and the surplus hugepage allocation is done so it can be allocated in any node. In case one or more of the requested surplus hugepages are allocated in a different node, the whole allocation will fail due allowed_mems_nr() returning a lower value. So allocate surplus hugepages in one of the nodes the current process belongs to. Easy way to reproduce this issue is to use a 2+ NUMA nodes system: # echo 0 >/proc/sys/vm/nr_hugepages # echo 1 >/proc/sys/vm/nr_overcommit_hugepages # numactl -m0 ./tools/testing/selftests/mm/map_hugetlb 2 Repeating the execution of map_hugetlb test application will eventually fail when the hugepage ends up allocated in a different node. v2: - attempt to make the description more clear - prevent unitialized usage of folio in case current process isn't part of any nodes with memory Cc: Vishal Moola Cc: David Hildenbrand Cc: Aristeu Rozanski Cc: Muchun Song Cc: Andrew Morton Cc: stable@vger.kernel.org Signed-off-by: Aristeu Rozanski --- mm/hugetlb.c | 47 ++++++++++++++++++++++++++++------------------- 1 file changed, 28 insertions(+), 19 deletions(-) --- upstream.orig/mm/hugetlb.c 2024-06-20 13:42:25.699568114 -0400 +++ upstream/mm/hugetlb.c 2024-07-01 16:48:53.693298053 -0400 @@ -2618,6 +2618,23 @@ struct folio *alloc_hugetlb_folio_nodema return alloc_migrate_hugetlb_folio(h, gfp_mask, preferred_nid, nmask); } +static nodemask_t *policy_mbind_nodemask(gfp_t gfp) +{ +#ifdef CONFIG_NUMA + struct mempolicy *mpol = get_task_policy(current); + + /* + * Only enforce MPOL_BIND policy which overlaps with cpuset policy + * (from policy_nodemask) specifically for hugetlb case + */ + if (mpol->mode == MPOL_BIND && + (apply_policy_zone(mpol, gfp_zone(gfp)) && + cpuset_nodemask_valid_mems_allowed(&mpol->nodes))) + return &mpol->nodes; +#endif + return NULL; +} + /* * Increase the hugetlb pool such that it can accommodate a reservation * of size 'delta'. @@ -2631,6 +2648,8 @@ static int gather_surplus_pages(struct h long i; long needed, allocated; bool alloc_ok = true; + int node; + nodemask_t *mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h)); lockdep_assert_held(&hugetlb_lock); needed = (h->resv_huge_pages + delta) - h->free_huge_pages; @@ -2645,8 +2664,15 @@ allocated = 0; retry: spin_unlock_irq(&hugetlb_lock); for (i = 0; i < needed; i++) { - folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h), - NUMA_NO_NODE, NULL); + folio = NULL; + for_each_node_mask(node, cpuset_current_mems_allowed) { + if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) { + folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h), + node, NULL); + if (folio) + break; + } + } if (!folio) { alloc_ok = false; break; @@ -4876,23 +4902,6 @@ default_hstate_max_huge_pages = 0; } __setup("default_hugepagesz=", default_hugepagesz_setup); -static nodemask_t *policy_mbind_nodemask(gfp_t gfp) -{ -#ifdef CONFIG_NUMA - struct mempolicy *mpol = get_task_policy(current); - - /* - * Only enforce MPOL_BIND policy which overlaps with cpuset policy - * (from policy_nodemask) specifically for hugetlb case - */ - if (mpol->mode == MPOL_BIND && - (apply_policy_zone(mpol, gfp_zone(gfp)) && - cpuset_nodemask_valid_mems_allowed(&mpol->nodes))) - return &mpol->nodes; -#endif - return NULL; -} - static unsigned int allowed_mems_nr(struct hstate *h) { int node;