From patchwork Wed Apr  2 20:56:13 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Frank van der Linden <fvdl@google.com>
X-Patchwork-Id: 14036470
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 71062C3601E
	for <linux-mm@archiver.kernel.org>; Wed,  2 Apr 2025 20:56:30 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 320DA280004; Wed,  2 Apr 2025 16:56:29 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 2D001280001; Wed,  2 Apr 2025 16:56:29 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 19837280004; Wed,  2 Apr 2025 16:56:29 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com
 [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id EDC98280001
	for <linux-mm@kvack.org>; Wed,  2 Apr 2025 16:56:28 -0400 (EDT)
Received: from smtpin18.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 2453F140720
	for <linux-mm@kvack.org>; Wed,  2 Apr 2025 20:56:29 +0000 (UTC)
X-FDA: 83290312098.18.360A4C4
Received: from mail-pf1-f201.google.com (mail-pf1-f201.google.com
 [209.85.210.201])
	by imf11.hostedemail.com (Postfix) with ESMTP id 6FAB940004
	for <linux-mm@kvack.org>; Wed,  2 Apr 2025 20:56:27 +0000 (UTC)
Authentication-Results: imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=SJqzWdwu;
	spf=pass (imf11.hostedemail.com: domain of
 3eqTtZwQKCL8k0iqlttlqj.htrqnsz2-rrp0fhp.twl@flex--fvdl.bounces.google.com
 designates 209.85.210.201 as permitted sender)
 smtp.mailfrom=3eqTtZwQKCL8k0iqlttlqj.htrqnsz2-rrp0fhp.twl@flex--fvdl.bounces.google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1743627387;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:in-reply-to:
	 references:dkim-signature; bh=9IOaG1peQ4ioEH8Qm0YRdkG68FfP97qOsWWlZTfTOM0=;
	b=WJjfKSVUFKotpI+TVqFCqivqKhNtuVMRU0B4TagMFIkyVxQezZeREencjdvw0L9pyIlUaA
	ojEM+DvDugHPNA2B+VTWoMVxhdN6HnPMIdvbJSSP63kKSPUKfazTNPH4huSKtfKUR8F4e9
	826UIpvmyGutdW9oNgcaC2TIoJ6CHFc=
ARC-Authentication-Results: i=1;
	imf11.hostedemail.com;
	dkim=pass header.d=google.com header.s=20230601 header.b=SJqzWdwu;
	spf=pass (imf11.hostedemail.com: domain of
 3eqTtZwQKCL8k0iqlttlqj.htrqnsz2-rrp0fhp.twl@flex--fvdl.bounces.google.com
 designates 209.85.210.201 as permitted sender)
 smtp.mailfrom=3eqTtZwQKCL8k0iqlttlqj.htrqnsz2-rrp0fhp.twl@flex--fvdl.bounces.google.com;
	dmarc=pass (policy=reject) header.from=google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1743627387; a=rsa-sha256;
	cv=none;
	b=VRya2ybmh9/A6crFTy2EAGPOXHHEa4a+exra261lhACdRvIs9cb3aoDJaRcV6OExNBYuaU
	YSorXQG14OSvMopjanFDuVvObc3I4nO9t+gD35dtxzO2qR90bwIbNsacGZ9S6TVR0Tp3ac
	ER5JXuZEhsMu5tCm5enkxAwi50RXIU8=
Received: by mail-pf1-f201.google.com with SMTP id
 d2e1a72fcca58-73917303082so137033b3a.3
        for <linux-mm@kvack.org>; Wed, 02 Apr 2025 13:56:27 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20230601; t=1743627386; x=1744232186; darn=kvack.org;
        h=cc:to:from:subject:message-id:mime-version:date:from:to:cc:subject
         :date:message-id:reply-to;
        bh=9IOaG1peQ4ioEH8Qm0YRdkG68FfP97qOsWWlZTfTOM0=;
        b=SJqzWdwuIios5AzOjTA1th8LI51VhnWgsN97y9VcmIj4t0r3NSs6EczWGpBrqdnesS
         H/GkHrAvKrQ1IWoi+k09CykYAUkX5U6Utgoa94coAtNMeWrk/FN9iRpdmH1zYBIABT8F
         u8uu5siuQmdEYV/VF7v5UoUqapc3vspgrt/Z7mRwTFhgbB3Cn00IYDAtzbddRCMAwglW
         UxEMbCcnUdNktcNHNXRWVlbmqnp5zLmbSAI8Za1Fkup081e/pCmUjd5ehtZ++soyKXy8
         9hqtLGPVnZxvebb5o1E+Lou/vvi3DHuHgp9nWv7d5248YxOaJiDtQiKfCKn60nIFmTJ8
         qxUA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1743627386; x=1744232186;
        h=cc:to:from:subject:message-id:mime-version:date:x-gm-message-state
         :from:to:cc:subject:date:message-id:reply-to;
        bh=9IOaG1peQ4ioEH8Qm0YRdkG68FfP97qOsWWlZTfTOM0=;
        b=IQCOcW67jMCoc7uO3sK22A3ZYUw48JrvqQAwTMIB+xm25rgjvm0upf/8zk9fsVxh30
         NSYsYx1zNsTA1DVc3iYZbrBhlFXh2NOikokYR9Eeau2IYCl3DUMgcVXgSdD+2kay6BA1
         MHOcobOwhVdg+ZTaZQ+CJINcALOR9BAdz1dBuYotV0DMUZyv0H4SWcBHgL+LpnBv4Y75
         OwKdjmwJ7pBvPOJh9JMgzhPuxwSxxlP91baLGikaDzzVZFV2f3ZuzZ40lTX0SMBYXv7n
         IjY1aFSetKdas/TN7qmMoubKfZ5na04vEwK8P7Nf+JHTOrkrZpeM/eCz6J+5NlGmlWTj
         sebw==
X-Forwarded-Encrypted: i=1;
 AJvYcCV/YZOm6b+7IWwVLchVGxYbiD/DzaYJKytTxJfPJwO1kZP2ptH9FS7sfBUNmL6TUqNt25T8h9aySg==@kvack.org
X-Gm-Message-State: AOJu0YyOgofY7t1HOudLzx7a4yAPjPf4nl0C+Gc9B9TlivqHo+ZFAHA+
	jN4HPx9oHLz7UddsY29nq3c//Ziq4joPpFBnZ5JuPwzoOovVWo2gFI+1JnNoUkML4hsvgw==
X-Google-Smtp-Source: 
 AGHT+IH+lFyhrUoBmSaCUUXaPG/BGBUuVyfFE/hf4AXktz8/7cr/BmFOcHngVLoFHJaT9TNyzunAv2XE
X-Received: from pfbei43.prod.google.com
 ([2002:a05:6a00:80eb:b0:736:46a8:452d])
 (user=fvdl job=prod-delivery.src-stubby-dispatcher) by
 2002:a05:6a00:1829:b0:736:5dc6:a14b
 with SMTP id d2e1a72fcca58-739d855846amr214293b3a.13.1743627386357; Wed, 02
 Apr 2025 13:56:26 -0700 (PDT)
Date: Wed,  2 Apr 2025 20:56:13 +0000
Mime-Version: 1.0
X-Mailer: git-send-email 2.49.0.504.g3bcea36a83-goog
Message-ID: <20250402205613.3086864-1-fvdl@google.com>
Subject: [PATCH] mm/hugetlb: use separate nodemask for bootmem allocations
From: Frank van der Linden <fvdl@google.com>
To: akpm@linux-foundation.org, muchun.song@linux.dev, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Cc: david@redhat.com, osalvador@suse.de, luizcap@redhat.com,
	Frank van der Linden <fvdl@google.com>
X-Rspamd-Queue-Id: 6FAB940004
X-Rspam-User: 
X-Rspamd-Server: rspam02
X-Stat-Signature: 677aw8wo6bgsxs5bunhztuecmw9c4qko
X-HE-Tag: 1743627387-782328
X-HE-Meta: 
 U2FsdGVkX19RtkvbaHD027C3YjGYaIdYQ61d8FTyGxcmJFlRAj/Jf8SK5vb52dUA+3A2GnsXuZMhc6ICPCC4vcNrZO9y9XxFuKD/ZY56kBR8qWG/IANaN9V/6pH7GChzcLKl82bIxJ7milm5mnliKeqeLdLSSU1g8kiIiNHbmoLVJn3yWb/1p+uUrNkn0Le2Dd9MLZhDRZE2ruFGVqWeU+0jivJ2YNn5q1Ui8tig9/OanKkADeoS21J9vfIm2UhxY3cJUwDjpU4fVUwCjqfg7ctASLDYV3eKlqGWtJ23pGscYTRV93CbOz3iz4WruZbI6brWlYpAOTs3fgFjpmhI9NWrERtM/eY7OgdQ/u9WU08u5PFvIAiRPmja4outogKtooxOFJ88nB+eTfQCQ8hfyoWDYhtEHEfiQGaFpaF6ZK2jXdRWIH+kTad2rls7nsx/Pa0RjQwbxxtB/oy5OpCjtPx2ns32piclw4FEUghMGeJhivCgMRDB3bztKFHXoKWW3xAA4CP0yJl7kSd5m3uu4gUIDSzC0SKmEbGlIoYDFkB6c2Sp2yKGXEzxWTXOVZL4SVe7vcvC0hjrKXWR2T7NNUtktuiV9VUywKpdD244GbjIiqJbe/uScKgAMhFi9ytZKqnIVaaWDyOGWmcXDvpzS+VrQoA+mNIvNhGNc8wsZ61VrYPuHgMDLuQoK9FajWvUbYp2qoWmbD/FxfTcJAwDfh1Q6VKdKyi0dSyHN4eUx037ge72dKVmuY3SwTouuVHzi+Yq8p+tNEQs90ZRffoCAbD2TFSEJ76KtV/GyOoOrm9RodGlMX6rBoR/tCfomo8vjWuK0PVzJli8xiD6UI8G5RlVMg9kclXPJvsQH7x8mtN7IC6ZRrpOUqb10wySIhn8JNHoAAthGHyfMwbaUECol/c2+GpbFyP/F5jfBYDrHfpuhgqVKa2wsYXQT+1D0iDbI6SwBHFC2+CaTsmS1aP
 IE1xMRhw
 bUgRUUApAY8VIMLdo+ZV5hFV3+wb6eAXwraxefv6D7pXGJBuCBL5GxCEP1Rq4w1Vp3yiEA3MIvkopDCnjJxULXnhxydLLA31At3fNnr12UpIiZuI2fgOrWDEVMAPI66FXxgvDLJbZwKLwr5TOMU8BZqF8+RZ6/nil04uM+Zu+nyHth+N9lh+LZxFX7ApC/sqKQG3O0sQBoiUdxuqL4Y0cil3vekUp7S+r0jS5F0g2EvsDYxKvkr3hqa7cpn2qFVl08tIhQqqgv1A48ZJfKEynXnmCgUyypBVadCGYLwoTdRpL4k9rWKqzGr0gGdQtzKvjBXHC/jdP02lVDv9RkCRbZBCF9kqVyIzEiTBrqnZrbggpzbJC3M5VJfYHFenIrpj7m79aBFudbWXf6//1ANVQggHPX60jVMe46nbsvYPE3MrFnS3Z1NOBEHE+ax3NFqGmagoynjWu8RnKzEbUbAGpnU4cUvCLu4jIehSXEtcUuvc1WpJy9ifIvAEJPBsIWIF0EJewY4R0z2OrUr0tBXXPVNJ70g==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

Hugetlb boot allocation has used online nodes for allocation since
commit de55996d7188 ("mm/hugetlb: use online nodes for bootmem
allocation"). This was needed to be able to do the allocations
earlier in boot, before N_MEMORY was set.

This might lead to a different distribution of gigantic hugepages
across NUMA nodes if there are memoryless nodes in the system.

What happens is that the memoryless nodes are tried, but then
the memblock allocation fails and falls back, which usually means
that the node that has the highest physical address available
will be used (top-down allocation). While this will end up
getting the same number of hugetlb pages, they might not be
be distributed the same way. The fallback for each memoryless
node might not end up coming from the same node as the
successful round-robin allocation from N_MEMORY nodes.

While administrators that rely on having a specific number of
hugepages per node should use the hugepages=N:X syntax, it's
better not to change the old behavior for the plain hugepages=N
case.

To do this, construct a nodemask for hugetlb bootmem purposes
only, containing nodes that have memory. Then use that
for round-robin bootmem allocations.

This saves some cycles, and the added advantage here is that
hugetlb_cma can use it too, avoiding the older issue of
pointless attempts to create a CMA area for memoryless nodes
(which will also cause the per-node CMA area size to be too
small).

Fixes: de55996d7188 ("mm/hugetlb: use online nodes for bootmem allocation")
Signed-off-by: Frank van der Linden <fvdl@google.com>
---
 include/linux/hugetlb.h |  3 +++
 mm/hugetlb.c            | 30 ++++++++++++++++++++++++++++--
 mm/hugetlb_cma.c        | 11 +++++++----
 3 files changed, 38 insertions(+), 6 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 8f3ac832ee7f..fc9166f7f679 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -14,6 +14,7 @@
 #include <linux/pgtable.h>
 #include <linux/gfp.h>
 #include <linux/userfaultfd_k.h>
+#include <linux/nodemask.h>
 
 struct ctl_table;
 struct user_struct;
@@ -176,6 +177,8 @@ extern struct list_head huge_boot_pages[MAX_NUMNODES];
 
 void hugetlb_bootmem_alloc(void);
 bool hugetlb_bootmem_allocated(void);
+extern nodemask_t hugetlb_bootmem_nodes;
+void hugetlb_bootmem_set_nodes(void);
 
 /* arch callbacks */
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6fccfe6d046c..e69f6f31e082 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -58,6 +58,7 @@ int hugetlb_max_hstate __read_mostly;
 unsigned int default_hstate_idx;
 struct hstate hstates[HUGE_MAX_HSTATE];
 
+__initdata nodemask_t hugetlb_bootmem_nodes;
 __initdata struct list_head huge_boot_pages[MAX_NUMNODES];
 static unsigned long hstate_boot_nrinvalid[HUGE_MAX_HSTATE] __initdata;
 
@@ -3237,7 +3238,8 @@ int __alloc_bootmem_huge_page(struct hstate *h, int nid)
 	}
 
 	/* allocate from next node when distributing huge pages */
-	for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node, &node_states[N_ONLINE]) {
+	for_each_node_mask_to_alloc(&h->next_nid_to_alloc, nr_nodes, node,
+				    &hugetlb_bootmem_nodes) {
 		m = alloc_bootmem(h, node, false);
 		if (!m)
 			return 0;
@@ -3701,6 +3703,15 @@ static void __init hugetlb_init_hstates(void)
 	struct hstate *h, *h2;
 
 	for_each_hstate(h) {
+		/*
+		 * Always reset to first_memory_node here, even if
+		 * next_nid_to_alloc was set before - we can't
+		 * reference hugetlb_bootmem_nodes after init, and
+		 * first_memory_node is right for all further allocations.
+		 */
+		h->next_nid_to_alloc = first_memory_node;
+		h->next_nid_to_free = first_memory_node;
+
 		/* oversize hugepages were init'ed in early boot */
 		if (!hstate_is_gigantic(h))
 			hugetlb_hstate_alloc_pages(h);
@@ -4990,6 +5001,20 @@ static int __init default_hugepagesz_setup(char *s)
 }
 hugetlb_early_param("default_hugepagesz", default_hugepagesz_setup);
 
+void __init hugetlb_bootmem_set_nodes(void)
+{
+	int i, nid;
+	unsigned long start_pfn, end_pfn;
+
+	if (!nodes_empty(hugetlb_bootmem_nodes))
+		return;
+
+	for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn, &nid) {
+		if (end_pfn > start_pfn)
+			node_set(nid, hugetlb_bootmem_nodes);
+	}
+}
+
 static bool __hugetlb_bootmem_allocated __initdata;
 
 bool __init hugetlb_bootmem_allocated(void)
@@ -5005,6 +5030,8 @@ void __init hugetlb_bootmem_alloc(void)
 	if (__hugetlb_bootmem_allocated)
 		return;
 
+	hugetlb_bootmem_set_nodes();
+
 	for (i = 0; i < MAX_NUMNODES; i++)
 		INIT_LIST_HEAD(&huge_boot_pages[i]);
 
@@ -5012,7 +5039,6 @@ void __init hugetlb_bootmem_alloc(void)
 
 	for_each_hstate(h) {
 		h->next_nid_to_alloc = first_online_node;
-		h->next_nid_to_free = first_online_node;
 
 		if (hstate_is_gigantic(h))
 			hugetlb_hstate_alloc_pages(h);
diff --git a/mm/hugetlb_cma.c b/mm/hugetlb_cma.c
index e0f2d5c3a84c..f58ef4969e7a 100644
--- a/mm/hugetlb_cma.c
+++ b/mm/hugetlb_cma.c
@@ -66,7 +66,7 @@ hugetlb_cma_alloc_bootmem(struct hstate *h, int *nid, bool node_exact)
 		if (node_exact)
 			return NULL;
 
-		for_each_online_node(node) {
+		for_each_node_mask(node, hugetlb_bootmem_nodes) {
 			cma = hugetlb_cma[node];
 			if (!cma || node == *nid)
 				continue;
@@ -153,11 +153,13 @@ void __init hugetlb_cma_reserve(int order)
 	if (!hugetlb_cma_size)
 		return;
 
+	hugetlb_bootmem_set_nodes();
+
 	for (nid = 0; nid < MAX_NUMNODES; nid++) {
 		if (hugetlb_cma_size_in_node[nid] == 0)
 			continue;
 
-		if (!node_online(nid)) {
+		if (!node_isset(nid, hugetlb_bootmem_nodes)) {
 			pr_warn("hugetlb_cma: invalid node %d specified\n", nid);
 			hugetlb_cma_size -= hugetlb_cma_size_in_node[nid];
 			hugetlb_cma_size_in_node[nid] = 0;
@@ -190,13 +192,14 @@ void __init hugetlb_cma_reserve(int order)
 		 * If 3 GB area is requested on a machine with 4 numa nodes,
 		 * let's allocate 1 GB on first three nodes and ignore the last one.
 		 */
-		per_node = DIV_ROUND_UP(hugetlb_cma_size, nr_online_nodes);
+		per_node = DIV_ROUND_UP(hugetlb_cma_size,
+					nodes_weight(hugetlb_bootmem_nodes));
 		pr_info("hugetlb_cma: reserve %lu MiB, up to %lu MiB per node\n",
 			hugetlb_cma_size / SZ_1M, per_node / SZ_1M);
 	}
 
 	reserved = 0;
-	for_each_online_node(nid) {
+	for_each_node_mask(nid, hugetlb_bootmem_nodes) {
 		int res;
 		char name[CMA_MAX_NAME];