From patchwork Thu Feb 27 22:45:05 2025
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Thomas Prescher via B4 Relay
 <devnull+thomas.prescher.cyberus-technology.de@kernel.org>
X-Patchwork-Id: 13995318
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 94564C197BF
	for <linux-mm@archiver.kernel.org>; Thu, 27 Feb 2025 22:45:23 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id C0FB26B0099; Thu, 27 Feb 2025 17:45:16 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id B95BB6B009C; Thu, 27 Feb 2025 17:45:16 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 7C6D46B009A; Thu, 27 Feb 2025 17:45:16 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com
 [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id 3C8FC6B0099
	for <linux-mm@kvack.org>; Thu, 27 Feb 2025 17:45:16 -0500 (EST)
Received: from smtpin10.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay04.hostedemail.com (Postfix) with ESMTP id 8EFC21A1F52
	for <linux-mm@kvack.org>; Thu, 27 Feb 2025 22:45:15 +0000 (UTC)
X-FDA: 83167206990.10.986CAF3
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
	by imf26.hostedemail.com (Postfix) with ESMTP id 90F00140008
	for <linux-mm@kvack.org>; Thu, 27 Feb 2025 22:45:13 +0000 (UTC)
Authentication-Results: imf26.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=ReviaM9y;
	spf=pass (imf26.hostedemail.com: domain of
 devnull+thomas.prescher.cyberus-technology.de@kernel.org designates
 139.178.84.217 as permitted sender)
 smtp.mailfrom=devnull+thomas.prescher.cyberus-technology.de@kernel.org;
	dmarc=pass (policy=quarantine) header.from=kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1740696313;
	h=from:from:sender:reply-to:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=amRfIu7egDuh2gQT2LLaey0nNPyqBorglSAqnFUklxU=;
	b=qQ0JrcZL57sJOEOH8jYxO+rW7wRkKPzrmdayN34okxAWbTusTaTrLyjokIztfdYI8gBu0D
	yg37oQOjMN+lpt4yBppaOdU9IIlYNVPMlPKtUC3mUB8S+SfiQzaE9pcuYNe1tpraKc+tNS
	4XlXNKkrCJzGsBY/vwAEkuDj4rL2hm4=
ARC-Authentication-Results: i=1;
	imf26.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=ReviaM9y;
	spf=pass (imf26.hostedemail.com: domain of
 devnull+thomas.prescher.cyberus-technology.de@kernel.org designates
 139.178.84.217 as permitted sender)
 smtp.mailfrom=devnull+thomas.prescher.cyberus-technology.de@kernel.org;
	dmarc=pass (policy=quarantine) header.from=kernel.org
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1740696313; a=rsa-sha256;
	cv=none;
	b=byTTA2VOBi1fY5Uh6lRoiB7MDN9rhj3B/+I0XNyeV3evo4n1RZ/iky/eA2vbM3msJ7RkrR
	KP6wRDePSm2gKXyCyTTfKd08kRb+zeLLcQgnOqsJd/XWCcTun82hH2HIPhf0HrTeJoj/yn
	T2TxHidIJcjBCCE7lFyjVUpWj4SW3c8=
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by dfw.source.kernel.org (Postfix) with ESMTP id C25875C5A2D;
	Thu, 27 Feb 2025 22:42:55 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPS id 47134C4CEE5;
	Thu, 27 Feb 2025 22:45:12 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1740696312;
	bh=oJkzUc1eO6G6pDf5SWNcF21d4R6h8JEKsl5niSERdXM=;
	h=From:Date:Subject:References:In-Reply-To:To:Cc:Reply-To:From;
	b=ReviaM9yL+36eqR29e6Vy+s7kRMgf6UR2DuPXer6kYpF66TbEH8IrrMV/pIKDRZgQ
	 uxWGq/oaUO4smsW4N8sgNwJ/kzKWuFe4PgYWKM3M4ej2Sx4u8qSjwULkJ6zQSil0Su
	 aJM49uQc/djfCltKu3NRTkStM8oA4D4PFi+OmLa4zdy2awpUtnW+9dexgdg74yqjxo
	 J+/rKLyJQKOclSVXJY/mYx3y8Gzc0GIZrJlf7K0VD92zPA8i2Y0ztc06GfaVrAQFLE
	 fcltTlj2HA+NqTuf5Mv7QneE6o4IcWlPDAcROIahDrHFUWQeobE2Icb9mV6YuVj9gX
	 VdCaZ5LiAT0ag==
Received: from aws-us-west-2-korg-lkml-1.web.codeaurora.org
 (localhost.localdomain [127.0.0.1])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 376BAC282C5;
	Thu, 27 Feb 2025 22:45:12 +0000 (UTC)
From: Thomas Prescher via B4 Relay
 <devnull+thomas.prescher.cyberus-technology.de@kernel.org>
Date: Thu, 27 Feb 2025 23:45:05 +0100
Subject: [PATCH v2 1/3] mm: hugetlb: improve parallel huge page allocation
 time
MIME-Version: 1.0
Message-Id: 
 <20250227-hugepage-parameter-v2-1-7db8c6dc0453@cyberus-technology.de>
References: 
 <20250227-hugepage-parameter-v2-0-7db8c6dc0453@cyberus-technology.de>
In-Reply-To: 
 <20250227-hugepage-parameter-v2-0-7db8c6dc0453@cyberus-technology.de>
To: Jonathan Corbet <corbet@lwn.net>, Muchun Song <muchun.song@linux.dev>,
 Andrew Morton <akpm@linux-foundation.org>
Cc: linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
 linux-mm@kvack.org, Thomas Prescher <thomas.prescher@cyberus-technology.de>
X-Mailer: b4 0.14.2
X-Developer-Signature: v=1; a=ed25519-sha256; t=1740696310; l=3225;
 i=thomas.prescher@cyberus-technology.de; s=20250221;
 h=from:subject:message-id;
 bh=J38/TB8riQAZ4fzuSMdQ3k6+xa/EprG/GBf8GbXV050=;
 b=N4v8H7XB40hwjOeVzS9b78vwFut9BRoOByegtj+ANmDvtJNJUb0+/dyY3IgOtGis0NqPG1rj7
 K9li2G7PWLgD7gQQ4vqzz3BzxbBK12dZtF1nd1XSS1XJkld/sxlzqqQ
X-Developer-Key: i=thomas.prescher@cyberus-technology.de; a=ed25519;
 pk=T5MVdLVCc/0UUyv5IcSqGVvGcVkgWW/KtuEo2RRJwM8=
X-Endpoint-Received: by B4 Relay for
 thomas.prescher@cyberus-technology.de/20250221 with auth_id=345
X-Original-From: Thomas Prescher <thomas.prescher@cyberus-technology.de>
Reply-To: thomas.prescher@cyberus-technology.de
X-Stat-Signature: gpgsxf6b33rejn4pzcgb3brtai4xp9eg
X-Rspamd-Queue-Id: 90F00140008
X-Rspamd-Server: rspam06
X-Rspam-User: 
X-HE-Tag: 1740696313-980027
X-HE-Meta: 
 U2FsdGVkX19E8jIsbk52Ddcnio8Qzoe4OVUcC3m1Km2OHtAD1jT8ePCN5LbqpVRtt3wE/kJKoz0OqtUK7rRj9OAoZ+hdfRZ8SEFYVrj6iq1XclRZ0OvR0Wi/gs7P9kwtXqg2A/xniskjohALqHS2yPx+OrLmmZzb4XYX0JxtmdklfTI1jpMoVClNRvcJ3XELA7Wj9sSRUL/YDJxr8x6iC2gMGaq7R5tLvj0g5beaqfF4CxQlvf7eptvcLlqdv7o9S12DVtVRv1ozQ181xzTMk9TCnxZglNTm5MXeOicWgpxANQfK4tCWtd+BWg6mWAhlmiD9slzG0yj0dYbc3DlpKGOoKmm1in1+Ysdxws1/DOAzDyvajUq7EjoKMfRt1hD7DHiN0ODvf9IIq7L/XGAA2Sa+L7Tm/tNE1yv5dMvvpJwEsiUHCqbQZaIUsIEh5to5eBQo6Vu8bEQpAK36/BbJjoopCGzxuu4OSveqbl13LBxbXN1ZVThWVcGdRWl6BE/EKnX0AIHzyam9zmtbiaKoNb28H6yXJF9ofq7x6iRHwci0yGB3Emf9bZOcIx5EwYl0II0YtE74Gftaf8/Ar9PYcz3iwoUTsF75uC0dSzmu3byv74DNTM2qyCf5jUlkd22cYEiAababWHWlcEZQ94pst2Bhj4eVqzcpKE3T456m7f6WmXMwWJMj+sRyQA4ulRAsZROi+QMSqI77xSisoDGM3jRfDoZsVlfVL6uLN8EDJfH2tEHUY7Q3aCXYUwF8ph6j4byUutq9QPepWuUIGBYcP0arOIcWPcCArWcN3lRR2pTMpcqbcvbaRaB08wzpW077FTxg1mhAhzmP1QJ4s+rzi3MOBvIYdtmEOdTnlINh3i32yiu36tWLFZ0IKWaxvRnjil3suH+uomwCVBaQ40+q5v5LjhF3ljsUPgOOudiNth5ukHKAV8A9R0mS7lvb7O3NZIMLD//TkvuDow3KJ5K
 70Gj2SOa
 1EtS2GMNUMas5R5aNjjz5rVW3PvPs6wKsFgEccJxVDA7+mmBUZ9mx7PQqQhqoRaDGK+0AZ0pysFmR/YlcRmKK1di/wf+lfZUQ0xnb4pp3Io3M2qCH1zPbC4PFjVGGCI1vAoiPFgZP19icn3Ta0XieiKnnc9WQQ2hB4vWtEQwGz8WiMS6Uc92hHAk25IebDkWH73y03A0sLWxu+SDGB2WUKCXo6+qnddgqYCfk0afU59rVImUPpGXF2xpNd5mV9Vb5WeJYSBc0lF41KI0njnYC/YkR7bJH2yDKkwPFix3QBd4dIBhPlHMSXCnIzsyMEpRHjy706M6TtOZ9RjqNU53G/IXrUp08L6GPKDVjjzcHiexXmtXzs4j4q9Qnp6rin/gz8V5QPfKdws11oWM=
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

From: Thomas Prescher <thomas.prescher@cyberus-technology.de>

Before this patch, the kernel currently used a hard coded
value of 2 threads per NUMA node for these allocations.

This patch changes this policy and the kernel now uses 25%
of the available hardware threads for the allocations.

Signed-off-by: Thomas Prescher <thomas.prescher@cyberus-technology.de>
---
 mm/hugetlb.c | 34 ++++++++++++++++++----------------
 1 file changed, 18 insertions(+), 16 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 163190e89ea16450026496c020b544877db147d1..e9b1b3e2b9d467f067d54359e1401a03f9926108 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -14,9 +14,11 @@
 #include <linux/pagemap.h>
 #include <linux/mempolicy.h>
 #include <linux/compiler.h>
+#include <linux/cpumask.h>
 #include <linux/cpuset.h>
 #include <linux/mutex.h>
 #include <linux/memblock.h>
+#include <linux/minmax.h>
 #include <linux/sysfs.h>
 #include <linux/slab.h>
 #include <linux/sched/mm.h>
@@ -3427,31 +3429,31 @@ static unsigned long __init hugetlb_pages_alloc_boot(struct hstate *h)
 		.numa_aware	= true
 	};
 
+	unsigned int num_allocation_threads = max(num_online_cpus() / 4, 1);
+
 	job.thread_fn	= hugetlb_pages_alloc_boot_node;
 	job.start	= 0;
 	job.size	= h->max_huge_pages;
 
 	/*
-	 * job.max_threads is twice the num_node_state(N_MEMORY),
+	 * job.max_threads is 25% of the available cpu threads by default.
 	 *
-	 * Tests below indicate that a multiplier of 2 significantly improves
-	 * performance, and although larger values also provide improvements,
-	 * the gains are marginal.
+	 * On large servers with terabytes of memory, huge page allocation
+	 * can consume a considerably amount of time.
 	 *
-	 * Therefore, choosing 2 as the multiplier strikes a good balance between
-	 * enhancing parallel processing capabilities and maintaining efficient
-	 * resource management.
+	 * Tests below show how long it takes to allocate 1 TiB of memory with 2MiB huge pages.
+	 * 2MiB huge pages. Using more threads can significantly improve allocation time.
 	 *
-	 * +------------+-------+-------+-------+-------+-------+
-	 * | multiplier |   1   |   2   |   3   |   4   |   5   |
-	 * +------------+-------+-------+-------+-------+-------+
-	 * | 256G 2node | 358ms | 215ms | 157ms | 134ms | 126ms |
-	 * | 2T   4node | 979ms | 679ms | 543ms | 489ms | 481ms |
-	 * | 50G  2node | 71ms  | 44ms  | 37ms  | 30ms  | 31ms  |
-	 * +------------+-------+-------+-------+-------+-------+
+	 * +-----------------------+-------+-------+-------+-------+-------+
+	 * | threads               |   8   |   16  |   32  |   64  |   128 |
+	 * +-----------------------+-------+-------+-------+-------+-------+
+	 * | skylake      144 cpus |   44s |   22s |   16s |   19s |   20s |
+	 * | cascade lake 192 cpus |   39s |   20s |   11s |   10s |    9s |
+	 * +-----------------------+-------+-------+-------+-------+-------+
 	 */
-	job.max_threads	= num_node_state(N_MEMORY) * 2;
-	job.min_chunk	= h->max_huge_pages / num_node_state(N_MEMORY) / 2;
+
+	job.max_threads	= num_allocation_threads;
+	job.min_chunk	= h->max_huge_pages / num_allocation_threads;
 	padata_do_multithreaded(&job);
 
 	return h->nr_huge_pages;