From patchwork Thu Jan 25 18:43:43 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Gregory Price <gourry.memverge@gmail.com>
X-Patchwork-Id: 13531498
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8B91EC47258
	for <linux-mm@archiver.kernel.org>; Thu, 25 Jan 2024 18:44:05 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 1A6AA6B0078; Thu, 25 Jan 2024 13:44:05 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 12EF96B0096; Thu, 25 Jan 2024 13:44:05 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id F11406B0098; Thu, 25 Jan 2024 13:44:04 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com
 [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id D73CD6B0078
	for <linux-mm@kvack.org>; Thu, 25 Jan 2024 13:44:04 -0500 (EST)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay02.hostedemail.com (Postfix) with ESMTP id 7C09D120E6B
	for <linux-mm@kvack.org>; Thu, 25 Jan 2024 18:44:04 +0000 (UTC)
X-FDA: 81718708008.08.97FB44B
Received: from mail-pf1-f195.google.com (mail-pf1-f195.google.com
 [209.85.210.195])
	by imf09.hostedemail.com (Postfix) with ESMTP id 97B7114002C
	for <linux-mm@kvack.org>; Thu, 25 Jan 2024 18:44:02 +0000 (UTC)
Authentication-Results: imf09.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="Spye/DAz";
	spf=pass (imf09.hostedemail.com: domain of gourry.memverge@gmail.com
 designates 209.85.210.195 as permitted sender)
 smtp.mailfrom=gourry.memverge@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1706208242;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=VzGhMbeDHhADN16eti5ZGSWn33ZQML077ijAcKWfxyI=;
	b=RsmJJI6TOXbLeAfbrYXNbqReqy4XC2oH05IOQRF58YPVSa9CzaPN6y5HWYnRr2M4BhHfpc
	CimEk8H+nD2OvLAWJ/IoDrAS7Kq5NYRdIOFLr2zgUEK8KPWdjTU5UOPaHZ0/4uU1CzL63S
	Cdi6Z6tA5cXrLbX+qJLHMWh0WFpd3XI=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706208242; a=rsa-sha256;
	cv=none;
	b=SxcM3Gks7Ncs95lcuG/kgnzRzDXxwzjAXfj9d3BRrRzDQRceGIdO+SnaI2LuR5k8Zr0Bvq
	R9JpM2K0S9Uc09/zNq2ooFnyam8aki8V9Lkq1NZeO74RG5OanHWP+wgsCs4szrFI9s6ToG
	WPIM46DeVETfQcavuX1bWVA+CcOpubc=
ARC-Authentication-Results: i=1;
	imf09.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b="Spye/DAz";
	spf=pass (imf09.hostedemail.com: domain of gourry.memverge@gmail.com
 designates 209.85.210.195 as permitted sender)
 smtp.mailfrom=gourry.memverge@gmail.com;
	dmarc=pass (policy=none) header.from=gmail.com
Received: by mail-pf1-f195.google.com with SMTP id
 d2e1a72fcca58-6dd87e7c355so10721b3a.0
        for <linux-mm@kvack.org>; Thu, 25 Jan 2024 10:44:02 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1706208241; x=1706813041; darn=kvack.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=VzGhMbeDHhADN16eti5ZGSWn33ZQML077ijAcKWfxyI=;
        b=Spye/DAzCGv/232sllAKvgsmwwKDcYNFdhhVoynj4aFiwuu1t7oB8aLVmJKW2r/jIa
         uBwhJbFDbSuD7HvhCjyY2NfCsvt8OTU9wyurrpA5TcAsmk+7xvZVJgrQvP95k5iXTVwU
         bhVKdxhwg1vvFoVwDTd71mWeG8pU3HW8C/L47tTMaVk5PCoQE6zudTZ4mkGFrjG8vjHR
         CgMoNsJRKzjKSsAUZWR/lY58V2S+lxtK0NMZ/iq2d9jsJlFWlnwoZ/+QK4rWaebzNLL8
         h1ZGhE4MQzGal8y349bhBynCMD9Rz8vzPYG4ilojUjrUaUmzl5yn3UrGBWbK9IvprVlO
         MelQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1706208241; x=1706813041;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=VzGhMbeDHhADN16eti5ZGSWn33ZQML077ijAcKWfxyI=;
        b=E9mowVYLVzN6sutCSJj559uRFGtnDYGiTbVPY/jhuIdwNrhC14aVoywT3rCnmGCU5y
         xEL8rawxeao29g/vT19Zug4sem4KJWwxZwexAMmgOqOZK0OeD695LFfBCbXl46dxev3Q
         MMBbFXhf0dOxqURtnmCdqLzbnwqbIrMLXBsdjVZ9cAAD7X/dnXyUAy/EV/TKunA+Ilbv
         iztlnIPztDbe0qo6g/WC3bypycPi9mBtXLIYD42KfLLcQwtlwuRKCnDLvnYukkZBr251
         U7qsI59Zn23Bdu59KaQN+RDBpBQAPb0KylUJ3GYtpW75NMteXN9Zi6n57FDFmhypJH2i
         2rBA==
X-Gm-Message-State: AOJu0Yw/owPdjh1ln1myROrTef9AmFf9uc9heJTbTXu/l2xtNQTsmCiO
	KMluD9J3WtXMMMSTFU60D9eiR7hnSQmB0Qk7XqAhcumP2D3RqwMuOnBbDXT6b67H
X-Google-Smtp-Source: 
 AGHT+IEcH1Tk+08R16sVhPM8oB55q9tIFfKiHt4xwLriTagfM09GAnko6cUAoLV1xiWOuP2rG1zkBg==
X-Received: by 2002:a05:6a20:1e52:b0:194:f8dd:4277 with SMTP id
 cy18-20020a056a201e5200b00194f8dd4277mr81192pzb.106.1706208241161;
        Thu, 25 Jan 2024 10:44:01 -0800 (PST)
Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net.
 [173.79.56.208])
        by smtp.gmail.com with ESMTPSA id
 p14-20020aa7860e000000b006ddcf56fb78sm1815070pfn.62.2024.01.25.10.43.58
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 25 Jan 2024 10:44:00 -0800 (PST)
From: Gregory Price <gourry.memverge@gmail.com>
X-Google-Original-From: Gregory Price <gregory.price@memverge.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	linux-api@vger.kernel.org,
	corbet@lwn.net,
	akpm@linux-foundation.org,
	gregory.price@memverge.com,
	honggyu.kim@sk.com,
	rakie.kim@sk.com,
	hyeongtak.ji@sk.com,
	mhocko@kernel.org,
	ying.huang@intel.com,
	vtavarespetr@micron.com,
	jgroves@micron.com,
	ravis.opensrc@micron.com,
	sthanneeru@micron.com,
	emirakhur@micron.com,
	Hasan.Maruf@amd.com,
	seungjun.ha@samsung.com,
	hannes@cmpxchg.org,
	dan.j.williams@intel.com
Subject: [PATCH v3 2/4] mm/mempolicy: refactor a read-once mechanism into a
 function for re-use
Date: Thu, 25 Jan 2024 13:43:43 -0500
Message-Id: <20240125184345.47074-3-gregory.price@memverge.com>
X-Mailer: git-send-email 2.39.1
In-Reply-To: <20240125184345.47074-1-gregory.price@memverge.com>
References: <20240125184345.47074-1-gregory.price@memverge.com>
MIME-Version: 1.0
X-Stat-Signature: 4bnmtuckj1qonemy8jutqrtgfmatdmya
X-Rspamd-Server: rspam10
X-Rspamd-Queue-Id: 97B7114002C
X-Rspam-User: 
X-HE-Tag: 1706208242-723468
X-HE-Meta: 
 U2FsdGVkX19NKMn40OjvsUlCqDr/Ai6wvH/+Ye2ZiAuNppRtLJHC64RDfJqX0DMLTi4A0uYdKpDXJvZhEBrb/fQ5YWy8sOoJ86vvWR77jSwBxfHrK2BpJnkhnSDKb8SyeZoCs/Vp4yVHjx51xScVN/PTie/xMSGb9gzM6N+LmTBFJj30SThhgux/apN9ipUNAi2pNgqF1f6FwYHnj0NtM/XYBAeiV6iV1I2Rn3xIBw2N9IpDptN7FgMIboIR0V/GGl4WSpEOQK+fCbzG6e9uTbpiwjjR1cuWNJOgnno6xyLHj6975BiHT1dptk2eElnQywfE5rNrtDvixEOfLaww9veSsM5BVK0r6d+AMXex1EMSJ7cLsiz1/T3BznE50Msr+7ya+oFqI1VYl1zQpg7tA787rh7UxzE8m4NftrRtdYvqdXathrpNat3rj3I2CdCttnmi41xchvBdyNewQjXo7PfeJFQn8E/A3Kxnhm89COg027EzLNKHQSUlF0K4uY/eOi5+fIV4IUOn2XEvT1seUHskaSeA2KcrFtJuYqpLuAOFJ0XcFbRm7tfi2QDBX7HLJ6SZcPJIeYxw6PE9tXlqrJpFrzC4Iof7VXIWYHaAWJYCizyMX/ZrmjadQpyOdliqLa/NEVCsGnasIlH6dTgIcB8LtODabcuqVxC7omtG6zY7DYJL1jEfC1o6+7fOspfHO8qzINKwY28V/QQD75frMyZrdDVnEE7suwNOFp2qCCIyGFnhMP0yjRCDuxi0XhURDUgYIxDTf/Z9K/uMO2/scdX9Q0EzA07BUN+t5J5D3amzBATACT7jqLsO0OgBZHpzPXtsLdUEF/+90IDGL8X8nKIUYYzzRHAtZJB2XgmrHNeL7NKLdgzo9fNTEpBq6A52TtB4wPPr8OcuvsUQrqMqgS2C3Sr354njtxvOprosMI3yExVK+xobmhHGT0barqaFR+oABNyNIpDAlXASVwv
 mQWX3YrQ
 YSkz6zWiyarUrppWIX0rzs1STta30MQtmG6cmv4E1Evek7VyDs2SEBJwcz5xbmvUByGHYiLT0cvp2HTnSwtC2FDu+sJ8zF7inJHr2PzUfTsyxbWXEApomWHd2+Q/13u3R2enKIUC6xOvbopgMuThS67RhRvCGGU9Jy76ZJ+x7PrQgBxYv3LI9d4/gX/pWDYYSSg3T6LtdT2nWY2uTUoCV/f3M4TQvyWb0TE8kyxJZW8qdMa00peTZiiDhwjrT2Ec2RJwbHPoIY7s8eExUl6fGzk7iXEig2UbYXw/tF/Z3/EThs+nw0H6Mv2GaXDIpVrIZB+vUzoW/5Q5vGdBWsJlznRFfzsvOVBWwzVqDfrCa9GcMnEsWDJecBSq2NRsCYlLXkTdJY6U3r2Qvc3FpEAequsjl3JZTMgSuNjoXv8vHmmSkeK4Msj2zpMGfshqxarCGrFeAidQbrg5hXntlbcbXx4/RhT1wgyX8DqM4+37tgFj6gPNBOv2w8UZWvdS9XzUQ8QWbopPgRiPHVxT1rlpN+v8OHfD5/HG/EADaA+O3GtppjA8y9memJnYepLnNimBWkOJA
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000172, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

move the use of barrier() to force policy->nodemask onto the stack into
a function `read_once_policy_nodemask` so that it may be re-used.

Suggested-by: Huang Ying <ying.huang@intel.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
---
 mm/mempolicy.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f1627d45b0c8..b13c45a0bfcb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1894,6 +1894,20 @@ unsigned int mempolicy_slab_node(void)
 	}
 }
 
+static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
+					      nodemask_t *mask)
+{
+	/*
+	 * barrier stabilizes the nodemask locally so that it can be iterated
+	 * over safely without concern for changes. Allocators validate node
+	 * selection does not violate mems_allowed, so this is safe.
+	 */
+	barrier();
+	memcpy(mask, &pol->nodes, sizeof(nodemask_t));
+	barrier();
+	return nodes_weight(*mask);
+}
+
 /*
  * Do static interleaving for interleave index @ilx.  Returns the ilx'th
  * node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1901,20 +1915,12 @@ unsigned int mempolicy_slab_node(void)
  */
 static unsigned int interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 {
-	nodemask_t nodemask = pol->nodes;
+	nodemask_t nodemask;
 	unsigned int target, nnodes;
 	int i;
 	int nid;
-	/*
-	 * The barrier will stabilize the nodemask in a register or on
-	 * the stack so that it will stop changing under the code.
-	 *
-	 * Between first_node() and next_node(), pol->nodes could be changed
-	 * by other threads. So we put pol->nodes in a local stack.
-	 */
-	barrier();
 
-	nnodes = nodes_weight(nodemask);
+	nnodes = read_once_policy_nodemask(pol, &nodemask);
 	if (!nnodes)
 		return numa_node_id();
 	target = ilx % nnodes;

From patchwork Thu Jan 25 18:43:44 2024
Content-Type: text/plain; charset="utf-8"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
X-Patchwork-Submitter: Gregory Price <gourry.memverge@gmail.com>
X-Patchwork-Id: 13531499
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A422CC47422
	for <linux-mm@archiver.kernel.org>; Thu, 25 Jan 2024 18:44:09 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 3F00C8D0002; Thu, 25 Jan 2024 13:44:09 -0500 (EST)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 377CE6B0099; Thu, 25 Jan 2024 13:44:09 -0500 (EST)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 1A60C8D0002; Thu, 25 Jan 2024 13:44:09 -0500 (EST)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com
 [216.40.44.13])
	by kanga.kvack.org (Postfix) with ESMTP id 01CD36B0098
	for <linux-mm@kvack.org>; Thu, 25 Jan 2024 13:44:08 -0500 (EST)
Received: from smtpin24.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay06.hostedemail.com (Postfix) with ESMTP id 9C327A2334
	for <linux-mm@kvack.org>; Thu, 25 Jan 2024 18:44:08 +0000 (UTC)
X-FDA: 81718708176.24.67A0BF8
Received: from mail-pf1-f195.google.com (mail-pf1-f195.google.com
 [209.85.210.195])
	by imf06.hostedemail.com (Postfix) with ESMTP id 9F7E518001C
	for <linux-mm@kvack.org>; Thu, 25 Jan 2024 18:44:06 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=KgI9MbZz;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf06.hostedemail.com: domain of gourry.memverge@gmail.com
 designates 209.85.210.195 as permitted sender)
 smtp.mailfrom=gourry.memverge@gmail.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed;
 d=hostedemail.com;
	s=arc-20220608; t=1706208246;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=XCM3+Jwfhpm+dmL8fIjOhTZIVTHnDTgH9yjGWopoEEE=;
	b=mj0IcBr0csWBTD/JknbsA73J/qX/qVacOIaQshlqghtQ18oMCtG5je4UBb0xHIfxK73igs
	DPfxxWKbkWPU26E25leljIqozx/CcjwkVuFWUKyJul6GPG1Z/oeLIRsjJZ67Dgkvps/jIv
	iGLy9nO8zjMVPVeOW94ggt7Lcioui1Y=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=gmail.com header.s=20230601 header.b=KgI9MbZz;
	dmarc=pass (policy=none) header.from=gmail.com;
	spf=pass (imf06.hostedemail.com: domain of gourry.memverge@gmail.com
 designates 209.85.210.195 as permitted sender)
 smtp.mailfrom=gourry.memverge@gmail.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1706208246; a=rsa-sha256;
	cv=none;
	b=seMLolUl8yameWEqnrwcMCyTGEp/Vl2QHduYw7Y9xk28NbIils0d+w0eebFlYb2HnnLouR
	00pOetRPezTczNAUJDCl8quNE0/x4nC4886/gwrRj76y0XQ34SPC7EXrvE6KqrhIEASZnC
	MhlD1REXLZYGX8WSCY1/+IqNx8OYiYU=
Received: by mail-pf1-f195.google.com with SMTP id
 d2e1a72fcca58-6ddb1115e82so1433593b3a.0
        for <linux-mm@kvack.org>; Thu, 25 Jan 2024 10:44:06 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20230601; t=1706208245; x=1706813045; darn=kvack.org;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:from:to:cc:subject:date
         :message-id:reply-to;
        bh=XCM3+Jwfhpm+dmL8fIjOhTZIVTHnDTgH9yjGWopoEEE=;
        b=KgI9MbZzrecXtRQt6PJ5sNoT4t9Wo7/PsGlQzpqu9kRyismSUnUBcuzdjVrEdJdwg5
         ZHLUt9em+bz/neXDT/sKLb6i/Fp/AjkBXgpOTaECDPN9rBD+h1Nws4R3/BwqUKhMwDKi
         5LAQCUj/sy3apQOsTJGzAE77JUxoThL2rNeSetdhtwx5qr99GJhmeUE/7lw9ae1AhPJK
         IHPjlcvNELxzQaElX0/iEcTBxYze0PNMLEDvcbj08LH+JbvRAyRvH5+bkQxQv2CxVyue
         /epu4vOCvU5xwIa9NbXLEetaespSMpjUvZEDVu0lLo57sHBCg59lSFg1k3vbCPPpA2dS
         h3Kw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1706208245; x=1706813045;
        h=content-transfer-encoding:mime-version:references:in-reply-to
         :message-id:date:subject:cc:to:from:x-gm-message-state:from:to:cc
         :subject:date:message-id:reply-to;
        bh=XCM3+Jwfhpm+dmL8fIjOhTZIVTHnDTgH9yjGWopoEEE=;
        b=P+lSzBEq4FkE1bzdg81AAanNLdfumstsUAXFT5Okf7ruCqYkU+rAKJXKdDOyxd3Knn
         zRC7zKJQJ3xlexVAJdT7o8Aa+LgJlRAPWE2q2HuR5nTWnlcfqVpzHecFllmitdwfrDyL
         B7ZRfYPaDAZnAXq1Byh9rwy7TrRi/eO1DMAWHyaWt9/Mu1jk93/LZaC7FpbGB8DaTkdB
         u4EDMxlHaiyka8nikWF8FtTRJyVXIjccY1o1Umhaf5LvAkNb21Xo7P6tSgTe33eBnY+5
         gpmKnnj9M5v23Wkqru1rD1JZymHFnfB2isUBaeyK52hO1UXUePYJr2TDlWP1LdOGEBl/
         Je/g==
X-Gm-Message-State: AOJu0YxgbtUfCcK9r54zs+DtrPSwJFcS48VyWlFFF6jIjGd/VCINGkXB
	DdNx7UzfxlZxgoMMuhzp5ueIu4SXZAqAGpVbN+KioIoi86yxKKxoVUyBkQ+CeiHG
X-Google-Smtp-Source: 
 AGHT+IG8IKF2LbF/tulleNXQK9a0buVlgJX+T6ILRE02yC4Dg04nb88+bqtRLNJHNefgetU7IyONPg==
X-Received: by 2002:a05:6a00:1d13:b0:6db:cdbc:311e with SMTP id
 a19-20020a056a001d1300b006dbcdbc311emr167814pfx.61.1706208245137;
        Thu, 25 Jan 2024 10:44:05 -0800 (PST)
Received: from fedora.mshome.net (pool-173-79-56-208.washdc.fios.verizon.net.
 [173.79.56.208])
        by smtp.gmail.com with ESMTPSA id
 p14-20020aa7860e000000b006ddcf56fb78sm1815070pfn.62.2024.01.25.10.44.02
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Thu, 25 Jan 2024 10:44:04 -0800 (PST)
From: Gregory Price <gourry.memverge@gmail.com>
X-Google-Original-From: Gregory Price <gregory.price@memverge.com>
To: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org,
	linux-doc@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	linux-api@vger.kernel.org,
	corbet@lwn.net,
	akpm@linux-foundation.org,
	gregory.price@memverge.com,
	honggyu.kim@sk.com,
	rakie.kim@sk.com,
	hyeongtak.ji@sk.com,
	mhocko@kernel.org,
	ying.huang@intel.com,
	vtavarespetr@micron.com,
	jgroves@micron.com,
	ravis.opensrc@micron.com,
	sthanneeru@micron.com,
	emirakhur@micron.com,
	Hasan.Maruf@amd.com,
	seungjun.ha@samsung.com,
	hannes@cmpxchg.org,
	dan.j.williams@intel.com,
	Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Subject: [PATCH v3 3/4] mm/mempolicy: introduce MPOL_WEIGHTED_INTERLEAVE for
 weighted interleaving
Date: Thu, 25 Jan 2024 13:43:44 -0500
Message-Id: <20240125184345.47074-4-gregory.price@memverge.com>
X-Mailer: git-send-email 2.39.1
In-Reply-To: <20240125184345.47074-1-gregory.price@memverge.com>
References: <20240125184345.47074-1-gregory.price@memverge.com>
MIME-Version: 1.0
X-Rspam-User: 
X-Rspamd-Server: rspam12
X-Rspamd-Queue-Id: 9F7E518001C
X-Stat-Signature: ad44hsfhazkdiaenw3mytmtnkpp3cjyf
X-HE-Tag: 1706208246-653558
X-HE-Meta: 
 U2FsdGVkX1+uwWOhdk56u3rWorWF9OnYFonF2Dwlubitm6S+tNN2Ng6c6NdxukDjfyAioaYwuhBdqnh2AwvBD508uEd/UQVwzhDTh4rOf1l8jXdwqavZSrp3BHoaApbQMGUQ0ClAW+K8II0nqEB7gpCAiv8RS/i5FIeUnvisVfIHXjcvDi1nVBAPjKyiKHqnzIOEUUTsc/JlJ25TSSlOtaKHY5vefyuSP7Cq8JQPXp5O6L+DzLmcl/4ZcgclJZTy166UK/diXYtQSrIGPDM5rkYrBEatDdfV3ytpN5IvnqG9AuyAGIz+TpO5T4iges4miFJimbVDSq+wYzKVa3za+utcKhmQGA7ws/7A+uEXNQVA/oF2hoYfUaGEo1tkoerwyit7cF14hjq6mBg4Dt+2QB61XMKw7p+bmYtxHIp/Dfna9tubHZyh9xuMS+p7N2zOWa2NXzyXySKSLwgjdKKxKaQiERFFcS5y6uQ55rKEeBbCuWmUVpFc62hCyzXd9Lj/G+yHEiseZl3COa/sOrcUfJkXcqO4MQrvn9JGW/WjGrU9kbm4smg4vLLWYNxskuvaUWLfrBfHcEmf32dd0M2bEw1mfIuTVaVsuGqhGF0t5PP5bD/vZPojo9RIPUNeB8op81f3HOKUK38BDMHp7N6Vtvmk4A7ViIHWjCjRsh9RIeGG1FfMz7OrRfB4pTbPYeNhGL2VhRBP6iiEiYBTWidK+4bLeXrQRAXV+/AsgQLdK6HUJUa62oQcDaBwWKYth8jtH1QdpZmslc6RmMa5F34DdHkVKe5nhGot+yI+a7Rh5aH8IseU9D4kT5zc7E9w+icOOXhNJtQM7EiGiLyhe6yHdQXbUyqBSpYNYo0twvxs4ySx9umBy2v2PaMuwFDA6sLKciABTUQOANRHaeT6Q4VgPjlUchI6zKMEtizhgQuhCx63hm1XIw/JbNF/pYgoa3Y53Bc5qbsKsXhxaFBsmRz
 6rrinuN+
 iEjj7M1mlb9N8nCww13fKx+TWzFZPZJGfOXX23+xIjAqemi5u0LTUHPnhvqD3OYTV6FtOp2qKRTVIo5eZE0GZQMRw4lcKjdJL1oyg+PUzBcTkvW6qPg/ge08NgHGWaWD5tjFMngZYbTmkIDeDq8jtMezYHG+X1JqwAP6piD7yYzG7xejVmKp/Xkc2JblnSTbf/6jHKE1g+fJBt09yo/Rc835fiNV0EXy2F0Vqu9NgPH8FFCdOhfnCAYW2gypjTDIphfcu0zLn5j2yBqCmgMBMelZ7sBh0iYn/4aSOexyyMgJvIT/oCxONrriXe3hjDr+i1gdpfmqVMhLHefG/atEVfQbZfIqekabsA3q/Fmgtx+DcyvC7ZZ5Bo+lUJQv/9Ldp7iDePD5TMzxFJls7qmU7TVLsrW7Rqjbez1LsQ+IAWybkL8nz2DdLM28ymdcrvN2rHlItses/r5wGw0VMQB5OpuTVKm94nXG4SG+Ey9IrYlzfbwsBYlJFwN+/gzoFxuwtoNiHE6iJMEuVaa6fHsX7oODDaH6XuuQNpI+UFK/HEKECCLxUbt+m8B3zAW6CutCw1aB0C1uFQRaqjg/RiBXjGeZnSrt2kqZnLQFYQFfdjpjYF2G4YAPnOuxDG0KXyjdXW8RZuVsU5BA7RYWHZkWQnmEuSjpQttpsdK49
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

When a system has multiple NUMA nodes and it becomes bandwidth hungry,
using the current MPOL_INTERLEAVE could be an wise option.

However, if those NUMA nodes consist of different types of memory such
as socket-attached DRAM and CXL/PCIe attached DRAM, the round-robin
based interleave policy does not optimally distribute data to make use
of their different bandwidth characteristics.

Instead, interleave is more effective when the allocation policy follows
each NUMA nodes' bandwidth weight rather than a simple 1:1 distribution.

This patch introduces a new memory policy, MPOL_WEIGHTED_INTERLEAVE,
enabling weighted interleave between NUMA nodes.  Weighted interleave
allows for proportional distribution of memory across multiple numa
nodes, preferably apportioned to match the bandwidth of each node.

For example, if a system has 1 CPU node (0), and 2 memory nodes (0,1),
with bandwidth of (100GB/s, 50GB/s) respectively, the appropriate
weight distribution is (2:1).

Weights for each node can be assigned via the new sysfs extension:
/sys/kernel/mm/mempolicy/weighted_interleave/

For now, the default value of all nodes will be `1`, which matches
the behavior of standard 1:1 round-robin interleave. An extension
will be added in the future to allow default values to be registered
at kernel and device bringup time.

The policy allocates a number of pages equal to the set weights. For
example, if the weights are (2,1), then 2 pages will be allocated on
node0 for every 1 page allocated on node1.

The new flag MPOL_WEIGHTED_INTERLEAVE can be used in set_mempolicy(2)
and mbind(2).

There are 3 integration points:

weighted_interleave_nodes:
    Counts the number of allocations as they occur, and applies the
    weight for the current node.  When the weight reaches 0, switch
    to the next node.

weighted_interleave_nid:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the node based on the given index.

bulk_array_weighted_interleave:
    Gets the total weight of the nodemask as well as each individual
    node weight, then calculates the number of "interleave rounds" as
    well as any delta ("partial round").  Calculates the number of
    pages for each node and allocates them.

    If a node was scheduled for interleave via interleave_nodes, the
    current weight (pol->cur_il_weight) will be allocated first, before
    the remaining bulk calculation is done.

One piece of complexity is the interaction between a recent refactor
which split the logic to acquire the "ilx" (interleave index) of an
allocation and the actually application of the interleave.  The
calculation of the `interleave index` is done by `get_vma_policy()`,
while the actual selection of the node will be later appliex by the
relevant weighted_interleave function.

Suggested-by: Hasan Al Maruf <Hasan.Maruf@amd.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Co-developed-by: Honggyu Kim <honggyu.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Co-developed-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Signed-off-by: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
Co-developed-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
Signed-off-by: Ravi Jonnalagadda <ravis.opensrc@micron.com>
---
 .../admin-guide/mm/numa_memory_policy.rst     |   9 +
 include/linux/mempolicy.h                     |   3 +
 include/uapi/linux/mempolicy.h                |   1 +
 mm/mempolicy.c                                | 274 +++++++++++++++++-
 4 files changed, 283 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/mm/numa_memory_policy.rst b/Documentation/admin-guide/mm/numa_memory_policy.rst
index eca38fa81e0f..a70f20ce1ffb 100644
--- a/Documentation/admin-guide/mm/numa_memory_policy.rst
+++ b/Documentation/admin-guide/mm/numa_memory_policy.rst
@@ -250,6 +250,15 @@ MPOL_PREFERRED_MANY
 	can fall back to all existing numa nodes. This is effectively
 	MPOL_PREFERRED allowed for a mask rather than a single node.
 
+MPOL_WEIGHTED_INTERLEAVE
+	This mode operates the same as MPOL_INTERLEAVE, except that
+	interleaving behavior is executed based on weights set in
+	/sys/kernel/mm/mempolicy/weighted_interleave/
+
+	Weighted interleave allocates pages on nodes according to a
+	weight.  For example if nodes [0,1] are weighted [5,2], 5 pages
+	will be allocated on node0 for every 2 pages allocated on node1.
+
 NUMA memory policy supports the following optional mode flags:
 
 MPOL_F_STATIC_NODES
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 931b118336f4..c644d7bbd396 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -54,6 +54,9 @@ struct mempolicy {
 		nodemask_t cpuset_mems_allowed;	/* relative to these nodes */
 		nodemask_t user_nodemask;	/* nodemask passed by user */
 	} w;
+
+	/* Weighted interleave settings */
+	u8 cur_il_weight;
 };
 
 /*
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index a8963f7ef4c2..1f9bb10d1a47 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -23,6 +23,7 @@ enum {
 	MPOL_INTERLEAVE,
 	MPOL_LOCAL,
 	MPOL_PREFERRED_MANY,
+	MPOL_WEIGHTED_INTERLEAVE,
 	MPOL_MAX,	/* always last member of enum */
 };
 
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b13c45a0bfcb..5a517511658e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -19,6 +19,13 @@
  *                for anonymous memory. For process policy an process counter
  *                is used.
  *
+ * weighted interleave
+ *                Allocate memory interleaved over a set of nodes based on
+ *                a set of weights (per-node), with normal fallback if it
+ *                fails.  Otherwise operates the same as interleave.
+ *                Example: nodeset(0,1) & weights (2,1) - 2 pages allocated
+ *                on node 0 for every 1 page allocated on node 1.
+ *
  * bind           Only allocate memory on a specific set of nodes,
  *                no fallback.
  *                FIXME: memory is allocated starting with the first node
@@ -314,6 +321,7 @@ static struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
 	policy->mode = mode;
 	policy->flags = flags;
 	policy->home_node = NUMA_NO_NODE;
+	policy->cur_il_weight = 0;
 
 	return policy;
 }
@@ -426,6 +434,10 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
 		.create = mpol_new_nodemask,
 		.rebind = mpol_rebind_preferred,
 	},
+	[MPOL_WEIGHTED_INTERLEAVE] = {
+		.create = mpol_new_nodemask,
+		.rebind = mpol_rebind_nodemask,
+	},
 };
 
 static bool migrate_folio_add(struct folio *folio, struct list_head *foliolist,
@@ -847,7 +859,8 @@ static long do_set_mempolicy(unsigned short mode, unsigned short flags,
 
 	old = current->mempolicy;
 	current->mempolicy = new;
-	if (new && new->mode == MPOL_INTERLEAVE)
+	if (new && (new->mode == MPOL_INTERLEAVE ||
+		    new->mode == MPOL_WEIGHTED_INTERLEAVE))
 		current->il_prev = MAX_NUMNODES-1;
 	task_unlock(current);
 	mpol_put(old);
@@ -873,6 +886,7 @@ static void get_policy_nodemask(struct mempolicy *pol, nodemask_t *nodes)
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		*nodes = pol->nodes;
 		break;
 	case MPOL_LOCAL:
@@ -957,6 +971,13 @@ static long do_get_mempolicy(int *policy, nodemask_t *nmask,
 		} else if (pol == current->mempolicy &&
 				pol->mode == MPOL_INTERLEAVE) {
 			*policy = next_node_in(current->il_prev, pol->nodes);
+		} else if (pol == current->mempolicy &&
+				(pol->mode == MPOL_WEIGHTED_INTERLEAVE)) {
+			if (pol->cur_il_weight)
+				*policy = current->il_prev;
+			else
+				*policy = next_node_in(current->il_prev,
+						       pol->nodes);
 		} else {
 			err = -EINVAL;
 			goto out;
@@ -1769,7 +1790,8 @@ struct mempolicy *__get_vma_policy(struct vm_area_struct *vma,
  * @vma: virtual memory area whose policy is sought
  * @addr: address in @vma for shared policy lookup
  * @order: 0, or appropriate huge_page_order for interleaving
- * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE
+ * @ilx: interleave index (output), for use only when MPOL_INTERLEAVE or
+ *       MPOL_WEIGHTED_INTERLEAVE
  *
  * Returns effective policy for a VMA at specified address.
  * Falls back to current->mempolicy or system default policy, as necessary.
@@ -1786,7 +1808,8 @@ struct mempolicy *get_vma_policy(struct vm_area_struct *vma,
 	pol = __get_vma_policy(vma, addr, ilx);
 	if (!pol)
 		pol = get_task_policy(current);
-	if (pol->mode == MPOL_INTERLEAVE) {
+	if (pol->mode == MPOL_INTERLEAVE ||
+	    pol->mode == MPOL_WEIGHTED_INTERLEAVE) {
 		*ilx += vma->vm_pgoff >> order;
 		*ilx += (addr - vma->vm_start) >> (PAGE_SHIFT + order);
 	}
@@ -1836,6 +1859,44 @@ bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone)
 	return zone >= dynamic_policy_zone;
 }
 
+static unsigned int weighted_interleave_nodes(struct mempolicy *policy)
+{
+	unsigned int node, next;
+	struct task_struct *me = current;
+	u8 __rcu *table;
+	u8 weight;
+
+	node = next_node_in(me->il_prev, policy->nodes);
+	if (node == MAX_NUMNODES)
+		return node;
+
+	/* on first alloc after setting mempolicy, acquire first weight */
+	if (unlikely(!policy->cur_il_weight)) {
+		rcu_read_lock();
+		table = rcu_dereference(iw_table);
+		/* detect system-default values */
+		weight = table ? table[node] : 1;
+		policy->cur_il_weight = weight ? weight : 1;
+		rcu_read_unlock();
+	}
+
+	/* account for this allocation call */
+	policy->cur_il_weight--;
+
+	/* if now at 0, move to next node and set up that node's weight */
+	if (unlikely(!policy->cur_il_weight)) {
+		me->il_prev = node;
+		next = next_node_in(node, policy->nodes);
+		rcu_read_lock();
+		table = rcu_dereference(iw_table);
+		/* detect system-default values */
+		weight = table ? table[next] : 1;
+		policy->cur_il_weight = weight ? weight : 1;
+		rcu_read_unlock();
+	}
+	return node;
+}
+
 /* Do dynamic interleaving for a process */
 static unsigned int interleave_nodes(struct mempolicy *policy)
 {
@@ -1870,6 +1931,9 @@ unsigned int mempolicy_slab_node(void)
 	case MPOL_INTERLEAVE:
 		return interleave_nodes(policy);
 
+	case MPOL_WEIGHTED_INTERLEAVE:
+		return weighted_interleave_nodes(policy);
+
 	case MPOL_BIND:
 	case MPOL_PREFERRED_MANY:
 	{
@@ -1908,6 +1972,39 @@ static unsigned int read_once_policy_nodemask(struct mempolicy *pol,
 	return nodes_weight(*mask);
 }
 
+static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
+{
+	nodemask_t nodemask;
+	unsigned int target, nr_nodes;
+	u8 __rcu *table;
+	unsigned int weight_total = 0;
+	u8 weight;
+	int nid;
+
+	nr_nodes = read_once_policy_nodemask(pol, &nodemask);
+	if (!nr_nodes)
+		return numa_node_id();
+
+	rcu_read_lock();
+	table = rcu_dereference(iw_table);
+	/* calculate the total weight */
+	for_each_node_mask(nid, nodemask)
+		weight_total += table ? table[nid] : 1;
+
+	/* Calculate the node offset based on totals */
+	target = ilx % weight_total;
+	nid = first_node(nodemask);
+	while (target) {
+		weight = table ? table[nid] : 1;
+		if (target < weight)
+			break;
+		target -= weight;
+		nid = next_node_in(nid, nodemask);
+	}
+	rcu_read_unlock();
+	return nid;
+}
+
 /*
  * Do static interleaving for interleave index @ilx.  Returns the ilx'th
  * node in pol->nodes (starting from ilx=0), wrapping around if ilx
@@ -1968,6 +2065,11 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
 			interleave_nodes(pol) : interleave_nid(pol, ilx);
 		break;
+	case MPOL_WEIGHTED_INTERLEAVE:
+		*nid = (ilx == NO_INTERLEAVE_INDEX) ?
+			weighted_interleave_nodes(pol) :
+			weighted_interleave_nid(pol, ilx);
+		break;
 	}
 
 	return nodemask;
@@ -2029,6 +2131,7 @@ bool init_nodemask_of_mempolicy(nodemask_t *mask)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		*mask = mempolicy->nodes;
 		break;
 
@@ -2128,7 +2231,8 @@ struct page *alloc_pages_mpol(gfp_t gfp, unsigned int order,
 		 * If the policy is interleave or does not allow the current
 		 * node in its nodemask, we allocate the standard way.
 		 */
-		if (pol->mode != MPOL_INTERLEAVE &&
+		if ((pol->mode != MPOL_INTERLEAVE &&
+		    pol->mode != MPOL_WEIGHTED_INTERLEAVE) &&
 		    (!nodemask || node_isset(nid, *nodemask))) {
 			/*
 			 * First, try to allocate THP only on local node, but
@@ -2264,6 +2368,156 @@ static unsigned long alloc_pages_bulk_array_interleave(gfp_t gfp,
 	return total_allocated;
 }
 
+static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
+		struct mempolicy *pol, unsigned long nr_pages,
+		struct page **page_array)
+{
+	struct task_struct *me = current;
+	unsigned long total_allocated = 0;
+	unsigned long nr_allocated;
+	unsigned long rounds;
+	unsigned long node_pages, delta;
+	u8 weight, resume_weight;
+	u8 __rcu *table;
+	u8 *weights;
+	unsigned int weight_total = 0;
+	unsigned long rem_pages = nr_pages;
+	nodemask_t nodes;
+	int nnodes, node, resume_node, next_node;
+	int prev_node = me->il_prev;
+	int i;
+
+	if (!nr_pages)
+		return 0;
+
+	nnodes = read_once_policy_nodemask(pol, &nodes);
+	if (!nnodes)
+		return 0;
+
+	/* Continue allocating from most recent node and adjust the nr_pages */
+	if (pol->cur_il_weight) {
+		node = next_node_in(prev_node, nodes);
+		node_pages = pol->cur_il_weight;
+		if (node_pages > rem_pages)
+			node_pages = rem_pages;
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		/*
+		 * if that's all the pages, no need to interleave, otherwise
+		 * we need to set up the next interleave node/weight correctly.
+		 */
+		if (rem_pages < pol->cur_il_weight) {
+			/* stay on current node, adjust cur_il_weight */
+			pol->cur_il_weight -= rem_pages;
+			return total_allocated;
+		} else if (rem_pages == pol->cur_il_weight) {
+			/* move to next node / weight */
+			me->il_prev = node;
+			next_node = next_node_in(node, nodes);
+			rcu_read_lock();
+			table = rcu_dereference(iw_table);
+			weight = table ? table[next_node] : 1;
+			/* detect system-default usage */
+			pol->cur_il_weight = weight ? weight : 1;
+			rcu_read_unlock();
+			return total_allocated;
+		}
+		/* Otherwise we adjust nr_pages down, and continue from there */
+		rem_pages -= pol->cur_il_weight;
+		pol->cur_il_weight = 0;
+		prev_node = node;
+	}
+
+	/* create a local copy of node weights to operate on outside rcu */
+	weights = kmalloc(nr_node_ids, GFP_KERNEL);
+	if (!weights)
+		return total_allocated;
+
+	rcu_read_lock();
+	table = rcu_dereference(iw_table);
+	/* If table is not registered, use system defaults */
+	if (table)
+		memcpy(weights, iw_table, nr_node_ids);
+	else
+		memset(weights, 1, nr_node_ids);
+	rcu_read_unlock();
+
+	/* calculate total, detect system default usage */
+	for_each_node_mask(node, nodes) {
+		/* detect system-default usage */
+		if (!weights[node])
+			weights[node] = 1;
+		weight_total += weights[node];
+	}
+
+	/*
+	 * Now we can continue allocating from 0 instead of an offset
+	 * We calculate the number of rounds and any partial rounds so
+	 * that we minimize the number of calls to __alloc_pages_bulk
+	 * This requires us to track which node we should resume from.
+	 *
+	 * if (rounds > 0) and (delta == 0), resume_node will always be
+	 * the current value of prev_node, which may be NUMA_NO_NODE if
+	 * this is the first allocation after a policy is replaced. The
+	 * resume weight will be the weight of the next node.
+	 *
+	 * if (delta > 0) and delta is depleted exactly on a node-weight
+	 * boundary, resume node will be the node last allocated from when
+	 * delta reached 0.
+	 *
+	 * if (delta > 0) and delta is not depleted on a node-weight boundary,
+	 * resume node will be the node prior to the node last allocated from.
+	 *
+	 * (rounds == 0) and (delta == 0) is not possible (earlier exit)
+	 */
+	rounds = rem_pages / weight_total;
+	delta = rem_pages % weight_total;
+	resume_node = prev_node;
+	resume_weight = weights[next_node_in(prev_node, nodes)];
+	/* If no delta, we'll resume from current prev_node and first weight */
+	for (i = 0; i < nnodes; i++) {
+		node = next_node_in(prev_node, nodes);
+		weight = weights[node];
+		node_pages = weight * rounds;
+		/* If a delta exists, add this node's portion of the delta */
+		if (delta > weight) {
+			node_pages += weight;
+			delta -= weight;
+			resume_node = node;
+		} else if (delta) {
+			node_pages += delta;
+			if (delta == weight) {
+				/* resume from next node with its weight */
+				resume_node = node;
+				next_node = next_node_in(node, nodes);
+				resume_weight = weights[next_node];
+			} else {
+				/* resume from this node w/ remaining weight */
+				resume_node = prev_node;
+				resume_weight = weight - (node_pages % weight);
+			}
+			delta = 0;
+		}
+		/* node_pages can be 0 if an allocation fails and rounds == 0 */
+		if (!node_pages)
+			break;
+		nr_allocated = __alloc_pages_bulk(gfp, node, NULL, node_pages,
+						  NULL, page_array);
+		page_array += nr_allocated;
+		total_allocated += nr_allocated;
+		if (total_allocated == nr_pages)
+			break;
+		prev_node = node;
+	}
+	/* resume allocating from the calculated node and weight */
+	me->il_prev = resume_node;
+	pol->cur_il_weight = resume_weight;
+	kfree(weights);
+	return total_allocated;
+}
+
 static unsigned long alloc_pages_bulk_array_preferred_many(gfp_t gfp, int nid,
 		struct mempolicy *pol, unsigned long nr_pages,
 		struct page **page_array)
@@ -2304,6 +2558,10 @@ unsigned long alloc_pages_bulk_array_mempolicy(gfp_t gfp,
 		return alloc_pages_bulk_array_interleave(gfp, pol,
 							 nr_pages, page_array);
 
+	if (pol->mode == MPOL_WEIGHTED_INTERLEAVE)
+		return alloc_pages_bulk_array_weighted_interleave(
+				  gfp, pol, nr_pages, page_array);
+
 	if (pol->mode == MPOL_PREFERRED_MANY)
 		return alloc_pages_bulk_array_preferred_many(gfp,
 				numa_node_id(), pol, nr_pages, page_array);
@@ -2379,6 +2637,7 @@ bool __mpol_equal(struct mempolicy *a, struct mempolicy *b)
 	case MPOL_INTERLEAVE:
 	case MPOL_PREFERRED:
 	case MPOL_PREFERRED_MANY:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		return !!nodes_equal(a->nodes, b->nodes);
 	case MPOL_LOCAL:
 		return true;
@@ -2515,6 +2774,10 @@ int mpol_misplaced(struct folio *folio, struct vm_area_struct *vma,
 		polnid = interleave_nid(pol, ilx);
 		break;
 
+	case MPOL_WEIGHTED_INTERLEAVE:
+		polnid = weighted_interleave_nid(pol, ilx);
+		break;
+
 	case MPOL_PREFERRED:
 		if (node_isset(curnid, pol->nodes))
 			goto out;
@@ -2889,6 +3152,7 @@ static const char * const policy_modes[] =
 	[MPOL_PREFERRED]  = "prefer",
 	[MPOL_BIND]       = "bind",
 	[MPOL_INTERLEAVE] = "interleave",
+	[MPOL_WEIGHTED_INTERLEAVE] = "weighted interleave",
 	[MPOL_LOCAL]      = "local",
 	[MPOL_PREFERRED_MANY]  = "prefer (many)",
 };
@@ -2948,6 +3212,7 @@ int mpol_parse_str(char *str, struct mempolicy **mpol)
 		}
 		break;
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		/*
 		 * Default to online nodes with memory if no nodelist
 		 */
@@ -3058,6 +3323,7 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 	case MPOL_PREFERRED_MANY:
 	case MPOL_BIND:
 	case MPOL_INTERLEAVE:
+	case MPOL_WEIGHTED_INTERLEAVE:
 		nodes = pol->nodes;
 		break;
 	default: