[v7] mm/mempolicy: Weighted Interleave Auto-tuning

On machines with multiple memory nodes, interleaving page allocations
across nodes allows for better utilization of each node's bandwidth.
Previous work by Gregory Price [1] introduced weighted interleave, which
allowed for pages to be allocated across nodes according to user-set ratios.

Ideally, these weights should be proportional to their bandwidth, so
that under bandwidth pressure, each node uses its maximal efficient
bandwidth and prevents latency from increasing exponentially.

Previously, weighted interleave's default weights were just 1s -- which
would be equivalent to the (unweighted) interleave mempolicy, which goes
through the nodes in a round-robin fashion, ignoring bandwidth information.

This patch has two main goals:
First, it makes weighted interleave easier to use for users who wish to
relieve bandwidth pressure when using nodes with varying bandwidth (CXL).
By providing a set of "real" default weights that just work out of the
box, users who might not have the capability (or wish to) perform
experimentation to find the most optimal weights for their system can
still take advantage of bandwidth-informed weighted interleave.

Second, it allows for weighted interleave to dynamically adjust to
hotplugged memory with new bandwidth information. Instead of manually
updating node weights every time new bandwidth information is reported
or taken off, weighted interleave adjusts and provides a new set of
default weights for weighted interleave to use when there is a change
in bandwidth information.

To meet these goals, this patch introduces an auto-configuration mode
for the interleave weights that provides a reasonable set of default
weights, calculated using bandwidth data reported by the system. In auto
mode, weights are dynamically adjusted based on whatever the current
bandwidth information reports (and responds to hotplug events).

This patch still supports users manually writing weights into the nodeN
sysfs interface by entering into manual mode. When a user enters manual
mode, the system stops dynamically updating any of the node weights,
even during hotplug events that shift the optimal weight distribution.

A new sysfs interface "auto" is introduced, which allows users to switch
between the auto (writing 1 or Y) and manual (writing 0 or N) modes. The
system also automatically enters manual mode when a nodeN interface is
manually written to.

There is one functional change that this patch makes to the existing
weighted_interleave ABI: previously, writing 0 directly to a nodeN
interface was said to reset the weight to the system default. Before
this patch, the default for all weights were 1, which meant that writing
0 and 1 were functionally equivalent.

[1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/

Suggested-by: Yunjeong Mun <yunjeong.mun@sk.com>
Suggested-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: Ying Huang <ying.huang@linux.alibaba.com>
Suggested-by: Harry Yoo <harry.yoo@oracle.com> 
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>
Co-developed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
---
Changelog
v7:
- Wordsmithing
- Rename iw_table_lock to wi_state_lock
- Clean up reduce_interleave_weights, as suggested by Yunjeong Mun.
  - Combine iw_table allocation & initialization to be outside the function.
  - Skip scaling to [1,100] before scaling to [1,weightiness].
- Removed the second part of this patch, which prevented creating weight
  sysfs interfaces for memoryless nodes.
- Added Suggested-by tags; I should have done this much, much earlier.

v6:
- iw_weights and mode_auto are combined into one rcu-protected struct.
- Protection against memoryless nodes, as suggested by Oscar Salvador
- Wordsmithing (documentation, commit message and comments), as suggested
  by Andrew Morton.
- Removed unnecessary #include statement in hmat.c, as pointed out by
  Harry (Hyeonggon) Yoo and Ying Huang.
- Bandwidth values changed from u64_t to unsigned int, as pointed out by
  Ying Huang and Dan Carpenter.
- RCU optimizations, as suggested by Ying Huang.
- A second patch is included to fix unintended behavior that creates a
  weight knob for memoryless nodes as well.
- Sysfs show/store functions use str_true_false & kstrtobool.
- Fix a build error in 32-bit systems, which are unable to perform
  64-bit division by casting 64-bit values to 32-bit, if under the range.

v5:
- I accidentally forgot to add the mm/mempolicy: subject tag since v1 of
  this patch. Added to the subject now!
- Wordsmithing, correcting typos, and re-naming variables for clarity.
- No functional changes.

v4:
- Renamed the mode interface to the "auto" interface, which now only
  emits either 'Y' or 'N'. Users can now interact with it by
  writing 'Y', '1', 'N', or '0' to it.
- Added additional documentation to the nodeN sysfs interface.
- Makes sure iw_table locks are properly held.
- Removed unlikely() call in reduce_interleave_weights.
- Wordsmithing

v3:
- Weightiness (max_node_weight) is now fixed to 32.
- Instead, the sysfs interface now exposes a "mode" parameter, which
  can either be "auto" or "manual".
  - Thank you Hyeonggon and Honggyu for the feedback.
- Documentation updated to reflect new sysfs interface, explicitly
  specifies that 0 is invalid.
  - Thank you Gregory and Ying for the discussion on how best to
    handle the 0 case.
- Re-worked nodeN sysfs store to handle auto --> manual shifts
- mempolicy_set_node_perf internally handles the auto / manual
  case differently now. bw is always updated, iw updates depend on
  what mode the user is in.
- Wordsmithing comments for clarity.
- Removed RFC tag.

v2:
- Name of the interface is changed: "max_node_weight" --> "weightiness"
- Default interleave weight table no longer exists. Rather, the
  interleave weight table is initialized with the defaults, if bandwidth
  information is available.
  - In addition, all sections that handle iw_table have been changed
    to reference iw_table if it exists, otherwise defaulting to 1.
- All instances of unsigned long are converted to uint64_t to guarantee
  support for both 32-bit and 64-bit machines
- sysfs initialization cleanup
- Documentation has been rewritten to explicitly outline expected
  behavior and expand on the interpretation of "weightiness".
- kzalloc replaced with kcalloc for readability
- Thank you Gregory and Hyeonggon for your review & feedback!

 ...fs-kernel-mm-mempolicy-weighted-interleave |  34 +-
 drivers/base/node.c                           |   9 +
 include/linux/mempolicy.h                     |   9 +
 mm/mempolicy.c                                | 318 +++++++++++++++---
 4 files changed, 311 insertions(+), 59 deletions(-)

base-commit: 99fa936e8e4f117d62f229003c9799686f74cebc

Message ID	20250305200506.2529583-1-joshua.hahnjy@gmail.com (mailing list archive)
State	Handled Elsewhere, archived
Headers	show Received: from mail-yw1-f182.google.com (mail-yw1-f182.google.com [209.85.128.182]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id CA02C2561AA; Wed, 5 Mar 2025 20:05:08 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.182 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741205112; cv=none; b=eMUwxOIDCM3ESSO5CbUDaVYOKaGqWxU9WobJsA5GU6InW2fmpllPSZDXon6/+mb4R05MyEr/QbJ0oxnXGAzFs/k03rtmTRV5UWrISsXI0McsTVHjcMy9XWGeaYqWZdfVTmwFrMGC/VJmDAiBvTf7WwQ/giU7AgfspVBP7H5P3gA= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1741205112; c=relaxed/simple; bh=fy2bLHcyhShlYsK021h6RMTy2AvUA5xGX6IVckJjNf4=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=LJ2taTo/6DLg+1aRJMEsgbOIebhSXBhLKRVb9hBkQ+fXF28qdMrbsJtzX5eDthQeM6uD2KWvBeLxNtX3G9ipeP6HPcs12Y/mlyC4vbmIK0wh7iZGgv5pw6nVITSs9RCJCh1KXQX/bCRWQ+EGEqe2MC7L2t9mtFXXx1EfBMktbpw= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=VllA2dDi; arc=none smtp.client-ip=209.85.128.182 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="VllA2dDi" Received: by mail-yw1-f182.google.com with SMTP id 00721157ae682-6feab7c5f96so5928897b3.3; Wed, 05 Mar 2025 12:05:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1741205107; x=1741809907; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=VPdkAEYBWex4q3Xi5PSYSIKu22Ka4npEXRmSaWj16kk=; b=VllA2dDiu/koxUP9Uv3ZzL6OfGfov+9msEzMfx7OoMa9QeFStWNr9BdTfWvs6t9ckh jh+1HFMA2b1oggp9B6eC15+e3Yf/nhHsW4F4CYZp204a61yf5lPibJMuVdHqfoRvBAIM hoq8//lHv0tL7TJIyQD1Zy1BLuE94t6aNE2k3axSNA5tVNAet2/vAjll/5jrUEjCNL1V i4CFeOv6/bW48lbZBAxzK1NMGGWx0a0+kgNqdBqpMsMeraBSD1ZfFM0hh6S4nMKGZsI/ EcGE/LISPpQtAzUDymCae9D0prUuX8dI4gD+Vj9BDClsbtgbK+bm4Ie5oI9SIo/+7rzU 18pQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1741205107; x=1741809907; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=VPdkAEYBWex4q3Xi5PSYSIKu22Ka4npEXRmSaWj16kk=; b=hPdohbfh843HCFkGjI5TWeVjWnEdgCo1u09mUztlJI/eF0vQw+0b5QWj79o2c592/0 mcaCIq/M8ucbkUrTDrlzVOsxzKReTnvCZpbt3GQK7juYmWL6xxdNL/WnYJnWygX0OZr7 NXadaVo7jY6K3YN5rBvb4k72tLr6RXREAL9IPXVb77m1Z2AE1WaJnpcR0IR04F5DZvM3 oSMqBIwHHMqTKUZvsyFD7bK44omkyIYp5RCyG6jPTn+vT71PxDZExVPiaWys1oQUPLaE 8F0FcJ9aIydCeMwHSt6jzC91bAk7yoozm//XLo6m0ZOo+LsGb1QdUvukL9NDVYPAvkod Q3ng== X-Forwarded-Encrypted: i=1; AJvYcCVXlJWL96/TTaG9+aMG/5E5/BwCFI5mCvyyTLeu5eJBccKMrc1uio4JKbysfgaWO5mTkXfJGKxyMm4S@vger.kernel.org, AJvYcCX+zJSfOB3xxGWLtwPRbwpJN8FBRU529nBPP9f40nv9ynHvIvF3KFo6ZB/BP07cVy42Ce+OwEJ3svSaW4x4@vger.kernel.org X-Gm-Message-State: AOJu0YwyvreBCqIHMvZyeWJBA9UuMmSsc5StBkIF5W/2z4grzEzLKrN3 /GDdzmP6yzxiHKfkBEx2O9jwXxq0GbYPZiEduMagsW3MFKltr8yc X-Gm-Gg: ASbGncvlGlYranH2g8eAQfvcpqtUcCZW2JQaUE3Q3V1aeNPvq1uBqpAsxuw0aF+nl0/ FAys1jtqV29x6GloxjE/BXPrxSXRxGL0m1Wyb2/rBx/eSoJViND+kDfZ5dkoxytWDJ2ygoz0/9t ttdKP+lefeOon4AkeXmq6aldm23SaIjvhv6l2p8axZXBRUOtx2G1gaCUOxiQ3ZR/NcphYRRD8Ez 27MNexKxswjDnGw/hCGx9gx7C7Imu0xiGFQ3amd9rkt/eKAkQq1tMKwlrTWVhF9PynDjIqKaj5Q hgCTFiwM2B87sdVA8RO2UUf/wf6PRoIt76/DXZ7A9b0= X-Google-Smtp-Source: AGHT+IHOVta6EES2yksMIZn7qzPLHVF6Wk3rIlHDiD0mnUUeRJ9EwJCB146nQp45TOKlRZc6K/Yrsw== X-Received: by 2002:a05:690c:74c6:b0:6fd:42c8:3c6a with SMTP id 00721157ae682-6fda31b124cmr62629387b3.34.1741205107327; Wed, 05 Mar 2025 12:05:07 -0800 (PST) Received: from localhost ([2a03:2880:25ff:2::]) by smtp.gmail.com with ESMTPSA id 00721157ae682-6fd3cba1605sm30502277b3.109.2025.03.05.12.05.06 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 05 Mar 2025 12:05:06 -0800 (PST) From: Joshua Hahn <joshua.hahnjy@gmail.com> To: gourry@gourry.net, harry.yoo@oracle.com, ying.huang@linux.alibaba.com Cc: honggyu.kim@sk.com, yunjeong.mun@sk.com, gregkh@linuxfoundation.org, rakie.kim@sk.com, akpm@linux-foundation.org, rafael@kernel.org, lenb@kernel.org, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, osalvador@suse.de, linux-kernel@vger.kernel.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com Subject: [PATCH v7] mm/mempolicy: Weighted Interleave Auto-tuning Date: Wed, 5 Mar 2025 12:05:05 -0800 Message-ID: <20250305200506.2529583-1-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.43.5 Precedence: bulk X-Mailing-List: linux-acpi@vger.kernel.org List-Id: <linux-acpi.vger.kernel.org> List-Subscribe: <mailto:linux-acpi+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-acpi+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v7] mm/mempolicy: Weighted Interleave Auto-tuning \| expand [v7] mm/mempolicy: Weighted Interleave Auto-tuning

[v7] mm/mempolicy: Weighted Interleave Auto-tuning

Commit Message

Comments

Patch