[v5] mm/mempolicy: Weighted Interleave Auto-tuning

On machines with multiple memory nodes, interleaving page allocations
across nodes allows for better utilization of each node's bandwidth.
Previous work by Gregory Price [1] introduced weighted interleave, which
allowed for pages to be allocated across nodes according to user-set ratios.   

Ideally, these weights should be proportional to their bandwidth, so
that under bandwidth pressure, each node uses its maximal efficient
bandwidth and prevents latency from increasing exponentially.

At the same time, we want these weights to be as small as possible.
Having ratios that involve large co-prime numbers like 7639:1345:7 leads
to awkward and inefficient allocations, since the node with weight 7
will remain mostly unused (and despite being proportional to bandwidth,
will not aid in relieving the bandwidth pressure in the other two nodes).

This patch introduces an auto-configuration mode for the interleave
weights that aims to balance the two goals of setting node weights to be
proportional to their bandwidths and keeping the weight values low.
In order to perform the weight re-scaling, we use an internal
"weightiness" value (fixed to 32) that defines interleave aggression.

In this auto configuration mode, node weights are dynamically updated
every time there is a hotplug event that introduces new bandwidth.

Users can also enter manual mode by writing "N" or "0" to the new "auto"
sysfs interface. When a user enters manual mode, the system stops
dynamically updating any of the node weights, even during hotplug events
that can shift the optimal weight distribution. The system also enters
manual mode any time a user sets a node's weight directly by using the
nodeN interface introduced in [1]. On the other hand, auto mode is
only entered by explicitly writing "Y" or "1" to the auto interface.

There is one functional change that this patch makes to the existing
weighted_interleave ABI: previously, writing 0 directly to a nodeN
interface was said to reset the weight to the system default. Before
this patch, the default for all weights were 1, which meant that writing
0 and 1 were functionally equivalent.

This patch introduces "real" defaults, but moves away from letting users
use 0 as a "set to default" interface. Rather, users who want to use
system defaults should use auto mode. This patch seems to be the
appropriate place to make this change, since we would like to remove
this usage before users begin to rely on the feature in userspace.
Moreover, users will not be losing any functionality; they can still
write 1 into a node if they want a weight of 1. Thus, we deprecate the
"write zero to reset" feature in favor of returning an error, the same
way we would return an error when the user writes any other invalid
weight to the interface.

[1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Co-developed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
Changelog
v5:
- I accidentally forgot to add the mm/mempolicy: subject tag since v1 of
  this patch. Added to the subject now!
- Wordsmithing, correcting typos, and re-naming variables for clarity.
- No functional changes.
v4:
- Renamed the mode interface to the "auto" interface, which now only
  emits either 'Y' or 'N'. Users can now interact with it by
  writing 'Y', '1', 'N', or '0' to it.
- Added additional documentation to the nodeN sysfs interface.
- Makes sure iw_table locks are properly held.
- Removed unlikely() call in reduce_interleave_weights.
- Wordsmithing

v3:
- Weightiness (max_node_weight) is now fixed to 32.
- Instead, the sysfs interface now exposes a "mode" parameter, which
  can either be "auto" or "manual".
  - Thank you Hyeonggon and Honggyu for the feedback.
- Documentation updated to reflect new sysfs interface, explicitly
  specifies that 0 is invalid.
  - Thank you Gregory and Ying for the discussion on how best to
    handle the 0 case.
- Re-worked nodeN sysfs store to handle auto --> manual shifts
- mempolicy_set_node_perf internally handles the auto / manual
  case differently now. bw is always updated, iw updates depend on
  what mode the user is in.
- Wordsmithing comments for clarity.
- Removed RFC tag.

v2:
- Name of the interface is changed: "max_node_weight" --> "weightiness"
- Default interleave weight table no longer exists. Rather, the
  interleave weight table is initialized with the defaults, if bandwidth
  information is available.
  - In addition, all sections that handle iw_table have been changed
    to reference iw_table if it exists, otherwise defaulting to 1.
- All instances of unsigned long are converted to uint64_t to guarantee
  support for both 32-bit and 64-bit machines
- sysfs initialization cleanup
- Documentation has been rewritten to explicitly outline expected
  behavior and expand on the interpretation of "weightiness".
- kzalloc replaced with kcalloc for readability
- Thank you Gregory and Hyeonggon for your review & feedback!
 ...fs-kernel-mm-mempolicy-weighted-interleave |  38 +++-
 drivers/acpi/numa/hmat.c                      |   1 +
 drivers/base/node.c                           |   7 +
 include/linux/mempolicy.h                     |   4 +
 mm/mempolicy.c                                | 200 ++++++++++++++++--
 5 files changed, 233 insertions(+), 17 deletions(-)

Message ID	20250207201335.2105488-1-joshua.hahnjy@gmail.com (mailing list archive)
State	Handled Elsewhere, archived
Headers	show Received: from mail-yw1-f171.google.com (mail-yw1-f171.google.com [209.85.128.171]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id DB48A187342 for <linux-acpi@vger.kernel.org>; Fri, 7 Feb 2025 20:13:37 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.128.171 ARC-Seal: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738959220; cv=none; b=eyrWt/YCHwbLvbf0BZbUB+H6vLgCKmuIBIJF41WG9iaHPzQy7mJ7kzIBzJOzC48OXnUuRvxeCQ67Zf1T403Wf8Csr6xKAJ7LYfprZ4GsLMHfS4CTGgdQk88+Vzig2PACiE2L6oprDD0o/87yjwGdKi+J3UkxMqM2gIxy58eXmUQ= ARC-Message-Signature: i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1738959220; c=relaxed/simple; bh=hwXmYzqG510So7CEj3Y9AKpPBPRpw5a+ZoJmoly4wW4=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=bIVMayE6dV4sNnW48h4yZM9alT/qfiilCAlRnKNavJOlVw9YqoTaCZInY/xdJ20X7KiaHrOEguELpU4SHJxON6NIGNgoy4gDLqMZF6reRcUI/INMkIkAQ30Co+XfuLVmMQSbTdFsZLUwCB5++hewF+Gt3MP9VQkzobsQev9a+AI= ARC-Authentication-Results: i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=W3gStFaX; arc=none smtp.client-ip=209.85.128.171 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="W3gStFaX" Received: by mail-yw1-f171.google.com with SMTP id 00721157ae682-6f99efce804so16572747b3.0 for <linux-acpi@vger.kernel.org>; Fri, 07 Feb 2025 12:13:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1738959216; x=1739564016; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=7mYJFebpoi4yOQtzJHtftr4mu1KzlJKTN1yGdBctV7U=; b=W3gStFaXUtO07Aj/bB09Gvn2zyv1kfihMUyfEvjmTUBuAMqDkWVr5eEze+RwKPDnLC W1ROfzE0wAVHc0C7ttMZIFUZMc81WqxnydiV9YrTW8BkONgH2mbsJb5t4HknfWSwDa8x 8z300WqmnHVNj6lQmwfWKcKrMROH4+3WOfxiPJ/ohdcU7EMI/tdA8dyBNWtoiTtLyJrx Lmd46m+8wmc3EB2DynzlJ/GIwoXGcsmGX8TdKCfa6EfY0hnBKYvf7fjm099mltSJsi0F 6mLiF4b255U6Fl95lnA41Al+ujt66PrqxcLF9Ehm3U9QGnU7Tm1mhF/Yx2pQFqX/sto9 gmqw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1738959216; x=1739564016; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=7mYJFebpoi4yOQtzJHtftr4mu1KzlJKTN1yGdBctV7U=; b=APMZvov+LUMYRQai5vSKHvMWrVA8zY69oOFt9TexTGuzAZ/g4x5O1mvlJUf91PHLPn aKBvFnMvNyjJgulOm69zXP+DoGzeHsFj2t1KlR4ZYE67vGOyfdICRnfWxadMSCQ2Ijwt 2qhiLq4Z8O2z+uHVWiIxj3YWWv8YwJMPwdEJxtfbGzDMdcx78CKyvTkW44JvQPkz1v4H hTVVWZEmLI2bEb/pBmdiJDWWnPxoclqxexrKasT0XgXnkusLehbvI35+e4q9AFSCwMuC mAqpDCqc+RhVgztg22DEgmd8bxd1ewWYQGbHv2B8KODc/oOvRf9Vrhq5BkjbPmmBO6fn x+jw== X-Forwarded-Encrypted: i=1; AJvYcCVIDzMRKOvA9iT5rlvhQ/4ToGy9S6E6X496TfxAIe0WhyN68eNdYyObdnY81Dw8ZY0KPiK3K6r80N3f@vger.kernel.org X-Gm-Message-State: AOJu0YzgqZMBA0evXGhGKc5BEZTlKR+6r6/oqDANd2LlHRD22/staUuR HngP7beVFlFLPwjMXngPgRma7rHNIOmM3UeG2sow1ZoDKawWpPbq X-Gm-Gg: ASbGncutQoXq1pW9+qDEjg6Dyhvdj9H1ppmn7PYbEY/H0VPDP57uduF2qpgVp3LM/ec Q/UVGM7cLgbf3oEob5DPQqB/w/ItTGXLjePfx8WmnoEFXsu0FwzefrK3p6BBCZsxedWJPwRTPHO m4pxibL8k8ImLrJcJhHF5I3zhqniaHKLMm9vdmbsnIEmM/HZfereEy87mw2TlMgh6LAdDFbGvKa xx0CGcQu3mQYQubLQt9zop10T7lr6A69qJ8aoBbL1VVkpcCV4iPNqN+4kBjGhkN5x9f93Rxfnoe VXvvBMKeqegQsg== X-Google-Smtp-Source: AGHT+IEj5+2JaKmommFYoDfCG03GC7hDejaD+3XiTT60PmEG6Fsl7wApn+z8yPD0v86UsaiMG5Lr6w== X-Received: by 2002:a05:690c:4909:b0:6f9:a9ee:d92e with SMTP id 00721157ae682-6f9b2a06f98mr49954037b3.35.1738959216405; Fri, 07 Feb 2025 12:13:36 -0800 (PST) Received: from localhost ([2a03:2880:25ff:8::]) by smtp.gmail.com with ESMTPSA id 00721157ae682-6f99ffc0c0csm6627667b3.114.2025.02.07.12.13.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Fri, 07 Feb 2025 12:13:35 -0800 (PST) From: Joshua Hahn <joshua.hahnjy@gmail.com> To: gourry@gourry.net, hyeonggon.yoo@sk.com, ying.huang@linux.alibaba.com, honggyu.kim@sk.com Cc: akpm@linux-foundation.org, rafael@kernel.org, lenb@kernel.org, gregkh@linuxfoundation.org, rakie.kim@sk.com, dan.j.williams@intel.com, Jonathan.Cameron@huawei.com, dave.jiang@intel.com, horen.chuang@linux.dev, hannes@cmpxchg.org, linux-kernel@vger.org, linux-acpi@vger.kernel.org, linux-mm@kvack.org, kernel-team@meta.com Subject: [PATCH v5] mm/mempolicy: Weighted Interleave Auto-tuning Date: Fri, 7 Feb 2025 12:13:35 -0800 Message-ID: <20250207201335.2105488-1-joshua.hahnjy@gmail.com> X-Mailer: git-send-email 2.43.5 Precedence: bulk X-Mailing-List: linux-acpi@vger.kernel.org List-Id: <linux-acpi.vger.kernel.org> List-Subscribe: <mailto:linux-acpi+subscribe@vger.kernel.org> List-Unsubscribe: <mailto:linux-acpi+unsubscribe@vger.kernel.org> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit
Series	[v5] mm/mempolicy: Weighted Interleave Auto-tuning \| expand [v5] mm/mempolicy: Weighted Interleave Auto-tuning

[v5] mm/mempolicy: Weighted Interleave Auto-tuning

Commit Message

Comments

Patch