diff mbox series

[v5] mm/mempolicy: Weighted Interleave Auto-tuning

Message ID 20250207201335.2105488-1-joshua.hahnjy@gmail.com (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series [v5] mm/mempolicy: Weighted Interleave Auto-tuning | expand

Commit Message

Joshua Hahn Feb. 7, 2025, 8:13 p.m. UTC
On machines with multiple memory nodes, interleaving page allocations
across nodes allows for better utilization of each node's bandwidth.
Previous work by Gregory Price [1] introduced weighted interleave, which
allowed for pages to be allocated across nodes according to user-set ratios.   

Ideally, these weights should be proportional to their bandwidth, so
that under bandwidth pressure, each node uses its maximal efficient
bandwidth and prevents latency from increasing exponentially.

At the same time, we want these weights to be as small as possible.
Having ratios that involve large co-prime numbers like 7639:1345:7 leads
to awkward and inefficient allocations, since the node with weight 7
will remain mostly unused (and despite being proportional to bandwidth,
will not aid in relieving the bandwidth pressure in the other two nodes).

This patch introduces an auto-configuration mode for the interleave
weights that aims to balance the two goals of setting node weights to be
proportional to their bandwidths and keeping the weight values low.
In order to perform the weight re-scaling, we use an internal
"weightiness" value (fixed to 32) that defines interleave aggression.

In this auto configuration mode, node weights are dynamically updated
every time there is a hotplug event that introduces new bandwidth.

Users can also enter manual mode by writing "N" or "0" to the new "auto"
sysfs interface. When a user enters manual mode, the system stops
dynamically updating any of the node weights, even during hotplug events
that can shift the optimal weight distribution. The system also enters
manual mode any time a user sets a node's weight directly by using the
nodeN interface introduced in [1]. On the other hand, auto mode is
only entered by explicitly writing "Y" or "1" to the auto interface.

There is one functional change that this patch makes to the existing
weighted_interleave ABI: previously, writing 0 directly to a nodeN
interface was said to reset the weight to the system default. Before
this patch, the default for all weights were 1, which meant that writing
0 and 1 were functionally equivalent.

This patch introduces "real" defaults, but moves away from letting users
use 0 as a "set to default" interface. Rather, users who want to use
system defaults should use auto mode. This patch seems to be the
appropriate place to make this change, since we would like to remove
this usage before users begin to rely on the feature in userspace.
Moreover, users will not be losing any functionality; they can still
write 1 into a node if they want a weight of 1. Thus, we deprecate the
"write zero to reset" feature in favor of returning an error, the same
way we would return an error when the user writes any other invalid
weight to the interface.

[1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Co-developed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Gregory Price <gourry@gourry.net>
Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
---
Changelog
v5:
- I accidentally forgot to add the mm/mempolicy: subject tag since v1 of
  this patch. Added to the subject now!
- Wordsmithing, correcting typos, and re-naming variables for clarity.
- No functional changes.
v4:
- Renamed the mode interface to the "auto" interface, which now only
  emits either 'Y' or 'N'. Users can now interact with it by
  writing 'Y', '1', 'N', or '0' to it.
- Added additional documentation to the nodeN sysfs interface.
- Makes sure iw_table locks are properly held.
- Removed unlikely() call in reduce_interleave_weights.
- Wordsmithing

v3:
- Weightiness (max_node_weight) is now fixed to 32.
- Instead, the sysfs interface now exposes a "mode" parameter, which
  can either be "auto" or "manual".
  - Thank you Hyeonggon and Honggyu for the feedback.
- Documentation updated to reflect new sysfs interface, explicitly
  specifies that 0 is invalid.
  - Thank you Gregory and Ying for the discussion on how best to
    handle the 0 case.
- Re-worked nodeN sysfs store to handle auto --> manual shifts
- mempolicy_set_node_perf internally handles the auto / manual
  case differently now. bw is always updated, iw updates depend on
  what mode the user is in.
- Wordsmithing comments for clarity.
- Removed RFC tag.

v2:
- Name of the interface is changed: "max_node_weight" --> "weightiness"
- Default interleave weight table no longer exists. Rather, the
  interleave weight table is initialized with the defaults, if bandwidth
  information is available.
  - In addition, all sections that handle iw_table have been changed
    to reference iw_table if it exists, otherwise defaulting to 1.
- All instances of unsigned long are converted to uint64_t to guarantee
  support for both 32-bit and 64-bit machines
- sysfs initialization cleanup
- Documentation has been rewritten to explicitly outline expected
  behavior and expand on the interpretation of "weightiness".
- kzalloc replaced with kcalloc for readability
- Thank you Gregory and Hyeonggon for your review & feedback!
 ...fs-kernel-mm-mempolicy-weighted-interleave |  38 +++-
 drivers/acpi/numa/hmat.c                      |   1 +
 drivers/base/node.c                           |   7 +
 include/linux/mempolicy.h                     |   4 +
 mm/mempolicy.c                                | 200 ++++++++++++++++--
 5 files changed, 233 insertions(+), 17 deletions(-)

Comments

Andrew Morton Feb. 8, 2025, 2:20 a.m. UTC | #1
On Fri,  7 Feb 2025 12:13:35 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> This patch introduces an auto-configuration mode for the interleave
> weights that aims to balance the two goals of setting node weights to be
> proportional to their bandwidths and keeping the weight values low.
> In order to perform the weight re-scaling, we use an internal
> "weightiness" value (fixed to 32) that defines interleave aggression.

Question please.  How does one determine whether a particular
configuration is working well?  To determine whether
manual-configuration-A is better than manual-configuration-B is better
than auto-configuration?

Leading to... how do we know that this patch makes the kernel better?
Joshua Hahn Feb. 8, 2025, 5:06 a.m. UTC | #2
On Fri, 7 Feb 2025 18:20:09 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri,  7 Feb 2025 12:13:35 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> > This patch introduces an auto-configuration mode for the interleave
> > weights that aims to balance the two goals of setting node weights to be
> > proportional to their bandwidths and keeping the weight values low.
> > In order to perform the weight re-scaling, we use an internal
> > "weightiness" value (fixed to 32) that defines interleave aggression.
> 
> Question please.  How does one determine whether a particular
> configuration is working well?  To determine whether
> manual-configuration-A is better than manual-configuration-B is better
> than auto-configuration?
> 
> Leading to... how do we know that this patch makes the kernel better?

Hello Andrew,

Thank you for your interest in this patch!

To answer your 1st question: I think that users can do some
experimentation with the specific workloads they expect to be running
with. In particular, since the weights that provide the best results
are workload-specific, it might make sense to compare the results across
a variety of workloads that the users might be expecting and comparing
what settings provide the least amount of throttling. 

With that said, this patch introduces defaults that will hopefully help
those who are either unable or uninterested in setting weights themselves.
For users who already have already been using weighted interleave and
know what specific weights they should use, the auto settings might not
give as much impact as someone who is unsure what the best weights are
(and would rather defer the decision-making to the system).

As for measuring the accuracy of the default weights generated:
The auto mode works by taking nodes' bandwidth data and trying to use
small numbers (between 1 and 255) to approximate those bandwidth values.
For instance, [19000, 4000, 7000] might be converted to something like
[4:1:2], since of course we don't want to be allocating from the second
node only after 19000 pages have already been allocated from the first.

But simultaneously... 4:1:2 is not the same ratio as 19000:4000:7000.
So there is a tradeoff between trying to get accurate weight values,
while keeping them small as to not have unbalanced distributions. This
is where we chose the value of 32 to be the magic "weightiness" value.

Gregory and I spent quite some time modeling this behavior, trying
different reduction algorithms and weightiness to see what could give
us the most accurate bandwidth data while using the most reasonably
small numbers possible, and ended up with 32. (Earlier versions of
this patch also exposed the weightiness parameter as a sysfs knob,
but it was removed for simplicity's sake.)

We've gotten some nice results (under reasonable conditions) after
running exhaustive tests for a wide array of bandwidth configurations,
which is why we were confident with selecting 32 as the default value.

As for the 2nd question and how this patch makes the kernel better : -)
Like I mentioned above, this patch might not have a large impact to
those already using weighted interleave to see performance gains and
know what weights work the best. However, we believe there are users
out there who (1) have nodes with varying bandwidths (CXL),
(2) have workloads that are bandwidth-bound, and (3) would like to
take advantage of weighted interleave but do not have the capacity
or are not willing to manually change the weights. For these folks,
having defaults that make sense (as opposed to the previous defaults
in weighted interleave, which would make it functionally the same
as unweighted interleave) can provide more options and performance
gains to those who wish to opt-in.

I apologize for the long explanation, but I hope that this answers
your question. Please let me know if there is anything else I can do!

Thank you again for your interest. I hope you have a great day!
Joshua
Oscar Salvador Feb. 8, 2025, 6:51 a.m. UTC | #3
On Fri, Feb 07, 2025 at 12:13:35PM -0800, Joshua Hahn wrote:
> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across nodes according to user-set ratios.   
> 
> Ideally, these weights should be proportional to their bandwidth, so
> that under bandwidth pressure, each node uses its maximal efficient
> bandwidth and prevents latency from increasing exponentially.
> 
> At the same time, we want these weights to be as small as possible.
> Having ratios that involve large co-prime numbers like 7639:1345:7 leads
> to awkward and inefficient allocations, since the node with weight 7
> will remain mostly unused (and despite being proportional to bandwidth,
> will not aid in relieving the bandwidth pressure in the other two nodes).
> 
> This patch introduces an auto-configuration mode for the interleave
> weights that aims to balance the two goals of setting node weights to be
> proportional to their bandwidths and keeping the weight values low.
> In order to perform the weight re-scaling, we use an internal
> "weightiness" value (fixed to 32) that defines interleave aggression.
> 
> In this auto configuration mode, node weights are dynamically updated
> every time there is a hotplug event that introduces new bandwidth.
> 
> Users can also enter manual mode by writing "N" or "0" to the new "auto"
> sysfs interface. When a user enters manual mode, the system stops
> dynamically updating any of the node weights, even during hotplug events
> that can shift the optimal weight distribution. The system also enters
> manual mode any time a user sets a node's weight directly by using the
> nodeN interface introduced in [1]. On the other hand, auto mode is
> only entered by explicitly writing "Y" or "1" to the auto interface.
> 
> There is one functional change that this patch makes to the existing
> weighted_interleave ABI: previously, writing 0 directly to a nodeN
> interface was said to reset the weight to the system default. Before
> this patch, the default for all weights were 1, which meant that writing
> 0 and 1 were functionally equivalent.
> 
> This patch introduces "real" defaults, but moves away from letting users
> use 0 as a "set to default" interface. Rather, users who want to use
> system defaults should use auto mode. This patch seems to be the
> appropriate place to make this change, since we would like to remove
> this usage before users begin to rely on the feature in userspace.
> Moreover, users will not be losing any functionality; they can still
> write 1 into a node if they want a weight of 1. Thus, we deprecate the
> "write zero to reset" feature in favor of returning an error, the same
> way we would return an error when the user writes any other invalid
> weight to the interface.
> 
> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> Co-developed-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---

Hi Joshua

> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0ea653fa3433..16e7a5a8ebe7 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -7,6 +7,7 @@
>  #include <linux/init.h>
>  #include <linux/mm.h>
>  #include <linux/memory.h>
> +#include <linux/mempolicy.h>
>  #include <linux/vmstat.h>
>  #include <linux/notifier.h>
>  #include <linux/node.h>
> @@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
>  			break;
>  		}
>  	}
> +
> +	/* When setting CPU access coordinates, update mempolicy */
> +	if (access == ACCESS_COORDINATE_CPU) {
> +		if (mempolicy_set_node_perf(nid, coord))
> +			pr_info("failed to set node%d mempolicy attrs\n", nid);

Not a big deal but I think you want to make that consistent with the error
pr_info? that is: "failed to set mempolicy attrs for node %d".

Also, I guess we cannot reach here with a memoryless node, right?

> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 04f35659717a..51edd3663667 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -109,6 +109,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/printk.h>
>  #include <linux/swapops.h>
> +#include <linux/gcd.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
> @@ -138,16 +139,18 @@ static struct mempolicy default_policy = {
>  
>  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>  
> +static uint64_t *node_bw_table;
> +
>  /*
> - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
> - * system-default value should be used. A NULL iw_table also denotes that
> - * system-default values should be used. Until the system-default table
> - * is implemented, the system-default is always 1.
> - *
> + * iw_table is the interleave weight table.
> + * If bandwidth data is available and the user is in auto mode, the table
> + * is populated with default values in [1,255].
>   * iw_table is RCU protected
>   */
>  static u8 __rcu *iw_table;
>  static DEFINE_MUTEX(iw_table_lock);
> +static const int weightiness = 32;

You explain why you chose this value in the changelog, but I would either
drop a comment, probably in reduce_interleave_weights() elaborating a
little bit, otherwise someone who stumbles upon that when reading that
code will have to go through the changelog.
Gregory Price Feb. 10, 2025, 5:36 a.m. UTC | #4
On Fri, Feb 07, 2025 at 06:20:09PM -0800, Andrew Morton wrote:
> On Fri,  7 Feb 2025 12:13:35 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> Leading to... how do we know that this patch makes the kernel better?

Just focusing on this question:

The default behavior of weighted interleave without this patch is
equivalent to normal interleave.  This provides a differentiation
out-of-the box, and that's just a better experience.

We may find the default values / calculations need tweaking in the
future, but this gives us a good starting point.  Anecdotally, I've
seen an "optimal" distribution of 10:1 based on the numbers run
sub-optimally compared to 7:1 or 13:1 (but better than default mempol).

So there will always be a "try it and see" component to this.

(Not to mention hardware/firmware lies regularly, and their reported
 performance numbers rarely if ever match their tested numbers - so
 *at best* this can be considered a best-effort feature)

~Gregory
Andrew Morton Feb. 11, 2025, 12:39 a.m. UTC | #5
On Mon, 10 Feb 2025 00:36:16 -0500 Gregory Price <gourry@gourry.net> wrote:

> On Fri, Feb 07, 2025 at 06:20:09PM -0800, Andrew Morton wrote:
> > On Fri,  7 Feb 2025 12:13:35 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> > 
> > Leading to... how do we know that this patch makes the kernel better?
> 
> Just focusing on this question:
> 
> The default behavior of weighted interleave without this patch is
> equivalent to normal interleave.  This provides a differentiation
> out-of-the box, and that's just a better experience.
> 
> We may find the default values / calculations need tweaking in the
> future, but this gives us a good starting point.  Anecdotally, I've
> seen an "optimal" distribution of 10:1 based on the numbers run
> sub-optimally compared to 7:1 or 13:1 (but better than default mempol).

How was this optimality measured/observed?

> So there will always be a "try it and see" component to this.
> 
> (Not to mention hardware/firmware lies regularly, and their reported
>  performance numbers rarely if ever match their tested numbers - so
>  *at best* this can be considered a best-effort feature)
Gregory Price Feb. 11, 2025, 2:14 a.m. UTC | #6
On Mon, Feb 10, 2025 at 04:39:41PM -0800, Andrew Morton wrote:
> On Mon, 10 Feb 2025 00:36:16 -0500 Gregory Price <gourry@gourry.net> wrote:
> 
> > On Fri, Feb 07, 2025 at 06:20:09PM -0800, Andrew Morton wrote:
> > > On Fri,  7 Feb 2025 12:13:35 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> > > 
> > > Leading to... how do we know that this patch makes the kernel better?
> > 
> > Just focusing on this question:
> > 
> > The default behavior of weighted interleave without this patch is
> > equivalent to normal interleave.  This provides a differentiation
> > out-of-the box, and that's just a better experience.
> > 
> > We may find the default values / calculations need tweaking in the
> > future, but this gives us a good starting point.  Anecdotally, I've
> > seen an "optimal" distribution of 10:1 based on the numbers run
> > sub-optimally compared to 7:1 or 13:1 (but better than default mempol).
> 
> How was this optimality measured/observed?
> 

TL;DR: We used MLC to observe highest sustained bandwidth.

Unfortunately I can't post exact numbers at this time.

To simplify the results - HMAT reported bandwidth often drifted
+/- 10% compared to real observed bandwidth. So the distributions
produced by auto-configuration were mildly off - but not by enough
to cause performance degredation, we still saw higher sustained
bandwidth.

When testing the manual configurations I saw that changing from the
auto-selected values to a few ticks in one direction or the other
resulted in *slightly* better results. Not too surprising.

So as long has hardware doesn't lie horrifically, which might be a
tall ask, auto config has a good shot at giving a decent default.

~Gregory
Andrew Morton Feb. 12, 2025, 12:17 a.m. UTC | #7
On Fri,  7 Feb 2025 21:06:04 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:

> On Fri, 7 Feb 2025 18:20:09 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
> 
> > On Fri,  7 Feb 2025 12:13:35 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> > 
> > > This patch introduces an auto-configuration mode for the interleave
> > > weights that aims to balance the two goals of setting node weights to be
> > > proportional to their bandwidths and keeping the weight values low.
> > > In order to perform the weight re-scaling, we use an internal
> > > "weightiness" value (fixed to 32) that defines interleave aggression.
> > 
> > Question please.  How does one determine whether a particular
> > configuration is working well?  To determine whether
> > manual-configuration-A is better than manual-configuration-B is better
> > than auto-configuration?
> > 
> > Leading to... how do we know that this patch makes the kernel better?
> 
> Hello Andrew,
> 
> Thank you for your interest in this patch!
> 
> To answer your 1st question: I think that users can do some
>
> ...
>

Interesting, thanks.

Have we adequately documented all these considerations for our users or
can we add some additional words in an appropriate place?
Huang, Ying Feb. 12, 2025, 2:49 a.m. UTC | #8
Hi, Joshua,

Thanks for your patch and sorry for late reply.

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across nodes according to user-set ratios.   
>
> Ideally, these weights should be proportional to their bandwidth, so
> that under bandwidth pressure, each node uses its maximal efficient
> bandwidth and prevents latency from increasing exponentially.
>
> At the same time, we want these weights to be as small as possible.
> Having ratios that involve large co-prime numbers like 7639:1345:7 leads
> to awkward and inefficient allocations, since the node with weight 7
> will remain mostly unused (and despite being proportional to bandwidth,
> will not aid in relieving the bandwidth pressure in the other two nodes).
>
> This patch introduces an auto-configuration mode for the interleave
> weights that aims to balance the two goals of setting node weights to be
> proportional to their bandwidths and keeping the weight values low.
> In order to perform the weight re-scaling, we use an internal
> "weightiness" value (fixed to 32) that defines interleave aggression.

As asked by Andrew in another thread, you may need to make it more
explicit about why we need this patch.

> In this auto configuration mode, node weights are dynamically updated
> every time there is a hotplug event that introduces new bandwidth.
>
> Users can also enter manual mode by writing "N" or "0" to the new "auto"
> sysfs interface. When a user enters manual mode, the system stops
> dynamically updating any of the node weights, even during hotplug events
> that can shift the optimal weight distribution. The system also enters
> manual mode any time a user sets a node's weight directly by using the
> nodeN interface introduced in [1]. On the other hand, auto mode is
> only entered by explicitly writing "Y" or "1" to the auto interface.
>
> There is one functional change that this patch makes to the existing
> weighted_interleave ABI: previously, writing 0 directly to a nodeN
> interface was said to reset the weight to the system default. Before
> this patch, the default for all weights were 1, which meant that writing
> 0 and 1 were functionally equivalent.
>
> This patch introduces "real" defaults, but moves away from letting users
> use 0 as a "set to default" interface. Rather, users who want to use
> system defaults should use auto mode. This patch seems to be the
> appropriate place to make this change, since we would like to remove
> this usage before users begin to rely on the feature in userspace.
> Moreover, users will not be losing any functionality; they can still
> write 1 into a node if they want a weight of 1. Thus, we deprecate the
> "write zero to reset" feature in favor of returning an error, the same
> way we would return an error when the user writes any other invalid
> weight to the interface.
>
> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> Co-developed-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> Reviewed-by: Hyeonggon Yoo <42.hyeyoo@gmail.com>
> ---
> Changelog
> v5:
> - I accidentally forgot to add the mm/mempolicy: subject tag since v1 of
>   this patch. Added to the subject now!
> - Wordsmithing, correcting typos, and re-naming variables for clarity.
> - No functional changes.
> v4:
> - Renamed the mode interface to the "auto" interface, which now only
>   emits either 'Y' or 'N'. Users can now interact with it by
>   writing 'Y', '1', 'N', or '0' to it.
> - Added additional documentation to the nodeN sysfs interface.
> - Makes sure iw_table locks are properly held.
> - Removed unlikely() call in reduce_interleave_weights.
> - Wordsmithing
>
> v3:
> - Weightiness (max_node_weight) is now fixed to 32.
> - Instead, the sysfs interface now exposes a "mode" parameter, which
>   can either be "auto" or "manual".
>   - Thank you Hyeonggon and Honggyu for the feedback.
> - Documentation updated to reflect new sysfs interface, explicitly
>   specifies that 0 is invalid.
>   - Thank you Gregory and Ying for the discussion on how best to
>     handle the 0 case.
> - Re-worked nodeN sysfs store to handle auto --> manual shifts
> - mempolicy_set_node_perf internally handles the auto / manual
>   case differently now. bw is always updated, iw updates depend on
>   what mode the user is in.
> - Wordsmithing comments for clarity.
> - Removed RFC tag.
>
> v2:
> - Name of the interface is changed: "max_node_weight" --> "weightiness"
> - Default interleave weight table no longer exists. Rather, the
>   interleave weight table is initialized with the defaults, if bandwidth
>   information is available.
>   - In addition, all sections that handle iw_table have been changed
>     to reference iw_table if it exists, otherwise defaulting to 1.
> - All instances of unsigned long are converted to uint64_t to guarantee
>   support for both 32-bit and 64-bit machines
> - sysfs initialization cleanup
> - Documentation has been rewritten to explicitly outline expected
>   behavior and expand on the interpretation of "weightiness".
> - kzalloc replaced with kcalloc for readability
> - Thank you Gregory and Hyeonggon for your review & feedback!
>  ...fs-kernel-mm-mempolicy-weighted-interleave |  38 +++-
>  drivers/acpi/numa/hmat.c                      |   1 +
>  drivers/base/node.c                           |   7 +
>  include/linux/mempolicy.h                     |   4 +
>  mm/mempolicy.c                                | 200 ++++++++++++++++--
>  5 files changed, 233 insertions(+), 17 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> index 0b7972de04e9..ef43228d135d 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -20,6 +20,38 @@ Description:	Weight configuration interface for nodeN
>  		Minimum weight: 1
>  		Maximum weight: 255
>  
> -		Writing an empty string or `0` will reset the weight to the
> -		system default. The system default may be set by the kernel
> -		or drivers at boot or during hotplug events.
> +		Writing invalid values (i.e. any values not in [1,255],
> +		empty string, ...) will return -EINVAL.
> +
> +		Changing the weight to a valid value will automatically
> +		update the system to manual mode as well.
> +
> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
> +Date:		February 2025
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Auto-weighting configuration interface
> +
> +		Configuration mode for weighted interleave. A 'Y' indicates
> +		that the system is in auto mode, and a 'N' indicates that
> +		the system is in manual mode. All other values are invalid.
> +
> +		In auto mode, all node weights are re-calculated and overwritten
> +		(visible via the nodeN interfaces) whenever new bandwidth data
> +		is made available during either boot or hotplug events.
> +
> +		In manual mode, node weights can only be updated by the user.
> +		Note that nodes that are onlined with previously set / defaulted

Sorry, I don't understand this well.  What is "defaulted" weights?  Why
not just "previously set"?

> +		weights will inherit those weights. If they were not previously
> +		set or are onlined with missing bandwidth data, they will be
> +		defaulted to a weight of 1.
> +
> +		Writing Y or 1 to the interface will enable auto mode, while
> +		writing N or 0 will enable manual mode. All other strings will
> +		be ignored, and -EINVAL will be returned.
> +
> +		If Y or 1 is written to the interface but the recalculation or
> +		updates fail at any point (-ENOMEM or -ENODEV), then the system
> +		will remain in manual mode.

IMHO, it's unnecessary to make this a part of the interface definition.
However, I have no strong opinion on this.

> +		Writing a new weight to a node directly via the nodeN interface
> +		will also automatically update the system to manual mode.
> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index 80a3481c0470..cc94cba112dd 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -20,6 +20,7 @@
>  #include <linux/list_sort.h>
>  #include <linux/memregion.h>
>  #include <linux/memory.h>
> +#include <linux/mempolicy.h>
>  #include <linux/mutex.h>
>  #include <linux/node.h>
>  #include <linux/sysfs.h>

This change should be removed?

> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0ea653fa3433..16e7a5a8ebe7 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -7,6 +7,7 @@
>  #include <linux/init.h>
>  #include <linux/mm.h>
>  #include <linux/memory.h>
> +#include <linux/mempolicy.h>
>  #include <linux/vmstat.h>
>  #include <linux/notifier.h>
>  #include <linux/node.h>
> @@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
>  			break;
>  		}
>  	}
> +
> +	/* When setting CPU access coordinates, update mempolicy */
> +	if (access == ACCESS_COORDINATE_CPU) {
> +		if (mempolicy_set_node_perf(nid, coord))
> +			pr_info("failed to set node%d mempolicy attrs\n", nid);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(node_set_perf_attrs);
>  
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index ce9885e0178a..0fe96f3ab3ef 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -11,6 +11,7 @@
>  #include <linux/slab.h>
>  #include <linux/rbtree.h>
>  #include <linux/spinlock.h>
> +#include <linux/node.h>
>  #include <linux/nodemask.h>
>  #include <linux/pagemap.h>
>  #include <uapi/linux/mempolicy.h>
> @@ -178,6 +179,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
>  
>  extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
>  
> +extern int mempolicy_set_node_perf(unsigned int node,
> +				   struct access_coordinate *coords);
> +
>  #else
>  
>  struct mempolicy {};
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 04f35659717a..51edd3663667 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -109,6 +109,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/printk.h>
>  #include <linux/swapops.h>
> +#include <linux/gcd.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
> @@ -138,16 +139,18 @@ static struct mempolicy default_policy = {
>  
>  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>  
> +static uint64_t *node_bw_table;

Usually, we use "u64" instead of "uint64_t" in kernel.  It appears that
we don't need u64. "unsigned int" should be enough because "struct
access_coordinate" uses that too.

It's better to move this after iw_table definition below.  And make it
explicit that iw_table_lock protects all interleave weight global
variables.

> +
>  /*
> - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
> - * system-default value should be used. A NULL iw_table also denotes that
> - * system-default values should be used. Until the system-default table
> - * is implemented, the system-default is always 1.
> - *
> + * iw_table is the interleave weight table.
> + * If bandwidth data is available and the user is in auto mode, the table
                                             ~~~~
                                             system?

> + * is populated with default values in [1,255].

If system is in manual mode, the weight value is in [1, 255] too?

NULL iw_table is still a valid configuration, so we still need to
describe its behavior.  Right?

>   * iw_table is RCU protected
>   */
>  static u8 __rcu *iw_table;
>  static DEFINE_MUTEX(iw_table_lock);
> +static const int weightiness = 32;
> +static bool weighted_interleave_auto = true;
>  
>  static u8 get_il_weight(int node)
>  {
> @@ -156,14 +159,114 @@ static u8 get_il_weight(int node)
>  
>  	rcu_read_lock();
>  	table = rcu_dereference(iw_table);
> -	/* if no iw_table, use system default */

It appears that this applies at some degree.

>  	weight = table ? table[node] : 1;
> -	/* if value in iw_table is 0, use system default */
> -	weight = weight ? weight : 1;
>  	rcu_read_unlock();
>  	return weight;
>  }
>  
> +/*
> + * Convert bandwidth values into weighted interleave weights.
> + * Call with iw_table_lock.
> + */
> +static void reduce_interleave_weights(uint64_t *bw, u8 *new_iw)
> +{
> +	uint64_t sum_bw = 0, sum_iw = 0;
> +	uint64_t scaling_factor = 1, iw_gcd = 1;
> +	unsigned int i = 0;
> +
> +	/* Recalculate the bandwidth distribution given the new info */
> +	for (i = 0; i < nr_node_ids; i++)
> +		sum_bw += bw[i];
> +
> +	/* If node is not set or has < 1% of total bw, use minimum value of 1 */
> +	for (i = 0; i < nr_node_ids; i++) {
> +		if (bw[i]) {
> +			scaling_factor = 100 * bw[i];
> +			new_iw[i] = max(scaling_factor / sum_bw, 1);
> +		} else {
> +			new_iw[i] = 1;
> +		}
> +		sum_iw += new_iw[i];
> +	}
> +
> +	/*
> +	 * Scale each node's share of the total bandwidth from percentages
> +	 * to whole numbers in the range [1, weightiness]
> +	 */
> +	for (i = 0; i < nr_node_ids; i++) {
> +		scaling_factor = weightiness * new_iw[i];
> +		new_iw[i] = max(scaling_factor / sum_iw, 1);
> +		if (i == 0)
> +			iw_gcd = new_iw[0];
> +		iw_gcd = gcd(iw_gcd, new_iw[i]);
> +	}
> +
> +	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
> +	for (i = 0; i < nr_node_ids; i++)
> +		new_iw[i] /= iw_gcd;
> +}
> +
> +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
> +{
> +	uint64_t *old_bw, *new_bw;
> +	uint64_t bw_val;
> +	u8 *old_iw, *new_iw;
> +
> +	/*
> +	 * Bandwidths above this limit cause rounding errors when reducing
> +	 * weights. This value is ~16 exabytes, which is unreasonable anyways.
> +	 */
> +	bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
> +	if (bw_val > (U64_MAX / 10))
> +		return -EINVAL;
> +
> +	new_bw = kcalloc(nr_node_ids, sizeof(uint64_t), GFP_KERNEL);
> +	if (!new_bw)
> +		return -ENOMEM;
> +
> +	new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL);
> +	if (!new_iw) {
> +		kfree(new_bw);
> +		return -ENOMEM;
> +	}
> +
> +	/*
> +	 * Update bandwidth info, even in manual mode. That way, when switching
> +	 * to auto mode in the future, iw_table can be overwritten using
> +	 * accurate bw data.
> +	 */
> +	mutex_lock(&iw_table_lock);
> +	old_bw = node_bw_table;
> +	old_iw = rcu_dereference_protected(iw_table,
> +					   lockdep_is_held(&iw_table_lock));
> +
> +	if (old_bw)
> +		memcpy(new_bw, old_bw, nr_node_ids * sizeof(uint64_t));
> +	new_bw[node] = bw_val;
> +	node_bw_table = new_bw;
> +
> +	if (weighted_interleave_auto) {
> +		reduce_interleave_weights(new_bw, new_iw);
> +	} else if (old_iw) {
> +		/*
> +		 * The first time mempolicy_set_node_perf is called, old_iw
> +		 * (iw_table) is null. If that is the case, assign a zeroed
> +		 * table to it. Otherwise, free the newly allocated iw_table.

We shouldn't assign a zeroed iw_table?  Because 0 isn't valid now.
Please check changes in weighted_interleave_nid() below.

> +		 */
> +		mutex_unlock(&iw_table_lock);
> +		kfree(new_iw);
> +		kfree(old_bw);
> +		return 0;
> +	}
> +
> +	rcu_assign_pointer(iw_table, new_iw);
> +	mutex_unlock(&iw_table_lock);
> +	synchronize_rcu();
> +	kfree(old_iw);

If you want to save one synchronize_rcu() if unnecessary, you can use

        if (old_iw) {
                synchronize_rcu();
                kfree(old_iw);
        }

> +	kfree(old_bw);
> +	return 0;
> +}
> +
>  /**
>   * numa_nearest_node - Find nearest node by state
>   * @node: Node id to start the search
> @@ -1998,10 +2101,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>  	table = rcu_dereference(iw_table);
>  	/* calculate the total weight */
>  	for_each_node_mask(nid, nodemask) {
> -		/* detect system default usage */
> -		weight = table ? table[nid] : 1;
> -		weight = weight ? weight : 1;
> -		weight_total += weight;
> +		weight_total += table ? table[nid] : 1;
>  	}
>  
>  	/* Calculate the node offset based on totals */
> @@ -2010,7 +2110,6 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>  	while (target) {
>  		/* detect system default usage */
>  		weight = table ? table[nid] : 1;
> -		weight = weight ? weight : 1;
>  		if (target < weight)
>  			break;
>  		target -= weight;
> @@ -3394,7 +3493,7 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
>  	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
>  	if (count == 0 || sysfs_streq(buf, ""))
>  		weight = 0;
> -	else if (kstrtou8(buf, 0, &weight))
> +	else if (kstrtou8(buf, 0, &weight) || weight == 0)
>  		return -EINVAL;
>  
>  	new = kzalloc(nr_node_ids, GFP_KERNEL);
> @@ -3411,11 +3510,68 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
>  	mutex_unlock(&iw_table_lock);
>  	synchronize_rcu();
>  	kfree(old);
> +	weighted_interleave_auto = false;
>  	return count;
>  }
>  
>  static struct iw_node_attr **node_attrs;
>  
> +static ssize_t weighted_interleave_mode_show(struct kobject *kobj,

IMHO, it's better to name the function as
weighted_interleave_auto_show()?

> +		struct kobj_attribute *attr, char *buf)
> +{
> +	if (weighted_interleave_auto)
> +		return sysfs_emit(buf, "Y\n");
> +	else
> +		return sysfs_emit(buf, "N\n");

str_true_false() can be used here.

> +}
> +
> +static ssize_t weighted_interleave_mode_store(struct kobject *kobj,
> +		struct kobj_attribute *attr, const char *buf, size_t count)
> +{
> +	uint64_t *bw;
> +	u8 *old_iw, *new_iw;
> +
> +	if (count == 0)
> +		return -EINVAL;
> +
> +	if (sysfs_streq(buf, "N") || sysfs_streq(buf, "0")) {

kstrtobool() can be used here.  It can deal with 'count == 0' case too.

> +		weighted_interleave_auto = false;
> +		return count;
> +	} else if (!sysfs_streq(buf, "Y") && !sysfs_streq(buf, "1")) {
> +		return -EINVAL;
> +	}
> +
> +	new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL);
> +	if (!new_iw)
> +		return -ENOMEM;
> +
> +	mutex_lock(&iw_table_lock);
> +	bw = node_bw_table;
> +
> +	if (!bw) {
> +		mutex_unlock(&iw_table_lock);
> +		kfree(new_iw);
> +		return -ENODEV;
> +	}
> +
> +	old_iw = rcu_dereference_protected(iw_table,
> +					   lockdep_is_held(&iw_table_lock));
> +
> +	reduce_interleave_weights(bw, new_iw);
> +	rcu_assign_pointer(iw_table, new_iw);
> +	mutex_unlock(&iw_table_lock);
> +
> +	synchronize_rcu();
> +	kfree(old_iw);
> +
> +	weighted_interleave_auto = true;

Why assign weighted_interleave_auto after synchronize_rcu()?  To reduce
the race window, it's better to change weighted_interleave_auto and
iw_table together?  Is it better to put them into a data structure and
change them together always?

        struct weighted_interleave_state {
                bool weighted_interleave_auto;
                u8 iw_table[0]
        };

> +	return count;
> +}
> +
> +static struct kobj_attribute wi_attr =
> +	__ATTR(auto, 0664, weighted_interleave_mode_show,
> +			   weighted_interleave_mode_store);
> +
>  static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
>  				  struct kobject *parent)
>  {
> @@ -3473,6 +3629,15 @@ static int add_weight_node(int nid, struct kobject *wi_kobj)
>  	return 0;
>  }
>  
> +static struct attribute *wi_default_attrs[] = {
> +	&wi_attr.attr,
> +	NULL
> +};
> +
> +static const struct attribute_group wi_attr_group = {
> +	.attrs = wi_default_attrs,
> +};
> +
>  static int add_weighted_interleave_group(struct kobject *root_kobj)
>  {
>  	struct kobject *wi_kobj;
> @@ -3489,6 +3654,13 @@ static int add_weighted_interleave_group(struct kobject *root_kobj)
>  		return err;
>  	}
>  
> +	err = sysfs_create_group(wi_kobj, &wi_attr_group);
> +	if (err) {
> +		pr_err("failed to add sysfs [auto]\n");
> +		kobject_put(wi_kobj);
> +		return err;
> +	}
> +
>  	for_each_node_state(nid, N_POSSIBLE) {
>  		err = add_weight_node(nid, wi_kobj);
>  		if (err) {

---
Best Regards,
Huang, Ying
Joshua Hahn Feb. 12, 2025, 3:18 p.m. UTC | #9
On Sat, 8 Feb 2025 07:51:38 +0100 Oscar Salvador <osalvador@suse.de> wrote:

> On Fri, Feb 07, 2025 at 12:13:35PM -0800, Joshua Hahn wrote:

[...snip...]

> Hi Joshua
> 
> > diff --git a/drivers/base/node.c b/drivers/base/node.c
> > index 0ea653fa3433..16e7a5a8ebe7 100644
> > --- a/drivers/base/node.c
> > +++ b/drivers/base/node.c
> > @@ -7,6 +7,7 @@
> >  #include <linux/init.h>
> >  #include <linux/mm.h>
> >  #include <linux/memory.h>
> > +#include <linux/mempolicy.h>
> >  #include <linux/vmstat.h>
> >  #include <linux/notifier.h>
> >  #include <linux/node.h>
> > @@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
> >  			break;
> >  		}
> >  	}
> > +
> > +	/* When setting CPU access coordinates, update mempolicy */
> > +	if (access == ACCESS_COORDINATE_CPU) {
> > +		if (mempolicy_set_node_perf(nid, coord))
> > +			pr_info("failed to set node%d mempolicy attrs\n", nid);
> 
> Not a big deal but I think you want to make that consistent with the error
> pr_info? that is: "failed to set mempolicy attrs for node %d".

Hi Oscar, thank you for reviewing my patch!
That sounds good to me. Then that line can be replaced with
pr_info("failed to set mempolicy attrs for node %d\n", nid);

> Also, I guess we cannot reach here with a memoryless node, right?

I think that they should not reach this line, but since memoryless
nodes do not have any memory bandwidth / latency information, it should
be fine. With that said, I think a check like the following would
make this more explicit and possibly guard against any unexpected
calls to mempolicy_set_node_perf:

- if (access == ACCESS_COORDINATE_CPU) {
+ if (access == ACCESS_COORDINATE_CPU && !node_state(nid, N_MEMORY)) {

> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 04f35659717a..51edd3663667 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -109,6 +109,7 @@
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/printk.h>
> >  #include <linux/swapops.h>
> > +#include <linux/gcd.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> > @@ -138,16 +139,18 @@ static struct mempolicy default_policy = {
> >  
> >  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
> >  
> > +static uint64_t *node_bw_table;
> > +
> >  /*
> > - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
> > - * system-default value should be used. A NULL iw_table also denotes that
> > - * system-default values should be used. Until the system-default table
> > - * is implemented, the system-default is always 1.
> > - *
> > + * iw_table is the interleave weight table.
> > + * If bandwidth data is available and the user is in auto mode, the table
> > + * is populated with default values in [1,255].
> >   * iw_table is RCU protected
> >   */
> >  static u8 __rcu *iw_table;
> >  static DEFINE_MUTEX(iw_table_lock);
> > +static const int weightiness = 32;
> 
> You explain why you chose this value in the changelog, but I would either
> drop a comment, probably in reduce_interleave_weights() elaborating a
> little bit, otherwise someone who stumbles upon that when reading that
> code will have to go through the changelog.

I also think this feedback makes a lot of sense. Maybe something like:
/*
 * 32 is selected as a constant that balances the two goals of:
 * (1) Keeping weight magnitude low, as to prevent too many consecutive
 *     pages from being allocated from the same node before other nodes
 *     get a chance
 * (2) Minimizing the error between bandwidth ratios and weight ratios
 */

Andrew -- I will send over a new v6 for this patch, if that is alright with
you. I will also probably wait a few days before sending it out, in case
there are other changes that folks would like to see. Please let me know
if this works for you / if there is something ele I can do to make this
process smoother.

Thank you again! Have a nice day!
Joshua

> 
> -- 
> Oscar Salvador
> SUSE Labs
Joshua Hahn Feb. 12, 2025, 3:26 p.m. UTC | #10
On Tue, 11 Feb 2025 16:17:52 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri,  7 Feb 2025 21:06:04 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> 
> > On Fri, 7 Feb 2025 18:20:09 -0800 Andrew Morton <akpm@linux-foundation.org> wrote:
> > 
> > > On Fri,  7 Feb 2025 12:13:35 -0800 Joshua Hahn <joshua.hahnjy@gmail.com> wrote:
> > > 
> > > > This patch introduces an auto-configuration mode for the interleave
> > > > weights that aims to balance the two goals of setting node weights to be
> > > > proportional to their bandwidths and keeping the weight values low.
> > > > In order to perform the weight re-scaling, we use an internal
> > > > "weightiness" value (fixed to 32) that defines interleave aggression.
> > > 
> > > Question please.  How does one determine whether a particular
> > > configuration is working well?  To determine whether
> > > manual-configuration-A is better than manual-configuration-B is better
> > > than auto-configuration?
> > > 
> > > Leading to... how do we know that this patch makes the kernel better?
> > 
> > Hello Andrew,
> > 
> > Thank you for your interest in this patch!
> > 
> > To answer your 1st question: I think that users can do some
> >
> > ...
> >
> 
> Interesting, thanks.
> 
> Have we adequately documented all these considerations for our users or
> can we add some additional words in an appropriate place?

Hello Andrew,

I have documented these thoughs on a private document, but I think that
it will be beneficial for weighted interleave users to have this
knowledge to reference in the future as well.

I can think of two places where this information will benefit users the
most: I can elaborate further the motivations & decisions Gregory
and I made for this patch within the patch commit message, and also
in the ABI documentation. As Oscar suggested, appropriate details in
the code should hopefully make the decisions clearer for future
maintainers and developers as well.

Thank you again for your insight! I will have a v6 drafted up, and
I think it makes sense to pull this patch out of mm-unstable for now.
Have a great day!

Joshua
Joshua Hahn Feb. 12, 2025, 5:06 p.m. UTC | #11
On Wed, 12 Feb 2025 10:49:32 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> Hi, Joshua,
> 
> Thanks for your patch and sorry for late reply.

Hi Ying, no worries! Thank you for taking time to review this patch.

> Joshua Hahn <joshua.hahnjy@gmail.com> writes:

[...snip...]

> > This patch introduces an auto-configuration mode for the interleave
> > weights that aims to balance the two goals of setting node weights to be
> > proportional to their bandwidths and keeping the weight values low.
> > In order to perform the weight re-scaling, we use an internal
> > "weightiness" value (fixed to 32) that defines interleave aggression.
> 
> As asked by Andrew in another thread, you may need to make it more
> explicit about why we need this patch.

Yes, I think that makes a lot of sense. Probably some details about
why we use 32 as the default weightiness value will also help with clarity.

[...snip...]

> > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > index 0b7972de04e9..ef43228d135d 100644
> > --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > @@ -20,6 +20,38 @@ Description:	Weight configuration interface for nodeN
> >  		Minimum weight: 1
> >  		Maximum weight: 255
> >  
> > -		Writing an empty string or `0` will reset the weight to the
> > -		system default. The system default may be set by the kernel
> > -		or drivers at boot or during hotplug events.
> > +		Writing invalid values (i.e. any values not in [1,255],
> > +		empty string, ...) will return -EINVAL.
> > +
> > +		Changing the weight to a valid value will automatically
> > +		update the system to manual mode as well.
> > +
> > +What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
> > +Date:		February 2025
> > +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> > +Description:	Auto-weighting configuration interface
> > +
> > +		Configuration mode for weighted interleave. A 'Y' indicates
> > +		that the system is in auto mode, and a 'N' indicates that
> > +		the system is in manual mode. All other values are invalid.
> > +
> > +		In auto mode, all node weights are re-calculated and overwritten
> > +		(visible via the nodeN interfaces) whenever new bandwidth data
> > +		is made available during either boot or hotplug events.
> > +
> > +		In manual mode, node weights can only be updated by the user.
> > +		Note that nodes that are onlined with previously set / defaulted
> 
> Sorry, I don't understand this well.  What is "defaulted" weights?  Why
> not just "previously set"?

This is just poor word choice on my end. I meant to say that if they had
been using auto mode previously and the node contained system-calculated
weights, then offlining and onlining while in manual mode would bring
back the weights from the system. This is the scenario:

Auto: weights are set to [1,2,3]
Mode changes from auto --> manual
Manual: weights are still [1,2,3]
Manual: update node0 to be [7,2,3]
Offline node2
Manual: weights are still [7,2,3], even though node2 is offline
Online node2
Manual: weights are still [7,2,3]
                               ^ I just wanted to highlight that it will
			         online with a weight of 3, which is a
				 value set by the system.

> > +		weights will inherit those weights. If they were not previously
> > +		set or are onlined with missing bandwidth data, they will be
> > +		defaulted to a weight of 1.
> > +
> > +		Writing Y or 1 to the interface will enable auto mode, while
> > +		writing N or 0 will enable manual mode. All other strings will
> > +		be ignored, and -EINVAL will be returned.
> > +
> > +		If Y or 1 is written to the interface but the recalculation or
> > +		updates fail at any point (-ENOMEM or -ENODEV), then the system
> > +		will remain in manual mode.
> 
> IMHO, it's unnecessary to make this a part of the interface definition.
> However, I have no strong opinion on this.

Thank you for your input. I was just trying to be as explicit and
thorough as possible, but I can see how this might feel a bit cluttered.

> > +		Writing a new weight to a node directly via the nodeN interface
> > +		will also automatically update the system to manual mode.
> > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> > index 80a3481c0470..cc94cba112dd 100644
> > --- a/drivers/acpi/numa/hmat.c
> > +++ b/drivers/acpi/numa/hmat.c
> > @@ -20,6 +20,7 @@
> >  #include <linux/list_sort.h>
> >  #include <linux/memregion.h>
> >  #include <linux/memory.h>
> > +#include <linux/mempolicy.h>
> >  #include <linux/mutex.h>
> >  #include <linux/node.h>
> >  #include <linux/sysfs.h>
> 
> This change should be removed?

Ah I see. Harry (Hyeonggon) had pointed this out in v4 of the patch,
but I misread the response because I thought his comment was about
the include in node.c, not in hmat.c. Sorry, Harry! This is correct;
this include statement should not be here. 

[...snip...]

> >  struct mempolicy {};
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 04f35659717a..51edd3663667 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -109,6 +109,7 @@
> >  #include <linux/mmu_notifier.h>
> >  #include <linux/printk.h>
> >  #include <linux/swapops.h>
> > +#include <linux/gcd.h>
> >  
> >  #include <asm/tlbflush.h>
> >  #include <asm/tlb.h>
> > @@ -138,16 +139,18 @@ static struct mempolicy default_policy = {
> >  
> >  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
> >  
> > +static uint64_t *node_bw_table;
> 
> Usually, we use "u64" instead of "uint64_t" in kernel.  It appears that
> we don't need u64. "unsigned int" should be enough because "struct
> access_coordinate" uses that too.
> 
> It's better to move this after iw_table definition below.  And make it
> explicit that iw_table_lock protects all interleave weight global
> variables.

I see, that makes sense. I will change it to unsigned int and move
it below, I agree that the scope of the lock should be clearer as well.

> > +
> >  /*
> > - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
> > - * system-default value should be used. A NULL iw_table also denotes that
> > - * system-default values should be used. Until the system-default table
> > - * is implemented, the system-default is always 1.
> > - *
> > + * iw_table is the interleave weight table.
> > + * If bandwidth data is available and the user is in auto mode, the table
>                                              ~~~~
>                                              system?

Thank you, I'll fix this.

> > + * is populated with default values in [1,255].
> 
> If system is in manual mode, the weight value is in [1, 255] too?
> 
> NULL iw_table is still a valid configuration, so we still need to
> describe its behavior.  Right?

Yes, that is true. I just wanted to explicilty note that the default
values selected by the system in auto mode will be in that range.

I'll also include what happens when iw_table is null. It only happens
when there is absolutely no bandwidth information available, so I
will note that in the description as well.

> >   * iw_table is RCU protected
> >   */
> >  static u8 __rcu *iw_table;
> >  static DEFINE_MUTEX(iw_table_lock);
> > +static const int weightiness = 32;
> > +static bool weighted_interleave_auto = true;
> >  
> >  static u8 get_il_weight(int node)
> >  {
> > @@ -156,14 +159,114 @@ static u8 get_il_weight(int node)
> >  
> >  	rcu_read_lock();
> >  	table = rcu_dereference(iw_table);
> > -	/* if no iw_table, use system default */
> 
> It appears that this applies at some degree.

Yes -- but this is only possible if there is no bandwidth data available
(and therefore, mempolicy_set_node_perf is never called). If this is
the case, then all nodes will have the same weight. Setting the default
weight to 1 means weighted interleave is reduced to unweighted interleave,
but setting the default weight to 0 will mean no memory can be allocated.

[...snip...]

> > +	if (old_bw)
> > +		memcpy(new_bw, old_bw, nr_node_ids * sizeof(uint64_t));
> > +	new_bw[node] = bw_val;
> > +	node_bw_table = new_bw;
> > +
> > +	if (weighted_interleave_auto) {
> > +		reduce_interleave_weights(new_bw, new_iw);
> > +	} else if (old_iw) {
> > +		/*
> > +		 * The first time mempolicy_set_node_perf is called, old_iw
> > +		 * (iw_table) is null. If that is the case, assign a zeroed
> > +		 * table to it. Otherwise, free the newly allocated iw_table.
> 
> We shouldn't assign a zeroed iw_table?  Because 0 isn't valid now.
> Please check changes in weighted_interleave_nid() below.

Thank you for the catch! I will fix this and write that it will contain
default values of 1 instead!.

> > +		 */
> > +		mutex_unlock(&iw_table_lock);
> > +		kfree(new_iw);
> > +		kfree(old_bw);
> > +		return 0;
> > +	}
> > +
> > +	rcu_assign_pointer(iw_table, new_iw);
> > +	mutex_unlock(&iw_table_lock);
> > +	synchronize_rcu();
> > +	kfree(old_iw);
> 
> If you want to save one synchronize_rcu() if unnecessary, you can use
> 
>         if (old_iw) {
>                 synchronize_rcu();
>                 kfree(old_iw);
>         }

I see, that makes sense. Thank you!

[...snip...]

> > +static ssize_t weighted_interleave_mode_show(struct kobject *kobj,
> 
> IMHO, it's better to name the function as
> weighted_interleave_auto_show()?

That makes sense, this was an artifact of when we named the interface
to be "mode", but it seems like I overlooked renaming this function.

> > +		struct kobj_attribute *attr, char *buf)
> > +{
> > +	if (weighted_interleave_auto)
> > +		return sysfs_emit(buf, "Y\n");
> > +	else
> > +		return sysfs_emit(buf, "N\n");
> 
> str_true_false() can be used here.

Thank you!

> > +}
> > +
> > +static ssize_t weighted_interleave_mode_store(struct kobject *kobj,
> > +		struct kobj_attribute *attr, const char *buf, size_t count)
> > +{
> > +	uint64_t *bw;
> > +	u8 *old_iw, *new_iw;
> > +
> > +	if (count == 0)
> > +		return -EINVAL;
> > +
> > +	if (sysfs_streq(buf, "N") || sysfs_streq(buf, "0")) {
> 
> kstrtobool() can be used here.  It can deal with 'count == 0' case too.

These kernel string tools are very helpful, thank you for bringing
them to my attention : -)

> > +		weighted_interleave_auto = false;
> > +		return count;
> > +	} else if (!sysfs_streq(buf, "Y") && !sysfs_streq(buf, "1")) {
> > +		return -EINVAL;
> > +	}
> > +
> > +	new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL);
> > +	if (!new_iw)
> > +		return -ENOMEM;
> > +
> > +	mutex_lock(&iw_table_lock);
> > +	bw = node_bw_table;
> > +
> > +	if (!bw) {
> > +		mutex_unlock(&iw_table_lock);
> > +		kfree(new_iw);
> > +		return -ENODEV;
> > +	}
> > +
> > +	old_iw = rcu_dereference_protected(iw_table,
> > +					   lockdep_is_held(&iw_table_lock));
> > +
> > +	reduce_interleave_weights(bw, new_iw);
> > +	rcu_assign_pointer(iw_table, new_iw);
> > +	mutex_unlock(&iw_table_lock);
> > +
> > +	synchronize_rcu();
> > +	kfree(old_iw);
> > +
> > +	weighted_interleave_auto = true;
> 
> Why assign weighted_interleave_auto after synchronize_rcu()?  To reduce
> the race window, it's better to change weighted_interleave_auto and
> iw_table together?  Is it better to put them into a data structure and
> change them together always?
> 
>         struct weighted_interleave_state {
>                 bool weighted_interleave_auto;
>                 u8 iw_table[0]
>         };

I see, I think your explanation makes sense. For the first question,
I think your point makes sense, so I will move the updating to be
inside the rcu section.

As for the combined data structure, I think that this makes sense,
but I have a few thoughts. First, there are some times when we don't
update both of them, like moving from auto --> manual, and whenever
we just update iw_table, we don't need to update the weighted_interleave
auto field. I also have a concern that this might make the code a bit
harder to read, but that is just my humble opinion.

> ---
> Best Regards,
> Huang, Ying

Thank you Ying for your detailed comments! It seems like there are a lot
of things that I missed, thank you for reviewing my patch and catching
the mistakes. Please let me know if there are any other changes that you
see fit for the patch. Have a great day!

Joshua

Sent using hkml (https://github.com/sjp38/hackermail)
Huang, Ying Feb. 13, 2025, 1:32 a.m. UTC | #12
Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On Wed, 12 Feb 2025 10:49:32 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
>
>> Hi, Joshua,
>> 
>> Thanks for your patch and sorry for late reply.
>
> Hi Ying, no worries! Thank you for taking time to review this patch.
>
>> Joshua Hahn <joshua.hahnjy@gmail.com> writes:

[snip]

>> > +
>> > +static ssize_t weighted_interleave_mode_store(struct kobject *kobj,
>> > +		struct kobj_attribute *attr, const char *buf, size_t count)
>> > +{
>> > +	uint64_t *bw;
>> > +	u8 *old_iw, *new_iw;
>> > +
>> > +	if (count == 0)
>> > +		return -EINVAL;
>> > +
>> > +	if (sysfs_streq(buf, "N") || sysfs_streq(buf, "0")) {
>> 
>> kstrtobool() can be used here.  It can deal with 'count == 0' case too.
>
> These kernel string tools are very helpful, thank you for bringing
> them to my attention : -)
>
>> > +		weighted_interleave_auto = false;
>> > +		return count;
>> > +	} else if (!sysfs_streq(buf, "Y") && !sysfs_streq(buf, "1")) {
>> > +		return -EINVAL;
>> > +	}
>> > +
>> > +	new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL);
>> > +	if (!new_iw)
>> > +		return -ENOMEM;
>> > +
>> > +	mutex_lock(&iw_table_lock);
>> > +	bw = node_bw_table;
>> > +
>> > +	if (!bw) {
>> > +		mutex_unlock(&iw_table_lock);
>> > +		kfree(new_iw);
>> > +		return -ENODEV;
>> > +	}
>> > +
>> > +	old_iw = rcu_dereference_protected(iw_table,
>> > +					   lockdep_is_held(&iw_table_lock));
>> > +
>> > +	reduce_interleave_weights(bw, new_iw);
>> > +	rcu_assign_pointer(iw_table, new_iw);
>> > +	mutex_unlock(&iw_table_lock);
>> > +
>> > +	synchronize_rcu();
>> > +	kfree(old_iw);
>> > +
>> > +	weighted_interleave_auto = true;
>> 
>> Why assign weighted_interleave_auto after synchronize_rcu()?  To reduce
>> the race window, it's better to change weighted_interleave_auto and
>> iw_table together?  Is it better to put them into a data structure and
>> change them together always?
>> 
>>         struct weighted_interleave_state {
>>                 bool weighted_interleave_auto;
>>                 u8 iw_table[0]
>>         };
>
> I see, I think your explanation makes sense. For the first question,
> I think your point makes sense, so I will move the updating to be
> inside the rcu section.
>
> As for the combined data structure, I think that this makes sense,
> but I have a few thoughts. First, there are some times when we don't
> update both of them, like moving from auto --> manual, and whenever
> we just update iw_table, we don't need to update the weighted_interleave
> auto field. I also have a concern that this might make the code a bit
> harder to read, but that is just my humble opinion.

I think the overhead is relatively small.  With that, we can avoid the
inconsistency between weighted_interleave_auto and iw_table[].
struct_size() or struct_size_t() family helpers can be used to manage
the flexible array at the end of the struct.

---
Best Regards,
Huang, Ying
Joshua Hahn Feb. 14, 2025, 3:45 p.m. UTC | #13
On Thu, 13 Feb 2025 09:32:49 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
> > On Wed, 12 Feb 2025 10:49:32 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
> >
> >> Hi, Joshua,
> >> 

[...snip...]

> >> > +		weighted_interleave_auto = false;
> >> > +		return count;
> >> > +	} else if (!sysfs_streq(buf, "Y") && !sysfs_streq(buf, "1")) {
> >> > +		return -EINVAL;
> >> > +	}
> >> > +
> >> > +	new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL);
> >> > +	if (!new_iw)
> >> > +		return -ENOMEM;
> >> > +
> >> > +	mutex_lock(&iw_table_lock);
> >> > +	bw = node_bw_table;
> >> > +
> >> > +	if (!bw) {
> >> > +		mutex_unlock(&iw_table_lock);
> >> > +		kfree(new_iw);
> >> > +		return -ENODEV;
> >> > +	}
> >> > +
> >> > +	old_iw = rcu_dereference_protected(iw_table,
> >> > +					   lockdep_is_held(&iw_table_lock));
> >> > +
> >> > +	reduce_interleave_weights(bw, new_iw);
> >> > +	rcu_assign_pointer(iw_table, new_iw);
> >> > +	mutex_unlock(&iw_table_lock);
> >> > +
> >> > +	synchronize_rcu();
> >> > +	kfree(old_iw);
> >> > +
> >> > +	weighted_interleave_auto = true;
> >> 
> >> Why assign weighted_interleave_auto after synchronize_rcu()?  To reduce
> >> the race window, it's better to change weighted_interleave_auto and
> >> iw_table together?  Is it better to put them into a data structure and
> >> change them together always?
> >> 
> >>         struct weighted_interleave_state {
> >>                 bool weighted_interleave_auto;
> >>                 u8 iw_table[0]
> >>         };
> >
> > I see, I think your explanation makes sense. For the first question,
> > I think your point makes sense, so I will move the updating to be
> > inside the rcu section.
> >
> > As for the combined data structure, I think that this makes sense,
> > but I have a few thoughts. First, there are some times when we don't
> > update both of them, like moving from auto --> manual, and whenever
> > we just update iw_table, we don't need to update the weighted_interleave
> > auto field. I also have a concern that this might make the code a bit
> > harder to read, but that is just my humble opinion.
> 
> I think the overhead is relatively small.  With that, we can avoid the
> inconsistency between weighted_interleave_auto and iw_table[].
> struct_size() or struct_size_t() family helpers can be used to manage
> the flexible array at the end of the struct.

That sounds good to me. I don't have any strong opinions about this
change, so I am happy to combine them into a struct. I just want to
make sure I am understanding your perspective correctly: what is the
incosistency between weighted_interleave_auto and iw_table[]?
If I move the weighted_interleave_auto = true statement inside the
rcu section, will the inconsistency still be there?

Just want to make sure so that I am not missing anything important!

Thank you again for your great feedback. I hope you have a happy Friday!
Joshua

> ---
> Best Regards,
> Huang, Ying

Sent using hkml (https://github.com/sjp38/hackermail)
Huang, Ying Feb. 16, 2025, 12:40 a.m. UTC | #14
Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On Thu, 13 Feb 2025 09:32:49 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
>
>> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
>> 
>> > On Wed, 12 Feb 2025 10:49:32 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
>> >
>> >> Hi, Joshua,
>> >> 
>
> [...snip...]
>
>> >> > +		weighted_interleave_auto = false;
>> >> > +		return count;
>> >> > +	} else if (!sysfs_streq(buf, "Y") && !sysfs_streq(buf, "1")) {
>> >> > +		return -EINVAL;
>> >> > +	}
>> >> > +
>> >> > +	new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL);
>> >> > +	if (!new_iw)
>> >> > +		return -ENOMEM;
>> >> > +
>> >> > +	mutex_lock(&iw_table_lock);
>> >> > +	bw = node_bw_table;
>> >> > +
>> >> > +	if (!bw) {
>> >> > +		mutex_unlock(&iw_table_lock);
>> >> > +		kfree(new_iw);
>> >> > +		return -ENODEV;
>> >> > +	}
>> >> > +
>> >> > +	old_iw = rcu_dereference_protected(iw_table,
>> >> > +					   lockdep_is_held(&iw_table_lock));
>> >> > +
>> >> > +	reduce_interleave_weights(bw, new_iw);
>> >> > +	rcu_assign_pointer(iw_table, new_iw);
>> >> > +	mutex_unlock(&iw_table_lock);
>> >> > +
>> >> > +	synchronize_rcu();
>> >> > +	kfree(old_iw);
>> >> > +
>> >> > +	weighted_interleave_auto = true;
>> >> 
>> >> Why assign weighted_interleave_auto after synchronize_rcu()?  To reduce
>> >> the race window, it's better to change weighted_interleave_auto and
>> >> iw_table together?  Is it better to put them into a data structure and
>> >> change them together always?
>> >> 
>> >>         struct weighted_interleave_state {
>> >>                 bool weighted_interleave_auto;
>> >>                 u8 iw_table[0]
>> >>         };
>> >
>> > I see, I think your explanation makes sense. For the first question,
>> > I think your point makes sense, so I will move the updating to be
>> > inside the rcu section.
>> >
>> > As for the combined data structure, I think that this makes sense,
>> > but I have a few thoughts. First, there are some times when we don't
>> > update both of them, like moving from auto --> manual, and whenever
>> > we just update iw_table, we don't need to update the weighted_interleave
>> > auto field. I also have a concern that this might make the code a bit
>> > harder to read, but that is just my humble opinion.
>> 
>> I think the overhead is relatively small.  With that, we can avoid the
>> inconsistency between weighted_interleave_auto and iw_table[].
>> struct_size() or struct_size_t() family helpers can be used to manage
>> the flexible array at the end of the struct.
>
> That sounds good to me. I don't have any strong opinions about this
> change, so I am happy to combine them into a struct. I just want to
> make sure I am understanding your perspective correctly: what is the
> incosistency between weighted_interleave_auto and iw_table[]?
> If I move the weighted_interleave_auto = true statement inside the
> rcu section, will the inconsistency still be there?

Because weighted_interleave_auto and iw_table are 2 variables, you may
read new weighted_interleave_auto and old iw_table or vice versa.  If
we put them into one struct and write/read the pointer to the struct
with rcu_assign_pointer() / rcu_dereference(), we can avoid this.

> Just want to make sure so that I am not missing anything important!
>
> Thank you again for your great feedback. I hope you have a happy Friday!

Thanks!

---
Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
index 0b7972de04e9..ef43228d135d 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -20,6 +20,38 @@  Description:	Weight configuration interface for nodeN
 		Minimum weight: 1
 		Maximum weight: 255
 
-		Writing an empty string or `0` will reset the weight to the
-		system default. The system default may be set by the kernel
-		or drivers at boot or during hotplug events.
+		Writing invalid values (i.e. any values not in [1,255],
+		empty string, ...) will return -EINVAL.
+
+		Changing the weight to a valid value will automatically
+		update the system to manual mode as well.
+
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/auto
+Date:		February 2025
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Auto-weighting configuration interface
+
+		Configuration mode for weighted interleave. A 'Y' indicates
+		that the system is in auto mode, and a 'N' indicates that
+		the system is in manual mode. All other values are invalid.
+
+		In auto mode, all node weights are re-calculated and overwritten
+		(visible via the nodeN interfaces) whenever new bandwidth data
+		is made available during either boot or hotplug events.
+
+		In manual mode, node weights can only be updated by the user.
+		Note that nodes that are onlined with previously set / defaulted
+		weights will inherit those weights. If they were not previously
+		set or are onlined with missing bandwidth data, they will be
+		defaulted to a weight of 1.
+
+		Writing Y or 1 to the interface will enable auto mode, while
+		writing N or 0 will enable manual mode. All other strings will
+		be ignored, and -EINVAL will be returned.
+
+		If Y or 1 is written to the interface but the recalculation or
+		updates fail at any point (-ENOMEM or -ENODEV), then the system
+		will remain in manual mode.
+
+		Writing a new weight to a node directly via the nodeN interface
+		will also automatically update the system to manual mode.
diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 80a3481c0470..cc94cba112dd 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -20,6 +20,7 @@ 
 #include <linux/list_sort.h>
 #include <linux/memregion.h>
 #include <linux/memory.h>
+#include <linux/mempolicy.h>
 #include <linux/mutex.h>
 #include <linux/node.h>
 #include <linux/sysfs.h>
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ea653fa3433..16e7a5a8ebe7 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -7,6 +7,7 @@ 
 #include <linux/init.h>
 #include <linux/mm.h>
 #include <linux/memory.h>
+#include <linux/mempolicy.h>
 #include <linux/vmstat.h>
 #include <linux/notifier.h>
 #include <linux/node.h>
@@ -214,6 +215,12 @@  void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
 			break;
 		}
 	}
+
+	/* When setting CPU access coordinates, update mempolicy */
+	if (access == ACCESS_COORDINATE_CPU) {
+		if (mempolicy_set_node_perf(nid, coord))
+			pr_info("failed to set node%d mempolicy attrs\n", nid);
+	}
 }
 EXPORT_SYMBOL_GPL(node_set_perf_attrs);
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index ce9885e0178a..0fe96f3ab3ef 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -11,6 +11,7 @@ 
 #include <linux/slab.h>
 #include <linux/rbtree.h>
 #include <linux/spinlock.h>
+#include <linux/node.h>
 #include <linux/nodemask.h>
 #include <linux/pagemap.h>
 #include <uapi/linux/mempolicy.h>
@@ -178,6 +179,9 @@  static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 
 extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
 
+extern int mempolicy_set_node_perf(unsigned int node,
+				   struct access_coordinate *coords);
+
 #else
 
 struct mempolicy {};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 04f35659717a..51edd3663667 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -109,6 +109,7 @@ 
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
 #include <linux/swapops.h>
+#include <linux/gcd.h>
 
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
@@ -138,16 +139,18 @@  static struct mempolicy default_policy = {
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 
+static uint64_t *node_bw_table;
+
 /*
- * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
- * system-default value should be used. A NULL iw_table also denotes that
- * system-default values should be used. Until the system-default table
- * is implemented, the system-default is always 1.
- *
+ * iw_table is the interleave weight table.
+ * If bandwidth data is available and the user is in auto mode, the table
+ * is populated with default values in [1,255].
  * iw_table is RCU protected
  */
 static u8 __rcu *iw_table;
 static DEFINE_MUTEX(iw_table_lock);
+static const int weightiness = 32;
+static bool weighted_interleave_auto = true;
 
 static u8 get_il_weight(int node)
 {
@@ -156,14 +159,114 @@  static u8 get_il_weight(int node)
 
 	rcu_read_lock();
 	table = rcu_dereference(iw_table);
-	/* if no iw_table, use system default */
 	weight = table ? table[node] : 1;
-	/* if value in iw_table is 0, use system default */
-	weight = weight ? weight : 1;
 	rcu_read_unlock();
 	return weight;
 }
 
+/*
+ * Convert bandwidth values into weighted interleave weights.
+ * Call with iw_table_lock.
+ */
+static void reduce_interleave_weights(uint64_t *bw, u8 *new_iw)
+{
+	uint64_t sum_bw = 0, sum_iw = 0;
+	uint64_t scaling_factor = 1, iw_gcd = 1;
+	unsigned int i = 0;
+
+	/* Recalculate the bandwidth distribution given the new info */
+	for (i = 0; i < nr_node_ids; i++)
+		sum_bw += bw[i];
+
+	/* If node is not set or has < 1% of total bw, use minimum value of 1 */
+	for (i = 0; i < nr_node_ids; i++) {
+		if (bw[i]) {
+			scaling_factor = 100 * bw[i];
+			new_iw[i] = max(scaling_factor / sum_bw, 1);
+		} else {
+			new_iw[i] = 1;
+		}
+		sum_iw += new_iw[i];
+	}
+
+	/*
+	 * Scale each node's share of the total bandwidth from percentages
+	 * to whole numbers in the range [1, weightiness]
+	 */
+	for (i = 0; i < nr_node_ids; i++) {
+		scaling_factor = weightiness * new_iw[i];
+		new_iw[i] = max(scaling_factor / sum_iw, 1);
+		if (i == 0)
+			iw_gcd = new_iw[0];
+		iw_gcd = gcd(iw_gcd, new_iw[i]);
+	}
+
+	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
+	for (i = 0; i < nr_node_ids; i++)
+		new_iw[i] /= iw_gcd;
+}
+
+int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
+{
+	uint64_t *old_bw, *new_bw;
+	uint64_t bw_val;
+	u8 *old_iw, *new_iw;
+
+	/*
+	 * Bandwidths above this limit cause rounding errors when reducing
+	 * weights. This value is ~16 exabytes, which is unreasonable anyways.
+	 */
+	bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
+	if (bw_val > (U64_MAX / 10))
+		return -EINVAL;
+
+	new_bw = kcalloc(nr_node_ids, sizeof(uint64_t), GFP_KERNEL);
+	if (!new_bw)
+		return -ENOMEM;
+
+	new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL);
+	if (!new_iw) {
+		kfree(new_bw);
+		return -ENOMEM;
+	}
+
+	/*
+	 * Update bandwidth info, even in manual mode. That way, when switching
+	 * to auto mode in the future, iw_table can be overwritten using
+	 * accurate bw data.
+	 */
+	mutex_lock(&iw_table_lock);
+	old_bw = node_bw_table;
+	old_iw = rcu_dereference_protected(iw_table,
+					   lockdep_is_held(&iw_table_lock));
+
+	if (old_bw)
+		memcpy(new_bw, old_bw, nr_node_ids * sizeof(uint64_t));
+	new_bw[node] = bw_val;
+	node_bw_table = new_bw;
+
+	if (weighted_interleave_auto) {
+		reduce_interleave_weights(new_bw, new_iw);
+	} else if (old_iw) {
+		/*
+		 * The first time mempolicy_set_node_perf is called, old_iw
+		 * (iw_table) is null. If that is the case, assign a zeroed
+		 * table to it. Otherwise, free the newly allocated iw_table.
+		 */
+		mutex_unlock(&iw_table_lock);
+		kfree(new_iw);
+		kfree(old_bw);
+		return 0;
+	}
+
+	rcu_assign_pointer(iw_table, new_iw);
+	mutex_unlock(&iw_table_lock);
+	synchronize_rcu();
+	kfree(old_iw);
+	kfree(old_bw);
+	return 0;
+}
+
 /**
  * numa_nearest_node - Find nearest node by state
  * @node: Node id to start the search
@@ -1998,10 +2101,7 @@  static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 	table = rcu_dereference(iw_table);
 	/* calculate the total weight */
 	for_each_node_mask(nid, nodemask) {
-		/* detect system default usage */
-		weight = table ? table[nid] : 1;
-		weight = weight ? weight : 1;
-		weight_total += weight;
+		weight_total += table ? table[nid] : 1;
 	}
 
 	/* Calculate the node offset based on totals */
@@ -2010,7 +2110,6 @@  static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 	while (target) {
 		/* detect system default usage */
 		weight = table ? table[nid] : 1;
-		weight = weight ? weight : 1;
 		if (target < weight)
 			break;
 		target -= weight;
@@ -3394,7 +3493,7 @@  static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
 	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
 	if (count == 0 || sysfs_streq(buf, ""))
 		weight = 0;
-	else if (kstrtou8(buf, 0, &weight))
+	else if (kstrtou8(buf, 0, &weight) || weight == 0)
 		return -EINVAL;
 
 	new = kzalloc(nr_node_ids, GFP_KERNEL);
@@ -3411,11 +3510,68 @@  static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
 	mutex_unlock(&iw_table_lock);
 	synchronize_rcu();
 	kfree(old);
+	weighted_interleave_auto = false;
 	return count;
 }
 
 static struct iw_node_attr **node_attrs;
 
+static ssize_t weighted_interleave_mode_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	if (weighted_interleave_auto)
+		return sysfs_emit(buf, "Y\n");
+	else
+		return sysfs_emit(buf, "N\n");
+}
+
+static ssize_t weighted_interleave_mode_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	uint64_t *bw;
+	u8 *old_iw, *new_iw;
+
+	if (count == 0)
+		return -EINVAL;
+
+	if (sysfs_streq(buf, "N") || sysfs_streq(buf, "0")) {
+		weighted_interleave_auto = false;
+		return count;
+	} else if (!sysfs_streq(buf, "Y") && !sysfs_streq(buf, "1")) {
+		return -EINVAL;
+	}
+
+	new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL);
+	if (!new_iw)
+		return -ENOMEM;
+
+	mutex_lock(&iw_table_lock);
+	bw = node_bw_table;
+
+	if (!bw) {
+		mutex_unlock(&iw_table_lock);
+		kfree(new_iw);
+		return -ENODEV;
+	}
+
+	old_iw = rcu_dereference_protected(iw_table,
+					   lockdep_is_held(&iw_table_lock));
+
+	reduce_interleave_weights(bw, new_iw);
+	rcu_assign_pointer(iw_table, new_iw);
+	mutex_unlock(&iw_table_lock);
+
+	synchronize_rcu();
+	kfree(old_iw);
+
+	weighted_interleave_auto = true;
+	return count;
+}
+
+static struct kobj_attribute wi_attr =
+	__ATTR(auto, 0664, weighted_interleave_mode_show,
+			   weighted_interleave_mode_store);
+
 static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
 				  struct kobject *parent)
 {
@@ -3473,6 +3629,15 @@  static int add_weight_node(int nid, struct kobject *wi_kobj)
 	return 0;
 }
 
+static struct attribute *wi_default_attrs[] = {
+	&wi_attr.attr,
+	NULL
+};
+
+static const struct attribute_group wi_attr_group = {
+	.attrs = wi_default_attrs,
+};
+
 static int add_weighted_interleave_group(struct kobject *root_kobj)
 {
 	struct kobject *wi_kobj;
@@ -3489,6 +3654,13 @@  static int add_weighted_interleave_group(struct kobject *root_kobj)
 		return err;
 	}
 
+	err = sysfs_create_group(wi_kobj, &wi_attr_group);
+	if (err) {
+		pr_err("failed to add sysfs [auto]\n");
+		kobject_put(wi_kobj);
+		return err;
+	}
+
 	for_each_node_state(nid, N_POSSIBLE) {
 		err = add_weight_node(nid, wi_kobj);
 		if (err) {