diff mbox series

[v3] Weighted interleave auto-tuning

Message ID 20250115185854.1991771-1-joshua.hahnjy@gmail.com (mailing list archive)
State Handled Elsewhere, archived
Headers show
Series [v3] Weighted interleave auto-tuning | expand

Commit Message

Joshua Hahn Jan. 15, 2025, 6:58 p.m. UTC
On machines with multiple memory nodes, interleaving page allocations
across nodes allows for better utilization of each node's bandwidth.
Previous work by Gregory Price [1] introduced weighted interleave, which
allowed for pages to be allocated across NUMA nodes according to
user-set ratios.

Ideally, these weights should be proportional to their bandwidth, so
that under bandwidth pressure, each node uses its maximal efficient
bandwidth and prevents latency from increasing exponentially.

At the same time, we want these weights to be as small as possible.
Having ratios that involve large co-prime numbers like 7639:1345:7 leads
to awkward and inefficient allocations, since the node with weight 7
will remain mostly unused (and despite being proportional to bandwidth,
will not aid in relieving the bandwidth pressure in the other two nodes).

This patch introduces an auto-configuration mode for the interleave
weights that aims to balance the two goals of setting node weights to be
proportional to their bandwidths and keeping the weight values low.
In order to perform the weight re-scaling, we use an internal
"weightiness" value (fixed to 32) that defines interleave aggression.

In this auto configuration mode, node weights are dynamically updated
every time there is a hotplug event that introduces new bandwidth.

Users can also enter manual mode by writing "manual" to the new "mode"
sysfs interface. When a user enters manual mode, the system stops
dynamically updating any of the node weights, even during hotplug events
that can shift the optimal weight distribution. The system also enters
manual mode any time a user sets a node's weight by hand, using the
nodeN interface introduced in [1]. On the other hand, auto mode is
only entered by explicitly writing "auto" to the mode interface.

There is one functional change that this patch makes to the existing
weighted_interleave ABI: previously, writing 0 directly to a nodeN
interface was said to reset the weight to the system default. Before
this patch, the default for all weights were 1, which meant that writing
0 and 1 were functionally equivalent.

This patch introduces "real" defaults, but we have decided to move away
from letting users use 0 as a "set to default" interface. Rather, users
who want to use system defaults should use "auto" mode. This patch seems
to be the appropriate place to make this change, since we would like to
remove this usage before users begin to rely on the feature in
userspace. Moreover, users will not be losing any functionality; they
can still write 1 into a node if they want a weight of 1. Thus, we
deprecate the "write zero to reset" feature in favor of returning an
error, the same way we would return an error when the user writes any
other invalid weight to the interface.

[1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Co-developed-by: Gregory Price <gourry@gourry.net>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
Changelog
v3:
- Weightiness (max_node_weight) is now fixed to 32.
- Instead, the sysfs interface now exposes a "mode" parameter, which
  can either be "auto" or "manual".
  - Thank you Hyeonggon and Honggyu for the feedback.
- Documentation updated to reflect new sysfs interface, explicitly
  specifies that 0 is invalid.
  - Thank you Gregory and Ying for the discussion on how best to
    handle the 0 case.
- Re-worked nodeN sysfs store to handle auto --> manual shifts
- mempolicy_set_node_perf internally handles the auto / manual
  caes differently now. bw is always updated, iw updates depend on
  what mode the user is in.
- Wordsmithing comments for clarity.
- Removed RFC tag.

v2:
- Name of the interface is changed: "max_node_weight" --> "weightiness"
- Default interleave weight table no longer exists. Rather, the
  interleave weight table is initialized with the defaults, if bandwidth
  information is available.
  - In addition, all sections that handle iw_table have been changed
    to reference iw_table if it exists, otherwise defaulting to 1.
- All instances of unsigned long are converted to uint64_t to guarantee
  support for both 32-bit and 64-bit machines
- sysfs initialization cleanup
- Documentation has been rewritten to explicitly outline expected
  behavior and expand on the interpretation of "weightiness".
- kzalloc replaced with kcalloc for readability
- Thank you Gregory and Hyeonggon for your review & feedback!

 ...fs-kernel-mm-mempolicy-weighted-interleave |  30 ++-
 drivers/acpi/numa/hmat.c                      |   1 +
 drivers/base/node.c                           |   7 +
 include/linux/mempolicy.h                     |   4 +
 mm/mempolicy.c                                | 212 ++++++++++++++++--
 5 files changed, 227 insertions(+), 27 deletions(-)

Comments

Hyeonggon Yoo Jan. 20, 2025, 4:47 a.m. UTC | #1
On 1/16/2025 3:58 AM, Joshua Hahn wrote:
> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across NUMA nodes according to
> user-set ratios.
> 
> Ideally, these weights should be proportional to their bandwidth, so
> that under bandwidth pressure, each node uses its maximal efficient
> bandwidth and prevents latency from increasing exponentially.
> 
> At the same time, we want these weights to be as small as possible.
> Having ratios that involve large co-prime numbers like 7639:1345:7 leads
> to awkward and inefficient allocations, since the node with weight 7
> will remain mostly unused (and despite being proportional to bandwidth,
> will not aid in relieving the bandwidth pressure in the other two nodes).
> 
> This patch introduces an auto-configuration mode for the interleave
> weights that aims to balance the two goals of setting node weights to be
> proportional to their bandwidths and keeping the weight values low.
> In order to perform the weight re-scaling, we use an internal
> "weightiness" value (fixed to 32) that defines interleave aggression.
> 
> In this auto configuration mode, node weights are dynamically updated
> every time there is a hotplug event that introduces new bandwidth.
> 
> Users can also enter manual mode by writing "manual" to the new "mode"
> sysfs interface. When a user enters manual mode, the system stops
> dynamically updating any of the node weights, even during hotplug events
> that can shift the optimal weight distribution. The system also enters
> manual mode any time a user sets a node's weight by hand, using the
> nodeN interface introduced in [1]. On the other hand, auto mode is
> only entered by explicitly writing "auto" to the mode interface.
> 
> There is one functional change that this patch makes to the existing
> weighted_interleave ABI: previously, writing 0 directly to a nodeN
> interface was said to reset the weight to the system default. Before
> this patch, the default for all weights were 1, which meant that writing
> 0 and 1 were functionally equivalent.
> 
> This patch introduces "real" defaults, but we have decided to move away
> from letting users use 0 as a "set to default" interface. Rather, users
> who want to use system defaults should use "auto" mode. This patch seems
> to be the appropriate place to make this change, since we would like to
> remove this usage before users begin to rely on the feature in
> userspace. Moreover, users will not be losing any functionality; they
> can still write 1 into a node if they want a weight of 1. Thus, we
> deprecate the "write zero to reset" feature in favor of returning an
> error, the same way we would return an error when the user writes any
> other invalid weight to the interface.
> 
> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> Co-developed-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> Changelog
> v3:
> - Weightiness (max_node_weight) is now fixed to 32.
> - Instead, the sysfs interface now exposes a "mode" parameter, which
>    can either be "auto" or "manual".
>    - Thank you Hyeonggon and Honggyu for the feedback.
> - Documentation updated to reflect new sysfs interface, explicitly
>    specifies that 0 is invalid.
>    - Thank you Gregory and Ying for the discussion on how best to
>      handle the 0 case.
> - Re-worked nodeN sysfs store to handle auto --> manual shifts
> - mempolicy_set_node_perf internally handles the auto / manual
>    caes differently now. bw is always updated, iw updates depend on
>    what mode the user is in.
> - Wordsmithing comments for clarity.
> - Removed RFC tag.
> 
> v2:
> - Name of the interface is changed: "max_node_weight" --> "weightiness"
> - Default interleave weight table no longer exists. Rather, the
>    interleave weight table is initialized with the defaults, if bandwidth
>    information is available.
>    - In addition, all sections that handle iw_table have been changed
>      to reference iw_table if it exists, otherwise defaulting to 1.
> - All instances of unsigned long are converted to uint64_t to guarantee
>    support for both 32-bit and 64-bit machines
> - sysfs initialization cleanup
> - Documentation has been rewritten to explicitly outline expected
>    behavior and expand on the interpretation of "weightiness".
> - kzalloc replaced with kcalloc for readability
> - Thank you Gregory and Hyeonggon for your review & feedback!
> 
>   ...fs-kernel-mm-mempolicy-weighted-interleave |  30 ++-
>   drivers/acpi/numa/hmat.c                      |   1 +
>   drivers/base/node.c                           |   7 +
>   include/linux/mempolicy.h                     |   4 +
>   mm/mempolicy.c                                | 212 ++++++++++++++++--
>   5 files changed, 227 insertions(+), 27 deletions(-)

Hi Joshua, thanks for the update!
It actually is what I was intended in the manual / auto mode description.

I don't have a strong opinion on the weight of the hot-plugged NUMA node
in manual mode, as it's not ideal whatever weight we choose and the user
need to update the weight after hot-plug events anyway.

Some comments inlined below:

> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> index 0b7972de04e9..d30dc29c53ff 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -20,6 +20,30 @@ Description:	Weight configuration interface for nodeN
>   		Minimum weight: 1
>   		Maximum weight: 255
>   
> -		Writing an empty string or `0` will reset the weight to the
> -		system default. The system default may be set by the kernel
> -		or drivers at boot or during hotplug events.
> +		Writing invalid values (i.e. any values not in [1,255],
> +		empty string, ...) will return -EINVAL.
> +
> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/mode
> +Date:		January 2025
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Auto-weighting configuration interface
> +
> +		Configuration modes for weighted interleave. Can take one of
> +		two options: "manual" and "auto". Default is "auto".
> +
> +		In auto mode, all node weights are re-calculated and overwritten
> +		(visible via the nodeN interfaces) whenever new bandwidth data
> +		is made available either during boot or hotplug events.
> +
> +		In manual mode, node weights can only be updated by the user.
> +		If a node is hotplugged while the user is in manual mode,
> +		the node will have a default weight of 1.
> +
> +		Modes can be changed by writing either "auto" or "manual" to the
> +		interface. All other strings will be ignored, and -EINVAL will
> +		be returned. If "auto" is written to the interface but the
> +		recalculation / updates fail at any point (-ENOMEM or -ENODEV)
> +		then the mode will remain in manual mode.
> +
> +		Writing a new weight to a node directly via the nodeN interface
> +		will also automatically update the system to manual mode.

I think the last paragraph should also be included in the nodeX parameter.

> @@ -2450,16 +2548,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
>   	if (!weights)
>   		return total_allocated;
>   
> -	rcu_read_lock();
> -	table = rcu_dereference(iw_table);
> -	if (table)
> -		memcpy(weights, table, nr_node_ids);
> -	rcu_read_unlock();
> -
> -	/* calculate total, detect system default usage */
>   	for_each_node_mask(node, nodes) {
> -		if (!weights[node])
> -			weights[node] = 1;
> +		weights[node] = get_il_weight(node);
>   		weight_total += weights[node];
>   	}

Uh-hum...
Looks like it now allows copying weights from different versions of iw_tables?

Otherwise this patch looks good to me.

Best,
Hyeonggon
Huang, Ying Jan. 21, 2025, 11:17 a.m. UTC | #2
Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across NUMA nodes according to
> user-set ratios.
>
> Ideally, these weights should be proportional to their bandwidth, so
> that under bandwidth pressure, each node uses its maximal efficient
> bandwidth and prevents latency from increasing exponentially.
>
> At the same time, we want these weights to be as small as possible.
> Having ratios that involve large co-prime numbers like 7639:1345:7 leads
> to awkward and inefficient allocations, since the node with weight 7
> will remain mostly unused (and despite being proportional to bandwidth,
> will not aid in relieving the bandwidth pressure in the other two nodes).
>
> This patch introduces an auto-configuration mode for the interleave
> weights that aims to balance the two goals of setting node weights to be
> proportional to their bandwidths and keeping the weight values low.
> In order to perform the weight re-scaling, we use an internal
> "weightiness" value (fixed to 32) that defines interleave aggression.
>
> In this auto configuration mode, node weights are dynamically updated
> every time there is a hotplug event that introduces new bandwidth.
>
> Users can also enter manual mode by writing "manual" to the new "mode"
> sysfs interface. When a user enters manual mode, the system stops
> dynamically updating any of the node weights, even during hotplug events
> that can shift the optimal weight distribution. The system also enters
> manual mode any time a user sets a node's weight by hand, using the
> nodeN interface introduced in [1]. On the other hand, auto mode is
> only entered by explicitly writing "auto" to the mode interface.
>
> There is one functional change that this patch makes to the existing
> weighted_interleave ABI: previously, writing 0 directly to a nodeN
> interface was said to reset the weight to the system default. Before
> this patch, the default for all weights were 1, which meant that writing
> 0 and 1 were functionally equivalent.
>
> This patch introduces "real" defaults, but we have decided to move away
> from letting users use 0 as a "set to default" interface. Rather, users
> who want to use system defaults should use "auto" mode. This patch seems
> to be the appropriate place to make this change, since we would like to
> remove this usage before users begin to rely on the feature in
> userspace. Moreover, users will not be losing any functionality; they
> can still write 1 into a node if they want a weight of 1. Thus, we
> deprecate the "write zero to reset" feature in favor of returning an
> error, the same way we would return an error when the user writes any
> other invalid weight to the interface.
>
> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
>
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> Co-developed-by: Gregory Price <gourry@gourry.net>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> Changelog
> v3:
> - Weightiness (max_node_weight) is now fixed to 32.
> - Instead, the sysfs interface now exposes a "mode" parameter, which
>   can either be "auto" or "manual".
>   - Thank you Hyeonggon and Honggyu for the feedback.
> - Documentation updated to reflect new sysfs interface, explicitly
>   specifies that 0 is invalid.
>   - Thank you Gregory and Ying for the discussion on how best to
>     handle the 0 case.
> - Re-worked nodeN sysfs store to handle auto --> manual shifts
> - mempolicy_set_node_perf internally handles the auto / manual
>   caes differently now. bw is always updated, iw updates depend on
>   what mode the user is in.
> - Wordsmithing comments for clarity.
> - Removed RFC tag.
>
> v2:
> - Name of the interface is changed: "max_node_weight" --> "weightiness"
> - Default interleave weight table no longer exists. Rather, the
>   interleave weight table is initialized with the defaults, if bandwidth
>   information is available.
>   - In addition, all sections that handle iw_table have been changed
>     to reference iw_table if it exists, otherwise defaulting to 1.
> - All instances of unsigned long are converted to uint64_t to guarantee
>   support for both 32-bit and 64-bit machines
> - sysfs initialization cleanup
> - Documentation has been rewritten to explicitly outline expected
>   behavior and expand on the interpretation of "weightiness".
> - kzalloc replaced with kcalloc for readability
> - Thank you Gregory and Hyeonggon for your review & feedback!
>
>  ...fs-kernel-mm-mempolicy-weighted-interleave |  30 ++-
>  drivers/acpi/numa/hmat.c                      |   1 +
>  drivers/base/node.c                           |   7 +
>  include/linux/mempolicy.h                     |   4 +
>  mm/mempolicy.c                                | 212 ++++++++++++++++--
>  5 files changed, 227 insertions(+), 27 deletions(-)
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> index 0b7972de04e9..d30dc29c53ff 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -20,6 +20,30 @@ Description:	Weight configuration interface for nodeN
>  		Minimum weight: 1
>  		Maximum weight: 255
>  
> -		Writing an empty string or `0` will reset the weight to the
> -		system default. The system default may be set by the kernel
> -		or drivers at boot or during hotplug events.
> +		Writing invalid values (i.e. any values not in [1,255],
> +		empty string, ...) will return -EINVAL.
> +
> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/mode
> +Date:		January 2025
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Auto-weighting configuration interface
> +
> +		Configuration modes for weighted interleave. Can take one of
> +		two options: "manual" and "auto". Default is "auto".
> +
> +		In auto mode, all node weights are re-calculated and overwritten
> +		(visible via the nodeN interfaces) whenever new bandwidth data
> +		is made available either during boot or hotplug events.
> +
> +		In manual mode, node weights can only be updated by the user.
> +		If a node is hotplugged while the user is in manual mode,
> +		the node will have a default weight of 1.
> +
> +		Modes can be changed by writing either "auto" or "manual" to the
> +		interface. All other strings will be ignored, and -EINVAL will
> +		be returned. If "auto" is written to the interface but the
> +		recalculation / updates fail at any point (-ENOMEM or -ENODEV)
> +		then the mode will remain in manual mode.
> +
> +		Writing a new weight to a node directly via the nodeN interface
> +		will also automatically update the system to manual mode.

IMHO, this interface is somewhat hard to be used.  Users need to know
which value is legal.  So, this will become something like,

$ cat mode
auto [manual]
$ echo auto > mode
$ cat mode
[auto] manual

Unless it's possible we will add more modes in the future, this is kind
of overkill for me.  How about something simpler as below?

$ cat auto
true
$ echo 0 > auto
$ cat auto
false

> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
> index 80a3481c0470..cc94cba112dd 100644
> --- a/drivers/acpi/numa/hmat.c
> +++ b/drivers/acpi/numa/hmat.c
> @@ -20,6 +20,7 @@
>  #include <linux/list_sort.h>
>  #include <linux/memregion.h>
>  #include <linux/memory.h>
> +#include <linux/mempolicy.h>
>  #include <linux/mutex.h>
>  #include <linux/node.h>
>  #include <linux/sysfs.h>
> diff --git a/drivers/base/node.c b/drivers/base/node.c
> index 0ea653fa3433..16e7a5a8ebe7 100644
> --- a/drivers/base/node.c
> +++ b/drivers/base/node.c
> @@ -7,6 +7,7 @@
>  #include <linux/init.h>
>  #include <linux/mm.h>
>  #include <linux/memory.h>
> +#include <linux/mempolicy.h>
>  #include <linux/vmstat.h>
>  #include <linux/notifier.h>
>  #include <linux/node.h>
> @@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
>  			break;
>  		}
>  	}
> +
> +	/* When setting CPU access coordinates, update mempolicy */
> +	if (access == ACCESS_COORDINATE_CPU) {
> +		if (mempolicy_set_node_perf(nid, coord))
> +			pr_info("failed to set node%d mempolicy attrs\n", nid);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(node_set_perf_attrs);
>  
> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index ce9885e0178a..0fe96f3ab3ef 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -11,6 +11,7 @@
>  #include <linux/slab.h>
>  #include <linux/rbtree.h>
>  #include <linux/spinlock.h>
> +#include <linux/node.h>
>  #include <linux/nodemask.h>
>  #include <linux/pagemap.h>
>  #include <uapi/linux/mempolicy.h>
> @@ -178,6 +179,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
>  
>  extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
>  
> +extern int mempolicy_set_node_perf(unsigned int node,
> +				   struct access_coordinate *coords);
> +
>  #else
>  
>  struct mempolicy {};
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 04f35659717a..8777bd6229bc 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -109,6 +109,7 @@
>  #include <linux/mmu_notifier.h>
>  #include <linux/printk.h>
>  #include <linux/swapops.h>
> +#include <linux/gcd.h>
>  
>  #include <asm/tlbflush.h>
>  #include <asm/tlb.h>
> @@ -138,16 +139,18 @@ static struct mempolicy default_policy = {
>  
>  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>  
> +static uint64_t *node_bw_table;
> +
>  /*
> - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
> - * system-default value should be used. A NULL iw_table also denotes that
> - * system-default values should be used. Until the system-default table
> - * is implemented, the system-default is always 1.
> - *
> + * iw_table is the interleave weight table.
> + * If bandwiddth data is available and the user is in auto mode, the table
> + * is populated with default values in [1,255].
>   * iw_table is RCU protected
>   */
>  static u8 __rcu *iw_table;
>  static DEFINE_MUTEX(iw_table_lock);
> +static const int weightiness = 32;
> +static bool weighted_interleave_auto = true;

I still prefer to use 2 iw_table, one is for default, the other is for
manual.  The default one will be used if the manual one is NULL.  Both
are protected by RCU.  The default one can be updated upon hotplug
blindly.  This makes the whole model easier to be understood IMHO.

What do you think about that.

>  static u8 get_il_weight(int node)
>  {
> @@ -156,14 +159,113 @@ static u8 get_il_weight(int node)
>  
>  	rcu_read_lock();
>  	table = rcu_dereference(iw_table);
> -	/* if no iw_table, use system default */
>  	weight = table ? table[node] : 1;
> -	/* if value in iw_table is 0, use system default */
> -	weight = weight ? weight : 1;
>  	rcu_read_unlock();
>  	return weight;
>  }

[snip]

---
Best Regards,
Huang, Ying
Honggyu Kim Jan. 21, 2025, 11:27 a.m. UTC | #3
Hi Ying and Joshua,

On 1/21/2025 8:17 PM, Huang, Ying wrote:
> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
>> On machines with multiple memory nodes, interleaving page allocations
>> across nodes allows for better utilization of each node's bandwidth.
>> Previous work by Gregory Price [1] introduced weighted interleave, which
>> allowed for pages to be allocated across NUMA nodes according to
>> user-set ratios.
>>
>> Ideally, these weights should be proportional to their bandwidth, so
>> that under bandwidth pressure, each node uses its maximal efficient
>> bandwidth and prevents latency from increasing exponentially.
>>
>> At the same time, we want these weights to be as small as possible.
>> Having ratios that involve large co-prime numbers like 7639:1345:7 leads
>> to awkward and inefficient allocations, since the node with weight 7
>> will remain mostly unused (and despite being proportional to bandwidth,
>> will not aid in relieving the bandwidth pressure in the other two nodes).
>>
>> This patch introduces an auto-configuration mode for the interleave
>> weights that aims to balance the two goals of setting node weights to be
>> proportional to their bandwidths and keeping the weight values low.
>> In order to perform the weight re-scaling, we use an internal
>> "weightiness" value (fixed to 32) that defines interleave aggression.
>>
>> In this auto configuration mode, node weights are dynamically updated
>> every time there is a hotplug event that introduces new bandwidth.
>>
>> Users can also enter manual mode by writing "manual" to the new "mode"
>> sysfs interface. When a user enters manual mode, the system stops
>> dynamically updating any of the node weights, even during hotplug events
>> that can shift the optimal weight distribution. The system also enters
>> manual mode any time a user sets a node's weight by hand, using the
>> nodeN interface introduced in [1]. On the other hand, auto mode is
>> only entered by explicitly writing "auto" to the mode interface.
>>
>> There is one functional change that this patch makes to the existing
>> weighted_interleave ABI: previously, writing 0 directly to a nodeN
>> interface was said to reset the weight to the system default. Before
>> this patch, the default for all weights were 1, which meant that writing
>> 0 and 1 were functionally equivalent.
>>
>> This patch introduces "real" defaults, but we have decided to move away
>> from letting users use 0 as a "set to default" interface. Rather, users
>> who want to use system defaults should use "auto" mode. This patch seems
>> to be the appropriate place to make this change, since we would like to
>> remove this usage before users begin to rely on the feature in
>> userspace. Moreover, users will not be losing any functionality; they
>> can still write 1 into a node if they want a weight of 1. Thus, we
>> deprecate the "write zero to reset" feature in favor of returning an
>> error, the same way we would return an error when the user writes any
>> other invalid weight to the interface.
>>
>> [1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
>>
>> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>> Co-developed-by: Gregory Price <gourry@gourry.net>
>> Signed-off-by: Gregory Price <gourry@gourry.net>
>> ---
>> Changelog
>> v3:
>> - Weightiness (max_node_weight) is now fixed to 32.
>> - Instead, the sysfs interface now exposes a "mode" parameter, which
>>    can either be "auto" or "manual".
>>    - Thank you Hyeonggon and Honggyu for the feedback.
>> - Documentation updated to reflect new sysfs interface, explicitly
>>    specifies that 0 is invalid.
>>    - Thank you Gregory and Ying for the discussion on how best to
>>      handle the 0 case.
>> - Re-worked nodeN sysfs store to handle auto --> manual shifts
>> - mempolicy_set_node_perf internally handles the auto / manual
>>    caes differently now. bw is always updated, iw updates depend on
>>    what mode the user is in.
>> - Wordsmithing comments for clarity.
>> - Removed RFC tag.
>>
>> v2:
>> - Name of the interface is changed: "max_node_weight" --> "weightiness"
>> - Default interleave weight table no longer exists. Rather, the
>>    interleave weight table is initialized with the defaults, if bandwidth
>>    information is available.
>>    - In addition, all sections that handle iw_table have been changed
>>      to reference iw_table if it exists, otherwise defaulting to 1.
>> - All instances of unsigned long are converted to uint64_t to guarantee
>>    support for both 32-bit and 64-bit machines
>> - sysfs initialization cleanup
>> - Documentation has been rewritten to explicitly outline expected
>>    behavior and expand on the interpretation of "weightiness".
>> - kzalloc replaced with kcalloc for readability
>> - Thank you Gregory and Hyeonggon for your review & feedback!
>>
>>   ...fs-kernel-mm-mempolicy-weighted-interleave |  30 ++-
>>   drivers/acpi/numa/hmat.c                      |   1 +
>>   drivers/base/node.c                           |   7 +
>>   include/linux/mempolicy.h                     |   4 +
>>   mm/mempolicy.c                                | 212 ++++++++++++++++--
>>   5 files changed, 227 insertions(+), 27 deletions(-)
>>
>> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>> index 0b7972de04e9..d30dc29c53ff 100644
>> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>> @@ -20,6 +20,30 @@ Description:	Weight configuration interface for nodeN
>>   		Minimum weight: 1
>>   		Maximum weight: 255
>>   
>> -		Writing an empty string or `0` will reset the weight to the
>> -		system default. The system default may be set by the kernel
>> -		or drivers at boot or during hotplug events.
>> +		Writing invalid values (i.e. any values not in [1,255],
>> +		empty string, ...) will return -EINVAL.
>> +
>> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/mode
>> +Date:		January 2025
>> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
>> +Description:	Auto-weighting configuration interface
>> +
>> +		Configuration modes for weighted interleave. Can take one of
>> +		two options: "manual" and "auto". Default is "auto".
>> +
>> +		In auto mode, all node weights are re-calculated and overwritten
>> +		(visible via the nodeN interfaces) whenever new bandwidth data
>> +		is made available either during boot or hotplug events.
>> +
>> +		In manual mode, node weights can only be updated by the user.
>> +		If a node is hotplugged while the user is in manual mode,
>> +		the node will have a default weight of 1.
>> +
>> +		Modes can be changed by writing either "auto" or "manual" to the
>> +		interface. All other strings will be ignored, and -EINVAL will
>> +		be returned. If "auto" is written to the interface but the
>> +		recalculation / updates fail at any point (-ENOMEM or -ENODEV)
>> +		then the mode will remain in manual mode.
>> +
>> +		Writing a new weight to a node directly via the nodeN interface
>> +		will also automatically update the system to manual mode.
> 
> IMHO, this interface is somewhat hard to be used.  Users need to know
> which value is legal.  So, this will become something like,
> 
> $ cat mode
> auto [manual]
> $ echo auto > mode
> $ cat mode
> [auto] manual

This is exactly I internally proposed to Hyeonggon, but couldn't share
the idea directly here.

> 
> Unless it's possible we will add more modes in the future, this is kind
> of overkill for me.  How about something simpler as below?
> 
> $ cat auto
> true
> $ echo 0 > auto
> $ cat auto
> false

That also makes sense, but I feel like somewhat vague what "auto" false
means. The "auto" might be better to be "use_hmat" instead and this
makes "use_hmat" false more meaningful. "use_hmat_weight" or
"use_hmat_info" might be another candidates.

Thanks,
Honggyu

> 
>> diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
>> index 80a3481c0470..cc94cba112dd 100644
>> --- a/drivers/acpi/numa/hmat.c
>> +++ b/drivers/acpi/numa/hmat.c
>> @@ -20,6 +20,7 @@
>>   #include <linux/list_sort.h>
>>   #include <linux/memregion.h>
>>   #include <linux/memory.h>
>> +#include <linux/mempolicy.h>
>>   #include <linux/mutex.h>
>>   #include <linux/node.h>
>>   #include <linux/sysfs.h>
>> diff --git a/drivers/base/node.c b/drivers/base/node.c
>> index 0ea653fa3433..16e7a5a8ebe7 100644
>> --- a/drivers/base/node.c
>> +++ b/drivers/base/node.c
>> @@ -7,6 +7,7 @@
>>   #include <linux/init.h>
>>   #include <linux/mm.h>
>>   #include <linux/memory.h>
>> +#include <linux/mempolicy.h>
>>   #include <linux/vmstat.h>
>>   #include <linux/notifier.h>
>>   #include <linux/node.h>
>> @@ -214,6 +215,12 @@ void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
>>   			break;
>>   		}
>>   	}
>> +
>> +	/* When setting CPU access coordinates, update mempolicy */
>> +	if (access == ACCESS_COORDINATE_CPU) {
>> +		if (mempolicy_set_node_perf(nid, coord))
>> +			pr_info("failed to set node%d mempolicy attrs\n", nid);
>> +	}
>>   }
>>   EXPORT_SYMBOL_GPL(node_set_perf_attrs);
>>   
>> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
>> index ce9885e0178a..0fe96f3ab3ef 100644
>> --- a/include/linux/mempolicy.h
>> +++ b/include/linux/mempolicy.h
>> @@ -11,6 +11,7 @@
>>   #include <linux/slab.h>
>>   #include <linux/rbtree.h>
>>   #include <linux/spinlock.h>
>> +#include <linux/node.h>
>>   #include <linux/nodemask.h>
>>   #include <linux/pagemap.h>
>>   #include <uapi/linux/mempolicy.h>
>> @@ -178,6 +179,9 @@ static inline bool mpol_is_preferred_many(struct mempolicy *pol)
>>   
>>   extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
>>   
>> +extern int mempolicy_set_node_perf(unsigned int node,
>> +				   struct access_coordinate *coords);
>> +
>>   #else
>>   
>>   struct mempolicy {};
>> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> index 04f35659717a..8777bd6229bc 100644
>> --- a/mm/mempolicy.c
>> +++ b/mm/mempolicy.c
>> @@ -109,6 +109,7 @@
>>   #include <linux/mmu_notifier.h>
>>   #include <linux/printk.h>
>>   #include <linux/swapops.h>
>> +#include <linux/gcd.h>
>>   
>>   #include <asm/tlbflush.h>
>>   #include <asm/tlb.h>
>> @@ -138,16 +139,18 @@ static struct mempolicy default_policy = {
>>   
>>   static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>>   
>> +static uint64_t *node_bw_table;
>> +
>>   /*
>> - * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
>> - * system-default value should be used. A NULL iw_table also denotes that
>> - * system-default values should be used. Until the system-default table
>> - * is implemented, the system-default is always 1.
>> - *
>> + * iw_table is the interleave weight table.
>> + * If bandwiddth data is available and the user is in auto mode, the table
>> + * is populated with default values in [1,255].
>>    * iw_table is RCU protected
>>    */
>>   static u8 __rcu *iw_table;
>>   static DEFINE_MUTEX(iw_table_lock);
>> +static const int weightiness = 32;
>> +static bool weighted_interleave_auto = true;
> 
> I still prefer to use 2 iw_table, one is for default, the other is for
> manual.  The default one will be used if the manual one is NULL.  Both
> are protected by RCU.  The default one can be updated upon hotplug
> blindly.  This makes the whole model easier to be understood IMHO.
> 
> What do you think about that.
> 
>>   static u8 get_il_weight(int node)
>>   {
>> @@ -156,14 +159,113 @@ static u8 get_il_weight(int node)
>>   
>>   	rcu_read_lock();
>>   	table = rcu_dereference(iw_table);
>> -	/* if no iw_table, use system default */
>>   	weight = table ? table[node] : 1;
>> -	/* if value in iw_table is 0, use system default */
>> -	weight = weight ? weight : 1;
>>   	rcu_read_unlock();
>>   	return weight;
>>   }
> 
> [snip]
> 
> ---
> Best Regards,
> Huang, Ying
Gregory Price Jan. 21, 2025, 7:56 p.m. UTC | #4
On Tue, Jan 21, 2025 at 07:17:15PM +0800, Huang, Ying wrote:
... snip ...
> 
> Unless it's possible we will add more modes in the future, this is kind
> of overkill for me.  How about something simpler as below?
> 
> $ cat auto
> true
> $ echo 0 > auto
> $ cat auto
> false

We have discussed having a dynamic-mode where the weights might adjust
on the fly based on system-state, but i think this ends up being
controlled under mempolicy/dynamic_interleave or something.

So this seems reasonable.

> >  static u8 __rcu *iw_table;
> >  static DEFINE_MUTEX(iw_table_lock);
> > +static const int weightiness = 32;
> > +static bool weighted_interleave_auto = true;
> 
> I still prefer to use 2 iw_table, one is for default, the other is for
> manual.  The default one will be used if the manual one is NULL.  Both
> are protected by RCU.  The default one can be updated upon hotplug
> blindly.  This makes the whole model easier to be understood IMHO.
> 
> What do you think about that.
> 

only question is, lets say you have

`cat auto node0 node1` -> `true 5 1`
and you do
echo 0 > auto

what should a subsequent `cat auto node0 node1` output?

`false 5 1`
or
`false 1 1`

Then lets say we do
echo 7 > node0

what should
echo true > auto
result in?

`true 5 1`
or
`true 7 1`

The current code makes sure that when you switch modes from auto
to manual, it inherits the current state - instead of there being
some hidden state that suddenly takes precedence.

So I prefer to just have one IW array and no hidden state.

~Gregory
Gregory Price Jan. 21, 2025, 8:02 p.m. UTC | #5
On Tue, Jan 21, 2025 at 08:27:17PM +0900, Honggyu Kim wrote:
> Hi Ying and Joshua,
> > IMHO, this interface is somewhat hard to be used.  Users need to know
> > which value is legal.  So, this will become something like,
> > 
> > $ cat mode
> > auto [manual]
> > $ echo auto > mode
> > $ cat mode
> > [auto] manual
> 
> This is exactly I internally proposed to Hyeonggon, but couldn't share
> the idea directly here.
> 
> That also makes sense, but I feel like somewhat vague what "auto" false
> means. The "auto" might be better to be "use_hmat" instead and this
> makes "use_hmat" false more meaningful. "use_hmat_weight" or
> "use_hmat_info" might be another candidates.
> 

I don't think we want to encode hmat-ism into the uapi. In fact,
mempolicy doesn't even know about hmat.  It just gets source information
from *somewhere* and applies it accordingly.

I think what you might be asking for is

auto -> [true, false]

if auto=true
  mode -> [default,
           read_bw, write_bw, combined_bw,
           read_ltc, write_ltc, combined_ltc]
if auto=false
  mode -> [disabled]

Where default mode is the kernel selection of whatever combination of
read/write bw/ltc data, and user could switch the attribute.

HOWEVER, such `mode` would require us to cache the attribute structure
per-node, and maybe some thought on what's reasonable, so that I would
prefer that to be a completely different feature / discussion.

~Gregory
Joshua Hahn Jan. 21, 2025, 9:24 p.m. UTC | #6
Hi Hyeonggon, thank you for the review!

[...snip...]

> Hi Joshua, thanks for the update!
> It actually is what I was intended in the manual / auto mode description.
> 
> I don't have a strong opinion on the weight of the hot-plugged NUMA node
> in manual mode, as it's not ideal whatever weight we choose and the user
> need to update the weight after hot-plug events anyway.

I'm glad that I was able to correctly interpret the framework you laid
out in the previous conversations. And yes -- I agree, I think no matter
what value I choose, it will always be sub-optimal for some definition
of optimality. I simply chose 1 because it is now the new smallest
weight possible, since 0 no longer works.

> Some comments inlined below:
> 
> > diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > index 0b7972de04e9..d30dc29c53ff 100644
> > --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> > @@ -20,6 +20,30 @@ Description:	Weight configuration interface for nodeN
> >   		Minimum weight: 1
> >   		Maximum weight: 255
> >   
> > -		Writing an empty string or `0` will reset the weight to the
> > -		system default. The system default may be set by the kernel
> > -		or drivers at boot or during hotplug events.
> > +		Writing invalid values (i.e. any values not in [1,255],
> > +		empty string, ...) will return -EINVAL.
> > +
> > +What:		/sys/kernel/mm/mempolicy/weighted_interleave/mode
> > +Date:		January 2025
> > +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> > +Description:	Auto-weighting configuration interface
> > +
> > +		Configuration modes for weighted interleave. Can take one of
> > +		two options: "manual" and "auto". Default is "auto".
> > +
> > +		In auto mode, all node weights are re-calculated and overwritten
> > +		(visible via the nodeN interfaces) whenever new bandwidth data
> > +		is made available either during boot or hotplug events.
> > +
> > +		In manual mode, node weights can only be updated by the user.
> > +		If a node is hotplugged while the user is in manual mode,
> > +		the node will have a default weight of 1.
> > +
> > +		Modes can be changed by writing either "auto" or "manual" to the
> > +		interface. All other strings will be ignored, and -EINVAL will
> > +		be returned. If "auto" is written to the interface but the
> > +		recalculation / updates fail at any point (-ENOMEM or -ENODEV)
> > +		then the mode will remain in manual mode.
> > +
> > +		Writing a new weight to a node directly via the nodeN interface
> > +		will also automatically update the system to manual mode.
> 
> I think the last paragraph should also be included in the nodeX parameter.

I agree, I will definitely add this in the next version!
 
> > @@ -2450,16 +2548,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
> >   	if (!weights)
> >   		return total_allocated;
> >   
> > -	rcu_read_lock();
> > -	table = rcu_dereference(iw_table);
> > -	if (table)
> > -		memcpy(weights, table, nr_node_ids);
> > -	rcu_read_unlock();
> > -
> > -	/* calculate total, detect system default usage */
> >   	for_each_node_mask(node, nodes) {
> > -		if (!weights[node])
> > -			weights[node] = 1;
> > +		weights[node] = get_il_weight(node);
> >   		weight_total += weights[node];
> >   	}
> 
> Uh-hum...
> Looks like it now allows copying weights from different versions of iw_tables?

This is a good point, this is actually an artifact from a previous
iteration where get_il_weight was needed to handle the weight being
0, but since we no longer allow 0 as a value, it makes more sense to
just take a snapshot under a single rcu lock. Thank you for the catch!

I will also go over the other places this is used and just make sure the
locking behavior is as intended.

> Otherwise this patch looks good to me.
> 
> Best,
> Hyeonggon

Thanks again Hyeonggon, I'll send out a v4 with the changes you mentioned!
Have a great day!!
Joshua
Huang, Ying Jan. 22, 2025, 1:24 a.m. UTC | #7
Hi, Honggyu,

Honggyu Kim <honggyu.kim@sk.com> writes:

> Hi Ying and Joshua,
>
> On 1/21/2025 8:17 PM, Huang, Ying wrote:
>> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
>> 

[snip]

>>> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>>> index 0b7972de04e9..d30dc29c53ff 100644
>>> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>>> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>>> @@ -20,6 +20,30 @@ Description:	Weight configuration interface for nodeN
>>>   		Minimum weight: 1
>>>   		Maximum weight: 255
>>>   -		Writing an empty string or `0` will reset the weight
>>> to the
>>> -		system default. The system default may be set by the kernel
>>> -		or drivers at boot or during hotplug events.
>>> +		Writing invalid values (i.e. any values not in [1,255],
>>> +		empty string, ...) will return -EINVAL.
>>> +
>>> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/mode
>>> +Date:		January 2025
>>> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
>>> +Description:	Auto-weighting configuration interface
>>> +
>>> +		Configuration modes for weighted interleave. Can take one of
>>> +		two options: "manual" and "auto". Default is "auto".
>>> +
>>> +		In auto mode, all node weights are re-calculated and overwritten
>>> +		(visible via the nodeN interfaces) whenever new bandwidth data
>>> +		is made available either during boot or hotplug events.
>>> +
>>> +		In manual mode, node weights can only be updated by the user.
>>> +		If a node is hotplugged while the user is in manual mode,
>>> +		the node will have a default weight of 1.
>>> +
>>> +		Modes can be changed by writing either "auto" or "manual" to the
>>> +		interface. All other strings will be ignored, and -EINVAL will
>>> +		be returned. If "auto" is written to the interface but the
>>> +		recalculation / updates fail at any point (-ENOMEM or -ENODEV)
>>> +		then the mode will remain in manual mode.
>>> +
>>> +		Writing a new weight to a node directly via the nodeN interface
>>> +		will also automatically update the system to manual mode.
>> IMHO, this interface is somewhat hard to be used.  Users need to
>> know
>> which value is legal.  So, this will become something like,
>> $ cat mode
>> auto [manual]
>> $ echo auto > mode
>> $ cat mode
>> [auto] manual
>
> This is exactly I internally proposed to Hyeonggon, but couldn't share
> the idea directly here.
>
>> Unless it's possible we will add more modes in the future, this is
>> kind
>> of overkill for me.  How about something simpler as below?
>> $ cat auto
>> true
>> $ echo 0 > auto
>> $ cat auto
>> false
>
> That also makes sense, but I feel like somewhat vague what "auto" false
> means. The "auto" might be better to be "use_hmat" instead and this
> makes "use_hmat" false more meaningful. "use_hmat_weight" or
> "use_hmat_info" might be another candidates.

As Gregory pointed out in another email.  hmat isn't good because it's
platform dependent.  We may use something else to get default weights on
another platform.

I'm open to other platform independent naming.  For example, use_default.

[snip]

---
Best Regards,
Huang, Ying
Huang, Ying Jan. 22, 2025, 1:37 a.m. UTC | #8
Gregory Price <gourry@gourry.net> writes:

> On Tue, Jan 21, 2025 at 07:17:15PM +0800, Huang, Ying wrote:
> ... snip ...
>> 
>> Unless it's possible we will add more modes in the future, this is kind
>> of overkill for me.  How about something simpler as below?
>> 
>> $ cat auto
>> true
>> $ echo 0 > auto
>> $ cat auto
>> false
>
> We have discussed having a dynamic-mode where the weights might adjust
> on the fly based on system-state, but i think this ends up being
> controlled under mempolicy/dynamic_interleave or something.
>
> So this seems reasonable.
>
>> >  static u8 __rcu *iw_table;
>> >  static DEFINE_MUTEX(iw_table_lock);
>> > +static const int weightiness = 32;
>> > +static bool weighted_interleave_auto = true;
>> 
>> I still prefer to use 2 iw_table, one is for default, the other is for
>> manual.  The default one will be used if the manual one is NULL.  Both
>> are protected by RCU.  The default one can be updated upon hotplug
>> blindly.  This makes the whole model easier to be understood IMHO.
>> 
>> What do you think about that.
>> 
>
> only question is, lets say you have
>
> `cat auto node0 node1` -> `true 5 1`
> and you do
> echo 0 > auto
>
> what should a subsequent `cat auto node0 node1` output?
>
> `false 5 1`
> or
> `false 1 1`

IMO, it should be

`false 5 1`

That is, we copy auto-generated weights to manual weights atomically and
use it.

> Then lets say we do
> echo 7 > node0

Now, `cat auto node0 node1` outputs,

`false 7 1`

That is, we delete manual weights and use the auto-generated ones.

> what should
> echo true > auto
> result in?
>
> `true 5 1`
> or
> `true 7 1`

It should be

`true 5 1`

> The current code makes sure that when you switch modes from auto
> to manual, it inherits the current state - instead of there being
> some hidden state that suddenly takes precedence.

I think that we can do that with two weight arrays.

> So I prefer to just have one IW array and no hidden state.

Then, when we switch from manual to auto mode, where to find
auto-generated weights?  Re-calculate them?

---
Best Regards,
Huang, Ying
Joshua Hahn Jan. 22, 2025, 3:59 p.m. UTC | #9
On Wed, 22 Jan 2025 09:37:20 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

Hi Gregory and Ying, thank you both for your insights!

[...snip...]

> >> I still prefer to use 2 iw_table, one is for default, the other is for
> >> manual.  The default one will be used if the manual one is NULL.  Both
> >> are protected by RCU.  The default one can be updated upon hotplug
> >> blindly.  This makes the whole model easier to be understood IMHO.
> > `cat auto node0 node1` -> `true 5 1`
> > and you do
> > echo 0 > auto

I think that when initially developing this patch, this was the intent
that I had as well (in the v1 of this RFC patch, there was an iw_table
and a separate default_iw_table). However, I think that the ideas of
having a "default" and "manual" table made less sense over time, given
that they behaved more like a "default" and "visible" table instead.
That is, the visible layer is directly manipulable by the user, but
does not necessarily only contain manually-set values; rather, most of
the time, it probably still has a lot of auto-generated weights.

I think that this analysis runs the risk of being a bit too semantically
nit-picky, but as I'll explain below, I think both the 1-layer approach
that I implemented in this RFC and the expected 2-layer behavior that
you outline below are essentially the same, functionally. In other
words, I think we agree on what the expected behavior should be : -)
We just have to agree on what presentation would make the most sense
for the user.

> > what should a subsequent `cat auto node0 node1` output?
> >
> > `false 5 1`
> > or
> > `false 1 1`
> 
> IMO, it should be
> 
> `false 5 1`
> 
> That is, we copy auto-generated weights to manual weights atomically and
> use it.
> 
> > Then lets say we do
> > echo 7 > node0
> 
> Now, `cat auto node0 node1` outputs,
> 
> `false 7 1`
> 
> That is, we delete manual weights and use the auto-generated ones.
> 
> > what should
> > echo true > auto
> > result in?
> >
> > `true 5 1`
> > or
> > `true 7 1`
> 
> It should be
> 
> `true 5 1`

I see, so I think we actually agree on what the behavior for this is.
Then there is no real "hidden state", it's either just using the
default weights, or turning that off and being able to edit the
states.
 
> > The current code makes sure that when you switch modes from auto
> > to manual, it inherits the current state - instead of there being
> > some hidden state that suddenly takes precedence.
> 
> I think that we can do that with two weight arrays.
> 
> > So I prefer to just have one IW array and no hidden state.
> 
> Then, when we switch from manual to auto mode, where to find
> auto-generated weights?  Re-calculate them?

Even in manual mode, incoming bandwidth data is continuously stored.
This way, when a user does decide to switch back to auto mode later,
the system doesn't have to retrieve the bandwidth data all over again.
As for the auto-generated weights, they are re-calculated based solely
on the bandwidth data available. (I will note that re-calculating
the weights are very quick, see reduce_interleave_weights)

Based on your description of the expected behavior, everything you
listed out is actually what currently happens in the one-layer system.
Switching from auto --> manual inherits the auto-generated weights, and
switching from manual --> auto wipes all previous user-stored data.

At this point, I think that I am happy with either option. I wrote and
re-wrote this a bunch of times, and came to the conclusion that now
that we agree on the behavior of the interface, I have no strong
opinion on whether we have a "hidden" default layer or a phantom
default layer that is just generated every time a user needs it : -)

Please let me know if I missed anything as well! Thank you all for
your continued feedback and interest! Have a great day,
Joshua
Gregory Price Jan. 22, 2025, 4:53 p.m. UTC | #10
On Wed, Jan 22, 2025 at 07:59:34AM -0800, Joshua Hahn wrote:
> On Wed, 22 Jan 2025 09:37:20 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
> > > The current code makes sure that when you switch modes from auto
> > > to manual, it inherits the current state - instead of there being
> > > some hidden state that suddenly takes precedence.
> > 
> > I think that we can do that with two weight arrays.
> > 
> > > So I prefer to just have one IW array and no hidden state.
> > 
> > Then, when we switch from manual to auto mode, where to find
> > auto-generated weights?  Re-calculate them?
> 
> Even in manual mode, incoming bandwidth data is continuously stored.
> This way, when a user does decide to switch back to auto mode later,
> the system doesn't have to retrieve the bandwidth data all over again.
> As for the auto-generated weights, they are re-calculated based solely
> on the bandwidth data available. (I will note that re-calculating
> the weights are very quick, see reduce_interleave_weights)
> 
> Based on your description of the expected behavior, everything you
> listed out is actually what currently happens in the one-layer system.
> Switching from auto --> manual inherits the auto-generated weights, and
> switching from manual --> auto wipes all previous user-stored data.
>

Piling on - the single-layer system is very bluntly simpler (one fewer array)
with the exact same behavior.  Therefore it's better (in my opinion).

But this is all hidden from the user - so I don't care.  If you (Ying)
have strong feelings, we're happy to ship either.

~Gregory
Huang, Ying Jan. 23, 2025, 3:32 a.m. UTC | #11
Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On Wed, 22 Jan 2025 09:37:20 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
>
> Hi Gregory and Ying, thank you both for your insights!
>
> [...snip...]
>
>> >> I still prefer to use 2 iw_table, one is for default, the other is for
>> >> manual.  The default one will be used if the manual one is NULL.  Both
>> >> are protected by RCU.  The default one can be updated upon hotplug
>> >> blindly.  This makes the whole model easier to be understood IMHO.
>> > `cat auto node0 node1` -> `true 5 1`
>> > and you do
>> > echo 0 > auto
>
> I think that when initially developing this patch, this was the intent
> that I had as well (in the v1 of this RFC patch, there was an iw_table
> and a separate default_iw_table). However, I think that the ideas of
> having a "default" and "manual" table made less sense over time, given
> that they behaved more like a "default" and "visible" table instead.
> That is, the visible layer is directly manipulable by the user, but
> does not necessarily only contain manually-set values; rather, most of
> the time, it probably still has a lot of auto-generated weights.
>
> I think that this analysis runs the risk of being a bit too semantically
> nit-picky, but as I'll explain below, I think both the 1-layer approach
> that I implemented in this RFC and the expected 2-layer behavior that
> you outline below are essentially the same, functionally. In other
> words, I think we agree on what the expected behavior should be : -)
> We just have to agree on what presentation would make the most sense
> for the user.

This sounds good to me.

We still need to finalize the interface, 'mode' or 'auto'.  Personally,
I prefer 'auto', because it's simpler.

>> > what should a subsequent `cat auto node0 node1` output?
>> >
>> > `false 5 1`
>> > or
>> > `false 1 1`
>> 
>> IMO, it should be
>> 
>> `false 5 1`
>> 
>> That is, we copy auto-generated weights to manual weights atomically and
>> use it.
>> 
>> > Then lets say we do
>> > echo 7 > node0
>> 
>> Now, `cat auto node0 node1` outputs,
>> 
>> `false 7 1`
>> 
>> That is, we delete manual weights and use the auto-generated ones.
>> 
>> > what should
>> > echo true > auto
>> > result in?
>> >
>> > `true 5 1`
>> > or
>> > `true 7 1`
>> 
>> It should be
>> 
>> `true 5 1`
>
> I see, so I think we actually agree on what the behavior for this is.
> Then there is no real "hidden state", it's either just using the
> default weights, or turning that off and being able to edit the
> states.
>  
>> > The current code makes sure that when you switch modes from auto
>> > to manual, it inherits the current state - instead of there being
>> > some hidden state that suddenly takes precedence.
>> 
>> I think that we can do that with two weight arrays.
>> 
>> > So I prefer to just have one IW array and no hidden state.
>> 
>> Then, when we switch from manual to auto mode, where to find
>> auto-generated weights?  Re-calculate them?
>
> Even in manual mode, incoming bandwidth data is continuously stored.
> This way, when a user does decide to switch back to auto mode later,
> the system doesn't have to retrieve the bandwidth data all over again.
> As for the auto-generated weights, they are re-calculated based solely
> on the bandwidth data available. (I will note that re-calculating
> the weights are very quick, see reduce_interleave_weights)
>
> Based on your description of the expected behavior, everything you
> listed out is actually what currently happens in the one-layer system.
> Switching from auto --> manual inherits the auto-generated weights, and
> switching from manual --> auto wipes all previous user-stored data.
>
> At this point, I think that I am happy with either option. I wrote and
> re-wrote this a bunch of times, and came to the conclusion that now
> that we agree on the behavior of the interface, I have no strong
> opinion on whether we have a "hidden" default layer or a phantom
> default layer that is just generated every time a user needs it : -)
>
> Please let me know if I missed anything as well! Thank you all for
> your continued feedback and interest! Have a great day,

I see.  You cache the nodes bandwidth instead of default weights.  That
works.  I am fine with either way too.

---
Best Regards,
Huang, Ying
Matthew Wilcox Jan. 24, 2025, 5:58 a.m. UTC | #12
On Wed, Jan 15, 2025 at 10:58:54AM -0800, Joshua Hahn wrote:
> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across NUMA nodes according to
> user-set ratios.

I still don't get it.  You always want memory to be on the local node or
the fabric gets horribly congested and slows you right down.  But you're
not really talking about NUMA, are you?  You're talking about CXL.

And CXL is terrible for bandwidth.  I just ran the numbers.

On a current Intel top-end CPU, we're looking at 8x DDR5-4800 DIMMs,
each with a bandwidth of 38.4GB/s for a total of 300GB/s.

For each CXL lane, you take a lane of PCIe gen5 away.  So that's
notionally 32Gbit/s, or 4GB/s per lane.  But CXL is crap, and you'll be
lucky to get 3 cachelines per 256 byte packet, dropping you down to 3GB/s.
You're not going to use all 80 lanes for CXL (presumably these CPUs are
going to want to do I/O somehow), so maybe allocate 20 of them to CXL.
That's 60GB/s, or a 20% improvement in bandwidth.  On top of that,
it's slow, with a minimum of 10ns latency penalty just from the CXL
encode/decode penalty.

Putting page cache in the CXL seems like nonsense to me.  I can see it
making sense to swap to CXL, or allocating anonymous memory for tasks
with low priority on it.  But I just can't see the point of putting
pagecache on CXL.
Gregory Price Jan. 24, 2025, 3:48 p.m. UTC | #13
On Fri, Jan 24, 2025 at 05:58:09AM +0000, Matthew Wilcox wrote:
> On Wed, Jan 15, 2025 at 10:58:54AM -0800, Joshua Hahn wrote:
> > On machines with multiple memory nodes, interleaving page allocations
> > across nodes allows for better utilization of each node's bandwidth.
> > Previous work by Gregory Price [1] introduced weighted interleave, which
> > allowed for pages to be allocated across NUMA nodes according to
> > user-set ratios.
> 
> I still don't get it.  You always want memory to be on the local node or
> the fabric gets horribly congested and slows you right down.  But you're
> not really talking about NUMA, are you?  You're talking about CXL.
> 
> And CXL is terrible for bandwidth.  I just ran the numbers.
> 
> On a current Intel top-end CPU, we're looking at 8x DDR5-4800 DIMMs,
> each with a bandwidth of 38.4GB/s for a total of 300GB/s.
> 
> For each CXL lane, you take a lane of PCIe gen5 away.  So that's
> notionally 32Gbit/s, or 4GB/s per lane.  But CXL is crap, and you'll be
> lucky to get 3 cachelines per 256 byte packet, dropping you down to 3GB/s.
> You're not going to use all 80 lanes for CXL (presumably these CPUs are
> going to want to do I/O somehow), so maybe allocate 20 of them to CXL.
> That's 60GB/s, or a 20% improvement in bandwidth.  On top of that,
> it's slow, with a minimum of 10ns latency penalty just from the CXL
> encode/decode penalty.
>

From the original - the performance tests show considerable opportunity
in the scenarios where DRAM bandwidth is pressured - as you can either

1) Lower the DRAM bandwidth pressure by offloading some cachelines to
   CXL - reducing latency on DRAM and reducing average latency overall.

   The latency cost on CXL lines gets amortized over all DRAM fetches
   no longer hitting stalls.

2) Under full-pressure scenarios (DRAM and CXL are saturated), the
   additional lanes / buffers provide more concurrent fetches - i.e.
   you're just doing more work (and avoiding going to storage).

   This is the weaker of the two scenarios.

No one is proposing we switch the default policy to weighted interleave.

= Performance summary =
(tests may have different configurations, see extended info below)
1) MLC (W2) : +38% over DRAM. +264% over default interleave.
   MLC (W5) : +40% over DRAM. +226% over default interleave.
2) Stream   : -6% to +4% over DRAM, +430% over default interleave.
3) XSBench  : +19% over DRAM. +47% over default interleave.

=====================================================================
Performance tests - MLC
From - Ravi Jonnalagadda <ravis.opensrc@micron.com>

Hardware: Single-socket, multiple CXL memory expanders.

Workload:                               W2
Data Signature:                         2:1 read:write
DRAM only bandwidth (GBps):             298.8
DRAM + CXL (default interleave) (GBps): 113.04
DRAM + CXL (weighted interleave)(GBps): 412.5
Gain over DRAM only:                    1.38x
Gain over default interleave:           2.64x

Workload:                               W5
Data Signature:                         1:1 read:write
DRAM only bandwidth (GBps):             273.2
DRAM + CXL (default interleave) (GBps): 117.23
DRAM + CXL (weighted interleave)(GBps): 382.7
Gain over DRAM only:                    1.4x
Gain over default interleave:           2.26x

=====================================================================
Performance test - Stream
From - Gregory Price <gregory.price@memverge.com>

Hardware: Single socket, single CXL expander

Summary: 64 threads, ~18GB workload, 3GB per array, executed 100 times
Default interleave : -78% (slower than DRAM)
Global weighting   : -6% to +4% (workload dependant)
mbind  weights     : +2.5% to +4% (consistently better than DRAM)

=====================================================================
Performance tests - XSBench
From - Hyeongtak Ji <hyeongtak.ji@sk.com>

Hardware: Single socket, Single CXL memory Expander

NUMA node 0: 56 logical cores, 128 GB memory
NUMA node 2: 96 GB CXL memory
Threads:     56
Lookups:     170,000,000

Summary: +19% over DRAM. +47% over default interleave.


> Putting page cache in the CXL seems like nonsense to me.  I can see it
> making sense to swap to CXL, or allocating anonymous memory for tasks
> with low priority on it.  But I just can't see the point of putting
> pagecache on CXL.

No one said anything about page cache - but it depends.

If you can keep your entire working set in-memory and on-CXL, as opposed
to swapping to disk - you win.  "Swapping to CXL" incurs a bunch of page
faults, that sounds like a lose.

However - the stream test from the original proposal agrees with you
that just making everything interleaved (code, pagecache, etc) is at
best a wash:

Global weighting   : -6% to +4% (workload dependant)

But targeting specific regions can provide a modest bump

mbind weights      : +2.5% to +4% (consistently better than DRAM)

~Gregory
Gregory Price Jan. 24, 2025, 3:53 p.m. UTC | #14
On Fri, Jan 24, 2025 at 05:58:09AM +0000, Matthew Wilcox wrote:
> Putting page cache in the CXL seems like nonsense to me.  I can see it
> making sense to swap to CXL, or allocating anonymous memory for tasks
> with low priority on it.  But I just can't see the point of putting
> pagecache on CXL.

Also for what it's worth, I'm trying to get page cache *off* CXL
when it becomes warm/hot :]

https://lore.kernel.org/linux-mm/20250107000346.1338481-1-gourry@gourry.net/

~Gregory
Honggyu Kim Feb. 5, 2025, 5:34 a.m. UTC | #15
Hi Ying,

On 1/22/2025 10:24 AM, Huang, Ying wrote:

[snip]

>>>> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>>>> index 0b7972de04e9..d30dc29c53ff 100644
>>>> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>>>> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>>>> @@ -20,6 +20,30 @@ Description:	Weight configuration interface for nodeN
>>>>    		Minimum weight: 1
>>>>    		Maximum weight: 255
>>>>    -		Writing an empty string or `0` will reset the weight
>>>> to the
>>>> -		system default. The system default may be set by the kernel
>>>> -		or drivers at boot or during hotplug events.
>>>> +		Writing invalid values (i.e. any values not in [1,255],
>>>> +		empty string, ...) will return -EINVAL.
>>>> +
>>>> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/mode
>>>> +Date:		January 2025
>>>> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
>>>> +Description:	Auto-weighting configuration interface
>>>> +
>>>> +		Configuration modes for weighted interleave. Can take one of
>>>> +		two options: "manual" and "auto". Default is "auto".
>>>> +
>>>> +		In auto mode, all node weights are re-calculated and overwritten
>>>> +		(visible via the nodeN interfaces) whenever new bandwidth data
>>>> +		is made available either during boot or hotplug events.
>>>> +
>>>> +		In manual mode, node weights can only be updated by the user.
>>>> +		If a node is hotplugged while the user is in manual mode,
>>>> +		the node will have a default weight of 1.
>>>> +
>>>> +		Modes can be changed by writing either "auto" or "manual" to the
>>>> +		interface. All other strings will be ignored, and -EINVAL will
>>>> +		be returned. If "auto" is written to the interface but the
>>>> +		recalculation / updates fail at any point (-ENOMEM or -ENODEV)
>>>> +		then the mode will remain in manual mode.
>>>> +
>>>> +		Writing a new weight to a node directly via the nodeN interface
>>>> +		will also automatically update the system to manual mode.
>>> IMHO, this interface is somewhat hard to be used.  Users need to
>>> know
>>> which value is legal.  So, this will become something like,
>>> $ cat mode
>>> auto [manual]
>>> $ echo auto > mode
>>> $ cat mode
>>> [auto] manual
>>
>> This is exactly I internally proposed to Hyeonggon, but couldn't share
>> the idea directly here.
>>
>>> Unless it's possible we will add more modes in the future, this is
>>> kind
>>> of overkill for me.  How about something simpler as below?
>>> $ cat auto
>>> true
>>> $ echo 0 > auto
>>> $ cat auto
>>> false
>>
>> That also makes sense, but I feel like somewhat vague what "auto" false
>> means. The "auto" might be better to be "use_hmat" instead and this
>> makes "use_hmat" false more meaningful. "use_hmat_weight" or
>> "use_hmat_info" might be another candidates.
> 
> As Gregory pointed out in another email.  hmat isn't good because it's
> platform dependent.  We may use something else to get default weights on
> another platform.

Sorry for the late reply.  I prefered "mode" contains auto/manual and
"hmat" names were my second thought, which may not be proper here as you
mentioned.

> 
> I'm open to other platform independent naming.  For example, use_default.

I'm also fine to your another proposal "auto" with Y/N, which is used in
v4.

> 
> [snip]
> 
> ---
> Best Regards,
> Huang, Ying

Thanks,
Honggyu
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
index 0b7972de04e9..d30dc29c53ff 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -20,6 +20,30 @@  Description:	Weight configuration interface for nodeN
 		Minimum weight: 1
 		Maximum weight: 255
 
-		Writing an empty string or `0` will reset the weight to the
-		system default. The system default may be set by the kernel
-		or drivers at boot or during hotplug events.
+		Writing invalid values (i.e. any values not in [1,255],
+		empty string, ...) will return -EINVAL.
+
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/mode
+Date:		January 2025
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Auto-weighting configuration interface
+
+		Configuration modes for weighted interleave. Can take one of
+		two options: "manual" and "auto". Default is "auto".
+
+		In auto mode, all node weights are re-calculated and overwritten
+		(visible via the nodeN interfaces) whenever new bandwidth data
+		is made available either during boot or hotplug events.
+
+		In manual mode, node weights can only be updated by the user.
+		If a node is hotplugged while the user is in manual mode,
+		the node will have a default weight of 1.
+
+		Modes can be changed by writing either "auto" or "manual" to the
+		interface. All other strings will be ignored, and -EINVAL will
+		be returned. If "auto" is written to the interface but the
+		recalculation / updates fail at any point (-ENOMEM or -ENODEV)
+		then the mode will remain in manual mode.
+
+		Writing a new weight to a node directly via the nodeN interface
+		will also automatically update the system to manual mode.
diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index 80a3481c0470..cc94cba112dd 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -20,6 +20,7 @@ 
 #include <linux/list_sort.h>
 #include <linux/memregion.h>
 #include <linux/memory.h>
+#include <linux/mempolicy.h>
 #include <linux/mutex.h>
 #include <linux/node.h>
 #include <linux/sysfs.h>
diff --git a/drivers/base/node.c b/drivers/base/node.c
index 0ea653fa3433..16e7a5a8ebe7 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -7,6 +7,7 @@ 
 #include <linux/init.h>
 #include <linux/mm.h>
 #include <linux/memory.h>
+#include <linux/mempolicy.h>
 #include <linux/vmstat.h>
 #include <linux/notifier.h>
 #include <linux/node.h>
@@ -214,6 +215,12 @@  void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
 			break;
 		}
 	}
+
+	/* When setting CPU access coordinates, update mempolicy */
+	if (access == ACCESS_COORDINATE_CPU) {
+		if (mempolicy_set_node_perf(nid, coord))
+			pr_info("failed to set node%d mempolicy attrs\n", nid);
+	}
 }
 EXPORT_SYMBOL_GPL(node_set_perf_attrs);
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index ce9885e0178a..0fe96f3ab3ef 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -11,6 +11,7 @@ 
 #include <linux/slab.h>
 #include <linux/rbtree.h>
 #include <linux/spinlock.h>
+#include <linux/node.h>
 #include <linux/nodemask.h>
 #include <linux/pagemap.h>
 #include <uapi/linux/mempolicy.h>
@@ -178,6 +179,9 @@  static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 
 extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
 
+extern int mempolicy_set_node_perf(unsigned int node,
+				   struct access_coordinate *coords);
+
 #else
 
 struct mempolicy {};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 04f35659717a..8777bd6229bc 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -109,6 +109,7 @@ 
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
 #include <linux/swapops.h>
+#include <linux/gcd.h>
 
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
@@ -138,16 +139,18 @@  static struct mempolicy default_policy = {
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 
+static uint64_t *node_bw_table;
+
 /*
- * iw_table is the sysfs-set interleave weight table, a value of 0 denotes
- * system-default value should be used. A NULL iw_table also denotes that
- * system-default values should be used. Until the system-default table
- * is implemented, the system-default is always 1.
- *
+ * iw_table is the interleave weight table.
+ * If bandwiddth data is available and the user is in auto mode, the table
+ * is populated with default values in [1,255].
  * iw_table is RCU protected
  */
 static u8 __rcu *iw_table;
 static DEFINE_MUTEX(iw_table_lock);
+static const int weightiness = 32;
+static bool weighted_interleave_auto = true;
 
 static u8 get_il_weight(int node)
 {
@@ -156,14 +159,113 @@  static u8 get_il_weight(int node)
 
 	rcu_read_lock();
 	table = rcu_dereference(iw_table);
-	/* if no iw_table, use system default */
 	weight = table ? table[node] : 1;
-	/* if value in iw_table is 0, use system default */
-	weight = weight ? weight : 1;
 	rcu_read_unlock();
 	return weight;
 }
 
+/*
+ * Convert bandwidth values into weighted interleave weights.
+ * Call with iw_table_lock.
+ */
+static void reduce_interleave_weights(uint64_t *bw, u8 *new_iw)
+{
+	uint64_t ttl_bw = 0, ttl_iw = 0, scaling_factor = 1, iw_gcd = 1;
+	unsigned int i = 0;
+
+	/* Recalculate the bandwidth distribution given the new info */
+	for (i = 0; i < nr_node_ids; i++)
+		ttl_bw += bw[i];
+
+	/* If node is not set or has < 1% of total bw, use minimum value of 1 */
+	for (i = 0; i < nr_node_ids; i++) {
+		if (bw[i]) {
+			scaling_factor = 100 * bw[i];
+			new_iw[i] = max(scaling_factor / ttl_bw, 1);
+		} else {
+			new_iw[i] = 1;
+		}
+		ttl_iw += new_iw[i];
+	}
+
+	/*
+	 * Scale each node's share of the total bandwidth from percentages
+	 * to whole numbers in the range [1, weightiness]
+	 */
+	for (i = 0; i < nr_node_ids; i++) {
+		scaling_factor = weightiness * new_iw[i];
+		new_iw[i] = max(scaling_factor / ttl_iw, 1);
+		if (unlikely(i == 0))
+			iw_gcd = new_iw[0];
+		iw_gcd = gcd(iw_gcd, new_iw[i]);
+	}
+
+	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
+	for (i = 0; i < nr_node_ids; i++)
+		new_iw[i] /= iw_gcd;
+}
+
+int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
+{
+	uint64_t *old_bw, *new_bw;
+	uint64_t bw_val;
+	u8 *old_iw, *new_iw;
+
+	/*
+	 * Bandwidths above this limit cause rounding errors when reducing
+	 * weights. This value is ~16 exabytes, which is unreasonable anyways.
+	 */
+	bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
+	if (bw_val > (U64_MAX / 10))
+		return -EINVAL;
+
+	new_bw = kcalloc(nr_node_ids, sizeof(uint64_t), GFP_KERNEL);
+	if (!new_bw)
+		return -ENOMEM;
+
+	new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL);
+	if (!new_iw) {
+		kfree(new_bw);
+		return -ENOMEM;
+	}
+
+	/*
+	 * Update bandwidth info, even in manual mode. That way, when switching
+	 * to auto mode in the future, iw_table can be overwritten using
+	 * accurate bw data.
+	 */
+	mutex_lock(&iw_table_lock);
+	old_bw = node_bw_table;
+	old_iw = rcu_dereference_protected(iw_table,
+					   lockdep_is_held(&iw_table_lock));
+
+	if (old_bw)
+		memcpy(new_bw, old_bw, nr_node_ids*sizeof(uint64_t));
+	new_bw[node] = bw_val;
+	node_bw_table = new_bw;
+
+	if (weighted_interleave_auto) {
+		reduce_interleave_weights(new_bw, new_iw);
+	} else if (old_iw) {
+		/*
+		 * The first time mempolicy_set_node_perf is called, old_iw
+		 * (iw_table) is null. If that is the case, assign a zeroed
+		 * table to it. Otherwise, free the newly allocated iw_table.
+		 */
+		mutex_unlock(&iw_table_lock);
+		kfree(new_iw);
+		kfree(old_bw);
+		return 0;
+	}
+
+	rcu_assign_pointer(iw_table, new_iw);
+	mutex_unlock(&iw_table_lock);
+	synchronize_rcu();
+	kfree(old_iw);
+	kfree(old_bw);
+	return 0;
+}
+
 /**
  * numa_nearest_node - Find nearest node by state
  * @node: Node id to start the search
@@ -1998,10 +2100,7 @@  static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 	table = rcu_dereference(iw_table);
 	/* calculate the total weight */
 	for_each_node_mask(nid, nodemask) {
-		/* detect system default usage */
-		weight = table ? table[nid] : 1;
-		weight = weight ? weight : 1;
-		weight_total += weight;
+		weight_total += table ? table[nid] : 1;
 	}
 
 	/* Calculate the node offset based on totals */
@@ -2010,7 +2109,6 @@  static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 	while (target) {
 		/* detect system default usage */
 		weight = table ? table[nid] : 1;
-		weight = weight ? weight : 1;
 		if (target < weight)
 			break;
 		target -= weight;
@@ -2401,7 +2499,7 @@  static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
 	unsigned long nr_allocated = 0;
 	unsigned long rounds;
 	unsigned long node_pages, delta;
-	u8 *table, *weights, weight;
+	u8 *weights, weight;
 	unsigned int weight_total = 0;
 	unsigned long rem_pages = nr_pages;
 	nodemask_t nodes;
@@ -2450,16 +2548,8 @@  static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
 	if (!weights)
 		return total_allocated;
 
-	rcu_read_lock();
-	table = rcu_dereference(iw_table);
-	if (table)
-		memcpy(weights, table, nr_node_ids);
-	rcu_read_unlock();
-
-	/* calculate total, detect system default usage */
 	for_each_node_mask(node, nodes) {
-		if (!weights[node])
-			weights[node] = 1;
+		weights[node] = get_il_weight(node);
 		weight_total += weights[node];
 	}
 
@@ -3394,7 +3484,7 @@  static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
 	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
 	if (count == 0 || sysfs_streq(buf, ""))
 		weight = 0;
-	else if (kstrtou8(buf, 0, &weight))
+	else if (kstrtou8(buf, 0, &weight) || weight == 0)
 		return -EINVAL;
 
 	new = kzalloc(nr_node_ids, GFP_KERNEL);
@@ -3411,11 +3501,68 @@  static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
 	mutex_unlock(&iw_table_lock);
 	synchronize_rcu();
 	kfree(old);
+	weighted_interleave_auto = false;
 	return count;
 }
 
 static struct iw_node_attr **node_attrs;
 
+static ssize_t weighted_interleave_mode_show(struct kobject *kobj,
+		struct kobj_attribute *attr, char *buf)
+{
+	if (weighted_interleave_auto)
+		return sysfs_emit(buf, "auto\n");
+	else
+		return sysfs_emit(buf, "manual\n");
+}
+
+static ssize_t weighted_interleave_mode_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	uint64_t *bw;
+	u8 *old_iw, *new_iw;
+
+	if (count == 0)
+		return -EINVAL;
+
+	if (sysfs_streq(buf, "manual")) {
+		weighted_interleave_auto = false;
+		return count;
+	} else if (!sysfs_streq(buf, "auto")) {
+		return -EINVAL;
+	}
+
+	new_iw = kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL);
+	if (!new_iw)
+		return -ENOMEM;
+
+	mutex_lock(&iw_table_lock);
+	bw = node_bw_table;
+
+	if (!bw) {
+		mutex_unlock(&iw_table_lock);
+		kfree(new_iw);
+		return -ENODEV;
+	}
+
+	old_iw = rcu_dereference_protected(iw_table,
+					   lockdep_is_held(&iw_table_lock));
+
+	reduce_interleave_weights(bw, new_iw);
+	rcu_assign_pointer(iw_table, new_iw);
+	mutex_unlock(&iw_table_lock);
+
+	synchronize_rcu();
+	kfree(old_iw);
+
+	weighted_interleave_auto = true;
+	return count;
+}
+
+static struct kobj_attribute wi_attr =
+	__ATTR(mode, 0664, weighted_interleave_mode_show,
+			   weighted_interleave_mode_store);
+
 static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
 				  struct kobject *parent)
 {
@@ -3432,6 +3579,7 @@  static void sysfs_wi_release(struct kobject *wi_kobj)
 
 	for (i = 0; i < nr_node_ids; i++)
 		sysfs_wi_node_release(node_attrs[i], wi_kobj);
+
 	kobject_put(wi_kobj);
 }
 
@@ -3473,6 +3621,15 @@  static int add_weight_node(int nid, struct kobject *wi_kobj)
 	return 0;
 }
 
+static struct attribute *wi_default_attrs[] = {
+	&wi_attr.attr,
+	NULL
+};
+
+static const struct attribute_group wi_attr_group = {
+	.attrs = wi_default_attrs,
+};
+
 static int add_weighted_interleave_group(struct kobject *root_kobj)
 {
 	struct kobject *wi_kobj;
@@ -3489,6 +3646,13 @@  static int add_weighted_interleave_group(struct kobject *root_kobj)
 		return err;
 	}
 
+	err = sysfs_create_group(wi_kobj, &wi_attr_group);
+	if (err) {
+		pr_err("failed to add sysfs [mode]\n");
+		kobject_put(wi_kobj);
+		return err;
+	}
+
 	for_each_node_state(nid, N_POSSIBLE) {
 		err = add_weight_node(nid, wi_kobj);
 		if (err) {