diff mbox series

[RFC] mm/mempolicy: Weighted interleave auto-tuning

Message ID 20241210215439.94819-1-joshua.hahnjy@gmail.com (mailing list archive)
State New
Headers show
Series [RFC] mm/mempolicy: Weighted interleave auto-tuning | expand

Commit Message

Joshua Hahn Dec. 10, 2024, 9:54 p.m. UTC
On machines with multiple memory nodes, interleaving page allocations
across nodes allows for better utilization of each node's bandwidth.
Previous work by Gregory Price [1] introduced weighted interleave, which
allowed for pages to be allocated across NUMA nodes according to
user-set ratios.

Ideally, these weights should be proportional to their bandwidth, so
that under bandwidth pressure, each node uses its maximal efficient
bandwidth and prevents latency from increasing exponentially.

At the same time, we want these weights to be as small as possible.
Having ratios that involve large co-prime numbers like 7639:1345:7 leads
to awkward and inefficient allocations, since the node with weight 7
will remain mostly unused (and despite being proportional to bandwidth,
will not aid in relieving the pressure present in the other two nodes).

This patch introduces an auto-configuration for the interleave weights
that aims to balance the two goals of setting node weights to be
proportional to their bandwidths and keeping the weight values low.
This balance is controlled by a value max_node_weight, which defines the
maximum weight a single node can take.

Large max_node_weights generally lead to increased weight-bandwidth
proportionality, but can lead to underutilized nodes (think worst-case
scenario, which is 1:max_node_weight). Lower max_node_weights reduce the
effects of underutilized nodes, but may lead to improperly loaded
distributions.

This knob is exposed as a sysfs interface with a default value of 32.
Weights are re-calculated once at boottime and then every time the knob
is changed by the user, or when the ACPI table is updated.

[1] https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/

Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
Co-Developed-by: Gregory Price <gourry@gourry.net>
---
 ...fs-kernel-mm-mempolicy-weighted-interleave |  24 +++
 drivers/acpi/numa/hmat.c                      |   1 +
 drivers/base/node.c                           |   7 +
 include/linux/mempolicy.h                     |   4 +
 mm/mempolicy.c                                | 195 ++++++++++++++++--
 5 files changed, 211 insertions(+), 20 deletions(-)

Comments

Hyeonggon Yoo Dec. 13, 2024, 6:19 a.m. UTC | #1
On 2024-12-11 06:54 AM, Joshua Hahn wrote:
> On machines with multiple memory nodes, interleaving page allocations
> across nodes allows for better utilization of each node's bandwidth.
> Previous work by Gregory Price [1] introduced weighted interleave, which
> allowed for pages to be allocated across NUMA nodes according to
> user-set ratios.
> 
> Ideally, these weights should be proportional to their bandwidth, so
> that under bandwidth pressure, each node uses its maximal efficient
> bandwidth and prevents latency from increasing exponentially.
> 
> At the same time, we want these weights to be as small as possible.
> Having ratios that involve large co-prime numbers like 7639:1345:7 leads
> to awkward and inefficient allocations, since the node with weight 7
> will remain mostly unused (and despite being proportional to bandwidth,
> will not aid in relieving the pressure present in the other two nodes).
> 
> This patch introduces an auto-configuration for the interleave weights
> that aims to balance the two goals of setting node weights to be
> proportional to their bandwidths and keeping the weight values low.
> This balance is controlled by a value max_node_weight, which defines the
> maximum weight a single node can take.

Hi Joshua,

I am wondering how this is going to work for host memory + CXL memory 
interleaving. I guess by "the ACPI table" you mean the ACPI HMAT or CXL 
CDAT, both of which does not provide the bandwidth of host memory.
If this feature does not consider the bandwidth of host memory, manual 
configuration will be inevitable anyway.

> Large max_node_weights generally lead to increased weight-bandwidth
> proportionality, but can lead to underutilized nodes (think worst-case
> scenario, which is 1:max_node_weight). Lower max_node_weights reduce the
> effects of underutilized nodes, but may lead to improperly loaded
> distributions.
> 
> This knob is exposed as a sysfs interface with a default value of 32.
> Weights are re-calculated once at boottime and then every time the knob
> is changed by the user, or when the ACPI table is updated.
>> [1] 
https://lore.kernel.org/linux-mm/20240202170238.90004-1-gregory.price@memverge.com/
> 
> Signed-off-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> Co-Developed-by: Gregory Price <gourry@gourry.net>
> ---
>   ...fs-kernel-mm-mempolicy-weighted-interleave |  24 +++
>   drivers/acpi/numa/hmat.c                      |   1 +
>   drivers/base/node.c                           |   7 +
>   include/linux/mempolicy.h                     |   4 +
>   mm/mempolicy.c                                | 195 ++++++++++++++++--
>   5 files changed, 211 insertions(+), 20 deletions(-)
> 
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> index 0b7972de04e9..2ef9a87ce878 100644
> --- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -23,3 +23,27 @@ Description:	Weight configuration interface for nodeN
>   		Writing an empty string or `0` will reset the weight to the
>   		system default. The system default may be set by the kernel
>   		or drivers at boot or during hotplug events.
> +
> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/max_node_weight
> +Date:		December 2024
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Weight limiting / scaling interface
> +
> +		The maximum interleave weight for a memory node. When it is
> +		updated, any previous changes to interleave weights (i.e. via
> +		the nodeN sysfs interfaces) are ignored, and new weights are
> +		calculated using ACPI-reported bandwidths and scaled.
> +

At first this paragraph sounded like "previously stored weights are 
discarded after setting max_node_weight", but I think you mean
"User can override the default values, but defaults values are 
calculated regardless of the values set by the user". Right?

> +		It is possible for weights to be greater than max_node_weight if
> +		the nodeN interfaces are directly modified to be greater.
> +
> +		Minimum weight: 1
> +		Default value: 32
> +		Maximum weight: 255
> +
> +		Writing an empty string will set the value to be the default
> +		(32). Writing a value outside the valid range  will return
> +		EINVAL and will not re-trigger a weight scaling.
> +
> +		Setting max_node_weight to 1 is equivalent to unweighted
> +		interleave.
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index ee32a10e992c..f789280acdcb 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -109,6 +109,7 @@
>   #include <linux/mmu_notifier.h>
>   #include <linux/printk.h>
>   #include <linux/swapops.h>
> +#include <linux/gcd.h>
>   
>   #include <asm/tlbflush.h>
>   #include <asm/tlb.h>
> @@ -153,24 +154,116 @@ static unsigned int mempolicy_behavior;

[...snip...]

> +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
> +{
> +	unsigned long *old_bw, *new_bw;
> +	unsigned long bw_val;
> +	u8 *old_iw, *new_iw;
> +
> +	/*
> +	 * Bandwidths above this limit causes rounding errors when reducing
> +	 * weights. This value is ~16 exabytes, which is unreasonable anyways.
> +	 */
> +	bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
> +	if (bw_val > (U64_MAX / 10))
> +		return -EINVAL;
> +
> +	new_bw = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
> +	if (!new_bw)
> +		return -ENOMEM;
> +
> +	new_iw = kzalloc(nr_node_ids, GFP_KERNEL);

I think kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL); will be more readable.

> +	if (!new_iw) {
> +		kfree(new_bw);
> +		return -ENOMEM;
> +	}
> +
> +	mutex_lock(&default_iwt_lock);
> +	old_bw = node_bw_table;
> +	old_iw = rcu_dereference_protected(default_iw_table,
> +					   lockdep_is_held(&default_iwt_lock));
> +
> +	if (old_bw)
> +		memcpy(new_bw, old_bw, nr_node_ids*sizeof(unsigned long));
> +	new_bw[node] = bw_val;
> +	node_bw_table = new_bw;
> +
> +	reduce_interleave_weights(new_bw, new_iw);
> +	rcu_assign_pointer(default_iw_table, new_iw);
> +
> +	mutex_unlock(&default_iwt_lock);
> +	synchronize_rcu();
> +	kfree(old_bw);
> +	kfree(old_iw);
> +	return 0;
> +}
> +
>   /**
>    * numa_nearest_node - Find nearest node by state
>    * @node: Node id to start the search
> @@ -2001,7 +2094,7 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>   {
>   	nodemask_t nodemask;
>   	unsigned int target, nr_nodes;
> -	u8 *table;
> +	u8 *table, *defaults;
>   	unsigned int weight_total = 0;
>   	u8 weight;
>   	int nid;
> @@ -2012,11 +2105,12 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>   
>   	rcu_read_lock();
>   	table = rcu_dereference(iw_table);
> +	defaults = rcu_dereference(iw_table);

Probably you intended rcu_dereference(default_iw_table)?

>   	/* calculate the total weight */
>   	for_each_node_mask(nid, nodemask) {
>   		/* detect system default usage */
> -		weight = table ? table[nid] : 1;
> -		weight = weight ? weight : 1;
> +		weight = table ? table[nid] : 0;
> +		weight = weight ? weight : (defaults ? defaults[nid] : 1);
>   		weight_total += weight;
>   	}
>   
> @@ -2025,8 +2119,8 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
>   	nid = first_node(nodemask);
>   	while (target) {
>   		/* detect system default usage */
> -		weight = table ? table[nid] : 1;
> -		weight = weight ? weight : 1;
> +		weight = table ? table[nid] : 0;
> +		weight = weight ? weight : (defaults ? defaults[nid] : 1);
>   		if (target < weight)
>   			break;
>   		target -= weight;
> @@ -2409,7 +2503,7 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
>   	unsigned long nr_allocated = 0;
>   	unsigned long rounds;
>   	unsigned long node_pages, delta;
> -	u8 *table, *weights, weight;
> +	u8 *weights, weight;
>   	unsigned int weight_total = 0;
>   	unsigned long rem_pages = nr_pages;
>   	nodemask_t nodes;
> @@ -2458,16 +2552,8 @@ static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
>   	if (!weights)
>   		return total_allocated;
>   
> -	rcu_read_lock();
> -	table = rcu_dereference(iw_table);
> -	if (table)
> -		memcpy(weights, table, nr_node_ids);
> -	rcu_read_unlock();
> -
> -	/* calculate total, detect system default usage */
>   	for_each_node_mask(node, nodes) {
> -		if (!weights[node])
> -			weights[node] = 1;
> +		weights[node] = get_il_weight(node);
>   		weight_total += weights[node];
>   	}
>   
> @@ -3396,6 +3482,7 @@ static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
>   }
>   
>   static struct iw_node_attr **node_attrs;
> +static struct kobj_attribute *max_nw_attr;

Where is max_nw_attr initialized?

Best,
Hyeonggon

>   static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
>   				  struct kobject *parent)
> @@ -3413,6 +3500,10 @@ static void sysfs_wi_release(struct kobject *wi_kobj)
>   
>   	for (i = 0; i < nr_node_ids; i++)
>   		sysfs_wi_node_release(node_attrs[i], wi_kobj);
> +
> +	sysfs_remove_file(wi_kobj, &max_nw_attr->attr);
> +	kfree(max_nw_attr->attr.name);
> +	kfree(max_nw_attr);
>   	kobject_put(wi_kobj);
>   }
Gregory Price Dec. 13, 2024, 4:28 p.m. UTC | #2
On Fri, Dec 13, 2024 at 03:19:20PM +0900, Hyeonggon Yoo wrote:
> On 2024-12-11 06:54 AM, Joshua Hahn wrote:
> > This patch introduces an auto-configuration for the interleave weights
> > that aims to balance the two goals of setting node weights to be
> > proportional to their bandwidths and keeping the weight values low.
> > This balance is controlled by a value max_node_weight, which defines the
> > maximum weight a single node can take.
> 
> Hi Joshua,
> 
> I am wondering how this is going to work for host memory + CXL memory
> interleaving. I guess by "the ACPI table" you mean the ACPI HMAT or CXL
> CDAT, both of which does not provide the bandwidth of host memory.

Then your BIOS vendor needs to fix their ACPI table generation, because
HMAT can absolutely contain that information.

[078h 0120   2]               Structure Type : 0001 [System Locality Latency and Bandwidth Information]
[07Ah 0122   2]                     Reserved : 0000
[07Ch 0124   4]                       Length : 00000030
[080h 0128   1]        Flags (decoded below) : 00
                            Memory Hierarchy : 0
                   Use Minimum Transfer Size : 0
                    Non-sequential Transfers : 0
[081h 0129   1]                    Data Type : 00
[082h 0130   1]        Minimum Transfer Size : 00
[083h 0131   1]                    Reserved1 : 00
[084h 0132   4] Initiator Proximity Domains # : 00000001
[088h 0136   4]   Target Proximity Domains # : 00000002
[08Ch 0140   4]                    Reserved2 : 00000000
[090h 0144   8]              Entry Base Unit : 00000000000003E8
[098h 0152   4] Initiator Proximity Domain List : 00000000
[09Ch 0156   4] Target Proximity Domain List : 00000000
[0A0h 0160   4] Target Proximity Domain List : 00000001
[0A4h 0164   2]                        Entry : 006E
[0A6h 0166   2]                        Entry : 01FE

[0A8h 0168   2]               Structure Type : 0001 [System Locality Latency and Bandwidth Information]
[0AAh 0170   2]                     Reserved : 0000
[0ACh 0172   4]                       Length : 00000030
[0B0h 0176   1]        Flags (decoded below) : 00
                            Memory Hierarchy : 0
                   Use Minimum Transfer Size : 0
                    Non-sequential Transfers : 0
[0B1h 0177   1]                    Data Type : 03
[0B2h 0178   1]        Minimum Transfer Size : 00
[0B3h 0179   1]                    Reserved1 : 00
[0B4h 0180   4] Initiator Proximity Domains # : 00000001
[0B8h 0184   4]   Target Proximity Domains # : 00000002
[0BCh 0188   4]                    Reserved2 : 00000000
[0C0h 0192   8]              Entry Base Unit : 0000000000000064
[0C8h 0200   4] Initiator Proximity Domain List : 00000000
[0CCh 0204   4] Target Proximity Domain List : 00000000
[0D0h 0208   4] Target Proximity Domain List : 00000001
[0D4h 0212   2]                        Entry : 1200
[0D6h 0214   2]                        Entry : 0064

Obviously if information is missing, then manual is the only way forward.

> > +		The maximum interleave weight for a memory node. When it is
> > +		updated, any previous changes to interleave weights (i.e. via
> > +		the nodeN sysfs interfaces) are ignored, and new weights are
> > +		calculated using ACPI-reported bandwidths and scaled.
> > +
> 
> At first this paragraph sounded like "previously stored weights are
> discarded after setting max_node_weight", but I think you mean
> "User can override the default values, but defaults values are calculated
> regardless of the values set by the user". Right?
> 

Agree that these comments need clarification, we'll workshop it.

~Gregory
Joshua Hahn Dec. 13, 2024, 7:57 p.m. UTC | #3
On Fri, 13 Dec 2024 15:19:20 +0900 Hyeonggon Yoo <hyeonggon.yoo@sk.com> wrote:

> On 2024-12-11 06:54 AM, Joshua Hahn wrote:
> > This patch introduces an auto-configuration for the interleave weights
> > that aims to balance the two goals of setting node weights to be
> > proportional to their bandwidths and keeping the weight values low.
> > This balance is controlled by a value max_node_weight, which defines the
> > maximum weight a single node can take.
> 
> Hi Joshua,
> 
> I am wondering how this is going to work for host memory + CXL memory 
> interleaving. I guess by "the ACPI table" you mean the ACPI HMAT or CXL 
> CDAT, both of which does not provide the bandwidth of host memory.
> If this feature does not consider the bandwidth of host memory, manual 
> configuration will be inevitable anyway.

Hi Hyeonggon,

Thank you for reviewing my patch! As Gregory showed in his reply,
I think it would be possible to get host bandwidth information
using the ACPI HMAT.

[-----8<-----]

> > +What:		/sys/kernel/mm/mempolicy/weighted_interleave/max_node_weight
> > +Date:		December 2024
> > +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> > +Description:	Weight limiting / scaling interface
> > +
> > +		The maximum interleave weight for a memory node. When it is
> > +		updated, any previous changes to interleave weights (i.e. via
> > +		the nodeN sysfs interfaces) are ignored, and new weights are
> > +		calculated using ACPI-reported bandwidths and scaled.
> > +
> 
> At first this paragraph sounded like "previously stored weights are 
> discarded after setting max_node_weight", but I think you mean
> "User can override the default values, but defaults values are 
> calculated regardless of the values set by the user". Right?

In the implementation, the first way you interpreted is the correct
description. That is, if a user manually changes a ndoe weight,
then updates the max_node_weight, the previous manual change will
be overwritten by the newly scaled values.

Does this behavior make sense? Perhaps it makes sense to ignore
user-changed values when doing the re-scaling if a user decides to
change the max_node_weight value themselves. 

Regardless of what implementation makes sense, I can re-write the
description so that there is no ambiguity when it comes to the
expected behavior of the code. Thank you for pointing this out!

> [...snip...]
> 
> > +int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
> > +{
> > +	unsigned long *old_bw, *new_bw;
> > +	unsigned long bw_val;
> > +	u8 *old_iw, *new_iw;
> > +
> > +	/*
> > +	 * Bandwidths above this limit causes rounding errors when reducing
> > +	 * weights. This value is ~16 exabytes, which is unreasonable anyways.
> > +	 */
> > +	bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
> > +	if (bw_val > (U64_MAX / 10))
> > +		return -EINVAL;
> > +
> > +	new_bw = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
> > +	if (!new_bw)
> > +		return -ENOMEM;
> > +
> > +	new_iw = kzalloc(nr_node_ids, GFP_KERNEL);
> 
> I think kcalloc(nr_node_ids, sizeof(u8), GFP_KERNEL); will be more readable.

I see, thank you for your input. I will make this change in a v2.

> > @@ -2012,11 +2105,12 @@ static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
> >   
> >   	rcu_read_lock();
> >   	table = rcu_dereference(iw_table);
> > +	defaults = rcu_dereference(iw_table);
> 
> Probably you intended rcu_dereference(default_iw_table)?

Yes -- thank you for the catch. I will also make this change.

> >   static struct iw_node_attr **node_attrs;
> > +static struct kobj_attribute *max_nw_attr;
> 
> Where is max_nw_attr initialized?

Oh thank you for this catch! You are correct, max_nw_attr is never
initalized. Actually, there is a typo in which I never use
max_nw_attr, I accidentally rename a different sysfs interface
to act as the intended max_nw_attr. I will make this change as well
and post a v2. 
 
> Best,
> Hyeonggon

Thank you for your input, I will make the changes that you mentioned
regardnig readability & typos. I hope to hear from you regarding the
thoughts on the behavior of re-scaling all node weights when users
update max_node_weight, and whether that should overwrite manually
set node weights.

Have a great day!
Joshua
Hyeonggon Yoo Dec. 16, 2024, 7:53 a.m. UTC | #4
On 2024-12-14 4:57 AM, Joshua Hahn wrote:
> On Fri, 13 Dec 2024 15:19:20 +0900 Hyeonggon Yoo <hyeonggon.yoo@sk.com> wrote:
> 
>> On 2024-12-11 06:54 AM, Joshua Hahn wrote:
>>> This patch introduces an auto-configuration for the interleave weights
>>> that aims to balance the two goals of setting node weights to be
>>> proportional to their bandwidths and keeping the weight values low.
>>> This balance is controlled by a value max_node_weight, which defines the
>>> maximum weight a single node can take.
>>
>> Hi Joshua,
>>
>> I am wondering how this is going to work for host memory + CXL memory
>> interleaving. I guess by "the ACPI table" you mean the ACPI HMAT or CXL
>> CDAT, both of which does not provide the bandwidth of host memory.
>> If this feature does not consider the bandwidth of host memory, manual
>> configuration will be inevitable anyway.
> 
> Hi Hyeonggon,
> 
> Thank you for reviewing my patch!

No problem :)

> As Gregory showed in his reply, I think it would be possible to get host bandwidth information
> using the ACPI HMAT.

You're right. I was wrong. I checked on our server, and its bandwidth 
information was valid for both local memory and CXL memory. Sorry for
the noise.

> [-----8<-----]
> 
>>> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/max_node_weight
>>> +Date:		December 2024
>>> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
>>> +Description:	Weight limiting / scaling interface
>>> +
>>> +		The maximum interleave weight for a memory node. When it is
>>> +		updated, any previous changes to interleave weights (i.e. via
>>> +		the nodeN sysfs interfaces) are ignored, and new weights are
>>> +		calculated using ACPI-reported bandwidths and scaled.
>>> +
>>
>> At first this paragraph sounded like "previously stored weights are
>> discarded after setting max_node_weight", but I think you mean
>> "User can override the default values, but defaults values are
>> calculated regardless of the values set by the user". Right?
> 
> In the implementation, the first way you interpreted is the correct
> description. That is, if a user manually changes a ndoe weight,
> then updates the max_node_weight, the previous manual change will
> be overwritten by the newly scaled values.
 >
> Does this behavior make sense?

Ok. then current implementation overwrites the node weights
previously set by the user.

I think it makes sense to re-scale all nodes and overwrite manually set 
node weights, because it's what the knob is meant to do, and the user 
still can override the weight by setting node weight after updating
max_node_weight.

By the way, could you please explain which part of this patch implements
this rule? IIUC it does not invalidate iw_table after updating these
default values, which makes get_il_weight() to use manully set node 
weights even after updating max_node_weight. (Or probably I just
misunderstood the code?)

> Have a great day!

Have a great day too :)

--
Hyeonggon
Joshua Hahn Dec. 16, 2024, 3:46 p.m. UTC | #5
On Mon, 16 Dec 2024 16:53:47 +0900 Hyeonggon Yoo <hyeonggon.yoo@sk.com> wrote:
> 
> On 2024-12-14 4:57 AM, Joshua Hahn wrote:
> > On Fri, 13 Dec 2024 15:19:20 +0900 Hyeonggon Yoo <hyeonggon.yoo@sk.com> wrote:
> > 
> >> On 2024-12-11 06:54 AM, Joshua Hahn wrote:

[-----8<-----]

> > Hi Hyeonggon,
> > 
> > Thank you for reviewing my patch!
> 
> No problem :)
> 
> > As Gregory showed in his reply, I think it would be possible to get host bandwidth information
> > using the ACPI HMAT.
> 
> You're right. I was wrong. I checked on our server, and its bandwidth 
> information was valid for both local memory and CXL memory. Sorry for
> the noise.

No worries, thank you for verifying from your end as well!

> > [-----8<-----]
> > 
> >>> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/max_node_weight
> >>> +Date:		December 2024
> >>> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> >>> +Description:	Weight limiting / scaling interface
> >>> +
> >>> +		The maximum interleave weight for a memory node. When it is
> >>> +		updated, any previous changes to interleave weights (i.e. via
> >>> +		the nodeN sysfs interfaces) are ignored, and new weights are
> >>> +		calculated using ACPI-reported bandwidths and scaled.
> >>> +
> >>
> >> At first this paragraph sounded like "previously stored weights are
> >> discarded after setting max_node_weight", but I think you mean
> >> "User can override the default values, but defaults values are
> >> calculated regardless of the values set by the user". Right?
> > 
> > In the implementation, the first way you interpreted is the correct
> > description. That is, if a user manually changes a ndoe weight,
> > then updates the max_node_weight, the previous manual change will
> > be overwritten by the newly scaled values.
>  >
> > Does this behavior make sense?
> 
> Ok. then current implementation overwrites the node weights
> previously set by the user.
> 
> I think it makes sense to re-scale all nodes and overwrite manually set 
> node weights, because it's what the knob is meant to do, and the user 
> still can override the weight by setting node weight after updating
> max_node_weight.

Thank you for your feedback. There is a slight concern, however, where there
is a semantic mismatch between the name "max_node_weight" and its value.
Like the description suggests, it is possible for individual nodes to have
weights greater than the "max node weight".

However, the alternative would be to re-scale all weights whenever an
individual node's weight is manually overwritten to be greater than
the max, which I think makes even less sense since users probably don't
expect changing one weight to influence other nodes as well.
 
> By the way, could you please explain which part of this patch implements
> this rule? IIUC it does not invalidate iw_table after updating these
> default values, which makes get_il_weight() to use manully set node 
> weights even after updating max_node_weight. (Or probably I just
> misunderstood the code?)

Ah, I am sorry for this mistake. It seems like I didn't update the
actual iw table, updating the default instead. Thank you for the catch,
this will be updated in the v2.

Actually, there are a few more changes that I would like to make in the
v2, the biggest of which is to lay down the intended behavior more
explicitly in the documentation so that there is no ambiguity
from the user perspective, and make sure that the code actually
does reflect the intention as well. 

> > Have a great day!
> 
> Have a great day too :)
> 
> --
> Hyeonggon

Thank you again for your feedback, happy holidays!

Joshua
Huang, Ying Dec. 21, 2024, 5:57 a.m. UTC | #6
Hi, Joshua,

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On Fri, 13 Dec 2024 15:19:20 +0900 Hyeonggon Yoo <hyeonggon.yoo@sk.com> wrote:
>
>> On 2024-12-11 06:54 AM, Joshua Hahn wrote:

[snip]

>
> [-----8<-----]
>
>> > +What:		/sys/kernel/mm/mempolicy/weighted_interleave/max_node_weight

I don't think that we need a new knob for this.  Just use a reasonable
default value, for example, 32 or 16.  If it turns out that a knob will
be really helpful, we can add it at that time.  For now, I don't think
the requirements are clear.  And, this is a new ABI and we need to
maintain it almost for ever.  We must be careful about new knob.

>> > +Date:		December 2024
>> > +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
>> > +Description:	Weight limiting / scaling interface
>> > +
>> > +		The maximum interleave weight for a memory node. When it is
>> > +		updated, any previous changes to interleave weights (i.e. via
>> > +		the nodeN sysfs interfaces) are ignored, and new weights are
>> > +		calculated using ACPI-reported bandwidths and scaled.
>> > +
>> 
>> At first this paragraph sounded like "previously stored weights are 
>> discarded after setting max_node_weight", but I think you mean
>> "User can override the default values, but defaults values are 
>> calculated regardless of the values set by the user". Right?
>
> In the implementation, the first way you interpreted is the correct
> description. That is, if a user manually changes a ndoe weight,
> then updates the max_node_weight, the previous manual change will
> be overwritten by the newly scaled values.
>
> Does this behavior make sense? Perhaps it makes sense to ignore
> user-changed values when doing the re-scaling if a user decides to
> change the max_node_weight value themselves. 
>
> Regardless of what implementation makes sense, I can re-write the
> description so that there is no ambiguity when it comes to the
> expected behavior of the code. Thank you for pointing this out!

I don't think that it's a good idea to override the user supplied
configuration values.  User configurations always have higher priority
than system default configurations.  IIUC, this is the general rule of
Linux kernel user space interface.

---
Best Regards,
Huang, Ying
Gregory Price Dec. 21, 2024, 2:58 p.m. UTC | #7
On Sat, Dec 21, 2024 at 01:57:58PM +0800, Huang, Ying wrote:
> Hi, Joshua,
> 
> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
> > On Fri, 13 Dec 2024 15:19:20 +0900 Hyeonggon Yoo <hyeonggon.yoo@sk.com> wrote:
> >
> >> On 2024-12-11 06:54 AM, Joshua Hahn wrote:
> 
> [snip]
> 
> >
> > [-----8<-----]
> >
> >> > +What:		/sys/kernel/mm/mempolicy/weighted_interleave/max_node_weight
> 
> I don't think that we need a new knob for this.  Just use a reasonable
> default value, for example, 32 or 16.  If it turns out that a knob will
> be really helpful, we can add it at that time.  For now, I don't think
> the requirements are clear.  And, this is a new ABI and we need to
> maintain it almost for ever.  We must be careful about new knob.
> 

This is fair.  We spent a good amount of time modeling the best
effective maximum weight and basically came to the conclusion that 32
has a good balance of minimizing error and being somewhat aggressive.

Ripping out the sysfs is easy enough.

> >
> > Regardless of what implementation makes sense, I can re-write the
> > description so that there is no ambiguity when it comes to the
> > expected behavior of the code. Thank you for pointing this out!
> 
> I don't think that it's a good idea to override the user supplied
> configuration values.  User configurations always have higher priority
> than system default configurations.  IIUC, this is the general rule of
> Linux kernel user space interface.
> 

We discussed this and decided it was confusing no matter what we did.

If new data comes in (CDAT data from a hotplug event), then the weights
are now wrong for the new global state - regardless of whether the user
set a weight manually or not.  This also allowed us to simplify the
implementation a bit.

But if generally we need to preserve user settings, then I think the
best we can do to provide a sane system is ignore the user setting when
re-weighting on a hotplug event.

e.g. user has not set a value

default_values [5,2,-] <- 1 node not set, expected to be hotplugged
user_values    [-,-,-] <- user has not set values
effective      [5,2,-]

hotplug event
default_values [2,1,1] - reweight has occurred
user_values    [-,-,-]
effective      [2,1,1]

e.g. user has set a value

default_values [5,2,-] <- 1 node not set, expected to be hotplugged
user_values    [4,-,-] <- user has only set one value
effective      [4,2,-]

hotplug event
default_values [2,1,1] - reweight has occurred
user_values    [4,-,-]
effective      [4,1,1]


So default values get updated, but user values get left alone.

If that's sane we'll fix it up.

> ---
> Best Regards,
> Huang, Ying
Huang, Ying Dec. 22, 2024, 8:29 a.m. UTC | #8
Gregory Price <gourry@gourry.net> writes:

> On Sat, Dec 21, 2024 at 01:57:58PM +0800, Huang, Ying wrote:
>> Hi, Joshua,
>> 
>> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
>> 
>> > On Fri, 13 Dec 2024 15:19:20 +0900 Hyeonggon Yoo <hyeonggon.yoo@sk.com> wrote:
>> >
>> >> On 2024-12-11 06:54 AM, Joshua Hahn wrote:
>>

[snip]

>
>> >
>> > Regardless of what implementation makes sense, I can re-write the
>> > description so that there is no ambiguity when it comes to the
>> > expected behavior of the code. Thank you for pointing this out!
>> 
>> I don't think that it's a good idea to override the user supplied
>> configuration values.  User configurations always have higher priority
>> than system default configurations.  IIUC, this is the general rule of
>> Linux kernel user space interface.
>> 
>
> We discussed this and decided it was confusing no matter what we did.
>
> If new data comes in (CDAT data from a hotplug event), then the weights
> are now wrong for the new global state - regardless of whether the user
> set a weight manually or not.  This also allowed us to simplify the
> implementation a bit.
>
> But if generally we need to preserve user settings, then I think the
> best we can do to provide a sane system is ignore the user setting when
> re-weighting on a hotplug event.
>
> e.g. user has not set a value
>
> default_values [5,2,-] <- 1 node not set, expected to be hotplugged
> user_values    [-,-,-] <- user has not set values
> effective      [5,2,-]
>
> hotplug event
> default_values [2,1,1] - reweight has occurred
> user_values    [-,-,-]
> effective      [2,1,1]
>
> e.g. user has set a value
>
> default_values [5,2,-] <- 1 node not set, expected to be hotplugged
> user_values    [4,-,-] <- user has only set one value
> effective      [4,2,-]
>
> hotplug event
> default_values [2,1,1] - reweight has occurred
> user_values    [4,-,-]
> effective      [4,1,1]

Another choice is that if the user set a value, he/she set all values
effectively.  Even if he/she doesn't set the other values, he/she thinks
that the other values are good, and more importantly, the ratio is good.
If so,

default_values [5,2,-] <- 1 node not set, expected to be hotplugged
user_values    [4,2,0] <- user has only set one value, not populated nodes have value 0
effective      [4,2,0]

hotplug event
default_values [2,1,1] - reweight has occurred
user_values    [4,2,0]
effective      [4,2,0]

In this way, 0 becomes a valid value too.

What do you think about this?

> So default values get updated, but user values get left alone.
>
> If that's sane we'll fix it up.

---
Best Regards,
Huang, Ying
Gregory Price Dec. 22, 2024, 4:54 p.m. UTC | #9
On Sun, Dec 22, 2024 at 04:29:30PM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> 
> > On Sat, Dec 21, 2024 at 01:57:58PM +0800, Huang, Ying wrote:
> 
> Another choice is that if the user set a value, he/she set all values
> effectively.  Even if he/she doesn't set the other values, he/she thinks
> that the other values are good, and more importantly, the ratio is good.

This is probably the actual way to go.  

> If so,
> 
> default_values [5,2,-] <- 1 node not set, expected to be hotplugged
> user_values    [4,2,0] <- user has only set one value, not populated nodes have value 0
> effective      [4,2,0]
>
> hotplug event
> default_values [2,1,1] - reweight has occurred
> user_values    [4,2,0]
> effective      [4,2,0]
> 
> In this way, 0 becomes a valid value too.
> 
> What do you think about this?
> 

We decided when implementing weights that 0 was a special value that
reverts to the system default:

  Writing an empty string or `0` will reset the weight to the
  system default. The system default may be set by the kernel
  or drivers at boot or during hotplug events.

I'm ok pulling the default weights in collectively once the first one is
written, but 0 is an invalid value which causes issues.

We went through that when we initially implemented the feature w/ task-local
weights and why the help function overrides it to 1 if it's ever seen.

We'll revert back to our initial implementation w/ default_iw_table and
iw_table - where iw_table contains user-defined weights.  Writing a 0 to
iw_table[N] will allow get_il_weight() to retrieve default_iw_table[N]
as the docs imply it should.

~Gregory
Huang, Ying Dec. 25, 2024, 12:25 a.m. UTC | #10
Gregory Price <gourry@gourry.net> writes:

> On Sun, Dec 22, 2024 at 04:29:30PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry@gourry.net> writes:
>> 
>> > On Sat, Dec 21, 2024 at 01:57:58PM +0800, Huang, Ying wrote:
>> 
>> Another choice is that if the user set a value, he/she set all values
>> effectively.  Even if he/she doesn't set the other values, he/she thinks
>> that the other values are good, and more importantly, the ratio is good.
>
> This is probably the actual way to go.  
>
>> If so,
>> 
>> default_values [5,2,-] <- 1 node not set, expected to be hotplugged
>> user_values    [4,2,0] <- user has only set one value, not populated nodes have value 0
>> effective      [4,2,0]
>>
>> hotplug event
>> default_values [2,1,1] - reweight has occurred
>> user_values    [4,2,0]
>> effective      [4,2,0]
>> 
>> In this way, 0 becomes a valid value too.
>> 
>> What do you think about this?
>> 
>
> We decided when implementing weights that 0 was a special value that
> reverts to the system default:
>
>   Writing an empty string or `0` will reset the weight to the
>   system default. The system default may be set by the kernel
>   or drivers at boot or during hotplug events.
>
> I'm ok pulling the default weights in collectively once the first one is
> written, but 0 is an invalid value which causes issues.
>
> We went through that when we initially implemented the feature w/ task-local
> weights and why the help function overrides it to 1 if it's ever seen.
>
> We'll revert back to our initial implementation w/ default_iw_table and
> iw_table - where iw_table contains user-defined weights.  Writing a 0 to
> iw_table[N] will allow get_il_weight() to retrieve default_iw_table[N]
> as the docs imply it should.

So, the suggested behavior becomes the following?

default_values [5,2,-] <- 1 node not set, expected to be hotplugged
user_values    [4,2,1] <- user has only set one value, not populated nodes have value 1
effective      [4,2,1]

hotplug event
default_values [2,1,1] - reweight has occurred
user_values    [4,2,1]
effective      [4,2,1]

Even if so, we still have another issue.  The effective values may be a
combination of default_values and user_values and it's hard for users to
identify which one is from default_values and subject to change.  For
example,

user reset weight of node 0 to default: echo 0 > node0
default_values [2,1,1]
user_values    [0,2,1]
effective      [2,2,1]

change the default again
default_values [3,1,1] - reweight again
user_values    [0,2,1]
effective      [3,2,1]

This is still quite confusing.  Another possible solution is to copy the
default value instead,

user reset weight of node 0 to default: echo 0 > node0
default_values [2,1,1]
user_values    [2,2,1] - copy default value when echo 0
effective      [2,2,1]

change the default again
default_values [3,1,1] - reweight again
user_values    [2,2,1]
effective      [2,2,1]

The remaining issue is that we cannot revert to default atomically.
That is, user_values may becomea  combination of old and new
default_values if users echo 0 to each node one by one when kernel is
changing default_values.  To resolve this, we may add another interface
to do that, for example, "use_default".

echo 1 > use_default

will use default_values for all nodes.  We can check whether we are
using default via

cat use_default

Anyway, I think that we need a thorough thought about the user space
interface.  And add good document, at least in change log.  It's really
hard to make user space interface right.

I'm open to better user space interface design.

---
Best Regards,
Huang, Ying
Joshua Hahn Dec. 25, 2024, 9:30 a.m. UTC | #11
Hi Gregory and Huang,

Sorry for the silence on my end for the past few days. I decided to take
some time off of the computer, but I should be more reponsive now!

On Wed, 25 Dec 2024 08:25:13 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:

> Gregory Price <gourry@gourry.net> writes:
> 
> > On Sun, Dec 22, 2024 at 04:29:30PM +0800, Huang, Ying wrote:
> >> Gregory Price <gourry@gourry.net> writes:
> >> 
> >> > On Sat, Dec 21, 2024 at 01:57:58PM +0800, Huang, Ying wrote:

[.....8<.....]

> > We decided when implementing weights that 0 was a special value that
> > reverts to the system default:
> >
> >   Writing an empty string or `0` will reset the weight to the
> >   system default. The system default may be set by the kernel
> >   or drivers at boot or during hotplug events.
> >
> > I'm ok pulling the default weights in collectively once the first one is
> > written, but 0 is an invalid value which causes issues.
> >
> > We went through that when we initially implemented the feature w/ task-local
> > weights and why the help function overrides it to 1 if it's ever seen.
> >
> > We'll revert back to our initial implementation w/ default_iw_table and
> > iw_table - where iw_table contains user-defined weights.  Writing a 0 to
> > iw_table[N] will allow get_il_weight() to retrieve default_iw_table[N]
> > as the docs imply it should.
> 
> So, the suggested behavior becomes the following?
> 
> default_values [5,2,-] <- 1 node not set, expected to be hotplugged
> user_values    [4,2,1] <- user has only set one value, not populated nodes have value 1
> effective      [4,2,1]
> 
> hotplug event
> default_values [2,1,1] - reweight has occurred
> user_values    [4,2,1]
> effective      [4,2,1]

Yes, I think this was the intended effect when we were discussing what
interface made the most sense.

> Even if so, we still have another issue.  The effective values may be a
> combination of default_values and user_values and it's hard for users to
> identify which one is from default_values and subject to change.  For
> example,
> 
> user reset weight of node 0 to default: echo 0 > node0
> default_values [2,1,1]
> user_values    [0,2,1]
> effective      [2,2,1]
> 
> change the default again
> default_values [3,1,1] - reweight again
> user_values    [0,2,1]
> effective      [3,2,1]

Agreed. Actually, this confusion was partly what motivated our new
re-work of the patch in v2, which got rid of the default and user
layers, and made all internal values transparent to the user as well.
That way, there would be no confusion as to the true source of the
value, and the user could be aware that re-weighting would impact
all values, regardless of whehter they were default values or not.

If we are moving away from allowing users to dynamically change the
weightiness (max_node_weight) parameter however, then I think that there
may be more merit to using the two-level default & user values system to
allow for more flexibility.
 
> This is still quite confusing.  Another possible solution is to copy the
> default value instead,
> 
> user reset weight of node 0 to default: echo 0 > node0
> default_values [2,1,1]
> user_values    [2,2,1] - copy default value when echo 0
> effective      [2,2,1]
> 
> change the default again
> default_values [3,1,1] - reweight again
> user_values    [2,2,1]
> effective      [2,2,1]

This makes a lot sense to me, I think it lets us keep both the
transparency of the new one-layered system and all the benefits that
come with having default values that can adapt to hotplug events. One
thing we should consider is that the user should probably be able to
check what the default value is for a given node before deciding to
copy that value over to the weight table.

Having two files for each node (nodeN, defaultN) seems a bit too
cluttered for the user perspective. Making the nodeN interfaces serve
multiple purposes (i.e. echo -1 into the nodes will output the default
value for that node) also seems a bit too complicated as well, in my
opinion. Maybe having a file 'weight_tables' that contains a table of
default/user/effective weights (as have been used in these conversations)
might be useful for the user? (Or maybe just the defaults)

Then a workflow for the user may be as such:

$ cat /sys/kernel/mm/mempolicy/weighted_interleave/weight_tables
default vales: [4,7,2]
  user values: [-,-,-]
    effective: [4,7,2]

$ echo 4 > /sys/kernel/mm/mempolicy/weighted_interleave/node2
4
...

> The remaining issue is that we cannot revert to default atomically.
> That is, user_values may becomea  combination of old and new
> default_values if users echo 0 to each node one by one when kernel is
> changing default_values.  To resolve this, we may add another interface
> to do that, for example, "use_default".
> 
> echo 1 > use_default
> 
> will use default_values for all nodes.  We can check whether we are
> using default via
> 
> cat use_default

Like mentioned in the previous comments, I think that the "setting one
value to set all the others" is a good method, especially since the
more I think about it (in my limited experience), I think there is rarely
a scenario where a user wants to use a hybrid of manually-set and
default values and is switching back and forth between default and
manual values.

> Anyway, I think that we need a thorough thought about the user space
> interface.  And add good document, at least in change log.  It's really
> hard to make user space interface right.
> 
> I'm open to better user space interface design.

I agree with this, thank you for your feedback. I think there has been
a lot of great points raised in these conversations, and I will do my
best to take these comments into consideration when writing better
documentation. 

> ---
> Best Regards,
> Huang, Ying

Thank you for your input! I hope you have a great day and happy holidays!
Joshua
Huang, Ying Dec. 26, 2024, 1:35 a.m. UTC | #12
Hi, Joshua,

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> Hi Gregory and Huang,
>
> Sorry for the silence on my end for the past few days. I decided to take
> some time off of the computer, but I should be more reponsive now!
>
> On Wed, 25 Dec 2024 08:25:13 +0800 "Huang, Ying" <ying.huang@linux.alibaba.com> wrote:
>
>> Gregory Price <gourry@gourry.net> writes:
>> 
>> > On Sun, Dec 22, 2024 at 04:29:30PM +0800, Huang, Ying wrote:
>> >> Gregory Price <gourry@gourry.net> writes:
>> >> 
>> >> > On Sat, Dec 21, 2024 at 01:57:58PM +0800, Huang, Ying wrote:
>
> [.....8<.....]
>
>> > We decided when implementing weights that 0 was a special value that
>> > reverts to the system default:
>> >
>> >   Writing an empty string or `0` will reset the weight to the
>> >   system default. The system default may be set by the kernel
>> >   or drivers at boot or during hotplug events.
>> >
>> > I'm ok pulling the default weights in collectively once the first one is
>> > written, but 0 is an invalid value which causes issues.
>> >
>> > We went through that when we initially implemented the feature w/ task-local
>> > weights and why the help function overrides it to 1 if it's ever seen.
>> >
>> > We'll revert back to our initial implementation w/ default_iw_table and
>> > iw_table - where iw_table contains user-defined weights.  Writing a 0 to
>> > iw_table[N] will allow get_il_weight() to retrieve default_iw_table[N]
>> > as the docs imply it should.
>> 
>> So, the suggested behavior becomes the following?
>> 
>> default_values [5,2,-] <- 1 node not set, expected to be hotplugged
>> user_values    [4,2,1] <- user has only set one value, not populated nodes have value 1
>> effective      [4,2,1]
>> 
>> hotplug event
>> default_values [2,1,1] - reweight has occurred
>> user_values    [4,2,1]
>> effective      [4,2,1]
>
> Yes, I think this was the intended effect when we were discussing what
> interface made the most sense.
>
>> Even if so, we still have another issue.  The effective values may be a
>> combination of default_values and user_values and it's hard for users to
>> identify which one is from default_values and subject to change.  For
>> example,
>> 
>> user reset weight of node 0 to default: echo 0 > node0
>> default_values [2,1,1]
>> user_values    [0,2,1]
>> effective      [2,2,1]
>> 
>> change the default again
>> default_values [3,1,1] - reweight again
>> user_values    [0,2,1]
>> effective      [3,2,1]
>
> Agreed. Actually, this confusion was partly what motivated our new
> re-work of the patch in v2, which got rid of the default and user
> layers, and made all internal values transparent to the user as well.
> That way, there would be no confusion as to the true source of the
> value, and the user could be aware that re-weighting would impact
> all values, regardless of whehter they were default values or not.
>
> If we are moving away from allowing users to dynamically change the
> weightiness (max_node_weight) parameter however, then I think that there
> may be more merit to using the two-level default & user values system to
> allow for more flexibility.
>  
>> This is still quite confusing.  Another possible solution is to copy the
>> default value instead,
>> 
>> user reset weight of node 0 to default: echo 0 > node0
>> default_values [2,1,1]
>> user_values    [2,2,1] - copy default value when echo 0
>> effective      [2,2,1]
>> 
>> change the default again
>> default_values [3,1,1] - reweight again
>> user_values    [2,2,1]
>> effective      [2,2,1]
>
> This makes a lot sense to me, I think it lets us keep both the
> transparency of the new one-layered system and all the benefits that
> come with having default values that can adapt to hotplug events. One
> thing we should consider is that the user should probably be able to
> check what the default value is for a given node before deciding to
> copy that value over to the weight table.
>
> Having two files for each node (nodeN, defaultN) seems a bit too
> cluttered for the user perspective. Making the nodeN interfaces serve
> multiple purposes (i.e. echo -1 into the nodes will output the default
> value for that node) also seems a bit too complicated as well, in my
> opinion. Maybe having a file 'weight_tables' that contains a table of
> default/user/effective weights (as have been used in these conversations)
> might be useful for the user? (Or maybe just the defaults)
>
> Then a workflow for the user may be as such:
>
> $ cat /sys/kernel/mm/mempolicy/weighted_interleave/weight_tables
> default vales: [4,7,2]
>   user values: [-,-,-]
>     effective: [4,7,2]

AFAIK, this breaks the sysfs attribute format rule as follows.

https://docs.kernel.org/filesystems/sysfs.html#attributes

It's hard to use array sysfs attribute here too.  Because the node ID
may be non-consecutive.  This makes it hard to read.

> $ echo 4 > /sys/kernel/mm/mempolicy/weighted_interleave/node2
> 4
> ...
>
>> The remaining issue is that we cannot revert to default atomically.
>> That is, user_values may becomea  combination of old and new
>> default_values if users echo 0 to each node one by one when kernel is
>> changing default_values.  To resolve this, we may add another interface
>> to do that, for example, "use_default".
>> 
>> echo 1 > use_default
>> 
>> will use default_values for all nodes.  We can check whether we are
>> using default via
>> 
>> cat use_default
>
> Like mentioned in the previous comments, I think that the "setting one
> value to set all the others" is a good method, especially since the
> more I think about it (in my limited experience), I think there is rarely
> a scenario where a user wants to use a hybrid of manually-set and
> default values and is switching back and forth between default and
> manual values.
>
>> Anyway, I think that we need a thorough thought about the user space
>> interface.  And add good document, at least in change log.  It's really
>> hard to make user space interface right.
>> 
>> I'm open to better user space interface design.
>
> I agree with this, thank you for your feedback. I think there has been
> a lot of great points raised in these conversations, and I will do my
> best to take these comments into consideration when writing better
> documentation. 
>
> Thank you for your input! I hope you have a great day and happy holidays!

Happy holidays!

---
Best Regards,
Huang, Ying
Gregory Price Dec. 26, 2024, 6:13 p.m. UTC | #13
On Thu, Dec 26, 2024 at 09:35:32AM +0800, Huang, Ying wrote:
> > Having two files for each node (nodeN, defaultN) seems a bit too
> > cluttered for the user perspective. Making the nodeN interfaces serve
> > multiple purposes (i.e. echo -1 into the nodes will output the default
> > value for that node) also seems a bit too complicated as well, in my
> > opinion. Maybe having a file 'weight_tables' that contains a table of
> > default/user/effective weights (as have been used in these conversations)
> > might be useful for the user? (Or maybe just the defaults)
> >
> > Then a workflow for the user may be as such:
> >
> > $ cat /sys/kernel/mm/mempolicy/weighted_interleave/weight_tables
> > default vales: [4,7,2]
> >   user values: [-,-,-]
> >     effective: [4,7,2]
> 
> AFAIK, this breaks the sysfs attribute format rule as follows.
> 
> https://docs.kernel.org/filesystems/sysfs.html#attributes
> 
> It's hard to use array sysfs attribute here too.  Because the node ID
> may be non-consecutive.  This makes it hard to read.
>

Would generally agree. I think essentially a
   use_defaults => (0 | 1)
interface is probably the best we can do.

Setting any node changes use_defaults from 1 => 0
echoing 1 into use_default clears user_values

This still allows 0 to be a manual "reset specific node to default"
mechanism for a specific node, and gives us a clean override.

The only question is a matter of hotplug behavior

nodes_online: 0,1
  default_values: [5,3]
  user_values   : [-,-]

event: node1 is taken offline
  default_values: [5,3] <-- nothing happens

event: node1 comes back online with different bandwidth attribute
  default_values: [6,5] <-- reweight as occured silently

event: user sets a custom value (node1 <= 2)
  default_values: [6,5]
  user_values:    [6,2] <= note, *no reduction*

event: node1 is taken offline
  default_values: [6,5]
  user_values:    [6,2] <= value still present but not used

event: node1 comes back online with different bandwidth attribute
  default_values: [5,3] <-- default reweight has occurred silently
  user_values   : [6,2] <-- user responsible for triggering re-weight

The user has the option of

echo 1 > /sys/.../weghted_interleave/user_defaults
result
	default_values: [5,3]
	user_values   : [-,-]
or
echo 0 > /sys/.../weighted_interleave/node1
result
	default_values: [5,3]
	user_values   : [6,3] <= only node1 is updated, no re-weight

Basically, if the user ever sets any value, we never automatically pull
new values in, and the admin is responsible for triggering a re-weight
(use_default) or manually reweighting *all* nodes - because changing
values implies a change in the bandwidth distribution anyway.

I think this makes the most sense.

~Gregory
Huang, Ying Dec. 27, 2024, 1:59 a.m. UTC | #14
Gregory Price <gourry@gourry.net> writes:

> On Thu, Dec 26, 2024 at 09:35:32AM +0800, Huang, Ying wrote:
>> > Having two files for each node (nodeN, defaultN) seems a bit too
>> > cluttered for the user perspective. Making the nodeN interfaces serve
>> > multiple purposes (i.e. echo -1 into the nodes will output the default
>> > value for that node) also seems a bit too complicated as well, in my
>> > opinion. Maybe having a file 'weight_tables' that contains a table of
>> > default/user/effective weights (as have been used in these conversations)
>> > might be useful for the user? (Or maybe just the defaults)
>> >
>> > Then a workflow for the user may be as such:
>> >
>> > $ cat /sys/kernel/mm/mempolicy/weighted_interleave/weight_tables
>> > default vales: [4,7,2]
>> >   user values: [-,-,-]
>> >     effective: [4,7,2]
>> 
>> AFAIK, this breaks the sysfs attribute format rule as follows.
>> 
>> https://docs.kernel.org/filesystems/sysfs.html#attributes
>> 
>> It's hard to use array sysfs attribute here too.  Because the node ID
>> may be non-consecutive.  This makes it hard to read.
>>
>
> Would generally agree. I think essentially a
>    use_defaults => (0 | 1)
> interface is probably the best we can do.
>
> Setting any node changes use_defaults from 1 => 0
> echoing 1 into use_default clears user_values
>
> This still allows 0 to be a manual "reset specific node to default"
> mechanism for a specific node, and gives us a clean override.

The difficulty is that users don't know the default value when they
reset a node's weight.  We don't have an interface to show them.  So, I
suggest to disable the functionality: "reset specific node to default".
They can still use "echo 1 > use_defaults" to reset all nodes to
default.

> The only question is a matter of hotplug behavior
>
> nodes_online: 0,1
>   default_values: [5,3]
>   user_values   : [-,-]
>
> event: node1 is taken offline
>   default_values: [5,3] <-- nothing happens
>
> event: node1 comes back online with different bandwidth attribute
>   default_values: [6,5] <-- reweight as occured silently
>
> event: user sets a custom value (node1 <= 2)
>   default_values: [6,5]
>   user_values:    [6,2] <= note, *no reduction*
>
> event: node1 is taken offline
>   default_values: [6,5]
>   user_values:    [6,2] <= value still present but not used
>
> event: node1 comes back online with different bandwidth attribute
>   default_values: [5,3] <-- default reweight has occurred silently
>   user_values   : [6,2] <-- user responsible for triggering re-weight
>
> The user has the option of
>
> echo 1 > /sys/.../weghted_interleave/user_defaults
> result
> 	default_values: [5,3]
> 	user_values   : [-,-]
> or
> echo 0 > /sys/.../weighted_interleave/node1
> result
> 	default_values: [5,3]
> 	user_values   : [6,3] <= only node1 is updated, no re-weight
>
> Basically, if the user ever sets any value, we never automatically pull
> new values in, and the admin is responsible for triggering a re-weight
> (use_default) or manually reweighting *all* nodes - because changing
> values implies a change in the bandwidth distribution anyway.
>
> I think this makes the most sense.

---
Best Regards,
Huang, Ying
Gregory Price Dec. 27, 2024, 3:35 p.m. UTC | #15
On Fri, Dec 27, 2024 at 09:59:30AM +0800, Huang, Ying wrote:
> Gregory Price <gourry@gourry.net> writes:
> > This still allows 0 to be a manual "reset specific node to default"
> > mechanism for a specific node, and gives us a clean override.
> 
> The difficulty is that users don't know the default value when they
> reset a node's weight.  We don't have an interface to show them.  So, I
> suggest to disable the functionality: "reset specific node to default".
> They can still use "echo 1 > use_defaults" to reset all nodes to
> default.
>

Good point, and agree.  So lets just ditch 0.  Since that "feature"
wasn't even functional in the current code (it just reset it to 1 at
this point), it's probably safe to just ditch it.  Worst case scenario
if someone takes issues, we can just have it revert the weight to 1.

~Gregory
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
index 0b7972de04e9..2ef9a87ce878 100644
--- a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -23,3 +23,27 @@  Description:	Weight configuration interface for nodeN
 		Writing an empty string or `0` will reset the weight to the
 		system default. The system default may be set by the kernel
 		or drivers at boot or during hotplug events.
+
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/max_node_weight
+Date:		December 2024
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Weight limiting / scaling interface
+
+		The maximum interleave weight for a memory node. When it is
+		updated, any previous changes to interleave weights (i.e. via
+		the nodeN sysfs interfaces) are ignored, and new weights are
+		calculated using ACPI-reported bandwidths and scaled.
+
+		It is possible for weights to be greater than max_node_weight if
+		the nodeN interfaces are directly modified to be greater.
+
+		Minimum weight: 1
+		Default value: 32
+		Maximum weight: 255
+
+		Writing an empty string will set the value to be the default
+		(32). Writing a value outside the valid range  will return
+		EINVAL and will not re-trigger a weight scaling.
+
+		Setting max_node_weight to 1 is equivalent to unweighted
+		interleave.
diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c
index a2f9e7a4b479..83f3858a773f 100644
--- a/drivers/acpi/numa/hmat.c
+++ b/drivers/acpi/numa/hmat.c
@@ -20,6 +20,7 @@ 
 #include <linux/list_sort.h>
 #include <linux/memregion.h>
 #include <linux/memory.h>
+#include <linux/mempolicy.h>
 #include <linux/mutex.h>
 #include <linux/node.h>
 #include <linux/sysfs.h>
diff --git a/drivers/base/node.c b/drivers/base/node.c
index eb72580288e6..d45216386c03 100644
--- a/drivers/base/node.c
+++ b/drivers/base/node.c
@@ -7,6 +7,7 @@ 
 #include <linux/init.h>
 #include <linux/mm.h>
 #include <linux/memory.h>
+#include <linux/mempolicy.h>
 #include <linux/vmstat.h>
 #include <linux/notifier.h>
 #include <linux/node.h>
@@ -214,6 +215,12 @@  void node_set_perf_attrs(unsigned int nid, struct access_coordinate *coord,
 			break;
 		}
 	}
+
+	/* When setting CPU access coordinates, update mempolicy */
+	if (access == ACCESS_COORDINATE_CPU) {
+		if (mempolicy_set_node_perf(nid, coord))
+			pr_info("failed to set node%d mempolicy attrs\n", nid);
+	}
 }
 EXPORT_SYMBOL_GPL(node_set_perf_attrs);
 
diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 931b118336f4..d564e9e893ea 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -11,6 +11,7 @@ 
 #include <linux/slab.h>
 #include <linux/rbtree.h>
 #include <linux/spinlock.h>
+#include <linux/node.h>
 #include <linux/nodemask.h>
 #include <linux/pagemap.h>
 #include <uapi/linux/mempolicy.h>
@@ -177,6 +178,9 @@  static inline bool mpol_is_preferred_many(struct mempolicy *pol)
 
 extern bool apply_policy_zone(struct mempolicy *policy, enum zone_type zone);
 
+extern int mempolicy_set_node_perf(unsigned int node,
+				   struct access_coordinate *coords);
+
 #else
 
 struct mempolicy {};
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index ee32a10e992c..f789280acdcb 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -109,6 +109,7 @@ 
 #include <linux/mmu_notifier.h>
 #include <linux/printk.h>
 #include <linux/swapops.h>
+#include <linux/gcd.h>
 
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
@@ -153,24 +154,116 @@  static unsigned int mempolicy_behavior;
  *
  * iw_table is RCU protected
  */
+static unsigned long *node_bw_table;
+static u8 __rcu *default_iw_table;
+static DEFINE_MUTEX(default_iwt_lock);
+
 static u8 __rcu *iw_table;
 static DEFINE_MUTEX(iw_table_lock);
 
+static int max_node_weight = 32;
+
 static u8 get_il_weight(int node)
 {
-	u8 *table;
+	u8 *table, *defaults;
 	u8 weight;
 
 	rcu_read_lock();
+	defaults = rcu_dereference(default_iw_table);
 	table = rcu_dereference(iw_table);
-	/* if no iw_table, use system default */
-	weight = table ? table[node] : 1;
-	/* if value in iw_table is 0, use system default */
-	weight = weight ? weight : 1;
+	/* if no iw_table, use system default - if no default, use 1 */
+	weight = table ? table[node] : 0;
+	weight = weight ? weight : (defaults ? defaults[node] : 1);
 	rcu_read_unlock();
 	return weight;
 }
 
+/*
+ * Convert ACPI-reported bandwidths into weighted interleave weights for
+ * informed page allocation.
+ * Call with default_iwt_lock held
+ */
+static void reduce_interleave_weights(unsigned long *bw, u8 *new_iw)
+{
+	uint64_t ttl_bw = 0, ttl_iw = 0, scaling_factor = 1;
+	unsigned int iw_gcd = 1, i = 0;
+
+	/* Recalculate the bandwidth distribution given the new info */
+	for (i = 0; i < nr_node_ids; i++)
+		ttl_bw += bw[i];
+
+	/* If node is not set or has < 1% of total bw, use minimum value of 1 */
+	for (i = 0; i < nr_node_ids; i++) {
+		if (bw[i]) {
+			scaling_factor = 100 * bw[i];
+			new_iw[i] = max(scaling_factor / ttl_bw, 1);
+		} else {
+			new_iw[i] = 1;
+		}
+		ttl_iw += new_iw[i];
+	}
+
+	/*
+	 * Scale each node's share of the total bandwidth from percentages
+	 * to whole numbers in the range [1, max_node_weight]
+	 */
+	for (i = 0; i < nr_node_ids; i++) {
+		scaling_factor = max_node_weight * new_iw[i];
+		new_iw[i] = max(scaling_factor / ttl_iw, 1);
+		if (unlikely(i == 0))
+			iw_gcd = new_iw[0];
+		iw_gcd = gcd(iw_gcd, new_iw[i]);
+	}
+
+	/* 1:2 is strictly better than 16:32. Reduce by the weights' GCD. */
+	for (i = 0; i < nr_node_ids; i++)
+		new_iw[i] /= iw_gcd;
+}
+
+int mempolicy_set_node_perf(unsigned int node, struct access_coordinate *coords)
+{
+	unsigned long *old_bw, *new_bw;
+	unsigned long bw_val;
+	u8 *old_iw, *new_iw;
+
+	/*
+	 * Bandwidths above this limit causes rounding errors when reducing
+	 * weights. This value is ~16 exabytes, which is unreasonable anyways.
+	 */
+	bw_val = min(coords->read_bandwidth, coords->write_bandwidth);
+	if (bw_val > (U64_MAX / 10))
+		return -EINVAL;
+
+	new_bw = kcalloc(nr_node_ids, sizeof(unsigned long), GFP_KERNEL);
+	if (!new_bw)
+		return -ENOMEM;
+
+	new_iw = kzalloc(nr_node_ids, GFP_KERNEL);
+	if (!new_iw) {
+		kfree(new_bw);
+		return -ENOMEM;
+	}
+
+	mutex_lock(&default_iwt_lock);
+	old_bw = node_bw_table;
+	old_iw = rcu_dereference_protected(default_iw_table,
+					   lockdep_is_held(&default_iwt_lock));
+
+	if (old_bw)
+		memcpy(new_bw, old_bw, nr_node_ids*sizeof(unsigned long));
+	new_bw[node] = bw_val;
+	node_bw_table = new_bw;
+
+	reduce_interleave_weights(new_bw, new_iw);
+	rcu_assign_pointer(default_iw_table, new_iw);
+
+	mutex_unlock(&default_iwt_lock);
+	synchronize_rcu();
+	kfree(old_bw);
+	kfree(old_iw);
+	return 0;
+}
+
 /**
  * numa_nearest_node - Find nearest node by state
  * @node: Node id to start the search
@@ -2001,7 +2094,7 @@  static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 {
 	nodemask_t nodemask;
 	unsigned int target, nr_nodes;
-	u8 *table;
+	u8 *table, *defaults;
 	unsigned int weight_total = 0;
 	u8 weight;
 	int nid;
@@ -2012,11 +2105,12 @@  static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 
 	rcu_read_lock();
 	table = rcu_dereference(iw_table);
+	defaults = rcu_dereference(iw_table);
 	/* calculate the total weight */
 	for_each_node_mask(nid, nodemask) {
 		/* detect system default usage */
-		weight = table ? table[nid] : 1;
-		weight = weight ? weight : 1;
+		weight = table ? table[nid] : 0;
+		weight = weight ? weight : (defaults ? defaults[nid] : 1);
 		weight_total += weight;
 	}
 
@@ -2025,8 +2119,8 @@  static unsigned int weighted_interleave_nid(struct mempolicy *pol, pgoff_t ilx)
 	nid = first_node(nodemask);
 	while (target) {
 		/* detect system default usage */
-		weight = table ? table[nid] : 1;
-		weight = weight ? weight : 1;
+		weight = table ? table[nid] : 0;
+		weight = weight ? weight : (defaults ? defaults[nid] : 1);
 		if (target < weight)
 			break;
 		target -= weight;
@@ -2409,7 +2503,7 @@  static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
 	unsigned long nr_allocated = 0;
 	unsigned long rounds;
 	unsigned long node_pages, delta;
-	u8 *table, *weights, weight;
+	u8 *weights, weight;
 	unsigned int weight_total = 0;
 	unsigned long rem_pages = nr_pages;
 	nodemask_t nodes;
@@ -2458,16 +2552,8 @@  static unsigned long alloc_pages_bulk_array_weighted_interleave(gfp_t gfp,
 	if (!weights)
 		return total_allocated;
 
-	rcu_read_lock();
-	table = rcu_dereference(iw_table);
-	if (table)
-		memcpy(weights, table, nr_node_ids);
-	rcu_read_unlock();
-
-	/* calculate total, detect system default usage */
 	for_each_node_mask(node, nodes) {
-		if (!weights[node])
-			weights[node] = 1;
+		weights[node] = get_il_weight(node);
 		weight_total += weights[node];
 	}
 
@@ -3396,6 +3482,7 @@  static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
 }
 
 static struct iw_node_attr **node_attrs;
+static struct kobj_attribute *max_nw_attr;
 
 static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
 				  struct kobject *parent)
@@ -3413,6 +3500,10 @@  static void sysfs_wi_release(struct kobject *wi_kobj)
 
 	for (i = 0; i < nr_node_ids; i++)
 		sysfs_wi_node_release(node_attrs[i], wi_kobj);
+
+	sysfs_remove_file(wi_kobj, &max_nw_attr->attr);
+	kfree(max_nw_attr->attr.name);
+	kfree(max_nw_attr);
 	kobject_put(wi_kobj);
 }
 
@@ -3454,6 +3545,63 @@  static int add_weight_node(int nid, struct kobject *wi_kobj)
 	return 0;
 }
 
+static ssize_t max_nw_show(struct kobject *kobj, struct kobj_attribute *attr,
+			char *buf)
+{
+	return sysfs_emit(buf, "%d\n", max_node_weight);
+}
+
+static ssize_t max_nw_store(struct kobject *kobj,
+		struct kobj_attribute *attr, const char *buf, size_t count)
+{
+	unsigned long *bw;
+	u8 *old_iw, *new_iw;
+	u8 max_weight;
+
+	if (count == 0 || sysfs_streq(buf, ""))
+		max_weight = 32;
+	else if (kstrtou8(buf, 0, &max_weight) || max_weight == 0)
+		return -EINVAL;
+
+	new_iw = kzalloc(nr_node_ids, GFP_KERNEL);
+	if (!new_iw)
+		return -ENOMEM;
+
+	mutex_lock(&default_iwt_lock);
+	bw = node_bw_table;
+
+	if (!bw) {
+		mutex_unlock(&default_iwt_lock);
+		kfree(new_iw);
+		return -ENODEV;
+	}
+
+	max_node_weight = max_weight;
+	old_iw = rcu_dereference_protected(default_iw_table,
+					   lockdep_is_held(&default_iwt_lock));
+
+	reduce_interleave_weights(bw, new_iw);
+	rcu_assign_pointer(default_iw_table, new_iw);
+	mutex_unlock(&default_iwt_lock);
+
+	synchronize_rcu();
+	kfree(old_iw);
+
+	return count;
+}
+
+static struct kobj_attribute wi_attr =
+	__ATTR(max_node_weight, 0664, max_nw_show, max_nw_store);
+
+static struct attribute *wi_default_attrs[] = {
+	&wi_attr.attr,
+	NULL
+};
+
+static const struct attribute_group wi_attr_group = {
+	.attrs = wi_default_attrs,
+};
+
 static int add_weighted_interleave_group(struct kobject *root_kobj)
 {
 	struct kobject *wi_kobj;
@@ -3470,6 +3618,13 @@  static int add_weighted_interleave_group(struct kobject *root_kobj)
 		return err;
 	}
 
+	err = sysfs_create_group(wi_kobj, &wi_attr_group);
+	if (err) {
+		pr_err("failed to add sysfs [max_node_weight]\n");
+		kobject_put(wi_kobj);
+		return err;
+	}
+
 	for_each_node_state(nid, N_POSSIBLE) {
 		err = add_weight_node(nid, wi_kobj);
 		if (err) {