diff mbox series

[v5,01/11] mm/mempolicy: implement the sysfs-based weighted_interleave interface

Message ID 20231223181101.1954-2-gregory.price@memverge.com (mailing list archive)
State New
Headers show
Series mempolicy2, mbind2, and weighted interleave | expand

Commit Message

Gregory Price Dec. 23, 2023, 6:10 p.m. UTC
From: Rakie Kim <rakie.kim@sk.com>

This patch provides a way to set interleave weight information under
sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN

The sysfs structure is designed as follows.

  $ tree /sys/kernel/mm/mempolicy/
  /sys/kernel/mm/mempolicy/ [1]
  └── weighted_interleave [2]
      ├── node0 [3]
      └── node1

Each file above can be explained as follows.

[1] mm/mempolicy: configuration interface for mempolicy subsystem

[2] weighted_interleave/: config interface for weighted interleave policy

[3] weighted_interleave/nodeN: weight for nodeN

If sysfs is disabled in the config, the global interleave weights
will default to "1" for all nodes.

Signed-off-by: Rakie Kim <rakie.kim@sk.com>
Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
Co-developed-by: Gregory Price <gregory.price@memverge.com>
Signed-off-by: Gregory Price <gregory.price@memverge.com>
Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
---
 .../ABI/testing/sysfs-kernel-mm-mempolicy     |   4 +
 ...fs-kernel-mm-mempolicy-weighted-interleave |  22 +++
 mm/mempolicy.c                                | 156 ++++++++++++++++++
 3 files changed, 182 insertions(+)
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
 create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave

Comments

Gregory Price Dec. 26, 2023, 6:48 a.m. UTC | #1
On Wed, Dec 27, 2023 at 02:42:15PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > +		These weights only affect new allocations, and changes at runtime
> > +		will not cause migrations on already allocated pages.
> > +
> > +		Writing an empty string resets the weight value to 1.
> 
> I still think that it's a good idea to provide some better default
> weight value with HMAT or CDAT if available.  So, better not to make "1"
> as part of ABI?
> 

That's the eventual goal, but this is just the initial mechanism.

My current thought is that the CXL driver will apply weights as the
system iterates through devices and creates numa nodes.  In the
meantime, you have to give the "possible" nodes a default value to
prevent nodes onlined after boot from showing up with 0-value.

Not allowing 0-value weights is simply easier in many respects.

> > +
> > +		Minimum weight: 1
> 
> Can weight be "0"?  Do we need a way to specify that a node don't want
> to participate weighted interleave?
> 

In this code, weight cannot be 0.  My thoguht is that removing the node
from the nodemask is the way to denote 0.

The problem with 0 is hotplug, migration, and cpusets.mems_allowed.  

Example issue:  Use set local weights to [1,0,1,0] for nodes [0-3],
and has a cpusets.mems_allowed mask of (0, 2).

Lets say the user migrates the task via cgroups from nodes (0,2) to
(1,3).

The task will instantly crash as basically OOM because weights of
[1,0,1,0] will prevent memory from being allocations.

Not allowing nodes weights of 0 is defensive.  Instead, simply removing
the node from the nodemask and/or mems_allowed is both equivalent to and
the preferred way to apply a weight of 0.

> > +		Maximum weight: 255
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 10a590ee1c89..0e77633b07a5 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -131,6 +131,8 @@ static struct mempolicy default_policy = {
> >  
> >  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
> >  
> > +static char iw_table[MAX_NUMNODES];
> > +
> 
> It's kind of obscure whether "char" is "signed" or "unsigned".  Given
> the max weight is 255 above, it's better to use "u8"?
>

bah, stupid mistake.  I will switch this to u8.

> And, we may need a way to specify whether the weight has been overridden
> by the user.
> A special value (such as 255) can be used for that.  If
> so, the maximum weight should be 254 instead of 255.  As a user space
> interface, is it better to use 100 as the maximum value?
> 

There's global weights and local weights.  These are the global weights.

Local weights are stored in task->mempolicy.wil.il_weights.

(policy->mode_flags & MPOL_F_GWEIGHT) denotes the override.
This is set if (mempolicy_args->il_weights) was provided.

This simplifies the interface.

(note: local weights are not introduced until the last patch 11/11)

> > +
> > +static void sysfs_mempolicy_release(struct kobject *mempolicy_kobj)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < MAX_NUMNODES; i++)
> > +		sysfs_wi_node_release(node_attrs[i], mempolicy_kobj);
> 
> IIUC, if this is called in error path (such as, in
> add_weighted_interleave_group()), some node_attrs[] element may be
> "NULL"?
> 

The null check is present in sysfs_wi_node_release

if (!node_attr)
	return;

Is it preferable to pull this out? Seemed more defensive to put it
inside the function.

~Gregory
Huang, Ying Dec. 27, 2023, 6:42 a.m. UTC | #2
Gregory Price <gourry.memverge@gmail.com> writes:

> From: Rakie Kim <rakie.kim@sk.com>
>
> This patch provides a way to set interleave weight information under
> sysfs at /sys/kernel/mm/mempolicy/weighted_interleave/nodeN
>
> The sysfs structure is designed as follows.
>
>   $ tree /sys/kernel/mm/mempolicy/
>   /sys/kernel/mm/mempolicy/ [1]
>   └── weighted_interleave [2]
>       ├── node0 [3]
>       └── node1
>
> Each file above can be explained as follows.
>
> [1] mm/mempolicy: configuration interface for mempolicy subsystem
>
> [2] weighted_interleave/: config interface for weighted interleave policy
>
> [3] weighted_interleave/nodeN: weight for nodeN
>
> If sysfs is disabled in the config, the global interleave weights
> will default to "1" for all nodes.
>
> Signed-off-by: Rakie Kim <rakie.kim@sk.com>
> Signed-off-by: Honggyu Kim <honggyu.kim@sk.com>
> Co-developed-by: Gregory Price <gregory.price@memverge.com>
> Signed-off-by: Gregory Price <gregory.price@memverge.com>
> Co-developed-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> Signed-off-by: Hyeongtak Ji <hyeongtak.ji@sk.com>
> ---
>  .../ABI/testing/sysfs-kernel-mm-mempolicy     |   4 +
>  ...fs-kernel-mm-mempolicy-weighted-interleave |  22 +++
>  mm/mempolicy.c                                | 156 ++++++++++++++++++
>  3 files changed, 182 insertions(+)
>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
>  create mode 100644 Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
>
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
> new file mode 100644
> index 000000000000..2dcf24f4384a
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
> @@ -0,0 +1,4 @@
> +What:		/sys/kernel/mm/mempolicy/
> +Date:		December 2023
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Interface for Mempolicy
> diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> new file mode 100644
> index 000000000000..aa27fdf08c19
> --- /dev/null
> +++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
> @@ -0,0 +1,22 @@
> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/
> +Date:		December 2023
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Configuration Interface for the Weighted Interleave policy
> +
> +What:		/sys/kernel/mm/mempolicy/weighted_interleave/nodeN
> +Date:		December 2023
> +Contact:	Linux memory management mailing list <linux-mm@kvack.org>
> +Description:	Weight configuration interface for nodeN
> +
> +		The interleave weight for a memory node (N). These weights are
> +		utilized by processes which have set their mempolicy to
> +		MPOL_WEIGHTED_INTERLEAVE and have opted into global weights by
> +		omitting a task-local weight array.
> +
> +		These weights only affect new allocations, and changes at runtime
> +		will not cause migrations on already allocated pages.
> +
> +		Writing an empty string resets the weight value to 1.

I still think that it's a good idea to provide some better default
weight value with HMAT or CDAT if available.  So, better not to make "1"
as part of ABI?

> +
> +		Minimum weight: 1

Can weight be "0"?  Do we need a way to specify that a node don't want
to participate weighted interleave?

> +		Maximum weight: 255
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 10a590ee1c89..0e77633b07a5 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -131,6 +131,8 @@ static struct mempolicy default_policy = {
>  
>  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>  
> +static char iw_table[MAX_NUMNODES];
> +

It's kind of obscure whether "char" is "signed" or "unsigned".  Given
the max weight is 255 above, it's better to use "u8"?

And, we may need a way to specify whether the weight has been overridden
by the user.  A special value (such as 255) can be used for that.  If
so, the maximum weight should be 254 instead of 255.  As a user space
interface, is it better to use 100 as the maximum value?

>  /**
>   * numa_nearest_node - Find nearest node by state
>   * @node: Node id to start the search
> @@ -3067,3 +3069,157 @@ void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
>  		p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
>  			       nodemask_pr_args(&nodes));
>  }
> +
> +#ifdef CONFIG_SYSFS
> +struct iw_node_attr {
> +	struct kobj_attribute kobj_attr;
> +	int nid;
> +};
> +
> +static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
> +			 char *buf)
> +{
> +	struct iw_node_attr *node_attr;
> +
> +	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> +	return sysfs_emit(buf, "%d\n", iw_table[node_attr->nid]);
> +}
> +
> +static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
> +			  const char *buf, size_t count)
> +{
> +	struct iw_node_attr *node_attr;
> +	unsigned char weight = 0;
> +
> +	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
> +	/* If no input, set default weight to 1 */
> +	if (count == 0 || sysfs_streq(buf, ""))
> +		weight = 1;
> +	else if (kstrtou8(buf, 0, &weight) || !weight)
> +		return -EINVAL;
> +
> +	iw_table[node_attr->nid] = weight;

kstrtou8(), "unsigned char weight", "char iw_table[]" isn't completely
consistent.  It's better to make them consistent?

> +	return count;
> +}
> +
> +static struct iw_node_attr *node_attrs[MAX_NUMNODES];
> +
> +static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
> +				  struct kobject *parent)
> +{
> +	if (!node_attr)
> +		return;
> +	sysfs_remove_file(parent, &node_attr->kobj_attr.attr);
> +	kfree(node_attr->kobj_attr.attr.name);
> +	kfree(node_attr);
> +}
> +
> +static void sysfs_mempolicy_release(struct kobject *mempolicy_kobj)
> +{
> +	int i;
> +
> +	for (i = 0; i < MAX_NUMNODES; i++)
> +		sysfs_wi_node_release(node_attrs[i], mempolicy_kobj);

IIUC, if this is called in error path (such as, in
add_weighted_interleave_group()), some node_attrs[] element may be
"NULL"?

> +	kobject_put(mempolicy_kobj);
> +}
> +
> +static const struct kobj_type mempolicy_ktype = {
> +	.sysfs_ops = &kobj_sysfs_ops,
> +	.release = sysfs_mempolicy_release,
> +};
> +
> +static int add_weight_node(int nid, struct kobject *wi_kobj)
> +{
> +	struct iw_node_attr *node_attr;
> +	char *name;
> +
> +	node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
> +	if (!node_attr)
> +		return -ENOMEM;
> +
> +	name = kasprintf(GFP_KERNEL, "node%d", nid);
> +	if (!name) {
> +		kfree(node_attr);
> +		return -ENOMEM;
> +	}
> +
> +	sysfs_attr_init(&node_attr->kobj_attr.attr);
> +	node_attr->kobj_attr.attr.name = name;
> +	node_attr->kobj_attr.attr.mode = 0644;
> +	node_attr->kobj_attr.show = node_show;
> +	node_attr->kobj_attr.store = node_store;
> +	node_attr->nid = nid;
> +
> +	if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) {
> +		kfree(node_attr->kobj_attr.attr.name);
> +		kfree(node_attr);
> +		pr_err("failed to add attribute to weighted_interleave\n");
> +		return -ENOMEM;
> +	}
> +
> +	node_attrs[nid] = node_attr;
> +	return 0;
> +}
> +
> +static int add_weighted_interleave_group(struct kobject *root_kobj)
> +{
> +	struct kobject *wi_kobj;
> +	int nid, err;
> +
> +	wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
> +	if (!wi_kobj)
> +		return -ENOMEM;
> +
> +	err = kobject_init_and_add(wi_kobj, &mempolicy_ktype, root_kobj,
> +				   "weighted_interleave");
> +	if (err) {
> +		kfree(wi_kobj);
> +		return err;
> +	}
> +
> +	memset(node_attrs, 0, sizeof(node_attrs));
> +	for_each_node_state(nid, N_POSSIBLE) {
> +		err = add_weight_node(nid, wi_kobj);
> +		if (err) {
> +			pr_err("failed to add sysfs [node%d]\n", nid);
> +			break;
> +		}
> +	}
> +	if (err)
> +		kobject_put(wi_kobj);
> +	return 0;
> +}
> +
> +static int __init mempolicy_sysfs_init(void)
> +{
> +	int err;
> +	struct kobject *root_kobj;
> +
> +	memset(&iw_table, 1, sizeof(iw_table));
> +
> +	root_kobj = kobject_create_and_add("mempolicy", mm_kobj);
> +	if (!root_kobj) {
> +		pr_err("failed to add mempolicy kobject to the system\n");
> +		return -ENOMEM;
> +	}
> +
> +	err = add_weighted_interleave_group(root_kobj);
> +
> +	if (err)
> +		kobject_put(root_kobj);
> +	return err;
> +
> +}
> +#else
> +static int __init mempolicy_sysfs_init(void)
> +{
> +	/*
> +	 * if sysfs is not enabled MPOL_WEIGHTED_INTERLEAVE defaults to
> +	 * MPOL_INTERLEAVE behavior, but is still defined separately to
> +	 * allow task-local weighted interleave to operate as intended.
> +	 */
> +	memset(&iw_table, 1, sizeof(iw_table));
> +	return 0;
> +}
> +#endif /* CONFIG_SYSFS */
> +late_initcall(mempolicy_sysfs_init);

--
Best Regards,
Huang, Ying
Huang, Ying Jan. 2, 2024, 7:41 a.m. UTC | #3
Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Dec 27, 2023 at 02:42:15PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > +		These weights only affect new allocations, and changes at runtime
>> > +		will not cause migrations on already allocated pages.
>> > +
>> > +		Writing an empty string resets the weight value to 1.
>> 
>> I still think that it's a good idea to provide some better default
>> weight value with HMAT or CDAT if available.  So, better not to make "1"
>> as part of ABI?
>> 
>
> That's the eventual goal,

Good to know this.

> but this is just the initial mechanism.
>
> My current thought is that the CXL driver will apply weights as the
> system iterates through devices and creates numa nodes.  In the
> meantime, you have to give the "possible" nodes a default value to
> prevent nodes onlined after boot from showing up with 0-value.
>
> Not allowing 0-value weights is simply easier in many respects.
>
>> > +
>> > +		Minimum weight: 1
>> 
>> Can weight be "0"?  Do we need a way to specify that a node don't want
>> to participate weighted interleave?
>> 
>
> In this code, weight cannot be 0.  My thoguht is that removing the node
> from the nodemask is the way to denote 0.
>
> The problem with 0 is hotplug, migration, and cpusets.mems_allowed.  
>
> Example issue:  Use set local weights to [1,0,1,0] for nodes [0-3],
> and has a cpusets.mems_allowed mask of (0, 2).
>
> Lets say the user migrates the task via cgroups from nodes (0,2) to
> (1,3).
>
> The task will instantly crash as basically OOM because weights of
> [1,0,1,0] will prevent memory from being allocations.
>
> Not allowing nodes weights of 0 is defensive.  Instead, simply removing
> the node from the nodemask and/or mems_allowed is both equivalent to and
> the preferred way to apply a weight of 0.

It sounds reasonable to set minimum weight to 1.  But "1" may be not the
default weight, so, I don't think it's a good idea to make "1" as
default in ABI.

>> > +		Maximum weight: 255
>> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
>> > index 10a590ee1c89..0e77633b07a5 100644
>> > --- a/mm/mempolicy.c
>> > +++ b/mm/mempolicy.c
>> > @@ -131,6 +131,8 @@ static struct mempolicy default_policy = {
>> >  
>> >  static struct mempolicy preferred_node_policy[MAX_NUMNODES];
>> >  
>> > +static char iw_table[MAX_NUMNODES];
>> > +
>> 
>> It's kind of obscure whether "char" is "signed" or "unsigned".  Given
>> the max weight is 255 above, it's better to use "u8"?
>>
>
> bah, stupid mistake.  I will switch this to u8.
>
>> And, we may need a way to specify whether the weight has been overridden
>> by the user.
>> A special value (such as 255) can be used for that.  If
>> so, the maximum weight should be 254 instead of 255.  As a user space
>> interface, is it better to use 100 as the maximum value?
>> 
>
> There's global weights and local weights.  These are the global weights.
>
> Local weights are stored in task->mempolicy.wil.il_weights.
>
> (policy->mode_flags & MPOL_F_GWEIGHT) denotes the override.
> This is set if (mempolicy_args->il_weights) was provided.
>
> This simplifies the interface.
>
> (note: local weights are not introduced until the last patch 11/11)

The global weight is writable via sysfs too, right?  Then, for global
weights, we have 2 sets of values,

iw_table_default[], and iw_table[].

iw_table_default[] is set to "1" now, and will be set to some other
values after we have enabled HMAT/CDAT based default value.

iw_table[] is initialized with a special value (for example, "0", if "1"
is the minimal weight).  And users can change it via sysfs.  Then the
actual global weight for a node becomes

    iw_table[node] ? : iw_table_default[node]


As for global weight and local weight, we may need a way to copy from
the global weights to the local weights to simplify the memory policy
setup.  For example, if users want to use the global weights of node 0,
1, 2 and override the weight of node 3.  They can specify some special
value, for example, 0, in mempolicy_args->il_weights[0], [1], [2] to
copy from the global values, and override [3] via some other value.


Think about the default weight value via HMAT/CDAT again.  It may be not
a good idea to use "1" as default even for now.

For example,

- The memory bandwidth of DRAM is 100GB, whose default weight is "1".

- We hot-plug CXL.mem A with memory bandwidth 20GB.  So, we change the
  weight of DRAM to 5, and use "1" as the weight of CXL.mem A.

- We hot-plug CXL.mem B with memory bandwidth 10GB.  So, we change the
  weight of DRAM to 10, the weight of CXL.mem A to 2, and use "1" as the
  weight of CXL.mem B.

That is, if we use "1" as default weight, we need to change weights of
nodes frequently because we haven't a "base" weight.  The best candidate
base weight is the weight of DRAM node.  For example, if we set the
default weight of DRAM node to be "16" and use that as the base weight,
we don't need to change it in most cases.  The weight of other nodes can
be set according to the ratio of its memory bandwidth to that of DRAM.

This makes it easy to set the default weight via HMAT/CDAT too.

What do you think about that?

>> > +
>> > +static void sysfs_mempolicy_release(struct kobject *mempolicy_kobj)
>> > +{
>> > +	int i;
>> > +
>> > +	for (i = 0; i < MAX_NUMNODES; i++)
>> > +		sysfs_wi_node_release(node_attrs[i], mempolicy_kobj);
>> 
>> IIUC, if this is called in error path (such as, in
>> add_weighted_interleave_group()), some node_attrs[] element may be
>> "NULL"?
>> 
>
> The null check is present in sysfs_wi_node_release
>
> if (!node_attr)
> 	return;

This works.  Sorry for noise.

> Is it preferable to pull this out? Seemed more defensive to put it
> inside the function.

Both are OK for me.

--
Best Regards,
Huang, Ying
Gregory Price Jan. 2, 2024, 7:45 p.m. UTC | #4
On Tue, Jan 02, 2024 at 03:41:08PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> That is, if we use "1" as default weight, we need to change weights of
> nodes frequently because we haven't a "base" weight.  The best candidate
> base weight is the weight of DRAM node.  For example, if we set the
> default weight of DRAM node to be "16" and use that as the base weight,
> we don't need to change it in most cases.  The weight of other nodes can
> be set according to the ratio of its memory bandwidth to that of DRAM.
> 
> This makes it easy to set the default weight via HMAT/CDAT too.
> 
> What do you think about that?
> 

You're getting a bit ahead of the patch set.  There is "what is a
reasonable default weight" and "what is the minumum functionality".

The minimum functionality is everything receiving a default weight of 1,
such that weighted interleave's behavior defaults to round-robin
interleave. This gets the system off the ground.

We can then expose an internal interface to drivers for them to set the
default weight to some reasonable number during system and device
initialization. The question at that point is what system is responsible
for setting the default weights... node? cxl? anything? What happens on
hotplug? etc.  That seems outside the scope of this patch set.


If you want me to add the default_iw_table with special value 0 denoting
"use default" at each layer, I can do that.

The basic change is this snippet:
```
if (pol->flags & MPOL_F_GWEIGHT)
	pol_weights = iw_table;
else
	pol_weights = pol->wil.weights;

for_each_node_mask(nid, nodemask) {
	weight = pol_weights[nid];
	weight_total += weight;
	weights[nid] = weight;
}
```

changes to:
```
for_each_node_mask(nid, nodemask) {
	weight = pol->wil.weights[node]
	if (!weight)
		weight = iw_table[node]
	if (!weight)
		weight = default_iw_table[node]
	weight_total += weight;
	weights[nid] = weight
}
```

It's a bit ugly, but it allows a 0 value to represent "use default",
and default_iw_table just ends up being initialized to `1` for now.

I think it also allows MPOL_F_GWEIGHT to be eliminated.

~Gregory
Huang, Ying Jan. 3, 2024, 2:45 a.m. UTC | #5
Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Jan 02, 2024 at 03:41:08PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> That is, if we use "1" as default weight, we need to change weights of
>> nodes frequently because we haven't a "base" weight.  The best candidate
>> base weight is the weight of DRAM node.  For example, if we set the
>> default weight of DRAM node to be "16" and use that as the base weight,
>> we don't need to change it in most cases.  The weight of other nodes can
>> be set according to the ratio of its memory bandwidth to that of DRAM.
>> 
>> This makes it easy to set the default weight via HMAT/CDAT too.
>> 
>> What do you think about that?
>> 
>
> You're getting a bit ahead of the patch set.  There is "what is a
> reasonable default weight" and "what is the minumum functionality".

I totally agree that we need the minimal functionality firstly.

> The minimum functionality is everything receiving a default weight of 1,
> such that weighted interleave's behavior defaults to round-robin
> interleave. This gets the system off the ground.

I don't think that we need to implement all functionalities now.  But,
we may need to consider more especially if it may impact the user space
interface.  The default base weight is something like that.  If we
change the default base weight from "1" to "16" later, users may be
surprised.  So, I think it's better to discuss it now.

> We can then expose an internal interface to drivers for them to set the
> default weight to some reasonable number during system and device
> initialization. The question at that point is what system is responsible
> for setting the default weights... node? cxl? anything? What happens on
> hotplug? etc.  That seems outside the scope of this patch set.
>
>
> If you want me to add the default_iw_table with special value 0 denoting
> "use default" at each layer, I can do that.
>
> The basic change is this snippet:
> ```
> if (pol->flags & MPOL_F_GWEIGHT)
> 	pol_weights = iw_table;
> else
> 	pol_weights = pol->wil.weights;
>
> for_each_node_mask(nid, nodemask) {
> 	weight = pol_weights[nid];
> 	weight_total += weight;
> 	weights[nid] = weight;
> }
> ```
>
> changes to:
> ```
> for_each_node_mask(nid, nodemask) {
> 	weight = pol->wil.weights[node]
> 	if (!weight)
> 		weight = iw_table[node]
> 	if (!weight)
> 		weight = default_iw_table[node]
> 	weight_total += weight;
> 	weights[nid] = weight
> }
> ```
>
> It's a bit ugly,

We can use a wrapper function to hide the logic.

> but it allows a 0 value to represent "use default",
> and default_iw_table just ends up being initialized to `1` for now.

Because the contents of default_iw_table are just "default weight" for
now.  We don't need it for now.  We can add it later.

> I think it also allows MPOL_F_GWEIGHT to be eliminated.

Do we need a way to distinguish whether to copy the global weights to
local weights when the memory policy is created?  That is, when the
global weights are changed later, will the changes be used?  One
possible solution is

- If no weights are specified in set_mempolicy2(), the global weights
  will be used always.

- If at least one weight is specified in set_mempolicy2(), it will be
  used, and the other weights in global weights will be copied to the
  local weights.  That is, changes to the global weights will not be
  used.

--
Best Regards,
Huang, Ying
Gregory Price Jan. 3, 2024, 2:46 a.m. UTC | #6
On Tue, Jan 02, 2024 at 03:41:08PM +0800, Huang, Ying wrote:
> Think about the default weight value via HMAT/CDAT again.  It may be not
> a good idea to use "1" as default even for now.
> 
> For example,
> 
> - The memory bandwidth of DRAM is 100GB, whose default weight is "1".
> 
> - We hot-plug CXL.mem A with memory bandwidth 20GB.  So, we change the
>   weight of DRAM to 5, and use "1" as the weight of CXL.mem A.
> 
> - We hot-plug CXL.mem B with memory bandwidth 10GB.  So, we change the
>   weight of DRAM to 10, the weight of CXL.mem A to 2, and use "1" as the
>   weight of CXL.mem B.
> 
> That is, if we use "1" as default weight, we need to change weights of
> nodes frequently because we haven't a "base" weight.  The best candidate
> base weight is the weight of DRAM node.  For example, if we set the
> default weight of DRAM node to be "16" and use that as the base weight,
> we don't need to change it in most cases.  The weight of other nodes can
> be set according to the ratio of its memory bandwidth to that of DRAM.
> 
> This makes it easy to set the default weight via HMAT/CDAT too.
> 
> What do you think about that?
> 

Giving this more thought.

Hotplug should be an incredibly rare event. I don't think swapping defaults
"frequently" is a real problem we should handle.

It's expected that dynamic capacity devices will not cause a node to
hotplug, but instead cause a node to grow/shrink.

Seems perfectly fine to rebalance weights in response to rare events.

~Gregory
Gregory Price Jan. 3, 2024, 2:59 a.m. UTC | #7
On Wed, Jan 03, 2024 at 10:45:53AM +0800, Huang, Ying wrote:
> 
> > The minimum functionality is everything receiving a default weight of 1,
> > such that weighted interleave's behavior defaults to round-robin
> > interleave. This gets the system off the ground.
> 
> I don't think that we need to implement all functionalities now.  But,
> we may need to consider more especially if it may impact the user space
> interface.  The default base weight is something like that.  If we
> change the default base weight from "1" to "16" later, users may be
> surprised.  So, I think it's better to discuss it now.
>

This is a hill I don't particularly care to die on.  I think the weights
are likely to end up being set at boot and rebalanced as (rare) hotplug
events occur.

So if people think the default weight should be 3,16,24 or 123, i don't
think it's going to matter.

> 
> We can use a wrapper function to hide the logic.
>

Done.  I'll push a new set tomorrow.

> > I think it also allows MPOL_F_GWEIGHT to be eliminated.
> 
> Do we need a way to distinguish whether to copy the global weights to
> local weights when the memory policy is created?  That is, when the
> global weights are changed later, will the changes be used?  One
> possible solution is
> 
> - If no weights are specified in set_mempolicy2(), the global weights
>   will be used always.
> 
> - If at least one weight is specified in set_mempolicy2(), it will be
>   used, and the other weights in global weights will be copied to the
>   local weights.  That is, changes to the global weights will not be
>   used.
> 

What's confusing about that is that if a user sets a weight to 0,
they'll get a non-0 weight - always.

In my opinion, if we want to make '0' mean 'use system default', then
it should mean 'ALWAYS use system default for this node'.

"Use the system default at the time the syscall was called, and do not
update to use a new system default if that default is changed" is
confusing.

If you say use a global value, use the global value. Simple.

> --
> Best Regards,
> Huang, Ying
Huang, Ying Jan. 3, 2024, 6:03 a.m. UTC | #8
Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Jan 03, 2024 at 10:45:53AM +0800, Huang, Ying wrote:
>> 
>> > The minimum functionality is everything receiving a default weight of 1,
>> > such that weighted interleave's behavior defaults to round-robin
>> > interleave. This gets the system off the ground.
>> 
>> I don't think that we need to implement all functionalities now.  But,
>> we may need to consider more especially if it may impact the user space
>> interface.  The default base weight is something like that.  If we
>> change the default base weight from "1" to "16" later, users may be
>> surprised.  So, I think it's better to discuss it now.
>>
>
> This is a hill I don't particularly care to die on.  I think the weights
> are likely to end up being set at boot and rebalanced as (rare) hotplug
> events occur.
>
> So if people think the default weight should be 3,16,24 or 123, i don't
> think it's going to matter.
>
>> 
>> We can use a wrapper function to hide the logic.
>>
>
> Done.  I'll push a new set tomorrow.
>
>> > I think it also allows MPOL_F_GWEIGHT to be eliminated.
>> 
>> Do we need a way to distinguish whether to copy the global weights to
>> local weights when the memory policy is created?  That is, when the
>> global weights are changed later, will the changes be used?  One
>> possible solution is
>> 
>> - If no weights are specified in set_mempolicy2(), the global weights
>>   will be used always.
>> 
>> - If at least one weight is specified in set_mempolicy2(), it will be
>>   used, and the other weights in global weights will be copied to the
>>   local weights.  That is, changes to the global weights will not be
>>   used.
>> 
>
> What's confusing about that is that if a user sets a weight to 0,
> they'll get a non-0 weight - always.
>
> In my opinion, if we want to make '0' mean 'use system default', then
> it should mean 'ALWAYS use system default for this node'.
>
> "Use the system default at the time the syscall was called, and do not
> update to use a new system default if that default is changed" is
> confusing.
>
> If you say use a global value, use the global value. Simple.

I mainly have concerns about consistency.  The global weights can be
changed while the local weights are fixed.  For example,

- Weights of node 0,1 is [3, 1] initially

- Process A call set_mempolicy2() to set weights to [4, 0], that is, use
  default weight for node 1.

- After hotplug, the weights of node is changed to [12, 4, 1], now the
  effective weights used in process A becomes [4, 4].  Which is hardly
  desired.

Another choice is to disallow "0" as weight in set_mempolicy2().

--
Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
new file mode 100644
index 000000000000..2dcf24f4384a
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy
@@ -0,0 +1,4 @@ 
+What:		/sys/kernel/mm/mempolicy/
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Interface for Mempolicy
diff --git a/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
new file mode 100644
index 000000000000..aa27fdf08c19
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-kernel-mm-mempolicy-weighted-interleave
@@ -0,0 +1,22 @@ 
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Configuration Interface for the Weighted Interleave policy
+
+What:		/sys/kernel/mm/mempolicy/weighted_interleave/nodeN
+Date:		December 2023
+Contact:	Linux memory management mailing list <linux-mm@kvack.org>
+Description:	Weight configuration interface for nodeN
+
+		The interleave weight for a memory node (N). These weights are
+		utilized by processes which have set their mempolicy to
+		MPOL_WEIGHTED_INTERLEAVE and have opted into global weights by
+		omitting a task-local weight array.
+
+		These weights only affect new allocations, and changes at runtime
+		will not cause migrations on already allocated pages.
+
+		Writing an empty string resets the weight value to 1.
+
+		Minimum weight: 1
+		Maximum weight: 255
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..0e77633b07a5 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -131,6 +131,8 @@  static struct mempolicy default_policy = {
 
 static struct mempolicy preferred_node_policy[MAX_NUMNODES];
 
+static char iw_table[MAX_NUMNODES];
+
 /**
  * numa_nearest_node - Find nearest node by state
  * @node: Node id to start the search
@@ -3067,3 +3069,157 @@  void mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol)
 		p += scnprintf(p, buffer + maxlen - p, ":%*pbl",
 			       nodemask_pr_args(&nodes));
 }
+
+#ifdef CONFIG_SYSFS
+struct iw_node_attr {
+	struct kobj_attribute kobj_attr;
+	int nid;
+};
+
+static ssize_t node_show(struct kobject *kobj, struct kobj_attribute *attr,
+			 char *buf)
+{
+	struct iw_node_attr *node_attr;
+
+	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
+	return sysfs_emit(buf, "%d\n", iw_table[node_attr->nid]);
+}
+
+static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
+			  const char *buf, size_t count)
+{
+	struct iw_node_attr *node_attr;
+	unsigned char weight = 0;
+
+	node_attr = container_of(attr, struct iw_node_attr, kobj_attr);
+	/* If no input, set default weight to 1 */
+	if (count == 0 || sysfs_streq(buf, ""))
+		weight = 1;
+	else if (kstrtou8(buf, 0, &weight) || !weight)
+		return -EINVAL;
+
+	iw_table[node_attr->nid] = weight;
+	return count;
+}
+
+static struct iw_node_attr *node_attrs[MAX_NUMNODES];
+
+static void sysfs_wi_node_release(struct iw_node_attr *node_attr,
+				  struct kobject *parent)
+{
+	if (!node_attr)
+		return;
+	sysfs_remove_file(parent, &node_attr->kobj_attr.attr);
+	kfree(node_attr->kobj_attr.attr.name);
+	kfree(node_attr);
+}
+
+static void sysfs_mempolicy_release(struct kobject *mempolicy_kobj)
+{
+	int i;
+
+	for (i = 0; i < MAX_NUMNODES; i++)
+		sysfs_wi_node_release(node_attrs[i], mempolicy_kobj);
+	kobject_put(mempolicy_kobj);
+}
+
+static const struct kobj_type mempolicy_ktype = {
+	.sysfs_ops = &kobj_sysfs_ops,
+	.release = sysfs_mempolicy_release,
+};
+
+static int add_weight_node(int nid, struct kobject *wi_kobj)
+{
+	struct iw_node_attr *node_attr;
+	char *name;
+
+	node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
+	if (!node_attr)
+		return -ENOMEM;
+
+	name = kasprintf(GFP_KERNEL, "node%d", nid);
+	if (!name) {
+		kfree(node_attr);
+		return -ENOMEM;
+	}
+
+	sysfs_attr_init(&node_attr->kobj_attr.attr);
+	node_attr->kobj_attr.attr.name = name;
+	node_attr->kobj_attr.attr.mode = 0644;
+	node_attr->kobj_attr.show = node_show;
+	node_attr->kobj_attr.store = node_store;
+	node_attr->nid = nid;
+
+	if (sysfs_create_file(wi_kobj, &node_attr->kobj_attr.attr)) {
+		kfree(node_attr->kobj_attr.attr.name);
+		kfree(node_attr);
+		pr_err("failed to add attribute to weighted_interleave\n");
+		return -ENOMEM;
+	}
+
+	node_attrs[nid] = node_attr;
+	return 0;
+}
+
+static int add_weighted_interleave_group(struct kobject *root_kobj)
+{
+	struct kobject *wi_kobj;
+	int nid, err;
+
+	wi_kobj = kzalloc(sizeof(struct kobject), GFP_KERNEL);
+	if (!wi_kobj)
+		return -ENOMEM;
+
+	err = kobject_init_and_add(wi_kobj, &mempolicy_ktype, root_kobj,
+				   "weighted_interleave");
+	if (err) {
+		kfree(wi_kobj);
+		return err;
+	}
+
+	memset(node_attrs, 0, sizeof(node_attrs));
+	for_each_node_state(nid, N_POSSIBLE) {
+		err = add_weight_node(nid, wi_kobj);
+		if (err) {
+			pr_err("failed to add sysfs [node%d]\n", nid);
+			break;
+		}
+	}
+	if (err)
+		kobject_put(wi_kobj);
+	return 0;
+}
+
+static int __init mempolicy_sysfs_init(void)
+{
+	int err;
+	struct kobject *root_kobj;
+
+	memset(&iw_table, 1, sizeof(iw_table));
+
+	root_kobj = kobject_create_and_add("mempolicy", mm_kobj);
+	if (!root_kobj) {
+		pr_err("failed to add mempolicy kobject to the system\n");
+		return -ENOMEM;
+	}
+
+	err = add_weighted_interleave_group(root_kobj);
+
+	if (err)
+		kobject_put(root_kobj);
+	return err;
+
+}
+#else
+static int __init mempolicy_sysfs_init(void)
+{
+	/*
+	 * if sysfs is not enabled MPOL_WEIGHTED_INTERLEAVE defaults to
+	 * MPOL_INTERLEAVE behavior, but is still defined separately to
+	 * allow task-local weighted interleave to operate as intended.
+	 */
+	memset(&iw_table, 1, sizeof(iw_table));
+	return 0;
+}
+#endif /* CONFIG_SYSFS */
+late_initcall(mempolicy_sysfs_init);