diff mbox series

[v3,3/3] mm/mempolicy: Support memory hotplug in weighted interleave

Message ID 20250320041749.881-4-rakie.kim@sk.com (mailing list archive)
State New
Headers show
Series Enhance sysfs handling for memory hotplug in weighted interleave | expand

Commit Message

Rakie Kim March 20, 2025, 4:17 a.m. UTC
The weighted interleave policy distributes page allocations across multiple
NUMA nodes based on their performance weight, thereby improving memory
bandwidth utilization. The weight values for each node are configured
through sysfs.

Previously, sysfs entries for configuring weighted interleave were created
for all possible nodes (N_POSSIBLE) at initialization, including nodes that
might not have memory. However, not all nodes in N_POSSIBLE are usable at
runtime, as some may remain memoryless or offline.
This led to sysfs entries being created for unusable nodes, causing
potential misconfiguration issues.

To address this issue, this patch modifies the sysfs creation logic to:
1) Limit sysfs entries to nodes that are online and have memory, reducing
   the creation of sysfs attributes for unusable nodes.
2) Support memory hotplug by dynamically adding and removing sysfs entries
   based on whether a node transitions into or out of the N_MEMORY state.

Additionally, the patch ensures that sysfs attributes are properly managed
when nodes go offline, preventing stale or redundant entries from persisting
in the system.

By making these changes, the weighted interleave policy now manages its
sysfs entries more efficiently, ensuring that only relevant nodes are
considered for interleaving, and dynamically adapting to memory hotplug
events.

Signed-off-by: Rakie Kim <rakie.kim@sk.com>
---
 mm/mempolicy.c | 108 +++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 86 insertions(+), 22 deletions(-)

Comments

Gregory Price March 21, 2025, 2:24 p.m. UTC | #1
On Thu, Mar 20, 2025 at 01:17:48PM +0900, Rakie Kim wrote:
... snip ...
> +	mutex_lock(&sgrp->kobj_lock);
> +	if (sgrp->nattrs[nid]) {
> +		mutex_unlock(&sgrp->kobj_lock);
> +		pr_info("Node [%d] already exists\n", nid);
> +		kfree(new_attr);
> +		kfree(name);
> +		return 0;
> +	}
>  
> -	if (sysfs_create_file(&sgrp->wi_kobj, &node_attr->kobj_attr.attr)) {
> -		kfree(node_attr->kobj_attr.attr.name);
> -		kfree(node_attr);
> -		pr_err("failed to add attribute to weighted_interleave\n");
> -		return -ENOMEM;
> +	sgrp->nattrs[nid] = new_attr;
> +	mutex_unlock(&sgrp->kobj_lock);
> +
> +	sysfs_attr_init(&sgrp->nattrs[nid]->kobj_attr.attr);
> +	sgrp->nattrs[nid]->kobj_attr.attr.name = name;
> +	sgrp->nattrs[nid]->kobj_attr.attr.mode = 0644;
> +	sgrp->nattrs[nid]->kobj_attr.show = node_show;
> +	sgrp->nattrs[nid]->kobj_attr.store = node_store;
> +	sgrp->nattrs[nid]->nid = nid;

These accesses need to be inside the lock as well.  Probably we can't
get here concurrently, but I can't so so definitively that I'm
comfortable blind-accessing it outside the lock.

> +static int wi_node_notifier(struct notifier_block *nb,
> +			       unsigned long action, void *data)
> +{
... snip ...
> +	case MEM_OFFLINE:
> +		sysfs_wi_node_release(nid);

I'm still not convinced this is correct.  `offline_pages()` says this:

/*
 * {on,off}lining is constrained to full memory sections (or more
 * precisely to memory blocks from the user space POV).
 */

And that is the function calling:
	memory_notify(MEM_OFFLINE, &arg);

David pointed out that this should be called when offlining each memory
block.  This is not the same as simply doing `echo 0 > online`, you need
to remove the dax device associated with the memory.

For example:

      node1
    /       \
 dax0.0    dax1.0
   |          |
  mb1        mb2


With this code, if I `daxctl reconfigure-device devmem dax0.0` it will
remove the first memory block, causing MEM_OFFLINE event to fire and
removing the node - despite the fact that dax1.0 is still present.

This matters for systems with memory holes in CXL hotplug memory and
also for systems with Dynamic Capacity Devices surfacing capacity as
separate dax devices.

~Gregory
Rakie Kim March 24, 2025, 8:48 a.m. UTC | #2
On Fri, 21 Mar 2025 10:24:46 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Thu, Mar 20, 2025 at 01:17:48PM +0900, Rakie Kim wrote:
> ... snip ...
> > +	mutex_lock(&sgrp->kobj_lock);
> > +	if (sgrp->nattrs[nid]) {
> > +		mutex_unlock(&sgrp->kobj_lock);
> > +		pr_info("Node [%d] already exists\n", nid);
> > +		kfree(new_attr);
> > +		kfree(name);
> > +		return 0;
> > +	}
> >  
> > -	if (sysfs_create_file(&sgrp->wi_kobj, &node_attr->kobj_attr.attr)) {
> > -		kfree(node_attr->kobj_attr.attr.name);
> > -		kfree(node_attr);
> > -		pr_err("failed to add attribute to weighted_interleave\n");
> > -		return -ENOMEM;
> > +	sgrp->nattrs[nid] = new_attr;
> > +	mutex_unlock(&sgrp->kobj_lock);
> > +
> > +	sysfs_attr_init(&sgrp->nattrs[nid]->kobj_attr.attr);
> > +	sgrp->nattrs[nid]->kobj_attr.attr.name = name;
> > +	sgrp->nattrs[nid]->kobj_attr.attr.mode = 0644;
> > +	sgrp->nattrs[nid]->kobj_attr.show = node_show;
> > +	sgrp->nattrs[nid]->kobj_attr.store = node_store;
> > +	sgrp->nattrs[nid]->nid = nid;
> 
> These accesses need to be inside the lock as well.  Probably we can't
> get here concurrently, but I can't so so definitively that I'm
> comfortable blind-accessing it outside the lock.

You're right, and I appreciate your point. It's not difficult to apply your
suggestion, so I plan to update the code as follows:

    sgrp->nattrs[nid] = new_attr;

    sysfs_attr_init(&sgrp->nattrs[nid]->kobj_attr.attr);
    sgrp->nattrs[nid]->kobj_attr.attr.name = name;
    sgrp->nattrs[nid]->kobj_attr.attr.mode = 0644;
    sgrp->nattrs[nid]->kobj_attr.show = node_show;
    sgrp->nattrs[nid]->kobj_attr.store = node_store;
    sgrp->nattrs[nid]->nid = nid;

    ret = sysfs_create_file(&sgrp->wi_kobj,
           &sgrp->nattrs[nid]->kobj_attr.attr);
    if (ret) {
        mutex_unlock(&sgrp->kobj_lock);
        ...
    }
    mutex_unlock(&sgrp->kobj_lock);

> 
> > +static int wi_node_notifier(struct notifier_block *nb,
> > +			       unsigned long action, void *data)
> > +{
> ... snip ...
> > +	case MEM_OFFLINE:
> > +		sysfs_wi_node_release(nid);
> 
> I'm still not convinced this is correct.  `offline_pages()` says this:
> 
> /*
>  * {on,off}lining is constrained to full memory sections (or more
>  * precisely to memory blocks from the user space POV).
>  */
> 
> And that is the function calling:
> 	memory_notify(MEM_OFFLINE, &arg);
> 
> David pointed out that this should be called when offlining each memory
> block.  This is not the same as simply doing `echo 0 > online`, you need
> to remove the dax device associated with the memory.
> 
> For example:
> 
>       node1
>     /       \
>  dax0.0    dax1.0
>    |          |
>   mb1        mb2
> 
> 
> With this code, if I `daxctl reconfigure-device devmem dax0.0` it will
> remove the first memory block, causing MEM_OFFLINE event to fire and
> removing the node - despite the fact that dax1.0 is still present.
> 
> This matters for systems with memory holes in CXL hotplug memory and
> also for systems with Dynamic Capacity Devices surfacing capacity as
> separate dax devices.
> 
> ~Gregory

If all memory blocks belonging to a node are offlined, the node will lose its
`N_MEMORY` state before the notifier callback is invoked. This should help avoid
the issue you mentioned.
Please let me know your thoughts on this approach.

Rakie
Rakie Kim March 24, 2025, 8:54 a.m. UTC | #3
On Mon, 24 Mar 2025 17:48:39 +0900 Rakie Kim <rakie.kim@sk.com> wrote:
> On Fri, 21 Mar 2025 10:24:46 -0400 Gregory Price <gourry@gourry.net> wrote:
> > On Thu, Mar 20, 2025 at 01:17:48PM +0900, Rakie Kim wrote:
> > ... snip ...
> > > +	mutex_lock(&sgrp->kobj_lock);
> > > +	if (sgrp->nattrs[nid]) {
> > > +		mutex_unlock(&sgrp->kobj_lock);
> > > +		pr_info("Node [%d] already exists\n", nid);
> > > +		kfree(new_attr);
> > > +		kfree(name);
> > > +		return 0;
> > > +	}
> > >  
> > > -	if (sysfs_create_file(&sgrp->wi_kobj, &node_attr->kobj_attr.attr)) {
> > > -		kfree(node_attr->kobj_attr.attr.name);
> > > -		kfree(node_attr);
> > > -		pr_err("failed to add attribute to weighted_interleave\n");
> > > -		return -ENOMEM;
> > > +	sgrp->nattrs[nid] = new_attr;
> > > +	mutex_unlock(&sgrp->kobj_lock);
> > > +
> > > +	sysfs_attr_init(&sgrp->nattrs[nid]->kobj_attr.attr);
> > > +	sgrp->nattrs[nid]->kobj_attr.attr.name = name;
> > > +	sgrp->nattrs[nid]->kobj_attr.attr.mode = 0644;
> > > +	sgrp->nattrs[nid]->kobj_attr.show = node_show;
> > > +	sgrp->nattrs[nid]->kobj_attr.store = node_store;
> > > +	sgrp->nattrs[nid]->nid = nid;
> > 
> > These accesses need to be inside the lock as well.  Probably we can't
> > get here concurrently, but I can't so so definitively that I'm
> > comfortable blind-accessing it outside the lock.
> 
> You're right, and I appreciate your point. It's not difficult to apply your
> suggestion, so I plan to update the code as follows:
> 
>     sgrp->nattrs[nid] = new_attr;
> 
>     sysfs_attr_init(&sgrp->nattrs[nid]->kobj_attr.attr);
>     sgrp->nattrs[nid]->kobj_attr.attr.name = name;
>     sgrp->nattrs[nid]->kobj_attr.attr.mode = 0644;
>     sgrp->nattrs[nid]->kobj_attr.show = node_show;
>     sgrp->nattrs[nid]->kobj_attr.store = node_store;
>     sgrp->nattrs[nid]->nid = nid;
> 
>     ret = sysfs_create_file(&sgrp->wi_kobj,
>            &sgrp->nattrs[nid]->kobj_attr.attr);
>     if (ret) {
>         mutex_unlock(&sgrp->kobj_lock);
>         ...
>     }
>     mutex_unlock(&sgrp->kobj_lock);
> 
> > 
> > > +static int wi_node_notifier(struct notifier_block *nb,
> > > +			       unsigned long action, void *data)
> > > +{
> > ... snip ...
> > > +	case MEM_OFFLINE:
> > > +		sysfs_wi_node_release(nid);
> > 
> > I'm still not convinced this is correct.  `offline_pages()` says this:
> > 
> > /*
> >  * {on,off}lining is constrained to full memory sections (or more
> >  * precisely to memory blocks from the user space POV).
> >  */
> > 
> > And that is the function calling:
> > 	memory_notify(MEM_OFFLINE, &arg);
> > 
> > David pointed out that this should be called when offlining each memory
> > block.  This is not the same as simply doing `echo 0 > online`, you need
> > to remove the dax device associated with the memory.
> > 
> > For example:
> > 
> >       node1
> >     /       \
> >  dax0.0    dax1.0
> >    |          |
> >   mb1        mb2
> > 
> > 
> > With this code, if I `daxctl reconfigure-device devmem dax0.0` it will
> > remove the first memory block, causing MEM_OFFLINE event to fire and
> > removing the node - despite the fact that dax1.0 is still present.
> > 
> > This matters for systems with memory holes in CXL hotplug memory and
> > also for systems with Dynamic Capacity Devices surfacing capacity as
> > separate dax devices.
> > 
> > ~Gregory
> 
> If all memory blocks belonging to a node are offlined, the node will lose its
> `N_MEMORY` state before the notifier callback is invoked. This should help avoid
> the issue you mentioned.
> Please let me know your thoughts on this approach.
> 
> Rakie
> 

I'm sorry, the code is missing.
I may not fully understand the scenario you described, but I think your concern
can be addressed by adding a simple check like the following:

    case MEM_OFFLINE:
        if (!node_state(nid, N_MEMORY)) --> this point
            sysfs_wi_node_release(nid);

If all memory blocks belonging to a node are offlined, the node will lose its
`N_MEMORY` state before the notifier callback is invoked. This should help avoid
the issue you mentioned.
Please let me know your thoughts on this approach.

Rakie.
Gregory Price March 24, 2025, 1:32 p.m. UTC | #4
On Mon, Mar 24, 2025 at 05:54:27PM +0900, Rakie Kim wrote:
> 
> I'm sorry, the code is missing.
> I may not fully understand the scenario you described, but I think your concern
> can be addressed by adding a simple check like the following:
> 
>     case MEM_OFFLINE:
>         if (!node_state(nid, N_MEMORY)) --> this point
>             sysfs_wi_node_release(nid);
>

This should work.  I have some questions about whether there might be
some subtle race conditions with this implementation, but I can take a
look after LSFMM.  (Example: Two blocks being offlined/onlined at the
same time, is state(nid, N_MEMORY) a raced value?)

~Gregory
Rakie Kim March 25, 2025, 10:27 a.m. UTC | #5
On Mon, 24 Mar 2025 09:32:49 -0400 Gregory Price <gourry@gourry.net> wrote:
> On Mon, Mar 24, 2025 at 05:54:27PM +0900, Rakie Kim wrote:
> > 
> > I'm sorry, the code is missing.
> > I may not fully understand the scenario you described, but I think your concern
> > can be addressed by adding a simple check like the following:
> > 
> >     case MEM_OFFLINE:
> >         if (!node_state(nid, N_MEMORY)) --> this point
> >             sysfs_wi_node_release(nid);
> >
> 
> This should work.  I have some questions about whether there might be
> some subtle race conditions with this implementation, but I can take a
> look after LSFMM.  (Example: Two blocks being offlined/onlined at the
> same time, is state(nid, N_MEMORY) a raced value?)
> 
> ~Gregory

I will also further review the code for any race condition
issues. I intend to update v4 to incorporate the discussions
from this patch series. Your feedback and review of v4 after
LSFMM would be greatly appreciated.

Rakie
diff mbox series

Patch

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6c8843114afd..91cdc1d9d43e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -113,6 +113,7 @@ 
 #include <asm/tlbflush.h>
 #include <asm/tlb.h>
 #include <linux/uaccess.h>
+#include <linux/memory.h>
 
 #include "internal.h"
 
@@ -3390,6 +3391,7 @@  struct iw_node_attr {
 
 struct sysfs_wi_group {
 	struct kobject wi_kobj;
+	struct mutex kobj_lock;
 	struct iw_node_attr *nattrs[];
 };
 
@@ -3439,12 +3441,24 @@  static ssize_t node_store(struct kobject *kobj, struct kobj_attribute *attr,
 
 static void sysfs_wi_node_release(int nid)
 {
-	if (!sgrp->nattrs[nid])
+	struct iw_node_attr *attr;
+
+	if (nid < 0 || nid >= nr_node_ids)
+		return;
+
+	mutex_lock(&sgrp->kobj_lock);
+	attr = sgrp->nattrs[nid];
+	if (!attr) {
+		mutex_unlock(&sgrp->kobj_lock);
 		return;
+	}
+
+	sgrp->nattrs[nid] = NULL;
+	mutex_unlock(&sgrp->kobj_lock);
 
-	sysfs_remove_file(&sgrp->wi_kobj, &sgrp->nattrs[nid]->kobj_attr.attr);
-	kfree(sgrp->nattrs[nid]->kobj_attr.attr.name);
-	kfree(sgrp->nattrs[nid]);
+	sysfs_remove_file(&sgrp->wi_kobj, &attr->kobj_attr.attr);
+	kfree(attr->kobj_attr.attr.name);
+	kfree(attr);
 }
 
 static void sysfs_wi_release(struct kobject *wi_kobj)
@@ -3463,35 +3477,80 @@  static const struct kobj_type wi_ktype = {
 
 static int sysfs_wi_node_add(int nid)
 {
-	struct iw_node_attr *node_attr;
+	int ret = 0;
 	char *name;
+	struct iw_node_attr *new_attr = NULL;
+
+	if (nid < 0 || nid >= nr_node_ids) {
+		pr_err("Invalid node id: %d\n", nid);
+		return -EINVAL;
+	}
 
-	node_attr = kzalloc(sizeof(*node_attr), GFP_KERNEL);
-	if (!node_attr)
+	new_attr = kzalloc(sizeof(struct iw_node_attr), GFP_KERNEL);
+	if (!new_attr)
 		return -ENOMEM;
 
 	name = kasprintf(GFP_KERNEL, "node%d", nid);
 	if (!name) {
-		kfree(node_attr);
+		kfree(new_attr);
 		return -ENOMEM;
 	}
 
-	sysfs_attr_init(&node_attr->kobj_attr.attr);
-	node_attr->kobj_attr.attr.name = name;
-	node_attr->kobj_attr.attr.mode = 0644;
-	node_attr->kobj_attr.show = node_show;
-	node_attr->kobj_attr.store = node_store;
-	node_attr->nid = nid;
+	mutex_lock(&sgrp->kobj_lock);
+	if (sgrp->nattrs[nid]) {
+		mutex_unlock(&sgrp->kobj_lock);
+		pr_info("Node [%d] already exists\n", nid);
+		kfree(new_attr);
+		kfree(name);
+		return 0;
+	}
 
-	if (sysfs_create_file(&sgrp->wi_kobj, &node_attr->kobj_attr.attr)) {
-		kfree(node_attr->kobj_attr.attr.name);
-		kfree(node_attr);
-		pr_err("failed to add attribute to weighted_interleave\n");
-		return -ENOMEM;
+	sgrp->nattrs[nid] = new_attr;
+	mutex_unlock(&sgrp->kobj_lock);
+
+	sysfs_attr_init(&sgrp->nattrs[nid]->kobj_attr.attr);
+	sgrp->nattrs[nid]->kobj_attr.attr.name = name;
+	sgrp->nattrs[nid]->kobj_attr.attr.mode = 0644;
+	sgrp->nattrs[nid]->kobj_attr.show = node_show;
+	sgrp->nattrs[nid]->kobj_attr.store = node_store;
+	sgrp->nattrs[nid]->nid = nid;
+
+	ret = sysfs_create_file(&sgrp->wi_kobj, &sgrp->nattrs[nid]->kobj_attr.attr);
+	if (ret) {
+		kfree(sgrp->nattrs[nid]->kobj_attr.attr.name);
+		kfree(sgrp->nattrs[nid]);
+		sgrp->nattrs[nid] = NULL;
+		pr_err("Failed to add attribute to weighted_interleave: %d\n", ret);
 	}
 
-	sgrp->nattrs[nid] = node_attr;
-	return 0;
+	return ret;
+}
+
+static int wi_node_notifier(struct notifier_block *nb,
+			       unsigned long action, void *data)
+{
+	int err;
+	struct memory_notify *arg = data;
+	int nid = arg->status_change_nid;
+
+	if (nid < 0)
+		goto notifier_end;
+
+	switch(action) {
+	case MEM_ONLINE:
+		err = sysfs_wi_node_add(nid);
+		if (err) {
+			pr_err("failed to add sysfs [node%d]\n", nid);
+			return NOTIFY_BAD;
+		}
+		break;
+	case MEM_OFFLINE:
+		sysfs_wi_node_release(nid);
+		break;
+	}
+
+notifier_end:
+	return NOTIFY_OK;
 }
 
 static int add_weighted_interleave_group(struct kobject *mempolicy_kobj)
@@ -3503,13 +3562,17 @@  static int add_weighted_interleave_group(struct kobject *mempolicy_kobj)
 		       GFP_KERNEL);
 	if (!sgrp)
 		return -ENOMEM;
+	mutex_init(&sgrp->kobj_lock);
 
 	err = kobject_init_and_add(&sgrp->wi_kobj, &wi_ktype, mempolicy_kobj,
 				   "weighted_interleave");
 	if (err)
 		goto err_out;
 
-	for_each_node_state(nid, N_POSSIBLE) {
+	for_each_online_node(nid) {
+		if (!node_state(nid, N_MEMORY))
+			continue;
+
 		err = sysfs_wi_node_add(nid);
 		if (err) {
 			pr_err("failed to add sysfs [node%d]\n", nid);
@@ -3517,6 +3580,7 @@  static int add_weighted_interleave_group(struct kobject *mempolicy_kobj)
 		}
 	}
 
+	hotplug_memory_notifier(wi_node_notifier, DEFAULT_CALLBACK_PRI);
 	return 0;
 
 err_out: