Message ID | 20250401092716.537512-2-osalvador@suse.de (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Implement numa node notifier | expand |
On Tue, Apr 01, 2025 at 11:27:15AM +0200, Oscar Salvador wrote: > There are at least four consumers of hotplug_memory_notifier that what they > really are interested in is whether any numa node changed its state, e.g: going > from being memory aware to becoming memoryless. > > Implement a specific notifier for numa nodes when their state gets changed, > and have those consumers that only care about numa node state changes use it. > > Signed-off-by: Oscar Salvador <osalvador@suse.de> > --- Hi Oscar, Please Cc harry.yoo@oracle.com instead of 42.hyeyoo@gmail.com in the futureāI recently changed my email. FWIW, I reviewed this patch and it looks good to me: Reviewed-by: Harry Yoo <harry.yoo@oracle.com> > drivers/acpi/numa/hmat.c | 6 +-- > drivers/base/node.c | 19 +++++++++ > drivers/cxl/core/region.c | 14 +++---- > drivers/cxl/cxl.h | 4 +- > include/linux/memory.h | 38 ++++++++++++++++++ > kernel/cgroup/cpuset.c | 2 +- > mm/memory-tiers.c | 8 ++-- > mm/memory_hotplug.c | 84 +++++++++++++++++++++++++++++---------- > mm/slub.c | 22 +++++----- > 9 files changed, 148 insertions(+), 49 deletions(-) > > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 75401866fb76..4bb9ff282ec9 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -1963,7 +1987,9 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, > unsigned long pfn, managed_pages, system_ram_pages = 0; > const int node = zone_to_nid(zone); > unsigned long flags; > - struct memory_notify arg; > + struct memory_notify mem_arg; > + struct node_notify node_arg; > + bool cancel_mem_notifier_on_err = false, cancel_node_notifier_on_err = false; > char *reason; > int ret; > [...] > + if (node_arg.status_change_nid >= 0) { > kcompactd_stop(node); > kswapd_stop(node); > + /*Node went memoryless. Notifiy interested consumers */ > + node_notify(NODE_BECAME_MEMORYLESS, &node_arg); > } nit: /* Node went memoryless. ... */ ^ a whitespace omitted. > writeback_set_ratelimit(); > > - memory_notify(MEM_OFFLINE, &arg); > + memory_notify(MEM_OFFLINE, &mem_arg); > remove_pfn_range_from_zone(zone, start_pfn, nr_pages); > return 0; > diff --git a/mm/slub.c b/mm/slub.c > index 184fd2b14758..74350f6c8ddd 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -5928,10 +5928,10 @@ static int slab_mem_going_offline_callback(void *arg) > > static void slab_mem_offline_callback(void *arg) > { > - struct memory_notify *marg = arg; > + struct node_notify *narg = arg; > int offline_node; > > - offline_node = marg->status_change_nid_normal; > + offline_node = narg->status_change_nid_normal; > > /* > * If the node still has available memory. we need kmem_cache_node I was wondering if this offline_node check is still necessary after this patch, but as SLUB is still notified for N_HIGH_MEMORY-only nodes, it is necessary.
On 4/1/25 11:27, Oscar Salvador wrote: > There are at least four consumers of hotplug_memory_notifier that what they > really are interested in is whether any numa node changed its state, e.g: going > from being memory aware to becoming memoryless. > > Implement a specific notifier for numa nodes when their state gets changed, > and have those consumers that only care about numa node state changes use it. > > Signed-off-by: Oscar Salvador <osalvador@suse.de> <snip> > -static void node_states_set_node(int node, struct memory_notify *arg) > +static void node_states_set_node(int node, struct node_notify *arg) > { > if (arg->status_change_nid_normal >= 0) > node_set_state(node, N_NORMAL_MEMORY); > @@ -1177,7 +1177,9 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, > int need_zonelists_rebuild = 0; > const int nid = zone_to_nid(zone); > int ret; > - struct memory_notify arg; > + struct memory_notify mem_arg; > + struct node_notify node_arg; > + bool cancel_mem_notifier_on_err = false, cancel_node_notifier_on_err = false; > > /* > * {on,off}lining is constrained to full memory sections (or more > @@ -1194,11 +1196,23 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, > /* associate pfn range with the zone */ > move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE); > > - arg.start_pfn = pfn; > - arg.nr_pages = nr_pages; > - node_states_check_changes_online(nr_pages, zone, &arg); > + mem_arg.start_pfn = pfn; > + mem_arg.nr_pages = nr_pages; > + node_states_check_changes_online(nr_pages, zone, &node_arg); > > - ret = memory_notify(MEM_GOING_ONLINE, &arg); > + if (node_arg.status_change_nid >= 0) { Hmm, don't we need to add "|| node_arg.status_change_nid_normal >= 0"? Or we fail to notify addition of normal memory to a node that already has !normal memory? > + /* Node is becoming memory aware. Notify consumers */ > + cancel_node_notifier_on_err = true; > + ret = node_notify(NODE_BECOMING_MEM_AWARE, &node_arg); > + ret = notifier_to_errno(ret); > + if (ret) > + goto failed_addition; > + } > + > + cancel_mem_notifier_on_err = true; > + mem_arg.status_change_nid = node_arg.status_change_nid; > + mem_arg.status_change_nid_normal = node_arg.status_change_nid_normal; > + ret = memory_notify(MEM_GOING_ONLINE, &mem_arg); > ret = notifier_to_errno(ret); > if (ret) > goto failed_addition; > @@ -1224,7 +1238,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, > online_pages_range(pfn, nr_pages); > adjust_present_page_count(pfn_to_page(pfn), group, nr_pages); > > - node_states_set_node(nid, &arg); > + node_states_set_node(nid, &node_arg); > if (need_zonelists_rebuild) > build_all_zonelists(NULL); > > @@ -1245,16 +1259,26 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, > kswapd_run(nid); > kcompactd_run(nid); > > + if (node_arg.status_change_nid >= 0) > + /* > + * Node went from memoryless to have memory. Notifiy interested > + * consumers > + */ > + node_notify(NODE_BECAME_MEM_AWARE, &node_arg); > + > writeback_set_ratelimit(); > > - memory_notify(MEM_ONLINE, &arg); > + memory_notify(MEM_ONLINE, &mem_arg); > return 0; > > failed_addition: > pr_debug("online_pages [mem %#010llx-%#010llx] failed\n", > (unsigned long long) pfn << PAGE_SHIFT, > (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1); > - memory_notify(MEM_CANCEL_ONLINE, &arg); > + if (cancel_node_notifier_on_err) > + node_notify(NODE_CANCEL_MEM_AWARE, &node_arg); > + if (cancel_mem_notifier_on_err) > + memory_notify(MEM_CANCEL_ONLINE, &mem_arg); Switch the order of those just for symmetry? :) > remove_pfn_range_from_zone(zone, pfn, nr_pages); > return ret; > } > @@ -1898,7 +1922,7 @@ early_param("movable_node", cmdline_parse_movable_node); > > /* check which state of node_states will be changed when offline memory */ > static void node_states_check_changes_offline(unsigned long nr_pages, > - struct zone *zone, struct memory_notify *arg) > + struct zone *zone, struct node_notify *arg) > { > struct pglist_data *pgdat = zone->zone_pgdat; > unsigned long present_pages = 0; > @@ -1935,7 +1959,7 @@ static void node_states_check_changes_offline(unsigned long nr_pages, > arg->status_change_nid = zone_to_nid(zone); > } > > -static void node_states_clear_node(int node, struct memory_notify *arg) > +static void node_states_clear_node(int node, struct node_notify *arg) > { > if (arg->status_change_nid_normal >= 0) > node_clear_state(node, N_NORMAL_MEMORY); > @@ -1963,7 +1987,9 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, > unsigned long pfn, managed_pages, system_ram_pages = 0; > const int node = zone_to_nid(zone); > unsigned long flags; > - struct memory_notify arg; > + struct memory_notify mem_arg; > + struct node_notify node_arg; > + bool cancel_mem_notifier_on_err = false, cancel_node_notifier_on_err = false; > char *reason; > int ret; > > @@ -2022,11 +2048,22 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, > goto failed_removal_pcplists_disabled; > } > > - arg.start_pfn = start_pfn; > - arg.nr_pages = nr_pages; > - node_states_check_changes_offline(nr_pages, zone, &arg); > + mem_arg.start_pfn = start_pfn; > + mem_arg.nr_pages = nr_pages; > + node_states_check_changes_offline(nr_pages, zone, &node_arg); > + > + if (node_arg.status_change_nid >= 0) { Ditto. > + cancel_node_notifier_on_err = true; > + ret = node_notify(NODE_BECOMING_MEMORYLESS, &node_arg); > + ret = notifier_to_errno(ret); > + if (ret) > + goto failed_removal_isolated; > + } > > - ret = memory_notify(MEM_GOING_OFFLINE, &arg); > + cancel_mem_notifier_on_err = true; > + mem_arg.status_change_nid = node_arg.status_change_nid; > + mem_arg.status_change_nid_normal = node_arg.status_change_nid_normal; > + ret = memory_notify(MEM_GOING_OFFLINE, &mem_arg); > ret = notifier_to_errno(ret); > if (ret) { > reason = "notifier failure"; > @@ -2106,27 +2143,32 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, > * Make sure to mark the node as memory-less before rebuilding the zone > * list. Otherwise this node would still appear in the fallback lists. > */ > - node_states_clear_node(node, &arg); > + node_states_clear_node(node, &node_arg); > if (!populated_zone(zone)) { > zone_pcp_reset(zone); > build_all_zonelists(NULL); > } > > - if (arg.status_change_nid >= 0) { > + if (node_arg.status_change_nid >= 0) { > kcompactd_stop(node); > kswapd_stop(node); > + /*Node went memoryless. Notifiy interested consumers */ > + node_notify(NODE_BECAME_MEMORYLESS, &node_arg); > } > > writeback_set_ratelimit(); > > - memory_notify(MEM_OFFLINE, &arg); > + memory_notify(MEM_OFFLINE, &mem_arg); > remove_pfn_range_from_zone(zone, start_pfn, nr_pages); > return 0; > > failed_removal_isolated: > /* pushback to free area */ > undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); > - memory_notify(MEM_CANCEL_OFFLINE, &arg); > + if (cancel_node_notifier_on_err) > + node_notify(NODE_CANCEL_MEMORYLESS, &node_arg); > + if (cancel_mem_notifier_on_err) > + memory_notify(MEM_CANCEL_OFFLINE, &mem_arg); Ditto. > failed_removal_pcplists_disabled: > lru_cache_enable(); > zone_pcp_enable(zone); > diff --git a/mm/slub.c b/mm/slub.c > index 184fd2b14758..74350f6c8ddd 100644 > --- a/mm/slub.c > +++ b/mm/slub.c > @@ -5928,10 +5928,10 @@ static int slab_mem_going_offline_callback(void *arg) > > static void slab_mem_offline_callback(void *arg) > { > - struct memory_notify *marg = arg; > + struct node_notify *narg = arg; > int offline_node; > > - offline_node = marg->status_change_nid_normal; > + offline_node = narg->status_change_nid_normal; > > /* > * If the node still has available memory. we need kmem_cache_node > @@ -5954,8 +5954,8 @@ static int slab_mem_going_online_callback(void *arg) > { > struct kmem_cache_node *n; > struct kmem_cache *s; > - struct memory_notify *marg = arg; > - int nid = marg->status_change_nid_normal; > + struct node_notify *narg = arg; > + int nid = narg->status_change_nid_normal; > int ret = 0; > > /* > @@ -6007,18 +6007,18 @@ static int slab_memory_callback(struct notifier_block *self, > int ret = 0; > > switch (action) { > - case MEM_GOING_ONLINE: > + case NODE_BECOMING_MEM_AWARE: > ret = slab_mem_going_online_callback(arg); > break; > - case MEM_GOING_OFFLINE: > + case NODE_BECOMING_MEMORYLESS: > ret = slab_mem_going_offline_callback(arg); > break; > - case MEM_OFFLINE: > - case MEM_CANCEL_ONLINE: > + case NODE_BECAME_MEMORYLESS: > + case NODE_CANCEL_MEM_AWARE: > slab_mem_offline_callback(arg); > break; > - case MEM_ONLINE: > - case MEM_CANCEL_OFFLINE: > + case NODE_BECAME_MEM_AWARE: > + case NODE_CANCEL_MEMORYLESS: > break; > } > if (ret) > @@ -6094,7 +6094,7 @@ void __init kmem_cache_init(void) > sizeof(struct kmem_cache_node), > SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0); > > - hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI); > + hotplug_node_notifier(slab_memory_callback, SLAB_CALLBACK_PRI); > > /* Able to allocate the per node structures */ > slab_state = PARTIAL;
On Wed, Apr 02, 2025 at 06:03:04PM +0200, Vlastimil Babka wrote: > On 4/1/25 11:27, Oscar Salvador wrote: > > - arg.start_pfn = pfn; > > - arg.nr_pages = nr_pages; > > - node_states_check_changes_online(nr_pages, zone, &arg); > > + mem_arg.start_pfn = pfn; > > + mem_arg.nr_pages = nr_pages; > > + node_states_check_changes_online(nr_pages, zone, &node_arg); > > > > - ret = memory_notify(MEM_GOING_ONLINE, &arg); > > + if (node_arg.status_change_nid >= 0) { > > Hmm, don't we need to add "|| node_arg.status_change_nid_normal >= 0"? Or we > fail to notify addition of normal memory to a node that already has !normal > memory? Ah, I think you are right, nice catch. Yes, if we add normal (<= ZONE_NORMAL) memory to a node that only has ZONE_MOVABLE, status_change_nid_normal will change, but status_change_nid will not. So yes, we need a "|| node_arg.status_change_nid_normal >= 0". > > - memory_notify(MEM_CANCEL_ONLINE, &arg); > > + if (cancel_node_notifier_on_err) > > + node_notify(NODE_CANCEL_MEM_AWARE, &node_arg); > > + if (cancel_mem_notifier_on_err) > > + memory_notify(MEM_CANCEL_ONLINE, &mem_arg); > > Switch the order of those just for symmetry? :) Will do. > > - arg.start_pfn = start_pfn; > > - arg.nr_pages = nr_pages; > > - node_states_check_changes_offline(nr_pages, zone, &arg); > > + mem_arg.start_pfn = start_pfn; > > + mem_arg.nr_pages = nr_pages; > > + node_states_check_changes_offline(nr_pages, zone, &node_arg); > > + > > + if (node_arg.status_change_nid >= 0) { > > Ditto. Yap. > > failed_removal_isolated: > > /* pushback to free area */ > > undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); > > - memory_notify(MEM_CANCEL_OFFLINE, &arg); > > + if (cancel_node_notifier_on_err) > > + node_notify(NODE_CANCEL_MEMORYLESS, &node_arg); > > + if (cancel_mem_notifier_on_err) > > + memory_notify(MEM_CANCEL_OFFLINE, &mem_arg); > > Ditto. Copy that. Thanks Vlastimil!
On Tue, 1 Apr 2025 11:27:15 +0200 Oscar Salvador <osalvador@suse.de> wrote: > There are at least four consumers of hotplug_memory_notifier that what they > really are interested in is whether any numa node changed its state, e.g: going > from being memory aware to becoming memoryless. Cover letter says 5. Whilst that's at least 4, maybe update this if you do a v2 to say at least 5 :) > > Implement a specific notifier for numa nodes when their state gets changed, > and have those consumers that only care about numa node state changes use it. > > Signed-off-by: Oscar Salvador <osalvador@suse.de> A couple of trivial things below. To me this looks fine otherwise. Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> > +#define hotplug_node_notifier(fn, pri) ({ \ > + static __meminitdata struct notifier_block fn##_node_nb =\ > + { .notifier_call = fn, .priority = pri };\ > + register_node_notifier(&fn##_node_nb); \ > +}) Trivial but spacing before \ seems rather random. Maybe I'm missing how it is consistent. > + > #ifdef CONFIG_NUMA > void memory_block_add_nid(struct memory_block *mem, int nid, > enum meminit_context context); > diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c > index 75401866fb76..4bb9ff282ec9 100644 > --- a/mm/memory_hotplug.c > +++ b/mm/memory_hotplug.c > @@ -2106,27 +2143,32 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, > * Make sure to mark the node as memory-less before rebuilding the zone > * list. Otherwise this node would still appear in the fallback lists. > */ > - node_states_clear_node(node, &arg); > + node_states_clear_node(node, &node_arg); > if (!populated_zone(zone)) { > zone_pcp_reset(zone); > build_all_zonelists(NULL); > } > > - if (arg.status_change_nid >= 0) { > + if (node_arg.status_change_nid >= 0) { > kcompactd_stop(node); > kswapd_stop(node); > + /*Node went memoryless. Notifiy interested consumers */ Trivial: missing space after * > + node_notify(NODE_BECAME_MEMORYLESS, &node_arg); > } > > writeback_set_ratelimit(); > > - memory_notify(MEM_OFFLINE, &arg); > + memory_notify(MEM_OFFLINE, &mem_arg); > remove_pfn_range_from_zone(zone, start_pfn, nr_pages); > return 0; > > failed_removal_isolated: > /* pushback to free area */ > undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); > - memory_notify(MEM_CANCEL_OFFLINE, &arg); > + if (cancel_node_notifier_on_err) > + node_notify(NODE_CANCEL_MEMORYLESS, &node_arg); > + if (cancel_mem_notifier_on_err) > + memory_notify(MEM_CANCEL_OFFLINE, &mem_arg); > failed_removal_pcplists_disabled: > lru_cache_enable(); > zone_pcp_enable(zone);
On 01.04.25 11:27, Oscar Salvador wrote: > There are at least four consumers of hotplug_memory_notifier that what they > really are interested in is whether any numa node changed its state, e.g: going > from being memory aware to becoming memoryless. > > Implement a specific notifier for numa nodes when their state gets changed, > and have those consumers that only care about numa node state changes use it. > > Signed-off-by: Oscar Salvador <osalvador@suse.de> > --- > drivers/acpi/numa/hmat.c | 6 +-- > drivers/base/node.c | 19 +++++++++ > drivers/cxl/core/region.c | 14 +++---- > drivers/cxl/cxl.h | 4 +- > include/linux/memory.h | 38 ++++++++++++++++++ > kernel/cgroup/cpuset.c | 2 +- > mm/memory-tiers.c | 8 ++-- > mm/memory_hotplug.c | 84 +++++++++++++++++++++++++++++---------- > mm/slub.c | 22 +++++----- > 9 files changed, 148 insertions(+), 49 deletions(-) > > diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c > index bfbb08b1e6af..d18f3efa2149 100644 > --- a/drivers/acpi/numa/hmat.c > +++ b/drivers/acpi/numa/hmat.c > @@ -918,10 +918,10 @@ static int hmat_callback(struct notifier_block *self, > unsigned long action, void *arg) > { > struct memory_target *target; > - struct memory_notify *mnb = arg; > + struct node_notify *mnb = arg; > int pxm, nid = mnb->status_change_nid; > > - if (nid == NUMA_NO_NODE || action != MEM_ONLINE) > + if (nid == NUMA_NO_NODE || action != NODE_BECAME_MEM_AWARE) > return NOTIFY_OK; > > pxm = node_to_pxm(nid); > @@ -1074,7 +1074,7 @@ static __init int hmat_init(void) > hmat_register_targets(); > > /* Keep the table and structures if the notifier may use them */ > - if (hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI)) > + if (hotplug_node_notifier(hmat_callback, HMAT_CALLBACK_PRI)) > goto out_put; > > if (!hmat_set_default_dram_perf()) > diff --git a/drivers/base/node.c b/drivers/base/node.c > index 0ea653fa3433..182c71dfb5b8 100644 > --- a/drivers/base/node.c > +++ b/drivers/base/node.c > @@ -110,6 +110,25 @@ static const struct attribute_group *node_access_node_groups[] = { > NULL, > }; > > +static BLOCKING_NOTIFIER_HEAD(node_chain); > + > +int register_node_notifier(struct notifier_block *nb) > +{ > + return blocking_notifier_chain_register(&node_chain, nb); > +} > +EXPORT_SYMBOL(register_node_notifier); > + > +void unregister_node_notifier(struct notifier_block *nb) > +{ > + blocking_notifier_chain_unregister(&node_chain, nb); > +} > +EXPORT_SYMBOL(unregister_node_notifier); > + > +int node_notify(unsigned long val, void *v) > +{ > + return blocking_notifier_call_chain(&node_chain, val, v); > +} > + > static void node_remove_accesses(struct node *node) > { > struct node_access_nodes *c, *cnext; > diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c > index e8d11a988fd9..7d187088f557 100644 > --- a/drivers/cxl/core/region.c > +++ b/drivers/cxl/core/region.c > @@ -2409,12 +2409,12 @@ static int cxl_region_perf_attrs_callback(struct notifier_block *nb, > unsigned long action, void *arg) > { > struct cxl_region *cxlr = container_of(nb, struct cxl_region, > - memory_notifier); > - struct memory_notify *mnb = arg; > + node_notifier); > + struct node_notify *mnb = arg; > int nid = mnb->status_change_nid; > int region_nid; > > - if (nid == NUMA_NO_NODE || action != MEM_ONLINE) > + if (nid == NUMA_NO_NODE || action != NODE_BECAME_MEM_AWARE) > return NOTIFY_DONE; > > /* > @@ -3388,7 +3388,7 @@ static void shutdown_notifiers(void *_cxlr) > { > struct cxl_region *cxlr = _cxlr; > > - unregister_memory_notifier(&cxlr->memory_notifier); > + unregister_node_notifier(&cxlr->node_notifier); > unregister_mt_adistance_algorithm(&cxlr->adist_notifier); > } > > @@ -3427,9 +3427,9 @@ static int cxl_region_probe(struct device *dev) > if (rc) > return rc; > > - cxlr->memory_notifier.notifier_call = cxl_region_perf_attrs_callback; > - cxlr->memory_notifier.priority = CXL_CALLBACK_PRI; > - register_memory_notifier(&cxlr->memory_notifier); > + cxlr->node_notifier.notifier_call = cxl_region_perf_attrs_callback; > + cxlr->node_notifier.priority = CXL_CALLBACK_PRI; > + register_node_notifier(&cxlr->node_notifier); > > cxlr->adist_notifier.notifier_call = cxl_region_calculate_adistance; > cxlr->adist_notifier.priority = 100; > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h > index bbbaa0d0a670..d4c9a499de7a 100644 > --- a/drivers/cxl/cxl.h > +++ b/drivers/cxl/cxl.h > @@ -532,7 +532,7 @@ struct cxl_region_params { > * @flags: Region state flags > * @params: active + config params for the region > * @coord: QoS access coordinates for the region > - * @memory_notifier: notifier for setting the access coordinates to node > + * @node_notifier: notifier for setting the access coordinates to node > * @adist_notifier: notifier for calculating the abstract distance of node > */ > struct cxl_region { > @@ -545,7 +545,7 @@ struct cxl_region { > unsigned long flags; > struct cxl_region_params params; > struct access_coordinate coord[ACCESS_COORDINATE_MAX]; > - struct notifier_block memory_notifier; > + struct notifier_block node_notifier; > struct notifier_block adist_notifier; > }; > > diff --git a/include/linux/memory.h b/include/linux/memory.h > index 12daa6ec7d09..1d814dfbb8a8 100644 > --- a/include/linux/memory.h > +++ b/include/linux/memory.h > @@ -99,6 +99,14 @@ int set_memory_block_size_order(unsigned int order); > #define MEM_PREPARE_ONLINE (1<<6) > #define MEM_FINISH_OFFLINE (1<<7) > > +/* These states are used for numa node notifiers */ > +#define NODE_BECOMING_MEM_AWARE (1<<0) > +#define NODE_BECAME_MEM_AWARE (1<<1) > +#define NODE_BECOMING_MEMORYLESS (1<<2) > +#define NODE_BECAME_MEMORYLESS (1<<3) > +#define NODE_CANCEL_MEM_AWARE (1<<4) > +#define NODE_CANCEL_MEMORYLESS (1<<5) Assuming we can remove the _normal stuff and we can do what we do in patch #2 here already ... meaning we unconditionally store the nid in the MEM notifier ... What about extending the existing memory notifier instead? That is, we add MEM_NODE_BECOMING_MEM_AWARE ... and trigger it using the same notifier chain. We only have to make sure these new events will be properly filtered out (IIRC, for most we do that already). Of course, the range will not apply to these events, but the nid would apply to all. Just a thought.
diff --git a/drivers/acpi/numa/hmat.c b/drivers/acpi/numa/hmat.c index bfbb08b1e6af..d18f3efa2149 100644 --- a/drivers/acpi/numa/hmat.c +++ b/drivers/acpi/numa/hmat.c @@ -918,10 +918,10 @@ static int hmat_callback(struct notifier_block *self, unsigned long action, void *arg) { struct memory_target *target; - struct memory_notify *mnb = arg; + struct node_notify *mnb = arg; int pxm, nid = mnb->status_change_nid; - if (nid == NUMA_NO_NODE || action != MEM_ONLINE) + if (nid == NUMA_NO_NODE || action != NODE_BECAME_MEM_AWARE) return NOTIFY_OK; pxm = node_to_pxm(nid); @@ -1074,7 +1074,7 @@ static __init int hmat_init(void) hmat_register_targets(); /* Keep the table and structures if the notifier may use them */ - if (hotplug_memory_notifier(hmat_callback, HMAT_CALLBACK_PRI)) + if (hotplug_node_notifier(hmat_callback, HMAT_CALLBACK_PRI)) goto out_put; if (!hmat_set_default_dram_perf()) diff --git a/drivers/base/node.c b/drivers/base/node.c index 0ea653fa3433..182c71dfb5b8 100644 --- a/drivers/base/node.c +++ b/drivers/base/node.c @@ -110,6 +110,25 @@ static const struct attribute_group *node_access_node_groups[] = { NULL, }; +static BLOCKING_NOTIFIER_HEAD(node_chain); + +int register_node_notifier(struct notifier_block *nb) +{ + return blocking_notifier_chain_register(&node_chain, nb); +} +EXPORT_SYMBOL(register_node_notifier); + +void unregister_node_notifier(struct notifier_block *nb) +{ + blocking_notifier_chain_unregister(&node_chain, nb); +} +EXPORT_SYMBOL(unregister_node_notifier); + +int node_notify(unsigned long val, void *v) +{ + return blocking_notifier_call_chain(&node_chain, val, v); +} + static void node_remove_accesses(struct node *node) { struct node_access_nodes *c, *cnext; diff --git a/drivers/cxl/core/region.c b/drivers/cxl/core/region.c index e8d11a988fd9..7d187088f557 100644 --- a/drivers/cxl/core/region.c +++ b/drivers/cxl/core/region.c @@ -2409,12 +2409,12 @@ static int cxl_region_perf_attrs_callback(struct notifier_block *nb, unsigned long action, void *arg) { struct cxl_region *cxlr = container_of(nb, struct cxl_region, - memory_notifier); - struct memory_notify *mnb = arg; + node_notifier); + struct node_notify *mnb = arg; int nid = mnb->status_change_nid; int region_nid; - if (nid == NUMA_NO_NODE || action != MEM_ONLINE) + if (nid == NUMA_NO_NODE || action != NODE_BECAME_MEM_AWARE) return NOTIFY_DONE; /* @@ -3388,7 +3388,7 @@ static void shutdown_notifiers(void *_cxlr) { struct cxl_region *cxlr = _cxlr; - unregister_memory_notifier(&cxlr->memory_notifier); + unregister_node_notifier(&cxlr->node_notifier); unregister_mt_adistance_algorithm(&cxlr->adist_notifier); } @@ -3427,9 +3427,9 @@ static int cxl_region_probe(struct device *dev) if (rc) return rc; - cxlr->memory_notifier.notifier_call = cxl_region_perf_attrs_callback; - cxlr->memory_notifier.priority = CXL_CALLBACK_PRI; - register_memory_notifier(&cxlr->memory_notifier); + cxlr->node_notifier.notifier_call = cxl_region_perf_attrs_callback; + cxlr->node_notifier.priority = CXL_CALLBACK_PRI; + register_node_notifier(&cxlr->node_notifier); cxlr->adist_notifier.notifier_call = cxl_region_calculate_adistance; cxlr->adist_notifier.priority = 100; diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h index bbbaa0d0a670..d4c9a499de7a 100644 --- a/drivers/cxl/cxl.h +++ b/drivers/cxl/cxl.h @@ -532,7 +532,7 @@ struct cxl_region_params { * @flags: Region state flags * @params: active + config params for the region * @coord: QoS access coordinates for the region - * @memory_notifier: notifier for setting the access coordinates to node + * @node_notifier: notifier for setting the access coordinates to node * @adist_notifier: notifier for calculating the abstract distance of node */ struct cxl_region { @@ -545,7 +545,7 @@ struct cxl_region { unsigned long flags; struct cxl_region_params params; struct access_coordinate coord[ACCESS_COORDINATE_MAX]; - struct notifier_block memory_notifier; + struct notifier_block node_notifier; struct notifier_block adist_notifier; }; diff --git a/include/linux/memory.h b/include/linux/memory.h index 12daa6ec7d09..1d814dfbb8a8 100644 --- a/include/linux/memory.h +++ b/include/linux/memory.h @@ -99,6 +99,14 @@ int set_memory_block_size_order(unsigned int order); #define MEM_PREPARE_ONLINE (1<<6) #define MEM_FINISH_OFFLINE (1<<7) +/* These states are used for numa node notifiers */ +#define NODE_BECOMING_MEM_AWARE (1<<0) +#define NODE_BECAME_MEM_AWARE (1<<1) +#define NODE_BECOMING_MEMORYLESS (1<<2) +#define NODE_BECAME_MEMORYLESS (1<<3) +#define NODE_CANCEL_MEM_AWARE (1<<4) +#define NODE_CANCEL_MEMORYLESS (1<<5) + struct memory_notify { /* * The altmap_start_pfn and altmap_nr_pages fields are designated for @@ -113,6 +121,11 @@ struct memory_notify { int status_change_nid; }; +struct node_notify { + int status_change_nid_normal; + int status_change_nid; +}; + struct notifier_block; struct mem_section; @@ -149,15 +162,34 @@ static inline int hotplug_memory_notifier(notifier_fn_t fn, int pri) { return 0; } + +static inline int register_node_notifier(struct notifier_block *nb) +{ + return 0; +} +static inline void unregister_node_notifier(struct notifier_block *nb) +{ +} +static inline int node_notify(unsigned long val, void *v) +{ + return 0; +} +static inline int hotplug_node_notifier(notifier_fn_t fn, int pri) +{ + return 0; +} #else /* CONFIG_MEMORY_HOTPLUG */ extern int register_memory_notifier(struct notifier_block *nb); +extern int register_node_notifier(struct notifier_block *nb); extern void unregister_memory_notifier(struct notifier_block *nb); +extern void unregister_node_notifier(struct notifier_block *nb); int create_memory_block_devices(unsigned long start, unsigned long size, struct vmem_altmap *altmap, struct memory_group *group); void remove_memory_block_devices(unsigned long start, unsigned long size); extern void memory_dev_init(void); extern int memory_notify(unsigned long val, void *v); +extern int node_notify(unsigned long val, void *v); extern struct memory_block *find_memory_block(unsigned long section_nr); typedef int (*walk_memory_blocks_func_t)(struct memory_block *, void *); extern int walk_memory_blocks(unsigned long start, unsigned long size, @@ -177,6 +209,12 @@ int walk_dynamic_memory_groups(int nid, walk_memory_groups_func_t func, register_memory_notifier(&fn##_mem_nb); \ }) +#define hotplug_node_notifier(fn, pri) ({ \ + static __meminitdata struct notifier_block fn##_node_nb =\ + { .notifier_call = fn, .priority = pri };\ + register_node_notifier(&fn##_node_nb); \ +}) + #ifdef CONFIG_NUMA void memory_block_add_nid(struct memory_block *mem, int nid, enum meminit_context context); diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c index 0f910c828973..62a5d34c4331 100644 --- a/kernel/cgroup/cpuset.c +++ b/kernel/cgroup/cpuset.c @@ -3939,7 +3939,7 @@ void __init cpuset_init_smp(void) cpumask_copy(top_cpuset.effective_cpus, cpu_active_mask); top_cpuset.effective_mems = node_states[N_MEMORY]; - hotplug_memory_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI); + hotplug_node_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI); cpuset_migrate_mm_wq = alloc_ordered_workqueue("cpuset_migrate_mm", 0); BUG_ON(!cpuset_migrate_mm_wq); diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c index fc14fe53e9b7..dfe6c28c8352 100644 --- a/mm/memory-tiers.c +++ b/mm/memory-tiers.c @@ -872,7 +872,7 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self, unsigned long action, void *_arg) { struct memory_tier *memtier; - struct memory_notify *arg = _arg; + struct node_notify *arg = _arg; /* * Only update the node migration order when a node is @@ -882,13 +882,13 @@ static int __meminit memtier_hotplug_callback(struct notifier_block *self, return notifier_from_errno(0); switch (action) { - case MEM_OFFLINE: + case NODE_BECAME_MEMORYLESS: mutex_lock(&memory_tier_lock); if (clear_node_memory_tier(arg->status_change_nid)) establish_demotion_targets(); mutex_unlock(&memory_tier_lock); break; - case MEM_ONLINE: + case NODE_BECAME_MEM_AWARE: mutex_lock(&memory_tier_lock); memtier = set_node_memory_tier(arg->status_change_nid); if (!IS_ERR(memtier)) @@ -929,7 +929,7 @@ static int __init memory_tier_init(void) nodes_and(default_dram_nodes, node_states[N_MEMORY], node_states[N_CPU]); - hotplug_memory_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI); + hotplug_node_notifier(memtier_hotplug_callback, MEMTIER_HOTPLUG_PRI); return 0; } subsys_initcall(memory_tier_init); diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c index 75401866fb76..4bb9ff282ec9 100644 --- a/mm/memory_hotplug.c +++ b/mm/memory_hotplug.c @@ -701,7 +701,7 @@ static void online_pages_range(unsigned long start_pfn, unsigned long nr_pages) /* check which state of node_states will be changed when online memory */ static void node_states_check_changes_online(unsigned long nr_pages, - struct zone *zone, struct memory_notify *arg) + struct zone *zone, struct node_notify *arg) { int nid = zone_to_nid(zone); @@ -714,7 +714,7 @@ static void node_states_check_changes_online(unsigned long nr_pages, arg->status_change_nid_normal = nid; } -static void node_states_set_node(int node, struct memory_notify *arg) +static void node_states_set_node(int node, struct node_notify *arg) { if (arg->status_change_nid_normal >= 0) node_set_state(node, N_NORMAL_MEMORY); @@ -1177,7 +1177,9 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, int need_zonelists_rebuild = 0; const int nid = zone_to_nid(zone); int ret; - struct memory_notify arg; + struct memory_notify mem_arg; + struct node_notify node_arg; + bool cancel_mem_notifier_on_err = false, cancel_node_notifier_on_err = false; /* * {on,off}lining is constrained to full memory sections (or more @@ -1194,11 +1196,23 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, /* associate pfn range with the zone */ move_pfn_range_to_zone(zone, pfn, nr_pages, NULL, MIGRATE_ISOLATE); - arg.start_pfn = pfn; - arg.nr_pages = nr_pages; - node_states_check_changes_online(nr_pages, zone, &arg); + mem_arg.start_pfn = pfn; + mem_arg.nr_pages = nr_pages; + node_states_check_changes_online(nr_pages, zone, &node_arg); - ret = memory_notify(MEM_GOING_ONLINE, &arg); + if (node_arg.status_change_nid >= 0) { + /* Node is becoming memory aware. Notify consumers */ + cancel_node_notifier_on_err = true; + ret = node_notify(NODE_BECOMING_MEM_AWARE, &node_arg); + ret = notifier_to_errno(ret); + if (ret) + goto failed_addition; + } + + cancel_mem_notifier_on_err = true; + mem_arg.status_change_nid = node_arg.status_change_nid; + mem_arg.status_change_nid_normal = node_arg.status_change_nid_normal; + ret = memory_notify(MEM_GOING_ONLINE, &mem_arg); ret = notifier_to_errno(ret); if (ret) goto failed_addition; @@ -1224,7 +1238,7 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, online_pages_range(pfn, nr_pages); adjust_present_page_count(pfn_to_page(pfn), group, nr_pages); - node_states_set_node(nid, &arg); + node_states_set_node(nid, &node_arg); if (need_zonelists_rebuild) build_all_zonelists(NULL); @@ -1245,16 +1259,26 @@ int online_pages(unsigned long pfn, unsigned long nr_pages, kswapd_run(nid); kcompactd_run(nid); + if (node_arg.status_change_nid >= 0) + /* + * Node went from memoryless to have memory. Notifiy interested + * consumers + */ + node_notify(NODE_BECAME_MEM_AWARE, &node_arg); + writeback_set_ratelimit(); - memory_notify(MEM_ONLINE, &arg); + memory_notify(MEM_ONLINE, &mem_arg); return 0; failed_addition: pr_debug("online_pages [mem %#010llx-%#010llx] failed\n", (unsigned long long) pfn << PAGE_SHIFT, (((unsigned long long) pfn + nr_pages) << PAGE_SHIFT) - 1); - memory_notify(MEM_CANCEL_ONLINE, &arg); + if (cancel_node_notifier_on_err) + node_notify(NODE_CANCEL_MEM_AWARE, &node_arg); + if (cancel_mem_notifier_on_err) + memory_notify(MEM_CANCEL_ONLINE, &mem_arg); remove_pfn_range_from_zone(zone, pfn, nr_pages); return ret; } @@ -1898,7 +1922,7 @@ early_param("movable_node", cmdline_parse_movable_node); /* check which state of node_states will be changed when offline memory */ static void node_states_check_changes_offline(unsigned long nr_pages, - struct zone *zone, struct memory_notify *arg) + struct zone *zone, struct node_notify *arg) { struct pglist_data *pgdat = zone->zone_pgdat; unsigned long present_pages = 0; @@ -1935,7 +1959,7 @@ static void node_states_check_changes_offline(unsigned long nr_pages, arg->status_change_nid = zone_to_nid(zone); } -static void node_states_clear_node(int node, struct memory_notify *arg) +static void node_states_clear_node(int node, struct node_notify *arg) { if (arg->status_change_nid_normal >= 0) node_clear_state(node, N_NORMAL_MEMORY); @@ -1963,7 +1987,9 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, unsigned long pfn, managed_pages, system_ram_pages = 0; const int node = zone_to_nid(zone); unsigned long flags; - struct memory_notify arg; + struct memory_notify mem_arg; + struct node_notify node_arg; + bool cancel_mem_notifier_on_err = false, cancel_node_notifier_on_err = false; char *reason; int ret; @@ -2022,11 +2048,22 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, goto failed_removal_pcplists_disabled; } - arg.start_pfn = start_pfn; - arg.nr_pages = nr_pages; - node_states_check_changes_offline(nr_pages, zone, &arg); + mem_arg.start_pfn = start_pfn; + mem_arg.nr_pages = nr_pages; + node_states_check_changes_offline(nr_pages, zone, &node_arg); + + if (node_arg.status_change_nid >= 0) { + cancel_node_notifier_on_err = true; + ret = node_notify(NODE_BECOMING_MEMORYLESS, &node_arg); + ret = notifier_to_errno(ret); + if (ret) + goto failed_removal_isolated; + } - ret = memory_notify(MEM_GOING_OFFLINE, &arg); + cancel_mem_notifier_on_err = true; + mem_arg.status_change_nid = node_arg.status_change_nid; + mem_arg.status_change_nid_normal = node_arg.status_change_nid_normal; + ret = memory_notify(MEM_GOING_OFFLINE, &mem_arg); ret = notifier_to_errno(ret); if (ret) { reason = "notifier failure"; @@ -2106,27 +2143,32 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages, * Make sure to mark the node as memory-less before rebuilding the zone * list. Otherwise this node would still appear in the fallback lists. */ - node_states_clear_node(node, &arg); + node_states_clear_node(node, &node_arg); if (!populated_zone(zone)) { zone_pcp_reset(zone); build_all_zonelists(NULL); } - if (arg.status_change_nid >= 0) { + if (node_arg.status_change_nid >= 0) { kcompactd_stop(node); kswapd_stop(node); + /*Node went memoryless. Notifiy interested consumers */ + node_notify(NODE_BECAME_MEMORYLESS, &node_arg); } writeback_set_ratelimit(); - memory_notify(MEM_OFFLINE, &arg); + memory_notify(MEM_OFFLINE, &mem_arg); remove_pfn_range_from_zone(zone, start_pfn, nr_pages); return 0; failed_removal_isolated: /* pushback to free area */ undo_isolate_page_range(start_pfn, end_pfn, MIGRATE_MOVABLE); - memory_notify(MEM_CANCEL_OFFLINE, &arg); + if (cancel_node_notifier_on_err) + node_notify(NODE_CANCEL_MEMORYLESS, &node_arg); + if (cancel_mem_notifier_on_err) + memory_notify(MEM_CANCEL_OFFLINE, &mem_arg); failed_removal_pcplists_disabled: lru_cache_enable(); zone_pcp_enable(zone); diff --git a/mm/slub.c b/mm/slub.c index 184fd2b14758..74350f6c8ddd 100644 --- a/mm/slub.c +++ b/mm/slub.c @@ -5928,10 +5928,10 @@ static int slab_mem_going_offline_callback(void *arg) static void slab_mem_offline_callback(void *arg) { - struct memory_notify *marg = arg; + struct node_notify *narg = arg; int offline_node; - offline_node = marg->status_change_nid_normal; + offline_node = narg->status_change_nid_normal; /* * If the node still has available memory. we need kmem_cache_node @@ -5954,8 +5954,8 @@ static int slab_mem_going_online_callback(void *arg) { struct kmem_cache_node *n; struct kmem_cache *s; - struct memory_notify *marg = arg; - int nid = marg->status_change_nid_normal; + struct node_notify *narg = arg; + int nid = narg->status_change_nid_normal; int ret = 0; /* @@ -6007,18 +6007,18 @@ static int slab_memory_callback(struct notifier_block *self, int ret = 0; switch (action) { - case MEM_GOING_ONLINE: + case NODE_BECOMING_MEM_AWARE: ret = slab_mem_going_online_callback(arg); break; - case MEM_GOING_OFFLINE: + case NODE_BECOMING_MEMORYLESS: ret = slab_mem_going_offline_callback(arg); break; - case MEM_OFFLINE: - case MEM_CANCEL_ONLINE: + case NODE_BECAME_MEMORYLESS: + case NODE_CANCEL_MEM_AWARE: slab_mem_offline_callback(arg); break; - case MEM_ONLINE: - case MEM_CANCEL_OFFLINE: + case NODE_BECAME_MEM_AWARE: + case NODE_CANCEL_MEMORYLESS: break; } if (ret) @@ -6094,7 +6094,7 @@ void __init kmem_cache_init(void) sizeof(struct kmem_cache_node), SLAB_HWCACHE_ALIGN | SLAB_NO_OBJ_EXT, 0, 0); - hotplug_memory_notifier(slab_memory_callback, SLAB_CALLBACK_PRI); + hotplug_node_notifier(slab_memory_callback, SLAB_CALLBACK_PRI); /* Able to allocate the per node structures */ slab_state = PARTIAL;
There are at least four consumers of hotplug_memory_notifier that what they really are interested in is whether any numa node changed its state, e.g: going from being memory aware to becoming memoryless. Implement a specific notifier for numa nodes when their state gets changed, and have those consumers that only care about numa node state changes use it. Signed-off-by: Oscar Salvador <osalvador@suse.de> --- drivers/acpi/numa/hmat.c | 6 +-- drivers/base/node.c | 19 +++++++++ drivers/cxl/core/region.c | 14 +++---- drivers/cxl/cxl.h | 4 +- include/linux/memory.h | 38 ++++++++++++++++++ kernel/cgroup/cpuset.c | 2 +- mm/memory-tiers.c | 8 ++-- mm/memory_hotplug.c | 84 +++++++++++++++++++++++++++++---------- mm/slub.c | 22 +++++----- 9 files changed, 148 insertions(+), 49 deletions(-)