[0/2] Implement numa node notifier

Message ID	20250401092716.537512-1-osalvador@suse.de (mailing list archive)
Headers	show Return-Path: <owner-linux-mm@kvack.org> From: Oscar Salvador <osalvador@suse.de> To: Andrew Morton <akpm@linux-foundation.org> Cc: David Hildenbrand <david@redhat.com>, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Vlastimil Babka <vbabka@suse.cz>, Hyeonggon Yoo <42.hyeyoo@gmail.com>, mkoutny@suse.com, Dan Williams <dan.j.williams@intel.com>, Jonathan Cameron <Jonathan.Cameron@huawei.com>, Oscar Salvador <osalvador@suse.de> Subject: [PATCH 0/2] Implement numa node notifier Date: Tue, 1 Apr 2025 11:27:14 +0200 Message-ID: <20250401092716.537512-1-osalvador@suse.de> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: owner-linux-mm@kvack.org Precedence: bulk
Series	Implement numa node notifier \| expand [0/2] Implement numa node notifier [1/2] mm,memory_hotplug: Implement numa node notifier [2/2] mm,memory_hotplug: Replace status_change_nid parameter in memory_notify

Oscar Salvador April 1, 2025, 9:27 a.m. UTC

Memory notifier is a tool that allow consumers to get notified whenever
memory gets onlined or offlined in the system.
Currently, there are 10 consumers of that, but 5 out of those 10 consumers
are only interested in getting notifications when a numa node has changed its
state.
That means going from memoryless to memory-aware of vice versa.

Which means that for every {online,offline}_pages operation they get
notified even though the numa node might not have changed its state.

The first patch implements a numa node notifier that does just that, and have
those consumers register in there, so they get notified only when they are
interested.

The second patch replaces 'status_change_normal{_normal}' fields within
memory_notify with a 'nid', as that is only what we need for memory
notifer and update the only user of it (page_ext).

Consumers that are only interested in numa node states change are:

 - memory-tier
 - slub
 - cpuset
 - hmat
 - cxl


Oscar Salvador (2):
  mm,memory_hotplug: Implement numa node notifier
  mm,memory_hotplug: Replace status_change_nid parameter in
    memory_notify

 drivers/acpi/numa/hmat.c  |  6 +--
 drivers/base/node.c       | 19 +++++++++
 drivers/cxl/core/region.c | 14 +++----
 drivers/cxl/cxl.h         |  4 +-
 include/linux/memory.h    | 37 ++++++++++++++++++
 kernel/cgroup/cpuset.c    |  2 +-
 mm/memory-tiers.c         |  8 ++--
 mm/memory_hotplug.c       | 82 +++++++++++++++++++++++++++++----------
 mm/page_ext.c             | 12 +-----
 mm/slub.c                 | 22 +++++------
 10 files changed, 146 insertions(+), 60 deletions(-)

Vlastimil Babka April 2, 2025, 4:06 p.m. UTC | #1

On 4/1/25 11:27, Oscar Salvador wrote:
> Memory notifier is a tool that allow consumers to get notified whenever
> memory gets onlined or offlined in the system.
> Currently, there are 10 consumers of that, but 5 out of those 10 consumers
> are only interested in getting notifications when a numa node has changed its
> state.
> That means going from memoryless to memory-aware of vice versa.
> 
> Which means that for every {online,offline}_pages operation they get
> notified even though the numa node might not have changed its state.
> 
> The first patch implements a numa node notifier that does just that, and have
> those consumers register in there, so they get notified only when they are
> interested.

What if we had two chains:

register_node_notifier()
register_node_normal_notifier()

I think they could have shared the state #defines and struct node_notify
would have just one nid and be always >= 0.

Or would it add too much extra boilerplate and only slab cares?

> The second patch replaces 'status_change_normal{_normal}' fields within
> memory_notify with a 'nid', as that is only what we need for memory
> notifer and update the only user of it (page_ext).
> 
> Consumers that are only interested in numa node states change are:
> 
>  - memory-tier
>  - slub
>  - cpuset
>  - hmat
>  - cxl
> 
> 
> Oscar Salvador (2):
>   mm,memory_hotplug: Implement numa node notifier
>   mm,memory_hotplug: Replace status_change_nid parameter in
>     memory_notify
> 
>  drivers/acpi/numa/hmat.c  |  6 +--
>  drivers/base/node.c       | 19 +++++++++
>  drivers/cxl/core/region.c | 14 +++----
>  drivers/cxl/cxl.h         |  4 +-
>  include/linux/memory.h    | 37 ++++++++++++++++++
>  kernel/cgroup/cpuset.c    |  2 +-
>  mm/memory-tiers.c         |  8 ++--
>  mm/memory_hotplug.c       | 82 +++++++++++++++++++++++++++++----------
>  mm/page_ext.c             | 12 +-----
>  mm/slub.c                 | 22 +++++------
>  10 files changed, 146 insertions(+), 60 deletions(-)
>

Oscar Salvador April 2, 2025, 5:03 p.m. UTC | #2

On Wed, Apr 02, 2025 at 06:06:51PM +0200, Vlastimil Babka wrote:
> What if we had two chains:
> 
> register_node_notifier()
> register_node_normal_notifier()
> 
> I think they could have shared the state #defines and struct node_notify
> would have just one nid and be always >= 0.
> 
> Or would it add too much extra boilerplate and only slab cares?

We could indeed go on that direction to try to decouple
status_change_nid from status_change_nid_normal.

Although as you said, slub is the only user of status_change_nid_normal
for the time beign, so I am not sure of adding a second chain for only
one user.

Might look cleaner though, and the advantatge is that slub would not get
notified for nodes adquiring only ZONE_MOVABLE.

Let us see what David thinks about it.

thanks for the suggestion ;-)

Jonathan Cameron April 3, 2025, 12:29 p.m. UTC | #3

On Tue,  1 Apr 2025 11:27:14 +0200
Oscar Salvador <osalvador@suse.de> wrote:

> Memory notifier is a tool that allow consumers to get notified whenever
> memory gets onlined or offlined in the system.
> Currently, there are 10 consumers of that, but 5 out of those 10 consumers
> are only interested in getting notifications when a numa node has changed its
> state.
> That means going from memoryless to memory-aware of vice versa.
> 
> Which means that for every {online,offline}_pages operation they get
> notified even though the numa node might not have changed its state.
> 
> The first patch implements a numa node notifier that does just that, and have
> those consumers register in there, so they get notified only when they are
> interested.
> 
> The second patch replaces 'status_change_normal{_normal}' fields within
> memory_notify with a 'nid', as that is only what we need for memory
> notifer and update the only user of it (page_ext).
> 
> Consumers that are only interested in numa node states change are:
> 
>  - memory-tier
>  - slub
>  - cpuset
>  - hmat
>  - cxl
> 
Hi Oscar,

Idea seems good to me.

+CC linux-cxl for information of others not on the thread.

> 
> Oscar Salvador (2):
>   mm,memory_hotplug: Implement numa node notifier
>   mm,memory_hotplug: Replace status_change_nid parameter in
>     memory_notify
> 
>  drivers/acpi/numa/hmat.c  |  6 +--
>  drivers/base/node.c       | 19 +++++++++
>  drivers/cxl/core/region.c | 14 +++----
>  drivers/cxl/cxl.h         |  4 +-
>  include/linux/memory.h    | 37 ++++++++++++++++++
>  kernel/cgroup/cpuset.c    |  2 +-
>  mm/memory-tiers.c         |  8 ++--
>  mm/memory_hotplug.c       | 82 +++++++++++++++++++++++++++++----------
>  mm/page_ext.c             | 12 +-----
>  mm/slub.c                 | 22 +++++------
>  10 files changed, 146 insertions(+), 60 deletions(-)
>

David Hildenbrand April 3, 2025, 1:02 p.m. UTC | #4

On 02.04.25 19:03, Oscar Salvador wrote:
> On Wed, Apr 02, 2025 at 06:06:51PM +0200, Vlastimil Babka wrote:
>> What if we had two chains:
>>
>> register_node_notifier()
>> register_node_normal_notifier()
>>
>> I think they could have shared the state #defines and struct node_notify
>> would have just one nid and be always >= 0.
>>
>> Or would it add too much extra boilerplate and only slab cares?
> 
> We could indeed go on that direction to try to decouple
> status_change_nid from status_change_nid_normal.
> 
> Although as you said, slub is the only user of status_change_nid_normal
> for the time beign, so I am not sure of adding a second chain for only
> one user.
> 
> Might look cleaner though, and the advantatge is that slub would not get
> notified for nodes adquiring only ZONE_MOVABLE.
> 
> Let us see what David thinks about it.

I'd hope we'd be able to get rid of the _normal stuff completely, it's seems
way to specialized.

We added that in

commit b9d5ab2562eceeada5e4837a621b6260574dd11d
Author: Lai Jiangshan <laijs@cn.fujitsu.com>
Date:   Tue Dec 11 16:01:05 2012 -0800

     slub, hotplug: ignore unrelated node's hot-adding and hot-removing
     
     SLUB only focuses on the nodes which have normal memory and it ignores the
     other node's hot-adding and hot-removing.
     
     Aka: if some memory of a node which has no onlined memory is online, but
     this new memory onlined is not normal memory (for example, highmem), we
     should not allocate kmem_cache_node for SLUB.
     
     And if the last normal memory is offlined, but the node still has memory,
     we should remove kmem_cache_node for that node.  (The current code delays
     it when all of the memory is offlined)
     
     So we only do something when marg->status_change_nid_normal > 0.
     marg->status_change_nid is not suitable here.
     
     The same problem doesn't exist in SLAB, because SLAB allocates kmem_list3
     for every node even the node don't have normal memory, SLAB tolerates
     kmem_list3 on alien nodes.  SLUB only focuses on the nodes which have
     normal memory, it don't tolerate alien kmem_cache_node.  The patch makes
     SLUB become self-compatible and avoids WARNs and BUGs in rare conditions.


How "bad" would it be if we do the slab_mem_going_online_callback() etc even
for completely-movable nodes? I assume one kmem_cache_alloc() per slab_caches.

slab_mem_going_offline_callback() only does shrinking, #dontcare

Looking at slab_mem_offline_callback(), we never even free the caches either
way when offlining. So the implication would be that we would have movable-only nodes
set in slab_nodes.


We don't expect many such nodes, so ... do we care?

David Hildenbrand April 3, 2025, 1:08 p.m. UTC | #5

On 03.04.25 15:02, David Hildenbrand wrote:
> On 02.04.25 19:03, Oscar Salvador wrote:
>> On Wed, Apr 02, 2025 at 06:06:51PM +0200, Vlastimil Babka wrote:
>>> What if we had two chains:
>>>
>>> register_node_notifier()
>>> register_node_normal_notifier()
>>>
>>> I think they could have shared the state #defines and struct node_notify
>>> would have just one nid and be always >= 0.
>>>
>>> Or would it add too much extra boilerplate and only slab cares?
>>
>> We could indeed go on that direction to try to decouple
>> status_change_nid from status_change_nid_normal.
>>
>> Although as you said, slub is the only user of status_change_nid_normal
>> for the time beign, so I am not sure of adding a second chain for only
>> one user.
>>
>> Might look cleaner though, and the advantatge is that slub would not get
>> notified for nodes adquiring only ZONE_MOVABLE.
>>
>> Let us see what David thinks about it.
> 
> I'd hope we'd be able to get rid of the _normal stuff completely, it's seems
> way to specialized.
> 
> We added that in
> 
> commit b9d5ab2562eceeada5e4837a621b6260574dd11d
> Author: Lai Jiangshan <laijs@cn.fujitsu.com>
> Date:   Tue Dec 11 16:01:05 2012 -0800
> 
>       slub, hotplug: ignore unrelated node's hot-adding and hot-removing
>       
>       SLUB only focuses on the nodes which have normal memory and it ignores the
>       other node's hot-adding and hot-removing.
>       
>       Aka: if some memory of a node which has no onlined memory is online, but
>       this new memory onlined is not normal memory (for example, highmem), we
>       should not allocate kmem_cache_node for SLUB.
>       
>       And if the last normal memory is offlined, but the node still has memory,
>       we should remove kmem_cache_node for that node.  (The current code delays
>       it when all of the memory is offlined)
>       
>       So we only do something when marg->status_change_nid_normal > 0.
>       marg->status_change_nid is not suitable here.
>       
>       The same problem doesn't exist in SLAB, because SLAB allocates kmem_list3
>       for every node even the node don't have normal memory, SLAB tolerates
>       kmem_list3 on alien nodes.  SLUB only focuses on the nodes which have
>       normal memory, it don't tolerate alien kmem_cache_node.  The patch makes
>       SLUB become self-compatible and avoids WARNs and BUGs in rare conditions.
> 
> 
> How "bad" would it be if we do the slab_mem_going_online_callback() etc even
> for completely-movable nodes? I assume one kmem_cache_alloc() per slab_caches.
> 
> slab_mem_going_offline_callback() only does shrinking, #dontcare
> 
> Looking at slab_mem_offline_callback(), we never even free the caches either
> way when offlining. So the implication would be that we would have movable-only nodes
> set in slab_nodes.
> 
> 
> We don't expect many such nodes, so ... do we care?

BTW, isn't description of slab_nodes wrong?

"Tracks for which NUMA nodes we have kmem_cache_nodes allocated." -- but 
as there is no freeing done in slab_mem_offline_callback(), isn't it 
always kept allocated?

(probably I am missing something)

Harry Yoo April 3, 2025, 1:57 p.m. UTC | #6

On Thu, Apr 03, 2025 at 03:08:18PM +0200, David Hildenbrand wrote:
> On 03.04.25 15:02, David Hildenbrand wrote:
> > On 02.04.25 19:03, Oscar Salvador wrote:
> > > On Wed, Apr 02, 2025 at 06:06:51PM +0200, Vlastimil Babka wrote:
> > > > What if we had two chains:
> > > > 
> > > > register_node_notifier()
> > > > register_node_normal_notifier()
> > > > 
> > > > I think they could have shared the state #defines and struct node_notify
> > > > would have just one nid and be always >= 0.
> > > > 
> > > > Or would it add too much extra boilerplate and only slab cares?
> > > 
> > > We could indeed go on that direction to try to decouple
> > > status_change_nid from status_change_nid_normal.
> > > 
> > > Although as you said, slub is the only user of status_change_nid_normal
> > > for the time beign, so I am not sure of adding a second chain for only
> > > one user.
> > > 
> > > Might look cleaner though, and the advantatge is that slub would not get
> > > notified for nodes adquiring only ZONE_MOVABLE.
> > > 
> > > Let us see what David thinks about it.
> > 
> > I'd hope we'd be able to get rid of the _normal stuff completely, it's seems
> > way to specialized.
> > 
> > We added that in
> > 
> > commit b9d5ab2562eceeada5e4837a621b6260574dd11d
> > Author: Lai Jiangshan <laijs@cn.fujitsu.com>
> > Date:   Tue Dec 11 16:01:05 2012 -0800
> > 
> >       slub, hotplug: ignore unrelated node's hot-adding and hot-removing
> >       SLUB only focuses on the nodes which have normal memory and it ignores the
> >       other node's hot-adding and hot-removing.
> >       Aka: if some memory of a node which has no onlined memory is online, but
> >       this new memory onlined is not normal memory (for example, highmem), we
> >       should not allocate kmem_cache_node for SLUB.
> >       And if the last normal memory is offlined, but the node still has memory,
> >       we should remove kmem_cache_node for that node.  (The current code delays
> >       it when all of the memory is offlined)
> >       So we only do something when marg->status_change_nid_normal > 0.
> >       marg->status_change_nid is not suitable here.
> >       The same problem doesn't exist in SLAB, because SLAB allocates kmem_list3
> >       for every node even the node don't have normal memory, SLAB tolerates
> >       kmem_list3 on alien nodes.  SLUB only focuses on the nodes which have
> >       normal memory, it don't tolerate alien kmem_cache_node.  The patch makes
> >       SLUB become self-compatible and avoids WARNs and BUGs in rare conditions.
> > 
> > 
> > How "bad" would it be if we do the slab_mem_going_online_callback() etc even
> > for completely-movable nodes? I assume one kmem_cache_alloc() per slab_caches.
> > 
> > slab_mem_going_offline_callback() only does shrinking, #dontcare
> > 
> > Looking at slab_mem_offline_callback(), we never even free the caches either
> > way when offlining. So the implication would be that we would have movable-only nodes
> > set in slab_nodes.
> > 
> > 
> > We don't expect many such nodes, so ... do we care?
> 
> BTW, isn't description of slab_nodes wrong?
> 
> "Tracks for which NUMA nodes we have kmem_cache_nodes allocated." -- but as
> there is no freeing done in slab_mem_offline_callback(), isn't it always
> kept allocated?

It was, but not anymore :)

I think this patch series [1] forgot the fact that it changed the meaning
from 'NUMA nodes that have kmem_cache_node', to 'NUMA nodes that have normal
memory (that can be allocated as slab memory)'?

[1] https://lore.kernel.org/all/20210113131634.3671-1-vbabka@suse.cz

> 
> (probably I am missing something)
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 
>

[0/2] Implement numa node notifier

Message

Comments