mbox series

[0/4] mm/mempolicy: Add memory hotplug support in weighted interleave

Message ID 20250307063534.540-1-rakie.kim@sk.com (mailing list archive)
Headers show
Series mm/mempolicy: Add memory hotplug support in weighted interleave | expand

Message

Rakie Kim March 7, 2025, 6:35 a.m. UTC
This patch series enhances the weighted interleave policy in mempolicy
to support memory hotplug, ensuring that newly added memory nodes are
properly recognized and integrated into the weighted interleave mechanism.

The weighted interleave policy distributes page allocations across
multiple NUMA nodes based on their performance weight, optimizing memory
bandwidth utilization. The weight values for each node are configured
through sysfs. However, the existing implementation only created sysfs
entries at initialization, leading to the following issues:

Unnecessary sysfs entries: Nodes without memory were included in sysfs
at boot.
Missing hotplug support: Nodes that became online after initialization
were not recognized, causing incomplete interleave configurations.
To resolve these issues, the first patch introduces two key changes:

Filtered sysfs creation at initialization Only nodes that are online
and have memory are registered.
Dynamic sysfs updates for hotplugged nodes  New memory nodes are
recognized and integrated via the memory hotplug mechanism.
Subsequent patches refine this functionality:

Patch 2: Enables sysfs registration for memory nodes added via hotplug.
Patch 3: Fixes a race condition that caused duplicate sysfs entries when
registering interleave settings.
Patch 4: Ensures proper deallocation of kobjects and memory, preventing
resource leaks in mempolicy_sysfs_init().
With these changes, the weighted interleave policy can dynamically adapt
to memory hotplug events, improving NUMA memory management and system
stability.

Patch Summary
[PATCH 1/4] mm/mempolicy: Support memory hotplug in weighted interleave
Adds dynamic sysfs integration for memory hotplug in weighted interleave.
[PATCH 2/4] mm/mempolicy: Enable sysfs support for memory hotplug in
weighted interleave
Implements sysfs attribute registration for newly detected memory nodes.
[PATCH 3/4] mm/mempolicy: Fix duplicate node addition in sysfs for
weighted interleave
Prevents redundant sysfs entries when configuring interleave settings.
[PATCH 4/4] mm/mempolicy: Fix memory leaks in mempolicy_sysfs_init()
Ensures proper kobject and memory deallocation to prevent resource leaks.

These patches have been tested to ensure correct memory node detection,
proper sysfs updates, and stability improvements in memory hotplug scenarios.

 mm/mempolicy.c | 172 +++++++++++++++++++++++++++++++++++--------------
 1 file changed, 122 insertions(+), 50 deletions(-)


base-commit: 7eb172143d5508b4da468ed59ee857c6e5e01da6

Comments

Gregory Price March 7, 2025, 3:56 p.m. UTC | #1
On Fri, Mar 07, 2025 at 03:35:29PM +0900, Rakie Kim wrote:
> Unnecessary sysfs entries: Nodes without memory were included in sysfs
> at boot.
> Missing hotplug support: Nodes that became online after initialization
> were not recognized, causing incomplete interleave configurations.

This comment is misleading.  Nodes can "come online" but they are
absolutely detected during init - as nodes cannot be "hotplugged"
themselves.  Resources can be added *to* nodes, but nodes themselves
cannot be added or removed.

I think what you're trying to say here is:

1) The current system creates 1 entry per possible node (explicitly)
2) Not all nodes may have memory at all times (memory can be hotplugged)
3) It would be nice to let sysfs and weighted interleave omit memoryless
   nodes until those nodes had memory hotplugged into them.

> Dynamic sysfs updates for hotplugged nodes  New memory nodes are
> recognized and integrated via the memory hotplug mechanism.
> Subsequent patches refine this functionality:
>

Just going to reiterate that that there's no such this as a hotplug node
or "new nodes" - only nodes that have their attributes changed (i.e.
!N_MEMORY -> N_MEMORY).  The node exists, it may just not have anything
associated with it.

Maybe semantic nits, but it matters.  The nodes are present and can be
operated on before memory comes online, and that has implications for
users.  Depending on how that hardware comes online, it may or may not
report its performance data prior to memory hotplug.

If it doesn't report its performance data, then hiding the node before
it hotplugs memory means a user can't pre-configure the system for when
the memory is added (which could be used immediately).

Hiding the node until hotplug also means we have hidden state.  We need
to capture pre-hotplug reported performance data so that if it comes
online the auto-calculation of weights is correct.  But if the user has
already switched from auto to manual mode, then a node suddenly
appearing will have an unknown state.

This is why I initially chose to just expose N_POSSIBLE entries in
sysfs, because the transition state causes hidden information - and that
felt worse than extra entries.  I suppose I should add some
documentation somewhere that discusses this issue.

I think the underlying issue you're dealing with is that the system is
creating more nodes for you than it should.

~Gregory
Gregory Price March 7, 2025, 9:55 p.m. UTC | #2
On Fri, Mar 07, 2025 at 10:56:04AM -0500, Gregory Price wrote:
> 
> I think the underlying issue you're dealing with is that the system is
> creating more nodes for you than it should.
> 

Looking into this for other reasons, I think you are right that multiple
numa nodes can exist that cover the same memory - just different
regions.

I can see why you would want to hide the nodes that don't actively have
memory online, but i still have concerns for nodes that may come and
go and hiding this configuration from the user until memory arrives.

An example would be a DCD device where a node could add or remove memory
at any time.  If you removed the last block of memory, the node would
disappear - but the block could come back at any time.  That seems
problematic, as you might want to manage that node while no memory is
present.

~Gregory
Rakie Kim March 10, 2025, 9:03 a.m. UTC | #3
On Fri, 7 Mar 2025 10:56:04 -0500 Gregory Price <gourry@gourry.net> wrote:

Hi Gregory
Thank you for your response regarding this patch.

> On Fri, Mar 07, 2025 at 03:35:29PM +0900, Rakie Kim wrote:
> > Unnecessary sysfs entries: Nodes without memory were included in sysfs
> > at boot.
> > Missing hotplug support: Nodes that became online after initialization
> > were not recognized, causing incomplete interleave configurations.
> 
> This comment is misleading.  Nodes can "come online" but they are
> absolutely detected during init - as nodes cannot be "hotplugged"
> themselves.  Resources can be added *to* nodes, but nodes themselves
> cannot be added or removed.
> 
> I think what you're trying to say here is:
> 
> 1) The current system creates 1 entry per possible node (explicitly)
> 2) Not all nodes may have memory at all times (memory can be hotplugged)
> 3) It would be nice to let sysfs and weighted interleave omit memoryless
>    nodes until those nodes had memory hotplugged into them.
> 
> > Dynamic sysfs updates for hotplugged nodes  New memory nodes are
> > recognized and integrated via the memory hotplug mechanism.
> > Subsequent patches refine this functionality:
> >
> 
> Just going to reiterate that that there's no such this as a hotplug node
> or "new nodes" - only nodes that have their attributes changed (i.e.
> !N_MEMORY -> N_MEMORY).  The node exists, it may just not have anything
> associated with it.
> 
> Maybe semantic nits, but it matters.  The nodes are present and can be
> operated on before memory comes online, and that has implications for
> users.  Depending on how that hardware comes online, it may or may not
> report its performance data prior to memory hotplug.

I agree with your assessment. The existing comments, as you pointed out,
might indeed be confusing or misleading. I'll make sure this issue is
addressed in version 2.

> 
> If it doesn't report its performance data, then hiding the node before
> it hotplugs memory means a user can't pre-configure the system for when
> the memory is added (which could be used immediately).
> 
> Hiding the node until hotplug also means we have hidden state.  We need
> to capture pre-hotplug reported performance data so that if it comes
> online the auto-calculation of weights is correct.  But if the user has
> already switched from auto to manual mode, then a node suddenly
> appearing will have an unknown state.
> 
> This is why I initially chose to just expose N_POSSIBLE entries in
> sysfs, because the transition state causes hidden information - and that
> felt worse than extra entries.  I suppose I should add some
> documentation somewhere that discusses this issue.
> 
> I think the underlying issue you're dealing with is that the system is
> creating more nodes for you than it should.

I will reply to your next comment on this issue soon.

> 
> ~Gregory

Rakie
Rakie Kim March 10, 2025, 9:03 a.m. UTC | #4
On Fri, 7 Mar 2025 16:55:40 -0500 Gregory Price <gourry@gourry.net> wrote:
> On Fri, Mar 07, 2025 at 10:56:04AM -0500, Gregory Price wrote:
> > 
> > I think the underlying issue you're dealing with is that the system is
> > creating more nodes for you than it should.
> > 
> 
> Looking into this for other reasons, I think you are right that multiple
> numa nodes can exist that cover the same memory - just different
> regions.
> 

I understand your concerns, and I agree that the most critical issue at the
moment is that the system is generating more nodes than necessary.
We need to conduct a more thorough analysis of this problem, but a detailed
investigation will require a significant amount of time. In this context,
these patches might offer a quick solution to address the issue.

Additionally, it's important to note that not many CXL devices have been
developed yet, and their operations are not entirely optimized. Therefore,
we might encounter behaviors from CXL devices and servers that differ from
our expectations. I hope these patches can serve as a solution for
unforeseen issues.

> I can see why you would want to hide the nodes that don't actively have
> memory online, but i still have concerns for nodes that may come and
> go and hiding this configuration from the user until memory arrives.
> 
> An example would be a DCD device where a node could add or remove memory
> at any time.  If you removed the last block of memory, the node would
> disappear - but the block could come back at any time.  That seems
> problematic, as you might want to manage that node while no memory is
> present.
> 
> ~Gregory

Of course, the patches may need further refinements. Therefore, I plan to
simplify the patches and remove any unnecessary modifications in the upcoming
version 2 update. Once it's ready, I would be very grateful if you could take
the time to review version 2 and share any further feedback you might have.

Rakie
Gregory Price March 10, 2025, 2:13 p.m. UTC | #5
On Mon, Mar 10, 2025 at 06:03:59PM +0900, Rakie Kim wrote:
> On Fri, 7 Mar 2025 16:55:40 -0500 Gregory Price <gourry@gourry.net> wrote:
> > On Fri, Mar 07, 2025 at 10:56:04AM -0500, Gregory Price wrote:
> > > 
> > > I think the underlying issue you're dealing with is that the system is
> > > creating more nodes for you than it should.
> > > 
> > 
> > Looking into this for other reasons, I think you are right that multiple
> > numa nodes can exist that cover the same memory - just different
> > regions.
> > 
> 
> I understand your concerns, and I agree that the most critical issue at the
> moment is that the system is generating more nodes than necessary.
> We need to conduct a more thorough analysis of this problem, but a detailed
> investigation will require a significant amount of time. In this context,
> these patches might offer a quick solution to address the issue.
> 

I dug into the expected CEDT / CFMWS behaviors and had some discussions
with Dan and Jonathan - assuming your CEDT has multiple CFMWS to cover
the same set of devices, this is the expected behavior.

https://lore.kernel.org/linux-mm/Z226PG9t-Ih7fJDL@gourry-fedora-PF4VCD3F/T/#m2780e47df7f0962a79182502afc99843bb046205

Basically your BIOS is likely creating one per device and likely one
per host bridge (to allow intra-host-bridge interleave).

This puts us in an awkward state, and I need some time to consider
whether we should expose N_POSSIBLE nodes or N_MEMORY nodes.

Probably it makes sense to expose N_MEMORY nodes and allow for hidden
state, as the annoying corner condition of a DCD coming and going
most likely means a user wouldn't be using weighted interleave anyway.

So if you can confirm what you CEDT says compared to the notes above, I
think we can move forward with this.

~Gregory
Rakie Kim March 12, 2025, 8:18 a.m. UTC | #6
On Mon, 10 Mar 2025 10:13:58 -0400 Gregory Price <gourry@gourry.net> wrote:

Hi Gregory,

I have updated version 2 of the patch series, incorporating the feedback from
you and Joshua.
However, this version does not yet include updates to the commit messages
regarding the points you previously mentioned.

Your detailed explanations have been incredibly valuable in helping us analyze
the system, and I sincerely appreciate your insights.

> 2) We need to clearly define what the weight of a node will be when
>    in manual mode and a node goes (memory -> no memory -> memory)

Additionally, I will soon provide an updated document addressing this and
other points you raised in your emails.

Thank you again for your guidance and support.

Rakie

> On Mon, Mar 10, 2025 at 06:03:59PM +0900, Rakie Kim wrote:
> > On Fri, 7 Mar 2025 16:55:40 -0500 Gregory Price <gourry@gourry.net> wrote:
> > > On Fri, Mar 07, 2025 at 10:56:04AM -0500, Gregory Price wrote:
> > > > 
> > > > I think the underlying issue you're dealing with is that the system is
> > > > creating more nodes for you than it should.
> > > > 
> > > 
> > > Looking into this for other reasons, I think you are right that multiple
> > > numa nodes can exist that cover the same memory - just different
> > > regions.
> > > 
> > 
> > I understand your concerns, and I agree that the most critical issue at the
> > moment is that the system is generating more nodes than necessary.
> > We need to conduct a more thorough analysis of this problem, but a detailed
> > investigation will require a significant amount of time. In this context,
> > these patches might offer a quick solution to address the issue.
> > 
> 
> I dug into the expected CEDT / CFMWS behaviors and had some discussions
> with Dan and Jonathan - assuming your CEDT has multiple CFMWS to cover
> the same set of devices, this is the expected behavior.
> 
> https://lore.kernel.org/linux-mm/Z226PG9t-Ih7fJDL@gourry-fedora-PF4VCD3F/T/#m2780e47df7f0962a79182502afc99843bb046205
> 
> Basically your BIOS is likely creating one per device and likely one
> per host bridge (to allow intra-host-bridge interleave).
> 
> This puts us in an awkward state, and I need some time to consider
> whether we should expose N_POSSIBLE nodes or N_MEMORY nodes.
> 
> Probably it makes sense to expose N_MEMORY nodes and allow for hidden
> state, as the annoying corner condition of a DCD coming and going
> most likely means a user wouldn't be using weighted interleave anyway.
> 
> So if you can confirm what you CEDT says compared to the notes above, I
> think we can move forward with this.
> 
> ~Gregory