mbox series

[RFC,v4,0/3] memcg weighted interleave mempolicy control

Message ID 20231109002517.106829-1-gregory.price@memverge.com
Headers show
Series memcg weighted interleave mempolicy control | expand

Message

Gregory Price Nov. 9, 2023, 12:25 a.m. UTC
This patchset implements weighted interleave and adds a new cgroup
sysfs entry: cgroup/memory.interleave_weights (excluded from root).

The il_weight of a node is used by mempolicy to implement weighted
interleave when `numactl --interleave=...` is invoked.  By default
il_weight for a node is always 1, which preserves the default round
robin interleave behavior.

Interleave weights denote the number of pages that should be
allocated from the node when interleaving occurs and have a range
of 1-255.  The weight of a node can never be 0, and instead the
preferred way to prevent allocation is to remove the node from the
cpuset or mempolicy altogether.

For example, if a node's interleave weight is set to 5, 5 pages
will be allocated from that node before the next node is scheduled
for allocations.

# Set node weight for node 0 to 5
echo 0:5 > /sys/fs/cgroup/user.slice/memory.interleave_weights

# Set node weight for node 1 to 3
echo 1:3 > /sys/fs/cgroup/user.slice/memory.interleave_weights

# View the currently set weights
cat /sys/fs/cgroup/user.slice/memory.interleave_weights
0:5,1:3

Weights will only be displayed for possible nodes.

With this it becomes possible to set an interleaving strategy
that fits the available bandwidth for the devices available on
the system. An example system:

Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
Node 1 - CXL Memory. 64GB/s BW, on Node 0 root complex

In this setup, the effective weights for a node set of [0,1]
may be may be [86, 14] (86% of memory on Node 0, 14% on node 1)
or some smaller fraction thereof to encourge quicker rounds
for better overall distribution.

This spreads memory out across devices which all have different
latency and bandwidth attributes in a way that can maximize the
available resources.

~Gregory

=============
Version Notes:

= v4 notes

Moved interleave weights to cgroups from nodes.

Omitted them from the root cgroup for initial testing/comment, but
it seems like it may be a reasonable idea to place them there too.

== Weighted interleave

mm/mempolicy: modify interleave mempolicy to use node weights

The mempolicy MPOL_INTERLEAVE utilizes the node weights defined in
the cgroup memory.interleave_weights interfaces to implement weighted
interleave.  By default, since all nodes default to a weight of 1,
the original interleave behavior is retained.

============
RFC History

Node based weights
By: Gregory Price
https://lore.kernel.org/linux-mm/20231031003810.4532-1-gregory.price@memverge.com/

Memory-tier based weights
By: Ravi Shankar
https://lore.kernel.org/all/20230927095002.10245-1-ravis.opensrc@micron.com/

Mempolicy multi-node weighting w/ set_mempolicy2:
By: Gregory Price
https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@memverge.com/

Hasan Al Maruf: N:M weighting in mempolicy
https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/

Huang, Ying's presentation in lpc22, 16th slide in
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\
Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf

===================

Gregory Price (3):
  mm/memcontrol: implement memcg.interleave_weights
  mm/mempolicy: implement weighted interleave
  Documentation: sysfs entries for cgroup.memory.interleave_weights

 Documentation/admin-guide/cgroup-v2.rst       |  45 +++++
 .../admin-guide/mm/numa_memory_policy.rst     |  11 ++
 include/linux/memcontrol.h                    |  31 ++++
 include/linux/mempolicy.h                     |   3 +
 mm/memcontrol.c                               | 172 ++++++++++++++++++
 mm/mempolicy.c                                | 153 +++++++++++++---
 6 files changed, 387 insertions(+), 28 deletions(-)

Comments

Michal Hocko Nov. 9, 2023, 10:02 a.m. UTC | #1
On Wed 08-11-23 19:25:14, Gregory Price wrote:
> This patchset implements weighted interleave and adds a new cgroup
> sysfs entry: cgroup/memory.interleave_weights (excluded from root).

Why have you chosen memory controler rather than cpuset controller?
TBH I do not think memcg is the best fit because traditionally memcg
accounts consumption rather than memory placement. This means that the
memory is already allocated when it is charged for a memcg. On the other
hand cpuset controller is the one to control the allocation placement so
it would seem a better fit.
Gregory Price Nov. 9, 2023, 3:10 p.m. UTC | #2
On Thu, Nov 09, 2023 at 11:02:23AM +0100, Michal Hocko wrote:
> On Wed 08-11-23 19:25:14, Gregory Price wrote:
> > This patchset implements weighted interleave and adds a new cgroup
> > sysfs entry: cgroup/memory.interleave_weights (excluded from root).
> 
> Why have you chosen memory controler rather than cpuset controller?
> TBH I do not think memcg is the best fit because traditionally memcg
> accounts consumption rather than memory placement. This means that the
> memory is already allocated when it is charged for a memcg. On the other
> hand cpuset controller is the one to control the allocation placement so
> it would seem a better fit.
> -- 
> Michal Hocko
> SUSE Labs

Wasn't sure between the two, so i tossed it in memcg. Easy relocation,
and the ode should remain basically the same, so I will wait a bit
before throwing out an update.  Assuming we've found the right place
for it, i'll probably drop RFC at that point.

~Gregory
Gregory Price Nov. 9, 2023, 4:34 p.m. UTC | #3
On Thu, Nov 09, 2023 at 11:02:23AM +0100, Michal Hocko wrote:
> On Wed 08-11-23 19:25:14, Gregory Price wrote:
> > This patchset implements weighted interleave and adds a new cgroup
> > sysfs entry: cgroup/memory.interleave_weights (excluded from root).
> 
> Why have you chosen memory controler rather than cpuset controller?
> TBH I do not think memcg is the best fit because traditionally memcg
> accounts consumption rather than memory placement. This means that the
> memory is already allocated when it is charged for a memcg. On the other
> hand cpuset controller is the one to control the allocation placement so
> it would seem a better fit.
> -- 
> Michal Hocko
> SUSE Labs

Actually going to walk back my last email, memcg actually feels more
correct than cpuset, if only because of what the admin-guide says:

"""
The "memory" controller regulates distribution of memory. [... snip ...]

While not completely water-tight, all major memory usages by a given
cgroup are tracked so that the total memory consumption can be accounted
and controlled to a reasonable extent.
"""

'And controlled to a reasonable extent' seems to fit the description of
this mechanism better than the cpuset description:

"""
The "cpuset" controller provides a mechanism for constraining the CPU and
memory node placement of tasks to only the resources specified in the
cpuset interface files in a task's current cgroup.
"""

This is not a constraining interface... it's "more of a suggestion". In
particular, anything not using interleave doesn't even care about these
weights at all.

The distribution is only enforced for allocation, it does not cause
migrations... thought that would be a neat idea. This is explicitly
why the interface does not allow a weight of 0 (the not should be
omitted from the policy nodemask or cpuset instead).

Even if this were designed to enforce a particular distribution of
memory, I'm not certain that would belong in cpusets either - but I
suppose that is a separate discussion.  It's possible this array of
weights could be used to do both, but it seems (at least on the surface)
that making this a hard control is an excellent way to induce OOMs where
you may not want them.


Anyway, summarizing:  After a bit of reading, this does seem to map
better to the "accounting consumption" subsystem than the "constrain"
subsystem. However, if you think it's better suited for cpuset, I'm
happy to push in that direction.

~Gregory
John Groves Nov. 9, 2023, 10:48 p.m. UTC | #4
On 23/11/08 07:25PM, Gregory Price wrote:
> This patchset implements weighted interleave and adds a new cgroup
> sysfs entry: cgroup/memory.interleave_weights (excluded from root).
 
<snip>

We at Micron think this capability is really important, and moving it into
cgroups introduces important flexibility that was missing from prior versions
of the patchset.

To me this mainly represents the antidote to having system bios programming
the memory map to be heterogeneously interleaved - which I expect will be
an option, and some customers seem to want it. But that approach will affect
all apps (and also the kernel), and will hide at least the interleaved
numa nodes from Linux' view of the topology. There is a lot not to like
about that approach...

This approach checks all the important boxes: it only applies to apps where
it's enabled, the weighting can vary from one app to another, the
kernel is not affected, and the numa topology is not buried.

I see Michal's comment about cpuset vs. memory, and have no opinion there.

Thumbs up!
John Groves at Micron
Huang, Ying Nov. 10, 2023, 6:16 a.m. UTC | #5
Gregory Price <gourry.memverge@gmail.com> writes:

> This patchset implements weighted interleave and adds a new cgroup
> sysfs entry: cgroup/memory.interleave_weights (excluded from root).
>
> The il_weight of a node is used by mempolicy to implement weighted
> interleave when `numactl --interleave=...` is invoked.  By default
> il_weight for a node is always 1, which preserves the default round
> robin interleave behavior.

IIUC, this makes it almost impossible to set the default weight of a
node from the node memory bandwidth information.  This will make the
life of users a little harder.

If so, how about use a new memory policy mode, for example
MPOL_WEIGHTED_INTERLEAVE, etc.

> Interleave weights denote the number of pages that should be
> allocated from the node when interleaving occurs and have a range
> of 1-255.  The weight of a node can never be 0, and instead the
> preferred way to prevent allocation is to remove the node from the
> cpuset or mempolicy altogether.
>
> For example, if a node's interleave weight is set to 5, 5 pages
> will be allocated from that node before the next node is scheduled
> for allocations.
>
> # Set node weight for node 0 to 5
> echo 0:5 > /sys/fs/cgroup/user.slice/memory.interleave_weights
>
> # Set node weight for node 1 to 3
> echo 1:3 > /sys/fs/cgroup/user.slice/memory.interleave_weights
>
> # View the currently set weights
> cat /sys/fs/cgroup/user.slice/memory.interleave_weights
> 0:5,1:3
>
> Weights will only be displayed for possible nodes.
>
> With this it becomes possible to set an interleaving strategy
> that fits the available bandwidth for the devices available on
> the system. An example system:
>
> Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
> Node 1 - CXL Memory. 64GB/s BW, on Node 0 root complex
>
> In this setup, the effective weights for a node set of [0,1]
> may be may be [86, 14] (86% of memory on Node 0, 14% on node 1)
> or some smaller fraction thereof to encourge quicker rounds
> for better overall distribution.
>
> This spreads memory out across devices which all have different
> latency and bandwidth attributes in a way that can maximize the
> available resources.
>

--
Best Regards,
Huang, Ying
Michal Hocko Nov. 10, 2023, 9:05 a.m. UTC | #6
On Thu 09-11-23 11:34:01, Gregory Price wrote:
[...]
> Anyway, summarizing:  After a bit of reading, this does seem to map
> better to the "accounting consumption" subsystem than the "constrain"
> subsystem. However, if you think it's better suited for cpuset, I'm
> happy to push in that direction.

Maybe others see it differently but I stick with my previous position.
Memcg is not a great fit for reasons already mentioned - most notably
that the controller doesn't control the allocation but accounting what
has been already allocated. Cpusets on the other hand constrains the
allocations and that is exactly what you want to achieve.
Gregory Price Nov. 10, 2023, 7:54 p.m. UTC | #7
On Fri, Nov 10, 2023 at 02:16:05PM +0800, Huang, Ying wrote:
> Gregory Price <gourry.memverge@gmail.com> writes:
> 
> > This patchset implements weighted interleave and adds a new cgroup
> > sysfs entry: cgroup/memory.interleave_weights (excluded from root).
> >
> > The il_weight of a node is used by mempolicy to implement weighted
> > interleave when `numactl --interleave=...` is invoked.  By default
> > il_weight for a node is always 1, which preserves the default round
> > robin interleave behavior.
> 
> IIUC, this makes it almost impossible to set the default weight of a
> node from the node memory bandwidth information.  This will make the
> life of users a little harder.
> 
> If so, how about use a new memory policy mode, for example
> MPOL_WEIGHTED_INTERLEAVE, etc.
>

weights are also inherited from parent cgroups, so if you set them in
parent slices you can automatically set update system settings.

by default the parent slice weights will always be 1 until set
otherwise.  Once they're set, children inherit naturally.

Maybe there's an argument here for including interleave_weights in the
root cgroup.

~Gregory
Gregory Price Nov. 10, 2023, 9:24 p.m. UTC | #8
On Fri, Nov 10, 2023 at 10:05:57AM +0100, Michal Hocko wrote:
> On Thu 09-11-23 11:34:01, Gregory Price wrote:
> [...]
> > Anyway, summarizing:  After a bit of reading, this does seem to map
> > better to the "accounting consumption" subsystem than the "constrain"
> > subsystem. However, if you think it's better suited for cpuset, I'm
> > happy to push in that direction.
> 
> Maybe others see it differently but I stick with my previous position.
> Memcg is not a great fit for reasons already mentioned - most notably
> that the controller doesn't control the allocation but accounting what
> has been already allocated. Cpusets on the other hand constrains the
> allocations and that is exactly what you want to achieve. 
> -- 
> Michal Hocko
> SUSE Labs

Digging in a bit, placing it in cpusets has locking requirements that
concerns me.  Maybe I'm being a bit over-cautious, so if none of this
matters, then I'll go ahead and swap the code over to cpusets.
Otherwise, just more food for thought in cpusets vs memcg.

In cpusets.c it states when acquiring read-only access, we have to
acquire the (global) callback lock:

https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L391
* There are two global locks guarding cpuset structures - cpuset_mutex and
* callback_lock. We also require taking task_lock() when dereferencing a
* task's cpuset pointer. See "The task_lock() exception", at the end of this
* comment.

Examples:

cpuset_node_allowed:
https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L4780
    spin_lock_irqsave(&callback_lock, flags);
    rcu_read_lock();
    cs = nearest_hardwall_ancestor(task_cs(current)); <-- walks parents
    allowed = node_isset(node, cs->mems_allowed);
    rcu_read_unlock();
    spin_unlock_irqrestore(&callback_lock, flags);

cpuset_mems_allowed:
https://github.com/torvalds/linux/blob/master/kernel/cgroup/cpuset.c#L4679
    spin_lock_irqsave(&callback_lock, flags);
    rcu_read_lock();
    guarantee_online_mems(task_cs(tsk), &mask); <-- walks parents
    rcu_read_unlock();
    spin_unlock_irqrestore(&callback_lock, flags);

Seems apparent that any form of parent walk in cpusets will require the
acquisition of &callback_lock.  This does not appear true of memcg.

Implementing a similar inheritance structure as described in this patch
set would therefore cause the acquisition of the callback lock during
node selection. So if we want this in cpuset, we're going to eat that
lock acquisition, despite not really needing it.


I'm was not intending to do checks against cpusets.mems_allowed when
acquiring weights, as this is already enforced between cpusets and
mempolicy on hotplug and mask changes, as well as in the allocators via
read_mems_allowed_begin/retry..

This is why I said this was *not* a constraining feature. Additionally,
if the node selected by mpol is exhausted, the allocator will simply
acquire memory from another (allowed) node, disregarding the weights
entirely (which is the correct / expected behavior). Another example of
"this is more of a suggestion" rather than a constraint.

So I'm contending here that putting it in cpusets is overkill.  But if
it likewise doesn't fit in memcg, is it insane to suggest that maybe we
should consider adding cgroup.mpol, and maybe consider migrating features
from mempolicy.c into cgroups (while keeping mpol the way it is).

~Gregory
Tejun Heo Nov. 10, 2023, 10:05 p.m. UTC | #9
Hello,

On Thu, Nov 09, 2023 at 10:48:56PM +0000, John Groves wrote:
> This approach checks all the important boxes: it only applies to apps where
> it's enabled, the weighting can vary from one app to another, the
> kernel is not affected, and the numa topology is not buried.

Can't it be a mempol property which is inherited by child processes? Then
all you'll need is e.g. adding systemd support to configure this at service
unit level. I'm having a bit of hard time seeing why this needs to be a
cgroup feature when it doesn't involve dynamic resource accounting /
enforcement at all.

Thanks.
Gregory Price Nov. 10, 2023, 10:29 p.m. UTC | #10
On Fri, Nov 10, 2023 at 12:05:59PM -1000, tj@kernel.org wrote:
> Hello,
> 
> On Thu, Nov 09, 2023 at 10:48:56PM +0000, John Groves wrote:
> > This approach checks all the important boxes: it only applies to apps where
> > it's enabled, the weighting can vary from one app to another, the
> > kernel is not affected, and the numa topology is not buried.
> 
> Can't it be a mempol property which is inherited by child processes? Then
> all you'll need is e.g. adding systemd support to configure this at service
> unit level. I'm having a bit of hard time seeing why this needs to be a
> cgroup feature when it doesn't involve dynamic resource accounting /
> enforcement at all.
> 
> Thanks.
> 
> -- 
> tejun

I did originally implement it this way, but note that it will either
require some creative extension of set_mempolicy or even set_mempolicy2
as proposed here:

https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@memverge.com/

One of the problems to consider is task migration.  If a task is
migrated from one socket to another, for example by being moved to a new
cgroup with a different cpuset - the weights might be completely nonsensical
for the new allowed topology.

Unfortunately mpol has no way of being changed from outside the task
itself once it's applied, other than changing its nodemasks via cpusets.

So one concrete use case: kubernetes might like change cpusets or move
tasks from one cgroup to another, or a vm might be migrated from one set
of nodes to enother (technically not mutually exclusive here).  Some
memory policy settings (like weights) may no longer apply when this
happens, so it would be preferable to have a way to change them.

~Gregory
Tejun Heo Nov. 11, 2023, 3:05 a.m. UTC | #11
Hello, Gregory.

On Fri, Nov 10, 2023 at 05:29:25PM -0500, Gregory Price wrote:
> I did originally implement it this way, but note that it will either
> require some creative extension of set_mempolicy or even set_mempolicy2
> as proposed here:
> 
> https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@memverge.com/
> 
> One of the problems to consider is task migration.  If a task is
> migrated from one socket to another, for example by being moved to a new
> cgroup with a different cpuset - the weights might be completely nonsensical
> for the new allowed topology.
> 
> Unfortunately mpol has no way of being changed from outside the task
> itself once it's applied, other than changing its nodemasks via cpusets.

Maybe it's time to add one?

> So one concrete use case: kubernetes might like change cpusets or move
> tasks from one cgroup to another, or a vm might be migrated from one set
> of nodes to enother (technically not mutually exclusive here).  Some
> memory policy settings (like weights) may no longer apply when this
> happens, so it would be preferable to have a way to change them.

Neither covers all use cases. As you noted in your mempolicy message, if the
application wants finer grained control, cgroup interface isn't great. In
general, any changes which are dynamically initiated by the application
itself isn't a great fit for cgroup.

I'm generally pretty awry of adding non-resource group configuration
interface especially when they don't have counter part in the regular
per-process/thread API for a few reasons:

1. The reason why people try to add those through cgroup somtimes is because
   it seems easier to add those new features through cgroup, which may be
   true to some degree, but shortcuts often aren't very conducive to long
   term maintainability.

2. As noted above, just having cgroup often excludes a signficant portion of
   use cases. Not all systems enable cgroups and programatic accesses from
   target processes / threads are coarse-grained and can be really awakward.

3. Cgroup can be convenient when group config change is necessary. However,
   we really don't want to keep adding kernel interface just for changing
   configs for a group of threads. For config changes which aren't high
   frequency, userspace iterating the member processes and applying the
   changes if possible is usually good enough which usually involves looping
   until no new process is found. If the looping is problematic, cgroup
   freezer can be used to atomically stop all member threads to provide
   atomicity too.

Thanks.
Gregory Price Nov. 11, 2023, 3:42 a.m. UTC | #12
On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@kernel.org wrote:
> Hello, Gregory.
> 
> On Fri, Nov 10, 2023 at 05:29:25PM -0500, Gregory Price wrote:
> > Unfortunately mpol has no way of being changed from outside the task
> > itself once it's applied, other than changing its nodemasks via cpusets.
> 
> Maybe it's time to add one?
> 

I've been considering this as well, but there's more context here being
lost.  It's not just about being able to toggle the policy of a single
task, or related tasks, but actually in support of a more global data
interleaving strategy that makes use of bandwidth more effectively as
we begin to memory expansion and bandwidth expansion occur on the
PCIE/CXL bus.

If the memory landscape of a system changes, for example due to a
hotplug event, you actually want to change the behavior of *every* task
that is using interleaving.  The fundamental bandwidth distribution of
the entire system changed, so the behavior of every task using that
memory should change with it.

We've explored adding weights to: mempolicy, memory tiers, nodes, memcg,
and now additionally cpusets. In the last email, I'd asked whether it
might actually be worth adding a new mpol component of cgroups to
aggregate these issues, rather than jam them into either component.
I would love your thoughts on that.


> > So one concrete use case: kubernetes might like change cpusets or move
> > tasks from one cgroup to another, or a vm might be migrated from one set
> > of nodes to enother (technically not mutually exclusive here).  Some
> > memory policy settings (like weights) may no longer apply when this
> > happens, so it would be preferable to have a way to change them.
> 
> Neither covers all use cases. As you noted in your mempolicy message, if the
> application wants finer grained control, cgroup interface isn't great. In
> general, any changes which are dynamically initiated by the application
> itself isn't a great fit for cgroup.
> 

It is certainly simple enough to add weights to mempolicy, but there
are limitations.  In particular, mempolicy is extremely `current task`
focused, and significant refactor work would need to be done to allow
external tasks the ability to toggle a target task's mempolicy.

In particular I worry about the potential concurrency issues since
mempolicy can be in the hot allocation path.

(additionally, as you note below, you would have to hit every child
thread separately to make effective changes, since it is per-task).

I'm not opposed to this, but it was suggested to me that maybe there is
a better place to place these weights.  Maybe it can be managed mostly
through RCU, though, so maybe the concern is overblow.

Anyway...

It's certainly my intent to add weights to mempolicy, as that's where
I started. If that is the preferred starting point from the perspective
of the mm community, I will revert to proposing set_mempolicy2 and/or
full on converting mempolicy into a sys/procfs friendly mechanism.

The goal here is to enable mempolicy, or something like it, to acquire
additional flexibility in a heterogeneous memory world, considering how
threads may be migrated, checkpointed/restored, and how use cases like
bandwidth expansion may be insufficiently serviced by something as fine
grained as per-task mempolicies.

> I'm generally pretty awry of adding non-resource group configuration
> interface especially when they don't have counter part in the regular
> per-process/thread API for a few reasons:
> 
> 1. The reason why people try to add those through cgroup somtimes is because
>    it seems easier to add those new features through cgroup, which may be
>    true to some degree, but shortcuts often aren't very conducive to long
>    term maintainability.
>

Concur.  That's why i originally proposed the mempolicy extension, since
I wasn't convinced by global settings, but I've been brought around by
the fact that migrations and hotplug events may want to affect mass
changes across a large number of unrelated tasks.

> 2. As noted above, just having cgroup often excludes a signficant portion of
>    use cases. Not all systems enable cgroups and programatic accesses from
>    target processes / threads are coarse-grained and can be really awakward.
> 
> 3. Cgroup can be convenient when group config change is necessary. However,
>    we really don't want to keep adding kernel interface just for changing
>    configs for a group of threads. For config changes which aren't high
>    frequency, userspace iterating the member processes and applying the
>    changes if possible is usually good enough which usually involves looping
>    until no new process is found. If the looping is problematic, cgroup
>    freezer can be used to atomically stop all member threads to provide
>    atomicity too.
> 

If I can ask, do you think it would be out of line to propose a major
refactor to mempolicy to enable external task's the ability to change a
running task's mempolicy *as well as* a cgroup-wide mempolicy component?

As you've alluded to here, I don't think either mechanism on their own
is sufficient to handle all use cases, but the two combined does seem
sufficient.

I do appreciate the feedback here, thank you.  I think we are getting
to the bottom of how/where such new mempolicy mechanisms should be
implemented.

Gregory:
Tejun Heo Nov. 11, 2023, 11:16 a.m. UTC | #13
Hello,

On Fri, Nov 10, 2023 at 10:42:39PM -0500, Gregory Price wrote:
> On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@kernel.org wrote:
...
> I've been considering this as well, but there's more context here being
> lost.  It's not just about being able to toggle the policy of a single
> task, or related tasks, but actually in support of a more global data
> interleaving strategy that makes use of bandwidth more effectively as
> we begin to memory expansion and bandwidth expansion occur on the
> PCIE/CXL bus.
> 
> If the memory landscape of a system changes, for example due to a
> hotplug event, you actually want to change the behavior of *every* task
> that is using interleaving.  The fundamental bandwidth distribution of
> the entire system changed, so the behavior of every task using that
> memory should change with it.
> 
> We've explored adding weights to: mempolicy, memory tiers, nodes, memcg,
> and now additionally cpusets. In the last email, I'd asked whether it
> might actually be worth adding a new mpol component of cgroups to
> aggregate these issues, rather than jam them into either component.
> I would love your thoughts on that.

As for CXL and the changing memory landscape, I think some caution is
necessary as with any expected "future" technology changes. The recent
example with non-volatile memory isn't too far from CXL either. Note that
this is not to say that we shouldn't change anything until the hardware is
wildly popular but more that we need to be cognizant of the speculative
nature and the possibility of overbuilding for it.

I don't have a golden answer but here are general suggestions: Build
something which is small and/or useful even outside the context of the
expected hardware landscape changes. Enable the core feature which is
absolutely required in a minimal manner. Avoid being maximalist in feature
and convenience coverage.

Here, even if CXL actually becomes popular, how many are going to use memory
hotplug and need to dynamically rebalance memory in actively running
workloads? What's the scenario? Are there going to be an army of data center
technicians going around plugging and unplugging CXL devices depending on
system memory usage?

Maybe there are some cases this is actually useful but for those niche use
cases, isn't per-task interface with iteration enough? How often are these
hotplug events going to be?

> > > So one concrete use case: kubernetes might like change cpusets or move
> > > tasks from one cgroup to another, or a vm might be migrated from one set
> > > of nodes to enother (technically not mutually exclusive here).  Some
> > > memory policy settings (like weights) may no longer apply when this
> > > happens, so it would be preferable to have a way to change them.
> > 
> > Neither covers all use cases. As you noted in your mempolicy message, if the
> > application wants finer grained control, cgroup interface isn't great. In
> > general, any changes which are dynamically initiated by the application
> > itself isn't a great fit for cgroup.
> 
> It is certainly simple enough to add weights to mempolicy, but there
> are limitations.  In particular, mempolicy is extremely `current task`
> focused, and significant refactor work would need to be done to allow
> external tasks the ability to toggle a target task's mempolicy.
> 
> In particular I worry about the potential concurrency issues since
> mempolicy can be in the hot allocation path.

Changing mpol from outside the task is a feature which is inherently useful
regardless of CXL and I don't quite understand why hot path concurrency
issues would be different whether the configuration is coming from mempol or
cgroup but that could easily be me not being familiar with the involved
code.

...
> > 3. Cgroup can be convenient when group config change is necessary. However,
> >    we really don't want to keep adding kernel interface just for changing
> >    configs for a group of threads. For config changes which aren't high
> >    frequency, userspace iterating the member processes and applying the
> >    changes if possible is usually good enough which usually involves looping
> >    until no new process is found. If the looping is problematic, cgroup
> >    freezer can be used to atomically stop all member threads to provide
> >    atomicity too.
> > 
> 
> If I can ask, do you think it would be out of line to propose a major
> refactor to mempolicy to enable external task's the ability to change a
> running task's mempolicy *as well as* a cgroup-wide mempolicy component?

I don't think these group configurations fit cgroup filesystem interface
very well. As these aren't resource allocations, it's unclear what the
hierarchical relationship means. Besides, it feels awkard to be keep adding
duplicate interfaces where the modality changes completely based on the
operation scope.

There are ample examples where other subsystems use cgroup membership
information and while we haven't expanded that to syscalls yet, I don't see
why that'd be all that difference. So, maybe it'd make sense to have the new
mempolicy syscall take a cgroup ID as a target identifier too? ie. so that
the scope of the operation (e.g. task, process, cgroup) and the content of
the policy can stay orthogonal?

Thanks.
Dan Williams Nov. 11, 2023, 11:54 p.m. UTC | #14
tj@kernel.org wrote:
> Hello,
> 
> On Fri, Nov 10, 2023 at 10:42:39PM -0500, Gregory Price wrote:
> > On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@kernel.org wrote:
> ...
> > I've been considering this as well, but there's more context here being
> > lost.  It's not just about being able to toggle the policy of a single
> > task, or related tasks, but actually in support of a more global data
> > interleaving strategy that makes use of bandwidth more effectively as
> > we begin to memory expansion and bandwidth expansion occur on the
> > PCIE/CXL bus.
> > 
> > If the memory landscape of a system changes, for example due to a
> > hotplug event, you actually want to change the behavior of *every* task
> > that is using interleaving.  The fundamental bandwidth distribution of
> > the entire system changed, so the behavior of every task using that
> > memory should change with it.
> > 
> > We've explored adding weights to: mempolicy, memory tiers, nodes, memcg,
> > and now additionally cpusets. In the last email, I'd asked whether it
> > might actually be worth adding a new mpol component of cgroups to
> > aggregate these issues, rather than jam them into either component.
> > I would love your thoughts on that.
> 
> As for CXL and the changing memory landscape, I think some caution is
> necessary as with any expected "future" technology changes. The recent
> example with non-volatile memory isn't too far from CXL either. Note that
> this is not to say that we shouldn't change anything until the hardware is
> wildly popular but more that we need to be cognizant of the speculative
> nature and the possibility of overbuilding for it.
> 
> I don't have a golden answer but here are general suggestions: Build
> something which is small and/or useful even outside the context of the
> expected hardware landscape changes. Enable the core feature which is
> absolutely required in a minimal manner. Avoid being maximalist in feature
> and convenience coverage.

If I had to state the golden rule of kernel enabling, this paragraph
comes close to being it.

> Here, even if CXL actually becomes popular, how many are going to use memory
> hotplug and need to dynamically rebalance memory in actively running
> workloads? What's the scenario? Are there going to be an army of data center
> technicians going around plugging and unplugging CXL devices depending on
> system memory usage?

While I have personal skepticism that all of the infrastructure in the
CXL specification is going to become popular, one mechanism that seems
poised to cross that threshold is "dynamic capacity". So it is not the
case that techs are running around hot-adjusting physical memory. A host
will have a cable hop to a shared memory pool in the rack where it can
be dynamically provisioned across hosts.

However, even then the bounds of what is dynamic is going to be
constrained to a fixed address space with likely predictable performance
characteristics for that address range. That potentially allows for a
system wide memory interleave policy to be viable. That might be the
place to start and mirrors, at a coarser granularity, what hardware
interleaving can do.

[..]
Huang, Ying Nov. 13, 2023, 1:31 a.m. UTC | #15
Gregory Price <gregory.price@memverge.com> writes:

> On Fri, Nov 10, 2023 at 02:16:05PM +0800, Huang, Ying wrote:
>> Gregory Price <gourry.memverge@gmail.com> writes:
>> 
>> > This patchset implements weighted interleave and adds a new cgroup
>> > sysfs entry: cgroup/memory.interleave_weights (excluded from root).
>> >
>> > The il_weight of a node is used by mempolicy to implement weighted
>> > interleave when `numactl --interleave=...` is invoked.  By default
>> > il_weight for a node is always 1, which preserves the default round
>> > robin interleave behavior.
>> 
>> IIUC, this makes it almost impossible to set the default weight of a
>> node from the node memory bandwidth information.  This will make the
>> life of users a little harder.
>> 
>> If so, how about use a new memory policy mode, for example
>> MPOL_WEIGHTED_INTERLEAVE, etc.
>>
>
> weights are also inherited from parent cgroups, so if you set them in
> parent slices you can automatically set update system settings.
>
> by default the parent slice weights will always be 1 until set
> otherwise.  Once they're set, children inherit naturally.
>
> Maybe there's an argument here for including interleave_weights in the
> root cgroup.

Even if the interleave_weights is introduced in root cgroup, the initial
default weight need to be 1 to be back-compatible with the original
MPOL_INTERLEAVE.

If we don't reuse MPOL_INTERLEAVE, but use a new memory policy mode (say
MPOL_WEIGHTED_INTERLEAVE).  The default values of the interleave weight
in root cgroup needn't to be 1.  So, we can provide a more helpful
default interleave weight based on the node memory bandwidth information
(e.g., from HMAT, CDAT, etc).  That will make users life much easier.
Do you agree?

--
Best Regards,
Huang, Ying
Gregory Price Nov. 13, 2023, 2:22 a.m. UTC | #16
On Sat, Nov 11, 2023 at 03:54:55PM -0800, Dan Williams wrote:
> tj@kernel.org wrote:
> > Hello,
> > 
> > On Fri, Nov 10, 2023 at 10:42:39PM -0500, Gregory Price wrote:
> > > On Fri, Nov 10, 2023 at 05:05:50PM -1000, tj@kernel.org wrote:
> 
> > Here, even if CXL actually becomes popular, how many are going to use memory
> > hotplug and need to dynamically rebalance memory in actively running
> > workloads? What's the scenario? Are there going to be an army of data center
> > technicians going around plugging and unplugging CXL devices depending on
> > system memory usage?
> 
> While I have personal skepticism that all of the infrastructure in the
> CXL specification is going to become popular, one mechanism that seems
> poised to cross that threshold is "dynamic capacity". So it is not the
> case that techs are running around hot-adjusting physical memory. A host
> will have a cable hop to a shared memory pool in the rack where it can
> be dynamically provisioned across hosts.
> 
> However, even then the bounds of what is dynamic is going to be
> constrained to a fixed address space with likely predictable performance
> characteristics for that address range. That potentially allows for a
> system wide memory interleave policy to be viable. That might be the
> place to start and mirrors, at a coarser granularity, what hardware
> interleaving can do.
> 
> [..]

Funny enough, this is exactly why I skipped cgroups and went directly to 
implementing the weights as an attribute of numa nodes. It cuts out a
middle-man and lets you apply weights globally.

BUT the policy is still ultimately opt-in, so you don't really get a
global effect, just a global control.  Just given that lesson, yeah
it's better to reduce the scope to mempolicy first.

Getting to global interleave weights from there... more complicated.

The simplees way I can think of to test system-wide weighted interleave
is to have the init task create a default mempolicy and have all tasks
inherit it.  That feels like a big, dumb hammer - but it might work.

Comparatively, implementing a mempolicy in the root cgroup and having
tasks use that directly "feels" better, though lessons form this patch
- interating cgroup parent trees on allocations feels not great.

Barring that, if a cgroup.mempolicy and a default mempolicy for init
aren't realistic, I don't see a good path to fruition for a global
interleave approach that doesn't require nastier allocator changes.

In the meantime, unless there's other pro-cgroups voices, I'm going to
pivot back to my initial approach of doing it in mempolicy, though I
may explore extending mempolicy into procfs at the same time.

~Gregory
Gregory Price Nov. 13, 2023, 2:28 a.m. UTC | #17
On Mon, Nov 13, 2023 at 09:31:07AM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Fri, Nov 10, 2023 at 02:16:05PM +0800, Huang, Ying wrote:
> >> Gregory Price <gourry.memverge@gmail.com> writes:
> >> 
> >
> > weights are also inherited from parent cgroups, so if you set them in
> > parent slices you can automatically set update system settings.
> >
> > by default the parent slice weights will always be 1 until set
> > otherwise.  Once they're set, children inherit naturally.
> >
> > Maybe there's an argument here for including interleave_weights in the
> > root cgroup.
> 
> Even if the interleave_weights is introduced in root cgroup, the initial
> default weight need to be 1 to be back-compatible with the original
> MPOL_INTERLEAVE.
>

Sorry, I am maybe not explaining correctly.  Right now, the weights
are not *replicated* when a child cgroup is created.  Instead, when
weights are requested (during allocation) the requestor searches for the
first cgroup in its family that has weights.

while (!memcg->weights && !memcg_is_root(memcg))
    memcg = parent(memcg)

We only create new weights on each child if the child explicitly has
their weights set.  We manage everything else via RCU to keep it all
consistent.

This would allow a set of weights in the root cgroup to be set and then
immediately inherited by the entire system, though it does add the
overhead of searching the cgroup family lineage on allocations (which
could be non-trivial, we are still testing it).

> If we don't reuse MPOL_INTERLEAVE, but use a new memory policy mode (say
> MPOL_WEIGHTED_INTERLEAVE).  The default values of the interleave weight
> in root cgroup needn't to be 1.

I agree, and I already have patches that do just this.  Though based on
other feedback, it's looking like I'll be reverting back to implementing
all of this in mempolicy, and maybe trying to pull mempolicy forward
into the procfs world.

~Gregory
Michal Hocko Nov. 14, 2023, 9:43 a.m. UTC | #18
On Fri 10-11-23 22:42:39, Gregory Price wrote:
[...]
> If I can ask, do you think it would be out of line to propose a major
> refactor to mempolicy to enable external task's the ability to change a
> running task's mempolicy *as well as* a cgroup-wide mempolicy component?

No, I actually think this is a reasonable idea. pidfd_setmempolicy is a
generally useful extension. The mempolicy code is heavily current task
based and there might be some challenges but I believe this will a)
improve the code base and b) allow more usecases.

That being said, I still believe that a cgroup based interface is a much
better choice over a global one. Cpusets seem to be a good fit as the
controller does control memory placement wrt NUMA interfaces.
Gregory Price Nov. 14, 2023, 3:50 p.m. UTC | #19
On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote:
> On Fri 10-11-23 22:42:39, Gregory Price wrote:
> [...]
> > If I can ask, do you think it would be out of line to propose a major
> > refactor to mempolicy to enable external task's the ability to change a
> > running task's mempolicy *as well as* a cgroup-wide mempolicy component?
> 
> No, I actually think this is a reasonable idea. pidfd_setmempolicy is a
> generally useful extension. The mempolicy code is heavily current task
> based and there might be some challenges but I believe this will a)
> improve the code base and b) allow more usecases.

Just read up on the pidfd_set_mempolicy lore, and yes I'm seeing all the
same problems (I know there was discussion of vma policies, but i think
that can be a topic for later).  Have some thoughts on this, but will
take some time to work through a few refactoring tickets first.

> 
> That being said, I still believe that a cgroup based interface is a much
> better choice over a global one. Cpusets seem to be a good fit as the
> controller does control memory placement wrt NUMA interfaces.

I think cpusets is a non-starter due to the global spinlock required when
reading informaiton from it:

https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391

Unless the proposal is to place the weights as a global cgroups value,
in which case I think it would be better placed in default_mempolicy :]
Michal Hocko Nov. 14, 2023, 5:01 p.m. UTC | #20
On Tue 14-11-23 10:50:51, Gregory Price wrote:
> On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote:
[...]
> > That being said, I still believe that a cgroup based interface is a much
> > better choice over a global one. Cpusets seem to be a good fit as the
> > controller does control memory placement wrt NUMA interfaces.
> 
> I think cpusets is a non-starter due to the global spinlock required when
> reading informaiton from it:
> 
> https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391

Right, our current cpuset implementation indeed requires callback lock
from the page allocator. But that is an implementation detail. I do not
remember bug reports about the lock being a bottle neck though. If
anything cpusets lock optimizations would be win also for users who do
not want to use weighted interleave interface.
Gregory Price Nov. 14, 2023, 5:49 p.m. UTC | #21
On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote:
> On Tue 14-11-23 10:50:51, Gregory Price wrote:
> > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote:
> [...]
> > > That being said, I still believe that a cgroup based interface is a much
> > > better choice over a global one. Cpusets seem to be a good fit as the
> > > controller does control memory placement wrt NUMA interfaces.
> > 
> > I think cpusets is a non-starter due to the global spinlock required when
> > reading informaiton from it:
> > 
> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391
> 
> Right, our current cpuset implementation indeed requires callback lock
> from the page allocator. But that is an implementation detail. I do not
> remember bug reports about the lock being a bottle neck though. If
> anything cpusets lock optimizations would be win also for users who do
> not want to use weighted interleave interface.

Definitely agree, but that's a rather large increase of scope :[

We could consider a push-model similar to how cpuset nodemasks are
pushed down to mempolicies, rather than a pull-model of having
mempolicy read directly from cpusets, at least until cpusets lock
optimization is undertaken.

This pattern looks like a wart to me, which is why I avoided it, but the
locking implications on the pull-model make me sad.

Would like to point out that Tejun pushed back on implementing weights
in cgroups (regardless of subcomponent), so I think we need to come
to a consensus on where this data should live in a "more global"
context (cpusets, memcg, nodes, etc) before I go mucking around
further.

So far we have:
* mempolicy: updating weights is a very complicated undertaking,
             and no (good) way to do this from outside the task.
	     would be better to have a coarser grained control.

             New syscall is likely needed to add/set weights in the
	     per-task mempolicy, or bite the bullet on set_mempolicy2
	     and make the syscall extensible for the future.

* memtiers: tier=node when devices are already interleaved or when all
            devices are different, so why add yet another layer of
	    complexity if other constructs already exist.  Additionally,
	    you lose task-placement relative weighting (or it becomes
	    very complex to implement.

* cgroups: "this doesn't involve dynamic resource accounting /
            enforcement at all" and "these aren't resource
	    allocations, it's unclear what the hierarchical
	    relationship mean".

* node: too global, explore smaller scope first then expand.

For now I think there is consensus that mempolicy should have weights
per-task regardless of how the more-global mechanism is defined, so i'll
go ahead and put up another RFC for some options on that in the next
week or so.

The limitations on the first pass will be that only the task is capable
of re-weighting should cpusets.mems or the nodemask change.

~Gregory
Huang, Ying Nov. 15, 2023, 5:56 a.m. UTC | #22
Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Nov 14, 2023 at 06:01:13PM +0100, Michal Hocko wrote:
>> On Tue 14-11-23 10:50:51, Gregory Price wrote:
>> > On Tue, Nov 14, 2023 at 10:43:13AM +0100, Michal Hocko wrote:
>> [...]
>> > > That being said, I still believe that a cgroup based interface is a much
>> > > better choice over a global one. Cpusets seem to be a good fit as the
>> > > controller does control memory placement wrt NUMA interfaces.
>> > 
>> > I think cpusets is a non-starter due to the global spinlock required when
>> > reading informaiton from it:
>> > 
>> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L391
>> 
>> Right, our current cpuset implementation indeed requires callback lock
>> from the page allocator. But that is an implementation detail. I do not
>> remember bug reports about the lock being a bottle neck though. If
>> anything cpusets lock optimizations would be win also for users who do
>> not want to use weighted interleave interface.
>
> Definitely agree, but that's a rather large increase of scope :[
>
> We could consider a push-model similar to how cpuset nodemasks are
> pushed down to mempolicies, rather than a pull-model of having
> mempolicy read directly from cpusets, at least until cpusets lock
> optimization is undertaken.
>
> This pattern looks like a wart to me, which is why I avoided it, but the
> locking implications on the pull-model make me sad.
>
> Would like to point out that Tejun pushed back on implementing weights
> in cgroups (regardless of subcomponent), so I think we need to come
> to a consensus on where this data should live in a "more global"
> context (cpusets, memcg, nodes, etc) before I go mucking around
> further.
>
> So far we have:
> * mempolicy: updating weights is a very complicated undertaking,
>              and no (good) way to do this from outside the task.
> 	     would be better to have a coarser grained control.
>
>              New syscall is likely needed to add/set weights in the
> 	     per-task mempolicy, or bite the bullet on set_mempolicy2
> 	     and make the syscall extensible for the future.
>
> * memtiers: tier=node when devices are already interleaved or when all
>             devices are different, so why add yet another layer of
> 	    complexity if other constructs already exist.  Additionally,
> 	    you lose task-placement relative weighting (or it becomes
> 	    very complex to implement.

Because we usually have multiple nodes in one mem-tier, I still think
mem-tier-based interface is simpler than node-based.  But, it seems more
complex to introduce mem-tier into mempolicy.  Especially if we have
per-task weights.  So, I am fine to go with node-based interface.

> * cgroups: "this doesn't involve dynamic resource accounting /
>             enforcement at all" and "these aren't resource
> 	    allocations, it's unclear what the hierarchical
> 	    relationship mean".
>
> * node: too global, explore smaller scope first then expand.

Why is it too global?  I understand that it doesn't cover all possible
use cases (although I don't know whether these use cases are practical
or not).  But it can provide a reasonable default per-node weight based
on available node performance information (such as, HMAT, CDAT, etc.).
And, quite some workloads can just use it.  I think this is an useful
feature.

> For now I think there is consensus that mempolicy should have weights
> per-task regardless of how the more-global mechanism is defined, so i'll
> go ahead and put up another RFC for some options on that in the next
> week or so.
>
> The limitations on the first pass will be that only the task is capable
> of re-weighting should cpusets.mems or the nodemask change.

--
Best Regards,
Huang, Ying
Gregory Price Dec. 4, 2023, 3:33 a.m. UTC | #23
On Wed, Nov 15, 2023 at 01:56:53PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> Because we usually have multiple nodes in one mem-tier, I still think
> mem-tier-based interface is simpler than node-based.  But, it seems more
> complex to introduce mem-tier into mempolicy.  Especially if we have
> per-task weights.  So, I am fine to go with node-based interface.
> 
> > * cgroups: "this doesn't involve dynamic resource accounting /
> >             enforcement at all" and "these aren't resource
> > 	    allocations, it's unclear what the hierarchical
> > 	    relationship mean".
> >
> > * node: too global, explore smaller scope first then expand.
> 
> Why is it too global?  I understand that it doesn't cover all possible
> use cases (although I don't know whether these use cases are practical
> or not).  But it can provide a reasonable default per-node weight based
> on available node performance information (such as, HMAT, CDAT, etc.).
> And, quite some workloads can just use it.  I think this is an useful
> feature.
>

Have been sharing notes with more folks.  Michal thinks a global set of
weights is unintuitive and not useful, and would prefer to see the
per-task weights first.

Though this may have been in response to adding it as an attribute of
nodes directly. 

Another proposal here suggested adding a new sysfs setting
https://github.com/skhynix/linux/commit/61d2fcc7a880185df186fa2544edcd2f8785952a

  $ tree /sys/kernel/mm/interleave_weight/
  /sys/kernel/mm/interleave_weight/
  ├── enabled [1]
  ├── possible [2]
  └── node
      ├── node0
      │   └── interleave_weight [3]
      └── node1
          └── interleave_weight [3]

(this could be changed to /sys/kernel/mm/mempolicy/...)

I think the internal representation of this can be simplified greatly,
over what the patch provides now, but maybe this solves the "it doesn't
belong in these other components" issue.

Answer: Simply leave it as a static global kobject in mempolicy, which
also deals with many of the issues regarding race conditions.

If a user provides weights, use those.  If they do not, use globals.

On a cpuset rebind event (container migration, mems_allowed changes),
manually set weights would have to remain, so in a bad case, the
weights would be very out of line with the real distribution of memory.

Example: if your nodemask is (0,1,2) and a migration changes it to
(3,4,5), then unfortunately your weights will likely revert to [1,1,1]

If set with global weights, they could automatically adjust.  It
would not be perfect, but it would be better than the potential worst
case above.  If that same migration occurs, the next allocation would
simply use whatever the target node weights are in the global config.

So if globally you have weights [3,2,1,1,2,3], and you move from
nodemask (0,1,2) to (3,4,5), your weights change from [3,2,1] to
[1,2,3].  If the structure is built as a matrix of (cpu_node,mem_nodes),
the you can also optimize based on the node the task is running on.

That feels very intuitive, deals with many race condition issues, and
the global setting can actually be implemented without the need for
set_mempolicy2 at all - which is certainly a bonus.

Would love more thoughts here.  Will have a new RFC with set_mempolicy2,
mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above.

Regards
~Gregory
Huang, Ying Dec. 4, 2023, 8:19 a.m. UTC | #24
Gregory Price <gregory.price@memverge.com> writes:

> On Wed, Nov 15, 2023 at 01:56:53PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> Because we usually have multiple nodes in one mem-tier, I still think
>> mem-tier-based interface is simpler than node-based.  But, it seems more
>> complex to introduce mem-tier into mempolicy.  Especially if we have
>> per-task weights.  So, I am fine to go with node-based interface.
>> 
>> > * cgroups: "this doesn't involve dynamic resource accounting /
>> >             enforcement at all" and "these aren't resource
>> > 	    allocations, it's unclear what the hierarchical
>> > 	    relationship mean".
>> >
>> > * node: too global, explore smaller scope first then expand.
>> 
>> Why is it too global?  I understand that it doesn't cover all possible
>> use cases (although I don't know whether these use cases are practical
>> or not).  But it can provide a reasonable default per-node weight based
>> on available node performance information (such as, HMAT, CDAT, etc.).
>> And, quite some workloads can just use it.  I think this is an useful
>> feature.
>>
>
> Have been sharing notes with more folks.  Michal thinks a global set of
> weights is unintuitive and not useful, and would prefer to see the
> per-task weights first.
>
> Though this may have been in response to adding it as an attribute of
> nodes directly. 
>
> Another proposal here suggested adding a new sysfs setting
> https://github.com/skhynix/linux/commit/61d2fcc7a880185df186fa2544edcd2f8785952a
>
>   $ tree /sys/kernel/mm/interleave_weight/
>   /sys/kernel/mm/interleave_weight/
>   ├── enabled [1]
>   ├── possible [2]
>   └── node
>       ├── node0
>       │   └── interleave_weight [3]
>       └── node1
>           └── interleave_weight [3]
>
> (this could be changed to /sys/kernel/mm/mempolicy/...)
>
> I think the internal representation of this can be simplified greatly,
> over what the patch provides now, but maybe this solves the "it doesn't
> belong in these other components" issue.
>
> Answer: Simply leave it as a static global kobject in mempolicy, which
> also deals with many of the issues regarding race conditions.

Although personally I prefer to add interleave weight as an attribute of
nodes.  I understand that some people think it's not appropriate to
place anything node-specific there.  So, some place under /sys/kernel/mm
sounds reasonable too.

> If a user provides weights, use those.  If they do not, use globals.

Yes.  That is the target use case.

> On a cpuset rebind event (container migration, mems_allowed changes),
> manually set weights would have to remain, so in a bad case, the
> weights would be very out of line with the real distribution of memory.
>
> Example: if your nodemask is (0,1,2) and a migration changes it to
> (3,4,5), then unfortunately your weights will likely revert to [1,1,1]
>
> If set with global weights, they could automatically adjust.  It
> would not be perfect, but it would be better than the potential worst
> case above.  If that same migration occurs, the next allocation would
> simply use whatever the target node weights are in the global config.
>
> So if globally you have weights [3,2,1,1,2,3], and you move from
> nodemask (0,1,2) to (3,4,5), your weights change from [3,2,1] to
> [1,2,3].

That is nice.  And I prefer to emphasize the simple use case.  Users
don't need to specify interleave weight always.  Just use
MPOL_WEIGHTED_INTERLEAVE policy, and system will provide reasonable
default weight.

> If the structure is built as a matrix of (cpu_node,mem_nodes),
> the you can also optimize based on the node the task is running on.

The matrix stuff makes the situation complex.  If people do need
something like that, they can just use set_memorypolicy2() with user
specified weights.  I still believe that "make simple stuff simple, and
complex stuff possible".

> That feels very intuitive, deals with many race condition issues, and
> the global setting can actually be implemented without the need for
> set_mempolicy2 at all - which is certainly a bonus.
>
> Would love more thoughts here.  Will have a new RFC with set_mempolicy2,
> mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above.

Thanks for doing all these!

--
Best Regards,
Huang, Ying
Gregory Price Dec. 4, 2023, 1:50 p.m. UTC | #25
On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > If the structure is built as a matrix of (cpu_node,mem_nodes),
> > the you can also optimize based on the node the task is running on.
> 
> The matrix stuff makes the situation complex.  If people do need
> something like that, they can just use set_memorypolicy2() with user
> specified weights.  I still believe that "make simple stuff simple, and
> complex stuff possible".
> 

I don't think it's particularly complex, since we already have a
distance matrix for numa nodes:

available: 2 nodes (0-1)
... snip ...
node distances:
node   0   1
  0:  10  21
  1:  21  10

This would follow the same thing, just adjustable for bandwidth.

I personally find the (src,dst) matrix very important for flexibility.

But if there is particular pushback against it, having a one dimensional
array is better than not having it, so I will take what I can get.

> > That feels very intuitive, deals with many race condition issues, and
> > the global setting can actually be implemented without the need for
> > set_mempolicy2 at all - which is certainly a bonus.
> >
> > Would love more thoughts here.  Will have a new RFC with set_mempolicy2,
> > mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above.
> 
> Thanks for doing all these!
> 

Someone's got to :]

~Gregory
Huang, Ying Dec. 5, 2023, 9:01 a.m. UTC | #26
Gregory Price <gregory.price@memverge.com> writes:

> On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> > If the structure is built as a matrix of (cpu_node,mem_nodes),
>> > the you can also optimize based on the node the task is running on.
>> 
>> The matrix stuff makes the situation complex.  If people do need
>> something like that, they can just use set_memorypolicy2() with user
>> specified weights.  I still believe that "make simple stuff simple, and
>> complex stuff possible".
>> 
>
> I don't think it's particularly complex, since we already have a
> distance matrix for numa nodes:
>
> available: 2 nodes (0-1)
> ... snip ...
> node distances:
> node   0   1
>   0:  10  21
>   1:  21  10
>
> This would follow the same thing, just adjustable for bandwidth.

We add complexity for requirement. Not there's something similar
already.

> I personally find the (src,dst) matrix very important for flexibility.

With set_memorypolicy2(), I think we have the needed flexibility for
users needs the complexity.

> But if there is particular pushback against it, having a one dimensional
> array is better than not having it, so I will take what I can get.

TBH, I don't think that we really need that.  Especially given we will
have set_memorypolicy2().

>> > That feels very intuitive, deals with many race condition issues, and
>> > the global setting can actually be implemented without the need for
>> > set_mempolicy2 at all - which is certainly a bonus.
>> >
>> > Would love more thoughts here.  Will have a new RFC with set_mempolicy2,
>> > mbind2, and MPOL_WEIGHTED_INTERLEAVE soon that demonstrate the above.
>> 
>> Thanks for doing all these!
>> 
>
> Someone's got to :]
>

--
Best Regards,
Huang, Ying
Gregory Price Dec. 5, 2023, 2:47 p.m. UTC | #27
On Tue, Dec 05, 2023 at 05:01:51PM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote:
> >> Gregory Price <gregory.price@memverge.com> writes:
> >> 
> >> > If the structure is built as a matrix of (cpu_node,mem_nodes),
> >> > the you can also optimize based on the node the task is running on.
> >> 
> >> The matrix stuff makes the situation complex.  If people do need
> >> something like that, they can just use set_memorypolicy2() with user
> >> specified weights.  I still believe that "make simple stuff simple, and
> >> complex stuff possible".
> >> 
> >
> > I don't think it's particularly complex, since we already have a
> > distance matrix for numa nodes:
> >
> > available: 2 nodes (0-1)
> > ... snip ...
> > node distances:
> > node   0   1
> >   0:  10  21
> >   1:  21  10
> >
> > This would follow the same thing, just adjustable for bandwidth.
> 
> We add complexity for requirement. Not there's something similar
> already.
> 
> > I personally find the (src,dst) matrix very important for flexibility.
> 
> With set_memorypolicy2(), I think we have the needed flexibility for
> users needs the complexity.
> 
> > But if there is particular pushback against it, having a one dimensional
> > array is better than not having it, so I will take what I can get.
> 
> TBH, I don't think that we really need that.  Especially given we will
> have set_memorypolicy2().
>

From a complexity standpoint, it is exactly as complex as the hardware
configuration itself:  each socket has a different view of the memory
topology. If you have a non-homogeneous memory configuration (e.g. a 
different number of CXL expanders on one socket thant he other), a flat
array of weights has no way of capturing this hardware configuration.

That makes the feature significantly less useful. In fact, it makes the
feature equivalent to set_mempolicy2 - except that weights could be
changed at runtime from outside a process.


A matrix resolves one very specific use case: task migration


set_mempolicy2 is not sufficient to solve this.  There is presently no
way for an external task to change the mempolicy of an existing task.
That means a task must become "migration aware" to use weighting in the
context of containers where migrations are likely.

Two things to consider: A task...
   a) has no way of knowing a migration occured
   b) may not have visibility of numa nodes outside its cpusets prior to
      a migration - making it unlikely/not possible for them to set
      weights correctly in the event a migration occurs.

If a server with 2 sockets is set up non-homogeneously (different amount
of CXL memory expanders on each socket), then the effective bandwidth
distribution between sockets will be different.

If a container is migrated between sockets in this situation, then tasks
with manually set weights, or if global weights are a single array, will
have poor memory distributions in relation to the new view of the system.

Requiring the global settings to be an array basically requires global
weights to be sub-optimal for any use cases that is not explicitly a
single workload that consumes all the cores on the system.

If the system provides a matrix, then the global settings can be optimal
and re-weighting in response to migration happens cleanly and transparently.

~Gregory
Huang, Ying Dec. 6, 2023, 12:50 a.m. UTC | #28
Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Dec 05, 2023 at 05:01:51PM +0800, Huang, Ying wrote:
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> > On Mon, Dec 04, 2023 at 04:19:02PM +0800, Huang, Ying wrote:
>> >> Gregory Price <gregory.price@memverge.com> writes:
>> >> 
>> >> > If the structure is built as a matrix of (cpu_node,mem_nodes),
>> >> > the you can also optimize based on the node the task is running on.
>> >> 
>> >> The matrix stuff makes the situation complex.  If people do need
>> >> something like that, they can just use set_memorypolicy2() with user
>> >> specified weights.  I still believe that "make simple stuff simple, and
>> >> complex stuff possible".
>> >> 
>> >
>> > I don't think it's particularly complex, since we already have a
>> > distance matrix for numa nodes:
>> >
>> > available: 2 nodes (0-1)
>> > ... snip ...
>> > node distances:
>> > node   0   1
>> >   0:  10  21
>> >   1:  21  10
>> >
>> > This would follow the same thing, just adjustable for bandwidth.
>> 
>> We add complexity for requirement. Not there's something similar
>> already.
>> 
>> > I personally find the (src,dst) matrix very important for flexibility.
>> 
>> With set_memorypolicy2(), I think we have the needed flexibility for
>> users needs the complexity.
>> 
>> > But if there is particular pushback against it, having a one dimensional
>> > array is better than not having it, so I will take what I can get.
>> 
>> TBH, I don't think that we really need that.  Especially given we will
>> have set_memorypolicy2().
>>
>
> From a complexity standpoint, it is exactly as complex as the hardware
> configuration itself:  each socket has a different view of the memory
> topology. If you have a non-homogeneous memory configuration (e.g. a 
> different number of CXL expanders on one socket thant he other), a flat
> array of weights has no way of capturing this hardware configuration.

One important task of the software is to hide the complexity of hardware
from the users.  At least it should provide the option.  It only add
complexity based on real requirements.

> That makes the feature significantly less useful. In fact, it makes the
> feature equivalent to set_mempolicy2 - except that weights could be
> changed at runtime from outside a process.
>
>
> A matrix resolves one very specific use case: task migration
>
>
> set_mempolicy2 is not sufficient to solve this.  There is presently no
> way for an external task to change the mempolicy of an existing task.
> That means a task must become "migration aware" to use weighting in the
> context of containers where migrations are likely.
>
> Two things to consider: A task...
>    a) has no way of knowing a migration occured
>    b) may not have visibility of numa nodes outside its cpusets prior to
>       a migration - making it unlikely/not possible for them to set
>       weights correctly in the event a migration occurs.
>
> If a server with 2 sockets is set up non-homogeneously (different amount
> of CXL memory expanders on each socket), then the effective bandwidth
> distribution between sockets will be different.
>
> If a container is migrated between sockets in this situation, then tasks
> with manually set weights, or if global weights are a single array, will
> have poor memory distributions in relation to the new view of the system.
>
> Requiring the global settings to be an array basically requires global
> weights to be sub-optimal for any use cases that is not explicitly a
> single workload that consumes all the cores on the system.
>
> If the system provides a matrix, then the global settings can be optimal
> and re-weighting in response to migration happens cleanly and transparently.

For these complex requirements, we will have process_set_mempolicy2().
I think that it's even more flexible than the global matrix.

--
Best Regards,
Huang, Ying
Gregory Price Dec. 6, 2023, 2:01 a.m. UTC | #29
On Wed, Dec 06, 2023 at 08:50:23AM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> >
> > From a complexity standpoint, it is exactly as complex as the hardware
> > configuration itself:  each socket has a different view of the memory
> > topology. If you have a non-homogeneous memory configuration (e.g. a 
> > different number of CXL expanders on one socket thant he other), a flat
> > array of weights has no way of capturing this hardware configuration.
> 
> One important task of the software is to hide the complexity of hardware
> from the users.  At least it should provide the option.  It only add
> complexity based on real requirements.
> 

The global weights are intended to help adminstrators hide that
complexity from actual end-users.

The administrator of a system should already be aware of the hardware
configuration, however to hide this complexity a system service can 
be made which auto-configures these weights at system-bringup and on
memory-device hostplug to simplify and hide the complexity even further.

A system service can use ACPI HMAT (ACPI Heterogeneous Memory Attribute
Table) information to automatically set the global weight information at
boot time and/or on hotplug. Such extensions have already been proposed
in prior RFCs and on the cxl mailing list.



To break this down a little more explicitly into 6 example use-cases,
lets consider the potential ways in which weighted interleave may be
set via set_mempolicy or set_mempolicy2().

1. Actual end-user software calls it directly (or through libnuma)
   a) they can call set_mempolicy() without task-weights and accept the
      administrator configured global weights
   b) they can call set_mempolicy2() with task-weights and use task-local
      defined weighting
2. Actual end-user uses `numactl -w[weights] --interleave ...`
   a) if weights are not defined, use global weights
   b) if weights are defined, use task-local weights
3. Administrator / Orchestrator opts user-software into weighted
   interleave by wrapping their software into `numactl -w --interleave`
   a) if weights are not defined, use global weights
   b) if weights are not defined, use task-local weights

The most common use case is likely to be (3a) - an administrator opting
a user-workload into weighted-interleave via `numactl -w --interleave`
or an orchestrator such as kubernetes doing something similar on
pod/container dispatching.



In all cases where the user does not define weights, they are trusting
the administrator (or system-daemon) set weights to provide the optimal
distribution, removing the complexity of understanding the hardware
environment from the end-user.



In all cases where the user does define weights, they are accepting the
complexity of understanding the hardware environment.



On the topic of the ACTUAL complexity of system hardware that is being
hidden, we must consider a non-homogeneous bandwidth environment. The
simplest form is an off the shelf Intel 2-socket server with CXL memory
expander.

Lets Consider a 2 socket system with the following configuration::

DRAM on Socket0:  300GB/s local DRAM bandwidth  (node 0)
DRAM on Socket1:  300GB/s local DRAM bandwidth  (node 1)
CXL on socket0:   128GB/s bandwidth             (node 2)
CXL on socket1:   128GB/s bandwidth             (node 3)

A single linear array of weights is not sufficient to capture the
complexities of bandwidth distributions on this system, because
of the presence of a UPI link between socket0 and socket1, which
changes the bandwidth distribution depending on where a task runs.

For example, 3 links of UPI is 62.4GB/s full-duplex.

From the perspective of socket 0, the following is true:

Bandwidth to Socket0 DRAM:  300GB/s    (node 0)
Bandwidth to Socket0 CXL:   100GB/s    (node 2)
Aggregate bandwidth to nodes (1,3):  62.4GB/s

From the perspective of socket 1, this changes to:
Bandwidth to Socket0 DRAM:  300GB/s    (node 1)
Bandwidth to Socket0 CXL:   100GB/s    (node 3)
Aggregate bandwidth to nodes (0,2):  62.4GB/s

With a single linear array of weights that apply to the entire system,
you cannot represent this configuration.  And in fact, a single
configuration of weights will always provide a sub-optimal distribution.

The goal of simplicity defeats the entire goal of weighted interleave in
a heterogeneous environment.

> 
> For these complex requirements, we will have process_set_mempolicy2().
> I think that it's even more flexible than the global matrix.
>

process_set_mempolicy2() has a *very* long road to exist. The problem of
mempolicy reference counting is non-trivial, and the plumbing requires
changes to no less than 4 subsystems.

Beyond that, the complexity of actually using process_set_mempolicy2()
is the same as any situation in which set_mempolicy2() with task-local
weights set:  The absolute highest.

The global weighting matrix actually hides this complexity entirely.

> --
> Best Regards,
> Huang, Ying