mbox series

[RFC,v3,0/4] Node Weights and Weighted Interleave

Message ID 20231031003810.4532-1-gregory.price@memverge.com
Headers show
Series Node Weights and Weighted Interleave | expand

Message

Gregory Price Oct. 31, 2023, 12:38 a.m. UTC
This patchset implements weighted interleave and adds a new sysfs
entry: /sys/devices/system/node/nodeN/accessM/il_weight.

The il_weight of a node is used by mempolicy to implement weighted
interleave when `numactl --interleave=...` is invoked.  By default
il_weight for a node is always 1, which preserves the default round
robin interleave behavior.

Interleave weights may be set from 0-100, and denote the number of
pages that should be allocated from the node when interleaving
occurs.

For example, if a node's interleave weight is set to 5, 5 pages
will be allocated from that node before the next node is scheduled
for allocations.

Additionally, "node accessors" (synonmous with cpu nodes) are used
to allow for accessor-relative weighting.  The "accessor" for a task
is defined as the node the task is presently running on.

# Set node weight for node0 accessed by tasks on node0 to 5
echo 5 > /sys/devices/system/node/node0/access0/il_weight

# Set node weight for node0 accessed by tasks on node1 to 3
echo 3 > /sys/devices/system/node/node0/access1/il_weight

In this way it becomes possible to set an interleaving strategy
that fits the available bandwidth for the devices available on
the system. An example system:

Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex

In this setup, the effective weights for nodes 0-3 for a task
running on Node 0 may be [60, 20, 10, 10].

This spreads memory out across devices which all have different
latency and bandwidth attributes at a way that can maximize the
available resources.

~Gregory

(sorry for the repeat send, automation failure)

================================================================

Version Notes:

v3: move weights into node rather than memtiers
    some additional fixes to node.c to support this

v1/v2: add weighted-interleave support to mempolicy

= v3 notes

This update effectively removes the connection between mempolicy
and memory-tiers by simply placing the interleave weights directly
in the node accessor information structure.

Node was recommended by Huang, Ying
Accessor was recommended by Ravi Shankar


== Move weights into node

Originally this work was done by placing weights in the memory tier.
In this patch set we changed the weights to live in the numa node
accessor structure, which allows for a more natural weighting scheme
and also supports source-node relative weighting.

Interleave weight is located in:
/sys/devices/system/node/nodeN/accessM/il_weight

and is set with a value between 1 and 100:

# Set node weight for node0 accessed by node0 to 5
echo 5 > /sys/devices/system/node/node0/access0/il_weight

By default, il_weight is always set to 1, which mimics the default
interleave behavior (simple round-robin).


== Other Node fixes

2 other updates to node.c were required to support this:

1) The access list must be initialized prior to the node struct
   pointer being registered in the node array

2) The accessor's in the list must be registered regardless of
   whether HMAT/HMEM information is reported. Presently this
   results in 0-value information being present in the various
   access subgroup


== Weighted interleave

mm/mempolicy: modify interleave mempolicy to use node weights

The node subsystem implements interleave weighting for the purpose
of bandwidth optimization.  Each node may have different weights in
relation to each compute node ("access node").

The mempolicy MPOL_INTERLEAVE utilizes the node weights to implement
weighted interleave.  By default, since all nodes default to a weight
of 1, the original interleave behavior is retained.

Examples

Weight settings:
  echo 4 > node0/access0/il_weight
  echo 3 > node1/access0/il_weight
  echo 2 > node1/access1/il_weight
  echo 1 > node0/access1/il_weight

Results:

Task A:
  cpunode:  0
  nodemask: [0,1]
  weights:  [4,3]
  allocation result: [0,0,0,0,1,1,1 repeat]

Task B:
  cpunode:  1
  nodemask: [0,1]
  weights:  [1,2]
  allocation result: [0,1,1 repeat]

=== original RFCs ====

Memory-tier based weights
By: Ravi Shankar
https://lore.kernel.org/all/20230927095002.10245-1-ravis.opensrc@micron.com/

Mempolicy multi-node weighting w/ set_mempolicy2:
By: Gregory Price
https://lore.kernel.org/all/20231003002156.740595-1-gregory.price@memverge.com/

N:M weighting in mempolicy
By: Hasan Al Maruf
https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/

Ying Huang's presentation in lpc22, 16th slide in
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/\
Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf

Gregory Price (4):
  base/node.c: initialize the accessor list before registering
  node: add accessors to sysfs when nodes are created
  node: add interleave weights to node accessor
  mm/mempolicy: modify interleave mempolicy to use node weights

 drivers/base/node.c       | 120 ++++++++++++++++++++++++++++++++-
 include/linux/mempolicy.h |   4 ++
 include/linux/node.h      |  17 +++++
 mm/mempolicy.c            | 138 +++++++++++++++++++++++++++++---------
 4 files changed, 246 insertions(+), 33 deletions(-)

Comments

Gregory Price Oct. 31, 2023, 4:27 a.m. UTC | #1
On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:

> > This hopefully also explains why it's a global setting. The usecase is
> > different from conventional NUMA interleaving, which is used as a
> > locality measure: spread shared data evenly between compute
> > nodes. This one isn't about locality - the CXL tier doesn't have local
> > compute. Instead, the optimal spread is based on hardware parameters,
> > which is a global property rather than a per-workload one.
> 
> Well, I am not convinced about that TBH. Sure it is probably a good fit
> for this specific CXL usecase but it just doesn't fit into many others I
> can think of - e.g. proportional use of those tiers based on the
> workload - you get what you pay for.
> 
> Is there any specific reason for not having a new interleave interface
> which defines weights for the nodemask? Is this because the policy
> itself is very dynamic or is this more driven by simplicity of use?
> 

I had originally implemented it this way while experimenting with new
mempolicies.

https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@memverge.com/

The downside of doing it in mempolicy is...
1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a
   non-trivial task.  It is very "current-task" centric.

2) Barring a change to mempolicy to be sysfs friendly, the options for
   implementing weights in the mempolicy are either a) new flag and
   setting every weight individually in many syscalls, or b) a new
   syscall (set_mempolicy2), which is what I demonstrated in the RFC.

3) mempolicy is also subject to cgroup nodemasks, and as a result you
   end up with a rats nest of interactions between mempolicy nodemasks
   changing as a result of cgroup migrations, nodes potentially coming
   and going (hotplug under CXL), and others I'm probably forgetting.

   Basically:  If a node leaves the nodemask, should you retain the
   weight, or should you reset it? If a new node comes into the node
   mask... what weight should you set? I did not have answers to these
   questions.


It was recommended to explore placing it in tiers instead, so I took a
crack at it here: 

https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/

This had similar issue with the idea of hotplug nodes: if you give a
tier a weight, and one or more of the nodes goes away/comes back... what
should you do with the weight?  Split it up among the remaining nodes?
Rebalance? Etc.

The result of this discussion lead us to simply say "What if we place
the weights directly in the node".  And that lead us to this RFC.


I am not against implementing it in mempolicy (as proof: my first RFC).
I am simply searching for the acceptable way to implement it.

One of the benefits of having it set as a global setting is that weights
can be automatically generated from HMAT/HMEM information (ACPI tables)
and programs already using MPOL_INTERLEAVE will have a direct benefit.

I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added
along side this patch so that MPOL_INTERLEAVE is left entirely alone.

Happy to discuss more,
~Gregory
Gregory Price Oct. 31, 2023, 4:29 a.m. UTC | #2
On Tue, Oct 31, 2023 at 12:22:16PM -0400, Johannes Weiner wrote:
> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
> > 
> > Well, I am not convinced about that TBH. Sure it is probably a good fit
> > for this specific CXL usecase but it just doesn't fit into many others I
> > can think of - e.g. proportional use of those tiers based on the
> > workload - you get what you pay for.
> > 
> > Is there any specific reason for not having a new interleave interface
> > which defines weights for the nodemask? Is this because the policy
> > itself is very dynamic or is this more driven by simplicity of use?
> 
> A downside of *requiring* weights to be paired with the mempolicy is
> that it's then the application that would have to figure out the
> weights dynamically, instead of having a static host configuration. A
> policy of "I want to be spread for optimal bus bandwidth" translates
> between different hardware configurations, but optimal weights will
> vary depending on the type of machine a job runs on.
> 
> That doesn't mean there couldn't be usecases for having weights as
> policy as well in other scenarios, like you allude to above. It's just
> so far such usecases haven't really materialized or spelled out
> concretely. Maybe we just want both - a global default, and the
> ability to override it locally. Could you elaborate on the 'get what
> you pay for' usecase you mentioned?

I've been considering "por qué no los dos" for some time.  Already have
the code for both, just need to clean up the original RFC.
Michal Hocko Oct. 31, 2023, 9:53 a.m. UTC | #3
On Mon 30-10-23 20:38:06, Gregory Price wrote:
> This patchset implements weighted interleave and adds a new sysfs
> entry: /sys/devices/system/node/nodeN/accessM/il_weight.
> 
> The il_weight of a node is used by mempolicy to implement weighted
> interleave when `numactl --interleave=...` is invoked.  By default
> il_weight for a node is always 1, which preserves the default round
> robin interleave behavior.
> 
> Interleave weights may be set from 0-100, and denote the number of
> pages that should be allocated from the node when interleaving
> occurs.
> 
> For example, if a node's interleave weight is set to 5, 5 pages
> will be allocated from that node before the next node is scheduled
> for allocations.

I find this semantic rather weird TBH. First of all why do you think it
makes sense to have those weights global for all users? What if
different applications have different view on how to spred their
interleaved memory?

I do get that you might have a different tiers with largerly different
runtime characteristics but why would you want to interleave them into a
single mapping and have hard to predict runtime behavior?

[...]
> In this way it becomes possible to set an interleaving strategy
> that fits the available bandwidth for the devices available on
> the system. An example system:
> 
> Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
> Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
> Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
> Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
> 
> In this setup, the effective weights for nodes 0-3 for a task
> running on Node 0 may be [60, 20, 10, 10].
> 
> This spreads memory out across devices which all have different
> latency and bandwidth attributes at a way that can maximize the
> available resources.

OK, so why is this any better than not using any memory policy rely
on demotion to push out cold memory down the tier hierarchy?

What is the actual real life usecase and what kind of benefits you can
present?
Johannes Weiner Oct. 31, 2023, 3:21 p.m. UTC | #4
On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
> On Mon 30-10-23 20:38:06, Gregory Price wrote:
> > This patchset implements weighted interleave and adds a new sysfs
> > entry: /sys/devices/system/node/nodeN/accessM/il_weight.
> > 
> > The il_weight of a node is used by mempolicy to implement weighted
> > interleave when `numactl --interleave=...` is invoked.  By default
> > il_weight for a node is always 1, which preserves the default round
> > robin interleave behavior.
> > 
> > Interleave weights may be set from 0-100, and denote the number of
> > pages that should be allocated from the node when interleaving
> > occurs.
> > 
> > For example, if a node's interleave weight is set to 5, 5 pages
> > will be allocated from that node before the next node is scheduled
> > for allocations.
> 
> I find this semantic rather weird TBH. First of all why do you think it
> makes sense to have those weights global for all users? What if
> different applications have different view on how to spred their
> interleaved memory?
> 
> I do get that you might have a different tiers with largerly different
> runtime characteristics but why would you want to interleave them into a
> single mapping and have hard to predict runtime behavior?
> 
> [...]
> > In this way it becomes possible to set an interleaving strategy
> > that fits the available bandwidth for the devices available on
> > the system. An example system:
> > 
> > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
> > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
> > 
> > In this setup, the effective weights for nodes 0-3 for a task
> > running on Node 0 may be [60, 20, 10, 10].
> > 
> > This spreads memory out across devices which all have different
> > latency and bandwidth attributes at a way that can maximize the
> > available resources.
> 
> OK, so why is this any better than not using any memory policy rely
> on demotion to push out cold memory down the tier hierarchy?
> 
> What is the actual real life usecase and what kind of benefits you can
> present?

There are two things CXL gives you: additional capacity and additional
bus bandwidth.

The promotion/demotion mechanism is good for the capacity usecase,
where you have a nice hot/cold gradient in the workingset and want
placement accordingly across faster and slower memory.

The interleaving is useful when you have a flatter workingset
distribution and poorer access locality. In that case, the CPU caches
are less effective and the workload can be bus-bound. The workload
might fit entirely into DRAM, but concentrating it there is
suboptimal. Fanning it out in proportion to the relative performance
of each memory tier gives better resuls.

We experimented with datacenter workloads on such machines last year
and found significant performance benefits:

https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/

This hopefully also explains why it's a global setting. The usecase is
different from conventional NUMA interleaving, which is used as a
locality measure: spread shared data evenly between compute
nodes. This one isn't about locality - the CXL tier doesn't have local
compute. Instead, the optimal spread is based on hardware parameters,
which is a global property rather than a per-workload one.
Michal Hocko Oct. 31, 2023, 3:56 p.m. UTC | #5
On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
> On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
> > On Mon 30-10-23 20:38:06, Gregory Price wrote:
> > > This patchset implements weighted interleave and adds a new sysfs
> > > entry: /sys/devices/system/node/nodeN/accessM/il_weight.
> > > 
> > > The il_weight of a node is used by mempolicy to implement weighted
> > > interleave when `numactl --interleave=...` is invoked.  By default
> > > il_weight for a node is always 1, which preserves the default round
> > > robin interleave behavior.
> > > 
> > > Interleave weights may be set from 0-100, and denote the number of
> > > pages that should be allocated from the node when interleaving
> > > occurs.
> > > 
> > > For example, if a node's interleave weight is set to 5, 5 pages
> > > will be allocated from that node before the next node is scheduled
> > > for allocations.
> > 
> > I find this semantic rather weird TBH. First of all why do you think it
> > makes sense to have those weights global for all users? What if
> > different applications have different view on how to spred their
> > interleaved memory?
> > 
> > I do get that you might have a different tiers with largerly different
> > runtime characteristics but why would you want to interleave them into a
> > single mapping and have hard to predict runtime behavior?
> > 
> > [...]
> > > In this way it becomes possible to set an interleaving strategy
> > > that fits the available bandwidth for the devices available on
> > > the system. An example system:
> > > 
> > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
> > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
> > > 
> > > In this setup, the effective weights for nodes 0-3 for a task
> > > running on Node 0 may be [60, 20, 10, 10].
> > > 
> > > This spreads memory out across devices which all have different
> > > latency and bandwidth attributes at a way that can maximize the
> > > available resources.
> > 
> > OK, so why is this any better than not using any memory policy rely
> > on demotion to push out cold memory down the tier hierarchy?
> > 
> > What is the actual real life usecase and what kind of benefits you can
> > present?
> 
> There are two things CXL gives you: additional capacity and additional
> bus bandwidth.
> 
> The promotion/demotion mechanism is good for the capacity usecase,
> where you have a nice hot/cold gradient in the workingset and want
> placement accordingly across faster and slower memory.
> 
> The interleaving is useful when you have a flatter workingset
> distribution and poorer access locality. In that case, the CPU caches
> are less effective and the workload can be bus-bound. The workload
> might fit entirely into DRAM, but concentrating it there is
> suboptimal. Fanning it out in proportion to the relative performance
> of each memory tier gives better resuls.
> 
> We experimented with datacenter workloads on such machines last year
> and found significant performance benefits:
> 
> https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/

Thanks, this is a useful insight.
 
> This hopefully also explains why it's a global setting. The usecase is
> different from conventional NUMA interleaving, which is used as a
> locality measure: spread shared data evenly between compute
> nodes. This one isn't about locality - the CXL tier doesn't have local
> compute. Instead, the optimal spread is based on hardware parameters,
> which is a global property rather than a per-workload one.

Well, I am not convinced about that TBH. Sure it is probably a good fit
for this specific CXL usecase but it just doesn't fit into many others I
can think of - e.g. proportional use of those tiers based on the
workload - you get what you pay for.

Is there any specific reason for not having a new interleave interface
which defines weights for the nodemask? Is this because the policy
itself is very dynamic or is this more driven by simplicity of use?
Johannes Weiner Oct. 31, 2023, 4:22 p.m. UTC | #6
On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
> > > On Mon 30-10-23 20:38:06, Gregory Price wrote:
> > > > This patchset implements weighted interleave and adds a new sysfs
> > > > entry: /sys/devices/system/node/nodeN/accessM/il_weight.
> > > > 
> > > > The il_weight of a node is used by mempolicy to implement weighted
> > > > interleave when `numactl --interleave=...` is invoked.  By default
> > > > il_weight for a node is always 1, which preserves the default round
> > > > robin interleave behavior.
> > > > 
> > > > Interleave weights may be set from 0-100, and denote the number of
> > > > pages that should be allocated from the node when interleaving
> > > > occurs.
> > > > 
> > > > For example, if a node's interleave weight is set to 5, 5 pages
> > > > will be allocated from that node before the next node is scheduled
> > > > for allocations.
> > > 
> > > I find this semantic rather weird TBH. First of all why do you think it
> > > makes sense to have those weights global for all users? What if
> > > different applications have different view on how to spred their
> > > interleaved memory?
> > > 
> > > I do get that you might have a different tiers with largerly different
> > > runtime characteristics but why would you want to interleave them into a
> > > single mapping and have hard to predict runtime behavior?
> > > 
> > > [...]
> > > > In this way it becomes possible to set an interleaving strategy
> > > > that fits the available bandwidth for the devices available on
> > > > the system. An example system:
> > > > 
> > > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
> > > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
> > > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
> > > > 
> > > > In this setup, the effective weights for nodes 0-3 for a task
> > > > running on Node 0 may be [60, 20, 10, 10].
> > > > 
> > > > This spreads memory out across devices which all have different
> > > > latency and bandwidth attributes at a way that can maximize the
> > > > available resources.
> > > 
> > > OK, so why is this any better than not using any memory policy rely
> > > on demotion to push out cold memory down the tier hierarchy?
> > > 
> > > What is the actual real life usecase and what kind of benefits you can
> > > present?
> > 
> > There are two things CXL gives you: additional capacity and additional
> > bus bandwidth.
> > 
> > The promotion/demotion mechanism is good for the capacity usecase,
> > where you have a nice hot/cold gradient in the workingset and want
> > placement accordingly across faster and slower memory.
> > 
> > The interleaving is useful when you have a flatter workingset
> > distribution and poorer access locality. In that case, the CPU caches
> > are less effective and the workload can be bus-bound. The workload
> > might fit entirely into DRAM, but concentrating it there is
> > suboptimal. Fanning it out in proportion to the relative performance
> > of each memory tier gives better resuls.
> > 
> > We experimented with datacenter workloads on such machines last year
> > and found significant performance benefits:
> > 
> > https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/
> 
> Thanks, this is a useful insight.
>  
> > This hopefully also explains why it's a global setting. The usecase is
> > different from conventional NUMA interleaving, which is used as a
> > locality measure: spread shared data evenly between compute
> > nodes. This one isn't about locality - the CXL tier doesn't have local
> > compute. Instead, the optimal spread is based on hardware parameters,
> > which is a global property rather than a per-workload one.
> 
> Well, I am not convinced about that TBH. Sure it is probably a good fit
> for this specific CXL usecase but it just doesn't fit into many others I
> can think of - e.g. proportional use of those tiers based on the
> workload - you get what you pay for.
> 
> Is there any specific reason for not having a new interleave interface
> which defines weights for the nodemask? Is this because the policy
> itself is very dynamic or is this more driven by simplicity of use?

A downside of *requiring* weights to be paired with the mempolicy is
that it's then the application that would have to figure out the
weights dynamically, instead of having a static host configuration. A
policy of "I want to be spread for optimal bus bandwidth" translates
between different hardware configurations, but optimal weights will
vary depending on the type of machine a job runs on.

That doesn't mean there couldn't be usecases for having weights as
policy as well in other scenarios, like you allude to above. It's just
so far such usecases haven't really materialized or spelled out
concretely. Maybe we just want both - a global default, and the
ability to override it locally. Could you elaborate on the 'get what
you pay for' usecase you mentioned?
Huang, Ying Nov. 1, 2023, 2:21 a.m. UTC | #7
Michal Hocko <mhocko@suse.com> writes:

> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
>> On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
>> > On Mon 30-10-23 20:38:06, Gregory Price wrote:
>> > > This patchset implements weighted interleave and adds a new sysfs
>> > > entry: /sys/devices/system/node/nodeN/accessM/il_weight.
>> > > 
>> > > The il_weight of a node is used by mempolicy to implement weighted
>> > > interleave when `numactl --interleave=...` is invoked.  By default
>> > > il_weight for a node is always 1, which preserves the default round
>> > > robin interleave behavior.
>> > > 
>> > > Interleave weights may be set from 0-100, and denote the number of
>> > > pages that should be allocated from the node when interleaving
>> > > occurs.
>> > > 
>> > > For example, if a node's interleave weight is set to 5, 5 pages
>> > > will be allocated from that node before the next node is scheduled
>> > > for allocations.
>> > 
>> > I find this semantic rather weird TBH. First of all why do you think it
>> > makes sense to have those weights global for all users? What if
>> > different applications have different view on how to spred their
>> > interleaved memory?
>> > 
>> > I do get that you might have a different tiers with largerly different
>> > runtime characteristics but why would you want to interleave them into a
>> > single mapping and have hard to predict runtime behavior?
>> > 
>> > [...]
>> > > In this way it becomes possible to set an interleaving strategy
>> > > that fits the available bandwidth for the devices available on
>> > > the system. An example system:
>> > > 
>> > > Node 0 - CPU+DRAM, 400GB/s BW (200 cross socket)
>> > > Node 1 - CPU+DRAM, 400GB/s BW (200 cross socket)
>> > > Node 2 - CXL Memory. 64GB/s BW, on Node 0 root complex
>> > > Node 3 - CXL Memory. 64GB/s BW, on Node 1 root complex
>> > > 
>> > > In this setup, the effective weights for nodes 0-3 for a task
>> > > running on Node 0 may be [60, 20, 10, 10].
>> > > 
>> > > This spreads memory out across devices which all have different
>> > > latency and bandwidth attributes at a way that can maximize the
>> > > available resources.
>> > 
>> > OK, so why is this any better than not using any memory policy rely
>> > on demotion to push out cold memory down the tier hierarchy?
>> > 
>> > What is the actual real life usecase and what kind of benefits you can
>> > present?
>> 
>> There are two things CXL gives you: additional capacity and additional
>> bus bandwidth.
>> 
>> The promotion/demotion mechanism is good for the capacity usecase,
>> where you have a nice hot/cold gradient in the workingset and want
>> placement accordingly across faster and slower memory.
>> 
>> The interleaving is useful when you have a flatter workingset
>> distribution and poorer access locality. In that case, the CPU caches
>> are less effective and the workload can be bus-bound. The workload
>> might fit entirely into DRAM, but concentrating it there is
>> suboptimal. Fanning it out in proportion to the relative performance
>> of each memory tier gives better resuls.
>> 
>> We experimented with datacenter workloads on such machines last year
>> and found significant performance benefits:
>> 
>> https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1gK6@cmpxchg.org/T/
>
> Thanks, this is a useful insight.
>  
>> This hopefully also explains why it's a global setting. The usecase is
>> different from conventional NUMA interleaving, which is used as a
>> locality measure: spread shared data evenly between compute
>> nodes. This one isn't about locality - the CXL tier doesn't have local
>> compute. Instead, the optimal spread is based on hardware parameters,
>> which is a global property rather than a per-workload one.
>
> Well, I am not convinced about that TBH. Sure it is probably a good fit
> for this specific CXL usecase but it just doesn't fit into many others I
> can think of - e.g. proportional use of those tiers based on the
> workload - you get what you pay for.

For "pay", per my understanding, we need some cgroup based
per-memory-tier (or per-node) usage limit.  The following patchset is
the first step for that.

https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/

--
Best Regards,
Huang, Ying
Huang, Ying Nov. 1, 2023, 2:34 a.m. UTC | #8
Johannes Weiner <hannes@cmpxchg.org> writes:

> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
>> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
>> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
>> > > On Mon 30-10-23 20:38:06, Gregory Price wrote:

[snip]

>>  
>> > This hopefully also explains why it's a global setting. The usecase is
>> > different from conventional NUMA interleaving, which is used as a
>> > locality measure: spread shared data evenly between compute
>> > nodes. This one isn't about locality - the CXL tier doesn't have local
>> > compute. Instead, the optimal spread is based on hardware parameters,
>> > which is a global property rather than a per-workload one.
>> 
>> Well, I am not convinced about that TBH. Sure it is probably a good fit
>> for this specific CXL usecase but it just doesn't fit into many others I
>> can think of - e.g. proportional use of those tiers based on the
>> workload - you get what you pay for.
>> 
>> Is there any specific reason for not having a new interleave interface
>> which defines weights for the nodemask? Is this because the policy
>> itself is very dynamic or is this more driven by simplicity of use?
>
> A downside of *requiring* weights to be paired with the mempolicy is
> that it's then the application that would have to figure out the
> weights dynamically, instead of having a static host configuration. A
> policy of "I want to be spread for optimal bus bandwidth" translates
> between different hardware configurations, but optimal weights will
> vary depending on the type of machine a job runs on.
>
> That doesn't mean there couldn't be usecases for having weights as
> policy as well in other scenarios, like you allude to above. It's just
> so far such usecases haven't really materialized or spelled out
> concretely. Maybe we just want both - a global default, and the
> ability to override it locally.

I think that this is a good idea.  The system-wise configuration with
reasonable default makes applications life much easier.  If more control
is needed, some kind of workload specific configuration can be added.
And, instead of adding another memory policy, a cgroup-wise
configuration may be easier to be used.  The per-workload weight may
need to be adjusted when we deploying different combination of workloads
in the system.

Another question is that should the weight be per-memory-tier or
per-node?  In this patchset, the weight is per-source-target-node
combination.  That is, the weight becomes a matrix instead of a vector.
IIUC, this is used to control cross-socket memory access in addition to
per-memory-type memory access.  Do you think the added complexity is
necessary?

> Could you elaborate on the 'get what you pay for' usecase you
> mentioned?

--
Best Regards,
Huang, Ying
Ravi Jonnalagadda Nov. 1, 2023, 9:29 a.m. UTC | #9
>> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
>>> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
>>> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
>>> > > On Mon 30-10-23 20:38:06, Gregory Price wrote:
>
>[snip]
>
>>>
>>> > This hopefully also explains why it's a global setting. The usecase is
>>> > different from conventional NUMA interleaving, which is used as a
>>> > locality measure: spread shared data evenly between compute
>>> > nodes. This one isn't about locality - the CXL tier doesn't have local
>>> > compute. Instead, the optimal spread is based on hardware parameters,
>>> > which is a global property rather than a per-workload one.
>>>
>>> Well, I am not convinced about that TBH. Sure it is probably a good fit
>>> for this specific CXL usecase but it just doesn't fit into many others I
>>> can think of - e.g. proportional use of those tiers based on the
>>> workload - you get what you pay for.
>>>
>>> Is there any specific reason for not having a new interleave interface
>>> which defines weights for the nodemask? Is this because the policy
>>> itself is very dynamic or is this more driven by simplicity of use?
>>
>> A downside of *requiring* weights to be paired with the mempolicy is
>> that it's then the application that would have to figure out the
>> weights dynamically, instead of having a static host configuration. A
>> policy of "I want to be spread for optimal bus bandwidth" translates
>> between different hardware configurations, but optimal weights will
>> vary depending on the type of machine a job runs on.
>>
>> That doesn't mean there couldn't be usecases for having weights as
>> policy as well in other scenarios, like you allude to above. It's just
>> so far such usecases haven't really materialized or spelled out
>> concretely. Maybe we just want both - a global default, and the
>> ability to override it locally.
>
>I think that this is a good idea.  The system-wise configuration with
>reasonable default makes applications life much easier.  If more control
>is needed, some kind of workload specific configuration can be added.

Glad that we are in agreement here. For bandwidth expansion use cases
that this interleave patchset is trying to cater to, most applications
would have to follow the "reasanable defaults" for weights.
The necessity for applications to choose different weights while
interleaving would probably be to do capacity expansion which the
default memory tiering implementation would anyway support and provide
better latency.

>And, instead of adding another memory policy, a cgroup-wise
>configuration may be easier to be used.  The per-workload weight may
>need to be adjusted when we deploying different combination of workloads
>in the system.
>
>Another question is that should the weight be per-memory-tier or
>per-node?  In this patchset, the weight is per-source-target-node
>combination.  That is, the weight becomes a matrix instead of a vector.
>IIUC, this is used to control cross-socket memory access in addition to
>per-memory-type memory access.  Do you think the added complexity is
>necessary?

Pros and Cons of Node based interleave:
Pros:
1. Weights can be defined for devices with different bandwidth and latency
characteristics individually irrespective of which tier they fall into.
2. Defining the weight per-source-target-node would be necessary for multi
socket systems where few devices may be closer to one socket rather than other.
Cons:
1. Weights need to be programmed for all the nodes which can be tedious for
systems with lot of NUMA nodes.

Pros and Cons of Memory Tier based interleave:
Pros:
1. Programming weight per initiator would apply for all the nodes in the tier.
2. Weights can be calculated considering the cumulative bandwidth of all
the nodes in the tier and need to be programmed once for all the nodes in a
given tier.
3. It may be useful in cases where numa nodes with similar latency and bandwidth
characteristics increase, possibly with pooling use cases.
Cons:
1. If nodes with different bandwidth and latency characteristics are placed
in same tier as seen in the current mainline kernel, it will be difficult to
apply a correct interleave weight policy.
2. There will be a need for functionality to move nodes between different tiers
or create new tiers to place such nodes for programming correct interleave weights.
We are working on a patch to support it currently.
3. For systems where each numa node is having different characteristics,
a single node might end up existing in different memory tier, which would be
equivalent to node based interleaving. On newer systems where all CXL memory
from different devices under a port are combined to form single numa node, this
scenario might be applicable.
4. Users may need to keep track of different memory tiers and what nodes are present
in each tier for invoking interleave policy.

>
>> Could you elaborate on the 'get what you pay for' usecase you
>> mentioned?
>
>--
>Best Regards,
>Huang, Ying
--
Best Regards,
Ravi Jonnalagadda
Michal Hocko Nov. 1, 2023, 1:45 p.m. UTC | #10
On Tue 31-10-23 00:27:04, Gregory Price wrote:
> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
> 
> > > This hopefully also explains why it's a global setting. The usecase is
> > > different from conventional NUMA interleaving, which is used as a
> > > locality measure: spread shared data evenly between compute
> > > nodes. This one isn't about locality - the CXL tier doesn't have local
> > > compute. Instead, the optimal spread is based on hardware parameters,
> > > which is a global property rather than a per-workload one.
> > 
> > Well, I am not convinced about that TBH. Sure it is probably a good fit
> > for this specific CXL usecase but it just doesn't fit into many others I
> > can think of - e.g. proportional use of those tiers based on the
> > workload - you get what you pay for.
> > 
> > Is there any specific reason for not having a new interleave interface
> > which defines weights for the nodemask? Is this because the policy
> > itself is very dynamic or is this more driven by simplicity of use?
> > 
> 
> I had originally implemented it this way while experimenting with new
> mempolicies.
> 
> https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@memverge.com/
> 
> The downside of doing it in mempolicy is...
> 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a
>    non-trivial task.  It is very "current-task" centric.

True. Cpusets is the way to make it less process centric but that comes
with its own constains (namely which NUMA policies are supported).
 
> 2) Barring a change to mempolicy to be sysfs friendly, the options for
>    implementing weights in the mempolicy are either a) new flag and
>    setting every weight individually in many syscalls, or b) a new
>    syscall (set_mempolicy2), which is what I demonstrated in the RFC.

Yes, that would likely require a new syscall.
 
> 3) mempolicy is also subject to cgroup nodemasks, and as a result you
>    end up with a rats nest of interactions between mempolicy nodemasks
>    changing as a result of cgroup migrations, nodes potentially coming
>    and going (hotplug under CXL), and others I'm probably forgetting.

Is this really any different from what you are proposing though?

>    Basically:  If a node leaves the nodemask, should you retain the
>    weight, or should you reset it? If a new node comes into the node
>    mask... what weight should you set? I did not have answers to these
>    questions.

I am not really sure I follow you. Are you talking about cpuset
nodemask changes or memory hotplug here.

> It was recommended to explore placing it in tiers instead, so I took a
> crack at it here: 
> 
> https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/
> 
> This had similar issue with the idea of hotplug nodes: if you give a
> tier a weight, and one or more of the nodes goes away/comes back... what
> should you do with the weight?  Split it up among the remaining nodes?
> Rebalance? Etc.

How is this any different from node becoming depleted? You cannot
really expect that you get memory you are asking for and you can easily
end up getting memory from a different node instead.
 
> The result of this discussion lead us to simply say "What if we place
> the weights directly in the node".  And that lead us to this RFC.

Maybe I am missing something really crucial here but I do not see how
this fundamentally changes anything.

Memory hotremove (or mere node memory depletion) is not really a thing
because interleaving is a best effort operation so you have to live with
memory not being strictly distributed per your preferences.

Memory hotadd will be easier to manage because you just update a single
place after node is hotadded rather than gazillions partial policies.
But, that requires that interleave policy nodemask is assuming future
nodes going online and put them to the mask.

> I am not against implementing it in mempolicy (as proof: my first RFC).
> I am simply searching for the acceptable way to implement it.
> 
> One of the benefits of having it set as a global setting is that weights
> can be automatically generated from HMAT/HMEM information (ACPI tables)
> and programs already using MPOL_INTERLEAVE will have a direct benefit.

Right. This is understood. My main concern is whether this is outweights
the limitations of having a _global_ policy _only_. Historically a single
global policy usually led to finding ways how to make that more scoped
(usually through cgroups).
 
> I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added
> along side this patch so that MPOL_INTERLEAVE is left entirely alone.
> 
> Happy to discuss more,
> ~Gregory
Michal Hocko Nov. 1, 2023, 1:56 p.m. UTC | #11
On Tue 31-10-23 12:22:16, Johannes Weiner wrote:
> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
[...]
> > Is there any specific reason for not having a new interleave interface
> > which defines weights for the nodemask? Is this because the policy
> > itself is very dynamic or is this more driven by simplicity of use?
> 
> A downside of *requiring* weights to be paired with the mempolicy is
> that it's then the application that would have to figure out the
> weights dynamically, instead of having a static host configuration. A
> policy of "I want to be spread for optimal bus bandwidth" translates
> between different hardware configurations, but optimal weights will
> vary depending on the type of machine a job runs on.

I can imagine this could be achieved by numactl(8) so that the process
management tool could set this up for the process on the start up. Sure
it wouldn't be very dynamic after then and that is why I was asking
about how dynamic the situation might be in practice.

> That doesn't mean there couldn't be usecases for having weights as
> policy as well in other scenarios, like you allude to above. It's just
> so far such usecases haven't really materialized or spelled out
> concretely. Maybe we just want both - a global default, and the
> ability to override it locally. Could you elaborate on the 'get what
> you pay for' usecase you mentioned?

This is more or less just an idea that came first to my mind when
hearing about bus bandwidth optimizations. I suspect that sooner or
later we just learn about usecases where the optimization function
maximizes not only bandwidth but also cost for that bandwidth. Consider
a hosting system serving different workloads each paying different QoS.
Do I know about anybody requiring that now? No! But we should really
test the proposed interface for potential future extensions. If such an
extension is not reasonable and/or we can achieve that by different
means then great.
Michal Hocko Nov. 1, 2023, 2:01 p.m. UTC | #12
On Wed 01-11-23 10:21:47, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
[...]
> > Well, I am not convinced about that TBH. Sure it is probably a good fit
> > for this specific CXL usecase but it just doesn't fit into many others I
> > can think of - e.g. proportional use of those tiers based on the
> > workload - you get what you pay for.
> 
> For "pay", per my understanding, we need some cgroup based
> per-memory-tier (or per-node) usage limit.  The following patchset is
> the first step for that.
> 
> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/

Why do we need a sysfs interface if there are plans for cgroup API?
Gregory Price Nov. 1, 2023, 4:58 p.m. UTC | #13
On Wed, Nov 01, 2023 at 02:45:50PM +0100, Michal Hocko wrote:
> On Tue 31-10-23 00:27:04, Gregory Price wrote:
[... snip ...]
> > 
> > The downside of doing it in mempolicy is...
> > 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a
> >    non-trivial task.  It is very "current-task" centric.
> 
> True. Cpusets is the way to make it less process centric but that comes
> with its own constains (namely which NUMA policies are supported).
>  
> > 2) Barring a change to mempolicy to be sysfs friendly, the options for
> >    implementing weights in the mempolicy are either a) new flag and
> >    setting every weight individually in many syscalls, or b) a new
> >    syscall (set_mempolicy2), which is what I demonstrated in the RFC.
> 
> Yes, that would likely require a new syscall.
>  
> > 3) mempolicy is also subject to cgroup nodemasks, and as a result you
> >    end up with a rats nest of interactions between mempolicy nodemasks
> >    changing as a result of cgroup migrations, nodes potentially coming
> >    and going (hotplug under CXL), and others I'm probably forgetting.
> 
> Is this really any different from what you are proposing though?
>

In only one manner: An external user can set the weight of a node that
is added later on.  If it is implemented in mempolicy, then this is not
possible.

Basically consider: `numactl --interleave=all ...`

If `--weights=...`: when a node hotplug event occurs, there is no
recourse for adding a weight for the new node (it will default to 1).

Maybe the answer is "Best effort, sorry" and we don't handle that
situation.  That doesn't seem entirely unreasonable.

At least with weights in node (or cgroup, or memtier, whatever) it
provides the ability to set that weight outside the mempolicy context.

> >    weight, or should you reset it? If a new node comes into the node
> >    mask... what weight should you set? I did not have answers to these
> >    questions.
> 
> I am not really sure I follow you. Are you talking about cpuset
> nodemask changes or memory hotplug here.
>

Actually both - slightly different context.

If the weights are implemented in mempolicy, if the cpuset nodemask
changes then the mempolicy nodemask changes with it.

If the node is removed from the system, I believe (need to validate
this, but IIRC) the node will be removed from any registered cpusets.
As a result, that falls down to mempolicy, and the node is removed.

Not entirely sure what happens if a node is added.  The only case where
I think that is relevant is when cpuset is empty ("all") and mempolicy
is set to something like `--interleave=all`.  In this case, it's
possible that the new node will simply have a default weight (1), and if
weights are implemented in mempolicy only there is no recourse for changing
it.

> > It was recommended to explore placing it in tiers instead, so I took a
> > crack at it here: 
> > 
> > https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/
> > 
> > This had similar issue with the idea of hotplug nodes: if you give a
> > tier a weight, and one or more of the nodes goes away/comes back... what
> > should you do with the weight?  Split it up among the remaining nodes?
> > Rebalance? Etc.
> 
> How is this any different from node becoming depleted? You cannot
> really expect that you get memory you are asking for and you can easily
> end up getting memory from a different node instead.
>  
... snip ... 
> Maybe I am missing something really crucial here but I do not see how
> this fundamentally changes anything.
> 
> Memory hotremove
... snip ... 
> Memory hotadd
... snip ...
> But, that requires that interleave policy nodemask is assuming future
> nodes going online and put them to the mask.
> 

The difference is the nodemask changes in mempolicy and cpuset.  If a
node is removed entirely from the nodemask, and then it comes back
(through cpuset or something), then "what do you do with it"?

If memory is depleted but opens up later - the interleave policy starts
working as intended again.  If a node disappears and comes back... that
bit of plumbing is a bit more complex.

So yes, the "assuming future nodes going online and put them into the
mask" is the concern I have.  A node being added/removed from the
nodemask specifically different plumbing issues than just depletion.

If that's really not a concern and we're happy to just let it be OBO
until an actual use case for handling node hotplug for weighting, then
mempolicy-based-weighting alone seems more than sufficient.

> > I am not against implementing it in mempolicy (as proof: my first RFC).
> > I am simply searching for the acceptable way to implement it.
> > 
> > One of the benefits of having it set as a global setting is that weights
> > can be automatically generated from HMAT/HMEM information (ACPI tables)
> > and programs already using MPOL_INTERLEAVE will have a direct benefit.
> 
> Right. This is understood. My main concern is whether this is outweights
> the limitations of having a _global_ policy _only_. Historically a single
> global policy usually led to finding ways how to make that more scoped
> (usually through cgroups).
>

Maybe the answer here is put it in cgroups + mempolicy, and don't handle
hotplug?  This is an easy shift my this patch to cgroups, and then
pulling my syscall patch forward to add weights directly to mempolicy.

I think the interleave code stays pretty much the same, the only
difference would be where the task gets the weight from:

if (policy->mode == WEIGHTED_INTERLEAVE)
  weight = pol->weight[target_node]
else
   cgroups.get_weight(from_node, target_node)

~Gregory
Huang, Ying Nov. 2, 2023, 2:01 a.m. UTC | #14
Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
>
>> > This hopefully also explains why it's a global setting. The usecase is
>> > different from conventional NUMA interleaving, which is used as a
>> > locality measure: spread shared data evenly between compute
>> > nodes. This one isn't about locality - the CXL tier doesn't have local
>> > compute. Instead, the optimal spread is based on hardware parameters,
>> > which is a global property rather than a per-workload one.
>> 
>> Well, I am not convinced about that TBH. Sure it is probably a good fit
>> for this specific CXL usecase but it just doesn't fit into many others I
>> can think of - e.g. proportional use of those tiers based on the
>> workload - you get what you pay for.
>> 
>> Is there any specific reason for not having a new interleave interface
>> which defines weights for the nodemask? Is this because the policy
>> itself is very dynamic or is this more driven by simplicity of use?
>> 
>
> I had originally implemented it this way while experimenting with new
> mempolicies.
>
> https://lore.kernel.org/linux-cxl/20231003002156.740595-5-gregory.price@memverge.com/
>
> The downside of doing it in mempolicy is...
> 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a
>    non-trivial task.  It is very "current-task" centric.
>
> 2) Barring a change to mempolicy to be sysfs friendly, the options for
>    implementing weights in the mempolicy are either a) new flag and
>    setting every weight individually in many syscalls, or b) a new
>    syscall (set_mempolicy2), which is what I demonstrated in the RFC.
>
> 3) mempolicy is also subject to cgroup nodemasks, and as a result you
>    end up with a rats nest of interactions between mempolicy nodemasks
>    changing as a result of cgroup migrations, nodes potentially coming
>    and going (hotplug under CXL), and others I'm probably forgetting.
>
>    Basically:  If a node leaves the nodemask, should you retain the
>    weight, or should you reset it? If a new node comes into the node
>    mask... what weight should you set? I did not have answers to these
>    questions.
>
>
> It was recommended to explore placing it in tiers instead, so I took a
> crack at it here: 
>
> https://lore.kernel.org/linux-mm/20231009204259.875232-1-gregory.price@memverge.com/
>
> This had similar issue with the idea of hotplug nodes: if you give a
> tier a weight, and one or more of the nodes goes away/comes back... what
> should you do with the weight?  Split it up among the remaining nodes?
> Rebalance? Etc.

The weight of a tier can be defined as the weight of one node of the
tier instead of the weight of all nodes of the tier.  That is, for a
system as follows,

tier 0: node 0, node 1; weight=4
tier 1: node 2, node 3; weight=1

If you run workload with `numactl --weighted-interleave -n 0,2,3`, the
proportion will be: "4:0:1:1" on each node.

While for `numactl --weighted-interleave -n 0,2`, it will be: "4:0:1:0".

--
Best Regards,
Huang, Ying

> The result of this discussion lead us to simply say "What if we place
> the weights directly in the node".  And that lead us to this RFC.
>
>
> I am not against implementing it in mempolicy (as proof: my first RFC).
> I am simply searching for the acceptable way to implement it.
>
> One of the benefits of having it set as a global setting is that weights
> can be automatically generated from HMAT/HMEM information (ACPI tables)
> and programs already using MPOL_INTERLEAVE will have a direct benefit.
>
> I have been considering whether MPOL_WEIGHTED_INTERLEAVE should be added
> along side this patch so that MPOL_INTERLEAVE is left entirely alone.
>
> Happy to discuss more,
> ~Gregory
Gregory Price Nov. 2, 2023, 3:18 a.m. UTC | #15
On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote:
> On Wed 01-11-23 12:58:55, Gregory Price wrote:
> > Basically consider: `numactl --interleave=all ...`
> > 
> > If `--weights=...`: when a node hotplug event occurs, there is no
> > recourse for adding a weight for the new node (it will default to 1).
> 
> Correct and this is what I was asking about in an earlier email. How
> much do we really need to consider this setup. Is this something nice to
> have or does the nature of the technology requires to be fully dynamic
> and expect new nodes coming up at any moment?
>  

Dynamic Capacity is expected to cause a numa node to change size (in
number of memory blocks) rather than cause numa nodes to come and go, so
maybe handling the full node hotplug is a bit of an overreach.

Good call, I'll stop considering this problem for now.

> > If the node is removed from the system, I believe (need to validate
> > this, but IIRC) the node will be removed from any registered cpusets.
> > As a result, that falls down to mempolicy, and the node is removed.
> 
> I do not think we do anything like that. Userspace might decide to
> change the numa mask when a node is offlined but I do not think we do
> anything like that automagically.
>

mpol_rebind_policy called by update_tasks_nodemask
https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319
https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016

falls down from cpuset_hotplug_workfn:
https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771

/*
 * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY].
 * Call this routine anytime after node_states[N_MEMORY] changes.
 * See cpuset_update_active_cpus() for CPU hotplug handling.
 */
static int cpuset_track_online_nodes(struct notifier_block *self,
				unsigned long action, void *arg)
{
	schedule_work(&cpuset_hotplug_work);
	return NOTIFY_OK;
}

void __init cpuset_init_smp(void)
{
...
	hotplug_memory_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI);
}


Causes 1 of 3 situations:
MPOL_F_STATIC_NODES:   overwrite with (old & new)
MPOL_F_RELATIVE_NODES: overwrite with a "relative" nodemask (fold+onto?)
Default:               either does a remap or replaces old with new.

My assumption based on this is that a hot-unplugged node would completely
be removed.  Doesn't look like hot-add is handled at all, so I can just
drop that entirely for now (except add default weight of 1 incase it is
ever added in the future).

I've been pushing agianst the weights being in memory-tiers.c for this
reason, as a weight set per-tier is meaningless if a node disappears.

Example: Tier has 2 nodes with some weight N split between them, such
that interleave gives each node N/2 pages.  If 1 node is removed, the
remaining node gets N pages, which is twice the allocation. Presumably
a node is an abstraction of 1 or more devices, therefore if the node is
removed, the weight should change.

You could handle hotplug in tiers, but if a node being hotplugged forcibly
removes the node from cpusets and mempolicy nodemasks, then it's
irrelevant since the node can never get selected for allocation anyway.

It's looking more like cgroups is the right place to put this.

> 
> Moving the global policy to cgroups would make the main cocern of
> different workloads looking for different policy less problamatic.
> I didn't have much time to think that through but the main question is
> how to sanely define hierarchical properties of those weights? This is
> more of a resource distribution than enforcement so maybe a simple
> inherit or overwrite (if you have a more specific needs) semantic makes
> sense and it is sufficient.
>

As a user I would assume it would operate much the same way as other
nested cgroups, which is inherit by default (with subsets) or an
explicit overwrite that can't exceed the higher level settings.

Weights could arguably allow different settings than capacity controls,
but that could be an extension.

> This is not as much about the code as it is about the proper interface
> because that will get cast in stone once introduced. It would be really
> bad to realize that we have a global policy that doesn't fit well and
> have hard time to work it around without breaking anybody.

o7 I concur now.  I'll take some time to rework this into a
cgroups+mempolicy proposal based on my earlier RFCs.

~Gregory
Huang, Ying Nov. 2, 2023, 6:11 a.m. UTC | #16
Michal Hocko <mhocko@suse.com> writes:

> On Wed 01-11-23 10:21:47, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
> [...]
>> > Well, I am not convinced about that TBH. Sure it is probably a good fit
>> > for this specific CXL usecase but it just doesn't fit into many others I
>> > can think of - e.g. proportional use of those tiers based on the
>> > workload - you get what you pay for.
>> 
>> For "pay", per my understanding, we need some cgroup based
>> per-memory-tier (or per-node) usage limit.  The following patchset is
>> the first step for that.
>> 
>> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/
>
> Why do we need a sysfs interface if there are plans for cgroup API?

They are for different target.  The cgroup API proposed here is to
constrain the DRAM usage in a system with DRAM and CXL memory.  The less
you pay, the less DRAM and more CXL memory you use.

--
Best Regards,
Huang, Ying
Huang, Ying Nov. 2, 2023, 6:21 a.m. UTC | #17
Michal Hocko <mhocko@suse.com> writes:

> On Tue 31-10-23 12:22:16, Johannes Weiner wrote:
>> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
> [...]
>> > Is there any specific reason for not having a new interleave interface
>> > which defines weights for the nodemask? Is this because the policy
>> > itself is very dynamic or is this more driven by simplicity of use?
>> 
>> A downside of *requiring* weights to be paired with the mempolicy is
>> that it's then the application that would have to figure out the
>> weights dynamically, instead of having a static host configuration. A
>> policy of "I want to be spread for optimal bus bandwidth" translates
>> between different hardware configurations, but optimal weights will
>> vary depending on the type of machine a job runs on.
>
> I can imagine this could be achieved by numactl(8) so that the process
> management tool could set this up for the process on the start up. Sure
> it wouldn't be very dynamic after then and that is why I was asking
> about how dynamic the situation might be in practice.
>
>> That doesn't mean there couldn't be usecases for having weights as
>> policy as well in other scenarios, like you allude to above. It's just
>> so far such usecases haven't really materialized or spelled out
>> concretely. Maybe we just want both - a global default, and the
>> ability to override it locally. Could you elaborate on the 'get what
>> you pay for' usecase you mentioned?
>
> This is more or less just an idea that came first to my mind when
> hearing about bus bandwidth optimizations. I suspect that sooner or
> later we just learn about usecases where the optimization function
> maximizes not only bandwidth but also cost for that bandwidth. Consider
> a hosting system serving different workloads each paying different
> QoS.

I don't think pure software solution can enforce the memory bandwidth
allocation.  For that, we will need something like MBA (Memory Bandwidth
Allocation) as in the following URL,

https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-allocation.html

At lease, something like MBM (Memory Bandwidth Monitoring) as in the
following URL will be needed.

https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-monitoring.html

The interleave solution helps the cooperative workloads only.

> Do I know about anybody requiring that now? No! But we should really
> test the proposed interface for potential future extensions. If such an
> extension is not reasonable and/or we can achieve that by different
> means then great.

--
Best Regards,
Huang, Ying
Huang, Ying Nov. 2, 2023, 6:41 a.m. UTC | #18
Ravi Jonnalagadda <ravis.opensrc@micron.com> writes:

>>> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
>>>> On Tue 31-10-23 11:21:42, Johannes Weiner wrote:
>>>> > On Tue, Oct 31, 2023 at 10:53:41AM +0100, Michal Hocko wrote:
>>>> > > On Mon 30-10-23 20:38:06, Gregory Price wrote:
>>
>>[snip]
>>
>>>>
>>>> > This hopefully also explains why it's a global setting. The usecase is
>>>> > different from conventional NUMA interleaving, which is used as a
>>>> > locality measure: spread shared data evenly between compute
>>>> > nodes. This one isn't about locality - the CXL tier doesn't have local
>>>> > compute. Instead, the optimal spread is based on hardware parameters,
>>>> > which is a global property rather than a per-workload one.
>>>>
>>>> Well, I am not convinced about that TBH. Sure it is probably a good fit
>>>> for this specific CXL usecase but it just doesn't fit into many others I
>>>> can think of - e.g. proportional use of those tiers based on the
>>>> workload - you get what you pay for.
>>>>
>>>> Is there any specific reason for not having a new interleave interface
>>>> which defines weights for the nodemask? Is this because the policy
>>>> itself is very dynamic or is this more driven by simplicity of use?
>>>
>>> A downside of *requiring* weights to be paired with the mempolicy is
>>> that it's then the application that would have to figure out the
>>> weights dynamically, instead of having a static host configuration. A
>>> policy of "I want to be spread for optimal bus bandwidth" translates
>>> between different hardware configurations, but optimal weights will
>>> vary depending on the type of machine a job runs on.
>>>
>>> That doesn't mean there couldn't be usecases for having weights as
>>> policy as well in other scenarios, like you allude to above. It's just
>>> so far such usecases haven't really materialized or spelled out
>>> concretely. Maybe we just want both - a global default, and the
>>> ability to override it locally.
>>
>>I think that this is a good idea.  The system-wise configuration with
>>reasonable default makes applications life much easier.  If more control
>>is needed, some kind of workload specific configuration can be added.
>
> Glad that we are in agreement here. For bandwidth expansion use cases
> that this interleave patchset is trying to cater to, most applications
> would have to follow the "reasanable defaults" for weights.
> The necessity for applications to choose different weights while
> interleaving would probably be to do capacity expansion which the
> default memory tiering implementation would anyway support and provide
> better latency.
>
>>And, instead of adding another memory policy, a cgroup-wise
>>configuration may be easier to be used.  The per-workload weight may
>>need to be adjusted when we deploying different combination of workloads
>>in the system.
>>
>>Another question is that should the weight be per-memory-tier or
>>per-node?  In this patchset, the weight is per-source-target-node
>>combination.  That is, the weight becomes a matrix instead of a vector.
>>IIUC, this is used to control cross-socket memory access in addition to
>>per-memory-type memory access.  Do you think the added complexity is
>>necessary?
>
> Pros and Cons of Node based interleave:
> Pros:
> 1. Weights can be defined for devices with different bandwidth and latency
> characteristics individually irrespective of which tier they fall into.
> 2. Defining the weight per-source-target-node would be necessary for multi
> socket systems where few devices may be closer to one socket rather than other.
> Cons:
> 1. Weights need to be programmed for all the nodes which can be tedious for
> systems with lot of NUMA nodes.

2. More complex, so need justification, for example, practical use case.

> Pros and Cons of Memory Tier based interleave:
> Pros:
> 1. Programming weight per initiator would apply for all the nodes in the tier.
> 2. Weights can be calculated considering the cumulative bandwidth of all
> the nodes in the tier and need to be programmed once for all the nodes in a
> given tier.
> 3. It may be useful in cases where numa nodes with similar latency and bandwidth
> characteristics increase, possibly with pooling use cases.

4. simpler.

> Cons:
> 1. If nodes with different bandwidth and latency characteristics are placed
> in same tier as seen in the current mainline kernel, it will be difficult to
> apply a correct interleave weight policy.
> 2. There will be a need for functionality to move nodes between different tiers
> or create new tiers to place such nodes for programming correct interleave weights.
> We are working on a patch to support it currently.

Thanks!  If we have such system, we will need this.

> 3. For systems where each numa node is having different characteristics,
> a single node might end up existing in different memory tier, which would be
> equivalent to node based interleaving.

No.  A node can only exist in one memory tier.

> On newer systems where all CXL memory from different devices under a
> port are combined to form single numa node, this scenario might be
> applicable.

You mean the different memory ranges of a NUMA node may have different
performance?  I don't think that we can deal with this.

> 4. Users may need to keep track of different memory tiers and what nodes are present
> in each tier for invoking interleave policy.

I don't think this is a con.  With node based solution, you need to know
your system too.

>>
>>> Could you elaborate on the 'get what you pay for' usecase you
>>> mentioned?
>>

--
Best Regards,
Huang, Ying
Michal Hocko Nov. 2, 2023, 9:28 a.m. UTC | #19
On Thu 02-11-23 14:11:09, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Wed 01-11-23 10:21:47, Huang, Ying wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> > [...]
> >> > Well, I am not convinced about that TBH. Sure it is probably a good fit
> >> > for this specific CXL usecase but it just doesn't fit into many others I
> >> > can think of - e.g. proportional use of those tiers based on the
> >> > workload - you get what you pay for.
> >> 
> >> For "pay", per my understanding, we need some cgroup based
> >> per-memory-tier (or per-node) usage limit.  The following patchset is
> >> the first step for that.
> >> 
> >> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/
> >
> > Why do we need a sysfs interface if there are plans for cgroup API?
> 
> They are for different target.  The cgroup API proposed here is to
> constrain the DRAM usage in a system with DRAM and CXL memory.  The less
> you pay, the less DRAM and more CXL memory you use.

Right, but why the usage distribution requires its own interface and
cannot be combined with the access control part of it?
Michal Hocko Nov. 2, 2023, 9:30 a.m. UTC | #20
On Thu 02-11-23 14:21:49, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Tue 31-10-23 12:22:16, Johannes Weiner wrote:
> >> On Tue, Oct 31, 2023 at 04:56:27PM +0100, Michal Hocko wrote:
> > [...]
> >> > Is there any specific reason for not having a new interleave interface
> >> > which defines weights for the nodemask? Is this because the policy
> >> > itself is very dynamic or is this more driven by simplicity of use?
> >> 
> >> A downside of *requiring* weights to be paired with the mempolicy is
> >> that it's then the application that would have to figure out the
> >> weights dynamically, instead of having a static host configuration. A
> >> policy of "I want to be spread for optimal bus bandwidth" translates
> >> between different hardware configurations, but optimal weights will
> >> vary depending on the type of machine a job runs on.
> >
> > I can imagine this could be achieved by numactl(8) so that the process
> > management tool could set this up for the process on the start up. Sure
> > it wouldn't be very dynamic after then and that is why I was asking
> > about how dynamic the situation might be in practice.
> >
> >> That doesn't mean there couldn't be usecases for having weights as
> >> policy as well in other scenarios, like you allude to above. It's just
> >> so far such usecases haven't really materialized or spelled out
> >> concretely. Maybe we just want both - a global default, and the
> >> ability to override it locally. Could you elaborate on the 'get what
> >> you pay for' usecase you mentioned?
> >
> > This is more or less just an idea that came first to my mind when
> > hearing about bus bandwidth optimizations. I suspect that sooner or
> > later we just learn about usecases where the optimization function
> > maximizes not only bandwidth but also cost for that bandwidth. Consider
> > a hosting system serving different workloads each paying different
> > QoS.
> 
> I don't think pure software solution can enforce the memory bandwidth
> allocation.  For that, we will need something like MBA (Memory Bandwidth
> Allocation) as in the following URL,
> 
> https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-allocation.html
> 
> At lease, something like MBM (Memory Bandwidth Monitoring) as in the
> following URL will be needed.
> 
> https://www.intel.com/content/www/us/en/developer/articles/technical/introduction-to-memory-bandwidth-monitoring.html
> 
> The interleave solution helps the cooperative workloads only.

Enforcement is an orthogonal thing IMO. We are talking about a best
effort interface.
Ravi Jonnalagadda Nov. 2, 2023, 9:35 a.m. UTC | #21
Should Node based interleave solution be considered complex or not would probably
depend on number of numa nodes that would be present in the system and whether
we are able to setup the default weights correctly to obtain optimum bandwidth
expansion.

>
>> Pros and Cons of Memory Tier based interleave:
>> Pros:
>> 1. Programming weight per initiator would apply for all the nodes in the tier.
>> 2. Weights can be calculated considering the cumulative bandwidth of all
>> the nodes in the tier and need to be programmed once for all the nodes in a
>> given tier.
>> 3. It may be useful in cases where numa nodes with similar latency and bandwidth
>> characteristics increase, possibly with pooling use cases.
>
>4. simpler.
>
>> Cons:
>> 1. If nodes with different bandwidth and latency characteristics are placed
>> in same tier as seen in the current mainline kernel, it will be difficult to
>> apply a correct interleave weight policy.
>> 2. There will be a need for functionality to move nodes between different tiers
>> or create new tiers to place such nodes for programming correct interleave weights.
>> We are working on a patch to support it currently.
>
>Thanks!  If we have such system, we will need this.
>
>> 3. For systems where each numa node is having different characteristics,
>> a single node might end up existing in different memory tier, which would be
>> equivalent to node based interleaving.
>
>No.  A node can only exist in one memory tier.

Sorry for the confusion what i meant was, if each node is having different 
characteristics, to program the memory tier weights correctly we need to place
each node in a separate tier of it's own. So each memory tier will contain
only a single node and the solution would resemble node based interleaving.

>
>> On newer systems where all CXL memory from different devices under a
>> port are combined to form single numa node, this scenario might be
>> applicable.
>
>You mean the different memory ranges of a NUMA node may have different
>performance?  I don't think that we can deal with this.

Example Configuration: On a server that we are using now, four different
CXL cards are combined to form a single NUMA node and two other cards are
exposed as two individual numa nodes.
So if we have the ability to combine multiple CXL memory ranges to a
single NUMA node the number of NUMA nodes in the system would potentially
decrease even if we can't combine the entire range to form a single node.

>
>> 4. Users may need to keep track of different memory tiers and what nodes are present
>> in each tier for invoking interleave policy.
>
>I don't think this is a con.  With node based solution, you need to know
>your system too.
>
>>>
>>>> Could you elaborate on the 'get what you pay for' usecase you
>>>> mentioned?
>>>
>
>--
>Best Regards,
>Huang, Ying
--
Best Regards,
Ravi Jonnalagadda
Michal Hocko Nov. 2, 2023, 9:47 a.m. UTC | #22
On Wed 01-11-23 12:58:55, Gregory Price wrote:
> On Wed, Nov 01, 2023 at 02:45:50PM +0100, Michal Hocko wrote:
> > On Tue 31-10-23 00:27:04, Gregory Price wrote:
> [... snip ...]
> > > 
> > > The downside of doing it in mempolicy is...
> > > 1) mempolicy is not sysfs friendly, and to make it sysfs friendly is a
> > >    non-trivial task.  It is very "current-task" centric.
> > 
> > True. Cpusets is the way to make it less process centric but that comes
> > with its own constains (namely which NUMA policies are supported).
> >  
> > > 2) Barring a change to mempolicy to be sysfs friendly, the options for
> > >    implementing weights in the mempolicy are either a) new flag and
> > >    setting every weight individually in many syscalls, or b) a new
> > >    syscall (set_mempolicy2), which is what I demonstrated in the RFC.
> > 
> > Yes, that would likely require a new syscall.
> >  
> > > 3) mempolicy is also subject to cgroup nodemasks, and as a result you
> > >    end up with a rats nest of interactions between mempolicy nodemasks
> > >    changing as a result of cgroup migrations, nodes potentially coming
> > >    and going (hotplug under CXL), and others I'm probably forgetting.
> > 
> > Is this really any different from what you are proposing though?
> >
> 
> In only one manner: An external user can set the weight of a node that
> is added later on.  If it is implemented in mempolicy, then this is not
> possible.
> 
> Basically consider: `numactl --interleave=all ...`
> 
> If `--weights=...`: when a node hotplug event occurs, there is no
> recourse for adding a weight for the new node (it will default to 1).

Correct and this is what I was asking about in an earlier email. How
much do we really need to consider this setup. Is this something nice to
have or does the nature of the technology requires to be fully dynamic
and expect new nodes coming up at any moment?
 
> Maybe the answer is "Best effort, sorry" and we don't handle that
> situation.  That doesn't seem entirely unreasonable.
> 
> At least with weights in node (or cgroup, or memtier, whatever) it
> provides the ability to set that weight outside the mempolicy context.
> 
> > >    weight, or should you reset it? If a new node comes into the node
> > >    mask... what weight should you set? I did not have answers to these
> > >    questions.
> > 
> > I am not really sure I follow you. Are you talking about cpuset
> > nodemask changes or memory hotplug here.
> >
> 
> Actually both - slightly different context.
> 
> If the weights are implemented in mempolicy, if the cpuset nodemask
> changes then the mempolicy nodemask changes with it.
> 
> If the node is removed from the system, I believe (need to validate
> this, but IIRC) the node will be removed from any registered cpusets.
> As a result, that falls down to mempolicy, and the node is removed.

I do not think we do anything like that. Userspace might decide to
change the numa mask when a node is offlined but I do not think we do
anything like that automagically.

> Not entirely sure what happens if a node is added.  The only case where
> I think that is relevant is when cpuset is empty ("all") and mempolicy
> is set to something like `--interleave=all`.  In this case, it's
> possible that the new node will simply have a default weight (1), and if
> weights are implemented in mempolicy only there is no recourse for changing
> it.

That is what I would expect.
 
[...]
> > Right. This is understood. My main concern is whether this is outweights
> > the limitations of having a _global_ policy _only_. Historically a single
> > global policy usually led to finding ways how to make that more scoped
> > (usually through cgroups).
> >
> 
> Maybe the answer here is put it in cgroups + mempolicy, and don't handle
> hotplug?  This is an easy shift my this patch to cgroups, and then
> pulling my syscall patch forward to add weights directly to mempolicy.

Moving the global policy to cgroups would make the main cocern of
different workloads looking for different policy less problamatic.
I didn't have much time to think that through but the main question is
how to sanely define hierarchical properties of those weights? This is
more of a resource distribution than enforcement so maybe a simple
inherit or overwrite (if you have a more specific needs) semantic makes
sense and it is sufficient.

> I think the interleave code stays pretty much the same, the only
> difference would be where the task gets the weight from:
> 
> if (policy->mode == WEIGHTED_INTERLEAVE)
>   weight = pol->weight[target_node]
> else
>    cgroups.get_weight(from_node, target_node)
> 
> ~Gregory

This is not as much about the code as it is about the proper interface
because that will get cast in stone once introduced. It would be really
bad to realize that we have a global policy that doesn't fit well and
have hard time to work it around without breaking anybody.
Jonathan Cameron Nov. 2, 2023, 2:13 p.m. UTC | #23
icable.
  
> >
> >You mean the different memory ranges of a NUMA node may have different
> >performance?  I don't think that we can deal with this.
  
> 
> Example Configuration: On a server that we are using now, four different
> CXL cards are combined to form a single NUMA node and two other cards are
> exposed as two individual numa nodes.
> So if we have the ability to combine multiple CXL memory ranges to a
> single NUMA node the number of NUMA nodes in the system would potentially
> decrease even if we can't combine the entire range to form a single node.
>

If it's in control of the kernel, today for CXL NUMA nodes are defined by
CXL Fixed Memory Windows rather than the individual characteristics of devices
that might be accessed from those windows.

That's a useful simplification to get things going and it's not clear how the
QoS aspects of CFMWS will be used.  So will we always have enough windows with
fine enough granularity coming from the _DSM QTG magic that they don't end up
with different performance devices (or topologies) within each one?

No idea.  It's a bunch of trade offs of where the complexity lies and how much
memory is being provided over CXL vs physical address space exhaustion.
 
Long term, my guess is we'll need to support something more sophisticated with
dynamic 'creation' of NUMA  nodes (or something that looks like that anyway)
so we can always have a separate one for each significantly different set of
memory access characteristics.  If they are coming from ACPI that's already
required by the specification.  This space is going to continue getting more
complex.

Upshot is that I wouldn't focus too much on possibility of a NUMA node having
devices with very different memory access characterstics in it.  That's a quirk
of today's world that we can and should look to fix.

If your bios is setting this up for you and presenting them in SRAT / HMAT etc
then it's not complying with the ACPI spec.

Jonathan
Gregory Price Nov. 2, 2023, 6:21 p.m. UTC | #24
On Fri, Nov 03, 2023 at 10:56:01AM +0100, Michal Hocko wrote:
> On Wed 01-11-23 23:18:59, Gregory Price wrote:
> > On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote:
> > > On Wed 01-11-23 12:58:55, Gregory Price wrote:
> > > > Basically consider: `numactl --interleave=all ...`
> > > > 
> > > > If `--weights=...`: when a node hotplug event occurs, there is no
> > > > recourse for adding a weight for the new node (it will default to 1).
> > > 
> > > Correct and this is what I was asking about in an earlier email. How
> > > much do we really need to consider this setup. Is this something nice to
> > > have or does the nature of the technology requires to be fully dynamic
> > > and expect new nodes coming up at any moment?
> > >  
> > 
> > Dynamic Capacity is expected to cause a numa node to change size (in
> > number of memory blocks) rather than cause numa nodes to come and go, so
> > maybe handling the full node hotplug is a bit of an overreach.
> > 
> > Good call, I'll stop considering this problem for now.
> > 
> > > > If the node is removed from the system, I believe (need to validate
> > > > this, but IIRC) the node will be removed from any registered cpusets.
> > > > As a result, that falls down to mempolicy, and the node is removed.
> > > 
> > > I do not think we do anything like that. Userspace might decide to
> > > change the numa mask when a node is offlined but I do not think we do
> > > anything like that automagically.
> > >
> > 
> > mpol_rebind_policy called by update_tasks_nodemask
> > https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319
> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016
> > 
> > falls down from cpuset_hotplug_workfn:
> > https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771
> 
> Ohh, have missed that. Thanks for the reference. Quite honestly I am not
> sure this code is really a) necessary and b) ever exercised. For the
> former I would argue that offline node could be treated as completely
> depleted one. From the correctness POV it shouldn't make any difference
> and I am rather skeptical it would have performance improvements. 

Only thing I'm not sure of is what happens if mempolicy is allowed to
select a node that doesn't exist.  I could hack up a contrived test, but
i don't think the state is reachable at the moment.

More importantly, the rebind code is needed for task migration and for
allowing the cpusets to be change-able.  From the perspective of
mempolicy, a node being hotplugged and the nodemask being changed due to
cgroup cpuset changing looks very similar and comes with the same
question:  What do i do about weights when a change to the effective
nodemask is made.

This is why i'm falling toward "cgroups seem about right", because we
can make mempolicy ask cgroups for the weight, and also allow mempolicy
to carry its own explicit weight array - which allows for flexiblity.

I think this may end up generalizing to a cgroup-wide mempolicy interface
ala cgroup/mempolicy/[policy, nodemask, weights, ...]

but one thing as a time :]

> for the latter, full node offlines are really rare from experience. I
> would be interested about actual real life usecases which do that

yeah i'm just going to drop this from my requirement list and go OBO,
for areas where i see it may cause an issue (potential for 0-weights) i
will do something simple (initialize weights to 1), but otherwise I
think it's too much to expect from the kernel.

> > 
> > As a user I would assume it would operate much the same way as other
> > nested cgroups, which is inherit by default (with subsets) or an
> > explicit overwrite that can't exceed the higher level settings.
> 
> This would make it rather impractical because a default (everything set
> to 1) would be cast in stone. As mentioned above this this not an
> enforcement limit. So I _think_ that a simple hierarchical rule like
> 	cgroup_interleaving_mask(cgroup)
> 		interleaving_mask = (cgroup->interleaving_mask) ? : cgroup_interleaving_mask(parent_cgroup(cgroup))
> 
> So child cgroups could overwrite parent as they wish. If there is any
> enforcement (like a cpuset) that would filter useable nodes and the
> allocation policy would simply apply weights on those.
> 

Sorry yes, this is what I intended, I'm just bad at words.

~Gregory
Huang, Ying Nov. 3, 2023, 7 a.m. UTC | #25
Ravi Jonnalagadda <ravis.opensrc@micron.com> writes:

> Should Node based interleave solution be considered complex or not would probably
> depend on number of numa nodes that would be present in the system and whether
> we are able to setup the default weights correctly to obtain optimum bandwidth
> expansion.

Node based interleave is more complex than tier based interleave.
Because you have less tiers than nodes in general.

>>
>>> Pros and Cons of Memory Tier based interleave:
>>> Pros:
>>> 1. Programming weight per initiator would apply for all the nodes in the tier.
>>> 2. Weights can be calculated considering the cumulative bandwidth of all
>>> the nodes in the tier and need to be programmed once for all the nodes in a
>>> given tier.
>>> 3. It may be useful in cases where numa nodes with similar latency and bandwidth
>>> characteristics increase, possibly with pooling use cases.
>>
>>4. simpler.
>>
>>> Cons:
>>> 1. If nodes with different bandwidth and latency characteristics are placed
>>> in same tier as seen in the current mainline kernel, it will be difficult to
>>> apply a correct interleave weight policy.
>>> 2. There will be a need for functionality to move nodes between different tiers
>>> or create new tiers to place such nodes for programming correct interleave weights.
>>> We are working on a patch to support it currently.
>>
>>Thanks!  If we have such system, we will need this.
>>
>>> 3. For systems where each numa node is having different characteristics,
>>> a single node might end up existing in different memory tier, which would be
>>> equivalent to node based interleaving.
>>
>>No.  A node can only exist in one memory tier.
>
> Sorry for the confusion what i meant was, if each node is having different 
> characteristics, to program the memory tier weights correctly we need to place
> each node in a separate tier of it's own. So each memory tier will contain
> only a single node and the solution would resemble node based interleaving.
>
>>
>>> On newer systems where all CXL memory from different devices under a
>>> port are combined to form single numa node, this scenario might be
>>> applicable.
>>
>>You mean the different memory ranges of a NUMA node may have different
>>performance?  I don't think that we can deal with this.
>
> Example Configuration: On a server that we are using now, four different
> CXL cards are combined to form a single NUMA node and two other cards are
> exposed as two individual numa nodes.
> So if we have the ability to combine multiple CXL memory ranges to a
> single NUMA node the number of NUMA nodes in the system would potentially
> decrease even if we can't combine the entire range to form a single node.

Sorry, I misunderstand your words.  Yes, it's possible that there one
tier for each node in some systems.  But I guess we will have less
tiers than nodes in general.

--
Best Regards,
Huang, Ying

>>
>>> 4. Users may need to keep track of different memory tiers and what nodes are present
>>> in each tier for invoking interleave policy.
>>
>>I don't think this is a con.  With node based solution, you need to know
>>your system too.
>>
>>>>
>>>>> Could you elaborate on the 'get what you pay for' usecase you
>>>>> mentioned?
>>>>
>>
>>--
>>Best Regards,
>>Huang, Ying
> --
> Best Regards,
> Ravi Jonnalagadda
Huang, Ying Nov. 3, 2023, 7:10 a.m. UTC | #26
Michal Hocko <mhocko@suse.com> writes:

> On Thu 02-11-23 14:11:09, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Wed 01-11-23 10:21:47, Huang, Ying wrote:
>> >> Michal Hocko <mhocko@suse.com> writes:
>> > [...]
>> >> > Well, I am not convinced about that TBH. Sure it is probably a good fit
>> >> > for this specific CXL usecase but it just doesn't fit into many others I
>> >> > can think of - e.g. proportional use of those tiers based on the
>> >> > workload - you get what you pay for.
>> >> 
>> >> For "pay", per my understanding, we need some cgroup based
>> >> per-memory-tier (or per-node) usage limit.  The following patchset is
>> >> the first step for that.
>> >> 
>> >> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/
>> >
>> > Why do we need a sysfs interface if there are plans for cgroup API?
>> 
>> They are for different target.  The cgroup API proposed here is to
>> constrain the DRAM usage in a system with DRAM and CXL memory.  The less
>> you pay, the less DRAM and more CXL memory you use.
>
> Right, but why the usage distribution requires its own interface and
> cannot be combined with the access control part of it?

Per my understanding, they are orthogonal.

Weighted-interleave is a memory allocation policy, other memory
allocation policies include local first, etc.

Usage limit is to constrain the usage of specific memory types
(e.g. DRAM) for a cgroup.  It can be used together with local first
policy and some other memory allocation policy.

--
Best Regards,
Huang, Ying
Huang, Ying Nov. 3, 2023, 7:45 a.m. UTC | #27
Gregory Price <gregory.price@memverge.com> writes:

> On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote:
>> On Wed 01-11-23 12:58:55, Gregory Price wrote:
>> > Basically consider: `numactl --interleave=all ...`
>> > 
>> > If `--weights=...`: when a node hotplug event occurs, there is no
>> > recourse for adding a weight for the new node (it will default to 1).
>> 
>> Correct and this is what I was asking about in an earlier email. How
>> much do we really need to consider this setup. Is this something nice to
>> have or does the nature of the technology requires to be fully dynamic
>> and expect new nodes coming up at any moment?
>>  
>
> Dynamic Capacity is expected to cause a numa node to change size (in
> number of memory blocks) rather than cause numa nodes to come and go, so
> maybe handling the full node hotplug is a bit of an overreach.

Will node max bandwidth change with the number of memory blocks?

> Good call, I'll stop considering this problem for now.
>
>> > If the node is removed from the system, I believe (need to validate
>> > this, but IIRC) the node will be removed from any registered cpusets.
>> > As a result, that falls down to mempolicy, and the node is removed.
>> 
>> I do not think we do anything like that. Userspace might decide to
>> change the numa mask when a node is offlined but I do not think we do
>> anything like that automagically.
>>
>
> mpol_rebind_policy called by update_tasks_nodemask
> https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319
> https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016
>
> falls down from cpuset_hotplug_workfn:
> https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771
>
> /*
>  * Keep top_cpuset.mems_allowed tracking node_states[N_MEMORY].
>  * Call this routine anytime after node_states[N_MEMORY] changes.
>  * See cpuset_update_active_cpus() for CPU hotplug handling.
>  */
> static int cpuset_track_online_nodes(struct notifier_block *self,
> 				unsigned long action, void *arg)
> {
> 	schedule_work(&cpuset_hotplug_work);
> 	return NOTIFY_OK;
> }
>
> void __init cpuset_init_smp(void)
> {
> ...
> 	hotplug_memory_notifier(cpuset_track_online_nodes, CPUSET_CALLBACK_PRI);
> }
>
>
> Causes 1 of 3 situations:
> MPOL_F_STATIC_NODES:   overwrite with (old & new)
> MPOL_F_RELATIVE_NODES: overwrite with a "relative" nodemask (fold+onto?)
> Default:               either does a remap or replaces old with new.
>
> My assumption based on this is that a hot-unplugged node would completely
> be removed.  Doesn't look like hot-add is handled at all, so I can just
> drop that entirely for now (except add default weight of 1 incase it is
> ever added in the future).
>
> I've been pushing agianst the weights being in memory-tiers.c for this
> reason, as a weight set per-tier is meaningless if a node disappears.
>
> Example: Tier has 2 nodes with some weight N split between them, such
> that interleave gives each node N/2 pages.  If 1 node is removed, the
> remaining node gets N pages, which is twice the allocation. Presumably
> a node is an abstraction of 1 or more devices, therefore if the node is
> removed, the weight should change.

The per-tier weight can be defined as interleave weight of each node of
the tier.  Tier just groups NUMA nodes with similar performance.  The
performance (including bandwidth) is still per-node in the context of
tier.

If we have multiple nodes in one tier, this makes weight definition
easier.

> You could handle hotplug in tiers, but if a node being hotplugged forcibly
> removes the node from cpusets and mempolicy nodemasks, then it's
> irrelevant since the node can never get selected for allocation anyway.
>
> It's looking more like cgroups is the right place to put this.

Have a cgroup/task level interface doesn't prevent us to have a system
level interface to provide default for cgroups/tasks.  Where performance
information (e.g., from HMAT) can help define a reasonable default
automatically.

>> 
>> Moving the global policy to cgroups would make the main cocern of
>> different workloads looking for different policy less problamatic.
>> I didn't have much time to think that through but the main question is
>> how to sanely define hierarchical properties of those weights? This is
>> more of a resource distribution than enforcement so maybe a simple
>> inherit or overwrite (if you have a more specific needs) semantic makes
>> sense and it is sufficient.
>>
>
> As a user I would assume it would operate much the same way as other
> nested cgroups, which is inherit by default (with subsets) or an
> explicit overwrite that can't exceed the higher level settings.
>
> Weights could arguably allow different settings than capacity controls,
> but that could be an extension.
>
>> This is not as much about the code as it is about the proper interface
>> because that will get cast in stone once introduced. It would be really
>> bad to realize that we have a global policy that doesn't fit well and
>> have hard time to work it around without breaking anybody.
>
> o7 I concur now.  I'll take some time to rework this into a
> cgroups+mempolicy proposal based on my earlier RFCs.

--
Best Regards,
Huang, Ying
Michal Hocko Nov. 3, 2023, 9:39 a.m. UTC | #28
On Fri 03-11-23 15:10:37, Huang, Ying wrote:
> Michal Hocko <mhocko@suse.com> writes:
> 
> > On Thu 02-11-23 14:11:09, Huang, Ying wrote:
> >> Michal Hocko <mhocko@suse.com> writes:
> >> 
> >> > On Wed 01-11-23 10:21:47, Huang, Ying wrote:
> >> >> Michal Hocko <mhocko@suse.com> writes:
> >> > [...]
> >> >> > Well, I am not convinced about that TBH. Sure it is probably a good fit
> >> >> > for this specific CXL usecase but it just doesn't fit into many others I
> >> >> > can think of - e.g. proportional use of those tiers based on the
> >> >> > workload - you get what you pay for.
> >> >> 
> >> >> For "pay", per my understanding, we need some cgroup based
> >> >> per-memory-tier (or per-node) usage limit.  The following patchset is
> >> >> the first step for that.
> >> >> 
> >> >> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/
> >> >
> >> > Why do we need a sysfs interface if there are plans for cgroup API?
> >> 
> >> They are for different target.  The cgroup API proposed here is to
> >> constrain the DRAM usage in a system with DRAM and CXL memory.  The less
> >> you pay, the less DRAM and more CXL memory you use.
> >
> > Right, but why the usage distribution requires its own interface and
> > cannot be combined with the access control part of it?
> 
> Per my understanding, they are orthogonal.
> 
> Weighted-interleave is a memory allocation policy, other memory
> allocation policies include local first, etc.
> 
> Usage limit is to constrain the usage of specific memory types
> (e.g. DRAM) for a cgroup.  It can be used together with local first
> policy and some other memory allocation policy.

Bad wording from me. Sorry for the confusion. Sure those are two
orthogonal things and I didn't mean to suggest a single API to cover
both. But if cgroup semantic can be reasonably defined for the usage
enforcement can we put the interleaving behavior API under the same
cgroup controller as well?
Michal Hocko Nov. 3, 2023, 9:56 a.m. UTC | #29
On Wed 01-11-23 23:18:59, Gregory Price wrote:
> On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote:
> > On Wed 01-11-23 12:58:55, Gregory Price wrote:
> > > Basically consider: `numactl --interleave=all ...`
> > > 
> > > If `--weights=...`: when a node hotplug event occurs, there is no
> > > recourse for adding a weight for the new node (it will default to 1).
> > 
> > Correct and this is what I was asking about in an earlier email. How
> > much do we really need to consider this setup. Is this something nice to
> > have or does the nature of the technology requires to be fully dynamic
> > and expect new nodes coming up at any moment?
> >  
> 
> Dynamic Capacity is expected to cause a numa node to change size (in
> number of memory blocks) rather than cause numa nodes to come and go, so
> maybe handling the full node hotplug is a bit of an overreach.
> 
> Good call, I'll stop considering this problem for now.
> 
> > > If the node is removed from the system, I believe (need to validate
> > > this, but IIRC) the node will be removed from any registered cpusets.
> > > As a result, that falls down to mempolicy, and the node is removed.
> > 
> > I do not think we do anything like that. Userspace might decide to
> > change the numa mask when a node is offlined but I do not think we do
> > anything like that automagically.
> >
> 
> mpol_rebind_policy called by update_tasks_nodemask
> https://elixir.bootlin.com/linux/latest/source/mm/mempolicy.c#L319
> https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L2016
> 
> falls down from cpuset_hotplug_workfn:
> https://elixir.bootlin.com/linux/latest/source/kernel/cgroup/cpuset.c#L3771

Ohh, have missed that. Thanks for the reference. Quite honestly I am not
sure this code is really a) necessary and b) ever exercised. For the
former I would argue that offline node could be treated as completely
depleted one. From the correctness POV it shouldn't make any difference
and I am rather skeptical it would have performance improvements. And
for the latter, full node offlines are really rare from experience. I
would be interested about actual real life usecases which do that
regularly. I do remember a certain HW vendor working on a hotplugable
system (both CPUs and memory) to reduce downtimes cause by misbehaving
CPUs/memoryu. This has turned out very impractical because of movable
memory requirements and also some HW limitations (like most HW attached
to Node0 which has turned out to be single point of failure anyway).

[...]
[...]
> > Moving the global policy to cgroups would make the main cocern of
> > different workloads looking for different policy less problamatic.
> > I didn't have much time to think that through but the main question is
> > how to sanely define hierarchical properties of those weights? This is
> > more of a resource distribution than enforcement so maybe a simple
> > inherit or overwrite (if you have a more specific needs) semantic makes
> > sense and it is sufficient.
> >
> 
> As a user I would assume it would operate much the same way as other
> nested cgroups, which is inherit by default (with subsets) or an
> explicit overwrite that can't exceed the higher level settings.

This would make it rather impractical because a default (everything set
to 1) would be cast in stone. As mentioned above this this not an
enforcement limit. So I _think_ that a simple hierarchical rule like
	cgroup_interleaving_mask(cgroup)
		interleaving_mask = (cgroup->interleaving_mask) ? : cgroup_interleaving_mask(parent_cgroup(cgroup))

So child cgroups could overwrite parent as they wish. If there is any
enforcement (like a cpuset) that would filter useable nodes and the
allocation policy would simply apply weights on those.
Jonathan Cameron Nov. 3, 2023, 2:16 p.m. UTC | #30
On Fri, 03 Nov 2023 15:45:13 +0800
"Huang, Ying" <ying.huang@intel.com> wrote:

> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote:  
> >> On Wed 01-11-23 12:58:55, Gregory Price wrote:  
> >> > Basically consider: `numactl --interleave=all ...`
> >> > 
> >> > If `--weights=...`: when a node hotplug event occurs, there is no
> >> > recourse for adding a weight for the new node (it will default to 1).  
> >> 
> >> Correct and this is what I was asking about in an earlier email. How
> >> much do we really need to consider this setup. Is this something nice to
> >> have or does the nature of the technology requires to be fully dynamic
> >> and expect new nodes coming up at any moment?
> >>    
> >
> > Dynamic Capacity is expected to cause a numa node to change size (in
> > number of memory blocks) rather than cause numa nodes to come and go, so
> > maybe handling the full node hotplug is a bit of an overreach.  
> 
> Will node max bandwidth change with the number of memory blocks?

Typically no as even a single memory extent would probably be interleaved
across all the actual memory devices (think DIMMS for simplicity) within
a CXL device. I guess a device 'could' do some scaling based on capacity
provided to a particular host but feels like they should be separate controls.
I don't recall there being anything in the specification to suggest the
need to recheck the CDAT info for updates when DC add / remove events happen.

Mind you, who knows in future :)  We'll point out in relevant forums that
doing so would be very hard to handle cleanly in Linux.

Jonathan
Michal Hocko Nov. 3, 2023, 4:59 p.m. UTC | #31
On Thu 02-11-23 14:21:14, Gregory Price wrote:
[...]
> Only thing I'm not sure of is what happens if mempolicy is allowed to
> select a node that doesn't exist.  I could hack up a contrived test, but
> i don't think the state is reachable at the moment.

There are two different kinds of doesn't exist. One is an offline node
and the other is one with a number higher than the config option allows.
Although we do have a concept of possible nodes N_POSSIBLE I do not
think we do enforce that in any user interface and we only reject nodes
outside of MAX_NODE.

The possible nodes concept is more about optimizing for real HW so that
we do not over shoot when the config allows a huge number of nodes while
only handful of them are actually used (which is the majority of cases).
Huang, Ying Nov. 6, 2023, 3:20 a.m. UTC | #32
Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes:

> On Fri, 03 Nov 2023 15:45:13 +0800
> "Huang, Ying" <ying.huang@intel.com> wrote:
>
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> > On Thu, Nov 02, 2023 at 10:47:33AM +0100, Michal Hocko wrote:  
>> >> On Wed 01-11-23 12:58:55, Gregory Price wrote:  
>> >> > Basically consider: `numactl --interleave=all ...`
>> >> > 
>> >> > If `--weights=...`: when a node hotplug event occurs, there is no
>> >> > recourse for adding a weight for the new node (it will default to 1).  
>> >> 
>> >> Correct and this is what I was asking about in an earlier email. How
>> >> much do we really need to consider this setup. Is this something nice to
>> >> have or does the nature of the technology requires to be fully dynamic
>> >> and expect new nodes coming up at any moment?
>> >>    
>> >
>> > Dynamic Capacity is expected to cause a numa node to change size (in
>> > number of memory blocks) rather than cause numa nodes to come and go, so
>> > maybe handling the full node hotplug is a bit of an overreach.  
>> 
>> Will node max bandwidth change with the number of memory blocks?
>
> Typically no as even a single memory extent would probably be interleaved
> across all the actual memory devices (think DIMMS for simplicity) within
> a CXL device. I guess a device 'could' do some scaling based on capacity
> provided to a particular host but feels like they should be separate controls.
> I don't recall there being anything in the specification to suggest the
> need to recheck the CDAT info for updates when DC add / remove events happen.

Sounds good!  Thank you for detailed explanation.

> Mind you, who knows in future :)  We'll point out in relevant forums that
> doing so would be very hard to handle cleanly in Linux.

Thanks!

--
Best Regards,
Huang, Ying
Huang, Ying Nov. 6, 2023, 5:08 a.m. UTC | #33
Michal Hocko <mhocko@suse.com> writes:

> On Fri 03-11-23 15:10:37, Huang, Ying wrote:
>> Michal Hocko <mhocko@suse.com> writes:
>> 
>> > On Thu 02-11-23 14:11:09, Huang, Ying wrote:
>> >> Michal Hocko <mhocko@suse.com> writes:
>> >> 
>> >> > On Wed 01-11-23 10:21:47, Huang, Ying wrote:
>> >> >> Michal Hocko <mhocko@suse.com> writes:
>> >> > [...]
>> >> >> > Well, I am not convinced about that TBH. Sure it is probably a good fit
>> >> >> > for this specific CXL usecase but it just doesn't fit into many others I
>> >> >> > can think of - e.g. proportional use of those tiers based on the
>> >> >> > workload - you get what you pay for.
>> >> >> 
>> >> >> For "pay", per my understanding, we need some cgroup based
>> >> >> per-memory-tier (or per-node) usage limit.  The following patchset is
>> >> >> the first step for that.
>> >> >> 
>> >> >> https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com/
>> >> >
>> >> > Why do we need a sysfs interface if there are plans for cgroup API?
>> >> 
>> >> They are for different target.  The cgroup API proposed here is to
>> >> constrain the DRAM usage in a system with DRAM and CXL memory.  The less
>> >> you pay, the less DRAM and more CXL memory you use.
>> >
>> > Right, but why the usage distribution requires its own interface and
>> > cannot be combined with the access control part of it?
>> 
>> Per my understanding, they are orthogonal.
>> 
>> Weighted-interleave is a memory allocation policy, other memory
>> allocation policies include local first, etc.
>> 
>> Usage limit is to constrain the usage of specific memory types
>> (e.g. DRAM) for a cgroup.  It can be used together with local first
>> policy and some other memory allocation policy.
>
> Bad wording from me. Sorry for the confusion.

Never mind.

> Sure those are two orthogonal things and I didn't mean to suggest a
> single API to cover both. But if cgroup semantic can be reasonably
> defined for the usage enforcement can we put the interleaving behavior
> API under the same cgroup controller as well?

I haven't thought about it thoroughly.  But I think it should be the
direction.

--
Best Regards,
Huang, Ying