mbox series

[v2,0/5] mm: demotion: Introduce new node state N_DEMOTION_TARGETS

Message ID 20220413092206.73974-1-jvgediya@linux.ibm.com (mailing list archive)
Headers show
Series mm: demotion: Introduce new node state N_DEMOTION_TARGETS | expand

Message

Jagdish Gediya April 13, 2022, 9:22 a.m. UTC
Current implementation to find the demotion targets works
based on node state N_MEMORY, however some systems may have
dram only memory numa node which are N_MEMORY but not the
right choices as demotion targets.

This patch series introduces the new node state
N_DEMOTION_TARGETS, which is used to distinguish the nodes which
can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
is used to hold the list of nodes which can be used as demotion
targets, support is also added to set the demotion target
list from user space so that default behavior can be overridden.

node state N_DEMOTION_TARGETS is also set from the dax kmem
driver, certain type of memory which registers through dax kmem
(e.g. HBM) may not be the right choices for demotion so in future
they should be distinguished based on certain attributes and dax
kmem driver should avoid setting them as N_DEMOTION_TARGETS,
however current implementation also doesn't distinguish any 
such memory and it considers all N_MEMORY as demotion targets
so this patch series doesn't modify the current behavior.

Current code which sets migration targets is modified in
this patch series to avoid some of the limitations on the demotion
target sharing and to use N_DEMOTION_TARGETS only nodes while
finding demotion targets.

Changelog
----------

v2:
In v1, only 1st patch of this patch series was sent, which was
implemented to avoid some of the limitations on the demotion
target sharing, however for certain numa topology, the demotion
targets found by that patch was not most optimal, so 1st patch
in this series is modified according to suggestions from Huang
and Baolin. Different examples of demotion list comparasion
between existing implementation and changed implementation can
be found in the commit message of 1st patch.

Jagdish Gediya (5):
  mm: demotion: Set demotion list differently
  mm: demotion: Add new node state N_DEMOTION_TARGETS
  mm: demotion: Add support to set targets from userspace
  device-dax/kmem: Set node state as N_DEMOTION_TARGETS
  mm: demotion: Build demotion list based on N_DEMOTION_TARGETS

 .../ABI/testing/sysfs-kernel-mm-numa          | 12 ++++
 drivers/base/node.c                           |  4 ++
 drivers/dax/kmem.c                            |  2 +
 include/linux/nodemask.h                      |  1 +
 mm/migrate.c                                  | 67 +++++++++++++++----
 5 files changed, 72 insertions(+), 14 deletions(-)

Comments

Andrew Morton April 13, 2022, 9:44 p.m. UTC | #1
On Wed, 13 Apr 2022 14:52:01 +0530 Jagdish Gediya <jvgediya@linux.ibm.com> wrote:

> Current implementation to find the demotion targets works
> based on node state N_MEMORY, however some systems may have
> dram only memory numa node which are N_MEMORY but not the
> right choices as demotion targets.

Why are they not the right choice?  Please describe this fully so we
can understand the motivation and end-user benefit of the proposed
change.  And please more fully describe the end-user benefits of this
change.

> This patch series introduces the new node state
> N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> is used to hold the list of nodes which can be used as demotion
> targets, support is also added to set the demotion target
> list from user space so that default behavior can be overridden.

Permanently extending the kernel ABI is a fairly big deal.  Please
fully explain the end-user value, usage scenarios, etc.

What would go wrong if we simply omitted this interface?

> node state N_DEMOTION_TARGETS is also set from the dax kmem
> driver, certain type of memory which registers through dax kmem
> (e.g. HBM) may not be the right choices for demotion so in future
> they should be distinguished based on certain attributes and dax
> kmem driver should avoid setting them as N_DEMOTION_TARGETS,
> however current implementation also doesn't distinguish any 
> such memory and it considers all N_MEMORY as demotion targets
> so this patch series doesn't modify the current behavior.
> 
> Current code which sets migration targets is modified in
> this patch series to avoid some of the limitations on the demotion
> target sharing and to use N_DEMOTION_TARGETS only nodes while
> finding demotion targets.
> 
> Changelog
> ----------
> 
> v2:
> In v1, only 1st patch of this patch series was sent, which was
> implemented to avoid some of the limitations on the demotion
> target sharing, however for certain numa topology, the demotion
> targets found by that patch was not most optimal, so 1st patch
> in this series is modified according to suggestions from Huang
> and Baolin. Different examples of demotion list comparasion
> between existing implementation and changed implementation can
> be found in the commit message of 1st patch.
> 
> Jagdish Gediya (5):
>   mm: demotion: Set demotion list differently
>   mm: demotion: Add new node state N_DEMOTION_TARGETS
>   mm: demotion: Add support to set targets from userspace
>   device-dax/kmem: Set node state as N_DEMOTION_TARGETS
>   mm: demotion: Build demotion list based on N_DEMOTION_TARGETS
> 
>  .../ABI/testing/sysfs-kernel-mm-numa          | 12 ++++

This description is rather brief.  Some additional user-facing material
under Documentation/ would help.  Describe the format for writing to the
file, what is seen when reading from it, provide a bit of help to the
user so they can understand how to use it, what effects they might see,
etc.

>  drivers/base/node.c                           |  4 ++
>  drivers/dax/kmem.c                            |  2 +
>  include/linux/nodemask.h                      |  1 +
>  mm/migrate.c                                  | 67 +++++++++++++++----
>  5 files changed, 72 insertions(+), 14 deletions(-)
Huang, Ying April 14, 2022, 7 a.m. UTC | #2
On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> Current implementation to find the demotion targets works
> based on node state N_MEMORY, however some systems may have
> dram only memory numa node which are N_MEMORY but not the
> right choices as demotion targets.
> 
> This patch series introduces the new node state
> N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> is used to hold the list of nodes which can be used as demotion
> targets, support is also added to set the demotion target
> list from user space so that default behavior can be overridden.

It appears that your proposed user space interface cannot solve all
problems.  For example, for system as follows,

Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
node 0,

available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: n MB
node 0 free: n MB
node 1 cpus:
node 1 size: n MB
node 1 free: n MB
node 2 cpus: 2 3
node 2 size: n MB
node 2 free: n MB
node distances:
node   0   1   2
  0:  10  40  20
  1:  40  10  80
  2:  20  80  10

Demotion order 1:

node    demotion_target
 0              1
 1              X
 2              X

Demotion order 2:

node    demotion_target
 0              1
 1              X
 2              1

The demotion order 1 is preferred if we want to reduce cross-socket
traffic.  While the demotion order 2 is preferred if we want to take
full advantage of the slow memory node.  We can take any choice as
automatic-generated order, while make the other choice possible via user
space overridden.

I don't know how to implement this via your proposed user space
interface.  How about the following user space interface?

1. Add a file "demotion_order_override" in
        /sys/devices/system/node/

2. When read, "1" is output if the demotion order of the system has been
overridden; "0" is output if not.

3. When write "1", the demotion order of the system will become the
overridden mode.  When write "0", the demotion order of the system will
become the automatic mode and the demotion order will be re-generated. 

4. Add a file "demotion_targets" for each node in
        /sys/devices/system/node/nodeX/

5. When read, the demotion targets of nodeX will be output.

6. When write a node list to the file, the demotion targets of nodeX
will be set to the written nodes.  And the demotion order of the system
will become the overridden mode.

To reduce the complexity, the demotion order of the system is either in
overridden mode or automatic mode.  When converting from the automatic
mode to the overridden mode, the existing demotion targets of all nodes
will be retained before being changed.  When converting from overridden
mode to automatic mode, the demotion order of the system will be re-
generated automatically.

In overridden mode, the demotion targets of the hot-added and hot-
removed node will be set to empty.  And the hot-removed node will be
removed from the demotion targets of any node.

This is an extention of the interface used in the following patch,

https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/

What do you think about this?

> node state N_DEMOTION_TARGETS is also set from the dax kmem
> driver, certain type of memory which registers through dax kmem
> (e.g. HBM) may not be the right choices for demotion so in future
> they should be distinguished based on certain attributes and dax
> kmem driver should avoid setting them as N_DEMOTION_TARGETS,
> however current implementation also doesn't distinguish any 
> such memory and it considers all N_MEMORY as demotion targets
> so this patch series doesn't modify the current behavior.
> 

Best Regards,
Huang, Ying

[snip]
Jagdish Gediya April 14, 2022, 10:16 a.m. UTC | #3
On Wed, Apr 13, 2022 at 02:44:34PM -0700, Andrew Morton wrote:
> On Wed, 13 Apr 2022 14:52:01 +0530 Jagdish Gediya <jvgediya@linux.ibm.com> wrote:
> 
> > Current implementation to find the demotion targets works
> > based on node state N_MEMORY, however some systems may have
> > dram only memory numa node which are N_MEMORY but not the
> > right choices as demotion targets.
> 
> Why are they not the right choice?  Please describe this fully so we
> can understand the motivation and end-user benefit of the proposed
> change.  And please more fully describe the end-user benefits of this
> change.

Some systems(e.g. PowerVM) have DRAM(fast memory) only NUMA node
which are N_MEMORY as well as slow memory(persistent memory) only
NUMA node which are also N_MEMORY. As the current demotion target
finding algorithm works based on N_MEMORY and best distance, it will
choose DRAM only NUMA node as demotion target instead of persistent
memory node on such systems. If DRAM only NUMA node is filled with
demoted pages then at some point new allocations can start falling
to persistent memory, so basically cold pages are in fast memor
(due to demotion) and new pages are in slow memory, this is why
persistent memory nodes should be utilized for demotion and dram node
should be avoided for demotion so that they can be used for new
allocations.

Current implementation can work fine on the system where the memory
only numa nodes are possible only for persistent/slow memory but it
is not suitable for the like of systems I have mentioned above.

Introduction of this new node state N_DEMOTION_TARGETS will provide
the solution to handle demotion for the like of systems I have mentioned,
without affecting the existing behavior.

> > This patch series introduces the new node state
> > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > is used to hold the list of nodes which can be used as demotion
> > targets, support is also added to set the demotion target
> > list from user space so that default behavior can be overridden.
> 
> Permanently extending the kernel ABI is a fairly big deal.  Please
> fully explain the end-user value, usage scenarios, etc.
> 
> What would go wrong if we simply omitted this interface?

I am going to modify this interface according to review feedback in
next version, but let me explain why it is needed with examples,

Based on topology, and available memory tiers in the system, it may
be possible that users don't want to utilize all the demotion targets
configured by kernel by default for e.g.,

1. To reduce cross socket traffic
2. To use only slowest memory as demotion targets when there are
   multiple slow memory only nodes available

The current patch series handles option 2 above, but doesn't handle
option 1 so next version will have that support and might be different
implementation to handle such scenarios.

Examples 1
----------

with below NUMA topology, where node 0 & 1 are cpu + dram nodes,
node 2 & 3 are equally slower memory only nodes, and node 4
is slowest memory only node,

available: 5 nodes (0-4)
node 0 cpus: 0 1
node 0 size: n MB
node 0 free: n MB
node 1 cpus: 2 3
node 1 size: n MB
node 1 free: n MB
node 2 cpus:
node 2 size: n MB
node 2 free: n MB
node 3 cpus:
node 3 size: n MB
node 3 free: n MB
node 4 cpus:
node 4 size: n MB
node 4 free: n MB
node distances:
node   0   1   2   3   4
  0:  10  20  40  40  80
  1:  20  10  40  40  80
  2:  40  40  10  40  80
  3:  40  40  40  10  80
  4:  80  80  80  80  10

This patch series by default prepares below demotion list,

node    demotion_target
 0              3, 2
 1              3, 2
 2              4
 3              4
 4              X

but It may be possible that user want to utilize node 2 & 3 only
for new allocations and only node 4 for demotion.

Example 2
---------

with below NUMA topology where Node 0 & 2 are cpu + dram nodes and
node 1 is slow memory node near node 0,

available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: n MB
node 0 free: n MB
node 1 cpus:
node 1 size: n MB
node 1 free: n MB
node 2 cpus: 2 3
node 2 size: n MB
node 2 free: n MB
node distances:
node   0   1   2
  0:  10  40  20
  1:  40  10  80
  2:  20  80  10

This patch series by default prepares below demotion list,

node    demotion_target
0              1
1              X
2              1

However it may be possible that user may want to avoid node 1 as
demotion target for node 2 to reduce cross socket traffic.

> > node state N_DEMOTION_TARGETS is also set from the dax kmem
> > driver, certain type of memory which registers through dax kmem
> > (e.g. HBM) may not be the right choices for demotion so in future
> > they should be distinguished based on certain attributes and dax
> > kmem driver should avoid setting them as N_DEMOTION_TARGETS,
> > however current implementation also doesn't distinguish any 
> > such memory and it considers all N_MEMORY as demotion targets
> > so this patch series doesn't modify the current behavior.
> > 
> > Current code which sets migration targets is modified in
> > this patch series to avoid some of the limitations on the demotion
> > target sharing and to use N_DEMOTION_TARGETS only nodes while
> > finding demotion targets.
> > 
> > Changelog
> > ----------
> > 
> > v2:
> > In v1, only 1st patch of this patch series was sent, which was
> > implemented to avoid some of the limitations on the demotion
> > target sharing, however for certain numa topology, the demotion
> > targets found by that patch was not most optimal, so 1st patch
> > in this series is modified according to suggestions from Huang
> > and Baolin. Different examples of demotion list comparasion
> > between existing implementation and changed implementation can
> > be found in the commit message of 1st patch.
> > 
> > Jagdish Gediya (5):
> >   mm: demotion: Set demotion list differently
> >   mm: demotion: Add new node state N_DEMOTION_TARGETS
> >   mm: demotion: Add support to set targets from userspace
> >   device-dax/kmem: Set node state as N_DEMOTION_TARGETS
> >   mm: demotion: Build demotion list based on N_DEMOTION_TARGETS
> > 
> >  .../ABI/testing/sysfs-kernel-mm-numa          | 12 ++++
> 
> This description is rather brief.  Some additional user-facing material
> under Documentation/ would help.  Describe the format for writing to the
> file, what is seen when reading from it, provide a bit of help to the
> user so they can understand how to use it, what effects they might see,
> etc.

Sure, Will do in next version.

> >  drivers/base/node.c                           |  4 ++
> >  drivers/dax/kmem.c                            |  2 +
> >  include/linux/nodemask.h                      |  1 +
> >  mm/migrate.c                                  | 67 +++++++++++++++----
> >  5 files changed, 72 insertions(+), 14 deletions(-)
>
Jagdish Gediya April 14, 2022, 10:19 a.m. UTC | #4
On Thu, Apr 14, 2022 at 03:00:46PM +0800, ying.huang@intel.com wrote:
> On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > Current implementation to find the demotion targets works
> > based on node state N_MEMORY, however some systems may have
> > dram only memory numa node which are N_MEMORY but not the
> > right choices as demotion targets.
> > 
> > This patch series introduces the new node state
> > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > is used to hold the list of nodes which can be used as demotion
> > targets, support is also added to set the demotion target
> > list from user space so that default behavior can be overridden.
> 
> It appears that your proposed user space interface cannot solve all
> problems.  For example, for system as follows,
> 
> Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> node 0,
> 
> available: 3 nodes (0-2)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus:
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus: 2 3
> node 2 size: n MB
> node 2 free: n MB
> node distances:
> node   0   1   2
>   0:  10  40  20
>   1:  40  10  80
>   2:  20  80  10
> 
> Demotion order 1:
> 
> node    demotion_target
>  0              1
>  1              X
>  2              X
> 
> Demotion order 2:
> 
> node    demotion_target
>  0              1
>  1              X
>  2              1
> 
> The demotion order 1 is preferred if we want to reduce cross-socket
> traffic.  While the demotion order 2 is preferred if we want to take
> full advantage of the slow memory node.  We can take any choice as
> automatic-generated order, while make the other choice possible via user
> space overridden.
> 
> I don't know how to implement this via your proposed user space
> interface.  How about the following user space interface?
> 
> 1. Add a file "demotion_order_override" in
>         /sys/devices/system/node/
> 
> 2. When read, "1" is output if the demotion order of the system has been
> overridden; "0" is output if not.
> 
> 3. When write "1", the demotion order of the system will become the
> overridden mode.  When write "0", the demotion order of the system will
> become the automatic mode and the demotion order will be re-generated. 
> 
> 4. Add a file "demotion_targets" for each node in
>         /sys/devices/system/node/nodeX/
> 
> 5. When read, the demotion targets of nodeX will be output.
> 
> 6. When write a node list to the file, the demotion targets of nodeX
> will be set to the written nodes.  And the demotion order of the system
> will become the overridden mode.
> 
> To reduce the complexity, the demotion order of the system is either in
> overridden mode or automatic mode.  When converting from the automatic
> mode to the overridden mode, the existing demotion targets of all nodes
> will be retained before being changed.  When converting from overridden
> mode to automatic mode, the demotion order of the system will be re-
> generated automatically.
> 
> In overridden mode, the demotion targets of the hot-added and hot-
> removed node will be set to empty.  And the hot-removed node will be
> removed from the demotion targets of any node.
> 
> This is an extention of the interface used in the following patch,
> 
> https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/
> 
> What do you think about this?

It looks good, will implement in next version.

> > node state N_DEMOTION_TARGETS is also set from the dax kmem
> > driver, certain type of memory which registers through dax kmem
> > (e.g. HBM) may not be the right choices for demotion so in future
> > they should be distinguished based on certain attributes and dax
> > kmem driver should avoid setting them as N_DEMOTION_TARGETS,
> > however current implementation also doesn't distinguish any 
> > such memory and it considers all N_MEMORY as demotion targets
> > so this patch series doesn't modify the current behavior.
> > 
> 
> Best Regards,
> Huang, Ying
> 
> [snip]
> 
Best regards,
Jagdish
Yang Shi April 21, 2022, 3:11 a.m. UTC | #5
On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > Current implementation to find the demotion targets works
> > based on node state N_MEMORY, however some systems may have
> > dram only memory numa node which are N_MEMORY but not the
> > right choices as demotion targets.
> >
> > This patch series introduces the new node state
> > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > is used to hold the list of nodes which can be used as demotion
> > targets, support is also added to set the demotion target
> > list from user space so that default behavior can be overridden.
>
> It appears that your proposed user space interface cannot solve all
> problems.  For example, for system as follows,
>
> Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> node 0,
>
> available: 3 nodes (0-2)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus:
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus: 2 3
> node 2 size: n MB
> node 2 free: n MB
> node distances:
> node   0   1   2
>   0:  10  40  20
>   1:  40  10  80
>   2:  20  80  10
>
> Demotion order 1:
>
> node    demotion_target
>  0              1
>  1              X
>  2              X
>
> Demotion order 2:
>
> node    demotion_target
>  0              1
>  1              X
>  2              1
>
> The demotion order 1 is preferred if we want to reduce cross-socket
> traffic.  While the demotion order 2 is preferred if we want to take
> full advantage of the slow memory node.  We can take any choice as
> automatic-generated order, while make the other choice possible via user
> space overridden.
>
> I don't know how to implement this via your proposed user space
> interface.  How about the following user space interface?
>
> 1. Add a file "demotion_order_override" in
>         /sys/devices/system/node/
>
> 2. When read, "1" is output if the demotion order of the system has been
> overridden; "0" is output if not.
>
> 3. When write "1", the demotion order of the system will become the
> overridden mode.  When write "0", the demotion order of the system will
> become the automatic mode and the demotion order will be re-generated.
>
> 4. Add a file "demotion_targets" for each node in
>         /sys/devices/system/node/nodeX/
>
> 5. When read, the demotion targets of nodeX will be output.
>
> 6. When write a node list to the file, the demotion targets of nodeX
> will be set to the written nodes.  And the demotion order of the system
> will become the overridden mode.

TBH I don't think having override demotion targets in userspace is
quite useful in real life for now (it might become useful in the
future, I can't tell). Imagine you manage hundred thousands of
machines, which may come from different vendors, have different
generations of hardware, have different versions of firmware, it would
be a nightmare for the users to configure the demotion targets
properly. So it would be great to have the kernel properly configure
it *without* intervening from the users.

So we should pick up a proper default policy and stick with that
policy unless it doesn't work well for the most workloads. I do
understand it is hard to make everyone happy. My proposal is having
every node in the fast tier has a demotion target (at least one) if
the slow tier exists sounds like a reasonable default policy. I think
this is also the current implementation.

>
> To reduce the complexity, the demotion order of the system is either in
> overridden mode or automatic mode.  When converting from the automatic
> mode to the overridden mode, the existing demotion targets of all nodes
> will be retained before being changed.  When converting from overridden
> mode to automatic mode, the demotion order of the system will be re-
> generated automatically.
>
> In overridden mode, the demotion targets of the hot-added and hot-
> removed node will be set to empty.  And the hot-removed node will be
> removed from the demotion targets of any node.
>
> This is an extention of the interface used in the following patch,
>
> https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/
>
> What do you think about this?
>
> > node state N_DEMOTION_TARGETS is also set from the dax kmem
> > driver, certain type of memory which registers through dax kmem
> > (e.g. HBM) may not be the right choices for demotion so in future
> > they should be distinguished based on certain attributes and dax
> > kmem driver should avoid setting them as N_DEMOTION_TARGETS,
> > however current implementation also doesn't distinguish any
> > such memory and it considers all N_MEMORY as demotion targets
> > so this patch series doesn't modify the current behavior.
> >
>
> Best Regards,
> Huang, Ying
>
> [snip]
>
Wei Xu April 21, 2022, 5:41 a.m. UTC | #6
On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> >
> > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > Current implementation to find the demotion targets works
> > > based on node state N_MEMORY, however some systems may have
> > > dram only memory numa node which are N_MEMORY but not the
> > > right choices as demotion targets.
> > >
> > > This patch series introduces the new node state
> > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > is used to hold the list of nodes which can be used as demotion
> > > targets, support is also added to set the demotion target
> > > list from user space so that default behavior can be overridden.
> >
> > It appears that your proposed user space interface cannot solve all
> > problems.  For example, for system as follows,
> >
> > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > node 0,
> >
> > available: 3 nodes (0-2)
> > node 0 cpus: 0 1
> > node 0 size: n MB
> > node 0 free: n MB
> > node 1 cpus:
> > node 1 size: n MB
> > node 1 free: n MB
> > node 2 cpus: 2 3
> > node 2 size: n MB
> > node 2 free: n MB
> > node distances:
> > node   0   1   2
> >   0:  10  40  20
> >   1:  40  10  80
> >   2:  20  80  10
> >
> > Demotion order 1:
> >
> > node    demotion_target
> >  0              1
> >  1              X
> >  2              X
> >
> > Demotion order 2:
> >
> > node    demotion_target
> >  0              1
> >  1              X
> >  2              1
> >
> > The demotion order 1 is preferred if we want to reduce cross-socket
> > traffic.  While the demotion order 2 is preferred if we want to take
> > full advantage of the slow memory node.  We can take any choice as
> > automatic-generated order, while make the other choice possible via user
> > space overridden.
> >
> > I don't know how to implement this via your proposed user space
> > interface.  How about the following user space interface?
> >
> > 1. Add a file "demotion_order_override" in
> >         /sys/devices/system/node/
> >
> > 2. When read, "1" is output if the demotion order of the system has been
> > overridden; "0" is output if not.
> >
> > 3. When write "1", the demotion order of the system will become the
> > overridden mode.  When write "0", the demotion order of the system will
> > become the automatic mode and the demotion order will be re-generated.
> >
> > 4. Add a file "demotion_targets" for each node in
> >         /sys/devices/system/node/nodeX/
> >
> > 5. When read, the demotion targets of nodeX will be output.
> >
> > 6. When write a node list to the file, the demotion targets of nodeX
> > will be set to the written nodes.  And the demotion order of the system
> > will become the overridden mode.
>
> TBH I don't think having override demotion targets in userspace is
> quite useful in real life for now (it might become useful in the
> future, I can't tell). Imagine you manage hundred thousands of
> machines, which may come from different vendors, have different
> generations of hardware, have different versions of firmware, it would
> be a nightmare for the users to configure the demotion targets
> properly. So it would be great to have the kernel properly configure
> it *without* intervening from the users.
>
> So we should pick up a proper default policy and stick with that
> policy unless it doesn't work well for the most workloads. I do
> understand it is hard to make everyone happy. My proposal is having
> every node in the fast tier has a demotion target (at least one) if
> the slow tier exists sounds like a reasonable default policy. I think
> this is also the current implementation.
>

This is reasonable.  I agree that with a decent default policy, the
overriding of per-node demotion targets can be deferred.  The most
important problem here is that we should allow the configurations
where memory-only nodes are not used as demotion targets, which this
patch set has already addressed.

> >
> > To reduce the complexity, the demotion order of the system is either in
> > overridden mode or automatic mode.  When converting from the automatic
> > mode to the overridden mode, the existing demotion targets of all nodes
> > will be retained before being changed.  When converting from overridden
> > mode to automatic mode, the demotion order of the system will be re-
> > generated automatically.
> >
> > In overridden mode, the demotion targets of the hot-added and hot-
> > removed node will be set to empty.  And the hot-removed node will be
> > removed from the demotion targets of any node.
> >
> > This is an extention of the interface used in the following patch,
> >
> > https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/
> >
> > What do you think about this?
> >
> > > node state N_DEMOTION_TARGETS is also set from the dax kmem
> > > driver, certain type of memory which registers through dax kmem
> > > (e.g. HBM) may not be the right choices for demotion so in future
> > > they should be distinguished based on certain attributes and dax
> > > kmem driver should avoid setting them as N_DEMOTION_TARGETS,
> > > however current implementation also doesn't distinguish any
> > > such memory and it considers all N_MEMORY as demotion targets
> > > so this patch series doesn't modify the current behavior.
> > >
> >
> > Best Regards,
> > Huang, Ying
> >
> > [snip]
> >
Huang, Ying April 21, 2022, 6:24 a.m. UTC | #7
On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > 
> > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > > 
> > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > Current implementation to find the demotion targets works
> > > > based on node state N_MEMORY, however some systems may have
> > > > dram only memory numa node which are N_MEMORY but not the
> > > > right choices as demotion targets.
> > > > 
> > > > This patch series introduces the new node state
> > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > is used to hold the list of nodes which can be used as demotion
> > > > targets, support is also added to set the demotion target
> > > > list from user space so that default behavior can be overridden.
> > > 
> > > It appears that your proposed user space interface cannot solve all
> > > problems.  For example, for system as follows,
> > > 
> > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > node 0,
> > > 
> > > available: 3 nodes (0-2)
> > > node 0 cpus: 0 1
> > > node 0 size: n MB
> > > node 0 free: n MB
> > > node 1 cpus:
> > > node 1 size: n MB
> > > node 1 free: n MB
> > > node 2 cpus: 2 3
> > > node 2 size: n MB
> > > node 2 free: n MB
> > > node distances:
> > > node   0   1   2
> > >   0:  10  40  20
> > >   1:  40  10  80
> > >   2:  20  80  10
> > > 
> > > Demotion order 1:
> > > 
> > > node    demotion_target
> > >  0              1
> > >  1              X
> > >  2              X
> > > 
> > > Demotion order 2:
> > > 
> > > node    demotion_target
> > >  0              1
> > >  1              X
> > >  2              1
> > > 
> > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > traffic.  While the demotion order 2 is preferred if we want to take
> > > full advantage of the slow memory node.  We can take any choice as
> > > automatic-generated order, while make the other choice possible via user
> > > space overridden.
> > > 
> > > I don't know how to implement this via your proposed user space
> > > interface.  How about the following user space interface?
> > > 
> > > 1. Add a file "demotion_order_override" in
> > >         /sys/devices/system/node/
> > > 
> > > 2. When read, "1" is output if the demotion order of the system has been
> > > overridden; "0" is output if not.
> > > 
> > > 3. When write "1", the demotion order of the system will become the
> > > overridden mode.  When write "0", the demotion order of the system will
> > > become the automatic mode and the demotion order will be re-generated.
> > > 
> > > 4. Add a file "demotion_targets" for each node in
> > >         /sys/devices/system/node/nodeX/
> > > 
> > > 5. When read, the demotion targets of nodeX will be output.
> > > 
> > > 6. When write a node list to the file, the demotion targets of nodeX
> > > will be set to the written nodes.  And the demotion order of the system
> > > will become the overridden mode.
> > 
> > TBH I don't think having override demotion targets in userspace is
> > quite useful in real life for now (it might become useful in the
> > future, I can't tell). Imagine you manage hundred thousands of
> > machines, which may come from different vendors, have different
> > generations of hardware, have different versions of firmware, it would
> > be a nightmare for the users to configure the demotion targets
> > properly. So it would be great to have the kernel properly configure
> > it *without* intervening from the users.
> > 
> > So we should pick up a proper default policy and stick with that
> > policy unless it doesn't work well for the most workloads. I do
> > understand it is hard to make everyone happy. My proposal is having
> > every node in the fast tier has a demotion target (at least one) if
> > the slow tier exists sounds like a reasonable default policy. I think
> > this is also the current implementation.
> > 
> 
> This is reasonable.  I agree that with a decent default policy, 
> 

I agree that a decent default policy is important.  As that was enhanced
in [1/5] of this patchset.

> the
> overriding of per-node demotion targets can be deferred.  The most
> important problem here is that we should allow the configurations
> where memory-only nodes are not used as demotion targets, which this
> patch set has already addressed.

Do you mean the user space interface proposed by [3/5] of this patchset?
 
IMHO, if we want to add a user space interface, I think that it should
be powerful enough to address all existing issues and some potential
future issues, so that it can be stable.  I don't think it's a good idea
to define a partial user space interface that works only for a specific
use case and cannot be extended for other use cases.

Best Regards,
Huang, Ying

[snip]

> >
Wei Xu April 21, 2022, 6:49 a.m. UTC | #8
On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > <ying.huang@intel.com> wrote:
> > > >
> > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > Current implementation to find the demotion targets works
> > > > > based on node state N_MEMORY, however some systems may have
> > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > right choices as demotion targets.
> > > > >
> > > > > This patch series introduces the new node state
> > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > is used to hold the list of nodes which can be used as demotion
> > > > > targets, support is also added to set the demotion target
> > > > > list from user space so that default behavior can be overridden.
> > > >
> > > > It appears that your proposed user space interface cannot solve all
> > > > problems.  For example, for system as follows,
> > > >
> > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > node 0,
> > > >
> > > > available: 3 nodes (0-2)
> > > > node 0 cpus: 0 1
> > > > node 0 size: n MB
> > > > node 0 free: n MB
> > > > node 1 cpus:
> > > > node 1 size: n MB
> > > > node 1 free: n MB
> > > > node 2 cpus: 2 3
> > > > node 2 size: n MB
> > > > node 2 free: n MB
> > > > node distances:
> > > > node   0   1   2
> > > >   0:  10  40  20
> > > >   1:  40  10  80
> > > >   2:  20  80  10
> > > >
> > > > Demotion order 1:
> > > >
> > > > node    demotion_target
> > > >  0              1
> > > >  1              X
> > > >  2              X
> > > >
> > > > Demotion order 2:
> > > >
> > > > node    demotion_target
> > > >  0              1
> > > >  1              X
> > > >  2              1
> > > >
> > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > full advantage of the slow memory node.  We can take any choice as
> > > > automatic-generated order, while make the other choice possible via user
> > > > space overridden.
> > > >
> > > > I don't know how to implement this via your proposed user space
> > > > interface.  How about the following user space interface?
> > > >
> > > > 1. Add a file "demotion_order_override" in
> > > >         /sys/devices/system/node/
> > > >
> > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > overridden; "0" is output if not.
> > > >
> > > > 3. When write "1", the demotion order of the system will become the
> > > > overridden mode.  When write "0", the demotion order of the system will
> > > > become the automatic mode and the demotion order will be re-generated.
> > > >
> > > > 4. Add a file "demotion_targets" for each node in
> > > >         /sys/devices/system/node/nodeX/
> > > >
> > > > 5. When read, the demotion targets of nodeX will be output.
> > > >
> > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > will be set to the written nodes.  And the demotion order of the system
> > > > will become the overridden mode.
> > >
> > > TBH I don't think having override demotion targets in userspace is
> > > quite useful in real life for now (it might become useful in the
> > > future, I can't tell). Imagine you manage hundred thousands of
> > > machines, which may come from different vendors, have different
> > > generations of hardware, have different versions of firmware, it would
> > > be a nightmare for the users to configure the demotion targets
> > > properly. So it would be great to have the kernel properly configure
> > > it *without* intervening from the users.
> > >
> > > So we should pick up a proper default policy and stick with that
> > > policy unless it doesn't work well for the most workloads. I do
> > > understand it is hard to make everyone happy. My proposal is having
> > > every node in the fast tier has a demotion target (at least one) if
> > > the slow tier exists sounds like a reasonable default policy. I think
> > > this is also the current implementation.
> > >
> >
> > This is reasonable.  I agree that with a decent default policy,
> >
>
> I agree that a decent default policy is important.  As that was enhanced
> in [1/5] of this patchset.
>
> > the
> > overriding of per-node demotion targets can be deferred.  The most
> > important problem here is that we should allow the configurations
> > where memory-only nodes are not used as demotion targets, which this
> > patch set has already addressed.
>
> Do you mean the user space interface proposed by [3/5] of this patchset?

Yes.

> IMHO, if we want to add a user space interface, I think that it should
> be powerful enough to address all existing issues and some potential
> future issues, so that it can be stable.  I don't think it's a good idea
> to define a partial user space interface that works only for a specific
> use case and cannot be extended for other use cases.

I actually think that they can be viewed as two separate problems: one
is to define which nodes can be used as demotion targets (this patch
set), and the other is how to initialize the per-node demotion path
(node_demotion[]).  We don't have to solve both problems at the same
time.

If we decide to go with a per-node demotion path customization
interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
is a single global control to turn off all demotion targets (for the
machines that don't use memory-only nodes for demotion).

> Best Regards,
> Huang, Ying
>
> [snip]
>
> > >
>
>
Huang, Ying April 21, 2022, 7:08 a.m. UTC | #9
On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > 
> > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > <ying.huang@intel.com> wrote:
> > > > > 
> > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > > Current implementation to find the demotion targets works
> > > > > > based on node state N_MEMORY, however some systems may have
> > > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > > right choices as demotion targets.
> > > > > > 
> > > > > > This patch series introduces the new node state
> > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > > is used to hold the list of nodes which can be used as demotion
> > > > > > targets, support is also added to set the demotion target
> > > > > > list from user space so that default behavior can be overridden.
> > > > > 
> > > > > It appears that your proposed user space interface cannot solve all
> > > > > problems.  For example, for system as follows,
> > > > > 
> > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > > node 0,
> > > > > 
> > > > > available: 3 nodes (0-2)
> > > > > node 0 cpus: 0 1
> > > > > node 0 size: n MB
> > > > > node 0 free: n MB
> > > > > node 1 cpus:
> > > > > node 1 size: n MB
> > > > > node 1 free: n MB
> > > > > node 2 cpus: 2 3
> > > > > node 2 size: n MB
> > > > > node 2 free: n MB
> > > > > node distances:
> > > > > node   0   1   2
> > > > >   0:  10  40  20
> > > > >   1:  40  10  80
> > > > >   2:  20  80  10
> > > > > 
> > > > > Demotion order 1:
> > > > > 
> > > > > node    demotion_target
> > > > >  0              1
> > > > >  1              X
> > > > >  2              X
> > > > > 
> > > > > Demotion order 2:
> > > > > 
> > > > > node    demotion_target
> > > > >  0              1
> > > > >  1              X
> > > > >  2              1
> > > > > 
> > > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > > full advantage of the slow memory node.  We can take any choice as
> > > > > automatic-generated order, while make the other choice possible via user
> > > > > space overridden.
> > > > > 
> > > > > I don't know how to implement this via your proposed user space
> > > > > interface.  How about the following user space interface?
> > > > > 
> > > > > 1. Add a file "demotion_order_override" in
> > > > >         /sys/devices/system/node/
> > > > > 
> > > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > > overridden; "0" is output if not.
> > > > > 
> > > > > 3. When write "1", the demotion order of the system will become the
> > > > > overridden mode.  When write "0", the demotion order of the system will
> > > > > become the automatic mode and the demotion order will be re-generated.
> > > > > 
> > > > > 4. Add a file "demotion_targets" for each node in
> > > > >         /sys/devices/system/node/nodeX/
> > > > > 
> > > > > 5. When read, the demotion targets of nodeX will be output.
> > > > > 
> > > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > > will be set to the written nodes.  And the demotion order of the system
> > > > > will become the overridden mode.
> > > > 
> > > > TBH I don't think having override demotion targets in userspace is
> > > > quite useful in real life for now (it might become useful in the
> > > > future, I can't tell). Imagine you manage hundred thousands of
> > > > machines, which may come from different vendors, have different
> > > > generations of hardware, have different versions of firmware, it would
> > > > be a nightmare for the users to configure the demotion targets
> > > > properly. So it would be great to have the kernel properly configure
> > > > it *without* intervening from the users.
> > > > 
> > > > So we should pick up a proper default policy and stick with that
> > > > policy unless it doesn't work well for the most workloads. I do
> > > > understand it is hard to make everyone happy. My proposal is having
> > > > every node in the fast tier has a demotion target (at least one) if
> > > > the slow tier exists sounds like a reasonable default policy. I think
> > > > this is also the current implementation.
> > > > 
> > > 
> > > This is reasonable.  I agree that with a decent default policy,
> > > 
> > 
> > I agree that a decent default policy is important.  As that was enhanced
> > in [1/5] of this patchset.
> > 
> > > the
> > > overriding of per-node demotion targets can be deferred.  The most
> > > important problem here is that we should allow the configurations
> > > where memory-only nodes are not used as demotion targets, which this
> > > patch set has already addressed.
> > 
> > Do you mean the user space interface proposed by [3/5] of this patchset?
> 
> Yes.
> 
> > IMHO, if we want to add a user space interface, I think that it should
> > be powerful enough to address all existing issues and some potential
> > future issues, so that it can be stable.  I don't think it's a good idea
> > to define a partial user space interface that works only for a specific
> > use case and cannot be extended for other use cases.
> 
> I actually think that they can be viewed as two separate problems: one
> is to define which nodes can be used as demotion targets (this patch
> set), and the other is how to initialize the per-node demotion path
> (node_demotion[]).  We don't have to solve both problems at the same
> time.
> 
> If we decide to go with a per-node demotion path customization
> interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
> is a single global control to turn off all demotion targets (for the
> machines that don't use memory-only nodes for demotion).
> 

There's one already.  In commit 20b51af15e01 ("mm/migrate: add sysfs
interface to enable reclaim migration"), a sysfs interface

	/sys/kernel/mm/numa/demotion_enabled

is added to turn off all demotion targets.

Best Regards,
Huang, Ying
Wei Xu April 21, 2022, 7:29 a.m. UTC | #10
On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > > <ying.huang@intel.com> wrote:
> > > > > >
> > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > > > Current implementation to find the demotion targets works
> > > > > > > based on node state N_MEMORY, however some systems may have
> > > > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > > > right choices as demotion targets.
> > > > > > >
> > > > > > > This patch series introduces the new node state
> > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > > > is used to hold the list of nodes which can be used as demotion
> > > > > > > targets, support is also added to set the demotion target
> > > > > > > list from user space so that default behavior can be overridden.
> > > > > >
> > > > > > It appears that your proposed user space interface cannot solve all
> > > > > > problems.  For example, for system as follows,
> > > > > >
> > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > > > node 0,
> > > > > >
> > > > > > available: 3 nodes (0-2)
> > > > > > node 0 cpus: 0 1
> > > > > > node 0 size: n MB
> > > > > > node 0 free: n MB
> > > > > > node 1 cpus:
> > > > > > node 1 size: n MB
> > > > > > node 1 free: n MB
> > > > > > node 2 cpus: 2 3
> > > > > > node 2 size: n MB
> > > > > > node 2 free: n MB
> > > > > > node distances:
> > > > > > node   0   1   2
> > > > > >   0:  10  40  20
> > > > > >   1:  40  10  80
> > > > > >   2:  20  80  10
> > > > > >
> > > > > > Demotion order 1:
> > > > > >
> > > > > > node    demotion_target
> > > > > >  0              1
> > > > > >  1              X
> > > > > >  2              X
> > > > > >
> > > > > > Demotion order 2:
> > > > > >
> > > > > > node    demotion_target
> > > > > >  0              1
> > > > > >  1              X
> > > > > >  2              1
> > > > > >
> > > > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > > > full advantage of the slow memory node.  We can take any choice as
> > > > > > automatic-generated order, while make the other choice possible via user
> > > > > > space overridden.
> > > > > >
> > > > > > I don't know how to implement this via your proposed user space
> > > > > > interface.  How about the following user space interface?
> > > > > >
> > > > > > 1. Add a file "demotion_order_override" in
> > > > > >         /sys/devices/system/node/
> > > > > >
> > > > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > > > overridden; "0" is output if not.
> > > > > >
> > > > > > 3. When write "1", the demotion order of the system will become the
> > > > > > overridden mode.  When write "0", the demotion order of the system will
> > > > > > become the automatic mode and the demotion order will be re-generated.
> > > > > >
> > > > > > 4. Add a file "demotion_targets" for each node in
> > > > > >         /sys/devices/system/node/nodeX/
> > > > > >
> > > > > > 5. When read, the demotion targets of nodeX will be output.
> > > > > >
> > > > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > > > will be set to the written nodes.  And the demotion order of the system
> > > > > > will become the overridden mode.
> > > > >
> > > > > TBH I don't think having override demotion targets in userspace is
> > > > > quite useful in real life for now (it might become useful in the
> > > > > future, I can't tell). Imagine you manage hundred thousands of
> > > > > machines, which may come from different vendors, have different
> > > > > generations of hardware, have different versions of firmware, it would
> > > > > be a nightmare for the users to configure the demotion targets
> > > > > properly. So it would be great to have the kernel properly configure
> > > > > it *without* intervening from the users.
> > > > >
> > > > > So we should pick up a proper default policy and stick with that
> > > > > policy unless it doesn't work well for the most workloads. I do
> > > > > understand it is hard to make everyone happy. My proposal is having
> > > > > every node in the fast tier has a demotion target (at least one) if
> > > > > the slow tier exists sounds like a reasonable default policy. I think
> > > > > this is also the current implementation.
> > > > >
> > > >
> > > > This is reasonable.  I agree that with a decent default policy,
> > > >
> > >
> > > I agree that a decent default policy is important.  As that was enhanced
> > > in [1/5] of this patchset.
> > >
> > > > the
> > > > overriding of per-node demotion targets can be deferred.  The most
> > > > important problem here is that we should allow the configurations
> > > > where memory-only nodes are not used as demotion targets, which this
> > > > patch set has already addressed.
> > >
> > > Do you mean the user space interface proposed by [3/5] of this patchset?
> >
> > Yes.
> >
> > > IMHO, if we want to add a user space interface, I think that it should
> > > be powerful enough to address all existing issues and some potential
> > > future issues, so that it can be stable.  I don't think it's a good idea
> > > to define a partial user space interface that works only for a specific
> > > use case and cannot be extended for other use cases.
> >
> > I actually think that they can be viewed as two separate problems: one
> > is to define which nodes can be used as demotion targets (this patch
> > set), and the other is how to initialize the per-node demotion path
> > (node_demotion[]).  We don't have to solve both problems at the same
> > time.
> >
> > If we decide to go with a per-node demotion path customization
> > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
> > is a single global control to turn off all demotion targets (for the
> > machines that don't use memory-only nodes for demotion).
> >
>
> There's one already.  In commit 20b51af15e01 ("mm/migrate: add sysfs
> interface to enable reclaim migration"), a sysfs interface
>
>         /sys/kernel/mm/numa/demotion_enabled
>
> is added to turn off all demotion targets.

IIUC, this sysfs interface only turns off demotion-in-reclaim.  It
will be even cleaner if we have an easy way to clear node_demotion[]
and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not
init scripts) can know that the machine doesn't even have memory
tiering hardware enabled.

> Best Regards,
> Huang, Ying
>
>
>
Huang, Ying April 21, 2022, 7:45 a.m. UTC | #11
On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote:
> On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> > > <ying.huang@intel.com> wrote:
> > > > 
> > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > 
> > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > > > <ying.huang@intel.com> wrote:
> > > > > > > 
> > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > > > > Current implementation to find the demotion targets works
> > > > > > > > based on node state N_MEMORY, however some systems may have
> > > > > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > > > > right choices as demotion targets.
> > > > > > > > 
> > > > > > > > This patch series introduces the new node state
> > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > > > > is used to hold the list of nodes which can be used as demotion
> > > > > > > > targets, support is also added to set the demotion target
> > > > > > > > list from user space so that default behavior can be overridden.
> > > > > > > 
> > > > > > > It appears that your proposed user space interface cannot solve all
> > > > > > > problems.  For example, for system as follows,
> > > > > > > 
> > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > > > > node 0,
> > > > > > > 
> > > > > > > available: 3 nodes (0-2)
> > > > > > > node 0 cpus: 0 1
> > > > > > > node 0 size: n MB
> > > > > > > node 0 free: n MB
> > > > > > > node 1 cpus:
> > > > > > > node 1 size: n MB
> > > > > > > node 1 free: n MB
> > > > > > > node 2 cpus: 2 3
> > > > > > > node 2 size: n MB
> > > > > > > node 2 free: n MB
> > > > > > > node distances:
> > > > > > > node   0   1   2
> > > > > > >   0:  10  40  20
> > > > > > >   1:  40  10  80
> > > > > > >   2:  20  80  10
> > > > > > > 
> > > > > > > Demotion order 1:
> > > > > > > 
> > > > > > > node    demotion_target
> > > > > > >  0              1
> > > > > > >  1              X
> > > > > > >  2              X
> > > > > > > 
> > > > > > > Demotion order 2:
> > > > > > > 
> > > > > > > node    demotion_target
> > > > > > >  0              1
> > > > > > >  1              X
> > > > > > >  2              1
> > > > > > > 
> > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > > > > full advantage of the slow memory node.  We can take any choice as
> > > > > > > automatic-generated order, while make the other choice possible via user
> > > > > > > space overridden.
> > > > > > > 
> > > > > > > I don't know how to implement this via your proposed user space
> > > > > > > interface.  How about the following user space interface?
> > > > > > > 
> > > > > > > 1. Add a file "demotion_order_override" in
> > > > > > >         /sys/devices/system/node/
> > > > > > > 
> > > > > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > > > > overridden; "0" is output if not.
> > > > > > > 
> > > > > > > 3. When write "1", the demotion order of the system will become the
> > > > > > > overridden mode.  When write "0", the demotion order of the system will
> > > > > > > become the automatic mode and the demotion order will be re-generated.
> > > > > > > 
> > > > > > > 4. Add a file "demotion_targets" for each node in
> > > > > > >         /sys/devices/system/node/nodeX/
> > > > > > > 
> > > > > > > 5. When read, the demotion targets of nodeX will be output.
> > > > > > > 
> > > > > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > > > > will be set to the written nodes.  And the demotion order of the system
> > > > > > > will become the overridden mode.
> > > > > > 
> > > > > > TBH I don't think having override demotion targets in userspace is
> > > > > > quite useful in real life for now (it might become useful in the
> > > > > > future, I can't tell). Imagine you manage hundred thousands of
> > > > > > machines, which may come from different vendors, have different
> > > > > > generations of hardware, have different versions of firmware, it would
> > > > > > be a nightmare for the users to configure the demotion targets
> > > > > > properly. So it would be great to have the kernel properly configure
> > > > > > it *without* intervening from the users.
> > > > > > 
> > > > > > So we should pick up a proper default policy and stick with that
> > > > > > policy unless it doesn't work well for the most workloads. I do
> > > > > > understand it is hard to make everyone happy. My proposal is having
> > > > > > every node in the fast tier has a demotion target (at least one) if
> > > > > > the slow tier exists sounds like a reasonable default policy. I think
> > > > > > this is also the current implementation.
> > > > > > 
> > > > > 
> > > > > This is reasonable.  I agree that with a decent default policy,
> > > > > 
> > > > 
> > > > I agree that a decent default policy is important.  As that was enhanced
> > > > in [1/5] of this patchset.
> > > > 
> > > > > the
> > > > > overriding of per-node demotion targets can be deferred.  The most
> > > > > important problem here is that we should allow the configurations
> > > > > where memory-only nodes are not used as demotion targets, which this
> > > > > patch set has already addressed.
> > > > 
> > > > Do you mean the user space interface proposed by [3/5] of this patchset?
> > > 
> > > Yes.
> > > 
> > > > IMHO, if we want to add a user space interface, I think that it should
> > > > be powerful enough to address all existing issues and some potential
> > > > future issues, so that it can be stable.  I don't think it's a good idea
> > > > to define a partial user space interface that works only for a specific
> > > > use case and cannot be extended for other use cases.
> > > 
> > > I actually think that they can be viewed as two separate problems: one
> > > is to define which nodes can be used as demotion targets (this patch
> > > set), and the other is how to initialize the per-node demotion path
> > > (node_demotion[]).  We don't have to solve both problems at the same
> > > time.
> > > 
> > > If we decide to go with a per-node demotion path customization
> > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
> > > is a single global control to turn off all demotion targets (for the
> > > machines that don't use memory-only nodes for demotion).
> > > 
> > 
> > There's one already.  In commit 20b51af15e01 ("mm/migrate: add sysfs
> > interface to enable reclaim migration"), a sysfs interface
> > 
> >         /sys/kernel/mm/numa/demotion_enabled
> > 
> > is added to turn off all demotion targets.
> 
> IIUC, this sysfs interface only turns off demotion-in-reclaim.  It
> will be even cleaner if we have an easy way to clear node_demotion[]
> and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not
> init scripts) can know that the machine doesn't even have memory
> tiering hardware enabled.
> 

What is the difference?  Now we have no interface to show demotion
targets of a node.  That is in-kernel only.  What is memory tiering
hardware?  The Optane PMEM?  Some information for it is available via
ACPI HMAT table.

Except demotion-in-reclaim, what else do you care about?

Best Regards,
Huang, Ying
Yang Shi April 21, 2022, 5:56 p.m. UTC | #12
On Wed, Apr 20, 2022 at 10:41 PM Wei Xu <weixugc@google.com> wrote:
>
> On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > Current implementation to find the demotion targets works
> > > > based on node state N_MEMORY, however some systems may have
> > > > dram only memory numa node which are N_MEMORY but not the
> > > > right choices as demotion targets.
> > > >
> > > > This patch series introduces the new node state
> > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > is used to hold the list of nodes which can be used as demotion
> > > > targets, support is also added to set the demotion target
> > > > list from user space so that default behavior can be overridden.
> > >
> > > It appears that your proposed user space interface cannot solve all
> > > problems.  For example, for system as follows,
> > >
> > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > node 0,
> > >
> > > available: 3 nodes (0-2)
> > > node 0 cpus: 0 1
> > > node 0 size: n MB
> > > node 0 free: n MB
> > > node 1 cpus:
> > > node 1 size: n MB
> > > node 1 free: n MB
> > > node 2 cpus: 2 3
> > > node 2 size: n MB
> > > node 2 free: n MB
> > > node distances:
> > > node   0   1   2
> > >   0:  10  40  20
> > >   1:  40  10  80
> > >   2:  20  80  10
> > >
> > > Demotion order 1:
> > >
> > > node    demotion_target
> > >  0              1
> > >  1              X
> > >  2              X
> > >
> > > Demotion order 2:
> > >
> > > node    demotion_target
> > >  0              1
> > >  1              X
> > >  2              1
> > >
> > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > traffic.  While the demotion order 2 is preferred if we want to take
> > > full advantage of the slow memory node.  We can take any choice as
> > > automatic-generated order, while make the other choice possible via user
> > > space overridden.
> > >
> > > I don't know how to implement this via your proposed user space
> > > interface.  How about the following user space interface?
> > >
> > > 1. Add a file "demotion_order_override" in
> > >         /sys/devices/system/node/
> > >
> > > 2. When read, "1" is output if the demotion order of the system has been
> > > overridden; "0" is output if not.
> > >
> > > 3. When write "1", the demotion order of the system will become the
> > > overridden mode.  When write "0", the demotion order of the system will
> > > become the automatic mode and the demotion order will be re-generated.
> > >
> > > 4. Add a file "demotion_targets" for each node in
> > >         /sys/devices/system/node/nodeX/
> > >
> > > 5. When read, the demotion targets of nodeX will be output.
> > >
> > > 6. When write a node list to the file, the demotion targets of nodeX
> > > will be set to the written nodes.  And the demotion order of the system
> > > will become the overridden mode.
> >
> > TBH I don't think having override demotion targets in userspace is
> > quite useful in real life for now (it might become useful in the
> > future, I can't tell). Imagine you manage hundred thousands of
> > machines, which may come from different vendors, have different
> > generations of hardware, have different versions of firmware, it would
> > be a nightmare for the users to configure the demotion targets
> > properly. So it would be great to have the kernel properly configure
> > it *without* intervening from the users.
> >
> > So we should pick up a proper default policy and stick with that
> > policy unless it doesn't work well for the most workloads. I do
> > understand it is hard to make everyone happy. My proposal is having
> > every node in the fast tier has a demotion target (at least one) if
> > the slow tier exists sounds like a reasonable default policy. I think
> > this is also the current implementation.
> >
>
> This is reasonable.  I agree that with a decent default policy, the
> overriding of per-node demotion targets can be deferred.  The most
> important problem here is that we should allow the configurations
> where memory-only nodes are not used as demotion targets, which this
> patch set has already addressed.

Yes, I agree. Fixing the bug and allowing override by userspace are
totally two separate things.

>
> > >
> > > To reduce the complexity, the demotion order of the system is either in
> > > overridden mode or automatic mode.  When converting from the automatic
> > > mode to the overridden mode, the existing demotion targets of all nodes
> > > will be retained before being changed.  When converting from overridden
> > > mode to automatic mode, the demotion order of the system will be re-
> > > generated automatically.
> > >
> > > In overridden mode, the demotion targets of the hot-added and hot-
> > > removed node will be set to empty.  And the hot-removed node will be
> > > removed from the demotion targets of any node.
> > >
> > > This is an extention of the interface used in the following patch,
> > >
> > > https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/
> > >
> > > What do you think about this?
> > >
> > > > node state N_DEMOTION_TARGETS is also set from the dax kmem
> > > > driver, certain type of memory which registers through dax kmem
> > > > (e.g. HBM) may not be the right choices for demotion so in future
> > > > they should be distinguished based on certain attributes and dax
> > > > kmem driver should avoid setting them as N_DEMOTION_TARGETS,
> > > > however current implementation also doesn't distinguish any
> > > > such memory and it considers all N_MEMORY as demotion targets
> > > > so this patch series doesn't modify the current behavior.
> > > >
> > >
> > > Best Regards,
> > > Huang, Ying
> > >
> > > [snip]
> > >
Wei Xu April 21, 2022, 6:26 p.m. UTC | #13
On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote:
> > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> > > > <ying.huang@intel.com> wrote:
> > > > >
> > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > >
> > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > > > > > Current implementation to find the demotion targets works
> > > > > > > > > based on node state N_MEMORY, however some systems may have
> > > > > > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > > > > > right choices as demotion targets.
> > > > > > > > >
> > > > > > > > > This patch series introduces the new node state
> > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > > > > > is used to hold the list of nodes which can be used as demotion
> > > > > > > > > targets, support is also added to set the demotion target
> > > > > > > > > list from user space so that default behavior can be overridden.
> > > > > > > >
> > > > > > > > It appears that your proposed user space interface cannot solve all
> > > > > > > > problems.  For example, for system as follows,
> > > > > > > >
> > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > > > > > node 0,
> > > > > > > >
> > > > > > > > available: 3 nodes (0-2)
> > > > > > > > node 0 cpus: 0 1
> > > > > > > > node 0 size: n MB
> > > > > > > > node 0 free: n MB
> > > > > > > > node 1 cpus:
> > > > > > > > node 1 size: n MB
> > > > > > > > node 1 free: n MB
> > > > > > > > node 2 cpus: 2 3
> > > > > > > > node 2 size: n MB
> > > > > > > > node 2 free: n MB
> > > > > > > > node distances:
> > > > > > > > node   0   1   2
> > > > > > > >   0:  10  40  20
> > > > > > > >   1:  40  10  80
> > > > > > > >   2:  20  80  10
> > > > > > > >
> > > > > > > > Demotion order 1:
> > > > > > > >
> > > > > > > > node    demotion_target
> > > > > > > >  0              1
> > > > > > > >  1              X
> > > > > > > >  2              X
> > > > > > > >
> > > > > > > > Demotion order 2:
> > > > > > > >
> > > > > > > > node    demotion_target
> > > > > > > >  0              1
> > > > > > > >  1              X
> > > > > > > >  2              1
> > > > > > > >
> > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > > > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > > > > > full advantage of the slow memory node.  We can take any choice as
> > > > > > > > automatic-generated order, while make the other choice possible via user
> > > > > > > > space overridden.
> > > > > > > >
> > > > > > > > I don't know how to implement this via your proposed user space
> > > > > > > > interface.  How about the following user space interface?
> > > > > > > >
> > > > > > > > 1. Add a file "demotion_order_override" in
> > > > > > > >         /sys/devices/system/node/
> > > > > > > >
> > > > > > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > > > > > overridden; "0" is output if not.
> > > > > > > >
> > > > > > > > 3. When write "1", the demotion order of the system will become the
> > > > > > > > overridden mode.  When write "0", the demotion order of the system will
> > > > > > > > become the automatic mode and the demotion order will be re-generated.
> > > > > > > >
> > > > > > > > 4. Add a file "demotion_targets" for each node in
> > > > > > > >         /sys/devices/system/node/nodeX/
> > > > > > > >
> > > > > > > > 5. When read, the demotion targets of nodeX will be output.
> > > > > > > >
> > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > > > > > will be set to the written nodes.  And the demotion order of the system
> > > > > > > > will become the overridden mode.
> > > > > > >
> > > > > > > TBH I don't think having override demotion targets in userspace is
> > > > > > > quite useful in real life for now (it might become useful in the
> > > > > > > future, I can't tell). Imagine you manage hundred thousands of
> > > > > > > machines, which may come from different vendors, have different
> > > > > > > generations of hardware, have different versions of firmware, it would
> > > > > > > be a nightmare for the users to configure the demotion targets
> > > > > > > properly. So it would be great to have the kernel properly configure
> > > > > > > it *without* intervening from the users.
> > > > > > >
> > > > > > > So we should pick up a proper default policy and stick with that
> > > > > > > policy unless it doesn't work well for the most workloads. I do
> > > > > > > understand it is hard to make everyone happy. My proposal is having
> > > > > > > every node in the fast tier has a demotion target (at least one) if
> > > > > > > the slow tier exists sounds like a reasonable default policy. I think
> > > > > > > this is also the current implementation.
> > > > > > >
> > > > > >
> > > > > > This is reasonable.  I agree that with a decent default policy,
> > > > > >
> > > > >
> > > > > I agree that a decent default policy is important.  As that was enhanced
> > > > > in [1/5] of this patchset.
> > > > >
> > > > > > the
> > > > > > overriding of per-node demotion targets can be deferred.  The most
> > > > > > important problem here is that we should allow the configurations
> > > > > > where memory-only nodes are not used as demotion targets, which this
> > > > > > patch set has already addressed.
> > > > >
> > > > > Do you mean the user space interface proposed by [3/5] of this patchset?
> > > >
> > > > Yes.
> > > >
> > > > > IMHO, if we want to add a user space interface, I think that it should
> > > > > be powerful enough to address all existing issues and some potential
> > > > > future issues, so that it can be stable.  I don't think it's a good idea
> > > > > to define a partial user space interface that works only for a specific
> > > > > use case and cannot be extended for other use cases.
> > > >
> > > > I actually think that they can be viewed as two separate problems: one
> > > > is to define which nodes can be used as demotion targets (this patch
> > > > set), and the other is how to initialize the per-node demotion path
> > > > (node_demotion[]).  We don't have to solve both problems at the same
> > > > time.
> > > >
> > > > If we decide to go with a per-node demotion path customization
> > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
> > > > is a single global control to turn off all demotion targets (for the
> > > > machines that don't use memory-only nodes for demotion).
> > > >
> > >
> > > There's one already.  In commit 20b51af15e01 ("mm/migrate: add sysfs
> > > interface to enable reclaim migration"), a sysfs interface
> > >
> > >         /sys/kernel/mm/numa/demotion_enabled
> > >
> > > is added to turn off all demotion targets.
> >
> > IIUC, this sysfs interface only turns off demotion-in-reclaim.  It
> > will be even cleaner if we have an easy way to clear node_demotion[]
> > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not
> > init scripts) can know that the machine doesn't even have memory
> > tiering hardware enabled.
> >
>
> What is the difference?  Now we have no interface to show demotion
> targets of a node.  That is in-kernel only.  What is memory tiering
> hardware?  The Optane PMEM?  Some information for it is available via
> ACPI HMAT table.
>
> Except demotion-in-reclaim, what else do you care about?

There is a difference: one is to indicate the availability of the
memory tiering hardware and the other is to indicate whether
transparent kernel-driven demotion from the reclaim path is activated.
With /sys/devices/system/node/demote_targets or the per-node demotion
target interface, the userspace can figure out the memory tiering
topology abstracted by the kernel.  It is possible to use
application-guided demotion without having to enable reclaim-based
demotion in the kernel.  Logically it is also cleaner to me to
decouple the tiering node representation from the actual demotion
mechanism enablement.

> Best Regards,
> Huang, Ying
>
>
>
Huang, Ying April 21, 2022, 11:48 p.m. UTC | #14
On Thu, 2022-04-21 at 10:56 -0700, Yang Shi wrote:
> On Wed, Apr 20, 2022 at 10:41 PM Wei Xu <weixugc@google.com> wrote:
> > 
> > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > 
> > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > <ying.huang@intel.com> wrote:
> > > > 
> > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > Current implementation to find the demotion targets works
> > > > > based on node state N_MEMORY, however some systems may have
> > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > right choices as demotion targets.
> > > > > 
> > > > > This patch series introduces the new node state
> > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > is used to hold the list of nodes which can be used as demotion
> > > > > targets, support is also added to set the demotion target
> > > > > list from user space so that default behavior can be overridden.
> > > > 
> > > > It appears that your proposed user space interface cannot solve all
> > > > problems.  For example, for system as follows,
> > > > 
> > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > node 0,
> > > > 
> > > > available: 3 nodes (0-2)
> > > > node 0 cpus: 0 1
> > > > node 0 size: n MB
> > > > node 0 free: n MB
> > > > node 1 cpus:
> > > > node 1 size: n MB
> > > > node 1 free: n MB
> > > > node 2 cpus: 2 3
> > > > node 2 size: n MB
> > > > node 2 free: n MB
> > > > node distances:
> > > > node   0   1   2
> > > >   0:  10  40  20
> > > >   1:  40  10  80
> > > >   2:  20  80  10
> > > > 
> > > > Demotion order 1:
> > > > 
> > > > node    demotion_target
> > > >  0              1
> > > >  1              X
> > > >  2              X
> > > > 
> > > > Demotion order 2:
> > > > 
> > > > node    demotion_target
> > > >  0              1
> > > >  1              X
> > > >  2              1
> > > > 
> > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > full advantage of the slow memory node.  We can take any choice as
> > > > automatic-generated order, while make the other choice possible via user
> > > > space overridden.
> > > > 
> > > > I don't know how to implement this via your proposed user space
> > > > interface.  How about the following user space interface?
> > > > 
> > > > 1. Add a file "demotion_order_override" in
> > > >         /sys/devices/system/node/
> > > > 
> > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > overridden; "0" is output if not.
> > > > 
> > > > 3. When write "1", the demotion order of the system will become the
> > > > overridden mode.  When write "0", the demotion order of the system will
> > > > become the automatic mode and the demotion order will be re-generated.
> > > > 
> > > > 4. Add a file "demotion_targets" for each node in
> > > >         /sys/devices/system/node/nodeX/
> > > > 
> > > > 5. When read, the demotion targets of nodeX will be output.
> > > > 
> > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > will be set to the written nodes.  And the demotion order of the system
> > > > will become the overridden mode.
> > > 
> > > TBH I don't think having override demotion targets in userspace is
> > > quite useful in real life for now (it might become useful in the
> > > future, I can't tell). Imagine you manage hundred thousands of
> > > machines, which may come from different vendors, have different
> > > generations of hardware, have different versions of firmware, it would
> > > be a nightmare for the users to configure the demotion targets
> > > properly. So it would be great to have the kernel properly configure
> > > it *without* intervening from the users.
> > > 
> > > So we should pick up a proper default policy and stick with that
> > > policy unless it doesn't work well for the most workloads. I do
> > > understand it is hard to make everyone happy. My proposal is having
> > > every node in the fast tier has a demotion target (at least one) if
> > > the slow tier exists sounds like a reasonable default policy. I think
> > > this is also the current implementation.
> > > 
> > 
> > This is reasonable.  I agree that with a decent default policy, the
> > overriding of per-node demotion targets can be deferred.  The most
> > important problem here is that we should allow the configurations
> > where memory-only nodes are not used as demotion targets, which this
> > patch set has already addressed.
> 
> Yes, I agree. Fixing the bug and allowing override by userspace are
> totally two separate things.
> 

Yes.  I agree with the separating thing, although [1/5] doesn't fix the
bug, but improve the automatic order generation method.  So I think it's
better to separate this patchset into 2 patchsets.  [1/5] is for
improving the automatic order generation.  The [2-5/5] is for user space
overriding.

Best Regards,
Huang, Ying

> > 
> > > > 
> > > > To reduce the complexity, the demotion order of the system is either in
> > > > overridden mode or automatic mode.  When converting from the automatic
> > > > mode to the overridden mode, the existing demotion targets of all nodes
> > > > will be retained before being changed.  When converting from overridden
> > > > mode to automatic mode, the demotion order of the system will be re-
> > > > generated automatically.
> > > > 
> > > > In overridden mode, the demotion targets of the hot-added and hot-
> > > > removed node will be set to empty.  And the hot-removed node will be
> > > > removed from the demotion targets of any node.
> > > > 
> > > > This is an extention of the interface used in the following patch,
> > > > 
> > > > https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/
> > > > 
> > > > What do you think about this?
> > > > 
> > > > > node state N_DEMOTION_TARGETS is also set from the dax kmem
> > > > > driver, certain type of memory which registers through dax kmem
> > > > > (e.g. HBM) may not be the right choices for demotion so in future
> > > > > they should be distinguished based on certain attributes and dax
> > > > > kmem driver should avoid setting them as N_DEMOTION_TARGETS,
> > > > > however current implementation also doesn't distinguish any
> > > > > such memory and it considers all N_MEMORY as demotion targets
> > > > > so this patch series doesn't modify the current behavior.
> > > > > 
> > > > 
> > > > Best Regards,
> > > > Huang, Ying
> > > > 
> > > > [snip]
> > > >
Huang, Ying April 22, 2022, 12:58 a.m. UTC | #15
On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote:
> On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote:
> > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com
> > > <ying.huang@intel.com> wrote:
> > > > 
> > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> > > > > <ying.huang@intel.com> wrote:
> > > > > > 
> > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > > 
> > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > 
> > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > > > > > > Current implementation to find the demotion targets works
> > > > > > > > > > based on node state N_MEMORY, however some systems may have
> > > > > > > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > > > > > > right choices as demotion targets.
> > > > > > > > > > 
> > > > > > > > > > This patch series introduces the new node state
> > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > > > > > > is used to hold the list of nodes which can be used as demotion
> > > > > > > > > > targets, support is also added to set the demotion target
> > > > > > > > > > list from user space so that default behavior can be overridden.
> > > > > > > > > 
> > > > > > > > > It appears that your proposed user space interface cannot solve all
> > > > > > > > > problems.  For example, for system as follows,
> > > > > > > > > 
> > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > > > > > > node 0,
> > > > > > > > > 
> > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > node 0 size: n MB
> > > > > > > > > node 0 free: n MB
> > > > > > > > > node 1 cpus:
> > > > > > > > > node 1 size: n MB
> > > > > > > > > node 1 free: n MB
> > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > node 2 size: n MB
> > > > > > > > > node 2 free: n MB
> > > > > > > > > node distances:
> > > > > > > > > node   0   1   2
> > > > > > > > >   0:  10  40  20
> > > > > > > > >   1:  40  10  80
> > > > > > > > >   2:  20  80  10
> > > > > > > > > 
> > > > > > > > > Demotion order 1:
> > > > > > > > > 
> > > > > > > > > node    demotion_target
> > > > > > > > >  0              1
> > > > > > > > >  1              X
> > > > > > > > >  2              X
> > > > > > > > > 
> > > > > > > > > Demotion order 2:
> > > > > > > > > 
> > > > > > > > > node    demotion_target
> > > > > > > > >  0              1
> > > > > > > > >  1              X
> > > > > > > > >  2              1
> > > > > > > > > 
> > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > > > > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > > > > > > full advantage of the slow memory node.  We can take any choice as
> > > > > > > > > automatic-generated order, while make the other choice possible via user
> > > > > > > > > space overridden.
> > > > > > > > > 
> > > > > > > > > I don't know how to implement this via your proposed user space
> > > > > > > > > interface.  How about the following user space interface?
> > > > > > > > > 
> > > > > > > > > 1. Add a file "demotion_order_override" in
> > > > > > > > >         /sys/devices/system/node/
> > > > > > > > > 
> > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > > > > > > overridden; "0" is output if not.
> > > > > > > > > 
> > > > > > > > > 3. When write "1", the demotion order of the system will become the
> > > > > > > > > overridden mode.  When write "0", the demotion order of the system will
> > > > > > > > > become the automatic mode and the demotion order will be re-generated.
> > > > > > > > > 
> > > > > > > > > 4. Add a file "demotion_targets" for each node in
> > > > > > > > >         /sys/devices/system/node/nodeX/
> > > > > > > > > 
> > > > > > > > > 5. When read, the demotion targets of nodeX will be output.
> > > > > > > > > 
> > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > > > > > > will be set to the written nodes.  And the demotion order of the system
> > > > > > > > > will become the overridden mode.
> > > > > > > > 
> > > > > > > > TBH I don't think having override demotion targets in userspace is
> > > > > > > > quite useful in real life for now (it might become useful in the
> > > > > > > > future, I can't tell). Imagine you manage hundred thousands of
> > > > > > > > machines, which may come from different vendors, have different
> > > > > > > > generations of hardware, have different versions of firmware, it would
> > > > > > > > be a nightmare for the users to configure the demotion targets
> > > > > > > > properly. So it would be great to have the kernel properly configure
> > > > > > > > it *without* intervening from the users.
> > > > > > > > 
> > > > > > > > So we should pick up a proper default policy and stick with that
> > > > > > > > policy unless it doesn't work well for the most workloads. I do
> > > > > > > > understand it is hard to make everyone happy. My proposal is having
> > > > > > > > every node in the fast tier has a demotion target (at least one) if
> > > > > > > > the slow tier exists sounds like a reasonable default policy. I think
> > > > > > > > this is also the current implementation.
> > > > > > > > 
> > > > > > > 
> > > > > > > This is reasonable.  I agree that with a decent default policy,
> > > > > > > 
> > > > > > 
> > > > > > I agree that a decent default policy is important.  As that was enhanced
> > > > > > in [1/5] of this patchset.
> > > > > > 
> > > > > > > the
> > > > > > > overriding of per-node demotion targets can be deferred.  The most
> > > > > > > important problem here is that we should allow the configurations
> > > > > > > where memory-only nodes are not used as demotion targets, which this
> > > > > > > patch set has already addressed.
> > > > > > 
> > > > > > Do you mean the user space interface proposed by [3/5] of this patchset?
> > > > > 
> > > > > Yes.
> > > > > 
> > > > > > IMHO, if we want to add a user space interface, I think that it should
> > > > > > be powerful enough to address all existing issues and some potential
> > > > > > future issues, so that it can be stable.  I don't think it's a good idea
> > > > > > to define a partial user space interface that works only for a specific
> > > > > > use case and cannot be extended for other use cases.
> > > > > 
> > > > > I actually think that they can be viewed as two separate problems: one
> > > > > is to define which nodes can be used as demotion targets (this patch
> > > > > set), and the other is how to initialize the per-node demotion path
> > > > > (node_demotion[]).  We don't have to solve both problems at the same
> > > > > time.
> > > > > 
> > > > > If we decide to go with a per-node demotion path customization
> > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
> > > > > is a single global control to turn off all demotion targets (for the
> > > > > machines that don't use memory-only nodes for demotion).
> > > > > 
> > > > 
> > > > There's one already.  In commit 20b51af15e01 ("mm/migrate: add sysfs
> > > > interface to enable reclaim migration"), a sysfs interface
> > > > 
> > > >         /sys/kernel/mm/numa/demotion_enabled
> > > > 
> > > > is added to turn off all demotion targets.
> > > 
> > > IIUC, this sysfs interface only turns off demotion-in-reclaim.  It
> > > will be even cleaner if we have an easy way to clear node_demotion[]
> > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not
> > > init scripts) can know that the machine doesn't even have memory
> > > tiering hardware enabled.
> > > 
> > 
> > What is the difference?  Now we have no interface to show demotion
> > targets of a node.  That is in-kernel only.  What is memory tiering
> > hardware?  The Optane PMEM?  Some information for it is available via
> > ACPI HMAT table.
> > 
> > Except demotion-in-reclaim, what else do you care about?
> 
> There is a difference: one is to indicate the availability of the
> memory tiering hardware and the other is to indicate whether
> transparent kernel-driven demotion from the reclaim path is activated.
> With /sys/devices/system/node/demote_targets or the per-node demotion
> target interface, the userspace can figure out the memory tiering
> topology abstracted by the kernel.  It is possible to use
> application-guided demotion without having to enable reclaim-based
> demotion in the kernel.  Logically it is also cleaner to me to
> decouple the tiering node representation from the actual demotion
> mechanism enablement.

I am confused here.  It appears that you need a way to expose the
automatic generated demotion order from kernel to user space interface.
We can talk about that if you really need it.

But [2-5/5] of this patchset is to override the automatic generated
demotion order from user space to kernel interface.

Best Regards,
Huang, Ying
Wei Xu April 22, 2022, 4:46 a.m. UTC | #16
On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote:
> > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote:
> > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com
> > > > <ying.huang@intel.com> wrote:
> > > > >
> > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> > > > > > <ying.huang@intel.com> wrote:
> > > > > > >
> > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > > > > > > > Current implementation to find the demotion targets works
> > > > > > > > > > > based on node state N_MEMORY, however some systems may have
> > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > > > > > > > right choices as demotion targets.
> > > > > > > > > > >
> > > > > > > > > > > This patch series introduces the new node state
> > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > > > > > > > is used to hold the list of nodes which can be used as demotion
> > > > > > > > > > > targets, support is also added to set the demotion target
> > > > > > > > > > > list from user space so that default behavior can be overridden.
> > > > > > > > > >
> > > > > > > > > > It appears that your proposed user space interface cannot solve all
> > > > > > > > > > problems.  For example, for system as follows,
> > > > > > > > > >
> > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > > > > > > > node 0,
> > > > > > > > > >
> > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > node 0 size: n MB
> > > > > > > > > > node 0 free: n MB
> > > > > > > > > > node 1 cpus:
> > > > > > > > > > node 1 size: n MB
> > > > > > > > > > node 1 free: n MB
> > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > node 2 size: n MB
> > > > > > > > > > node 2 free: n MB
> > > > > > > > > > node distances:
> > > > > > > > > > node   0   1   2
> > > > > > > > > >   0:  10  40  20
> > > > > > > > > >   1:  40  10  80
> > > > > > > > > >   2:  20  80  10
> > > > > > > > > >
> > > > > > > > > > Demotion order 1:
> > > > > > > > > >
> > > > > > > > > > node    demotion_target
> > > > > > > > > >  0              1
> > > > > > > > > >  1              X
> > > > > > > > > >  2              X
> > > > > > > > > >
> > > > > > > > > > Demotion order 2:
> > > > > > > > > >
> > > > > > > > > > node    demotion_target
> > > > > > > > > >  0              1
> > > > > > > > > >  1              X
> > > > > > > > > >  2              1
> > > > > > > > > >
> > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > > > > > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > > > > > > > full advantage of the slow memory node.  We can take any choice as
> > > > > > > > > > automatic-generated order, while make the other choice possible via user
> > > > > > > > > > space overridden.
> > > > > > > > > >
> > > > > > > > > > I don't know how to implement this via your proposed user space
> > > > > > > > > > interface.  How about the following user space interface?
> > > > > > > > > >
> > > > > > > > > > 1. Add a file "demotion_order_override" in
> > > > > > > > > >         /sys/devices/system/node/
> > > > > > > > > >
> > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > > > > > > > overridden; "0" is output if not.
> > > > > > > > > >
> > > > > > > > > > 3. When write "1", the demotion order of the system will become the
> > > > > > > > > > overridden mode.  When write "0", the demotion order of the system will
> > > > > > > > > > become the automatic mode and the demotion order will be re-generated.
> > > > > > > > > >
> > > > > > > > > > 4. Add a file "demotion_targets" for each node in
> > > > > > > > > >         /sys/devices/system/node/nodeX/
> > > > > > > > > >
> > > > > > > > > > 5. When read, the demotion targets of nodeX will be output.
> > > > > > > > > >
> > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > > > > > > > will be set to the written nodes.  And the demotion order of the system
> > > > > > > > > > will become the overridden mode.
> > > > > > > > >
> > > > > > > > > TBH I don't think having override demotion targets in userspace is
> > > > > > > > > quite useful in real life for now (it might become useful in the
> > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of
> > > > > > > > > machines, which may come from different vendors, have different
> > > > > > > > > generations of hardware, have different versions of firmware, it would
> > > > > > > > > be a nightmare for the users to configure the demotion targets
> > > > > > > > > properly. So it would be great to have the kernel properly configure
> > > > > > > > > it *without* intervening from the users.
> > > > > > > > >
> > > > > > > > > So we should pick up a proper default policy and stick with that
> > > > > > > > > policy unless it doesn't work well for the most workloads. I do
> > > > > > > > > understand it is hard to make everyone happy. My proposal is having
> > > > > > > > > every node in the fast tier has a demotion target (at least one) if
> > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think
> > > > > > > > > this is also the current implementation.
> > > > > > > > >
> > > > > > > >
> > > > > > > > This is reasonable.  I agree that with a decent default policy,
> > > > > > > >
> > > > > > >
> > > > > > > I agree that a decent default policy is important.  As that was enhanced
> > > > > > > in [1/5] of this patchset.
> > > > > > >
> > > > > > > > the
> > > > > > > > overriding of per-node demotion targets can be deferred.  The most
> > > > > > > > important problem here is that we should allow the configurations
> > > > > > > > where memory-only nodes are not used as demotion targets, which this
> > > > > > > > patch set has already addressed.
> > > > > > >
> > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset?
> > > > > >
> > > > > > Yes.
> > > > > >
> > > > > > > IMHO, if we want to add a user space interface, I think that it should
> > > > > > > be powerful enough to address all existing issues and some potential
> > > > > > > future issues, so that it can be stable.  I don't think it's a good idea
> > > > > > > to define a partial user space interface that works only for a specific
> > > > > > > use case and cannot be extended for other use cases.
> > > > > >
> > > > > > I actually think that they can be viewed as two separate problems: one
> > > > > > is to define which nodes can be used as demotion targets (this patch
> > > > > > set), and the other is how to initialize the per-node demotion path
> > > > > > (node_demotion[]).  We don't have to solve both problems at the same
> > > > > > time.
> > > > > >
> > > > > > If we decide to go with a per-node demotion path customization
> > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
> > > > > > is a single global control to turn off all demotion targets (for the
> > > > > > machines that don't use memory-only nodes for demotion).
> > > > > >
> > > > >
> > > > > There's one already.  In commit 20b51af15e01 ("mm/migrate: add sysfs
> > > > > interface to enable reclaim migration"), a sysfs interface
> > > > >
> > > > >         /sys/kernel/mm/numa/demotion_enabled
> > > > >
> > > > > is added to turn off all demotion targets.
> > > >
> > > > IIUC, this sysfs interface only turns off demotion-in-reclaim.  It
> > > > will be even cleaner if we have an easy way to clear node_demotion[]
> > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not
> > > > init scripts) can know that the machine doesn't even have memory
> > > > tiering hardware enabled.
> > > >
> > >
> > > What is the difference?  Now we have no interface to show demotion
> > > targets of a node.  That is in-kernel only.  What is memory tiering
> > > hardware?  The Optane PMEM?  Some information for it is available via
> > > ACPI HMAT table.
> > >
> > > Except demotion-in-reclaim, what else do you care about?
> >
> > There is a difference: one is to indicate the availability of the
> > memory tiering hardware and the other is to indicate whether
> > transparent kernel-driven demotion from the reclaim path is activated.
> > With /sys/devices/system/node/demote_targets or the per-node demotion
> > target interface, the userspace can figure out the memory tiering
> > topology abstracted by the kernel.  It is possible to use
> > application-guided demotion without having to enable reclaim-based
> > demotion in the kernel.  Logically it is also cleaner to me to
> > decouple the tiering node representation from the actual demotion
> > mechanism enablement.
>
> I am confused here.  It appears that you need a way to expose the
> automatic generated demotion order from kernel to user space interface.
> We can talk about that if you really need it.
>
> But [2-5/5] of this patchset is to override the automatic generated
> demotion order from user space to kernel interface.

As a side effect of allowing user space to override the default set of
demotion target nodes, it also provides a sysfs interface to allow
userspace to read which nodes are currently being designated as
demotion targets.

The initialization of demotion targets is expected to complete during
boot (either by kernel or via an init script).  After that, the
userspace processes (e.g. proactive tiering daemon or tiering-aware
applications) can query this sysfs interface to know if there are any
tiering nodes present and act accordingly.

It would be even better to expose the per-node demotion order
(node_demotion[]) via the sysfs interface (e.g.
/sys/devices/system/node/nodeX/demotion_targets as you have
suggested). It can be read-only until there are good use cases to
require overriding the per-node demotion order.
Huang, Ying April 22, 2022, 5:40 a.m. UTC | #17
On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote:
> On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote:
> > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com
> > > <ying.huang@intel.com> wrote:
> > > > 
> > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote:
> > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com
> > > > > <ying.huang@intel.com> wrote:
> > > > > > 
> > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > 
> > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > > > > > > > > Current implementation to find the demotion targets works
> > > > > > > > > > > > based on node state N_MEMORY, however some systems may have
> > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > > > > > > > > right choices as demotion targets.
> > > > > > > > > > > > 
> > > > > > > > > > > > This patch series introduces the new node state
> > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion
> > > > > > > > > > > > targets, support is also added to set the demotion target
> > > > > > > > > > > > list from user space so that default behavior can be overridden.
> > > > > > > > > > > 
> > > > > > > > > > > It appears that your proposed user space interface cannot solve all
> > > > > > > > > > > problems.  For example, for system as follows,
> > > > > > > > > > > 
> > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > > > > > > > > node 0,
> > > > > > > > > > > 
> > > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > > node 0 size: n MB
> > > > > > > > > > > node 0 free: n MB
> > > > > > > > > > > node 1 cpus:
> > > > > > > > > > > node 1 size: n MB
> > > > > > > > > > > node 1 free: n MB
> > > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > > node 2 size: n MB
> > > > > > > > > > > node 2 free: n MB
> > > > > > > > > > > node distances:
> > > > > > > > > > > node   0   1   2
> > > > > > > > > > >   0:  10  40  20
> > > > > > > > > > >   1:  40  10  80
> > > > > > > > > > >   2:  20  80  10
> > > > > > > > > > > 
> > > > > > > > > > > Demotion order 1:
> > > > > > > > > > > 
> > > > > > > > > > > node    demotion_target
> > > > > > > > > > >  0              1
> > > > > > > > > > >  1              X
> > > > > > > > > > >  2              X
> > > > > > > > > > > 
> > > > > > > > > > > Demotion order 2:
> > > > > > > > > > > 
> > > > > > > > > > > node    demotion_target
> > > > > > > > > > >  0              1
> > > > > > > > > > >  1              X
> > > > > > > > > > >  2              1
> > > > > > > > > > > 
> > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > > > > > > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > > > > > > > > full advantage of the slow memory node.  We can take any choice as
> > > > > > > > > > > automatic-generated order, while make the other choice possible via user
> > > > > > > > > > > space overridden.
> > > > > > > > > > > 
> > > > > > > > > > > I don't know how to implement this via your proposed user space
> > > > > > > > > > > interface.  How about the following user space interface?
> > > > > > > > > > > 
> > > > > > > > > > > 1. Add a file "demotion_order_override" in
> > > > > > > > > > >         /sys/devices/system/node/
> > > > > > > > > > > 
> > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > > > > > > > > overridden; "0" is output if not.
> > > > > > > > > > > 
> > > > > > > > > > > 3. When write "1", the demotion order of the system will become the
> > > > > > > > > > > overridden mode.  When write "0", the demotion order of the system will
> > > > > > > > > > > become the automatic mode and the demotion order will be re-generated.
> > > > > > > > > > > 
> > > > > > > > > > > 4. Add a file "demotion_targets" for each node in
> > > > > > > > > > >         /sys/devices/system/node/nodeX/
> > > > > > > > > > > 
> > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output.
> > > > > > > > > > > 
> > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > > > > > > > > will be set to the written nodes.  And the demotion order of the system
> > > > > > > > > > > will become the overridden mode.
> > > > > > > > > > 
> > > > > > > > > > TBH I don't think having override demotion targets in userspace is
> > > > > > > > > > quite useful in real life for now (it might become useful in the
> > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of
> > > > > > > > > > machines, which may come from different vendors, have different
> > > > > > > > > > generations of hardware, have different versions of firmware, it would
> > > > > > > > > > be a nightmare for the users to configure the demotion targets
> > > > > > > > > > properly. So it would be great to have the kernel properly configure
> > > > > > > > > > it *without* intervening from the users.
> > > > > > > > > > 
> > > > > > > > > > So we should pick up a proper default policy and stick with that
> > > > > > > > > > policy unless it doesn't work well for the most workloads. I do
> > > > > > > > > > understand it is hard to make everyone happy. My proposal is having
> > > > > > > > > > every node in the fast tier has a demotion target (at least one) if
> > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think
> > > > > > > > > > this is also the current implementation.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > This is reasonable.  I agree that with a decent default policy,
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I agree that a decent default policy is important.  As that was enhanced
> > > > > > > > in [1/5] of this patchset.
> > > > > > > > 
> > > > > > > > > the
> > > > > > > > > overriding of per-node demotion targets can be deferred.  The most
> > > > > > > > > important problem here is that we should allow the configurations
> > > > > > > > > where memory-only nodes are not used as demotion targets, which this
> > > > > > > > > patch set has already addressed.
> > > > > > > > 
> > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset?
> > > > > > > 
> > > > > > > Yes.
> > > > > > > 
> > > > > > > > IMHO, if we want to add a user space interface, I think that it should
> > > > > > > > be powerful enough to address all existing issues and some potential
> > > > > > > > future issues, so that it can be stable.  I don't think it's a good idea
> > > > > > > > to define a partial user space interface that works only for a specific
> > > > > > > > use case and cannot be extended for other use cases.
> > > > > > > 
> > > > > > > I actually think that they can be viewed as two separate problems: one
> > > > > > > is to define which nodes can be used as demotion targets (this patch
> > > > > > > set), and the other is how to initialize the per-node demotion path
> > > > > > > (node_demotion[]).  We don't have to solve both problems at the same
> > > > > > > time.
> > > > > > > 
> > > > > > > If we decide to go with a per-node demotion path customization
> > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
> > > > > > > is a single global control to turn off all demotion targets (for the
> > > > > > > machines that don't use memory-only nodes for demotion).
> > > > > > > 
> > > > > > 
> > > > > > There's one already.  In commit 20b51af15e01 ("mm/migrate: add sysfs
> > > > > > interface to enable reclaim migration"), a sysfs interface
> > > > > > 
> > > > > >         /sys/kernel/mm/numa/demotion_enabled
> > > > > > 
> > > > > > is added to turn off all demotion targets.
> > > > > 
> > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim.  It
> > > > > will be even cleaner if we have an easy way to clear node_demotion[]
> > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not
> > > > > init scripts) can know that the machine doesn't even have memory
> > > > > tiering hardware enabled.
> > > > > 
> > > > 
> > > > What is the difference?  Now we have no interface to show demotion
> > > > targets of a node.  That is in-kernel only.  What is memory tiering
> > > > hardware?  The Optane PMEM?  Some information for it is available via
> > > > ACPI HMAT table.
> > > > 
> > > > Except demotion-in-reclaim, what else do you care about?
> > > 
> > > There is a difference: one is to indicate the availability of the
> > > memory tiering hardware and the other is to indicate whether
> > > transparent kernel-driven demotion from the reclaim path is activated.
> > > With /sys/devices/system/node/demote_targets or the per-node demotion
> > > target interface, the userspace can figure out the memory tiering
> > > topology abstracted by the kernel.  It is possible to use
> > > application-guided demotion without having to enable reclaim-based
> > > demotion in the kernel.  Logically it is also cleaner to me to
> > > decouple the tiering node representation from the actual demotion
> > > mechanism enablement.
> > 
> > I am confused here.  It appears that you need a way to expose the
> > automatic generated demotion order from kernel to user space interface.
> > We can talk about that if you really need it.
> > 
> > But [2-5/5] of this patchset is to override the automatic generated
> > demotion order from user space to kernel interface.
> 
> As a side effect of allowing user space to override the default set of
> demotion target nodes, it also provides a sysfs interface to allow
> userspace to read which nodes are currently being designated as
> demotion targets.
> 
> The initialization of demotion targets is expected to complete during
> boot (either by kernel or via an init script).  After that, the
> userspace processes (e.g. proactive tiering daemon or tiering-aware
> applications) can query this sysfs interface to know if there are any
> tiering nodes present and act accordingly.
> 
> It would be even better to expose the per-node demotion order
> (node_demotion[]) via the sysfs interface (e.g.
> /sys/devices/system/node/nodeX/demotion_targets as you have
> suggested). It can be read-only until there are good use cases to
> require overriding the per-node demotion order.

I am OK to expose the system demotion order to user space.  For example,
via /sys/devices/system/node/nodeX/demotion_targets, but read-only.

But if we want to add functionality to override system demotion order,
we need to consider the user space interface carefully, at least after
collecting all requirement so far.  I don't think the interface proposed
in [2-5/5] of this patchset is sufficient or extensible enough.

Best Regards,
Huang, Ying
Wei Xu April 22, 2022, 6:11 a.m. UTC | #18
On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com <ying.huang@intel.com>
wrote:

> On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote:
> > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote:
> > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com
> > > > <ying.huang@intel.com> wrote:
> > > > >
> > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote:
> > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com
> > > > > > <ying.huang@intel.com> wrote:
> > > > > > >
> > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <
> shy828301@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya
> wrote:
> > > > > > > > > > > > > Current implementation to find the demotion
> targets works
> > > > > > > > > > > > > based on node state N_MEMORY, however some systems
> may have
> > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but
> not the
> > > > > > > > > > > > > right choices as demotion targets.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This patch series introduces the new node state
> > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish
> the nodes which
> > > > > > > > > > > > > can be used as demotion targets,
> node_states[N_DEMOTION_TARGETS]
> > > > > > > > > > > > > is used to hold the list of nodes which can be
> used as demotion
> > > > > > > > > > > > > targets, support is also added to set the demotion
> target
> > > > > > > > > > > > > list from user space so that default behavior can
> be overridden.
> > > > > > > > > > > >
> > > > > > > > > > > > It appears that your proposed user space interface
> cannot solve all
> > > > > > > > > > > > problems.  For example, for system as follows,
> > > > > > > > > > > >
> > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> memory node near
> > > > > > > > > > > > node 0,
> > > > > > > > > > > >
> > > > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > > > node 0 size: n MB
> > > > > > > > > > > > node 0 free: n MB
> > > > > > > > > > > > node 1 cpus:
> > > > > > > > > > > > node 1 size: n MB
> > > > > > > > > > > > node 1 free: n MB
> > > > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > > > node 2 size: n MB
> > > > > > > > > > > > node 2 free: n MB
> > > > > > > > > > > > node distances:
> > > > > > > > > > > > node   0   1   2
> > > > > > > > > > > >   0:  10  40  20
> > > > > > > > > > > >   1:  40  10  80
> > > > > > > > > > > >   2:  20  80  10
> > > > > > > > > > > >
> > > > > > > > > > > > Demotion order 1:
> > > > > > > > > > > >
> > > > > > > > > > > > node    demotion_target
> > > > > > > > > > > >  0              1
> > > > > > > > > > > >  1              X
> > > > > > > > > > > >  2              X
> > > > > > > > > > > >
> > > > > > > > > > > > Demotion order 2:
> > > > > > > > > > > >
> > > > > > > > > > > > node    demotion_target
> > > > > > > > > > > >  0              1
> > > > > > > > > > > >  1              X
> > > > > > > > > > > >  2              1
> > > > > > > > > > > >
> > > > > > > > > > > > The demotion order 1 is preferred if we want to
> reduce cross-socket
> > > > > > > > > > > > traffic.  While the demotion order 2 is preferred if
> we want to take
> > > > > > > > > > > > full advantage of the slow memory node.  We can take
> any choice as
> > > > > > > > > > > > automatic-generated order, while make the other
> choice possible via user
> > > > > > > > > > > > space overridden.
> > > > > > > > > > > >
> > > > > > > > > > > > I don't know how to implement this via your proposed
> user space
> > > > > > > > > > > > interface.  How about the following user space
> interface?
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Add a file "demotion_order_override" in
> > > > > > > > > > > >         /sys/devices/system/node/
> > > > > > > > > > > >
> > > > > > > > > > > > 2. When read, "1" is output if the demotion order of
> the system has been
> > > > > > > > > > > > overridden; "0" is output if not.
> > > > > > > > > > > >
> > > > > > > > > > > > 3. When write "1", the demotion order of the system
> will become the
> > > > > > > > > > > > overridden mode.  When write "0", the demotion order
> of the system will
> > > > > > > > > > > > become the automatic mode and the demotion order
> will be re-generated.
> > > > > > > > > > > >
> > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in
> > > > > > > > > > > >         /sys/devices/system/node/nodeX/
> > > > > > > > > > > >
> > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be
> output.
> > > > > > > > > > > >
> > > > > > > > > > > > 6. When write a node list to the file, the demotion
> targets of nodeX
> > > > > > > > > > > > will be set to the written nodes.  And the demotion
> order of the system
> > > > > > > > > > > > will become the overridden mode.
> > > > > > > > > > >
> > > > > > > > > > > TBH I don't think having override demotion targets in
> userspace is
> > > > > > > > > > > quite useful in real life for now (it might become
> useful in the
> > > > > > > > > > > future, I can't tell). Imagine you manage hundred
> thousands of
> > > > > > > > > > > machines, which may come from different vendors, have
> different
> > > > > > > > > > > generations of hardware, have different versions of
> firmware, it would
> > > > > > > > > > > be a nightmare for the users to configure the demotion
> targets
> > > > > > > > > > > properly. So it would be great to have the kernel
> properly configure
> > > > > > > > > > > it *without* intervening from the users.
> > > > > > > > > > >
> > > > > > > > > > > So we should pick up a proper default policy and stick
> with that
> > > > > > > > > > > policy unless it doesn't work well for the most
> workloads. I do
> > > > > > > > > > > understand it is hard to make everyone happy. My
> proposal is having
> > > > > > > > > > > every node in the fast tier has a demotion target (at
> least one) if
> > > > > > > > > > > the slow tier exists sounds like a reasonable default
> policy. I think
> > > > > > > > > > > this is also the current implementation.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This is reasonable.  I agree that with a decent default
> policy,
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I agree that a decent default policy is important.  As
> that was enhanced
> > > > > > > > > in [1/5] of this patchset.
> > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > > overriding of per-node demotion targets can be
> deferred.  The most
> > > > > > > > > > important problem here is that we should allow the
> configurations
> > > > > > > > > > where memory-only nodes are not used as demotion
> targets, which this
> > > > > > > > > > patch set has already addressed.
> > > > > > > > >
> > > > > > > > > Do you mean the user space interface proposed by [3/5] of
> this patchset?
> > > > > > > >
> > > > > > > > Yes.
> > > > > > > >
> > > > > > > > > IMHO, if we want to add a user space interface, I think
> that it should
> > > > > > > > > be powerful enough to address all existing issues and some
> potential
> > > > > > > > > future issues, so that it can be stable.  I don't think
> it's a good idea
> > > > > > > > > to define a partial user space interface that works only
> for a specific
> > > > > > > > > use case and cannot be extended for other use cases.
> > > > > > > >
> > > > > > > > I actually think that they can be viewed as two separate
> problems: one
> > > > > > > > is to define which nodes can be used as demotion targets
> (this patch
> > > > > > > > set), and the other is how to initialize the per-node
> demotion path
> > > > > > > > (node_demotion[]).  We don't have to solve both problems at
> the same
> > > > > > > > time.
> > > > > > > >
> > > > > > > > If we decide to go with a per-node demotion path
> customization
> > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer
> that there
> > > > > > > > is a single global control to turn off all demotion targets
> (for the
> > > > > > > > machines that don't use memory-only nodes for demotion).
> > > > > > > >
> > > > > > >
> > > > > > > There's one already.  In commit 20b51af15e01 ("mm/migrate: add
> sysfs
> > > > > > > interface to enable reclaim migration"), a sysfs interface
> > > > > > >
> > > > > > >         /sys/kernel/mm/numa/demotion_enabled
> > > > > > >
> > > > > > > is added to turn off all demotion targets.
> > > > > >
> > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim.
> It
> > > > > > will be even cleaner if we have an easy way to clear
> node_demotion[]
> > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent,
> not
> > > > > > init scripts) can know that the machine doesn't even have memory
> > > > > > tiering hardware enabled.
> > > > > >
> > > > >
> > > > > What is the difference?  Now we have no interface to show demotion
> > > > > targets of a node.  That is in-kernel only.  What is memory tiering
> > > > > hardware?  The Optane PMEM?  Some information for it is available
> via
> > > > > ACPI HMAT table.
> > > > >
> > > > > Except demotion-in-reclaim, what else do you care about?
> > > >
> > > > There is a difference: one is to indicate the availability of the
> > > > memory tiering hardware and the other is to indicate whether
> > > > transparent kernel-driven demotion from the reclaim path is
> activated.
> > > > With /sys/devices/system/node/demote_targets or the per-node demotion
> > > > target interface, the userspace can figure out the memory tiering
> > > > topology abstracted by the kernel.  It is possible to use
> > > > application-guided demotion without having to enable reclaim-based
> > > > demotion in the kernel.  Logically it is also cleaner to me to
> > > > decouple the tiering node representation from the actual demotion
> > > > mechanism enablement.
> > >
> > > I am confused here.  It appears that you need a way to expose the
> > > automatic generated demotion order from kernel to user space interface.
> > > We can talk about that if you really need it.
> > >
> > > But [2-5/5] of this patchset is to override the automatic generated
> > > demotion order from user space to kernel interface.
> >
> > As a side effect of allowing user space to override the default set of
> > demotion target nodes, it also provides a sysfs interface to allow
> > userspace to read which nodes are currently being designated as
> > demotion targets.
> >
> > The initialization of demotion targets is expected to complete during
> > boot (either by kernel or via an init script).  After that, the
> > userspace processes (e.g. proactive tiering daemon or tiering-aware
> > applications) can query this sysfs interface to know if there are any
> > tiering nodes present and act accordingly.
> >
> > It would be even better to expose the per-node demotion order
> > (node_demotion[]) via the sysfs interface (e.g.
> > /sys/devices/system/node/nodeX/demotion_targets as you have
> > suggested). It can be read-only until there are good use cases to
> > require overriding the per-node demotion order.
>
> I am OK to expose the system demotion order to user space.  For example,
> via /sys/devices/system/node/nodeX/demotion_targets, but read-only.
>

Sounds good. We can send out a patch for such a read-only interface.


> But if we want to add functionality to override system demotion order,
> we need to consider the user space interface carefully, at least after
> collecting all requirement so far.  I don't think the interface proposed
> in [2-5/5] of this patchset is sufficient or extensible enough.
>

The current proposed interface should be sufficient to override which nodes
can serve as demotion targets.  I agree that it is not sufficient if
userspace wants to redefine the per-node demotion targets and a suitable
user space interface for that purpose needs to be designed carefully.

I also agree that it is better to move out patch 1/5 from this patchset.

Best Regards,
> Huang, Ying
>
>
>
>
Wei Xu April 22, 2022, 6:13 a.m. UTC | #19
On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote:
> > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote:
> > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com
> > > > <ying.huang@intel.com> wrote:
> > > > >
> > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote:
> > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com
> > > > > > <ying.huang@intel.com> wrote:
> > > > > > >
> > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > > > > > > > > > Current implementation to find the demotion targets works
> > > > > > > > > > > > > based on node state N_MEMORY, however some systems may have
> > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > > > > > > > > > right choices as demotion targets.
> > > > > > > > > > > > >
> > > > > > > > > > > > > This patch series introduces the new node state
> > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion
> > > > > > > > > > > > > targets, support is also added to set the demotion target
> > > > > > > > > > > > > list from user space so that default behavior can be overridden.
> > > > > > > > > > > >
> > > > > > > > > > > > It appears that your proposed user space interface cannot solve all
> > > > > > > > > > > > problems.  For example, for system as follows,
> > > > > > > > > > > >
> > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > > > > > > > > > node 0,
> > > > > > > > > > > >
> > > > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > > > node 0 size: n MB
> > > > > > > > > > > > node 0 free: n MB
> > > > > > > > > > > > node 1 cpus:
> > > > > > > > > > > > node 1 size: n MB
> > > > > > > > > > > > node 1 free: n MB
> > > > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > > > node 2 size: n MB
> > > > > > > > > > > > node 2 free: n MB
> > > > > > > > > > > > node distances:
> > > > > > > > > > > > node   0   1   2
> > > > > > > > > > > >   0:  10  40  20
> > > > > > > > > > > >   1:  40  10  80
> > > > > > > > > > > >   2:  20  80  10
> > > > > > > > > > > >
> > > > > > > > > > > > Demotion order 1:
> > > > > > > > > > > >
> > > > > > > > > > > > node    demotion_target
> > > > > > > > > > > >  0              1
> > > > > > > > > > > >  1              X
> > > > > > > > > > > >  2              X
> > > > > > > > > > > >
> > > > > > > > > > > > Demotion order 2:
> > > > > > > > > > > >
> > > > > > > > > > > > node    demotion_target
> > > > > > > > > > > >  0              1
> > > > > > > > > > > >  1              X
> > > > > > > > > > > >  2              1
> > > > > > > > > > > >
> > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > > > > > > > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > > > > > > > > > full advantage of the slow memory node.  We can take any choice as
> > > > > > > > > > > > automatic-generated order, while make the other choice possible via user
> > > > > > > > > > > > space overridden.
> > > > > > > > > > > >
> > > > > > > > > > > > I don't know how to implement this via your proposed user space
> > > > > > > > > > > > interface.  How about the following user space interface?
> > > > > > > > > > > >
> > > > > > > > > > > > 1. Add a file "demotion_order_override" in
> > > > > > > > > > > >         /sys/devices/system/node/
> > > > > > > > > > > >
> > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > > > > > > > > > overridden; "0" is output if not.
> > > > > > > > > > > >
> > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the
> > > > > > > > > > > > overridden mode.  When write "0", the demotion order of the system will
> > > > > > > > > > > > become the automatic mode and the demotion order will be re-generated.
> > > > > > > > > > > >
> > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in
> > > > > > > > > > > >         /sys/devices/system/node/nodeX/
> > > > > > > > > > > >
> > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output.
> > > > > > > > > > > >
> > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > > > > > > > > > will be set to the written nodes.  And the demotion order of the system
> > > > > > > > > > > > will become the overridden mode.
> > > > > > > > > > >
> > > > > > > > > > > TBH I don't think having override demotion targets in userspace is
> > > > > > > > > > > quite useful in real life for now (it might become useful in the
> > > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of
> > > > > > > > > > > machines, which may come from different vendors, have different
> > > > > > > > > > > generations of hardware, have different versions of firmware, it would
> > > > > > > > > > > be a nightmare for the users to configure the demotion targets
> > > > > > > > > > > properly. So it would be great to have the kernel properly configure
> > > > > > > > > > > it *without* intervening from the users.
> > > > > > > > > > >
> > > > > > > > > > > So we should pick up a proper default policy and stick with that
> > > > > > > > > > > policy unless it doesn't work well for the most workloads. I do
> > > > > > > > > > > understand it is hard to make everyone happy. My proposal is having
> > > > > > > > > > > every node in the fast tier has a demotion target (at least one) if
> > > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think
> > > > > > > > > > > this is also the current implementation.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > This is reasonable.  I agree that with a decent default policy,
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I agree that a decent default policy is important.  As that was enhanced
> > > > > > > > > in [1/5] of this patchset.
> > > > > > > > >
> > > > > > > > > > the
> > > > > > > > > > overriding of per-node demotion targets can be deferred.  The most
> > > > > > > > > > important problem here is that we should allow the configurations
> > > > > > > > > > where memory-only nodes are not used as demotion targets, which this
> > > > > > > > > > patch set has already addressed.
> > > > > > > > >
> > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset?
> > > > > > > >
> > > > > > > > Yes.
> > > > > > > >
> > > > > > > > > IMHO, if we want to add a user space interface, I think that it should
> > > > > > > > > be powerful enough to address all existing issues and some potential
> > > > > > > > > future issues, so that it can be stable.  I don't think it's a good idea
> > > > > > > > > to define a partial user space interface that works only for a specific
> > > > > > > > > use case and cannot be extended for other use cases.
> > > > > > > >
> > > > > > > > I actually think that they can be viewed as two separate problems: one
> > > > > > > > is to define which nodes can be used as demotion targets (this patch
> > > > > > > > set), and the other is how to initialize the per-node demotion path
> > > > > > > > (node_demotion[]).  We don't have to solve both problems at the same
> > > > > > > > time.
> > > > > > > >
> > > > > > > > If we decide to go with a per-node demotion path customization
> > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
> > > > > > > > is a single global control to turn off all demotion targets (for the
> > > > > > > > machines that don't use memory-only nodes for demotion).
> > > > > > > >
> > > > > > >
> > > > > > > There's one already.  In commit 20b51af15e01 ("mm/migrate: add sysfs
> > > > > > > interface to enable reclaim migration"), a sysfs interface
> > > > > > >
> > > > > > >         /sys/kernel/mm/numa/demotion_enabled
> > > > > > >
> > > > > > > is added to turn off all demotion targets.
> > > > > >
> > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim.  It
> > > > > > will be even cleaner if we have an easy way to clear node_demotion[]
> > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not
> > > > > > init scripts) can know that the machine doesn't even have memory
> > > > > > tiering hardware enabled.
> > > > > >
> > > > >
> > > > > What is the difference?  Now we have no interface to show demotion
> > > > > targets of a node.  That is in-kernel only.  What is memory tiering
> > > > > hardware?  The Optane PMEM?  Some information for it is available via
> > > > > ACPI HMAT table.
> > > > >
> > > > > Except demotion-in-reclaim, what else do you care about?
> > > >
> > > > There is a difference: one is to indicate the availability of the
> > > > memory tiering hardware and the other is to indicate whether
> > > > transparent kernel-driven demotion from the reclaim path is activated.
> > > > With /sys/devices/system/node/demote_targets or the per-node demotion
> > > > target interface, the userspace can figure out the memory tiering
> > > > topology abstracted by the kernel.  It is possible to use
> > > > application-guided demotion without having to enable reclaim-based
> > > > demotion in the kernel.  Logically it is also cleaner to me to
> > > > decouple the tiering node representation from the actual demotion
> > > > mechanism enablement.
> > >
> > > I am confused here.  It appears that you need a way to expose the
> > > automatic generated demotion order from kernel to user space interface.
> > > We can talk about that if you really need it.
> > >
> > > But [2-5/5] of this patchset is to override the automatic generated
> > > demotion order from user space to kernel interface.
> >
> > As a side effect of allowing user space to override the default set of
> > demotion target nodes, it also provides a sysfs interface to allow
> > userspace to read which nodes are currently being designated as
> > demotion targets.
> >
> > The initialization of demotion targets is expected to complete during
> > boot (either by kernel or via an init script).  After that, the
> > userspace processes (e.g. proactive tiering daemon or tiering-aware
> > applications) can query this sysfs interface to know if there are any
> > tiering nodes present and act accordingly.
> >
> > It would be even better to expose the per-node demotion order
> > (node_demotion[]) via the sysfs interface (e.g.
> > /sys/devices/system/node/nodeX/demotion_targets as you have
> > suggested). It can be read-only until there are good use cases to
> > require overriding the per-node demotion order.
>
> I am OK to expose the system demotion order to user space.  For example,
> via /sys/devices/system/node/nodeX/demotion_targets, but read-only.

Sounds good. We can send out a patch for such a read-only interface.

> But if we want to add functionality to override system demotion order,
> we need to consider the user space interface carefully, at least after
> collecting all requirement so far.  I don't think the interface proposed
> in [2-5/5] of this patchset is sufficient or extensible enough.

The current proposed interface should be sufficient to override which
nodes can serve as demotion targets.  I agree that it is not
sufficient if userspace wants to redefine the per-node demotion
targets and a suitable user space interface for that purpose needs to
be designed carefully.

I also agree that it is better to move out patch 1/5 from this patchset.

> Best Regards,
> Huang, Ying
>
>
>
Huang, Ying April 22, 2022, 6:21 a.m. UTC | #20
On Thu, 2022-04-21 at 23:13 -0700, Wei Xu wrote:
> On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote:
> > > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com
> > > <ying.huang@intel.com> wrote:
> > > > 
> > > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote:
> > > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com
> > > > > <ying.huang@intel.com> wrote:
> > > > > > 
> > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote:
> > > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com
> > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > 
> > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> > > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > 
> > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > > > > > > 
> > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > > > > > > > > > > Current implementation to find the demotion targets works
> > > > > > > > > > > > > > based on node state N_MEMORY, however some systems may have
> > > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > > > > > > > > > > right choices as demotion targets.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > This patch series introduces the new node state
> > > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion
> > > > > > > > > > > > > > targets, support is also added to set the demotion target
> > > > > > > > > > > > > > list from user space so that default behavior can be overridden.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all
> > > > > > > > > > > > > problems.  For example, for system as follows,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > > > > > > > > > > node 0,
> > > > > > > > > > > > > 
> > > > > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > > > > node 0 size: n MB
> > > > > > > > > > > > > node 0 free: n MB
> > > > > > > > > > > > > node 1 cpus:
> > > > > > > > > > > > > node 1 size: n MB
> > > > > > > > > > > > > node 1 free: n MB
> > > > > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > > > > node 2 size: n MB
> > > > > > > > > > > > > node 2 free: n MB
> > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > node   0   1   2
> > > > > > > > > > > > >   0:  10  40  20
> > > > > > > > > > > > >   1:  40  10  80
> > > > > > > > > > > > >   2:  20  80  10
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Demotion order 1:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > node    demotion_target
> > > > > > > > > > > > >  0              1
> > > > > > > > > > > > >  1              X
> > > > > > > > > > > > >  2              X
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Demotion order 2:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > node    demotion_target
> > > > > > > > > > > > >  0              1
> > > > > > > > > > > > >  1              X
> > > > > > > > > > > > >  2              1
> > > > > > > > > > > > > 
> > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > > > > > > > > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > > > > > > > > > > full advantage of the slow memory node.  We can take any choice as
> > > > > > > > > > > > > automatic-generated order, while make the other choice possible via user
> > > > > > > > > > > > > space overridden.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > I don't know how to implement this via your proposed user space
> > > > > > > > > > > > > interface.  How about the following user space interface?
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 1. Add a file "demotion_order_override" in
> > > > > > > > > > > > >         /sys/devices/system/node/
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > > > > > > > > > > overridden; "0" is output if not.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the
> > > > > > > > > > > > > overridden mode.  When write "0", the demotion order of the system will
> > > > > > > > > > > > > become the automatic mode and the demotion order will be re-generated.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in
> > > > > > > > > > > > >         /sys/devices/system/node/nodeX/
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > > > > > > > > > > will be set to the written nodes.  And the demotion order of the system
> > > > > > > > > > > > > will become the overridden mode.
> > > > > > > > > > > > 
> > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is
> > > > > > > > > > > > quite useful in real life for now (it might become useful in the
> > > > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of
> > > > > > > > > > > > machines, which may come from different vendors, have different
> > > > > > > > > > > > generations of hardware, have different versions of firmware, it would
> > > > > > > > > > > > be a nightmare for the users to configure the demotion targets
> > > > > > > > > > > > properly. So it would be great to have the kernel properly configure
> > > > > > > > > > > > it *without* intervening from the users.
> > > > > > > > > > > > 
> > > > > > > > > > > > So we should pick up a proper default policy and stick with that
> > > > > > > > > > > > policy unless it doesn't work well for the most workloads. I do
> > > > > > > > > > > > understand it is hard to make everyone happy. My proposal is having
> > > > > > > > > > > > every node in the fast tier has a demotion target (at least one) if
> > > > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think
> > > > > > > > > > > > this is also the current implementation.
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > This is reasonable.  I agree that with a decent default policy,
> > > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > I agree that a decent default policy is important.  As that was enhanced
> > > > > > > > > > in [1/5] of this patchset.
> > > > > > > > > > 
> > > > > > > > > > > the
> > > > > > > > > > > overriding of per-node demotion targets can be deferred.  The most
> > > > > > > > > > > important problem here is that we should allow the configurations
> > > > > > > > > > > where memory-only nodes are not used as demotion targets, which this
> > > > > > > > > > > patch set has already addressed.
> > > > > > > > > > 
> > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset?
> > > > > > > > > 
> > > > > > > > > Yes.
> > > > > > > > > 
> > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should
> > > > > > > > > > be powerful enough to address all existing issues and some potential
> > > > > > > > > > future issues, so that it can be stable.  I don't think it's a good idea
> > > > > > > > > > to define a partial user space interface that works only for a specific
> > > > > > > > > > use case and cannot be extended for other use cases.
> > > > > > > > > 
> > > > > > > > > I actually think that they can be viewed as two separate problems: one
> > > > > > > > > is to define which nodes can be used as demotion targets (this patch
> > > > > > > > > set), and the other is how to initialize the per-node demotion path
> > > > > > > > > (node_demotion[]).  We don't have to solve both problems at the same
> > > > > > > > > time.
> > > > > > > > > 
> > > > > > > > > If we decide to go with a per-node demotion path customization
> > > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
> > > > > > > > > is a single global control to turn off all demotion targets (for the
> > > > > > > > > machines that don't use memory-only nodes for demotion).
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > There's one already.  In commit 20b51af15e01 ("mm/migrate: add sysfs
> > > > > > > > interface to enable reclaim migration"), a sysfs interface
> > > > > > > > 
> > > > > > > >         /sys/kernel/mm/numa/demotion_enabled
> > > > > > > > 
> > > > > > > > is added to turn off all demotion targets.
> > > > > > > 
> > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim.  It
> > > > > > > will be even cleaner if we have an easy way to clear node_demotion[]
> > > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not
> > > > > > > init scripts) can know that the machine doesn't even have memory
> > > > > > > tiering hardware enabled.
> > > > > > > 
> > > > > > 
> > > > > > What is the difference?  Now we have no interface to show demotion
> > > > > > targets of a node.  That is in-kernel only.  What is memory tiering
> > > > > > hardware?  The Optane PMEM?  Some information for it is available via
> > > > > > ACPI HMAT table.
> > > > > > 
> > > > > > Except demotion-in-reclaim, what else do you care about?
> > > > > 
> > > > > There is a difference: one is to indicate the availability of the
> > > > > memory tiering hardware and the other is to indicate whether
> > > > > transparent kernel-driven demotion from the reclaim path is activated.
> > > > > With /sys/devices/system/node/demote_targets or the per-node demotion
> > > > > target interface, the userspace can figure out the memory tiering
> > > > > topology abstracted by the kernel.  It is possible to use
> > > > > application-guided demotion without having to enable reclaim-based
> > > > > demotion in the kernel.  Logically it is also cleaner to me to
> > > > > decouple the tiering node representation from the actual demotion
> > > > > mechanism enablement.
> > > > 
> > > > I am confused here.  It appears that you need a way to expose the
> > > > automatic generated demotion order from kernel to user space interface.
> > > > We can talk about that if you really need it.
> > > > 
> > > > But [2-5/5] of this patchset is to override the automatic generated
> > > > demotion order from user space to kernel interface.
> > > 
> > > As a side effect of allowing user space to override the default set of
> > > demotion target nodes, it also provides a sysfs interface to allow
> > > userspace to read which nodes are currently being designated as
> > > demotion targets.
> > > 
> > > The initialization of demotion targets is expected to complete during
> > > boot (either by kernel or via an init script).  After that, the
> > > userspace processes (e.g. proactive tiering daemon or tiering-aware
> > > applications) can query this sysfs interface to know if there are any
> > > tiering nodes present and act accordingly.
> > > 
> > > It would be even better to expose the per-node demotion order
> > > (node_demotion[]) via the sysfs interface (e.g.
> > > /sys/devices/system/node/nodeX/demotion_targets as you have
> > > suggested). It can be read-only until there are good use cases to
> > > require overriding the per-node demotion order.
> > 
> > I am OK to expose the system demotion order to user space.  For example,
> > via /sys/devices/system/node/nodeX/demotion_targets, but read-only.
> 
> Sounds good. We can send out a patch for such a read-only interface.
> 
> > But if we want to add functionality to override system demotion order,
> > we need to consider the user space interface carefully, at least after
> > collecting all requirement so far.  I don't think the interface proposed
> > in [2-5/5] of this patchset is sufficient or extensible enough.
> 
> The current proposed interface should be sufficient to override which
> nodes can serve as demotion targets.  I agree that it is not
> sufficient if userspace wants to redefine the per-node demotion
> targets and a suitable user space interface for that purpose needs to
> be designed carefully.
> 

IMHO, it's better to define both together.  That is, collect all
requirement, and design it carefully, keeping extensible in mind.  If
it's not the good timing yet, we can defer it to collect more
requirement.  That's not urgent even for authors' system, because they
can just don't enable demotion-in-reclaim.

Best Regards,
Huang, Ying

> I also agree that it is better to move out patch 1/5 from this patchset.
> 
> > Best Regards,
> > Huang, Ying
> > 
> > 
> >
Jagdish Gediya April 22, 2022, 11 a.m. UTC | #21
On Fri, Apr 22, 2022 at 02:21:47PM +0800, ying.huang@intel.com wrote:
> On Thu, 2022-04-21 at 23:13 -0700, Wei Xu wrote:
> > On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > > 
> > > On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote:
> > > > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com
> > > > <ying.huang@intel.com> wrote:
> > > > > 
> > > > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote:
> > > > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com
> > > > > > <ying.huang@intel.com> wrote:
> > > > > > > 
> > > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote:
> > > > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com
> > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > 
> > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> > > > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > > > > > > > 
> > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > > > > > > > > > > > Current implementation to find the demotion targets works
> > > > > > > > > > > > > > > based on node state N_MEMORY, however some systems may have
> > > > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > > > > > > > > > > > right choices as demotion targets.
> > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > This patch series introduces the new node state
> > > > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion
> > > > > > > > > > > > > > > targets, support is also added to set the demotion target
> > > > > > > > > > > > > > > list from user space so that default behavior can be overridden.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all
> > > > > > > > > > > > > > problems.  For example, for system as follows,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > > > > > > > > > > > node 0,
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > > > > > node 0 size: n MB
> > > > > > > > > > > > > > node 0 free: n MB
> > > > > > > > > > > > > > node 1 cpus:
> > > > > > > > > > > > > > node 1 size: n MB
> > > > > > > > > > > > > > node 1 free: n MB
> > > > > > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > > > > > node 2 size: n MB
> > > > > > > > > > > > > > node 2 free: n MB
> > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > node   0   1   2
> > > > > > > > > > > > > >   0:  10  40  20
> > > > > > > > > > > > > >   1:  40  10  80
> > > > > > > > > > > > > >   2:  20  80  10
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Demotion order 1:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > node    demotion_target
> > > > > > > > > > > > > >  0              1
> > > > > > > > > > > > > >  1              X
> > > > > > > > > > > > > >  2              X
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > Demotion order 2:
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > node    demotion_target
> > > > > > > > > > > > > >  0              1
> > > > > > > > > > > > > >  1              X
> > > > > > > > > > > > > >  2              1
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > > > > > > > > > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > > > > > > > > > > > full advantage of the slow memory node.  We can take any choice as
> > > > > > > > > > > > > > automatic-generated order, while make the other choice possible via user
> > > > > > > > > > > > > > space overridden.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > I don't know how to implement this via your proposed user space
> > > > > > > > > > > > > > interface.  How about the following user space interface?
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in
> > > > > > > > > > > > > >         /sys/devices/system/node/
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > > > > > > > > > > > overridden; "0" is output if not.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the
> > > > > > > > > > > > > > overridden mode.  When write "0", the demotion order of the system will
> > > > > > > > > > > > > > become the automatic mode and the demotion order will be re-generated.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in
> > > > > > > > > > > > > >         /sys/devices/system/node/nodeX/
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > > > > > > > > > > > will be set to the written nodes.  And the demotion order of the system
> > > > > > > > > > > > > > will become the overridden mode.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is
> > > > > > > > > > > > > quite useful in real life for now (it might become useful in the
> > > > > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of
> > > > > > > > > > > > > machines, which may come from different vendors, have different
> > > > > > > > > > > > > generations of hardware, have different versions of firmware, it would
> > > > > > > > > > > > > be a nightmare for the users to configure the demotion targets
> > > > > > > > > > > > > properly. So it would be great to have the kernel properly configure
> > > > > > > > > > > > > it *without* intervening from the users.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > So we should pick up a proper default policy and stick with that
> > > > > > > > > > > > > policy unless it doesn't work well for the most workloads. I do
> > > > > > > > > > > > > understand it is hard to make everyone happy. My proposal is having
> > > > > > > > > > > > > every node in the fast tier has a demotion target (at least one) if
> > > > > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think
> > > > > > > > > > > > > this is also the current implementation.
> > > > > > > > > > > > > 
> > > > > > > > > > > > 
> > > > > > > > > > > > This is reasonable.  I agree that with a decent default policy,
> > > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > I agree that a decent default policy is important.  As that was enhanced
> > > > > > > > > > > in [1/5] of this patchset.
> > > > > > > > > > > 
> > > > > > > > > > > > the
> > > > > > > > > > > > overriding of per-node demotion targets can be deferred.  The most
> > > > > > > > > > > > important problem here is that we should allow the configurations
> > > > > > > > > > > > where memory-only nodes are not used as demotion targets, which this
> > > > > > > > > > > > patch set has already addressed.
> > > > > > > > > > > 
> > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset?
> > > > > > > > > > 
> > > > > > > > > > Yes.
> > > > > > > > > > 
> > > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should
> > > > > > > > > > > be powerful enough to address all existing issues and some potential
> > > > > > > > > > > future issues, so that it can be stable.  I don't think it's a good idea
> > > > > > > > > > > to define a partial user space interface that works only for a specific
> > > > > > > > > > > use case and cannot be extended for other use cases.
> > > > > > > > > > 
> > > > > > > > > > I actually think that they can be viewed as two separate problems: one
> > > > > > > > > > is to define which nodes can be used as demotion targets (this patch
> > > > > > > > > > set), and the other is how to initialize the per-node demotion path
> > > > > > > > > > (node_demotion[]).  We don't have to solve both problems at the same
> > > > > > > > > > time.
> > > > > > > > > > 
> > > > > > > > > > If we decide to go with a per-node demotion path customization
> > > > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
> > > > > > > > > > is a single global control to turn off all demotion targets (for the
> > > > > > > > > > machines that don't use memory-only nodes for demotion).
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > There's one already.  In commit 20b51af15e01 ("mm/migrate: add sysfs
> > > > > > > > > interface to enable reclaim migration"), a sysfs interface
> > > > > > > > > 
> > > > > > > > >         /sys/kernel/mm/numa/demotion_enabled
> > > > > > > > > 
> > > > > > > > > is added to turn off all demotion targets.
> > > > > > > > 
> > > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim.  It
> > > > > > > > will be even cleaner if we have an easy way to clear node_demotion[]
> > > > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not
> > > > > > > > init scripts) can know that the machine doesn't even have memory
> > > > > > > > tiering hardware enabled.
> > > > > > > > 
> > > > > > > 
> > > > > > > What is the difference?  Now we have no interface to show demotion
> > > > > > > targets of a node.  That is in-kernel only.  What is memory tiering
> > > > > > > hardware?  The Optane PMEM?  Some information for it is available via
> > > > > > > ACPI HMAT table.
> > > > > > > 
> > > > > > > Except demotion-in-reclaim, what else do you care about?
> > > > > > 
> > > > > > There is a difference: one is to indicate the availability of the
> > > > > > memory tiering hardware and the other is to indicate whether
> > > > > > transparent kernel-driven demotion from the reclaim path is activated.
> > > > > > With /sys/devices/system/node/demote_targets or the per-node demotion
> > > > > > target interface, the userspace can figure out the memory tiering
> > > > > > topology abstracted by the kernel.  It is possible to use
> > > > > > application-guided demotion without having to enable reclaim-based
> > > > > > demotion in the kernel.  Logically it is also cleaner to me to
> > > > > > decouple the tiering node representation from the actual demotion
> > > > > > mechanism enablement.
> > > > > 
> > > > > I am confused here.  It appears that you need a way to expose the
> > > > > automatic generated demotion order from kernel to user space interface.
> > > > > We can talk about that if you really need it.
> > > > > 
> > > > > But [2-5/5] of this patchset is to override the automatic generated
> > > > > demotion order from user space to kernel interface.
> > > > 
> > > > As a side effect of allowing user space to override the default set of
> > > > demotion target nodes, it also provides a sysfs interface to allow
> > > > userspace to read which nodes are currently being designated as
> > > > demotion targets.
> > > > 
> > > > The initialization of demotion targets is expected to complete during
> > > > boot (either by kernel or via an init script).  After that, the
> > > > userspace processes (e.g. proactive tiering daemon or tiering-aware
> > > > applications) can query this sysfs interface to know if there are any
> > > > tiering nodes present and act accordingly.
> > > > 
> > > > It would be even better to expose the per-node demotion order
> > > > (node_demotion[]) via the sysfs interface (e.g.
> > > > /sys/devices/system/node/nodeX/demotion_targets as you have
> > > > suggested). It can be read-only until there are good use cases to
> > > > require overriding the per-node demotion order.
> > > 
> > > I am OK to expose the system demotion order to user space.  For example,
> > > via /sys/devices/system/node/nodeX/demotion_targets, but read-only.
> > 
> > Sounds good. We can send out a patch for such a read-only interface.
> > 
> > > But if we want to add functionality to override system demotion order,
> > > we need to consider the user space interface carefully, at least after
> > > collecting all requirement so far.  I don't think the interface proposed
> > > in [2-5/5] of this patchset is sufficient or extensible enough.
> > 
> > The current proposed interface should be sufficient to override which
> > nodes can serve as demotion targets.  I agree that it is not
> > sufficient if userspace wants to redefine the per-node demotion
> > targets and a suitable user space interface for that purpose needs to
> > be designed carefully.
> > 
> 
> IMHO, it's better to define both together.  That is, collect all
> requirement, and design it carefully, keeping extensible in mind.  If
> it's not the good timing yet, we can defer it to collect more
> requirement.  That's not urgent even for authors' system, because they
> can just don't enable demotion-in-reclaim.
> 
> Best Regards,
> Huang, Ying

I think it is necessary to either have per node demotion targets
configuration or the user space interface supported by this patch
series. As we don't have clear consensus on how the user interface
should look like, we can defer the per node demotion target set
interface to future until the real need arises.

Current patch series sets N_DEMOTION_TARGET from dax device kmem
driver, it may be possible that some memory node desired as demotion
target is not detected in the system from dax-device kmem probe path.

It is also possible that some of the dax-devices are not preferred as
demotion target e.g. HBM, for such devices, node shouldn't be set to
N_DEMOTION_TARGETS. In future, Support should be added to distinguish
such dax-devices and not mark them as N_DEMOTION_TARGETS from the
kernel, but for now this user space interface will be useful to avoid
such devices as demotion targets.

We can add read only interface to view per node demotion targets
from /sys/devices/system/node/nodeX/demotion_targets, remove
duplicated /sys/kernel/mm/numa/demotion_target interface and instead
make /sys/devices/system/node/demotion_targets writable.

Huang, Wei, Yang,
What do you suggest?
Wei Xu April 22, 2022, 4:43 p.m. UTC | #22
On Fri, Apr 22, 2022 at 4:00 AM Jagdish Gediya <jvgediya@linux.ibm.com> wrote:
>
> On Fri, Apr 22, 2022 at 02:21:47PM +0800, ying.huang@intel.com wrote:
> > On Thu, 2022-04-21 at 23:13 -0700, Wei Xu wrote:
> > > On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com
> > > <ying.huang@intel.com> wrote:
> > > >
> > > > On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote:
> > > > > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com
> > > > > <ying.huang@intel.com> wrote:
> > > > > >
> > > > > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote:
> > > > > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com
> > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote:
> > > > > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com
> > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> > > > > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> > > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > > > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > > > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > > > > > > > > > > > > Current implementation to find the demotion targets works
> > > > > > > > > > > > > > > > based on node state N_MEMORY, however some systems may have
> > > > > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > > > > > > > > > > > > right choices as demotion targets.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This patch series introduces the new node state
> > > > > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion
> > > > > > > > > > > > > > > > targets, support is also added to set the demotion target
> > > > > > > > > > > > > > > > list from user space so that default behavior can be overridden.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all
> > > > > > > > > > > > > > > problems.  For example, for system as follows,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > > > > > > > > > > > > node 0,
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > > > > > > node 0 size: n MB
> > > > > > > > > > > > > > > node 0 free: n MB
> > > > > > > > > > > > > > > node 1 cpus:
> > > > > > > > > > > > > > > node 1 size: n MB
> > > > > > > > > > > > > > > node 1 free: n MB
> > > > > > > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > > > > > > node 2 size: n MB
> > > > > > > > > > > > > > > node 2 free: n MB
> > > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > > node   0   1   2
> > > > > > > > > > > > > > >   0:  10  40  20
> > > > > > > > > > > > > > >   1:  40  10  80
> > > > > > > > > > > > > > >   2:  20  80  10
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Demotion order 1:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > node    demotion_target
> > > > > > > > > > > > > > >  0              1
> > > > > > > > > > > > > > >  1              X
> > > > > > > > > > > > > > >  2              X
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > Demotion order 2:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > node    demotion_target
> > > > > > > > > > > > > > >  0              1
> > > > > > > > > > > > > > >  1              X
> > > > > > > > > > > > > > >  2              1
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > > > > > > > > > > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > > > > > > > > > > > > full advantage of the slow memory node.  We can take any choice as
> > > > > > > > > > > > > > > automatic-generated order, while make the other choice possible via user
> > > > > > > > > > > > > > > space overridden.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space
> > > > > > > > > > > > > > > interface.  How about the following user space interface?
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in
> > > > > > > > > > > > > > >         /sys/devices/system/node/
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > > > > > > > > > > > > overridden; "0" is output if not.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the
> > > > > > > > > > > > > > > overridden mode.  When write "0", the demotion order of the system will
> > > > > > > > > > > > > > > become the automatic mode and the demotion order will be re-generated.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in
> > > > > > > > > > > > > > >         /sys/devices/system/node/nodeX/
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > > > > > > > > > > > > will be set to the written nodes.  And the demotion order of the system
> > > > > > > > > > > > > > > will become the overridden mode.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is
> > > > > > > > > > > > > > quite useful in real life for now (it might become useful in the
> > > > > > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of
> > > > > > > > > > > > > > machines, which may come from different vendors, have different
> > > > > > > > > > > > > > generations of hardware, have different versions of firmware, it would
> > > > > > > > > > > > > > be a nightmare for the users to configure the demotion targets
> > > > > > > > > > > > > > properly. So it would be great to have the kernel properly configure
> > > > > > > > > > > > > > it *without* intervening from the users.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > So we should pick up a proper default policy and stick with that
> > > > > > > > > > > > > > policy unless it doesn't work well for the most workloads. I do
> > > > > > > > > > > > > > understand it is hard to make everyone happy. My proposal is having
> > > > > > > > > > > > > > every node in the fast tier has a demotion target (at least one) if
> > > > > > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think
> > > > > > > > > > > > > > this is also the current implementation.
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > This is reasonable.  I agree that with a decent default policy,
> > > > > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > I agree that a decent default policy is important.  As that was enhanced
> > > > > > > > > > > > in [1/5] of this patchset.
> > > > > > > > > > > >
> > > > > > > > > > > > > the
> > > > > > > > > > > > > overriding of per-node demotion targets can be deferred.  The most
> > > > > > > > > > > > > important problem here is that we should allow the configurations
> > > > > > > > > > > > > where memory-only nodes are not used as demotion targets, which this
> > > > > > > > > > > > > patch set has already addressed.
> > > > > > > > > > > >
> > > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset?
> > > > > > > > > > >
> > > > > > > > > > > Yes.
> > > > > > > > > > >
> > > > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should
> > > > > > > > > > > > be powerful enough to address all existing issues and some potential
> > > > > > > > > > > > future issues, so that it can be stable.  I don't think it's a good idea
> > > > > > > > > > > > to define a partial user space interface that works only for a specific
> > > > > > > > > > > > use case and cannot be extended for other use cases.
> > > > > > > > > > >
> > > > > > > > > > > I actually think that they can be viewed as two separate problems: one
> > > > > > > > > > > is to define which nodes can be used as demotion targets (this patch
> > > > > > > > > > > set), and the other is how to initialize the per-node demotion path
> > > > > > > > > > > (node_demotion[]).  We don't have to solve both problems at the same
> > > > > > > > > > > time.
> > > > > > > > > > >
> > > > > > > > > > > If we decide to go with a per-node demotion path customization
> > > > > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
> > > > > > > > > > > is a single global control to turn off all demotion targets (for the
> > > > > > > > > > > machines that don't use memory-only nodes for demotion).
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > There's one already.  In commit 20b51af15e01 ("mm/migrate: add sysfs
> > > > > > > > > > interface to enable reclaim migration"), a sysfs interface
> > > > > > > > > >
> > > > > > > > > >         /sys/kernel/mm/numa/demotion_enabled
> > > > > > > > > >
> > > > > > > > > > is added to turn off all demotion targets.
> > > > > > > > >
> > > > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim.  It
> > > > > > > > > will be even cleaner if we have an easy way to clear node_demotion[]
> > > > > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not
> > > > > > > > > init scripts) can know that the machine doesn't even have memory
> > > > > > > > > tiering hardware enabled.
> > > > > > > > >
> > > > > > > >
> > > > > > > > What is the difference?  Now we have no interface to show demotion
> > > > > > > > targets of a node.  That is in-kernel only.  What is memory tiering
> > > > > > > > hardware?  The Optane PMEM?  Some information for it is available via
> > > > > > > > ACPI HMAT table.
> > > > > > > >
> > > > > > > > Except demotion-in-reclaim, what else do you care about?
> > > > > > >
> > > > > > > There is a difference: one is to indicate the availability of the
> > > > > > > memory tiering hardware and the other is to indicate whether
> > > > > > > transparent kernel-driven demotion from the reclaim path is activated.
> > > > > > > With /sys/devices/system/node/demote_targets or the per-node demotion
> > > > > > > target interface, the userspace can figure out the memory tiering
> > > > > > > topology abstracted by the kernel.  It is possible to use
> > > > > > > application-guided demotion without having to enable reclaim-based
> > > > > > > demotion in the kernel.  Logically it is also cleaner to me to
> > > > > > > decouple the tiering node representation from the actual demotion
> > > > > > > mechanism enablement.
> > > > > >
> > > > > > I am confused here.  It appears that you need a way to expose the
> > > > > > automatic generated demotion order from kernel to user space interface.
> > > > > > We can talk about that if you really need it.
> > > > > >
> > > > > > But [2-5/5] of this patchset is to override the automatic generated
> > > > > > demotion order from user space to kernel interface.
> > > > >
> > > > > As a side effect of allowing user space to override the default set of
> > > > > demotion target nodes, it also provides a sysfs interface to allow
> > > > > userspace to read which nodes are currently being designated as
> > > > > demotion targets.
> > > > >
> > > > > The initialization of demotion targets is expected to complete during
> > > > > boot (either by kernel or via an init script).  After that, the
> > > > > userspace processes (e.g. proactive tiering daemon or tiering-aware
> > > > > applications) can query this sysfs interface to know if there are any
> > > > > tiering nodes present and act accordingly.
> > > > >
> > > > > It would be even better to expose the per-node demotion order
> > > > > (node_demotion[]) via the sysfs interface (e.g.
> > > > > /sys/devices/system/node/nodeX/demotion_targets as you have
> > > > > suggested). It can be read-only until there are good use cases to
> > > > > require overriding the per-node demotion order.
> > > >
> > > > I am OK to expose the system demotion order to user space.  For example,
> > > > via /sys/devices/system/node/nodeX/demotion_targets, but read-only.
> > >
> > > Sounds good. We can send out a patch for such a read-only interface.
> > >
> > > > But if we want to add functionality to override system demotion order,
> > > > we need to consider the user space interface carefully, at least after
> > > > collecting all requirement so far.  I don't think the interface proposed
> > > > in [2-5/5] of this patchset is sufficient or extensible enough.
> > >
> > > The current proposed interface should be sufficient to override which
> > > nodes can serve as demotion targets.  I agree that it is not
> > > sufficient if userspace wants to redefine the per-node demotion
> > > targets and a suitable user space interface for that purpose needs to
> > > be designed carefully.
> > >
> >
> > IMHO, it's better to define both together.  That is, collect all
> > requirement, and design it carefully, keeping extensible in mind.  If
> > it's not the good timing yet, we can defer it to collect more
> > requirement.  That's not urgent even for authors' system, because they
> > can just don't enable demotion-in-reclaim.
> >
> > Best Regards,
> > Huang, Ying
>
> I think it is necessary to either have per node demotion targets
> configuration or the user space interface supported by this patch
> series. As we don't have clear consensus on how the user interface
> should look like, we can defer the per node demotion target set
> interface to future until the real need arises.
>
> Current patch series sets N_DEMOTION_TARGET from dax device kmem
> driver, it may be possible that some memory node desired as demotion
> target is not detected in the system from dax-device kmem probe path.
>
> It is also possible that some of the dax-devices are not preferred as
> demotion target e.g. HBM, for such devices, node shouldn't be set to
> N_DEMOTION_TARGETS. In future, Support should be added to distinguish
> such dax-devices and not mark them as N_DEMOTION_TARGETS from the
> kernel, but for now this user space interface will be useful to avoid
> such devices as demotion targets.
>
> We can add read only interface to view per node demotion targets
> from /sys/devices/system/node/nodeX/demotion_targets, remove
> duplicated /sys/kernel/mm/numa/demotion_target interface and instead
> make /sys/devices/system/node/demotion_targets writable.
>
> Huang, Wei, Yang,
> What do you suggest?

This sounds good to me.

I don't know a clear use case where we want to set per-node demotion
order from the userspace.  In the long term, in my view, it would be
better that per-node demotion order is still only initialized by the
kernel, just like the allocation zonelist, but with the help of more
hardware information (e.g. HMAT) when available.  Userspace can still
control which nodes can be used for demotion on a process/cgroup
through the typical NUMA interfaces (e.g. mbind, cpuset.mems).

Wei
Yang Shi April 22, 2022, 5:29 p.m. UTC | #23
On Fri, Apr 22, 2022 at 9:43 AM Wei Xu <weixugc@google.com> wrote:
>
> On Fri, Apr 22, 2022 at 4:00 AM Jagdish Gediya <jvgediya@linux.ibm.com> wrote:
> >
> > On Fri, Apr 22, 2022 at 02:21:47PM +0800, ying.huang@intel.com wrote:
> > > On Thu, 2022-04-21 at 23:13 -0700, Wei Xu wrote:
> > > > On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com
> > > > <ying.huang@intel.com> wrote:
> > > > >
> > > > > On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote:
> > > > > > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com
> > > > > > <ying.huang@intel.com> wrote:
> > > > > > >
> > > > > > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote:
> > > > > > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com
> > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote:
> > > > > > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com
> > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote:
> > > > > > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com
> > > > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote:
> > > > > > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com
> > > > > > > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote:
> > > > > > > > > > > > > > > > > Current implementation to find the demotion targets works
> > > > > > > > > > > > > > > > > based on node state N_MEMORY, however some systems may have
> > > > > > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the
> > > > > > > > > > > > > > > > > right choices as demotion targets.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > This patch series introduces the new node state
> > > > > > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which
> > > > > > > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS]
> > > > > > > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion
> > > > > > > > > > > > > > > > > targets, support is also added to set the demotion target
> > > > > > > > > > > > > > > > > list from user space so that default behavior can be overridden.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all
> > > > > > > > > > > > > > > > problems.  For example, for system as follows,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near
> > > > > > > > > > > > > > > > node 0,
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > > > > > > > node 0 size: n MB
> > > > > > > > > > > > > > > > node 0 free: n MB
> > > > > > > > > > > > > > > > node 1 cpus:
> > > > > > > > > > > > > > > > node 1 size: n MB
> > > > > > > > > > > > > > > > node 1 free: n MB
> > > > > > > > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > > > > > > > node 2 size: n MB
> > > > > > > > > > > > > > > > node 2 free: n MB
> > > > > > > > > > > > > > > > node distances:
> > > > > > > > > > > > > > > > node   0   1   2
> > > > > > > > > > > > > > > >   0:  10  40  20
> > > > > > > > > > > > > > > >   1:  40  10  80
> > > > > > > > > > > > > > > >   2:  20  80  10
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Demotion order 1:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > node    demotion_target
> > > > > > > > > > > > > > > >  0              1
> > > > > > > > > > > > > > > >  1              X
> > > > > > > > > > > > > > > >  2              X
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Demotion order 2:
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > node    demotion_target
> > > > > > > > > > > > > > > >  0              1
> > > > > > > > > > > > > > > >  1              X
> > > > > > > > > > > > > > > >  2              1
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket
> > > > > > > > > > > > > > > > traffic.  While the demotion order 2 is preferred if we want to take
> > > > > > > > > > > > > > > > full advantage of the slow memory node.  We can take any choice as
> > > > > > > > > > > > > > > > automatic-generated order, while make the other choice possible via user
> > > > > > > > > > > > > > > > space overridden.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space
> > > > > > > > > > > > > > > > interface.  How about the following user space interface?
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in
> > > > > > > > > > > > > > > >         /sys/devices/system/node/
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been
> > > > > > > > > > > > > > > > overridden; "0" is output if not.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the
> > > > > > > > > > > > > > > > overridden mode.  When write "0", the demotion order of the system will
> > > > > > > > > > > > > > > > become the automatic mode and the demotion order will be re-generated.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in
> > > > > > > > > > > > > > > >         /sys/devices/system/node/nodeX/
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX
> > > > > > > > > > > > > > > > will be set to the written nodes.  And the demotion order of the system
> > > > > > > > > > > > > > > > will become the overridden mode.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is
> > > > > > > > > > > > > > > quite useful in real life for now (it might become useful in the
> > > > > > > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of
> > > > > > > > > > > > > > > machines, which may come from different vendors, have different
> > > > > > > > > > > > > > > generations of hardware, have different versions of firmware, it would
> > > > > > > > > > > > > > > be a nightmare for the users to configure the demotion targets
> > > > > > > > > > > > > > > properly. So it would be great to have the kernel properly configure
> > > > > > > > > > > > > > > it *without* intervening from the users.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > So we should pick up a proper default policy and stick with that
> > > > > > > > > > > > > > > policy unless it doesn't work well for the most workloads. I do
> > > > > > > > > > > > > > > understand it is hard to make everyone happy. My proposal is having
> > > > > > > > > > > > > > > every node in the fast tier has a demotion target (at least one) if
> > > > > > > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think
> > > > > > > > > > > > > > > this is also the current implementation.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This is reasonable.  I agree that with a decent default policy,
> > > > > > > > > > > > > >
> > > > > > > > > > > > >
> > > > > > > > > > > > > I agree that a decent default policy is important.  As that was enhanced
> > > > > > > > > > > > > in [1/5] of this patchset.
> > > > > > > > > > > > >
> > > > > > > > > > > > > > the
> > > > > > > > > > > > > > overriding of per-node demotion targets can be deferred.  The most
> > > > > > > > > > > > > > important problem here is that we should allow the configurations
> > > > > > > > > > > > > > where memory-only nodes are not used as demotion targets, which this
> > > > > > > > > > > > > > patch set has already addressed.
> > > > > > > > > > > > >
> > > > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset?
> > > > > > > > > > > >
> > > > > > > > > > > > Yes.
> > > > > > > > > > > >
> > > > > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should
> > > > > > > > > > > > > be powerful enough to address all existing issues and some potential
> > > > > > > > > > > > > future issues, so that it can be stable.  I don't think it's a good idea
> > > > > > > > > > > > > to define a partial user space interface that works only for a specific
> > > > > > > > > > > > > use case and cannot be extended for other use cases.
> > > > > > > > > > > >
> > > > > > > > > > > > I actually think that they can be viewed as two separate problems: one
> > > > > > > > > > > > is to define which nodes can be used as demotion targets (this patch
> > > > > > > > > > > > set), and the other is how to initialize the per-node demotion path
> > > > > > > > > > > > (node_demotion[]).  We don't have to solve both problems at the same
> > > > > > > > > > > > time.
> > > > > > > > > > > >
> > > > > > > > > > > > If we decide to go with a per-node demotion path customization
> > > > > > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there
> > > > > > > > > > > > is a single global control to turn off all demotion targets (for the
> > > > > > > > > > > > machines that don't use memory-only nodes for demotion).
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > There's one already.  In commit 20b51af15e01 ("mm/migrate: add sysfs
> > > > > > > > > > > interface to enable reclaim migration"), a sysfs interface
> > > > > > > > > > >
> > > > > > > > > > >         /sys/kernel/mm/numa/demotion_enabled
> > > > > > > > > > >
> > > > > > > > > > > is added to turn off all demotion targets.
> > > > > > > > > >
> > > > > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim.  It
> > > > > > > > > > will be even cleaner if we have an easy way to clear node_demotion[]
> > > > > > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not
> > > > > > > > > > init scripts) can know that the machine doesn't even have memory
> > > > > > > > > > tiering hardware enabled.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > What is the difference?  Now we have no interface to show demotion
> > > > > > > > > targets of a node.  That is in-kernel only.  What is memory tiering
> > > > > > > > > hardware?  The Optane PMEM?  Some information for it is available via
> > > > > > > > > ACPI HMAT table.
> > > > > > > > >
> > > > > > > > > Except demotion-in-reclaim, what else do you care about?
> > > > > > > >
> > > > > > > > There is a difference: one is to indicate the availability of the
> > > > > > > > memory tiering hardware and the other is to indicate whether
> > > > > > > > transparent kernel-driven demotion from the reclaim path is activated.
> > > > > > > > With /sys/devices/system/node/demote_targets or the per-node demotion
> > > > > > > > target interface, the userspace can figure out the memory tiering
> > > > > > > > topology abstracted by the kernel.  It is possible to use
> > > > > > > > application-guided demotion without having to enable reclaim-based
> > > > > > > > demotion in the kernel.  Logically it is also cleaner to me to
> > > > > > > > decouple the tiering node representation from the actual demotion
> > > > > > > > mechanism enablement.
> > > > > > >
> > > > > > > I am confused here.  It appears that you need a way to expose the
> > > > > > > automatic generated demotion order from kernel to user space interface.
> > > > > > > We can talk about that if you really need it.
> > > > > > >
> > > > > > > But [2-5/5] of this patchset is to override the automatic generated
> > > > > > > demotion order from user space to kernel interface.
> > > > > >
> > > > > > As a side effect of allowing user space to override the default set of
> > > > > > demotion target nodes, it also provides a sysfs interface to allow
> > > > > > userspace to read which nodes are currently being designated as
> > > > > > demotion targets.
> > > > > >
> > > > > > The initialization of demotion targets is expected to complete during
> > > > > > boot (either by kernel or via an init script).  After that, the
> > > > > > userspace processes (e.g. proactive tiering daemon or tiering-aware
> > > > > > applications) can query this sysfs interface to know if there are any
> > > > > > tiering nodes present and act accordingly.
> > > > > >
> > > > > > It would be even better to expose the per-node demotion order
> > > > > > (node_demotion[]) via the sysfs interface (e.g.
> > > > > > /sys/devices/system/node/nodeX/demotion_targets as you have
> > > > > > suggested). It can be read-only until there are good use cases to
> > > > > > require overriding the per-node demotion order.
> > > > >
> > > > > I am OK to expose the system demotion order to user space.  For example,
> > > > > via /sys/devices/system/node/nodeX/demotion_targets, but read-only.
> > > >
> > > > Sounds good. We can send out a patch for such a read-only interface.
> > > >
> > > > > But if we want to add functionality to override system demotion order,
> > > > > we need to consider the user space interface carefully, at least after
> > > > > collecting all requirement so far.  I don't think the interface proposed
> > > > > in [2-5/5] of this patchset is sufficient or extensible enough.
> > > >
> > > > The current proposed interface should be sufficient to override which
> > > > nodes can serve as demotion targets.  I agree that it is not
> > > > sufficient if userspace wants to redefine the per-node demotion
> > > > targets and a suitable user space interface for that purpose needs to
> > > > be designed carefully.
> > > >
> > >
> > > IMHO, it's better to define both together.  That is, collect all
> > > requirement, and design it carefully, keeping extensible in mind.  If
> > > it's not the good timing yet, we can defer it to collect more
> > > requirement.  That's not urgent even for authors' system, because they
> > > can just don't enable demotion-in-reclaim.
> > >
> > > Best Regards,
> > > Huang, Ying
> >
> > I think it is necessary to either have per node demotion targets
> > configuration or the user space interface supported by this patch
> > series. As we don't have clear consensus on how the user interface
> > should look like, we can defer the per node demotion target set
> > interface to future until the real need arises.
> >
> > Current patch series sets N_DEMOTION_TARGET from dax device kmem
> > driver, it may be possible that some memory node desired as demotion
> > target is not detected in the system from dax-device kmem probe path.
> >
> > It is also possible that some of the dax-devices are not preferred as
> > demotion target e.g. HBM, for such devices, node shouldn't be set to
> > N_DEMOTION_TARGETS. In future, Support should be added to distinguish
> > such dax-devices and not mark them as N_DEMOTION_TARGETS from the
> > kernel, but for now this user space interface will be useful to avoid
> > such devices as demotion targets.
> >
> > We can add read only interface to view per node demotion targets
> > from /sys/devices/system/node/nodeX/demotion_targets, remove
> > duplicated /sys/kernel/mm/numa/demotion_target interface and instead
> > make /sys/devices/system/node/demotion_targets writable.
> >
> > Huang, Wei, Yang,
> > What do you suggest?
>
> This sounds good to me.
>
> I don't know a clear use case where we want to set per-node demotion
> order from the userspace.  In the long term, in my view, it would be
> better that per-node demotion order is still only initialized by the
> kernel, just like the allocation zonelist, but with the help of more
> hardware information (e.g. HMAT) when available.  Userspace can still
> control which nodes can be used for demotion on a process/cgroup
> through the typical NUMA interfaces (e.g. mbind, cpuset.mems).

+1

>
> Wei
Huang, Ying April 24, 2022, 3:02 a.m. UTC | #24
Hi, All,

On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:

[snip]

> I think it is necessary to either have per node demotion targets
> configuration or the user space interface supported by this patch
> series. As we don't have clear consensus on how the user interface
> should look like, we can defer the per node demotion target set
> interface to future until the real need arises.
> 
> Current patch series sets N_DEMOTION_TARGET from dax device kmem
> driver, it may be possible that some memory node desired as demotion
> target is not detected in the system from dax-device kmem probe path.
> 
> It is also possible that some of the dax-devices are not preferred as
> demotion target e.g. HBM, for such devices, node shouldn't be set to
> N_DEMOTION_TARGETS. In future, Support should be added to distinguish
> such dax-devices and not mark them as N_DEMOTION_TARGETS from the
> kernel, but for now this user space interface will be useful to avoid
> such devices as demotion targets.
> 
> We can add read only interface to view per node demotion targets
> from /sys/devices/system/node/nodeX/demotion_targets, remove
> duplicated /sys/kernel/mm/numa/demotion_target interface and instead
> make /sys/devices/system/node/demotion_targets writable.
> 
> Huang, Wei, Yang,
> What do you suggest?

We cannot remove a kernel ABI in practice.  So we need to make it right
at the first time.  Let's try to collect some information for the kernel
ABI definitation.

The below is just a starting point, please add your requirements.

1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
want to use that as the demotion targets.  But I don't think this is a
issue in practice for now, because demote-in-reclaim is disabled by
default.

2. For machines with PMEM installed in only 1 of 2 sockets, for example,

Node 0 & 2 are cpu + dram nodes and node 1 are slow
memory node near node 0,

available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: n MB
node 0 free: n MB
node 1 cpus:
node 1 size: n MB
node 1 free: n MB
node 2 cpus: 2 3
node 2 size: n MB
node 2 free: n MB
node distances:
node   0   1   2
  0:  10  40  20
  1:  40  10  80
  2:  20  80  10

We have 2 choices,

a)
node	demotion targets
0	1
2	1

b)
node	demotion targets
0	1
2	X

a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
traffic.  Both are OK as defualt configuration.  But some users may
prefer the other one.  So we need a user space ABI to override the
default configuration.

3. For machines with HBM (High Bandwidth Memory), as in

https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/

> [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41

Although HBM has better performance than DDR, in ACPI SLIT, their
distance to CPU is longer.  We need to provide a way to fix this.  The
user space ABI is one way.  The desired result will be to use local DDR
as demotion targets of local HBM.

Best Regards,
Huang, Ying
Aneesh Kumar K.V April 25, 2022, 3:50 a.m. UTC | #25
"ying.huang@intel.com" <ying.huang@intel.com> writes:

> Hi, All,
>
> On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
>
> [snip]
>
>> I think it is necessary to either have per node demotion targets
>> configuration or the user space interface supported by this patch
>> series. As we don't have clear consensus on how the user interface
>> should look like, we can defer the per node demotion target set
>> interface to future until the real need arises.
>> 
>> Current patch series sets N_DEMOTION_TARGET from dax device kmem
>> driver, it may be possible that some memory node desired as demotion
>> target is not detected in the system from dax-device kmem probe path.
>> 
>> It is also possible that some of the dax-devices are not preferred as
>> demotion target e.g. HBM, for such devices, node shouldn't be set to
>> N_DEMOTION_TARGETS. In future, Support should be added to distinguish
>> such dax-devices and not mark them as N_DEMOTION_TARGETS from the
>> kernel, but for now this user space interface will be useful to avoid
>> such devices as demotion targets.
>> 
>> We can add read only interface to view per node demotion targets
>> from /sys/devices/system/node/nodeX/demotion_targets, remove
>> duplicated /sys/kernel/mm/numa/demotion_target interface and instead
>> make /sys/devices/system/node/demotion_targets writable.
>> 
>> Huang, Wei, Yang,
>> What do you suggest?
>
> We cannot remove a kernel ABI in practice.  So we need to make it right
> at the first time.  Let's try to collect some information for the kernel
> ABI definitation.
>
> The below is just a starting point, please add your requirements.
>
> 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
> want to use that as the demotion targets.  But I don't think this is a
> issue in practice for now, because demote-in-reclaim is disabled by
> default.

It is not just that the demotion can be disabled. We should be able to
use demotion on a system where we can find DRAM only NUMA nodes. That
cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs
something similar to to N_DEMOTION_TARGETS

>
> 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
>
> Node 0 & 2 are cpu + dram nodes and node 1 are slow
> memory node near node 0,
>
> available: 3 nodes (0-2)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus:
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus: 2 3
> node 2 size: n MB
> node 2 free: n MB
> node distances:
> node   0   1   2
>   0:  10  40  20
>   1:  40  10  80
>   2:  20  80  10
>
> We have 2 choices,
>
> a)
> node	demotion targets
> 0	1
> 2	1

This is achieved by 

[PATCH v2 1/5] mm: demotion: Set demotion list differently

>
> b)
> node	demotion targets
> 0	1
> 2	X


>
> a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> traffic.  Both are OK as defualt configuration.  But some users may
> prefer the other one.  So we need a user space ABI to override the
> default configuration.
>
> 3. For machines with HBM (High Bandwidth Memory), as in
>
> https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/
>
>> [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
>
> Although HBM has better performance than DDR, in ACPI SLIT, their
> distance to CPU is longer.  We need to provide a way to fix this.  The
> user space ABI is one way.  The desired result will be to use local DDR
> as demotion targets of local HBM.


IMHO the above (2b and 3) can be done using per node demotion targets. Below is
what I think we could do with a single slow memory NUMA node 4.

/sys/devices/system/node# cat node[0-4]/demotion_targets
4
4
4
4

/sys/devices/system/node# echo 1 > node1/demotion_targets 
bash: echo: write error: Invalid argument
/sys/devices/system/node# cat node[0-4]/demotion_targets
4
4
4
4

/sys/devices/system/node# echo 0 > node1/demotion_targets 
/sys/devices/system/node# cat node[0-4]/demotion_targets
4
0
4
4

/sys/devices/system/node# echo 1 > node0/demotion_targets 
bash: echo: write error: Invalid argument
/sys/devices/system/node# cat node[0-4]/demotion_targets
4
0
4
4

Disable demotion for a specific node.
/sys/devices/system/node# echo > node1/demotion_targets 
/sys/devices/system/node# cat node[0-4]/demotion_targets
4

4
4

Reset demotion to default
/sys/devices/system/node# echo -1 > node1/demotion_targets 
/sys/devices/system/node# cat node[0-4]/demotion_targets
4
4
4
4

When a specific device/NUMA node is used for demotion target via the user interface, it is taken
out of other NUMA node targets.
root@ubuntu-guest:/sys/devices/system/node# cat node[0-4]/demotion_targets
4
4
4
4

/sys/devices/system/node# echo 4 > node1/demotion_targets 
/sys/devices/system/node# cat node[0-4]/demotion_targets

4



If more than one node requies the same demotion target
/sys/devices/system/node# echo 4 > node0/demotion_targets 
/sys/devices/system/node# cat node[0-4]/demotion_targets
4
4



-aneesh
Huang, Ying April 25, 2022, 6:10 a.m. UTC | #26
On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote:
> "ying.huang@intel.com" <ying.huang@intel.com> writes:
> 
> > Hi, All,
> > 
> > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
> > 
> > [snip]
> > 
> > > I think it is necessary to either have per node demotion targets
> > > configuration or the user space interface supported by this patch
> > > series. As we don't have clear consensus on how the user interface
> > > should look like, we can defer the per node demotion target set
> > > interface to future until the real need arises.
> > > 
> > > Current patch series sets N_DEMOTION_TARGET from dax device kmem
> > > driver, it may be possible that some memory node desired as demotion
> > > target is not detected in the system from dax-device kmem probe path.
> > > 
> > > It is also possible that some of the dax-devices are not preferred as
> > > demotion target e.g. HBM, for such devices, node shouldn't be set to
> > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish
> > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the
> > > kernel, but for now this user space interface will be useful to avoid
> > > such devices as demotion targets.
> > > 
> > > We can add read only interface to view per node demotion targets
> > > from /sys/devices/system/node/nodeX/demotion_targets, remove
> > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead
> > > make /sys/devices/system/node/demotion_targets writable.
> > > 
> > > Huang, Wei, Yang,
> > > What do you suggest?
> > 
> > We cannot remove a kernel ABI in practice.  So we need to make it right
> > at the first time.  Let's try to collect some information for the kernel
> > ABI definitation.
> > 
> > The below is just a starting point, please add your requirements.
> > 
> > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
> > want to use that as the demotion targets.  But I don't think this is a
> > issue in practice for now, because demote-in-reclaim is disabled by
> > default.
> 
> It is not just that the demotion can be disabled. We should be able to
> use demotion on a system where we can find DRAM only NUMA nodes. That
> cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs
> something similar to to N_DEMOTION_TARGETS
> 

Can you show NUMA information of your machines with DRAM-only nodes and
PMEM nodes?  We can try to find the proper demotion order for the
system.  If you can not show it, we can defer N_DEMOTION_TARGETS until
the machine is available.

> > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > 
> > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > memory node near node 0,
> > 
> > available: 3 nodes (0-2)
> > node 0 cpus: 0 1
> > node 0 size: n MB
> > node 0 free: n MB
> > node 1 cpus:
> > node 1 size: n MB
> > node 1 free: n MB
> > node 2 cpus: 2 3
> > node 2 size: n MB
> > node 2 free: n MB
> > node distances:
> > node   0   1   2
> >   0:  10  40  20
> >   1:  40  10  80
> >   2:  20  80  10
> > 
> > We have 2 choices,
> > 
> > a)
> > node	demotion targets
> > 0	1
> > 2	1
> 
> This is achieved by 
> 
> [PATCH v2 1/5] mm: demotion: Set demotion list differently
> 
> > 
> > b)
> > node	demotion targets
> > 0	1
> > 2	X
> 
> 
> > 
> > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > traffic.  Both are OK as defualt configuration.  But some users may
> > prefer the other one.  So we need a user space ABI to override the
> > default configuration.
> > 
> > 3. For machines with HBM (High Bandwidth Memory), as in
> > 
> > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/
> > 
> > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
> > 
> > Although HBM has better performance than DDR, in ACPI SLIT, their
> > distance to CPU is longer.  We need to provide a way to fix this.  The
> > user space ABI is one way.  The desired result will be to use local DDR
> > as demotion targets of local HBM.
> 
> 
> IMHO the above (2b and 3) can be done using per node demotion targets. Below is
> what I think we could do with a single slow memory NUMA node 4.

If we can use writable per-node demotion targets as ABI, then we don't
need N_DEMOTION_TARGETS.

> /sys/devices/system/node# cat node[0-4]/demotion_targets
> 4
> 4
> 4
> 4
> 
> /sys/devices/system/node# echo 1 > node1/demotion_targets 
> bash: echo: write error: Invalid argument
> /sys/devices/system/node# cat node[0-4]/demotion_targets
> 4
> 4
> 4
> 4
> 
> /sys/devices/system/node# echo 0 > node1/demotion_targets 
> /sys/devices/system/node# cat node[0-4]/demotion_targets
> 4
> 0
> 4
> 4
> 
> /sys/devices/system/node# echo 1 > node0/demotion_targets 
> bash: echo: write error: Invalid argument
> /sys/devices/system/node# cat node[0-4]/demotion_targets
> 4
> 0
> 4
> 4
> 
> Disable demotion for a specific node.
> /sys/devices/system/node# echo > node1/demotion_targets 
> /sys/devices/system/node# cat node[0-4]/demotion_targets
> 4
> 
> 4
> 4
> 
> Reset demotion to default
> /sys/devices/system/node# echo -1 > node1/demotion_targets 
> /sys/devices/system/node# cat node[0-4]/demotion_targets
> 4
> 4
> 4
> 4
> 
> When a specific device/NUMA node is used for demotion target via the user interface, it is taken
> out of other NUMA node targets.

IMHO, we should be careful about interaction between auto-generated and
overridden demotion order.

Best Regards,
Huang, Ying

> root@ubuntu-guest:/sys/devices/system/node# cat node[0-4]/demotion_targets
> 4
> 4
> 4
> 4
> 
> /sys/devices/system/node# echo 4 > node1/demotion_targets 
> /sys/devices/system/node# cat node[0-4]/demotion_targets
> 
> 4
> 
> 
> 
> If more than one node requies the same demotion target
> /sys/devices/system/node# echo 4 > node0/demotion_targets 
> /sys/devices/system/node# cat node[0-4]/demotion_targets
> 4
> 4
> 
> 
> 
> -aneesh
Jagdish Gediya April 25, 2022, 7:26 a.m. UTC | #27
On Sun, Apr 24, 2022 at 11:02:47AM +0800, ying.huang@intel.com wrote:
> Hi, All,
> 
> On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
> 
> [snip]
> 
> > I think it is necessary to either have per node demotion targets
> > configuration or the user space interface supported by this patch
> > series. As we don't have clear consensus on how the user interface
> > should look like, we can defer the per node demotion target set
> > interface to future until the real need arises.
> > 
> > Current patch series sets N_DEMOTION_TARGET from dax device kmem
> > driver, it may be possible that some memory node desired as demotion
> > target is not detected in the system from dax-device kmem probe path.
> > 
> > It is also possible that some of the dax-devices are not preferred as
> > demotion target e.g. HBM, for such devices, node shouldn't be set to
> > N_DEMOTION_TARGETS. In future, Support should be added to distinguish
> > such dax-devices and not mark them as N_DEMOTION_TARGETS from the
> > kernel, but for now this user space interface will be useful to avoid
> > such devices as demotion targets.
> > 
> > We can add read only interface to view per node demotion targets
> > from /sys/devices/system/node/nodeX/demotion_targets, remove
> > duplicated /sys/kernel/mm/numa/demotion_target interface and instead
> > make /sys/devices/system/node/demotion_targets writable.
> > 
> > Huang, Wei, Yang,
> > What do you suggest?
> 
> We cannot remove a kernel ABI in practice.  So we need to make it right
> at the first time.  Let's try to collect some information for the kernel
> ABI definitation.

/sys/kernel/mm/numa/demotion_target was introduced in v2, I was
talking about removing it from next version of the series as the
similar interface is available as a result of introducing
N_DEMOTION_TARGETS at /sys/devices/system/node/demotion_targets, so
instead of introducing duplicate interface to write N_DEMOTION_TARGETS,
we can instead make /sys/devices/system/node/demotion_targets writable.

> The below is just a starting point, please add your requirements.
> 
> 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
> want to use that as the demotion targets.  But I don't think this is a
> issue in practice for now, because demote-in-reclaim is disabled by
> default.
> 
> 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> 
> Node 0 & 2 are cpu + dram nodes and node 1 are slow
> memory node near node 0,
> 
> available: 3 nodes (0-2)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus:
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus: 2 3
> node 2 size: n MB
> node 2 free: n MB
> node distances:
> node   0   1   2
>   0:  10  40  20
>   1:  40  10  80
>   2:  20  80  10
> 
> We have 2 choices,
> 
> a)
> node	demotion targets
> 0	1
> 2	1
> 
> b)
> node	demotion targets
> 0	1
> 2	X
> 
> a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> traffic.  Both are OK as defualt configuration.  But some users may
> prefer the other one.  So we need a user space ABI to override the
> default configuration.
> 
> 3. For machines with HBM (High Bandwidth Memory), as in
> 
> https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/
> 
> > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
> 
> Although HBM has better performance than DDR, in ACPI SLIT, their
> distance to CPU is longer.  We need to provide a way to fix this.  The
> user space ABI is one way.  The desired result will be to use local DDR
> as demotion targets of local HBM.
> 
> Best Regards,
> Huang, Ying
> 
>
Aneesh Kumar K.V April 25, 2022, 8:09 a.m. UTC | #28
On 4/25/22 11:40 AM, ying.huang@intel.com wrote:
> On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote:
>> "ying.huang@intel.com" <ying.huang@intel.com> writes:
>>
>>> Hi, All,
>>>
>>> On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
>>>
>>> [snip]
>>>
>>>> I think it is necessary to either have per node demotion targets
>>>> configuration or the user space interface supported by this patch
>>>> series. As we don't have clear consensus on how the user interface
>>>> should look like, we can defer the per node demotion target set
>>>> interface to future until the real need arises.
>>>>
>>>> Current patch series sets N_DEMOTION_TARGET from dax device kmem
>>>> driver, it may be possible that some memory node desired as demotion
>>>> target is not detected in the system from dax-device kmem probe path.
>>>>
>>>> It is also possible that some of the dax-devices are not preferred as
>>>> demotion target e.g. HBM, for such devices, node shouldn't be set to
>>>> N_DEMOTION_TARGETS. In future, Support should be added to distinguish
>>>> such dax-devices and not mark them as N_DEMOTION_TARGETS from the
>>>> kernel, but for now this user space interface will be useful to avoid
>>>> such devices as demotion targets.
>>>>
>>>> We can add read only interface to view per node demotion targets
>>>> from /sys/devices/system/node/nodeX/demotion_targets, remove
>>>> duplicated /sys/kernel/mm/numa/demotion_target interface and instead
>>>> make /sys/devices/system/node/demotion_targets writable.
>>>>
>>>> Huang, Wei, Yang,
>>>> What do you suggest?
>>>
>>> We cannot remove a kernel ABI in practice.  So we need to make it right
>>> at the first time.  Let's try to collect some information for the kernel
>>> ABI definitation.
>>>
>>> The below is just a starting point, please add your requirements.
>>>
>>> 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
>>> want to use that as the demotion targets.  But I don't think this is a
>>> issue in practice for now, because demote-in-reclaim is disabled by
>>> default.
>>
>> It is not just that the demotion can be disabled. We should be able to
>> use demotion on a system where we can find DRAM only NUMA nodes. That
>> cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs
>> something similar to to N_DEMOTION_TARGETS
>>
> 
> Can you show NUMA information of your machines with DRAM-only nodes and
> PMEM nodes?  We can try to find the proper demotion order for the
> system.  If you can not show it, we can defer N_DEMOTION_TARGETS until
> the machine is available.


Sure will find one such config. As you might have noticed this is very 
easy to have in a virtualization setup because the hypervisor can assign 
memory to a guest VM from a numa node that doesn't have CPU assigned to 
the same guest. This depends on the other guest VM instance config 
running on the system. So on any virtualization config that has got 
persistent memory attached, this can become an easy config to end up with.


>>> 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
>>>
>>> Node 0 & 2 are cpu + dram nodes and node 1 are slow
>>> memory node near node 0,
>>>
>>> available: 3 nodes (0-2)
>>> node 0 cpus: 0 1
>>> node 0 size: n MB
>>> node 0 free: n MB
>>> node 1 cpus:
>>> node 1 size: n MB
>>> node 1 free: n MB
>>> node 2 cpus: 2 3
>>> node 2 size: n MB
>>> node 2 free: n MB
>>> node distances:
>>> node   0   1   2
>>>    0:  10  40  20
>>>    1:  40  10  80
>>>    2:  20  80  10
>>>
>>> We have 2 choices,
>>>
>>> a)
>>> node	demotion targets
>>> 0	1
>>> 2	1
>>
>> This is achieved by
>>
>> [PATCH v2 1/5] mm: demotion: Set demotion list differently
>>
>>>
>>> b)
>>> node	demotion targets
>>> 0	1
>>> 2	X
>>
>>
>>>
>>> a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
>>> traffic.  Both are OK as defualt configuration.  But some users may
>>> prefer the other one.  So we need a user space ABI to override the
>>> default configuration.
>>>
>>> 3. For machines with HBM (High Bandwidth Memory), as in
>>>
>>> https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/
>>>
>>>> [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
>>>
>>> Although HBM has better performance than DDR, in ACPI SLIT, their
>>> distance to CPU is longer.  We need to provide a way to fix this.  The
>>> user space ABI is one way.  The desired result will be to use local DDR
>>> as demotion targets of local HBM.
>>
>>
>> IMHO the above (2b and 3) can be done using per node demotion targets. Below is
>> what I think we could do with a single slow memory NUMA node 4.
> 
> If we can use writable per-node demotion targets as ABI, then we don't
> need N_DEMOTION_TARGETS.


Not sure I understand that. Yes, once you have a writeable per node 
demotion target it is easy to build any demotion order. But that doesn't 
mean we should not improve the default unless you have reason to say 
that using N_DEMOTTION_TARGETS breaks any existing config.

> 
>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>> 4
>> 4
>> 4
>> 4
>>
>> /sys/devices/system/node# echo 1 > node1/demotion_targets
>> bash: echo: write error: Invalid argument
>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>> 4
>> 4
>> 4
>> 4
>>
>> /sys/devices/system/node# echo 0 > node1/demotion_targets
>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>> 4
>> 0
>> 4
>> 4
>>
>> /sys/devices/system/node# echo 1 > node0/demotion_targets
>> bash: echo: write error: Invalid argument
>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>> 4
>> 0
>> 4
>> 4
>>
>> Disable demotion for a specific node.
>> /sys/devices/system/node# echo > node1/demotion_targets
>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>> 4
>>
>> 4
>> 4
>>
>> Reset demotion to default
>> /sys/devices/system/node# echo -1 > node1/demotion_targets
>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>> 4
>> 4
>> 4
>> 4
>>
>> When a specific device/NUMA node is used for demotion target via the user interface, it is taken
>> out of other NUMA node targets.
> 
> IMHO, we should be careful about interaction between auto-generated and
> overridden demotion order.
> 

yes, we should avoid loop between that. But if you agree for the above 
ABI we could go ahead and share the implementation code.


> Best Regards,
> Huang, Ying
> 
>> root@ubuntu-guest:/sys/devices/system/node# cat node[0-4]/demotion_targets
>> 4
>> 4
>> 4
>> 4
>>
>> /sys/devices/system/node# echo 4 > node1/demotion_targets
>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>>
>> 4
>>
>>
>>
>> If more than one node requies the same demotion target
>> /sys/devices/system/node# echo 4 > node0/demotion_targets
>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>> 4
>> 4
>>
>>
>>
>> -aneesh
> 
> 

-aneesh
Aneesh Kumar K.V April 25, 2022, 8:54 a.m. UTC | #29
On 4/25/22 1:39 PM, Aneesh Kumar K V wrote:
> On 4/25/22 11:40 AM, ying.huang@intel.com wrote:
>> On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote:
>>> "ying.huang@intel.com" <ying.huang@intel.com> writes:
>>>
>>>> Hi, All,
>>>>
>>>> On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
>>>>
>>>> [snip]
>>>>
>>>>> I think it is necessary to either have per node demotion targets
>>>>> configuration or the user space interface supported by this patch
>>>>> series. As we don't have clear consensus on how the user interface
>>>>> should look like, we can defer the per node demotion target set
>>>>> interface to future until the real need arises.
>>>>>
>>>>> Current patch series sets N_DEMOTION_TARGET from dax device kmem
>>>>> driver, it may be possible that some memory node desired as demotion
>>>>> target is not detected in the system from dax-device kmem probe path.
>>>>>
>>>>> It is also possible that some of the dax-devices are not preferred as
>>>>> demotion target e.g. HBM, for such devices, node shouldn't be set to
>>>>> N_DEMOTION_TARGETS. In future, Support should be added to distinguish
>>>>> such dax-devices and not mark them as N_DEMOTION_TARGETS from the
>>>>> kernel, but for now this user space interface will be useful to avoid
>>>>> such devices as demotion targets.
>>>>>
>>>>> We can add read only interface to view per node demotion targets
>>>>> from /sys/devices/system/node/nodeX/demotion_targets, remove
>>>>> duplicated /sys/kernel/mm/numa/demotion_target interface and instead
>>>>> make /sys/devices/system/node/demotion_targets writable.
>>>>>
>>>>> Huang, Wei, Yang,
>>>>> What do you suggest?
>>>>
>>>> We cannot remove a kernel ABI in practice.  So we need to make it right
>>>> at the first time.  Let's try to collect some information for the 
>>>> kernel
>>>> ABI definitation.
>>>>
>>>> The below is just a starting point, please add your requirements.
>>>>
>>>> 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
>>>> want to use that as the demotion targets.  But I don't think this is a
>>>> issue in practice for now, because demote-in-reclaim is disabled by
>>>> default.
>>>
>>> It is not just that the demotion can be disabled. We should be able to
>>> use demotion on a system where we can find DRAM only NUMA nodes. That
>>> cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs
>>> something similar to to N_DEMOTION_TARGETS
>>>
>>
>> Can you show NUMA information of your machines with DRAM-only nodes and
>> PMEM nodes?  We can try to find the proper demotion order for the
>> system.  If you can not show it, we can defer N_DEMOTION_TARGETS until
>> the machine is available.
> 
> 
> Sure will find one such config. As you might have noticed this is very 
> easy to have in a virtualization setup because the hypervisor can assign 
> memory to a guest VM from a numa node that doesn't have CPU assigned to 
> the same guest. This depends on the other guest VM instance config 
> running on the system. So on any virtualization config that has got 
> persistent memory attached, this can become an easy config to end up with.
> 
> 

something like this

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 14272 MB
node 0 free: 13392 MB
node 1 cpus:
node 1 size: 2028 MB
node 1 free: 1971 MB
node distances:
node   0   1
   0:  10  40
   1:  40  10
$ cat /sys/bus/nd/devices/dax0.0/target_node
2
$
# cd /sys/bus/dax/drivers/
:/sys/bus/dax/drivers# ls
device_dax  kmem
:/sys/bus/dax/drivers# cd device_dax/
:/sys/bus/dax/drivers/device_dax# echo dax0.0 > unbind
:/sys/bus/dax/drivers/device_dax# echo dax0.0 >  ../kmem/new_id
:/sys/bus/dax/drivers/device_dax# numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0 1 2 3 4 5 6 7
node 0 size: 14272 MB
node 0 free: 13380 MB
node 1 cpus:
node 1 size: 2028 MB
node 1 free: 1961 MB
node 2 cpus:
node 2 size: 0 MB
node 2 free: 0 MB
node distances:
node   0   1   2
   0:  10  40  80
   1:  40  10  80
   2:  80  80  10
:/sys/bus/dax/drivers/device_dax#
Wei Xu April 25, 2022, 4:56 p.m. UTC | #30
On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> Hi, All,
>
> On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
>
> [snip]
>
> > I think it is necessary to either have per node demotion targets
> > configuration or the user space interface supported by this patch
> > series. As we don't have clear consensus on how the user interface
> > should look like, we can defer the per node demotion target set
> > interface to future until the real need arises.
> >
> > Current patch series sets N_DEMOTION_TARGET from dax device kmem
> > driver, it may be possible that some memory node desired as demotion
> > target is not detected in the system from dax-device kmem probe path.
> >
> > It is also possible that some of the dax-devices are not preferred as
> > demotion target e.g. HBM, for such devices, node shouldn't be set to
> > N_DEMOTION_TARGETS. In future, Support should be added to distinguish
> > such dax-devices and not mark them as N_DEMOTION_TARGETS from the
> > kernel, but for now this user space interface will be useful to avoid
> > such devices as demotion targets.
> >
> > We can add read only interface to view per node demotion targets
> > from /sys/devices/system/node/nodeX/demotion_targets, remove
> > duplicated /sys/kernel/mm/numa/demotion_target interface and instead
> > make /sys/devices/system/node/demotion_targets writable.
> >
> > Huang, Wei, Yang,
> > What do you suggest?
>
> We cannot remove a kernel ABI in practice.  So we need to make it right
> at the first time.  Let's try to collect some information for the kernel
> ABI definitation.
>
> The below is just a starting point, please add your requirements.
>
> 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
> want to use that as the demotion targets.  But I don't think this is a
> issue in practice for now, because demote-in-reclaim is disabled by
> default.
>
> 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
>
> Node 0 & 2 are cpu + dram nodes and node 1 are slow
> memory node near node 0,
>
> available: 3 nodes (0-2)
> node 0 cpus: 0 1
> node 0 size: n MB
> node 0 free: n MB
> node 1 cpus:
> node 1 size: n MB
> node 1 free: n MB
> node 2 cpus: 2 3
> node 2 size: n MB
> node 2 free: n MB
> node distances:
> node   0   1   2
>   0:  10  40  20
>   1:  40  10  80
>   2:  20  80  10
>
> We have 2 choices,
>
> a)
> node    demotion targets
> 0       1
> 2       1
>
> b)
> node    demotion targets
> 0       1
> 2       X
>
> a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> traffic.  Both are OK as defualt configuration.  But some users may
> prefer the other one.  So we need a user space ABI to override the
> default configuration.

I think 2(a) should be the system-wide configuration and 2(b) can be
achieved with NUMA mempolicy (which needs to be added to demotion).

In general, we can view the demotion order in a way similar to
allocation fallback order (after all, if we don't demote or demotion
lags behind, the allocations will go to these demotion target nodes
according to the allocation fallback order anyway).  If we initialize
the demotion order in that way (i.e. every node can demote to any node
in the next tier, and the priority of the target nodes is sorted for
each source node), we don't need per-node demotion order override from
the userspace.  What we need is to specify what nodes should be in
each tier and support NUMA mempolicy in demotion.

Cross-socket demotion should not be too big a problem in practice
because we can optimize the code to do the demotion from the local CPU
node (i.e. local writes to the target node and remote read from the
source node).  The bigger issue is cross-socket memory access onto the
demoted pages from the applications, which is why NUMA mempolicy is
important here.

> 3. For machines with HBM (High Bandwidth Memory), as in
>
> https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/
>
> > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
>
> Although HBM has better performance than DDR, in ACPI SLIT, their
> distance to CPU is longer.  We need to provide a way to fix this.  The
> user space ABI is one way.  The desired result will be to use local DDR
> as demotion targets of local HBM.
>
> Best Regards,
> Huang, Ying
>
Davidlohr Bueso April 25, 2022, 8:17 p.m. UTC | #31
On Mon, 25 Apr 2022, Aneesh Kumar K V wrote:

>On 4/25/22 11:40 AM, ying.huang@intel.com wrote:
>>On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote:
>>>"ying.huang@intel.com" <ying.huang@intel.com> writes:
>>>
>>>>Hi, All,
>>>>
>>>>On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
>>>>
>>>>[snip]
>>>>
>>>>>I think it is necessary to either have per node demotion targets
>>>>>configuration or the user space interface supported by this patch
>>>>>series. As we don't have clear consensus on how the user interface
>>>>>should look like, we can defer the per node demotion target set
>>>>>interface to future until the real need arises.
>>>>>
>>>>>Current patch series sets N_DEMOTION_TARGET from dax device kmem
>>>>>driver, it may be possible that some memory node desired as demotion
>>>>>target is not detected in the system from dax-device kmem probe path.
>>>>>
>>>>>It is also possible that some of the dax-devices are not preferred as
>>>>>demotion target e.g. HBM, for such devices, node shouldn't be set to
>>>>>N_DEMOTION_TARGETS. In future, Support should be added to distinguish
>>>>>such dax-devices and not mark them as N_DEMOTION_TARGETS from the
>>>>>kernel, but for now this user space interface will be useful to avoid
>>>>>such devices as demotion targets.
>>>>>
>>>>>We can add read only interface to view per node demotion targets
>>>>>from /sys/devices/system/node/nodeX/demotion_targets, remove
>>>>>duplicated /sys/kernel/mm/numa/demotion_target interface and instead
>>>>>make /sys/devices/system/node/demotion_targets writable.
>>>>>
>>>>>Huang, Wei, Yang,
>>>>>What do you suggest?
>>>>
>>>>We cannot remove a kernel ABI in practice.  So we need to make it right
>>>>at the first time.  Let's try to collect some information for the kernel
>>>>ABI definitation.
>>>>
>>>>The below is just a starting point, please add your requirements.
>>>>
>>>>1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
>>>>want to use that as the demotion targets.  But I don't think this is a
>>>>issue in practice for now, because demote-in-reclaim is disabled by
>>>>default.
>>>
>>>It is not just that the demotion can be disabled. We should be able to
>>>use demotion on a system where we can find DRAM only NUMA nodes. That
>>>cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs
>>>something similar to to N_DEMOTION_TARGETS
>>>
>>
>>Can you show NUMA information of your machines with DRAM-only nodes and
>>PMEM nodes?  We can try to find the proper demotion order for the
>>system.  If you can not show it, we can defer N_DEMOTION_TARGETS until
>>the machine is available.
>
>
>Sure will find one such config. As you might have noticed this is very
>easy to have in a virtualization setup because the hypervisor can
>assign memory to a guest VM from a numa node that doesn't have CPU
>assigned to the same guest. This depends on the other guest VM
>instance config running on the system. So on any virtualization config
>that has got persistent memory attached, this can become an easy
>config to end up with.

And as hw becomes available things like CXL will also start to show
"interesting" setups. You have a mix of volatile and/or pmem nodes
with different access costs, so: CPU+DRAM, DRAM (?), volatile CXL mem,
CXL pmem, non-cxl pmem.

imo, by default, slower mem should be demotion candidates regardless of
type or socket layout (which can be a last consideration such that this
is somewhat mitigated). And afaict this is along the lines of what Jagdish's
first example refers to in patch 1/5.

>
>>>>2. For machines with PMEM installed in only 1 of 2 sockets, for example,
>>>>
>>>>Node 0 & 2 are cpu + dram nodes and node 1 are slow
>>>>memory node near node 0,
>>>>
>>>>available: 3 nodes (0-2)
>>>>node 0 cpus: 0 1
>>>>node 0 size: n MB
>>>>node 0 free: n MB
>>>>node 1 cpus:
>>>>node 1 size: n MB
>>>>node 1 free: n MB
>>>>node 2 cpus: 2 3
>>>>node 2 size: n MB
>>>>node 2 free: n MB
>>>>node distances:
>>>>node   0   1   2
>>>>   0:  10  40  20
>>>>   1:  40  10  80
>>>>   2:  20  80  10
>>>>
>>>>We have 2 choices,
>>>>
>>>>a)
>>>>node	demotion targets
>>>>0	1
>>>>2	1
>>>
>>>This is achieved by
>>>
>>>[PATCH v2 1/5] mm: demotion: Set demotion list differently

Yes, I think it makes sense to do 2a.

Thanks,
Davidlohr
Huang, Ying April 26, 2022, 8:42 a.m. UTC | #32
On Mon, 2022-04-25 at 13:39 +0530, Aneesh Kumar K V wrote:
> On 4/25/22 11:40 AM, ying.huang@intel.com wrote:
> > On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote:
> > > "ying.huang@intel.com" <ying.huang@intel.com> writes:
> > > 
> > > > Hi, All,
> > > > 
> > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
> > > > 
> > > > [snip]
> > > > 
> > > > > I think it is necessary to either have per node demotion targets
> > > > > configuration or the user space interface supported by this patch
> > > > > series. As we don't have clear consensus on how the user interface
> > > > > should look like, we can defer the per node demotion target set
> > > > > interface to future until the real need arises.
> > > > > 
> > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem
> > > > > driver, it may be possible that some memory node desired as demotion
> > > > > target is not detected in the system from dax-device kmem probe path.
> > > > > 
> > > > > It is also possible that some of the dax-devices are not preferred as
> > > > > demotion target e.g. HBM, for such devices, node shouldn't be set to
> > > > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish
> > > > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the
> > > > > kernel, but for now this user space interface will be useful to avoid
> > > > > such devices as demotion targets.
> > > > > 
> > > > > We can add read only interface to view per node demotion targets
> > > > > from /sys/devices/system/node/nodeX/demotion_targets, remove
> > > > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead
> > > > > make /sys/devices/system/node/demotion_targets writable.
> > > > > 
> > > > > Huang, Wei, Yang,
> > > > > What do you suggest?
> > > > 
> > > > We cannot remove a kernel ABI in practice.  So we need to make it right
> > > > at the first time.  Let's try to collect some information for the kernel
> > > > ABI definitation.
> > > > 
> > > > The below is just a starting point, please add your requirements.
> > > > 
> > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
> > > > want to use that as the demotion targets.  But I don't think this is a
> > > > issue in practice for now, because demote-in-reclaim is disabled by
> > > > default.
> > > 
> > > It is not just that the demotion can be disabled. We should be able to
> > > use demotion on a system where we can find DRAM only NUMA nodes. That
> > > cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs
> > > something similar to to N_DEMOTION_TARGETS
> > > 
> > 
> > Can you show NUMA information of your machines with DRAM-only nodes and
> > PMEM nodes?  We can try to find the proper demotion order for the
> > system.  If you can not show it, we can defer N_DEMOTION_TARGETS until
> > the machine is available.
> 
> 
> Sure will find one such config. As you might have noticed this is very 
> easy to have in a virtualization setup because the hypervisor can assign 
> memory to a guest VM from a numa node that doesn't have CPU assigned to 
> the same guest. This depends on the other guest VM instance config 
> running on the system. So on any virtualization config that has got 
> persistent memory attached, this can become an easy config to end up with.
> 

Why they want to do that?  I am looking forward to a real issue, not
theoritical possibility.

> 
> > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > 
> > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > memory node near node 0,
> > > > 
> > > > available: 3 nodes (0-2)
> > > > node 0 cpus: 0 1
> > > > node 0 size: n MB
> > > > node 0 free: n MB
> > > > node 1 cpus:
> > > > node 1 size: n MB
> > > > node 1 free: n MB
> > > > node 2 cpus: 2 3
> > > > node 2 size: n MB
> > > > node 2 free: n MB
> > > > node distances:
> > > > node   0   1   2
> > > >    0:  10  40  20
> > > >    1:  40  10  80
> > > >    2:  20  80  10
> > > > 
> > > > We have 2 choices,
> > > > 
> > > > a)
> > > > node	demotion targets
> > > > 0	1
> > > > 2	1
> > > 
> > > This is achieved by
> > > 
> > > [PATCH v2 1/5] mm: demotion: Set demotion list differently
> > > 
> > > > 
> > > > b)
> > > > node	demotion targets
> > > > 0	1
> > > > 2	X
> > > 
> > > 
> > > > 
> > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > prefer the other one.  So we need a user space ABI to override the
> > > > default configuration.
> > > > 
> > > > 3. For machines with HBM (High Bandwidth Memory), as in
> > > > 
> > > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/
> > > > 
> > > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
> > > > 
> > > > Although HBM has better performance than DDR, in ACPI SLIT, their
> > > > distance to CPU is longer.  We need to provide a way to fix this.  The
> > > > user space ABI is one way.  The desired result will be to use local DDR
> > > > as demotion targets of local HBM.
> > > 
> > > 
> > > IMHO the above (2b and 3) can be done using per node demotion targets. Below is
> > > what I think we could do with a single slow memory NUMA node 4.
> > 
> > If we can use writable per-node demotion targets as ABI, then we don't
> > need N_DEMOTION_TARGETS.
> 
> 
> Not sure I understand that. Yes, once you have a writeable per node 
> demotion target it is easy to build any demotion order.

Yes.

> But that doesn't 
> mean we should not improve the default unless you have reason to say 
> that using N_DEMOTTION_TARGETS breaks any existing config.
> 

Becuase N_DEMOTTION_TARGETS is a new kernel ABI to override the default,
not the default itself.  [1/5] of this patchset improve the default
behavior itself, and I think that's good.

Because we must maintain the kernel ABI almost for ever, we need to be
careful about adding new ABI and add less if possible.  If writable per-
node demotion targets can address your issue.  Then it's unnecessary to
add another redundant kernel ABI for that.

> > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > 4
> > > 4
> > > 4
> > > 4
> > > 
> > > /sys/devices/system/node# echo 1 > node1/demotion_targets
> > > bash: echo: write error: Invalid argument
> > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > 4
> > > 4
> > > 4
> > > 4
> > > 
> > > /sys/devices/system/node# echo 0 > node1/demotion_targets
> > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > 4
> > > 0
> > > 4
> > > 4
> > > 
> > > /sys/devices/system/node# echo 1 > node0/demotion_targets
> > > bash: echo: write error: Invalid argument
> > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > 4
> > > 0
> > > 4
> > > 4
> > > 
> > > Disable demotion for a specific node.
> > > /sys/devices/system/node# echo > node1/demotion_targets
> > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > 4
> > > 
> > > 4
> > > 4
> > > 
> > > Reset demotion to default
> > > /sys/devices/system/node# echo -1 > node1/demotion_targets
> > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > 4
> > > 4
> > > 4
> > > 4
> > > 
> > > When a specific device/NUMA node is used for demotion target via the user interface, it is taken
> > > out of other NUMA node targets.
> > 
> > IMHO, we should be careful about interaction between auto-generated and
> > overridden demotion order.
> > 
> 
> yes, we should avoid loop between that.

In addition to that, we need to get same result after hot-remove then
hot-add the same node.  That is, the result should be stable after NOOP.
I guess we can just always,

- Generate the default demotion order automatically without any
overriding.

- Apply the overriding, after removing the invalid targets, etc.

> But if you agree for the above 
> ABI we could go ahead and share the implementation code.

I think we need to add a way to distinguish auto-generated and overriden
demotion targets in the output of nodeX/demotion_targets.  Otherwise it
looks good to me.

Best Regards,
Huang, Ying

> > > root@ubuntu-guest:/sys/devices/system/node# cat node[0-4]/demotion_targets
> > > 4
> > > 4
> > > 4
> > > 4
> > > 
> > > /sys/devices/system/node# echo 4 > node1/demotion_targets
> > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > 
> > > 4
> > > 
> > > 
> > > 
> > > If more than one node requies the same demotion target
> > > /sys/devices/system/node# echo 4 > node0/demotion_targets
> > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > 4
> > > 4
> > > 
> > > 
> > > 
> > > -aneesh
> > 
> > 
> 
> -aneesh
Aneesh Kumar K.V April 26, 2022, 9:02 a.m. UTC | #33
On 4/26/22 2:12 PM, ying.huang@intel.com wrote:
> On Mon, 2022-04-25 at 13:39 +0530, Aneesh Kumar K V wrote:
>> On 4/25/22 11:40 AM, ying.huang@intel.com wrote:
>>> On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote:
>>>> "ying.huang@intel.com" <ying.huang@intel.com> writes:
>>>>
>>>>> Hi, All,
>>>>>
>>>>> On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
>>>>>
>>>>> [snip]
>>>>>
>>>>>> I think it is necessary to either have per node demotion targets
>>>>>> configuration or the user space interface supported by this patch
>>>>>> series. As we don't have clear consensus on how the user interface
>>>>>> should look like, we can defer the per node demotion target set
>>>>>> interface to future until the real need arises.
>>>>>>
>>>>>> Current patch series sets N_DEMOTION_TARGET from dax device kmem
>>>>>> driver, it may be possible that some memory node desired as demotion
>>>>>> target is not detected in the system from dax-device kmem probe path.
>>>>>>
>>>>>> It is also possible that some of the dax-devices are not preferred as
>>>>>> demotion target e.g. HBM, for such devices, node shouldn't be set to
>>>>>> N_DEMOTION_TARGETS. In future, Support should be added to distinguish
>>>>>> such dax-devices and not mark them as N_DEMOTION_TARGETS from the
>>>>>> kernel, but for now this user space interface will be useful to avoid
>>>>>> such devices as demotion targets.
>>>>>>
>>>>>> We can add read only interface to view per node demotion targets
>>>>>> from /sys/devices/system/node/nodeX/demotion_targets, remove
>>>>>> duplicated /sys/kernel/mm/numa/demotion_target interface and instead
>>>>>> make /sys/devices/system/node/demotion_targets writable.
>>>>>>
>>>>>> Huang, Wei, Yang,
>>>>>> What do you suggest?
>>>>>
>>>>> We cannot remove a kernel ABI in practice.  So we need to make it right
>>>>> at the first time.  Let's try to collect some information for the kernel
>>>>> ABI definitation.
>>>>>
>>>>> The below is just a starting point, please add your requirements.
>>>>>
>>>>> 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
>>>>> want to use that as the demotion targets.  But I don't think this is a
>>>>> issue in practice for now, because demote-in-reclaim is disabled by
>>>>> default.
>>>>
>>>> It is not just that the demotion can be disabled. We should be able to
>>>> use demotion on a system where we can find DRAM only NUMA nodes. That
>>>> cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs
>>>> something similar to to N_DEMOTION_TARGETS
>>>>
>>>
>>> Can you show NUMA information of your machines with DRAM-only nodes and
>>> PMEM nodes?  We can try to find the proper demotion order for the
>>> system.  If you can not show it, we can defer N_DEMOTION_TARGETS until
>>> the machine is available.
>>
>>
>> Sure will find one such config. As you might have noticed this is very
>> easy to have in a virtualization setup because the hypervisor can assign
>> memory to a guest VM from a numa node that doesn't have CPU assigned to
>> the same guest. This depends on the other guest VM instance config
>> running on the system. So on any virtualization config that has got
>> persistent memory attached, this can become an easy config to end up with.
>>
> 
> Why they want to do that?  I am looking forward to a real issue, not
> theoritical possibility.
> 


Can you elaborate this more? That is a real config.


>>
>>>>> 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
>>>>>
>>>>> Node 0 & 2 are cpu + dram nodes and node 1 are slow
>>>>> memory node near node 0,
>>>>>
>>>>> available: 3 nodes (0-2)
>>>>> node 0 cpus: 0 1
>>>>> node 0 size: n MB
>>>>> node 0 free: n MB
>>>>> node 1 cpus:
>>>>> node 1 size: n MB
>>>>> node 1 free: n MB
>>>>> node 2 cpus: 2 3
>>>>> node 2 size: n MB
>>>>> node 2 free: n MB
>>>>> node distances:
>>>>> node   0   1   2
>>>>>     0:  10  40  20
>>>>>     1:  40  10  80
>>>>>     2:  20  80  10
>>>>>
>>>>> We have 2 choices,
>>>>>
>>>>> a)
>>>>> node	demotion targets
>>>>> 0	1
>>>>> 2	1
>>>>
>>>> This is achieved by
>>>>
>>>> [PATCH v2 1/5] mm: demotion: Set demotion list differently
>>>>
>>>>>
>>>>> b)
>>>>> node	demotion targets
>>>>> 0	1
>>>>> 2	X
>>>>
>>>>
>>>>>
>>>>> a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
>>>>> traffic.  Both are OK as defualt configuration.  But some users may
>>>>> prefer the other one.  So we need a user space ABI to override the
>>>>> default configuration.
>>>>>
>>>>> 3. For machines with HBM (High Bandwidth Memory), as in
>>>>>
>>>>> https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/
>>>>>
>>>>>> [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
>>>>>
>>>>> Although HBM has better performance than DDR, in ACPI SLIT, their
>>>>> distance to CPU is longer.  We need to provide a way to fix this.  The
>>>>> user space ABI is one way.  The desired result will be to use local DDR
>>>>> as demotion targets of local HBM.
>>>>
>>>>
>>>> IMHO the above (2b and 3) can be done using per node demotion targets. Below is
>>>> what I think we could do with a single slow memory NUMA node 4.
>>>
>>> If we can use writable per-node demotion targets as ABI, then we don't
>>> need N_DEMOTION_TARGETS.
>>
>>
>> Not sure I understand that. Yes, once you have a writeable per node
>> demotion target it is easy to build any demotion order.
> 
> Yes.
> 
>> But that doesn't
>> mean we should not improve the default unless you have reason to say
>> that using N_DEMOTTION_TARGETS breaks any existing config.
>>
> 
> Becuase N_DEMOTTION_TARGETS is a new kernel ABI to override the default,
> not the default itself.  [1/5] of this patchset improve the default
> behavior itself, and I think that's good.
> 

we are improving the default by using N_DEMOTION_TARGETS because the 
current default breaks configs which can get you memory only NUMA nodes. 
I would not consider it an override.

> Because we must maintain the kernel ABI almost for ever, we need to be
> careful about adding new ABI and add less if possible.  If writable per-
> node demotion targets can address your issue.  Then it's unnecessary to
> add another redundant kernel ABI for that.

This means on platform like powerpc, we would always need to have a 
userspace managed demotion because we can end up with memory only numa 
nodes for them. Why force that?


> 
>>>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>>>> 4
>>>> 4
>>>> 4
>>>> 4
>>>>
>>>> /sys/devices/system/node# echo 1 > node1/demotion_targets
>>>> bash: echo: write error: Invalid argument
>>>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>>>> 4
>>>> 4
>>>> 4
>>>> 4
>>>>
>>>> /sys/devices/system/node# echo 0 > node1/demotion_targets
>>>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>>>> 4
>>>> 0
>>>> 4
>>>> 4
>>>>
>>>> /sys/devices/system/node# echo 1 > node0/demotion_targets
>>>> bash: echo: write error: Invalid argument
>>>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>>>> 4
>>>> 0
>>>> 4
>>>> 4
>>>>
>>>> Disable demotion for a specific node.
>>>> /sys/devices/system/node# echo > node1/demotion_targets
>>>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>>>> 4
>>>>
>>>> 4
>>>> 4
>>>>
>>>> Reset demotion to default
>>>> /sys/devices/system/node# echo -1 > node1/demotion_targets
>>>> /sys/devices/system/node# cat node[0-4]/demotion_targets
>>>> 4
>>>> 4
>>>> 4
>>>> 4
>>>>
>>>> When a specific device/NUMA node is used for demotion target via the user interface, it is taken
>>>> out of other NUMA node targets.
>>>
>>> IMHO, we should be careful about interaction between auto-generated and
>>> overridden demotion order.
>>>
>>
>> yes, we should avoid loop between that.
> 
> In addition to that, we need to get same result after hot-remove then
> hot-add the same node.  That is, the result should be stable after NOOP.
> I guess we can just always,
> 
> - Generate the default demotion order automatically without any
> overriding.
> 
> - Apply the overriding, after removing the invalid targets, etc.
> 
>> But if you agree for the above
>> ABI we could go ahead and share the implementation code.
> 
> I think we need to add a way to distinguish auto-generated and overriden
> demotion targets in the output of nodeX/demotion_targets.  Otherwise it
> looks good to me.
> 


something like:

/sys/devices/system/node# echo 4 > node1/demotion_targets
/sys/devices/system/node# cat node[0-4]/demotion_targets
-
4 (userspace override)
-
-
-

-aneesh
Huang, Ying April 26, 2022, 9:44 a.m. UTC | #34
On Tue, 2022-04-26 at 14:32 +0530, Aneesh Kumar K V wrote:
> On 4/26/22 2:12 PM, ying.huang@intel.com wrote:
> > On Mon, 2022-04-25 at 13:39 +0530, Aneesh Kumar K V wrote:
> > > On 4/25/22 11:40 AM, ying.huang@intel.com wrote:
> > > > On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote:
> > > > > "ying.huang@intel.com" <ying.huang@intel.com> writes:
> > > > > 
> > > > > > Hi, All,
> > > > > > 
> > > > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
> > > > > > 
> > > > > > [snip]
> > > > > > 
> > > > > > > I think it is necessary to either have per node demotion targets
> > > > > > > configuration or the user space interface supported by this patch
> > > > > > > series. As we don't have clear consensus on how the user interface
> > > > > > > should look like, we can defer the per node demotion target set
> > > > > > > interface to future until the real need arises.
> > > > > > > 
> > > > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem
> > > > > > > driver, it may be possible that some memory node desired as demotion
> > > > > > > target is not detected in the system from dax-device kmem probe path.
> > > > > > > 
> > > > > > > It is also possible that some of the dax-devices are not preferred as
> > > > > > > demotion target e.g. HBM, for such devices, node shouldn't be set to
> > > > > > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish
> > > > > > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the
> > > > > > > kernel, but for now this user space interface will be useful to avoid
> > > > > > > such devices as demotion targets.
> > > > > > > 
> > > > > > > We can add read only interface to view per node demotion targets
> > > > > > > from /sys/devices/system/node/nodeX/demotion_targets, remove
> > > > > > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead
> > > > > > > make /sys/devices/system/node/demotion_targets writable.
> > > > > > > 
> > > > > > > Huang, Wei, Yang,
> > > > > > > What do you suggest?
> > > > > > 
> > > > > > We cannot remove a kernel ABI in practice.  So we need to make it right
> > > > > > at the first time.  Let's try to collect some information for the kernel
> > > > > > ABI definitation.
> > > > > > 
> > > > > > The below is just a starting point, please add your requirements.
> > > > > > 
> > > > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
> > > > > > want to use that as the demotion targets.  But I don't think this is a
> > > > > > issue in practice for now, because demote-in-reclaim is disabled by
> > > > > > default.
> > > > > 
> > > > > It is not just that the demotion can be disabled. We should be able to
> > > > > use demotion on a system where we can find DRAM only NUMA nodes. That
> > > > > cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs
> > > > > something similar to to N_DEMOTION_TARGETS
> > > > > 
> > > > 
> > > > Can you show NUMA information of your machines with DRAM-only nodes and
> > > > PMEM nodes?  We can try to find the proper demotion order for the
> > > > system.  If you can not show it, we can defer N_DEMOTION_TARGETS until
> > > > the machine is available.
> > > 
> > > 
> > > Sure will find one such config. As you might have noticed this is very
> > > easy to have in a virtualization setup because the hypervisor can assign
> > > memory to a guest VM from a numa node that doesn't have CPU assigned to
> > > the same guest. This depends on the other guest VM instance config
> > > running on the system. So on any virtualization config that has got
> > > persistent memory attached, this can become an easy config to end up with.
> > > 
> > 
> > Why they want to do that?  I am looking forward to a real issue, not
> > theoritical possibility.
> > 
> 
> 
> Can you elaborate this more? That is a real config.
> 
> 
> > > 
> > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > > > 
> > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > > > memory node near node 0,
> > > > > > 
> > > > > > available: 3 nodes (0-2)
> > > > > > node 0 cpus: 0 1
> > > > > > node 0 size: n MB
> > > > > > node 0 free: n MB
> > > > > > node 1 cpus:
> > > > > > node 1 size: n MB
> > > > > > node 1 free: n MB
> > > > > > node 2 cpus: 2 3
> > > > > > node 2 size: n MB
> > > > > > node 2 free: n MB
> > > > > > node distances:
> > > > > > node   0   1   2
> > > > > >     0:  10  40  20
> > > > > >     1:  40  10  80
> > > > > >     2:  20  80  10
> > > > > > 
> > > > > > We have 2 choices,
> > > > > > 
> > > > > > a)
> > > > > > node	demotion targets
> > > > > > 0	1
> > > > > > 2	1
> > > > > 
> > > > > This is achieved by
> > > > > 
> > > > > [PATCH v2 1/5] mm: demotion: Set demotion list differently
> > > > > 
> > > > > > 
> > > > > > b)
> > > > > > node	demotion targets
> > > > > > 0	1
> > > > > > 2	X
> > > > > 
> > > > > 
> > > > > > 
> > > > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > > > prefer the other one.  So we need a user space ABI to override the
> > > > > > default configuration.
> > > > > > 
> > > > > > 3. For machines with HBM (High Bandwidth Memory), as in
> > > > > > 
> > > > > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/
> > > > > > 
> > > > > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
> > > > > > 
> > > > > > Although HBM has better performance than DDR, in ACPI SLIT, their
> > > > > > distance to CPU is longer.  We need to provide a way to fix this.  The
> > > > > > user space ABI is one way.  The desired result will be to use local DDR
> > > > > > as demotion targets of local HBM.
> > > > > 
> > > > > 
> > > > > IMHO the above (2b and 3) can be done using per node demotion targets. Below is
> > > > > what I think we could do with a single slow memory NUMA node 4.
> > > > 
> > > > If we can use writable per-node demotion targets as ABI, then we don't
> > > > need N_DEMOTION_TARGETS.
> > > 
> > > 
> > > Not sure I understand that. Yes, once you have a writeable per node
> > > demotion target it is easy to build any demotion order.
> > 
> > Yes.
> > 
> > > But that doesn't
> > > mean we should not improve the default unless you have reason to say
> > > that using N_DEMOTTION_TARGETS breaks any existing config.
> > > 
> > 
> > Becuase N_DEMOTTION_TARGETS is a new kernel ABI to override the default,
> > not the default itself.  [1/5] of this patchset improve the default
> > behavior itself, and I think that's good.
> > 
> 
> we are improving the default by using N_DEMOTION_TARGETS because the 
> current default breaks configs which can get you memory only NUMA nodes. 
> I would not consider it an override.
> 

OK.  I guess that there is some misunderstanding here.  I thought that
you refer to N_DEMOTION_TARGETS overriden via make the following file
writable,

  /sys/devices/system/node/demotion_targets

Now, I think you are referring to setting N_DEMOTION_TARGETS in kmem
driver by default.  Sorry if I misunderstood you.

So, to be clear.  I am OK to restrict default demotion targets via kmem
driver (we can improve this in the future with more source).  But I
don't think it's good to make

  /sys/devices/system/node/demotion_targets

writable.  Instead, I think it's better to make

  /sys/devices/system/node/nodeX/demotion_targets

writable.

> > Because we must maintain the kernel ABI almost for ever, we need to be
> > careful about adding new ABI and add less if possible.  If writable per-
> > node demotion targets can address your issue.  Then it's unnecessary to
> > add another redundant kernel ABI for that.
> 
> This means on platform like powerpc, we would always need to have a 
> userspace managed demotion because we can end up with memory only numa 
> nodes for them. Why force that?

Please take a look at the above.


> > 
> > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > > 4
> > > > > 4
> > > > > 4
> > > > > 4
> > > > > 
> > > > > /sys/devices/system/node# echo 1 > node1/demotion_targets
> > > > > bash: echo: write error: Invalid argument
> > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > > 4
> > > > > 4
> > > > > 4
> > > > > 4
> > > > > 
> > > > > /sys/devices/system/node# echo 0 > node1/demotion_targets
> > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > > 4
> > > > > 0
> > > > > 4
> > > > > 4
> > > > > 
> > > > > /sys/devices/system/node# echo 1 > node0/demotion_targets
> > > > > bash: echo: write error: Invalid argument
> > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > > 4
> > > > > 0
> > > > > 4
> > > > > 4
> > > > > 
> > > > > Disable demotion for a specific node.
> > > > > /sys/devices/system/node# echo > node1/demotion_targets
> > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > > 4
> > > > > 
> > > > > 4
> > > > > 4
> > > > > 
> > > > > Reset demotion to default
> > > > > /sys/devices/system/node# echo -1 > node1/demotion_targets
> > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > > 4
> > > > > 4
> > > > > 4
> > > > > 4
> > > > > 
> > > > > When a specific device/NUMA node is used for demotion target via the user interface, it is taken
> > > > > out of other NUMA node targets.
> > > > 
> > > > IMHO, we should be careful about interaction between auto-generated and
> > > > overridden demotion order.
> > > > 
> > > 
> > > yes, we should avoid loop between that.
> > 
> > In addition to that, we need to get same result after hot-remove then
> > hot-add the same node.  That is, the result should be stable after NOOP.
> > I guess we can just always,
> > 
> > - Generate the default demotion order automatically without any
> > overriding.
> > 
> > - Apply the overriding, after removing the invalid targets, etc.
> > 
> > > But if you agree for the above
> > > ABI we could go ahead and share the implementation code.
> > 
> > I think we need to add a way to distinguish auto-generated and overriden
> > demotion targets in the output of nodeX/demotion_targets.  Otherwise it
> > looks good to me.
> > 
> 
> 
> something like:
> 
> /sys/devices/system/node# echo 4 > node1/demotion_targets
> /sys/devices/system/node# cat node[0-4]/demotion_targets
> -
> 4 (userspace override)
> -
> -
> -
> 

Or

/sys/devices/system/node# echo 4 > node1/demotion_targets
/sys/devices/system/node# cat node[0-4]/demotion_targets
-
*4
-
-
-

Best Regards,
Huang, Ying
Wei Xu April 27, 2022, 4:27 a.m. UTC | #35
On Tue, Apr 26, 2022 at 1:43 AM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Mon, 2022-04-25 at 13:39 +0530, Aneesh Kumar K V wrote:
> > On 4/25/22 11:40 AM, ying.huang@intel.com wrote:
> > > On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote:
> > > > "ying.huang@intel.com" <ying.huang@intel.com> writes:
> > > >
> > > > > Hi, All,
> > > > >
> > > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
> > > > >
> > > > > [snip]
> > > > >
> > > > > > I think it is necessary to either have per node demotion targets
> > > > > > configuration or the user space interface supported by this patch
> > > > > > series. As we don't have clear consensus on how the user interface
> > > > > > should look like, we can defer the per node demotion target set
> > > > > > interface to future until the real need arises.
> > > > > >
> > > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem
> > > > > > driver, it may be possible that some memory node desired as demotion
> > > > > > target is not detected in the system from dax-device kmem probe path.
> > > > > >
> > > > > > It is also possible that some of the dax-devices are not preferred as
> > > > > > demotion target e.g. HBM, for such devices, node shouldn't be set to
> > > > > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish
> > > > > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the
> > > > > > kernel, but for now this user space interface will be useful to avoid
> > > > > > such devices as demotion targets.
> > > > > >
> > > > > > We can add read only interface to view per node demotion targets
> > > > > > from /sys/devices/system/node/nodeX/demotion_targets, remove
> > > > > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead
> > > > > > make /sys/devices/system/node/demotion_targets writable.
> > > > > >
> > > > > > Huang, Wei, Yang,
> > > > > > What do you suggest?
> > > > >
> > > > > We cannot remove a kernel ABI in practice.  So we need to make it right
> > > > > at the first time.  Let's try to collect some information for the kernel
> > > > > ABI definitation.
> > > > >
> > > > > The below is just a starting point, please add your requirements.
> > > > >
> > > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
> > > > > want to use that as the demotion targets.  But I don't think this is a
> > > > > issue in practice for now, because demote-in-reclaim is disabled by
> > > > > default.
> > > >
> > > > It is not just that the demotion can be disabled. We should be able to
> > > > use demotion on a system where we can find DRAM only NUMA nodes. That
> > > > cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs
> > > > something similar to to N_DEMOTION_TARGETS
> > > >
> > >
> > > Can you show NUMA information of your machines with DRAM-only nodes and
> > > PMEM nodes?  We can try to find the proper demotion order for the
> > > system.  If you can not show it, we can defer N_DEMOTION_TARGETS until
> > > the machine is available.
> >
> >
> > Sure will find one such config. As you might have noticed this is very
> > easy to have in a virtualization setup because the hypervisor can assign
> > memory to a guest VM from a numa node that doesn't have CPU assigned to
> > the same guest. This depends on the other guest VM instance config
> > running on the system. So on any virtualization config that has got
> > persistent memory attached, this can become an easy config to end up with.
> >
>
> Why they want to do that?  I am looking forward to a real issue, not
> theoritical possibility.
>
> >
> > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > >
> > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > > memory node near node 0,
> > > > >
> > > > > available: 3 nodes (0-2)
> > > > > node 0 cpus: 0 1
> > > > > node 0 size: n MB
> > > > > node 0 free: n MB
> > > > > node 1 cpus:
> > > > > node 1 size: n MB
> > > > > node 1 free: n MB
> > > > > node 2 cpus: 2 3
> > > > > node 2 size: n MB
> > > > > node 2 free: n MB
> > > > > node distances:
> > > > > node   0   1   2
> > > > >    0:  10  40  20
> > > > >    1:  40  10  80
> > > > >    2:  20  80  10
> > > > >
> > > > > We have 2 choices,
> > > > >
> > > > > a)
> > > > > node    demotion targets
> > > > > 0       1
> > > > > 2       1
> > > >
> > > > This is achieved by
> > > >
> > > > [PATCH v2 1/5] mm: demotion: Set demotion list differently
> > > >
> > > > >
> > > > > b)
> > > > > node    demotion targets
> > > > > 0       1
> > > > > 2       X
> > > >
> > > >
> > > > >
> > > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > > prefer the other one.  So we need a user space ABI to override the
> > > > > default configuration.
> > > > >
> > > > > 3. For machines with HBM (High Bandwidth Memory), as in
> > > > >
> > > > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/
> > > > >
> > > > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
> > > > >
> > > > > Although HBM has better performance than DDR, in ACPI SLIT, their
> > > > > distance to CPU is longer.  We need to provide a way to fix this.  The
> > > > > user space ABI is one way.  The desired result will be to use local DDR
> > > > > as demotion targets of local HBM.
> > > >
> > > >
> > > > IMHO the above (2b and 3) can be done using per node demotion targets. Below is
> > > > what I think we could do with a single slow memory NUMA node 4.
> > >
> > > If we can use writable per-node demotion targets as ABI, then we don't
> > > need N_DEMOTION_TARGETS.
> >
> >
> > Not sure I understand that. Yes, once you have a writeable per node
> > demotion target it is easy to build any demotion order.
>
> Yes.
>
> > But that doesn't
> > mean we should not improve the default unless you have reason to say
> > that using N_DEMOTTION_TARGETS breaks any existing config.
> >
>
> Becuase N_DEMOTTION_TARGETS is a new kernel ABI to override the default,
> not the default itself.  [1/5] of this patchset improve the default
> behavior itself, and I think that's good.
>
> Because we must maintain the kernel ABI almost for ever, we need to be
> careful about adding new ABI and add less if possible.  If writable per-
> node demotion targets can address your issue.  Then it's unnecessary to
> add another redundant kernel ABI for that.

I still think the kernel should initialize the per-node demotion order
in a way similar to allocation fallback order and there is no need for
a userspace interface to override per-node demotion order. But I don't
object to such a per-node demotion order override interface proposed
here.

On the other hand, I think it is better to preserve the system-wide
/sys/devices/system/node/demotion_targets as writable.  If the
userspace only wants to specify a specific set of nodes as the
demotion tier and is perfectly fine with the per-node demotion order
generated by the kernel, why should we enforce the userspace to have
to manually define the per-node demotion order as well?

> > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > 4
> > > > 4
> > > > 4
> > > > 4
> > > >
> > > > /sys/devices/system/node# echo 1 > node1/demotion_targets
> > > > bash: echo: write error: Invalid argument
> > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > 4
> > > > 4
> > > > 4
> > > > 4
> > > >
> > > > /sys/devices/system/node# echo 0 > node1/demotion_targets
> > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > 4
> > > > 0
> > > > 4
> > > > 4
> > > >
> > > > /sys/devices/system/node# echo 1 > node0/demotion_targets
> > > > bash: echo: write error: Invalid argument
> > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > 4
> > > > 0
> > > > 4
> > > > 4
> > > >
> > > > Disable demotion for a specific node.
> > > > /sys/devices/system/node# echo > node1/demotion_targets
> > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > 4
> > > >
> > > > 4
> > > > 4
> > > >
> > > > Reset demotion to default
> > > > /sys/devices/system/node# echo -1 > node1/demotion_targets
> > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > 4
> > > > 4
> > > > 4
> > > > 4
> > > >
> > > > When a specific device/NUMA node is used for demotion target via the user interface, it is taken
> > > > out of other NUMA node targets.
> > >
> > > IMHO, we should be careful about interaction between auto-generated and
> > > overridden demotion order.
> > >
> >
> > yes, we should avoid loop between that.
>
> In addition to that, we need to get same result after hot-remove then
> hot-add the same node.  That is, the result should be stable after NOOP.
> I guess we can just always,
>
> - Generate the default demotion order automatically without any
> overriding.
>
> - Apply the overriding, after removing the invalid targets, etc.
>
> > But if you agree for the above
> > ABI we could go ahead and share the implementation code.
>
> I think we need to add a way to distinguish auto-generated and overriden
> demotion targets in the output of nodeX/demotion_targets.  Otherwise it
> looks good to me.
>
> Best Regards,
> Huang, Ying
>
> > > > root@ubuntu-guest:/sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > 4
> > > > 4
> > > > 4
> > > > 4
> > > >
> > > > /sys/devices/system/node# echo 4 > node1/demotion_targets
> > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > >
> > > > 4
> > > >
> > > >
> > > >
> > > > If more than one node requies the same demotion target
> > > > /sys/devices/system/node# echo 4 > node0/demotion_targets
> > > > /sys/devices/system/node# cat node[0-4]/demotion_targets
> > > > 4
> > > > 4
> > > >
> > > >
> > > >
> > > > -aneesh
> > >
> > >
> >
> > -aneesh
>
>
Aneesh Kumar K.V April 27, 2022, 5:06 a.m. UTC | #36
On 4/25/22 10:26 PM, Wei Xu wrote:
> On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
>>

....

>> 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
>>
>> Node 0 & 2 are cpu + dram nodes and node 1 are slow
>> memory node near node 0,
>>
>> available: 3 nodes (0-2)
>> node 0 cpus: 0 1
>> node 0 size: n MB
>> node 0 free: n MB
>> node 1 cpus:
>> node 1 size: n MB
>> node 1 free: n MB
>> node 2 cpus: 2 3
>> node 2 size: n MB
>> node 2 free: n MB
>> node distances:
>> node   0   1   2
>>    0:  10  40  20
>>    1:  40  10  80
>>    2:  20  80  10
>>
>> We have 2 choices,
>>
>> a)
>> node    demotion targets
>> 0       1
>> 2       1
>>
>> b)
>> node    demotion targets
>> 0       1
>> 2       X
>>
>> a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
>> traffic.  Both are OK as defualt configuration.  But some users may
>> prefer the other one.  So we need a user space ABI to override the
>> default configuration.
> 
> I think 2(a) should be the system-wide configuration and 2(b) can be
> achieved with NUMA mempolicy (which needs to be added to demotion).
> 
> In general, we can view the demotion order in a way similar to
> allocation fallback order (after all, if we don't demote or demotion
> lags behind, the allocations will go to these demotion target nodes
> according to the allocation fallback order anyway).  If we initialize
> the demotion order in that way (i.e. every node can demote to any node
> in the next tier, and the priority of the target nodes is sorted for
> each source node), we don't need per-node demotion order override from
> the userspace.  What we need is to specify what nodes should be in
> each tier and support NUMA mempolicy in demotion.
> 

I have been wondering how we would handle this. For ex: If an 
application has specified an MPOL_BIND policy and restricted the 
allocation to be from Node0 and Node1, should we demote pages allocated 
by that application
to Node10? The other alternative for that demotion is swapping. So from 
the page point of view, we either demote to a slow memory or pageout to 
swap. But then if we demote we are also breaking the MPOL_BIND rule.

The above says we would need some kind of mem policy interaction, but 
what I am not sure about is how to find the memory policy in the 
demotion path.


> Cross-socket demotion should not be too big a problem in practice
> because we can optimize the code to do the demotion from the local CPU
> node (i.e. local writes to the target node and remote read from the
> source node).  The bigger issue is cross-socket memory access onto the
> demoted pages from the applications, which is why NUMA mempolicy is
> important here.
> 
>
-aneesh
Huang, Ying April 27, 2022, 7:11 a.m. UTC | #37
On Mon, 2022-04-25 at 09:56 -0700, Wei Xu wrote:
> On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > Hi, All,
> > 
> > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
> > 
> > [snip]
> > 
> > > I think it is necessary to either have per node demotion targets
> > > configuration or the user space interface supported by this patch
> > > series. As we don't have clear consensus on how the user interface
> > > should look like, we can defer the per node demotion target set
> > > interface to future until the real need arises.
> > > 
> > > Current patch series sets N_DEMOTION_TARGET from dax device kmem
> > > driver, it may be possible that some memory node desired as demotion
> > > target is not detected in the system from dax-device kmem probe path.
> > > 
> > > It is also possible that some of the dax-devices are not preferred as
> > > demotion target e.g. HBM, for such devices, node shouldn't be set to
> > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish
> > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the
> > > kernel, but for now this user space interface will be useful to avoid
> > > such devices as demotion targets.
> > > 
> > > We can add read only interface to view per node demotion targets
> > > from /sys/devices/system/node/nodeX/demotion_targets, remove
> > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead
> > > make /sys/devices/system/node/demotion_targets writable.
> > > 
> > > Huang, Wei, Yang,
> > > What do you suggest?
> > 
> > We cannot remove a kernel ABI in practice.  So we need to make it right
> > at the first time.  Let's try to collect some information for the kernel
> > ABI definitation.
> > 
> > The below is just a starting point, please add your requirements.
> > 
> > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
> > want to use that as the demotion targets.  But I don't think this is a
> > issue in practice for now, because demote-in-reclaim is disabled by
> > default.
> > 
> > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > 
> > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > memory node near node 0,
> > 
> > available: 3 nodes (0-2)
> > node 0 cpus: 0 1
> > node 0 size: n MB
> > node 0 free: n MB
> > node 1 cpus:
> > node 1 size: n MB
> > node 1 free: n MB
> > node 2 cpus: 2 3
> > node 2 size: n MB
> > node 2 free: n MB
> > node distances:
> > node   0   1   2
> >   0:  10  40  20
> >   1:  40  10  80
> >   2:  20  80  10
> > 
> > We have 2 choices,
> > 
> > a)
> > node    demotion targets
> > 0       1
> > 2       1
> > 
> > b)
> > node    demotion targets
> > 0       1
> > 2       X
> > 
> > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > traffic.  Both are OK as defualt configuration.  But some users may
> > prefer the other one.  So we need a user space ABI to override the
> > default configuration.
> 
> I think 2(a) should be the system-wide configuration and 2(b) can be
> achieved with NUMA mempolicy (which needs to be added to demotion).

Unfortunately, some NUMA mempolicy information isn't available at
demotion time, for example, mempolicy enforced via set_mempolicy() is
for thread. But I think that cpusets can work for demotion.

> In general, we can view the demotion order in a way similar to
> allocation fallback order (after all, if we don't demote or demotion
> lags behind, the allocations will go to these demotion target nodes
> according to the allocation fallback order anyway).  If we initialize
> the demotion order in that way (i.e. every node can demote to any node
> in the next tier, and the priority of the target nodes is sorted for
> each source node), we don't need per-node demotion order override from
> the userspace.  What we need is to specify what nodes should be in
> each tier and support NUMA mempolicy in demotion.

This sounds interesting. Tier sounds like a natural and general concept
for these memory types. It's attracting to use it for user space
interface too. For example, we may use that for mem_cgroup limits of a
specific memory type (tier).

And if we take a look at the N_DEMOTION_TARGETS again from the "tier"
point of view. The nodes are divided to 2 classes via
N_DEMOTION_TARGETS.

- The nodes without N_DEMOTION_TARGETS are top tier (or tier 0).

- The nodes with N_DEMOTION_TARGETS are non-top tier (or tier 1, 2, 3,
...)

So, another possibility is to fit N_DEMOTION_TARGETS and its overriding
into "tier" concept too.  !N_DEMOTION_TARGETS == TIER0.

- All nodes start with TIER0

- TIER0 can be cleared for some nodes via e.g. kmem driver

TIER0 node list can be read or overriden by the user space via the
following interface,

  /sys/devices/system/node/tier0

In the future, if we want to customize more tiers, we can add tier1,
tier2, tier3, .....  For now, we can add just tier0.  That is, the
interface is extensible in the future compared with
.../node/demote_targets.

This isn't as flexible as the writable per-node demotion targets.  But
it may be enough for most requirements?

Best Regards,
Huang, Ying

> Cross-socket demotion should not be too big a problem in practice
> because we can optimize the code to do the demotion from the local CPU
> node (i.e. local writes to the target node and remote read from the
> source node).  The bigger issue is cross-socket memory access onto the
> demoted pages from the applications, which is why NUMA mempolicy is
> important here.
> 
> > 3. For machines with HBM (High Bandwidth Memory), as in
> > 
> > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/
> > 
> > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
> > 
> > Although HBM has better performance than DDR, in ACPI SLIT, their
> > distance to CPU is longer.  We need to provide a way to fix this.  The
> > user space ABI is one way.  The desired result will be to use local DDR
> > as demotion targets of local HBM.
> > 
> > Best Regards,
> > Huang, Ying
> >
Wei Xu April 27, 2022, 4:27 p.m. UTC | #38
On Wed, Apr 27, 2022 at 12:11 AM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Mon, 2022-04-25 at 09:56 -0700, Wei Xu wrote:
> > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > Hi, All,
> > >
> > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
> > >
> > > [snip]
> > >
> > > > I think it is necessary to either have per node demotion targets
> > > > configuration or the user space interface supported by this patch
> > > > series. As we don't have clear consensus on how the user interface
> > > > should look like, we can defer the per node demotion target set
> > > > interface to future until the real need arises.
> > > >
> > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem
> > > > driver, it may be possible that some memory node desired as demotion
> > > > target is not detected in the system from dax-device kmem probe path.
> > > >
> > > > It is also possible that some of the dax-devices are not preferred as
> > > > demotion target e.g. HBM, for such devices, node shouldn't be set to
> > > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish
> > > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the
> > > > kernel, but for now this user space interface will be useful to avoid
> > > > such devices as demotion targets.
> > > >
> > > > We can add read only interface to view per node demotion targets
> > > > from /sys/devices/system/node/nodeX/demotion_targets, remove
> > > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead
> > > > make /sys/devices/system/node/demotion_targets writable.
> > > >
> > > > Huang, Wei, Yang,
> > > > What do you suggest?
> > >
> > > We cannot remove a kernel ABI in practice.  So we need to make it right
> > > at the first time.  Let's try to collect some information for the kernel
> > > ABI definitation.
> > >
> > > The below is just a starting point, please add your requirements.
> > >
> > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
> > > want to use that as the demotion targets.  But I don't think this is a
> > > issue in practice for now, because demote-in-reclaim is disabled by
> > > default.
> > >
> > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > >
> > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > memory node near node 0,
> > >
> > > available: 3 nodes (0-2)
> > > node 0 cpus: 0 1
> > > node 0 size: n MB
> > > node 0 free: n MB
> > > node 1 cpus:
> > > node 1 size: n MB
> > > node 1 free: n MB
> > > node 2 cpus: 2 3
> > > node 2 size: n MB
> > > node 2 free: n MB
> > > node distances:
> > > node   0   1   2
> > >   0:  10  40  20
> > >   1:  40  10  80
> > >   2:  20  80  10
> > >
> > > We have 2 choices,
> > >
> > > a)
> > > node    demotion targets
> > > 0       1
> > > 2       1
> > >
> > > b)
> > > node    demotion targets
> > > 0       1
> > > 2       X
> > >
> > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > traffic.  Both are OK as defualt configuration.  But some users may
> > > prefer the other one.  So we need a user space ABI to override the
> > > default configuration.
> >
> > I think 2(a) should be the system-wide configuration and 2(b) can be
> > achieved with NUMA mempolicy (which needs to be added to demotion).
>
> Unfortunately, some NUMA mempolicy information isn't available at
> demotion time, for example, mempolicy enforced via set_mempolicy() is
> for thread. But I think that cpusets can work for demotion.
>
> > In general, we can view the demotion order in a way similar to
> > allocation fallback order (after all, if we don't demote or demotion
> > lags behind, the allocations will go to these demotion target nodes
> > according to the allocation fallback order anyway).  If we initialize
> > the demotion order in that way (i.e. every node can demote to any node
> > in the next tier, and the priority of the target nodes is sorted for
> > each source node), we don't need per-node demotion order override from
> > the userspace.  What we need is to specify what nodes should be in
> > each tier and support NUMA mempolicy in demotion.
>
> This sounds interesting. Tier sounds like a natural and general concept
> for these memory types. It's attracting to use it for user space
> interface too. For example, we may use that for mem_cgroup limits of a
> specific memory type (tier).
>
> And if we take a look at the N_DEMOTION_TARGETS again from the "tier"
> point of view. The nodes are divided to 2 classes via
> N_DEMOTION_TARGETS.
>
> - The nodes without N_DEMOTION_TARGETS are top tier (or tier 0).
>
> - The nodes with N_DEMOTION_TARGETS are non-top tier (or tier 1, 2, 3,
> ...)
>

Yes, this is one of the main reasons why we (Google) want this interface.

> So, another possibility is to fit N_DEMOTION_TARGETS and its overriding
> into "tier" concept too.  !N_DEMOTION_TARGETS == TIER0.
>
> - All nodes start with TIER0
>
> - TIER0 can be cleared for some nodes via e.g. kmem driver
>
> TIER0 node list can be read or overriden by the user space via the
> following interface,
>
>   /sys/devices/system/node/tier0
>
> In the future, if we want to customize more tiers, we can add tier1,
> tier2, tier3, .....  For now, we can add just tier0.  That is, the
> interface is extensible in the future compared with
> .../node/demote_targets.
>

This more explicit tier definition interface works, too.

> This isn't as flexible as the writable per-node demotion targets.  But
> it may be enough for most requirements?

I would think so. Besides, it doesn't really conflict with the
per-node demotion target interface if we really want to introduce the
latter.

> Best Regards,
> Huang, Ying
>
> > Cross-socket demotion should not be too big a problem in practice
> > because we can optimize the code to do the demotion from the local CPU
> > node (i.e. local writes to the target node and remote read from the
> > source node).  The bigger issue is cross-socket memory access onto the
> > demoted pages from the applications, which is why NUMA mempolicy is
> > important here.
> >
> > > 3. For machines with HBM (High Bandwidth Memory), as in
> > >
> > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/
> > >
> > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
> > >
> > > Although HBM has better performance than DDR, in ACPI SLIT, their
> > > distance to CPU is longer.  We need to provide a way to fix this.  The
> > > user space ABI is one way.  The desired result will be to use local DDR
> > > as demotion targets of local HBM.
> > >
> > > Best Regards,
> > > Huang, Ying
> > >
>
>
>
Wei Xu April 27, 2022, 6:27 p.m. UTC | #39
On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 4/25/22 10:26 PM, Wei Xu wrote:
> > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> >>
>
> ....
>
> >> 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> >>
> >> Node 0 & 2 are cpu + dram nodes and node 1 are slow
> >> memory node near node 0,
> >>
> >> available: 3 nodes (0-2)
> >> node 0 cpus: 0 1
> >> node 0 size: n MB
> >> node 0 free: n MB
> >> node 1 cpus:
> >> node 1 size: n MB
> >> node 1 free: n MB
> >> node 2 cpus: 2 3
> >> node 2 size: n MB
> >> node 2 free: n MB
> >> node distances:
> >> node   0   1   2
> >>    0:  10  40  20
> >>    1:  40  10  80
> >>    2:  20  80  10
> >>
> >> We have 2 choices,
> >>
> >> a)
> >> node    demotion targets
> >> 0       1
> >> 2       1
> >>
> >> b)
> >> node    demotion targets
> >> 0       1
> >> 2       X
> >>
> >> a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> >> traffic.  Both are OK as defualt configuration.  But some users may
> >> prefer the other one.  So we need a user space ABI to override the
> >> default configuration.
> >
> > I think 2(a) should be the system-wide configuration and 2(b) can be
> > achieved with NUMA mempolicy (which needs to be added to demotion).
> >
> > In general, we can view the demotion order in a way similar to
> > allocation fallback order (after all, if we don't demote or demotion
> > lags behind, the allocations will go to these demotion target nodes
> > according to the allocation fallback order anyway).  If we initialize
> > the demotion order in that way (i.e. every node can demote to any node
> > in the next tier, and the priority of the target nodes is sorted for
> > each source node), we don't need per-node demotion order override from
> > the userspace.  What we need is to specify what nodes should be in
> > each tier and support NUMA mempolicy in demotion.
> >
>
> I have been wondering how we would handle this. For ex: If an
> application has specified an MPOL_BIND policy and restricted the
> allocation to be from Node0 and Node1, should we demote pages allocated
> by that application
> to Node10? The other alternative for that demotion is swapping. So from
> the page point of view, we either demote to a slow memory or pageout to
> swap. But then if we demote we are also breaking the MPOL_BIND rule.

IMHO, the MPOL_BIND policy should be respected and demotion should be
skipped in such cases.  Such MPOL_BIND policies can be an important
tool for applications to override and control their memory placement
when transparent memory tiering is enabled.  If the application
doesn't want swapping, there are other ways to achieve that (e.g.
mlock, disabling swap globally, setting memcg parameters, etc).

> The above says we would need some kind of mem policy interaction, but
> what I am not sure about is how to find the memory policy in the
> demotion path.

This is indeed an important and challenging problem.  One possible
approach is to retrieve the allowed demotion nodemask from
page_referenced() similar to vm_flags.

>
> > Cross-socket demotion should not be too big a problem in practice
> > because we can optimize the code to do the demotion from the local CPU
> > node (i.e. local writes to the target node and remote read from the
> > source node).  The bigger issue is cross-socket memory access onto the
> > demoted pages from the applications, which is why NUMA mempolicy is
> > important here.
> >
> >
> -aneesh
Huang, Ying April 28, 2022, 12:56 a.m. UTC | #40
On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote:
> On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V
> <aneesh.kumar@linux.ibm.com> wrote:
> > 
> > On 4/25/22 10:26 PM, Wei Xu wrote:
> > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > > <ying.huang@intel.com> wrote:
> > > > 
> > 
> > ....
> > 
> > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > 
> > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > memory node near node 0,
> > > > 
> > > > available: 3 nodes (0-2)
> > > > node 0 cpus: 0 1
> > > > node 0 size: n MB
> > > > node 0 free: n MB
> > > > node 1 cpus:
> > > > node 1 size: n MB
> > > > node 1 free: n MB
> > > > node 2 cpus: 2 3
> > > > node 2 size: n MB
> > > > node 2 free: n MB
> > > > node distances:
> > > > node   0   1   2
> > > >    0:  10  40  20
> > > >    1:  40  10  80
> > > >    2:  20  80  10
> > > > 
> > > > We have 2 choices,
> > > > 
> > > > a)
> > > > node    demotion targets
> > > > 0       1
> > > > 2       1
> > > > 
> > > > b)
> > > > node    demotion targets
> > > > 0       1
> > > > 2       X
> > > > 
> > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > prefer the other one.  So we need a user space ABI to override the
> > > > default configuration.
> > > 
> > > I think 2(a) should be the system-wide configuration and 2(b) can be
> > > achieved with NUMA mempolicy (which needs to be added to demotion).
> > > 
> > > In general, we can view the demotion order in a way similar to
> > > allocation fallback order (after all, if we don't demote or demotion
> > > lags behind, the allocations will go to these demotion target nodes
> > > according to the allocation fallback order anyway).  If we initialize
> > > the demotion order in that way (i.e. every node can demote to any node
> > > in the next tier, and the priority of the target nodes is sorted for
> > > each source node), we don't need per-node demotion order override from
> > > the userspace.  What we need is to specify what nodes should be in
> > > each tier and support NUMA mempolicy in demotion.
> > > 
> > 
> > I have been wondering how we would handle this. For ex: If an
> > application has specified an MPOL_BIND policy and restricted the
> > allocation to be from Node0 and Node1, should we demote pages allocated
> > by that application
> > to Node10? The other alternative for that demotion is swapping. So from
> > the page point of view, we either demote to a slow memory or pageout to
> > swap. But then if we demote we are also breaking the MPOL_BIND rule.
> 
> IMHO, the MPOL_BIND policy should be respected and demotion should be
> skipped in such cases.  Such MPOL_BIND policies can be an important
> tool for applications to override and control their memory placement
> when transparent memory tiering is enabled.  If the application
> doesn't want swapping, there are other ways to achieve that (e.g.
> mlock, disabling swap globally, setting memcg parameters, etc).
> 
>
> > The above says we would need some kind of mem policy interaction, but
> > what I am not sure about is how to find the memory policy in the
> > demotion path.
> 
> This is indeed an important and challenging problem.  One possible
> approach is to retrieve the allowed demotion nodemask from
> page_referenced() similar to vm_flags.

This works for mempolicy in struct vm_area_struct, but not for that in
struct task_struct.  Mutiple threads in a process may have different
mempolicy.

Best Regards,
Huang, Ying

> > 
> > > Cross-socket demotion should not be too big a problem in practice
> > > because we can optimize the code to do the demotion from the local CPU
> > > node (i.e. local writes to the target node and remote read from the
> > > source node).  The bigger issue is cross-socket memory access onto the
> > > demoted pages from the applications, which is why NUMA mempolicy is
> > > important here.
> > > 
> > > 
> > -aneesh
Wei Xu April 28, 2022, 4:11 a.m. UTC | #41
On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote:
> > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V
> > <aneesh.kumar@linux.ibm.com> wrote:
> > >
> > > On 4/25/22 10:26 PM, Wei Xu wrote:
> > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > > > <ying.huang@intel.com> wrote:
> > > > >
> > >
> > > ....
> > >
> > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > >
> > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > > memory node near node 0,
> > > > >
> > > > > available: 3 nodes (0-2)
> > > > > node 0 cpus: 0 1
> > > > > node 0 size: n MB
> > > > > node 0 free: n MB
> > > > > node 1 cpus:
> > > > > node 1 size: n MB
> > > > > node 1 free: n MB
> > > > > node 2 cpus: 2 3
> > > > > node 2 size: n MB
> > > > > node 2 free: n MB
> > > > > node distances:
> > > > > node   0   1   2
> > > > >    0:  10  40  20
> > > > >    1:  40  10  80
> > > > >    2:  20  80  10
> > > > >
> > > > > We have 2 choices,
> > > > >
> > > > > a)
> > > > > node    demotion targets
> > > > > 0       1
> > > > > 2       1
> > > > >
> > > > > b)
> > > > > node    demotion targets
> > > > > 0       1
> > > > > 2       X
> > > > >
> > > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > > prefer the other one.  So we need a user space ABI to override the
> > > > > default configuration.
> > > >
> > > > I think 2(a) should be the system-wide configuration and 2(b) can be
> > > > achieved with NUMA mempolicy (which needs to be added to demotion).
> > > >
> > > > In general, we can view the demotion order in a way similar to
> > > > allocation fallback order (after all, if we don't demote or demotion
> > > > lags behind, the allocations will go to these demotion target nodes
> > > > according to the allocation fallback order anyway).  If we initialize
> > > > the demotion order in that way (i.e. every node can demote to any node
> > > > in the next tier, and the priority of the target nodes is sorted for
> > > > each source node), we don't need per-node demotion order override from
> > > > the userspace.  What we need is to specify what nodes should be in
> > > > each tier and support NUMA mempolicy in demotion.
> > > >
> > >
> > > I have been wondering how we would handle this. For ex: If an
> > > application has specified an MPOL_BIND policy and restricted the
> > > allocation to be from Node0 and Node1, should we demote pages allocated
> > > by that application
> > > to Node10? The other alternative for that demotion is swapping. So from
> > > the page point of view, we either demote to a slow memory or pageout to
> > > swap. But then if we demote we are also breaking the MPOL_BIND rule.
> >
> > IMHO, the MPOL_BIND policy should be respected and demotion should be
> > skipped in such cases.  Such MPOL_BIND policies can be an important
> > tool for applications to override and control their memory placement
> > when transparent memory tiering is enabled.  If the application
> > doesn't want swapping, there are other ways to achieve that (e.g.
> > mlock, disabling swap globally, setting memcg parameters, etc).
> >
> >
> > > The above says we would need some kind of mem policy interaction, but
> > > what I am not sure about is how to find the memory policy in the
> > > demotion path.
> >
> > This is indeed an important and challenging problem.  One possible
> > approach is to retrieve the allowed demotion nodemask from
> > page_referenced() similar to vm_flags.
>
> This works for mempolicy in struct vm_area_struct, but not for that in
> struct task_struct.  Mutiple threads in a process may have different
> mempolicy.

From vm_area_struct, we can get to mm_struct and then to the owner
task_struct, which has the process mempolicy.

It is indeed a problem when a page is shared by different threads or
different processes that have different thread default mempolicy
values.

On the other hand, it can already support most interesting use cases
for demotion (e.g. selecting the demotion node, mbind to prevent
demotion) by respecting cpuset and vma mempolicies.

> Best Regards,
> Huang, Ying
>
> > >
> > > > Cross-socket demotion should not be too big a problem in practice
> > > > because we can optimize the code to do the demotion from the local CPU
> > > > node (i.e. local writes to the target node and remote read from the
> > > > source node).  The bigger issue is cross-socket memory access onto the
> > > > demoted pages from the applications, which is why NUMA mempolicy is
> > > > important here.
> > > >
> > > >
> > > -aneesh
>
>
Huang, Ying April 28, 2022, 8:37 a.m. UTC | #42
On Wed, 2022-04-27 at 09:27 -0700, Wei Xu wrote:
> On Wed, Apr 27, 2022 at 12:11 AM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > On Mon, 2022-04-25 at 09:56 -0700, Wei Xu wrote:
> > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > > <ying.huang@intel.com> wrote:
> > > > 
> > > > Hi, All,
> > > > 
> > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
> > > > 
> > > > [snip]
> > > > 
> > > > > I think it is necessary to either have per node demotion targets
> > > > > configuration or the user space interface supported by this patch
> > > > > series. As we don't have clear consensus on how the user interface
> > > > > should look like, we can defer the per node demotion target set
> > > > > interface to future until the real need arises.
> > > > > 
> > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem
> > > > > driver, it may be possible that some memory node desired as demotion
> > > > > target is not detected in the system from dax-device kmem probe path.
> > > > > 
> > > > > It is also possible that some of the dax-devices are not preferred as
> > > > > demotion target e.g. HBM, for such devices, node shouldn't be set to
> > > > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish
> > > > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the
> > > > > kernel, but for now this user space interface will be useful to avoid
> > > > > such devices as demotion targets.
> > > > > 
> > > > > We can add read only interface to view per node demotion targets
> > > > > from /sys/devices/system/node/nodeX/demotion_targets, remove
> > > > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead
> > > > > make /sys/devices/system/node/demotion_targets writable.
> > > > > 
> > > > > Huang, Wei, Yang,
> > > > > What do you suggest?
> > > > 
> > > > We cannot remove a kernel ABI in practice.  So we need to make it right
> > > > at the first time.  Let's try to collect some information for the kernel
> > > > ABI definitation.
> > > > 
> > > > The below is just a starting point, please add your requirements.
> > > > 
> > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't
> > > > want to use that as the demotion targets.  But I don't think this is a
> > > > issue in practice for now, because demote-in-reclaim is disabled by
> > > > default.
> > > > 
> > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > 
> > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > memory node near node 0,
> > > > 
> > > > available: 3 nodes (0-2)
> > > > node 0 cpus: 0 1
> > > > node 0 size: n MB
> > > > node 0 free: n MB
> > > > node 1 cpus:
> > > > node 1 size: n MB
> > > > node 1 free: n MB
> > > > node 2 cpus: 2 3
> > > > node 2 size: n MB
> > > > node 2 free: n MB
> > > > node distances:
> > > > node   0   1   2
> > > >   0:  10  40  20
> > > >   1:  40  10  80
> > > >   2:  20  80  10
> > > > 
> > > > We have 2 choices,
> > > > 
> > > > a)
> > > > node    demotion targets
> > > > 0       1
> > > > 2       1
> > > > 
> > > > b)
> > > > node    demotion targets
> > > > 0       1
> > > > 2       X
> > > > 
> > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > prefer the other one.  So we need a user space ABI to override the
> > > > default configuration.
> > > 
> > > I think 2(a) should be the system-wide configuration and 2(b) can be
> > > achieved with NUMA mempolicy (which needs to be added to demotion).
> > 
> > Unfortunately, some NUMA mempolicy information isn't available at
> > demotion time, for example, mempolicy enforced via set_mempolicy() is
> > for thread. But I think that cpusets can work for demotion.
> > 
> > > In general, we can view the demotion order in a way similar to
> > > allocation fallback order (after all, if we don't demote or demotion
> > > lags behind, the allocations will go to these demotion target nodes
> > > according to the allocation fallback order anyway).  If we initialize
> > > the demotion order in that way (i.e. every node can demote to any node
> > > in the next tier, and the priority of the target nodes is sorted for
> > > each source node), we don't need per-node demotion order override from
> > > the userspace.  What we need is to specify what nodes should be in
> > > each tier and support NUMA mempolicy in demotion.
> > 
> > This sounds interesting. Tier sounds like a natural and general concept
> > for these memory types. It's attracting to use it for user space
> > interface too. For example, we may use that for mem_cgroup limits of a
> > specific memory type (tier).
> > 
> > And if we take a look at the N_DEMOTION_TARGETS again from the "tier"
> > point of view. The nodes are divided to 2 classes via
> > N_DEMOTION_TARGETS.
> > 
> > - The nodes without N_DEMOTION_TARGETS are top tier (or tier 0).
> > 
> > - The nodes with N_DEMOTION_TARGETS are non-top tier (or tier 1, 2, 3,
> > ...)
> > 
> 
> Yes, this is one of the main reasons why we (Google) want this interface.
> 
> > So, another possibility is to fit N_DEMOTION_TARGETS and its overriding
> > into "tier" concept too.  !N_DEMOTION_TARGETS == TIER0.
> > 
> > - All nodes start with TIER0
> > 
> > - TIER0 can be cleared for some nodes via e.g. kmem driver
> > 
> > TIER0 node list can be read or overriden by the user space via the
> > following interface,
> > 
> >   /sys/devices/system/node/tier0
> > 
> > In the future, if we want to customize more tiers, we can add tier1,
> > tier2, tier3, .....  For now, we can add just tier0.  That is, the
> > interface is extensible in the future compared with
> > .../node/demote_targets.
> > 
> 
> This more explicit tier definition interface works, too.
> 

In addition to make tiering definition explicit, more importantly, this
makes it much easier to support more than 2 tiers.  For example, for a
system with HBM (High Bandwidth Memory), CPU+DRAM, DRAM only, and PMEM,
that is, 3 tiers, we can put HBM in tier 0, CPU+DRAM and DRAM only in
tier 1, and PMEM in tier 2, automatically, or via user space
overridding.  N_DEMOTION_TARGETS isn't natural to be extended to support
this.

Best Regards,
Huang, Ying

> > This isn't as flexible as the writable per-node demotion targets.  But
> > it may be enough for most requirements?
> 
> I would think so. Besides, it doesn't really conflict with the
> per-node demotion target interface if we really want to introduce the
> latter.
> 
> > Best Regards,
> > Huang, Ying
> > 
> > > Cross-socket demotion should not be too big a problem in practice
> > > because we can optimize the code to do the demotion from the local CPU
> > > node (i.e. local writes to the target node and remote read from the
> > > source node).  The bigger issue is cross-socket memory access onto the
> > > demoted pages from the applications, which is why NUMA mempolicy is
> > > important here.
> > > 
> > > > 3. For machines with HBM (High Bandwidth Memory), as in
> > > > 
> > > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/
> > > > 
> > > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41
> > > > 
> > > > Although HBM has better performance than DDR, in ACPI SLIT, their
> > > > distance to CPU is longer.  We need to provide a way to fix this.  The
> > > > user space ABI is one way.  The desired result will be to use local DDR
> > > > as demotion targets of local HBM.
> > > > 
> > > > Best Regards,
> > > > Huang, Ying
> > > > 
> > 
> > 
> >
Yang Shi April 28, 2022, 5:14 p.m. UTC | #43
On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote:
>
> On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> >
> > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote:
> > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V
> > > <aneesh.kumar@linux.ibm.com> wrote:
> > > >
> > > > On 4/25/22 10:26 PM, Wei Xu wrote:
> > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > > > > <ying.huang@intel.com> wrote:
> > > > > >
> > > >
> > > > ....
> > > >
> > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > > >
> > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > > > memory node near node 0,
> > > > > >
> > > > > > available: 3 nodes (0-2)
> > > > > > node 0 cpus: 0 1
> > > > > > node 0 size: n MB
> > > > > > node 0 free: n MB
> > > > > > node 1 cpus:
> > > > > > node 1 size: n MB
> > > > > > node 1 free: n MB
> > > > > > node 2 cpus: 2 3
> > > > > > node 2 size: n MB
> > > > > > node 2 free: n MB
> > > > > > node distances:
> > > > > > node   0   1   2
> > > > > >    0:  10  40  20
> > > > > >    1:  40  10  80
> > > > > >    2:  20  80  10
> > > > > >
> > > > > > We have 2 choices,
> > > > > >
> > > > > > a)
> > > > > > node    demotion targets
> > > > > > 0       1
> > > > > > 2       1
> > > > > >
> > > > > > b)
> > > > > > node    demotion targets
> > > > > > 0       1
> > > > > > 2       X
> > > > > >
> > > > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > > > prefer the other one.  So we need a user space ABI to override the
> > > > > > default configuration.
> > > > >
> > > > > I think 2(a) should be the system-wide configuration and 2(b) can be
> > > > > achieved with NUMA mempolicy (which needs to be added to demotion).
> > > > >
> > > > > In general, we can view the demotion order in a way similar to
> > > > > allocation fallback order (after all, if we don't demote or demotion
> > > > > lags behind, the allocations will go to these demotion target nodes
> > > > > according to the allocation fallback order anyway).  If we initialize
> > > > > the demotion order in that way (i.e. every node can demote to any node
> > > > > in the next tier, and the priority of the target nodes is sorted for
> > > > > each source node), we don't need per-node demotion order override from
> > > > > the userspace.  What we need is to specify what nodes should be in
> > > > > each tier and support NUMA mempolicy in demotion.
> > > > >
> > > >
> > > > I have been wondering how we would handle this. For ex: If an
> > > > application has specified an MPOL_BIND policy and restricted the
> > > > allocation to be from Node0 and Node1, should we demote pages allocated
> > > > by that application
> > > > to Node10? The other alternative for that demotion is swapping. So from
> > > > the page point of view, we either demote to a slow memory or pageout to
> > > > swap. But then if we demote we are also breaking the MPOL_BIND rule.
> > >
> > > IMHO, the MPOL_BIND policy should be respected and demotion should be
> > > skipped in such cases.  Such MPOL_BIND policies can be an important
> > > tool for applications to override and control their memory placement
> > > when transparent memory tiering is enabled.  If the application
> > > doesn't want swapping, there are other ways to achieve that (e.g.
> > > mlock, disabling swap globally, setting memcg parameters, etc).
> > >
> > >
> > > > The above says we would need some kind of mem policy interaction, but
> > > > what I am not sure about is how to find the memory policy in the
> > > > demotion path.
> > >
> > > This is indeed an important and challenging problem.  One possible
> > > approach is to retrieve the allowed demotion nodemask from
> > > page_referenced() similar to vm_flags.
> >
> > This works for mempolicy in struct vm_area_struct, but not for that in
> > struct task_struct.  Mutiple threads in a process may have different
> > mempolicy.
>
> From vm_area_struct, we can get to mm_struct and then to the owner
> task_struct, which has the process mempolicy.
>
> It is indeed a problem when a page is shared by different threads or
> different processes that have different thread default mempolicy
> values.

Sorry for chiming in late, this is a known issue when we were working
on demotion. Yes, it is hard to handle the shared pages and multi
threads since mempolicy is applied to each thread so each thread may
have different mempolicy. And I don't think this case is rare. And not
only mempolicy but also may cpuset settings cause the similar problem,
different threads may have different cpuset settings for cgroupv1.

If this is really a problem for real life workloads, we may consider
tackling it for exclusively owned pages first. Thanks to David's
patches, now we have dedicated flags to tell exclusively owned pages.

>
> On the other hand, it can already support most interesting use cases
> for demotion (e.g. selecting the demotion node, mbind to prevent
> demotion) by respecting cpuset and vma mempolicies.
>
> > Best Regards,
> > Huang, Ying
> >
> > > >
> > > > > Cross-socket demotion should not be too big a problem in practice
> > > > > because we can optimize the code to do the demotion from the local CPU
> > > > > node (i.e. local writes to the target node and remote read from the
> > > > > source node).  The bigger issue is cross-socket memory access onto the
> > > > > demoted pages from the applications, which is why NUMA mempolicy is
> > > > > important here.
> > > > >
> > > > >
> > > > -aneesh
> >
> >
Chen, Tim C April 28, 2022, 7:30 p.m. UTC | #44
>
>On Wed, 2022-04-27 at 09:27 -0700, Wei Xu wrote:
>> On Wed, Apr 27, 2022 at 12:11 AM ying.huang@intel.com
>> <ying.huang@intel.com> wrote:
>> >
>> > On Mon, 2022-04-25 at 09:56 -0700, Wei Xu wrote:
>> > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
>> > > <ying.huang@intel.com> wrote:
>> > > >
>> > > > Hi, All,
>> > > >
>> > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
>> > > >
>> > > > [snip]
>> > > >
>> > > > > I think it is necessary to either have per node demotion
>> > > > > targets configuration or the user space interface supported by
>> > > > > this patch series. As we don't have clear consensus on how the
>> > > > > user interface should look like, we can defer the per node
>> > > > > demotion target set interface to future until the real need arises.
>> > > > >
>> > > > > Current patch series sets N_DEMOTION_TARGET from dax device
>> > > > > kmem driver, it may be possible that some memory node desired
>> > > > > as demotion target is not detected in the system from dax-device
>kmem probe path.
>> > > > >
>> > > > > It is also possible that some of the dax-devices are not
>> > > > > preferred as demotion target e.g. HBM, for such devices, node
>> > > > > shouldn't be set to N_DEMOTION_TARGETS. In future, Support
>> > > > > should be added to distinguish such dax-devices and not mark
>> > > > > them as N_DEMOTION_TARGETS from the kernel, but for now this
>> > > > > user space interface will be useful to avoid such devices as demotion
>targets.
>> > > > >
>> > > > > We can add read only interface to view per node demotion
>> > > > > targets from /sys/devices/system/node/nodeX/demotion_targets,
>> > > > > remove duplicated /sys/kernel/mm/numa/demotion_target
>> > > > > interface and instead make
>/sys/devices/system/node/demotion_targets writable.
>> > > > >
>> > > > > Huang, Wei, Yang,
>> > > > > What do you suggest?
>> > > >
>> > > > We cannot remove a kernel ABI in practice.  So we need to make
>> > > > it right at the first time.  Let's try to collect some
>> > > > information for the kernel ABI definitation.
>> > > >
>> > > > The below is just a starting point, please add your requirements.
>> > > >
>> > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they
>> > > > don't want to use that as the demotion targets.  But I don't
>> > > > think this is a issue in practice for now, because
>> > > > demote-in-reclaim is disabled by default.
>> > > >
>> > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for
>> > > > example,
>> > > >
>> > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node
>> > > > near node 0,
>> > > >
>> > > > available: 3 nodes (0-2)
>> > > > node 0 cpus: 0 1
>> > > > node 0 size: n MB
>> > > > node 0 free: n MB
>> > > > node 1 cpus:
>> > > > node 1 size: n MB
>> > > > node 1 free: n MB
>> > > > node 2 cpus: 2 3
>> > > > node 2 size: n MB
>> > > > node 2 free: n MB
>> > > > node distances:
>> > > > node   0   1   2
>> > > >   0:  10  40  20
>> > > >   1:  40  10  80
>> > > >   2:  20  80  10
>> > > >
>> > > > We have 2 choices,
>> > > >
>> > > > a)
>> > > > node    demotion targets
>> > > > 0       1
>> > > > 2       1
>> > > >
>> > > > b)
>> > > > node    demotion targets
>> > > > 0       1
>> > > > 2       X
>> > > >
>> > > > a) is good to take advantage of PMEM.  b) is good to reduce
>> > > > cross-socket traffic.  Both are OK as defualt configuration.
>> > > > But some users may prefer the other one.  So we need a user
>> > > > space ABI to override the default configuration.
>> > >
>> > > I think 2(a) should be the system-wide configuration and 2(b) can
>> > > be achieved with NUMA mempolicy (which needs to be added to
>demotion).
>> >
>> > Unfortunately, some NUMA mempolicy information isn't available at
>> > demotion time, for example, mempolicy enforced via set_mempolicy()
>> > is for thread. But I think that cpusets can work for demotion.
>> >
>> > > In general, we can view the demotion order in a way similar to
>> > > allocation fallback order (after all, if we don't demote or
>> > > demotion lags behind, the allocations will go to these demotion
>> > > target nodes according to the allocation fallback order anyway).
>> > > If we initialize the demotion order in that way (i.e. every node
>> > > can demote to any node in the next tier, and the priority of the
>> > > target nodes is sorted for each source node), we don't need
>> > > per-node demotion order override from the userspace.  What we need
>> > > is to specify what nodes should be in each tier and support NUMA
>mempolicy in demotion.
>> >
>> > This sounds interesting. Tier sounds like a natural and general
>> > concept for these memory types. It's attracting to use it for user
>> > space interface too. For example, we may use that for mem_cgroup
>> > limits of a specific memory type (tier).
>> >
>> > And if we take a look at the N_DEMOTION_TARGETS again from the "tier"
>> > point of view. The nodes are divided to 2 classes via
>> > N_DEMOTION_TARGETS.
>> >
>> > - The nodes without N_DEMOTION_TARGETS are top tier (or tier 0).
>> >
>> > - The nodes with N_DEMOTION_TARGETS are non-top tier (or tier 1, 2,
>> > 3,
>> > ...)
>> >
>>
>> Yes, this is one of the main reasons why we (Google) want this interface.
>>
>> > So, another possibility is to fit N_DEMOTION_TARGETS and its
>> > overriding into "tier" concept too.  !N_DEMOTION_TARGETS == TIER0.
>> >
>> > - All nodes start with TIER0
>> >
>> > - TIER0 can be cleared for some nodes via e.g. kmem driver
>> >
>> > TIER0 node list can be read or overriden by the user space via the
>> > following interface,
>> >
>> >   /sys/devices/system/node/tier0
>> >
>> > In the future, if we want to customize more tiers, we can add tier1,
>> > tier2, tier3, .....  For now, we can add just tier0.  That is, the
>> > interface is extensible in the future compared with
>> > .../node/demote_targets.
>> >
>>
>> This more explicit tier definition interface works, too.
>>
>
>In addition to make tiering definition explicit, more importantly, this makes it
>much easier to support more than 2 tiers.  For example, for a system with
>HBM (High Bandwidth Memory), CPU+DRAM, DRAM only, and PMEM, that is,
>3 tiers, we can put HBM in tier 0, CPU+DRAM and DRAM only in tier 1, and
>PMEM in tier 2, automatically, or via user space overridding.
>N_DEMOTION_TARGETS isn't natural to be extended to support this.

Agree with Ying that making the tier explicit is fundamental to the rest of the API.

I think that the tier organization should come before setting the demotion targets,
not the other way round.

That makes things clear on the demotion direction,  (node in tier X 
demote to tier Y, X<Y).  With that, explicitly specifying the demotion target or
order is only needed when we truly want that level of control or a demotion
order.  Otherwise all the higher numbered tiers are valid targets.  
Configuring a tier level for each node is a lot easier than fixing up all
demotion targets for each and every node.

We can prevent demotion target configuration that goes in the wrong
direction by looking at the tier level.

Tim
Alistair Popple April 29, 2022, 1:27 a.m. UTC | #45
On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote:
> On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote:
> >
> > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > >
> > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote:
> > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V
> > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > >
> > > > > On 4/25/22 10:26 PM, Wei Xu wrote:
> > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > > > > > <ying.huang@intel.com> wrote:
> > > > > > >
> > > > >
> > > > > ....
> > > > >
> > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > > > >
> > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > > > > memory node near node 0,
> > > > > > >
> > > > > > > available: 3 nodes (0-2)
> > > > > > > node 0 cpus: 0 1
> > > > > > > node 0 size: n MB
> > > > > > > node 0 free: n MB
> > > > > > > node 1 cpus:
> > > > > > > node 1 size: n MB
> > > > > > > node 1 free: n MB
> > > > > > > node 2 cpus: 2 3
> > > > > > > node 2 size: n MB
> > > > > > > node 2 free: n MB
> > > > > > > node distances:
> > > > > > > node   0   1   2
> > > > > > >    0:  10  40  20
> > > > > > >    1:  40  10  80
> > > > > > >    2:  20  80  10
> > > > > > >
> > > > > > > We have 2 choices,
> > > > > > >
> > > > > > > a)
> > > > > > > node    demotion targets
> > > > > > > 0       1
> > > > > > > 2       1
> > > > > > >
> > > > > > > b)
> > > > > > > node    demotion targets
> > > > > > > 0       1
> > > > > > > 2       X
> > > > > > >
> > > > > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > > > > prefer the other one.  So we need a user space ABI to override the
> > > > > > > default configuration.
> > > > > >
> > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be
> > > > > > achieved with NUMA mempolicy (which needs to be added to demotion).
> > > > > >
> > > > > > In general, we can view the demotion order in a way similar to
> > > > > > allocation fallback order (after all, if we don't demote or demotion
> > > > > > lags behind, the allocations will go to these demotion target nodes
> > > > > > according to the allocation fallback order anyway).  If we initialize
> > > > > > the demotion order in that way (i.e. every node can demote to any node
> > > > > > in the next tier, and the priority of the target nodes is sorted for
> > > > > > each source node), we don't need per-node demotion order override from
> > > > > > the userspace.  What we need is to specify what nodes should be in
> > > > > > each tier and support NUMA mempolicy in demotion.
> > > > > >
> > > > >
> > > > > I have been wondering how we would handle this. For ex: If an
> > > > > application has specified an MPOL_BIND policy and restricted the
> > > > > allocation to be from Node0 and Node1, should we demote pages allocated
> > > > > by that application
> > > > > to Node10? The other alternative for that demotion is swapping. So from
> > > > > the page point of view, we either demote to a slow memory or pageout to
> > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule.
> > > >
> > > > IMHO, the MPOL_BIND policy should be respected and demotion should be
> > > > skipped in such cases.  Such MPOL_BIND policies can be an important
> > > > tool for applications to override and control their memory placement
> > > > when transparent memory tiering is enabled.  If the application
> > > > doesn't want swapping, there are other ways to achieve that (e.g.
> > > > mlock, disabling swap globally, setting memcg parameters, etc).
> > > >
> > > >
> > > > > The above says we would need some kind of mem policy interaction, but
> > > > > what I am not sure about is how to find the memory policy in the
> > > > > demotion path.
> > > >
> > > > This is indeed an important and challenging problem.  One possible
> > > > approach is to retrieve the allowed demotion nodemask from
> > > > page_referenced() similar to vm_flags.
> > >
> > > This works for mempolicy in struct vm_area_struct, but not for that in
> > > struct task_struct.  Mutiple threads in a process may have different
> > > mempolicy.
> >
> > From vm_area_struct, we can get to mm_struct and then to the owner
> > task_struct, which has the process mempolicy.
> >
> > It is indeed a problem when a page is shared by different threads or
> > different processes that have different thread default mempolicy
> > values.
> 
> Sorry for chiming in late, this is a known issue when we were working
> on demotion. Yes, it is hard to handle the shared pages and multi
> threads since mempolicy is applied to each thread so each thread may
> have different mempolicy. And I don't think this case is rare. And not
> only mempolicy but also may cpuset settings cause the similar problem,
> different threads may have different cpuset settings for cgroupv1.
> 
> If this is really a problem for real life workloads, we may consider
> tackling it for exclusively owned pages first. Thanks to David's
> patches, now we have dedicated flags to tell exclusively owned pages.

One of the problems with demotion when I last looked is it does almost exactly
the opposite of what we want on systems like POWER9 where GPU memory is a
CPU-less memory node.

On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate
memory on the GPU node. Under memory pressure demotion should migrate GPU
allocations to the CPU node and finally other slow memory nodes or swap.

Currently though demotion considers the GPU node slow memory (because it is
CPU-less) so will demote CPU memory to GPU memory which is a limited resource.
And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap
everything to disk rather than demote to CPU memory (which would be preferred).

I'm still looking at this series but as I understand it it will help somewhat
because we could make GPU memory the top-tier so nothing gets demoted to it.

However I wouldn't want to see demotion skipped entirely when a memory policy
such as MPOL_BIND is specified. For example most memory on a GPU node will have
some kind of policy specified and IMHO it would be better to demote to another
node in the mempolicy nodemask rather than going straight to swap, particularly
as GPU memory capacity tends to be limited in comparison to CPU memory
capacity.

> >
> > On the other hand, it can already support most interesting use cases
> > for demotion (e.g. selecting the demotion node, mbind to prevent
> > demotion) by respecting cpuset and vma mempolicies.
> >
> > > Best Regards,
> > > Huang, Ying
> > >
> > > > >
> > > > > > Cross-socket demotion should not be too big a problem in practice
> > > > > > because we can optimize the code to do the demotion from the local CPU
> > > > > > node (i.e. local writes to the target node and remote read from the
> > > > > > source node).  The bigger issue is cross-socket memory access onto the
> > > > > > demoted pages from the applications, which is why NUMA mempolicy is
> > > > > > important here.
> > > > > >
> > > > > >
> > > > > -aneesh
> > >
> > >
> 
>
Huang, Ying April 29, 2022, 2:21 a.m. UTC | #46
On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote:
> On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote:
> > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote:
> > > 
> > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com
> > > <ying.huang@intel.com> wrote:
> > > > 
> > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote:
> > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V
> > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > > 
> > > > > > On 4/25/22 10:26 PM, Wei Xu wrote:
> > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > 
> > > > > > 
> > > > > > ....
> > > > > > 
> > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > > > > > 
> > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > > > > > memory node near node 0,
> > > > > > > > 
> > > > > > > > available: 3 nodes (0-2)
> > > > > > > > node 0 cpus: 0 1
> > > > > > > > node 0 size: n MB
> > > > > > > > node 0 free: n MB
> > > > > > > > node 1 cpus:
> > > > > > > > node 1 size: n MB
> > > > > > > > node 1 free: n MB
> > > > > > > > node 2 cpus: 2 3
> > > > > > > > node 2 size: n MB
> > > > > > > > node 2 free: n MB
> > > > > > > > node distances:
> > > > > > > > node   0   1   2
> > > > > > > >    0:  10  40  20
> > > > > > > >    1:  40  10  80
> > > > > > > >    2:  20  80  10
> > > > > > > > 
> > > > > > > > We have 2 choices,
> > > > > > > > 
> > > > > > > > a)
> > > > > > > > node    demotion targets
> > > > > > > > 0       1
> > > > > > > > 2       1
> > > > > > > > 
> > > > > > > > b)
> > > > > > > > node    demotion targets
> > > > > > > > 0       1
> > > > > > > > 2       X
> > > > > > > > 
> > > > > > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > > > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > > > > > prefer the other one.  So we need a user space ABI to override the
> > > > > > > > default configuration.
> > > > > > > 
> > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be
> > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion).
> > > > > > > 
> > > > > > > In general, we can view the demotion order in a way similar to
> > > > > > > allocation fallback order (after all, if we don't demote or demotion
> > > > > > > lags behind, the allocations will go to these demotion target nodes
> > > > > > > according to the allocation fallback order anyway).  If we initialize
> > > > > > > the demotion order in that way (i.e. every node can demote to any node
> > > > > > > in the next tier, and the priority of the target nodes is sorted for
> > > > > > > each source node), we don't need per-node demotion order override from
> > > > > > > the userspace.  What we need is to specify what nodes should be in
> > > > > > > each tier and support NUMA mempolicy in demotion.
> > > > > > > 
> > > > > > 
> > > > > > I have been wondering how we would handle this. For ex: If an
> > > > > > application has specified an MPOL_BIND policy and restricted the
> > > > > > allocation to be from Node0 and Node1, should we demote pages allocated
> > > > > > by that application
> > > > > > to Node10? The other alternative for that demotion is swapping. So from
> > > > > > the page point of view, we either demote to a slow memory or pageout to
> > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule.
> > > > > 
> > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be
> > > > > skipped in such cases.  Such MPOL_BIND policies can be an important
> > > > > tool for applications to override and control their memory placement
> > > > > when transparent memory tiering is enabled.  If the application
> > > > > doesn't want swapping, there are other ways to achieve that (e.g.
> > > > > mlock, disabling swap globally, setting memcg parameters, etc).
> > > > > 
> > > > > 
> > > > > > The above says we would need some kind of mem policy interaction, but
> > > > > > what I am not sure about is how to find the memory policy in the
> > > > > > demotion path.
> > > > > 
> > > > > This is indeed an important and challenging problem.  One possible
> > > > > approach is to retrieve the allowed demotion nodemask from
> > > > > page_referenced() similar to vm_flags.
> > > > 
> > > > This works for mempolicy in struct vm_area_struct, but not for that in
> > > > struct task_struct.  Mutiple threads in a process may have different
> > > > mempolicy.
> > > 
> > > From vm_area_struct, we can get to mm_struct and then to the owner
> > > task_struct, which has the process mempolicy.
> > > 
> > > It is indeed a problem when a page is shared by different threads or
> > > different processes that have different thread default mempolicy
> > > values.
> > 
> > Sorry for chiming in late, this is a known issue when we were working
> > on demotion. Yes, it is hard to handle the shared pages and multi
> > threads since mempolicy is applied to each thread so each thread may
> > have different mempolicy. And I don't think this case is rare. And not
> > only mempolicy but also may cpuset settings cause the similar problem,
> > different threads may have different cpuset settings for cgroupv1.
> > 
> > If this is really a problem for real life workloads, we may consider
> > tackling it for exclusively owned pages first. Thanks to David's
> > patches, now we have dedicated flags to tell exclusively owned pages.
> 
> One of the problems with demotion when I last looked is it does almost exactly
> the opposite of what we want on systems like POWER9 where GPU memory is a
> CPU-less memory node.
> 
> On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate
> memory on the GPU node. Under memory pressure demotion should migrate GPU
> allocations to the CPU node and finally other slow memory nodes or swap.
> 
> Currently though demotion considers the GPU node slow memory (because it is
> CPU-less) so will demote CPU memory to GPU memory which is a limited resource.
> And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap
> everything to disk rather than demote to CPU memory (which would be preferred).
>
> I'm still looking at this series but as I understand it it will help somewhat
> because we could make GPU memory the top-tier so nothing gets demoted to it.

Yes.  If we have a way to put GPU memory in top-tier (tier 0) and
CPU+DRAM in tier 1.  Your requirement can be satisfied.  One way is to
override the auto-generated demotion order via some user space tool. 
Another way is to change the GPU driver (I guess where the GPU memory is
enumerated and onlined?) to change the tier of GPU memory node.

> However I wouldn't want to see demotion skipped entirely when a memory policy
> such as MPOL_BIND is specified. For example most memory on a GPU node will have
> some kind of policy specified and IMHO it would be better to demote to another
> node in the mempolicy nodemask rather than going straight to swap, particularly
> as GPU memory capacity tends to be limited in comparison to CPU memory
> capacity.
> > 

Can you use MPOL_PREFERRED?  Even if we enforce MPOL_BIND as much as
possible, we will not stop demoting from GPU to DRAM with
MPOL_PREFERRED.  And in addition to demotion, allocation fallbacking can
be used too to avoid allocation latency caused by demotion.

This is another example of a system with 3 tiers if PMEM is installed in
this machine too.

Best Regards,
Huang, Ying

> > > On the other hand, it can already support most interesting use cases
> > > for demotion (e.g. selecting the demotion node, mbind to prevent
> > > demotion) by respecting cpuset and vma mempolicies.
> > > 
> > > > Best Regards,
> > > > Huang, Ying
> > > > 
> > > > > > 
> > > > > > > Cross-socket demotion should not be too big a problem in practice
> > > > > > > because we can optimize the code to do the demotion from the local CPU
> > > > > > > node (i.e. local writes to the target node and remote read from the
> > > > > > > source node).  The bigger issue is cross-socket memory access onto the
> > > > > > > demoted pages from the applications, which is why NUMA mempolicy is
> > > > > > > important here.
> > > > > > > 
> > > > > > > 
> > > > > > -aneesh
> > > > 
> > > > 
> > 
> > 
> 
> 
> 
>
Wei Xu April 29, 2022, 2:58 a.m. UTC | #47
On Thu, Apr 28, 2022 at 7:21 PM ying.huang@intel.com
<ying.huang@intel.com> wrote:
>
> On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote:
> > On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote:
> > > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote:
> > > >
> > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com
> > > > <ying.huang@intel.com> wrote:
> > > > >
> > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote:
> > > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V
> > > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > > >
> > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote:
> > > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > >
> > > > > > >
> > > > > > > ....
> > > > > > >
> > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > > > > > >
> > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > > > > > > memory node near node 0,
> > > > > > > > >
> > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > node 0 size: n MB
> > > > > > > > > node 0 free: n MB
> > > > > > > > > node 1 cpus:
> > > > > > > > > node 1 size: n MB
> > > > > > > > > node 1 free: n MB
> > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > node 2 size: n MB
> > > > > > > > > node 2 free: n MB
> > > > > > > > > node distances:
> > > > > > > > > node   0   1   2
> > > > > > > > >    0:  10  40  20
> > > > > > > > >    1:  40  10  80
> > > > > > > > >    2:  20  80  10
> > > > > > > > >
> > > > > > > > > We have 2 choices,
> > > > > > > > >
> > > > > > > > > a)
> > > > > > > > > node    demotion targets
> > > > > > > > > 0       1
> > > > > > > > > 2       1
> > > > > > > > >
> > > > > > > > > b)
> > > > > > > > > node    demotion targets
> > > > > > > > > 0       1
> > > > > > > > > 2       X
> > > > > > > > >
> > > > > > > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > > > > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > > > > > > prefer the other one.  So we need a user space ABI to override the
> > > > > > > > > default configuration.
> > > > > > > >
> > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be
> > > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion).
> > > > > > > >
> > > > > > > > In general, we can view the demotion order in a way similar to
> > > > > > > > allocation fallback order (after all, if we don't demote or demotion
> > > > > > > > lags behind, the allocations will go to these demotion target nodes
> > > > > > > > according to the allocation fallback order anyway).  If we initialize
> > > > > > > > the demotion order in that way (i.e. every node can demote to any node
> > > > > > > > in the next tier, and the priority of the target nodes is sorted for
> > > > > > > > each source node), we don't need per-node demotion order override from
> > > > > > > > the userspace.  What we need is to specify what nodes should be in
> > > > > > > > each tier and support NUMA mempolicy in demotion.
> > > > > > > >
> > > > > > >
> > > > > > > I have been wondering how we would handle this. For ex: If an
> > > > > > > application has specified an MPOL_BIND policy and restricted the
> > > > > > > allocation to be from Node0 and Node1, should we demote pages allocated
> > > > > > > by that application
> > > > > > > to Node10? The other alternative for that demotion is swapping. So from
> > > > > > > the page point of view, we either demote to a slow memory or pageout to
> > > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule.
> > > > > >
> > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be
> > > > > > skipped in such cases.  Such MPOL_BIND policies can be an important
> > > > > > tool for applications to override and control their memory placement
> > > > > > when transparent memory tiering is enabled.  If the application
> > > > > > doesn't want swapping, there are other ways to achieve that (e.g.
> > > > > > mlock, disabling swap globally, setting memcg parameters, etc).
> > > > > >
> > > > > >
> > > > > > > The above says we would need some kind of mem policy interaction, but
> > > > > > > what I am not sure about is how to find the memory policy in the
> > > > > > > demotion path.
> > > > > >
> > > > > > This is indeed an important and challenging problem.  One possible
> > > > > > approach is to retrieve the allowed demotion nodemask from
> > > > > > page_referenced() similar to vm_flags.
> > > > >
> > > > > This works for mempolicy in struct vm_area_struct, but not for that in
> > > > > struct task_struct.  Mutiple threads in a process may have different
> > > > > mempolicy.
> > > >
> > > > From vm_area_struct, we can get to mm_struct and then to the owner
> > > > task_struct, which has the process mempolicy.
> > > >
> > > > It is indeed a problem when a page is shared by different threads or
> > > > different processes that have different thread default mempolicy
> > > > values.
> > >
> > > Sorry for chiming in late, this is a known issue when we were working
> > > on demotion. Yes, it is hard to handle the shared pages and multi
> > > threads since mempolicy is applied to each thread so each thread may
> > > have different mempolicy. And I don't think this case is rare. And not
> > > only mempolicy but also may cpuset settings cause the similar problem,
> > > different threads may have different cpuset settings for cgroupv1.
> > >
> > > If this is really a problem for real life workloads, we may consider
> > > tackling it for exclusively owned pages first. Thanks to David's
> > > patches, now we have dedicated flags to tell exclusively owned pages.
> >
> > One of the problems with demotion when I last looked is it does almost exactly
> > the opposite of what we want on systems like POWER9 where GPU memory is a
> > CPU-less memory node.
> >
> > On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate
> > memory on the GPU node. Under memory pressure demotion should migrate GPU
> > allocations to the CPU node and finally other slow memory nodes or swap.
> >
> > Currently though demotion considers the GPU node slow memory (because it is
> > CPU-less) so will demote CPU memory to GPU memory which is a limited resource.
> > And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap
> > everything to disk rather than demote to CPU memory (which would be preferred).
> >
> > I'm still looking at this series but as I understand it it will help somewhat
> > because we could make GPU memory the top-tier so nothing gets demoted to it.
>
> Yes.  If we have a way to put GPU memory in top-tier (tier 0) and
> CPU+DRAM in tier 1.  Your requirement can be satisfied.  One way is to
> override the auto-generated demotion order via some user space tool.
> Another way is to change the GPU driver (I guess where the GPU memory is
> enumerated and onlined?) to change the tier of GPU memory node.
>
> > However I wouldn't want to see demotion skipped entirely when a memory policy
> > such as MPOL_BIND is specified. For example most memory on a GPU node will have
> > some kind of policy specified and IMHO it would be better to demote to another
> > node in the mempolicy nodemask rather than going straight to swap, particularly
> > as GPU memory capacity tends to be limited in comparison to CPU memory
> > capacity.
> > >
>
> Can you use MPOL_PREFERRED?  Even if we enforce MPOL_BIND as much as
> possible, we will not stop demoting from GPU to DRAM with
> MPOL_PREFERRED.  And in addition to demotion, allocation fallbacking can
> be used too to avoid allocation latency caused by demotion.

I expect that MPOL_BIND can be used to either prevent demotion or
select a particular demotion node/nodemask. It all depends on the
mempolicy nodemask specified by MPOL_BIND.

> This is another example of a system with 3 tiers if PMEM is installed in
> this machine too.
>
> Best Regards,
> Huang, Ying
>
> > > > On the other hand, it can already support most interesting use cases
> > > > for demotion (e.g. selecting the demotion node, mbind to prevent
> > > > demotion) by respecting cpuset and vma mempolicies.
> > > >
> > > > > Best Regards,
> > > > > Huang, Ying
> > > > >
> > > > > > >
> > > > > > > > Cross-socket demotion should not be too big a problem in practice
> > > > > > > > because we can optimize the code to do the demotion from the local CPU
> > > > > > > > node (i.e. local writes to the target node and remote read from the
> > > > > > > > source node).  The bigger issue is cross-socket memory access onto the
> > > > > > > > demoted pages from the applications, which is why NUMA mempolicy is
> > > > > > > > important here.
> > > > > > > >
> > > > > > > >
> > > > > > > -aneesh
> > > > >
> > > > >
> > >
> > >
> >
> >
> >
> >
>
>
Huang, Ying April 29, 2022, 3:27 a.m. UTC | #48
On Thu, 2022-04-28 at 19:58 -0700, Wei Xu wrote:
> On Thu, Apr 28, 2022 at 7:21 PM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> > 
> > On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote:
> > > On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote:
> > > > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote:
> > > > > 
> > > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com
> > > > > <ying.huang@intel.com> wrote:
> > > > > > 
> > > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote:
> > > > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V
> > > > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > > > > 
> > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote:
> > > > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > 
> > > > > > > > 
> > > > > > > > ....
> > > > > > > > 
> > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > > > > > > > 
> > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > > > > > > > memory node near node 0,
> > > > > > > > > > 
> > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > node 0 size: n MB
> > > > > > > > > > node 0 free: n MB
> > > > > > > > > > node 1 cpus:
> > > > > > > > > > node 1 size: n MB
> > > > > > > > > > node 1 free: n MB
> > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > node 2 size: n MB
> > > > > > > > > > node 2 free: n MB
> > > > > > > > > > node distances:
> > > > > > > > > > node   0   1   2
> > > > > > > > > >    0:  10  40  20
> > > > > > > > > >    1:  40  10  80
> > > > > > > > > >    2:  20  80  10
> > > > > > > > > > 
> > > > > > > > > > We have 2 choices,
> > > > > > > > > > 
> > > > > > > > > > a)
> > > > > > > > > > node    demotion targets
> > > > > > > > > > 0       1
> > > > > > > > > > 2       1
> > > > > > > > > > 
> > > > > > > > > > b)
> > > > > > > > > > node    demotion targets
> > > > > > > > > > 0       1
> > > > > > > > > > 2       X
> > > > > > > > > > 
> > > > > > > > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > > > > > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > > > > > > > prefer the other one.  So we need a user space ABI to override the
> > > > > > > > > > default configuration.
> > > > > > > > > 
> > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be
> > > > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion).
> > > > > > > > > 
> > > > > > > > > In general, we can view the demotion order in a way similar to
> > > > > > > > > allocation fallback order (after all, if we don't demote or demotion
> > > > > > > > > lags behind, the allocations will go to these demotion target nodes
> > > > > > > > > according to the allocation fallback order anyway).  If we initialize
> > > > > > > > > the demotion order in that way (i.e. every node can demote to any node
> > > > > > > > > in the next tier, and the priority of the target nodes is sorted for
> > > > > > > > > each source node), we don't need per-node demotion order override from
> > > > > > > > > the userspace.  What we need is to specify what nodes should be in
> > > > > > > > > each tier and support NUMA mempolicy in demotion.
> > > > > > > > > 
> > > > > > > > 
> > > > > > > > I have been wondering how we would handle this. For ex: If an
> > > > > > > > application has specified an MPOL_BIND policy and restricted the
> > > > > > > > allocation to be from Node0 and Node1, should we demote pages allocated
> > > > > > > > by that application
> > > > > > > > to Node10? The other alternative for that demotion is swapping. So from
> > > > > > > > the page point of view, we either demote to a slow memory or pageout to
> > > > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule.
> > > > > > > 
> > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be
> > > > > > > skipped in such cases.  Such MPOL_BIND policies can be an important
> > > > > > > tool for applications to override and control their memory placement
> > > > > > > when transparent memory tiering is enabled.  If the application
> > > > > > > doesn't want swapping, there are other ways to achieve that (e.g.
> > > > > > > mlock, disabling swap globally, setting memcg parameters, etc).
> > > > > > > 
> > > > > > > 
> > > > > > > > The above says we would need some kind of mem policy interaction, but
> > > > > > > > what I am not sure about is how to find the memory policy in the
> > > > > > > > demotion path.
> > > > > > > 
> > > > > > > This is indeed an important and challenging problem.  One possible
> > > > > > > approach is to retrieve the allowed demotion nodemask from
> > > > > > > page_referenced() similar to vm_flags.
> > > > > > 
> > > > > > This works for mempolicy in struct vm_area_struct, but not for that in
> > > > > > struct task_struct.  Mutiple threads in a process may have different
> > > > > > mempolicy.
> > > > > 
> > > > > From vm_area_struct, we can get to mm_struct and then to the owner
> > > > > task_struct, which has the process mempolicy.
> > > > > 
> > > > > It is indeed a problem when a page is shared by different threads or
> > > > > different processes that have different thread default mempolicy
> > > > > values.
> > > > 
> > > > Sorry for chiming in late, this is a known issue when we were working
> > > > on demotion. Yes, it is hard to handle the shared pages and multi
> > > > threads since mempolicy is applied to each thread so each thread may
> > > > have different mempolicy. And I don't think this case is rare. And not
> > > > only mempolicy but also may cpuset settings cause the similar problem,
> > > > different threads may have different cpuset settings for cgroupv1.
> > > > 
> > > > If this is really a problem for real life workloads, we may consider
> > > > tackling it for exclusively owned pages first. Thanks to David's
> > > > patches, now we have dedicated flags to tell exclusively owned pages.
> > > 
> > > One of the problems with demotion when I last looked is it does almost exactly
> > > the opposite of what we want on systems like POWER9 where GPU memory is a
> > > CPU-less memory node.
> > > 
> > > On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate
> > > memory on the GPU node. Under memory pressure demotion should migrate GPU
> > > allocations to the CPU node and finally other slow memory nodes or swap.
> > > 
> > > Currently though demotion considers the GPU node slow memory (because it is
> > > CPU-less) so will demote CPU memory to GPU memory which is a limited resource.
> > > And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap
> > > everything to disk rather than demote to CPU memory (which would be preferred).
> > > 
> > > I'm still looking at this series but as I understand it it will help somewhat
> > > because we could make GPU memory the top-tier so nothing gets demoted to it.
> > 
> > Yes.  If we have a way to put GPU memory in top-tier (tier 0) and
> > CPU+DRAM in tier 1.  Your requirement can be satisfied.  One way is to
> > override the auto-generated demotion order via some user space tool.
> > Another way is to change the GPU driver (I guess where the GPU memory is
> > enumerated and onlined?) to change the tier of GPU memory node.
> > 
> > > However I wouldn't want to see demotion skipped entirely when a memory policy
> > > such as MPOL_BIND is specified. For example most memory on a GPU node will have
> > > some kind of policy specified and IMHO it would be better to demote to another
> > > node in the mempolicy nodemask rather than going straight to swap, particularly
> > > as GPU memory capacity tends to be limited in comparison to CPU memory
> > > capacity.
> > > > 
> > 
> > Can you use MPOL_PREFERRED?  Even if we enforce MPOL_BIND as much as
> > possible, we will not stop demoting from GPU to DRAM with
> > MPOL_PREFERRED.  And in addition to demotion, allocation fallbacking can
> > be used too to avoid allocation latency caused by demotion.
> 
> I expect that MPOL_BIND can be used to either prevent demotion or
> select a particular demotion node/nodemask. It all depends on the
> mempolicy nodemask specified by MPOL_BIND.

Yes.  I think so too.

Best Regards,
Huang, Ying

> > This is another example of a system with 3 tiers if PMEM is installed in
> > this machine too.
> > 
> > Best Regards,
> > Huang, Ying
> > 
> > > > > On the other hand, it can already support most interesting use cases
> > > > > for demotion (e.g. selecting the demotion node, mbind to prevent
> > > > > demotion) by respecting cpuset and vma mempolicies.
> > > > > 
> > > > > > Best Regards,
> > > > > > Huang, Ying
> > > > > > 
> > > > > > > > 
> > > > > > > > > Cross-socket demotion should not be too big a problem in practice
> > > > > > > > > because we can optimize the code to do the demotion from the local CPU
> > > > > > > > > node (i.e. local writes to the target node and remote read from the
> > > > > > > > > source node).  The bigger issue is cross-socket memory access onto the
> > > > > > > > > demoted pages from the applications, which is why NUMA mempolicy is
> > > > > > > > > important here.
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > -aneesh
> > > > > > 
> > > > > > 
> > > > 
> > > > 
> > > 
> > > 
> > > 
> > > 
> > 
> >
Alistair Popple April 29, 2022, 4:45 a.m. UTC | #49
On Friday, 29 April 2022 1:27:36 PM AEST ying.huang@intel.com wrote:
> On Thu, 2022-04-28 at 19:58 -0700, Wei Xu wrote:
> > On Thu, Apr 28, 2022 at 7:21 PM ying.huang@intel.com
> > <ying.huang@intel.com> wrote:
> > > 
> > > On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote:
> > > > On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote:
> > > > > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote:
> > > > > > 
> > > > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com
> > > > > > <ying.huang@intel.com> wrote:
> > > > > > > 
> > > > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote:
> > > > > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V
> > > > > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > > > > > 
> > > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote:
> > > > > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > ....
> > > > > > > > > 
> > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > > > > > > > > 
> > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > > > > > > > > memory node near node 0,
> > > > > > > > > > > 
> > > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > > node 0 size: n MB
> > > > > > > > > > > node 0 free: n MB
> > > > > > > > > > > node 1 cpus:
> > > > > > > > > > > node 1 size: n MB
> > > > > > > > > > > node 1 free: n MB
> > > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > > node 2 size: n MB
> > > > > > > > > > > node 2 free: n MB
> > > > > > > > > > > node distances:
> > > > > > > > > > > node   0   1   2
> > > > > > > > > > >    0:  10  40  20
> > > > > > > > > > >    1:  40  10  80
> > > > > > > > > > >    2:  20  80  10
> > > > > > > > > > > 
> > > > > > > > > > > We have 2 choices,
> > > > > > > > > > > 
> > > > > > > > > > > a)
> > > > > > > > > > > node    demotion targets
> > > > > > > > > > > 0       1
> > > > > > > > > > > 2       1
> > > > > > > > > > > 
> > > > > > > > > > > b)
> > > > > > > > > > > node    demotion targets
> > > > > > > > > > > 0       1
> > > > > > > > > > > 2       X
> > > > > > > > > > > 
> > > > > > > > > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > > > > > > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > > > > > > > > prefer the other one.  So we need a user space ABI to override the
> > > > > > > > > > > default configuration.
> > > > > > > > > > 
> > > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be
> > > > > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion).
> > > > > > > > > > 
> > > > > > > > > > In general, we can view the demotion order in a way similar to
> > > > > > > > > > allocation fallback order (after all, if we don't demote or demotion
> > > > > > > > > > lags behind, the allocations will go to these demotion target nodes
> > > > > > > > > > according to the allocation fallback order anyway).  If we initialize
> > > > > > > > > > the demotion order in that way (i.e. every node can demote to any node
> > > > > > > > > > in the next tier, and the priority of the target nodes is sorted for
> > > > > > > > > > each source node), we don't need per-node demotion order override from
> > > > > > > > > > the userspace.  What we need is to specify what nodes should be in
> > > > > > > > > > each tier and support NUMA mempolicy in demotion.
> > > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > I have been wondering how we would handle this. For ex: If an
> > > > > > > > > application has specified an MPOL_BIND policy and restricted the
> > > > > > > > > allocation to be from Node0 and Node1, should we demote pages allocated
> > > > > > > > > by that application
> > > > > > > > > to Node10? The other alternative for that demotion is swapping. So from
> > > > > > > > > the page point of view, we either demote to a slow memory or pageout to
> > > > > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule.
> > > > > > > > 
> > > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be
> > > > > > > > skipped in such cases.  Such MPOL_BIND policies can be an important
> > > > > > > > tool for applications to override and control their memory placement
> > > > > > > > when transparent memory tiering is enabled.  If the application
> > > > > > > > doesn't want swapping, there are other ways to achieve that (e.g.
> > > > > > > > mlock, disabling swap globally, setting memcg parameters, etc).
> > > > > > > > 
> > > > > > > > 
> > > > > > > > > The above says we would need some kind of mem policy interaction, but
> > > > > > > > > what I am not sure about is how to find the memory policy in the
> > > > > > > > > demotion path.
> > > > > > > > 
> > > > > > > > This is indeed an important and challenging problem.  One possible
> > > > > > > > approach is to retrieve the allowed demotion nodemask from
> > > > > > > > page_referenced() similar to vm_flags.
> > > > > > > 
> > > > > > > This works for mempolicy in struct vm_area_struct, but not for that in
> > > > > > > struct task_struct.  Mutiple threads in a process may have different
> > > > > > > mempolicy.
> > > > > > 
> > > > > > From vm_area_struct, we can get to mm_struct and then to the owner
> > > > > > task_struct, which has the process mempolicy.
> > > > > > 
> > > > > > It is indeed a problem when a page is shared by different threads or
> > > > > > different processes that have different thread default mempolicy
> > > > > > values.
> > > > > 
> > > > > Sorry for chiming in late, this is a known issue when we were working
> > > > > on demotion. Yes, it is hard to handle the shared pages and multi
> > > > > threads since mempolicy is applied to each thread so each thread may
> > > > > have different mempolicy. And I don't think this case is rare. And not
> > > > > only mempolicy but also may cpuset settings cause the similar problem,
> > > > > different threads may have different cpuset settings for cgroupv1.
> > > > > 
> > > > > If this is really a problem for real life workloads, we may consider
> > > > > tackling it for exclusively owned pages first. Thanks to David's
> > > > > patches, now we have dedicated flags to tell exclusively owned pages.
> > > > 
> > > > One of the problems with demotion when I last looked is it does almost exactly
> > > > the opposite of what we want on systems like POWER9 where GPU memory is a
> > > > CPU-less memory node.
> > > > 
> > > > On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate
> > > > memory on the GPU node. Under memory pressure demotion should migrate GPU
> > > > allocations to the CPU node and finally other slow memory nodes or swap.
> > > > 
> > > > Currently though demotion considers the GPU node slow memory (because it is
> > > > CPU-less) so will demote CPU memory to GPU memory which is a limited resource.
> > > > And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap
> > > > everything to disk rather than demote to CPU memory (which would be preferred).
> > > > 
> > > > I'm still looking at this series but as I understand it it will help somewhat
> > > > because we could make GPU memory the top-tier so nothing gets demoted to it.
> > > 
> > > Yes.  If we have a way to put GPU memory in top-tier (tier 0) and
> > > CPU+DRAM in tier 1.  Your requirement can be satisfied.  One way is to
> > > override the auto-generated demotion order via some user space tool.
> > > Another way is to change the GPU driver (I guess where the GPU memory is
> > > enumerated and onlined?) to change the tier of GPU memory node.

Yes, although I think in this case it would be firmware that determines memory
tiers (similar to ACPI HMAT which I saw discussed somewhere here). I agree it's
a system level property though that in an ideal world shouldn't need overriding
from userspace. However being able to override it with a user space tool could
be useful.

> > > > However I wouldn't want to see demotion skipped entirely when a memory policy
> > > > such as MPOL_BIND is specified. For example most memory on a GPU node will have
> > > > some kind of policy specified and IMHO it would be better to demote to another
> > > > node in the mempolicy nodemask rather than going straight to swap, particularly
> > > > as GPU memory capacity tends to be limited in comparison to CPU memory
> > > > capacity.
> > > > > 
> > > 
> > > Can you use MPOL_PREFERRED?  Even if we enforce MPOL_BIND as much as
> > > possible, we will not stop demoting from GPU to DRAM with
> > > MPOL_PREFERRED.  And in addition to demotion, allocation fallbacking can
> > > be used too to avoid allocation latency caused by demotion.

I think so. It's been a little while since I last looked at this but I was
under the impression MPOL_PREFERRED didn't do direct reclaim (and therefore
wouldn't trigger demotion so once GPU memory was full became effectively a
no-op). However looking at the source I don't think that's the case now - if
I'm understanding correctly MPOL_PREFERRED will do reclaim/demotion.

The other problem with MPOL_PREFERRED is it doesn't allow the fallback nodes to
be specified. I was hoping the new MPOL_PREFERRED_MANY and
set_mempolicy_home_node() would help here but currently that does disable
reclaim (and therefore demotion) in the first pass.

However that problem is tangential to this series and I can look at that
separately. My main aim here given you were looking at requirements was just
to raise this as a slightly different use case (one where the CPU isn't the top
tier).

Thanks for looking into all this.

 - Alistair

> > I expect that MPOL_BIND can be used to either prevent demotion or
> > select a particular demotion node/nodemask. It all depends on the
> > mempolicy nodemask specified by MPOL_BIND.
> 
> Yes.  I think so too.
> 
> Best Regards,
> Huang, Ying
> 
> > > This is another example of a system with 3 tiers if PMEM is installed in
> > > this machine too.
> > > 
> > > Best Regards,
> > > Huang, Ying
> > > 
> > > > > > On the other hand, it can already support most interesting use cases
> > > > > > for demotion (e.g. selecting the demotion node, mbind to prevent
> > > > > > demotion) by respecting cpuset and vma mempolicies.
> > > > > > 
> > > > > > > Best Regards,
> > > > > > > Huang, Ying
> > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Cross-socket demotion should not be too big a problem in practice
> > > > > > > > > > because we can optimize the code to do the demotion from the local CPU
> > > > > > > > > > node (i.e. local writes to the target node and remote read from the
> > > > > > > > > > source node).  The bigger issue is cross-socket memory access onto the
> > > > > > > > > > demoted pages from the applications, which is why NUMA mempolicy is
> > > > > > > > > > important here.
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > -aneesh
> > > > > > > 
> > > > > > > 
> > > > > 
> > > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > > 
> 
> 
>
Yang Shi April 29, 2022, 6:52 p.m. UTC | #50
On Thu, Apr 28, 2022 at 7:59 PM Wei Xu <weixugc@google.com> wrote:
>
> On Thu, Apr 28, 2022 at 7:21 PM ying.huang@intel.com
> <ying.huang@intel.com> wrote:
> >
> > On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote:
> > > On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote:
> > > > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote:
> > > > >
> > > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com
> > > > > <ying.huang@intel.com> wrote:
> > > > > >
> > > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote:
> > > > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V
> > > > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > > > >
> > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote:
> > > > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > >
> > > > > > > >
> > > > > > > > ....
> > > > > > > >
> > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > > > > > > >
> > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > > > > > > > memory node near node 0,
> > > > > > > > > >
> > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > node 0 size: n MB
> > > > > > > > > > node 0 free: n MB
> > > > > > > > > > node 1 cpus:
> > > > > > > > > > node 1 size: n MB
> > > > > > > > > > node 1 free: n MB
> > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > node 2 size: n MB
> > > > > > > > > > node 2 free: n MB
> > > > > > > > > > node distances:
> > > > > > > > > > node   0   1   2
> > > > > > > > > >    0:  10  40  20
> > > > > > > > > >    1:  40  10  80
> > > > > > > > > >    2:  20  80  10
> > > > > > > > > >
> > > > > > > > > > We have 2 choices,
> > > > > > > > > >
> > > > > > > > > > a)
> > > > > > > > > > node    demotion targets
> > > > > > > > > > 0       1
> > > > > > > > > > 2       1
> > > > > > > > > >
> > > > > > > > > > b)
> > > > > > > > > > node    demotion targets
> > > > > > > > > > 0       1
> > > > > > > > > > 2       X
> > > > > > > > > >
> > > > > > > > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > > > > > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > > > > > > > prefer the other one.  So we need a user space ABI to override the
> > > > > > > > > > default configuration.
> > > > > > > > >
> > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be
> > > > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion).
> > > > > > > > >
> > > > > > > > > In general, we can view the demotion order in a way similar to
> > > > > > > > > allocation fallback order (after all, if we don't demote or demotion
> > > > > > > > > lags behind, the allocations will go to these demotion target nodes
> > > > > > > > > according to the allocation fallback order anyway).  If we initialize
> > > > > > > > > the demotion order in that way (i.e. every node can demote to any node
> > > > > > > > > in the next tier, and the priority of the target nodes is sorted for
> > > > > > > > > each source node), we don't need per-node demotion order override from
> > > > > > > > > the userspace.  What we need is to specify what nodes should be in
> > > > > > > > > each tier and support NUMA mempolicy in demotion.
> > > > > > > > >
> > > > > > > >
> > > > > > > > I have been wondering how we would handle this. For ex: If an
> > > > > > > > application has specified an MPOL_BIND policy and restricted the
> > > > > > > > allocation to be from Node0 and Node1, should we demote pages allocated
> > > > > > > > by that application
> > > > > > > > to Node10? The other alternative for that demotion is swapping. So from
> > > > > > > > the page point of view, we either demote to a slow memory or pageout to
> > > > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule.
> > > > > > >
> > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be
> > > > > > > skipped in such cases.  Such MPOL_BIND policies can be an important
> > > > > > > tool for applications to override and control their memory placement
> > > > > > > when transparent memory tiering is enabled.  If the application
> > > > > > > doesn't want swapping, there are other ways to achieve that (e.g.
> > > > > > > mlock, disabling swap globally, setting memcg parameters, etc).
> > > > > > >
> > > > > > >
> > > > > > > > The above says we would need some kind of mem policy interaction, but
> > > > > > > > what I am not sure about is how to find the memory policy in the
> > > > > > > > demotion path.
> > > > > > >
> > > > > > > This is indeed an important and challenging problem.  One possible
> > > > > > > approach is to retrieve the allowed demotion nodemask from
> > > > > > > page_referenced() similar to vm_flags.
> > > > > >
> > > > > > This works for mempolicy in struct vm_area_struct, but not for that in
> > > > > > struct task_struct.  Mutiple threads in a process may have different
> > > > > > mempolicy.
> > > > >
> > > > > From vm_area_struct, we can get to mm_struct and then to the owner
> > > > > task_struct, which has the process mempolicy.
> > > > >
> > > > > It is indeed a problem when a page is shared by different threads or
> > > > > different processes that have different thread default mempolicy
> > > > > values.
> > > >
> > > > Sorry for chiming in late, this is a known issue when we were working
> > > > on demotion. Yes, it is hard to handle the shared pages and multi
> > > > threads since mempolicy is applied to each thread so each thread may
> > > > have different mempolicy. And I don't think this case is rare. And not
> > > > only mempolicy but also may cpuset settings cause the similar problem,
> > > > different threads may have different cpuset settings for cgroupv1.
> > > >
> > > > If this is really a problem for real life workloads, we may consider
> > > > tackling it for exclusively owned pages first. Thanks to David's
> > > > patches, now we have dedicated flags to tell exclusively owned pages.
> > >
> > > One of the problems with demotion when I last looked is it does almost exactly
> > > the opposite of what we want on systems like POWER9 where GPU memory is a
> > > CPU-less memory node.
> > >
> > > On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate
> > > memory on the GPU node. Under memory pressure demotion should migrate GPU
> > > allocations to the CPU node and finally other slow memory nodes or swap.
> > >
> > > Currently though demotion considers the GPU node slow memory (because it is
> > > CPU-less) so will demote CPU memory to GPU memory which is a limited resource.
> > > And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap
> > > everything to disk rather than demote to CPU memory (which would be preferred).
> > >
> > > I'm still looking at this series but as I understand it it will help somewhat
> > > because we could make GPU memory the top-tier so nothing gets demoted to it.
> >
> > Yes.  If we have a way to put GPU memory in top-tier (tier 0) and
> > CPU+DRAM in tier 1.  Your requirement can be satisfied.  One way is to
> > override the auto-generated demotion order via some user space tool.
> > Another way is to change the GPU driver (I guess where the GPU memory is
> > enumerated and onlined?) to change the tier of GPU memory node.
> >
> > > However I wouldn't want to see demotion skipped entirely when a memory policy
> > > such as MPOL_BIND is specified. For example most memory on a GPU node will have
> > > some kind of policy specified and IMHO it would be better to demote to another
> > > node in the mempolicy nodemask rather than going straight to swap, particularly
> > > as GPU memory capacity tends to be limited in comparison to CPU memory
> > > capacity.
> > > >
> >
> > Can you use MPOL_PREFERRED?  Even if we enforce MPOL_BIND as much as
> > possible, we will not stop demoting from GPU to DRAM with
> > MPOL_PREFERRED.  And in addition to demotion, allocation fallbacking can
> > be used too to avoid allocation latency caused by demotion.
>
> I expect that MPOL_BIND can be used to either prevent demotion or
> select a particular demotion node/nodemask. It all depends on the
> mempolicy nodemask specified by MPOL_BIND.

Preventing demotion doesn't make too much sense to me IMHO. But I tend
to agree the demotion target should be selected from the nodemask. I
think this could follow what numa fault does.

>
> > This is another example of a system with 3 tiers if PMEM is installed in
> > this machine too.
> >
> > Best Regards,
> > Huang, Ying
> >
> > > > > On the other hand, it can already support most interesting use cases
> > > > > for demotion (e.g. selecting the demotion node, mbind to prevent
> > > > > demotion) by respecting cpuset and vma mempolicies.
> > > > >
> > > > > > Best Regards,
> > > > > > Huang, Ying
> > > > > >
> > > > > > > >
> > > > > > > > > Cross-socket demotion should not be too big a problem in practice
> > > > > > > > > because we can optimize the code to do the demotion from the local CPU
> > > > > > > > > node (i.e. local writes to the target node and remote read from the
> > > > > > > > > source node).  The bigger issue is cross-socket memory access onto the
> > > > > > > > > demoted pages from the applications, which is why NUMA mempolicy is
> > > > > > > > > important here.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > -aneesh
> > > > > >
> > > > > >
> > > >
> > > >
> > >
> > >
> > >
> > >
> >
> >
Yang Shi April 29, 2022, 6:53 p.m. UTC | #51
On Thu, Apr 28, 2022 at 9:45 PM Alistair Popple <apopple@nvidia.com> wrote:
>
> On Friday, 29 April 2022 1:27:36 PM AEST ying.huang@intel.com wrote:
> > On Thu, 2022-04-28 at 19:58 -0700, Wei Xu wrote:
> > > On Thu, Apr 28, 2022 at 7:21 PM ying.huang@intel.com
> > > <ying.huang@intel.com> wrote:
> > > >
> > > > On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote:
> > > > > On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote:
> > > > > > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote:
> > > > > > >
> > > > > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com
> > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > >
> > > > > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote:
> > > > > > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V
> > > > > > > > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > > > > > > >
> > > > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote:
> > > > > > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> > > > > > > > > > > <ying.huang@intel.com> wrote:
> > > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > ....
> > > > > > > > > >
> > > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example,
> > > > > > > > > > > >
> > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow
> > > > > > > > > > > > memory node near node 0,
> > > > > > > > > > > >
> > > > > > > > > > > > available: 3 nodes (0-2)
> > > > > > > > > > > > node 0 cpus: 0 1
> > > > > > > > > > > > node 0 size: n MB
> > > > > > > > > > > > node 0 free: n MB
> > > > > > > > > > > > node 1 cpus:
> > > > > > > > > > > > node 1 size: n MB
> > > > > > > > > > > > node 1 free: n MB
> > > > > > > > > > > > node 2 cpus: 2 3
> > > > > > > > > > > > node 2 size: n MB
> > > > > > > > > > > > node 2 free: n MB
> > > > > > > > > > > > node distances:
> > > > > > > > > > > > node   0   1   2
> > > > > > > > > > > >    0:  10  40  20
> > > > > > > > > > > >    1:  40  10  80
> > > > > > > > > > > >    2:  20  80  10
> > > > > > > > > > > >
> > > > > > > > > > > > We have 2 choices,
> > > > > > > > > > > >
> > > > > > > > > > > > a)
> > > > > > > > > > > > node    demotion targets
> > > > > > > > > > > > 0       1
> > > > > > > > > > > > 2       1
> > > > > > > > > > > >
> > > > > > > > > > > > b)
> > > > > > > > > > > > node    demotion targets
> > > > > > > > > > > > 0       1
> > > > > > > > > > > > 2       X
> > > > > > > > > > > >
> > > > > > > > > > > > a) is good to take advantage of PMEM.  b) is good to reduce cross-socket
> > > > > > > > > > > > traffic.  Both are OK as defualt configuration.  But some users may
> > > > > > > > > > > > prefer the other one.  So we need a user space ABI to override the
> > > > > > > > > > > > default configuration.
> > > > > > > > > > >
> > > > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be
> > > > > > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion).
> > > > > > > > > > >
> > > > > > > > > > > In general, we can view the demotion order in a way similar to
> > > > > > > > > > > allocation fallback order (after all, if we don't demote or demotion
> > > > > > > > > > > lags behind, the allocations will go to these demotion target nodes
> > > > > > > > > > > according to the allocation fallback order anyway).  If we initialize
> > > > > > > > > > > the demotion order in that way (i.e. every node can demote to any node
> > > > > > > > > > > in the next tier, and the priority of the target nodes is sorted for
> > > > > > > > > > > each source node), we don't need per-node demotion order override from
> > > > > > > > > > > the userspace.  What we need is to specify what nodes should be in
> > > > > > > > > > > each tier and support NUMA mempolicy in demotion.
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I have been wondering how we would handle this. For ex: If an
> > > > > > > > > > application has specified an MPOL_BIND policy and restricted the
> > > > > > > > > > allocation to be from Node0 and Node1, should we demote pages allocated
> > > > > > > > > > by that application
> > > > > > > > > > to Node10? The other alternative for that demotion is swapping. So from
> > > > > > > > > > the page point of view, we either demote to a slow memory or pageout to
> > > > > > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule.
> > > > > > > > >
> > > > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be
> > > > > > > > > skipped in such cases.  Such MPOL_BIND policies can be an important
> > > > > > > > > tool for applications to override and control their memory placement
> > > > > > > > > when transparent memory tiering is enabled.  If the application
> > > > > > > > > doesn't want swapping, there are other ways to achieve that (e.g.
> > > > > > > > > mlock, disabling swap globally, setting memcg parameters, etc).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > > The above says we would need some kind of mem policy interaction, but
> > > > > > > > > > what I am not sure about is how to find the memory policy in the
> > > > > > > > > > demotion path.
> > > > > > > > >
> > > > > > > > > This is indeed an important and challenging problem.  One possible
> > > > > > > > > approach is to retrieve the allowed demotion nodemask from
> > > > > > > > > page_referenced() similar to vm_flags.
> > > > > > > >
> > > > > > > > This works for mempolicy in struct vm_area_struct, but not for that in
> > > > > > > > struct task_struct.  Mutiple threads in a process may have different
> > > > > > > > mempolicy.
> > > > > > >
> > > > > > > From vm_area_struct, we can get to mm_struct and then to the owner
> > > > > > > task_struct, which has the process mempolicy.
> > > > > > >
> > > > > > > It is indeed a problem when a page is shared by different threads or
> > > > > > > different processes that have different thread default mempolicy
> > > > > > > values.
> > > > > >
> > > > > > Sorry for chiming in late, this is a known issue when we were working
> > > > > > on demotion. Yes, it is hard to handle the shared pages and multi
> > > > > > threads since mempolicy is applied to each thread so each thread may
> > > > > > have different mempolicy. And I don't think this case is rare. And not
> > > > > > only mempolicy but also may cpuset settings cause the similar problem,
> > > > > > different threads may have different cpuset settings for cgroupv1.
> > > > > >
> > > > > > If this is really a problem for real life workloads, we may consider
> > > > > > tackling it for exclusively owned pages first. Thanks to David's
> > > > > > patches, now we have dedicated flags to tell exclusively owned pages.
> > > > >
> > > > > One of the problems with demotion when I last looked is it does almost exactly
> > > > > the opposite of what we want on systems like POWER9 where GPU memory is a
> > > > > CPU-less memory node.
> > > > >
> > > > > On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate
> > > > > memory on the GPU node. Under memory pressure demotion should migrate GPU
> > > > > allocations to the CPU node and finally other slow memory nodes or swap.
> > > > >
> > > > > Currently though demotion considers the GPU node slow memory (because it is
> > > > > CPU-less) so will demote CPU memory to GPU memory which is a limited resource.
> > > > > And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap
> > > > > everything to disk rather than demote to CPU memory (which would be preferred).
> > > > >
> > > > > I'm still looking at this series but as I understand it it will help somewhat
> > > > > because we could make GPU memory the top-tier so nothing gets demoted to it.
> > > >
> > > > Yes.  If we have a way to put GPU memory in top-tier (tier 0) and
> > > > CPU+DRAM in tier 1.  Your requirement can be satisfied.  One way is to
> > > > override the auto-generated demotion order via some user space tool.
> > > > Another way is to change the GPU driver (I guess where the GPU memory is
> > > > enumerated and onlined?) to change the tier of GPU memory node.
>
> Yes, although I think in this case it would be firmware that determines memory
> tiers (similar to ACPI HMAT which I saw discussed somewhere here). I agree it's
> a system level property though that in an ideal world shouldn't need overriding
> from userspace. However being able to override it with a user space tool could
> be useful.
>
> > > > > However I wouldn't want to see demotion skipped entirely when a memory policy
> > > > > such as MPOL_BIND is specified. For example most memory on a GPU node will have
> > > > > some kind of policy specified and IMHO it would be better to demote to another
> > > > > node in the mempolicy nodemask rather than going straight to swap, particularly
> > > > > as GPU memory capacity tends to be limited in comparison to CPU memory
> > > > > capacity.
> > > > > >
> > > >
> > > > Can you use MPOL_PREFERRED?  Even if we enforce MPOL_BIND as much as
> > > > possible, we will not stop demoting from GPU to DRAM with
> > > > MPOL_PREFERRED.  And in addition to demotion, allocation fallbacking can
> > > > be used too to avoid allocation latency caused by demotion.
>
> I think so. It's been a little while since I last looked at this but I was
> under the impression MPOL_PREFERRED didn't do direct reclaim (and therefore
> wouldn't trigger demotion so once GPU memory was full became effectively a
> no-op). However looking at the source I don't think that's the case now - if
> I'm understanding correctly MPOL_PREFERRED will do reclaim/demotion.

You are right. Whether doing reclaim depends on the GFP flags and
memory pressure instead of mempolicy.

>
> The other problem with MPOL_PREFERRED is it doesn't allow the fallback nodes to
> be specified. I was hoping the new MPOL_PREFERRED_MANY and
> set_mempolicy_home_node() would help here but currently that does disable
> reclaim (and therefore demotion) in the first pass.
>
> However that problem is tangential to this series and I can look at that
> separately. My main aim here given you were looking at requirements was just
> to raise this as a slightly different use case (one where the CPU isn't the top
> tier).
>
> Thanks for looking into all this.
>
>  - Alistair
>
> > > I expect that MPOL_BIND can be used to either prevent demotion or
> > > select a particular demotion node/nodemask. It all depends on the
> > > mempolicy nodemask specified by MPOL_BIND.
> >
> > Yes.  I think so too.
> >
> > Best Regards,
> > Huang, Ying
> >
> > > > This is another example of a system with 3 tiers if PMEM is installed in
> > > > this machine too.
> > > >
> > > > Best Regards,
> > > > Huang, Ying
> > > >
> > > > > > > On the other hand, it can already support most interesting use cases
> > > > > > > for demotion (e.g. selecting the demotion node, mbind to prevent
> > > > > > > demotion) by respecting cpuset and vma mempolicies.
> > > > > > >
> > > > > > > > Best Regards,
> > > > > > > > Huang, Ying
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > Cross-socket demotion should not be too big a problem in practice
> > > > > > > > > > > because we can optimize the code to do the demotion from the local CPU
> > > > > > > > > > > node (i.e. local writes to the target node and remote read from the
> > > > > > > > > > > source node).  The bigger issue is cross-socket memory access onto the
> > > > > > > > > > > demoted pages from the applications, which is why NUMA mempolicy is
> > > > > > > > > > > important here.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > -aneesh
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > >
> > > >
> >
> >
> >
>
>
>
>
Wei Xu April 30, 2022, 2:21 a.m. UTC | #52
On Thu, Apr 28, 2022 at 12:30 PM Chen, Tim C <tim.c.chen@intel.com> wrote:
>
> >
> >On Wed, 2022-04-27 at 09:27 -0700, Wei Xu wrote:
> >> On Wed, Apr 27, 2022 at 12:11 AM ying.huang@intel.com
> >> <ying.huang@intel.com> wrote:
> >> >
> >> > On Mon, 2022-04-25 at 09:56 -0700, Wei Xu wrote:
> >> > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com
> >> > > <ying.huang@intel.com> wrote:
> >> > > >
> >> > > > Hi, All,
> >> > > >
> >> > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote:
> >> > > >
> >> > > > [snip]
> >> > > >
> >> > > > > I think it is necessary to either have per node demotion
> >> > > > > targets configuration or the user space interface supported by
> >> > > > > this patch series. As we don't have clear consensus on how the
> >> > > > > user interface should look like, we can defer the per node
> >> > > > > demotion target set interface to future until the real need arises.
> >> > > > >
> >> > > > > Current patch series sets N_DEMOTION_TARGET from dax device
> >> > > > > kmem driver, it may be possible that some memory node desired
> >> > > > > as demotion target is not detected in the system from dax-device
> >kmem probe path.
> >> > > > >
> >> > > > > It is also possible that some of the dax-devices are not
> >> > > > > preferred as demotion target e.g. HBM, for such devices, node
> >> > > > > shouldn't be set to N_DEMOTION_TARGETS. In future, Support
> >> > > > > should be added to distinguish such dax-devices and not mark
> >> > > > > them as N_DEMOTION_TARGETS from the kernel, but for now this
> >> > > > > user space interface will be useful to avoid such devices as demotion
> >targets.
> >> > > > >
> >> > > > > We can add read only interface to view per node demotion
> >> > > > > targets from /sys/devices/system/node/nodeX/demotion_targets,
> >> > > > > remove duplicated /sys/kernel/mm/numa/demotion_target
> >> > > > > interface and instead make
> >/sys/devices/system/node/demotion_targets writable.
> >> > > > >
> >> > > > > Huang, Wei, Yang,
> >> > > > > What do you suggest?
> >> > > >
> >> > > > We cannot remove a kernel ABI in practice.  So we need to make
> >> > > > it right at the first time.  Let's try to collect some
> >> > > > information for the kernel ABI definitation.
> >> > > >
> >> > > > The below is just a starting point, please add your requirements.
> >> > > >
> >> > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they
> >> > > > don't want to use that as the demotion targets.  But I don't
> >> > > > think this is a issue in practice for now, because
> >> > > > demote-in-reclaim is disabled by default.
> >> > > >
> >> > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for
> >> > > > example,
> >> > > >
> >> > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node
> >> > > > near node 0,
> >> > > >
> >> > > > available: 3 nodes (0-2)
> >> > > > node 0 cpus: 0 1
> >> > > > node 0 size: n MB
> >> > > > node 0 free: n MB
> >> > > > node 1 cpus:
> >> > > > node 1 size: n MB
> >> > > > node 1 free: n MB
> >> > > > node 2 cpus: 2 3
> >> > > > node 2 size: n MB
> >> > > > node 2 free: n MB
> >> > > > node distances:
> >> > > > node   0   1   2
> >> > > >   0:  10  40  20
> >> > > >   1:  40  10  80
> >> > > >   2:  20  80  10
> >> > > >
> >> > > > We have 2 choices,
> >> > > >
> >> > > > a)
> >> > > > node    demotion targets
> >> > > > 0       1
> >> > > > 2       1
> >> > > >
> >> > > > b)
> >> > > > node    demotion targets
> >> > > > 0       1
> >> > > > 2       X
> >> > > >
> >> > > > a) is good to take advantage of PMEM.  b) is good to reduce
> >> > > > cross-socket traffic.  Both are OK as defualt configuration.
> >> > > > But some users may prefer the other one.  So we need a user
> >> > > > space ABI to override the default configuration.
> >> > >
> >> > > I think 2(a) should be the system-wide configuration and 2(b) can
> >> > > be achieved with NUMA mempolicy (which needs to be added to
> >demotion).
> >> >
> >> > Unfortunately, some NUMA mempolicy information isn't available at
> >> > demotion time, for example, mempolicy enforced via set_mempolicy()
> >> > is for thread. But I think that cpusets can work for demotion.
> >> >
> >> > > In general, we can view the demotion order in a way similar to
> >> > > allocation fallback order (after all, if we don't demote or
> >> > > demotion lags behind, the allocations will go to these demotion
> >> > > target nodes according to the allocation fallback order anyway).
> >> > > If we initialize the demotion order in that way (i.e. every node
> >> > > can demote to any node in the next tier, and the priority of the
> >> > > target nodes is sorted for each source node), we don't need
> >> > > per-node demotion order override from the userspace.  What we need
> >> > > is to specify what nodes should be in each tier and support NUMA
> >mempolicy in demotion.
> >> >
> >> > This sounds interesting. Tier sounds like a natural and general
> >> > concept for these memory types. It's attracting to use it for user
> >> > space interface too. For example, we may use that for mem_cgroup
> >> > limits of a specific memory type (tier).
> >> >
> >> > And if we take a look at the N_DEMOTION_TARGETS again from the "tier"
> >> > point of view. The nodes are divided to 2 classes via
> >> > N_DEMOTION_TARGETS.
> >> >
> >> > - The nodes without N_DEMOTION_TARGETS are top tier (or tier 0).
> >> >
> >> > - The nodes with N_DEMOTION_TARGETS are non-top tier (or tier 1, 2,
> >> > 3,
> >> > ...)
> >> >
> >>
> >> Yes, this is one of the main reasons why we (Google) want this interface.
> >>
> >> > So, another possibility is to fit N_DEMOTION_TARGETS and its
> >> > overriding into "tier" concept too.  !N_DEMOTION_TARGETS == TIER0.
> >> >
> >> > - All nodes start with TIER0
> >> >
> >> > - TIER0 can be cleared for some nodes via e.g. kmem driver
> >> >
> >> > TIER0 node list can be read or overriden by the user space via the
> >> > following interface,
> >> >
> >> >   /sys/devices/system/node/tier0
> >> >
> >> > In the future, if we want to customize more tiers, we can add tier1,
> >> > tier2, tier3, .....  For now, we can add just tier0.  That is, the
> >> > interface is extensible in the future compared with
> >> > .../node/demote_targets.
> >> >
> >>
> >> This more explicit tier definition interface works, too.
> >>
> >
> >In addition to make tiering definition explicit, more importantly, this makes it
> >much easier to support more than 2 tiers.  For example, for a system with
> >HBM (High Bandwidth Memory), CPU+DRAM, DRAM only, and PMEM, that is,
> >3 tiers, we can put HBM in tier 0, CPU+DRAM and DRAM only in tier 1, and
> >PMEM in tier 2, automatically, or via user space overridding.
> >N_DEMOTION_TARGETS isn't natural to be extended to support this.
>
> Agree with Ying that making the tier explicit is fundamental to the rest of the API.
>
> I think that the tier organization should come before setting the demotion targets,
> not the other way round.
>
> That makes things clear on the demotion direction,  (node in tier X
> demote to tier Y, X<Y).  With that, explicitly specifying the demotion target or
> order is only needed when we truly want that level of control or a demotion
> order.  Otherwise all the higher numbered tiers are valid targets.
> Configuring a tier level for each node is a lot easier than fixing up all
> demotion targets for each and every node.
>
> We can prevent demotion target configuration that goes in the wrong
> direction by looking at the tier level.
>
> Tim
>

I have just posted a RFC on the tier-oriented memory tiering kernel
interface based on the discussions here.  The RFC proposes a sysfs
interface, /sys/devices/system/node/memory_tiers, to display and
override the nodes in each memory tier.  It also proposes that we rely
on the kernel allocation order to select the demotion target node from
the next tier and don't expose a userspace overriding interface for
per-node demotion order.   The RFC drops the approach of CPU nodes as
the top-tier by default, too.