Message ID | 20220413092206.73974-1-jvgediya@linux.ibm.com (mailing list archive) |
---|---|
Headers | show |
Series | mm: demotion: Introduce new node state N_DEMOTION_TARGETS | expand |
On Wed, 13 Apr 2022 14:52:01 +0530 Jagdish Gediya <jvgediya@linux.ibm.com> wrote: > Current implementation to find the demotion targets works > based on node state N_MEMORY, however some systems may have > dram only memory numa node which are N_MEMORY but not the > right choices as demotion targets. Why are they not the right choice? Please describe this fully so we can understand the motivation and end-user benefit of the proposed change. And please more fully describe the end-user benefits of this change. > This patch series introduces the new node state > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > is used to hold the list of nodes which can be used as demotion > targets, support is also added to set the demotion target > list from user space so that default behavior can be overridden. Permanently extending the kernel ABI is a fairly big deal. Please fully explain the end-user value, usage scenarios, etc. What would go wrong if we simply omitted this interface? > node state N_DEMOTION_TARGETS is also set from the dax kmem > driver, certain type of memory which registers through dax kmem > (e.g. HBM) may not be the right choices for demotion so in future > they should be distinguished based on certain attributes and dax > kmem driver should avoid setting them as N_DEMOTION_TARGETS, > however current implementation also doesn't distinguish any > such memory and it considers all N_MEMORY as demotion targets > so this patch series doesn't modify the current behavior. > > Current code which sets migration targets is modified in > this patch series to avoid some of the limitations on the demotion > target sharing and to use N_DEMOTION_TARGETS only nodes while > finding demotion targets. > > Changelog > ---------- > > v2: > In v1, only 1st patch of this patch series was sent, which was > implemented to avoid some of the limitations on the demotion > target sharing, however for certain numa topology, the demotion > targets found by that patch was not most optimal, so 1st patch > in this series is modified according to suggestions from Huang > and Baolin. Different examples of demotion list comparasion > between existing implementation and changed implementation can > be found in the commit message of 1st patch. > > Jagdish Gediya (5): > mm: demotion: Set demotion list differently > mm: demotion: Add new node state N_DEMOTION_TARGETS > mm: demotion: Add support to set targets from userspace > device-dax/kmem: Set node state as N_DEMOTION_TARGETS > mm: demotion: Build demotion list based on N_DEMOTION_TARGETS > > .../ABI/testing/sysfs-kernel-mm-numa | 12 ++++ This description is rather brief. Some additional user-facing material under Documentation/ would help. Describe the format for writing to the file, what is seen when reading from it, provide a bit of help to the user so they can understand how to use it, what effects they might see, etc. > drivers/base/node.c | 4 ++ > drivers/dax/kmem.c | 2 + > include/linux/nodemask.h | 1 + > mm/migrate.c | 67 +++++++++++++++---- > 5 files changed, 72 insertions(+), 14 deletions(-)
On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > Current implementation to find the demotion targets works > based on node state N_MEMORY, however some systems may have > dram only memory numa node which are N_MEMORY but not the > right choices as demotion targets. > > This patch series introduces the new node state > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > is used to hold the list of nodes which can be used as demotion > targets, support is also added to set the demotion target > list from user space so that default behavior can be overridden. It appears that your proposed user space interface cannot solve all problems. For example, for system as follows, Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near node 0, available: 3 nodes (0-2) node 0 cpus: 0 1 node 0 size: n MB node 0 free: n MB node 1 cpus: node 1 size: n MB node 1 free: n MB node 2 cpus: 2 3 node 2 size: n MB node 2 free: n MB node distances: node 0 1 2 0: 10 40 20 1: 40 10 80 2: 20 80 10 Demotion order 1: node demotion_target 0 1 1 X 2 X Demotion order 2: node demotion_target 0 1 1 X 2 1 The demotion order 1 is preferred if we want to reduce cross-socket traffic. While the demotion order 2 is preferred if we want to take full advantage of the slow memory node. We can take any choice as automatic-generated order, while make the other choice possible via user space overridden. I don't know how to implement this via your proposed user space interface. How about the following user space interface? 1. Add a file "demotion_order_override" in /sys/devices/system/node/ 2. When read, "1" is output if the demotion order of the system has been overridden; "0" is output if not. 3. When write "1", the demotion order of the system will become the overridden mode. When write "0", the demotion order of the system will become the automatic mode and the demotion order will be re-generated. 4. Add a file "demotion_targets" for each node in /sys/devices/system/node/nodeX/ 5. When read, the demotion targets of nodeX will be output. 6. When write a node list to the file, the demotion targets of nodeX will be set to the written nodes. And the demotion order of the system will become the overridden mode. To reduce the complexity, the demotion order of the system is either in overridden mode or automatic mode. When converting from the automatic mode to the overridden mode, the existing demotion targets of all nodes will be retained before being changed. When converting from overridden mode to automatic mode, the demotion order of the system will be re- generated automatically. In overridden mode, the demotion targets of the hot-added and hot- removed node will be set to empty. And the hot-removed node will be removed from the demotion targets of any node. This is an extention of the interface used in the following patch, https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/ What do you think about this? > node state N_DEMOTION_TARGETS is also set from the dax kmem > driver, certain type of memory which registers through dax kmem > (e.g. HBM) may not be the right choices for demotion so in future > they should be distinguished based on certain attributes and dax > kmem driver should avoid setting them as N_DEMOTION_TARGETS, > however current implementation also doesn't distinguish any > such memory and it considers all N_MEMORY as demotion targets > so this patch series doesn't modify the current behavior. > Best Regards, Huang, Ying [snip]
On Wed, Apr 13, 2022 at 02:44:34PM -0700, Andrew Morton wrote: > On Wed, 13 Apr 2022 14:52:01 +0530 Jagdish Gediya <jvgediya@linux.ibm.com> wrote: > > > Current implementation to find the demotion targets works > > based on node state N_MEMORY, however some systems may have > > dram only memory numa node which are N_MEMORY but not the > > right choices as demotion targets. > > Why are they not the right choice? Please describe this fully so we > can understand the motivation and end-user benefit of the proposed > change. And please more fully describe the end-user benefits of this > change. Some systems(e.g. PowerVM) have DRAM(fast memory) only NUMA node which are N_MEMORY as well as slow memory(persistent memory) only NUMA node which are also N_MEMORY. As the current demotion target finding algorithm works based on N_MEMORY and best distance, it will choose DRAM only NUMA node as demotion target instead of persistent memory node on such systems. If DRAM only NUMA node is filled with demoted pages then at some point new allocations can start falling to persistent memory, so basically cold pages are in fast memor (due to demotion) and new pages are in slow memory, this is why persistent memory nodes should be utilized for demotion and dram node should be avoided for demotion so that they can be used for new allocations. Current implementation can work fine on the system where the memory only numa nodes are possible only for persistent/slow memory but it is not suitable for the like of systems I have mentioned above. Introduction of this new node state N_DEMOTION_TARGETS will provide the solution to handle demotion for the like of systems I have mentioned, without affecting the existing behavior. > > This patch series introduces the new node state > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > is used to hold the list of nodes which can be used as demotion > > targets, support is also added to set the demotion target > > list from user space so that default behavior can be overridden. > > Permanently extending the kernel ABI is a fairly big deal. Please > fully explain the end-user value, usage scenarios, etc. > > What would go wrong if we simply omitted this interface? I am going to modify this interface according to review feedback in next version, but let me explain why it is needed with examples, Based on topology, and available memory tiers in the system, it may be possible that users don't want to utilize all the demotion targets configured by kernel by default for e.g., 1. To reduce cross socket traffic 2. To use only slowest memory as demotion targets when there are multiple slow memory only nodes available The current patch series handles option 2 above, but doesn't handle option 1 so next version will have that support and might be different implementation to handle such scenarios. Examples 1 ---------- with below NUMA topology, where node 0 & 1 are cpu + dram nodes, node 2 & 3 are equally slower memory only nodes, and node 4 is slowest memory only node, available: 5 nodes (0-4) node 0 cpus: 0 1 node 0 size: n MB node 0 free: n MB node 1 cpus: 2 3 node 1 size: n MB node 1 free: n MB node 2 cpus: node 2 size: n MB node 2 free: n MB node 3 cpus: node 3 size: n MB node 3 free: n MB node 4 cpus: node 4 size: n MB node 4 free: n MB node distances: node 0 1 2 3 4 0: 10 20 40 40 80 1: 20 10 40 40 80 2: 40 40 10 40 80 3: 40 40 40 10 80 4: 80 80 80 80 10 This patch series by default prepares below demotion list, node demotion_target 0 3, 2 1 3, 2 2 4 3 4 4 X but It may be possible that user want to utilize node 2 & 3 only for new allocations and only node 4 for demotion. Example 2 --------- with below NUMA topology where Node 0 & 2 are cpu + dram nodes and node 1 is slow memory node near node 0, available: 3 nodes (0-2) node 0 cpus: 0 1 node 0 size: n MB node 0 free: n MB node 1 cpus: node 1 size: n MB node 1 free: n MB node 2 cpus: 2 3 node 2 size: n MB node 2 free: n MB node distances: node 0 1 2 0: 10 40 20 1: 40 10 80 2: 20 80 10 This patch series by default prepares below demotion list, node demotion_target 0 1 1 X 2 1 However it may be possible that user may want to avoid node 1 as demotion target for node 2 to reduce cross socket traffic. > > node state N_DEMOTION_TARGETS is also set from the dax kmem > > driver, certain type of memory which registers through dax kmem > > (e.g. HBM) may not be the right choices for demotion so in future > > they should be distinguished based on certain attributes and dax > > kmem driver should avoid setting them as N_DEMOTION_TARGETS, > > however current implementation also doesn't distinguish any > > such memory and it considers all N_MEMORY as demotion targets > > so this patch series doesn't modify the current behavior. > > > > Current code which sets migration targets is modified in > > this patch series to avoid some of the limitations on the demotion > > target sharing and to use N_DEMOTION_TARGETS only nodes while > > finding demotion targets. > > > > Changelog > > ---------- > > > > v2: > > In v1, only 1st patch of this patch series was sent, which was > > implemented to avoid some of the limitations on the demotion > > target sharing, however for certain numa topology, the demotion > > targets found by that patch was not most optimal, so 1st patch > > in this series is modified according to suggestions from Huang > > and Baolin. Different examples of demotion list comparasion > > between existing implementation and changed implementation can > > be found in the commit message of 1st patch. > > > > Jagdish Gediya (5): > > mm: demotion: Set demotion list differently > > mm: demotion: Add new node state N_DEMOTION_TARGETS > > mm: demotion: Add support to set targets from userspace > > device-dax/kmem: Set node state as N_DEMOTION_TARGETS > > mm: demotion: Build demotion list based on N_DEMOTION_TARGETS > > > > .../ABI/testing/sysfs-kernel-mm-numa | 12 ++++ > > This description is rather brief. Some additional user-facing material > under Documentation/ would help. Describe the format for writing to the > file, what is seen when reading from it, provide a bit of help to the > user so they can understand how to use it, what effects they might see, > etc. Sure, Will do in next version. > > drivers/base/node.c | 4 ++ > > drivers/dax/kmem.c | 2 + > > include/linux/nodemask.h | 1 + > > mm/migrate.c | 67 +++++++++++++++---- > > 5 files changed, 72 insertions(+), 14 deletions(-) >
On Thu, Apr 14, 2022 at 03:00:46PM +0800, ying.huang@intel.com wrote: > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > Current implementation to find the demotion targets works > > based on node state N_MEMORY, however some systems may have > > dram only memory numa node which are N_MEMORY but not the > > right choices as demotion targets. > > > > This patch series introduces the new node state > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > is used to hold the list of nodes which can be used as demotion > > targets, support is also added to set the demotion target > > list from user space so that default behavior can be overridden. > > It appears that your proposed user space interface cannot solve all > problems. For example, for system as follows, > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > node 0, > > available: 3 nodes (0-2) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: > node 1 size: n MB > node 1 free: n MB > node 2 cpus: 2 3 > node 2 size: n MB > node 2 free: n MB > node distances: > node 0 1 2 > 0: 10 40 20 > 1: 40 10 80 > 2: 20 80 10 > > Demotion order 1: > > node demotion_target > 0 1 > 1 X > 2 X > > Demotion order 2: > > node demotion_target > 0 1 > 1 X > 2 1 > > The demotion order 1 is preferred if we want to reduce cross-socket > traffic. While the demotion order 2 is preferred if we want to take > full advantage of the slow memory node. We can take any choice as > automatic-generated order, while make the other choice possible via user > space overridden. > > I don't know how to implement this via your proposed user space > interface. How about the following user space interface? > > 1. Add a file "demotion_order_override" in > /sys/devices/system/node/ > > 2. When read, "1" is output if the demotion order of the system has been > overridden; "0" is output if not. > > 3. When write "1", the demotion order of the system will become the > overridden mode. When write "0", the demotion order of the system will > become the automatic mode and the demotion order will be re-generated. > > 4. Add a file "demotion_targets" for each node in > /sys/devices/system/node/nodeX/ > > 5. When read, the demotion targets of nodeX will be output. > > 6. When write a node list to the file, the demotion targets of nodeX > will be set to the written nodes. And the demotion order of the system > will become the overridden mode. > > To reduce the complexity, the demotion order of the system is either in > overridden mode or automatic mode. When converting from the automatic > mode to the overridden mode, the existing demotion targets of all nodes > will be retained before being changed. When converting from overridden > mode to automatic mode, the demotion order of the system will be re- > generated automatically. > > In overridden mode, the demotion targets of the hot-added and hot- > removed node will be set to empty. And the hot-removed node will be > removed from the demotion targets of any node. > > This is an extention of the interface used in the following patch, > > https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/ > > What do you think about this? It looks good, will implement in next version. > > node state N_DEMOTION_TARGETS is also set from the dax kmem > > driver, certain type of memory which registers through dax kmem > > (e.g. HBM) may not be the right choices for demotion so in future > > they should be distinguished based on certain attributes and dax > > kmem driver should avoid setting them as N_DEMOTION_TARGETS, > > however current implementation also doesn't distinguish any > > such memory and it considers all N_MEMORY as demotion targets > > so this patch series doesn't modify the current behavior. > > > > Best Regards, > Huang, Ying > > [snip] > Best regards, Jagdish
On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com <ying.huang@intel.com> wrote: > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > Current implementation to find the demotion targets works > > based on node state N_MEMORY, however some systems may have > > dram only memory numa node which are N_MEMORY but not the > > right choices as demotion targets. > > > > This patch series introduces the new node state > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > is used to hold the list of nodes which can be used as demotion > > targets, support is also added to set the demotion target > > list from user space so that default behavior can be overridden. > > It appears that your proposed user space interface cannot solve all > problems. For example, for system as follows, > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > node 0, > > available: 3 nodes (0-2) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: > node 1 size: n MB > node 1 free: n MB > node 2 cpus: 2 3 > node 2 size: n MB > node 2 free: n MB > node distances: > node 0 1 2 > 0: 10 40 20 > 1: 40 10 80 > 2: 20 80 10 > > Demotion order 1: > > node demotion_target > 0 1 > 1 X > 2 X > > Demotion order 2: > > node demotion_target > 0 1 > 1 X > 2 1 > > The demotion order 1 is preferred if we want to reduce cross-socket > traffic. While the demotion order 2 is preferred if we want to take > full advantage of the slow memory node. We can take any choice as > automatic-generated order, while make the other choice possible via user > space overridden. > > I don't know how to implement this via your proposed user space > interface. How about the following user space interface? > > 1. Add a file "demotion_order_override" in > /sys/devices/system/node/ > > 2. When read, "1" is output if the demotion order of the system has been > overridden; "0" is output if not. > > 3. When write "1", the demotion order of the system will become the > overridden mode. When write "0", the demotion order of the system will > become the automatic mode and the demotion order will be re-generated. > > 4. Add a file "demotion_targets" for each node in > /sys/devices/system/node/nodeX/ > > 5. When read, the demotion targets of nodeX will be output. > > 6. When write a node list to the file, the demotion targets of nodeX > will be set to the written nodes. And the demotion order of the system > will become the overridden mode. TBH I don't think having override demotion targets in userspace is quite useful in real life for now (it might become useful in the future, I can't tell). Imagine you manage hundred thousands of machines, which may come from different vendors, have different generations of hardware, have different versions of firmware, it would be a nightmare for the users to configure the demotion targets properly. So it would be great to have the kernel properly configure it *without* intervening from the users. So we should pick up a proper default policy and stick with that policy unless it doesn't work well for the most workloads. I do understand it is hard to make everyone happy. My proposal is having every node in the fast tier has a demotion target (at least one) if the slow tier exists sounds like a reasonable default policy. I think this is also the current implementation. > > To reduce the complexity, the demotion order of the system is either in > overridden mode or automatic mode. When converting from the automatic > mode to the overridden mode, the existing demotion targets of all nodes > will be retained before being changed. When converting from overridden > mode to automatic mode, the demotion order of the system will be re- > generated automatically. > > In overridden mode, the demotion targets of the hot-added and hot- > removed node will be set to empty. And the hot-removed node will be > removed from the demotion targets of any node. > > This is an extention of the interface used in the following patch, > > https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/ > > What do you think about this? > > > node state N_DEMOTION_TARGETS is also set from the dax kmem > > driver, certain type of memory which registers through dax kmem > > (e.g. HBM) may not be the right choices for demotion so in future > > they should be distinguished based on certain attributes and dax > > kmem driver should avoid setting them as N_DEMOTION_TARGETS, > > however current implementation also doesn't distinguish any > > such memory and it considers all N_MEMORY as demotion targets > > so this patch series doesn't modify the current behavior. > > > > Best Regards, > Huang, Ying > > [snip] >
On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > <ying.huang@intel.com> wrote: > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > Current implementation to find the demotion targets works > > > based on node state N_MEMORY, however some systems may have > > > dram only memory numa node which are N_MEMORY but not the > > > right choices as demotion targets. > > > > > > This patch series introduces the new node state > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > is used to hold the list of nodes which can be used as demotion > > > targets, support is also added to set the demotion target > > > list from user space so that default behavior can be overridden. > > > > It appears that your proposed user space interface cannot solve all > > problems. For example, for system as follows, > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > node 0, > > > > available: 3 nodes (0-2) > > node 0 cpus: 0 1 > > node 0 size: n MB > > node 0 free: n MB > > node 1 cpus: > > node 1 size: n MB > > node 1 free: n MB > > node 2 cpus: 2 3 > > node 2 size: n MB > > node 2 free: n MB > > node distances: > > node 0 1 2 > > 0: 10 40 20 > > 1: 40 10 80 > > 2: 20 80 10 > > > > Demotion order 1: > > > > node demotion_target > > 0 1 > > 1 X > > 2 X > > > > Demotion order 2: > > > > node demotion_target > > 0 1 > > 1 X > > 2 1 > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > traffic. While the demotion order 2 is preferred if we want to take > > full advantage of the slow memory node. We can take any choice as > > automatic-generated order, while make the other choice possible via user > > space overridden. > > > > I don't know how to implement this via your proposed user space > > interface. How about the following user space interface? > > > > 1. Add a file "demotion_order_override" in > > /sys/devices/system/node/ > > > > 2. When read, "1" is output if the demotion order of the system has been > > overridden; "0" is output if not. > > > > 3. When write "1", the demotion order of the system will become the > > overridden mode. When write "0", the demotion order of the system will > > become the automatic mode and the demotion order will be re-generated. > > > > 4. Add a file "demotion_targets" for each node in > > /sys/devices/system/node/nodeX/ > > > > 5. When read, the demotion targets of nodeX will be output. > > > > 6. When write a node list to the file, the demotion targets of nodeX > > will be set to the written nodes. And the demotion order of the system > > will become the overridden mode. > > TBH I don't think having override demotion targets in userspace is > quite useful in real life for now (it might become useful in the > future, I can't tell). Imagine you manage hundred thousands of > machines, which may come from different vendors, have different > generations of hardware, have different versions of firmware, it would > be a nightmare for the users to configure the demotion targets > properly. So it would be great to have the kernel properly configure > it *without* intervening from the users. > > So we should pick up a proper default policy and stick with that > policy unless it doesn't work well for the most workloads. I do > understand it is hard to make everyone happy. My proposal is having > every node in the fast tier has a demotion target (at least one) if > the slow tier exists sounds like a reasonable default policy. I think > this is also the current implementation. > This is reasonable. I agree that with a decent default policy, the overriding of per-node demotion targets can be deferred. The most important problem here is that we should allow the configurations where memory-only nodes are not used as demotion targets, which this patch set has already addressed. > > > > To reduce the complexity, the demotion order of the system is either in > > overridden mode or automatic mode. When converting from the automatic > > mode to the overridden mode, the existing demotion targets of all nodes > > will be retained before being changed. When converting from overridden > > mode to automatic mode, the demotion order of the system will be re- > > generated automatically. > > > > In overridden mode, the demotion targets of the hot-added and hot- > > removed node will be set to empty. And the hot-removed node will be > > removed from the demotion targets of any node. > > > > This is an extention of the interface used in the following patch, > > > > https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/ > > > > What do you think about this? > > > > > node state N_DEMOTION_TARGETS is also set from the dax kmem > > > driver, certain type of memory which registers through dax kmem > > > (e.g. HBM) may not be the right choices for demotion so in future > > > they should be distinguished based on certain attributes and dax > > > kmem driver should avoid setting them as N_DEMOTION_TARGETS, > > > however current implementation also doesn't distinguish any > > > such memory and it considers all N_MEMORY as demotion targets > > > so this patch series doesn't modify the current behavior. > > > > > > > Best Regards, > > Huang, Ying > > > > [snip] > >
On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > <ying.huang@intel.com> wrote: > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > Current implementation to find the demotion targets works > > > > based on node state N_MEMORY, however some systems may have > > > > dram only memory numa node which are N_MEMORY but not the > > > > right choices as demotion targets. > > > > > > > > This patch series introduces the new node state > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > is used to hold the list of nodes which can be used as demotion > > > > targets, support is also added to set the demotion target > > > > list from user space so that default behavior can be overridden. > > > > > > It appears that your proposed user space interface cannot solve all > > > problems. For example, for system as follows, > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > node 0, > > > > > > available: 3 nodes (0-2) > > > node 0 cpus: 0 1 > > > node 0 size: n MB > > > node 0 free: n MB > > > node 1 cpus: > > > node 1 size: n MB > > > node 1 free: n MB > > > node 2 cpus: 2 3 > > > node 2 size: n MB > > > node 2 free: n MB > > > node distances: > > > node 0 1 2 > > > 0: 10 40 20 > > > 1: 40 10 80 > > > 2: 20 80 10 > > > > > > Demotion order 1: > > > > > > node demotion_target > > > 0 1 > > > 1 X > > > 2 X > > > > > > Demotion order 2: > > > > > > node demotion_target > > > 0 1 > > > 1 X > > > 2 1 > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > traffic. While the demotion order 2 is preferred if we want to take > > > full advantage of the slow memory node. We can take any choice as > > > automatic-generated order, while make the other choice possible via user > > > space overridden. > > > > > > I don't know how to implement this via your proposed user space > > > interface. How about the following user space interface? > > > > > > 1. Add a file "demotion_order_override" in > > > /sys/devices/system/node/ > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > overridden; "0" is output if not. > > > > > > 3. When write "1", the demotion order of the system will become the > > > overridden mode. When write "0", the demotion order of the system will > > > become the automatic mode and the demotion order will be re-generated. > > > > > > 4. Add a file "demotion_targets" for each node in > > > /sys/devices/system/node/nodeX/ > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > will be set to the written nodes. And the demotion order of the system > > > will become the overridden mode. > > > > TBH I don't think having override demotion targets in userspace is > > quite useful in real life for now (it might become useful in the > > future, I can't tell). Imagine you manage hundred thousands of > > machines, which may come from different vendors, have different > > generations of hardware, have different versions of firmware, it would > > be a nightmare for the users to configure the demotion targets > > properly. So it would be great to have the kernel properly configure > > it *without* intervening from the users. > > > > So we should pick up a proper default policy and stick with that > > policy unless it doesn't work well for the most workloads. I do > > understand it is hard to make everyone happy. My proposal is having > > every node in the fast tier has a demotion target (at least one) if > > the slow tier exists sounds like a reasonable default policy. I think > > this is also the current implementation. > > > > This is reasonable. I agree that with a decent default policy, > I agree that a decent default policy is important. As that was enhanced in [1/5] of this patchset. > the > overriding of per-node demotion targets can be deferred. The most > important problem here is that we should allow the configurations > where memory-only nodes are not used as demotion targets, which this > patch set has already addressed. Do you mean the user space interface proposed by [3/5] of this patchset? IMHO, if we want to add a user space interface, I think that it should be powerful enough to address all existing issues and some potential future issues, so that it can be stable. I don't think it's a good idea to define a partial user space interface that works only for a specific use case and cannot be extended for other use cases. Best Regards, Huang, Ying [snip] > >
On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com <ying.huang@intel.com> wrote: > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > <ying.huang@intel.com> wrote: > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > Current implementation to find the demotion targets works > > > > > based on node state N_MEMORY, however some systems may have > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > right choices as demotion targets. > > > > > > > > > > This patch series introduces the new node state > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > is used to hold the list of nodes which can be used as demotion > > > > > targets, support is also added to set the demotion target > > > > > list from user space so that default behavior can be overridden. > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > problems. For example, for system as follows, > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > node 0, > > > > > > > > available: 3 nodes (0-2) > > > > node 0 cpus: 0 1 > > > > node 0 size: n MB > > > > node 0 free: n MB > > > > node 1 cpus: > > > > node 1 size: n MB > > > > node 1 free: n MB > > > > node 2 cpus: 2 3 > > > > node 2 size: n MB > > > > node 2 free: n MB > > > > node distances: > > > > node 0 1 2 > > > > 0: 10 40 20 > > > > 1: 40 10 80 > > > > 2: 20 80 10 > > > > > > > > Demotion order 1: > > > > > > > > node demotion_target > > > > 0 1 > > > > 1 X > > > > 2 X > > > > > > > > Demotion order 2: > > > > > > > > node demotion_target > > > > 0 1 > > > > 1 X > > > > 2 1 > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > full advantage of the slow memory node. We can take any choice as > > > > automatic-generated order, while make the other choice possible via user > > > > space overridden. > > > > > > > > I don't know how to implement this via your proposed user space > > > > interface. How about the following user space interface? > > > > > > > > 1. Add a file "demotion_order_override" in > > > > /sys/devices/system/node/ > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > overridden; "0" is output if not. > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > overridden mode. When write "0", the demotion order of the system will > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > /sys/devices/system/node/nodeX/ > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > will be set to the written nodes. And the demotion order of the system > > > > will become the overridden mode. > > > > > > TBH I don't think having override demotion targets in userspace is > > > quite useful in real life for now (it might become useful in the > > > future, I can't tell). Imagine you manage hundred thousands of > > > machines, which may come from different vendors, have different > > > generations of hardware, have different versions of firmware, it would > > > be a nightmare for the users to configure the demotion targets > > > properly. So it would be great to have the kernel properly configure > > > it *without* intervening from the users. > > > > > > So we should pick up a proper default policy and stick with that > > > policy unless it doesn't work well for the most workloads. I do > > > understand it is hard to make everyone happy. My proposal is having > > > every node in the fast tier has a demotion target (at least one) if > > > the slow tier exists sounds like a reasonable default policy. I think > > > this is also the current implementation. > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > I agree that a decent default policy is important. As that was enhanced > in [1/5] of this patchset. > > > the > > overriding of per-node demotion targets can be deferred. The most > > important problem here is that we should allow the configurations > > where memory-only nodes are not used as demotion targets, which this > > patch set has already addressed. > > Do you mean the user space interface proposed by [3/5] of this patchset? Yes. > IMHO, if we want to add a user space interface, I think that it should > be powerful enough to address all existing issues and some potential > future issues, so that it can be stable. I don't think it's a good idea > to define a partial user space interface that works only for a specific > use case and cannot be extended for other use cases. I actually think that they can be viewed as two separate problems: one is to define which nodes can be used as demotion targets (this patch set), and the other is how to initialize the per-node demotion path (node_demotion[]). We don't have to solve both problems at the same time. If we decide to go with a per-node demotion path customization interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there is a single global control to turn off all demotion targets (for the machines that don't use memory-only nodes for demotion). > Best Regards, > Huang, Ying > > [snip] > > > > > >
On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > <ying.huang@intel.com> wrote: > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > Current implementation to find the demotion targets works > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > right choices as demotion targets. > > > > > > > > > > > > This patch series introduces the new node state > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > targets, support is also added to set the demotion target > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > problems. For example, for system as follows, > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > node 0, > > > > > > > > > > available: 3 nodes (0-2) > > > > > node 0 cpus: 0 1 > > > > > node 0 size: n MB > > > > > node 0 free: n MB > > > > > node 1 cpus: > > > > > node 1 size: n MB > > > > > node 1 free: n MB > > > > > node 2 cpus: 2 3 > > > > > node 2 size: n MB > > > > > node 2 free: n MB > > > > > node distances: > > > > > node 0 1 2 > > > > > 0: 10 40 20 > > > > > 1: 40 10 80 > > > > > 2: 20 80 10 > > > > > > > > > > Demotion order 1: > > > > > > > > > > node demotion_target > > > > > 0 1 > > > > > 1 X > > > > > 2 X > > > > > > > > > > Demotion order 2: > > > > > > > > > > node demotion_target > > > > > 0 1 > > > > > 1 X > > > > > 2 1 > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > full advantage of the slow memory node. We can take any choice as > > > > > automatic-generated order, while make the other choice possible via user > > > > > space overridden. > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > interface. How about the following user space interface? > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > /sys/devices/system/node/ > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > overridden; "0" is output if not. > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > will be set to the written nodes. And the demotion order of the system > > > > > will become the overridden mode. > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > quite useful in real life for now (it might become useful in the > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > machines, which may come from different vendors, have different > > > > generations of hardware, have different versions of firmware, it would > > > > be a nightmare for the users to configure the demotion targets > > > > properly. So it would be great to have the kernel properly configure > > > > it *without* intervening from the users. > > > > > > > > So we should pick up a proper default policy and stick with that > > > > policy unless it doesn't work well for the most workloads. I do > > > > understand it is hard to make everyone happy. My proposal is having > > > > every node in the fast tier has a demotion target (at least one) if > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > this is also the current implementation. > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > I agree that a decent default policy is important. As that was enhanced > > in [1/5] of this patchset. > > > > > the > > > overriding of per-node demotion targets can be deferred. The most > > > important problem here is that we should allow the configurations > > > where memory-only nodes are not used as demotion targets, which this > > > patch set has already addressed. > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > Yes. > > > IMHO, if we want to add a user space interface, I think that it should > > be powerful enough to address all existing issues and some potential > > future issues, so that it can be stable. I don't think it's a good idea > > to define a partial user space interface that works only for a specific > > use case and cannot be extended for other use cases. > > I actually think that they can be viewed as two separate problems: one > is to define which nodes can be used as demotion targets (this patch > set), and the other is how to initialize the per-node demotion path > (node_demotion[]). We don't have to solve both problems at the same > time. > > If we decide to go with a per-node demotion path customization > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > is a single global control to turn off all demotion targets (for the > machines that don't use memory-only nodes for demotion). > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs interface to enable reclaim migration"), a sysfs interface /sys/kernel/mm/numa/demotion_enabled is added to turn off all demotion targets. Best Regards, Huang, Ying
On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com <ying.huang@intel.com> wrote: > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > <ying.huang@intel.com> wrote: > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > > Current implementation to find the demotion targets works > > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > > targets, support is also added to set the demotion target > > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > > node 0, > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > node 0 cpus: 0 1 > > > > > > node 0 size: n MB > > > > > > node 0 free: n MB > > > > > > node 1 cpus: > > > > > > node 1 size: n MB > > > > > > node 1 free: n MB > > > > > > node 2 cpus: 2 3 > > > > > > node 2 size: n MB > > > > > > node 2 free: n MB > > > > > > node distances: > > > > > > node 0 1 2 > > > > > > 0: 10 40 20 > > > > > > 1: 40 10 80 > > > > > > 2: 20 80 10 > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > node demotion_target > > > > > > 0 1 > > > > > > 1 X > > > > > > 2 X > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > node demotion_target > > > > > > 0 1 > > > > > > 1 X > > > > > > 2 1 > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > > full advantage of the slow memory node. We can take any choice as > > > > > > automatic-generated order, while make the other choice possible via user > > > > > > space overridden. > > > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > > interface. How about the following user space interface? > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > /sys/devices/system/node/ > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > > will be set to the written nodes. And the demotion order of the system > > > > > > will become the overridden mode. > > > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > > quite useful in real life for now (it might become useful in the > > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > > machines, which may come from different vendors, have different > > > > > generations of hardware, have different versions of firmware, it would > > > > > be a nightmare for the users to configure the demotion targets > > > > > properly. So it would be great to have the kernel properly configure > > > > > it *without* intervening from the users. > > > > > > > > > > So we should pick up a proper default policy and stick with that > > > > > policy unless it doesn't work well for the most workloads. I do > > > > > understand it is hard to make everyone happy. My proposal is having > > > > > every node in the fast tier has a demotion target (at least one) if > > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > > this is also the current implementation. > > > > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > > > > I agree that a decent default policy is important. As that was enhanced > > > in [1/5] of this patchset. > > > > > > > the > > > > overriding of per-node demotion targets can be deferred. The most > > > > important problem here is that we should allow the configurations > > > > where memory-only nodes are not used as demotion targets, which this > > > > patch set has already addressed. > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > > > Yes. > > > > > IMHO, if we want to add a user space interface, I think that it should > > > be powerful enough to address all existing issues and some potential > > > future issues, so that it can be stable. I don't think it's a good idea > > > to define a partial user space interface that works only for a specific > > > use case and cannot be extended for other use cases. > > > > I actually think that they can be viewed as two separate problems: one > > is to define which nodes can be used as demotion targets (this patch > > set), and the other is how to initialize the per-node demotion path > > (node_demotion[]). We don't have to solve both problems at the same > > time. > > > > If we decide to go with a per-node demotion path customization > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > > is a single global control to turn off all demotion targets (for the > > machines that don't use memory-only nodes for demotion). > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs > interface to enable reclaim migration"), a sysfs interface > > /sys/kernel/mm/numa/demotion_enabled > > is added to turn off all demotion targets. IIUC, this sysfs interface only turns off demotion-in-reclaim. It will be even cleaner if we have an easy way to clear node_demotion[] and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not init scripts) can know that the machine doesn't even have memory tiering hardware enabled. > Best Regards, > Huang, Ying > > >
On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote: > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com > <ying.huang@intel.com> wrote: > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > > <ying.huang@intel.com> wrote: > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > > > Current implementation to find the demotion targets works > > > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > > > targets, support is also added to set the demotion target > > > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > > > node 0, > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > node 0 cpus: 0 1 > > > > > > > node 0 size: n MB > > > > > > > node 0 free: n MB > > > > > > > node 1 cpus: > > > > > > > node 1 size: n MB > > > > > > > node 1 free: n MB > > > > > > > node 2 cpus: 2 3 > > > > > > > node 2 size: n MB > > > > > > > node 2 free: n MB > > > > > > > node distances: > > > > > > > node 0 1 2 > > > > > > > 0: 10 40 20 > > > > > > > 1: 40 10 80 > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > > > node demotion_target > > > > > > > 0 1 > > > > > > > 1 X > > > > > > > 2 X > > > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > > > node demotion_target > > > > > > > 0 1 > > > > > > > 1 X > > > > > > > 2 1 > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > > > full advantage of the slow memory node. We can take any choice as > > > > > > > automatic-generated order, while make the other choice possible via user > > > > > > > space overridden. > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > > > interface. How about the following user space interface? > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > > /sys/devices/system/node/ > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > > > will be set to the written nodes. And the demotion order of the system > > > > > > > will become the overridden mode. > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > > > quite useful in real life for now (it might become useful in the > > > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > > > machines, which may come from different vendors, have different > > > > > > generations of hardware, have different versions of firmware, it would > > > > > > be a nightmare for the users to configure the demotion targets > > > > > > properly. So it would be great to have the kernel properly configure > > > > > > it *without* intervening from the users. > > > > > > > > > > > > So we should pick up a proper default policy and stick with that > > > > > > policy unless it doesn't work well for the most workloads. I do > > > > > > understand it is hard to make everyone happy. My proposal is having > > > > > > every node in the fast tier has a demotion target (at least one) if > > > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > > > this is also the current implementation. > > > > > > > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > > > > > > > I agree that a decent default policy is important. As that was enhanced > > > > in [1/5] of this patchset. > > > > > > > > > the > > > > > overriding of per-node demotion targets can be deferred. The most > > > > > important problem here is that we should allow the configurations > > > > > where memory-only nodes are not used as demotion targets, which this > > > > > patch set has already addressed. > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > > > > > Yes. > > > > > > > IMHO, if we want to add a user space interface, I think that it should > > > > be powerful enough to address all existing issues and some potential > > > > future issues, so that it can be stable. I don't think it's a good idea > > > > to define a partial user space interface that works only for a specific > > > > use case and cannot be extended for other use cases. > > > > > > I actually think that they can be viewed as two separate problems: one > > > is to define which nodes can be used as demotion targets (this patch > > > set), and the other is how to initialize the per-node demotion path > > > (node_demotion[]). We don't have to solve both problems at the same > > > time. > > > > > > If we decide to go with a per-node demotion path customization > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > > > is a single global control to turn off all demotion targets (for the > > > machines that don't use memory-only nodes for demotion). > > > > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs > > interface to enable reclaim migration"), a sysfs interface > > > > /sys/kernel/mm/numa/demotion_enabled > > > > is added to turn off all demotion targets. > > IIUC, this sysfs interface only turns off demotion-in-reclaim. It > will be even cleaner if we have an easy way to clear node_demotion[] > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not > init scripts) can know that the machine doesn't even have memory > tiering hardware enabled. > What is the difference? Now we have no interface to show demotion targets of a node. That is in-kernel only. What is memory tiering hardware? The Optane PMEM? Some information for it is available via ACPI HMAT table. Except demotion-in-reclaim, what else do you care about? Best Regards, Huang, Ying
On Wed, Apr 20, 2022 at 10:41 PM Wei Xu <weixugc@google.com> wrote: > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > <ying.huang@intel.com> wrote: > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > Current implementation to find the demotion targets works > > > > based on node state N_MEMORY, however some systems may have > > > > dram only memory numa node which are N_MEMORY but not the > > > > right choices as demotion targets. > > > > > > > > This patch series introduces the new node state > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > is used to hold the list of nodes which can be used as demotion > > > > targets, support is also added to set the demotion target > > > > list from user space so that default behavior can be overridden. > > > > > > It appears that your proposed user space interface cannot solve all > > > problems. For example, for system as follows, > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > node 0, > > > > > > available: 3 nodes (0-2) > > > node 0 cpus: 0 1 > > > node 0 size: n MB > > > node 0 free: n MB > > > node 1 cpus: > > > node 1 size: n MB > > > node 1 free: n MB > > > node 2 cpus: 2 3 > > > node 2 size: n MB > > > node 2 free: n MB > > > node distances: > > > node 0 1 2 > > > 0: 10 40 20 > > > 1: 40 10 80 > > > 2: 20 80 10 > > > > > > Demotion order 1: > > > > > > node demotion_target > > > 0 1 > > > 1 X > > > 2 X > > > > > > Demotion order 2: > > > > > > node demotion_target > > > 0 1 > > > 1 X > > > 2 1 > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > traffic. While the demotion order 2 is preferred if we want to take > > > full advantage of the slow memory node. We can take any choice as > > > automatic-generated order, while make the other choice possible via user > > > space overridden. > > > > > > I don't know how to implement this via your proposed user space > > > interface. How about the following user space interface? > > > > > > 1. Add a file "demotion_order_override" in > > > /sys/devices/system/node/ > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > overridden; "0" is output if not. > > > > > > 3. When write "1", the demotion order of the system will become the > > > overridden mode. When write "0", the demotion order of the system will > > > become the automatic mode and the demotion order will be re-generated. > > > > > > 4. Add a file "demotion_targets" for each node in > > > /sys/devices/system/node/nodeX/ > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > will be set to the written nodes. And the demotion order of the system > > > will become the overridden mode. > > > > TBH I don't think having override demotion targets in userspace is > > quite useful in real life for now (it might become useful in the > > future, I can't tell). Imagine you manage hundred thousands of > > machines, which may come from different vendors, have different > > generations of hardware, have different versions of firmware, it would > > be a nightmare for the users to configure the demotion targets > > properly. So it would be great to have the kernel properly configure > > it *without* intervening from the users. > > > > So we should pick up a proper default policy and stick with that > > policy unless it doesn't work well for the most workloads. I do > > understand it is hard to make everyone happy. My proposal is having > > every node in the fast tier has a demotion target (at least one) if > > the slow tier exists sounds like a reasonable default policy. I think > > this is also the current implementation. > > > > This is reasonable. I agree that with a decent default policy, the > overriding of per-node demotion targets can be deferred. The most > important problem here is that we should allow the configurations > where memory-only nodes are not used as demotion targets, which this > patch set has already addressed. Yes, I agree. Fixing the bug and allowing override by userspace are totally two separate things. > > > > > > > To reduce the complexity, the demotion order of the system is either in > > > overridden mode or automatic mode. When converting from the automatic > > > mode to the overridden mode, the existing demotion targets of all nodes > > > will be retained before being changed. When converting from overridden > > > mode to automatic mode, the demotion order of the system will be re- > > > generated automatically. > > > > > > In overridden mode, the demotion targets of the hot-added and hot- > > > removed node will be set to empty. And the hot-removed node will be > > > removed from the demotion targets of any node. > > > > > > This is an extention of the interface used in the following patch, > > > > > > https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/ > > > > > > What do you think about this? > > > > > > > node state N_DEMOTION_TARGETS is also set from the dax kmem > > > > driver, certain type of memory which registers through dax kmem > > > > (e.g. HBM) may not be the right choices for demotion so in future > > > > they should be distinguished based on certain attributes and dax > > > > kmem driver should avoid setting them as N_DEMOTION_TARGETS, > > > > however current implementation also doesn't distinguish any > > > > such memory and it considers all N_MEMORY as demotion targets > > > > so this patch series doesn't modify the current behavior. > > > > > > > > > > Best Regards, > > > Huang, Ying > > > > > > [snip] > > >
On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com <ying.huang@intel.com> wrote: > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote: > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com > > <ying.huang@intel.com> wrote: > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > > > > Current implementation to find the demotion targets works > > > > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > > > > targets, support is also added to set the demotion target > > > > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > > > > node 0, > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > node 0 cpus: 0 1 > > > > > > > > node 0 size: n MB > > > > > > > > node 0 free: n MB > > > > > > > > node 1 cpus: > > > > > > > > node 1 size: n MB > > > > > > > > node 1 free: n MB > > > > > > > > node 2 cpus: 2 3 > > > > > > > > node 2 size: n MB > > > > > > > > node 2 free: n MB > > > > > > > > node distances: > > > > > > > > node 0 1 2 > > > > > > > > 0: 10 40 20 > > > > > > > > 1: 40 10 80 > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > 0 1 > > > > > > > > 1 X > > > > > > > > 2 X > > > > > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > 0 1 > > > > > > > > 1 X > > > > > > > > 2 1 > > > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > > > > full advantage of the slow memory node. We can take any choice as > > > > > > > > automatic-generated order, while make the other choice possible via user > > > > > > > > space overridden. > > > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > > > > interface. How about the following user space interface? > > > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > > > /sys/devices/system/node/ > > > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > > > > will be set to the written nodes. And the demotion order of the system > > > > > > > > will become the overridden mode. > > > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > > > > quite useful in real life for now (it might become useful in the > > > > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > > > > machines, which may come from different vendors, have different > > > > > > > generations of hardware, have different versions of firmware, it would > > > > > > > be a nightmare for the users to configure the demotion targets > > > > > > > properly. So it would be great to have the kernel properly configure > > > > > > > it *without* intervening from the users. > > > > > > > > > > > > > > So we should pick up a proper default policy and stick with that > > > > > > > policy unless it doesn't work well for the most workloads. I do > > > > > > > understand it is hard to make everyone happy. My proposal is having > > > > > > > every node in the fast tier has a demotion target (at least one) if > > > > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > > > > this is also the current implementation. > > > > > > > > > > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > > > > > > > > > > I agree that a decent default policy is important. As that was enhanced > > > > > in [1/5] of this patchset. > > > > > > > > > > > the > > > > > > overriding of per-node demotion targets can be deferred. The most > > > > > > important problem here is that we should allow the configurations > > > > > > where memory-only nodes are not used as demotion targets, which this > > > > > > patch set has already addressed. > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > > > > > > > Yes. > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should > > > > > be powerful enough to address all existing issues and some potential > > > > > future issues, so that it can be stable. I don't think it's a good idea > > > > > to define a partial user space interface that works only for a specific > > > > > use case and cannot be extended for other use cases. > > > > > > > > I actually think that they can be viewed as two separate problems: one > > > > is to define which nodes can be used as demotion targets (this patch > > > > set), and the other is how to initialize the per-node demotion path > > > > (node_demotion[]). We don't have to solve both problems at the same > > > > time. > > > > > > > > If we decide to go with a per-node demotion path customization > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > > > > is a single global control to turn off all demotion targets (for the > > > > machines that don't use memory-only nodes for demotion). > > > > > > > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs > > > interface to enable reclaim migration"), a sysfs interface > > > > > > /sys/kernel/mm/numa/demotion_enabled > > > > > > is added to turn off all demotion targets. > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim. It > > will be even cleaner if we have an easy way to clear node_demotion[] > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not > > init scripts) can know that the machine doesn't even have memory > > tiering hardware enabled. > > > > What is the difference? Now we have no interface to show demotion > targets of a node. That is in-kernel only. What is memory tiering > hardware? The Optane PMEM? Some information for it is available via > ACPI HMAT table. > > Except demotion-in-reclaim, what else do you care about? There is a difference: one is to indicate the availability of the memory tiering hardware and the other is to indicate whether transparent kernel-driven demotion from the reclaim path is activated. With /sys/devices/system/node/demote_targets or the per-node demotion target interface, the userspace can figure out the memory tiering topology abstracted by the kernel. It is possible to use application-guided demotion without having to enable reclaim-based demotion in the kernel. Logically it is also cleaner to me to decouple the tiering node representation from the actual demotion mechanism enablement. > Best Regards, > Huang, Ying > > >
On Thu, 2022-04-21 at 10:56 -0700, Yang Shi wrote: > On Wed, Apr 20, 2022 at 10:41 PM Wei Xu <weixugc@google.com> wrote: > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > <ying.huang@intel.com> wrote: > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > Current implementation to find the demotion targets works > > > > > based on node state N_MEMORY, however some systems may have > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > right choices as demotion targets. > > > > > > > > > > This patch series introduces the new node state > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > is used to hold the list of nodes which can be used as demotion > > > > > targets, support is also added to set the demotion target > > > > > list from user space so that default behavior can be overridden. > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > problems. For example, for system as follows, > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > node 0, > > > > > > > > available: 3 nodes (0-2) > > > > node 0 cpus: 0 1 > > > > node 0 size: n MB > > > > node 0 free: n MB > > > > node 1 cpus: > > > > node 1 size: n MB > > > > node 1 free: n MB > > > > node 2 cpus: 2 3 > > > > node 2 size: n MB > > > > node 2 free: n MB > > > > node distances: > > > > node 0 1 2 > > > > 0: 10 40 20 > > > > 1: 40 10 80 > > > > 2: 20 80 10 > > > > > > > > Demotion order 1: > > > > > > > > node demotion_target > > > > 0 1 > > > > 1 X > > > > 2 X > > > > > > > > Demotion order 2: > > > > > > > > node demotion_target > > > > 0 1 > > > > 1 X > > > > 2 1 > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > full advantage of the slow memory node. We can take any choice as > > > > automatic-generated order, while make the other choice possible via user > > > > space overridden. > > > > > > > > I don't know how to implement this via your proposed user space > > > > interface. How about the following user space interface? > > > > > > > > 1. Add a file "demotion_order_override" in > > > > /sys/devices/system/node/ > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > overridden; "0" is output if not. > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > overridden mode. When write "0", the demotion order of the system will > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > /sys/devices/system/node/nodeX/ > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > will be set to the written nodes. And the demotion order of the system > > > > will become the overridden mode. > > > > > > TBH I don't think having override demotion targets in userspace is > > > quite useful in real life for now (it might become useful in the > > > future, I can't tell). Imagine you manage hundred thousands of > > > machines, which may come from different vendors, have different > > > generations of hardware, have different versions of firmware, it would > > > be a nightmare for the users to configure the demotion targets > > > properly. So it would be great to have the kernel properly configure > > > it *without* intervening from the users. > > > > > > So we should pick up a proper default policy and stick with that > > > policy unless it doesn't work well for the most workloads. I do > > > understand it is hard to make everyone happy. My proposal is having > > > every node in the fast tier has a demotion target (at least one) if > > > the slow tier exists sounds like a reasonable default policy. I think > > > this is also the current implementation. > > > > > > > This is reasonable. I agree that with a decent default policy, the > > overriding of per-node demotion targets can be deferred. The most > > important problem here is that we should allow the configurations > > where memory-only nodes are not used as demotion targets, which this > > patch set has already addressed. > > Yes, I agree. Fixing the bug and allowing override by userspace are > totally two separate things. > Yes. I agree with the separating thing, although [1/5] doesn't fix the bug, but improve the automatic order generation method. So I think it's better to separate this patchset into 2 patchsets. [1/5] is for improving the automatic order generation. The [2-5/5] is for user space overriding. Best Regards, Huang, Ying > > > > > > > > > > To reduce the complexity, the demotion order of the system is either in > > > > overridden mode or automatic mode. When converting from the automatic > > > > mode to the overridden mode, the existing demotion targets of all nodes > > > > will be retained before being changed. When converting from overridden > > > > mode to automatic mode, the demotion order of the system will be re- > > > > generated automatically. > > > > > > > > In overridden mode, the demotion targets of the hot-added and hot- > > > > removed node will be set to empty. And the hot-removed node will be > > > > removed from the demotion targets of any node. > > > > > > > > This is an extention of the interface used in the following patch, > > > > > > > > https://lore.kernel.org/lkml/20191016221149.74AE222C@viggo.jf.intel.com/ > > > > > > > > What do you think about this? > > > > > > > > > node state N_DEMOTION_TARGETS is also set from the dax kmem > > > > > driver, certain type of memory which registers through dax kmem > > > > > (e.g. HBM) may not be the right choices for demotion so in future > > > > > they should be distinguished based on certain attributes and dax > > > > > kmem driver should avoid setting them as N_DEMOTION_TARGETS, > > > > > however current implementation also doesn't distinguish any > > > > > such memory and it considers all N_MEMORY as demotion targets > > > > > so this patch series doesn't modify the current behavior. > > > > > > > > > > > > > Best Regards, > > > > Huang, Ying > > > > > > > > [snip] > > > >
On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote: > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com > <ying.huang@intel.com> wrote: > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote: > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com > > > <ying.huang@intel.com> wrote: > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > > > > > Current implementation to find the demotion targets works > > > > > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > > > > > targets, support is also added to set the demotion target > > > > > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > > > > > node 0, > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > node 0 size: n MB > > > > > > > > > node 0 free: n MB > > > > > > > > > node 1 cpus: > > > > > > > > > node 1 size: n MB > > > > > > > > > node 1 free: n MB > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > node 2 size: n MB > > > > > > > > > node 2 free: n MB > > > > > > > > > node distances: > > > > > > > > > node 0 1 2 > > > > > > > > > 0: 10 40 20 > > > > > > > > > 1: 40 10 80 > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > 0 1 > > > > > > > > > 1 X > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > 0 1 > > > > > > > > > 1 X > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > > > > > full advantage of the slow memory node. We can take any choice as > > > > > > > > > automatic-generated order, while make the other choice possible via user > > > > > > > > > space overridden. > > > > > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > > > > > interface. How about the following user space interface? > > > > > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > > > > /sys/devices/system/node/ > > > > > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > > > > > will be set to the written nodes. And the demotion order of the system > > > > > > > > > will become the overridden mode. > > > > > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > > > > > quite useful in real life for now (it might become useful in the > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > > > > > machines, which may come from different vendors, have different > > > > > > > > generations of hardware, have different versions of firmware, it would > > > > > > > > be a nightmare for the users to configure the demotion targets > > > > > > > > properly. So it would be great to have the kernel properly configure > > > > > > > > it *without* intervening from the users. > > > > > > > > > > > > > > > > So we should pick up a proper default policy and stick with that > > > > > > > > policy unless it doesn't work well for the most workloads. I do > > > > > > > > understand it is hard to make everyone happy. My proposal is having > > > > > > > > every node in the fast tier has a demotion target (at least one) if > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > > > > > this is also the current implementation. > > > > > > > > > > > > > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > > > > > > > > > > > > > I agree that a decent default policy is important. As that was enhanced > > > > > > in [1/5] of this patchset. > > > > > > > > > > > > > the > > > > > > > overriding of per-node demotion targets can be deferred. The most > > > > > > > important problem here is that we should allow the configurations > > > > > > > where memory-only nodes are not used as demotion targets, which this > > > > > > > patch set has already addressed. > > > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > > > > > > > > > Yes. > > > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should > > > > > > be powerful enough to address all existing issues and some potential > > > > > > future issues, so that it can be stable. I don't think it's a good idea > > > > > > to define a partial user space interface that works only for a specific > > > > > > use case and cannot be extended for other use cases. > > > > > > > > > > I actually think that they can be viewed as two separate problems: one > > > > > is to define which nodes can be used as demotion targets (this patch > > > > > set), and the other is how to initialize the per-node demotion path > > > > > (node_demotion[]). We don't have to solve both problems at the same > > > > > time. > > > > > > > > > > If we decide to go with a per-node demotion path customization > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > > > > > is a single global control to turn off all demotion targets (for the > > > > > machines that don't use memory-only nodes for demotion). > > > > > > > > > > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs > > > > interface to enable reclaim migration"), a sysfs interface > > > > > > > > /sys/kernel/mm/numa/demotion_enabled > > > > > > > > is added to turn off all demotion targets. > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim. It > > > will be even cleaner if we have an easy way to clear node_demotion[] > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not > > > init scripts) can know that the machine doesn't even have memory > > > tiering hardware enabled. > > > > > > > What is the difference? Now we have no interface to show demotion > > targets of a node. That is in-kernel only. What is memory tiering > > hardware? The Optane PMEM? Some information for it is available via > > ACPI HMAT table. > > > > Except demotion-in-reclaim, what else do you care about? > > There is a difference: one is to indicate the availability of the > memory tiering hardware and the other is to indicate whether > transparent kernel-driven demotion from the reclaim path is activated. > With /sys/devices/system/node/demote_targets or the per-node demotion > target interface, the userspace can figure out the memory tiering > topology abstracted by the kernel. It is possible to use > application-guided demotion without having to enable reclaim-based > demotion in the kernel. Logically it is also cleaner to me to > decouple the tiering node representation from the actual demotion > mechanism enablement. I am confused here. It appears that you need a way to expose the automatic generated demotion order from kernel to user space interface. We can talk about that if you really need it. But [2-5/5] of this patchset is to override the automatic generated demotion order from user space to kernel interface. Best Regards, Huang, Ying
On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com <ying.huang@intel.com> wrote: > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote: > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com > > <ying.huang@intel.com> wrote: > > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote: > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > > > > > > Current implementation to find the demotion targets works > > > > > > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > > > > > > targets, support is also added to set the demotion target > > > > > > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > > > > > > node 0, > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > node 0 size: n MB > > > > > > > > > > node 0 free: n MB > > > > > > > > > > node 1 cpus: > > > > > > > > > > node 1 size: n MB > > > > > > > > > > node 1 free: n MB > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > node 2 size: n MB > > > > > > > > > > node 2 free: n MB > > > > > > > > > > node distances: > > > > > > > > > > node 0 1 2 > > > > > > > > > > 0: 10 40 20 > > > > > > > > > > 1: 40 10 80 > > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > 0 1 > > > > > > > > > > 1 X > > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > 0 1 > > > > > > > > > > 1 X > > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > > > > > > full advantage of the slow memory node. We can take any choice as > > > > > > > > > > automatic-generated order, while make the other choice possible via user > > > > > > > > > > space overridden. > > > > > > > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > > > > > > interface. How about the following user space interface? > > > > > > > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > > > > > /sys/devices/system/node/ > > > > > > > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > > > > > > will be set to the written nodes. And the demotion order of the system > > > > > > > > > > will become the overridden mode. > > > > > > > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > > > > > > quite useful in real life for now (it might become useful in the > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > > > > > > machines, which may come from different vendors, have different > > > > > > > > > generations of hardware, have different versions of firmware, it would > > > > > > > > > be a nightmare for the users to configure the demotion targets > > > > > > > > > properly. So it would be great to have the kernel properly configure > > > > > > > > > it *without* intervening from the users. > > > > > > > > > > > > > > > > > > So we should pick up a proper default policy and stick with that > > > > > > > > > policy unless it doesn't work well for the most workloads. I do > > > > > > > > > understand it is hard to make everyone happy. My proposal is having > > > > > > > > > every node in the fast tier has a demotion target (at least one) if > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > > > > > > this is also the current implementation. > > > > > > > > > > > > > > > > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > > > > > > > > > > > > > > > > I agree that a decent default policy is important. As that was enhanced > > > > > > > in [1/5] of this patchset. > > > > > > > > > > > > > > > the > > > > > > > > overriding of per-node demotion targets can be deferred. The most > > > > > > > > important problem here is that we should allow the configurations > > > > > > > > where memory-only nodes are not used as demotion targets, which this > > > > > > > > patch set has already addressed. > > > > > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > > > > > > > > > > > Yes. > > > > > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should > > > > > > > be powerful enough to address all existing issues and some potential > > > > > > > future issues, so that it can be stable. I don't think it's a good idea > > > > > > > to define a partial user space interface that works only for a specific > > > > > > > use case and cannot be extended for other use cases. > > > > > > > > > > > > I actually think that they can be viewed as two separate problems: one > > > > > > is to define which nodes can be used as demotion targets (this patch > > > > > > set), and the other is how to initialize the per-node demotion path > > > > > > (node_demotion[]). We don't have to solve both problems at the same > > > > > > time. > > > > > > > > > > > > If we decide to go with a per-node demotion path customization > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > > > > > > is a single global control to turn off all demotion targets (for the > > > > > > machines that don't use memory-only nodes for demotion). > > > > > > > > > > > > > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs > > > > > interface to enable reclaim migration"), a sysfs interface > > > > > > > > > > /sys/kernel/mm/numa/demotion_enabled > > > > > > > > > > is added to turn off all demotion targets. > > > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim. It > > > > will be even cleaner if we have an easy way to clear node_demotion[] > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not > > > > init scripts) can know that the machine doesn't even have memory > > > > tiering hardware enabled. > > > > > > > > > > What is the difference? Now we have no interface to show demotion > > > targets of a node. That is in-kernel only. What is memory tiering > > > hardware? The Optane PMEM? Some information for it is available via > > > ACPI HMAT table. > > > > > > Except demotion-in-reclaim, what else do you care about? > > > > There is a difference: one is to indicate the availability of the > > memory tiering hardware and the other is to indicate whether > > transparent kernel-driven demotion from the reclaim path is activated. > > With /sys/devices/system/node/demote_targets or the per-node demotion > > target interface, the userspace can figure out the memory tiering > > topology abstracted by the kernel. It is possible to use > > application-guided demotion without having to enable reclaim-based > > demotion in the kernel. Logically it is also cleaner to me to > > decouple the tiering node representation from the actual demotion > > mechanism enablement. > > I am confused here. It appears that you need a way to expose the > automatic generated demotion order from kernel to user space interface. > We can talk about that if you really need it. > > But [2-5/5] of this patchset is to override the automatic generated > demotion order from user space to kernel interface. As a side effect of allowing user space to override the default set of demotion target nodes, it also provides a sysfs interface to allow userspace to read which nodes are currently being designated as demotion targets. The initialization of demotion targets is expected to complete during boot (either by kernel or via an init script). After that, the userspace processes (e.g. proactive tiering daemon or tiering-aware applications) can query this sysfs interface to know if there are any tiering nodes present and act accordingly. It would be even better to expose the per-node demotion order (node_demotion[]) via the sysfs interface (e.g. /sys/devices/system/node/nodeX/demotion_targets as you have suggested). It can be read-only until there are good use cases to require overriding the per-node demotion order.
On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote: > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com > <ying.huang@intel.com> wrote: > > > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote: > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com > > > <ying.huang@intel.com> wrote: > > > > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote: > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > > > > > > > Current implementation to find the demotion targets works > > > > > > > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > > > > > > > targets, support is also added to set the demotion target > > > > > > > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > > > > > > > node 0, > > > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > > node 0 size: n MB > > > > > > > > > > > node 0 free: n MB > > > > > > > > > > > node 1 cpus: > > > > > > > > > > > node 1 size: n MB > > > > > > > > > > > node 1 free: n MB > > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > > node 2 size: n MB > > > > > > > > > > > node 2 free: n MB > > > > > > > > > > > node distances: > > > > > > > > > > > node 0 1 2 > > > > > > > > > > > 0: 10 40 20 > > > > > > > > > > > 1: 40 10 80 > > > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > 0 1 > > > > > > > > > > > 1 X > > > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > 0 1 > > > > > > > > > > > 1 X > > > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > > > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > > > > > > > full advantage of the slow memory node. We can take any choice as > > > > > > > > > > > automatic-generated order, while make the other choice possible via user > > > > > > > > > > > space overridden. > > > > > > > > > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > > > > > > > interface. How about the following user space interface? > > > > > > > > > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > > > > > > /sys/devices/system/node/ > > > > > > > > > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > > > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > > > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > > > > > > > will be set to the written nodes. And the demotion order of the system > > > > > > > > > > > will become the overridden mode. > > > > > > > > > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > > > > > > > quite useful in real life for now (it might become useful in the > > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > > > > > > > machines, which may come from different vendors, have different > > > > > > > > > > generations of hardware, have different versions of firmware, it would > > > > > > > > > > be a nightmare for the users to configure the demotion targets > > > > > > > > > > properly. So it would be great to have the kernel properly configure > > > > > > > > > > it *without* intervening from the users. > > > > > > > > > > > > > > > > > > > > So we should pick up a proper default policy and stick with that > > > > > > > > > > policy unless it doesn't work well for the most workloads. I do > > > > > > > > > > understand it is hard to make everyone happy. My proposal is having > > > > > > > > > > every node in the fast tier has a demotion target (at least one) if > > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > > > > > > > this is also the current implementation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > > > > > > > > > > > > > > > > > > > I agree that a decent default policy is important. As that was enhanced > > > > > > > > in [1/5] of this patchset. > > > > > > > > > > > > > > > > > the > > > > > > > > > overriding of per-node demotion targets can be deferred. The most > > > > > > > > > important problem here is that we should allow the configurations > > > > > > > > > where memory-only nodes are not used as demotion targets, which this > > > > > > > > > patch set has already addressed. > > > > > > > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should > > > > > > > > be powerful enough to address all existing issues and some potential > > > > > > > > future issues, so that it can be stable. I don't think it's a good idea > > > > > > > > to define a partial user space interface that works only for a specific > > > > > > > > use case and cannot be extended for other use cases. > > > > > > > > > > > > > > I actually think that they can be viewed as two separate problems: one > > > > > > > is to define which nodes can be used as demotion targets (this patch > > > > > > > set), and the other is how to initialize the per-node demotion path > > > > > > > (node_demotion[]). We don't have to solve both problems at the same > > > > > > > time. > > > > > > > > > > > > > > If we decide to go with a per-node demotion path customization > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > > > > > > > is a single global control to turn off all demotion targets (for the > > > > > > > machines that don't use memory-only nodes for demotion). > > > > > > > > > > > > > > > > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs > > > > > > interface to enable reclaim migration"), a sysfs interface > > > > > > > > > > > > /sys/kernel/mm/numa/demotion_enabled > > > > > > > > > > > > is added to turn off all demotion targets. > > > > > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim. It > > > > > will be even cleaner if we have an easy way to clear node_demotion[] > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not > > > > > init scripts) can know that the machine doesn't even have memory > > > > > tiering hardware enabled. > > > > > > > > > > > > > What is the difference? Now we have no interface to show demotion > > > > targets of a node. That is in-kernel only. What is memory tiering > > > > hardware? The Optane PMEM? Some information for it is available via > > > > ACPI HMAT table. > > > > > > > > Except demotion-in-reclaim, what else do you care about? > > > > > > There is a difference: one is to indicate the availability of the > > > memory tiering hardware and the other is to indicate whether > > > transparent kernel-driven demotion from the reclaim path is activated. > > > With /sys/devices/system/node/demote_targets or the per-node demotion > > > target interface, the userspace can figure out the memory tiering > > > topology abstracted by the kernel. It is possible to use > > > application-guided demotion without having to enable reclaim-based > > > demotion in the kernel. Logically it is also cleaner to me to > > > decouple the tiering node representation from the actual demotion > > > mechanism enablement. > > > > I am confused here. It appears that you need a way to expose the > > automatic generated demotion order from kernel to user space interface. > > We can talk about that if you really need it. > > > > But [2-5/5] of this patchset is to override the automatic generated > > demotion order from user space to kernel interface. > > As a side effect of allowing user space to override the default set of > demotion target nodes, it also provides a sysfs interface to allow > userspace to read which nodes are currently being designated as > demotion targets. > > The initialization of demotion targets is expected to complete during > boot (either by kernel or via an init script). After that, the > userspace processes (e.g. proactive tiering daemon or tiering-aware > applications) can query this sysfs interface to know if there are any > tiering nodes present and act accordingly. > > It would be even better to expose the per-node demotion order > (node_demotion[]) via the sysfs interface (e.g. > /sys/devices/system/node/nodeX/demotion_targets as you have > suggested). It can be read-only until there are good use cases to > require overriding the per-node demotion order. I am OK to expose the system demotion order to user space. For example, via /sys/devices/system/node/nodeX/demotion_targets, but read-only. But if we want to add functionality to override system demotion order, we need to consider the user space interface carefully, at least after collecting all requirement so far. I don't think the interface proposed in [2-5/5] of this patchset is sufficient or extensible enough. Best Regards, Huang, Ying
On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com <ying.huang@intel.com> wrote: > On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote: > > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com > > <ying.huang@intel.com> wrote: > > > > > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote: > > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote: > > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi < > shy828301@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya > wrote: > > > > > > > > > > > > > Current implementation to find the demotion > targets works > > > > > > > > > > > > > based on node state N_MEMORY, however some systems > may have > > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but > not the > > > > > > > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish > the nodes which > > > > > > > > > > > > > can be used as demotion targets, > node_states[N_DEMOTION_TARGETS] > > > > > > > > > > > > > is used to hold the list of nodes which can be > used as demotion > > > > > > > > > > > > > targets, support is also added to set the demotion > target > > > > > > > > > > > > > list from user space so that default behavior can > be overridden. > > > > > > > > > > > > > > > > > > > > > > > > It appears that your proposed user space interface > cannot solve all > > > > > > > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > memory node near > > > > > > > > > > > > node 0, > > > > > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > > > node 0 size: n MB > > > > > > > > > > > > node 0 free: n MB > > > > > > > > > > > > node 1 cpus: > > > > > > > > > > > > node 1 size: n MB > > > > > > > > > > > > node 1 free: n MB > > > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > > > node 2 size: n MB > > > > > > > > > > > > node 2 free: n MB > > > > > > > > > > > > node distances: > > > > > > > > > > > > node 0 1 2 > > > > > > > > > > > > 0: 10 40 20 > > > > > > > > > > > > 1: 40 10 80 > > > > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > 0 1 > > > > > > > > > > > > 1 X > > > > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > 0 1 > > > > > > > > > > > > 1 X > > > > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to > reduce cross-socket > > > > > > > > > > > > traffic. While the demotion order 2 is preferred if > we want to take > > > > > > > > > > > > full advantage of the slow memory node. We can take > any choice as > > > > > > > > > > > > automatic-generated order, while make the other > choice possible via user > > > > > > > > > > > > space overridden. > > > > > > > > > > > > > > > > > > > > > > > > I don't know how to implement this via your proposed > user space > > > > > > > > > > > > interface. How about the following user space > interface? > > > > > > > > > > > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > > > > > > > /sys/devices/system/node/ > > > > > > > > > > > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of > the system has been > > > > > > > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system > will become the > > > > > > > > > > > > overridden mode. When write "0", the demotion order > of the system will > > > > > > > > > > > > become the automatic mode and the demotion order > will be re-generated. > > > > > > > > > > > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > > > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be > output. > > > > > > > > > > > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion > targets of nodeX > > > > > > > > > > > > will be set to the written nodes. And the demotion > order of the system > > > > > > > > > > > > will become the overridden mode. > > > > > > > > > > > > > > > > > > > > > > TBH I don't think having override demotion targets in > userspace is > > > > > > > > > > > quite useful in real life for now (it might become > useful in the > > > > > > > > > > > future, I can't tell). Imagine you manage hundred > thousands of > > > > > > > > > > > machines, which may come from different vendors, have > different > > > > > > > > > > > generations of hardware, have different versions of > firmware, it would > > > > > > > > > > > be a nightmare for the users to configure the demotion > targets > > > > > > > > > > > properly. So it would be great to have the kernel > properly configure > > > > > > > > > > > it *without* intervening from the users. > > > > > > > > > > > > > > > > > > > > > > So we should pick up a proper default policy and stick > with that > > > > > > > > > > > policy unless it doesn't work well for the most > workloads. I do > > > > > > > > > > > understand it is hard to make everyone happy. My > proposal is having > > > > > > > > > > > every node in the fast tier has a demotion target (at > least one) if > > > > > > > > > > > the slow tier exists sounds like a reasonable default > policy. I think > > > > > > > > > > > this is also the current implementation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is reasonable. I agree that with a decent default > policy, > > > > > > > > > > > > > > > > > > > > > > > > > > > > I agree that a decent default policy is important. As > that was enhanced > > > > > > > > > in [1/5] of this patchset. > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > overriding of per-node demotion targets can be > deferred. The most > > > > > > > > > > important problem here is that we should allow the > configurations > > > > > > > > > > where memory-only nodes are not used as demotion > targets, which this > > > > > > > > > > patch set has already addressed. > > > > > > > > > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of > this patchset? > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > IMHO, if we want to add a user space interface, I think > that it should > > > > > > > > > be powerful enough to address all existing issues and some > potential > > > > > > > > > future issues, so that it can be stable. I don't think > it's a good idea > > > > > > > > > to define a partial user space interface that works only > for a specific > > > > > > > > > use case and cannot be extended for other use cases. > > > > > > > > > > > > > > > > I actually think that they can be viewed as two separate > problems: one > > > > > > > > is to define which nodes can be used as demotion targets > (this patch > > > > > > > > set), and the other is how to initialize the per-node > demotion path > > > > > > > > (node_demotion[]). We don't have to solve both problems at > the same > > > > > > > > time. > > > > > > > > > > > > > > > > If we decide to go with a per-node demotion path > customization > > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer > that there > > > > > > > > is a single global control to turn off all demotion targets > (for the > > > > > > > > machines that don't use memory-only nodes for demotion). > > > > > > > > > > > > > > > > > > > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add > sysfs > > > > > > > interface to enable reclaim migration"), a sysfs interface > > > > > > > > > > > > > > /sys/kernel/mm/numa/demotion_enabled > > > > > > > > > > > > > > is added to turn off all demotion targets. > > > > > > > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim. > It > > > > > > will be even cleaner if we have an easy way to clear > node_demotion[] > > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, > not > > > > > > init scripts) can know that the machine doesn't even have memory > > > > > > tiering hardware enabled. > > > > > > > > > > > > > > > > What is the difference? Now we have no interface to show demotion > > > > > targets of a node. That is in-kernel only. What is memory tiering > > > > > hardware? The Optane PMEM? Some information for it is available > via > > > > > ACPI HMAT table. > > > > > > > > > > Except demotion-in-reclaim, what else do you care about? > > > > > > > > There is a difference: one is to indicate the availability of the > > > > memory tiering hardware and the other is to indicate whether > > > > transparent kernel-driven demotion from the reclaim path is > activated. > > > > With /sys/devices/system/node/demote_targets or the per-node demotion > > > > target interface, the userspace can figure out the memory tiering > > > > topology abstracted by the kernel. It is possible to use > > > > application-guided demotion without having to enable reclaim-based > > > > demotion in the kernel. Logically it is also cleaner to me to > > > > decouple the tiering node representation from the actual demotion > > > > mechanism enablement. > > > > > > I am confused here. It appears that you need a way to expose the > > > automatic generated demotion order from kernel to user space interface. > > > We can talk about that if you really need it. > > > > > > But [2-5/5] of this patchset is to override the automatic generated > > > demotion order from user space to kernel interface. > > > > As a side effect of allowing user space to override the default set of > > demotion target nodes, it also provides a sysfs interface to allow > > userspace to read which nodes are currently being designated as > > demotion targets. > > > > The initialization of demotion targets is expected to complete during > > boot (either by kernel or via an init script). After that, the > > userspace processes (e.g. proactive tiering daemon or tiering-aware > > applications) can query this sysfs interface to know if there are any > > tiering nodes present and act accordingly. > > > > It would be even better to expose the per-node demotion order > > (node_demotion[]) via the sysfs interface (e.g. > > /sys/devices/system/node/nodeX/demotion_targets as you have > > suggested). It can be read-only until there are good use cases to > > require overriding the per-node demotion order. > > I am OK to expose the system demotion order to user space. For example, > via /sys/devices/system/node/nodeX/demotion_targets, but read-only. > Sounds good. We can send out a patch for such a read-only interface. > But if we want to add functionality to override system demotion order, > we need to consider the user space interface carefully, at least after > collecting all requirement so far. I don't think the interface proposed > in [2-5/5] of this patchset is sufficient or extensible enough. > The current proposed interface should be sufficient to override which nodes can serve as demotion targets. I agree that it is not sufficient if userspace wants to redefine the per-node demotion targets and a suitable user space interface for that purpose needs to be designed carefully. I also agree that it is better to move out patch 1/5 from this patchset. Best Regards, > Huang, Ying > > > >
On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com <ying.huang@intel.com> wrote: > > On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote: > > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com > > <ying.huang@intel.com> wrote: > > > > > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote: > > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote: > > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > > > > > > > > Current implementation to find the demotion targets works > > > > > > > > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > > > > > > > > targets, support is also added to set the demotion target > > > > > > > > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > > > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > > > > > > > > node 0, > > > > > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > > > node 0 size: n MB > > > > > > > > > > > > node 0 free: n MB > > > > > > > > > > > > node 1 cpus: > > > > > > > > > > > > node 1 size: n MB > > > > > > > > > > > > node 1 free: n MB > > > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > > > node 2 size: n MB > > > > > > > > > > > > node 2 free: n MB > > > > > > > > > > > > node distances: > > > > > > > > > > > > node 0 1 2 > > > > > > > > > > > > 0: 10 40 20 > > > > > > > > > > > > 1: 40 10 80 > > > > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > 0 1 > > > > > > > > > > > > 1 X > > > > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > 0 1 > > > > > > > > > > > > 1 X > > > > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > > > > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > > > > > > > > full advantage of the slow memory node. We can take any choice as > > > > > > > > > > > > automatic-generated order, while make the other choice possible via user > > > > > > > > > > > > space overridden. > > > > > > > > > > > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > > > > > > > > interface. How about the following user space interface? > > > > > > > > > > > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > > > > > > > /sys/devices/system/node/ > > > > > > > > > > > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > > > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > > > > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > > > > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > > > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > > > > > > > > will be set to the written nodes. And the demotion order of the system > > > > > > > > > > > > will become the overridden mode. > > > > > > > > > > > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > > > > > > > > quite useful in real life for now (it might become useful in the > > > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > > > > > > > > machines, which may come from different vendors, have different > > > > > > > > > > > generations of hardware, have different versions of firmware, it would > > > > > > > > > > > be a nightmare for the users to configure the demotion targets > > > > > > > > > > > properly. So it would be great to have the kernel properly configure > > > > > > > > > > > it *without* intervening from the users. > > > > > > > > > > > > > > > > > > > > > > So we should pick up a proper default policy and stick with that > > > > > > > > > > > policy unless it doesn't work well for the most workloads. I do > > > > > > > > > > > understand it is hard to make everyone happy. My proposal is having > > > > > > > > > > > every node in the fast tier has a demotion target (at least one) if > > > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > > > > > > > > this is also the current implementation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > > > > > > > > > > > > > > > > > > > > > > I agree that a decent default policy is important. As that was enhanced > > > > > > > > > in [1/5] of this patchset. > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > overriding of per-node demotion targets can be deferred. The most > > > > > > > > > > important problem here is that we should allow the configurations > > > > > > > > > > where memory-only nodes are not used as demotion targets, which this > > > > > > > > > > patch set has already addressed. > > > > > > > > > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should > > > > > > > > > be powerful enough to address all existing issues and some potential > > > > > > > > > future issues, so that it can be stable. I don't think it's a good idea > > > > > > > > > to define a partial user space interface that works only for a specific > > > > > > > > > use case and cannot be extended for other use cases. > > > > > > > > > > > > > > > > I actually think that they can be viewed as two separate problems: one > > > > > > > > is to define which nodes can be used as demotion targets (this patch > > > > > > > > set), and the other is how to initialize the per-node demotion path > > > > > > > > (node_demotion[]). We don't have to solve both problems at the same > > > > > > > > time. > > > > > > > > > > > > > > > > If we decide to go with a per-node demotion path customization > > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > > > > > > > > is a single global control to turn off all demotion targets (for the > > > > > > > > machines that don't use memory-only nodes for demotion). > > > > > > > > > > > > > > > > > > > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs > > > > > > > interface to enable reclaim migration"), a sysfs interface > > > > > > > > > > > > > > /sys/kernel/mm/numa/demotion_enabled > > > > > > > > > > > > > > is added to turn off all demotion targets. > > > > > > > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim. It > > > > > > will be even cleaner if we have an easy way to clear node_demotion[] > > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not > > > > > > init scripts) can know that the machine doesn't even have memory > > > > > > tiering hardware enabled. > > > > > > > > > > > > > > > > What is the difference? Now we have no interface to show demotion > > > > > targets of a node. That is in-kernel only. What is memory tiering > > > > > hardware? The Optane PMEM? Some information for it is available via > > > > > ACPI HMAT table. > > > > > > > > > > Except demotion-in-reclaim, what else do you care about? > > > > > > > > There is a difference: one is to indicate the availability of the > > > > memory tiering hardware and the other is to indicate whether > > > > transparent kernel-driven demotion from the reclaim path is activated. > > > > With /sys/devices/system/node/demote_targets or the per-node demotion > > > > target interface, the userspace can figure out the memory tiering > > > > topology abstracted by the kernel. It is possible to use > > > > application-guided demotion without having to enable reclaim-based > > > > demotion in the kernel. Logically it is also cleaner to me to > > > > decouple the tiering node representation from the actual demotion > > > > mechanism enablement. > > > > > > I am confused here. It appears that you need a way to expose the > > > automatic generated demotion order from kernel to user space interface. > > > We can talk about that if you really need it. > > > > > > But [2-5/5] of this patchset is to override the automatic generated > > > demotion order from user space to kernel interface. > > > > As a side effect of allowing user space to override the default set of > > demotion target nodes, it also provides a sysfs interface to allow > > userspace to read which nodes are currently being designated as > > demotion targets. > > > > The initialization of demotion targets is expected to complete during > > boot (either by kernel or via an init script). After that, the > > userspace processes (e.g. proactive tiering daemon or tiering-aware > > applications) can query this sysfs interface to know if there are any > > tiering nodes present and act accordingly. > > > > It would be even better to expose the per-node demotion order > > (node_demotion[]) via the sysfs interface (e.g. > > /sys/devices/system/node/nodeX/demotion_targets as you have > > suggested). It can be read-only until there are good use cases to > > require overriding the per-node demotion order. > > I am OK to expose the system demotion order to user space. For example, > via /sys/devices/system/node/nodeX/demotion_targets, but read-only. Sounds good. We can send out a patch for such a read-only interface. > But if we want to add functionality to override system demotion order, > we need to consider the user space interface carefully, at least after > collecting all requirement so far. I don't think the interface proposed > in [2-5/5] of this patchset is sufficient or extensible enough. The current proposed interface should be sufficient to override which nodes can serve as demotion targets. I agree that it is not sufficient if userspace wants to redefine the per-node demotion targets and a suitable user space interface for that purpose needs to be designed carefully. I also agree that it is better to move out patch 1/5 from this patchset. > Best Regards, > Huang, Ying > > >
On Thu, 2022-04-21 at 23:13 -0700, Wei Xu wrote: > On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com > <ying.huang@intel.com> wrote: > > > > On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote: > > > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com > > > <ying.huang@intel.com> wrote: > > > > > > > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote: > > > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote: > > > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > > > > > > > > > Current implementation to find the demotion targets works > > > > > > > > > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > > > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > > > > > > > > > targets, support is also added to set the demotion target > > > > > > > > > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > > > > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > > > > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > > > > > > > > > node 0, > > > > > > > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > > > > node 0 size: n MB > > > > > > > > > > > > > node 0 free: n MB > > > > > > > > > > > > > node 1 cpus: > > > > > > > > > > > > > node 1 size: n MB > > > > > > > > > > > > > node 1 free: n MB > > > > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > > > > node 2 size: n MB > > > > > > > > > > > > > node 2 free: n MB > > > > > > > > > > > > > node distances: > > > > > > > > > > > > > node 0 1 2 > > > > > > > > > > > > > 0: 10 40 20 > > > > > > > > > > > > > 1: 40 10 80 > > > > > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > > 0 1 > > > > > > > > > > > > > 1 X > > > > > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > > 0 1 > > > > > > > > > > > > > 1 X > > > > > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > > > > > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > > > > > > > > > full advantage of the slow memory node. We can take any choice as > > > > > > > > > > > > > automatic-generated order, while make the other choice possible via user > > > > > > > > > > > > > space overridden. > > > > > > > > > > > > > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > > > > > > > > > interface. How about the following user space interface? > > > > > > > > > > > > > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > > > > > > > > /sys/devices/system/node/ > > > > > > > > > > > > > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > > > > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > > > > > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > > > > > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > > > > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > > > > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > > > > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > > > > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > > > > > > > > > will be set to the written nodes. And the demotion order of the system > > > > > > > > > > > > > will become the overridden mode. > > > > > > > > > > > > > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > > > > > > > > > quite useful in real life for now (it might become useful in the > > > > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > > > > > > > > > machines, which may come from different vendors, have different > > > > > > > > > > > > generations of hardware, have different versions of firmware, it would > > > > > > > > > > > > be a nightmare for the users to configure the demotion targets > > > > > > > > > > > > properly. So it would be great to have the kernel properly configure > > > > > > > > > > > > it *without* intervening from the users. > > > > > > > > > > > > > > > > > > > > > > > > So we should pick up a proper default policy and stick with that > > > > > > > > > > > > policy unless it doesn't work well for the most workloads. I do > > > > > > > > > > > > understand it is hard to make everyone happy. My proposal is having > > > > > > > > > > > > every node in the fast tier has a demotion target (at least one) if > > > > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > > > > > > > > > this is also the current implementation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I agree that a decent default policy is important. As that was enhanced > > > > > > > > > > in [1/5] of this patchset. > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > overriding of per-node demotion targets can be deferred. The most > > > > > > > > > > > important problem here is that we should allow the configurations > > > > > > > > > > > where memory-only nodes are not used as demotion targets, which this > > > > > > > > > > > patch set has already addressed. > > > > > > > > > > > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should > > > > > > > > > > be powerful enough to address all existing issues and some potential > > > > > > > > > > future issues, so that it can be stable. I don't think it's a good idea > > > > > > > > > > to define a partial user space interface that works only for a specific > > > > > > > > > > use case and cannot be extended for other use cases. > > > > > > > > > > > > > > > > > > I actually think that they can be viewed as two separate problems: one > > > > > > > > > is to define which nodes can be used as demotion targets (this patch > > > > > > > > > set), and the other is how to initialize the per-node demotion path > > > > > > > > > (node_demotion[]). We don't have to solve both problems at the same > > > > > > > > > time. > > > > > > > > > > > > > > > > > > If we decide to go with a per-node demotion path customization > > > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > > > > > > > > > is a single global control to turn off all demotion targets (for the > > > > > > > > > machines that don't use memory-only nodes for demotion). > > > > > > > > > > > > > > > > > > > > > > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs > > > > > > > > interface to enable reclaim migration"), a sysfs interface > > > > > > > > > > > > > > > > /sys/kernel/mm/numa/demotion_enabled > > > > > > > > > > > > > > > > is added to turn off all demotion targets. > > > > > > > > > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim. It > > > > > > > will be even cleaner if we have an easy way to clear node_demotion[] > > > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not > > > > > > > init scripts) can know that the machine doesn't even have memory > > > > > > > tiering hardware enabled. > > > > > > > > > > > > > > > > > > > What is the difference? Now we have no interface to show demotion > > > > > > targets of a node. That is in-kernel only. What is memory tiering > > > > > > hardware? The Optane PMEM? Some information for it is available via > > > > > > ACPI HMAT table. > > > > > > > > > > > > Except demotion-in-reclaim, what else do you care about? > > > > > > > > > > There is a difference: one is to indicate the availability of the > > > > > memory tiering hardware and the other is to indicate whether > > > > > transparent kernel-driven demotion from the reclaim path is activated. > > > > > With /sys/devices/system/node/demote_targets or the per-node demotion > > > > > target interface, the userspace can figure out the memory tiering > > > > > topology abstracted by the kernel. It is possible to use > > > > > application-guided demotion without having to enable reclaim-based > > > > > demotion in the kernel. Logically it is also cleaner to me to > > > > > decouple the tiering node representation from the actual demotion > > > > > mechanism enablement. > > > > > > > > I am confused here. It appears that you need a way to expose the > > > > automatic generated demotion order from kernel to user space interface. > > > > We can talk about that if you really need it. > > > > > > > > But [2-5/5] of this patchset is to override the automatic generated > > > > demotion order from user space to kernel interface. > > > > > > As a side effect of allowing user space to override the default set of > > > demotion target nodes, it also provides a sysfs interface to allow > > > userspace to read which nodes are currently being designated as > > > demotion targets. > > > > > > The initialization of demotion targets is expected to complete during > > > boot (either by kernel or via an init script). After that, the > > > userspace processes (e.g. proactive tiering daemon or tiering-aware > > > applications) can query this sysfs interface to know if there are any > > > tiering nodes present and act accordingly. > > > > > > It would be even better to expose the per-node demotion order > > > (node_demotion[]) via the sysfs interface (e.g. > > > /sys/devices/system/node/nodeX/demotion_targets as you have > > > suggested). It can be read-only until there are good use cases to > > > require overriding the per-node demotion order. > > > > I am OK to expose the system demotion order to user space. For example, > > via /sys/devices/system/node/nodeX/demotion_targets, but read-only. > > Sounds good. We can send out a patch for such a read-only interface. > > > But if we want to add functionality to override system demotion order, > > we need to consider the user space interface carefully, at least after > > collecting all requirement so far. I don't think the interface proposed > > in [2-5/5] of this patchset is sufficient or extensible enough. > > The current proposed interface should be sufficient to override which > nodes can serve as demotion targets. I agree that it is not > sufficient if userspace wants to redefine the per-node demotion > targets and a suitable user space interface for that purpose needs to > be designed carefully. > IMHO, it's better to define both together. That is, collect all requirement, and design it carefully, keeping extensible in mind. If it's not the good timing yet, we can defer it to collect more requirement. That's not urgent even for authors' system, because they can just don't enable demotion-in-reclaim. Best Regards, Huang, Ying > I also agree that it is better to move out patch 1/5 from this patchset. > > > Best Regards, > > Huang, Ying > > > > > >
On Fri, Apr 22, 2022 at 02:21:47PM +0800, ying.huang@intel.com wrote: > On Thu, 2022-04-21 at 23:13 -0700, Wei Xu wrote: > > On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com > > <ying.huang@intel.com> wrote: > > > > > > On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote: > > > > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote: > > > > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote: > > > > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > > > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > > > > > > > > > > Current implementation to find the demotion targets works > > > > > > > > > > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > > > > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > > > > > > > > > > targets, support is also added to set the demotion target > > > > > > > > > > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > > > > > > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > > > > > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > > > > > > > > > > node 0, > > > > > > > > > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > > > > > node 0 size: n MB > > > > > > > > > > > > > > node 0 free: n MB > > > > > > > > > > > > > > node 1 cpus: > > > > > > > > > > > > > > node 1 size: n MB > > > > > > > > > > > > > > node 1 free: n MB > > > > > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > > > > > node 2 size: n MB > > > > > > > > > > > > > > node 2 free: n MB > > > > > > > > > > > > > > node distances: > > > > > > > > > > > > > > node 0 1 2 > > > > > > > > > > > > > > 0: 10 40 20 > > > > > > > > > > > > > > 1: 40 10 80 > > > > > > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > > > 0 1 > > > > > > > > > > > > > > 1 X > > > > > > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > > > 0 1 > > > > > > > > > > > > > > 1 X > > > > > > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > > > > > > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > > > > > > > > > > full advantage of the slow memory node. We can take any choice as > > > > > > > > > > > > > > automatic-generated order, while make the other choice possible via user > > > > > > > > > > > > > > space overridden. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > > > > > > > > > > interface. How about the following user space interface? > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > > > > > > > > > /sys/devices/system/node/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > > > > > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > > > > > > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > > > > > > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > > > > > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > > > > > > > > > > will be set to the written nodes. And the demotion order of the system > > > > > > > > > > > > > > will become the overridden mode. > > > > > > > > > > > > > > > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > > > > > > > > > > quite useful in real life for now (it might become useful in the > > > > > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > > > > > > > > > > machines, which may come from different vendors, have different > > > > > > > > > > > > > generations of hardware, have different versions of firmware, it would > > > > > > > > > > > > > be a nightmare for the users to configure the demotion targets > > > > > > > > > > > > > properly. So it would be great to have the kernel properly configure > > > > > > > > > > > > > it *without* intervening from the users. > > > > > > > > > > > > > > > > > > > > > > > > > > So we should pick up a proper default policy and stick with that > > > > > > > > > > > > > policy unless it doesn't work well for the most workloads. I do > > > > > > > > > > > > > understand it is hard to make everyone happy. My proposal is having > > > > > > > > > > > > > every node in the fast tier has a demotion target (at least one) if > > > > > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > > > > > > > > > > this is also the current implementation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I agree that a decent default policy is important. As that was enhanced > > > > > > > > > > > in [1/5] of this patchset. > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > overriding of per-node demotion targets can be deferred. The most > > > > > > > > > > > > important problem here is that we should allow the configurations > > > > > > > > > > > > where memory-only nodes are not used as demotion targets, which this > > > > > > > > > > > > patch set has already addressed. > > > > > > > > > > > > > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > > > > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should > > > > > > > > > > > be powerful enough to address all existing issues and some potential > > > > > > > > > > > future issues, so that it can be stable. I don't think it's a good idea > > > > > > > > > > > to define a partial user space interface that works only for a specific > > > > > > > > > > > use case and cannot be extended for other use cases. > > > > > > > > > > > > > > > > > > > > I actually think that they can be viewed as two separate problems: one > > > > > > > > > > is to define which nodes can be used as demotion targets (this patch > > > > > > > > > > set), and the other is how to initialize the per-node demotion path > > > > > > > > > > (node_demotion[]). We don't have to solve both problems at the same > > > > > > > > > > time. > > > > > > > > > > > > > > > > > > > > If we decide to go with a per-node demotion path customization > > > > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > > > > > > > > > > is a single global control to turn off all demotion targets (for the > > > > > > > > > > machines that don't use memory-only nodes for demotion). > > > > > > > > > > > > > > > > > > > > > > > > > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs > > > > > > > > > interface to enable reclaim migration"), a sysfs interface > > > > > > > > > > > > > > > > > > /sys/kernel/mm/numa/demotion_enabled > > > > > > > > > > > > > > > > > > is added to turn off all demotion targets. > > > > > > > > > > > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim. It > > > > > > > > will be even cleaner if we have an easy way to clear node_demotion[] > > > > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not > > > > > > > > init scripts) can know that the machine doesn't even have memory > > > > > > > > tiering hardware enabled. > > > > > > > > > > > > > > > > > > > > > > What is the difference? Now we have no interface to show demotion > > > > > > > targets of a node. That is in-kernel only. What is memory tiering > > > > > > > hardware? The Optane PMEM? Some information for it is available via > > > > > > > ACPI HMAT table. > > > > > > > > > > > > > > Except demotion-in-reclaim, what else do you care about? > > > > > > > > > > > > There is a difference: one is to indicate the availability of the > > > > > > memory tiering hardware and the other is to indicate whether > > > > > > transparent kernel-driven demotion from the reclaim path is activated. > > > > > > With /sys/devices/system/node/demote_targets or the per-node demotion > > > > > > target interface, the userspace can figure out the memory tiering > > > > > > topology abstracted by the kernel. It is possible to use > > > > > > application-guided demotion without having to enable reclaim-based > > > > > > demotion in the kernel. Logically it is also cleaner to me to > > > > > > decouple the tiering node representation from the actual demotion > > > > > > mechanism enablement. > > > > > > > > > > I am confused here. It appears that you need a way to expose the > > > > > automatic generated demotion order from kernel to user space interface. > > > > > We can talk about that if you really need it. > > > > > > > > > > But [2-5/5] of this patchset is to override the automatic generated > > > > > demotion order from user space to kernel interface. > > > > > > > > As a side effect of allowing user space to override the default set of > > > > demotion target nodes, it also provides a sysfs interface to allow > > > > userspace to read which nodes are currently being designated as > > > > demotion targets. > > > > > > > > The initialization of demotion targets is expected to complete during > > > > boot (either by kernel or via an init script). After that, the > > > > userspace processes (e.g. proactive tiering daemon or tiering-aware > > > > applications) can query this sysfs interface to know if there are any > > > > tiering nodes present and act accordingly. > > > > > > > > It would be even better to expose the per-node demotion order > > > > (node_demotion[]) via the sysfs interface (e.g. > > > > /sys/devices/system/node/nodeX/demotion_targets as you have > > > > suggested). It can be read-only until there are good use cases to > > > > require overriding the per-node demotion order. > > > > > > I am OK to expose the system demotion order to user space. For example, > > > via /sys/devices/system/node/nodeX/demotion_targets, but read-only. > > > > Sounds good. We can send out a patch for such a read-only interface. > > > > > But if we want to add functionality to override system demotion order, > > > we need to consider the user space interface carefully, at least after > > > collecting all requirement so far. I don't think the interface proposed > > > in [2-5/5] of this patchset is sufficient or extensible enough. > > > > The current proposed interface should be sufficient to override which > > nodes can serve as demotion targets. I agree that it is not > > sufficient if userspace wants to redefine the per-node demotion > > targets and a suitable user space interface for that purpose needs to > > be designed carefully. > > > > IMHO, it's better to define both together. That is, collect all > requirement, and design it carefully, keeping extensible in mind. If > it's not the good timing yet, we can defer it to collect more > requirement. That's not urgent even for authors' system, because they > can just don't enable demotion-in-reclaim. > > Best Regards, > Huang, Ying I think it is necessary to either have per node demotion targets configuration or the user space interface supported by this patch series. As we don't have clear consensus on how the user interface should look like, we can defer the per node demotion target set interface to future until the real need arises. Current patch series sets N_DEMOTION_TARGET from dax device kmem driver, it may be possible that some memory node desired as demotion target is not detected in the system from dax-device kmem probe path. It is also possible that some of the dax-devices are not preferred as demotion target e.g. HBM, for such devices, node shouldn't be set to N_DEMOTION_TARGETS. In future, Support should be added to distinguish such dax-devices and not mark them as N_DEMOTION_TARGETS from the kernel, but for now this user space interface will be useful to avoid such devices as demotion targets. We can add read only interface to view per node demotion targets from /sys/devices/system/node/nodeX/demotion_targets, remove duplicated /sys/kernel/mm/numa/demotion_target interface and instead make /sys/devices/system/node/demotion_targets writable. Huang, Wei, Yang, What do you suggest?
On Fri, Apr 22, 2022 at 4:00 AM Jagdish Gediya <jvgediya@linux.ibm.com> wrote: > > On Fri, Apr 22, 2022 at 02:21:47PM +0800, ying.huang@intel.com wrote: > > On Thu, 2022-04-21 at 23:13 -0700, Wei Xu wrote: > > > On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com > > > <ying.huang@intel.com> wrote: > > > > > > > > On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote: > > > > > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote: > > > > > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote: > > > > > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > > > > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > > > > > > > > > > > Current implementation to find the demotion targets works > > > > > > > > > > > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > > > > > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > > > > > > > > > > > targets, support is also added to set the demotion target > > > > > > > > > > > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > > > > > > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > > > > > > > > > > > node 0, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > > > > > > node 0 size: n MB > > > > > > > > > > > > > > > node 0 free: n MB > > > > > > > > > > > > > > > node 1 cpus: > > > > > > > > > > > > > > > node 1 size: n MB > > > > > > > > > > > > > > > node 1 free: n MB > > > > > > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > > > > > > node 2 size: n MB > > > > > > > > > > > > > > > node 2 free: n MB > > > > > > > > > > > > > > > node distances: > > > > > > > > > > > > > > > node 0 1 2 > > > > > > > > > > > > > > > 0: 10 40 20 > > > > > > > > > > > > > > > 1: 40 10 80 > > > > > > > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > > > > 0 1 > > > > > > > > > > > > > > > 1 X > > > > > > > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > > > > 0 1 > > > > > > > > > > > > > > > 1 X > > > > > > > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > > > > > > > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > > > > > > > > > > > full advantage of the slow memory node. We can take any choice as > > > > > > > > > > > > > > > automatic-generated order, while make the other choice possible via user > > > > > > > > > > > > > > > space overridden. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > > > > > > > > > > > interface. How about the following user space interface? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > > > > > > > > > > /sys/devices/system/node/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > > > > > > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > > > > > > > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > > > > > > > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > > > > > > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > > > > > > > > > > > will be set to the written nodes. And the demotion order of the system > > > > > > > > > > > > > > > will become the overridden mode. > > > > > > > > > > > > > > > > > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > > > > > > > > > > > quite useful in real life for now (it might become useful in the > > > > > > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > > > > > > > > > > > machines, which may come from different vendors, have different > > > > > > > > > > > > > > generations of hardware, have different versions of firmware, it would > > > > > > > > > > > > > > be a nightmare for the users to configure the demotion targets > > > > > > > > > > > > > > properly. So it would be great to have the kernel properly configure > > > > > > > > > > > > > > it *without* intervening from the users. > > > > > > > > > > > > > > > > > > > > > > > > > > > > So we should pick up a proper default policy and stick with that > > > > > > > > > > > > > > policy unless it doesn't work well for the most workloads. I do > > > > > > > > > > > > > > understand it is hard to make everyone happy. My proposal is having > > > > > > > > > > > > > > every node in the fast tier has a demotion target (at least one) if > > > > > > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > > > > > > > > > > > this is also the current implementation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I agree that a decent default policy is important. As that was enhanced > > > > > > > > > > > > in [1/5] of this patchset. > > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > overriding of per-node demotion targets can be deferred. The most > > > > > > > > > > > > > important problem here is that we should allow the configurations > > > > > > > > > > > > > where memory-only nodes are not used as demotion targets, which this > > > > > > > > > > > > > patch set has already addressed. > > > > > > > > > > > > > > > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > > > > > > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should > > > > > > > > > > > > be powerful enough to address all existing issues and some potential > > > > > > > > > > > > future issues, so that it can be stable. I don't think it's a good idea > > > > > > > > > > > > to define a partial user space interface that works only for a specific > > > > > > > > > > > > use case and cannot be extended for other use cases. > > > > > > > > > > > > > > > > > > > > > > I actually think that they can be viewed as two separate problems: one > > > > > > > > > > > is to define which nodes can be used as demotion targets (this patch > > > > > > > > > > > set), and the other is how to initialize the per-node demotion path > > > > > > > > > > > (node_demotion[]). We don't have to solve both problems at the same > > > > > > > > > > > time. > > > > > > > > > > > > > > > > > > > > > > If we decide to go with a per-node demotion path customization > > > > > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > > > > > > > > > > > is a single global control to turn off all demotion targets (for the > > > > > > > > > > > machines that don't use memory-only nodes for demotion). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs > > > > > > > > > > interface to enable reclaim migration"), a sysfs interface > > > > > > > > > > > > > > > > > > > > /sys/kernel/mm/numa/demotion_enabled > > > > > > > > > > > > > > > > > > > > is added to turn off all demotion targets. > > > > > > > > > > > > > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim. It > > > > > > > > > will be even cleaner if we have an easy way to clear node_demotion[] > > > > > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not > > > > > > > > > init scripts) can know that the machine doesn't even have memory > > > > > > > > > tiering hardware enabled. > > > > > > > > > > > > > > > > > > > > > > > > > What is the difference? Now we have no interface to show demotion > > > > > > > > targets of a node. That is in-kernel only. What is memory tiering > > > > > > > > hardware? The Optane PMEM? Some information for it is available via > > > > > > > > ACPI HMAT table. > > > > > > > > > > > > > > > > Except demotion-in-reclaim, what else do you care about? > > > > > > > > > > > > > > There is a difference: one is to indicate the availability of the > > > > > > > memory tiering hardware and the other is to indicate whether > > > > > > > transparent kernel-driven demotion from the reclaim path is activated. > > > > > > > With /sys/devices/system/node/demote_targets or the per-node demotion > > > > > > > target interface, the userspace can figure out the memory tiering > > > > > > > topology abstracted by the kernel. It is possible to use > > > > > > > application-guided demotion without having to enable reclaim-based > > > > > > > demotion in the kernel. Logically it is also cleaner to me to > > > > > > > decouple the tiering node representation from the actual demotion > > > > > > > mechanism enablement. > > > > > > > > > > > > I am confused here. It appears that you need a way to expose the > > > > > > automatic generated demotion order from kernel to user space interface. > > > > > > We can talk about that if you really need it. > > > > > > > > > > > > But [2-5/5] of this patchset is to override the automatic generated > > > > > > demotion order from user space to kernel interface. > > > > > > > > > > As a side effect of allowing user space to override the default set of > > > > > demotion target nodes, it also provides a sysfs interface to allow > > > > > userspace to read which nodes are currently being designated as > > > > > demotion targets. > > > > > > > > > > The initialization of demotion targets is expected to complete during > > > > > boot (either by kernel or via an init script). After that, the > > > > > userspace processes (e.g. proactive tiering daemon or tiering-aware > > > > > applications) can query this sysfs interface to know if there are any > > > > > tiering nodes present and act accordingly. > > > > > > > > > > It would be even better to expose the per-node demotion order > > > > > (node_demotion[]) via the sysfs interface (e.g. > > > > > /sys/devices/system/node/nodeX/demotion_targets as you have > > > > > suggested). It can be read-only until there are good use cases to > > > > > require overriding the per-node demotion order. > > > > > > > > I am OK to expose the system demotion order to user space. For example, > > > > via /sys/devices/system/node/nodeX/demotion_targets, but read-only. > > > > > > Sounds good. We can send out a patch for such a read-only interface. > > > > > > > But if we want to add functionality to override system demotion order, > > > > we need to consider the user space interface carefully, at least after > > > > collecting all requirement so far. I don't think the interface proposed > > > > in [2-5/5] of this patchset is sufficient or extensible enough. > > > > > > The current proposed interface should be sufficient to override which > > > nodes can serve as demotion targets. I agree that it is not > > > sufficient if userspace wants to redefine the per-node demotion > > > targets and a suitable user space interface for that purpose needs to > > > be designed carefully. > > > > > > > IMHO, it's better to define both together. That is, collect all > > requirement, and design it carefully, keeping extensible in mind. If > > it's not the good timing yet, we can defer it to collect more > > requirement. That's not urgent even for authors' system, because they > > can just don't enable demotion-in-reclaim. > > > > Best Regards, > > Huang, Ying > > I think it is necessary to either have per node demotion targets > configuration or the user space interface supported by this patch > series. As we don't have clear consensus on how the user interface > should look like, we can defer the per node demotion target set > interface to future until the real need arises. > > Current patch series sets N_DEMOTION_TARGET from dax device kmem > driver, it may be possible that some memory node desired as demotion > target is not detected in the system from dax-device kmem probe path. > > It is also possible that some of the dax-devices are not preferred as > demotion target e.g. HBM, for such devices, node shouldn't be set to > N_DEMOTION_TARGETS. In future, Support should be added to distinguish > such dax-devices and not mark them as N_DEMOTION_TARGETS from the > kernel, but for now this user space interface will be useful to avoid > such devices as demotion targets. > > We can add read only interface to view per node demotion targets > from /sys/devices/system/node/nodeX/demotion_targets, remove > duplicated /sys/kernel/mm/numa/demotion_target interface and instead > make /sys/devices/system/node/demotion_targets writable. > > Huang, Wei, Yang, > What do you suggest? This sounds good to me. I don't know a clear use case where we want to set per-node demotion order from the userspace. In the long term, in my view, it would be better that per-node demotion order is still only initialized by the kernel, just like the allocation zonelist, but with the help of more hardware information (e.g. HMAT) when available. Userspace can still control which nodes can be used for demotion on a process/cgroup through the typical NUMA interfaces (e.g. mbind, cpuset.mems). Wei
On Fri, Apr 22, 2022 at 9:43 AM Wei Xu <weixugc@google.com> wrote: > > On Fri, Apr 22, 2022 at 4:00 AM Jagdish Gediya <jvgediya@linux.ibm.com> wrote: > > > > On Fri, Apr 22, 2022 at 02:21:47PM +0800, ying.huang@intel.com wrote: > > > On Thu, 2022-04-21 at 23:13 -0700, Wei Xu wrote: > > > > On Thu, Apr 21, 2022 at 10:40 PM ying.huang@intel.com > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > On Thu, 2022-04-21 at 21:46 -0700, Wei Xu wrote: > > > > > > On Thu, Apr 21, 2022 at 5:58 PM ying.huang@intel.com > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > On Thu, 2022-04-21 at 11:26 -0700, Wei Xu wrote: > > > > > > > > On Thu, Apr 21, 2022 at 12:45 AM ying.huang@intel.com > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > On Thu, 2022-04-21 at 00:29 -0700, Wei Xu wrote: > > > > > > > > > > On Thu, Apr 21, 2022 at 12:08 AM ying.huang@intel.com > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 23:49 -0700, Wei Xu wrote: > > > > > > > > > > > > On Wed, Apr 20, 2022 at 11:24 PM ying.huang@intel.com > > > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-20 at 22:41 -0700, Wei Xu wrote: > > > > > > > > > > > > > > On Wed, Apr 20, 2022 at 8:12 PM Yang Shi <shy828301@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Apr 14, 2022 at 12:00 AM ying.huang@intel.com > > > > > > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Wed, 2022-04-13 at 14:52 +0530, Jagdish Gediya wrote: > > > > > > > > > > > > > > > > > Current implementation to find the demotion targets works > > > > > > > > > > > > > > > > > based on node state N_MEMORY, however some systems may have > > > > > > > > > > > > > > > > > dram only memory numa node which are N_MEMORY but not the > > > > > > > > > > > > > > > > > right choices as demotion targets. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This patch series introduces the new node state > > > > > > > > > > > > > > > > > N_DEMOTION_TARGETS, which is used to distinguish the nodes which > > > > > > > > > > > > > > > > > can be used as demotion targets, node_states[N_DEMOTION_TARGETS] > > > > > > > > > > > > > > > > > is used to hold the list of nodes which can be used as demotion > > > > > > > > > > > > > > > > > targets, support is also added to set the demotion target > > > > > > > > > > > > > > > > > list from user space so that default behavior can be overridden. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It appears that your proposed user space interface cannot solve all > > > > > > > > > > > > > > > > problems. For example, for system as follows, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near > > > > > > > > > > > > > > > > node 0, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > > > > > > > node 0 size: n MB > > > > > > > > > > > > > > > > node 0 free: n MB > > > > > > > > > > > > > > > > node 1 cpus: > > > > > > > > > > > > > > > > node 1 size: n MB > > > > > > > > > > > > > > > > node 1 free: n MB > > > > > > > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > > > > > > > node 2 size: n MB > > > > > > > > > > > > > > > > node 2 free: n MB > > > > > > > > > > > > > > > > node distances: > > > > > > > > > > > > > > > > node 0 1 2 > > > > > > > > > > > > > > > > 0: 10 40 20 > > > > > > > > > > > > > > > > 1: 40 10 80 > > > > > > > > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 1: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > > > > > 0 1 > > > > > > > > > > > > > > > > 1 X > > > > > > > > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Demotion order 2: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > node demotion_target > > > > > > > > > > > > > > > > 0 1 > > > > > > > > > > > > > > > > 1 X > > > > > > > > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The demotion order 1 is preferred if we want to reduce cross-socket > > > > > > > > > > > > > > > > traffic. While the demotion order 2 is preferred if we want to take > > > > > > > > > > > > > > > > full advantage of the slow memory node. We can take any choice as > > > > > > > > > > > > > > > > automatic-generated order, while make the other choice possible via user > > > > > > > > > > > > > > > > space overridden. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I don't know how to implement this via your proposed user space > > > > > > > > > > > > > > > > interface. How about the following user space interface? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1. Add a file "demotion_order_override" in > > > > > > > > > > > > > > > > /sys/devices/system/node/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. When read, "1" is output if the demotion order of the system has been > > > > > > > > > > > > > > > > overridden; "0" is output if not. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. When write "1", the demotion order of the system will become the > > > > > > > > > > > > > > > > overridden mode. When write "0", the demotion order of the system will > > > > > > > > > > > > > > > > become the automatic mode and the demotion order will be re-generated. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 4. Add a file "demotion_targets" for each node in > > > > > > > > > > > > > > > > /sys/devices/system/node/nodeX/ > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 5. When read, the demotion targets of nodeX will be output. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 6. When write a node list to the file, the demotion targets of nodeX > > > > > > > > > > > > > > > > will be set to the written nodes. And the demotion order of the system > > > > > > > > > > > > > > > > will become the overridden mode. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > TBH I don't think having override demotion targets in userspace is > > > > > > > > > > > > > > > quite useful in real life for now (it might become useful in the > > > > > > > > > > > > > > > future, I can't tell). Imagine you manage hundred thousands of > > > > > > > > > > > > > > > machines, which may come from different vendors, have different > > > > > > > > > > > > > > > generations of hardware, have different versions of firmware, it would > > > > > > > > > > > > > > > be a nightmare for the users to configure the demotion targets > > > > > > > > > > > > > > > properly. So it would be great to have the kernel properly configure > > > > > > > > > > > > > > > it *without* intervening from the users. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So we should pick up a proper default policy and stick with that > > > > > > > > > > > > > > > policy unless it doesn't work well for the most workloads. I do > > > > > > > > > > > > > > > understand it is hard to make everyone happy. My proposal is having > > > > > > > > > > > > > > > every node in the fast tier has a demotion target (at least one) if > > > > > > > > > > > > > > > the slow tier exists sounds like a reasonable default policy. I think > > > > > > > > > > > > > > > this is also the current implementation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > This is reasonable. I agree that with a decent default policy, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I agree that a decent default policy is important. As that was enhanced > > > > > > > > > > > > > in [1/5] of this patchset. > > > > > > > > > > > > > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > overriding of per-node demotion targets can be deferred. The most > > > > > > > > > > > > > > important problem here is that we should allow the configurations > > > > > > > > > > > > > > where memory-only nodes are not used as demotion targets, which this > > > > > > > > > > > > > > patch set has already addressed. > > > > > > > > > > > > > > > > > > > > > > > > > > Do you mean the user space interface proposed by [3/5] of this patchset? > > > > > > > > > > > > > > > > > > > > > > > > Yes. > > > > > > > > > > > > > > > > > > > > > > > > > IMHO, if we want to add a user space interface, I think that it should > > > > > > > > > > > > > be powerful enough to address all existing issues and some potential > > > > > > > > > > > > > future issues, so that it can be stable. I don't think it's a good idea > > > > > > > > > > > > > to define a partial user space interface that works only for a specific > > > > > > > > > > > > > use case and cannot be extended for other use cases. > > > > > > > > > > > > > > > > > > > > > > > > I actually think that they can be viewed as two separate problems: one > > > > > > > > > > > > is to define which nodes can be used as demotion targets (this patch > > > > > > > > > > > > set), and the other is how to initialize the per-node demotion path > > > > > > > > > > > > (node_demotion[]). We don't have to solve both problems at the same > > > > > > > > > > > > time. > > > > > > > > > > > > > > > > > > > > > > > > If we decide to go with a per-node demotion path customization > > > > > > > > > > > > interface to indirectly set N_DEMOTION_TARGETS, I'd prefer that there > > > > > > > > > > > > is a single global control to turn off all demotion targets (for the > > > > > > > > > > > > machines that don't use memory-only nodes for demotion). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > There's one already. In commit 20b51af15e01 ("mm/migrate: add sysfs > > > > > > > > > > > interface to enable reclaim migration"), a sysfs interface > > > > > > > > > > > > > > > > > > > > > > /sys/kernel/mm/numa/demotion_enabled > > > > > > > > > > > > > > > > > > > > > > is added to turn off all demotion targets. > > > > > > > > > > > > > > > > > > > > IIUC, this sysfs interface only turns off demotion-in-reclaim. It > > > > > > > > > > will be even cleaner if we have an easy way to clear node_demotion[] > > > > > > > > > > and N_DEMOTION_TARGETS so that the userspace (post-boot agent, not > > > > > > > > > > init scripts) can know that the machine doesn't even have memory > > > > > > > > > > tiering hardware enabled. > > > > > > > > > > > > > > > > > > > > > > > > > > > > What is the difference? Now we have no interface to show demotion > > > > > > > > > targets of a node. That is in-kernel only. What is memory tiering > > > > > > > > > hardware? The Optane PMEM? Some information for it is available via > > > > > > > > > ACPI HMAT table. > > > > > > > > > > > > > > > > > > Except demotion-in-reclaim, what else do you care about? > > > > > > > > > > > > > > > > There is a difference: one is to indicate the availability of the > > > > > > > > memory tiering hardware and the other is to indicate whether > > > > > > > > transparent kernel-driven demotion from the reclaim path is activated. > > > > > > > > With /sys/devices/system/node/demote_targets or the per-node demotion > > > > > > > > target interface, the userspace can figure out the memory tiering > > > > > > > > topology abstracted by the kernel. It is possible to use > > > > > > > > application-guided demotion without having to enable reclaim-based > > > > > > > > demotion in the kernel. Logically it is also cleaner to me to > > > > > > > > decouple the tiering node representation from the actual demotion > > > > > > > > mechanism enablement. > > > > > > > > > > > > > > I am confused here. It appears that you need a way to expose the > > > > > > > automatic generated demotion order from kernel to user space interface. > > > > > > > We can talk about that if you really need it. > > > > > > > > > > > > > > But [2-5/5] of this patchset is to override the automatic generated > > > > > > > demotion order from user space to kernel interface. > > > > > > > > > > > > As a side effect of allowing user space to override the default set of > > > > > > demotion target nodes, it also provides a sysfs interface to allow > > > > > > userspace to read which nodes are currently being designated as > > > > > > demotion targets. > > > > > > > > > > > > The initialization of demotion targets is expected to complete during > > > > > > boot (either by kernel or via an init script). After that, the > > > > > > userspace processes (e.g. proactive tiering daemon or tiering-aware > > > > > > applications) can query this sysfs interface to know if there are any > > > > > > tiering nodes present and act accordingly. > > > > > > > > > > > > It would be even better to expose the per-node demotion order > > > > > > (node_demotion[]) via the sysfs interface (e.g. > > > > > > /sys/devices/system/node/nodeX/demotion_targets as you have > > > > > > suggested). It can be read-only until there are good use cases to > > > > > > require overriding the per-node demotion order. > > > > > > > > > > I am OK to expose the system demotion order to user space. For example, > > > > > via /sys/devices/system/node/nodeX/demotion_targets, but read-only. > > > > > > > > Sounds good. We can send out a patch for such a read-only interface. > > > > > > > > > But if we want to add functionality to override system demotion order, > > > > > we need to consider the user space interface carefully, at least after > > > > > collecting all requirement so far. I don't think the interface proposed > > > > > in [2-5/5] of this patchset is sufficient or extensible enough. > > > > > > > > The current proposed interface should be sufficient to override which > > > > nodes can serve as demotion targets. I agree that it is not > > > > sufficient if userspace wants to redefine the per-node demotion > > > > targets and a suitable user space interface for that purpose needs to > > > > be designed carefully. > > > > > > > > > > IMHO, it's better to define both together. That is, collect all > > > requirement, and design it carefully, keeping extensible in mind. If > > > it's not the good timing yet, we can defer it to collect more > > > requirement. That's not urgent even for authors' system, because they > > > can just don't enable demotion-in-reclaim. > > > > > > Best Regards, > > > Huang, Ying > > > > I think it is necessary to either have per node demotion targets > > configuration or the user space interface supported by this patch > > series. As we don't have clear consensus on how the user interface > > should look like, we can defer the per node demotion target set > > interface to future until the real need arises. > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem > > driver, it may be possible that some memory node desired as demotion > > target is not detected in the system from dax-device kmem probe path. > > > > It is also possible that some of the dax-devices are not preferred as > > demotion target e.g. HBM, for such devices, node shouldn't be set to > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the > > kernel, but for now this user space interface will be useful to avoid > > such devices as demotion targets. > > > > We can add read only interface to view per node demotion targets > > from /sys/devices/system/node/nodeX/demotion_targets, remove > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead > > make /sys/devices/system/node/demotion_targets writable. > > > > Huang, Wei, Yang, > > What do you suggest? > > This sounds good to me. > > I don't know a clear use case where we want to set per-node demotion > order from the userspace. In the long term, in my view, it would be > better that per-node demotion order is still only initialized by the > kernel, just like the allocation zonelist, but with the help of more > hardware information (e.g. HMAT) when available. Userspace can still > control which nodes can be used for demotion on a process/cgroup > through the typical NUMA interfaces (e.g. mbind, cpuset.mems). +1 > > Wei
Hi, All, On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: [snip] > I think it is necessary to either have per node demotion targets > configuration or the user space interface supported by this patch > series. As we don't have clear consensus on how the user interface > should look like, we can defer the per node demotion target set > interface to future until the real need arises. > > Current patch series sets N_DEMOTION_TARGET from dax device kmem > driver, it may be possible that some memory node desired as demotion > target is not detected in the system from dax-device kmem probe path. > > It is also possible that some of the dax-devices are not preferred as > demotion target e.g. HBM, for such devices, node shouldn't be set to > N_DEMOTION_TARGETS. In future, Support should be added to distinguish > such dax-devices and not mark them as N_DEMOTION_TARGETS from the > kernel, but for now this user space interface will be useful to avoid > such devices as demotion targets. > > We can add read only interface to view per node demotion targets > from /sys/devices/system/node/nodeX/demotion_targets, remove > duplicated /sys/kernel/mm/numa/demotion_target interface and instead > make /sys/devices/system/node/demotion_targets writable. > > Huang, Wei, Yang, > What do you suggest? We cannot remove a kernel ABI in practice. So we need to make it right at the first time. Let's try to collect some information for the kernel ABI definitation. The below is just a starting point, please add your requirements. 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't want to use that as the demotion targets. But I don't think this is a issue in practice for now, because demote-in-reclaim is disabled by default. 2. For machines with PMEM installed in only 1 of 2 sockets, for example, Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node near node 0, available: 3 nodes (0-2) node 0 cpus: 0 1 node 0 size: n MB node 0 free: n MB node 1 cpus: node 1 size: n MB node 1 free: n MB node 2 cpus: 2 3 node 2 size: n MB node 2 free: n MB node distances: node 0 1 2 0: 10 40 20 1: 40 10 80 2: 20 80 10 We have 2 choices, a) node demotion targets 0 1 2 1 b) node demotion targets 0 1 2 X a) is good to take advantage of PMEM. b) is good to reduce cross-socket traffic. Both are OK as defualt configuration. But some users may prefer the other one. So we need a user space ABI to override the default configuration. 3. For machines with HBM (High Bandwidth Memory), as in https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 Although HBM has better performance than DDR, in ACPI SLIT, their distance to CPU is longer. We need to provide a way to fix this. The user space ABI is one way. The desired result will be to use local DDR as demotion targets of local HBM. Best Regards, Huang, Ying
"ying.huang@intel.com" <ying.huang@intel.com> writes: > Hi, All, > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: > > [snip] > >> I think it is necessary to either have per node demotion targets >> configuration or the user space interface supported by this patch >> series. As we don't have clear consensus on how the user interface >> should look like, we can defer the per node demotion target set >> interface to future until the real need arises. >> >> Current patch series sets N_DEMOTION_TARGET from dax device kmem >> driver, it may be possible that some memory node desired as demotion >> target is not detected in the system from dax-device kmem probe path. >> >> It is also possible that some of the dax-devices are not preferred as >> demotion target e.g. HBM, for such devices, node shouldn't be set to >> N_DEMOTION_TARGETS. In future, Support should be added to distinguish >> such dax-devices and not mark them as N_DEMOTION_TARGETS from the >> kernel, but for now this user space interface will be useful to avoid >> such devices as demotion targets. >> >> We can add read only interface to view per node demotion targets >> from /sys/devices/system/node/nodeX/demotion_targets, remove >> duplicated /sys/kernel/mm/numa/demotion_target interface and instead >> make /sys/devices/system/node/demotion_targets writable. >> >> Huang, Wei, Yang, >> What do you suggest? > > We cannot remove a kernel ABI in practice. So we need to make it right > at the first time. Let's try to collect some information for the kernel > ABI definitation. > > The below is just a starting point, please add your requirements. > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't > want to use that as the demotion targets. But I don't think this is a > issue in practice for now, because demote-in-reclaim is disabled by > default. It is not just that the demotion can be disabled. We should be able to use demotion on a system where we can find DRAM only NUMA nodes. That cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs something similar to to N_DEMOTION_TARGETS > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > memory node near node 0, > > available: 3 nodes (0-2) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: > node 1 size: n MB > node 1 free: n MB > node 2 cpus: 2 3 > node 2 size: n MB > node 2 free: n MB > node distances: > node 0 1 2 > 0: 10 40 20 > 1: 40 10 80 > 2: 20 80 10 > > We have 2 choices, > > a) > node demotion targets > 0 1 > 2 1 This is achieved by [PATCH v2 1/5] mm: demotion: Set demotion list differently > > b) > node demotion targets > 0 1 > 2 X > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > traffic. Both are OK as defualt configuration. But some users may > prefer the other one. So we need a user space ABI to override the > default configuration. > > 3. For machines with HBM (High Bandwidth Memory), as in > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ > >> [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 > > Although HBM has better performance than DDR, in ACPI SLIT, their > distance to CPU is longer. We need to provide a way to fix this. The > user space ABI is one way. The desired result will be to use local DDR > as demotion targets of local HBM. IMHO the above (2b and 3) can be done using per node demotion targets. Below is what I think we could do with a single slow memory NUMA node 4. /sys/devices/system/node# cat node[0-4]/demotion_targets 4 4 4 4 /sys/devices/system/node# echo 1 > node1/demotion_targets bash: echo: write error: Invalid argument /sys/devices/system/node# cat node[0-4]/demotion_targets 4 4 4 4 /sys/devices/system/node# echo 0 > node1/demotion_targets /sys/devices/system/node# cat node[0-4]/demotion_targets 4 0 4 4 /sys/devices/system/node# echo 1 > node0/demotion_targets bash: echo: write error: Invalid argument /sys/devices/system/node# cat node[0-4]/demotion_targets 4 0 4 4 Disable demotion for a specific node. /sys/devices/system/node# echo > node1/demotion_targets /sys/devices/system/node# cat node[0-4]/demotion_targets 4 4 4 Reset demotion to default /sys/devices/system/node# echo -1 > node1/demotion_targets /sys/devices/system/node# cat node[0-4]/demotion_targets 4 4 4 4 When a specific device/NUMA node is used for demotion target via the user interface, it is taken out of other NUMA node targets. root@ubuntu-guest:/sys/devices/system/node# cat node[0-4]/demotion_targets 4 4 4 4 /sys/devices/system/node# echo 4 > node1/demotion_targets /sys/devices/system/node# cat node[0-4]/demotion_targets 4 If more than one node requies the same demotion target /sys/devices/system/node# echo 4 > node0/demotion_targets /sys/devices/system/node# cat node[0-4]/demotion_targets 4 4 -aneesh
On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote: > "ying.huang@intel.com" <ying.huang@intel.com> writes: > > > Hi, All, > > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: > > > > [snip] > > > > > I think it is necessary to either have per node demotion targets > > > configuration or the user space interface supported by this patch > > > series. As we don't have clear consensus on how the user interface > > > should look like, we can defer the per node demotion target set > > > interface to future until the real need arises. > > > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem > > > driver, it may be possible that some memory node desired as demotion > > > target is not detected in the system from dax-device kmem probe path. > > > > > > It is also possible that some of the dax-devices are not preferred as > > > demotion target e.g. HBM, for such devices, node shouldn't be set to > > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish > > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the > > > kernel, but for now this user space interface will be useful to avoid > > > such devices as demotion targets. > > > > > > We can add read only interface to view per node demotion targets > > > from /sys/devices/system/node/nodeX/demotion_targets, remove > > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead > > > make /sys/devices/system/node/demotion_targets writable. > > > > > > Huang, Wei, Yang, > > > What do you suggest? > > > > We cannot remove a kernel ABI in practice. So we need to make it right > > at the first time. Let's try to collect some information for the kernel > > ABI definitation. > > > > The below is just a starting point, please add your requirements. > > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't > > want to use that as the demotion targets. But I don't think this is a > > issue in practice for now, because demote-in-reclaim is disabled by > > default. > > It is not just that the demotion can be disabled. We should be able to > use demotion on a system where we can find DRAM only NUMA nodes. That > cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs > something similar to to N_DEMOTION_TARGETS > Can you show NUMA information of your machines with DRAM-only nodes and PMEM nodes? We can try to find the proper demotion order for the system. If you can not show it, we can defer N_DEMOTION_TARGETS until the machine is available. > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > memory node near node 0, > > > > available: 3 nodes (0-2) > > node 0 cpus: 0 1 > > node 0 size: n MB > > node 0 free: n MB > > node 1 cpus: > > node 1 size: n MB > > node 1 free: n MB > > node 2 cpus: 2 3 > > node 2 size: n MB > > node 2 free: n MB > > node distances: > > node 0 1 2 > > 0: 10 40 20 > > 1: 40 10 80 > > 2: 20 80 10 > > > > We have 2 choices, > > > > a) > > node demotion targets > > 0 1 > > 2 1 > > This is achieved by > > [PATCH v2 1/5] mm: demotion: Set demotion list differently > > > > > b) > > node demotion targets > > 0 1 > > 2 X > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > traffic. Both are OK as defualt configuration. But some users may > > prefer the other one. So we need a user space ABI to override the > > default configuration. > > > > 3. For machines with HBM (High Bandwidth Memory), as in > > > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ > > > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 > > > > Although HBM has better performance than DDR, in ACPI SLIT, their > > distance to CPU is longer. We need to provide a way to fix this. The > > user space ABI is one way. The desired result will be to use local DDR > > as demotion targets of local HBM. > > > IMHO the above (2b and 3) can be done using per node demotion targets. Below is > what I think we could do with a single slow memory NUMA node 4. If we can use writable per-node demotion targets as ABI, then we don't need N_DEMOTION_TARGETS. > /sys/devices/system/node# cat node[0-4]/demotion_targets > 4 > 4 > 4 > 4 > > /sys/devices/system/node# echo 1 > node1/demotion_targets > bash: echo: write error: Invalid argument > /sys/devices/system/node# cat node[0-4]/demotion_targets > 4 > 4 > 4 > 4 > > /sys/devices/system/node# echo 0 > node1/demotion_targets > /sys/devices/system/node# cat node[0-4]/demotion_targets > 4 > 0 > 4 > 4 > > /sys/devices/system/node# echo 1 > node0/demotion_targets > bash: echo: write error: Invalid argument > /sys/devices/system/node# cat node[0-4]/demotion_targets > 4 > 0 > 4 > 4 > > Disable demotion for a specific node. > /sys/devices/system/node# echo > node1/demotion_targets > /sys/devices/system/node# cat node[0-4]/demotion_targets > 4 > > 4 > 4 > > Reset demotion to default > /sys/devices/system/node# echo -1 > node1/demotion_targets > /sys/devices/system/node# cat node[0-4]/demotion_targets > 4 > 4 > 4 > 4 > > When a specific device/NUMA node is used for demotion target via the user interface, it is taken > out of other NUMA node targets. IMHO, we should be careful about interaction between auto-generated and overridden demotion order. Best Regards, Huang, Ying > root@ubuntu-guest:/sys/devices/system/node# cat node[0-4]/demotion_targets > 4 > 4 > 4 > 4 > > /sys/devices/system/node# echo 4 > node1/demotion_targets > /sys/devices/system/node# cat node[0-4]/demotion_targets > > 4 > > > > If more than one node requies the same demotion target > /sys/devices/system/node# echo 4 > node0/demotion_targets > /sys/devices/system/node# cat node[0-4]/demotion_targets > 4 > 4 > > > > -aneesh
On Sun, Apr 24, 2022 at 11:02:47AM +0800, ying.huang@intel.com wrote: > Hi, All, > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: > > [snip] > > > I think it is necessary to either have per node demotion targets > > configuration or the user space interface supported by this patch > > series. As we don't have clear consensus on how the user interface > > should look like, we can defer the per node demotion target set > > interface to future until the real need arises. > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem > > driver, it may be possible that some memory node desired as demotion > > target is not detected in the system from dax-device kmem probe path. > > > > It is also possible that some of the dax-devices are not preferred as > > demotion target e.g. HBM, for such devices, node shouldn't be set to > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the > > kernel, but for now this user space interface will be useful to avoid > > such devices as demotion targets. > > > > We can add read only interface to view per node demotion targets > > from /sys/devices/system/node/nodeX/demotion_targets, remove > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead > > make /sys/devices/system/node/demotion_targets writable. > > > > Huang, Wei, Yang, > > What do you suggest? > > We cannot remove a kernel ABI in practice. So we need to make it right > at the first time. Let's try to collect some information for the kernel > ABI definitation. /sys/kernel/mm/numa/demotion_target was introduced in v2, I was talking about removing it from next version of the series as the similar interface is available as a result of introducing N_DEMOTION_TARGETS at /sys/devices/system/node/demotion_targets, so instead of introducing duplicate interface to write N_DEMOTION_TARGETS, we can instead make /sys/devices/system/node/demotion_targets writable. > The below is just a starting point, please add your requirements. > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't > want to use that as the demotion targets. But I don't think this is a > issue in practice for now, because demote-in-reclaim is disabled by > default. > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > memory node near node 0, > > available: 3 nodes (0-2) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: > node 1 size: n MB > node 1 free: n MB > node 2 cpus: 2 3 > node 2 size: n MB > node 2 free: n MB > node distances: > node 0 1 2 > 0: 10 40 20 > 1: 40 10 80 > 2: 20 80 10 > > We have 2 choices, > > a) > node demotion targets > 0 1 > 2 1 > > b) > node demotion targets > 0 1 > 2 X > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > traffic. Both are OK as defualt configuration. But some users may > prefer the other one. So we need a user space ABI to override the > default configuration. > > 3. For machines with HBM (High Bandwidth Memory), as in > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 > > Although HBM has better performance than DDR, in ACPI SLIT, their > distance to CPU is longer. We need to provide a way to fix this. The > user space ABI is one way. The desired result will be to use local DDR > as demotion targets of local HBM. > > Best Regards, > Huang, Ying > >
On 4/25/22 11:40 AM, ying.huang@intel.com wrote: > On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote: >> "ying.huang@intel.com" <ying.huang@intel.com> writes: >> >>> Hi, All, >>> >>> On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: >>> >>> [snip] >>> >>>> I think it is necessary to either have per node demotion targets >>>> configuration or the user space interface supported by this patch >>>> series. As we don't have clear consensus on how the user interface >>>> should look like, we can defer the per node demotion target set >>>> interface to future until the real need arises. >>>> >>>> Current patch series sets N_DEMOTION_TARGET from dax device kmem >>>> driver, it may be possible that some memory node desired as demotion >>>> target is not detected in the system from dax-device kmem probe path. >>>> >>>> It is also possible that some of the dax-devices are not preferred as >>>> demotion target e.g. HBM, for such devices, node shouldn't be set to >>>> N_DEMOTION_TARGETS. In future, Support should be added to distinguish >>>> such dax-devices and not mark them as N_DEMOTION_TARGETS from the >>>> kernel, but for now this user space interface will be useful to avoid >>>> such devices as demotion targets. >>>> >>>> We can add read only interface to view per node demotion targets >>>> from /sys/devices/system/node/nodeX/demotion_targets, remove >>>> duplicated /sys/kernel/mm/numa/demotion_target interface and instead >>>> make /sys/devices/system/node/demotion_targets writable. >>>> >>>> Huang, Wei, Yang, >>>> What do you suggest? >>> >>> We cannot remove a kernel ABI in practice. So we need to make it right >>> at the first time. Let's try to collect some information for the kernel >>> ABI definitation. >>> >>> The below is just a starting point, please add your requirements. >>> >>> 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't >>> want to use that as the demotion targets. But I don't think this is a >>> issue in practice for now, because demote-in-reclaim is disabled by >>> default. >> >> It is not just that the demotion can be disabled. We should be able to >> use demotion on a system where we can find DRAM only NUMA nodes. That >> cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs >> something similar to to N_DEMOTION_TARGETS >> > > Can you show NUMA information of your machines with DRAM-only nodes and > PMEM nodes? We can try to find the proper demotion order for the > system. If you can not show it, we can defer N_DEMOTION_TARGETS until > the machine is available. Sure will find one such config. As you might have noticed this is very easy to have in a virtualization setup because the hypervisor can assign memory to a guest VM from a numa node that doesn't have CPU assigned to the same guest. This depends on the other guest VM instance config running on the system. So on any virtualization config that has got persistent memory attached, this can become an easy config to end up with. >>> 2. For machines with PMEM installed in only 1 of 2 sockets, for example, >>> >>> Node 0 & 2 are cpu + dram nodes and node 1 are slow >>> memory node near node 0, >>> >>> available: 3 nodes (0-2) >>> node 0 cpus: 0 1 >>> node 0 size: n MB >>> node 0 free: n MB >>> node 1 cpus: >>> node 1 size: n MB >>> node 1 free: n MB >>> node 2 cpus: 2 3 >>> node 2 size: n MB >>> node 2 free: n MB >>> node distances: >>> node 0 1 2 >>> 0: 10 40 20 >>> 1: 40 10 80 >>> 2: 20 80 10 >>> >>> We have 2 choices, >>> >>> a) >>> node demotion targets >>> 0 1 >>> 2 1 >> >> This is achieved by >> >> [PATCH v2 1/5] mm: demotion: Set demotion list differently >> >>> >>> b) >>> node demotion targets >>> 0 1 >>> 2 X >> >> >>> >>> a) is good to take advantage of PMEM. b) is good to reduce cross-socket >>> traffic. Both are OK as defualt configuration. But some users may >>> prefer the other one. So we need a user space ABI to override the >>> default configuration. >>> >>> 3. For machines with HBM (High Bandwidth Memory), as in >>> >>> https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ >>> >>>> [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 >>> >>> Although HBM has better performance than DDR, in ACPI SLIT, their >>> distance to CPU is longer. We need to provide a way to fix this. The >>> user space ABI is one way. The desired result will be to use local DDR >>> as demotion targets of local HBM. >> >> >> IMHO the above (2b and 3) can be done using per node demotion targets. Below is >> what I think we could do with a single slow memory NUMA node 4. > > If we can use writable per-node demotion targets as ABI, then we don't > need N_DEMOTION_TARGETS. Not sure I understand that. Yes, once you have a writeable per node demotion target it is easy to build any demotion order. But that doesn't mean we should not improve the default unless you have reason to say that using N_DEMOTTION_TARGETS breaks any existing config. > >> /sys/devices/system/node# cat node[0-4]/demotion_targets >> 4 >> 4 >> 4 >> 4 >> >> /sys/devices/system/node# echo 1 > node1/demotion_targets >> bash: echo: write error: Invalid argument >> /sys/devices/system/node# cat node[0-4]/demotion_targets >> 4 >> 4 >> 4 >> 4 >> >> /sys/devices/system/node# echo 0 > node1/demotion_targets >> /sys/devices/system/node# cat node[0-4]/demotion_targets >> 4 >> 0 >> 4 >> 4 >> >> /sys/devices/system/node# echo 1 > node0/demotion_targets >> bash: echo: write error: Invalid argument >> /sys/devices/system/node# cat node[0-4]/demotion_targets >> 4 >> 0 >> 4 >> 4 >> >> Disable demotion for a specific node. >> /sys/devices/system/node# echo > node1/demotion_targets >> /sys/devices/system/node# cat node[0-4]/demotion_targets >> 4 >> >> 4 >> 4 >> >> Reset demotion to default >> /sys/devices/system/node# echo -1 > node1/demotion_targets >> /sys/devices/system/node# cat node[0-4]/demotion_targets >> 4 >> 4 >> 4 >> 4 >> >> When a specific device/NUMA node is used for demotion target via the user interface, it is taken >> out of other NUMA node targets. > > IMHO, we should be careful about interaction between auto-generated and > overridden demotion order. > yes, we should avoid loop between that. But if you agree for the above ABI we could go ahead and share the implementation code. > Best Regards, > Huang, Ying > >> root@ubuntu-guest:/sys/devices/system/node# cat node[0-4]/demotion_targets >> 4 >> 4 >> 4 >> 4 >> >> /sys/devices/system/node# echo 4 > node1/demotion_targets >> /sys/devices/system/node# cat node[0-4]/demotion_targets >> >> 4 >> >> >> >> If more than one node requies the same demotion target >> /sys/devices/system/node# echo 4 > node0/demotion_targets >> /sys/devices/system/node# cat node[0-4]/demotion_targets >> 4 >> 4 >> >> >> >> -aneesh > > -aneesh
On 4/25/22 1:39 PM, Aneesh Kumar K V wrote: > On 4/25/22 11:40 AM, ying.huang@intel.com wrote: >> On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote: >>> "ying.huang@intel.com" <ying.huang@intel.com> writes: >>> >>>> Hi, All, >>>> >>>> On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: >>>> >>>> [snip] >>>> >>>>> I think it is necessary to either have per node demotion targets >>>>> configuration or the user space interface supported by this patch >>>>> series. As we don't have clear consensus on how the user interface >>>>> should look like, we can defer the per node demotion target set >>>>> interface to future until the real need arises. >>>>> >>>>> Current patch series sets N_DEMOTION_TARGET from dax device kmem >>>>> driver, it may be possible that some memory node desired as demotion >>>>> target is not detected in the system from dax-device kmem probe path. >>>>> >>>>> It is also possible that some of the dax-devices are not preferred as >>>>> demotion target e.g. HBM, for such devices, node shouldn't be set to >>>>> N_DEMOTION_TARGETS. In future, Support should be added to distinguish >>>>> such dax-devices and not mark them as N_DEMOTION_TARGETS from the >>>>> kernel, but for now this user space interface will be useful to avoid >>>>> such devices as demotion targets. >>>>> >>>>> We can add read only interface to view per node demotion targets >>>>> from /sys/devices/system/node/nodeX/demotion_targets, remove >>>>> duplicated /sys/kernel/mm/numa/demotion_target interface and instead >>>>> make /sys/devices/system/node/demotion_targets writable. >>>>> >>>>> Huang, Wei, Yang, >>>>> What do you suggest? >>>> >>>> We cannot remove a kernel ABI in practice. So we need to make it right >>>> at the first time. Let's try to collect some information for the >>>> kernel >>>> ABI definitation. >>>> >>>> The below is just a starting point, please add your requirements. >>>> >>>> 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't >>>> want to use that as the demotion targets. But I don't think this is a >>>> issue in practice for now, because demote-in-reclaim is disabled by >>>> default. >>> >>> It is not just that the demotion can be disabled. We should be able to >>> use demotion on a system where we can find DRAM only NUMA nodes. That >>> cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs >>> something similar to to N_DEMOTION_TARGETS >>> >> >> Can you show NUMA information of your machines with DRAM-only nodes and >> PMEM nodes? We can try to find the proper demotion order for the >> system. If you can not show it, we can defer N_DEMOTION_TARGETS until >> the machine is available. > > > Sure will find one such config. As you might have noticed this is very > easy to have in a virtualization setup because the hypervisor can assign > memory to a guest VM from a numa node that doesn't have CPU assigned to > the same guest. This depends on the other guest VM instance config > running on the system. So on any virtualization config that has got > persistent memory attached, this can become an easy config to end up with. > > something like this $ numactl -H available: 2 nodes (0-1) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 14272 MB node 0 free: 13392 MB node 1 cpus: node 1 size: 2028 MB node 1 free: 1971 MB node distances: node 0 1 0: 10 40 1: 40 10 $ cat /sys/bus/nd/devices/dax0.0/target_node 2 $ # cd /sys/bus/dax/drivers/ :/sys/bus/dax/drivers# ls device_dax kmem :/sys/bus/dax/drivers# cd device_dax/ :/sys/bus/dax/drivers/device_dax# echo dax0.0 > unbind :/sys/bus/dax/drivers/device_dax# echo dax0.0 > ../kmem/new_id :/sys/bus/dax/drivers/device_dax# numactl -H available: 3 nodes (0-2) node 0 cpus: 0 1 2 3 4 5 6 7 node 0 size: 14272 MB node 0 free: 13380 MB node 1 cpus: node 1 size: 2028 MB node 1 free: 1961 MB node 2 cpus: node 2 size: 0 MB node 2 free: 0 MB node distances: node 0 1 2 0: 10 40 80 1: 40 10 80 2: 80 80 10 :/sys/bus/dax/drivers/device_dax#
On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com <ying.huang@intel.com> wrote: > > Hi, All, > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: > > [snip] > > > I think it is necessary to either have per node demotion targets > > configuration or the user space interface supported by this patch > > series. As we don't have clear consensus on how the user interface > > should look like, we can defer the per node demotion target set > > interface to future until the real need arises. > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem > > driver, it may be possible that some memory node desired as demotion > > target is not detected in the system from dax-device kmem probe path. > > > > It is also possible that some of the dax-devices are not preferred as > > demotion target e.g. HBM, for such devices, node shouldn't be set to > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the > > kernel, but for now this user space interface will be useful to avoid > > such devices as demotion targets. > > > > We can add read only interface to view per node demotion targets > > from /sys/devices/system/node/nodeX/demotion_targets, remove > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead > > make /sys/devices/system/node/demotion_targets writable. > > > > Huang, Wei, Yang, > > What do you suggest? > > We cannot remove a kernel ABI in practice. So we need to make it right > at the first time. Let's try to collect some information for the kernel > ABI definitation. > > The below is just a starting point, please add your requirements. > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't > want to use that as the demotion targets. But I don't think this is a > issue in practice for now, because demote-in-reclaim is disabled by > default. > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > memory node near node 0, > > available: 3 nodes (0-2) > node 0 cpus: 0 1 > node 0 size: n MB > node 0 free: n MB > node 1 cpus: > node 1 size: n MB > node 1 free: n MB > node 2 cpus: 2 3 > node 2 size: n MB > node 2 free: n MB > node distances: > node 0 1 2 > 0: 10 40 20 > 1: 40 10 80 > 2: 20 80 10 > > We have 2 choices, > > a) > node demotion targets > 0 1 > 2 1 > > b) > node demotion targets > 0 1 > 2 X > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > traffic. Both are OK as defualt configuration. But some users may > prefer the other one. So we need a user space ABI to override the > default configuration. I think 2(a) should be the system-wide configuration and 2(b) can be achieved with NUMA mempolicy (which needs to be added to demotion). In general, we can view the demotion order in a way similar to allocation fallback order (after all, if we don't demote or demotion lags behind, the allocations will go to these demotion target nodes according to the allocation fallback order anyway). If we initialize the demotion order in that way (i.e. every node can demote to any node in the next tier, and the priority of the target nodes is sorted for each source node), we don't need per-node demotion order override from the userspace. What we need is to specify what nodes should be in each tier and support NUMA mempolicy in demotion. Cross-socket demotion should not be too big a problem in practice because we can optimize the code to do the demotion from the local CPU node (i.e. local writes to the target node and remote read from the source node). The bigger issue is cross-socket memory access onto the demoted pages from the applications, which is why NUMA mempolicy is important here. > 3. For machines with HBM (High Bandwidth Memory), as in > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 > > Although HBM has better performance than DDR, in ACPI SLIT, their > distance to CPU is longer. We need to provide a way to fix this. The > user space ABI is one way. The desired result will be to use local DDR > as demotion targets of local HBM. > > Best Regards, > Huang, Ying >
On Mon, 25 Apr 2022, Aneesh Kumar K V wrote: >On 4/25/22 11:40 AM, ying.huang@intel.com wrote: >>On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote: >>>"ying.huang@intel.com" <ying.huang@intel.com> writes: >>> >>>>Hi, All, >>>> >>>>On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: >>>> >>>>[snip] >>>> >>>>>I think it is necessary to either have per node demotion targets >>>>>configuration or the user space interface supported by this patch >>>>>series. As we don't have clear consensus on how the user interface >>>>>should look like, we can defer the per node demotion target set >>>>>interface to future until the real need arises. >>>>> >>>>>Current patch series sets N_DEMOTION_TARGET from dax device kmem >>>>>driver, it may be possible that some memory node desired as demotion >>>>>target is not detected in the system from dax-device kmem probe path. >>>>> >>>>>It is also possible that some of the dax-devices are not preferred as >>>>>demotion target e.g. HBM, for such devices, node shouldn't be set to >>>>>N_DEMOTION_TARGETS. In future, Support should be added to distinguish >>>>>such dax-devices and not mark them as N_DEMOTION_TARGETS from the >>>>>kernel, but for now this user space interface will be useful to avoid >>>>>such devices as demotion targets. >>>>> >>>>>We can add read only interface to view per node demotion targets >>>>>from /sys/devices/system/node/nodeX/demotion_targets, remove >>>>>duplicated /sys/kernel/mm/numa/demotion_target interface and instead >>>>>make /sys/devices/system/node/demotion_targets writable. >>>>> >>>>>Huang, Wei, Yang, >>>>>What do you suggest? >>>> >>>>We cannot remove a kernel ABI in practice. So we need to make it right >>>>at the first time. Let's try to collect some information for the kernel >>>>ABI definitation. >>>> >>>>The below is just a starting point, please add your requirements. >>>> >>>>1. Jagdish has some machines with DRAM only NUMA nodes, but they don't >>>>want to use that as the demotion targets. But I don't think this is a >>>>issue in practice for now, because demote-in-reclaim is disabled by >>>>default. >>> >>>It is not just that the demotion can be disabled. We should be able to >>>use demotion on a system where we can find DRAM only NUMA nodes. That >>>cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs >>>something similar to to N_DEMOTION_TARGETS >>> >> >>Can you show NUMA information of your machines with DRAM-only nodes and >>PMEM nodes? We can try to find the proper demotion order for the >>system. If you can not show it, we can defer N_DEMOTION_TARGETS until >>the machine is available. > > >Sure will find one such config. As you might have noticed this is very >easy to have in a virtualization setup because the hypervisor can >assign memory to a guest VM from a numa node that doesn't have CPU >assigned to the same guest. This depends on the other guest VM >instance config running on the system. So on any virtualization config >that has got persistent memory attached, this can become an easy >config to end up with. And as hw becomes available things like CXL will also start to show "interesting" setups. You have a mix of volatile and/or pmem nodes with different access costs, so: CPU+DRAM, DRAM (?), volatile CXL mem, CXL pmem, non-cxl pmem. imo, by default, slower mem should be demotion candidates regardless of type or socket layout (which can be a last consideration such that this is somewhat mitigated). And afaict this is along the lines of what Jagdish's first example refers to in patch 1/5. > >>>>2. For machines with PMEM installed in only 1 of 2 sockets, for example, >>>> >>>>Node 0 & 2 are cpu + dram nodes and node 1 are slow >>>>memory node near node 0, >>>> >>>>available: 3 nodes (0-2) >>>>node 0 cpus: 0 1 >>>>node 0 size: n MB >>>>node 0 free: n MB >>>>node 1 cpus: >>>>node 1 size: n MB >>>>node 1 free: n MB >>>>node 2 cpus: 2 3 >>>>node 2 size: n MB >>>>node 2 free: n MB >>>>node distances: >>>>node 0 1 2 >>>> 0: 10 40 20 >>>> 1: 40 10 80 >>>> 2: 20 80 10 >>>> >>>>We have 2 choices, >>>> >>>>a) >>>>node demotion targets >>>>0 1 >>>>2 1 >>> >>>This is achieved by >>> >>>[PATCH v2 1/5] mm: demotion: Set demotion list differently Yes, I think it makes sense to do 2a. Thanks, Davidlohr
On Mon, 2022-04-25 at 13:39 +0530, Aneesh Kumar K V wrote: > On 4/25/22 11:40 AM, ying.huang@intel.com wrote: > > On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote: > > > "ying.huang@intel.com" <ying.huang@intel.com> writes: > > > > > > > Hi, All, > > > > > > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: > > > > > > > > [snip] > > > > > > > > > I think it is necessary to either have per node demotion targets > > > > > configuration or the user space interface supported by this patch > > > > > series. As we don't have clear consensus on how the user interface > > > > > should look like, we can defer the per node demotion target set > > > > > interface to future until the real need arises. > > > > > > > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem > > > > > driver, it may be possible that some memory node desired as demotion > > > > > target is not detected in the system from dax-device kmem probe path. > > > > > > > > > > It is also possible that some of the dax-devices are not preferred as > > > > > demotion target e.g. HBM, for such devices, node shouldn't be set to > > > > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish > > > > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the > > > > > kernel, but for now this user space interface will be useful to avoid > > > > > such devices as demotion targets. > > > > > > > > > > We can add read only interface to view per node demotion targets > > > > > from /sys/devices/system/node/nodeX/demotion_targets, remove > > > > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead > > > > > make /sys/devices/system/node/demotion_targets writable. > > > > > > > > > > Huang, Wei, Yang, > > > > > What do you suggest? > > > > > > > > We cannot remove a kernel ABI in practice. So we need to make it right > > > > at the first time. Let's try to collect some information for the kernel > > > > ABI definitation. > > > > > > > > The below is just a starting point, please add your requirements. > > > > > > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't > > > > want to use that as the demotion targets. But I don't think this is a > > > > issue in practice for now, because demote-in-reclaim is disabled by > > > > default. > > > > > > It is not just that the demotion can be disabled. We should be able to > > > use demotion on a system where we can find DRAM only NUMA nodes. That > > > cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs > > > something similar to to N_DEMOTION_TARGETS > > > > > > > Can you show NUMA information of your machines with DRAM-only nodes and > > PMEM nodes? We can try to find the proper demotion order for the > > system. If you can not show it, we can defer N_DEMOTION_TARGETS until > > the machine is available. > > > Sure will find one such config. As you might have noticed this is very > easy to have in a virtualization setup because the hypervisor can assign > memory to a guest VM from a numa node that doesn't have CPU assigned to > the same guest. This depends on the other guest VM instance config > running on the system. So on any virtualization config that has got > persistent memory attached, this can become an easy config to end up with. > Why they want to do that? I am looking forward to a real issue, not theoritical possibility. > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > memory node near node 0, > > > > > > > > available: 3 nodes (0-2) > > > > node 0 cpus: 0 1 > > > > node 0 size: n MB > > > > node 0 free: n MB > > > > node 1 cpus: > > > > node 1 size: n MB > > > > node 1 free: n MB > > > > node 2 cpus: 2 3 > > > > node 2 size: n MB > > > > node 2 free: n MB > > > > node distances: > > > > node 0 1 2 > > > > 0: 10 40 20 > > > > 1: 40 10 80 > > > > 2: 20 80 10 > > > > > > > > We have 2 choices, > > > > > > > > a) > > > > node demotion targets > > > > 0 1 > > > > 2 1 > > > > > > This is achieved by > > > > > > [PATCH v2 1/5] mm: demotion: Set demotion list differently > > > > > > > > > > > b) > > > > node demotion targets > > > > 0 1 > > > > 2 X > > > > > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > traffic. Both are OK as defualt configuration. But some users may > > > > prefer the other one. So we need a user space ABI to override the > > > > default configuration. > > > > > > > > 3. For machines with HBM (High Bandwidth Memory), as in > > > > > > > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ > > > > > > > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 > > > > > > > > Although HBM has better performance than DDR, in ACPI SLIT, their > > > > distance to CPU is longer. We need to provide a way to fix this. The > > > > user space ABI is one way. The desired result will be to use local DDR > > > > as demotion targets of local HBM. > > > > > > > > > IMHO the above (2b and 3) can be done using per node demotion targets. Below is > > > what I think we could do with a single slow memory NUMA node 4. > > > > If we can use writable per-node demotion targets as ABI, then we don't > > need N_DEMOTION_TARGETS. > > > Not sure I understand that. Yes, once you have a writeable per node > demotion target it is easy to build any demotion order. Yes. > But that doesn't > mean we should not improve the default unless you have reason to say > that using N_DEMOTTION_TARGETS breaks any existing config. > Becuase N_DEMOTTION_TARGETS is a new kernel ABI to override the default, not the default itself. [1/5] of this patchset improve the default behavior itself, and I think that's good. Because we must maintain the kernel ABI almost for ever, we need to be careful about adding new ABI and add less if possible. If writable per- node demotion targets can address your issue. Then it's unnecessary to add another redundant kernel ABI for that. > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > 4 > > > 4 > > > 4 > > > 4 > > > > > > /sys/devices/system/node# echo 1 > node1/demotion_targets > > > bash: echo: write error: Invalid argument > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > 4 > > > 4 > > > 4 > > > 4 > > > > > > /sys/devices/system/node# echo 0 > node1/demotion_targets > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > 4 > > > 0 > > > 4 > > > 4 > > > > > > /sys/devices/system/node# echo 1 > node0/demotion_targets > > > bash: echo: write error: Invalid argument > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > 4 > > > 0 > > > 4 > > > 4 > > > > > > Disable demotion for a specific node. > > > /sys/devices/system/node# echo > node1/demotion_targets > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > 4 > > > > > > 4 > > > 4 > > > > > > Reset demotion to default > > > /sys/devices/system/node# echo -1 > node1/demotion_targets > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > 4 > > > 4 > > > 4 > > > 4 > > > > > > When a specific device/NUMA node is used for demotion target via the user interface, it is taken > > > out of other NUMA node targets. > > > > IMHO, we should be careful about interaction between auto-generated and > > overridden demotion order. > > > > yes, we should avoid loop between that. In addition to that, we need to get same result after hot-remove then hot-add the same node. That is, the result should be stable after NOOP. I guess we can just always, - Generate the default demotion order automatically without any overriding. - Apply the overriding, after removing the invalid targets, etc. > But if you agree for the above > ABI we could go ahead and share the implementation code. I think we need to add a way to distinguish auto-generated and overriden demotion targets in the output of nodeX/demotion_targets. Otherwise it looks good to me. Best Regards, Huang, Ying > > > root@ubuntu-guest:/sys/devices/system/node# cat node[0-4]/demotion_targets > > > 4 > > > 4 > > > 4 > > > 4 > > > > > > /sys/devices/system/node# echo 4 > node1/demotion_targets > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > > > 4 > > > > > > > > > > > > If more than one node requies the same demotion target > > > /sys/devices/system/node# echo 4 > node0/demotion_targets > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > 4 > > > 4 > > > > > > > > > > > > -aneesh > > > > > > -aneesh
On 4/26/22 2:12 PM, ying.huang@intel.com wrote: > On Mon, 2022-04-25 at 13:39 +0530, Aneesh Kumar K V wrote: >> On 4/25/22 11:40 AM, ying.huang@intel.com wrote: >>> On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote: >>>> "ying.huang@intel.com" <ying.huang@intel.com> writes: >>>> >>>>> Hi, All, >>>>> >>>>> On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: >>>>> >>>>> [snip] >>>>> >>>>>> I think it is necessary to either have per node demotion targets >>>>>> configuration or the user space interface supported by this patch >>>>>> series. As we don't have clear consensus on how the user interface >>>>>> should look like, we can defer the per node demotion target set >>>>>> interface to future until the real need arises. >>>>>> >>>>>> Current patch series sets N_DEMOTION_TARGET from dax device kmem >>>>>> driver, it may be possible that some memory node desired as demotion >>>>>> target is not detected in the system from dax-device kmem probe path. >>>>>> >>>>>> It is also possible that some of the dax-devices are not preferred as >>>>>> demotion target e.g. HBM, for such devices, node shouldn't be set to >>>>>> N_DEMOTION_TARGETS. In future, Support should be added to distinguish >>>>>> such dax-devices and not mark them as N_DEMOTION_TARGETS from the >>>>>> kernel, but for now this user space interface will be useful to avoid >>>>>> such devices as demotion targets. >>>>>> >>>>>> We can add read only interface to view per node demotion targets >>>>>> from /sys/devices/system/node/nodeX/demotion_targets, remove >>>>>> duplicated /sys/kernel/mm/numa/demotion_target interface and instead >>>>>> make /sys/devices/system/node/demotion_targets writable. >>>>>> >>>>>> Huang, Wei, Yang, >>>>>> What do you suggest? >>>>> >>>>> We cannot remove a kernel ABI in practice. So we need to make it right >>>>> at the first time. Let's try to collect some information for the kernel >>>>> ABI definitation. >>>>> >>>>> The below is just a starting point, please add your requirements. >>>>> >>>>> 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't >>>>> want to use that as the demotion targets. But I don't think this is a >>>>> issue in practice for now, because demote-in-reclaim is disabled by >>>>> default. >>>> >>>> It is not just that the demotion can be disabled. We should be able to >>>> use demotion on a system where we can find DRAM only NUMA nodes. That >>>> cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs >>>> something similar to to N_DEMOTION_TARGETS >>>> >>> >>> Can you show NUMA information of your machines with DRAM-only nodes and >>> PMEM nodes? We can try to find the proper demotion order for the >>> system. If you can not show it, we can defer N_DEMOTION_TARGETS until >>> the machine is available. >> >> >> Sure will find one such config. As you might have noticed this is very >> easy to have in a virtualization setup because the hypervisor can assign >> memory to a guest VM from a numa node that doesn't have CPU assigned to >> the same guest. This depends on the other guest VM instance config >> running on the system. So on any virtualization config that has got >> persistent memory attached, this can become an easy config to end up with. >> > > Why they want to do that? I am looking forward to a real issue, not > theoritical possibility. > Can you elaborate this more? That is a real config. >> >>>>> 2. For machines with PMEM installed in only 1 of 2 sockets, for example, >>>>> >>>>> Node 0 & 2 are cpu + dram nodes and node 1 are slow >>>>> memory node near node 0, >>>>> >>>>> available: 3 nodes (0-2) >>>>> node 0 cpus: 0 1 >>>>> node 0 size: n MB >>>>> node 0 free: n MB >>>>> node 1 cpus: >>>>> node 1 size: n MB >>>>> node 1 free: n MB >>>>> node 2 cpus: 2 3 >>>>> node 2 size: n MB >>>>> node 2 free: n MB >>>>> node distances: >>>>> node 0 1 2 >>>>> 0: 10 40 20 >>>>> 1: 40 10 80 >>>>> 2: 20 80 10 >>>>> >>>>> We have 2 choices, >>>>> >>>>> a) >>>>> node demotion targets >>>>> 0 1 >>>>> 2 1 >>>> >>>> This is achieved by >>>> >>>> [PATCH v2 1/5] mm: demotion: Set demotion list differently >>>> >>>>> >>>>> b) >>>>> node demotion targets >>>>> 0 1 >>>>> 2 X >>>> >>>> >>>>> >>>>> a) is good to take advantage of PMEM. b) is good to reduce cross-socket >>>>> traffic. Both are OK as defualt configuration. But some users may >>>>> prefer the other one. So we need a user space ABI to override the >>>>> default configuration. >>>>> >>>>> 3. For machines with HBM (High Bandwidth Memory), as in >>>>> >>>>> https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ >>>>> >>>>>> [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 >>>>> >>>>> Although HBM has better performance than DDR, in ACPI SLIT, their >>>>> distance to CPU is longer. We need to provide a way to fix this. The >>>>> user space ABI is one way. The desired result will be to use local DDR >>>>> as demotion targets of local HBM. >>>> >>>> >>>> IMHO the above (2b and 3) can be done using per node demotion targets. Below is >>>> what I think we could do with a single slow memory NUMA node 4. >>> >>> If we can use writable per-node demotion targets as ABI, then we don't >>> need N_DEMOTION_TARGETS. >> >> >> Not sure I understand that. Yes, once you have a writeable per node >> demotion target it is easy to build any demotion order. > > Yes. > >> But that doesn't >> mean we should not improve the default unless you have reason to say >> that using N_DEMOTTION_TARGETS breaks any existing config. >> > > Becuase N_DEMOTTION_TARGETS is a new kernel ABI to override the default, > not the default itself. [1/5] of this patchset improve the default > behavior itself, and I think that's good. > we are improving the default by using N_DEMOTION_TARGETS because the current default breaks configs which can get you memory only NUMA nodes. I would not consider it an override. > Because we must maintain the kernel ABI almost for ever, we need to be > careful about adding new ABI and add less if possible. If writable per- > node demotion targets can address your issue. Then it's unnecessary to > add another redundant kernel ABI for that. This means on platform like powerpc, we would always need to have a userspace managed demotion because we can end up with memory only numa nodes for them. Why force that? > >>>> /sys/devices/system/node# cat node[0-4]/demotion_targets >>>> 4 >>>> 4 >>>> 4 >>>> 4 >>>> >>>> /sys/devices/system/node# echo 1 > node1/demotion_targets >>>> bash: echo: write error: Invalid argument >>>> /sys/devices/system/node# cat node[0-4]/demotion_targets >>>> 4 >>>> 4 >>>> 4 >>>> 4 >>>> >>>> /sys/devices/system/node# echo 0 > node1/demotion_targets >>>> /sys/devices/system/node# cat node[0-4]/demotion_targets >>>> 4 >>>> 0 >>>> 4 >>>> 4 >>>> >>>> /sys/devices/system/node# echo 1 > node0/demotion_targets >>>> bash: echo: write error: Invalid argument >>>> /sys/devices/system/node# cat node[0-4]/demotion_targets >>>> 4 >>>> 0 >>>> 4 >>>> 4 >>>> >>>> Disable demotion for a specific node. >>>> /sys/devices/system/node# echo > node1/demotion_targets >>>> /sys/devices/system/node# cat node[0-4]/demotion_targets >>>> 4 >>>> >>>> 4 >>>> 4 >>>> >>>> Reset demotion to default >>>> /sys/devices/system/node# echo -1 > node1/demotion_targets >>>> /sys/devices/system/node# cat node[0-4]/demotion_targets >>>> 4 >>>> 4 >>>> 4 >>>> 4 >>>> >>>> When a specific device/NUMA node is used for demotion target via the user interface, it is taken >>>> out of other NUMA node targets. >>> >>> IMHO, we should be careful about interaction between auto-generated and >>> overridden demotion order. >>> >> >> yes, we should avoid loop between that. > > In addition to that, we need to get same result after hot-remove then > hot-add the same node. That is, the result should be stable after NOOP. > I guess we can just always, > > - Generate the default demotion order automatically without any > overriding. > > - Apply the overriding, after removing the invalid targets, etc. > >> But if you agree for the above >> ABI we could go ahead and share the implementation code. > > I think we need to add a way to distinguish auto-generated and overriden > demotion targets in the output of nodeX/demotion_targets. Otherwise it > looks good to me. > something like: /sys/devices/system/node# echo 4 > node1/demotion_targets /sys/devices/system/node# cat node[0-4]/demotion_targets - 4 (userspace override) - - - -aneesh
On Tue, 2022-04-26 at 14:32 +0530, Aneesh Kumar K V wrote: > On 4/26/22 2:12 PM, ying.huang@intel.com wrote: > > On Mon, 2022-04-25 at 13:39 +0530, Aneesh Kumar K V wrote: > > > On 4/25/22 11:40 AM, ying.huang@intel.com wrote: > > > > On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote: > > > > > "ying.huang@intel.com" <ying.huang@intel.com> writes: > > > > > > > > > > > Hi, All, > > > > > > > > > > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: > > > > > > > > > > > > [snip] > > > > > > > > > > > > > I think it is necessary to either have per node demotion targets > > > > > > > configuration or the user space interface supported by this patch > > > > > > > series. As we don't have clear consensus on how the user interface > > > > > > > should look like, we can defer the per node demotion target set > > > > > > > interface to future until the real need arises. > > > > > > > > > > > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem > > > > > > > driver, it may be possible that some memory node desired as demotion > > > > > > > target is not detected in the system from dax-device kmem probe path. > > > > > > > > > > > > > > It is also possible that some of the dax-devices are not preferred as > > > > > > > demotion target e.g. HBM, for such devices, node shouldn't be set to > > > > > > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish > > > > > > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the > > > > > > > kernel, but for now this user space interface will be useful to avoid > > > > > > > such devices as demotion targets. > > > > > > > > > > > > > > We can add read only interface to view per node demotion targets > > > > > > > from /sys/devices/system/node/nodeX/demotion_targets, remove > > > > > > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead > > > > > > > make /sys/devices/system/node/demotion_targets writable. > > > > > > > > > > > > > > Huang, Wei, Yang, > > > > > > > What do you suggest? > > > > > > > > > > > > We cannot remove a kernel ABI in practice. So we need to make it right > > > > > > at the first time. Let's try to collect some information for the kernel > > > > > > ABI definitation. > > > > > > > > > > > > The below is just a starting point, please add your requirements. > > > > > > > > > > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't > > > > > > want to use that as the demotion targets. But I don't think this is a > > > > > > issue in practice for now, because demote-in-reclaim is disabled by > > > > > > default. > > > > > > > > > > It is not just that the demotion can be disabled. We should be able to > > > > > use demotion on a system where we can find DRAM only NUMA nodes. That > > > > > cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs > > > > > something similar to to N_DEMOTION_TARGETS > > > > > > > > > > > > > Can you show NUMA information of your machines with DRAM-only nodes and > > > > PMEM nodes? We can try to find the proper demotion order for the > > > > system. If you can not show it, we can defer N_DEMOTION_TARGETS until > > > > the machine is available. > > > > > > > > > Sure will find one such config. As you might have noticed this is very > > > easy to have in a virtualization setup because the hypervisor can assign > > > memory to a guest VM from a numa node that doesn't have CPU assigned to > > > the same guest. This depends on the other guest VM instance config > > > running on the system. So on any virtualization config that has got > > > persistent memory attached, this can become an easy config to end up with. > > > > > > > Why they want to do that? I am looking forward to a real issue, not > > theoritical possibility. > > > > > Can you elaborate this more? That is a real config. > > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > > > memory node near node 0, > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > node 0 cpus: 0 1 > > > > > > node 0 size: n MB > > > > > > node 0 free: n MB > > > > > > node 1 cpus: > > > > > > node 1 size: n MB > > > > > > node 1 free: n MB > > > > > > node 2 cpus: 2 3 > > > > > > node 2 size: n MB > > > > > > node 2 free: n MB > > > > > > node distances: > > > > > > node 0 1 2 > > > > > > 0: 10 40 20 > > > > > > 1: 40 10 80 > > > > > > 2: 20 80 10 > > > > > > > > > > > > We have 2 choices, > > > > > > > > > > > > a) > > > > > > node demotion targets > > > > > > 0 1 > > > > > > 2 1 > > > > > > > > > > This is achieved by > > > > > > > > > > [PATCH v2 1/5] mm: demotion: Set demotion list differently > > > > > > > > > > > > > > > > > b) > > > > > > node demotion targets > > > > > > 0 1 > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > > > traffic. Both are OK as defualt configuration. But some users may > > > > > > prefer the other one. So we need a user space ABI to override the > > > > > > default configuration. > > > > > > > > > > > > 3. For machines with HBM (High Bandwidth Memory), as in > > > > > > > > > > > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ > > > > > > > > > > > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 > > > > > > > > > > > > Although HBM has better performance than DDR, in ACPI SLIT, their > > > > > > distance to CPU is longer. We need to provide a way to fix this. The > > > > > > user space ABI is one way. The desired result will be to use local DDR > > > > > > as demotion targets of local HBM. > > > > > > > > > > > > > > > IMHO the above (2b and 3) can be done using per node demotion targets. Below is > > > > > what I think we could do with a single slow memory NUMA node 4. > > > > > > > > If we can use writable per-node demotion targets as ABI, then we don't > > > > need N_DEMOTION_TARGETS. > > > > > > > > > Not sure I understand that. Yes, once you have a writeable per node > > > demotion target it is easy to build any demotion order. > > > > Yes. > > > > > But that doesn't > > > mean we should not improve the default unless you have reason to say > > > that using N_DEMOTTION_TARGETS breaks any existing config. > > > > > > > Becuase N_DEMOTTION_TARGETS is a new kernel ABI to override the default, > > not the default itself. [1/5] of this patchset improve the default > > behavior itself, and I think that's good. > > > > we are improving the default by using N_DEMOTION_TARGETS because the > current default breaks configs which can get you memory only NUMA nodes. > I would not consider it an override. > OK. I guess that there is some misunderstanding here. I thought that you refer to N_DEMOTION_TARGETS overriden via make the following file writable, /sys/devices/system/node/demotion_targets Now, I think you are referring to setting N_DEMOTION_TARGETS in kmem driver by default. Sorry if I misunderstood you. So, to be clear. I am OK to restrict default demotion targets via kmem driver (we can improve this in the future with more source). But I don't think it's good to make /sys/devices/system/node/demotion_targets writable. Instead, I think it's better to make /sys/devices/system/node/nodeX/demotion_targets writable. > > Because we must maintain the kernel ABI almost for ever, we need to be > > careful about adding new ABI and add less if possible. If writable per- > > node demotion targets can address your issue. Then it's unnecessary to > > add another redundant kernel ABI for that. > > This means on platform like powerpc, we would always need to have a > userspace managed demotion because we can end up with memory only numa > nodes for them. Why force that? Please take a look at the above. > > > > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > > 4 > > > > > 4 > > > > > 4 > > > > > 4 > > > > > > > > > > /sys/devices/system/node# echo 1 > node1/demotion_targets > > > > > bash: echo: write error: Invalid argument > > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > > 4 > > > > > 4 > > > > > 4 > > > > > 4 > > > > > > > > > > /sys/devices/system/node# echo 0 > node1/demotion_targets > > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > > 4 > > > > > 0 > > > > > 4 > > > > > 4 > > > > > > > > > > /sys/devices/system/node# echo 1 > node0/demotion_targets > > > > > bash: echo: write error: Invalid argument > > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > > 4 > > > > > 0 > > > > > 4 > > > > > 4 > > > > > > > > > > Disable demotion for a specific node. > > > > > /sys/devices/system/node# echo > node1/demotion_targets > > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > > 4 > > > > > > > > > > 4 > > > > > 4 > > > > > > > > > > Reset demotion to default > > > > > /sys/devices/system/node# echo -1 > node1/demotion_targets > > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > > 4 > > > > > 4 > > > > > 4 > > > > > 4 > > > > > > > > > > When a specific device/NUMA node is used for demotion target via the user interface, it is taken > > > > > out of other NUMA node targets. > > > > > > > > IMHO, we should be careful about interaction between auto-generated and > > > > overridden demotion order. > > > > > > > > > > yes, we should avoid loop between that. > > > > In addition to that, we need to get same result after hot-remove then > > hot-add the same node. That is, the result should be stable after NOOP. > > I guess we can just always, > > > > - Generate the default demotion order automatically without any > > overriding. > > > > - Apply the overriding, after removing the invalid targets, etc. > > > > > But if you agree for the above > > > ABI we could go ahead and share the implementation code. > > > > I think we need to add a way to distinguish auto-generated and overriden > > demotion targets in the output of nodeX/demotion_targets. Otherwise it > > looks good to me. > > > > > something like: > > /sys/devices/system/node# echo 4 > node1/demotion_targets > /sys/devices/system/node# cat node[0-4]/demotion_targets > - > 4 (userspace override) > - > - > - > Or /sys/devices/system/node# echo 4 > node1/demotion_targets /sys/devices/system/node# cat node[0-4]/demotion_targets - *4 - - - Best Regards, Huang, Ying
On Tue, Apr 26, 2022 at 1:43 AM ying.huang@intel.com <ying.huang@intel.com> wrote: > > On Mon, 2022-04-25 at 13:39 +0530, Aneesh Kumar K V wrote: > > On 4/25/22 11:40 AM, ying.huang@intel.com wrote: > > > On Mon, 2022-04-25 at 09:20 +0530, Aneesh Kumar K.V wrote: > > > > "ying.huang@intel.com" <ying.huang@intel.com> writes: > > > > > > > > > Hi, All, > > > > > > > > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: > > > > > > > > > > [snip] > > > > > > > > > > > I think it is necessary to either have per node demotion targets > > > > > > configuration or the user space interface supported by this patch > > > > > > series. As we don't have clear consensus on how the user interface > > > > > > should look like, we can defer the per node demotion target set > > > > > > interface to future until the real need arises. > > > > > > > > > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem > > > > > > driver, it may be possible that some memory node desired as demotion > > > > > > target is not detected in the system from dax-device kmem probe path. > > > > > > > > > > > > It is also possible that some of the dax-devices are not preferred as > > > > > > demotion target e.g. HBM, for such devices, node shouldn't be set to > > > > > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish > > > > > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the > > > > > > kernel, but for now this user space interface will be useful to avoid > > > > > > such devices as demotion targets. > > > > > > > > > > > > We can add read only interface to view per node demotion targets > > > > > > from /sys/devices/system/node/nodeX/demotion_targets, remove > > > > > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead > > > > > > make /sys/devices/system/node/demotion_targets writable. > > > > > > > > > > > > Huang, Wei, Yang, > > > > > > What do you suggest? > > > > > > > > > > We cannot remove a kernel ABI in practice. So we need to make it right > > > > > at the first time. Let's try to collect some information for the kernel > > > > > ABI definitation. > > > > > > > > > > The below is just a starting point, please add your requirements. > > > > > > > > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't > > > > > want to use that as the demotion targets. But I don't think this is a > > > > > issue in practice for now, because demote-in-reclaim is disabled by > > > > > default. > > > > > > > > It is not just that the demotion can be disabled. We should be able to > > > > use demotion on a system where we can find DRAM only NUMA nodes. That > > > > cannot be achieved by /sys/kernel/mm/numa/demotion_enabled. It needs > > > > something similar to to N_DEMOTION_TARGETS > > > > > > > > > > Can you show NUMA information of your machines with DRAM-only nodes and > > > PMEM nodes? We can try to find the proper demotion order for the > > > system. If you can not show it, we can defer N_DEMOTION_TARGETS until > > > the machine is available. > > > > > > Sure will find one such config. As you might have noticed this is very > > easy to have in a virtualization setup because the hypervisor can assign > > memory to a guest VM from a numa node that doesn't have CPU assigned to > > the same guest. This depends on the other guest VM instance config > > running on the system. So on any virtualization config that has got > > persistent memory attached, this can become an easy config to end up with. > > > > Why they want to do that? I am looking forward to a real issue, not > theoritical possibility. > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > > memory node near node 0, > > > > > > > > > > available: 3 nodes (0-2) > > > > > node 0 cpus: 0 1 > > > > > node 0 size: n MB > > > > > node 0 free: n MB > > > > > node 1 cpus: > > > > > node 1 size: n MB > > > > > node 1 free: n MB > > > > > node 2 cpus: 2 3 > > > > > node 2 size: n MB > > > > > node 2 free: n MB > > > > > node distances: > > > > > node 0 1 2 > > > > > 0: 10 40 20 > > > > > 1: 40 10 80 > > > > > 2: 20 80 10 > > > > > > > > > > We have 2 choices, > > > > > > > > > > a) > > > > > node demotion targets > > > > > 0 1 > > > > > 2 1 > > > > > > > > This is achieved by > > > > > > > > [PATCH v2 1/5] mm: demotion: Set demotion list differently > > > > > > > > > > > > > > b) > > > > > node demotion targets > > > > > 0 1 > > > > > 2 X > > > > > > > > > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > > traffic. Both are OK as defualt configuration. But some users may > > > > > prefer the other one. So we need a user space ABI to override the > > > > > default configuration. > > > > > > > > > > 3. For machines with HBM (High Bandwidth Memory), as in > > > > > > > > > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ > > > > > > > > > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 > > > > > > > > > > Although HBM has better performance than DDR, in ACPI SLIT, their > > > > > distance to CPU is longer. We need to provide a way to fix this. The > > > > > user space ABI is one way. The desired result will be to use local DDR > > > > > as demotion targets of local HBM. > > > > > > > > > > > > IMHO the above (2b and 3) can be done using per node demotion targets. Below is > > > > what I think we could do with a single slow memory NUMA node 4. > > > > > > If we can use writable per-node demotion targets as ABI, then we don't > > > need N_DEMOTION_TARGETS. > > > > > > Not sure I understand that. Yes, once you have a writeable per node > > demotion target it is easy to build any demotion order. > > Yes. > > > But that doesn't > > mean we should not improve the default unless you have reason to say > > that using N_DEMOTTION_TARGETS breaks any existing config. > > > > Becuase N_DEMOTTION_TARGETS is a new kernel ABI to override the default, > not the default itself. [1/5] of this patchset improve the default > behavior itself, and I think that's good. > > Because we must maintain the kernel ABI almost for ever, we need to be > careful about adding new ABI and add less if possible. If writable per- > node demotion targets can address your issue. Then it's unnecessary to > add another redundant kernel ABI for that. I still think the kernel should initialize the per-node demotion order in a way similar to allocation fallback order and there is no need for a userspace interface to override per-node demotion order. But I don't object to such a per-node demotion order override interface proposed here. On the other hand, I think it is better to preserve the system-wide /sys/devices/system/node/demotion_targets as writable. If the userspace only wants to specify a specific set of nodes as the demotion tier and is perfectly fine with the per-node demotion order generated by the kernel, why should we enforce the userspace to have to manually define the per-node demotion order as well? > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > 4 > > > > 4 > > > > 4 > > > > 4 > > > > > > > > /sys/devices/system/node# echo 1 > node1/demotion_targets > > > > bash: echo: write error: Invalid argument > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > 4 > > > > 4 > > > > 4 > > > > 4 > > > > > > > > /sys/devices/system/node# echo 0 > node1/demotion_targets > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > 4 > > > > 0 > > > > 4 > > > > 4 > > > > > > > > /sys/devices/system/node# echo 1 > node0/demotion_targets > > > > bash: echo: write error: Invalid argument > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > 4 > > > > 0 > > > > 4 > > > > 4 > > > > > > > > Disable demotion for a specific node. > > > > /sys/devices/system/node# echo > node1/demotion_targets > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > 4 > > > > > > > > 4 > > > > 4 > > > > > > > > Reset demotion to default > > > > /sys/devices/system/node# echo -1 > node1/demotion_targets > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > 4 > > > > 4 > > > > 4 > > > > 4 > > > > > > > > When a specific device/NUMA node is used for demotion target via the user interface, it is taken > > > > out of other NUMA node targets. > > > > > > IMHO, we should be careful about interaction between auto-generated and > > > overridden demotion order. > > > > > > > yes, we should avoid loop between that. > > In addition to that, we need to get same result after hot-remove then > hot-add the same node. That is, the result should be stable after NOOP. > I guess we can just always, > > - Generate the default demotion order automatically without any > overriding. > > - Apply the overriding, after removing the invalid targets, etc. > > > But if you agree for the above > > ABI we could go ahead and share the implementation code. > > I think we need to add a way to distinguish auto-generated and overriden > demotion targets in the output of nodeX/demotion_targets. Otherwise it > looks good to me. > > Best Regards, > Huang, Ying > > > > > root@ubuntu-guest:/sys/devices/system/node# cat node[0-4]/demotion_targets > > > > 4 > > > > 4 > > > > 4 > > > > 4 > > > > > > > > /sys/devices/system/node# echo 4 > node1/demotion_targets > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > > > > > 4 > > > > > > > > > > > > > > > > If more than one node requies the same demotion target > > > > /sys/devices/system/node# echo 4 > node0/demotion_targets > > > > /sys/devices/system/node# cat node[0-4]/demotion_targets > > > > 4 > > > > 4 > > > > > > > > > > > > > > > > -aneesh > > > > > > > > > > -aneesh > >
On 4/25/22 10:26 PM, Wei Xu wrote: > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > <ying.huang@intel.com> wrote: >> .... >> 2. For machines with PMEM installed in only 1 of 2 sockets, for example, >> >> Node 0 & 2 are cpu + dram nodes and node 1 are slow >> memory node near node 0, >> >> available: 3 nodes (0-2) >> node 0 cpus: 0 1 >> node 0 size: n MB >> node 0 free: n MB >> node 1 cpus: >> node 1 size: n MB >> node 1 free: n MB >> node 2 cpus: 2 3 >> node 2 size: n MB >> node 2 free: n MB >> node distances: >> node 0 1 2 >> 0: 10 40 20 >> 1: 40 10 80 >> 2: 20 80 10 >> >> We have 2 choices, >> >> a) >> node demotion targets >> 0 1 >> 2 1 >> >> b) >> node demotion targets >> 0 1 >> 2 X >> >> a) is good to take advantage of PMEM. b) is good to reduce cross-socket >> traffic. Both are OK as defualt configuration. But some users may >> prefer the other one. So we need a user space ABI to override the >> default configuration. > > I think 2(a) should be the system-wide configuration and 2(b) can be > achieved with NUMA mempolicy (which needs to be added to demotion). > > In general, we can view the demotion order in a way similar to > allocation fallback order (after all, if we don't demote or demotion > lags behind, the allocations will go to these demotion target nodes > according to the allocation fallback order anyway). If we initialize > the demotion order in that way (i.e. every node can demote to any node > in the next tier, and the priority of the target nodes is sorted for > each source node), we don't need per-node demotion order override from > the userspace. What we need is to specify what nodes should be in > each tier and support NUMA mempolicy in demotion. > I have been wondering how we would handle this. For ex: If an application has specified an MPOL_BIND policy and restricted the allocation to be from Node0 and Node1, should we demote pages allocated by that application to Node10? The other alternative for that demotion is swapping. So from the page point of view, we either demote to a slow memory or pageout to swap. But then if we demote we are also breaking the MPOL_BIND rule. The above says we would need some kind of mem policy interaction, but what I am not sure about is how to find the memory policy in the demotion path. > Cross-socket demotion should not be too big a problem in practice > because we can optimize the code to do the demotion from the local CPU > node (i.e. local writes to the target node and remote read from the > source node). The bigger issue is cross-socket memory access onto the > demoted pages from the applications, which is why NUMA mempolicy is > important here. > > -aneesh
On Mon, 2022-04-25 at 09:56 -0700, Wei Xu wrote: > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > <ying.huang@intel.com> wrote: > > > > Hi, All, > > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: > > > > [snip] > > > > > I think it is necessary to either have per node demotion targets > > > configuration or the user space interface supported by this patch > > > series. As we don't have clear consensus on how the user interface > > > should look like, we can defer the per node demotion target set > > > interface to future until the real need arises. > > > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem > > > driver, it may be possible that some memory node desired as demotion > > > target is not detected in the system from dax-device kmem probe path. > > > > > > It is also possible that some of the dax-devices are not preferred as > > > demotion target e.g. HBM, for such devices, node shouldn't be set to > > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish > > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the > > > kernel, but for now this user space interface will be useful to avoid > > > such devices as demotion targets. > > > > > > We can add read only interface to view per node demotion targets > > > from /sys/devices/system/node/nodeX/demotion_targets, remove > > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead > > > make /sys/devices/system/node/demotion_targets writable. > > > > > > Huang, Wei, Yang, > > > What do you suggest? > > > > We cannot remove a kernel ABI in practice. So we need to make it right > > at the first time. Let's try to collect some information for the kernel > > ABI definitation. > > > > The below is just a starting point, please add your requirements. > > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't > > want to use that as the demotion targets. But I don't think this is a > > issue in practice for now, because demote-in-reclaim is disabled by > > default. > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > memory node near node 0, > > > > available: 3 nodes (0-2) > > node 0 cpus: 0 1 > > node 0 size: n MB > > node 0 free: n MB > > node 1 cpus: > > node 1 size: n MB > > node 1 free: n MB > > node 2 cpus: 2 3 > > node 2 size: n MB > > node 2 free: n MB > > node distances: > > node 0 1 2 > > 0: 10 40 20 > > 1: 40 10 80 > > 2: 20 80 10 > > > > We have 2 choices, > > > > a) > > node demotion targets > > 0 1 > > 2 1 > > > > b) > > node demotion targets > > 0 1 > > 2 X > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > traffic. Both are OK as defualt configuration. But some users may > > prefer the other one. So we need a user space ABI to override the > > default configuration. > > I think 2(a) should be the system-wide configuration and 2(b) can be > achieved with NUMA mempolicy (which needs to be added to demotion). Unfortunately, some NUMA mempolicy information isn't available at demotion time, for example, mempolicy enforced via set_mempolicy() is for thread. But I think that cpusets can work for demotion. > In general, we can view the demotion order in a way similar to > allocation fallback order (after all, if we don't demote or demotion > lags behind, the allocations will go to these demotion target nodes > according to the allocation fallback order anyway). If we initialize > the demotion order in that way (i.e. every node can demote to any node > in the next tier, and the priority of the target nodes is sorted for > each source node), we don't need per-node demotion order override from > the userspace. What we need is to specify what nodes should be in > each tier and support NUMA mempolicy in demotion. This sounds interesting. Tier sounds like a natural and general concept for these memory types. It's attracting to use it for user space interface too. For example, we may use that for mem_cgroup limits of a specific memory type (tier). And if we take a look at the N_DEMOTION_TARGETS again from the "tier" point of view. The nodes are divided to 2 classes via N_DEMOTION_TARGETS. - The nodes without N_DEMOTION_TARGETS are top tier (or tier 0). - The nodes with N_DEMOTION_TARGETS are non-top tier (or tier 1, 2, 3, ...) So, another possibility is to fit N_DEMOTION_TARGETS and its overriding into "tier" concept too. !N_DEMOTION_TARGETS == TIER0. - All nodes start with TIER0 - TIER0 can be cleared for some nodes via e.g. kmem driver TIER0 node list can be read or overriden by the user space via the following interface, /sys/devices/system/node/tier0 In the future, if we want to customize more tiers, we can add tier1, tier2, tier3, ..... For now, we can add just tier0. That is, the interface is extensible in the future compared with .../node/demote_targets. This isn't as flexible as the writable per-node demotion targets. But it may be enough for most requirements? Best Regards, Huang, Ying > Cross-socket demotion should not be too big a problem in practice > because we can optimize the code to do the demotion from the local CPU > node (i.e. local writes to the target node and remote read from the > source node). The bigger issue is cross-socket memory access onto the > demoted pages from the applications, which is why NUMA mempolicy is > important here. > > > 3. For machines with HBM (High Bandwidth Memory), as in > > > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ > > > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 > > > > Although HBM has better performance than DDR, in ACPI SLIT, their > > distance to CPU is longer. We need to provide a way to fix this. The > > user space ABI is one way. The desired result will be to use local DDR > > as demotion targets of local HBM. > > > > Best Regards, > > Huang, Ying > >
On Wed, Apr 27, 2022 at 12:11 AM ying.huang@intel.com <ying.huang@intel.com> wrote: > > On Mon, 2022-04-25 at 09:56 -0700, Wei Xu wrote: > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > <ying.huang@intel.com> wrote: > > > > > > Hi, All, > > > > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: > > > > > > [snip] > > > > > > > I think it is necessary to either have per node demotion targets > > > > configuration or the user space interface supported by this patch > > > > series. As we don't have clear consensus on how the user interface > > > > should look like, we can defer the per node demotion target set > > > > interface to future until the real need arises. > > > > > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem > > > > driver, it may be possible that some memory node desired as demotion > > > > target is not detected in the system from dax-device kmem probe path. > > > > > > > > It is also possible that some of the dax-devices are not preferred as > > > > demotion target e.g. HBM, for such devices, node shouldn't be set to > > > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish > > > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the > > > > kernel, but for now this user space interface will be useful to avoid > > > > such devices as demotion targets. > > > > > > > > We can add read only interface to view per node demotion targets > > > > from /sys/devices/system/node/nodeX/demotion_targets, remove > > > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead > > > > make /sys/devices/system/node/demotion_targets writable. > > > > > > > > Huang, Wei, Yang, > > > > What do you suggest? > > > > > > We cannot remove a kernel ABI in practice. So we need to make it right > > > at the first time. Let's try to collect some information for the kernel > > > ABI definitation. > > > > > > The below is just a starting point, please add your requirements. > > > > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't > > > want to use that as the demotion targets. But I don't think this is a > > > issue in practice for now, because demote-in-reclaim is disabled by > > > default. > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > memory node near node 0, > > > > > > available: 3 nodes (0-2) > > > node 0 cpus: 0 1 > > > node 0 size: n MB > > > node 0 free: n MB > > > node 1 cpus: > > > node 1 size: n MB > > > node 1 free: n MB > > > node 2 cpus: 2 3 > > > node 2 size: n MB > > > node 2 free: n MB > > > node distances: > > > node 0 1 2 > > > 0: 10 40 20 > > > 1: 40 10 80 > > > 2: 20 80 10 > > > > > > We have 2 choices, > > > > > > a) > > > node demotion targets > > > 0 1 > > > 2 1 > > > > > > b) > > > node demotion targets > > > 0 1 > > > 2 X > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > traffic. Both are OK as defualt configuration. But some users may > > > prefer the other one. So we need a user space ABI to override the > > > default configuration. > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > achieved with NUMA mempolicy (which needs to be added to demotion). > > Unfortunately, some NUMA mempolicy information isn't available at > demotion time, for example, mempolicy enforced via set_mempolicy() is > for thread. But I think that cpusets can work for demotion. > > > In general, we can view the demotion order in a way similar to > > allocation fallback order (after all, if we don't demote or demotion > > lags behind, the allocations will go to these demotion target nodes > > according to the allocation fallback order anyway). If we initialize > > the demotion order in that way (i.e. every node can demote to any node > > in the next tier, and the priority of the target nodes is sorted for > > each source node), we don't need per-node demotion order override from > > the userspace. What we need is to specify what nodes should be in > > each tier and support NUMA mempolicy in demotion. > > This sounds interesting. Tier sounds like a natural and general concept > for these memory types. It's attracting to use it for user space > interface too. For example, we may use that for mem_cgroup limits of a > specific memory type (tier). > > And if we take a look at the N_DEMOTION_TARGETS again from the "tier" > point of view. The nodes are divided to 2 classes via > N_DEMOTION_TARGETS. > > - The nodes without N_DEMOTION_TARGETS are top tier (or tier 0). > > - The nodes with N_DEMOTION_TARGETS are non-top tier (or tier 1, 2, 3, > ...) > Yes, this is one of the main reasons why we (Google) want this interface. > So, another possibility is to fit N_DEMOTION_TARGETS and its overriding > into "tier" concept too. !N_DEMOTION_TARGETS == TIER0. > > - All nodes start with TIER0 > > - TIER0 can be cleared for some nodes via e.g. kmem driver > > TIER0 node list can be read or overriden by the user space via the > following interface, > > /sys/devices/system/node/tier0 > > In the future, if we want to customize more tiers, we can add tier1, > tier2, tier3, ..... For now, we can add just tier0. That is, the > interface is extensible in the future compared with > .../node/demote_targets. > This more explicit tier definition interface works, too. > This isn't as flexible as the writable per-node demotion targets. But > it may be enough for most requirements? I would think so. Besides, it doesn't really conflict with the per-node demotion target interface if we really want to introduce the latter. > Best Regards, > Huang, Ying > > > Cross-socket demotion should not be too big a problem in practice > > because we can optimize the code to do the demotion from the local CPU > > node (i.e. local writes to the target node and remote read from the > > source node). The bigger issue is cross-socket memory access onto the > > demoted pages from the applications, which is why NUMA mempolicy is > > important here. > > > > > 3. For machines with HBM (High Bandwidth Memory), as in > > > > > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ > > > > > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 > > > > > > Although HBM has better performance than DDR, in ACPI SLIT, their > > > distance to CPU is longer. We need to provide a way to fix this. The > > > user space ABI is one way. The desired result will be to use local DDR > > > as demotion targets of local HBM. > > > > > > Best Regards, > > > Huang, Ying > > > > > >
On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> wrote: > > On 4/25/22 10:26 PM, Wei Xu wrote: > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > <ying.huang@intel.com> wrote: > >> > > .... > > >> 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > >> > >> Node 0 & 2 are cpu + dram nodes and node 1 are slow > >> memory node near node 0, > >> > >> available: 3 nodes (0-2) > >> node 0 cpus: 0 1 > >> node 0 size: n MB > >> node 0 free: n MB > >> node 1 cpus: > >> node 1 size: n MB > >> node 1 free: n MB > >> node 2 cpus: 2 3 > >> node 2 size: n MB > >> node 2 free: n MB > >> node distances: > >> node 0 1 2 > >> 0: 10 40 20 > >> 1: 40 10 80 > >> 2: 20 80 10 > >> > >> We have 2 choices, > >> > >> a) > >> node demotion targets > >> 0 1 > >> 2 1 > >> > >> b) > >> node demotion targets > >> 0 1 > >> 2 X > >> > >> a) is good to take advantage of PMEM. b) is good to reduce cross-socket > >> traffic. Both are OK as defualt configuration. But some users may > >> prefer the other one. So we need a user space ABI to override the > >> default configuration. > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > In general, we can view the demotion order in a way similar to > > allocation fallback order (after all, if we don't demote or demotion > > lags behind, the allocations will go to these demotion target nodes > > according to the allocation fallback order anyway). If we initialize > > the demotion order in that way (i.e. every node can demote to any node > > in the next tier, and the priority of the target nodes is sorted for > > each source node), we don't need per-node demotion order override from > > the userspace. What we need is to specify what nodes should be in > > each tier and support NUMA mempolicy in demotion. > > > > I have been wondering how we would handle this. For ex: If an > application has specified an MPOL_BIND policy and restricted the > allocation to be from Node0 and Node1, should we demote pages allocated > by that application > to Node10? The other alternative for that demotion is swapping. So from > the page point of view, we either demote to a slow memory or pageout to > swap. But then if we demote we are also breaking the MPOL_BIND rule. IMHO, the MPOL_BIND policy should be respected and demotion should be skipped in such cases. Such MPOL_BIND policies can be an important tool for applications to override and control their memory placement when transparent memory tiering is enabled. If the application doesn't want swapping, there are other ways to achieve that (e.g. mlock, disabling swap globally, setting memcg parameters, etc). > The above says we would need some kind of mem policy interaction, but > what I am not sure about is how to find the memory policy in the > demotion path. This is indeed an important and challenging problem. One possible approach is to retrieve the allowed demotion nodemask from page_referenced() similar to vm_flags. > > > Cross-socket demotion should not be too big a problem in practice > > because we can optimize the code to do the demotion from the local CPU > > node (i.e. local writes to the target node and remote read from the > > source node). The bigger issue is cross-socket memory access onto the > > demoted pages from the applications, which is why NUMA mempolicy is > > important here. > > > > > -aneesh
On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote: > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V > <aneesh.kumar@linux.ibm.com> wrote: > > > > On 4/25/22 10:26 PM, Wei Xu wrote: > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > > <ying.huang@intel.com> wrote: > > > > > > > > .... > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > memory node near node 0, > > > > > > > > available: 3 nodes (0-2) > > > > node 0 cpus: 0 1 > > > > node 0 size: n MB > > > > node 0 free: n MB > > > > node 1 cpus: > > > > node 1 size: n MB > > > > node 1 free: n MB > > > > node 2 cpus: 2 3 > > > > node 2 size: n MB > > > > node 2 free: n MB > > > > node distances: > > > > node 0 1 2 > > > > 0: 10 40 20 > > > > 1: 40 10 80 > > > > 2: 20 80 10 > > > > > > > > We have 2 choices, > > > > > > > > a) > > > > node demotion targets > > > > 0 1 > > > > 2 1 > > > > > > > > b) > > > > node demotion targets > > > > 0 1 > > > > 2 X > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > traffic. Both are OK as defualt configuration. But some users may > > > > prefer the other one. So we need a user space ABI to override the > > > > default configuration. > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > > > In general, we can view the demotion order in a way similar to > > > allocation fallback order (after all, if we don't demote or demotion > > > lags behind, the allocations will go to these demotion target nodes > > > according to the allocation fallback order anyway). If we initialize > > > the demotion order in that way (i.e. every node can demote to any node > > > in the next tier, and the priority of the target nodes is sorted for > > > each source node), we don't need per-node demotion order override from > > > the userspace. What we need is to specify what nodes should be in > > > each tier and support NUMA mempolicy in demotion. > > > > > > > I have been wondering how we would handle this. For ex: If an > > application has specified an MPOL_BIND policy and restricted the > > allocation to be from Node0 and Node1, should we demote pages allocated > > by that application > > to Node10? The other alternative for that demotion is swapping. So from > > the page point of view, we either demote to a slow memory or pageout to > > swap. But then if we demote we are also breaking the MPOL_BIND rule. > > IMHO, the MPOL_BIND policy should be respected and demotion should be > skipped in such cases. Such MPOL_BIND policies can be an important > tool for applications to override and control their memory placement > when transparent memory tiering is enabled. If the application > doesn't want swapping, there are other ways to achieve that (e.g. > mlock, disabling swap globally, setting memcg parameters, etc). > > > > The above says we would need some kind of mem policy interaction, but > > what I am not sure about is how to find the memory policy in the > > demotion path. > > This is indeed an important and challenging problem. One possible > approach is to retrieve the allowed demotion nodemask from > page_referenced() similar to vm_flags. This works for mempolicy in struct vm_area_struct, but not for that in struct task_struct. Mutiple threads in a process may have different mempolicy. Best Regards, Huang, Ying > > > > > Cross-socket demotion should not be too big a problem in practice > > > because we can optimize the code to do the demotion from the local CPU > > > node (i.e. local writes to the target node and remote read from the > > > source node). The bigger issue is cross-socket memory access onto the > > > demoted pages from the applications, which is why NUMA mempolicy is > > > important here. > > > > > > > > -aneesh
On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com <ying.huang@intel.com> wrote: > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote: > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote: > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > .... > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > > memory node near node 0, > > > > > > > > > > available: 3 nodes (0-2) > > > > > node 0 cpus: 0 1 > > > > > node 0 size: n MB > > > > > node 0 free: n MB > > > > > node 1 cpus: > > > > > node 1 size: n MB > > > > > node 1 free: n MB > > > > > node 2 cpus: 2 3 > > > > > node 2 size: n MB > > > > > node 2 free: n MB > > > > > node distances: > > > > > node 0 1 2 > > > > > 0: 10 40 20 > > > > > 1: 40 10 80 > > > > > 2: 20 80 10 > > > > > > > > > > We have 2 choices, > > > > > > > > > > a) > > > > > node demotion targets > > > > > 0 1 > > > > > 2 1 > > > > > > > > > > b) > > > > > node demotion targets > > > > > 0 1 > > > > > 2 X > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > > traffic. Both are OK as defualt configuration. But some users may > > > > > prefer the other one. So we need a user space ABI to override the > > > > > default configuration. > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > > > > > In general, we can view the demotion order in a way similar to > > > > allocation fallback order (after all, if we don't demote or demotion > > > > lags behind, the allocations will go to these demotion target nodes > > > > according to the allocation fallback order anyway). If we initialize > > > > the demotion order in that way (i.e. every node can demote to any node > > > > in the next tier, and the priority of the target nodes is sorted for > > > > each source node), we don't need per-node demotion order override from > > > > the userspace. What we need is to specify what nodes should be in > > > > each tier and support NUMA mempolicy in demotion. > > > > > > > > > > I have been wondering how we would handle this. For ex: If an > > > application has specified an MPOL_BIND policy and restricted the > > > allocation to be from Node0 and Node1, should we demote pages allocated > > > by that application > > > to Node10? The other alternative for that demotion is swapping. So from > > > the page point of view, we either demote to a slow memory or pageout to > > > swap. But then if we demote we are also breaking the MPOL_BIND rule. > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be > > skipped in such cases. Such MPOL_BIND policies can be an important > > tool for applications to override and control their memory placement > > when transparent memory tiering is enabled. If the application > > doesn't want swapping, there are other ways to achieve that (e.g. > > mlock, disabling swap globally, setting memcg parameters, etc). > > > > > > > The above says we would need some kind of mem policy interaction, but > > > what I am not sure about is how to find the memory policy in the > > > demotion path. > > > > This is indeed an important and challenging problem. One possible > > approach is to retrieve the allowed demotion nodemask from > > page_referenced() similar to vm_flags. > > This works for mempolicy in struct vm_area_struct, but not for that in > struct task_struct. Mutiple threads in a process may have different > mempolicy. From vm_area_struct, we can get to mm_struct and then to the owner task_struct, which has the process mempolicy. It is indeed a problem when a page is shared by different threads or different processes that have different thread default mempolicy values. On the other hand, it can already support most interesting use cases for demotion (e.g. selecting the demotion node, mbind to prevent demotion) by respecting cpuset and vma mempolicies. > Best Regards, > Huang, Ying > > > > > > > > Cross-socket demotion should not be too big a problem in practice > > > > because we can optimize the code to do the demotion from the local CPU > > > > node (i.e. local writes to the target node and remote read from the > > > > source node). The bigger issue is cross-socket memory access onto the > > > > demoted pages from the applications, which is why NUMA mempolicy is > > > > important here. > > > > > > > > > > > -aneesh > >
On Wed, 2022-04-27 at 09:27 -0700, Wei Xu wrote: > On Wed, Apr 27, 2022 at 12:11 AM ying.huang@intel.com > <ying.huang@intel.com> wrote: > > > > On Mon, 2022-04-25 at 09:56 -0700, Wei Xu wrote: > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > > <ying.huang@intel.com> wrote: > > > > > > > > Hi, All, > > > > > > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: > > > > > > > > [snip] > > > > > > > > > I think it is necessary to either have per node demotion targets > > > > > configuration or the user space interface supported by this patch > > > > > series. As we don't have clear consensus on how the user interface > > > > > should look like, we can defer the per node demotion target set > > > > > interface to future until the real need arises. > > > > > > > > > > Current patch series sets N_DEMOTION_TARGET from dax device kmem > > > > > driver, it may be possible that some memory node desired as demotion > > > > > target is not detected in the system from dax-device kmem probe path. > > > > > > > > > > It is also possible that some of the dax-devices are not preferred as > > > > > demotion target e.g. HBM, for such devices, node shouldn't be set to > > > > > N_DEMOTION_TARGETS. In future, Support should be added to distinguish > > > > > such dax-devices and not mark them as N_DEMOTION_TARGETS from the > > > > > kernel, but for now this user space interface will be useful to avoid > > > > > such devices as demotion targets. > > > > > > > > > > We can add read only interface to view per node demotion targets > > > > > from /sys/devices/system/node/nodeX/demotion_targets, remove > > > > > duplicated /sys/kernel/mm/numa/demotion_target interface and instead > > > > > make /sys/devices/system/node/demotion_targets writable. > > > > > > > > > > Huang, Wei, Yang, > > > > > What do you suggest? > > > > > > > > We cannot remove a kernel ABI in practice. So we need to make it right > > > > at the first time. Let's try to collect some information for the kernel > > > > ABI definitation. > > > > > > > > The below is just a starting point, please add your requirements. > > > > > > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they don't > > > > want to use that as the demotion targets. But I don't think this is a > > > > issue in practice for now, because demote-in-reclaim is disabled by > > > > default. > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > memory node near node 0, > > > > > > > > available: 3 nodes (0-2) > > > > node 0 cpus: 0 1 > > > > node 0 size: n MB > > > > node 0 free: n MB > > > > node 1 cpus: > > > > node 1 size: n MB > > > > node 1 free: n MB > > > > node 2 cpus: 2 3 > > > > node 2 size: n MB > > > > node 2 free: n MB > > > > node distances: > > > > node 0 1 2 > > > > 0: 10 40 20 > > > > 1: 40 10 80 > > > > 2: 20 80 10 > > > > > > > > We have 2 choices, > > > > > > > > a) > > > > node demotion targets > > > > 0 1 > > > > 2 1 > > > > > > > > b) > > > > node demotion targets > > > > 0 1 > > > > 2 X > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > traffic. Both are OK as defualt configuration. But some users may > > > > prefer the other one. So we need a user space ABI to override the > > > > default configuration. > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > Unfortunately, some NUMA mempolicy information isn't available at > > demotion time, for example, mempolicy enforced via set_mempolicy() is > > for thread. But I think that cpusets can work for demotion. > > > > > In general, we can view the demotion order in a way similar to > > > allocation fallback order (after all, if we don't demote or demotion > > > lags behind, the allocations will go to these demotion target nodes > > > according to the allocation fallback order anyway). If we initialize > > > the demotion order in that way (i.e. every node can demote to any node > > > in the next tier, and the priority of the target nodes is sorted for > > > each source node), we don't need per-node demotion order override from > > > the userspace. What we need is to specify what nodes should be in > > > each tier and support NUMA mempolicy in demotion. > > > > This sounds interesting. Tier sounds like a natural and general concept > > for these memory types. It's attracting to use it for user space > > interface too. For example, we may use that for mem_cgroup limits of a > > specific memory type (tier). > > > > And if we take a look at the N_DEMOTION_TARGETS again from the "tier" > > point of view. The nodes are divided to 2 classes via > > N_DEMOTION_TARGETS. > > > > - The nodes without N_DEMOTION_TARGETS are top tier (or tier 0). > > > > - The nodes with N_DEMOTION_TARGETS are non-top tier (or tier 1, 2, 3, > > ...) > > > > Yes, this is one of the main reasons why we (Google) want this interface. > > > So, another possibility is to fit N_DEMOTION_TARGETS and its overriding > > into "tier" concept too. !N_DEMOTION_TARGETS == TIER0. > > > > - All nodes start with TIER0 > > > > - TIER0 can be cleared for some nodes via e.g. kmem driver > > > > TIER0 node list can be read or overriden by the user space via the > > following interface, > > > > /sys/devices/system/node/tier0 > > > > In the future, if we want to customize more tiers, we can add tier1, > > tier2, tier3, ..... For now, we can add just tier0. That is, the > > interface is extensible in the future compared with > > .../node/demote_targets. > > > > This more explicit tier definition interface works, too. > In addition to make tiering definition explicit, more importantly, this makes it much easier to support more than 2 tiers. For example, for a system with HBM (High Bandwidth Memory), CPU+DRAM, DRAM only, and PMEM, that is, 3 tiers, we can put HBM in tier 0, CPU+DRAM and DRAM only in tier 1, and PMEM in tier 2, automatically, or via user space overridding. N_DEMOTION_TARGETS isn't natural to be extended to support this. Best Regards, Huang, Ying > > This isn't as flexible as the writable per-node demotion targets. But > > it may be enough for most requirements? > > I would think so. Besides, it doesn't really conflict with the > per-node demotion target interface if we really want to introduce the > latter. > > > Best Regards, > > Huang, Ying > > > > > Cross-socket demotion should not be too big a problem in practice > > > because we can optimize the code to do the demotion from the local CPU > > > node (i.e. local writes to the target node and remote read from the > > > source node). The bigger issue is cross-socket memory access onto the > > > demoted pages from the applications, which is why NUMA mempolicy is > > > important here. > > > > > > > 3. For machines with HBM (High Bandwidth Memory), as in > > > > > > > > https://lore.kernel.org/lkml/39cbe02a-d309-443d-54c9-678a0799342d@gmail.com/ > > > > > > > > > [1] local DDR = 10, remote DDR = 20, local HBM = 31, remote HBM = 41 > > > > > > > > Although HBM has better performance than DDR, in ACPI SLIT, their > > > > distance to CPU is longer. We need to provide a way to fix this. The > > > > user space ABI is one way. The desired result will be to use local DDR > > > > as demotion targets of local HBM. > > > > > > > > Best Regards, > > > > Huang, Ying > > > > > > > > > >
On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote: > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com > <ying.huang@intel.com> wrote: > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote: > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V > > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote: > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > .... > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > > > memory node near node 0, > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > node 0 cpus: 0 1 > > > > > > node 0 size: n MB > > > > > > node 0 free: n MB > > > > > > node 1 cpus: > > > > > > node 1 size: n MB > > > > > > node 1 free: n MB > > > > > > node 2 cpus: 2 3 > > > > > > node 2 size: n MB > > > > > > node 2 free: n MB > > > > > > node distances: > > > > > > node 0 1 2 > > > > > > 0: 10 40 20 > > > > > > 1: 40 10 80 > > > > > > 2: 20 80 10 > > > > > > > > > > > > We have 2 choices, > > > > > > > > > > > > a) > > > > > > node demotion targets > > > > > > 0 1 > > > > > > 2 1 > > > > > > > > > > > > b) > > > > > > node demotion targets > > > > > > 0 1 > > > > > > 2 X > > > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > > > traffic. Both are OK as defualt configuration. But some users may > > > > > > prefer the other one. So we need a user space ABI to override the > > > > > > default configuration. > > > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > > > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > > > > > > > In general, we can view the demotion order in a way similar to > > > > > allocation fallback order (after all, if we don't demote or demotion > > > > > lags behind, the allocations will go to these demotion target nodes > > > > > according to the allocation fallback order anyway). If we initialize > > > > > the demotion order in that way (i.e. every node can demote to any node > > > > > in the next tier, and the priority of the target nodes is sorted for > > > > > each source node), we don't need per-node demotion order override from > > > > > the userspace. What we need is to specify what nodes should be in > > > > > each tier and support NUMA mempolicy in demotion. > > > > > > > > > > > > > I have been wondering how we would handle this. For ex: If an > > > > application has specified an MPOL_BIND policy and restricted the > > > > allocation to be from Node0 and Node1, should we demote pages allocated > > > > by that application > > > > to Node10? The other alternative for that demotion is swapping. So from > > > > the page point of view, we either demote to a slow memory or pageout to > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule. > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be > > > skipped in such cases. Such MPOL_BIND policies can be an important > > > tool for applications to override and control their memory placement > > > when transparent memory tiering is enabled. If the application > > > doesn't want swapping, there are other ways to achieve that (e.g. > > > mlock, disabling swap globally, setting memcg parameters, etc). > > > > > > > > > > The above says we would need some kind of mem policy interaction, but > > > > what I am not sure about is how to find the memory policy in the > > > > demotion path. > > > > > > This is indeed an important and challenging problem. One possible > > > approach is to retrieve the allowed demotion nodemask from > > > page_referenced() similar to vm_flags. > > > > This works for mempolicy in struct vm_area_struct, but not for that in > > struct task_struct. Mutiple threads in a process may have different > > mempolicy. > > From vm_area_struct, we can get to mm_struct and then to the owner > task_struct, which has the process mempolicy. > > It is indeed a problem when a page is shared by different threads or > different processes that have different thread default mempolicy > values. Sorry for chiming in late, this is a known issue when we were working on demotion. Yes, it is hard to handle the shared pages and multi threads since mempolicy is applied to each thread so each thread may have different mempolicy. And I don't think this case is rare. And not only mempolicy but also may cpuset settings cause the similar problem, different threads may have different cpuset settings for cgroupv1. If this is really a problem for real life workloads, we may consider tackling it for exclusively owned pages first. Thanks to David's patches, now we have dedicated flags to tell exclusively owned pages. > > On the other hand, it can already support most interesting use cases > for demotion (e.g. selecting the demotion node, mbind to prevent > demotion) by respecting cpuset and vma mempolicies. > > > Best Regards, > > Huang, Ying > > > > > > > > > > > Cross-socket demotion should not be too big a problem in practice > > > > > because we can optimize the code to do the demotion from the local CPU > > > > > node (i.e. local writes to the target node and remote read from the > > > > > source node). The bigger issue is cross-socket memory access onto the > > > > > demoted pages from the applications, which is why NUMA mempolicy is > > > > > important here. > > > > > > > > > > > > > > -aneesh > > > >
> >On Wed, 2022-04-27 at 09:27 -0700, Wei Xu wrote: >> On Wed, Apr 27, 2022 at 12:11 AM ying.huang@intel.com >> <ying.huang@intel.com> wrote: >> > >> > On Mon, 2022-04-25 at 09:56 -0700, Wei Xu wrote: >> > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com >> > > <ying.huang@intel.com> wrote: >> > > > >> > > > Hi, All, >> > > > >> > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: >> > > > >> > > > [snip] >> > > > >> > > > > I think it is necessary to either have per node demotion >> > > > > targets configuration or the user space interface supported by >> > > > > this patch series. As we don't have clear consensus on how the >> > > > > user interface should look like, we can defer the per node >> > > > > demotion target set interface to future until the real need arises. >> > > > > >> > > > > Current patch series sets N_DEMOTION_TARGET from dax device >> > > > > kmem driver, it may be possible that some memory node desired >> > > > > as demotion target is not detected in the system from dax-device >kmem probe path. >> > > > > >> > > > > It is also possible that some of the dax-devices are not >> > > > > preferred as demotion target e.g. HBM, for such devices, node >> > > > > shouldn't be set to N_DEMOTION_TARGETS. In future, Support >> > > > > should be added to distinguish such dax-devices and not mark >> > > > > them as N_DEMOTION_TARGETS from the kernel, but for now this >> > > > > user space interface will be useful to avoid such devices as demotion >targets. >> > > > > >> > > > > We can add read only interface to view per node demotion >> > > > > targets from /sys/devices/system/node/nodeX/demotion_targets, >> > > > > remove duplicated /sys/kernel/mm/numa/demotion_target >> > > > > interface and instead make >/sys/devices/system/node/demotion_targets writable. >> > > > > >> > > > > Huang, Wei, Yang, >> > > > > What do you suggest? >> > > > >> > > > We cannot remove a kernel ABI in practice. So we need to make >> > > > it right at the first time. Let's try to collect some >> > > > information for the kernel ABI definitation. >> > > > >> > > > The below is just a starting point, please add your requirements. >> > > > >> > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they >> > > > don't want to use that as the demotion targets. But I don't >> > > > think this is a issue in practice for now, because >> > > > demote-in-reclaim is disabled by default. >> > > > >> > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for >> > > > example, >> > > > >> > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node >> > > > near node 0, >> > > > >> > > > available: 3 nodes (0-2) >> > > > node 0 cpus: 0 1 >> > > > node 0 size: n MB >> > > > node 0 free: n MB >> > > > node 1 cpus: >> > > > node 1 size: n MB >> > > > node 1 free: n MB >> > > > node 2 cpus: 2 3 >> > > > node 2 size: n MB >> > > > node 2 free: n MB >> > > > node distances: >> > > > node 0 1 2 >> > > > 0: 10 40 20 >> > > > 1: 40 10 80 >> > > > 2: 20 80 10 >> > > > >> > > > We have 2 choices, >> > > > >> > > > a) >> > > > node demotion targets >> > > > 0 1 >> > > > 2 1 >> > > > >> > > > b) >> > > > node demotion targets >> > > > 0 1 >> > > > 2 X >> > > > >> > > > a) is good to take advantage of PMEM. b) is good to reduce >> > > > cross-socket traffic. Both are OK as defualt configuration. >> > > > But some users may prefer the other one. So we need a user >> > > > space ABI to override the default configuration. >> > > >> > > I think 2(a) should be the system-wide configuration and 2(b) can >> > > be achieved with NUMA mempolicy (which needs to be added to >demotion). >> > >> > Unfortunately, some NUMA mempolicy information isn't available at >> > demotion time, for example, mempolicy enforced via set_mempolicy() >> > is for thread. But I think that cpusets can work for demotion. >> > >> > > In general, we can view the demotion order in a way similar to >> > > allocation fallback order (after all, if we don't demote or >> > > demotion lags behind, the allocations will go to these demotion >> > > target nodes according to the allocation fallback order anyway). >> > > If we initialize the demotion order in that way (i.e. every node >> > > can demote to any node in the next tier, and the priority of the >> > > target nodes is sorted for each source node), we don't need >> > > per-node demotion order override from the userspace. What we need >> > > is to specify what nodes should be in each tier and support NUMA >mempolicy in demotion. >> > >> > This sounds interesting. Tier sounds like a natural and general >> > concept for these memory types. It's attracting to use it for user >> > space interface too. For example, we may use that for mem_cgroup >> > limits of a specific memory type (tier). >> > >> > And if we take a look at the N_DEMOTION_TARGETS again from the "tier" >> > point of view. The nodes are divided to 2 classes via >> > N_DEMOTION_TARGETS. >> > >> > - The nodes without N_DEMOTION_TARGETS are top tier (or tier 0). >> > >> > - The nodes with N_DEMOTION_TARGETS are non-top tier (or tier 1, 2, >> > 3, >> > ...) >> > >> >> Yes, this is one of the main reasons why we (Google) want this interface. >> >> > So, another possibility is to fit N_DEMOTION_TARGETS and its >> > overriding into "tier" concept too. !N_DEMOTION_TARGETS == TIER0. >> > >> > - All nodes start with TIER0 >> > >> > - TIER0 can be cleared for some nodes via e.g. kmem driver >> > >> > TIER0 node list can be read or overriden by the user space via the >> > following interface, >> > >> > /sys/devices/system/node/tier0 >> > >> > In the future, if we want to customize more tiers, we can add tier1, >> > tier2, tier3, ..... For now, we can add just tier0. That is, the >> > interface is extensible in the future compared with >> > .../node/demote_targets. >> > >> >> This more explicit tier definition interface works, too. >> > >In addition to make tiering definition explicit, more importantly, this makes it >much easier to support more than 2 tiers. For example, for a system with >HBM (High Bandwidth Memory), CPU+DRAM, DRAM only, and PMEM, that is, >3 tiers, we can put HBM in tier 0, CPU+DRAM and DRAM only in tier 1, and >PMEM in tier 2, automatically, or via user space overridding. >N_DEMOTION_TARGETS isn't natural to be extended to support this. Agree with Ying that making the tier explicit is fundamental to the rest of the API. I think that the tier organization should come before setting the demotion targets, not the other way round. That makes things clear on the demotion direction, (node in tier X demote to tier Y, X<Y). With that, explicitly specifying the demotion target or order is only needed when we truly want that level of control or a demotion order. Otherwise all the higher numbered tiers are valid targets. Configuring a tier level for each node is a lot easier than fixing up all demotion targets for each and every node. We can prevent demotion target configuration that goes in the wrong direction by looking at the tier level. Tim
On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote: > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote: > > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com > > <ying.huang@intel.com> wrote: > > > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote: > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V > > > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote: > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > .... > > > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > > > > memory node near node 0, > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > node 0 cpus: 0 1 > > > > > > > node 0 size: n MB > > > > > > > node 0 free: n MB > > > > > > > node 1 cpus: > > > > > > > node 1 size: n MB > > > > > > > node 1 free: n MB > > > > > > > node 2 cpus: 2 3 > > > > > > > node 2 size: n MB > > > > > > > node 2 free: n MB > > > > > > > node distances: > > > > > > > node 0 1 2 > > > > > > > 0: 10 40 20 > > > > > > > 1: 40 10 80 > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > We have 2 choices, > > > > > > > > > > > > > > a) > > > > > > > node demotion targets > > > > > > > 0 1 > > > > > > > 2 1 > > > > > > > > > > > > > > b) > > > > > > > node demotion targets > > > > > > > 0 1 > > > > > > > 2 X > > > > > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > > > > traffic. Both are OK as defualt configuration. But some users may > > > > > > > prefer the other one. So we need a user space ABI to override the > > > > > > > default configuration. > > > > > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > > > > > > > > > In general, we can view the demotion order in a way similar to > > > > > > allocation fallback order (after all, if we don't demote or demotion > > > > > > lags behind, the allocations will go to these demotion target nodes > > > > > > according to the allocation fallback order anyway). If we initialize > > > > > > the demotion order in that way (i.e. every node can demote to any node > > > > > > in the next tier, and the priority of the target nodes is sorted for > > > > > > each source node), we don't need per-node demotion order override from > > > > > > the userspace. What we need is to specify what nodes should be in > > > > > > each tier and support NUMA mempolicy in demotion. > > > > > > > > > > > > > > > > I have been wondering how we would handle this. For ex: If an > > > > > application has specified an MPOL_BIND policy and restricted the > > > > > allocation to be from Node0 and Node1, should we demote pages allocated > > > > > by that application > > > > > to Node10? The other alternative for that demotion is swapping. So from > > > > > the page point of view, we either demote to a slow memory or pageout to > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule. > > > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be > > > > skipped in such cases. Such MPOL_BIND policies can be an important > > > > tool for applications to override and control their memory placement > > > > when transparent memory tiering is enabled. If the application > > > > doesn't want swapping, there are other ways to achieve that (e.g. > > > > mlock, disabling swap globally, setting memcg parameters, etc). > > > > > > > > > > > > > The above says we would need some kind of mem policy interaction, but > > > > > what I am not sure about is how to find the memory policy in the > > > > > demotion path. > > > > > > > > This is indeed an important and challenging problem. One possible > > > > approach is to retrieve the allowed demotion nodemask from > > > > page_referenced() similar to vm_flags. > > > > > > This works for mempolicy in struct vm_area_struct, but not for that in > > > struct task_struct. Mutiple threads in a process may have different > > > mempolicy. > > > > From vm_area_struct, we can get to mm_struct and then to the owner > > task_struct, which has the process mempolicy. > > > > It is indeed a problem when a page is shared by different threads or > > different processes that have different thread default mempolicy > > values. > > Sorry for chiming in late, this is a known issue when we were working > on demotion. Yes, it is hard to handle the shared pages and multi > threads since mempolicy is applied to each thread so each thread may > have different mempolicy. And I don't think this case is rare. And not > only mempolicy but also may cpuset settings cause the similar problem, > different threads may have different cpuset settings for cgroupv1. > > If this is really a problem for real life workloads, we may consider > tackling it for exclusively owned pages first. Thanks to David's > patches, now we have dedicated flags to tell exclusively owned pages. One of the problems with demotion when I last looked is it does almost exactly the opposite of what we want on systems like POWER9 where GPU memory is a CPU-less memory node. On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate memory on the GPU node. Under memory pressure demotion should migrate GPU allocations to the CPU node and finally other slow memory nodes or swap. Currently though demotion considers the GPU node slow memory (because it is CPU-less) so will demote CPU memory to GPU memory which is a limited resource. And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap everything to disk rather than demote to CPU memory (which would be preferred). I'm still looking at this series but as I understand it it will help somewhat because we could make GPU memory the top-tier so nothing gets demoted to it. However I wouldn't want to see demotion skipped entirely when a memory policy such as MPOL_BIND is specified. For example most memory on a GPU node will have some kind of policy specified and IMHO it would be better to demote to another node in the mempolicy nodemask rather than going straight to swap, particularly as GPU memory capacity tends to be limited in comparison to CPU memory capacity. > > > > On the other hand, it can already support most interesting use cases > > for demotion (e.g. selecting the demotion node, mbind to prevent > > demotion) by respecting cpuset and vma mempolicies. > > > > > Best Regards, > > > Huang, Ying > > > > > > > > > > > > > > Cross-socket demotion should not be too big a problem in practice > > > > > > because we can optimize the code to do the demotion from the local CPU > > > > > > node (i.e. local writes to the target node and remote read from the > > > > > > source node). The bigger issue is cross-socket memory access onto the > > > > > > demoted pages from the applications, which is why NUMA mempolicy is > > > > > > important here. > > > > > > > > > > > > > > > > > -aneesh > > > > > > > >
On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote: > On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote: > > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote: > > > > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com > > > <ying.huang@intel.com> wrote: > > > > > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote: > > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V > > > > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote: > > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > .... > > > > > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > > > > > memory node near node 0, > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > node 0 cpus: 0 1 > > > > > > > > node 0 size: n MB > > > > > > > > node 0 free: n MB > > > > > > > > node 1 cpus: > > > > > > > > node 1 size: n MB > > > > > > > > node 1 free: n MB > > > > > > > > node 2 cpus: 2 3 > > > > > > > > node 2 size: n MB > > > > > > > > node 2 free: n MB > > > > > > > > node distances: > > > > > > > > node 0 1 2 > > > > > > > > 0: 10 40 20 > > > > > > > > 1: 40 10 80 > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > We have 2 choices, > > > > > > > > > > > > > > > > a) > > > > > > > > node demotion targets > > > > > > > > 0 1 > > > > > > > > 2 1 > > > > > > > > > > > > > > > > b) > > > > > > > > node demotion targets > > > > > > > > 0 1 > > > > > > > > 2 X > > > > > > > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > > > > > traffic. Both are OK as defualt configuration. But some users may > > > > > > > > prefer the other one. So we need a user space ABI to override the > > > > > > > > default configuration. > > > > > > > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > > > > > > > > > > > In general, we can view the demotion order in a way similar to > > > > > > > allocation fallback order (after all, if we don't demote or demotion > > > > > > > lags behind, the allocations will go to these demotion target nodes > > > > > > > according to the allocation fallback order anyway). If we initialize > > > > > > > the demotion order in that way (i.e. every node can demote to any node > > > > > > > in the next tier, and the priority of the target nodes is sorted for > > > > > > > each source node), we don't need per-node demotion order override from > > > > > > > the userspace. What we need is to specify what nodes should be in > > > > > > > each tier and support NUMA mempolicy in demotion. > > > > > > > > > > > > > > > > > > > I have been wondering how we would handle this. For ex: If an > > > > > > application has specified an MPOL_BIND policy and restricted the > > > > > > allocation to be from Node0 and Node1, should we demote pages allocated > > > > > > by that application > > > > > > to Node10? The other alternative for that demotion is swapping. So from > > > > > > the page point of view, we either demote to a slow memory or pageout to > > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule. > > > > > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be > > > > > skipped in such cases. Such MPOL_BIND policies can be an important > > > > > tool for applications to override and control their memory placement > > > > > when transparent memory tiering is enabled. If the application > > > > > doesn't want swapping, there are other ways to achieve that (e.g. > > > > > mlock, disabling swap globally, setting memcg parameters, etc). > > > > > > > > > > > > > > > > The above says we would need some kind of mem policy interaction, but > > > > > > what I am not sure about is how to find the memory policy in the > > > > > > demotion path. > > > > > > > > > > This is indeed an important and challenging problem. One possible > > > > > approach is to retrieve the allowed demotion nodemask from > > > > > page_referenced() similar to vm_flags. > > > > > > > > This works for mempolicy in struct vm_area_struct, but not for that in > > > > struct task_struct. Mutiple threads in a process may have different > > > > mempolicy. > > > > > > From vm_area_struct, we can get to mm_struct and then to the owner > > > task_struct, which has the process mempolicy. > > > > > > It is indeed a problem when a page is shared by different threads or > > > different processes that have different thread default mempolicy > > > values. > > > > Sorry for chiming in late, this is a known issue when we were working > > on demotion. Yes, it is hard to handle the shared pages and multi > > threads since mempolicy is applied to each thread so each thread may > > have different mempolicy. And I don't think this case is rare. And not > > only mempolicy but also may cpuset settings cause the similar problem, > > different threads may have different cpuset settings for cgroupv1. > > > > If this is really a problem for real life workloads, we may consider > > tackling it for exclusively owned pages first. Thanks to David's > > patches, now we have dedicated flags to tell exclusively owned pages. > > One of the problems with demotion when I last looked is it does almost exactly > the opposite of what we want on systems like POWER9 where GPU memory is a > CPU-less memory node. > > On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate > memory on the GPU node. Under memory pressure demotion should migrate GPU > allocations to the CPU node and finally other slow memory nodes or swap. > > Currently though demotion considers the GPU node slow memory (because it is > CPU-less) so will demote CPU memory to GPU memory which is a limited resource. > And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap > everything to disk rather than demote to CPU memory (which would be preferred). > > I'm still looking at this series but as I understand it it will help somewhat > because we could make GPU memory the top-tier so nothing gets demoted to it. Yes. If we have a way to put GPU memory in top-tier (tier 0) and CPU+DRAM in tier 1. Your requirement can be satisfied. One way is to override the auto-generated demotion order via some user space tool. Another way is to change the GPU driver (I guess where the GPU memory is enumerated and onlined?) to change the tier of GPU memory node. > However I wouldn't want to see demotion skipped entirely when a memory policy > such as MPOL_BIND is specified. For example most memory on a GPU node will have > some kind of policy specified and IMHO it would be better to demote to another > node in the mempolicy nodemask rather than going straight to swap, particularly > as GPU memory capacity tends to be limited in comparison to CPU memory > capacity. > > Can you use MPOL_PREFERRED? Even if we enforce MPOL_BIND as much as possible, we will not stop demoting from GPU to DRAM with MPOL_PREFERRED. And in addition to demotion, allocation fallbacking can be used too to avoid allocation latency caused by demotion. This is another example of a system with 3 tiers if PMEM is installed in this machine too. Best Regards, Huang, Ying > > > On the other hand, it can already support most interesting use cases > > > for demotion (e.g. selecting the demotion node, mbind to prevent > > > demotion) by respecting cpuset and vma mempolicies. > > > > > > > Best Regards, > > > > Huang, Ying > > > > > > > > > > > > > > > > > Cross-socket demotion should not be too big a problem in practice > > > > > > > because we can optimize the code to do the demotion from the local CPU > > > > > > > node (i.e. local writes to the target node and remote read from the > > > > > > > source node). The bigger issue is cross-socket memory access onto the > > > > > > > demoted pages from the applications, which is why NUMA mempolicy is > > > > > > > important here. > > > > > > > > > > > > > > > > > > > > -aneesh > > > > > > > > > > > > > > > >
On Thu, Apr 28, 2022 at 7:21 PM ying.huang@intel.com <ying.huang@intel.com> wrote: > > On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote: > > On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote: > > > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote: > > > > > > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote: > > > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V > > > > > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > > > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote: > > > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > .... > > > > > > > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > > > > > > memory node near node 0, > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > node 0 size: n MB > > > > > > > > > node 0 free: n MB > > > > > > > > > node 1 cpus: > > > > > > > > > node 1 size: n MB > > > > > > > > > node 1 free: n MB > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > node 2 size: n MB > > > > > > > > > node 2 free: n MB > > > > > > > > > node distances: > > > > > > > > > node 0 1 2 > > > > > > > > > 0: 10 40 20 > > > > > > > > > 1: 40 10 80 > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > We have 2 choices, > > > > > > > > > > > > > > > > > > a) > > > > > > > > > node demotion targets > > > > > > > > > 0 1 > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > b) > > > > > > > > > node demotion targets > > > > > > > > > 0 1 > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > > > > > > traffic. Both are OK as defualt configuration. But some users may > > > > > > > > > prefer the other one. So we need a user space ABI to override the > > > > > > > > > default configuration. > > > > > > > > > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > > > > > > > > > > > > > In general, we can view the demotion order in a way similar to > > > > > > > > allocation fallback order (after all, if we don't demote or demotion > > > > > > > > lags behind, the allocations will go to these demotion target nodes > > > > > > > > according to the allocation fallback order anyway). If we initialize > > > > > > > > the demotion order in that way (i.e. every node can demote to any node > > > > > > > > in the next tier, and the priority of the target nodes is sorted for > > > > > > > > each source node), we don't need per-node demotion order override from > > > > > > > > the userspace. What we need is to specify what nodes should be in > > > > > > > > each tier and support NUMA mempolicy in demotion. > > > > > > > > > > > > > > > > > > > > > > I have been wondering how we would handle this. For ex: If an > > > > > > > application has specified an MPOL_BIND policy and restricted the > > > > > > > allocation to be from Node0 and Node1, should we demote pages allocated > > > > > > > by that application > > > > > > > to Node10? The other alternative for that demotion is swapping. So from > > > > > > > the page point of view, we either demote to a slow memory or pageout to > > > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule. > > > > > > > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be > > > > > > skipped in such cases. Such MPOL_BIND policies can be an important > > > > > > tool for applications to override and control their memory placement > > > > > > when transparent memory tiering is enabled. If the application > > > > > > doesn't want swapping, there are other ways to achieve that (e.g. > > > > > > mlock, disabling swap globally, setting memcg parameters, etc). > > > > > > > > > > > > > > > > > > > The above says we would need some kind of mem policy interaction, but > > > > > > > what I am not sure about is how to find the memory policy in the > > > > > > > demotion path. > > > > > > > > > > > > This is indeed an important and challenging problem. One possible > > > > > > approach is to retrieve the allowed demotion nodemask from > > > > > > page_referenced() similar to vm_flags. > > > > > > > > > > This works for mempolicy in struct vm_area_struct, but not for that in > > > > > struct task_struct. Mutiple threads in a process may have different > > > > > mempolicy. > > > > > > > > From vm_area_struct, we can get to mm_struct and then to the owner > > > > task_struct, which has the process mempolicy. > > > > > > > > It is indeed a problem when a page is shared by different threads or > > > > different processes that have different thread default mempolicy > > > > values. > > > > > > Sorry for chiming in late, this is a known issue when we were working > > > on demotion. Yes, it is hard to handle the shared pages and multi > > > threads since mempolicy is applied to each thread so each thread may > > > have different mempolicy. And I don't think this case is rare. And not > > > only mempolicy but also may cpuset settings cause the similar problem, > > > different threads may have different cpuset settings for cgroupv1. > > > > > > If this is really a problem for real life workloads, we may consider > > > tackling it for exclusively owned pages first. Thanks to David's > > > patches, now we have dedicated flags to tell exclusively owned pages. > > > > One of the problems with demotion when I last looked is it does almost exactly > > the opposite of what we want on systems like POWER9 where GPU memory is a > > CPU-less memory node. > > > > On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate > > memory on the GPU node. Under memory pressure demotion should migrate GPU > > allocations to the CPU node and finally other slow memory nodes or swap. > > > > Currently though demotion considers the GPU node slow memory (because it is > > CPU-less) so will demote CPU memory to GPU memory which is a limited resource. > > And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap > > everything to disk rather than demote to CPU memory (which would be preferred). > > > > I'm still looking at this series but as I understand it it will help somewhat > > because we could make GPU memory the top-tier so nothing gets demoted to it. > > Yes. If we have a way to put GPU memory in top-tier (tier 0) and > CPU+DRAM in tier 1. Your requirement can be satisfied. One way is to > override the auto-generated demotion order via some user space tool. > Another way is to change the GPU driver (I guess where the GPU memory is > enumerated and onlined?) to change the tier of GPU memory node. > > > However I wouldn't want to see demotion skipped entirely when a memory policy > > such as MPOL_BIND is specified. For example most memory on a GPU node will have > > some kind of policy specified and IMHO it would be better to demote to another > > node in the mempolicy nodemask rather than going straight to swap, particularly > > as GPU memory capacity tends to be limited in comparison to CPU memory > > capacity. > > > > > Can you use MPOL_PREFERRED? Even if we enforce MPOL_BIND as much as > possible, we will not stop demoting from GPU to DRAM with > MPOL_PREFERRED. And in addition to demotion, allocation fallbacking can > be used too to avoid allocation latency caused by demotion. I expect that MPOL_BIND can be used to either prevent demotion or select a particular demotion node/nodemask. It all depends on the mempolicy nodemask specified by MPOL_BIND. > This is another example of a system with 3 tiers if PMEM is installed in > this machine too. > > Best Regards, > Huang, Ying > > > > > On the other hand, it can already support most interesting use cases > > > > for demotion (e.g. selecting the demotion node, mbind to prevent > > > > demotion) by respecting cpuset and vma mempolicies. > > > > > > > > > Best Regards, > > > > > Huang, Ying > > > > > > > > > > > > > > > > > > > > Cross-socket demotion should not be too big a problem in practice > > > > > > > > because we can optimize the code to do the demotion from the local CPU > > > > > > > > node (i.e. local writes to the target node and remote read from the > > > > > > > > source node). The bigger issue is cross-socket memory access onto the > > > > > > > > demoted pages from the applications, which is why NUMA mempolicy is > > > > > > > > important here. > > > > > > > > > > > > > > > > > > > > > > > -aneesh > > > > > > > > > > > > > > > > > > > > > > > > > >
On Thu, 2022-04-28 at 19:58 -0700, Wei Xu wrote: > On Thu, Apr 28, 2022 at 7:21 PM ying.huang@intel.com > <ying.huang@intel.com> wrote: > > > > On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote: > > > On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote: > > > > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote: > > > > > > > > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote: > > > > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V > > > > > > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > > > > > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote: > > > > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > .... > > > > > > > > > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > > > > > > > memory node near node 0, > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > node 0 size: n MB > > > > > > > > > > node 0 free: n MB > > > > > > > > > > node 1 cpus: > > > > > > > > > > node 1 size: n MB > > > > > > > > > > node 1 free: n MB > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > node 2 size: n MB > > > > > > > > > > node 2 free: n MB > > > > > > > > > > node distances: > > > > > > > > > > node 0 1 2 > > > > > > > > > > 0: 10 40 20 > > > > > > > > > > 1: 40 10 80 > > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > > > We have 2 choices, > > > > > > > > > > > > > > > > > > > > a) > > > > > > > > > > node demotion targets > > > > > > > > > > 0 1 > > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > > > b) > > > > > > > > > > node demotion targets > > > > > > > > > > 0 1 > > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > > > > > > > traffic. Both are OK as defualt configuration. But some users may > > > > > > > > > > prefer the other one. So we need a user space ABI to override the > > > > > > > > > > default configuration. > > > > > > > > > > > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > > > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > > > > > > > > > > > > > > > In general, we can view the demotion order in a way similar to > > > > > > > > > allocation fallback order (after all, if we don't demote or demotion > > > > > > > > > lags behind, the allocations will go to these demotion target nodes > > > > > > > > > according to the allocation fallback order anyway). If we initialize > > > > > > > > > the demotion order in that way (i.e. every node can demote to any node > > > > > > > > > in the next tier, and the priority of the target nodes is sorted for > > > > > > > > > each source node), we don't need per-node demotion order override from > > > > > > > > > the userspace. What we need is to specify what nodes should be in > > > > > > > > > each tier and support NUMA mempolicy in demotion. > > > > > > > > > > > > > > > > > > > > > > > > > I have been wondering how we would handle this. For ex: If an > > > > > > > > application has specified an MPOL_BIND policy and restricted the > > > > > > > > allocation to be from Node0 and Node1, should we demote pages allocated > > > > > > > > by that application > > > > > > > > to Node10? The other alternative for that demotion is swapping. So from > > > > > > > > the page point of view, we either demote to a slow memory or pageout to > > > > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule. > > > > > > > > > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be > > > > > > > skipped in such cases. Such MPOL_BIND policies can be an important > > > > > > > tool for applications to override and control their memory placement > > > > > > > when transparent memory tiering is enabled. If the application > > > > > > > doesn't want swapping, there are other ways to achieve that (e.g. > > > > > > > mlock, disabling swap globally, setting memcg parameters, etc). > > > > > > > > > > > > > > > > > > > > > > The above says we would need some kind of mem policy interaction, but > > > > > > > > what I am not sure about is how to find the memory policy in the > > > > > > > > demotion path. > > > > > > > > > > > > > > This is indeed an important and challenging problem. One possible > > > > > > > approach is to retrieve the allowed demotion nodemask from > > > > > > > page_referenced() similar to vm_flags. > > > > > > > > > > > > This works for mempolicy in struct vm_area_struct, but not for that in > > > > > > struct task_struct. Mutiple threads in a process may have different > > > > > > mempolicy. > > > > > > > > > > From vm_area_struct, we can get to mm_struct and then to the owner > > > > > task_struct, which has the process mempolicy. > > > > > > > > > > It is indeed a problem when a page is shared by different threads or > > > > > different processes that have different thread default mempolicy > > > > > values. > > > > > > > > Sorry for chiming in late, this is a known issue when we were working > > > > on demotion. Yes, it is hard to handle the shared pages and multi > > > > threads since mempolicy is applied to each thread so each thread may > > > > have different mempolicy. And I don't think this case is rare. And not > > > > only mempolicy but also may cpuset settings cause the similar problem, > > > > different threads may have different cpuset settings for cgroupv1. > > > > > > > > If this is really a problem for real life workloads, we may consider > > > > tackling it for exclusively owned pages first. Thanks to David's > > > > patches, now we have dedicated flags to tell exclusively owned pages. > > > > > > One of the problems with demotion when I last looked is it does almost exactly > > > the opposite of what we want on systems like POWER9 where GPU memory is a > > > CPU-less memory node. > > > > > > On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate > > > memory on the GPU node. Under memory pressure demotion should migrate GPU > > > allocations to the CPU node and finally other slow memory nodes or swap. > > > > > > Currently though demotion considers the GPU node slow memory (because it is > > > CPU-less) so will demote CPU memory to GPU memory which is a limited resource. > > > And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap > > > everything to disk rather than demote to CPU memory (which would be preferred). > > > > > > I'm still looking at this series but as I understand it it will help somewhat > > > because we could make GPU memory the top-tier so nothing gets demoted to it. > > > > Yes. If we have a way to put GPU memory in top-tier (tier 0) and > > CPU+DRAM in tier 1. Your requirement can be satisfied. One way is to > > override the auto-generated demotion order via some user space tool. > > Another way is to change the GPU driver (I guess where the GPU memory is > > enumerated and onlined?) to change the tier of GPU memory node. > > > > > However I wouldn't want to see demotion skipped entirely when a memory policy > > > such as MPOL_BIND is specified. For example most memory on a GPU node will have > > > some kind of policy specified and IMHO it would be better to demote to another > > > node in the mempolicy nodemask rather than going straight to swap, particularly > > > as GPU memory capacity tends to be limited in comparison to CPU memory > > > capacity. > > > > > > > > Can you use MPOL_PREFERRED? Even if we enforce MPOL_BIND as much as > > possible, we will not stop demoting from GPU to DRAM with > > MPOL_PREFERRED. And in addition to demotion, allocation fallbacking can > > be used too to avoid allocation latency caused by demotion. > > I expect that MPOL_BIND can be used to either prevent demotion or > select a particular demotion node/nodemask. It all depends on the > mempolicy nodemask specified by MPOL_BIND. Yes. I think so too. Best Regards, Huang, Ying > > This is another example of a system with 3 tiers if PMEM is installed in > > this machine too. > > > > Best Regards, > > Huang, Ying > > > > > > > On the other hand, it can already support most interesting use cases > > > > > for demotion (e.g. selecting the demotion node, mbind to prevent > > > > > demotion) by respecting cpuset and vma mempolicies. > > > > > > > > > > > Best Regards, > > > > > > Huang, Ying > > > > > > > > > > > > > > > > > > > > > > > Cross-socket demotion should not be too big a problem in practice > > > > > > > > > because we can optimize the code to do the demotion from the local CPU > > > > > > > > > node (i.e. local writes to the target node and remote read from the > > > > > > > > > source node). The bigger issue is cross-socket memory access onto the > > > > > > > > > demoted pages from the applications, which is why NUMA mempolicy is > > > > > > > > > important here. > > > > > > > > > > > > > > > > > > > > > > > > > > -aneesh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
On Friday, 29 April 2022 1:27:36 PM AEST ying.huang@intel.com wrote: > On Thu, 2022-04-28 at 19:58 -0700, Wei Xu wrote: > > On Thu, Apr 28, 2022 at 7:21 PM ying.huang@intel.com > > <ying.huang@intel.com> wrote: > > > > > > On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote: > > > > On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote: > > > > > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote: > > > > > > > > > > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote: > > > > > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V > > > > > > > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > > > > > > > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote: > > > > > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > .... > > > > > > > > > > > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > > > > > > > > memory node near node 0, > > > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > > node 0 size: n MB > > > > > > > > > > > node 0 free: n MB > > > > > > > > > > > node 1 cpus: > > > > > > > > > > > node 1 size: n MB > > > > > > > > > > > node 1 free: n MB > > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > > node 2 size: n MB > > > > > > > > > > > node 2 free: n MB > > > > > > > > > > > node distances: > > > > > > > > > > > node 0 1 2 > > > > > > > > > > > 0: 10 40 20 > > > > > > > > > > > 1: 40 10 80 > > > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > > > > > We have 2 choices, > > > > > > > > > > > > > > > > > > > > > > a) > > > > > > > > > > > node demotion targets > > > > > > > > > > > 0 1 > > > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > > > > > b) > > > > > > > > > > > node demotion targets > > > > > > > > > > > 0 1 > > > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > > > > > > > > traffic. Both are OK as defualt configuration. But some users may > > > > > > > > > > > prefer the other one. So we need a user space ABI to override the > > > > > > > > > > > default configuration. > > > > > > > > > > > > > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > > > > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > > > > > > > > > > > > > > > > > In general, we can view the demotion order in a way similar to > > > > > > > > > > allocation fallback order (after all, if we don't demote or demotion > > > > > > > > > > lags behind, the allocations will go to these demotion target nodes > > > > > > > > > > according to the allocation fallback order anyway). If we initialize > > > > > > > > > > the demotion order in that way (i.e. every node can demote to any node > > > > > > > > > > in the next tier, and the priority of the target nodes is sorted for > > > > > > > > > > each source node), we don't need per-node demotion order override from > > > > > > > > > > the userspace. What we need is to specify what nodes should be in > > > > > > > > > > each tier and support NUMA mempolicy in demotion. > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have been wondering how we would handle this. For ex: If an > > > > > > > > > application has specified an MPOL_BIND policy and restricted the > > > > > > > > > allocation to be from Node0 and Node1, should we demote pages allocated > > > > > > > > > by that application > > > > > > > > > to Node10? The other alternative for that demotion is swapping. So from > > > > > > > > > the page point of view, we either demote to a slow memory or pageout to > > > > > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule. > > > > > > > > > > > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be > > > > > > > > skipped in such cases. Such MPOL_BIND policies can be an important > > > > > > > > tool for applications to override and control their memory placement > > > > > > > > when transparent memory tiering is enabled. If the application > > > > > > > > doesn't want swapping, there are other ways to achieve that (e.g. > > > > > > > > mlock, disabling swap globally, setting memcg parameters, etc). > > > > > > > > > > > > > > > > > > > > > > > > > The above says we would need some kind of mem policy interaction, but > > > > > > > > > what I am not sure about is how to find the memory policy in the > > > > > > > > > demotion path. > > > > > > > > > > > > > > > > This is indeed an important and challenging problem. One possible > > > > > > > > approach is to retrieve the allowed demotion nodemask from > > > > > > > > page_referenced() similar to vm_flags. > > > > > > > > > > > > > > This works for mempolicy in struct vm_area_struct, but not for that in > > > > > > > struct task_struct. Mutiple threads in a process may have different > > > > > > > mempolicy. > > > > > > > > > > > > From vm_area_struct, we can get to mm_struct and then to the owner > > > > > > task_struct, which has the process mempolicy. > > > > > > > > > > > > It is indeed a problem when a page is shared by different threads or > > > > > > different processes that have different thread default mempolicy > > > > > > values. > > > > > > > > > > Sorry for chiming in late, this is a known issue when we were working > > > > > on demotion. Yes, it is hard to handle the shared pages and multi > > > > > threads since mempolicy is applied to each thread so each thread may > > > > > have different mempolicy. And I don't think this case is rare. And not > > > > > only mempolicy but also may cpuset settings cause the similar problem, > > > > > different threads may have different cpuset settings for cgroupv1. > > > > > > > > > > If this is really a problem for real life workloads, we may consider > > > > > tackling it for exclusively owned pages first. Thanks to David's > > > > > patches, now we have dedicated flags to tell exclusively owned pages. > > > > > > > > One of the problems with demotion when I last looked is it does almost exactly > > > > the opposite of what we want on systems like POWER9 where GPU memory is a > > > > CPU-less memory node. > > > > > > > > On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate > > > > memory on the GPU node. Under memory pressure demotion should migrate GPU > > > > allocations to the CPU node and finally other slow memory nodes or swap. > > > > > > > > Currently though demotion considers the GPU node slow memory (because it is > > > > CPU-less) so will demote CPU memory to GPU memory which is a limited resource. > > > > And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap > > > > everything to disk rather than demote to CPU memory (which would be preferred). > > > > > > > > I'm still looking at this series but as I understand it it will help somewhat > > > > because we could make GPU memory the top-tier so nothing gets demoted to it. > > > > > > Yes. If we have a way to put GPU memory in top-tier (tier 0) and > > > CPU+DRAM in tier 1. Your requirement can be satisfied. One way is to > > > override the auto-generated demotion order via some user space tool. > > > Another way is to change the GPU driver (I guess where the GPU memory is > > > enumerated and onlined?) to change the tier of GPU memory node. Yes, although I think in this case it would be firmware that determines memory tiers (similar to ACPI HMAT which I saw discussed somewhere here). I agree it's a system level property though that in an ideal world shouldn't need overriding from userspace. However being able to override it with a user space tool could be useful. > > > > However I wouldn't want to see demotion skipped entirely when a memory policy > > > > such as MPOL_BIND is specified. For example most memory on a GPU node will have > > > > some kind of policy specified and IMHO it would be better to demote to another > > > > node in the mempolicy nodemask rather than going straight to swap, particularly > > > > as GPU memory capacity tends to be limited in comparison to CPU memory > > > > capacity. > > > > > > > > > > > Can you use MPOL_PREFERRED? Even if we enforce MPOL_BIND as much as > > > possible, we will not stop demoting from GPU to DRAM with > > > MPOL_PREFERRED. And in addition to demotion, allocation fallbacking can > > > be used too to avoid allocation latency caused by demotion. I think so. It's been a little while since I last looked at this but I was under the impression MPOL_PREFERRED didn't do direct reclaim (and therefore wouldn't trigger demotion so once GPU memory was full became effectively a no-op). However looking at the source I don't think that's the case now - if I'm understanding correctly MPOL_PREFERRED will do reclaim/demotion. The other problem with MPOL_PREFERRED is it doesn't allow the fallback nodes to be specified. I was hoping the new MPOL_PREFERRED_MANY and set_mempolicy_home_node() would help here but currently that does disable reclaim (and therefore demotion) in the first pass. However that problem is tangential to this series and I can look at that separately. My main aim here given you were looking at requirements was just to raise this as a slightly different use case (one where the CPU isn't the top tier). Thanks for looking into all this. - Alistair > > I expect that MPOL_BIND can be used to either prevent demotion or > > select a particular demotion node/nodemask. It all depends on the > > mempolicy nodemask specified by MPOL_BIND. > > Yes. I think so too. > > Best Regards, > Huang, Ying > > > > This is another example of a system with 3 tiers if PMEM is installed in > > > this machine too. > > > > > > Best Regards, > > > Huang, Ying > > > > > > > > > On the other hand, it can already support most interesting use cases > > > > > > for demotion (e.g. selecting the demotion node, mbind to prevent > > > > > > demotion) by respecting cpuset and vma mempolicies. > > > > > > > > > > > > > Best Regards, > > > > > > > Huang, Ying > > > > > > > > > > > > > > > > > > > > > > > > > > Cross-socket demotion should not be too big a problem in practice > > > > > > > > > > because we can optimize the code to do the demotion from the local CPU > > > > > > > > > > node (i.e. local writes to the target node and remote read from the > > > > > > > > > > source node). The bigger issue is cross-socket memory access onto the > > > > > > > > > > demoted pages from the applications, which is why NUMA mempolicy is > > > > > > > > > > important here. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -aneesh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
On Thu, Apr 28, 2022 at 7:59 PM Wei Xu <weixugc@google.com> wrote: > > On Thu, Apr 28, 2022 at 7:21 PM ying.huang@intel.com > <ying.huang@intel.com> wrote: > > > > On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote: > > > On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote: > > > > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote: > > > > > > > > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote: > > > > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V > > > > > > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > > > > > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote: > > > > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > .... > > > > > > > > > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > > > > > > > memory node near node 0, > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > node 0 size: n MB > > > > > > > > > > node 0 free: n MB > > > > > > > > > > node 1 cpus: > > > > > > > > > > node 1 size: n MB > > > > > > > > > > node 1 free: n MB > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > node 2 size: n MB > > > > > > > > > > node 2 free: n MB > > > > > > > > > > node distances: > > > > > > > > > > node 0 1 2 > > > > > > > > > > 0: 10 40 20 > > > > > > > > > > 1: 40 10 80 > > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > > > We have 2 choices, > > > > > > > > > > > > > > > > > > > > a) > > > > > > > > > > node demotion targets > > > > > > > > > > 0 1 > > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > > > b) > > > > > > > > > > node demotion targets > > > > > > > > > > 0 1 > > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > > > > > > > traffic. Both are OK as defualt configuration. But some users may > > > > > > > > > > prefer the other one. So we need a user space ABI to override the > > > > > > > > > > default configuration. > > > > > > > > > > > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > > > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > > > > > > > > > > > > > > > In general, we can view the demotion order in a way similar to > > > > > > > > > allocation fallback order (after all, if we don't demote or demotion > > > > > > > > > lags behind, the allocations will go to these demotion target nodes > > > > > > > > > according to the allocation fallback order anyway). If we initialize > > > > > > > > > the demotion order in that way (i.e. every node can demote to any node > > > > > > > > > in the next tier, and the priority of the target nodes is sorted for > > > > > > > > > each source node), we don't need per-node demotion order override from > > > > > > > > > the userspace. What we need is to specify what nodes should be in > > > > > > > > > each tier and support NUMA mempolicy in demotion. > > > > > > > > > > > > > > > > > > > > > > > > > I have been wondering how we would handle this. For ex: If an > > > > > > > > application has specified an MPOL_BIND policy and restricted the > > > > > > > > allocation to be from Node0 and Node1, should we demote pages allocated > > > > > > > > by that application > > > > > > > > to Node10? The other alternative for that demotion is swapping. So from > > > > > > > > the page point of view, we either demote to a slow memory or pageout to > > > > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule. > > > > > > > > > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be > > > > > > > skipped in such cases. Such MPOL_BIND policies can be an important > > > > > > > tool for applications to override and control their memory placement > > > > > > > when transparent memory tiering is enabled. If the application > > > > > > > doesn't want swapping, there are other ways to achieve that (e.g. > > > > > > > mlock, disabling swap globally, setting memcg parameters, etc). > > > > > > > > > > > > > > > > > > > > > > The above says we would need some kind of mem policy interaction, but > > > > > > > > what I am not sure about is how to find the memory policy in the > > > > > > > > demotion path. > > > > > > > > > > > > > > This is indeed an important and challenging problem. One possible > > > > > > > approach is to retrieve the allowed demotion nodemask from > > > > > > > page_referenced() similar to vm_flags. > > > > > > > > > > > > This works for mempolicy in struct vm_area_struct, but not for that in > > > > > > struct task_struct. Mutiple threads in a process may have different > > > > > > mempolicy. > > > > > > > > > > From vm_area_struct, we can get to mm_struct and then to the owner > > > > > task_struct, which has the process mempolicy. > > > > > > > > > > It is indeed a problem when a page is shared by different threads or > > > > > different processes that have different thread default mempolicy > > > > > values. > > > > > > > > Sorry for chiming in late, this is a known issue when we were working > > > > on demotion. Yes, it is hard to handle the shared pages and multi > > > > threads since mempolicy is applied to each thread so each thread may > > > > have different mempolicy. And I don't think this case is rare. And not > > > > only mempolicy but also may cpuset settings cause the similar problem, > > > > different threads may have different cpuset settings for cgroupv1. > > > > > > > > If this is really a problem for real life workloads, we may consider > > > > tackling it for exclusively owned pages first. Thanks to David's > > > > patches, now we have dedicated flags to tell exclusively owned pages. > > > > > > One of the problems with demotion when I last looked is it does almost exactly > > > the opposite of what we want on systems like POWER9 where GPU memory is a > > > CPU-less memory node. > > > > > > On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate > > > memory on the GPU node. Under memory pressure demotion should migrate GPU > > > allocations to the CPU node and finally other slow memory nodes or swap. > > > > > > Currently though demotion considers the GPU node slow memory (because it is > > > CPU-less) so will demote CPU memory to GPU memory which is a limited resource. > > > And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap > > > everything to disk rather than demote to CPU memory (which would be preferred). > > > > > > I'm still looking at this series but as I understand it it will help somewhat > > > because we could make GPU memory the top-tier so nothing gets demoted to it. > > > > Yes. If we have a way to put GPU memory in top-tier (tier 0) and > > CPU+DRAM in tier 1. Your requirement can be satisfied. One way is to > > override the auto-generated demotion order via some user space tool. > > Another way is to change the GPU driver (I guess where the GPU memory is > > enumerated and onlined?) to change the tier of GPU memory node. > > > > > However I wouldn't want to see demotion skipped entirely when a memory policy > > > such as MPOL_BIND is specified. For example most memory on a GPU node will have > > > some kind of policy specified and IMHO it would be better to demote to another > > > node in the mempolicy nodemask rather than going straight to swap, particularly > > > as GPU memory capacity tends to be limited in comparison to CPU memory > > > capacity. > > > > > > > > Can you use MPOL_PREFERRED? Even if we enforce MPOL_BIND as much as > > possible, we will not stop demoting from GPU to DRAM with > > MPOL_PREFERRED. And in addition to demotion, allocation fallbacking can > > be used too to avoid allocation latency caused by demotion. > > I expect that MPOL_BIND can be used to either prevent demotion or > select a particular demotion node/nodemask. It all depends on the > mempolicy nodemask specified by MPOL_BIND. Preventing demotion doesn't make too much sense to me IMHO. But I tend to agree the demotion target should be selected from the nodemask. I think this could follow what numa fault does. > > > This is another example of a system with 3 tiers if PMEM is installed in > > this machine too. > > > > Best Regards, > > Huang, Ying > > > > > > > On the other hand, it can already support most interesting use cases > > > > > for demotion (e.g. selecting the demotion node, mbind to prevent > > > > > demotion) by respecting cpuset and vma mempolicies. > > > > > > > > > > > Best Regards, > > > > > > Huang, Ying > > > > > > > > > > > > > > > > > > > > > > > Cross-socket demotion should not be too big a problem in practice > > > > > > > > > because we can optimize the code to do the demotion from the local CPU > > > > > > > > > node (i.e. local writes to the target node and remote read from the > > > > > > > > > source node). The bigger issue is cross-socket memory access onto the > > > > > > > > > demoted pages from the applications, which is why NUMA mempolicy is > > > > > > > > > important here. > > > > > > > > > > > > > > > > > > > > > > > > > > -aneesh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
On Thu, Apr 28, 2022 at 9:45 PM Alistair Popple <apopple@nvidia.com> wrote: > > On Friday, 29 April 2022 1:27:36 PM AEST ying.huang@intel.com wrote: > > On Thu, 2022-04-28 at 19:58 -0700, Wei Xu wrote: > > > On Thu, Apr 28, 2022 at 7:21 PM ying.huang@intel.com > > > <ying.huang@intel.com> wrote: > > > > > > > > On Fri, 2022-04-29 at 11:27 +1000, Alistair Popple wrote: > > > > > On Friday, 29 April 2022 3:14:29 AM AEST Yang Shi wrote: > > > > > > On Wed, Apr 27, 2022 at 9:11 PM Wei Xu <weixugc@google.com> wrote: > > > > > > > > > > > > > > On Wed, Apr 27, 2022 at 5:56 PM ying.huang@intel.com > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > On Wed, 2022-04-27 at 11:27 -0700, Wei Xu wrote: > > > > > > > > > On Tue, Apr 26, 2022 at 10:06 PM Aneesh Kumar K V > > > > > > > > > <aneesh.kumar@linux.ibm.com> wrote: > > > > > > > > > > > > > > > > > > > > On 4/25/22 10:26 PM, Wei Xu wrote: > > > > > > > > > > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > > > > > > > > > > > <ying.huang@intel.com> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > .... > > > > > > > > > > > > > > > > > > > > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for example, > > > > > > > > > > > > > > > > > > > > > > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow > > > > > > > > > > > > memory node near node 0, > > > > > > > > > > > > > > > > > > > > > > > > available: 3 nodes (0-2) > > > > > > > > > > > > node 0 cpus: 0 1 > > > > > > > > > > > > node 0 size: n MB > > > > > > > > > > > > node 0 free: n MB > > > > > > > > > > > > node 1 cpus: > > > > > > > > > > > > node 1 size: n MB > > > > > > > > > > > > node 1 free: n MB > > > > > > > > > > > > node 2 cpus: 2 3 > > > > > > > > > > > > node 2 size: n MB > > > > > > > > > > > > node 2 free: n MB > > > > > > > > > > > > node distances: > > > > > > > > > > > > node 0 1 2 > > > > > > > > > > > > 0: 10 40 20 > > > > > > > > > > > > 1: 40 10 80 > > > > > > > > > > > > 2: 20 80 10 > > > > > > > > > > > > > > > > > > > > > > > > We have 2 choices, > > > > > > > > > > > > > > > > > > > > > > > > a) > > > > > > > > > > > > node demotion targets > > > > > > > > > > > > 0 1 > > > > > > > > > > > > 2 1 > > > > > > > > > > > > > > > > > > > > > > > > b) > > > > > > > > > > > > node demotion targets > > > > > > > > > > > > 0 1 > > > > > > > > > > > > 2 X > > > > > > > > > > > > > > > > > > > > > > > > a) is good to take advantage of PMEM. b) is good to reduce cross-socket > > > > > > > > > > > > traffic. Both are OK as defualt configuration. But some users may > > > > > > > > > > > > prefer the other one. So we need a user space ABI to override the > > > > > > > > > > > > default configuration. > > > > > > > > > > > > > > > > > > > > > > I think 2(a) should be the system-wide configuration and 2(b) can be > > > > > > > > > > > achieved with NUMA mempolicy (which needs to be added to demotion). > > > > > > > > > > > > > > > > > > > > > > In general, we can view the demotion order in a way similar to > > > > > > > > > > > allocation fallback order (after all, if we don't demote or demotion > > > > > > > > > > > lags behind, the allocations will go to these demotion target nodes > > > > > > > > > > > according to the allocation fallback order anyway). If we initialize > > > > > > > > > > > the demotion order in that way (i.e. every node can demote to any node > > > > > > > > > > > in the next tier, and the priority of the target nodes is sorted for > > > > > > > > > > > each source node), we don't need per-node demotion order override from > > > > > > > > > > > the userspace. What we need is to specify what nodes should be in > > > > > > > > > > > each tier and support NUMA mempolicy in demotion. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I have been wondering how we would handle this. For ex: If an > > > > > > > > > > application has specified an MPOL_BIND policy and restricted the > > > > > > > > > > allocation to be from Node0 and Node1, should we demote pages allocated > > > > > > > > > > by that application > > > > > > > > > > to Node10? The other alternative for that demotion is swapping. So from > > > > > > > > > > the page point of view, we either demote to a slow memory or pageout to > > > > > > > > > > swap. But then if we demote we are also breaking the MPOL_BIND rule. > > > > > > > > > > > > > > > > > > IMHO, the MPOL_BIND policy should be respected and demotion should be > > > > > > > > > skipped in such cases. Such MPOL_BIND policies can be an important > > > > > > > > > tool for applications to override and control their memory placement > > > > > > > > > when transparent memory tiering is enabled. If the application > > > > > > > > > doesn't want swapping, there are other ways to achieve that (e.g. > > > > > > > > > mlock, disabling swap globally, setting memcg parameters, etc). > > > > > > > > > > > > > > > > > > > > > > > > > > > > The above says we would need some kind of mem policy interaction, but > > > > > > > > > > what I am not sure about is how to find the memory policy in the > > > > > > > > > > demotion path. > > > > > > > > > > > > > > > > > > This is indeed an important and challenging problem. One possible > > > > > > > > > approach is to retrieve the allowed demotion nodemask from > > > > > > > > > page_referenced() similar to vm_flags. > > > > > > > > > > > > > > > > This works for mempolicy in struct vm_area_struct, but not for that in > > > > > > > > struct task_struct. Mutiple threads in a process may have different > > > > > > > > mempolicy. > > > > > > > > > > > > > > From vm_area_struct, we can get to mm_struct and then to the owner > > > > > > > task_struct, which has the process mempolicy. > > > > > > > > > > > > > > It is indeed a problem when a page is shared by different threads or > > > > > > > different processes that have different thread default mempolicy > > > > > > > values. > > > > > > > > > > > > Sorry for chiming in late, this is a known issue when we were working > > > > > > on demotion. Yes, it is hard to handle the shared pages and multi > > > > > > threads since mempolicy is applied to each thread so each thread may > > > > > > have different mempolicy. And I don't think this case is rare. And not > > > > > > only mempolicy but also may cpuset settings cause the similar problem, > > > > > > different threads may have different cpuset settings for cgroupv1. > > > > > > > > > > > > If this is really a problem for real life workloads, we may consider > > > > > > tackling it for exclusively owned pages first. Thanks to David's > > > > > > patches, now we have dedicated flags to tell exclusively owned pages. > > > > > > > > > > One of the problems with demotion when I last looked is it does almost exactly > > > > > the opposite of what we want on systems like POWER9 where GPU memory is a > > > > > CPU-less memory node. > > > > > > > > > > On those systems users tend to use MPOL_BIND or MPOL_PREFERRED to allocate > > > > > memory on the GPU node. Under memory pressure demotion should migrate GPU > > > > > allocations to the CPU node and finally other slow memory nodes or swap. > > > > > > > > > > Currently though demotion considers the GPU node slow memory (because it is > > > > > CPU-less) so will demote CPU memory to GPU memory which is a limited resource. > > > > > And when trying to allocate GPU memory with MPOL_BIND/PREFERRED it will swap > > > > > everything to disk rather than demote to CPU memory (which would be preferred). > > > > > > > > > > I'm still looking at this series but as I understand it it will help somewhat > > > > > because we could make GPU memory the top-tier so nothing gets demoted to it. > > > > > > > > Yes. If we have a way to put GPU memory in top-tier (tier 0) and > > > > CPU+DRAM in tier 1. Your requirement can be satisfied. One way is to > > > > override the auto-generated demotion order via some user space tool. > > > > Another way is to change the GPU driver (I guess where the GPU memory is > > > > enumerated and onlined?) to change the tier of GPU memory node. > > Yes, although I think in this case it would be firmware that determines memory > tiers (similar to ACPI HMAT which I saw discussed somewhere here). I agree it's > a system level property though that in an ideal world shouldn't need overriding > from userspace. However being able to override it with a user space tool could > be useful. > > > > > > However I wouldn't want to see demotion skipped entirely when a memory policy > > > > > such as MPOL_BIND is specified. For example most memory on a GPU node will have > > > > > some kind of policy specified and IMHO it would be better to demote to another > > > > > node in the mempolicy nodemask rather than going straight to swap, particularly > > > > > as GPU memory capacity tends to be limited in comparison to CPU memory > > > > > capacity. > > > > > > > > > > > > > > Can you use MPOL_PREFERRED? Even if we enforce MPOL_BIND as much as > > > > possible, we will not stop demoting from GPU to DRAM with > > > > MPOL_PREFERRED. And in addition to demotion, allocation fallbacking can > > > > be used too to avoid allocation latency caused by demotion. > > I think so. It's been a little while since I last looked at this but I was > under the impression MPOL_PREFERRED didn't do direct reclaim (and therefore > wouldn't trigger demotion so once GPU memory was full became effectively a > no-op). However looking at the source I don't think that's the case now - if > I'm understanding correctly MPOL_PREFERRED will do reclaim/demotion. You are right. Whether doing reclaim depends on the GFP flags and memory pressure instead of mempolicy. > > The other problem with MPOL_PREFERRED is it doesn't allow the fallback nodes to > be specified. I was hoping the new MPOL_PREFERRED_MANY and > set_mempolicy_home_node() would help here but currently that does disable > reclaim (and therefore demotion) in the first pass. > > However that problem is tangential to this series and I can look at that > separately. My main aim here given you were looking at requirements was just > to raise this as a slightly different use case (one where the CPU isn't the top > tier). > > Thanks for looking into all this. > > - Alistair > > > > I expect that MPOL_BIND can be used to either prevent demotion or > > > select a particular demotion node/nodemask. It all depends on the > > > mempolicy nodemask specified by MPOL_BIND. > > > > Yes. I think so too. > > > > Best Regards, > > Huang, Ying > > > > > > This is another example of a system with 3 tiers if PMEM is installed in > > > > this machine too. > > > > > > > > Best Regards, > > > > Huang, Ying > > > > > > > > > > > On the other hand, it can already support most interesting use cases > > > > > > > for demotion (e.g. selecting the demotion node, mbind to prevent > > > > > > > demotion) by respecting cpuset and vma mempolicies. > > > > > > > > > > > > > > > Best Regards, > > > > > > > > Huang, Ying > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Cross-socket demotion should not be too big a problem in practice > > > > > > > > > > > because we can optimize the code to do the demotion from the local CPU > > > > > > > > > > > node (i.e. local writes to the target node and remote read from the > > > > > > > > > > > source node). The bigger issue is cross-socket memory access onto the > > > > > > > > > > > demoted pages from the applications, which is why NUMA mempolicy is > > > > > > > > > > > important here. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -aneesh > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
On Thu, Apr 28, 2022 at 12:30 PM Chen, Tim C <tim.c.chen@intel.com> wrote: > > > > >On Wed, 2022-04-27 at 09:27 -0700, Wei Xu wrote: > >> On Wed, Apr 27, 2022 at 12:11 AM ying.huang@intel.com > >> <ying.huang@intel.com> wrote: > >> > > >> > On Mon, 2022-04-25 at 09:56 -0700, Wei Xu wrote: > >> > > On Sat, Apr 23, 2022 at 8:02 PM ying.huang@intel.com > >> > > <ying.huang@intel.com> wrote: > >> > > > > >> > > > Hi, All, > >> > > > > >> > > > On Fri, 2022-04-22 at 16:30 +0530, Jagdish Gediya wrote: > >> > > > > >> > > > [snip] > >> > > > > >> > > > > I think it is necessary to either have per node demotion > >> > > > > targets configuration or the user space interface supported by > >> > > > > this patch series. As we don't have clear consensus on how the > >> > > > > user interface should look like, we can defer the per node > >> > > > > demotion target set interface to future until the real need arises. > >> > > > > > >> > > > > Current patch series sets N_DEMOTION_TARGET from dax device > >> > > > > kmem driver, it may be possible that some memory node desired > >> > > > > as demotion target is not detected in the system from dax-device > >kmem probe path. > >> > > > > > >> > > > > It is also possible that some of the dax-devices are not > >> > > > > preferred as demotion target e.g. HBM, for such devices, node > >> > > > > shouldn't be set to N_DEMOTION_TARGETS. In future, Support > >> > > > > should be added to distinguish such dax-devices and not mark > >> > > > > them as N_DEMOTION_TARGETS from the kernel, but for now this > >> > > > > user space interface will be useful to avoid such devices as demotion > >targets. > >> > > > > > >> > > > > We can add read only interface to view per node demotion > >> > > > > targets from /sys/devices/system/node/nodeX/demotion_targets, > >> > > > > remove duplicated /sys/kernel/mm/numa/demotion_target > >> > > > > interface and instead make > >/sys/devices/system/node/demotion_targets writable. > >> > > > > > >> > > > > Huang, Wei, Yang, > >> > > > > What do you suggest? > >> > > > > >> > > > We cannot remove a kernel ABI in practice. So we need to make > >> > > > it right at the first time. Let's try to collect some > >> > > > information for the kernel ABI definitation. > >> > > > > >> > > > The below is just a starting point, please add your requirements. > >> > > > > >> > > > 1. Jagdish has some machines with DRAM only NUMA nodes, but they > >> > > > don't want to use that as the demotion targets. But I don't > >> > > > think this is a issue in practice for now, because > >> > > > demote-in-reclaim is disabled by default. > >> > > > > >> > > > 2. For machines with PMEM installed in only 1 of 2 sockets, for > >> > > > example, > >> > > > > >> > > > Node 0 & 2 are cpu + dram nodes and node 1 are slow memory node > >> > > > near node 0, > >> > > > > >> > > > available: 3 nodes (0-2) > >> > > > node 0 cpus: 0 1 > >> > > > node 0 size: n MB > >> > > > node 0 free: n MB > >> > > > node 1 cpus: > >> > > > node 1 size: n MB > >> > > > node 1 free: n MB > >> > > > node 2 cpus: 2 3 > >> > > > node 2 size: n MB > >> > > > node 2 free: n MB > >> > > > node distances: > >> > > > node 0 1 2 > >> > > > 0: 10 40 20 > >> > > > 1: 40 10 80 > >> > > > 2: 20 80 10 > >> > > > > >> > > > We have 2 choices, > >> > > > > >> > > > a) > >> > > > node demotion targets > >> > > > 0 1 > >> > > > 2 1 > >> > > > > >> > > > b) > >> > > > node demotion targets > >> > > > 0 1 > >> > > > 2 X > >> > > > > >> > > > a) is good to take advantage of PMEM. b) is good to reduce > >> > > > cross-socket traffic. Both are OK as defualt configuration. > >> > > > But some users may prefer the other one. So we need a user > >> > > > space ABI to override the default configuration. > >> > > > >> > > I think 2(a) should be the system-wide configuration and 2(b) can > >> > > be achieved with NUMA mempolicy (which needs to be added to > >demotion). > >> > > >> > Unfortunately, some NUMA mempolicy information isn't available at > >> > demotion time, for example, mempolicy enforced via set_mempolicy() > >> > is for thread. But I think that cpusets can work for demotion. > >> > > >> > > In general, we can view the demotion order in a way similar to > >> > > allocation fallback order (after all, if we don't demote or > >> > > demotion lags behind, the allocations will go to these demotion > >> > > target nodes according to the allocation fallback order anyway). > >> > > If we initialize the demotion order in that way (i.e. every node > >> > > can demote to any node in the next tier, and the priority of the > >> > > target nodes is sorted for each source node), we don't need > >> > > per-node demotion order override from the userspace. What we need > >> > > is to specify what nodes should be in each tier and support NUMA > >mempolicy in demotion. > >> > > >> > This sounds interesting. Tier sounds like a natural and general > >> > concept for these memory types. It's attracting to use it for user > >> > space interface too. For example, we may use that for mem_cgroup > >> > limits of a specific memory type (tier). > >> > > >> > And if we take a look at the N_DEMOTION_TARGETS again from the "tier" > >> > point of view. The nodes are divided to 2 classes via > >> > N_DEMOTION_TARGETS. > >> > > >> > - The nodes without N_DEMOTION_TARGETS are top tier (or tier 0). > >> > > >> > - The nodes with N_DEMOTION_TARGETS are non-top tier (or tier 1, 2, > >> > 3, > >> > ...) > >> > > >> > >> Yes, this is one of the main reasons why we (Google) want this interface. > >> > >> > So, another possibility is to fit N_DEMOTION_TARGETS and its > >> > overriding into "tier" concept too. !N_DEMOTION_TARGETS == TIER0. > >> > > >> > - All nodes start with TIER0 > >> > > >> > - TIER0 can be cleared for some nodes via e.g. kmem driver > >> > > >> > TIER0 node list can be read or overriden by the user space via the > >> > following interface, > >> > > >> > /sys/devices/system/node/tier0 > >> > > >> > In the future, if we want to customize more tiers, we can add tier1, > >> > tier2, tier3, ..... For now, we can add just tier0. That is, the > >> > interface is extensible in the future compared with > >> > .../node/demote_targets. > >> > > >> > >> This more explicit tier definition interface works, too. > >> > > > >In addition to make tiering definition explicit, more importantly, this makes it > >much easier to support more than 2 tiers. For example, for a system with > >HBM (High Bandwidth Memory), CPU+DRAM, DRAM only, and PMEM, that is, > >3 tiers, we can put HBM in tier 0, CPU+DRAM and DRAM only in tier 1, and > >PMEM in tier 2, automatically, or via user space overridding. > >N_DEMOTION_TARGETS isn't natural to be extended to support this. > > Agree with Ying that making the tier explicit is fundamental to the rest of the API. > > I think that the tier organization should come before setting the demotion targets, > not the other way round. > > That makes things clear on the demotion direction, (node in tier X > demote to tier Y, X<Y). With that, explicitly specifying the demotion target or > order is only needed when we truly want that level of control or a demotion > order. Otherwise all the higher numbered tiers are valid targets. > Configuring a tier level for each node is a lot easier than fixing up all > demotion targets for each and every node. > > We can prevent demotion target configuration that goes in the wrong > direction by looking at the tier level. > > Tim > I have just posted a RFC on the tier-oriented memory tiering kernel interface based on the discussions here. The RFC proposes a sysfs interface, /sys/devices/system/node/memory_tiers, to display and override the nodes in each memory tier. It also proposes that we rely on the kernel allocation order to select the demotion target node from the next tier and don't expose a userspace overriding interface for per-node demotion order. The RFC drops the approach of CPU nodes as the top-tier by default, too.