diff mbox series

[RFC,v4,1/7] mm/demotion: Add support for explicit memory tiers

Message ID 20220527122528.129445-2-aneesh.kumar@linux.ibm.com (mailing list archive)
State New
Headers show
Series mm/demotion: Memory tiers and demotion | expand

Commit Message

Aneesh Kumar K.V May 27, 2022, 12:25 p.m. UTC
From: Jagdish Gediya <jvgediya@linux.ibm.com>

In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created
during the kernel initialization and updated when a NUMA node is
hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and builds the tier hierarchy
tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases,

The current tier initialization code always initializes
each memory-only NUMA node into a lower tier.  But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM or GPU devices, the
memory-only NUMA nodes mapping these devices should be in the
top tier, and DRAM nodes with CPUs are better to be placed into the
next lower tier.

With current kernel higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier.  This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.

The current kernel also don't provide any interfaces for the
userspace to learn about the memory tier hierarchy in order to
optimize its memory allocations.

This patch series address the above by defining memory tiers explicitly.

This patch adds below sysfs interface which is read-only and
can be used to read nodes available in specific tier.

/sys/devices/system/memtier/memtierN/nodelist

Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
lowest tier. The absolute value of a tier id number has no specific
meaning. what matters is the relative order of the tier id numbers.

All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
nodes are by default assigned to DEFAULT_MEMORY_TIER(1).

Default memory tier can be read from,
/sys/devices/system/memtier/default_tier

Max memory tier can be read from,
/sys/devices/system/memtier/max_tiers

This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].

[1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/migrate.h |  38 ++++++++----
 mm/Kconfig              |  11 ++++
 mm/migrate.c            | 134 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 170 insertions(+), 13 deletions(-)

Comments

Jonathan Cameron May 27, 2022, 1:59 p.m. UTC | #1
On Fri, 27 May 2022 17:55:22 +0530
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> wrote:

> From: Jagdish Gediya <jvgediya@linux.ibm.com>
> 
> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created
> during the kernel initialization and updated when a NUMA node is
> hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier, and builds the tier hierarchy
> tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
> 
> This current memory tier kernel interface needs to be improved for
> several important use cases,
> 
> The current tier initialization code always initializes
> each memory-only NUMA node into a lower tier.  But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem or a DRAM-backed memory-only node on
> a virtual machine) and should be put into a higher tier.
> 
> The current tier hierarchy always puts CPU nodes into the top
> tier. But on a system with HBM or GPU devices, the
> memory-only NUMA nodes mapping these devices should be in the
> top tier, and DRAM nodes with CPUs are better to be placed into the
> next lower tier.
> 
> With current kernel higher tier node can only be demoted to selected nodes on the
> next lower tier as defined by the demotion path, not any other
> node from any lower tier.  This strict, hard-coded demotion order
> does not work in all use cases (e.g. some use cases may want to
> allow cross-socket demotion to another node in the same demotion
> tier as a fallback when the preferred demotion node is out of
> space), This demotion order is also inconsistent with the page
> allocation fallback order when all the nodes in a higher tier are
> out of space: The page allocation can fall back to any node from
> any lower tier, whereas the demotion order doesn't allow that.
> 
> The current kernel also don't provide any interfaces for the
> userspace to learn about the memory tier hierarchy in order to
> optimize its memory allocations.
> 
> This patch series address the above by defining memory tiers explicitly.
> 
> This patch adds below sysfs interface which is read-only and
> can be used to read nodes available in specific tier.
> 
> /sys/devices/system/memtier/memtierN/nodelist
> 
> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> lowest tier. The absolute value of a tier id number has no specific
> meaning. what matters is the relative order of the tier id numbers.
> 
> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> 
> Default memory tier can be read from,
> /sys/devices/system/memtier/default_tier
> 
> Max memory tier can be read from,
> /sys/devices/system/memtier/max_tiers
> 
> This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
> 
> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> 
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

Hi Aneesh,

Superficial review only for this first pass.  We'll give it a spin
next week and come back with anything more substantial.

Thanks,

Jonathan

> ---
>  include/linux/migrate.h |  38 ++++++++----
>  mm/Kconfig              |  11 ++++
>  mm/migrate.c            | 134 ++++++++++++++++++++++++++++++++++++++++
>  3 files changed, 170 insertions(+), 13 deletions(-)
> 
> diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> index 90e75d5a54d6..0ec653623565 100644
> --- a/include/linux/migrate.h
> +++ b/include/linux/migrate.h
> @@ -47,17 +47,8 @@ void folio_migrate_copy(struct folio *newfolio, struct folio *folio);
>  int folio_migrate_mapping(struct address_space *mapping,
>  		struct folio *newfolio, struct folio *folio, int extra_count);
>  
> -extern bool numa_demotion_enabled;
> -extern void migrate_on_reclaim_init(void);
> -#ifdef CONFIG_HOTPLUG_CPU
> -extern void set_migration_target_nodes(void);
> -#else
> -static inline void set_migration_target_nodes(void) {}
> -#endif
>  #else
>  
> -static inline void set_migration_target_nodes(void) {}
> -
>  static inline void putback_movable_pages(struct list_head *l) {}
>  static inline int migrate_pages(struct list_head *l, new_page_t new,
>  		free_page_t free, unsigned long private, enum migrate_mode mode,
> @@ -82,7 +73,6 @@ static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
>  	return -ENOSYS;
>  }
>  
> -#define numa_demotion_enabled	false
>  #endif /* CONFIG_MIGRATION */
>  
>  #ifdef CONFIG_COMPACTION
> @@ -172,15 +162,37 @@ struct migrate_vma {
>  int migrate_vma_setup(struct migrate_vma *args);
>  void migrate_vma_pages(struct migrate_vma *migrate);
>  void migrate_vma_finalize(struct migrate_vma *migrate);
> -int next_demotion_node(int node);
> +#endif /* CONFIG_MIGRATION */
> +
> +#ifdef CONFIG_TIERED_MEMORY
> +
> +extern bool numa_demotion_enabled;
> +#define DEFAULT_MEMORY_TIER	1
> +
> +enum memory_tier_type {
> +	MEMORY_TIER_HBM_GPU,
> +	MEMORY_TIER_DRAM,
> +	MEMORY_TIER_PMEM,
> +	MAX_MEMORY_TIERS

Not used yet. Introduce it in patch that makes use of it.

> +};
>  
> -#else /* CONFIG_MIGRATION disabled: */
> +int next_demotion_node(int node);
>  
> +extern void migrate_on_reclaim_init(void);
> +#ifdef CONFIG_HOTPLUG_CPU
> +extern void set_migration_target_nodes(void);
> +#else
> +static inline void set_migration_target_nodes(void) {}
> +#endif
> +#else 

Worth noting what this else is for as we have a lot of them about!
Comments as used elsewhere in this file for #else will help.

> +#define numa_demotion_enabled	false
>  static inline int next_demotion_node(int node)
>  {
>  	return NUMA_NO_NODE;
>  }
>  
> -#endif /* CONFIG_MIGRATION */
> +static inline void set_migration_target_nodes(void) {}
> +static inline void migrate_on_reclaim_init(void) {}
> +#endif	/* CONFIG_TIERED_MEMORY */
>  
>  #endif /* _LINUX_MIGRATE_H */
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 034d87953600..7bfbddef46ed 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -258,6 +258,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>  config ARCH_ENABLE_THP_MIGRATION
>  	bool
>  
> +config TIERED_MEMORY
> +	bool "Support for explicit memory tiers"
> +	def_bool y

New options as y is generally hard to argue for

> +	depends on MIGRATION && NUMA
> +	help
> +	  Support to split nodes into memory tiers explicitly and
> +	  to demote pages on reclaim to lower tiers. This option
> +	  also exposes sysfs interface to read nodes available in
> +	  specific tier and to move specific node among different
> +	  possible tiers.
> +
>  config HUGETLB_PAGE_SIZE_VARIABLE
>  	def_bool n
>  	help
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6c31ee1e1c9b..f28ee93fb017 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -2118,6 +2118,113 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
>  #endif /* CONFIG_NUMA_BALANCING */
>  #endif /* CONFIG_NUMA */
>  
> +#ifdef CONFIG_TIERED_MEMORY

I wonder if it makes sense to put this in a separate file given it's reasonably
separable and that file is big enough already...

> +
> +struct memory_tier {
> +	struct device dev;
> +	nodemask_t nodelist;
> +};
> +
> +#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
> +
> +static struct bus_type memory_tier_subsys = {
> +	.name = "memtier",
> +	.dev_name = "memtier",
> +};
> +
> +static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
> +
> +static ssize_t nodelist_show(struct device *dev,
> +			     struct device_attribute *attr, char *buf)
> +{
> +	int tier = dev->id;

	struct memory_tier *tier = memory_tiers[dev->id];
as the local variable will give more readable code I think.

> +
> +	return sysfs_emit(buf, "%*pbl\n",
> +			  nodemask_pr_args(&memory_tiers[tier]->nodelist));
> +
> +}
> +static DEVICE_ATTR_RO(nodelist);
> +
> +static struct attribute *memory_tier_dev_attrs[] = {
> +	&dev_attr_nodelist.attr,
> +	NULL
> +};
> +
> +static const struct attribute_group memory_tier_dev_group = {
> +	.attrs = memory_tier_dev_attrs,
> +};
> +
> +static const struct attribute_group *memory_tier_dev_groups[] = {
> +	&memory_tier_dev_group,
> +	NULL
> +};
> +
> +static void memory_tier_device_release(struct device *dev)
> +{
> +	struct memory_tier *tier = to_memory_tier(dev);
> +
> +	kfree(tier);
> +}
> +
> +static int register_memory_tier(int tier)
> +{
> +	int error;
> +

I would add a sanity check that the tier is not already
present.  Trivial check and might help debugging...

> +	memory_tiers[tier] = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);

prefer sizeof(*memory_tiers[tier]) to avoid need to check type matches.

> +	if (!memory_tiers[tier])
> +		return -ENOMEM;
> +
> +	memory_tiers[tier]->dev.id = tier;
This would all be simpler to read if you use a local variable.

struct memory_tier *tier = kzalloc(sizeof(*tier), GFP_KERNEL);
and only assign it to memory_tiers[tier] on successful device_register

> +	memory_tiers[tier]->dev.bus = &memory_tier_subsys;
> +	memory_tiers[tier]->dev.release = memory_tier_device_release;
> +	memory_tiers[tier]->dev.groups = memory_tier_dev_groups;
> +	error = device_register(&memory_tiers[tier]->dev);
> +
> +	if (error) {
> +		put_device(&memory_tiers[tier]->dev);
> +		memory_tiers[tier] = NULL;
> +	}
> +
> +	return error;
> +}
> +
> +static void unregister_memory_tier(int tier)
> +{
> +	device_unregister(&memory_tiers[tier]->dev);
> +	memory_tiers[tier] = NULL;
> +}
> +
> +static ssize_t
> +max_tiers_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
> +}
> +
> +static DEVICE_ATTR_RO(max_tiers);
> +
> +static ssize_t
> +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +	return sysfs_emit(buf, "%d\n", DEFAULT_MEMORY_TIER);
> +}
Common convention for these is don't leave a blank line.
Helps associate the single function with the ATTR definition.
> +
> +static DEVICE_ATTR_RO(default_tier);
> +
> +static struct attribute *memoty_tier_attrs[] = {

memory

> +	&dev_attr_max_tiers.attr,
> +	&dev_attr_default_tier.attr,
> +	NULL
> +};
> +
> +static const struct attribute_group memory_tier_attr_group = {
> +	.attrs = memoty_tier_attrs,
> +};
> +
> +static const struct attribute_group *memory_tier_attr_groups[] = {
> +	&memory_tier_attr_group,
> +	NULL,
> +};
> +
>  /*
>   * node_demotion[] example:
>   *
> @@ -2569,3 +2676,30 @@ static int __init numa_init_sysfs(void)
>  }
>  subsys_initcall(numa_init_sysfs);
>  #endif

You've caught up some other stuff in your CONFIG_TIERED_MEMORY ifdef.

> +
> +static int __init memory_tier_init(void)
> +{
> +	int ret;
> +
> +	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> +	if (ret)
> +		panic("%s() failed to register subsystem: %d\n", __func__, ret);
> +
> +	/*
> +	 * Register only default memory tier to hide all empty
> +	 * memory tier from sysfs.
> +	 */
> +	ret = register_memory_tier(DEFAULT_MEMORY_TIER);
> +	if (ret)
> +		panic("%s() failed to register memory tier: %d\n", __func__, ret);
> +
> +	/*
> +	 * CPU only nodes are not part of memoty tiers.
memory, plus single line comment syntax probably better.
> +	 */
> +	memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
> +
> +	return 0;
> +}
> +subsys_initcall(memory_tier_init);
> +
> +#endif	/* CONFIG_TIERED_MEMORY */
Huang, Ying June 2, 2022, 6:07 a.m. UTC | #2
On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> From: Jagdish Gediya <jvgediya@linux.ibm.com>
> 
> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created
> during the kernel initialization and updated when a NUMA node is
> hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier, and builds the tier hierarchy
> tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
> 
> This current memory tier kernel interface needs to be improved for
> several important use cases,
> 
> The current tier initialization code always initializes
> each memory-only NUMA node into a lower tier.  But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem or a DRAM-backed memory-only node on
> a virtual machine) and should be put into a higher tier.
> 
> The current tier hierarchy always puts CPU nodes into the top
> tier. But on a system with HBM or GPU devices, the
> memory-only NUMA nodes mapping these devices should be in the
> top tier, and DRAM nodes with CPUs are better to be placed into the
> next lower tier.
> 
> With current kernel higher tier node can only be demoted to selected nodes on the
> next lower tier as defined by the demotion path, not any other
> node from any lower tier.  This strict, hard-coded demotion order
> does not work in all use cases (e.g. some use cases may want to
> allow cross-socket demotion to another node in the same demotion
> tier as a fallback when the preferred demotion node is out of
> space), This demotion order is also inconsistent with the page
> allocation fallback order when all the nodes in a higher tier are
> out of space: The page allocation can fall back to any node from
> any lower tier, whereas the demotion order doesn't allow that.
> 
> The current kernel also don't provide any interfaces for the
> userspace to learn about the memory tier hierarchy in order to
> optimize its memory allocations.
> 
> This patch series address the above by defining memory tiers explicitly.
> 
> This patch adds below sysfs interface which is read-only and
> can be used to read nodes available in specific tier.
> 
> /sys/devices/system/memtier/memtierN/nodelist
> 
> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> lowest tier. The absolute value of a tier id number has no specific
> meaning. what matters is the relative order of the tier id numbers.
> 
> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> 
> Default memory tier can be read from,
> /sys/devices/system/memtier/default_tier
> 
> Max memory tier can be read from,
> /sys/devices/system/memtier/max_tiers
> 
> This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
> 
> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> 
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

IMHO, we should change the kernel internal implementation firstly, then
implement the kerne/user space interface.  That is, make memory tier
explicit inside kernel, then expose it to user space.

Best Regards,
Huang, Ying


[snip]
Huang, Ying June 6, 2022, 2:49 a.m. UTC | #3
On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > From: Jagdish Gediya <jvgediya@linux.ibm.com>
> > 
> > In the current kernel, memory tiers are defined implicitly via a
> > demotion path relationship between NUMA nodes, which is created
> > during the kernel initialization and updated when a NUMA node is
> > hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and builds the tier hierarchy
> > tier-by-tier by establishing the per-node demotion targets based
> > on the distances between nodes.
> > 
> > This current memory tier kernel interface needs to be improved for
> > several important use cases,
> > 
> > The current tier initialization code always initializes
> > each memory-only NUMA node into a lower tier.  But a memory-only
> > NUMA node may have a high performance memory device (e.g. a DRAM
> > device attached via CXL.mem or a DRAM-backed memory-only node on
> > a virtual machine) and should be put into a higher tier.
> > 
> > The current tier hierarchy always puts CPU nodes into the top
> > tier. But on a system with HBM or GPU devices, the
> > memory-only NUMA nodes mapping these devices should be in the
> > top tier, and DRAM nodes with CPUs are better to be placed into the
> > next lower tier.
> > 
> > With current kernel higher tier node can only be demoted to selected nodes on the
> > next lower tier as defined by the demotion path, not any other
> > node from any lower tier.  This strict, hard-coded demotion order
> > does not work in all use cases (e.g. some use cases may want to
> > allow cross-socket demotion to another node in the same demotion
> > tier as a fallback when the preferred demotion node is out of
> > space), This demotion order is also inconsistent with the page
> > allocation fallback order when all the nodes in a higher tier are
> > out of space: The page allocation can fall back to any node from
> > any lower tier, whereas the demotion order doesn't allow that.
> > 
> > The current kernel also don't provide any interfaces for the
> > userspace to learn about the memory tier hierarchy in order to
> > optimize its memory allocations.
> > 
> > This patch series address the above by defining memory tiers explicitly.
> > 
> > This patch adds below sysfs interface which is read-only and
> > can be used to read nodes available in specific tier.
> > 
> > /sys/devices/system/memtier/memtierN/nodelist
> > 
> > Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> > lowest tier. The absolute value of a tier id number has no specific
> > meaning. what matters is the relative order of the tier id numbers.
> > 
> > All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> > Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> > nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> > 
> > Default memory tier can be read from,
> > /sys/devices/system/memtier/default_tier
> > 
> > Max memory tier can be read from,
> > /sys/devices/system/memtier/max_tiers
> > 
> > This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
> > 
> > [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> > 
> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> 
> IMHO, we should change the kernel internal implementation firstly, then
> implement the kerne/user space interface.  That is, make memory tier
> explicit inside kernel, then expose it to user space.

Why ignore this comment for v5?  If you don't agree, please respond me.

Best Regards,
Huang, Ying
Aneesh Kumar K.V June 6, 2022, 3:56 a.m. UTC | #4
On 6/6/22 8:19 AM, Ying Huang wrote:
> On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>
>>> In the current kernel, memory tiers are defined implicitly via a
>>> demotion path relationship between NUMA nodes, which is created
>>> during the kernel initialization and updated when a NUMA node is
>>> hot-added or hot-removed.  The current implementation puts all
>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>> tier-by-tier by establishing the per-node demotion targets based
>>> on the distances between nodes.
>>>
>>> This current memory tier kernel interface needs to be improved for
>>> several important use cases,
>>>
>>> The current tier initialization code always initializes
>>> each memory-only NUMA node into a lower tier.  But a memory-only
>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>> a virtual machine) and should be put into a higher tier.
>>>
>>> The current tier hierarchy always puts CPU nodes into the top
>>> tier. But on a system with HBM or GPU devices, the
>>> memory-only NUMA nodes mapping these devices should be in the
>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>> next lower tier.
>>>
>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>> next lower tier as defined by the demotion path, not any other
>>> node from any lower tier.  This strict, hard-coded demotion order
>>> does not work in all use cases (e.g. some use cases may want to
>>> allow cross-socket demotion to another node in the same demotion
>>> tier as a fallback when the preferred demotion node is out of
>>> space), This demotion order is also inconsistent with the page
>>> allocation fallback order when all the nodes in a higher tier are
>>> out of space: The page allocation can fall back to any node from
>>> any lower tier, whereas the demotion order doesn't allow that.
>>>
>>> The current kernel also don't provide any interfaces for the
>>> userspace to learn about the memory tier hierarchy in order to
>>> optimize its memory allocations.
>>>
>>> This patch series address the above by defining memory tiers explicitly.
>>>
>>> This patch adds below sysfs interface which is read-only and
>>> can be used to read nodes available in specific tier.
>>>
>>> /sys/devices/system/memtier/memtierN/nodelist
>>>
>>> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
>>> lowest tier. The absolute value of a tier id number has no specific
>>> meaning. what matters is the relative order of the tier id numbers.
>>>
>>> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
>>> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
>>> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>>>
>>> Default memory tier can be read from,
>>> /sys/devices/system/memtier/default_tier
>>>
>>> Max memory tier can be read from,
>>> /sys/devices/system/memtier/max_tiers
>>>
>>> This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
>>>
>>> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>>>
>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>
>> IMHO, we should change the kernel internal implementation firstly, then
>> implement the kerne/user space interface.  That is, make memory tier
>> explicit inside kernel, then expose it to user space.
> 
> Why ignore this comment for v5?  If you don't agree, please respond me.
> 

I am not sure what benefit such a rearrange would bring in? Right now I 
am writing the series from the point of view of introducing all the 
plumbing and them switching the existing demotion logic to use the new 
infrastructure. Redoing the code to hide all the userspace sysfs till we 
switch the demotion logic to use the new infrastructure doesn't really 
bring any additional clarity to patch review and would require me to 
redo the series with a lot of conflicts across the patches in the patchset.

-aneesh
Huang, Ying June 6, 2022, 5:33 a.m. UTC | #5
On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 8:19 AM, Ying Huang wrote:
> > On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
> > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > > From: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > 
> > > > In the current kernel, memory tiers are defined implicitly via a
> > > > demotion path relationship between NUMA nodes, which is created
> > > > during the kernel initialization and updated when a NUMA node is
> > > > hot-added or hot-removed.  The current implementation puts all
> > > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > > tier-by-tier by establishing the per-node demotion targets based
> > > > on the distances between nodes.
> > > > 
> > > > This current memory tier kernel interface needs to be improved for
> > > > several important use cases,
> > > > 
> > > > The current tier initialization code always initializes
> > > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > a virtual machine) and should be put into a higher tier.
> > > > 
> > > > The current tier hierarchy always puts CPU nodes into the top
> > > > tier. But on a system with HBM or GPU devices, the
> > > > memory-only NUMA nodes mapping these devices should be in the
> > > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > > next lower tier.
> > > > 
> > > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > > next lower tier as defined by the demotion path, not any other
> > > > node from any lower tier.  This strict, hard-coded demotion order
> > > > does not work in all use cases (e.g. some use cases may want to
> > > > allow cross-socket demotion to another node in the same demotion
> > > > tier as a fallback when the preferred demotion node is out of
> > > > space), This demotion order is also inconsistent with the page
> > > > allocation fallback order when all the nodes in a higher tier are
> > > > out of space: The page allocation can fall back to any node from
> > > > any lower tier, whereas the demotion order doesn't allow that.
> > > > 
> > > > The current kernel also don't provide any interfaces for the
> > > > userspace to learn about the memory tier hierarchy in order to
> > > > optimize its memory allocations.
> > > > 
> > > > This patch series address the above by defining memory tiers explicitly.
> > > > 
> > > > This patch adds below sysfs interface which is read-only and
> > > > can be used to read nodes available in specific tier.
> > > > 
> > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > 
> > > > Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> > > > lowest tier. The absolute value of a tier id number has no specific
> > > > meaning. what matters is the relative order of the tier id numbers.
> > > > 
> > > > All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> > > > Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> > > > nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> > > > 
> > > > Default memory tier can be read from,
> > > > /sys/devices/system/memtier/default_tier
> > > > 
> > > > Max memory tier can be read from,
> > > > /sys/devices/system/memtier/max_tiers
> > > > 
> > > > This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
> > > > 
> > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> > > > 
> > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > 
> > > IMHO, we should change the kernel internal implementation firstly, then
> > > implement the kerne/user space interface.  That is, make memory tier
> > > explicit inside kernel, then expose it to user space.
> > 
> > Why ignore this comment for v5?  If you don't agree, please respond me.
> > 
> 
> I am not sure what benefit such a rearrange would bring in? Right now I 
> am writing the series from the point of view of introducing all the 
> plumbing and them switching the existing demotion logic to use the new 
> infrastructure. Redoing the code to hide all the userspace sysfs till we 
> switch the demotion logic to use the new infrastructure doesn't really 
> bring any additional clarity to patch review and would require me to 
> redo the series with a lot of conflicts across the patches in the patchset.

IMHO, we shouldn't introduce regression even in the middle of a
patchset.  Each step should only rely on previous patches in the series
to work correctly.  In your current way of organization, after patch
[1/7], on a system with 2 memory tiers, the user space interface will
output wrong information (only 1 memory tier).  So I think the correct
way is to make it right inside the kenrel firstly, then expose the right
information to user space.

Best Regards,
Huang, Ying
Aneesh Kumar K.V June 6, 2022, 6:01 a.m. UTC | #6
On 6/6/22 11:03 AM, Ying Huang wrote:
> On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
>> On 6/6/22 8:19 AM, Ying Huang wrote:
>>> On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
>>>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>
>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>> demotion path relationship between NUMA nodes, which is created
>>>>> during the kernel initialization and updated when a NUMA node is
>>>>> hot-added or hot-removed.  The current implementation puts all
>>>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>>>> tier-by-tier by establishing the per-node demotion targets based
>>>>> on the distances between nodes.
>>>>>
>>>>> This current memory tier kernel interface needs to be improved for
>>>>> several important use cases,
>>>>>
>>>>> The current tier initialization code always initializes
>>>>> each memory-only NUMA node into a lower tier.  But a memory-only
>>>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>> a virtual machine) and should be put into a higher tier.
>>>>>
>>>>> The current tier hierarchy always puts CPU nodes into the top
>>>>> tier. But on a system with HBM or GPU devices, the
>>>>> memory-only NUMA nodes mapping these devices should be in the
>>>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>>>> next lower tier.
>>>>>
>>>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>>>> next lower tier as defined by the demotion path, not any other
>>>>> node from any lower tier.  This strict, hard-coded demotion order
>>>>> does not work in all use cases (e.g. some use cases may want to
>>>>> allow cross-socket demotion to another node in the same demotion
>>>>> tier as a fallback when the preferred demotion node is out of
>>>>> space), This demotion order is also inconsistent with the page
>>>>> allocation fallback order when all the nodes in a higher tier are
>>>>> out of space: The page allocation can fall back to any node from
>>>>> any lower tier, whereas the demotion order doesn't allow that.
>>>>>
>>>>> The current kernel also don't provide any interfaces for the
>>>>> userspace to learn about the memory tier hierarchy in order to
>>>>> optimize its memory allocations.
>>>>>
>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>
>>>>> This patch adds below sysfs interface which is read-only and
>>>>> can be used to read nodes available in specific tier.
>>>>>
>>>>> /sys/devices/system/memtier/memtierN/nodelist
>>>>>
>>>>> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
>>>>> lowest tier. The absolute value of a tier id number has no specific
>>>>> meaning. what matters is the relative order of the tier id numbers.
>>>>>
>>>>> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
>>>>> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
>>>>> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>>>>>
>>>>> Default memory tier can be read from,
>>>>> /sys/devices/system/memtier/default_tier
>>>>>
>>>>> Max memory tier can be read from,
>>>>> /sys/devices/system/memtier/max_tiers
>>>>>
>>>>> This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>>>>>
>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>
>>>> IMHO, we should change the kernel internal implementation firstly, then
>>>> implement the kerne/user space interface.  That is, make memory tier
>>>> explicit inside kernel, then expose it to user space.
>>>
>>> Why ignore this comment for v5?  If you don't agree, please respond me.
>>>
>>
>> I am not sure what benefit such a rearrange would bring in? Right now I
>> am writing the series from the point of view of introducing all the
>> plumbing and them switching the existing demotion logic to use the new
>> infrastructure. Redoing the code to hide all the userspace sysfs till we
>> switch the demotion logic to use the new infrastructure doesn't really
>> bring any additional clarity to patch review and would require me to
>> redo the series with a lot of conflicts across the patches in the patchset.
> 
> IMHO, we shouldn't introduce regression even in the middle of a
> patchset.  Each step should only rely on previous patches in the series
> to work correctly.  In your current way of organization, after patch
> [1/7], on a system with 2 memory tiers, the user space interface will
> output wrong information (only 1 memory tier).  So I think the correct
> way is to make it right inside the kenrel firstly, then expose the right
> information to user space.
>

The patchset doesn't add additional tier until "mm/demotion/dax/kmem: 
Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional 
tiers done till all the demotion logic is in place. So even if the 
system got dax/kmem, the support for adding dax/kmem as a memory tier 
comes later in the patch series.


-aneesh
Aneesh Kumar K.V June 6, 2022, 6:27 a.m. UTC | #7
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 6/6/22 11:03 AM, Ying Huang wrote:
>> On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
>>> On 6/6/22 8:19 AM, Ying Huang wrote:
>>>> On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
>>>>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>>
>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>> demotion path relationship between NUMA nodes, which is created
>>>>>> during the kernel initialization and updated when a NUMA node is
>>>>>> hot-added or hot-removed.  The current implementation puts all
>>>>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>>>>> tier-by-tier by establishing the per-node demotion targets based
>>>>>> on the distances between nodes.
>>>>>>
>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>> several important use cases,
>>>>>>
>>>>>> The current tier initialization code always initializes
>>>>>> each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>> a virtual machine) and should be put into a higher tier.
>>>>>>
>>>>>> The current tier hierarchy always puts CPU nodes into the top
>>>>>> tier. But on a system with HBM or GPU devices, the
>>>>>> memory-only NUMA nodes mapping these devices should be in the
>>>>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>>>>> next lower tier.
>>>>>>
>>>>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>>>>> next lower tier as defined by the demotion path, not any other
>>>>>> node from any lower tier.  This strict, hard-coded demotion order
>>>>>> does not work in all use cases (e.g. some use cases may want to
>>>>>> allow cross-socket demotion to another node in the same demotion
>>>>>> tier as a fallback when the preferred demotion node is out of
>>>>>> space), This demotion order is also inconsistent with the page
>>>>>> allocation fallback order when all the nodes in a higher tier are
>>>>>> out of space: The page allocation can fall back to any node from
>>>>>> any lower tier, whereas the demotion order doesn't allow that.
>>>>>>
>>>>>> The current kernel also don't provide any interfaces for the
>>>>>> userspace to learn about the memory tier hierarchy in order to
>>>>>> optimize its memory allocations.
>>>>>>
>>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>>
>>>>>> This patch adds below sysfs interface which is read-only and
>>>>>> can be used to read nodes available in specific tier.
>>>>>>
>>>>>> /sys/devices/system/memtier/memtierN/nodelist
>>>>>>
>>>>>> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
>>>>>> lowest tier. The absolute value of a tier id number has no specific
>>>>>> meaning. what matters is the relative order of the tier id numbers.
>>>>>>
>>>>>> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
>>>>>> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
>>>>>> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>>>>>>
>>>>>> Default memory tier can be read from,
>>>>>> /sys/devices/system/memtier/default_tier
>>>>>>
>>>>>> Max memory tier can be read from,
>>>>>> /sys/devices/system/memtier/max_tiers
>>>>>>
>>>>>> This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
>>>>>>
>>>>>> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>>>>>>
>>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>
>>>>> IMHO, we should change the kernel internal implementation firstly, then
>>>>> implement the kerne/user space interface.  That is, make memory tier
>>>>> explicit inside kernel, then expose it to user space.
>>>>
>>>> Why ignore this comment for v5?  If you don't agree, please respond me.
>>>>
>>>
>>> I am not sure what benefit such a rearrange would bring in? Right now I
>>> am writing the series from the point of view of introducing all the
>>> plumbing and them switching the existing demotion logic to use the new
>>> infrastructure. Redoing the code to hide all the userspace sysfs till we
>>> switch the demotion logic to use the new infrastructure doesn't really
>>> bring any additional clarity to patch review and would require me to
>>> redo the series with a lot of conflicts across the patches in the patchset.
>> 
>> IMHO, we shouldn't introduce regression even in the middle of a
>> patchset.  Each step should only rely on previous patches in the series
>> to work correctly.  In your current way of organization, after patch
>> [1/7], on a system with 2 memory tiers, the user space interface will
>> output wrong information (only 1 memory tier).  So I think the correct
>> way is to make it right inside the kenrel firstly, then expose the right
>> information to user space.
>>
>
> The patchset doesn't add additional tier until "mm/demotion/dax/kmem: 
> Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional 
> tiers done till all the demotion logic is in place. So even if the 
> system got dax/kmem, the support for adding dax/kmem as a memory tier 
> comes later in the patch series.

Let me clarify this a bit more. This patchset doesn't change the
existing kernel behavior till "mm/demotion: Build demotion targets
based on explicit memory tiers". So there is no regression till then.
It adds a parallel framework (memory tiers to the existing demotion
logic).

I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
with two memory tiers (DRAM and pmem) the demotion continues to work
as expected after patch 3 ("mm/demotion: Build demotion targets based on
explicit memory tiers"). With that, there will not be any regression in
between the patch series.

-aneesh
Huang, Ying June 6, 2022, 7:53 a.m. UTC | #8
On Mon, 2022-06-06 at 11:57 +0530, Aneesh Kumar K.V wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
> > On 6/6/22 11:03 AM, Ying Huang wrote:
> > > On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
> > > > On 6/6/22 8:19 AM, Ying Huang wrote:
> > > > > On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
> > > > > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > > > > > From: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > > > > 
> > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > demotion path relationship between NUMA nodes, which is created
> > > > > > > during the kernel initialization and updated when a NUMA node is
> > > > > > > hot-added or hot-removed.  The current implementation puts all
> > > > > > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > > > > > tier-by-tier by establishing the per-node demotion targets based
> > > > > > > on the distances between nodes.
> > > > > > > 
> > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > several important use cases,
> > > > > > > 
> > > > > > > The current tier initialization code always initializes
> > > > > > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > a virtual machine) and should be put into a higher tier.
> > > > > > > 
> > > > > > > The current tier hierarchy always puts CPU nodes into the top
> > > > > > > tier. But on a system with HBM or GPU devices, the
> > > > > > > memory-only NUMA nodes mapping these devices should be in the
> > > > > > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > > > > > next lower tier.
> > > > > > > 
> > > > > > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > > > > > next lower tier as defined by the demotion path, not any other
> > > > > > > node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > does not work in all use cases (e.g. some use cases may want to
> > > > > > > allow cross-socket demotion to another node in the same demotion
> > > > > > > tier as a fallback when the preferred demotion node is out of
> > > > > > > space), This demotion order is also inconsistent with the page
> > > > > > > allocation fallback order when all the nodes in a higher tier are
> > > > > > > out of space: The page allocation can fall back to any node from
> > > > > > > any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > 
> > > > > > > The current kernel also don't provide any interfaces for the
> > > > > > > userspace to learn about the memory tier hierarchy in order to
> > > > > > > optimize its memory allocations.
> > > > > > > 
> > > > > > > This patch series address the above by defining memory tiers explicitly.
> > > > > > > 
> > > > > > > This patch adds below sysfs interface which is read-only and
> > > > > > > can be used to read nodes available in specific tier.
> > > > > > > 
> > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > 
> > > > > > > Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> > > > > > > lowest tier. The absolute value of a tier id number has no specific
> > > > > > > meaning. what matters is the relative order of the tier id numbers.
> > > > > > > 
> > > > > > > All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> > > > > > > nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> > > > > > > 
> > > > > > > Default memory tier can be read from,
> > > > > > > /sys/devices/system/memtier/default_tier
> > > > > > > 
> > > > > > > Max memory tier can be read from,
> > > > > > > /sys/devices/system/memtier/max_tiers
> > > > > > > 
> > > > > > > This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
> > > > > > > 
> > > > > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> > > > > > > 
> > > > > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > > > > 
> > > > > > IMHO, we should change the kernel internal implementation firstly, then
> > > > > > implement the kerne/user space interface.  That is, make memory tier
> > > > > > explicit inside kernel, then expose it to user space.
> > > > > 
> > > > > Why ignore this comment for v5?  If you don't agree, please respond me.
> > > > > 
> > > > 
> > > > I am not sure what benefit such a rearrange would bring in? Right now I
> > > > am writing the series from the point of view of introducing all the
> > > > plumbing and them switching the existing demotion logic to use the new
> > > > infrastructure. Redoing the code to hide all the userspace sysfs till we
> > > > switch the demotion logic to use the new infrastructure doesn't really
> > > > bring any additional clarity to patch review and would require me to
> > > > redo the series with a lot of conflicts across the patches in the patchset.
> > > 
> > > IMHO, we shouldn't introduce regression even in the middle of a
> > > patchset.  Each step should only rely on previous patches in the series
> > > to work correctly.  In your current way of organization, after patch
> > > [1/7], on a system with 2 memory tiers, the user space interface will
> > > output wrong information (only 1 memory tier).  So I think the correct
> > > way is to make it right inside the kenrel firstly, then expose the right
> > > information to user space.
> > > 
> > 
> > The patchset doesn't add additional tier until "mm/demotion/dax/kmem: 
> > Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional 
> > tiers done till all the demotion logic is in place. So even if the 
> > system got dax/kmem, the support for adding dax/kmem as a memory tier 
> > comes later in the patch series.
> 
> Let me clarify this a bit more. This patchset doesn't change the
> existing kernel behavior till "mm/demotion: Build demotion targets
> based on explicit memory tiers". So there is no regression till then.
> It adds a parallel framework (memory tiers to the existing demotion
> logic).
> 
> I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
> MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
> with two memory tiers (DRAM and pmem) the demotion continues to work
> as expected after patch 3 ("mm/demotion: Build demotion targets based on
> explicit memory tiers"). With that, there will not be any regression in
> between the patch series.
> 

Thanks!  Please do that.  And I think you can add sysfs interface after
that patch too.  That is, in [1/7]

+struct memory_tier {
+	nodemask_t nodelist;
+};

And struct device can be added after the kernel has switched the
implementation based on explicit memory tiers.

+struct memory_tier {
+	struct device dev;
+	nodemask_t nodelist;
+};

But I don't think it's a good idea to have "struct device" embedded in
"struct memory_tier".  We don't have "struct device" embedded in "struct
pgdata_list"...

Best Regards,
Huang, Ying
Aneesh Kumar K.V June 6, 2022, 8:01 a.m. UTC | #9
On 6/6/22 1:23 PM, Ying Huang wrote:
> On Mon, 2022-06-06 at 11:57 +0530, Aneesh Kumar K.V wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>>
>>> On 6/6/22 11:03 AM, Ying Huang wrote:
>>>> On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
>>>>> On 6/6/22 8:19 AM, Ying Huang wrote:
>>>>>> On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
>>>>>>> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
>>>>>>>> From: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>>>>
>>>>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>>>>> demotion path relationship between NUMA nodes, which is created
>>>>>>>> during the kernel initialization and updated when a NUMA node is
>>>>>>>> hot-added or hot-removed.  The current implementation puts all
>>>>>>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>>>>>>> tier-by-tier by establishing the per-node demotion targets based
>>>>>>>> on the distances between nodes.
>>>>>>>>
>>>>>>>> This current memory tier kernel interface needs to be improved for
>>>>>>>> several important use cases,
>>>>>>>>
>>>>>>>> The current tier initialization code always initializes
>>>>>>>> each memory-only NUMA node into a lower tier.  But a memory-only
>>>>>>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>>>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>>>>> a virtual machine) and should be put into a higher tier.
>>>>>>>>
>>>>>>>> The current tier hierarchy always puts CPU nodes into the top
>>>>>>>> tier. But on a system with HBM or GPU devices, the
>>>>>>>> memory-only NUMA nodes mapping these devices should be in the
>>>>>>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>>>>>>> next lower tier.
>>>>>>>>
>>>>>>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>>>>>>> next lower tier as defined by the demotion path, not any other
>>>>>>>> node from any lower tier.  This strict, hard-coded demotion order
>>>>>>>> does not work in all use cases (e.g. some use cases may want to
>>>>>>>> allow cross-socket demotion to another node in the same demotion
>>>>>>>> tier as a fallback when the preferred demotion node is out of
>>>>>>>> space), This demotion order is also inconsistent with the page
>>>>>>>> allocation fallback order when all the nodes in a higher tier are
>>>>>>>> out of space: The page allocation can fall back to any node from
>>>>>>>> any lower tier, whereas the demotion order doesn't allow that.
>>>>>>>>
>>>>>>>> The current kernel also don't provide any interfaces for the
>>>>>>>> userspace to learn about the memory tier hierarchy in order to
>>>>>>>> optimize its memory allocations.
>>>>>>>>
>>>>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>>>>
>>>>>>>> This patch adds below sysfs interface which is read-only and
>>>>>>>> can be used to read nodes available in specific tier.
>>>>>>>>
>>>>>>>> /sys/devices/system/memtier/memtierN/nodelist
>>>>>>>>
>>>>>>>> Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
>>>>>>>> lowest tier. The absolute value of a tier id number has no specific
>>>>>>>> meaning. what matters is the relative order of the tier id numbers.
>>>>>>>>
>>>>>>>> All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
>>>>>>>> Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
>>>>>>>> nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
>>>>>>>>
>>>>>>>> Default memory tier can be read from,
>>>>>>>> /sys/devices/system/memtier/default_tier
>>>>>>>>
>>>>>>>> Max memory tier can be read from,
>>>>>>>> /sys/devices/system/memtier/max_tiers
>>>>>>>>
>>>>>>>> This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
>>>>>>>>
>>>>>>>> [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
>>>>>>>>
>>>>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>>>>
>>>>>>> IMHO, we should change the kernel internal implementation firstly, then
>>>>>>> implement the kerne/user space interface.  That is, make memory tier
>>>>>>> explicit inside kernel, then expose it to user space.
>>>>>>
>>>>>> Why ignore this comment for v5?  If you don't agree, please respond me.
>>>>>>
>>>>>
>>>>> I am not sure what benefit such a rearrange would bring in? Right now I
>>>>> am writing the series from the point of view of introducing all the
>>>>> plumbing and them switching the existing demotion logic to use the new
>>>>> infrastructure. Redoing the code to hide all the userspace sysfs till we
>>>>> switch the demotion logic to use the new infrastructure doesn't really
>>>>> bring any additional clarity to patch review and would require me to
>>>>> redo the series with a lot of conflicts across the patches in the patchset.
>>>>
>>>> IMHO, we shouldn't introduce regression even in the middle of a
>>>> patchset.  Each step should only rely on previous patches in the series
>>>> to work correctly.  In your current way of organization, after patch
>>>> [1/7], on a system with 2 memory tiers, the user space interface will
>>>> output wrong information (only 1 memory tier).  So I think the correct
>>>> way is to make it right inside the kenrel firstly, then expose the right
>>>> information to user space.
>>>>
>>>
>>> The patchset doesn't add additional tier until "mm/demotion/dax/kmem:
>>> Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional
>>> tiers done till all the demotion logic is in place. So even if the
>>> system got dax/kmem, the support for adding dax/kmem as a memory tier
>>> comes later in the patch series.
>>
>> Let me clarify this a bit more. This patchset doesn't change the
>> existing kernel behavior till "mm/demotion: Build demotion targets
>> based on explicit memory tiers". So there is no regression till then.
>> It adds a parallel framework (memory tiers to the existing demotion
>> logic).
>>
>> I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
>> MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
>> with two memory tiers (DRAM and pmem) the demotion continues to work
>> as expected after patch 3 ("mm/demotion: Build demotion targets based on
>> explicit memory tiers"). With that, there will not be any regression in
>> between the patch series.
>>
> 
> Thanks!  Please do that.  And I think you can add sysfs interface after
> that patch too.  That is, in [1/7]
> 

I am not sure why you insist on moving sysfs interfaces later. They are 
introduced based on the helper added. It make patch review easier to 
look at both the helpers and the user of the helper together in a patch.

> +struct memory_tier {
> +	nodemask_t nodelist;
> +};
> 
> And struct device can be added after the kernel has switched the
> implementation based on explicit memory tiers.
> 
> +struct memory_tier {
> +	struct device dev;
> +	nodemask_t nodelist;
> +};
> 


Can you elaborate on this? or possibly review the v5 series indicating 
what change you are suggesting here?


> But I don't think it's a good idea to have "struct device" embedded in
> "struct memory_tier".  We don't have "struct device" embedded in "struct
> pgdata_list"...
> 

I avoided creating an array for memory_tier (memory_tier[]) so that we 
can keep it dynamic. Keeping dev embedded in struct memory_tier simplify 
the life cycle management of that dynamic list. We free the struct 
memory_tier allocation via device release function (memtier->dev.release 
= memory_tier_device_release )

Why do you think it is not a good idea?

-aneesh
Huang, Ying June 6, 2022, 8:52 a.m. UTC | #10
On Mon, 2022-06-06 at 13:31 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 1:23 PM, Ying Huang wrote:
> > On Mon, 2022-06-06 at 11:57 +0530, Aneesh Kumar K.V wrote:
> > > Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> > > 
> > > > On 6/6/22 11:03 AM, Ying Huang wrote:
> > > > > On Mon, 2022-06-06 at 09:26 +0530, Aneesh Kumar K V wrote:
> > > > > > On 6/6/22 8:19 AM, Ying Huang wrote:
> > > > > > > On Thu, 2022-06-02 at 14:07 +0800, Ying Huang wrote:
> > > > > > > > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > > > > > > > > From: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > > > > > > 
> > > > > > > > > In the current kernel, memory tiers are defined implicitly via a
> > > > > > > > > demotion path relationship between NUMA nodes, which is created
> > > > > > > > > during the kernel initialization and updated when a NUMA node is
> > > > > > > > > hot-added or hot-removed.  The current implementation puts all
> > > > > > > > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > > > > > > > tier-by-tier by establishing the per-node demotion targets based
> > > > > > > > > on the distances between nodes.
> > > > > > > > > 
> > > > > > > > > This current memory tier kernel interface needs to be improved for
> > > > > > > > > several important use cases,
> > > > > > > > > 
> > > > > > > > > The current tier initialization code always initializes
> > > > > > > > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > > > > > > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > > > > > > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > > > > > > a virtual machine) and should be put into a higher tier.
> > > > > > > > > 
> > > > > > > > > The current tier hierarchy always puts CPU nodes into the top
> > > > > > > > > tier. But on a system with HBM or GPU devices, the
> > > > > > > > > memory-only NUMA nodes mapping these devices should be in the
> > > > > > > > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > > > > > > > next lower tier.
> > > > > > > > > 
> > > > > > > > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > > > > > > > next lower tier as defined by the demotion path, not any other
> > > > > > > > > node from any lower tier.  This strict, hard-coded demotion order
> > > > > > > > > does not work in all use cases (e.g. some use cases may want to
> > > > > > > > > allow cross-socket demotion to another node in the same demotion
> > > > > > > > > tier as a fallback when the preferred demotion node is out of
> > > > > > > > > space), This demotion order is also inconsistent with the page
> > > > > > > > > allocation fallback order when all the nodes in a higher tier are
> > > > > > > > > out of space: The page allocation can fall back to any node from
> > > > > > > > > any lower tier, whereas the demotion order doesn't allow that.
> > > > > > > > > 
> > > > > > > > > The current kernel also don't provide any interfaces for the
> > > > > > > > > userspace to learn about the memory tier hierarchy in order to
> > > > > > > > > optimize its memory allocations.
> > > > > > > > > 
> > > > > > > > > This patch series address the above by defining memory tiers explicitly.
> > > > > > > > > 
> > > > > > > > > This patch adds below sysfs interface which is read-only and
> > > > > > > > > can be used to read nodes available in specific tier.
> > > > > > > > > 
> > > > > > > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > > > > > > 
> > > > > > > > > Tier 0 is the highest tier, while tier MAX_MEMORY_TIERS - 1 is the
> > > > > > > > > lowest tier. The absolute value of a tier id number has no specific
> > > > > > > > > meaning. what matters is the relative order of the tier id numbers.
> > > > > > > > > 
> > > > > > > > > All the tiered memory code is guarded by CONFIG_TIERED_MEMORY.
> > > > > > > > > Default number of memory tiers are MAX_MEMORY_TIERS(3). All the
> > > > > > > > > nodes are by default assigned to DEFAULT_MEMORY_TIER(1).
> > > > > > > > > 
> > > > > > > > > Default memory tier can be read from,
> > > > > > > > > /sys/devices/system/memtier/default_tier
> > > > > > > > > 
> > > > > > > > > Max memory tier can be read from,
> > > > > > > > > /sys/devices/system/memtier/max_tiers
> > > > > > > > > 
> > > > > > > > > This patch implements the RFC spec sent by Wei Xu <weixugc@google.com> at [1].
> > > > > > > > > 
> > > > > > > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u-DGLcKRVDnChN9ZhxPkfxQvz9Sb93kVoX_4J2oiJSkUw@mail.gmail.com/
> > > > > > > > > 
> > > > > > > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > > > > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > > > > > > 
> > > > > > > > IMHO, we should change the kernel internal implementation firstly, then
> > > > > > > > implement the kerne/user space interface.  That is, make memory tier
> > > > > > > > explicit inside kernel, then expose it to user space.
> > > > > > > 
> > > > > > > Why ignore this comment for v5?  If you don't agree, please respond me.
> > > > > > > 
> > > > > > 
> > > > > > I am not sure what benefit such a rearrange would bring in? Right now I
> > > > > > am writing the series from the point of view of introducing all the
> > > > > > plumbing and them switching the existing demotion logic to use the new
> > > > > > infrastructure. Redoing the code to hide all the userspace sysfs till we
> > > > > > switch the demotion logic to use the new infrastructure doesn't really
> > > > > > bring any additional clarity to patch review and would require me to
> > > > > > redo the series with a lot of conflicts across the patches in the patchset.
> > > > > 
> > > > > IMHO, we shouldn't introduce regression even in the middle of a
> > > > > patchset.  Each step should only rely on previous patches in the series
> > > > > to work correctly.  In your current way of organization, after patch
> > > > > [1/7], on a system with 2 memory tiers, the user space interface will
> > > > > output wrong information (only 1 memory tier).  So I think the correct
> > > > > way is to make it right inside the kenrel firstly, then expose the right
> > > > > information to user space.
> > > > > 
> > > > 
> > > > The patchset doesn't add additional tier until "mm/demotion/dax/kmem:
> > > > Set node's memory tier to MEMORY_TIER_PMEM". ie, there is no additional
> > > > tiers done till all the demotion logic is in place. So even if the
> > > > system got dax/kmem, the support for adding dax/kmem as a memory tier
> > > > comes later in the patch series.
> > > 
> > > Let me clarify this a bit more. This patchset doesn't change the
> > > existing kernel behavior till "mm/demotion: Build demotion targets
> > > based on explicit memory tiers". So there is no regression till then.
> > > It adds a parallel framework (memory tiers to the existing demotion
> > > logic).
> > > 
> > > I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
> > > MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
> > > with two memory tiers (DRAM and pmem) the demotion continues to work
> > > as expected after patch 3 ("mm/demotion: Build demotion targets based on
> > > explicit memory tiers"). With that, there will not be any regression in
> > > between the patch series.
> > > 
> > 
> > Thanks!  Please do that.  And I think you can add sysfs interface after
> > that patch too.  That is, in [1/7]
> > 
> 
> I am not sure why you insist on moving sysfs interfaces later. They are 
> introduced based on the helper added. It make patch review easier to 
> look at both the helpers and the user of the helper together in a patch.

Yes.  We should introduce a function and its user in one patch for
review.  But this doesn't mean that we should introduce the user space
interface as the first step.  I think the user space interface should
output correct information when we expose it.

> > +struct memory_tier {
> > +	nodemask_t nodelist;
> > +};
> > 
> > And struct device can be added after the kernel has switched the
> > implementation based on explicit memory tiers.
> > 
> > +struct memory_tier {
> > +	struct device dev;
> > +	nodemask_t nodelist;
> > +};
> > 
> 
> 
> Can you elaborate on this? or possibly review the v5 series indicating 
> what change you are suggesting here?
> 
> 
> > But I don't think it's a good idea to have "struct device" embedded in
> > "struct memory_tier".  We don't have "struct device" embedded in "struct
> > pgdata_list"...
> > 
> 
> I avoided creating an array for memory_tier (memory_tier[]) so that we 
> can keep it dynamic. Keeping dev embedded in struct memory_tier simplify 
> the life cycle management of that dynamic list. We free the struct 
> memory_tier allocation via device release function (memtier->dev.release 
> = memory_tier_device_release )
> 
> Why do you think it is not a good idea?

I think that we shouldn't bind our kernel internal implementation with
user space interface too much.  Yes.  We can expose kernel internal
implementation to user space in a direct way.  I suggest you to follow
the style of "struct pglist_data" and "struct node".  If we decouple
"struct memory_tier" and "struct memory_tier_dev" (or some other name),
we can refer to "struct memory_tier" without depending on all device
core.  Memory tier should be accessible inside the kernel even without a
user interface.  And memory tier isn't a device in concept.

For life cycle management, I think that we can do that without sysfs
too.

Best Regards,
Huang, Ying
Aneesh Kumar K.V June 6, 2022, 9:02 a.m. UTC | #11
On 6/6/22 2:22 PM, Ying Huang wrote:
....
>>>> I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
>>>> MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
>>>> with two memory tiers (DRAM and pmem) the demotion continues to work
>>>> as expected after patch 3 ("mm/demotion: Build demotion targets based on
>>>> explicit memory tiers"). With that, there will not be any regression in
>>>> between the patch series.
>>>>
>>>
>>> Thanks!  Please do that.  And I think you can add sysfs interface after
>>> that patch too.  That is, in [1/7]
>>>
>>
>> I am not sure why you insist on moving sysfs interfaces later. They are
>> introduced based on the helper added. It make patch review easier to
>> look at both the helpers and the user of the helper together in a patch.
> 
> Yes.  We should introduce a function and its user in one patch for
> review.  But this doesn't mean that we should introduce the user space
> interface as the first step.  I think the user space interface should
> output correct information when we expose it.
> 

If you look at this patchset we are not exposing any wrong information.

patch 1 -> adds ability to register the memory tiers and expose details 
of registered memory tier. At this point the patchset only support DRAM 
tier and hence only one tier is shown

patch 2 -> adds per node memtier attribute. So only DRAM nodes shows the 
details, because the patchset yet has not introduced a slower memory 
tier like PMEM.

patch 4 -> introducing demotion. Will make that patch 5

patch 5 -> add dax kmem numa nodes as slower memory tier. Now this 
becomes patch 4 at which point we will correctly show two memory tiers 
in the system.


>>> +struct memory_tier {
>>> +	nodemask_t nodelist;
>>> +};
>>>
>>> And struct device can be added after the kernel has switched the
>>> implementation based on explicit memory tiers.
>>>
>>> +struct memory_tier {
>>> +	struct device dev;
>>> +	nodemask_t nodelist;
>>> +};
>>>
>>
>>
>> Can you elaborate on this? or possibly review the v5 series indicating
>> what change you are suggesting here?
>>
>>
>>> But I don't think it's a good idea to have "struct device" embedded in
>>> "struct memory_tier".  We don't have "struct device" embedded in "struct
>>> pgdata_list"...
>>>
>>
>> I avoided creating an array for memory_tier (memory_tier[]) so that we
>> can keep it dynamic. Keeping dev embedded in struct memory_tier simplify
>> the life cycle management of that dynamic list. We free the struct
>> memory_tier allocation via device release function (memtier->dev.release
>> = memory_tier_device_release )
>>
>> Why do you think it is not a good idea?
> 
> I think that we shouldn't bind our kernel internal implementation with
> user space interface too much.  Yes.  We can expose kernel internal
> implementation to user space in a direct way.  I suggest you to follow
> the style of "struct pglist_data" and "struct node".  If we decouple
> "struct memory_tier" and "struct memory_tier_dev" (or some other name),
> we can refer to "struct memory_tier" without depending on all device
> core.  Memory tier should be accessible inside the kernel even without a
> user interface.  And memory tier isn't a device in concept.
> 

memory_tiers are different from pglist_data and struct node in that we 
also allow the creation of them from userspace. That is the life time of 
a memory tier is driven from userspace and it is much easier to manage 
them via sysfs file lifetime mechanism rather than inventing an 
independent and more complex way of doing the same.

> For life cycle management, I think that we can do that without sysfs
> too.
> 

unless there are specific details that you think will be broken by 
embedding struct device inside struct memory_tier, IMHO I still consider 
the embedded implementation much simpler and in accordance with other 
kernel design patterns.

-aneesh
Huang, Ying June 8, 2022, 1:24 a.m. UTC | #12
On Mon, 2022-06-06 at 14:32 +0530, Aneesh Kumar K V wrote:
> On 6/6/22 2:22 PM, Ying Huang wrote:
> ....
> > > > > I can move the patch "mm/demotion/dax/kmem: Set node's memory tier to
> > > > > MEMORY_TIER_PMEM" before switching the demotion logic so that on systems
> > > > > with two memory tiers (DRAM and pmem) the demotion continues to work
> > > > > as expected after patch 3 ("mm/demotion: Build demotion targets based on
> > > > > explicit memory tiers"). With that, there will not be any regression in
> > > > > between the patch series.
> > > > > 
> > > > 
> > > > Thanks!  Please do that.  And I think you can add sysfs interface after
> > > > that patch too.  That is, in [1/7]
> > > > 
> > > 
> > > I am not sure why you insist on moving sysfs interfaces later. They are
> > > introduced based on the helper added. It make patch review easier to
> > > look at both the helpers and the user of the helper together in a patch.
> > 
> > Yes.  We should introduce a function and its user in one patch for
> > review.  But this doesn't mean that we should introduce the user space
> > interface as the first step.  I think the user space interface should
> > output correct information when we expose it.
> > 
> 
> If you look at this patchset we are not exposing any wrong information.
> 
> patch 1 -> adds ability to register the memory tiers and expose details 
> of registered memory tier. At this point the patchset only support DRAM 
> tier and hence only one tier is shown

But inside kernel, we actually work with 2 tiers and demote/prmote pages
between them.  With the information from your interface, users would
think that there is no any demotion/promotion in kernel because there's
only 1 tier.

> patch 2 -> adds per node memtier attribute. So only DRAM nodes shows the 
> details, because the patchset yet has not introduced a slower memory 
> tier like PMEM.
> 
> patch 4 -> introducing demotion. Will make that patch 5
> 
> patch 5 -> add dax kmem numa nodes as slower memory tier. Now this 
> becomes patch 4 at which point we will correctly show two memory tiers 
> in the system.
> 
> 
> > > > +struct memory_tier {
> > > > +	nodemask_t nodelist;
> > > > +};
> > > > 
> > > > And struct device can be added after the kernel has switched the
> > > > implementation based on explicit memory tiers.
> > > > 
> > > > +struct memory_tier {
> > > > +	struct device dev;
> > > > +	nodemask_t nodelist;
> > > > +};
> > > > 
> > > 
> > > 
> > > Can you elaborate on this? or possibly review the v5 series indicating
> > > what change you are suggesting here?
> > > 
> > > 
> > > > But I don't think it's a good idea to have "struct device" embedded in
> > > > "struct memory_tier".  We don't have "struct device" embedded in "struct
> > > > pgdata_list"...
> > > > 
> > > 
> > > I avoided creating an array for memory_tier (memory_tier[]) so that we
> > > can keep it dynamic. Keeping dev embedded in struct memory_tier simplify
> > > the life cycle management of that dynamic list. We free the struct
> > > memory_tier allocation via device release function (memtier->dev.release
> > > = memory_tier_device_release )
> > > 
> > > Why do you think it is not a good idea?
> > 
> > I think that we shouldn't bind our kernel internal implementation with
> > user space interface too much.  Yes.  We can expose kernel internal
> > implementation to user space in a direct way.  I suggest you to follow
> > the style of "struct pglist_data" and "struct node".  If we decouple
> > "struct memory_tier" and "struct memory_tier_dev" (or some other name),
> > we can refer to "struct memory_tier" without depending on all device
> > core.  Memory tier should be accessible inside the kernel even without a
> > user interface.  And memory tier isn't a device in concept.
> > 
> 
> memory_tiers are different from pglist_data and struct node in that we 
> also allow the creation of them from userspace.

I don't think that there's much difference.  struct pglist_data and
struct node can be created/destroyed dynamically too.  Please take a
look at

  __try_online_node()
  register_one_node()
  try_offline_node()
  unregister_one_node()

> That is the life time of 
> a memory tier is driven from userspace and it is much easier to manage 
> them via sysfs file lifetime mechanism rather than inventing an 
> independent and more complex way of doing the same.

You needs to manage the lifetime of struct memory_tier in kernel too. 
Because there are kernel users.  And even if you use device core
lifetime mechanism, you don't need to embed struct device in struct
memory_tier too, you can free "separate" struct memory_tier in "release"
callback of struct device.

> > For life cycle management, I think that we can do that without sysfs
> > too.
> > 
> 
> unless there are specific details that you think will be broken by 
> embedding struct device inside struct memory_tier, IMHO I still consider 
> the embedded implementation much simpler and in accordance with other 
> kernel design patterns.

In concept, struct memory_tier isn't a device.  Although we expose it as
a device in sysfs.  That's just an implementation detail.  So I think
it's better to make struct memory_tier independent of struct device if
possible.

Via not embeding struct device in struct memory_tier, it's much easier
to dereference struct memory_tier directly in inline function in ".h". 
We don't need to introduce one accessor function for each field of
struct memory_tier for that.

Best Regards,
Huang, Ying
Huang, Ying June 8, 2022, 7:16 a.m. UTC | #13
On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:

[snip]

> 
> +static int __init memory_tier_init(void)
> +{
> +	int ret;
> +
> +	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> +	if (ret)
> +		panic("%s() failed to register subsystem: %d\n", __func__, ret);

I don't think we should go panic for failing to register subsys and
device for memory tiers.  Just pr_err() should be enough.

Best Regards,
Huang, Ying

> +
> +	/*
> +	 * Register only default memory tier to hide all empty
> +	 * memory tier from sysfs.
> +	 */
> +	ret = register_memory_tier(DEFAULT_MEMORY_TIER);
> +	if (ret)
> +		panic("%s() failed to register memory tier: %d\n", __func__, ret);
> +
> +	/*
> +	 * CPU only nodes are not part of memoty tiers.
> +	 */
> +	memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
> +
> +	return 0;
> +}
> +subsys_initcall(memory_tier_init);
> +
> +#endif	/* CONFIG_TIERED_MEMORY */
Aneesh Kumar K.V June 8, 2022, 8:24 a.m. UTC | #14
On 6/8/22 12:46 PM, Ying Huang wrote:
> On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> 
> [snip]
> 
>>
>> +static int __init memory_tier_init(void)
>> +{
>> +	int ret;
>> +
>> +	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
>> +	if (ret)
>> +		panic("%s() failed to register subsystem: %d\n", __func__, ret);
> 
> I don't think we should go panic for failing to register subsys and
> device for memory tiers.  Just pr_err() should be enough.
> 

So you are suggesting we continue to work with memory tiers with no 
userspace interface?

>> +
>> +	/*
>> +	 * Register only default memory tier to hide all empty
>> +	 * memory tier from sysfs.
>> +	 */
>> +	ret = register_memory_tier(DEFAULT_MEMORY_TIER);
>> +	if (ret)
>> +		panic("%s() failed to register memory tier: %d\n", __func__, ret);
>> +
>> +	/*
>> +	 * CPU only nodes are not part of memoty tiers.
>> +	 */
>> +	memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
>> +
>> +	return 0;
>> +}
>> +subsys_initcall(memory_tier_init);
>> +
>> +#endif	/* CONFIG_TIERED_MEMORY */
> 
>
Huang, Ying June 8, 2022, 8:27 a.m. UTC | #15
On Wed, 2022-06-08 at 13:54 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 12:46 PM, Ying Huang wrote:
> > On Fri, 2022-05-27 at 17:55 +0530, Aneesh Kumar K.V wrote:
> > 
> > [snip]
> > 
> > > 
> > > +static int __init memory_tier_init(void)
> > > +{
> > > +	int ret;
> > > +
> > > +	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> > > +	if (ret)
> > > +		panic("%s() failed to register subsystem: %d\n", __func__, ret);
> > 
> > I don't think we should go panic for failing to register subsys and
> > device for memory tiers.  Just pr_err() should be enough.
> > 
> 
> So you are suggesting we continue to work with memory tiers with no 
> userspace interface?

Yes.  We don't need to panic system for this.

Best Regards,
Huang, Ying

> > > +
> > > +	/*
> > > +	 * Register only default memory tier to hide all empty
> > > +	 * memory tier from sysfs.
> > > +	 */
> > > +	ret = register_memory_tier(DEFAULT_MEMORY_TIER);
> > > +	if (ret)
> > > +		panic("%s() failed to register memory tier: %d\n", __func__, ret);
> > > +
> > > +	/*
> > > +	 * CPU only nodes are not part of memoty tiers.
> > > +	 */
> > > +	memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
> > > +
> > > +	return 0;
> > > +}
> > > +subsys_initcall(memory_tier_init);
> > > +
> > > +#endif	/* CONFIG_TIERED_MEMORY */
> > 
> > 
>
diff mbox series

Patch

diff --git a/include/linux/migrate.h b/include/linux/migrate.h
index 90e75d5a54d6..0ec653623565 100644
--- a/include/linux/migrate.h
+++ b/include/linux/migrate.h
@@ -47,17 +47,8 @@  void folio_migrate_copy(struct folio *newfolio, struct folio *folio);
 int folio_migrate_mapping(struct address_space *mapping,
 		struct folio *newfolio, struct folio *folio, int extra_count);
 
-extern bool numa_demotion_enabled;
-extern void migrate_on_reclaim_init(void);
-#ifdef CONFIG_HOTPLUG_CPU
-extern void set_migration_target_nodes(void);
-#else
-static inline void set_migration_target_nodes(void) {}
-#endif
 #else
 
-static inline void set_migration_target_nodes(void) {}
-
 static inline void putback_movable_pages(struct list_head *l) {}
 static inline int migrate_pages(struct list_head *l, new_page_t new,
 		free_page_t free, unsigned long private, enum migrate_mode mode,
@@ -82,7 +73,6 @@  static inline int migrate_huge_page_move_mapping(struct address_space *mapping,
 	return -ENOSYS;
 }
 
-#define numa_demotion_enabled	false
 #endif /* CONFIG_MIGRATION */
 
 #ifdef CONFIG_COMPACTION
@@ -172,15 +162,37 @@  struct migrate_vma {
 int migrate_vma_setup(struct migrate_vma *args);
 void migrate_vma_pages(struct migrate_vma *migrate);
 void migrate_vma_finalize(struct migrate_vma *migrate);
-int next_demotion_node(int node);
+#endif /* CONFIG_MIGRATION */
+
+#ifdef CONFIG_TIERED_MEMORY
+
+extern bool numa_demotion_enabled;
+#define DEFAULT_MEMORY_TIER	1
+
+enum memory_tier_type {
+	MEMORY_TIER_HBM_GPU,
+	MEMORY_TIER_DRAM,
+	MEMORY_TIER_PMEM,
+	MAX_MEMORY_TIERS
+};
 
-#else /* CONFIG_MIGRATION disabled: */
+int next_demotion_node(int node);
 
+extern void migrate_on_reclaim_init(void);
+#ifdef CONFIG_HOTPLUG_CPU
+extern void set_migration_target_nodes(void);
+#else
+static inline void set_migration_target_nodes(void) {}
+#endif
+#else
+#define numa_demotion_enabled	false
 static inline int next_demotion_node(int node)
 {
 	return NUMA_NO_NODE;
 }
 
-#endif /* CONFIG_MIGRATION */
+static inline void set_migration_target_nodes(void) {}
+static inline void migrate_on_reclaim_init(void) {}
+#endif	/* CONFIG_TIERED_MEMORY */
 
 #endif /* _LINUX_MIGRATE_H */
diff --git a/mm/Kconfig b/mm/Kconfig
index 034d87953600..7bfbddef46ed 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -258,6 +258,17 @@  config ARCH_ENABLE_HUGEPAGE_MIGRATION
 config ARCH_ENABLE_THP_MIGRATION
 	bool
 
+config TIERED_MEMORY
+	bool "Support for explicit memory tiers"
+	def_bool y
+	depends on MIGRATION && NUMA
+	help
+	  Support to split nodes into memory tiers explicitly and
+	  to demote pages on reclaim to lower tiers. This option
+	  also exposes sysfs interface to read nodes available in
+	  specific tier and to move specific node among different
+	  possible tiers.
+
 config HUGETLB_PAGE_SIZE_VARIABLE
 	def_bool n
 	help
diff --git a/mm/migrate.c b/mm/migrate.c
index 6c31ee1e1c9b..f28ee93fb017 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -2118,6 +2118,113 @@  int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 #endif /* CONFIG_NUMA_BALANCING */
 #endif /* CONFIG_NUMA */
 
+#ifdef CONFIG_TIERED_MEMORY
+
+struct memory_tier {
+	struct device dev;
+	nodemask_t nodelist;
+};
+
+#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
+
+static struct bus_type memory_tier_subsys = {
+	.name = "memtier",
+	.dev_name = "memtier",
+};
+
+static struct memory_tier *memory_tiers[MAX_MEMORY_TIERS];
+
+static ssize_t nodelist_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	int tier = dev->id;
+
+	return sysfs_emit(buf, "%*pbl\n",
+			  nodemask_pr_args(&memory_tiers[tier]->nodelist));
+
+}
+static DEVICE_ATTR_RO(nodelist);
+
+static struct attribute *memory_tier_dev_attrs[] = {
+	&dev_attr_nodelist.attr,
+	NULL
+};
+
+static const struct attribute_group memory_tier_dev_group = {
+	.attrs = memory_tier_dev_attrs,
+};
+
+static const struct attribute_group *memory_tier_dev_groups[] = {
+	&memory_tier_dev_group,
+	NULL
+};
+
+static void memory_tier_device_release(struct device *dev)
+{
+	struct memory_tier *tier = to_memory_tier(dev);
+
+	kfree(tier);
+}
+
+static int register_memory_tier(int tier)
+{
+	int error;
+
+	memory_tiers[tier] = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+	if (!memory_tiers[tier])
+		return -ENOMEM;
+
+	memory_tiers[tier]->dev.id = tier;
+	memory_tiers[tier]->dev.bus = &memory_tier_subsys;
+	memory_tiers[tier]->dev.release = memory_tier_device_release;
+	memory_tiers[tier]->dev.groups = memory_tier_dev_groups;
+	error = device_register(&memory_tiers[tier]->dev);
+
+	if (error) {
+		put_device(&memory_tiers[tier]->dev);
+		memory_tiers[tier] = NULL;
+	}
+
+	return error;
+}
+
+static void unregister_memory_tier(int tier)
+{
+	device_unregister(&memory_tiers[tier]->dev);
+	memory_tiers[tier] = NULL;
+}
+
+static ssize_t
+max_tiers_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
+}
+
+static DEVICE_ATTR_RO(max_tiers);
+
+static ssize_t
+default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", DEFAULT_MEMORY_TIER);
+}
+
+static DEVICE_ATTR_RO(default_tier);
+
+static struct attribute *memoty_tier_attrs[] = {
+	&dev_attr_max_tiers.attr,
+	&dev_attr_default_tier.attr,
+	NULL
+};
+
+static const struct attribute_group memory_tier_attr_group = {
+	.attrs = memoty_tier_attrs,
+};
+
+static const struct attribute_group *memory_tier_attr_groups[] = {
+	&memory_tier_attr_group,
+	NULL,
+};
+
 /*
  * node_demotion[] example:
  *
@@ -2569,3 +2676,30 @@  static int __init numa_init_sysfs(void)
 }
 subsys_initcall(numa_init_sysfs);
 #endif
+
+static int __init memory_tier_init(void)
+{
+	int ret;
+
+	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
+	if (ret)
+		panic("%s() failed to register subsystem: %d\n", __func__, ret);
+
+	/*
+	 * Register only default memory tier to hide all empty
+	 * memory tier from sysfs.
+	 */
+	ret = register_memory_tier(DEFAULT_MEMORY_TIER);
+	if (ret)
+		panic("%s() failed to register memory tier: %d\n", __func__, ret);
+
+	/*
+	 * CPU only nodes are not part of memoty tiers.
+	 */
+	memory_tiers[DEFAULT_MEMORY_TIER]->nodelist = node_states[N_MEMORY];
+
+	return 0;
+}
+subsys_initcall(memory_tier_init);
+
+#endif	/* CONFIG_TIERED_MEMORY */