diff mbox series

[v9,1/8] mm/demotion: Add support for explicit memory tiers

Message ID 20220714045351.434957-2-aneesh.kumar@linux.ibm.com (mailing list archive)
State New
Headers show
Series mm/demotion: Memory tiers and demotion | expand

Commit Message

Aneesh Kumar K.V July 14, 2022, 4:53 a.m. UTC
In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created
during the kernel initialization and updated when a NUMA node is
hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and builds the tier hierarchy
tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases,

The current tier initialization code always initializes
each memory-only NUMA node into a lower tier.  But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM or GPU devices, the
memory-only NUMA nodes mapping these devices should be in the
top tier, and DRAM nodes with CPUs are better to be placed into the
next lower tier.

With current kernel higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier.  This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.

The current kernel also don't provide any interfaces for the
userspace to learn about the memory tier hierarchy in order to
optimize its memory allocations.

This patch series address the above by defining memory tiers explicitly.

This patch introduce explicity memory tiers. The tier ID value
of a memory tier is used to derive the demotion order between
NUMA nodes.

For example, if we have 3 memtiers: memtier100, memtier200, memiter300
then the memory tier order is: memtier300 -> memtier200 -> memtier100
where memtier300 is the highest tier and memtier100 is the lowest tier.

While reclaim we migrate pages from fast(higher) tiers to slow(lower)
tiers when the fast(higher) tier is under memory pressure.

This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
which are created by different kernel subsystems. The default memory
tier created by the kernel is memtier200. A kernel parameter is provided
to override the default memory tier.

Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com

Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h | 15 +++++++
 mm/Makefile                  |  1 +
 mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
 3 files changed, 94 insertions(+)
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c

Comments

Huang, Ying July 15, 2022, 7:53 a.m. UTC | #1
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created
> during the kernel initialization and updated when a NUMA node is
> hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier, and builds the tier hierarchy
> tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
>
> This current memory tier kernel interface needs to be improved for
> several important use cases,
>
> The current tier initialization code always initializes
> each memory-only NUMA node into a lower tier.  But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem or a DRAM-backed memory-only node on
> a virtual machine) and should be put into a higher tier.
>
> The current tier hierarchy always puts CPU nodes into the top
> tier. But on a system with HBM or GPU devices, the
> memory-only NUMA nodes mapping these devices should be in the
> top tier, and DRAM nodes with CPUs are better to be placed into the
> next lower tier.
>
> With current kernel higher tier node can only be demoted to selected nodes on the
> next lower tier as defined by the demotion path, not any other
> node from any lower tier.  This strict, hard-coded demotion order
> does not work in all use cases (e.g. some use cases may want to
> allow cross-socket demotion to another node in the same demotion
> tier as a fallback when the preferred demotion node is out of
> space), This demotion order is also inconsistent with the page
> allocation fallback order when all the nodes in a higher tier are
> out of space: The page allocation can fall back to any node from
> any lower tier, whereas the demotion order doesn't allow that.
>
> The current kernel also don't provide any interfaces for the
> userspace to learn about the memory tier hierarchy in order to
> optimize its memory allocations.
>
> This patch series address the above by defining memory tiers explicitly.
>
> This patch introduce explicity memory tiers. The tier ID value
> of a memory tier is used to derive the demotion order between
> NUMA nodes.
>
> For example, if we have 3 memtiers: memtier100, memtier200, memiter300
> then the memory tier order is: memtier300 -> memtier200 -> memtier100
> where memtier300 is the highest tier and memtier100 is the lowest tier.
>
> While reclaim we migrate pages from fast(higher) tiers to slow(lower)
> tiers when the fast(higher) tier is under memory pressure.
>
> This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
> which are created by different kernel subsystems. The default memory
> tier created by the kernel is memtier200. A kernel parameter is provided
> to override the default memory tier.
>
> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h | 15 +++++++
>  mm/Makefile                  |  1 +
>  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
>  3 files changed, 94 insertions(+)
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> new file mode 100644
> index 000000000000..a81dbc20e0d1
> --- /dev/null
> +++ b/include/linux/memory-tiers.h
> @@ -0,0 +1,15 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_TIERS_H
> +#define _LINUX_MEMORY_TIERS_H
> +
> +#ifdef CONFIG_NUMA
> +
> +#define MEMORY_TIER_HBM_GPU	300
> +#define MEMORY_TIER_DRAM	200
> +#define MEMORY_TIER_PMEM	100
> +
> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> +#define MAX_MEMORY_TIER_ID	400
> +
> +#endif	/* CONFIG_NUMA */
> +#endif  /* _LINUX_MEMORY_TIERS_H */
> diff --git a/mm/Makefile b/mm/Makefile
> index 6f9ffa968a1a..d30acebc2164 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_NUMA) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> new file mode 100644
> index 000000000000..011877b6dbb9
> --- /dev/null
> +++ b/mm/memory-tiers.c
> @@ -0,0 +1,78 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/types.h>
> +#include <linux/nodemask.h>
> +#include <linux/slab.h>
> +#include <linux/lockdep.h>
> +#include <linux/moduleparam.h>
> +#include <linux/memory-tiers.h>
> +
> +struct memory_tier {
> +	struct list_head list;
> +	int id;
> +	nodemask_t nodelist;
> +};
> +
> +static DEFINE_MUTEX(memory_tier_lock);
> +static LIST_HEAD(memory_tiers);
> +
> +static void insert_memory_tier(struct memory_tier *memtier)
> +{
> +	struct list_head *ent;
> +	struct memory_tier *tmp_memtier;
> +
> +	lockdep_assert_held_once(&memory_tier_lock);
> +
> +	list_for_each(ent, &memory_tiers) {
> +		tmp_memtier = list_entry(ent, struct memory_tier, list);
> +		if (tmp_memtier->id < memtier->id) {
> +			list_add_tail(&memtier->list, ent);
> +			return;
> +		}
> +	}
> +	list_add_tail(&memtier->list, &memory_tiers);
> +}
> +
> +static struct memory_tier *register_memory_tier(unsigned int tier)
> +{
> +	struct memory_tier *memtier;
> +
> +	if (tier > MAX_MEMORY_TIER_ID)
> +		return ERR_PTR(-EINVAL);
> +
> +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> +	if (!memtier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	memtier->id   = tier;
> +
> +	insert_memory_tier(memtier);
> +
> +	return memtier;
> +}
> +
> +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
> +core_param(default_memory_tier, default_memtier, uint, 0644);
> +
> +static int __init memory_tier_init(void)
> +{
> +	struct memory_tier *memtier;
> +
> +	/*
> +	 * Register only default memory tier to hide all empty
> +	 * memory tier from sysfs. Since this is early during
> +	 * boot, we could avoid holding memtory_tier_lock. But
> +	 * keep it simple by holding locks. So we can add lock
> +	 * held debug checks in other functions.
> +	 */
> +	mutex_lock(&memory_tier_lock);
> +	memtier = register_memory_tier(default_memtier);
> +	if (IS_ERR(memtier))
> +		panic("%s() failed to register memory tier: %ld\n",
> +		      __func__, PTR_ERR(memtier));
> +
> +	/* CPU only nodes are not part of memory tiers. */
> +	memtier->nodelist = node_states[N_MEMORY];
> +	mutex_unlock(&memory_tier_lock);
> +	return 0;
> +}
> +subsys_initcall(memory_tier_init);

You dropped the original sysfs interface patches from the series, but
the kernel internal implementation is still for the original sysfs
interface.  For example, memory tier ID is for the original sysfs
interface, not for the new proposed sysfs interface.  So I suggest you
to implement with the new interface in mind.  What do you think about
the following design?

- Each NUMA node belongs to a memory type, and each memory type
  corresponds to a "abstract distance", so each NUMA node corresonds to
  a "distance".  For simplicity, we can start with static distances, for
  example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
  node can be recorded in a global array,

    int node_distances[MAX_NUMNODES];

  or, just

    pgdat->distance

- Each memory tier corresponds to a range of distance, for example,
  0-100, 100-200, 200-300, >300, we can start with static ranges too.

- The core API of memory tier could be

    struct memory_tier *find_create_memory_tier(int distance);

  it will find the memory tier which covers "distance" in the memory
  tier list, or create a new memory tier if not found.

- kmem_dax driver will setup distance for PMEM NUMA nodes before online
  them.

- When a NUMA node is onlined, we will use find_create_memory_tier() to
  find or create its memory tier and add the NUMA node into the memory
  tier.

- Or we can add memory type data structure now.

Best Regards,
Huang, Ying
Aneesh Kumar K.V July 15, 2022, 9:08 a.m. UTC | #2
On 7/15/22 1:23 PM, Huang, Ying wrote:
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
> 
>> In the current kernel, memory tiers are defined implicitly via a
>> demotion path relationship between NUMA nodes, which is created
>> during the kernel initialization and updated when a NUMA node is
>> hot-added or hot-removed.  The current implementation puts all
>> nodes with CPU into the top tier, and builds the tier hierarchy
>> tier-by-tier by establishing the per-node demotion targets based
>> on the distances between nodes.
>>
>> This current memory tier kernel interface needs to be improved for
>> several important use cases,
>>
>> The current tier initialization code always initializes
>> each memory-only NUMA node into a lower tier.  But a memory-only
>> NUMA node may have a high performance memory device (e.g. a DRAM
>> device attached via CXL.mem or a DRAM-backed memory-only node on
>> a virtual machine) and should be put into a higher tier.
>>
>> The current tier hierarchy always puts CPU nodes into the top
>> tier. But on a system with HBM or GPU devices, the
>> memory-only NUMA nodes mapping these devices should be in the
>> top tier, and DRAM nodes with CPUs are better to be placed into the
>> next lower tier.
>>
>> With current kernel higher tier node can only be demoted to selected nodes on the
>> next lower tier as defined by the demotion path, not any other
>> node from any lower tier.  This strict, hard-coded demotion order
>> does not work in all use cases (e.g. some use cases may want to
>> allow cross-socket demotion to another node in the same demotion
>> tier as a fallback when the preferred demotion node is out of
>> space), This demotion order is also inconsistent with the page
>> allocation fallback order when all the nodes in a higher tier are
>> out of space: The page allocation can fall back to any node from
>> any lower tier, whereas the demotion order doesn't allow that.
>>
>> The current kernel also don't provide any interfaces for the
>> userspace to learn about the memory tier hierarchy in order to
>> optimize its memory allocations.
>>
>> This patch series address the above by defining memory tiers explicitly.
>>
>> This patch introduce explicity memory tiers. The tier ID value
>> of a memory tier is used to derive the demotion order between
>> NUMA nodes.
>>
>> For example, if we have 3 memtiers: memtier100, memtier200, memiter300
>> then the memory tier order is: memtier300 -> memtier200 -> memtier100
>> where memtier300 is the highest tier and memtier100 is the lowest tier.
>>
>> While reclaim we migrate pages from fast(higher) tiers to slow(lower)
>> tiers when the fast(higher) tier is under memory pressure.
>>
>> This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
>> which are created by different kernel subsystems. The default memory
>> tier created by the kernel is memtier200. A kernel parameter is provided
>> to override the default memory tier.
>>
>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>
>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>  include/linux/memory-tiers.h | 15 +++++++
>>  mm/Makefile                  |  1 +
>>  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
>>  3 files changed, 94 insertions(+)
>>  create mode 100644 include/linux/memory-tiers.h
>>  create mode 100644 mm/memory-tiers.c
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> new file mode 100644
>> index 000000000000..a81dbc20e0d1
>> --- /dev/null
>> +++ b/include/linux/memory-tiers.h
>> @@ -0,0 +1,15 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_MEMORY_TIERS_H
>> +#define _LINUX_MEMORY_TIERS_H
>> +
>> +#ifdef CONFIG_NUMA
>> +
>> +#define MEMORY_TIER_HBM_GPU	300
>> +#define MEMORY_TIER_DRAM	200
>> +#define MEMORY_TIER_PMEM	100
>> +
>> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
>> +#define MAX_MEMORY_TIER_ID	400
>> +
>> +#endif	/* CONFIG_NUMA */
>> +#endif  /* _LINUX_MEMORY_TIERS_H */
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 6f9ffa968a1a..d30acebc2164 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>>  obj-$(CONFIG_FAILSLAB) += failslab.o
>>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>>  obj-$(CONFIG_MIGRATION) += migrate.o
>> +obj-$(CONFIG_NUMA) += memory-tiers.o
>>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> new file mode 100644
>> index 000000000000..011877b6dbb9
>> --- /dev/null
>> +++ b/mm/memory-tiers.c
>> @@ -0,0 +1,78 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <linux/types.h>
>> +#include <linux/nodemask.h>
>> +#include <linux/slab.h>
>> +#include <linux/lockdep.h>
>> +#include <linux/moduleparam.h>
>> +#include <linux/memory-tiers.h>
>> +
>> +struct memory_tier {
>> +	struct list_head list;
>> +	int id;
>> +	nodemask_t nodelist;
>> +};
>> +
>> +static DEFINE_MUTEX(memory_tier_lock);
>> +static LIST_HEAD(memory_tiers);
>> +
>> +static void insert_memory_tier(struct memory_tier *memtier)
>> +{
>> +	struct list_head *ent;
>> +	struct memory_tier *tmp_memtier;
>> +
>> +	lockdep_assert_held_once(&memory_tier_lock);
>> +
>> +	list_for_each(ent, &memory_tiers) {
>> +		tmp_memtier = list_entry(ent, struct memory_tier, list);
>> +		if (tmp_memtier->id < memtier->id) {
>> +			list_add_tail(&memtier->list, ent);
>> +			return;
>> +		}
>> +	}
>> +	list_add_tail(&memtier->list, &memory_tiers);
>> +}
>> +
>> +static struct memory_tier *register_memory_tier(unsigned int tier)
>> +{
>> +	struct memory_tier *memtier;
>> +
>> +	if (tier > MAX_MEMORY_TIER_ID)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>> +	if (!memtier)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	memtier->id   = tier;
>> +
>> +	insert_memory_tier(memtier);
>> +
>> +	return memtier;
>> +}
>> +
>> +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
>> +core_param(default_memory_tier, default_memtier, uint, 0644);
>> +
>> +static int __init memory_tier_init(void)
>> +{
>> +	struct memory_tier *memtier;
>> +
>> +	/*
>> +	 * Register only default memory tier to hide all empty
>> +	 * memory tier from sysfs. Since this is early during
>> +	 * boot, we could avoid holding memtory_tier_lock. But
>> +	 * keep it simple by holding locks. So we can add lock
>> +	 * held debug checks in other functions.
>> +	 */
>> +	mutex_lock(&memory_tier_lock);
>> +	memtier = register_memory_tier(default_memtier);
>> +	if (IS_ERR(memtier))
>> +		panic("%s() failed to register memory tier: %ld\n",
>> +		      __func__, PTR_ERR(memtier));
>> +
>> +	/* CPU only nodes are not part of memory tiers. */
>> +	memtier->nodelist = node_states[N_MEMORY];
>> +	mutex_unlock(&memory_tier_lock);
>> +	return 0;
>> +}
>> +subsys_initcall(memory_tier_init);
> 
> You dropped the original sysfs interface patches from the series, but
> the kernel internal implementation is still for the original sysfs
> interface.  For example, memory tier ID is for the original sysfs
> interface, not for the new proposed sysfs interface.  So I suggest you
> to implement with the new interface in mind.  What do you think about
> the following design?
> 

Sorry I am not able to follow you here. This patchset completely drops
exposing memory tiers to userspace via sysfs. Instead it allow
creation of memory tiers with specific tierID from within the kernel/device driver.
Default tierID is 200 and dax kmem creates memory tier with tierID 100. 


> - Each NUMA node belongs to a memory type, and each memory type
>   corresponds to a "abstract distance", so each NUMA node corresonds to
>   a "distance".  For simplicity, we can start with static distances, for
>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>   node can be recorded in a global array,
> 
>     int node_distances[MAX_NUMNODES];
> 
>   or, just
> 
>     pgdat->distance
> 

I don't follow this. I guess you are trying to have a different design.
Would it be much easier if you can write this in the form of a patch? 


> - Each memory tier corresponds to a range of distance, for example,
>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
> 
> - The core API of memory tier could be
> 
>     struct memory_tier *find_create_memory_tier(int distance);
> 
>   it will find the memory tier which covers "distance" in the memory
>   tier list, or create a new memory tier if not found.
> 

I was expecting this to be internal to dax kmem. How dax kmem maps
"abstract distance" to a memory tier. At this point this patchset is
keeping all that for a future patchset. 

> - kmem_dax driver will setup distance for PMEM NUMA nodes before online
>   them.
> 

Sure we can do that as part of future patchset ?

> - When a NUMA node is onlined, we will use find_create_memory_tier() to
>   find or create its memory tier and add the NUMA node into the memory
>   tier.
> 

This is what this patchset does. When we online a numa node the kernel 
find the memory tier for the node (__node_get_memory_tier). If it doesn't
exist, we create one. (The new one created is not dynamic as you outlined
earlier. But then that can be done in a future patchset). For now I am
keeping this simpler.

static int node_set_memory_tier(int node, int tier)
{
	struct memory_tier *memtier;
	int ret = 0;

	mutex_lock(&memory_tier_lock);
	memtier = __node_get_memory_tier(node);
	/*
	 * if node is already part of the tier proceed with the
	 * current tier value, because we might want to establish
	 * new migration paths now. The node might be added to a tier
	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
	 * will have skipped this node.
	 */
	if (!memtier)
		ret = __node_set_memory_tier(node, tier);
	establish_migration_targets();

	mutex_unlock(&memory_tier_lock);

	return ret;
}





> - Or we can add memory type data structure now.
> 
> Best Regards,
> Huang, Ying

-aneesh
Aneesh Kumar K.V July 15, 2022, 9:24 a.m. UTC | #3
On 7/15/22 2:38 PM, Aneesh Kumar K V wrote:
> On 7/15/22 1:23 PM, Huang, Ying wrote:
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>
>>> In the current kernel, memory tiers are defined implicitly via a
>>> demotion path relationship between NUMA nodes, which is created
>>> during the kernel initialization and updated when a NUMA node is
>>> hot-added or hot-removed.  The current implementation puts all
>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>> tier-by-tier by establishing the per-node demotion targets based
>>> on the distances between nodes.
>>>
>>> This current memory tier kernel interface needs to be improved for
>>> several important use cases,
>>>
>>> The current tier initialization code always initializes
>>> each memory-only NUMA node into a lower tier.  But a memory-only
>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>> a virtual machine) and should be put into a higher tier.
>>>
>>> The current tier hierarchy always puts CPU nodes into the top
>>> tier. But on a system with HBM or GPU devices, the
>>> memory-only NUMA nodes mapping these devices should be in the
>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>> next lower tier.
>>>
>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>> next lower tier as defined by the demotion path, not any other
>>> node from any lower tier.  This strict, hard-coded demotion order
>>> does not work in all use cases (e.g. some use cases may want to
>>> allow cross-socket demotion to another node in the same demotion
>>> tier as a fallback when the preferred demotion node is out of
>>> space), This demotion order is also inconsistent with the page
>>> allocation fallback order when all the nodes in a higher tier are
>>> out of space: The page allocation can fall back to any node from
>>> any lower tier, whereas the demotion order doesn't allow that.
>>>
>>> The current kernel also don't provide any interfaces for the
>>> userspace to learn about the memory tier hierarchy in order to
>>> optimize its memory allocations.
>>>
>>> This patch series address the above by defining memory tiers explicitly.
>>>
>>> This patch introduce explicity memory tiers. The tier ID value
>>> of a memory tier is used to derive the demotion order between
>>> NUMA nodes.
>>>
>>> For example, if we have 3 memtiers: memtier100, memtier200, memiter300
>>> then the memory tier order is: memtier300 -> memtier200 -> memtier100
>>> where memtier300 is the highest tier and memtier100 is the lowest tier.
>>>
>>> While reclaim we migrate pages from fast(higher) tiers to slow(lower)
>>> tiers when the fast(higher) tier is under memory pressure.
>>>
>>> This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
>>> which are created by different kernel subsystems. The default memory
>>> tier created by the kernel is memtier200. A kernel parameter is provided
>>> to override the default memory tier.
>>>
>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>
>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> ---
>>>  include/linux/memory-tiers.h | 15 +++++++
>>>  mm/Makefile                  |  1 +
>>>  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
>>>  3 files changed, 94 insertions(+)
>>>  create mode 100644 include/linux/memory-tiers.h
>>>  create mode 100644 mm/memory-tiers.c
>>>
>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>> new file mode 100644
>>> index 000000000000..a81dbc20e0d1
>>> --- /dev/null
>>> +++ b/include/linux/memory-tiers.h
>>> @@ -0,0 +1,15 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>> +#define _LINUX_MEMORY_TIERS_H
>>> +
>>> +#ifdef CONFIG_NUMA
>>> +
>>> +#define MEMORY_TIER_HBM_GPU	300
>>> +#define MEMORY_TIER_DRAM	200
>>> +#define MEMORY_TIER_PMEM	100
>>> +
>>> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
>>> +#define MAX_MEMORY_TIER_ID	400
>>> +
>>> +#endif	/* CONFIG_NUMA */
>>> +#endif  /* _LINUX_MEMORY_TIERS_H */
>>> diff --git a/mm/Makefile b/mm/Makefile
>>> index 6f9ffa968a1a..d30acebc2164 100644
>>> --- a/mm/Makefile
>>> +++ b/mm/Makefile
>>> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>>>  obj-$(CONFIG_FAILSLAB) += failslab.o
>>>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>>>  obj-$(CONFIG_MIGRATION) += migrate.o
>>> +obj-$(CONFIG_NUMA) += memory-tiers.o
>>>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>> new file mode 100644
>>> index 000000000000..011877b6dbb9
>>> --- /dev/null
>>> +++ b/mm/memory-tiers.c
>>> @@ -0,0 +1,78 @@
>>> +// SPDX-License-Identifier: GPL-2.0
>>> +#include <linux/types.h>
>>> +#include <linux/nodemask.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/lockdep.h>
>>> +#include <linux/moduleparam.h>
>>> +#include <linux/memory-tiers.h>
>>> +
>>> +struct memory_tier {
>>> +	struct list_head list;
>>> +	int id;
>>> +	nodemask_t nodelist;
>>> +};
>>> +
>>> +static DEFINE_MUTEX(memory_tier_lock);
>>> +static LIST_HEAD(memory_tiers);
>>> +
>>> +static void insert_memory_tier(struct memory_tier *memtier)
>>> +{
>>> +	struct list_head *ent;
>>> +	struct memory_tier *tmp_memtier;
>>> +
>>> +	lockdep_assert_held_once(&memory_tier_lock);
>>> +
>>> +	list_for_each(ent, &memory_tiers) {
>>> +		tmp_memtier = list_entry(ent, struct memory_tier, list);
>>> +		if (tmp_memtier->id < memtier->id) {
>>> +			list_add_tail(&memtier->list, ent);
>>> +			return;
>>> +		}
>>> +	}
>>> +	list_add_tail(&memtier->list, &memory_tiers);
>>> +}
>>> +
>>> +static struct memory_tier *register_memory_tier(unsigned int tier)
>>> +{
>>> +	struct memory_tier *memtier;
>>> +
>>> +	if (tier > MAX_MEMORY_TIER_ID)
>>> +		return ERR_PTR(-EINVAL);
>>> +
>>> +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>>> +	if (!memtier)
>>> +		return ERR_PTR(-ENOMEM);
>>> +
>>> +	memtier->id   = tier;
>>> +
>>> +	insert_memory_tier(memtier);
>>> +
>>> +	return memtier;
>>> +}
>>> +
>>> +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
>>> +core_param(default_memory_tier, default_memtier, uint, 0644);
>>> +
>>> +static int __init memory_tier_init(void)
>>> +{
>>> +	struct memory_tier *memtier;
>>> +
>>> +	/*
>>> +	 * Register only default memory tier to hide all empty
>>> +	 * memory tier from sysfs. Since this is early during
>>> +	 * boot, we could avoid holding memtory_tier_lock. But
>>> +	 * keep it simple by holding locks. So we can add lock
>>> +	 * held debug checks in other functions.
>>> +	 */
>>> +	mutex_lock(&memory_tier_lock);
>>> +	memtier = register_memory_tier(default_memtier);
>>> +	if (IS_ERR(memtier))
>>> +		panic("%s() failed to register memory tier: %ld\n",
>>> +		      __func__, PTR_ERR(memtier));
>>> +
>>> +	/* CPU only nodes are not part of memory tiers. */
>>> +	memtier->nodelist = node_states[N_MEMORY];
>>> +	mutex_unlock(&memory_tier_lock);
>>> +	return 0;
>>> +}
>>> +subsys_initcall(memory_tier_init);
>>
>> You dropped the original sysfs interface patches from the series, but
>> the kernel internal implementation is still for the original sysfs
>> interface.  For example, memory tier ID is for the original sysfs
>> interface, not for the new proposed sysfs interface.  So I suggest you
>> to implement with the new interface in mind.  What do you think about
>> the following design?
>>
> 
> Sorry I am not able to follow you here. This patchset completely drops
> exposing memory tiers to userspace via sysfs. Instead it allow
> creation of memory tiers with specific tierID from within the kernel/device driver.
> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
> 
> 
>> - Each NUMA node belongs to a memory type, and each memory type
>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>   a "distance".  For simplicity, we can start with static distances, for
>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>   node can be recorded in a global array,
>>
>>     int node_distances[MAX_NUMNODES];
>>
>>   or, just
>>
>>     pgdat->distance
>>
> 
> I don't follow this. I guess you are trying to have a different design.
> Would it be much easier if you can write this in the form of a patch? 
> 
> 
>> - Each memory tier corresponds to a range of distance, for example,
>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>
>> - The core API of memory tier could be
>>
>>     struct memory_tier *find_create_memory_tier(int distance);
>>
>>   it will find the memory tier which covers "distance" in the memory
>>   tier list, or create a new memory tier if not found.
>>
> 
> I was expecting this to be internal to dax kmem. How dax kmem maps
> "abstract distance" to a memory tier. At this point this patchset is
> keeping all that for a future patchset. 
> 

At an abstract level, something like this.

modified   drivers/dax/kmem.c
@@ -150,7 +150,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	}
 
 	dev_set_drvdata(dev, data);
-	node_create_and_set_memory_tier(numa_node, dax_kmem_memtier);
+	this_device_tier = find_memtier_from_distance(dev_dax);
+	node_create_and_set_memory_tier(numa_node, this_device_tier);
 	return 0;
 
 err_request_mem:



>> - kmem_dax driver will setup distance for PMEM NUMA nodes before online
>>   them.
>>
> 
> Sure we can do that as part of future patchset ?
> 
>> - When a NUMA node is onlined, we will use find_create_memory_tier() to
>>   find or create its memory tier and add the NUMA node into the memory
>>   tier.
>>
> 
> This is what this patchset does. When we online a numa node the kernel 
> find the memory tier for the node (__node_get_memory_tier). If it doesn't
> exist, we create one. (The new one created is not dynamic as you outlined
> earlier. But then that can be done in a future patchset). For now I am
> keeping this simpler.
> 
> static int node_set_memory_tier(int node, int tier)
> {
> 	struct memory_tier *memtier;
> 	int ret = 0;
> 
> 	mutex_lock(&memory_tier_lock);
> 	memtier = __node_get_memory_tier(node);
> 	/*
> 	 * if node is already part of the tier proceed with the
> 	 * current tier value, because we might want to establish
> 	 * new migration paths now. The node might be added to a tier
> 	 * before it was made part of N_MEMORY, hence estabilish_migration_targets
> 	 * will have skipped this node.
> 	 */
> 	if (!memtier)
> 		ret = __node_set_memory_tier(node, tier);
> 	establish_migration_targets();
> 
> 	mutex_unlock(&memory_tier_lock);
> 
> 	return ret;
> }
> 
> 
> 
> 
> 
>> - Or we can add memory type data structure now.
>>
>> Best Regards,
>> Huang, Ying
> 
> -aneesh
Aneesh Kumar K.V July 15, 2022, 10:27 a.m. UTC | #4
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

....

> 
>> You dropped the original sysfs interface patches from the series, but
>> the kernel internal implementation is still for the original sysfs
>> interface.  For example, memory tier ID is for the original sysfs
>> interface, not for the new proposed sysfs interface.  So I suggest you
>> to implement with the new interface in mind.  What do you think about
>> the following design?
>> 
>
> Sorry I am not able to follow you here. This patchset completely drops
> exposing memory tiers to userspace via sysfs. Instead it allow
> creation of memory tiers with specific tierID from within the kernel/device driver.
> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
>
>
>> - Each NUMA node belongs to a memory type, and each memory type
>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>   a "distance".  For simplicity, we can start with static distances, for
>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>   node can be recorded in a global array,
>> 
>>     int node_distances[MAX_NUMNODES];
>> 
>>   or, just
>> 
>>     pgdat->distance
>> 
>
> I don't follow this. I guess you are trying to have a different design.
> Would it be much easier if you can write this in the form of a patch? 
>
>
>> - Each memory tier corresponds to a range of distance, for example,
>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>> 
>> - The core API of memory tier could be
>> 
>>     struct memory_tier *find_create_memory_tier(int distance);
>> 
>>   it will find the memory tier which covers "distance" in the memory
>>   tier list, or create a new memory tier if not found.
>> 
>
> I was expecting this to be internal to dax kmem. How dax kmem maps
> "abstract distance" to a memory tier. At this point this patchset is
> keeping all that for a future patchset. 
>

This shows how i was expecting "abstract distance" to be integrated.

diff --git a/arch/powerpc/platforms/pseries/papr_scm.c b/arch/powerpc/platforms/pseries/papr_scm.c
index 82cae08976bc..1281aec63986 100644
--- a/arch/powerpc/platforms/pseries/papr_scm.c
+++ b/arch/powerpc/platforms/pseries/papr_scm.c
@@ -1332,6 +1332,7 @@ static int papr_scm_nvdimm_init(struct papr_scm_priv *p)
 	ndr_desc.mapping = &mapping;
 	ndr_desc.num_mappings = 1;
 	ndr_desc.nd_set = &p->nd_set;
+	ndr_desc.memtier_distance = PMEM_MEMTIER_DEFAULT_DISTANCE;
 
 	if (p->hcall_flush_required) {
 		set_bit(ND_REGION_ASYNC, &ndr_desc.flags);
diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index ae5f4acf2675..7b8cf1f15562 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -2641,6 +2641,10 @@ static int acpi_nfit_register_region(struct acpi_nfit_desc *acpi_desc,
 			NUMA_NO_NODE, ndr_desc->numa_node, &res.start, &res.end);
 	}
 
+	/*
+	 * We may want to look at SLIT/HMAT to fine tune this
+	 */
+	ndr_desc->memtier_distance  =  PMEM_MEMTIER_DEFAULT_DISTANCE;
 	/*
 	 * Persistence domain bits are hierarchical, if
 	 * ACPI_NFIT_CAPABILITY_CACHE_FLUSH is set then
diff --git a/drivers/dax/bus.c b/drivers/dax/bus.c
index 1dad813ee4a6..708a40cf29c0 100644
--- a/drivers/dax/bus.c
+++ b/drivers/dax/bus.c
@@ -570,8 +570,9 @@ static void dax_region_unregister(void *region)
 }
 
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
-		struct range *range, int target_node, unsigned int align,
-		unsigned long flags)
+				    struct range *range, int target_node,
+				    int memtier_distance, unsigned int align,
+				    unsigned long flags)
 {
 	struct dax_region *dax_region;
 
@@ -599,6 +600,7 @@ struct dax_region *alloc_dax_region(struct device *parent, int region_id,
 	dax_region->align = align;
 	dax_region->dev = parent;
 	dax_region->target_node = target_node;
+	dax_region->memtier_distance = memtier_distance;
 	ida_init(&dax_region->ida);
 	dax_region->res = (struct resource) {
 		.start = range->start,
@@ -1370,6 +1372,7 @@ struct dev_dax *devm_create_dev_dax(struct dev_dax_data *data)
 
 	dev_dax->dax_dev = dax_dev;
 	dev_dax->target_node = dax_region->target_node;
+	dev_dax->memtier_distance = dax_region->memtier_distance;
 	dev_dax->align = dax_region->align;
 	ida_init(&dev_dax->ida);
 	kref_get(&dax_region->kref);
diff --git a/drivers/dax/bus.h b/drivers/dax/bus.h
index fbb940293d6d..3de4292392dd 100644
--- a/drivers/dax/bus.h
+++ b/drivers/dax/bus.h
@@ -13,8 +13,9 @@ void dax_region_put(struct dax_region *dax_region);
 
 #define IORESOURCE_DAX_STATIC (1UL << 0)
 struct dax_region *alloc_dax_region(struct device *parent, int region_id,
-		struct range *range, int target_node, unsigned int align,
-		unsigned long flags);
+				    struct range *range, int target_node,
+				    int memtier_distance, unsigned int align,
+				    unsigned long flags);
 
 struct dev_dax_data {
 	struct dax_region *dax_region;
diff --git a/drivers/dax/dax-private.h b/drivers/dax/dax-private.h
index 1c974b7caae6..5db382c78d0e 100644
--- a/drivers/dax/dax-private.h
+++ b/drivers/dax/dax-private.h
@@ -31,6 +31,7 @@ void dax_bus_exit(void);
 struct dax_region {
 	int id;
 	int target_node;
+	int memtier_distance;
 	struct kref kref;
 	struct device *dev;
 	unsigned int align;
@@ -64,6 +65,7 @@ struct dev_dax {
 	struct dax_device *dax_dev;
 	unsigned int align;
 	int target_node;
+	int memtier_distance;
 	int id;
 	struct ida ida;
 	struct device dev;
diff --git a/drivers/dax/hmem/hmem.c b/drivers/dax/hmem/hmem.c
index 1bf040dbc834..b9f80971c07b 100644
--- a/drivers/dax/hmem/hmem.c
+++ b/drivers/dax/hmem/hmem.c
@@ -26,7 +26,7 @@ static int dax_hmem_probe(struct platform_device *pdev)
 	range.start = res->start;
 	range.end = res->end;
 	dax_region = alloc_dax_region(dev, pdev->id, &range, mri->target_node,
-			PMD_SIZE, 0);
+				      mri->memtier_distance, PMD_SIZE, 0);
 	if (!dax_region)
 		return -ENOMEM;
 
diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 0c03889286ac..32878bd96f09 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -45,13 +45,18 @@ struct dax_kmem_data {
 static unsigned int dax_kmem_memtier = MEMORY_TIER_PMEM;
 module_param(dax_kmem_memtier, uint, 0644);
 
+int find_memtier_from_distance(struct dev_dax *dev_dax)
+{
+	return dax_kmem_memtier + dev_dax->memtier_distance;
+}
+
 static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 {
 	struct device *dev = &dev_dax->dev;
 	unsigned long total_len = 0;
 	struct dax_kmem_data *data;
 	int i, rc, mapped = 0;
-	int numa_node;
+	int numa_node, mem_tier;
 
 	/*
 	 * Ensure good NUMA information for the persistent memory.
@@ -150,7 +155,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	}
 
 	dev_set_drvdata(dev, data);
-	node_create_and_set_memory_tier(numa_node, dax_kmem_memtier);
+	mem_tier = find_memtier_from_distance(dev_dax);
+	node_create_and_set_memory_tier(numa_node, mem_tier);
 	return 0;
 
 err_request_mem:
diff --git a/drivers/dax/pmem.c b/drivers/dax/pmem.c
index f050ea78bb83..1b51fc0490de 100644
--- a/drivers/dax/pmem.c
+++ b/drivers/dax/pmem.c
@@ -54,8 +54,10 @@ static struct dev_dax *__dax_pmem_probe(struct device *dev)
 	range = pgmap.range;
 	range.start += offset;
 	dax_region = alloc_dax_region(dev, region_id, &range,
-			nd_region->target_node, le32_to_cpu(pfn_sb->align),
-			IORESOURCE_DAX_STATIC);
+				      nd_region->target_node,
+				      nd_region->memtier_distance,
+				      le32_to_cpu(pfn_sb->align),
+				      IORESOURCE_DAX_STATIC);
 	if (!dax_region)
 		return ERR_PTR(-ENOMEM);
 
diff --git a/drivers/nvdimm/nd.h b/drivers/nvdimm/nd.h
index ec5219680092..cf7a379a2220 100644
--- a/drivers/nvdimm/nd.h
+++ b/drivers/nvdimm/nd.h
@@ -416,6 +416,7 @@ struct nd_region {
 	u64 ndr_size;
 	u64 ndr_start;
 	int id, num_lanes, ro, numa_node, target_node;
+	int memtier_distance;
 	void *provider_data;
 	struct kernfs_node *bb_state;
 	struct badblocks bb;
diff --git a/drivers/nvdimm/region_devs.c b/drivers/nvdimm/region_devs.c
index d976260eca7a..f2067de8d660 100644
--- a/drivers/nvdimm/region_devs.c
+++ b/drivers/nvdimm/region_devs.c
@@ -1019,6 +1019,7 @@ static struct nd_region *nd_region_create(struct nvdimm_bus *nvdimm_bus,
 	nd_region->ro = ro;
 	nd_region->numa_node = ndr_desc->numa_node;
 	nd_region->target_node = ndr_desc->target_node;
+	nd_region->memtier_distance = ndr_desc->memtier_distance;
 	ida_init(&nd_region->ns_ida);
 	ida_init(&nd_region->btt_ida);
 	ida_init(&nd_region->pfn_ida);
diff --git a/include/linux/libnvdimm.h b/include/linux/libnvdimm.h
index 0d61e07b6827..bf20e018074f 100644
--- a/include/linux/libnvdimm.h
+++ b/include/linux/libnvdimm.h
@@ -121,6 +121,7 @@ struct nd_region_desc {
 	int num_lanes;
 	int numa_node;
 	int target_node;
+	int memtier_distance;
 	unsigned long flags;
 	struct device_node *of_node;
 	int (*flush)(struct nd_region *nd_region, struct bio *bio);
@@ -224,6 +225,8 @@ struct nvdimm_fw_ops {
 	int (*arm)(struct nvdimm *nvdimm, enum nvdimm_fwa_trigger arg);
 };
 
+#define PMEM_MEMTIER_DEFAULT_DISTANCE  10
+
 void badrange_init(struct badrange *badrange);
 int badrange_add(struct badrange *badrange, u64 addr, u64 length);
 void badrange_forget(struct badrange *badrange, phys_addr_t start,
diff --git a/include/linux/memregion.h b/include/linux/memregion.h
index c04c4fd2e209..5850e2bbbfed 100644
--- a/include/linux/memregion.h
+++ b/include/linux/memregion.h
@@ -6,6 +6,7 @@
 
 struct memregion_info {
 	int target_node;
+	int memtier_distance;
 };
 
 #ifdef CONFIG_MEMREGION
Wei Xu July 15, 2022, 4:59 p.m. UTC | #5
On Fri, Jul 15, 2022 at 12:53 AM Huang, Ying <ying.huang@intel.com> wrote:
>
> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>
> > In the current kernel, memory tiers are defined implicitly via a
> > demotion path relationship between NUMA nodes, which is created
> > during the kernel initialization and updated when a NUMA node is
> > hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and builds the tier hierarchy
> > tier-by-tier by establishing the per-node demotion targets based
> > on the distances between nodes.
> >
> > This current memory tier kernel interface needs to be improved for
> > several important use cases,
> >
> > The current tier initialization code always initializes
> > each memory-only NUMA node into a lower tier.  But a memory-only
> > NUMA node may have a high performance memory device (e.g. a DRAM
> > device attached via CXL.mem or a DRAM-backed memory-only node on
> > a virtual machine) and should be put into a higher tier.
> >
> > The current tier hierarchy always puts CPU nodes into the top
> > tier. But on a system with HBM or GPU devices, the
> > memory-only NUMA nodes mapping these devices should be in the
> > top tier, and DRAM nodes with CPUs are better to be placed into the
> > next lower tier.
> >
> > With current kernel higher tier node can only be demoted to selected nodes on the
> > next lower tier as defined by the demotion path, not any other
> > node from any lower tier.  This strict, hard-coded demotion order
> > does not work in all use cases (e.g. some use cases may want to
> > allow cross-socket demotion to another node in the same demotion
> > tier as a fallback when the preferred demotion node is out of
> > space), This demotion order is also inconsistent with the page
> > allocation fallback order when all the nodes in a higher tier are
> > out of space: The page allocation can fall back to any node from
> > any lower tier, whereas the demotion order doesn't allow that.
> >
> > The current kernel also don't provide any interfaces for the
> > userspace to learn about the memory tier hierarchy in order to
> > optimize its memory allocations.
> >
> > This patch series address the above by defining memory tiers explicitly.
> >
> > This patch introduce explicity memory tiers. The tier ID value
> > of a memory tier is used to derive the demotion order between
> > NUMA nodes.
> >
> > For example, if we have 3 memtiers: memtier100, memtier200, memiter300
> > then the memory tier order is: memtier300 -> memtier200 -> memtier100
> > where memtier300 is the highest tier and memtier100 is the lowest tier.
> >
> > While reclaim we migrate pages from fast(higher) tiers to slow(lower)
> > tiers when the fast(higher) tier is under memory pressure.
> >
> > This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
> > which are created by different kernel subsystems. The default memory
> > tier created by the kernel is memtier200. A kernel parameter is provided
> > to override the default memory tier.
> >
> > Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> > Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
> >
> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > ---
> >  include/linux/memory-tiers.h | 15 +++++++
> >  mm/Makefile                  |  1 +
> >  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
> >  3 files changed, 94 insertions(+)
> >  create mode 100644 include/linux/memory-tiers.h
> >  create mode 100644 mm/memory-tiers.c
> >
> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > new file mode 100644
> > index 000000000000..a81dbc20e0d1
> > --- /dev/null
> > +++ b/include/linux/memory-tiers.h
> > @@ -0,0 +1,15 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_MEMORY_TIERS_H
> > +#define _LINUX_MEMORY_TIERS_H
> > +
> > +#ifdef CONFIG_NUMA
> > +
> > +#define MEMORY_TIER_HBM_GPU  300
> > +#define MEMORY_TIER_DRAM     200
> > +#define MEMORY_TIER_PMEM     100
> > +
> > +#define DEFAULT_MEMORY_TIER  MEMORY_TIER_DRAM
> > +#define MAX_MEMORY_TIER_ID   400
> > +
> > +#endif       /* CONFIG_NUMA */
> > +#endif  /* _LINUX_MEMORY_TIERS_H */
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 6f9ffa968a1a..d30acebc2164 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
> >  obj-$(CONFIG_FAILSLAB) += failslab.o
> >  obj-$(CONFIG_MEMTEST)                += memtest.o
> >  obj-$(CONFIG_MIGRATION) += migrate.o
> > +obj-$(CONFIG_NUMA) += memory-tiers.o
> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > new file mode 100644
> > index 000000000000..011877b6dbb9
> > --- /dev/null
> > +++ b/mm/memory-tiers.c
> > @@ -0,0 +1,78 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/types.h>
> > +#include <linux/nodemask.h>
> > +#include <linux/slab.h>
> > +#include <linux/lockdep.h>
> > +#include <linux/moduleparam.h>
> > +#include <linux/memory-tiers.h>
> > +
> > +struct memory_tier {
> > +     struct list_head list;
> > +     int id;
> > +     nodemask_t nodelist;
> > +};
> > +
> > +static DEFINE_MUTEX(memory_tier_lock);
> > +static LIST_HEAD(memory_tiers);
> > +
> > +static void insert_memory_tier(struct memory_tier *memtier)
> > +{
> > +     struct list_head *ent;
> > +     struct memory_tier *tmp_memtier;
> > +
> > +     lockdep_assert_held_once(&memory_tier_lock);
> > +
> > +     list_for_each(ent, &memory_tiers) {
> > +             tmp_memtier = list_entry(ent, struct memory_tier, list);
> > +             if (tmp_memtier->id < memtier->id) {
> > +                     list_add_tail(&memtier->list, ent);
> > +                     return;
> > +             }
> > +     }
> > +     list_add_tail(&memtier->list, &memory_tiers);
> > +}
> > +
> > +static struct memory_tier *register_memory_tier(unsigned int tier)
> > +{
> > +     struct memory_tier *memtier;
> > +
> > +     if (tier > MAX_MEMORY_TIER_ID)
> > +             return ERR_PTR(-EINVAL);
> > +
> > +     memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > +     if (!memtier)
> > +             return ERR_PTR(-ENOMEM);
> > +
> > +     memtier->id   = tier;
> > +
> > +     insert_memory_tier(memtier);
> > +
> > +     return memtier;
> > +}
> > +
> > +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
> > +core_param(default_memory_tier, default_memtier, uint, 0644);
> > +
> > +static int __init memory_tier_init(void)
> > +{
> > +     struct memory_tier *memtier;
> > +
> > +     /*
> > +      * Register only default memory tier to hide all empty
> > +      * memory tier from sysfs. Since this is early during
> > +      * boot, we could avoid holding memtory_tier_lock. But
> > +      * keep it simple by holding locks. So we can add lock
> > +      * held debug checks in other functions.
> > +      */
> > +     mutex_lock(&memory_tier_lock);
> > +     memtier = register_memory_tier(default_memtier);
> > +     if (IS_ERR(memtier))
> > +             panic("%s() failed to register memory tier: %ld\n",
> > +                   __func__, PTR_ERR(memtier));
> > +
> > +     /* CPU only nodes are not part of memory tiers. */
> > +     memtier->nodelist = node_states[N_MEMORY];
> > +     mutex_unlock(&memory_tier_lock);
> > +     return 0;
> > +}
> > +subsys_initcall(memory_tier_init);
>
> You dropped the original sysfs interface patches from the series, but
> the kernel internal implementation is still for the original sysfs
> interface.  For example, memory tier ID is for the original sysfs
> interface, not for the new proposed sysfs interface.  So I suggest you
> to implement with the new interface in mind.  What do you think about
> the following design?
>
> - Each NUMA node belongs to a memory type, and each memory type
>   corresponds to a "abstract distance", so each NUMA node corresonds to
>   a "distance".  For simplicity, we can start with static distances, for
>   example, DRAM (default): 150, PMEM: 250.

I agree with this design, though I'd prefer the new attribute to not
be named as "distance".  This is to both avoid the confusion with the
SLIT distance and to avoid the misconception that only the latency
matters, but the bandwidth doesn't.

How about we call it "performance level" (perf_level) or something
similar instead?

> The distance of each NUMA
>   node can be recorded in a global array,
>
>     int node_distances[MAX_NUMNODES];
>
>   or, just
>
>     pgdat->distance

I think node_devices[] is a better place to record this new attribute.
The HMAT performance data is also listed there.

> - Each memory tier corresponds to a range of distance, for example,
>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>
> - The core API of memory tier could be
>
>     struct memory_tier *find_create_memory_tier(int distance);
>
>   it will find the memory tier which covers "distance" in the memory
>   tier list, or create a new memory tier if not found.
>
> - kmem_dax driver will setup distance for PMEM NUMA nodes before online
>   them.

This attribute should be a property of the NUMA node based on the
device hardware.  For PMEM, it is better to handle at the ACPI level.
For example, we can consider initializing this attribute for a PMEM
node in acpi_numa_memory_affinity_init() when the node is
non-volatile.

> - When a NUMA node is onlined, we will use find_create_memory_tier() to
>   find or create its memory tier and add the NUMA node into the memory
>   tier.

I think we should create all the memory tiers up-front, just like NUMA
nodes, to keep their devices and IDs stable.  Similar to offline NUMA
nodes, when a memory tier has no online nodes, we can mark it as
offline and exclude it from online-related operations (e.g. demotion).
A memory tier can be made online when it gets assigned with an online
node.

> - Or we can add memory type data structure now.
>
> Best Regards,
> Huang, Ying
Huang, Ying July 18, 2022, 5:28 a.m. UTC | #6
Wei Xu <weixugc@google.com> writes:

> On Fri, Jul 15, 2022 at 12:53 AM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>
>> > In the current kernel, memory tiers are defined implicitly via a
>> > demotion path relationship between NUMA nodes, which is created
>> > during the kernel initialization and updated when a NUMA node is
>> > hot-added or hot-removed.  The current implementation puts all
>> > nodes with CPU into the top tier, and builds the tier hierarchy
>> > tier-by-tier by establishing the per-node demotion targets based
>> > on the distances between nodes.
>> >
>> > This current memory tier kernel interface needs to be improved for
>> > several important use cases,
>> >
>> > The current tier initialization code always initializes
>> > each memory-only NUMA node into a lower tier.  But a memory-only
>> > NUMA node may have a high performance memory device (e.g. a DRAM
>> > device attached via CXL.mem or a DRAM-backed memory-only node on
>> > a virtual machine) and should be put into a higher tier.
>> >
>> > The current tier hierarchy always puts CPU nodes into the top
>> > tier. But on a system with HBM or GPU devices, the
>> > memory-only NUMA nodes mapping these devices should be in the
>> > top tier, and DRAM nodes with CPUs are better to be placed into the
>> > next lower tier.
>> >
>> > With current kernel higher tier node can only be demoted to selected nodes on the
>> > next lower tier as defined by the demotion path, not any other
>> > node from any lower tier.  This strict, hard-coded demotion order
>> > does not work in all use cases (e.g. some use cases may want to
>> > allow cross-socket demotion to another node in the same demotion
>> > tier as a fallback when the preferred demotion node is out of
>> > space), This demotion order is also inconsistent with the page
>> > allocation fallback order when all the nodes in a higher tier are
>> > out of space: The page allocation can fall back to any node from
>> > any lower tier, whereas the demotion order doesn't allow that.
>> >
>> > The current kernel also don't provide any interfaces for the
>> > userspace to learn about the memory tier hierarchy in order to
>> > optimize its memory allocations.
>> >
>> > This patch series address the above by defining memory tiers explicitly.
>> >
>> > This patch introduce explicity memory tiers. The tier ID value
>> > of a memory tier is used to derive the demotion order between
>> > NUMA nodes.
>> >
>> > For example, if we have 3 memtiers: memtier100, memtier200, memiter300
>> > then the memory tier order is: memtier300 -> memtier200 -> memtier100
>> > where memtier300 is the highest tier and memtier100 is the lowest tier.
>> >
>> > While reclaim we migrate pages from fast(higher) tiers to slow(lower)
>> > tiers when the fast(higher) tier is under memory pressure.
>> >
>> > This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
>> > which are created by different kernel subsystems. The default memory
>> > tier created by the kernel is memtier200. A kernel parameter is provided
>> > to override the default memory tier.
>> >
>> > Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>> > Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>> >
>> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> > ---
>> >  include/linux/memory-tiers.h | 15 +++++++
>> >  mm/Makefile                  |  1 +
>> >  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
>> >  3 files changed, 94 insertions(+)
>> >  create mode 100644 include/linux/memory-tiers.h
>> >  create mode 100644 mm/memory-tiers.c
>> >
>> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> > new file mode 100644
>> > index 000000000000..a81dbc20e0d1
>> > --- /dev/null
>> > +++ b/include/linux/memory-tiers.h
>> > @@ -0,0 +1,15 @@
>> > +/* SPDX-License-Identifier: GPL-2.0 */
>> > +#ifndef _LINUX_MEMORY_TIERS_H
>> > +#define _LINUX_MEMORY_TIERS_H
>> > +
>> > +#ifdef CONFIG_NUMA
>> > +
>> > +#define MEMORY_TIER_HBM_GPU  300
>> > +#define MEMORY_TIER_DRAM     200
>> > +#define MEMORY_TIER_PMEM     100
>> > +
>> > +#define DEFAULT_MEMORY_TIER  MEMORY_TIER_DRAM
>> > +#define MAX_MEMORY_TIER_ID   400
>> > +
>> > +#endif       /* CONFIG_NUMA */
>> > +#endif  /* _LINUX_MEMORY_TIERS_H */
>> > diff --git a/mm/Makefile b/mm/Makefile
>> > index 6f9ffa968a1a..d30acebc2164 100644
>> > --- a/mm/Makefile
>> > +++ b/mm/Makefile
>> > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>> >  obj-$(CONFIG_FAILSLAB) += failslab.o
>> >  obj-$(CONFIG_MEMTEST)                += memtest.o
>> >  obj-$(CONFIG_MIGRATION) += migrate.o
>> > +obj-$(CONFIG_NUMA) += memory-tiers.o
>> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> > new file mode 100644
>> > index 000000000000..011877b6dbb9
>> > --- /dev/null
>> > +++ b/mm/memory-tiers.c
>> > @@ -0,0 +1,78 @@
>> > +// SPDX-License-Identifier: GPL-2.0
>> > +#include <linux/types.h>
>> > +#include <linux/nodemask.h>
>> > +#include <linux/slab.h>
>> > +#include <linux/lockdep.h>
>> > +#include <linux/moduleparam.h>
>> > +#include <linux/memory-tiers.h>
>> > +
>> > +struct memory_tier {
>> > +     struct list_head list;
>> > +     int id;
>> > +     nodemask_t nodelist;
>> > +};
>> > +
>> > +static DEFINE_MUTEX(memory_tier_lock);
>> > +static LIST_HEAD(memory_tiers);
>> > +
>> > +static void insert_memory_tier(struct memory_tier *memtier)
>> > +{
>> > +     struct list_head *ent;
>> > +     struct memory_tier *tmp_memtier;
>> > +
>> > +     lockdep_assert_held_once(&memory_tier_lock);
>> > +
>> > +     list_for_each(ent, &memory_tiers) {
>> > +             tmp_memtier = list_entry(ent, struct memory_tier, list);
>> > +             if (tmp_memtier->id < memtier->id) {
>> > +                     list_add_tail(&memtier->list, ent);
>> > +                     return;
>> > +             }
>> > +     }
>> > +     list_add_tail(&memtier->list, &memory_tiers);
>> > +}
>> > +
>> > +static struct memory_tier *register_memory_tier(unsigned int tier)
>> > +{
>> > +     struct memory_tier *memtier;
>> > +
>> > +     if (tier > MAX_MEMORY_TIER_ID)
>> > +             return ERR_PTR(-EINVAL);
>> > +
>> > +     memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>> > +     if (!memtier)
>> > +             return ERR_PTR(-ENOMEM);
>> > +
>> > +     memtier->id   = tier;
>> > +
>> > +     insert_memory_tier(memtier);
>> > +
>> > +     return memtier;
>> > +}
>> > +
>> > +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
>> > +core_param(default_memory_tier, default_memtier, uint, 0644);
>> > +
>> > +static int __init memory_tier_init(void)
>> > +{
>> > +     struct memory_tier *memtier;
>> > +
>> > +     /*
>> > +      * Register only default memory tier to hide all empty
>> > +      * memory tier from sysfs. Since this is early during
>> > +      * boot, we could avoid holding memtory_tier_lock. But
>> > +      * keep it simple by holding locks. So we can add lock
>> > +      * held debug checks in other functions.
>> > +      */
>> > +     mutex_lock(&memory_tier_lock);
>> > +     memtier = register_memory_tier(default_memtier);
>> > +     if (IS_ERR(memtier))
>> > +             panic("%s() failed to register memory tier: %ld\n",
>> > +                   __func__, PTR_ERR(memtier));
>> > +
>> > +     /* CPU only nodes are not part of memory tiers. */
>> > +     memtier->nodelist = node_states[N_MEMORY];
>> > +     mutex_unlock(&memory_tier_lock);
>> > +     return 0;
>> > +}
>> > +subsys_initcall(memory_tier_init);
>>
>> You dropped the original sysfs interface patches from the series, but
>> the kernel internal implementation is still for the original sysfs
>> interface.  For example, memory tier ID is for the original sysfs
>> interface, not for the new proposed sysfs interface.  So I suggest you
>> to implement with the new interface in mind.  What do you think about
>> the following design?
>>
>> - Each NUMA node belongs to a memory type, and each memory type
>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>   a "distance".  For simplicity, we can start with static distances, for
>>   example, DRAM (default): 150, PMEM: 250.
>
> I agree with this design, though I'd prefer the new attribute to not
> be named as "distance".  This is to both avoid the confusion with the
> SLIT distance and to avoid the misconception that only the latency
> matters, but the bandwidth doesn't.
>
> How about we call it "performance level" (perf_level) or something
> similar instead?

I have no strong opinion on this.  Both "distance" or "perf_level" looks
OK to me.

>> The distance of each NUMA
>>   node can be recorded in a global array,
>>
>>     int node_distances[MAX_NUMNODES];
>>
>>   or, just
>>
>>     pgdat->distance
>
> I think node_devices[] is a better place to record this new attribute.
> The HMAT performance data is also listed there.

Firstly, we all agree that we need a place to record this information,
per node or per memory type.  Personally, I prefer to separate the data
and its interface (such as sysfs).

>> - Each memory tier corresponds to a range of distance, for example,
>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>
>> - The core API of memory tier could be
>>
>>     struct memory_tier *find_create_memory_tier(int distance);
>>
>>   it will find the memory tier which covers "distance" in the memory
>>   tier list, or create a new memory tier if not found.
>>
>> - kmem_dax driver will setup distance for PMEM NUMA nodes before online
>>   them.
>
> This attribute should be a property of the NUMA node based on the
> device hardware.

Yes.  Or a property of a memory type.

> For PMEM, it is better to handle at the ACPI level.
> For example, we can consider initializing this attribute for a PMEM
> node in acpi_numa_memory_affinity_init() when the node is
> non-volatile.

The abstract_distance/perf_level may be determined from multiple
information sources, e.g., ACPI SLIT/SRAT/HMAT, etc.  It should be the
responsibility of device drivers (e.g., kmem_dax) to determine the final
value of abstract_distance/perf_level based on the information
availability/priority and some specific knowledge of the hardware.  Yes,
ACPI SRAT is valuable to determine the abstract_distance/perf_level.
And, it's better for kmem_dax to use it to determine the final value of
abstract_distance/perf_level.

To make the first version as simple as possible, I think we can just use
some static abstract_distance/perf_level in kmem_dax driver for the NUMA
nodes onlined by it.  Because we use the driver for PMEM only now.  We
can enhance the implementation later.

>> - When a NUMA node is onlined, we will use find_create_memory_tier() to
>>   find or create its memory tier and add the NUMA node into the memory
>>   tier.
>
> I think we should create all the memory tiers up-front, just like NUMA
> nodes, to keep their devices and IDs stable.  Similar to offline NUMA
> nodes, when a memory tier has no online nodes, we can mark it as
> offline and exclude it from online-related operations (e.g. demotion).
> A memory tier can be made online when it gets assigned with an online
> node.

Each memory tier corresponds to a range of abstract_distance/perf_level.
For example, if 1 <= abstract_distance/perf_level <= 500, 5 memory tiers
can be defined with abstract_distance/perf_level ranges 1-100, 101-200,
201-300, 301-400, 401-500.  We can create these 5 memory tiers up-front
of course.  In the new design, we may change the ranges at run time
according to policy chosen by the users.  For example, we may change 5
memory tiers above to 500 memory tiers, with
abstract_distance/perf_level ranges 1-1, 2-2, ..., 500-500.  This may
make memory tier devices and their IDs unstable at some degree.  But if
we are cautious to customize the ranges, it's possible to make the
memory tier devices and their IDs stable in most cases.

Because we may define 500 memory tiers, it's hard to create all memory
tiers up-front really.  But we can create them all in concept and
allocate memory/resources for one when we add the first NUMA node to it.

To make the fist version as simple as possible, I suggest to define 500
memory tiers as above statically.

>> - Or we can add memory type data structure now.

Best Regards,
Huang, Ying
Alistair Popple July 18, 2022, 5:58 a.m. UTC | #7
"Huang, Ying" <ying.huang@intel.com> writes:

> Wei Xu <weixugc@google.com> writes:
>
>> On Fri, Jul 15, 2022 at 12:53 AM Huang, Ying <ying.huang@intel.com> wrote:
>>>
>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>
>>> > In the current kernel, memory tiers are defined implicitly via a
>>> > demotion path relationship between NUMA nodes, which is created
>>> > during the kernel initialization and updated when a NUMA node is
>>> > hot-added or hot-removed.  The current implementation puts all
>>> > nodes with CPU into the top tier, and builds the tier hierarchy
>>> > tier-by-tier by establishing the per-node demotion targets based
>>> > on the distances between nodes.
>>> >
>>> > This current memory tier kernel interface needs to be improved for
>>> > several important use cases,
>>> >
>>> > The current tier initialization code always initializes
>>> > each memory-only NUMA node into a lower tier.  But a memory-only
>>> > NUMA node may have a high performance memory device (e.g. a DRAM
>>> > device attached via CXL.mem or a DRAM-backed memory-only node on
>>> > a virtual machine) and should be put into a higher tier.
>>> >
>>> > The current tier hierarchy always puts CPU nodes into the top
>>> > tier. But on a system with HBM or GPU devices, the
>>> > memory-only NUMA nodes mapping these devices should be in the
>>> > top tier, and DRAM nodes with CPUs are better to be placed into the
>>> > next lower tier.
>>> >
>>> > With current kernel higher tier node can only be demoted to selected nodes on the
>>> > next lower tier as defined by the demotion path, not any other
>>> > node from any lower tier.  This strict, hard-coded demotion order
>>> > does not work in all use cases (e.g. some use cases may want to
>>> > allow cross-socket demotion to another node in the same demotion
>>> > tier as a fallback when the preferred demotion node is out of
>>> > space), This demotion order is also inconsistent with the page
>>> > allocation fallback order when all the nodes in a higher tier are
>>> > out of space: The page allocation can fall back to any node from
>>> > any lower tier, whereas the demotion order doesn't allow that.
>>> >
>>> > The current kernel also don't provide any interfaces for the
>>> > userspace to learn about the memory tier hierarchy in order to
>>> > optimize its memory allocations.
>>> >
>>> > This patch series address the above by defining memory tiers explicitly.
>>> >
>>> > This patch introduce explicity memory tiers. The tier ID value
>>> > of a memory tier is used to derive the demotion order between
>>> > NUMA nodes.
>>> >
>>> > For example, if we have 3 memtiers: memtier100, memtier200, memiter300
>>> > then the memory tier order is: memtier300 -> memtier200 -> memtier100
>>> > where memtier300 is the highest tier and memtier100 is the lowest tier.
>>> >
>>> > While reclaim we migrate pages from fast(higher) tiers to slow(lower)
>>> > tiers when the fast(higher) tier is under memory pressure.
>>> >
>>> > This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
>>> > which are created by different kernel subsystems. The default memory
>>> > tier created by the kernel is memtier200. A kernel parameter is provided
>>> > to override the default memory tier.
>>> >
>>> > Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>> > Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>> >
>>> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>> > ---
>>> >  include/linux/memory-tiers.h | 15 +++++++
>>> >  mm/Makefile                  |  1 +
>>> >  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
>>> >  3 files changed, 94 insertions(+)
>>> >  create mode 100644 include/linux/memory-tiers.h
>>> >  create mode 100644 mm/memory-tiers.c
>>> >
>>> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>> > new file mode 100644
>>> > index 000000000000..a81dbc20e0d1
>>> > --- /dev/null
>>> > +++ b/include/linux/memory-tiers.h
>>> > @@ -0,0 +1,15 @@
>>> > +/* SPDX-License-Identifier: GPL-2.0 */
>>> > +#ifndef _LINUX_MEMORY_TIERS_H
>>> > +#define _LINUX_MEMORY_TIERS_H
>>> > +
>>> > +#ifdef CONFIG_NUMA
>>> > +
>>> > +#define MEMORY_TIER_HBM_GPU  300
>>> > +#define MEMORY_TIER_DRAM     200
>>> > +#define MEMORY_TIER_PMEM     100
>>> > +
>>> > +#define DEFAULT_MEMORY_TIER  MEMORY_TIER_DRAM
>>> > +#define MAX_MEMORY_TIER_ID   400
>>> > +
>>> > +#endif       /* CONFIG_NUMA */
>>> > +#endif  /* _LINUX_MEMORY_TIERS_H */
>>> > diff --git a/mm/Makefile b/mm/Makefile
>>> > index 6f9ffa968a1a..d30acebc2164 100644
>>> > --- a/mm/Makefile
>>> > +++ b/mm/Makefile
>>> > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>>> >  obj-$(CONFIG_FAILSLAB) += failslab.o
>>> >  obj-$(CONFIG_MEMTEST)                += memtest.o
>>> >  obj-$(CONFIG_MIGRATION) += migrate.o
>>> > +obj-$(CONFIG_NUMA) += memory-tiers.o
>>> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>>> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>> > new file mode 100644
>>> > index 000000000000..011877b6dbb9
>>> > --- /dev/null
>>> > +++ b/mm/memory-tiers.c
>>> > @@ -0,0 +1,78 @@
>>> > +// SPDX-License-Identifier: GPL-2.0
>>> > +#include <linux/types.h>
>>> > +#include <linux/nodemask.h>
>>> > +#include <linux/slab.h>
>>> > +#include <linux/lockdep.h>
>>> > +#include <linux/moduleparam.h>
>>> > +#include <linux/memory-tiers.h>
>>> > +
>>> > +struct memory_tier {
>>> > +     struct list_head list;
>>> > +     int id;
>>> > +     nodemask_t nodelist;
>>> > +};
>>> > +
>>> > +static DEFINE_MUTEX(memory_tier_lock);
>>> > +static LIST_HEAD(memory_tiers);
>>> > +
>>> > +static void insert_memory_tier(struct memory_tier *memtier)
>>> > +{
>>> > +     struct list_head *ent;
>>> > +     struct memory_tier *tmp_memtier;
>>> > +
>>> > +     lockdep_assert_held_once(&memory_tier_lock);
>>> > +
>>> > +     list_for_each(ent, &memory_tiers) {
>>> > +             tmp_memtier = list_entry(ent, struct memory_tier, list);
>>> > +             if (tmp_memtier->id < memtier->id) {
>>> > +                     list_add_tail(&memtier->list, ent);
>>> > +                     return;
>>> > +             }
>>> > +     }
>>> > +     list_add_tail(&memtier->list, &memory_tiers);
>>> > +}
>>> > +
>>> > +static struct memory_tier *register_memory_tier(unsigned int tier)
>>> > +{
>>> > +     struct memory_tier *memtier;
>>> > +
>>> > +     if (tier > MAX_MEMORY_TIER_ID)
>>> > +             return ERR_PTR(-EINVAL);
>>> > +
>>> > +     memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>>> > +     if (!memtier)
>>> > +             return ERR_PTR(-ENOMEM);
>>> > +
>>> > +     memtier->id   = tier;
>>> > +
>>> > +     insert_memory_tier(memtier);
>>> > +
>>> > +     return memtier;
>>> > +}
>>> > +
>>> > +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
>>> > +core_param(default_memory_tier, default_memtier, uint, 0644);
>>> > +
>>> > +static int __init memory_tier_init(void)
>>> > +{
>>> > +     struct memory_tier *memtier;
>>> > +
>>> > +     /*
>>> > +      * Register only default memory tier to hide all empty
>>> > +      * memory tier from sysfs. Since this is early during
>>> > +      * boot, we could avoid holding memtory_tier_lock. But
>>> > +      * keep it simple by holding locks. So we can add lock
>>> > +      * held debug checks in other functions.
>>> > +      */
>>> > +     mutex_lock(&memory_tier_lock);
>>> > +     memtier = register_memory_tier(default_memtier);
>>> > +     if (IS_ERR(memtier))
>>> > +             panic("%s() failed to register memory tier: %ld\n",
>>> > +                   __func__, PTR_ERR(memtier));
>>> > +
>>> > +     /* CPU only nodes are not part of memory tiers. */
>>> > +     memtier->nodelist = node_states[N_MEMORY];
>>> > +     mutex_unlock(&memory_tier_lock);
>>> > +     return 0;
>>> > +}
>>> > +subsys_initcall(memory_tier_init);
>>>
>>> You dropped the original sysfs interface patches from the series, but
>>> the kernel internal implementation is still for the original sysfs
>>> interface.  For example, memory tier ID is for the original sysfs
>>> interface, not for the new proposed sysfs interface.  So I suggest you
>>> to implement with the new interface in mind.  What do you think about
>>> the following design?
>>>
>>> - Each NUMA node belongs to a memory type, and each memory type
>>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>>   a "distance".  For simplicity, we can start with static distances, for
>>>   example, DRAM (default): 150, PMEM: 250.
>>
>> I agree with this design, though I'd prefer the new attribute to not
>> be named as "distance".  This is to both avoid the confusion with the
>> SLIT distance and to avoid the misconception that only the latency
>> matters, but the bandwidth doesn't.
>>
>> How about we call it "performance level" (perf_level) or something
>> similar instead?
>
> I have no strong opinion on this.  Both "distance" or "perf_level" looks
> OK to me.
>
>>> The distance of each NUMA
>>>   node can be recorded in a global array,
>>>
>>>     int node_distances[MAX_NUMNODES];
>>>
>>>   or, just
>>>
>>>     pgdat->distance
>>
>> I think node_devices[] is a better place to record this new attribute.
>> The HMAT performance data is also listed there.
>
> Firstly, we all agree that we need a place to record this information,
> per node or per memory type.  Personally, I prefer to separate the data
> and its interface (such as sysfs).
>
>>> - Each memory tier corresponds to a range of distance, for example,
>>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>>
>>> - The core API of memory tier could be
>>>
>>>     struct memory_tier *find_create_memory_tier(int distance);
>>>
>>>   it will find the memory tier which covers "distance" in the memory
>>>   tier list, or create a new memory tier if not found.
>>>
>>> - kmem_dax driver will setup distance for PMEM NUMA nodes before online
>>>   them.
>>
>> This attribute should be a property of the NUMA node based on the
>> device hardware.
>
> Yes.  Or a property of a memory type.
>
>> For PMEM, it is better to handle at the ACPI level.
>> For example, we can consider initializing this attribute for a PMEM
>> node in acpi_numa_memory_affinity_init() when the node is
>> non-volatile.
>
> The abstract_distance/perf_level may be determined from multiple
> information sources, e.g., ACPI SLIT/SRAT/HMAT, etc.  It should be the
> responsibility of device drivers (e.g., kmem_dax) to determine the final
> value of abstract_distance/perf_level based on the information
> availability/priority and some specific knowledge of the hardware.  Yes,
> ACPI SRAT is valuable to determine the abstract_distance/perf_level.
> And, it's better for kmem_dax to use it to determine the final value of
> abstract_distance/perf_level.
>
> To make the first version as simple as possible, I think we can just use
> some static abstract_distance/perf_level in kmem_dax driver for the NUMA
> nodes onlined by it.  Because we use the driver for PMEM only now.  We
> can enhance the implementation later.

I agree. Ideally I think all this should be derived from ACPI tables,
etc. However I think it will take a while for both FW and SW to make
that information available and correct. Letting drivers initialise that
for now at least should aid development in determining how performance
levels should be set from multiple information sources, especially if
there is no way of overriding it from userspace.

>>> - When a NUMA node is onlined, we will use find_create_memory_tier() to
>>>   find or create its memory tier and add the NUMA node into the memory
>>>   tier.
>>
>> I think we should create all the memory tiers up-front, just like NUMA
>> nodes, to keep their devices and IDs stable.  Similar to offline NUMA
>> nodes, when a memory tier has no online nodes, we can mark it as
>> offline and exclude it from online-related operations (e.g. demotion).
>> A memory tier can be made online when it gets assigned with an online
>> node.
>
> Each memory tier corresponds to a range of abstract_distance/perf_level.
> For example, if 1 <= abstract_distance/perf_level <= 500, 5 memory tiers
> can be defined with abstract_distance/perf_level ranges 1-100, 101-200,
> 201-300, 301-400, 401-500.  We can create these 5 memory tiers up-front
> of course.  In the new design, we may change the ranges at run time
> according to policy chosen by the users.  For example, we may change 5
> memory tiers above to 500 memory tiers, with
> abstract_distance/perf_level ranges 1-1, 2-2, ..., 500-500.  This may
> make memory tier devices and their IDs unstable at some degree.  But if
> we are cautious to customize the ranges, it's possible to make the
> memory tier devices and their IDs stable in most cases.
>
> Because we may define 500 memory tiers, it's hard to create all memory
> tiers up-front really.  But we can create them all in concept and
> allocate memory/resources for one when we add the first NUMA node to it.
>
> To make the fist version as simple as possible, I suggest to define 500
> memory tiers as above statically.
>
>>> - Or we can add memory type data structure now.
>
> Best Regards,
> Huang, Ying
Huang, Ying July 18, 2022, 6:08 a.m. UTC | #8
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:

> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>
> ....
>
>> 
>>> You dropped the original sysfs interface patches from the series, but
>>> the kernel internal implementation is still for the original sysfs
>>> interface.  For example, memory tier ID is for the original sysfs
>>> interface, not for the new proposed sysfs interface.  So I suggest you
>>> to implement with the new interface in mind.  What do you think about
>>> the following design?
>>> 
>>
>> Sorry I am not able to follow you here. This patchset completely drops
>> exposing memory tiers to userspace via sysfs. Instead it allow
>> creation of memory tiers with specific tierID from within the kernel/device driver.
>> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
>>
>>
>>> - Each NUMA node belongs to a memory type, and each memory type
>>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>>   a "distance".  For simplicity, we can start with static distances, for
>>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>>   node can be recorded in a global array,
>>> 
>>>     int node_distances[MAX_NUMNODES];
>>> 
>>>   or, just
>>> 
>>>     pgdat->distance
>>> 
>>
>> I don't follow this. I guess you are trying to have a different design.
>> Would it be much easier if you can write this in the form of a patch? 
>>
>>
>>> - Each memory tier corresponds to a range of distance, for example,
>>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>> 
>>> - The core API of memory tier could be
>>> 
>>>     struct memory_tier *find_create_memory_tier(int distance);
>>> 
>>>   it will find the memory tier which covers "distance" in the memory
>>>   tier list, or create a new memory tier if not found.
>>> 
>>
>> I was expecting this to be internal to dax kmem. How dax kmem maps
>> "abstract distance" to a memory tier. At this point this patchset is
>> keeping all that for a future patchset. 
>>
>
> This shows how i was expecting "abstract distance" to be integrated.
>

Thanks!

To make the first version as simple as possible, I think we can just use
some static "abstract distance" for dax_kmem, e.g., 250.  Because we
use it for PMEM only now.  We can enhance dax_kmem later.

IMHO, we should make the core framework correct firstly.

- A device driver should report the capability (or performance level) of
  the hardware to the memory tier core via abstract distance.  This can
  be done via some global data structure (e.g. node_distances[]) at
  least in the first version.

- Memory tier core determines the mapping from the abstract distance to
  the memory tier via abstract distance ranges, and allocate the struct
  memory_tier when necessary.  That is, memory tier core determines
  whether to allocate or reuse which memory tier for NUMA nodes, not
  device drivers.

- It's better to place the NUMA node to the correct memory tier in the
  fist place.  We should avoid to place the PMEM node in the default
  tier, then change it to the correct memory tier.  That is, device
  drivers should report the abstract distance before onlining NUMA
  nodes.

Please check my reply to Wei too about my other suggestions for the
first version.

Best Regards,
Huang, Ying
Aneesh Kumar K.V July 18, 2022, 6:56 a.m. UTC | #9
On 7/18/22 11:28 AM, Alistair Popple wrote:
> 
> "Huang, Ying" <ying.huang@intel.com> writes:
> 
>> Wei Xu <weixugc@google.com> writes:
>>
>>> On Fri, Jul 15, 2022 at 12:53 AM Huang, Ying <ying.huang@intel.com> wrote:
>>>>
>>>> "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com> writes:
>>>>
>>>>> In the current kernel, memory tiers are defined implicitly via a
>>>>> demotion path relationship between NUMA nodes, which is created
>>>>> during the kernel initialization and updated when a NUMA node is
>>>>> hot-added or hot-removed.  The current implementation puts all
>>>>> nodes with CPU into the top tier, and builds the tier hierarchy
>>>>> tier-by-tier by establishing the per-node demotion targets based
>>>>> on the distances between nodes.
>>>>>
>>>>> This current memory tier kernel interface needs to be improved for
>>>>> several important use cases,
>>>>>
>>>>> The current tier initialization code always initializes
>>>>> each memory-only NUMA node into a lower tier.  But a memory-only
>>>>> NUMA node may have a high performance memory device (e.g. a DRAM
>>>>> device attached via CXL.mem or a DRAM-backed memory-only node on
>>>>> a virtual machine) and should be put into a higher tier.
>>>>>
>>>>> The current tier hierarchy always puts CPU nodes into the top
>>>>> tier. But on a system with HBM or GPU devices, the
>>>>> memory-only NUMA nodes mapping these devices should be in the
>>>>> top tier, and DRAM nodes with CPUs are better to be placed into the
>>>>> next lower tier.
>>>>>
>>>>> With current kernel higher tier node can only be demoted to selected nodes on the
>>>>> next lower tier as defined by the demotion path, not any other
>>>>> node from any lower tier.  This strict, hard-coded demotion order
>>>>> does not work in all use cases (e.g. some use cases may want to
>>>>> allow cross-socket demotion to another node in the same demotion
>>>>> tier as a fallback when the preferred demotion node is out of
>>>>> space), This demotion order is also inconsistent with the page
>>>>> allocation fallback order when all the nodes in a higher tier are
>>>>> out of space: The page allocation can fall back to any node from
>>>>> any lower tier, whereas the demotion order doesn't allow that.
>>>>>
>>>>> The current kernel also don't provide any interfaces for the
>>>>> userspace to learn about the memory tier hierarchy in order to
>>>>> optimize its memory allocations.
>>>>>
>>>>> This patch series address the above by defining memory tiers explicitly.
>>>>>
>>>>> This patch introduce explicity memory tiers. The tier ID value
>>>>> of a memory tier is used to derive the demotion order between
>>>>> NUMA nodes.
>>>>>
>>>>> For example, if we have 3 memtiers: memtier100, memtier200, memiter300
>>>>> then the memory tier order is: memtier300 -> memtier200 -> memtier100
>>>>> where memtier300 is the highest tier and memtier100 is the lowest tier.
>>>>>
>>>>> While reclaim we migrate pages from fast(higher) tiers to slow(lower)
>>>>> tiers when the fast(higher) tier is under memory pressure.
>>>>>
>>>>> This patchset introduce 3 memory tiers (memtier100, memtier200 and memtier300)
>>>>> which are created by different kernel subsystems. The default memory
>>>>> tier created by the kernel is memtier200. A kernel parameter is provided
>>>>> to override the default memory tier.
>>>>>
>>>>> Link: https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>>>> Link: https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com
>>>>>
>>>>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>>>>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>>>>> ---
>>>>>  include/linux/memory-tiers.h | 15 +++++++
>>>>>  mm/Makefile                  |  1 +
>>>>>  mm/memory-tiers.c            | 78 ++++++++++++++++++++++++++++++++++++
>>>>>  3 files changed, 94 insertions(+)
>>>>>  create mode 100644 include/linux/memory-tiers.h
>>>>>  create mode 100644 mm/memory-tiers.c
>>>>>
>>>>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>>>>> new file mode 100644
>>>>> index 000000000000..a81dbc20e0d1
>>>>> --- /dev/null
>>>>> +++ b/include/linux/memory-tiers.h
>>>>> @@ -0,0 +1,15 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>>>> +#define _LINUX_MEMORY_TIERS_H
>>>>> +
>>>>> +#ifdef CONFIG_NUMA
>>>>> +
>>>>> +#define MEMORY_TIER_HBM_GPU  300
>>>>> +#define MEMORY_TIER_DRAM     200
>>>>> +#define MEMORY_TIER_PMEM     100
>>>>> +
>>>>> +#define DEFAULT_MEMORY_TIER  MEMORY_TIER_DRAM
>>>>> +#define MAX_MEMORY_TIER_ID   400
>>>>> +
>>>>> +#endif       /* CONFIG_NUMA */
>>>>> +#endif  /* _LINUX_MEMORY_TIERS_H */
>>>>> diff --git a/mm/Makefile b/mm/Makefile
>>>>> index 6f9ffa968a1a..d30acebc2164 100644
>>>>> --- a/mm/Makefile
>>>>> +++ b/mm/Makefile
>>>>> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>>>>>  obj-$(CONFIG_FAILSLAB) += failslab.o
>>>>>  obj-$(CONFIG_MEMTEST)                += memtest.o
>>>>>  obj-$(CONFIG_MIGRATION) += migrate.o
>>>>> +obj-$(CONFIG_NUMA) += memory-tiers.o
>>>>>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>>>>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>>>>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>>>>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>>>>> new file mode 100644
>>>>> index 000000000000..011877b6dbb9
>>>>> --- /dev/null
>>>>> +++ b/mm/memory-tiers.c
>>>>> @@ -0,0 +1,78 @@
>>>>> +// SPDX-License-Identifier: GPL-2.0
>>>>> +#include <linux/types.h>
>>>>> +#include <linux/nodemask.h>
>>>>> +#include <linux/slab.h>
>>>>> +#include <linux/lockdep.h>
>>>>> +#include <linux/moduleparam.h>
>>>>> +#include <linux/memory-tiers.h>
>>>>> +
>>>>> +struct memory_tier {
>>>>> +     struct list_head list;
>>>>> +     int id;
>>>>> +     nodemask_t nodelist;
>>>>> +};
>>>>> +
>>>>> +static DEFINE_MUTEX(memory_tier_lock);
>>>>> +static LIST_HEAD(memory_tiers);
>>>>> +
>>>>> +static void insert_memory_tier(struct memory_tier *memtier)
>>>>> +{
>>>>> +     struct list_head *ent;
>>>>> +     struct memory_tier *tmp_memtier;
>>>>> +
>>>>> +     lockdep_assert_held_once(&memory_tier_lock);
>>>>> +
>>>>> +     list_for_each(ent, &memory_tiers) {
>>>>> +             tmp_memtier = list_entry(ent, struct memory_tier, list);
>>>>> +             if (tmp_memtier->id < memtier->id) {
>>>>> +                     list_add_tail(&memtier->list, ent);
>>>>> +                     return;
>>>>> +             }
>>>>> +     }
>>>>> +     list_add_tail(&memtier->list, &memory_tiers);
>>>>> +}
>>>>> +
>>>>> +static struct memory_tier *register_memory_tier(unsigned int tier)
>>>>> +{
>>>>> +     struct memory_tier *memtier;
>>>>> +
>>>>> +     if (tier > MAX_MEMORY_TIER_ID)
>>>>> +             return ERR_PTR(-EINVAL);
>>>>> +
>>>>> +     memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>>>>> +     if (!memtier)
>>>>> +             return ERR_PTR(-ENOMEM);
>>>>> +
>>>>> +     memtier->id   = tier;
>>>>> +
>>>>> +     insert_memory_tier(memtier);
>>>>> +
>>>>> +     return memtier;
>>>>> +}
>>>>> +
>>>>> +static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
>>>>> +core_param(default_memory_tier, default_memtier, uint, 0644);
>>>>> +
>>>>> +static int __init memory_tier_init(void)
>>>>> +{
>>>>> +     struct memory_tier *memtier;
>>>>> +
>>>>> +     /*
>>>>> +      * Register only default memory tier to hide all empty
>>>>> +      * memory tier from sysfs. Since this is early during
>>>>> +      * boot, we could avoid holding memtory_tier_lock. But
>>>>> +      * keep it simple by holding locks. So we can add lock
>>>>> +      * held debug checks in other functions.
>>>>> +      */
>>>>> +     mutex_lock(&memory_tier_lock);
>>>>> +     memtier = register_memory_tier(default_memtier);
>>>>> +     if (IS_ERR(memtier))
>>>>> +             panic("%s() failed to register memory tier: %ld\n",
>>>>> +                   __func__, PTR_ERR(memtier));
>>>>> +
>>>>> +     /* CPU only nodes are not part of memory tiers. */
>>>>> +     memtier->nodelist = node_states[N_MEMORY];
>>>>> +     mutex_unlock(&memory_tier_lock);
>>>>> +     return 0;
>>>>> +}
>>>>> +subsys_initcall(memory_tier_init);
>>>>
>>>> You dropped the original sysfs interface patches from the series, but
>>>> the kernel internal implementation is still for the original sysfs
>>>> interface.  For example, memory tier ID is for the original sysfs
>>>> interface, not for the new proposed sysfs interface.  So I suggest you
>>>> to implement with the new interface in mind.  What do you think about
>>>> the following design?
>>>>
>>>> - Each NUMA node belongs to a memory type, and each memory type
>>>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>>>   a "distance".  For simplicity, we can start with static distances, for
>>>>   example, DRAM (default): 150, PMEM: 250.
>>>
>>> I agree with this design, though I'd prefer the new attribute to not
>>> be named as "distance".  This is to both avoid the confusion with the
>>> SLIT distance and to avoid the misconception that only the latency
>>> matters, but the bandwidth doesn't.
>>>
>>> How about we call it "performance level" (perf_level) or something
>>> similar instead?
>>
>> I have no strong opinion on this.  Both "distance" or "perf_level" looks
>> OK to me.
>>
>>>> The distance of each NUMA
>>>>   node can be recorded in a global array,
>>>>
>>>>     int node_distances[MAX_NUMNODES];
>>>>
>>>>   or, just
>>>>
>>>>     pgdat->distance
>>>
>>> I think node_devices[] is a better place to record this new attribute.
>>> The HMAT performance data is also listed there.
>>
>> Firstly, we all agree that we need a place to record this information,
>> per node or per memory type.  Personally, I prefer to separate the data
>> and its interface (such as sysfs).
>>
>>>> - Each memory tier corresponds to a range of distance, for example,
>>>>   0-100, 100-200, 200-300, >300, we can start with static ranges too.
>>>>
>>>> - The core API of memory tier could be
>>>>
>>>>     struct memory_tier *find_create_memory_tier(int distance);
>>>>
>>>>   it will find the memory tier which covers "distance" in the memory
>>>>   tier list, or create a new memory tier if not found.
>>>>
>>>> - kmem_dax driver will setup distance for PMEM NUMA nodes before online
>>>>   them.
>>>
>>> This attribute should be a property of the NUMA node based on the
>>> device hardware.
>>
>> Yes.  Or a property of a memory type.
>>
>>> For PMEM, it is better to handle at the ACPI level.
>>> For example, we can consider initializing this attribute for a PMEM
>>> node in acpi_numa_memory_affinity_init() when the node is
>>> non-volatile.
>>
>> The abstract_distance/perf_level may be determined from multiple
>> information sources, e.g., ACPI SLIT/SRAT/HMAT, etc.  It should be the
>> responsibility of device drivers (e.g., kmem_dax) to determine the final
>> value of abstract_distance/perf_level based on the information
>> availability/priority and some specific knowledge of the hardware.  Yes,
>> ACPI SRAT is valuable to determine the abstract_distance/perf_level.
>> And, it's better for kmem_dax to use it to determine the final value of
>> abstract_distance/perf_level.
>>
>> To make the first version as simple as possible, I think we can just use
>> some static abstract_distance/perf_level in kmem_dax driver for the NUMA
>> nodes onlined by it.  Because we use the driver for PMEM only now.  We
>> can enhance the implementation later.
> 
> I agree. Ideally I think all this should be derived from ACPI tables,
> etc. However I think it will take a while for both FW and SW to make
> that information available and correct. Letting drivers initialise that
> for now at least should aid development in determining how performance
> levels should be set from multiple information sources, especially if
> there is no way of overriding it from userspace.
>

When we parse the firmware tables, node_devices is mostly not allocated.
That get allocated in register_one_node. We can do a hotplug
callback like below. This should also allow us to update perf_level based
ACPI tables.

diff --git a/drivers/acpi/nfit/core.c b/drivers/acpi/nfit/core.c
index ae5f4acf2675..89b010e0461e 100644
--- a/drivers/acpi/nfit/core.c
+++ b/drivers/acpi/nfit/core.c
@@ -15,6 +15,7 @@
 #include <linux/sort.h>
 #include <linux/io.h>
 #include <linux/nd.h>
+#include <linux/memory.h>
 #include <asm/cacheflush.h>
 #include <acpi/nfit.h>
 #include "intel.h"
@@ -3470,6 +3471,45 @@ static struct acpi_driver acpi_nfit_driver = {
 	},
 };
 
+static int nfit_callback(struct notifier_block *self,
+			 unsigned long action, void *arg)
+{
+	bool found = false;
+	struct memory_notify *mnb = arg;
+	int nid = mnb->status_change_nid;
+	struct nfit_spa *nfit_spa;
+	struct acpi_nfit_desc *acpi_desc;
+
+	if (nid == NUMA_NO_NODE || action != MEM_ONLINE)
+		return NOTIFY_OK;
+
+	mutex_lock(&acpi_desc_lock);
+	list_for_each_entry(acpi_desc, &acpi_descs, list) {
+		mutex_lock(&acpi_desc->init_mutex);
+		list_for_each_entry(nfit_spa, &acpi_desc->spas, list) {
+			struct acpi_nfit_system_address *spa = nfit_spa->spa;
+			int target_node = pxm_to_node(spa->proximity_domain);
+
+			if (target_node == nid) {
+				node_devices[nid]->perf_level = 1;
+				found = true;
+				break;
+			}
+		}
+		mutex_unlock(&acpi_desc->init_mutex);
+		if (found)
+			break;
+	}
+	mutex_unlock(&acpi_desc_lock);
+	return NOTIFY_OK;
+}
+
+static struct notifier_block nfit_callback_nb = {
+	.notifier_call = nfit_callback,
+	.priority = 2,
+};
+
+
 static __init int nfit_init(void)
 {
 	int ret;
@@ -3509,7 +3549,11 @@ static __init int nfit_init(void)
 		nfit_mce_unregister();
 		destroy_workqueue(nfit_wq);
 	}
-
+	/*
+	 * register a memory hotplug notifier at prio 2 so that we
+	 * can update the perf level for the node.
+	 */
+	register_hotmemory_notifier(&nfit_callback_nb);
 	return ret;
 
 }
Huang, Ying July 18, 2022, 6:57 a.m. UTC | #10
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 7/15/22 1:23 PM, Huang, Ying wrote:

[snip]

>> 
>> You dropped the original sysfs interface patches from the series, but
>> the kernel internal implementation is still for the original sysfs
>> interface.  For example, memory tier ID is for the original sysfs
>> interface, not for the new proposed sysfs interface.  So I suggest you
>> to implement with the new interface in mind.  What do you think about
>> the following design?
>> 
>
> Sorry I am not able to follow you here. This patchset completely drops
> exposing memory tiers to userspace via sysfs. Instead it allow
> creation of memory tiers with specific tierID from within the kernel/device driver.
> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
>
>
>> - Each NUMA node belongs to a memory type, and each memory type
>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>   a "distance".  For simplicity, we can start with static distances, for
>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>   node can be recorded in a global array,
>> 
>>     int node_distances[MAX_NUMNODES];
>> 
>>   or, just
>> 
>>     pgdat->distance
>> 
>
> I don't follow this. I guess you are trying to have a different design.
> Would it be much easier if you can write this in the form of a patch? 

Written some pseudo code as follow to show my basic idea.

#define MEMORY_TIER_ADISTANCE_DRAM	150
#define MEMORY_TIER_ADISTANCE_PMEM	250

struct memory_tier {
	/* abstract distance range covered by the memory tier */
	int adistance_start;
	int adistance_len;
	struct list_head list;
	nodemask_t nodemask;
};

/* RCU list of memory tiers */
static LIST_HEAD(memory_tiers);

/* abstract distance of each NUMA node */
int node_adistances[MAX_NUMNODES];

struct memory_tier *find_create_memory_tier(int adistance)
{
	struct memory_tier *tier;

	list_for_each_entry(tier, &memory_tiers, list) {
		if (adistance >= tier->adistance_start &&
		    adistance < tier->adistance_start + tier->adistance_len)
			return tier;
	}
	/* allocate a new memory tier and return */
}

void memory_tier_add_node(int nid)
{
	int adistance;
	struct memory_tier *tier;

	adistance = node_adistances[nid] || MEMORY_TIER_ADISTANCE_DRAM;
	tier = find_create_memory_tier(adistance);
	node_set(nid, &tier->nodemask);
	/* setup demotion data structure, etc */
}

static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
						 unsigned long action, void *_arg)
{
	struct memory_notify *arg = _arg;
	int nid;

	nid = arg->status_change_nid;
	if (nid < 0)
		return notifier_from_errno(0);

	switch (action) {
	case MEM_ONLINE:
		memory_tier_add_node(nid);
		break;
	}

	return notifier_from_errno(0);
}

/* kmem.c */
static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
{
	node_adistances[dev_dax->target_node] = MEMORY_TIER_ADISTANCE_PMEM;
	/* add_memory_driver_managed() */
}

[snip]

Best Regards,
Huang, Ying
Aneesh Kumar K.V July 18, 2022, 8 a.m. UTC | #11
On 7/18/22 12:27 PM, Huang, Ying wrote:
> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
> 
>> On 7/15/22 1:23 PM, Huang, Ying wrote:
> 
> [snip]
> 
>>>
>>> You dropped the original sysfs interface patches from the series, but
>>> the kernel internal implementation is still for the original sysfs
>>> interface.  For example, memory tier ID is for the original sysfs
>>> interface, not for the new proposed sysfs interface.  So I suggest you
>>> to implement with the new interface in mind.  What do you think about
>>> the following design?
>>>
>>
>> Sorry I am not able to follow you here. This patchset completely drops
>> exposing memory tiers to userspace via sysfs. Instead it allow
>> creation of memory tiers with specific tierID from within the kernel/device driver.
>> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
>>
>>
>>> - Each NUMA node belongs to a memory type, and each memory type
>>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>>   a "distance".  For simplicity, we can start with static distances, for
>>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>>   node can be recorded in a global array,
>>>
>>>     int node_distances[MAX_NUMNODES];
>>>
>>>   or, just
>>>
>>>     pgdat->distance
>>>
>>
>> I don't follow this. I guess you are trying to have a different design.
>> Would it be much easier if you can write this in the form of a patch? 
> 
> Written some pseudo code as follow to show my basic idea.
> 
> #define MEMORY_TIER_ADISTANCE_DRAM	150
> #define MEMORY_TIER_ADISTANCE_PMEM	250
> 
> struct memory_tier {
> 	/* abstract distance range covered by the memory tier */
> 	int adistance_start;
> 	int adistance_len;
> 	struct list_head list;
> 	nodemask_t nodemask;
> };
> 
> /* RCU list of memory tiers */
> static LIST_HEAD(memory_tiers);
> 
> /* abstract distance of each NUMA node */
> int node_adistances[MAX_NUMNODES];
> 
> struct memory_tier *find_create_memory_tier(int adistance)
> {
> 	struct memory_tier *tier;
> 
> 	list_for_each_entry(tier, &memory_tiers, list) {
> 		if (adistance >= tier->adistance_start &&
> 		    adistance < tier->adistance_start + tier->adistance_len)
> 			return tier;
> 	}
> 	/* allocate a new memory tier and return */
> }
> 
> void memory_tier_add_node(int nid)
> {
> 	int adistance;
> 	struct memory_tier *tier;
> 
> 	adistance = node_adistances[nid] || MEMORY_TIER_ADISTANCE_DRAM;
> 	tier = find_create_memory_tier(adistance);
> 	node_set(nid, &tier->nodemask);
> 	/* setup demotion data structure, etc */
> }
> 
> static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
> 						 unsigned long action, void *_arg)
> {
> 	struct memory_notify *arg = _arg;
> 	int nid;
> 
> 	nid = arg->status_change_nid;
> 	if (nid < 0)
> 		return notifier_from_errno(0);
> 
> 	switch (action) {
> 	case MEM_ONLINE:
> 		memory_tier_add_node(nid);
> 		break;
> 	}
> 
> 	return notifier_from_errno(0);
> }
> 
> /* kmem.c */
> static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> {
> 	node_adistances[dev_dax->target_node] = MEMORY_TIER_ADISTANCE_PMEM;
> 	/* add_memory_driver_managed() */
> }
> 
> [snip]
> 
> Best Regards,
> Huang, Ying


Implementing that I ended up with the below. The difference is adistance_len is not a memory tier property
instead it is a kernel parameter like memory_tier_chunk_size which can be tuned to create more memory tiers.
How about this? Not yet tested.

struct memory_tier {
	struct list_head list;
	int id;
	int perf_level;
	nodemask_t nodelist;
};

static LIST_HEAD(memory_tiers);
static DEFINE_MUTEX(memory_tier_lock);
static unsigned int default_memtier_perf_level = DEFAULT_MEMORY_TYPE_PERF;
core_param(default_memory_tier_perf_level, default_memtier_perf_level, uint, 0644);
static unsigned int memtier_perf_chunk_size = 150;
core_param(memory_tier_perf_chunk, memtier_perf_chunk_size, uint, 0644);

/*
 * performance levels are grouped into memtiers each of chunk size
 * memtier_perf_chunk
 */
static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
{
	bool found_slot = false;
	struct list_head *ent;
	struct memory_tier *memtier, *new_memtier;
	static int next_memtier_id = 0;
	/*
	 * zero is special in that it indicates uninitialized
	 * perf level by respective driver. Pick default memory
	 * tier perf level for that.
	 */
	if (!perf_level)
		perf_level = default_memtier_perf_level;

	lockdep_assert_held_once(&memory_tier_lock);

	list_for_each(ent, &memory_tiers) {
		memtier = list_entry(ent, struct memory_tier, list);
		if (perf_level >= memtier->perf_level &&
		    perf_level < memtier->perf_level + memtier_perf_chunk_size)
			return memtier;
		else if (perf_level < memtier->perf_level) {
			found_slot = true;
			break;
		}
	}

	new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
	if (!new_memtier)
		return ERR_PTR(-ENOMEM);

	new_memtier->id = next_memtier_id++;
	new_memtier->perf_level = ALIGN_DOWN(perf_level, memtier_perf_chunk_size);
	if (found_slot)
		list_add_tail(&new_memtier->list, ent);
	else
		list_add_tail(&new_memtier->list, &memory_tiers);
	return new_memtier;
}

static int __init memory_tier_init(void)
{
	int node;
	struct memory_tier *memtier;

	/*
	 * Since this is early during  boot, we could avoid
	 * holding memtory_tier_lock. But keep it simple by
	 * holding locks. So we can add lock held debug checks
	 * in other functions.
	 */
	mutex_lock(&memory_tier_lock);
	memtier = find_create_memory_tier(default_memtier_perf_level);
	if (IS_ERR(memtier))
		panic("%s() failed to register memory tier: %ld\n",
		      __func__, PTR_ERR(memtier));

	/* CPU only nodes are not part of memory tiers. */
	memtier->nodelist = node_states[N_MEMORY];

	/*
	 * nodes that are already online and that doesn't
	 * have perf level assigned is assigned a default perf
	 * level.
	 */
	for_each_node_state(node, N_MEMORY) {
		struct node *node_property = node_devices[node];

		if (!node_property->perf_level)
			node_property->perf_level = default_memtier_perf_level;
	}
	mutex_unlock(&memory_tier_lock);
	return 0;
}
subsys_initcall(memory_tier_init);
Huang, Ying July 18, 2022, 8:55 a.m. UTC | #12
Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:

> On 7/18/22 12:27 PM, Huang, Ying wrote:
>> Aneesh Kumar K V <aneesh.kumar@linux.ibm.com> writes:
>> 
>>> On 7/15/22 1:23 PM, Huang, Ying wrote:
>> 
>> [snip]
>> 
>>>>
>>>> You dropped the original sysfs interface patches from the series, but
>>>> the kernel internal implementation is still for the original sysfs
>>>> interface.  For example, memory tier ID is for the original sysfs
>>>> interface, not for the new proposed sysfs interface.  So I suggest you
>>>> to implement with the new interface in mind.  What do you think about
>>>> the following design?
>>>>
>>>
>>> Sorry I am not able to follow you here. This patchset completely drops
>>> exposing memory tiers to userspace via sysfs. Instead it allow
>>> creation of memory tiers with specific tierID from within the kernel/device driver.
>>> Default tierID is 200 and dax kmem creates memory tier with tierID 100. 
>>>
>>>
>>>> - Each NUMA node belongs to a memory type, and each memory type
>>>>   corresponds to a "abstract distance", so each NUMA node corresonds to
>>>>   a "distance".  For simplicity, we can start with static distances, for
>>>>   example, DRAM (default): 150, PMEM: 250.  The distance of each NUMA
>>>>   node can be recorded in a global array,
>>>>
>>>>     int node_distances[MAX_NUMNODES];
>>>>
>>>>   or, just
>>>>
>>>>     pgdat->distance
>>>>
>>>
>>> I don't follow this. I guess you are trying to have a different design.
>>> Would it be much easier if you can write this in the form of a patch? 
>> 
>> Written some pseudo code as follow to show my basic idea.
>> 
>> #define MEMORY_TIER_ADISTANCE_DRAM	150
>> #define MEMORY_TIER_ADISTANCE_PMEM	250
>> 
>> struct memory_tier {
>> 	/* abstract distance range covered by the memory tier */
>> 	int adistance_start;
>> 	int adistance_len;
>> 	struct list_head list;
>> 	nodemask_t nodemask;
>> };
>> 
>> /* RCU list of memory tiers */
>> static LIST_HEAD(memory_tiers);
>> 
>> /* abstract distance of each NUMA node */
>> int node_adistances[MAX_NUMNODES];
>> 
>> struct memory_tier *find_create_memory_tier(int adistance)
>> {
>> 	struct memory_tier *tier;
>> 
>> 	list_for_each_entry(tier, &memory_tiers, list) {
>> 		if (adistance >= tier->adistance_start &&
>> 		    adistance < tier->adistance_start + tier->adistance_len)
>> 			return tier;
>> 	}
>> 	/* allocate a new memory tier and return */
>> }
>> 
>> void memory_tier_add_node(int nid)
>> {
>> 	int adistance;
>> 	struct memory_tier *tier;
>> 
>> 	adistance = node_adistances[nid] || MEMORY_TIER_ADISTANCE_DRAM;
>> 	tier = find_create_memory_tier(adistance);
>> 	node_set(nid, &tier->nodemask);
>> 	/* setup demotion data structure, etc */
>> }
>> 
>> static int __meminit migrate_on_reclaim_callback(struct notifier_block *self,
>> 						 unsigned long action, void *_arg)
>> {
>> 	struct memory_notify *arg = _arg;
>> 	int nid;
>> 
>> 	nid = arg->status_change_nid;
>> 	if (nid < 0)
>> 		return notifier_from_errno(0);
>> 
>> 	switch (action) {
>> 	case MEM_ONLINE:
>> 		memory_tier_add_node(nid);
>> 		break;
>> 	}
>> 
>> 	return notifier_from_errno(0);
>> }
>> 
>> /* kmem.c */
>> static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>> {
>> 	node_adistances[dev_dax->target_node] = MEMORY_TIER_ADISTANCE_PMEM;
>> 	/* add_memory_driver_managed() */
>> }
>> 
>> [snip]
>> 
>> Best Regards,
>> Huang, Ying
>
>
> Implementing that I ended up with the below. The difference is adistance_len is not a memory tier property
> instead it is a kernel parameter like memory_tier_chunk_size which can
> be tuned to create more memory tiers.

It's not determined how to represent the range of abstract distance of
memory tier.  perf_level_chunk_size or perf_level_granularity is another
possible solution.  But I don't think it should be a kernel parameter
for the fist step.

> How about this? Not yet tested.
>
> struct memory_tier {
> 	struct list_head list;
> 	int id;

We don't need "id" for now in fact.  So I suggest to remove it.  We can
add it when we really need it.

> 	int perf_level;
> 	nodemask_t nodelist;
> };
>
> static LIST_HEAD(memory_tiers);
> static DEFINE_MUTEX(memory_tier_lock);
> static unsigned int default_memtier_perf_level = DEFAULT_MEMORY_TYPE_PERF;
> core_param(default_memory_tier_perf_level, default_memtier_perf_level, uint, 0644);
> static unsigned int memtier_perf_chunk_size = 150;
> core_param(memory_tier_perf_chunk, memtier_perf_chunk_size, uint, 0644);
>
> /*
>  * performance levels are grouped into memtiers each of chunk size
>  * memtier_perf_chunk
>  */
> static struct memory_tier *find_create_memory_tier(unsigned int perf_level)
> {
> 	bool found_slot = false;
> 	struct list_head *ent;
> 	struct memory_tier *memtier, *new_memtier;
> 	static int next_memtier_id = 0;
> 	/*
> 	 * zero is special in that it indicates uninitialized
> 	 * perf level by respective driver. Pick default memory
> 	 * tier perf level for that.
> 	 */
> 	if (!perf_level)
> 		perf_level = default_memtier_perf_level;
>
> 	lockdep_assert_held_once(&memory_tier_lock);
>
> 	list_for_each(ent, &memory_tiers) {
> 		memtier = list_entry(ent, struct memory_tier, list);
> 		if (perf_level >= memtier->perf_level &&
> 		    perf_level < memtier->perf_level + memtier_perf_chunk_size)
> 			return memtier;
> 		else if (perf_level < memtier->perf_level) {
> 			found_slot = true;
> 			break;
> 		}
> 	}
>
> 	new_memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> 	if (!new_memtier)
> 		return ERR_PTR(-ENOMEM);
>
> 	new_memtier->id = next_memtier_id++;
> 	new_memtier->perf_level = ALIGN_DOWN(perf_level, memtier_perf_chunk_size);
> 	if (found_slot)
> 		list_add_tail(&new_memtier->list, ent);
> 	else
> 		list_add_tail(&new_memtier->list, &memory_tiers);
> 	return new_memtier;
> }
>
> static int __init memory_tier_init(void)
> {
> 	int node;
> 	struct memory_tier *memtier;
>
> 	/*
> 	 * Since this is early during  boot, we could avoid
> 	 * holding memtory_tier_lock. But keep it simple by
> 	 * holding locks. So we can add lock held debug checks
> 	 * in other functions.
> 	 */
> 	mutex_lock(&memory_tier_lock);
> 	memtier = find_create_memory_tier(default_memtier_perf_level);
> 	if (IS_ERR(memtier))
> 		panic("%s() failed to register memory tier: %ld\n",
> 		      __func__, PTR_ERR(memtier));
>
> 	/* CPU only nodes are not part of memory tiers. */
> 	memtier->nodelist = node_states[N_MEMORY];
>
> 	/*
> 	 * nodes that are already online and that doesn't
> 	 * have perf level assigned is assigned a default perf
> 	 * level.
> 	 */
> 	for_each_node_state(node, N_MEMORY) {
> 		struct node *node_property = node_devices[node];
>
> 		if (!node_property->perf_level)
> 			node_property->perf_level = default_memtier_perf_level;
> 	}
> 	mutex_unlock(&memory_tier_lock);
> 	return 0;
> }
> subsys_initcall(memory_tier_init);

I think that this can be a starting point of our future discussion and
review.  Thanks!

Best Regards,
Huang, Ying
diff mbox series

Patch

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
new file mode 100644
index 000000000000..a81dbc20e0d1
--- /dev/null
+++ b/include/linux/memory-tiers.h
@@ -0,0 +1,15 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_TIERS_H
+#define _LINUX_MEMORY_TIERS_H
+
+#ifdef CONFIG_NUMA
+
+#define MEMORY_TIER_HBM_GPU	300
+#define MEMORY_TIER_DRAM	200
+#define MEMORY_TIER_PMEM	100
+
+#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
+#define MAX_MEMORY_TIER_ID	400
+
+#endif	/* CONFIG_NUMA */
+#endif  /* _LINUX_MEMORY_TIERS_H */
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..d30acebc2164 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,6 +92,7 @@  obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_NUMA) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
new file mode 100644
index 000000000000..011877b6dbb9
--- /dev/null
+++ b/mm/memory-tiers.c
@@ -0,0 +1,78 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/nodemask.h>
+#include <linux/slab.h>
+#include <linux/lockdep.h>
+#include <linux/moduleparam.h>
+#include <linux/memory-tiers.h>
+
+struct memory_tier {
+	struct list_head list;
+	int id;
+	nodemask_t nodelist;
+};
+
+static DEFINE_MUTEX(memory_tier_lock);
+static LIST_HEAD(memory_tiers);
+
+static void insert_memory_tier(struct memory_tier *memtier)
+{
+	struct list_head *ent;
+	struct memory_tier *tmp_memtier;
+
+	lockdep_assert_held_once(&memory_tier_lock);
+
+	list_for_each(ent, &memory_tiers) {
+		tmp_memtier = list_entry(ent, struct memory_tier, list);
+		if (tmp_memtier->id < memtier->id) {
+			list_add_tail(&memtier->list, ent);
+			return;
+		}
+	}
+	list_add_tail(&memtier->list, &memory_tiers);
+}
+
+static struct memory_tier *register_memory_tier(unsigned int tier)
+{
+	struct memory_tier *memtier;
+
+	if (tier > MAX_MEMORY_TIER_ID)
+		return ERR_PTR(-EINVAL);
+
+	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+	if (!memtier)
+		return ERR_PTR(-ENOMEM);
+
+	memtier->id   = tier;
+
+	insert_memory_tier(memtier);
+
+	return memtier;
+}
+
+static unsigned int default_memtier = DEFAULT_MEMORY_TIER;
+core_param(default_memory_tier, default_memtier, uint, 0644);
+
+static int __init memory_tier_init(void)
+{
+	struct memory_tier *memtier;
+
+	/*
+	 * Register only default memory tier to hide all empty
+	 * memory tier from sysfs. Since this is early during
+	 * boot, we could avoid holding memtory_tier_lock. But
+	 * keep it simple by holding locks. So we can add lock
+	 * held debug checks in other functions.
+	 */
+	mutex_lock(&memory_tier_lock);
+	memtier = register_memory_tier(default_memtier);
+	if (IS_ERR(memtier))
+		panic("%s() failed to register memory tier: %ld\n",
+		      __func__, PTR_ERR(memtier));
+
+	/* CPU only nodes are not part of memory tiers. */
+	memtier->nodelist = node_states[N_MEMORY];
+	mutex_unlock(&memory_tier_lock);
+	return 0;
+}
+subsys_initcall(memory_tier_init);