diff mbox series

[v6,01/13] mm/demotion: Add support for explicit memory tiers

Message ID 20220610135229.182859-2-aneesh.kumar@linux.ibm.com (mailing list archive)
State New
Headers show
Series mm/demotion: Memory tiers and demotion | expand

Commit Message

Aneesh Kumar K.V June 10, 2022, 1:52 p.m. UTC
In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created
during the kernel initialization and updated when a NUMA node is
hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and builds the tier hierarchy
tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases,

The current tier initialization code always initializes
each memory-only NUMA node into a lower tier.  But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM or GPU devices, the
memory-only NUMA nodes mapping these devices should be in the
top tier, and DRAM nodes with CPUs are better to be placed into the
next lower tier.

With current kernel higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier.  This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.

The current kernel also don't provide any interfaces for the
userspace to learn about the memory tier hierarchy in order to
optimize its memory allocations.

This patch series address the above by defining memory tiers explicitly.

This patch introduce explicity memory tiers with ranks. The rank
value of a memory tier is used to derive the demotion order between
NUMA nodes. The memory tiers present in a system can be found at

"Rank" is an opaque value. Its absolute value doesn't have any
special meaning. But the rank values of different memtiers can be
compared with each other to determine the memory tier order.

For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
their rank values are 300, 200, 100, then the memory tier order is:
memtier0 -> memtier1 -> memtier2, where memtier0 is the highest tier
and memtier2 is the lowest tier.

The rank value of each memtier should be unique.

A higher rank memory tier will appear first in the demotion order
than a lower rank memory tier. ie. while reclaim we choose a node
in higher rank memory tier to demote pages to as compared to a node
in a lower rank memory tier.

This patchset introduce 3 memory tiers (memtier0, memtier1 and memtier2)
which are created by different kernel subsystems. The default memory
tier created by the kernel is memtier1. Once created these memory tiers
are not destroyed even if they don't have any NUMA nodes assigned to
them.

This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].

[1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com

/sys/devices/system/memtier/memtierN/

The nodes which are part of a specific memory tier can be listed
via
/sys/devices/system/memtier/memtierN/nodelist

Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h | 20 ++++++++
 mm/Kconfig                   |  3 ++
 mm/Makefile                  |  1 +
 mm/memory-tiers.c            | 89 ++++++++++++++++++++++++++++++++++++
 4 files changed, 113 insertions(+)
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c

Comments

Huang, Ying June 13, 2022, 3:22 a.m. UTC | #1
Hi, Aneesh,

On Fri, 2022-06-10 at 19:22 +0530, Aneesh Kumar K.V wrote:
> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created
> during the kernel initialization and updated when a NUMA node is
> hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier, and builds the tier hierarchy
> tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
> 
> This current memory tier kernel interface needs to be improved for
> several important use cases,
> 
> The current tier initialization code always initializes
> each memory-only NUMA node into a lower tier.  But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem or a DRAM-backed memory-only node on
> a virtual machine) and should be put into a higher tier.
> 
> The current tier hierarchy always puts CPU nodes into the top
> tier. But on a system with HBM or GPU devices, the
> memory-only NUMA nodes mapping these devices should be in the
> top tier, and DRAM nodes with CPUs are better to be placed into the
> next lower tier.
> 
> With current kernel higher tier node can only be demoted to selected nodes on the
> next lower tier as defined by the demotion path, not any other
> node from any lower tier.  This strict, hard-coded demotion order
> does not work in all use cases (e.g. some use cases may want to
> allow cross-socket demotion to another node in the same demotion
> tier as a fallback when the preferred demotion node is out of
> space), This demotion order is also inconsistent with the page
> allocation fallback order when all the nodes in a higher tier are
> out of space: The page allocation can fall back to any node from
> any lower tier, whereas the demotion order doesn't allow that.
> 
> The current kernel also don't provide any interfaces for the
> userspace to learn about the memory tier hierarchy in order to
> optimize its memory allocations.
> 
> This patch series address the above by defining memory tiers explicitly.
> 
> This patch introduce explicity memory tiers with ranks. The rank
> value of a memory tier is used to derive the demotion order between
> NUMA nodes. The memory tiers present in a system can be found at
> 
> "Rank" is an opaque value. Its absolute value doesn't have any
> special meaning. But the rank values of different memtiers can be
> compared with each other to determine the memory tier order.
> 
> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> their rank values are 300, 200, 100, then the memory tier order is:
> memtier0 -> memtier1 -> memtier2, where memtier0 is the highest tier
> and memtier2 is the lowest tier.
> 
> The rank value of each memtier should be unique.
> 
> A higher rank memory tier will appear first in the demotion order
> than a lower rank memory tier. ie. while reclaim we choose a node
> in higher rank memory tier to demote pages to as compared to a node
> in a lower rank memory tier.
> 
> This patchset introduce 3 memory tiers (memtier0, memtier1 and memtier2)
> which are created by different kernel subsystems. The default memory
> tier created by the kernel is memtier1. Once created these memory tiers
> are not destroyed even if they don't have any NUMA nodes assigned to
> them.
> 
> This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> 
> [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> 
> /sys/devices/system/memtier/memtierN/
> 
> The nodes which are part of a specific memory tier can be listed
> via
> /sys/devices/system/memtier/memtierN/nodelist
> 
> Suggested-by: Wei Xu <weixugc@google.com>
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h | 20 ++++++++
>  mm/Kconfig                   |  3 ++
>  mm/Makefile                  |  1 +
>  mm/memory-tiers.c            | 89 ++++++++++++++++++++++++++++++++++++
>  4 files changed, 113 insertions(+)
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c
> 
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> new file mode 100644
> index 000000000000..e17f6b4ee177
> --- /dev/null
> +++ b/include/linux/memory-tiers.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_TIERS_H
> +#define _LINUX_MEMORY_TIERS_H
> +
> +#ifdef CONFIG_TIERED_MEMORY
> +
> +#define MEMORY_TIER_HBM_GPU	0
> +#define MEMORY_TIER_DRAM	1
> +#define MEMORY_TIER_PMEM	2
> +
> +#define MEMORY_RANK_HBM_GPU	300
> +#define MEMORY_RANK_DRAM	200
> +#define MEMORY_RANK_PMEM	100
> +
> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> +#define MAX_MEMORY_TIERS  3
> +
> +#endif	/* CONFIG_TIERED_MEMORY */
> +
> +#endif
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 169e64192e48..bb5aa585ab41 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -614,6 +614,9 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>  config ARCH_ENABLE_THP_MIGRATION
>  	bool
>  
> 
> +config TIERED_MEMORY
> +	def_bool NUMA
> +

As Yang pointed out, why not just use CONFIG_NUMA?  I suspect the
added value of CONIFIG_TIRED_MEMORY.

>  config HUGETLB_PAGE_SIZE_VARIABLE
>  	def_bool n
>  	help
> diff --git a/mm/Makefile b/mm/Makefile
> index 6f9ffa968a1a..482557fbc9d1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMTEST)		+= memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> new file mode 100644
> index 000000000000..d9fa955f208e
> --- /dev/null
> +++ b/mm/memory-tiers.c
> @@ -0,0 +1,89 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/types.h>
> +#include <linux/nodemask.h>
> +#include <linux/slab.h>
> +#include <linux/memory-tiers.h>
> +
> +struct memory_tier {
> +	struct list_head list;
> +	nodemask_t nodelist;
> +	int id;
> +	int rank;
> +};
> +
> +static DEFINE_MUTEX(memory_tier_lock);
> +static LIST_HEAD(memory_tiers);
> +
> +/*
> + * Keep it simple by having  direct mapping between
> + * tier index and rank value.
> + */
> +static inline int get_rank_from_tier(unsigned int tier)
> +{
> +	switch (tier) {
> +	case MEMORY_TIER_HBM_GPU:
> +		return MEMORY_RANK_HBM_GPU;
> +	case MEMORY_TIER_DRAM:
> +		return MEMORY_RANK_DRAM;
> +	case MEMORY_TIER_PMEM:
> +		return MEMORY_RANK_PMEM;
> +	}
> +	return -1;
> +}
> +
> +static void insert_memory_tier(struct memory_tier *memtier)
> +{
> +	struct list_head *ent;
> +	struct memory_tier *tmp_memtier;
> +
> +	list_for_each(ent, &memory_tiers) {
> +		tmp_memtier = list_entry(ent, struct memory_tier, list);

list_for_each_entry() ?

> +		if (tmp_memtier->rank < memtier->rank) {
> +			list_add_tail(&memtier->list, ent);

> +			return;
> +		}
> +	}
> +	list_add_tail(&memtier->list, &memory_tiers);
> +}
> +

IMHO, the locking requirements are needed here as comments to avoid
confusing.

> +static struct memory_tier *register_memory_tier(unsigned int tier,
> +						unsigned int rank)
> +{
> +	struct memory_tier *memtier;
> +
> +	if (tier >= MAX_MEMORY_TIERS)
> +		return ERR_PTR(-EINVAL);
> +
> +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> +	if (!memtier)
> +		return ERR_PTR(-ENOMEM);
> +
> +	memtier->id   = tier;
> +	memtier->rank = rank;
> +
> +	insert_memory_tier(memtier);
> +
> +	return memtier;
> +}
> +
> +static int __init memory_tier_init(void)
> +{
> +	struct memory_tier *memtier;
> +
> +	/*
> +	 * Register only default memory tier to hide all empty
> +	 * memory tier from sysfs.
> +	 */
> +	memtier = register_memory_tier(DEFAULT_MEMORY_TIER,
> +				       get_rank_from_tier(DEFAULT_MEMORY_TIER));
> +
> +	if (IS_ERR(memtier))
> +		panic("%s() failed to register memory tier: %ld\n",
> +		      __func__, PTR_ERR(memtier));
> +
> +	/* CPU only nodes are not part of memory tiers. */
> +	memtier->nodelist = node_states[N_MEMORY];
> +
> +	return 0;
> +}
> +subsys_initcall(memory_tier_init);

Best Regards,
Huang, Ying
Aneesh Kumar K.V June 13, 2022, 3:31 a.m. UTC | #2
On 6/13/22 8:52 AM, Ying Huang wrote:
> Hi, Aneesh,
> 
> On Fri, 2022-06-10 at 19:22 +0530, Aneesh Kumar K.V wrote:
>> In the current kernel, memory tiers are defined implicitly via a
>> demotion path relationship between NUMA nodes, which is created
>> during the kernel initialization and updated when a NUMA node is
>> hot-added or hot-removed.  The current implementation puts all
>> nodes with CPU into the top tier, and builds the tier hierarchy
>> tier-by-tier by establishing the per-node demotion targets based
>> on the distances between nodes.
>>
>> This current memory tier kernel interface needs to be improved for
>> several important use cases,
>>
>> The current tier initialization code always initializes
>> each memory-only NUMA node into a lower tier.  But a memory-only
>> NUMA node may have a high performance memory device (e.g. a DRAM
>> device attached via CXL.mem or a DRAM-backed memory-only node on
>> a virtual machine) and should be put into a higher tier.
>>
>> The current tier hierarchy always puts CPU nodes into the top
>> tier. But on a system with HBM or GPU devices, the
>> memory-only NUMA nodes mapping these devices should be in the
>> top tier, and DRAM nodes with CPUs are better to be placed into the
>> next lower tier.
>>
>> With current kernel higher tier node can only be demoted to selected nodes on the
>> next lower tier as defined by the demotion path, not any other
>> node from any lower tier.  This strict, hard-coded demotion order
>> does not work in all use cases (e.g. some use cases may want to
>> allow cross-socket demotion to another node in the same demotion
>> tier as a fallback when the preferred demotion node is out of
>> space), This demotion order is also inconsistent with the page
>> allocation fallback order when all the nodes in a higher tier are
>> out of space: The page allocation can fall back to any node from
>> any lower tier, whereas the demotion order doesn't allow that.
>>
>> The current kernel also don't provide any interfaces for the
>> userspace to learn about the memory tier hierarchy in order to
>> optimize its memory allocations.
>>
>> This patch series address the above by defining memory tiers explicitly.
>>
>> This patch introduce explicity memory tiers with ranks. The rank
>> value of a memory tier is used to derive the demotion order between
>> NUMA nodes. The memory tiers present in a system can be found at
>>
>> "Rank" is an opaque value. Its absolute value doesn't have any
>> special meaning. But the rank values of different memtiers can be
>> compared with each other to determine the memory tier order.
>>
>> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
>> their rank values are 300, 200, 100, then the memory tier order is:
>> memtier0 -> memtier1 -> memtier2, where memtier0 is the highest tier
>> and memtier2 is the lowest tier.
>>
>> The rank value of each memtier should be unique.
>>
>> A higher rank memory tier will appear first in the demotion order
>> than a lower rank memory tier. ie. while reclaim we choose a node
>> in higher rank memory tier to demote pages to as compared to a node
>> in a lower rank memory tier.
>>
>> This patchset introduce 3 memory tiers (memtier0, memtier1 and memtier2)
>> which are created by different kernel subsystems. The default memory
>> tier created by the kernel is memtier1. Once created these memory tiers
>> are not destroyed even if they don't have any NUMA nodes assigned to
>> them.
>>
>> This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
>>
>> [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>
>> /sys/devices/system/memtier/memtierN/
>>
>> The nodes which are part of a specific memory tier can be listed
>> via
>> /sys/devices/system/memtier/memtierN/nodelist
>>
>> Suggested-by: Wei Xu <weixugc@google.com>
>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   include/linux/memory-tiers.h | 20 ++++++++
>>   mm/Kconfig                   |  3 ++
>>   mm/Makefile                  |  1 +
>>   mm/memory-tiers.c            | 89 ++++++++++++++++++++++++++++++++++++
>>   4 files changed, 113 insertions(+)
>>   create mode 100644 include/linux/memory-tiers.h
>>   create mode 100644 mm/memory-tiers.c
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> new file mode 100644
>> index 000000000000..e17f6b4ee177
>> --- /dev/null
>> +++ b/include/linux/memory-tiers.h
>> @@ -0,0 +1,20 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_MEMORY_TIERS_H
>> +#define _LINUX_MEMORY_TIERS_H
>> +
>> +#ifdef CONFIG_TIERED_MEMORY
>> +
>> +#define MEMORY_TIER_HBM_GPU	0
>> +#define MEMORY_TIER_DRAM	1
>> +#define MEMORY_TIER_PMEM	2
>> +
>> +#define MEMORY_RANK_HBM_GPU	300
>> +#define MEMORY_RANK_DRAM	200
>> +#define MEMORY_RANK_PMEM	100
>> +
>> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
>> +#define MAX_MEMORY_TIERS  3
>> +
>> +#endif	/* CONFIG_TIERED_MEMORY */
>> +
>> +#endif
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 169e64192e48..bb5aa585ab41 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -614,6 +614,9 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>>   config ARCH_ENABLE_THP_MIGRATION
>>   	bool
>>   
>>
>> +config TIERED_MEMORY
>> +	def_bool NUMA
>> +
> 
> As Yang pointed out, why not just use CONFIG_NUMA?  I suspect the
> added value of CONIFIG_TIRED_MEMORY.
> 

I decided to use TIERED_MEMORY to bring more clarity. It should be same 
now that we have moved CONFIG_MIGRATION dependencies to runtime. IMHO 
having CONFIG_TIERED_MEMORY is better than using CONFIG_NUMA.

>>   config HUGETLB_PAGE_SIZE_VARIABLE
>>   	def_bool n
>>   	help
>> diff --git a/mm/Makefile b/mm/Makefile
>> index 6f9ffa968a1a..482557fbc9d1 100644
>> --- a/mm/Makefile
>> +++ b/mm/Makefile
>> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>>   obj-$(CONFIG_FAILSLAB) += failslab.o
>>   obj-$(CONFIG_MEMTEST)		+= memtest.o
>>   obj-$(CONFIG_MIGRATION) += migrate.o
>> +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
>>   obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>>   obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>>   obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
>> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
>> new file mode 100644
>> index 000000000000..d9fa955f208e
>> --- /dev/null
>> +++ b/mm/memory-tiers.c
>> @@ -0,0 +1,89 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +#include <linux/types.h>
>> +#include <linux/nodemask.h>
>> +#include <linux/slab.h>
>> +#include <linux/memory-tiers.h>
>> +
>> +struct memory_tier {
>> +	struct list_head list;
>> +	nodemask_t nodelist;
>> +	int id;
>> +	int rank;
>> +};
>> +
>> +static DEFINE_MUTEX(memory_tier_lock);
>> +static LIST_HEAD(memory_tiers);
>> +
>> +/*
>> + * Keep it simple by having  direct mapping between
>> + * tier index and rank value.
>> + */
>> +static inline int get_rank_from_tier(unsigned int tier)
>> +{
>> +	switch (tier) {
>> +	case MEMORY_TIER_HBM_GPU:
>> +		return MEMORY_RANK_HBM_GPU;
>> +	case MEMORY_TIER_DRAM:
>> +		return MEMORY_RANK_DRAM;
>> +	case MEMORY_TIER_PMEM:
>> +		return MEMORY_RANK_PMEM;
>> +	}
>> +	return -1;
>> +}
>> +
>> +static void insert_memory_tier(struct memory_tier *memtier)
>> +{
>> +	struct list_head *ent;
>> +	struct memory_tier *tmp_memtier;
>> +
>> +	list_for_each(ent, &memory_tiers) {
>> +		tmp_memtier = list_entry(ent, struct memory_tier, list);
> 
> list_for_each_entry() ?
> 

ent variable is used below. Hence I won't be able to use 
list_for_each_entry.

>> +		if (tmp_memtier->rank < memtier->rank) {
>> +			list_add_tail(&memtier->list, ent);
> 
>> +			return;
>> +		}
>> +	}
>> +	list_add_tail(&memtier->list, &memory_tiers);
>> +}
>> +
> 
> IMHO, the locking requirements are needed here as comments to avoid
> confusing.
> 

All those functions are called with memory_tier_lock_held. Infact all 
list operations requires that lock held. What details do you suggest we 
document? I can add extra comment to the mutex itself? Adding locking 
details to all the functions will be duplicating the same details at 
multiple places?

>> +static struct memory_tier *register_memory_tier(unsigned int tier,
>> +						unsigned int rank)
>> +{
>> +	struct memory_tier *memtier;
>> +
>> +	if (tier >= MAX_MEMORY_TIERS)
>> +		return ERR_PTR(-EINVAL);
>> +
>> +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>> +	if (!memtier)
>> +		return ERR_PTR(-ENOMEM);
>> +
>> +	memtier->id   = tier;
>> +	memtier->rank = rank;
>> +
>> +	insert_memory_tier(memtier);
>> +
>> +	return memtier;
>> +}
>> +
>> +static int __init memory_tier_init(void)
>> +{
>> +	struct memory_tier *memtier;
>> +
>> +	/*
>> +	 * Register only default memory tier to hide all empty
>> +	 * memory tier from sysfs.
>> +	 */
>> +	memtier = register_memory_tier(DEFAULT_MEMORY_TIER,
>> +				       get_rank_from_tier(DEFAULT_MEMORY_TIER));
>> +
>> +	if (IS_ERR(memtier))
>> +		panic("%s() failed to register memory tier: %ld\n",
>> +		      __func__, PTR_ERR(memtier));
>> +
>> +	/* CPU only nodes are not part of memory tiers. */
>> +	memtier->nodelist = node_states[N_MEMORY];
>> +
>> +	return 0;
>> +}
>> +subsys_initcall(memory_tier_init);
> 

-aneesh
Huang, Ying June 13, 2022, 5:30 a.m. UTC | #3
On Mon, 2022-06-13 at 09:01 +0530, Aneesh Kumar K V wrote:
> On 6/13/22 8:52 AM, Ying Huang wrote:
> > Hi, Aneesh,
> > 
> > On Fri, 2022-06-10 at 19:22 +0530, Aneesh Kumar K.V wrote:
> > > In the current kernel, memory tiers are defined implicitly via a
> > > demotion path relationship between NUMA nodes, which is created
> > > during the kernel initialization and updated when a NUMA node is
> > > hot-added or hot-removed.  The current implementation puts all
> > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based
> > > on the distances between nodes.
> > > 
> > > This current memory tier kernel interface needs to be improved for
> > > several important use cases,
> > > 
> > > The current tier initialization code always initializes
> > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > a virtual machine) and should be put into a higher tier.
> > > 
> > > The current tier hierarchy always puts CPU nodes into the top
> > > tier. But on a system with HBM or GPU devices, the
> > > memory-only NUMA nodes mapping these devices should be in the
> > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > next lower tier.
> > > 
> > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > next lower tier as defined by the demotion path, not any other
> > > node from any lower tier.  This strict, hard-coded demotion order
> > > does not work in all use cases (e.g. some use cases may want to
> > > allow cross-socket demotion to another node in the same demotion
> > > tier as a fallback when the preferred demotion node is out of
> > > space), This demotion order is also inconsistent with the page
> > > allocation fallback order when all the nodes in a higher tier are
> > > out of space: The page allocation can fall back to any node from
> > > any lower tier, whereas the demotion order doesn't allow that.
> > > 
> > > The current kernel also don't provide any interfaces for the
> > > userspace to learn about the memory tier hierarchy in order to
> > > optimize its memory allocations.
> > > 
> > > This patch series address the above by defining memory tiers explicitly.
> > > 
> > > This patch introduce explicity memory tiers with ranks. The rank
> > > value of a memory tier is used to derive the demotion order between
> > > NUMA nodes. The memory tiers present in a system can be found at
> > > 
> > > "Rank" is an opaque value. Its absolute value doesn't have any
> > > special meaning. But the rank values of different memtiers can be
> > > compared with each other to determine the memory tier order.
> > > 
> > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > > their rank values are 300, 200, 100, then the memory tier order is:
> > > memtier0 -> memtier1 -> memtier2, where memtier0 is the highest tier
> > > and memtier2 is the lowest tier.
> > > 
> > > The rank value of each memtier should be unique.
> > > 
> > > A higher rank memory tier will appear first in the demotion order
> > > than a lower rank memory tier. ie. while reclaim we choose a node
> > > in higher rank memory tier to demote pages to as compared to a node
> > > in a lower rank memory tier.
> > > 
> > > This patchset introduce 3 memory tiers (memtier0, memtier1 and memtier2)
> > > which are created by different kernel subsystems. The default memory
> > > tier created by the kernel is memtier1. Once created these memory tiers
> > > are not destroyed even if they don't have any NUMA nodes assigned to
> > > them.
> > > 
> > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> > > 
> > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> > > 
> > > /sys/devices/system/memtier/memtierN/
> > > 
> > > The nodes which are part of a specific memory tier can be listed
> > > via
> > > /sys/devices/system/memtier/memtierN/nodelist
> > > 
> > > Suggested-by: Wei Xu <weixugc@google.com>
> > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > ---
> > >   include/linux/memory-tiers.h | 20 ++++++++
> > >   mm/Kconfig                   |  3 ++
> > >   mm/Makefile                  |  1 +
> > >   mm/memory-tiers.c            | 89 ++++++++++++++++++++++++++++++++++++
> > >   4 files changed, 113 insertions(+)
> > >   create mode 100644 include/linux/memory-tiers.h
> > >   create mode 100644 mm/memory-tiers.c
> > > 
> > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > > new file mode 100644
> > > index 000000000000..e17f6b4ee177
> > > --- /dev/null
> > > +++ b/include/linux/memory-tiers.h
> > > @@ -0,0 +1,20 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > +#define _LINUX_MEMORY_TIERS_H
> > > +
> > > +#ifdef CONFIG_TIERED_MEMORY
> > > +
> > > +#define MEMORY_TIER_HBM_GPU	0
> > > +#define MEMORY_TIER_DRAM	1
> > > +#define MEMORY_TIER_PMEM	2
> > > +
> > > +#define MEMORY_RANK_HBM_GPU	300
> > > +#define MEMORY_RANK_DRAM	200
> > > +#define MEMORY_RANK_PMEM	100
> > > +
> > > +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> > > +#define MAX_MEMORY_TIERS  3
> > > +
> > > +#endif	/* CONFIG_TIERED_MEMORY */
> > > +
> > > +#endif
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 169e64192e48..bb5aa585ab41 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -614,6 +614,9 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> > >   config ARCH_ENABLE_THP_MIGRATION
> > >   	bool
> > >   
> > > 
> > > 
> > > +config TIERED_MEMORY
> > > +	def_bool NUMA
> > > +
> > 
> > As Yang pointed out, why not just use CONFIG_NUMA?  I suspect the
> > added value of CONIFIG_TIRED_MEMORY.
> > 
> 
> I decided to use TIERED_MEMORY to bring more clarity. It should be same 
> now that we have moved CONFIG_MIGRATION dependencies to runtime. IMHO 
> having CONFIG_TIERED_MEMORY is better than using CONFIG_NUMA.

I don't think CONFIG_TIERED_MEMORY bring no much value.  It's better
to use CONFIG_NUMA directly.  But this is just my opinion.

> > >   config HUGETLB_PAGE_SIZE_VARIABLE
> > >   	def_bool n
> > >   	help
> > > diff --git a/mm/Makefile b/mm/Makefile
> > > index 6f9ffa968a1a..482557fbc9d1 100644
> > > --- a/mm/Makefile
> > > +++ b/mm/Makefile
> > > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
> > >   obj-$(CONFIG_FAILSLAB) += failslab.o
> > >   obj-$(CONFIG_MEMTEST)		+= memtest.o
> > >   obj-$(CONFIG_MIGRATION) += migrate.o
> > > +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
> > >   obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> > >   obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > >   obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > > new file mode 100644
> > > index 000000000000..d9fa955f208e
> > > --- /dev/null
> > > +++ b/mm/memory-tiers.c
> > > @@ -0,0 +1,89 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include <linux/types.h>
> > > +#include <linux/nodemask.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/memory-tiers.h>
> > > +
> > > +struct memory_tier {
> > > +	struct list_head list;
> > > +	nodemask_t nodelist;
> > > +	int id;
> > > +	int rank;
> > > +};
> > > +
> > > +static DEFINE_MUTEX(memory_tier_lock);
> > > +static LIST_HEAD(memory_tiers);
> > > +
> > > +/*
> > > + * Keep it simple by having  direct mapping between
> > > + * tier index and rank value.
> > > + */
> > > +static inline int get_rank_from_tier(unsigned int tier)
> > > +{
> > > +	switch (tier) {
> > > +	case MEMORY_TIER_HBM_GPU:
> > > +		return MEMORY_RANK_HBM_GPU;
> > > +	case MEMORY_TIER_DRAM:
> > > +		return MEMORY_RANK_DRAM;
> > > +	case MEMORY_TIER_PMEM:
> > > +		return MEMORY_RANK_PMEM;
> > > +	}
> > > +	return -1;
> > > +}
> > > +
> > > +static void insert_memory_tier(struct memory_tier *memtier)
> > > +{
> > > +	struct list_head *ent;
> > > +	struct memory_tier *tmp_memtier;
> > > +
> > > +	list_for_each(ent, &memory_tiers) {
> > > +		tmp_memtier = list_entry(ent, struct memory_tier, list);
> > 
> > list_for_each_entry() ?
> > 
> 
> ent variable is used below. Hence I won't be able to use 
> list_for_each_entry.

ent == &tmp_memtier->list ?

> > > +		if (tmp_memtier->rank < memtier->rank) {
> > > +			list_add_tail(&memtier->list, ent);
> > 
> > > +			return;
> > > +		}
> > > +	}
> > > +	list_add_tail(&memtier->list, &memory_tiers);
> > > +}
> > > +
> > 
> > IMHO, the locking requirements are needed here as comments to avoid
> > confusing.
> > 
> 
> All those functions are called with memory_tier_lock_held. Infact all 
> list operations requires that lock held. What details do you suggest we 
> document? I can add extra comment to the mutex itself? Adding locking 
> details to all the functions will be duplicating the same details at 
> multiple places?

memory_tier_lock isn't held to call register_memory_tier() in this
patch.  That will cause confusion.

> > > +static struct memory_tier *register_memory_tier(unsigned int tier,
> > > +						unsigned int rank)
> > > +{
> > > +	struct memory_tier *memtier;
> > > +
> > > +	if (tier >= MAX_MEMORY_TIERS)
> > > +		return ERR_PTR(-EINVAL);
> > > +
> > > +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > > +	if (!memtier)
> > > +		return ERR_PTR(-ENOMEM);
> > > +
> > > +	memtier->id   = tier;
> > > +	memtier->rank = rank;
> > > +
> > > +	insert_memory_tier(memtier);
> > > +
> > > +	return memtier;
> > > +}
> > > +
> > > +static int __init memory_tier_init(void)
> > > +{
> > > +	struct memory_tier *memtier;
> > > +
> > > +	/*
> > > +	 * Register only default memory tier to hide all empty
> > > +	 * memory tier from sysfs.
> > > +	 */
> > > +	memtier = register_memory_tier(DEFAULT_MEMORY_TIER,
> > > +				       get_rank_from_tier(DEFAULT_MEMORY_TIER));
> > > +
> > > +	if (IS_ERR(memtier))
> > > +		panic("%s() failed to register memory tier: %ld\n",
> > > +		      __func__, PTR_ERR(memtier));
> > > +
> > > +	/* CPU only nodes are not part of memory tiers. */
> > > +	memtier->nodelist = node_states[N_MEMORY];
> > > +
> > > +	return 0;
> > > +}
> > > +subsys_initcall(memory_tier_init);
> > 

Best Regards,
Huang, Ying
Johannes Weiner June 13, 2022, 1:16 p.m. UTC | #4
On Mon, Jun 13, 2022 at 01:30:08PM +0800, Ying Huang wrote:
> On Mon, 2022-06-13 at 09:01 +0530, Aneesh Kumar K V wrote:
> > On 6/13/22 8:52 AM, Ying Huang wrote:
> > > On Fri, 2022-06-10 at 19:22 +0530, Aneesh Kumar K.V wrote:
> > > > +config TIERED_MEMORY
> > > > +	def_bool NUMA
> > > > +
> > > 
> > > As Yang pointed out, why not just use CONFIG_NUMA?  I suspect the
> > > added value of CONIFIG_TIRED_MEMORY.
> > 
> > I decided to use TIERED_MEMORY to bring more clarity. It should be same 
> > now that we have moved CONFIG_MIGRATION dependencies to runtime. IMHO 
> > having CONFIG_TIERED_MEMORY is better than using CONFIG_NUMA.
> 
> I don't think CONFIG_TIERED_MEMORY bring no much value.  It's better
> to use CONFIG_NUMA directly.  But this is just my opinion.

I agree. As long as it's always built with CONFIG_NUMA, it's simply
NUMA code. Easy enough to modularize it later if somebody really wants
this to be configurable separately.
Aneesh Kumar K.V June 13, 2022, 1:28 p.m. UTC | #5
On 6/13/22 6:46 PM, Johannes Weiner wrote:
> On Mon, Jun 13, 2022 at 01:30:08PM +0800, Ying Huang wrote:
>> On Mon, 2022-06-13 at 09:01 +0530, Aneesh Kumar K V wrote:
>>> On 6/13/22 8:52 AM, Ying Huang wrote:
>>>> On Fri, 2022-06-10 at 19:22 +0530, Aneesh Kumar K.V wrote:
>>>>> +config TIERED_MEMORY
>>>>> +	def_bool NUMA
>>>>> +
>>>>
>>>> As Yang pointed out, why not just use CONFIG_NUMA?  I suspect the
>>>> added value of CONIFIG_TIRED_MEMORY.
>>>
>>> I decided to use TIERED_MEMORY to bring more clarity. It should be same
>>> now that we have moved CONFIG_MIGRATION dependencies to runtime. IMHO
>>> having CONFIG_TIERED_MEMORY is better than using CONFIG_NUMA.
>>
>> I don't think CONFIG_TIERED_MEMORY bring no much value.  It's better
>> to use CONFIG_NUMA directly.  But this is just my opinion.
> 
> I agree. As long as it's always built with CONFIG_NUMA, it's simply
> NUMA code. Easy enough to modularize it later if somebody really wants
> this to be configurable separately.

I was comparing,

#ifdef CONFIG_TIERED_MEMORY
struct memory_tier {

vs

#ifdef CONFIG_NUMA
struct memory_tier {

I will switch to CONFIG_NUMA in the next update since you are not 
finding it beneficial.

-aneesh
Aneesh Kumar K.V June 14, 2022, 8:20 a.m. UTC | #6
Ying Huang <ying.huang@intel.com> writes:

....
> 
>> All those functions are called with memory_tier_lock_held. Infact all 
>> list operations requires that lock held. What details do you suggest we 
>> document? I can add extra comment to the mutex itself? Adding locking 
>> details to all the functions will be duplicating the same details at 
>> multiple places?
>
> memory_tier_lock isn't held to call register_memory_tier() in this
> patch.  That will cause confusion.

will this help to explain this better
modified   mm/memory-tiers.c
@@ -151,6 +151,11 @@ static void insert_memory_tier(struct memory_tier *memtier)
 	struct list_head *ent;
 	struct memory_tier *tmp_memtier;
 
+	if (IS_ENABLED(CONFIG_DEBUG_VM) && !mutex_is_locked(&memory_tier_lock)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+
 	list_for_each(ent, &memory_tiers) {
 		tmp_memtier = list_entry(ent, struct memory_tier, list);
 		if (tmp_memtier->rank < memtier->rank) {
@@ -811,8 +816,12 @@ static int __init memory_tier_init(void)
 
 	/*
 	 * Register only default memory tier to hide all empty
-	 * memory tier from sysfs.
+	 * memory tier from sysfs. Since this is early during
+	 * boot, we could avoid holding memtory_tier_lock. But
+	 * keep it simple by holding locks. We can add lock
+	 * held debug checks in other functions.
 	 */
+	mutex_lock(&memory_tier_lock);
 	memtier = register_memory_tier(DEFAULT_MEMORY_TIER,
 				       get_rank_from_tier(DEFAULT_MEMORY_TIER));
 
@@ -828,6 +837,7 @@ static int __init memory_tier_init(void)
 		NODE_DATA(node)->memtier = memtier;
 		node_set(node, memtier->nodelist);
 	}
+	mutex_unlock(&memory_tier_lock);
 	migrate_on_reclaim_init();
 
 	return 0;

-aneesh
Davidlohr Bueso June 14, 2022, 3:13 p.m. UTC | #7
>> memory_tier_lock isn't held to call register_memory_tier() in this
>> patch.  That will cause confusion.
>
>will this help to explain this better
>modified   mm/memory-tiers.c
>@@ -151,6 +151,11 @@ static void insert_memory_tier(struct memory_tier *memtier)
> 	struct list_head *ent;
> 	struct memory_tier *tmp_memtier;
>
>+	if (IS_ENABLED(CONFIG_DEBUG_VM) && !mutex_is_locked(&memory_tier_lock)) {
>+		WARN_ON_ONCE(1);
>+		return;
>+	}

Why not just use lockdep here instead?
diff mbox series

Patch

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
new file mode 100644
index 000000000000..e17f6b4ee177
--- /dev/null
+++ b/include/linux/memory-tiers.h
@@ -0,0 +1,20 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_TIERS_H
+#define _LINUX_MEMORY_TIERS_H
+
+#ifdef CONFIG_TIERED_MEMORY
+
+#define MEMORY_TIER_HBM_GPU	0
+#define MEMORY_TIER_DRAM	1
+#define MEMORY_TIER_PMEM	2
+
+#define MEMORY_RANK_HBM_GPU	300
+#define MEMORY_RANK_DRAM	200
+#define MEMORY_RANK_PMEM	100
+
+#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
+#define MAX_MEMORY_TIERS  3
+
+#endif	/* CONFIG_TIERED_MEMORY */
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..bb5aa585ab41 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -614,6 +614,9 @@  config ARCH_ENABLE_HUGEPAGE_MIGRATION
 config ARCH_ENABLE_THP_MIGRATION
 	bool
 
+config TIERED_MEMORY
+	def_bool NUMA
+
 config HUGETLB_PAGE_SIZE_VARIABLE
 	def_bool n
 	help
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..482557fbc9d1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,6 +92,7 @@  obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
new file mode 100644
index 000000000000..d9fa955f208e
--- /dev/null
+++ b/mm/memory-tiers.c
@@ -0,0 +1,89 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/nodemask.h>
+#include <linux/slab.h>
+#include <linux/memory-tiers.h>
+
+struct memory_tier {
+	struct list_head list;
+	nodemask_t nodelist;
+	int id;
+	int rank;
+};
+
+static DEFINE_MUTEX(memory_tier_lock);
+static LIST_HEAD(memory_tiers);
+
+/*
+ * Keep it simple by having  direct mapping between
+ * tier index and rank value.
+ */
+static inline int get_rank_from_tier(unsigned int tier)
+{
+	switch (tier) {
+	case MEMORY_TIER_HBM_GPU:
+		return MEMORY_RANK_HBM_GPU;
+	case MEMORY_TIER_DRAM:
+		return MEMORY_RANK_DRAM;
+	case MEMORY_TIER_PMEM:
+		return MEMORY_RANK_PMEM;
+	}
+	return -1;
+}
+
+static void insert_memory_tier(struct memory_tier *memtier)
+{
+	struct list_head *ent;
+	struct memory_tier *tmp_memtier;
+
+	list_for_each(ent, &memory_tiers) {
+		tmp_memtier = list_entry(ent, struct memory_tier, list);
+		if (tmp_memtier->rank < memtier->rank) {
+			list_add_tail(&memtier->list, ent);
+			return;
+		}
+	}
+	list_add_tail(&memtier->list, &memory_tiers);
+}
+
+static struct memory_tier *register_memory_tier(unsigned int tier,
+						unsigned int rank)
+{
+	struct memory_tier *memtier;
+
+	if (tier >= MAX_MEMORY_TIERS)
+		return ERR_PTR(-EINVAL);
+
+	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+	if (!memtier)
+		return ERR_PTR(-ENOMEM);
+
+	memtier->id   = tier;
+	memtier->rank = rank;
+
+	insert_memory_tier(memtier);
+
+	return memtier;
+}
+
+static int __init memory_tier_init(void)
+{
+	struct memory_tier *memtier;
+
+	/*
+	 * Register only default memory tier to hide all empty
+	 * memory tier from sysfs.
+	 */
+	memtier = register_memory_tier(DEFAULT_MEMORY_TIER,
+				       get_rank_from_tier(DEFAULT_MEMORY_TIER));
+
+	if (IS_ERR(memtier))
+		panic("%s() failed to register memory tier: %ld\n",
+		      __func__, PTR_ERR(memtier));
+
+	/* CPU only nodes are not part of memory tiers. */
+	memtier->nodelist = node_states[N_MEMORY];
+
+	return 0;
+}
+subsys_initcall(memory_tier_init);