diff mbox series

[v5,1/9] mm/demotion: Add support for explicit memory tiers

Message ID 20220603134237.131362-2-aneesh.kumar@linux.ibm.com (mailing list archive)
State New
Headers show
Series mm/demotion: Memory tiers and demotion | expand

Commit Message

Aneesh Kumar K.V June 3, 2022, 1:42 p.m. UTC
In the current kernel, memory tiers are defined implicitly via a
demotion path relationship between NUMA nodes, which is created
during the kernel initialization and updated when a NUMA node is
hot-added or hot-removed.  The current implementation puts all
nodes with CPU into the top tier, and builds the tier hierarchy
tier-by-tier by establishing the per-node demotion targets based
on the distances between nodes.

This current memory tier kernel interface needs to be improved for
several important use cases,

The current tier initialization code always initializes
each memory-only NUMA node into a lower tier.  But a memory-only
NUMA node may have a high performance memory device (e.g. a DRAM
device attached via CXL.mem or a DRAM-backed memory-only node on
a virtual machine) and should be put into a higher tier.

The current tier hierarchy always puts CPU nodes into the top
tier. But on a system with HBM or GPU devices, the
memory-only NUMA nodes mapping these devices should be in the
top tier, and DRAM nodes with CPUs are better to be placed into the
next lower tier.

With current kernel higher tier node can only be demoted to selected nodes on the
next lower tier as defined by the demotion path, not any other
node from any lower tier.  This strict, hard-coded demotion order
does not work in all use cases (e.g. some use cases may want to
allow cross-socket demotion to another node in the same demotion
tier as a fallback when the preferred demotion node is out of
space), This demotion order is also inconsistent with the page
allocation fallback order when all the nodes in a higher tier are
out of space: The page allocation can fall back to any node from
any lower tier, whereas the demotion order doesn't allow that.

The current kernel also don't provide any interfaces for the
userspace to learn about the memory tier hierarchy in order to
optimize its memory allocations.

This patch series address the above by defining memory tiers explicitly.

This patch introduce explicity memory tiers with ranks. The rank
value of a memory tier is used to derive the demotion order between
NUMA nodes. The memory tiers present in a system can be found at

/sys/devices/system/memtier/memtierN/

The nodes which are part of a specific memory tier can be listed
via
/sys/devices/system/memtier/memtierN/nodelist

"Rank" is an opaque value. Its absolute value doesn't have any
special meaning. But the rank values of different memtiers can be
compared with each other to determine the memory tier order.

For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
their rank values are 300, 200, 100, then the memory tier order is:
memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
and memtier1 is the lowest tier.

The rank value of each memtier should be unique.

A higher rank memory tier will appear first in the demotion order
than a lower rank memory tier. ie. while reclaim we choose a node
in higher rank memory tier to demote pages to as compared to a node
in a lower rank memory tier.

For now we are not adding the dynamic number of memory tiers.
But a future series supporting that is possible. Currently
number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
When doing memory hotplug, if not added to a memory tier, the NUMA
node gets added to DEFAULT_MEMORY_TIER(1).

This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].

[1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com

Suggested-by: Wei Xu <weixugc@google.com>
Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
---
 include/linux/memory-tiers.h |  20 ++++
 mm/Kconfig                   |  11 ++
 mm/Makefile                  |   1 +
 mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
 4 files changed, 220 insertions(+)
 create mode 100644 include/linux/memory-tiers.h
 create mode 100644 mm/memory-tiers.c

Comments

Tim Chen June 7, 2022, 6:43 p.m. UTC | #1
On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> 
> 
> The nodes which are part of a specific memory tier can be listed
> via
> /sys/devices/system/memtier/memtierN/nodelist
> 
> "Rank" is an opaque value. Its absolute value doesn't have any
> special meaning. But the rank values of different memtiers can be
> compared with each other to determine the memory tier order.
> 
> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> their rank values are 300, 200, 100, then the memory tier order is:
> memtier0 -> memtier2 -> memtier1, 

Why is memtier2 (rank 100) higher than memtier1 (rank 200)?  Seems like
the order should be memtier0 -> memtier1 -> memtier2?
                    (rank 300)  (rank 200)  (rank 100)

> where memtier0 is the highest tier
> and memtier1 is the lowest tier.

I think memtier2 is the lowest as it has the lowest rank value.
> 
> The rank value of each memtier should be unique.
> 
> 
> +
> +static void memory_tier_device_release(struct device *dev)
> +{
> +	struct memory_tier *tier = to_memory_tier(dev);
> +

Do we need some ref counts on memory_tier?
If there is another device still using the same memtier,
free below could cause problem.

> +	kfree(tier);
> +}
> +
> 
...
> +static struct memory_tier *register_memory_tier(unsigned int tier)
> +{
> +	int error;
> +	struct memory_tier *memtier;
> +
> +	if (tier >= MAX_MEMORY_TIERS)
> +		return NULL;
> +
> +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> +	if (!memtier)
> +		return NULL;
> +
> +	memtier->dev.id = tier;
> +	memtier->rank = get_rank_from_tier(tier);
> +	memtier->dev.bus = &memory_tier_subsys;
> +	memtier->dev.release = memory_tier_device_release;
> +	memtier->dev.groups = memory_tier_dev_groups;
> +

Should you take the mem_tier_lock before you insert to
memtier-list?

> +	insert_memory_tier(memtier);
> +
> +	error = device_register(&memtier->dev);
> +	if (error) {
> +		list_del(&memtier->list);
> +		put_device(&memtier->dev);
> +		return NULL;
> +	}
> +	return memtier;
> +}
> +
> +__maybe_unused // temporay to prevent warnings during bisects
> +static void unregister_memory_tier(struct memory_tier *memtier)
> +{

I think we should take mem_tier_lock before modifying memtier->list.

> +	list_del(&memtier->list);
> +	device_unregister(&memtier->dev);
> +}
> +
> 

Thanks.

Tim
Wei Xu June 7, 2022, 8:18 p.m. UTC | #2
On Tue, Jun 7, 2022 at 11:43 AM Tim Chen <tim.c.chen@linux.intel.com> wrote:
>
> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> >
> >
> > The nodes which are part of a specific memory tier can be listed
> > via
> > /sys/devices/system/memtier/memtierN/nodelist
> >
> > "Rank" is an opaque value. Its absolute value doesn't have any
> > special meaning. But the rank values of different memtiers can be
> > compared with each other to determine the memory tier order.
> >
> > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > their rank values are 300, 200, 100, then the memory tier order is:
> > memtier0 -> memtier2 -> memtier1,
>
> Why is memtier2 (rank 100) higher than memtier1 (rank 200)?  Seems like
> the order should be memtier0 -> memtier1 -> memtier2?
>                     (rank 300)  (rank 200)  (rank 100)

I think this is a copy-and-modify typo from my original memory tiering
kernel interface RFC (v4,
https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com/T/):
where the rank values are 100, 10, 50 (i.e the rank of memtier2 is
higher than memtier1).

> > where memtier0 is the highest tier
> > and memtier1 is the lowest tier.
>
> I think memtier2 is the lowest as it has the lowest rank value.
> >
> > The rank value of each memtier should be unique.
> >
> >
> > +
> > +static void memory_tier_device_release(struct device *dev)
> > +{
> > +     struct memory_tier *tier = to_memory_tier(dev);
> > +
>
> Do we need some ref counts on memory_tier?
> If there is another device still using the same memtier,
> free below could cause problem.
>
> > +     kfree(tier);
> > +}
> > +
> >
> ...
> > +static struct memory_tier *register_memory_tier(unsigned int tier)
> > +{
> > +     int error;
> > +     struct memory_tier *memtier;
> > +
> > +     if (tier >= MAX_MEMORY_TIERS)
> > +             return NULL;
> > +
> > +     memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > +     if (!memtier)
> > +             return NULL;
> > +
> > +     memtier->dev.id = tier;
> > +     memtier->rank = get_rank_from_tier(tier);
> > +     memtier->dev.bus = &memory_tier_subsys;
> > +     memtier->dev.release = memory_tier_device_release;
> > +     memtier->dev.groups = memory_tier_dev_groups;
> > +
>
> Should you take the mem_tier_lock before you insert to
> memtier-list?
>
> > +     insert_memory_tier(memtier);
> > +
> > +     error = device_register(&memtier->dev);
> > +     if (error) {
> > +             list_del(&memtier->list);
> > +             put_device(&memtier->dev);
> > +             return NULL;
> > +     }
> > +     return memtier;
> > +}
> > +
> > +__maybe_unused // temporay to prevent warnings during bisects
> > +static void unregister_memory_tier(struct memory_tier *memtier)
> > +{
>
> I think we should take mem_tier_lock before modifying memtier->list.
>
> > +     list_del(&memtier->list);
> > +     device_unregister(&memtier->dev);
> > +}
> > +
> >
>
> Thanks.
>
> Tim
>
>
Yang Shi June 7, 2022, 9:32 p.m. UTC | #3
On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
<aneesh.kumar@linux.ibm.com> wrote:
>
> In the current kernel, memory tiers are defined implicitly via a
> demotion path relationship between NUMA nodes, which is created
> during the kernel initialization and updated when a NUMA node is
> hot-added or hot-removed.  The current implementation puts all
> nodes with CPU into the top tier, and builds the tier hierarchy
> tier-by-tier by establishing the per-node demotion targets based
> on the distances between nodes.
>
> This current memory tier kernel interface needs to be improved for
> several important use cases,
>
> The current tier initialization code always initializes
> each memory-only NUMA node into a lower tier.  But a memory-only
> NUMA node may have a high performance memory device (e.g. a DRAM
> device attached via CXL.mem or a DRAM-backed memory-only node on
> a virtual machine) and should be put into a higher tier.
>
> The current tier hierarchy always puts CPU nodes into the top
> tier. But on a system with HBM or GPU devices, the
> memory-only NUMA nodes mapping these devices should be in the
> top tier, and DRAM nodes with CPUs are better to be placed into the
> next lower tier.
>
> With current kernel higher tier node can only be demoted to selected nodes on the
> next lower tier as defined by the demotion path, not any other
> node from any lower tier.  This strict, hard-coded demotion order
> does not work in all use cases (e.g. some use cases may want to
> allow cross-socket demotion to another node in the same demotion
> tier as a fallback when the preferred demotion node is out of
> space), This demotion order is also inconsistent with the page
> allocation fallback order when all the nodes in a higher tier are
> out of space: The page allocation can fall back to any node from
> any lower tier, whereas the demotion order doesn't allow that.
>
> The current kernel also don't provide any interfaces for the
> userspace to learn about the memory tier hierarchy in order to
> optimize its memory allocations.
>
> This patch series address the above by defining memory tiers explicitly.
>
> This patch introduce explicity memory tiers with ranks. The rank
> value of a memory tier is used to derive the demotion order between
> NUMA nodes. The memory tiers present in a system can be found at
>
> /sys/devices/system/memtier/memtierN/
>
> The nodes which are part of a specific memory tier can be listed
> via
> /sys/devices/system/memtier/memtierN/nodelist
>
> "Rank" is an opaque value. Its absolute value doesn't have any
> special meaning. But the rank values of different memtiers can be
> compared with each other to determine the memory tier order.
>
> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> their rank values are 300, 200, 100, then the memory tier order is:
> memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> and memtier1 is the lowest tier.
>
> The rank value of each memtier should be unique.
>
> A higher rank memory tier will appear first in the demotion order
> than a lower rank memory tier. ie. while reclaim we choose a node
> in higher rank memory tier to demote pages to as compared to a node
> in a lower rank memory tier.
>
> For now we are not adding the dynamic number of memory tiers.
> But a future series supporting that is possible. Currently
> number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> When doing memory hotplug, if not added to a memory tier, the NUMA
> node gets added to DEFAULT_MEMORY_TIER(1).
>
> This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
>
> [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>
> Suggested-by: Wei Xu <weixugc@google.com>
> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  include/linux/memory-tiers.h |  20 ++++
>  mm/Kconfig                   |  11 ++
>  mm/Makefile                  |   1 +
>  mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
>  4 files changed, 220 insertions(+)
>  create mode 100644 include/linux/memory-tiers.h
>  create mode 100644 mm/memory-tiers.c
>
> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> new file mode 100644
> index 000000000000..e17f6b4ee177
> --- /dev/null
> +++ b/include/linux/memory-tiers.h
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_TIERS_H
> +#define _LINUX_MEMORY_TIERS_H
> +
> +#ifdef CONFIG_TIERED_MEMORY
> +
> +#define MEMORY_TIER_HBM_GPU    0
> +#define MEMORY_TIER_DRAM       1
> +#define MEMORY_TIER_PMEM       2
> +
> +#define MEMORY_RANK_HBM_GPU    300
> +#define MEMORY_RANK_DRAM       200
> +#define MEMORY_RANK_PMEM       100
> +
> +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> +#define MAX_MEMORY_TIERS  3
> +
> +#endif /* CONFIG_TIERED_MEMORY */
> +
> +#endif
> diff --git a/mm/Kconfig b/mm/Kconfig
> index 169e64192e48..08a3d330740b 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>  config ARCH_ENABLE_THP_MIGRATION
>         bool
>
> +config TIERED_MEMORY
> +       bool "Support for explicit memory tiers"
> +       def_bool n
> +       depends on MIGRATION && NUMA
> +       help
> +         Support to split nodes into memory tiers explicitly and
> +         to demote pages on reclaim to lower tiers. This option
> +         also exposes sysfs interface to read nodes available in
> +         specific tier and to move specific node among different
> +         possible tiers.

IMHO we should not need a new kernel config. If tiering is not present
then there is just one tier on the system. And tiering is a kind of
hardware configuration, the information could be shown regardless of
whether demotion/promotion is supported/enabled or not.

> +
>  config HUGETLB_PAGE_SIZE_VARIABLE
>         def_bool n
>         help
> diff --git a/mm/Makefile b/mm/Makefile
> index 6f9ffa968a1a..482557fbc9d1 100644
> --- a/mm/Makefile
> +++ b/mm/Makefile
> @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
>  obj-$(CONFIG_FAILSLAB) += failslab.o
>  obj-$(CONFIG_MEMTEST)          += memtest.o
>  obj-$(CONFIG_MIGRATION) += migrate.o
> +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
>  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
>  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
>  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> new file mode 100644
> index 000000000000..7de18d94a08d
> --- /dev/null
> +++ b/mm/memory-tiers.c
> @@ -0,0 +1,188 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/types.h>
> +#include <linux/device.h>
> +#include <linux/nodemask.h>
> +#include <linux/slab.h>
> +#include <linux/memory-tiers.h>
> +
> +struct memory_tier {
> +       struct list_head list;
> +       struct device dev;
> +       nodemask_t nodelist;
> +       int rank;
> +};
> +
> +#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
> +
> +static struct bus_type memory_tier_subsys = {
> +       .name = "memtier",
> +       .dev_name = "memtier",
> +};
> +
> +static DEFINE_MUTEX(memory_tier_lock);
> +static LIST_HEAD(memory_tiers);
> +
> +
> +static ssize_t nodelist_show(struct device *dev,
> +                            struct device_attribute *attr, char *buf)
> +{
> +       struct memory_tier *memtier = to_memory_tier(dev);
> +
> +       return sysfs_emit(buf, "%*pbl\n",
> +                         nodemask_pr_args(&memtier->nodelist));
> +}
> +static DEVICE_ATTR_RO(nodelist);
> +
> +static ssize_t rank_show(struct device *dev,
> +                        struct device_attribute *attr, char *buf)
> +{
> +       struct memory_tier *memtier = to_memory_tier(dev);
> +
> +       return sysfs_emit(buf, "%d\n", memtier->rank);
> +}
> +static DEVICE_ATTR_RO(rank);
> +
> +static struct attribute *memory_tier_dev_attrs[] = {
> +       &dev_attr_nodelist.attr,
> +       &dev_attr_rank.attr,
> +       NULL
> +};
> +
> +static const struct attribute_group memory_tier_dev_group = {
> +       .attrs = memory_tier_dev_attrs,
> +};
> +
> +static const struct attribute_group *memory_tier_dev_groups[] = {
> +       &memory_tier_dev_group,
> +       NULL
> +};
> +
> +static void memory_tier_device_release(struct device *dev)
> +{
> +       struct memory_tier *tier = to_memory_tier(dev);
> +
> +       kfree(tier);
> +}
> +
> +/*
> + * Keep it simple by having  direct mapping between
> + * tier index and rank value.
> + */
> +static inline int get_rank_from_tier(unsigned int tier)
> +{
> +       switch (tier) {
> +       case MEMORY_TIER_HBM_GPU:
> +               return MEMORY_RANK_HBM_GPU;
> +       case MEMORY_TIER_DRAM:
> +               return MEMORY_RANK_DRAM;
> +       case MEMORY_TIER_PMEM:
> +               return MEMORY_RANK_PMEM;
> +       }
> +
> +       return 0;
> +}
> +
> +static void insert_memory_tier(struct memory_tier *memtier)
> +{
> +       struct list_head *ent;
> +       struct memory_tier *tmp_memtier;
> +
> +       list_for_each(ent, &memory_tiers) {
> +               tmp_memtier = list_entry(ent, struct memory_tier, list);
> +               if (tmp_memtier->rank < memtier->rank) {
> +                       list_add_tail(&memtier->list, ent);
> +                       return;
> +               }
> +       }
> +       list_add_tail(&memtier->list, &memory_tiers);
> +}
> +
> +static struct memory_tier *register_memory_tier(unsigned int tier)
> +{
> +       int error;
> +       struct memory_tier *memtier;
> +
> +       if (tier >= MAX_MEMORY_TIERS)
> +               return NULL;
> +
> +       memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> +       if (!memtier)
> +               return NULL;
> +
> +       memtier->dev.id = tier;
> +       memtier->rank = get_rank_from_tier(tier);
> +       memtier->dev.bus = &memory_tier_subsys;
> +       memtier->dev.release = memory_tier_device_release;
> +       memtier->dev.groups = memory_tier_dev_groups;
> +
> +       insert_memory_tier(memtier);
> +
> +       error = device_register(&memtier->dev);
> +       if (error) {
> +               list_del(&memtier->list);
> +               put_device(&memtier->dev);
> +               return NULL;
> +       }
> +       return memtier;
> +}
> +
> +__maybe_unused // temporay to prevent warnings during bisects
> +static void unregister_memory_tier(struct memory_tier *memtier)
> +{
> +       list_del(&memtier->list);
> +       device_unregister(&memtier->dev);
> +}
> +
> +static ssize_t
> +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +       return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
> +}
> +static DEVICE_ATTR_RO(max_tier);
> +
> +static ssize_t
> +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> +{
> +       return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER);
> +}
> +static DEVICE_ATTR_RO(default_tier);
> +
> +static struct attribute *memory_tier_attrs[] = {
> +       &dev_attr_max_tier.attr,
> +       &dev_attr_default_tier.attr,
> +       NULL
> +};
> +
> +static const struct attribute_group memory_tier_attr_group = {
> +       .attrs = memory_tier_attrs,
> +};
> +
> +static const struct attribute_group *memory_tier_attr_groups[] = {
> +       &memory_tier_attr_group,
> +       NULL,
> +};
> +
> +static int __init memory_tier_init(void)
> +{
> +       int ret;
> +       struct memory_tier *memtier;
> +
> +       ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> +       if (ret)
> +               panic("%s() failed to register subsystem: %d\n", __func__, ret);
> +
> +       /*
> +        * Register only default memory tier to hide all empty
> +        * memory tier from sysfs.
> +        */
> +       memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
> +       if (!memtier)
> +               panic("%s() failed to register memory tier: %d\n", __func__, ret);
> +
> +       /* CPU only nodes are not part of memory tiers. */
> +       memtier->nodelist = node_states[N_MEMORY];
> +
> +       return 0;
> +}
> +subsys_initcall(memory_tier_init);
> +
> --
> 2.36.1
>
Huang, Ying June 8, 2022, 1:34 a.m. UTC | #4
On Tue, 2022-06-07 at 14:32 -0700, Yang Shi wrote:
> On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
> > 
> > In the current kernel, memory tiers are defined implicitly via a
> > demotion path relationship between NUMA nodes, which is created
> > during the kernel initialization and updated when a NUMA node is
> > hot-added or hot-removed.  The current implementation puts all
> > nodes with CPU into the top tier, and builds the tier hierarchy
> > tier-by-tier by establishing the per-node demotion targets based
> > on the distances between nodes.
> > 
> > This current memory tier kernel interface needs to be improved for
> > several important use cases,
> > 
> > The current tier initialization code always initializes
> > each memory-only NUMA node into a lower tier.  But a memory-only
> > NUMA node may have a high performance memory device (e.g. a DRAM
> > device attached via CXL.mem or a DRAM-backed memory-only node on
> > a virtual machine) and should be put into a higher tier.
> > 
> > The current tier hierarchy always puts CPU nodes into the top
> > tier. But on a system with HBM or GPU devices, the
> > memory-only NUMA nodes mapping these devices should be in the
> > top tier, and DRAM nodes with CPUs are better to be placed into the
> > next lower tier.
> > 
> > With current kernel higher tier node can only be demoted to selected nodes on the
> > next lower tier as defined by the demotion path, not any other
> > node from any lower tier.  This strict, hard-coded demotion order
> > does not work in all use cases (e.g. some use cases may want to
> > allow cross-socket demotion to another node in the same demotion
> > tier as a fallback when the preferred demotion node is out of
> > space), This demotion order is also inconsistent with the page
> > allocation fallback order when all the nodes in a higher tier are
> > out of space: The page allocation can fall back to any node from
> > any lower tier, whereas the demotion order doesn't allow that.
> > 
> > The current kernel also don't provide any interfaces for the
> > userspace to learn about the memory tier hierarchy in order to
> > optimize its memory allocations.
> > 
> > This patch series address the above by defining memory tiers explicitly.
> > 
> > This patch introduce explicity memory tiers with ranks. The rank
> > value of a memory tier is used to derive the demotion order between
> > NUMA nodes. The memory tiers present in a system can be found at
> > 
> > /sys/devices/system/memtier/memtierN/
> > 
> > The nodes which are part of a specific memory tier can be listed
> > via
> > /sys/devices/system/memtier/memtierN/nodelist
> > 
> > "Rank" is an opaque value. Its absolute value doesn't have any
> > special meaning. But the rank values of different memtiers can be
> > compared with each other to determine the memory tier order.
> > 
> > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > their rank values are 300, 200, 100, then the memory tier order is:
> > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> > and memtier1 is the lowest tier.
> > 
> > The rank value of each memtier should be unique.
> > 
> > A higher rank memory tier will appear first in the demotion order
> > than a lower rank memory tier. ie. while reclaim we choose a node
> > in higher rank memory tier to demote pages to as compared to a node
> > in a lower rank memory tier.
> > 
> > For now we are not adding the dynamic number of memory tiers.
> > But a future series supporting that is possible. Currently
> > number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> > When doing memory hotplug, if not added to a memory tier, the NUMA
> > node gets added to DEFAULT_MEMORY_TIER(1).
> > 
> > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> > 
> > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> > 
> > Suggested-by: Wei Xu <weixugc@google.com>
> > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > ---
> >  include/linux/memory-tiers.h |  20 ++++
> >  mm/Kconfig                   |  11 ++
> >  mm/Makefile                  |   1 +
> >  mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
> >  4 files changed, 220 insertions(+)
> >  create mode 100644 include/linux/memory-tiers.h
> >  create mode 100644 mm/memory-tiers.c
> > 
> > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > new file mode 100644
> > index 000000000000..e17f6b4ee177
> > --- /dev/null
> > +++ b/include/linux/memory-tiers.h
> > @@ -0,0 +1,20 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_MEMORY_TIERS_H
> > +#define _LINUX_MEMORY_TIERS_H
> > +
> > +#ifdef CONFIG_TIERED_MEMORY
> > +
> > +#define MEMORY_TIER_HBM_GPU    0
> > +#define MEMORY_TIER_DRAM       1
> > +#define MEMORY_TIER_PMEM       2
> > +
> > +#define MEMORY_RANK_HBM_GPU    300
> > +#define MEMORY_RANK_DRAM       200
> > +#define MEMORY_RANK_PMEM       100
> > +
> > +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> > +#define MAX_MEMORY_TIERS  3
> > +
> > +#endif /* CONFIG_TIERED_MEMORY */
> > +
> > +#endif
> > diff --git a/mm/Kconfig b/mm/Kconfig
> > index 169e64192e48..08a3d330740b 100644
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> >  config ARCH_ENABLE_THP_MIGRATION
> >         bool
> > 
> > +config TIERED_MEMORY
> > +       bool "Support for explicit memory tiers"
> > +       def_bool n
> > +       depends on MIGRATION && NUMA
> > +       help
> > +         Support to split nodes into memory tiers explicitly and
> > +         to demote pages on reclaim to lower tiers. This option
> > +         also exposes sysfs interface to read nodes available in
> > +         specific tier and to move specific node among different
> > +         possible tiers.
> 
> IMHO we should not need a new kernel config. If tiering is not present
> then there is just one tier on the system. And tiering is a kind of
> hardware configuration, the information could be shown regardless of
> whether demotion/promotion is supported/enabled or not.

I think so too.  At least it appears unnecessary to let the user turn
on/off it at configuration time.

All the code should be enclosed by #if defined(CONFIG_NUMA) &&
defined(CONIFIG_MIGRATION).  So we will not waste memory in small
systems.

Best Regards,
Huang, Ying

> > +
> >  config HUGETLB_PAGE_SIZE_VARIABLE
> >         def_bool n
> >         help
> > diff --git a/mm/Makefile b/mm/Makefile
> > index 6f9ffa968a1a..482557fbc9d1 100644
> > --- a/mm/Makefile
> > +++ b/mm/Makefile
> > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
> >  obj-$(CONFIG_FAILSLAB) += failslab.o
> >  obj-$(CONFIG_MEMTEST)          += memtest.o
> >  obj-$(CONFIG_MIGRATION) += migrate.o
> > +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
> >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > new file mode 100644
> > index 000000000000..7de18d94a08d
> > --- /dev/null
> > +++ b/mm/memory-tiers.c
> > @@ -0,0 +1,188 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/types.h>
> > +#include <linux/device.h>
> > +#include <linux/nodemask.h>
> > +#include <linux/slab.h>
> > +#include <linux/memory-tiers.h>
> > +
> > +struct memory_tier {
> > +       struct list_head list;
> > +       struct device dev;
> > +       nodemask_t nodelist;
> > +       int rank;
> > +};
> > +
> > +#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
> > +
> > +static struct bus_type memory_tier_subsys = {
> > +       .name = "memtier",
> > +       .dev_name = "memtier",
> > +};
> > +
> > +static DEFINE_MUTEX(memory_tier_lock);
> > +static LIST_HEAD(memory_tiers);
> > +
> > +
> > +static ssize_t nodelist_show(struct device *dev,
> > +                            struct device_attribute *attr, char *buf)
> > +{
> > +       struct memory_tier *memtier = to_memory_tier(dev);
> > +
> > +       return sysfs_emit(buf, "%*pbl\n",
> > +                         nodemask_pr_args(&memtier->nodelist));
> > +}
> > +static DEVICE_ATTR_RO(nodelist);
> > +
> > +static ssize_t rank_show(struct device *dev,
> > +                        struct device_attribute *attr, char *buf)
> > +{
> > +       struct memory_tier *memtier = to_memory_tier(dev);
> > +
> > +       return sysfs_emit(buf, "%d\n", memtier->rank);
> > +}
> > +static DEVICE_ATTR_RO(rank);
> > +
> > +static struct attribute *memory_tier_dev_attrs[] = {
> > +       &dev_attr_nodelist.attr,
> > +       &dev_attr_rank.attr,
> > +       NULL
> > +};
> > +
> > +static const struct attribute_group memory_tier_dev_group = {
> > +       .attrs = memory_tier_dev_attrs,
> > +};
> > +
> > +static const struct attribute_group *memory_tier_dev_groups[] = {
> > +       &memory_tier_dev_group,
> > +       NULL
> > +};
> > +
> > +static void memory_tier_device_release(struct device *dev)
> > +{
> > +       struct memory_tier *tier = to_memory_tier(dev);
> > +
> > +       kfree(tier);
> > +}
> > +
> > +/*
> > + * Keep it simple by having  direct mapping between
> > + * tier index and rank value.
> > + */
> > +static inline int get_rank_from_tier(unsigned int tier)
> > +{
> > +       switch (tier) {
> > +       case MEMORY_TIER_HBM_GPU:
> > +               return MEMORY_RANK_HBM_GPU;
> > +       case MEMORY_TIER_DRAM:
> > +               return MEMORY_RANK_DRAM;
> > +       case MEMORY_TIER_PMEM:
> > +               return MEMORY_RANK_PMEM;
> > +       }
> > +
> > +       return 0;
> > +}
> > +
> > +static void insert_memory_tier(struct memory_tier *memtier)
> > +{
> > +       struct list_head *ent;
> > +       struct memory_tier *tmp_memtier;
> > +
> > +       list_for_each(ent, &memory_tiers) {
> > +               tmp_memtier = list_entry(ent, struct memory_tier, list);
> > +               if (tmp_memtier->rank < memtier->rank) {
> > +                       list_add_tail(&memtier->list, ent);
> > +                       return;
> > +               }
> > +       }
> > +       list_add_tail(&memtier->list, &memory_tiers);
> > +}
> > +
> > +static struct memory_tier *register_memory_tier(unsigned int tier)
> > +{
> > +       int error;
> > +       struct memory_tier *memtier;
> > +
> > +       if (tier >= MAX_MEMORY_TIERS)
> > +               return NULL;
> > +
> > +       memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > +       if (!memtier)
> > +               return NULL;
> > +
> > +       memtier->dev.id = tier;
> > +       memtier->rank = get_rank_from_tier(tier);
> > +       memtier->dev.bus = &memory_tier_subsys;
> > +       memtier->dev.release = memory_tier_device_release;
> > +       memtier->dev.groups = memory_tier_dev_groups;
> > +
> > +       insert_memory_tier(memtier);
> > +
> > +       error = device_register(&memtier->dev);
> > +       if (error) {
> > +               list_del(&memtier->list);
> > +               put_device(&memtier->dev);
> > +               return NULL;
> > +       }
> > +       return memtier;
> > +}
> > +
> > +__maybe_unused // temporay to prevent warnings during bisects
> > +static void unregister_memory_tier(struct memory_tier *memtier)
> > +{
> > +       list_del(&memtier->list);
> > +       device_unregister(&memtier->dev);
> > +}
> > +
> > +static ssize_t
> > +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> > +{
> > +       return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
> > +}
> > +static DEVICE_ATTR_RO(max_tier);
> > +
> > +static ssize_t
> > +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> > +{
> > +       return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER);
> > +}
> > +static DEVICE_ATTR_RO(default_tier);
> > +
> > +static struct attribute *memory_tier_attrs[] = {
> > +       &dev_attr_max_tier.attr,
> > +       &dev_attr_default_tier.attr,
> > +       NULL
> > +};
> > +
> > +static const struct attribute_group memory_tier_attr_group = {
> > +       .attrs = memory_tier_attrs,
> > +};
> > +
> > +static const struct attribute_group *memory_tier_attr_groups[] = {
> > +       &memory_tier_attr_group,
> > +       NULL,
> > +};
> > +
> > +static int __init memory_tier_init(void)
> > +{
> > +       int ret;
> > +       struct memory_tier *memtier;
> > +
> > +       ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> > +       if (ret)
> > +               panic("%s() failed to register subsystem: %d\n", __func__, ret);
> > +
> > +       /*
> > +        * Register only default memory tier to hide all empty
> > +        * memory tier from sysfs.
> > +        */
> > +       memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
> > +       if (!memtier)
> > +               panic("%s() failed to register memory tier: %d\n", __func__, ret);
> > +
> > +       /* CPU only nodes are not part of memory tiers. */
> > +       memtier->nodelist = node_states[N_MEMORY];
> > +
> > +       return 0;
> > +}
> > +subsys_initcall(memory_tier_init);
> > +
> > --
> > 2.36.1
> >
Aneesh Kumar K.V June 8, 2022, 4:30 a.m. UTC | #5
On 6/8/22 12:13 AM, Tim Chen wrote:
> On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
>>
>>
>> The nodes which are part of a specific memory tier can be listed
>> via
>> /sys/devices/system/memtier/memtierN/nodelist
>>
>> "Rank" is an opaque value. Its absolute value doesn't have any
>> special meaning. But the rank values of different memtiers can be
>> compared with each other to determine the memory tier order.
>>
>> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
>> their rank values are 300, 200, 100, then the memory tier order is:
>> memtier0 -> memtier2 -> memtier1,
> 
> Why is memtier2 (rank 100) higher than memtier1 (rank 200)?  Seems like
> the order should be memtier0 -> memtier1 -> memtier2?
>                      (rank 300)  (rank 200)  (rank 100)
> 
>> where memtier0 is the highest tier
>> and memtier1 is the lowest tier.
> 
> I think memtier2 is the lowest as it has the lowest rank value.


typo error. Will fix that in the next update

>>
>> The rank value of each memtier should be unique.
>>
>>
>> +
>> +static void memory_tier_device_release(struct device *dev)
>> +{
>> +	struct memory_tier *tier = to_memory_tier(dev);
>> +
> 
> Do we need some ref counts on memory_tier?
> If there is another device still using the same memtier,
> free below could cause problem.
> 
>> +	kfree(tier);
>> +}
>> +
>>
> ...
>> +static struct memory_tier *register_memory_tier(unsigned int tier)
>> +{
>> +	int error;
>> +	struct memory_tier *memtier;
>> +
>> +	if (tier >= MAX_MEMORY_TIERS)
>> +		return NULL;
>> +
>> +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
>> +	if (!memtier)
>> +		return NULL;
>> +
>> +	memtier->dev.id = tier;
>> +	memtier->rank = get_rank_from_tier(tier);
>> +	memtier->dev.bus = &memory_tier_subsys;
>> +	memtier->dev.release = memory_tier_device_release;
>> +	memtier->dev.groups = memory_tier_dev_groups;
>> +
> 
> Should you take the mem_tier_lock before you insert to
> memtier-list?


Both register_memory_tier and unregister_memory_tier get called with 
memory_tier_lock held.

> 
>> +	insert_memory_tier(memtier);
>> +
>> +	error = device_register(&memtier->dev);
>> +	if (error) {
>> +		list_del(&memtier->list);
>> +		put_device(&memtier->dev);
>> +		return NULL;
>> +	}
>> +	return memtier;
>> +}
>> +
>> +__maybe_unused // temporay to prevent warnings during bisects
>> +static void unregister_memory_tier(struct memory_tier *memtier)
>> +{
> 
> I think we should take mem_tier_lock before modifying memtier->list.
> 

unregister_memory_tier get called with memory_tier_lock held.

>> +	list_del(&memtier->list);
>> +	device_unregister(&memtier->dev);
>> +}
>> +
>>

-aneesh
Aneesh Kumar K.V June 8, 2022, 4:37 a.m. UTC | #6
On 6/8/22 12:13 AM, Tim Chen wrote:
...

>>
>> +
>> +static void memory_tier_device_release(struct device *dev)
>> +{
>> +	struct memory_tier *tier = to_memory_tier(dev);
>> +
> 
> Do we need some ref counts on memory_tier?
> If there is another device still using the same memtier,
> free below could cause problem.
> 
>> +	kfree(tier);
>> +}
>> +
>>
> ...

The lifecycle of the memory_tier struct is tied to the sysfs device life 
time. ie, memory_tier_device_relese get called only after the last 
reference on that sysfs dev object is released. Hence we can be sure 
there is no userspace that is keeping one of the memtier related sysfs 
file open.

W.r.t other memory device sharing the same memtier, we unregister the
sysfs device only when the memory tier nodelist is empty. That is no 
memory device is present in this memory tier.

-aneesh
Aneesh Kumar K.V June 8, 2022, 4:58 a.m. UTC | #7
On 6/8/22 3:02 AM, Yang Shi wrote:
> On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> <aneesh.kumar@linux.ibm.com> wrote:
>>
>> In the current kernel, memory tiers are defined implicitly via a
>> demotion path relationship between NUMA nodes, which is created
>> during the kernel initialization and updated when a NUMA node is
>> hot-added or hot-removed.  The current implementation puts all
>> nodes with CPU into the top tier, and builds the tier hierarchy
>> tier-by-tier by establishing the per-node demotion targets based
>> on the distances between nodes.
>>
>> This current memory tier kernel interface needs to be improved for
>> several important use cases,
>>
>> The current tier initialization code always initializes
>> each memory-only NUMA node into a lower tier.  But a memory-only
>> NUMA node may have a high performance memory device (e.g. a DRAM
>> device attached via CXL.mem or a DRAM-backed memory-only node on
>> a virtual machine) and should be put into a higher tier.
>>
>> The current tier hierarchy always puts CPU nodes into the top
>> tier. But on a system with HBM or GPU devices, the
>> memory-only NUMA nodes mapping these devices should be in the
>> top tier, and DRAM nodes with CPUs are better to be placed into the
>> next lower tier.
>>
>> With current kernel higher tier node can only be demoted to selected nodes on the
>> next lower tier as defined by the demotion path, not any other
>> node from any lower tier.  This strict, hard-coded demotion order
>> does not work in all use cases (e.g. some use cases may want to
>> allow cross-socket demotion to another node in the same demotion
>> tier as a fallback when the preferred demotion node is out of
>> space), This demotion order is also inconsistent with the page
>> allocation fallback order when all the nodes in a higher tier are
>> out of space: The page allocation can fall back to any node from
>> any lower tier, whereas the demotion order doesn't allow that.
>>
>> The current kernel also don't provide any interfaces for the
>> userspace to learn about the memory tier hierarchy in order to
>> optimize its memory allocations.
>>
>> This patch series address the above by defining memory tiers explicitly.
>>
>> This patch introduce explicity memory tiers with ranks. The rank
>> value of a memory tier is used to derive the demotion order between
>> NUMA nodes. The memory tiers present in a system can be found at
>>
>> /sys/devices/system/memtier/memtierN/
>>
>> The nodes which are part of a specific memory tier can be listed
>> via
>> /sys/devices/system/memtier/memtierN/nodelist
>>
>> "Rank" is an opaque value. Its absolute value doesn't have any
>> special meaning. But the rank values of different memtiers can be
>> compared with each other to determine the memory tier order.
>>
>> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
>> their rank values are 300, 200, 100, then the memory tier order is:
>> memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
>> and memtier1 is the lowest tier.
>>
>> The rank value of each memtier should be unique.
>>
>> A higher rank memory tier will appear first in the demotion order
>> than a lower rank memory tier. ie. while reclaim we choose a node
>> in higher rank memory tier to demote pages to as compared to a node
>> in a lower rank memory tier.
>>
>> For now we are not adding the dynamic number of memory tiers.
>> But a future series supporting that is possible. Currently
>> number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
>> When doing memory hotplug, if not added to a memory tier, the NUMA
>> node gets added to DEFAULT_MEMORY_TIER(1).
>>
>> This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
>>
>> [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
>>
>> Suggested-by: Wei Xu <weixugc@google.com>
>> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
>> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
>> ---
>>   include/linux/memory-tiers.h |  20 ++++
>>   mm/Kconfig                   |  11 ++
>>   mm/Makefile                  |   1 +
>>   mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
>>   4 files changed, 220 insertions(+)
>>   create mode 100644 include/linux/memory-tiers.h
>>   create mode 100644 mm/memory-tiers.c
>>
>> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
>> new file mode 100644
>> index 000000000000..e17f6b4ee177
>> --- /dev/null
>> +++ b/include/linux/memory-tiers.h
>> @@ -0,0 +1,20 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_MEMORY_TIERS_H
>> +#define _LINUX_MEMORY_TIERS_H
>> +
>> +#ifdef CONFIG_TIERED_MEMORY
>> +
>> +#define MEMORY_TIER_HBM_GPU    0
>> +#define MEMORY_TIER_DRAM       1
>> +#define MEMORY_TIER_PMEM       2
>> +
>> +#define MEMORY_RANK_HBM_GPU    300
>> +#define MEMORY_RANK_DRAM       200
>> +#define MEMORY_RANK_PMEM       100
>> +
>> +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
>> +#define MAX_MEMORY_TIERS  3
>> +
>> +#endif /* CONFIG_TIERED_MEMORY */
>> +
>> +#endif
>> diff --git a/mm/Kconfig b/mm/Kconfig
>> index 169e64192e48..08a3d330740b 100644
>> --- a/mm/Kconfig
>> +++ b/mm/Kconfig
>> @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
>>   config ARCH_ENABLE_THP_MIGRATION
>>          bool
>>
>> +config TIERED_MEMORY
>> +       bool "Support for explicit memory tiers"
>> +       def_bool n
>> +       depends on MIGRATION && NUMA
>> +       help
>> +         Support to split nodes into memory tiers explicitly and
>> +         to demote pages on reclaim to lower tiers. This option
>> +         also exposes sysfs interface to read nodes available in
>> +         specific tier and to move specific node among different
>> +         possible tiers.
> 
> IMHO we should not need a new kernel config. If tiering is not present
> then there is just one tier on the system. And tiering is a kind of
> hardware configuration, the information could be shown regardless of
> whether demotion/promotion is supported/enabled or not.
> 

This was added so that we could avoid doing multiple

#if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)

Initially I had that as def_bool y and depends on MIGRATION && NUMA. But 
it was later suggested that def_bool is not recommended for newer config.

How about

  config TIERED_MEMORY
  	bool "Support for explicit memory tiers"
-	def_bool n
-	depends on MIGRATION && NUMA
-	help
-	  Support to split nodes into memory tiers explicitly and
-	  to demote pages on reclaim to lower tiers. This option
-	  also exposes sysfs interface to read nodes available in
-	  specific tier and to move specific node among different
-	  possible tiers.
+	def_bool MIGRATION && NUMA

  config HUGETLB_PAGE_SIZE_VARIABLE
  	def_bool n

ie, we just make it a Kconfig variable without exposing it to the user?

-aneesh
Huang, Ying June 8, 2022, 6:06 a.m. UTC | #8
On Wed, 2022-06-08 at 10:00 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 12:13 AM, Tim Chen wrote:
> > On Fri, 2022-06-03 at 19:12 +0530, Aneesh Kumar K.V wrote:
> > > 
> > > 
> > > The nodes which are part of a specific memory tier can be listed
> > > via
> > > /sys/devices/system/memtier/memtierN/nodelist
> > > 
> > > "Rank" is an opaque value. Its absolute value doesn't have any
> > > special meaning. But the rank values of different memtiers can be
> > > compared with each other to determine the memory tier order.
> > > 
> > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > > their rank values are 300, 200, 100, then the memory tier order is:
> > > memtier0 -> memtier2 -> memtier1,
> > 
> > Why is memtier2 (rank 100) higher than memtier1 (rank 200)?  Seems like
> > the order should be memtier0 -> memtier1 -> memtier2?
> >                      (rank 300)  (rank 200)  (rank 100)
> > 
> > > where memtier0 is the highest tier
> > > and memtier1 is the lowest tier.
> > 
> > I think memtier2 is the lowest as it has the lowest rank value.
> 
> 
> typo error. Will fix that in the next update
> 
> > > 
> > > The rank value of each memtier should be unique.
> > > 
> > > 
> > > +
> > > +static void memory_tier_device_release(struct device *dev)
> > > +{
> > > +	struct memory_tier *tier = to_memory_tier(dev);
> > > +
> > 
> > Do we need some ref counts on memory_tier?
> > If there is another device still using the same memtier,
> > free below could cause problem.
> > 
> > > +	kfree(tier);
> > > +}
> > > +
> > > 
> > ...
> > > +static struct memory_tier *register_memory_tier(unsigned int tier)
> > > +{
> > > +	int error;
> > > +	struct memory_tier *memtier;
> > > +
> > > +	if (tier >= MAX_MEMORY_TIERS)
> > > +		return NULL;
> > > +
> > > +	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > > +	if (!memtier)
> > > +		return NULL;
> > > +
> > > +	memtier->dev.id = tier;
> > > +	memtier->rank = get_rank_from_tier(tier);
> > > +	memtier->dev.bus = &memory_tier_subsys;
> > > +	memtier->dev.release = memory_tier_device_release;
> > > +	memtier->dev.groups = memory_tier_dev_groups;
> > > +
> > 
> > Should you take the mem_tier_lock before you insert to
> > memtier-list?
> 
> 
> Both register_memory_tier and unregister_memory_tier get called with 
> memory_tier_lock held.

Then please add locking requirements to the comments above these
functions.

Best Regards,
Huang, Ying

> > 
> > > +	insert_memory_tier(memtier);
> > > +
> > > +	error = device_register(&memtier->dev);
> > > +	if (error) {
> > > +		list_del(&memtier->list);
> > > +		put_device(&memtier->dev);
> > > +		return NULL;
> > > +	}
> > > +	return memtier;
> > > +}
> > > +
> > > +__maybe_unused // temporay to prevent warnings during bisects
> > > +static void unregister_memory_tier(struct memory_tier *memtier)
> > > +{
> > 
> > I think we should take mem_tier_lock before modifying memtier->list.
> > 
> 
> unregister_memory_tier get called with memory_tier_lock held.
> 
> > > +	list_del(&memtier->list);
> > > +	device_unregister(&memtier->dev);
> > > +}
> > > +
> > > 
> 
> -aneesh
Huang, Ying June 8, 2022, 6:10 a.m. UTC | #9
On Wed, 2022-06-08 at 10:07 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 12:13 AM, Tim Chen wrote:
> ...
> 
> > > 
> > > +
> > > +static void memory_tier_device_release(struct device *dev)
> > > +{
> > > +	struct memory_tier *tier = to_memory_tier(dev);
> > > +
> > 
> > Do we need some ref counts on memory_tier?
> > If there is another device still using the same memtier,
> > free below could cause problem.
> > 
> > > +	kfree(tier);
> > > +}
> > > +
> > > 
> > ...
> 
> The lifecycle of the memory_tier struct is tied to the sysfs device life 
> time. ie, memory_tier_device_relese get called only after the last 
> reference on that sysfs dev object is released. Hence we can be sure 
> there is no userspace that is keeping one of the memtier related sysfs 
> file open.
> 
> W.r.t other memory device sharing the same memtier, we unregister the
> sysfs device only when the memory tier nodelist is empty. That is no 
> memory device is present in this memory tier.

memory_tier isn't only used by user space.  It is used inside kernel
too.  If some kernel code get a pointer to struct memory_tier, we need
to guarantee the pointer will not be freed under us.  And as Tim pointed
out, we need to use it in hot path (for statistics), so some kind of rcu
lock may be good.

Best Regards,
Huang, Ying
Huang, Ying June 8, 2022, 6:18 a.m. UTC | #10
On Wed, 2022-06-08 at 10:28 +0530, Aneesh Kumar K V wrote:
> On 6/8/22 3:02 AM, Yang Shi wrote:
> > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> > > 
> > > In the current kernel, memory tiers are defined implicitly via a
> > > demotion path relationship between NUMA nodes, which is created
> > > during the kernel initialization and updated when a NUMA node is
> > > hot-added or hot-removed.  The current implementation puts all
> > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based
> > > on the distances between nodes.
> > > 
> > > This current memory tier kernel interface needs to be improved for
> > > several important use cases,
> > > 
> > > The current tier initialization code always initializes
> > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > a virtual machine) and should be put into a higher tier.
> > > 
> > > The current tier hierarchy always puts CPU nodes into the top
> > > tier. But on a system with HBM or GPU devices, the
> > > memory-only NUMA nodes mapping these devices should be in the
> > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > next lower tier.
> > > 
> > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > next lower tier as defined by the demotion path, not any other
> > > node from any lower tier.  This strict, hard-coded demotion order
> > > does not work in all use cases (e.g. some use cases may want to
> > > allow cross-socket demotion to another node in the same demotion
> > > tier as a fallback when the preferred demotion node is out of
> > > space), This demotion order is also inconsistent with the page
> > > allocation fallback order when all the nodes in a higher tier are
> > > out of space: The page allocation can fall back to any node from
> > > any lower tier, whereas the demotion order doesn't allow that.
> > > 
> > > The current kernel also don't provide any interfaces for the
> > > userspace to learn about the memory tier hierarchy in order to
> > > optimize its memory allocations.
> > > 
> > > This patch series address the above by defining memory tiers explicitly.
> > > 
> > > This patch introduce explicity memory tiers with ranks. The rank
> > > value of a memory tier is used to derive the demotion order between
> > > NUMA nodes. The memory tiers present in a system can be found at
> > > 
> > > /sys/devices/system/memtier/memtierN/
> > > 
> > > The nodes which are part of a specific memory tier can be listed
> > > via
> > > /sys/devices/system/memtier/memtierN/nodelist
> > > 
> > > "Rank" is an opaque value. Its absolute value doesn't have any
> > > special meaning. But the rank values of different memtiers can be
> > > compared with each other to determine the memory tier order.
> > > 
> > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > > their rank values are 300, 200, 100, then the memory tier order is:
> > > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> > > and memtier1 is the lowest tier.
> > > 
> > > The rank value of each memtier should be unique.
> > > 
> > > A higher rank memory tier will appear first in the demotion order
> > > than a lower rank memory tier. ie. while reclaim we choose a node
> > > in higher rank memory tier to demote pages to as compared to a node
> > > in a lower rank memory tier.
> > > 
> > > For now we are not adding the dynamic number of memory tiers.
> > > But a future series supporting that is possible. Currently
> > > number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> > > When doing memory hotplug, if not added to a memory tier, the NUMA
> > > node gets added to DEFAULT_MEMORY_TIER(1).
> > > 
> > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> > > 
> > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> > > 
> > > Suggested-by: Wei Xu <weixugc@google.com>
> > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > ---
> > >   include/linux/memory-tiers.h |  20 ++++
> > >   mm/Kconfig                   |  11 ++
> > >   mm/Makefile                  |   1 +
> > >   mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
> > >   4 files changed, 220 insertions(+)
> > >   create mode 100644 include/linux/memory-tiers.h
> > >   create mode 100644 mm/memory-tiers.c
> > > 
> > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > > new file mode 100644
> > > index 000000000000..e17f6b4ee177
> > > --- /dev/null
> > > +++ b/include/linux/memory-tiers.h
> > > @@ -0,0 +1,20 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > +#define _LINUX_MEMORY_TIERS_H
> > > +
> > > +#ifdef CONFIG_TIERED_MEMORY
> > > +
> > > +#define MEMORY_TIER_HBM_GPU    0
> > > +#define MEMORY_TIER_DRAM       1
> > > +#define MEMORY_TIER_PMEM       2
> > > +
> > > +#define MEMORY_RANK_HBM_GPU    300
> > > +#define MEMORY_RANK_DRAM       200
> > > +#define MEMORY_RANK_PMEM       100
> > > +
> > > +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> > > +#define MAX_MEMORY_TIERS  3
> > > +
> > > +#endif /* CONFIG_TIERED_MEMORY */
> > > +
> > > +#endif
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 169e64192e48..08a3d330740b 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> > >   config ARCH_ENABLE_THP_MIGRATION
> > >          bool
> > > 
> > > +config TIERED_MEMORY
> > > +       bool "Support for explicit memory tiers"
> > > +       def_bool n
> > > +       depends on MIGRATION && NUMA
> > > +       help
> > > +         Support to split nodes into memory tiers explicitly and
> > > +         to demote pages on reclaim to lower tiers. This option
> > > +         also exposes sysfs interface to read nodes available in
> > > +         specific tier and to move specific node among different
> > > +         possible tiers.
> > 
> > IMHO we should not need a new kernel config. If tiering is not present
> > then there is just one tier on the system. And tiering is a kind of
> > hardware configuration, the information could be shown regardless of
> > whether demotion/promotion is supported/enabled or not.
> > 
> 
> This was added so that we could avoid doing multiple
> 
> #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
> 
> Initially I had that as def_bool y and depends on MIGRATION && NUMA. But 
> it was later suggested that def_bool is not recommended for newer config.
> 
> How about
> 
>   config TIERED_MEMORY
>   	bool "Support for explicit memory tiers"

Need to remove this line too to make it invisible for users?

Best Regards,
HUang, Ying

> -	def_bool n
> -	depends on MIGRATION && NUMA
> -	help
> -	  Support to split nodes into memory tiers explicitly and
> -	  to demote pages on reclaim to lower tiers. This option
> -	  also exposes sysfs interface to read nodes available in
> -	  specific tier and to move specific node among different
> -	  possible tiers.
> +	def_bool MIGRATION && NUMA
> 
>   config HUGETLB_PAGE_SIZE_VARIABLE
>   	def_bool n
> 
> ie, we just make it a Kconfig variable without exposing it to the user?
> 
> -aneesh
Aneesh Kumar K.V June 8, 2022, 8:04 a.m. UTC | #11
On 6/8/22 11:40 AM, Ying Huang wrote:
> On Wed, 2022-06-08 at 10:07 +0530, Aneesh Kumar K V wrote:
>> On 6/8/22 12:13 AM, Tim Chen wrote:
>> ...
>>
>>>>
>>>> +
>>>> +static void memory_tier_device_release(struct device *dev)
>>>> +{
>>>> +	struct memory_tier *tier = to_memory_tier(dev);
>>>> +
>>>
>>> Do we need some ref counts on memory_tier?
>>> If there is another device still using the same memtier,
>>> free below could cause problem.
>>>
>>>> +	kfree(tier);
>>>> +}
>>>> +
>>>>
>>> ...
>>
>> The lifecycle of the memory_tier struct is tied to the sysfs device life
>> time. ie, memory_tier_device_relese get called only after the last
>> reference on that sysfs dev object is released. Hence we can be sure
>> there is no userspace that is keeping one of the memtier related sysfs
>> file open.
>>
>> W.r.t other memory device sharing the same memtier, we unregister the
>> sysfs device only when the memory tier nodelist is empty. That is no
>> memory device is present in this memory tier.
> 
> memory_tier isn't only used by user space.  It is used inside kernel
> too.  If some kernel code get a pointer to struct memory_tier, we need
> to guarantee the pointer will not be freed under us. 

As mentioned above current patchset avoid doing that.

> And as Tim pointed
> out, we need to use it in hot path (for statistics), so some kind of rcu
> lock may be good.
> 

Sure when those statistics code get added, we can add the relevant kref 
and locking details.

-aneesh
Johannes Weiner June 8, 2022, 2:11 p.m. UTC | #12
Hi Aneesh,

On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
> @@ -0,0 +1,20 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _LINUX_MEMORY_TIERS_H
> +#define _LINUX_MEMORY_TIERS_H
> +
> +#ifdef CONFIG_TIERED_MEMORY
> +
> +#define MEMORY_TIER_HBM_GPU	0
> +#define MEMORY_TIER_DRAM	1
> +#define MEMORY_TIER_PMEM	2
> +
> +#define MEMORY_RANK_HBM_GPU	300
> +#define MEMORY_RANK_DRAM	200
> +#define MEMORY_RANK_PMEM	100
> +
> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> +#define MAX_MEMORY_TIERS  3

I understand the names are somewhat arbitrary, and the tier ID space
can be expanded down the line by bumping MAX_MEMORY_TIERS.

But starting out with a packed ID space can get quite awkward for
users when new tiers - especially intermediate tiers - show up in
existing configurations. I mentioned in the other email that DRAM !=
DRAM, so new tiers seem inevitable already.

It could make sense to start with a bigger address space and spread
out the list of kernel default tiers a bit within it:

MEMORY_TIER_GPU		0
MEMORY_TIER_DRAM	10
MEMORY_TIER_PMEM	20

etc.
Aneesh Kumar K.V June 8, 2022, 2:21 p.m. UTC | #13
On 6/8/22 7:41 PM, Johannes Weiner wrote:
> Hi Aneesh,
> 
> On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
>> @@ -0,0 +1,20 @@
>> +/* SPDX-License-Identifier: GPL-2.0 */
>> +#ifndef _LINUX_MEMORY_TIERS_H
>> +#define _LINUX_MEMORY_TIERS_H
>> +
>> +#ifdef CONFIG_TIERED_MEMORY
>> +
>> +#define MEMORY_TIER_HBM_GPU	0
>> +#define MEMORY_TIER_DRAM	1
>> +#define MEMORY_TIER_PMEM	2
>> +
>> +#define MEMORY_RANK_HBM_GPU	300
>> +#define MEMORY_RANK_DRAM	200
>> +#define MEMORY_RANK_PMEM	100
>> +
>> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
>> +#define MAX_MEMORY_TIERS  3
> 
> I understand the names are somewhat arbitrary, and the tier ID space
> can be expanded down the line by bumping MAX_MEMORY_TIERS.
> 
> But starting out with a packed ID space can get quite awkward for
> users when new tiers - especially intermediate tiers - show up in
> existing configurations. I mentioned in the other email that DRAM !=
> DRAM, so new tiers seem inevitable already.
> 
> It could make sense to start with a bigger address space and spread
> out the list of kernel default tiers a bit within it:
> 
> MEMORY_TIER_GPU		0
> MEMORY_TIER_DRAM	10
> MEMORY_TIER_PMEM	20
> 

the tier index or tier id or the tier dev id don't have any special 
meaning. What is used to find the demotion order is memory tier rank and 
they are really spread out, (300, 200, 100).

-aneesh
Johannes Weiner June 8, 2022, 3:55 p.m. UTC | #14
Hello,

On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:
> On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
> > @@ -0,0 +1,20 @@
> > +/* SPDX-License-Identifier: GPL-2.0 */
> > +#ifndef _LINUX_MEMORY_TIERS_H
> > +#define _LINUX_MEMORY_TIERS_H
> > +
> > +#ifdef CONFIG_TIERED_MEMORY
> > +
> > +#define MEMORY_TIER_HBM_GPU	0
> > +#define MEMORY_TIER_DRAM	1
> > +#define MEMORY_TIER_PMEM	2
> > +
> > +#define MEMORY_RANK_HBM_GPU	300
> > +#define MEMORY_RANK_DRAM	200
> > +#define MEMORY_RANK_PMEM	100
> > +
> > +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> > +#define MAX_MEMORY_TIERS  3
> 
> I understand the names are somewhat arbitrary, and the tier ID space
> can be expanded down the line by bumping MAX_MEMORY_TIERS.
> 
> But starting out with a packed ID space can get quite awkward for
> users when new tiers - especially intermediate tiers - show up in
> existing configurations. I mentioned in the other email that DRAM !=
> DRAM, so new tiers seem inevitable already.
> 
> It could make sense to start with a bigger address space and spread
> out the list of kernel default tiers a bit within it:
> 
> MEMORY_TIER_GPU		0
> MEMORY_TIER_DRAM	10
> MEMORY_TIER_PMEM	20

Forgive me if I'm asking a question that has been answered. I went
back to earlier threads and couldn't work it out - maybe there were
some off-list discussions? Anyway...

Why is there a distinction between tier ID and rank? I undestand that
rank was added because tier IDs were too few. But if rank determines
ordering, what is the use of a separate tier ID? IOW, why not make the
tier ID space wider and have the kernel pick a few spread out defaults
based on known hardware, with plenty of headroom to be future proof.

  $ ls tiers
  100				# DEFAULT_TIER
  $ cat tiers/100/nodelist
  0-1				# conventional numa nodes

  <pmem is onlined>

  $ grep . tiers/*/nodelist
  tiers/100/nodelist:0-1	# conventional numa
  tiers/200/nodelist:2		# pmem

  $ grep . nodes/*/tier
  nodes/0/tier:100
  nodes/1/tier:100
  nodes/2/tier:200

  <unknown device is online as node 3, defaults to 100>

  $ grep . tiers/*/nodelist
  tiers/100/nodelist:0-1,3
  tiers/200/nodelist:2

  $ echo 300 >nodes/3/tier
  $ grep . tiers/*/nodelist
  tiers/100/nodelist:0-1
  tiers/200/nodelist:2
  tiers/300/nodelist:3

  $ echo 200 >nodes/3/tier
  $ grep . tiers/*/nodelist
  tiers/100/nodelist:0-1	
  tiers/200/nodelist:2-3

etc.
Aneesh Kumar K.V June 8, 2022, 4:13 p.m. UTC | #15
On 6/8/22 9:25 PM, Johannes Weiner wrote:
> Hello,
> 
> On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:
>> On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
>>> @@ -0,0 +1,20 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>> +#define _LINUX_MEMORY_TIERS_H
>>> +
>>> +#ifdef CONFIG_TIERED_MEMORY
>>> +
>>> +#define MEMORY_TIER_HBM_GPU	0
>>> +#define MEMORY_TIER_DRAM	1
>>> +#define MEMORY_TIER_PMEM	2
>>> +
>>> +#define MEMORY_RANK_HBM_GPU	300
>>> +#define MEMORY_RANK_DRAM	200
>>> +#define MEMORY_RANK_PMEM	100
>>> +
>>> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
>>> +#define MAX_MEMORY_TIERS  3
>>
>> I understand the names are somewhat arbitrary, and the tier ID space
>> can be expanded down the line by bumping MAX_MEMORY_TIERS.
>>
>> But starting out with a packed ID space can get quite awkward for
>> users when new tiers - especially intermediate tiers - show up in
>> existing configurations. I mentioned in the other email that DRAM !=
>> DRAM, so new tiers seem inevitable already.
>>
>> It could make sense to start with a bigger address space and spread
>> out the list of kernel default tiers a bit within it:
>>
>> MEMORY_TIER_GPU		0
>> MEMORY_TIER_DRAM	10
>> MEMORY_TIER_PMEM	20
> 
> Forgive me if I'm asking a question that has been answered. I went
> back to earlier threads and couldn't work it out - maybe there were
> some off-list discussions? Anyway...
> 
> Why is there a distinction between tier ID and rank? I undestand that
> rank was added because tier IDs were too few. But if rank determines
> ordering, what is the use of a separate tier ID? IOW, why not make the
> tier ID space wider and have the kernel pick a few spread out defaults
> based on known hardware, with plenty of headroom to be future proof.
> 
>    $ ls tiers
>    100				# DEFAULT_TIER
>    $ cat tiers/100/nodelist
>    0-1				# conventional numa nodes
> 
>    <pmem is onlined>
> 
>    $ grep . tiers/*/nodelist
>    tiers/100/nodelist:0-1	# conventional numa
>    tiers/200/nodelist:2		# pmem
> 
>    $ grep . nodes/*/tier
>    nodes/0/tier:100
>    nodes/1/tier:100
>    nodes/2/tier:200
> 
>    <unknown device is online as node 3, defaults to 100>
> 
>    $ grep . tiers/*/nodelist
>    tiers/100/nodelist:0-1,3
>    tiers/200/nodelist:2
> 
>    $ echo 300 >nodes/3/tier
>    $ grep . tiers/*/nodelist
>    tiers/100/nodelist:0-1
>    tiers/200/nodelist:2
>    tiers/300/nodelist:3
> 
>    $ echo 200 >nodes/3/tier
>    $ grep . tiers/*/nodelist
>    tiers/100/nodelist:0-1	
>    tiers/200/nodelist:2-3
> 
> etc.

tier ID is also used as device id memtier.dev.id. It was discussed that 
we would need the ability to change the rank value of a memory tier. If 
we make rank value same as tier ID or tier device id, we will not be 
able to support that.

-aneesh
Yang Shi June 8, 2022, 4:37 p.m. UTC | #16
On Tue, Jun 7, 2022 at 6:34 PM Ying Huang <ying.huang@intel.com> wrote:
>
> On Tue, 2022-06-07 at 14:32 -0700, Yang Shi wrote:
> > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> > >
> > > In the current kernel, memory tiers are defined implicitly via a
> > > demotion path relationship between NUMA nodes, which is created
> > > during the kernel initialization and updated when a NUMA node is
> > > hot-added or hot-removed.  The current implementation puts all
> > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > tier-by-tier by establishing the per-node demotion targets based
> > > on the distances between nodes.
> > >
> > > This current memory tier kernel interface needs to be improved for
> > > several important use cases,
> > >
> > > The current tier initialization code always initializes
> > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > a virtual machine) and should be put into a higher tier.
> > >
> > > The current tier hierarchy always puts CPU nodes into the top
> > > tier. But on a system with HBM or GPU devices, the
> > > memory-only NUMA nodes mapping these devices should be in the
> > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > next lower tier.
> > >
> > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > next lower tier as defined by the demotion path, not any other
> > > node from any lower tier.  This strict, hard-coded demotion order
> > > does not work in all use cases (e.g. some use cases may want to
> > > allow cross-socket demotion to another node in the same demotion
> > > tier as a fallback when the preferred demotion node is out of
> > > space), This demotion order is also inconsistent with the page
> > > allocation fallback order when all the nodes in a higher tier are
> > > out of space: The page allocation can fall back to any node from
> > > any lower tier, whereas the demotion order doesn't allow that.
> > >
> > > The current kernel also don't provide any interfaces for the
> > > userspace to learn about the memory tier hierarchy in order to
> > > optimize its memory allocations.
> > >
> > > This patch series address the above by defining memory tiers explicitly.
> > >
> > > This patch introduce explicity memory tiers with ranks. The rank
> > > value of a memory tier is used to derive the demotion order between
> > > NUMA nodes. The memory tiers present in a system can be found at
> > >
> > > /sys/devices/system/memtier/memtierN/
> > >
> > > The nodes which are part of a specific memory tier can be listed
> > > via
> > > /sys/devices/system/memtier/memtierN/nodelist
> > >
> > > "Rank" is an opaque value. Its absolute value doesn't have any
> > > special meaning. But the rank values of different memtiers can be
> > > compared with each other to determine the memory tier order.
> > >
> > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > > their rank values are 300, 200, 100, then the memory tier order is:
> > > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> > > and memtier1 is the lowest tier.
> > >
> > > The rank value of each memtier should be unique.
> > >
> > > A higher rank memory tier will appear first in the demotion order
> > > than a lower rank memory tier. ie. while reclaim we choose a node
> > > in higher rank memory tier to demote pages to as compared to a node
> > > in a lower rank memory tier.
> > >
> > > For now we are not adding the dynamic number of memory tiers.
> > > But a future series supporting that is possible. Currently
> > > number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> > > When doing memory hotplug, if not added to a memory tier, the NUMA
> > > node gets added to DEFAULT_MEMORY_TIER(1).
> > >
> > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> > >
> > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> > >
> > > Suggested-by: Wei Xu <weixugc@google.com>
> > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > ---
> > >  include/linux/memory-tiers.h |  20 ++++
> > >  mm/Kconfig                   |  11 ++
> > >  mm/Makefile                  |   1 +
> > >  mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
> > >  4 files changed, 220 insertions(+)
> > >  create mode 100644 include/linux/memory-tiers.h
> > >  create mode 100644 mm/memory-tiers.c
> > >
> > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > > new file mode 100644
> > > index 000000000000..e17f6b4ee177
> > > --- /dev/null
> > > +++ b/include/linux/memory-tiers.h
> > > @@ -0,0 +1,20 @@
> > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > +#define _LINUX_MEMORY_TIERS_H
> > > +
> > > +#ifdef CONFIG_TIERED_MEMORY
> > > +
> > > +#define MEMORY_TIER_HBM_GPU    0
> > > +#define MEMORY_TIER_DRAM       1
> > > +#define MEMORY_TIER_PMEM       2
> > > +
> > > +#define MEMORY_RANK_HBM_GPU    300
> > > +#define MEMORY_RANK_DRAM       200
> > > +#define MEMORY_RANK_PMEM       100
> > > +
> > > +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> > > +#define MAX_MEMORY_TIERS  3
> > > +
> > > +#endif /* CONFIG_TIERED_MEMORY */
> > > +
> > > +#endif
> > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > index 169e64192e48..08a3d330740b 100644
> > > --- a/mm/Kconfig
> > > +++ b/mm/Kconfig
> > > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> > >  config ARCH_ENABLE_THP_MIGRATION
> > >         bool
> > >
> > > +config TIERED_MEMORY
> > > +       bool "Support for explicit memory tiers"
> > > +       def_bool n
> > > +       depends on MIGRATION && NUMA
> > > +       help
> > > +         Support to split nodes into memory tiers explicitly and
> > > +         to demote pages on reclaim to lower tiers. This option
> > > +         also exposes sysfs interface to read nodes available in
> > > +         specific tier and to move specific node among different
> > > +         possible tiers.
> >
> > IMHO we should not need a new kernel config. If tiering is not present
> > then there is just one tier on the system. And tiering is a kind of
> > hardware configuration, the information could be shown regardless of
> > whether demotion/promotion is supported/enabled or not.
>
> I think so too.  At least it appears unnecessary to let the user turn
> on/off it at configuration time.
>
> All the code should be enclosed by #if defined(CONFIG_NUMA) &&
> defined(CONIFIG_MIGRATION).  So we will not waste memory in small
> systems.

CONFIG_NUMA alone should be good enough. CONFIG_MIGRATION is enabled
by default if NUMA is enabled. And MIGRATION is just used to support
demotion/promotion. Memory tiers exist even though demotion/promotion
is not supported, right?

>
> Best Regards,
> Huang, Ying
>
> > > +
> > >  config HUGETLB_PAGE_SIZE_VARIABLE
> > >         def_bool n
> > >         help
> > > diff --git a/mm/Makefile b/mm/Makefile
> > > index 6f9ffa968a1a..482557fbc9d1 100644
> > > --- a/mm/Makefile
> > > +++ b/mm/Makefile
> > > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
> > >  obj-$(CONFIG_FAILSLAB) += failslab.o
> > >  obj-$(CONFIG_MEMTEST)          += memtest.o
> > >  obj-$(CONFIG_MIGRATION) += migrate.o
> > > +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
> > >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> > >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > > new file mode 100644
> > > index 000000000000..7de18d94a08d
> > > --- /dev/null
> > > +++ b/mm/memory-tiers.c
> > > @@ -0,0 +1,188 @@
> > > +// SPDX-License-Identifier: GPL-2.0
> > > +#include <linux/types.h>
> > > +#include <linux/device.h>
> > > +#include <linux/nodemask.h>
> > > +#include <linux/slab.h>
> > > +#include <linux/memory-tiers.h>
> > > +
> > > +struct memory_tier {
> > > +       struct list_head list;
> > > +       struct device dev;
> > > +       nodemask_t nodelist;
> > > +       int rank;
> > > +};
> > > +
> > > +#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
> > > +
> > > +static struct bus_type memory_tier_subsys = {
> > > +       .name = "memtier",
> > > +       .dev_name = "memtier",
> > > +};
> > > +
> > > +static DEFINE_MUTEX(memory_tier_lock);
> > > +static LIST_HEAD(memory_tiers);
> > > +
> > > +
> > > +static ssize_t nodelist_show(struct device *dev,
> > > +                            struct device_attribute *attr, char *buf)
> > > +{
> > > +       struct memory_tier *memtier = to_memory_tier(dev);
> > > +
> > > +       return sysfs_emit(buf, "%*pbl\n",
> > > +                         nodemask_pr_args(&memtier->nodelist));
> > > +}
> > > +static DEVICE_ATTR_RO(nodelist);
> > > +
> > > +static ssize_t rank_show(struct device *dev,
> > > +                        struct device_attribute *attr, char *buf)
> > > +{
> > > +       struct memory_tier *memtier = to_memory_tier(dev);
> > > +
> > > +       return sysfs_emit(buf, "%d\n", memtier->rank);
> > > +}
> > > +static DEVICE_ATTR_RO(rank);
> > > +
> > > +static struct attribute *memory_tier_dev_attrs[] = {
> > > +       &dev_attr_nodelist.attr,
> > > +       &dev_attr_rank.attr,
> > > +       NULL
> > > +};
> > > +
> > > +static const struct attribute_group memory_tier_dev_group = {
> > > +       .attrs = memory_tier_dev_attrs,
> > > +};
> > > +
> > > +static const struct attribute_group *memory_tier_dev_groups[] = {
> > > +       &memory_tier_dev_group,
> > > +       NULL
> > > +};
> > > +
> > > +static void memory_tier_device_release(struct device *dev)
> > > +{
> > > +       struct memory_tier *tier = to_memory_tier(dev);
> > > +
> > > +       kfree(tier);
> > > +}
> > > +
> > > +/*
> > > + * Keep it simple by having  direct mapping between
> > > + * tier index and rank value.
> > > + */
> > > +static inline int get_rank_from_tier(unsigned int tier)
> > > +{
> > > +       switch (tier) {
> > > +       case MEMORY_TIER_HBM_GPU:
> > > +               return MEMORY_RANK_HBM_GPU;
> > > +       case MEMORY_TIER_DRAM:
> > > +               return MEMORY_RANK_DRAM;
> > > +       case MEMORY_TIER_PMEM:
> > > +               return MEMORY_RANK_PMEM;
> > > +       }
> > > +
> > > +       return 0;
> > > +}
> > > +
> > > +static void insert_memory_tier(struct memory_tier *memtier)
> > > +{
> > > +       struct list_head *ent;
> > > +       struct memory_tier *tmp_memtier;
> > > +
> > > +       list_for_each(ent, &memory_tiers) {
> > > +               tmp_memtier = list_entry(ent, struct memory_tier, list);
> > > +               if (tmp_memtier->rank < memtier->rank) {
> > > +                       list_add_tail(&memtier->list, ent);
> > > +                       return;
> > > +               }
> > > +       }
> > > +       list_add_tail(&memtier->list, &memory_tiers);
> > > +}
> > > +
> > > +static struct memory_tier *register_memory_tier(unsigned int tier)
> > > +{
> > > +       int error;
> > > +       struct memory_tier *memtier;
> > > +
> > > +       if (tier >= MAX_MEMORY_TIERS)
> > > +               return NULL;
> > > +
> > > +       memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > > +       if (!memtier)
> > > +               return NULL;
> > > +
> > > +       memtier->dev.id = tier;
> > > +       memtier->rank = get_rank_from_tier(tier);
> > > +       memtier->dev.bus = &memory_tier_subsys;
> > > +       memtier->dev.release = memory_tier_device_release;
> > > +       memtier->dev.groups = memory_tier_dev_groups;
> > > +
> > > +       insert_memory_tier(memtier);
> > > +
> > > +       error = device_register(&memtier->dev);
> > > +       if (error) {
> > > +               list_del(&memtier->list);
> > > +               put_device(&memtier->dev);
> > > +               return NULL;
> > > +       }
> > > +       return memtier;
> > > +}
> > > +
> > > +__maybe_unused // temporay to prevent warnings during bisects
> > > +static void unregister_memory_tier(struct memory_tier *memtier)
> > > +{
> > > +       list_del(&memtier->list);
> > > +       device_unregister(&memtier->dev);
> > > +}
> > > +
> > > +static ssize_t
> > > +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> > > +{
> > > +       return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
> > > +}
> > > +static DEVICE_ATTR_RO(max_tier);
> > > +
> > > +static ssize_t
> > > +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> > > +{
> > > +       return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER);
> > > +}
> > > +static DEVICE_ATTR_RO(default_tier);
> > > +
> > > +static struct attribute *memory_tier_attrs[] = {
> > > +       &dev_attr_max_tier.attr,
> > > +       &dev_attr_default_tier.attr,
> > > +       NULL
> > > +};
> > > +
> > > +static const struct attribute_group memory_tier_attr_group = {
> > > +       .attrs = memory_tier_attrs,
> > > +};
> > > +
> > > +static const struct attribute_group *memory_tier_attr_groups[] = {
> > > +       &memory_tier_attr_group,
> > > +       NULL,
> > > +};
> > > +
> > > +static int __init memory_tier_init(void)
> > > +{
> > > +       int ret;
> > > +       struct memory_tier *memtier;
> > > +
> > > +       ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> > > +       if (ret)
> > > +               panic("%s() failed to register subsystem: %d\n", __func__, ret);
> > > +
> > > +       /*
> > > +        * Register only default memory tier to hide all empty
> > > +        * memory tier from sysfs.
> > > +        */
> > > +       memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
> > > +       if (!memtier)
> > > +               panic("%s() failed to register memory tier: %d\n", __func__, ret);
> > > +
> > > +       /* CPU only nodes are not part of memory tiers. */
> > > +       memtier->nodelist = node_states[N_MEMORY];
> > > +
> > > +       return 0;
> > > +}
> > > +subsys_initcall(memory_tier_init);
> > > +
> > > --
> > > 2.36.1
> > >
>
>
Yang Shi June 8, 2022, 4:42 p.m. UTC | #17
On Tue, Jun 7, 2022 at 9:58 PM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 6/8/22 3:02 AM, Yang Shi wrote:
> > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> > <aneesh.kumar@linux.ibm.com> wrote:
> >>
> >> In the current kernel, memory tiers are defined implicitly via a
> >> demotion path relationship between NUMA nodes, which is created
> >> during the kernel initialization and updated when a NUMA node is
> >> hot-added or hot-removed.  The current implementation puts all
> >> nodes with CPU into the top tier, and builds the tier hierarchy
> >> tier-by-tier by establishing the per-node demotion targets based
> >> on the distances between nodes.
> >>
> >> This current memory tier kernel interface needs to be improved for
> >> several important use cases,
> >>
> >> The current tier initialization code always initializes
> >> each memory-only NUMA node into a lower tier.  But a memory-only
> >> NUMA node may have a high performance memory device (e.g. a DRAM
> >> device attached via CXL.mem or a DRAM-backed memory-only node on
> >> a virtual machine) and should be put into a higher tier.
> >>
> >> The current tier hierarchy always puts CPU nodes into the top
> >> tier. But on a system with HBM or GPU devices, the
> >> memory-only NUMA nodes mapping these devices should be in the
> >> top tier, and DRAM nodes with CPUs are better to be placed into the
> >> next lower tier.
> >>
> >> With current kernel higher tier node can only be demoted to selected nodes on the
> >> next lower tier as defined by the demotion path, not any other
> >> node from any lower tier.  This strict, hard-coded demotion order
> >> does not work in all use cases (e.g. some use cases may want to
> >> allow cross-socket demotion to another node in the same demotion
> >> tier as a fallback when the preferred demotion node is out of
> >> space), This demotion order is also inconsistent with the page
> >> allocation fallback order when all the nodes in a higher tier are
> >> out of space: The page allocation can fall back to any node from
> >> any lower tier, whereas the demotion order doesn't allow that.
> >>
> >> The current kernel also don't provide any interfaces for the
> >> userspace to learn about the memory tier hierarchy in order to
> >> optimize its memory allocations.
> >>
> >> This patch series address the above by defining memory tiers explicitly.
> >>
> >> This patch introduce explicity memory tiers with ranks. The rank
> >> value of a memory tier is used to derive the demotion order between
> >> NUMA nodes. The memory tiers present in a system can be found at
> >>
> >> /sys/devices/system/memtier/memtierN/
> >>
> >> The nodes which are part of a specific memory tier can be listed
> >> via
> >> /sys/devices/system/memtier/memtierN/nodelist
> >>
> >> "Rank" is an opaque value. Its absolute value doesn't have any
> >> special meaning. But the rank values of different memtiers can be
> >> compared with each other to determine the memory tier order.
> >>
> >> For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> >> their rank values are 300, 200, 100, then the memory tier order is:
> >> memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> >> and memtier1 is the lowest tier.
> >>
> >> The rank value of each memtier should be unique.
> >>
> >> A higher rank memory tier will appear first in the demotion order
> >> than a lower rank memory tier. ie. while reclaim we choose a node
> >> in higher rank memory tier to demote pages to as compared to a node
> >> in a lower rank memory tier.
> >>
> >> For now we are not adding the dynamic number of memory tiers.
> >> But a future series supporting that is possible. Currently
> >> number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> >> When doing memory hotplug, if not added to a memory tier, the NUMA
> >> node gets added to DEFAULT_MEMORY_TIER(1).
> >>
> >> This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> >>
> >> [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> >>
> >> Suggested-by: Wei Xu <weixugc@google.com>
> >> Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> >> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> >> ---
> >>   include/linux/memory-tiers.h |  20 ++++
> >>   mm/Kconfig                   |  11 ++
> >>   mm/Makefile                  |   1 +
> >>   mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
> >>   4 files changed, 220 insertions(+)
> >>   create mode 100644 include/linux/memory-tiers.h
> >>   create mode 100644 mm/memory-tiers.c
> >>
> >> diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> >> new file mode 100644
> >> index 000000000000..e17f6b4ee177
> >> --- /dev/null
> >> +++ b/include/linux/memory-tiers.h
> >> @@ -0,0 +1,20 @@
> >> +/* SPDX-License-Identifier: GPL-2.0 */
> >> +#ifndef _LINUX_MEMORY_TIERS_H
> >> +#define _LINUX_MEMORY_TIERS_H
> >> +
> >> +#ifdef CONFIG_TIERED_MEMORY
> >> +
> >> +#define MEMORY_TIER_HBM_GPU    0
> >> +#define MEMORY_TIER_DRAM       1
> >> +#define MEMORY_TIER_PMEM       2
> >> +
> >> +#define MEMORY_RANK_HBM_GPU    300
> >> +#define MEMORY_RANK_DRAM       200
> >> +#define MEMORY_RANK_PMEM       100
> >> +
> >> +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> >> +#define MAX_MEMORY_TIERS  3
> >> +
> >> +#endif /* CONFIG_TIERED_MEMORY */
> >> +
> >> +#endif
> >> diff --git a/mm/Kconfig b/mm/Kconfig
> >> index 169e64192e48..08a3d330740b 100644
> >> --- a/mm/Kconfig
> >> +++ b/mm/Kconfig
> >> @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> >>   config ARCH_ENABLE_THP_MIGRATION
> >>          bool
> >>
> >> +config TIERED_MEMORY
> >> +       bool "Support for explicit memory tiers"
> >> +       def_bool n
> >> +       depends on MIGRATION && NUMA
> >> +       help
> >> +         Support to split nodes into memory tiers explicitly and
> >> +         to demote pages on reclaim to lower tiers. This option
> >> +         also exposes sysfs interface to read nodes available in
> >> +         specific tier and to move specific node among different
> >> +         possible tiers.
> >
> > IMHO we should not need a new kernel config. If tiering is not present
> > then there is just one tier on the system. And tiering is a kind of
> > hardware configuration, the information could be shown regardless of
> > whether demotion/promotion is supported/enabled or not.
> >
>
> This was added so that we could avoid doing multiple
>
> #if defined(CONFIG_MIGRATION) && defined(CONFIG_NUMA)
>
> Initially I had that as def_bool y and depends on MIGRATION && NUMA. But
> it was later suggested that def_bool is not recommended for newer config.
>
> How about
>
>   config TIERED_MEMORY
>         bool "Support for explicit memory tiers"
> -       def_bool n
> -       depends on MIGRATION && NUMA
> -       help
> -         Support to split nodes into memory tiers explicitly and
> -         to demote pages on reclaim to lower tiers. This option
> -         also exposes sysfs interface to read nodes available in
> -         specific tier and to move specific node among different
> -         possible tiers.
> +       def_bool MIGRATION && NUMA

CONFIG_NUMA should be good enough. Memory tiering doesn't have to mean
demotion/promotion has to be supported IMHO.

>
>   config HUGETLB_PAGE_SIZE_VARIABLE
>         def_bool n
>
> ie, we just make it a Kconfig variable without exposing it to the user?
>
> -aneesh
Johannes Weiner June 8, 2022, 6:16 p.m. UTC | #18
On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote:
> On 6/8/22 9:25 PM, Johannes Weiner wrote:
> > Hello,
> > 
> > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:
> > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
> > > > @@ -0,0 +1,20 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > > +#define _LINUX_MEMORY_TIERS_H
> > > > +
> > > > +#ifdef CONFIG_TIERED_MEMORY
> > > > +
> > > > +#define MEMORY_TIER_HBM_GPU	0
> > > > +#define MEMORY_TIER_DRAM	1
> > > > +#define MEMORY_TIER_PMEM	2
> > > > +
> > > > +#define MEMORY_RANK_HBM_GPU	300
> > > > +#define MEMORY_RANK_DRAM	200
> > > > +#define MEMORY_RANK_PMEM	100
> > > > +
> > > > +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> > > > +#define MAX_MEMORY_TIERS  3
> > > 
> > > I understand the names are somewhat arbitrary, and the tier ID space
> > > can be expanded down the line by bumping MAX_MEMORY_TIERS.
> > > 
> > > But starting out with a packed ID space can get quite awkward for
> > > users when new tiers - especially intermediate tiers - show up in
> > > existing configurations. I mentioned in the other email that DRAM !=
> > > DRAM, so new tiers seem inevitable already.
> > > 
> > > It could make sense to start with a bigger address space and spread
> > > out the list of kernel default tiers a bit within it:
> > > 
> > > MEMORY_TIER_GPU		0
> > > MEMORY_TIER_DRAM	10
> > > MEMORY_TIER_PMEM	20
> > 
> > Forgive me if I'm asking a question that has been answered. I went
> > back to earlier threads and couldn't work it out - maybe there were
> > some off-list discussions? Anyway...
> > 
> > Why is there a distinction between tier ID and rank? I undestand that
> > rank was added because tier IDs were too few. But if rank determines
> > ordering, what is the use of a separate tier ID? IOW, why not make the
> > tier ID space wider and have the kernel pick a few spread out defaults
> > based on known hardware, with plenty of headroom to be future proof.
> > 
> >    $ ls tiers
> >    100				# DEFAULT_TIER
> >    $ cat tiers/100/nodelist
> >    0-1				# conventional numa nodes
> > 
> >    <pmem is onlined>
> > 
> >    $ grep . tiers/*/nodelist
> >    tiers/100/nodelist:0-1	# conventional numa
> >    tiers/200/nodelist:2		# pmem
> > 
> >    $ grep . nodes/*/tier
> >    nodes/0/tier:100
> >    nodes/1/tier:100
> >    nodes/2/tier:200
> > 
> >    <unknown device is online as node 3, defaults to 100>
> > 
> >    $ grep . tiers/*/nodelist
> >    tiers/100/nodelist:0-1,3
> >    tiers/200/nodelist:2
> > 
> >    $ echo 300 >nodes/3/tier
> >    $ grep . tiers/*/nodelist
> >    tiers/100/nodelist:0-1
> >    tiers/200/nodelist:2
> >    tiers/300/nodelist:3
> > 
> >    $ echo 200 >nodes/3/tier
> >    $ grep . tiers/*/nodelist
> >    tiers/100/nodelist:0-1	
> >    tiers/200/nodelist:2-3
> > 
> > etc.
> 
> tier ID is also used as device id memtier.dev.id. It was discussed that we
> would need the ability to change the rank value of a memory tier. If we make
> rank value same as tier ID or tier device id, we will not be able to support
> that.

Is the idea that you could change the rank of a collection of nodes in
one go? Rather than moving the nodes one by one into a new tier?

[ Sorry, I wasn't able to find this discussion. AFAICS the first
  patches in RFC4 already had the struct device { .id = tier }
  logic. Could you point me to it? In general it would be really
  helpful to maintain summarized rationales for such decisions in the
  coverletter to make sure things don't get lost over many, many
  threads, conferences, and video calls. ]
Aneesh Kumar K.V June 9, 2022, 2:33 a.m. UTC | #19
On 6/8/22 11:46 PM, Johannes Weiner wrote:
> On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote:
>> On 6/8/22 9:25 PM, Johannes Weiner wrote:
>>> Hello,
>>>
>>> On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:
>>>> On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
>>>>> @@ -0,0 +1,20 @@
>>>>> +/* SPDX-License-Identifier: GPL-2.0 */
>>>>> +#ifndef _LINUX_MEMORY_TIERS_H
>>>>> +#define _LINUX_MEMORY_TIERS_H
>>>>> +
>>>>> +#ifdef CONFIG_TIERED_MEMORY
>>>>> +
>>>>> +#define MEMORY_TIER_HBM_GPU	0
>>>>> +#define MEMORY_TIER_DRAM	1
>>>>> +#define MEMORY_TIER_PMEM	2
>>>>> +
>>>>> +#define MEMORY_RANK_HBM_GPU	300
>>>>> +#define MEMORY_RANK_DRAM	200
>>>>> +#define MEMORY_RANK_PMEM	100
>>>>> +
>>>>> +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
>>>>> +#define MAX_MEMORY_TIERS  3
>>>>
>>>> I understand the names are somewhat arbitrary, and the tier ID space
>>>> can be expanded down the line by bumping MAX_MEMORY_TIERS.
>>>>
>>>> But starting out with a packed ID space can get quite awkward for
>>>> users when new tiers - especially intermediate tiers - show up in
>>>> existing configurations. I mentioned in the other email that DRAM !=
>>>> DRAM, so new tiers seem inevitable already.
>>>>
>>>> It could make sense to start with a bigger address space and spread
>>>> out the list of kernel default tiers a bit within it:
>>>>
>>>> MEMORY_TIER_GPU		0
>>>> MEMORY_TIER_DRAM	10
>>>> MEMORY_TIER_PMEM	20
>>>
>>> Forgive me if I'm asking a question that has been answered. I went
>>> back to earlier threads and couldn't work it out - maybe there were
>>> some off-list discussions? Anyway...
>>>
>>> Why is there a distinction between tier ID and rank? I undestand that
>>> rank was added because tier IDs were too few. But if rank determines
>>> ordering, what is the use of a separate tier ID? IOW, why not make the
>>> tier ID space wider and have the kernel pick a few spread out defaults
>>> based on known hardware, with plenty of headroom to be future proof.
>>>
>>>     $ ls tiers
>>>     100				# DEFAULT_TIER
>>>     $ cat tiers/100/nodelist
>>>     0-1				# conventional numa nodes
>>>
>>>     <pmem is onlined>
>>>
>>>     $ grep . tiers/*/nodelist
>>>     tiers/100/nodelist:0-1	# conventional numa
>>>     tiers/200/nodelist:2		# pmem
>>>
>>>     $ grep . nodes/*/tier
>>>     nodes/0/tier:100
>>>     nodes/1/tier:100
>>>     nodes/2/tier:200
>>>
>>>     <unknown device is online as node 3, defaults to 100>
>>>
>>>     $ grep . tiers/*/nodelist
>>>     tiers/100/nodelist:0-1,3
>>>     tiers/200/nodelist:2
>>>
>>>     $ echo 300 >nodes/3/tier
>>>     $ grep . tiers/*/nodelist
>>>     tiers/100/nodelist:0-1
>>>     tiers/200/nodelist:2
>>>     tiers/300/nodelist:3
>>>
>>>     $ echo 200 >nodes/3/tier
>>>     $ grep . tiers/*/nodelist
>>>     tiers/100/nodelist:0-1	
>>>     tiers/200/nodelist:2-3
>>>
>>> etc.
>>
>> tier ID is also used as device id memtier.dev.id. It was discussed that we
>> would need the ability to change the rank value of a memory tier. If we make
>> rank value same as tier ID or tier device id, we will not be able to support
>> that.
> 
> Is the idea that you could change the rank of a collection of nodes in
> one go? Rather than moving the nodes one by one into a new tier?
> 
> [ Sorry, I wasn't able to find this discussion. AFAICS the first
>    patches in RFC4 already had the struct device { .id = tier }
>    logic. Could you point me to it? In general it would be really
>    helpful to maintain summarized rationales for such decisions in the
>    coverletter to make sure things don't get lost over many, many
>    threads, conferences, and video calls. ]

Most of the discussion happened not int he patch review email threads.

RFC: Memory Tiering Kernel Interfaces (v2)
https://lore.kernel.org/linux-mm/CAAPL-u_diGYEb7+WsgqNBLRix-nRCk2SsDj6p9r8j5JZwOABZQ@mail.gmail.com

RFC: Memory Tiering Kernel Interfaces (v4)
https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com

-aneesh
Huang, Ying June 9, 2022, 6:52 a.m. UTC | #20
On Wed, 2022-06-08 at 09:37 -0700, Yang Shi wrote:
> On Tue, Jun 7, 2022 at 6:34 PM Ying Huang <ying.huang@intel.com> wrote:
> > 
> > On Tue, 2022-06-07 at 14:32 -0700, Yang Shi wrote:
> > > On Fri, Jun 3, 2022 at 6:43 AM Aneesh Kumar K.V
> > > <aneesh.kumar@linux.ibm.com> wrote:
> > > > 
> > > > In the current kernel, memory tiers are defined implicitly via a
> > > > demotion path relationship between NUMA nodes, which is created
> > > > during the kernel initialization and updated when a NUMA node is
> > > > hot-added or hot-removed.  The current implementation puts all
> > > > nodes with CPU into the top tier, and builds the tier hierarchy
> > > > tier-by-tier by establishing the per-node demotion targets based
> > > > on the distances between nodes.
> > > > 
> > > > This current memory tier kernel interface needs to be improved for
> > > > several important use cases,
> > > > 
> > > > The current tier initialization code always initializes
> > > > each memory-only NUMA node into a lower tier.  But a memory-only
> > > > NUMA node may have a high performance memory device (e.g. a DRAM
> > > > device attached via CXL.mem or a DRAM-backed memory-only node on
> > > > a virtual machine) and should be put into a higher tier.
> > > > 
> > > > The current tier hierarchy always puts CPU nodes into the top
> > > > tier. But on a system with HBM or GPU devices, the
> > > > memory-only NUMA nodes mapping these devices should be in the
> > > > top tier, and DRAM nodes with CPUs are better to be placed into the
> > > > next lower tier.
> > > > 
> > > > With current kernel higher tier node can only be demoted to selected nodes on the
> > > > next lower tier as defined by the demotion path, not any other
> > > > node from any lower tier.  This strict, hard-coded demotion order
> > > > does not work in all use cases (e.g. some use cases may want to
> > > > allow cross-socket demotion to another node in the same demotion
> > > > tier as a fallback when the preferred demotion node is out of
> > > > space), This demotion order is also inconsistent with the page
> > > > allocation fallback order when all the nodes in a higher tier are
> > > > out of space: The page allocation can fall back to any node from
> > > > any lower tier, whereas the demotion order doesn't allow that.
> > > > 
> > > > The current kernel also don't provide any interfaces for the
> > > > userspace to learn about the memory tier hierarchy in order to
> > > > optimize its memory allocations.
> > > > 
> > > > This patch series address the above by defining memory tiers explicitly.
> > > > 
> > > > This patch introduce explicity memory tiers with ranks. The rank
> > > > value of a memory tier is used to derive the demotion order between
> > > > NUMA nodes. The memory tiers present in a system can be found at
> > > > 
> > > > /sys/devices/system/memtier/memtierN/
> > > > 
> > > > The nodes which are part of a specific memory tier can be listed
> > > > via
> > > > /sys/devices/system/memtier/memtierN/nodelist
> > > > 
> > > > "Rank" is an opaque value. Its absolute value doesn't have any
> > > > special meaning. But the rank values of different memtiers can be
> > > > compared with each other to determine the memory tier order.
> > > > 
> > > > For example, if we have 3 memtiers: memtier0, memtier1, memiter2, and
> > > > their rank values are 300, 200, 100, then the memory tier order is:
> > > > memtier0 -> memtier2 -> memtier1, where memtier0 is the highest tier
> > > > and memtier1 is the lowest tier.
> > > > 
> > > > The rank value of each memtier should be unique.
> > > > 
> > > > A higher rank memory tier will appear first in the demotion order
> > > > than a lower rank memory tier. ie. while reclaim we choose a node
> > > > in higher rank memory tier to demote pages to as compared to a node
> > > > in a lower rank memory tier.
> > > > 
> > > > For now we are not adding the dynamic number of memory tiers.
> > > > But a future series supporting that is possible. Currently
> > > > number of tiers supported is limitted to MAX_MEMORY_TIERS(3).
> > > > When doing memory hotplug, if not added to a memory tier, the NUMA
> > > > node gets added to DEFAULT_MEMORY_TIER(1).
> > > > 
> > > > This patch is based on the proposal sent by Wei Xu <weixugc@google.com> at [1].
> > > > 
> > > > [1] https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com
> > > > 
> > > > Suggested-by: Wei Xu <weixugc@google.com>
> > > > Signed-off-by: Jagdish Gediya <jvgediya@linux.ibm.com>
> > > > Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> > > > ---
> > > >  include/linux/memory-tiers.h |  20 ++++
> > > >  mm/Kconfig                   |  11 ++
> > > >  mm/Makefile                  |   1 +
> > > >  mm/memory-tiers.c            | 188 +++++++++++++++++++++++++++++++++++
> > > >  4 files changed, 220 insertions(+)
> > > >  create mode 100644 include/linux/memory-tiers.h
> > > >  create mode 100644 mm/memory-tiers.c
> > > > 
> > > > diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
> > > > new file mode 100644
> > > > index 000000000000..e17f6b4ee177
> > > > --- /dev/null
> > > > +++ b/include/linux/memory-tiers.h
> > > > @@ -0,0 +1,20 @@
> > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > > +#define _LINUX_MEMORY_TIERS_H
> > > > +
> > > > +#ifdef CONFIG_TIERED_MEMORY
> > > > +
> > > > +#define MEMORY_TIER_HBM_GPU    0
> > > > +#define MEMORY_TIER_DRAM       1
> > > > +#define MEMORY_TIER_PMEM       2
> > > > +
> > > > +#define MEMORY_RANK_HBM_GPU    300
> > > > +#define MEMORY_RANK_DRAM       200
> > > > +#define MEMORY_RANK_PMEM       100
> > > > +
> > > > +#define DEFAULT_MEMORY_TIER    MEMORY_TIER_DRAM
> > > > +#define MAX_MEMORY_TIERS  3
> > > > +
> > > > +#endif /* CONFIG_TIERED_MEMORY */
> > > > +
> > > > +#endif
> > > > diff --git a/mm/Kconfig b/mm/Kconfig
> > > > index 169e64192e48..08a3d330740b 100644
> > > > --- a/mm/Kconfig
> > > > +++ b/mm/Kconfig
> > > > @@ -614,6 +614,17 @@ config ARCH_ENABLE_HUGEPAGE_MIGRATION
> > > >  config ARCH_ENABLE_THP_MIGRATION
> > > >         bool
> > > > 
> > > > +config TIERED_MEMORY
> > > > +       bool "Support for explicit memory tiers"
> > > > +       def_bool n
> > > > +       depends on MIGRATION && NUMA
> > > > +       help
> > > > +         Support to split nodes into memory tiers explicitly and
> > > > +         to demote pages on reclaim to lower tiers. This option
> > > > +         also exposes sysfs interface to read nodes available in
> > > > +         specific tier and to move specific node among different
> > > > +         possible tiers.
> > > 
> > > IMHO we should not need a new kernel config. If tiering is not present
> > > then there is just one tier on the system. And tiering is a kind of
> > > hardware configuration, the information could be shown regardless of
> > > whether demotion/promotion is supported/enabled or not.
> > 
> > I think so too.  At least it appears unnecessary to let the user turn
> > on/off it at configuration time.
> > 
> > All the code should be enclosed by #if defined(CONFIG_NUMA) &&
> > defined(CONIFIG_MIGRATION).  So we will not waste memory in small
> > systems.
> 
> CONFIG_NUMA alone should be good enough. CONFIG_MIGRATION is enabled
> by default if NUMA is enabled. And MIGRATION is just used to support
> demotion/promotion. Memory tiers exist even though demotion/promotion
> is not supported, right?

Yes.  You are right.  For example, in the following patch, memory tiers
are used for allocation interleaving.

https://lore.kernel.org/lkml/20220607171949.85796-1-hannes@cmpxchg.org/

Best Regards,
Huang, Ying

> > 
> > > > +
> > > >  config HUGETLB_PAGE_SIZE_VARIABLE
> > > >         def_bool n
> > > >         help
> > > > diff --git a/mm/Makefile b/mm/Makefile
> > > > index 6f9ffa968a1a..482557fbc9d1 100644
> > > > --- a/mm/Makefile
> > > > +++ b/mm/Makefile
> > > > @@ -92,6 +92,7 @@ obj-$(CONFIG_KFENCE) += kfence/
> > > >  obj-$(CONFIG_FAILSLAB) += failslab.o
> > > >  obj-$(CONFIG_MEMTEST)          += memtest.o
> > > >  obj-$(CONFIG_MIGRATION) += migrate.o
> > > > +obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
> > > >  obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
> > > >  obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
> > > >  obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
> > > > diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
> > > > new file mode 100644
> > > > index 000000000000..7de18d94a08d
> > > > --- /dev/null
> > > > +++ b/mm/memory-tiers.c
> > > > @@ -0,0 +1,188 @@
> > > > +// SPDX-License-Identifier: GPL-2.0
> > > > +#include <linux/types.h>
> > > > +#include <linux/device.h>
> > > > +#include <linux/nodemask.h>
> > > > +#include <linux/slab.h>
> > > > +#include <linux/memory-tiers.h>
> > > > +
> > > > +struct memory_tier {
> > > > +       struct list_head list;
> > > > +       struct device dev;
> > > > +       nodemask_t nodelist;
> > > > +       int rank;
> > > > +};
> > > > +
> > > > +#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
> > > > +
> > > > +static struct bus_type memory_tier_subsys = {
> > > > +       .name = "memtier",
> > > > +       .dev_name = "memtier",
> > > > +};
> > > > +
> > > > +static DEFINE_MUTEX(memory_tier_lock);
> > > > +static LIST_HEAD(memory_tiers);
> > > > +
> > > > +
> > > > +static ssize_t nodelist_show(struct device *dev,
> > > > +                            struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct memory_tier *memtier = to_memory_tier(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%*pbl\n",
> > > > +                         nodemask_pr_args(&memtier->nodelist));
> > > > +}
> > > > +static DEVICE_ATTR_RO(nodelist);
> > > > +
> > > > +static ssize_t rank_show(struct device *dev,
> > > > +                        struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       struct memory_tier *memtier = to_memory_tier(dev);
> > > > +
> > > > +       return sysfs_emit(buf, "%d\n", memtier->rank);
> > > > +}
> > > > +static DEVICE_ATTR_RO(rank);
> > > > +
> > > > +static struct attribute *memory_tier_dev_attrs[] = {
> > > > +       &dev_attr_nodelist.attr,
> > > > +       &dev_attr_rank.attr,
> > > > +       NULL
> > > > +};
> > > > +
> > > > +static const struct attribute_group memory_tier_dev_group = {
> > > > +       .attrs = memory_tier_dev_attrs,
> > > > +};
> > > > +
> > > > +static const struct attribute_group *memory_tier_dev_groups[] = {
> > > > +       &memory_tier_dev_group,
> > > > +       NULL
> > > > +};
> > > > +
> > > > +static void memory_tier_device_release(struct device *dev)
> > > > +{
> > > > +       struct memory_tier *tier = to_memory_tier(dev);
> > > > +
> > > > +       kfree(tier);
> > > > +}
> > > > +
> > > > +/*
> > > > + * Keep it simple by having  direct mapping between
> > > > + * tier index and rank value.
> > > > + */
> > > > +static inline int get_rank_from_tier(unsigned int tier)
> > > > +{
> > > > +       switch (tier) {
> > > > +       case MEMORY_TIER_HBM_GPU:
> > > > +               return MEMORY_RANK_HBM_GPU;
> > > > +       case MEMORY_TIER_DRAM:
> > > > +               return MEMORY_RANK_DRAM;
> > > > +       case MEMORY_TIER_PMEM:
> > > > +               return MEMORY_RANK_PMEM;
> > > > +       }
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +
> > > > +static void insert_memory_tier(struct memory_tier *memtier)
> > > > +{
> > > > +       struct list_head *ent;
> > > > +       struct memory_tier *tmp_memtier;
> > > > +
> > > > +       list_for_each(ent, &memory_tiers) {
> > > > +               tmp_memtier = list_entry(ent, struct memory_tier, list);
> > > > +               if (tmp_memtier->rank < memtier->rank) {
> > > > +                       list_add_tail(&memtier->list, ent);
> > > > +                       return;
> > > > +               }
> > > > +       }
> > > > +       list_add_tail(&memtier->list, &memory_tiers);
> > > > +}
> > > > +
> > > > +static struct memory_tier *register_memory_tier(unsigned int tier)
> > > > +{
> > > > +       int error;
> > > > +       struct memory_tier *memtier;
> > > > +
> > > > +       if (tier >= MAX_MEMORY_TIERS)
> > > > +               return NULL;
> > > > +
> > > > +       memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
> > > > +       if (!memtier)
> > > > +               return NULL;
> > > > +
> > > > +       memtier->dev.id = tier;
> > > > +       memtier->rank = get_rank_from_tier(tier);
> > > > +       memtier->dev.bus = &memory_tier_subsys;
> > > > +       memtier->dev.release = memory_tier_device_release;
> > > > +       memtier->dev.groups = memory_tier_dev_groups;
> > > > +
> > > > +       insert_memory_tier(memtier);
> > > > +
> > > > +       error = device_register(&memtier->dev);
> > > > +       if (error) {
> > > > +               list_del(&memtier->list);
> > > > +               put_device(&memtier->dev);
> > > > +               return NULL;
> > > > +       }
> > > > +       return memtier;
> > > > +}
> > > > +
> > > > +__maybe_unused // temporay to prevent warnings during bisects
> > > > +static void unregister_memory_tier(struct memory_tier *memtier)
> > > > +{
> > > > +       list_del(&memtier->list);
> > > > +       device_unregister(&memtier->dev);
> > > > +}
> > > > +
> > > > +static ssize_t
> > > > +max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
> > > > +}
> > > > +static DEVICE_ATTR_RO(max_tier);
> > > > +
> > > > +static ssize_t
> > > > +default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
> > > > +{
> > > > +       return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER);
> > > > +}
> > > > +static DEVICE_ATTR_RO(default_tier);
> > > > +
> > > > +static struct attribute *memory_tier_attrs[] = {
> > > > +       &dev_attr_max_tier.attr,
> > > > +       &dev_attr_default_tier.attr,
> > > > +       NULL
> > > > +};
> > > > +
> > > > +static const struct attribute_group memory_tier_attr_group = {
> > > > +       .attrs = memory_tier_attrs,
> > > > +};
> > > > +
> > > > +static const struct attribute_group *memory_tier_attr_groups[] = {
> > > > +       &memory_tier_attr_group,
> > > > +       NULL,
> > > > +};
> > > > +
> > > > +static int __init memory_tier_init(void)
> > > > +{
> > > > +       int ret;
> > > > +       struct memory_tier *memtier;
> > > > +
> > > > +       ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
> > > > +       if (ret)
> > > > +               panic("%s() failed to register subsystem: %d\n", __func__, ret);
> > > > +
> > > > +       /*
> > > > +        * Register only default memory tier to hide all empty
> > > > +        * memory tier from sysfs.
> > > > +        */
> > > > +       memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
> > > > +       if (!memtier)
> > > > +               panic("%s() failed to register memory tier: %d\n", __func__, ret);
> > > > +
> > > > +       /* CPU only nodes are not part of memory tiers. */
> > > > +       memtier->nodelist = node_states[N_MEMORY];
> > > > +
> > > > +       return 0;
> > > > +}
> > > > +subsys_initcall(memory_tier_init);
> > > > +
> > > > --
> > > > 2.36.1
> > > > 
> > 
> >
Aneesh Kumar K.V June 9, 2022, 8:17 a.m. UTC | #21
On 6/8/22 10:12 PM, Yang Shi wrote:
> On Tue, Jun 7, 2022 at 9:58 PM Aneesh Kumar K V
> <aneesh.kumar@linux.ibm.com> wrote:

....

>>    config TIERED_MEMORY
>>          bool "Support for explicit memory tiers"
>> -       def_bool n
>> -       depends on MIGRATION && NUMA
>> -       help
>> -         Support to split nodes into memory tiers explicitly and
>> -         to demote pages on reclaim to lower tiers. This option
>> -         also exposes sysfs interface to read nodes available in
>> -         specific tier and to move specific node among different
>> -         possible tiers.
>> +       def_bool MIGRATION && NUMA
> 
> CONFIG_NUMA should be good enough. Memory tiering doesn't have to mean
> demotion/promotion has to be supported IMHO.
> 
>>
>>    config HUGETLB_PAGE_SIZE_VARIABLE
>>          def_bool n
>>
>> ie, we just make it a Kconfig variable without exposing it to the user?
>>

We can do that but that would also mean in order to avoid building the 
demotion targets etc we will now have to have multiple #ifdef 
CONFIG_MIGRATION in mm/memory-tiers.c . It builds without those #ifdef 
So these are not really build errors, but rather we will be building all 
the demotion targets for no real use with them.

What usecase do you have to expose memory tiers on a system with 
CONFIG_MIGRATION disabled? CONFIG_MIGRATION gets enabled in almost all 
configs these days due to its dependency against COMPACTION and 
TRANSPARENT_HUGEPAGE.

Unless there is a real need, I am wondering if we can avoid sprinkling 
#ifdef CONFIG_MIGRATION in mm/memory-tiers.c

-aneesh
Johannes Weiner June 9, 2022, 1:55 p.m. UTC | #22
On Thu, Jun 09, 2022 at 08:03:26AM +0530, Aneesh Kumar K V wrote:
> On 6/8/22 11:46 PM, Johannes Weiner wrote:
> > On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote:
> > > On 6/8/22 9:25 PM, Johannes Weiner wrote:
> > > > Hello,
> > > > 
> > > > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:
> > > > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:
> > > > > > @@ -0,0 +1,20 @@
> > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > > > > +#define _LINUX_MEMORY_TIERS_H
> > > > > > +
> > > > > > +#ifdef CONFIG_TIERED_MEMORY
> > > > > > +
> > > > > > +#define MEMORY_TIER_HBM_GPU	0
> > > > > > +#define MEMORY_TIER_DRAM	1
> > > > > > +#define MEMORY_TIER_PMEM	2
> > > > > > +
> > > > > > +#define MEMORY_RANK_HBM_GPU	300
> > > > > > +#define MEMORY_RANK_DRAM	200
> > > > > > +#define MEMORY_RANK_PMEM	100
> > > > > > +
> > > > > > +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> > > > > > +#define MAX_MEMORY_TIERS  3
> > > > > 
> > > > > I understand the names are somewhat arbitrary, and the tier ID space
> > > > > can be expanded down the line by bumping MAX_MEMORY_TIERS.
> > > > > 
> > > > > But starting out with a packed ID space can get quite awkward for
> > > > > users when new tiers - especially intermediate tiers - show up in
> > > > > existing configurations. I mentioned in the other email that DRAM !=
> > > > > DRAM, so new tiers seem inevitable already.
> > > > > 
> > > > > It could make sense to start with a bigger address space and spread
> > > > > out the list of kernel default tiers a bit within it:
> > > > > 
> > > > > MEMORY_TIER_GPU		0
> > > > > MEMORY_TIER_DRAM	10
> > > > > MEMORY_TIER_PMEM	20
> > > > 
> > > > Forgive me if I'm asking a question that has been answered. I went
> > > > back to earlier threads and couldn't work it out - maybe there were
> > > > some off-list discussions? Anyway...
> > > > 
> > > > Why is there a distinction between tier ID and rank? I undestand that
> > > > rank was added because tier IDs were too few. But if rank determines
> > > > ordering, what is the use of a separate tier ID? IOW, why not make the
> > > > tier ID space wider and have the kernel pick a few spread out defaults
> > > > based on known hardware, with plenty of headroom to be future proof.
> > > > 
> > > >     $ ls tiers
> > > >     100				# DEFAULT_TIER
> > > >     $ cat tiers/100/nodelist
> > > >     0-1				# conventional numa nodes
> > > > 
> > > >     <pmem is onlined>
> > > > 
> > > >     $ grep . tiers/*/nodelist
> > > >     tiers/100/nodelist:0-1	# conventional numa
> > > >     tiers/200/nodelist:2		# pmem
> > > > 
> > > >     $ grep . nodes/*/tier
> > > >     nodes/0/tier:100
> > > >     nodes/1/tier:100
> > > >     nodes/2/tier:200
> > > > 
> > > >     <unknown device is online as node 3, defaults to 100>
> > > > 
> > > >     $ grep . tiers/*/nodelist
> > > >     tiers/100/nodelist:0-1,3
> > > >     tiers/200/nodelist:2
> > > > 
> > > >     $ echo 300 >nodes/3/tier
> > > >     $ grep . tiers/*/nodelist
> > > >     tiers/100/nodelist:0-1
> > > >     tiers/200/nodelist:2
> > > >     tiers/300/nodelist:3
> > > > 
> > > >     $ echo 200 >nodes/3/tier
> > > >     $ grep . tiers/*/nodelist
> > > >     tiers/100/nodelist:0-1	
> > > >     tiers/200/nodelist:2-3
> > > > 
> > > > etc.
> > > 
> > > tier ID is also used as device id memtier.dev.id. It was discussed that we
> > > would need the ability to change the rank value of a memory tier. If we make
> > > rank value same as tier ID or tier device id, we will not be able to support
> > > that.
> > 
> > Is the idea that you could change the rank of a collection of nodes in
> > one go? Rather than moving the nodes one by one into a new tier?
> > 
> > [ Sorry, I wasn't able to find this discussion. AFAICS the first
> >    patches in RFC4 already had the struct device { .id = tier }
> >    logic. Could you point me to it? In general it would be really
> >    helpful to maintain summarized rationales for such decisions in the
> >    coverletter to make sure things don't get lost over many, many
> >    threads, conferences, and video calls. ]
> 
> Most of the discussion happened not int he patch review email threads.
> 
> RFC: Memory Tiering Kernel Interfaces (v2)
> https://lore.kernel.org/linux-mm/CAAPL-u_diGYEb7+WsgqNBLRix-nRCk2SsDj6p9r8j5JZwOABZQ@mail.gmail.com
> 
> RFC: Memory Tiering Kernel Interfaces (v4)
> https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com

I read the RFCs, the discussions and your code. It's still not clear
why the tier/device ID and the rank need to be two separate,
user-visible things. There is only one tier of a given rank, why can't
the rank be the unique device id? dev->id = 100. One number. Or use a
unique device id allocator if large numbers are causing problems
internally. But I don't see an explanation why they need to be two
different things, let alone two different things in the user ABI.
Jonathan Cameron June 9, 2022, 2:22 p.m. UTC | #23
On Thu, 9 Jun 2022 09:55:45 -0400
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Thu, Jun 09, 2022 at 08:03:26AM +0530, Aneesh Kumar K V wrote:
> > On 6/8/22 11:46 PM, Johannes Weiner wrote:  
> > > On Wed, Jun 08, 2022 at 09:43:52PM +0530, Aneesh Kumar K V wrote:  
> > > > On 6/8/22 9:25 PM, Johannes Weiner wrote:  
> > > > > Hello,
> > > > > 
> > > > > On Wed, Jun 08, 2022 at 10:11:31AM -0400, Johannes Weiner wrote:  
> > > > > > On Fri, Jun 03, 2022 at 07:12:29PM +0530, Aneesh Kumar K.V wrote:  
> > > > > > > @@ -0,0 +1,20 @@
> > > > > > > +/* SPDX-License-Identifier: GPL-2.0 */
> > > > > > > +#ifndef _LINUX_MEMORY_TIERS_H
> > > > > > > +#define _LINUX_MEMORY_TIERS_H
> > > > > > > +
> > > > > > > +#ifdef CONFIG_TIERED_MEMORY
> > > > > > > +
> > > > > > > +#define MEMORY_TIER_HBM_GPU	0
> > > > > > > +#define MEMORY_TIER_DRAM	1
> > > > > > > +#define MEMORY_TIER_PMEM	2
> > > > > > > +
> > > > > > > +#define MEMORY_RANK_HBM_GPU	300
> > > > > > > +#define MEMORY_RANK_DRAM	200
> > > > > > > +#define MEMORY_RANK_PMEM	100
> > > > > > > +
> > > > > > > +#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
> > > > > > > +#define MAX_MEMORY_TIERS  3  
> > > > > > 
> > > > > > I understand the names are somewhat arbitrary, and the tier ID space
> > > > > > can be expanded down the line by bumping MAX_MEMORY_TIERS.
> > > > > > 
> > > > > > But starting out with a packed ID space can get quite awkward for
> > > > > > users when new tiers - especially intermediate tiers - show up in
> > > > > > existing configurations. I mentioned in the other email that DRAM !=
> > > > > > DRAM, so new tiers seem inevitable already.
> > > > > > 
> > > > > > It could make sense to start with a bigger address space and spread
> > > > > > out the list of kernel default tiers a bit within it:
> > > > > > 
> > > > > > MEMORY_TIER_GPU		0
> > > > > > MEMORY_TIER_DRAM	10
> > > > > > MEMORY_TIER_PMEM	20  
> > > > > 
> > > > > Forgive me if I'm asking a question that has been answered. I went
> > > > > back to earlier threads and couldn't work it out - maybe there were
> > > > > some off-list discussions? Anyway...
> > > > > 
> > > > > Why is there a distinction between tier ID and rank? I undestand that
> > > > > rank was added because tier IDs were too few. But if rank determines
> > > > > ordering, what is the use of a separate tier ID? IOW, why not make the
> > > > > tier ID space wider and have the kernel pick a few spread out defaults
> > > > > based on known hardware, with plenty of headroom to be future proof.
> > > > > 
> > > > >     $ ls tiers
> > > > >     100				# DEFAULT_TIER
> > > > >     $ cat tiers/100/nodelist
> > > > >     0-1				# conventional numa nodes
> > > > > 
> > > > >     <pmem is onlined>
> > > > > 
> > > > >     $ grep . tiers/*/nodelist
> > > > >     tiers/100/nodelist:0-1	# conventional numa
> > > > >     tiers/200/nodelist:2		# pmem
> > > > > 
> > > > >     $ grep . nodes/*/tier
> > > > >     nodes/0/tier:100
> > > > >     nodes/1/tier:100
> > > > >     nodes/2/tier:200
> > > > > 
> > > > >     <unknown device is online as node 3, defaults to 100>
> > > > > 
> > > > >     $ grep . tiers/*/nodelist
> > > > >     tiers/100/nodelist:0-1,3
> > > > >     tiers/200/nodelist:2
> > > > > 
> > > > >     $ echo 300 >nodes/3/tier
> > > > >     $ grep . tiers/*/nodelist
> > > > >     tiers/100/nodelist:0-1
> > > > >     tiers/200/nodelist:2
> > > > >     tiers/300/nodelist:3
> > > > > 
> > > > >     $ echo 200 >nodes/3/tier
> > > > >     $ grep . tiers/*/nodelist
> > > > >     tiers/100/nodelist:0-1	
> > > > >     tiers/200/nodelist:2-3
> > > > > 
> > > > > etc.  
> > > > 
> > > > tier ID is also used as device id memtier.dev.id. It was discussed that we
> > > > would need the ability to change the rank value of a memory tier. If we make
> > > > rank value same as tier ID or tier device id, we will not be able to support
> > > > that.  
> > > 
> > > Is the idea that you could change the rank of a collection of nodes in
> > > one go? Rather than moving the nodes one by one into a new tier?
> > > 
> > > [ Sorry, I wasn't able to find this discussion. AFAICS the first
> > >    patches in RFC4 already had the struct device { .id = tier }
> > >    logic. Could you point me to it? In general it would be really
> > >    helpful to maintain summarized rationales for such decisions in the
> > >    coverletter to make sure things don't get lost over many, many
> > >    threads, conferences, and video calls. ]  
> > 
> > Most of the discussion happened not int he patch review email threads.
> > 
> > RFC: Memory Tiering Kernel Interfaces (v2)
> > https://lore.kernel.org/linux-mm/CAAPL-u_diGYEb7+WsgqNBLRix-nRCk2SsDj6p9r8j5JZwOABZQ@mail.gmail.com
> > 
> > RFC: Memory Tiering Kernel Interfaces (v4)
> > https://lore.kernel.org/linux-mm/CAAPL-u9Wv+nH1VOZTj=9p9S70Y3Qz3+63EkqncRDdHfubsrjfw@mail.gmail.com  
> 
> I read the RFCs, the discussions and your code. It's still not clear
> why the tier/device ID and the rank need to be two separate,
> user-visible things. There is only one tier of a given rank, why can't
> the rank be the unique device id? dev->id = 100. One number. Or use a
> unique device id allocator if large numbers are causing problems
> internally. But I don't see an explanation why they need to be two
> different things, let alone two different things in the user ABI.

I think discussion hinged on it making sense to be able to change
rank of a tier rather than create a new tier and move things one by one.
Example was wanting to change the rank of a tier that was created
either by core code or a subsystem.

E.g. If GPU driver creates a tier, assumption is all similar GPUs will
default to the same tier (if hot plugged later for example) as the
driver subsystem will keep a reference to the created tier.
Hence if user wants to change the order of that relative to
other tiers, the option of creating a new tier and moving the
devices would then require us to have infrastructure to tell the GPU
driver to now use the new tier for additional devices.

Or we could go with new nodes are not assigned to a tier and userspace is
always responsible for that assignment.  That may be a problem for
anything relying on existing behavior.  Means that there must always
be a sensible userspace script...

Jonathan
Yang Shi June 9, 2022, 4:04 p.m. UTC | #24
On Thu, Jun 9, 2022 at 1:18 AM Aneesh Kumar K V
<aneesh.kumar@linux.ibm.com> wrote:
>
> On 6/8/22 10:12 PM, Yang Shi wrote:
> > On Tue, Jun 7, 2022 at 9:58 PM Aneesh Kumar K V
> > <aneesh.kumar@linux.ibm.com> wrote:
>
> ....
>
> >>    config TIERED_MEMORY
> >>          bool "Support for explicit memory tiers"
> >> -       def_bool n
> >> -       depends on MIGRATION && NUMA
> >> -       help
> >> -         Support to split nodes into memory tiers explicitly and
> >> -         to demote pages on reclaim to lower tiers. This option
> >> -         also exposes sysfs interface to read nodes available in
> >> -         specific tier and to move specific node among different
> >> -         possible tiers.
> >> +       def_bool MIGRATION && NUMA
> >
> > CONFIG_NUMA should be good enough. Memory tiering doesn't have to mean
> > demotion/promotion has to be supported IMHO.
> >
> >>
> >>    config HUGETLB_PAGE_SIZE_VARIABLE
> >>          def_bool n
> >>
> >> ie, we just make it a Kconfig variable without exposing it to the user?
> >>
>
> We can do that but that would also mean in order to avoid building the
> demotion targets etc we will now have to have multiple #ifdef
> CONFIG_MIGRATION in mm/memory-tiers.c . It builds without those #ifdef
> So these are not really build errors, but rather we will be building all
> the demotion targets for no real use with them.

Can we have default demotion targets for !MIGRATION? For example, all
demotion targets are -1.

>
> What usecase do you have to expose memory tiers on a system with
> CONFIG_MIGRATION disabled? CONFIG_MIGRATION gets enabled in almost all
> configs these days due to its dependency against COMPACTION and
> TRANSPARENT_HUGEPAGE.

Johannes's interleave series is an example,
https://lore.kernel.org/lkml/20220607171949.85796-1-hannes@cmpxchg.org/

It doesn't do any demotion/promotion, just make allocations interleave
on different tiers.

>
> Unless there is a real need, I am wondering if we can avoid sprinkling
> #ifdef CONFIG_MIGRATION in mm/memory-tiers.c
>
> -aneesh
Johannes Weiner June 9, 2022, 8:41 p.m. UTC | #25
On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote:
> I think discussion hinged on it making sense to be able to change
> rank of a tier rather than create a new tier and move things one by one.
> Example was wanting to change the rank of a tier that was created
> either by core code or a subsystem.
> 
> E.g. If GPU driver creates a tier, assumption is all similar GPUs will
> default to the same tier (if hot plugged later for example) as the
> driver subsystem will keep a reference to the created tier.
> Hence if user wants to change the order of that relative to
> other tiers, the option of creating a new tier and moving the
> devices would then require us to have infrastructure to tell the GPU
> driver to now use the new tier for additional devices.

That's an interesting point, thanks for explaining.

But that could still happen when two drivers report the same tier and
one of them is wrong, right? You'd still need to separate out by hand
to adjust rank, as well as handle hotplug events. Driver colllisions
are probable with coarse categories like gpu, dram, pmem.

Would it make more sense to have the platform/devicetree/driver
provide more fine-grained distance values similar to NUMA distances,
and have a driver-scope tunable to override/correct? And then have the
distance value function as the unique tier ID and rank in one.

That would allow device class reassignments, too, and it would work
with driver collisions where simple "tier stickiness" would
not. (Although collisions would be less likely to begin with given a
broader range of possible distance values.)

Going further, it could be useful to separate the business of hardware
properties (and configuring quirks) from the business of configuring
MM policies that should be applied to the resulting tier hierarchy.
They're somewhat orthogonal tuning tasks, and one of them might become
obsolete before the other (if the quality of distance values provided
by drivers improves before the quality of MM heuristics ;). Separating
them might help clarify the interface for both designers and users.

E.g. a memdev class scope with a driver-wide distance value, and a
memdev scope for per-device values that default to "inherit driver
value". The memtier subtree would then have an r/o structure, but
allow tuning per-tier interleaving ratio[1], demotion rules etc.

[1] https://lore.kernel.org/linux-mm/20220607171949.85796-1-hannes@cmpxchg.org/#t
Huang, Ying June 10, 2022, 6:15 a.m. UTC | #26
On Thu, 2022-06-09 at 16:41 -0400, Johannes Weiner wrote:
> On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote:
> > I think discussion hinged on it making sense to be able to change
> > rank of a tier rather than create a new tier and move things one by one.
> > Example was wanting to change the rank of a tier that was created
> > either by core code or a subsystem.
> > 
> > E.g. If GPU driver creates a tier, assumption is all similar GPUs will
> > default to the same tier (if hot plugged later for example) as the
> > driver subsystem will keep a reference to the created tier.
> > Hence if user wants to change the order of that relative to
> > other tiers, the option of creating a new tier and moving the
> > devices would then require us to have infrastructure to tell the GPU
> > driver to now use the new tier for additional devices.
> 
> That's an interesting point, thanks for explaining.

I have proposed to use sparse memory tier device ID and remove rank. 
The response from Wei Xu is as follows,

"
Using the rank value directly as the device ID has some disadvantages:
- It is kind of unconventional to number devices in this way.
- We cannot assign DRAM nodes with CPUs with a specific memtier device
ID (even though this is not mandated by the "rank" proposal, I expect
the device will likely always be memtier1 in practice).
- It is possible that we may eventually allow the rank value to be
modified as a way to adjust the tier ordering.  We cannot do that
easily for device IDs.
"

in

https://lore.kernel.org/lkml/CAAPL-u9t=9hYfcXyCZwYFmVTUQGrWVq3cndpN+sqPSm5cwE4Yg@mail.gmail.com/

I think that your proposal below has resolved the latter "disadvantage".
So if the former one isn't so important, we can go to remove "rank". 
That will make memory tier much easier to be understand and use.

Best Regards,
Huang, Ying

> But that could still happen when two drivers report the same tier and
> one of them is wrong, right? You'd still need to separate out by hand
> to adjust rank, as well as handle hotplug events. Driver colllisions
> are probable with coarse categories like gpu, dram, pmem.
> 
> Would it make more sense to have the platform/devicetree/driver
> provide more fine-grained distance values similar to NUMA distances,
> and have a driver-scope tunable to override/correct? And then have the
> distance value function as the unique tier ID and rank in one.
> 
> That would allow device class reassignments, too, and it would work
> with driver collisions where simple "tier stickiness" would
> not. (Although collisions would be less likely to begin with given a
> broader range of possible distance values.)
> 
> Going further, it could be useful to separate the business of hardware
> properties (and configuring quirks) from the business of configuring
> MM policies that should be applied to the resulting tier hierarchy.
> They're somewhat orthogonal tuning tasks, and one of them might become
> obsolete before the other (if the quality of distance values provided
> by drivers improves before the quality of MM heuristics ;). Separating
> them might help clarify the interface for both designers and users.
> 
> E.g. a memdev class scope with a driver-wide distance value, and a
> memdev scope for per-device values that default to "inherit driver
> value". The memtier subtree would then have an r/o structure, but
> allow tuning per-tier interleaving ratio[1], demotion rules etc.
> 
> [1] https://lore.kernel.org/linux-mm/20220607171949.85796-1-hannes@cmpxchg.org/#t
Jonathan Cameron June 10, 2022, 9:57 a.m. UTC | #27
On Thu, 9 Jun 2022 16:41:04 -0400
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote:
> > I think discussion hinged on it making sense to be able to change
> > rank of a tier rather than create a new tier and move things one by one.
> > Example was wanting to change the rank of a tier that was created
> > either by core code or a subsystem.
> > 
> > E.g. If GPU driver creates a tier, assumption is all similar GPUs will
> > default to the same tier (if hot plugged later for example) as the
> > driver subsystem will keep a reference to the created tier.
> > Hence if user wants to change the order of that relative to
> > other tiers, the option of creating a new tier and moving the
> > devices would then require us to have infrastructure to tell the GPU
> > driver to now use the new tier for additional devices.  
> 
> That's an interesting point, thanks for explaining.
> 
> But that could still happen when two drivers report the same tier and
> one of them is wrong, right? You'd still need to separate out by hand
> to adjust rank, as well as handle hotplug events. Driver colllisions
> are probable with coarse categories like gpu, dram, pmem.

There will always be cases that need hand tweaking.  Also I'd envision
some driver subsystems being clever enough to manage several tiers and
use the information available to them to assign appropriately.  This
is definitely true for CXL 2.0+ devices where we can have radically
different device types under the same driver (volatile, persistent,
direct connect, behind switches etc).  There will be some interesting
choices to make on groupings in big systems as we don't want too many
tiers unless we naturally demote multiple levels in one go..

> 
> Would it make more sense to have the platform/devicetree/driver
> provide more fine-grained distance values similar to NUMA distances,
> and have a driver-scope tunable to override/correct? And then have the
> distance value function as the unique tier ID and rank in one.

Absolutely a good thing to provide that information, but it's black
magic. There are too many contradicting metrics (latency vs bandwidth etc)
even not including a more complex system model like Jerome Glisse proposed
a few years back. https://lore.kernel.org/all/20190118174512.GA3060@redhat.com/
CXL 2.0 got this more right than anything else I've seen as provides
discoverable topology along with details like latency to cross between
particular switch ports.  Actually using that data (other than by throwing
it to userspace controls for HPC apps etc) is going to take some figuring out.
Even the question of what + how we expose this info to userspace is non
obvious.

The 'right' decision is also usecase specific, so what you'd do for
particular memory characteristics for a GPU are not the same as what
you'd do for the same characteristics on a memory only device.

> 
> That would allow device class reassignments, too, and it would work
> with driver collisions where simple "tier stickiness" would
> not. (Although collisions would be less likely to begin with given a
> broader range of possible distance values.)

I think we definitely need the option to move individual nodes (in this
case nodes map to individual devices if characteristics vary between them)
around as well, but I think that's somewhat orthogonal to a good first guess.

> 
> Going further, it could be useful to separate the business of hardware
> properties (and configuring quirks) from the business of configuring
> MM policies that should be applied to the resulting tier hierarchy.
> They're somewhat orthogonal tuning tasks, and one of them might become
> obsolete before the other (if the quality of distance values provided
> by drivers improves before the quality of MM heuristics ;). Separating
> them might help clarify the interface for both designers and users.
> 
> E.g. a memdev class scope with a driver-wide distance value, and a
> memdev scope for per-device values that default to "inherit driver
> value". The memtier subtree would then have an r/o structure, but
> allow tuning per-tier interleaving ratio[1], demotion rules etc.

Ok that makes sense.  I'm not sure if that ends up as an implementation
detail, or effects the userspace interface of this particular element.

I'm not sure completely read only is flexible enough (though mostly RO is fine)
as we keep sketching out cases where any attempt to do things automatically
does the wrong thing and where we need to add an extra tier to get
everything to work.  Short of having a lot of tiers I'm not sure how
we could have the default work well.  Maybe a lot of "tiers" is fine
though perhaps we need to rename them if going this way and then they
don't really work as current concept of tier.

Imagine a system with subtle difference between different memories such
as 10% latency increase for same bandwidth.  To get an advantage from
demoting to such a tier will require really stable usage and long
run times. Whilst you could design a demotion scheme that takes that
into account, I think we are a long way from that today.

Jonathan


> 
> [1] https://lore.kernel.org/linux-mm/20220607171949.85796-1-hannes@cmpxchg.org/#t
Johannes Weiner June 13, 2022, 2:05 p.m. UTC | #28
On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
> On Thu, 9 Jun 2022 16:41:04 -0400
> Johannes Weiner <hannes@cmpxchg.org> wrote:
> > On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote:
> > Would it make more sense to have the platform/devicetree/driver
> > provide more fine-grained distance values similar to NUMA distances,
> > and have a driver-scope tunable to override/correct? And then have the
> > distance value function as the unique tier ID and rank in one.
> 
> Absolutely a good thing to provide that information, but it's black
> magic. There are too many contradicting metrics (latency vs bandwidth etc)
> even not including a more complex system model like Jerome Glisse proposed
> a few years back. https://lore.kernel.org/all/20190118174512.GA3060@redhat.com/
> CXL 2.0 got this more right than anything else I've seen as provides
> discoverable topology along with details like latency to cross between
> particular switch ports.  Actually using that data (other than by throwing
> it to userspace controls for HPC apps etc) is going to take some figuring out.
> Even the question of what + how we expose this info to userspace is non
> obvious.

Right, I don't think those would be scientifically accurate - but
neither is a number between 1 and 3. The way I look at it is more
about spreading out the address space a bit, to allow expressing
nuanced differences without risking conflicts and overlaps. Hopefully
this results in the shipped values stabilizing over time and thus
requiring less and less intervention and overriding from userspace.

> > Going further, it could be useful to separate the business of hardware
> > properties (and configuring quirks) from the business of configuring
> > MM policies that should be applied to the resulting tier hierarchy.
> > They're somewhat orthogonal tuning tasks, and one of them might become
> > obsolete before the other (if the quality of distance values provided
> > by drivers improves before the quality of MM heuristics ;). Separating
> > them might help clarify the interface for both designers and users.
> > 
> > E.g. a memdev class scope with a driver-wide distance value, and a
> > memdev scope for per-device values that default to "inherit driver
> > value". The memtier subtree would then have an r/o structure, but
> > allow tuning per-tier interleaving ratio[1], demotion rules etc.
> 
> Ok that makes sense.  I'm not sure if that ends up as an implementation
> detail, or effects the userspace interface of this particular element.
> 
> I'm not sure completely read only is flexible enough (though mostly RO is fine)
> as we keep sketching out cases where any attempt to do things automatically
> does the wrong thing and where we need to add an extra tier to get
> everything to work.  Short of having a lot of tiers I'm not sure how
> we could have the default work well.  Maybe a lot of "tiers" is fine
> though perhaps we need to rename them if going this way and then they
> don't really work as current concept of tier.
> 
> Imagine a system with subtle difference between different memories such
> as 10% latency increase for same bandwidth.  To get an advantage from
> demoting to such a tier will require really stable usage and long
> run times. Whilst you could design a demotion scheme that takes that
> into account, I think we are a long way from that today.

Good point: there can be a clear hardware difference, but it's a
policy choice whether the MM should treat them as one or two tiers.

What do you think of a per-driver/per-device (overridable) distance
number, combined with a configurable distance cutoff for what
constitutes separate tiers. E.g. cutoff=20 means two devices with
distances of 10 and 20 respectively would be in the same tier, devices
with 10 and 100 would be in separate ones. The kernel then generates
and populates the tiers based on distances and grouping cutoff, and
populates the memtier directory tree and nodemasks in sysfs.

It could be simple tier0, tier1, tier2 numbering again, but the
numbers now would mean something to the user. A rank tunable is no
longer necessary.

I think even the nodemasks in the memtier tree could be read-only
then, since corrections should only be necessary when either the
device distance is wrong or the tier grouping cutoff.

Can you think of scenarios where that scheme would fall apart?
Aneesh Kumar K.V June 13, 2022, 2:23 p.m. UTC | #29
On 6/13/22 7:35 PM, Johannes Weiner wrote:
> On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
>>

....

>> I'm not sure completely read only is flexible enough (though mostly RO is fine)
>> as we keep sketching out cases where any attempt to do things automatically
>> does the wrong thing and where we need to add an extra tier to get
>> everything to work.  Short of having a lot of tiers I'm not sure how
>> we could have the default work well.  Maybe a lot of "tiers" is fine
>> though perhaps we need to rename them if going this way and then they
>> don't really work as current concept of tier.
>>
>> Imagine a system with subtle difference between different memories such
>> as 10% latency increase for same bandwidth.  To get an advantage from
>> demoting to such a tier will require really stable usage and long
>> run times. Whilst you could design a demotion scheme that takes that
>> into account, I think we are a long way from that today.
> 
> Good point: there can be a clear hardware difference, but it's a
> policy choice whether the MM should treat them as one or two tiers.
> 
> What do you think of a per-driver/per-device (overridable) distance
> number, combined with a configurable distance cutoff for what
> constitutes separate tiers. E.g. cutoff=20 means two devices with
> distances of 10 and 20 respectively would be in the same tier, devices
> with 10 and 100 would be in separate ones. The kernel then generates
> and populates the tiers based on distances and grouping cutoff, and
> populates the memtier directory tree and nodemasks in sysfs.
> 

Right now core/generic code doesn't get involved in building tiers. It 
just defines three tiers where drivers could place the respective 
devices they manage. The above suggestion would imply we are moving 
quite a lot of policy decision logic into the generic code?.

At some point, we will have to depend on more attributes other than 
distance(may be HMAT?) and each driver should have the flexibility to 
place the device it is managing in a specific tier? By then we may 
decide to support more than 3 static tiers which the core kernel 
currently does.

If the kernel still can't make the right decision, userspace could 
rearrange them in any order using rank values. Without something like 
rank, if userspace needs to fix things up,  it gets hard with device
hotplugging. ie, the userspace policy could be that any new PMEM tier 
device that is hotplugged, park it with a very low-rank value and hence 
lowest in demotion order by default. (echo 10 > 
/sys/devices/system/memtier/memtier2/rank) . After that userspace could 
selectively move the new devices to the correct memory tier?


> It could be simple tier0, tier1, tier2 numbering again, but the
> numbers now would mean something to the user. A rank tunable is no
> longer necessary.
> 
> I think even the nodemasks in the memtier tree could be read-only
> then, since corrections should only be necessary when either the
> device distance is wrong or the tier grouping cutoff.
> 
> Can you think of scenarios where that scheme would fall apart?

-aneesh
Johannes Weiner June 13, 2022, 3:50 p.m. UTC | #30
On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:
> On 6/13/22 7:35 PM, Johannes Weiner wrote:
> > On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
> > > I'm not sure completely read only is flexible enough (though mostly RO is fine)
> > > as we keep sketching out cases where any attempt to do things automatically
> > > does the wrong thing and where we need to add an extra tier to get
> > > everything to work.  Short of having a lot of tiers I'm not sure how
> > > we could have the default work well.  Maybe a lot of "tiers" is fine
> > > though perhaps we need to rename them if going this way and then they
> > > don't really work as current concept of tier.
> > > 
> > > Imagine a system with subtle difference between different memories such
> > > as 10% latency increase for same bandwidth.  To get an advantage from
> > > demoting to such a tier will require really stable usage and long
> > > run times. Whilst you could design a demotion scheme that takes that
> > > into account, I think we are a long way from that today.
> > 
> > Good point: there can be a clear hardware difference, but it's a
> > policy choice whether the MM should treat them as one or two tiers.
> > 
> > What do you think of a per-driver/per-device (overridable) distance
> > number, combined with a configurable distance cutoff for what
> > constitutes separate tiers. E.g. cutoff=20 means two devices with
> > distances of 10 and 20 respectively would be in the same tier, devices
> > with 10 and 100 would be in separate ones. The kernel then generates
> > and populates the tiers based on distances and grouping cutoff, and
> > populates the memtier directory tree and nodemasks in sysfs.
> > 
> 
> Right now core/generic code doesn't get involved in building tiers. It just
> defines three tiers where drivers could place the respective devices they
> manage. The above suggestion would imply we are moving quite a lot of policy
> decision logic into the generic code?.

No. The driver still chooses its own number, just from a wider
range. The only policy in generic code is the distance cutoff for
which devices are grouped into tiers together.

> At some point, we will have to depend on more attributes other than
> distance(may be HMAT?) and each driver should have the flexibility to place
> the device it is managing in a specific tier? By then we may decide to
> support more than 3 static tiers which the core kernel currently does.

If we start with a larger possible range of "distance" values right
away, we can still let the drivers ballpark into 3 tiers for now (100,
200, 300). But it will be easier to take additional metrics into
account later and fine tune accordingly (120, 260, 90 etc.) without
having to update all the other drivers as well.

> If the kernel still can't make the right decision, userspace could rearrange
> them in any order using rank values. Without something like rank, if
> userspace needs to fix things up,  it gets hard with device
> hotplugging. ie, the userspace policy could be that any new PMEM tier device
> that is hotplugged, park it with a very low-rank value and hence lowest in
> demotion order by default. (echo 10 >
> /sys/devices/system/memtier/memtier2/rank) . After that userspace could
> selectively move the new devices to the correct memory tier?

I had touched on this in the other email.

This doesn't work if two drivers that should have separate policies
collide into the same tier - which is very likely with just 3 tiers.
So it seems to me the main usecase for having a rank tunable falls
apart rather quickly until tiers are spaced out more widely. And it
does so at the cost of an, IMO, tricky to understand interface.

In the other email I had suggested the ability to override not just
the per-device distance, but also the driver default for new devices
to handle the hotplug situation.

This should be less policy than before. Driver default and per-device
distances (both overridable) combined with one tunable to set the
range of distances that get grouped into tiers.

With these parameters alone, you can generate an ordered list of tiers
and their devices. The tier numbers make sense, and no rank is needed.

Do you still need the ability to move nodes by writing nodemasks? I
don't think so. Assuming you would never want to have an actually
slower device in a higher tier than a faster device, the only time
you'd want to move a device is when the device's distance value is
wrong. So you override that (until you update to a fixed kernel).
Huang, Ying June 14, 2022, 6:48 a.m. UTC | #31
On Mon, 2022-06-13 at 11:50 -0400, Johannes Weiner wrote:
> On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:
> > On 6/13/22 7:35 PM, Johannes Weiner wrote:
> > > On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
> > > > I'm not sure completely read only is flexible enough (though mostly RO is fine)
> > > > as we keep sketching out cases where any attempt to do things automatically
> > > > does the wrong thing and where we need to add an extra tier to get
> > > > everything to work.  Short of having a lot of tiers I'm not sure how
> > > > we could have the default work well.  Maybe a lot of "tiers" is fine
> > > > though perhaps we need to rename them if going this way and then they
> > > > don't really work as current concept of tier.
> > > > 
> > > > Imagine a system with subtle difference between different memories such
> > > > as 10% latency increase for same bandwidth.  To get an advantage from
> > > > demoting to such a tier will require really stable usage and long
> > > > run times. Whilst you could design a demotion scheme that takes that
> > > > into account, I think we are a long way from that today.
> > > 
> > > Good point: there can be a clear hardware difference, but it's a
> > > policy choice whether the MM should treat them as one or two tiers.
> > > 
> > > What do you think of a per-driver/per-device (overridable) distance
> > > number, combined with a configurable distance cutoff for what
> > > constitutes separate tiers. E.g. cutoff=20 means two devices with
> > > distances of 10 and 20 respectively would be in the same tier, devices
> > > with 10 and 100 would be in separate ones. The kernel then generates
> > > and populates the tiers based on distances and grouping cutoff, and
> > > populates the memtier directory tree and nodemasks in sysfs.
> > > 
> > 
> > Right now core/generic code doesn't get involved in building tiers. It just
> > defines three tiers where drivers could place the respective devices they
> > manage. The above suggestion would imply we are moving quite a lot of policy
> > decision logic into the generic code?.
> 
> No. The driver still chooses its own number, just from a wider
> range. The only policy in generic code is the distance cutoff for
> which devices are grouped into tiers together.
> 
> > At some point, we will have to depend on more attributes other than
> > distance(may be HMAT?) and each driver should have the flexibility to place
> > the device it is managing in a specific tier? By then we may decide to
> > support more than 3 static tiers which the core kernel currently does.
> 
> If we start with a larger possible range of "distance" values right
> away, we can still let the drivers ballpark into 3 tiers for now (100,
> 200, 300). But it will be easier to take additional metrics into
> account later and fine tune accordingly (120, 260, 90 etc.) without
> having to update all the other drivers as well.
> 
> > If the kernel still can't make the right decision, userspace could rearrange
> > them in any order using rank values. Without something like rank, if
> > userspace needs to fix things up,  it gets hard with device
> > hotplugging. ie, the userspace policy could be that any new PMEM tier device
> > that is hotplugged, park it with a very low-rank value and hence lowest in
> > demotion order by default. (echo 10 >
> > /sys/devices/system/memtier/memtier2/rank) . After that userspace could
> > selectively move the new devices to the correct memory tier?
> 
> I had touched on this in the other email.
> 
> This doesn't work if two drivers that should have separate policies
> collide into the same tier - which is very likely with just 3 tiers.
> So it seems to me the main usecase for having a rank tunable falls
> apart rather quickly until tiers are spaced out more widely. And it
> does so at the cost of an, IMO, tricky to understand interface.
> 
> In the other email I had suggested the ability to override not just
> the per-device distance, but also the driver default for new devices
> to handle the hotplug situation.
> 
> This should be less policy than before. Driver default and per-device
> distances (both overridable) combined with one tunable to set the
> range of distances that get grouped into tiers.
> 
> With these parameters alone, you can generate an ordered list of tiers
> and their devices. The tier numbers make sense, and no rank is needed.
> 
> Do you still need the ability to move nodes by writing nodemasks? I
> don't think so. Assuming you would never want to have an actually
> slower device in a higher tier than a faster device, the only time
> you'd want to move a device is when the device's distance value is
> wrong. So you override that (until you update to a fixed kernel).

This sounds good to me.  In this way, we override driver parameter
instead of memory tiers itself.  So I guess when we do that, the memory
tier of the NUMA nodes controlled by the driver will be changed.  Or all
memory tiers will be regenerated?

I have a suggestion.  Instead of abstract distance number, how about
using memory latency and bandwidth directly?  These can be gotten from
HMAT directly when necessary.  Even if they are not available directly,
they may be tested at runtime by the drivers.

Best Regards,
Huang, Ying
Aneesh Kumar K.V June 14, 2022, 8:01 a.m. UTC | #32
On 6/13/22 9:20 PM, Johannes Weiner wrote:
> On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:
>> On 6/13/22 7:35 PM, Johannes Weiner wrote:
>>> On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
>>>> I'm not sure completely read only is flexible enough (though mostly RO is fine)
>>>> as we keep sketching out cases where any attempt to do things automatically
>>>> does the wrong thing and where we need to add an extra tier to get
>>>> everything to work.  Short of having a lot of tiers I'm not sure how
>>>> we could have the default work well.  Maybe a lot of "tiers" is fine
>>>> though perhaps we need to rename them if going this way and then they
>>>> don't really work as current concept of tier.
>>>>
>>>> Imagine a system with subtle difference between different memories such
>>>> as 10% latency increase for same bandwidth.  To get an advantage from
>>>> demoting to such a tier will require really stable usage and long
>>>> run times. Whilst you could design a demotion scheme that takes that
>>>> into account, I think we are a long way from that today.
>>>
>>> Good point: there can be a clear hardware difference, but it's a
>>> policy choice whether the MM should treat them as one or two tiers.
>>>
>>> What do you think of a per-driver/per-device (overridable) distance
>>> number, combined with a configurable distance cutoff for what
>>> constitutes separate tiers. E.g. cutoff=20 means two devices with
>>> distances of 10 and 20 respectively would be in the same tier, devices
>>> with 10 and 100 would be in separate ones. The kernel then generates
>>> and populates the tiers based on distances and grouping cutoff, and
>>> populates the memtier directory tree and nodemasks in sysfs.
>>>
>>
>> Right now core/generic code doesn't get involved in building tiers. It just
>> defines three tiers where drivers could place the respective devices they
>> manage. The above suggestion would imply we are moving quite a lot of policy
>> decision logic into the generic code?.
> 
> No. The driver still chooses its own number, just from a wider
> range. The only policy in generic code is the distance cutoff for
> which devices are grouped into tiers together.
> 
>> At some point, we will have to depend on more attributes other than
>> distance(may be HMAT?) and each driver should have the flexibility to place
>> the device it is managing in a specific tier? By then we may decide to
>> support more than 3 static tiers which the core kernel currently does.
> 
> If we start with a larger possible range of "distance" values right
> away, we can still let the drivers ballpark into 3 tiers for now (100,
> 200, 300). But it will be easier to take additional metrics into
> account later and fine tune accordingly (120, 260, 90 etc.) without
> having to update all the other drivers as well.
> 
>> If the kernel still can't make the right decision, userspace could rearrange
>> them in any order using rank values. Without something like rank, if
>> userspace needs to fix things up,  it gets hard with device
>> hotplugging. ie, the userspace policy could be that any new PMEM tier device
>> that is hotplugged, park it with a very low-rank value and hence lowest in
>> demotion order by default. (echo 10 >
>> /sys/devices/system/memtier/memtier2/rank) . After that userspace could
>> selectively move the new devices to the correct memory tier?
> 
> I had touched on this in the other email.
> 
> This doesn't work if two drivers that should have separate policies
> collide into the same tier - which is very likely with just 3 tiers.
> So it seems to me the main usecase for having a rank tunable falls
> apart rather quickly until tiers are spaced out more widely. And it
> does so at the cost of an, IMO, tricky to understand interface.
> 

Considering the kernel has a static map for these tiers, how can two drivers
end up using the same tier? If a new driver is going to manage a memory
device that is of different characteristics than the one managed by dax/kmem,
we will end up adding 

#define MEMORY_TIER_NEW_DEVICE 4

The new driver will never use MEMORY_TIER_PMEM

What can happen is two devices that are managed by DAX/kmem that
should be in two memory tiers get assigned the same memory tier
because the dax/kmem driver added both the device to the same memory tier.

In the future we would avoid that by using more device properties like HMAT
to create additional memory tiers with different rank values. ie, we would
do in the dax/kmem create_tier_from_rank() .


> In the other email I had suggested the ability to override not just
> the per-device distance, but also the driver default for new devices
> to handle the hotplug situation.
> 

I understand that the driver override will be done via module parameters.
How will we implement device override? For example in case of dax/kmem driver
the device override will be per dax device? What interface will we use to set the override? 

IIUC in the above proposal the dax/kmem will do

node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax));

get_device_tier_index(struct dev_dax *dev)
{
    return dax_kmem_tier_index; // module parameter
}

Are you suggesting to add a dev_dax property to override the tier defaults? 

> This should be less policy than before. Driver default and per-device
> distances (both overridable) combined with one tunable to set the
> range of distances that get grouped into tiers.
> 

Can you elaborate more on how distance value will be used? The device/device NUMA node can have
different distance value from other NUMA nodes. How do we group them?
for ex: earlier discussion did outline three different topologies. Can you
ellaborate how we would end up grouping them using distance?

For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
so how will we classify node 2?


Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.

		  20
  Node 0 (DRAM)  ----  Node 1 (DRAM)
       |        \   /       |
       | 30    40 X 40      | 30
       |        /   \       |
  Node 2 (PMEM)  ----  Node 3 (PMEM)
		  40

node distances:
node   0    1    2    3
   0  10   20   30   40
   1  20   10   40   30
   2  30   40   10   40
   3  40   30   40   10

Node 0 & 1 are DRAM nodes.
Node 2 is a PMEM node and closer to node 0.

		  20
  Node 0 (DRAM)  ----  Node 1 (DRAM)
       |            /
       | 30       / 40
       |        /
  Node 2 (PMEM)

node distances:
node   0    1    2
   0  10   20   30
   1  20   10   40
   2  30   40   10


Node 0 is a DRAM node with CPU.
Node 1 is a GPU node.
Node 2 is a PMEM node.
Node 3 is a large, slow DRAM node without CPU.

		    100
     Node 0 (DRAM)  ----  Node 1 (GPU)
    /     |               /    |
   /40    |30        120 /     | 110
  |       |             /      |
  |  Node 2 (PMEM) ----       /
  |        \                 /
   \     80 \               /
    ------- Node 3 (Slow DRAM)

node distances:
node    0    1    2    3
   0   10  100   30   40
   1  100   10  120  110
   2   30  120   10   80
   3   40  110   80   10

> With these parameters alone, you can generate an ordered list of tiers
> and their devices. The tier numbers make sense, and no rank is needed.
> 
> Do you still need the ability to move nodes by writing nodemasks? I
> don't think so. Assuming you would never want to have an actually
> slower device in a higher tier than a faster device, the only time
> you'd want to move a device is when the device's distance value is
> wrong. So you override that (until you update to a fixed kernel).
Jonathan Cameron June 14, 2022, 4:45 p.m. UTC | #33
On Mon, 13 Jun 2022 10:05:06 -0400
Johannes Weiner <hannes@cmpxchg.org> wrote:

> On Fri, Jun 10, 2022 at 10:57:08AM +0100, Jonathan Cameron wrote:
> > On Thu, 9 Jun 2022 16:41:04 -0400
> > Johannes Weiner <hannes@cmpxchg.org> wrote:  
> > > On Thu, Jun 09, 2022 at 03:22:43PM +0100, Jonathan Cameron wrote:
> > > Would it make more sense to have the platform/devicetree/driver
> > > provide more fine-grained distance values similar to NUMA distances,
> > > and have a driver-scope tunable to override/correct? And then have the
> > > distance value function as the unique tier ID and rank in one.  
> > 
> > Absolutely a good thing to provide that information, but it's black
> > magic. There are too many contradicting metrics (latency vs bandwidth etc)
> > even not including a more complex system model like Jerome Glisse proposed
> > a few years back. https://lore.kernel.org/all/20190118174512.GA3060@redhat.com/
> > CXL 2.0 got this more right than anything else I've seen as provides
> > discoverable topology along with details like latency to cross between
> > particular switch ports.  Actually using that data (other than by throwing
> > it to userspace controls for HPC apps etc) is going to take some figuring out.
> > Even the question of what + how we expose this info to userspace is non
> > obvious.  

Was offline for a few days.  At risk of splitting a complex thread
even more....

> 
> Right, I don't think those would be scientifically accurate - but
> neither is a number between 1 and 3.

The 3 tiers in this proposal are just a starting point (and one I'd
expect we'll move beyond very quickly) - aim is to define a userspace
that is flexible enough, but then only use a tiny bit of that flexibility
to get an initial version in place.  Even relatively trivial CXL systems
will include.

1) Direct connected volatile memory, (similar to a memory only NUMA node / socket)
2) Direct connected non volatile (similar to pmem Numa node, but maybe not
   similar enough to fuse with socket connected pmem)
3) Switch connected volatile memory (typically a disagregated memory device,
   so huge, high bandwidth, not great latency)
4) Switch connected non volatile (typically huge, high bandwidth, even wors
   latency).
5) Much more fun if we care about bandwidth as interleaving going on
   in hardware across either similar, or mixed sets of switch connected
   and direct connected.

Sure we might fuse some of those.  But just the CXL driver is likely to have
groups separate enough we want to handle them as 4 tiers and migrate between
those tiers...  Obviously might want a clever strategy for cold / hot migration!

> The way I look at it is more
> about spreading out the address space a bit, to allow expressing
> nuanced differences without risking conflicts and overlaps. Hopefully
> this results in the shipped values stabilizing over time and thus
> requiring less and less intervention and overriding from userspace.

I don't think they ever will stabilize, because the right answer isn't
definable in terms of just one number.  We'll end up with the old mess of
magic values in SLIT in which systems have been tuned against particular
use cases. HMAT was meant to solve that, but not yet clear it it will.

> 
> > > Going further, it could be useful to separate the business of hardware
> > > properties (and configuring quirks) from the business of configuring
> > > MM policies that should be applied to the resulting tier hierarchy.
> > > They're somewhat orthogonal tuning tasks, and one of them might become
> > > obsolete before the other (if the quality of distance values provided
> > > by drivers improves before the quality of MM heuristics ;). Separating
> > > them might help clarify the interface for both designers and users.
> > > 
> > > E.g. a memdev class scope with a driver-wide distance value, and a
> > > memdev scope for per-device values that default to "inherit driver
> > > value". The memtier subtree would then have an r/o structure, but
> > > allow tuning per-tier interleaving ratio[1], demotion rules etc.  
> > 
> > Ok that makes sense.  I'm not sure if that ends up as an implementation
> > detail, or effects the userspace interface of this particular element.
> > 
> > I'm not sure completely read only is flexible enough (though mostly RO is fine)
> > as we keep sketching out cases where any attempt to do things automatically
> > does the wrong thing and where we need to add an extra tier to get
> > everything to work.  Short of having a lot of tiers I'm not sure how
> > we could have the default work well.  Maybe a lot of "tiers" is fine
> > though perhaps we need to rename them if going this way and then they
> > don't really work as current concept of tier.
> > 
> > Imagine a system with subtle difference between different memories such
> > as 10% latency increase for same bandwidth.  To get an advantage from
> > demoting to such a tier will require really stable usage and long
> > run times. Whilst you could design a demotion scheme that takes that
> > into account, I think we are a long way from that today.  
> 
> Good point: there can be a clear hardware difference, but it's a
> policy choice whether the MM should treat them as one or two tiers.
> 
> What do you think of a per-driver/per-device (overridable) distance
> number, combined with a configurable distance cutoff for what
> constitutes separate tiers. E.g. cutoff=20 means two devices with
> distances of 10 and 20 respectively would be in the same tier, devices
> with 10 and 100 would be in separate ones. The kernel then generates
> and populates the tiers based on distances and grouping cutoff, and
> populates the memtier directory tree and nodemasks in sysfs.

I think we'll need something along those lines, though I was envisioning
it sitting at the level of what we do with the tiers, rather than how
we create them.  So particularly usecases would decide to treat
sets of tiers as if they were one.  Have enough tiers and we'll end up
with k-means or similar to figure out the groupings. Of course there
is then a soft of 'tier group for use XX' concept so maybe not much
difference until we have a bunch of usecases.

> 
> It could be simple tier0, tier1, tier2 numbering again, but the
> numbers now would mean something to the user. A rank tunable is no
> longer necessary.

This feels like it might make tier assignments a bit less stable
and hence run into question of how to hook up accounting. Not my
area of expertise though, but it was put forward as one of the reasons
we didn't want hotplug to potentially end up shuffling other tiers
around.  The desire was for a 'stable' entity.  Can avoid that with
'space' between them but then we sort of still have rank, just in a
form that makes updating it messy (need to create a new tier to do
it).

> 
> I think even the nodemasks in the memtier tree could be read-only
> then, since corrections should only be necessary when either the
> device distance is wrong or the tier grouping cutoff.
> 
> Can you think of scenarios where that scheme would fall apart?

Simplest (I think) is the GPU one. Often those have very nice
memory that we CPU software developers would love to use, but
some pesky GPGPU folk think it is for GPU related data. Anyhow, folk
who care about GPUs have requested that it be in a tier that
is lower rank than main memory.

If you just categorize it by performance (from CPUs) then it
might well end up elsewhere.  These folk do want to demote
to CPU attached DRAM though.  Which raises the question of
'where is your distance between?'

Definitely policy decision, and one we can't get from perf
characteristics.  It's a blurry line. There are classes
of fairly low spec memory attached accelerators on the horizon.
For those preventing migration to the memory they are associated
with might generally not make sense.

Tweaking policy by messing with anything that claims to be a
distance is a bit nasty at looks like the SLIT table tuning
that's still happens. Could have a per device rank though
and make it clear this isn't cleanly related to any perf
characterstics.  So ultimately that moves rank to devices
and then we have to put them into nodes. Not sure it gained
us much other than seeming more complex to me.

Jonathan
Johannes Weiner June 14, 2022, 6:56 p.m. UTC | #34
On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:
> On 6/13/22 9:20 PM, Johannes Weiner wrote:
> > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:
> >> If the kernel still can't make the right decision, userspace could rearrange
> >> them in any order using rank values. Without something like rank, if
> >> userspace needs to fix things up,  it gets hard with device
> >> hotplugging. ie, the userspace policy could be that any new PMEM tier device
> >> that is hotplugged, park it with a very low-rank value and hence lowest in
> >> demotion order by default. (echo 10 >
> >> /sys/devices/system/memtier/memtier2/rank) . After that userspace could
> >> selectively move the new devices to the correct memory tier?
> > 
> > I had touched on this in the other email.
> > 
> > This doesn't work if two drivers that should have separate policies
> > collide into the same tier - which is very likely with just 3 tiers.
> > So it seems to me the main usecase for having a rank tunable falls
> > apart rather quickly until tiers are spaced out more widely. And it
> > does so at the cost of an, IMO, tricky to understand interface.
> > 
> 
> Considering the kernel has a static map for these tiers, how can two drivers
> end up using the same tier? If a new driver is going to manage a memory
> device that is of different characteristics than the one managed by dax/kmem,
> we will end up adding 
> 
> #define MEMORY_TIER_NEW_DEVICE 4
> 
> The new driver will never use MEMORY_TIER_PMEM
> 
> What can happen is two devices that are managed by DAX/kmem that
> should be in two memory tiers get assigned the same memory tier
> because the dax/kmem driver added both the device to the same memory tier.
> 
> In the future we would avoid that by using more device properties like HMAT
> to create additional memory tiers with different rank values. ie, we would
> do in the dax/kmem create_tier_from_rank() .

Yes, that's the type of collision I mean. Two GPUs, two CXL-attached
DRAMs of different speeds etc.

I also like Huang's idea of using latency characteristics instead of
abstract distances. Though I'm not quite sure how feasible this is in
the short term, and share some concerns that Jonathan raised. But I
think a wider possible range to begin with makes sense in any case.

> > In the other email I had suggested the ability to override not just
> > the per-device distance, but also the driver default for new devices
> > to handle the hotplug situation.
> > 
> 
> I understand that the driver override will be done via module parameters.
> How will we implement device override? For example in case of dax/kmem driver
> the device override will be per dax device? What interface will we use to set the override?
> 
> IIUC in the above proposal the dax/kmem will do
> 
> node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax));
> 
> get_device_tier_index(struct dev_dax *dev)
> {
>     return dax_kmem_tier_index; // module parameter
> }
> 
> Are you suggesting to add a dev_dax property to override the tier defaults?

I was thinking a new struct memdevice and struct memtype(?). Every
driver implementing memory devices like this sets those up and
registers them with generic code and preset parameters. The generic
code creates sysfs directories and allows overriding the parameters.

struct memdevice {
	struct device dev;
	unsigned long distance;
	struct list_head siblings;
	/* nid? ... */
};

struct memtype {
	struct device_type type;
	unsigned long default_distance;
	struct list_head devices;
};

That forms the (tweakable) tree describing physical properties.

From that, the kernel then generates the ordered list of tiers.

> > This should be less policy than before. Driver default and per-device
> > distances (both overridable) combined with one tunable to set the
> > range of distances that get grouped into tiers.
> > 
> 
> Can you elaborate more on how distance value will be used? The device/device NUMA node can have
> different distance value from other NUMA nodes. How do we group them?
> for ex: earlier discussion did outline three different topologies. Can you
> ellaborate how we would end up grouping them using distance?
> 
> For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
> so how will we classify node 2?
> 
> 
> Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> 
> 		  20
>   Node 0 (DRAM)  ----  Node 1 (DRAM)
>        |        \   /       |
>        | 30    40 X 40      | 30
>        |        /   \       |
>   Node 2 (PMEM)  ----  Node 3 (PMEM)
> 		  40
> 
> node distances:
> node   0    1    2    3
>    0  10   20   30   40
>    1  20   10   40   30
>    2  30   40   10   40
>    3  40   30   40   10

I'm fairly confused by this example. Do all nodes have CPUs? Isn't
this just classic NUMA, where optimizing for locality makes the most
sense, rather than tiering?

Forget the interface for a second, I have no idea how tiering on such
a system would work. One CPU's lower tier can be another CPU's
toptier. There is no lowest rung from which to actually *reclaim*
pages. Would the CPUs just demote in circles?

And the coldest pages on one socket would get demoted into another
socket and displace what that socket considers hot local memory?

I feel like I missing something.

When we're talking about tiered memory, I'm thinking about CPUs
utilizing more than one memory node. If those other nodes have CPUs,
you can't reliably establish a singular tier order anymore and it
becomes classic NUMA, no?
Aneesh Kumar K.V June 15, 2022, 6:23 a.m. UTC | #35
On 6/15/22 12:26 AM, Johannes Weiner wrote:

....

>> What can happen is two devices that are managed by DAX/kmem that
>> should be in two memory tiers get assigned the same memory tier
>> because the dax/kmem driver added both the device to the same memory tier.
>>
>> In the future we would avoid that by using more device properties like HMAT
>> to create additional memory tiers with different rank values. ie, we would
>> do in the dax/kmem create_tier_from_rank() .
> 
> Yes, that's the type of collision I mean. Two GPUs, two CXL-attached
> DRAMs of different speeds etc.
> 
> I also like Huang's idea of using latency characteristics instead of
> abstract distances. Though I'm not quite sure how feasible this is in
> the short term, and share some concerns that Jonathan raised. But I
> think a wider possible range to begin with makes sense in any case.
> 

How about the below proposal? 

In this proposal, we use the tier ID as the value that determines the position
of the memory tier in the demotion order. A higher value of tier ID indicates a
higher memory tier. Memory demotion happens from a higher memory tier to a lower
memory tier.

By default memory get hotplugged into 'default_memory_tier' . There is a core
kernel parameter "default_memory_tier" which can be updated if the user wants to
modify the default tier ID.

dax/kmem driver use the "dax_kmem_memtier" module parameter to determine the
memory tier to which DAX/kmem memory will be added.

dax_kmem_memtier and default_memtier defaults to 100 and 200 respectively.

Later as we update dax/kmem to use additional device attributes, the driver will
be able to place new devices in different memory tiers. As we do that, it is
expected that users will have the ability to override these device attribute and
control which memory tiers the devices will be placed.

New memory tiers can also be created by using node/memtier attribute.
Moving a NUMA node to a non-existing memory tier results in creating
new memory tiers. So if the kernel default placement of memory devices
in memory tiers is not preferred, userspace could choose to create a
completely new memory tier hierarchy using this interface. Memory tiers
get deleted when they ends up with empty nodelist.

# cat /sys/module/kernel/parameters/default_memory_tier 
200
# cat /sys/module/kmem/parameters/dax_kmem_memtier 
100

# ls /sys/devices/system/memtier/
default_tier  max_tier  memtier200  power  uevent
# ls /sys/devices/system/memtier/memtier200/nodelist 
/sys/devices/system/memtier/memtier200/nodelist
# cat  /sys/devices/system/memtier/memtier200/nodelist 
1-3
# echo 20 > /sys/devices/system/node/node1/memtier 
# 
# ls /sys/devices/system/memtier/
default_tier  max_tier  memtier20  memtier200  power  uevent
# cat  /sys/devices/system/memtier/memtier20/nodelist 
1
# 

# echo 10 > /sys/module/kmem/parameters/dax_kmem_memtier 
# echo dax0.0 > /sys/bus/dax/drivers/device_dax/unbind 
# echo dax0.0 > /sys/bus/dax/drivers/kmem/new_id 
# 
# ls /sys/devices/system/memtier/
default_tier  max_tier  memtier10  memtier20  memtier200  power  uevent
# cat  /sys/devices/system/memtier/memtier10/nodelist 
4
# 

# grep . /sys/devices/system/memtier/memtier*/nodelist
/sys/devices/system/memtier/memtier10/nodelist:4
/sys/devices/system/memtier/memtier200/nodelist:2-3
/sys/devices/system/memtier/memtier20/nodelist:1

demotion order details for the above will be
lower tier mask for node 1 is 4 and preferred demotion node is  4
lower tier mask for node 2 is 1,4 and preferred demotion node is 1
lower tier mask for node 3 is 1,4 and preferred demotion node is 1
lower tier mask for node 4  None

:/sys/devices/system/memtier#  ls
default_tier  max_tier  memtier10  memtier20  memtier200  power  uevent
:/sys/devices/system/memtier#  cat memtier20/nodelist 
1
:/sys/devices/system/memtier# echo 200 > ../node/node1/memtier 
:/sys/devices/system/memtier# ls
default_tier  max_tier  memtier10  memtier200  power  uevent
:/sys/devices/system/memtier# 




>>> In the other email I had suggested the ability to override not just
>>> the per-device distance, but also the driver default for new devices
>>> to handle the hotplug situation.
>>>

.....

>>
>> Can you elaborate more on how distance value will be used? The device/device NUMA node can have
>> different distance value from other NUMA nodes. How do we group them?
>> for ex: earlier discussion did outline three different topologies. Can you
>> ellaborate how we would end up grouping them using distance?
>>
>> For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
>> so how will we classify node 2?
>>
>>
>> Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
>>
>> 		  20
>>   Node 0 (DRAM)  ----  Node 1 (DRAM)
>>        |        \   /       |
>>        | 30    40 X 40      | 30
>>        |        /   \       |
>>   Node 2 (PMEM)  ----  Node 3 (PMEM)
>> 		  40
>>
>> node distances:
>> node   0    1    2    3
>>    0  10   20   30   40
>>    1  20   10   40   30
>>    2  30   40   10   40
>>    3  40   30   40   10
> 
> I'm fairly confused by this example. Do all nodes have CPUs? Isn't
> this just classic NUMA, where optimizing for locality makes the most
> sense, rather than tiering?
> 

Node 2 and Node3 will be memory only NUMA nodes.

> Forget the interface for a second, I have no idea how tiering on such
> a system would work. One CPU's lower tier can be another CPU's
> toptier. There is no lowest rung from which to actually *reclaim*
> pages. Would the CPUs just demote in circles?
> 
> And the coldest pages on one socket would get demoted into another
> socket and displace what that socket considers hot local memory?
> 
> I feel like I missing something.
> 
> When we're talking about tiered memory, I'm thinking about CPUs
> utilizing more than one memory node. If those other nodes have CPUs,
> you can't reliably establish a singular tier order anymore and it
> becomes classic NUMA, no?
Huang, Ying June 16, 2022, 1:11 a.m. UTC | #36
On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote:
> On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:
> > On 6/13/22 9:20 PM, Johannes Weiner wrote:
> > > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:
> > > > If the kernel still can't make the right decision, userspace could rearrange
> > > > them in any order using rank values. Without something like rank, if
> > > > userspace needs to fix things up,  it gets hard with device
> > > > hotplugging. ie, the userspace policy could be that any new PMEM tier device
> > > > that is hotplugged, park it with a very low-rank value and hence lowest in
> > > > demotion order by default. (echo 10 >
> > > > /sys/devices/system/memtier/memtier2/rank) . After that userspace could
> > > > selectively move the new devices to the correct memory tier?
> > > 
> > > I had touched on this in the other email.
> > > 
> > > This doesn't work if two drivers that should have separate policies
> > > collide into the same tier - which is very likely with just 3 tiers.
> > > So it seems to me the main usecase for having a rank tunable falls
> > > apart rather quickly until tiers are spaced out more widely. And it
> > > does so at the cost of an, IMO, tricky to understand interface.
> > > 
> > 
> > Considering the kernel has a static map for these tiers, how can two drivers
> > end up using the same tier? If a new driver is going to manage a memory
> > device that is of different characteristics than the one managed by dax/kmem,
> > we will end up adding 
> > 
> > #define MEMORY_TIER_NEW_DEVICE 4
> > 
> > The new driver will never use MEMORY_TIER_PMEM
> > 
> > What can happen is two devices that are managed by DAX/kmem that
> > should be in two memory tiers get assigned the same memory tier
> > because the dax/kmem driver added both the device to the same memory tier.
> > 
> > In the future we would avoid that by using more device properties like HMAT
> > to create additional memory tiers with different rank values. ie, we would
> > do in the dax/kmem create_tier_from_rank() .
> 
> Yes, that's the type of collision I mean. Two GPUs, two CXL-attached
> DRAMs of different speeds etc.
> 
> I also like Huang's idea of using latency characteristics instead of
> abstract distances. Though I'm not quite sure how feasible this is in
> the short term, and share some concerns that Jonathan raised. But I
> think a wider possible range to begin with makes sense in any case.
> 
> > > In the other email I had suggested the ability to override not just
> > > the per-device distance, but also the driver default for new devices
> > > to handle the hotplug situation.
> > > 
> > 
> > I understand that the driver override will be done via module parameters.
> > How will we implement device override? For example in case of dax/kmem driver
> > the device override will be per dax device? What interface will we use to set the override?
> > 
> > IIUC in the above proposal the dax/kmem will do
> > 
> > node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax));
> > 
> > get_device_tier_index(struct dev_dax *dev)
> > {
> >     return dax_kmem_tier_index; // module parameter
> > }
> > 
> > Are you suggesting to add a dev_dax property to override the tier defaults?
> 
> I was thinking a new struct memdevice and struct memtype(?). Every
> driver implementing memory devices like this sets those up and
> registers them with generic code and preset parameters. The generic
> code creates sysfs directories and allows overriding the parameters.
> 
> struct memdevice {
> 	struct device dev;
> 	unsigned long distance;
> 	struct list_head siblings;
> 	/* nid? ... */
> };
> 
> struct memtype {
> 	struct device_type type;
> 	unsigned long default_distance;
> 	struct list_head devices;
> };
> 
> That forms the (tweakable) tree describing physical properties.

In general, I think memtype is a good idea.  I have suggested
something similar before.  It can describe the characters of a
specific type of memory (same memory media with different interface
(e.g., CXL, or DIMM) will be different memory types).  And they can
be used to provide overriding information.

As for memdevice, I think that we already have "node" to represent
them in sysfs.  Do we really need another one?  Is it sufficient to
add some links to node in the appropriate directory?  For example,
make memtype class device under the physical device (e.g. CXL device),
and create links to node inside the memtype class device directory?

> From that, the kernel then generates the ordered list of tiers.

As Jonathan Cameron pointed, we may need the memory tier ID to be
stable if possible.  I know this isn't a easy task.  At least we can
make the default memory tier (CPU local DRAM) ID stable (for example
make it always 128)?  That provides an anchor for users to understand.

Best Regards,
Huang, Ying

> > > This should be less policy than before. Driver default and per-device
> > > distances (both overridable) combined with one tunable to set the
> > > range of distances that get grouped into tiers.
> > > 
> > 
> > Can you elaborate more on how distance value will be used? The device/device NUMA node can have
> > different distance value from other NUMA nodes. How do we group them?
> > for ex: earlier discussion did outline three different topologies. Can you
> > ellaborate how we would end up grouping them using distance?
> > 
> > For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
> > so how will we classify node 2?
> > 
> > 
> > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> > 
> > 		  20
> >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> >        |        \   /       |
> >        | 30    40 X 40      | 30
> >        |        /   \       |
> >   Node 2 (PMEM)  ----  Node 3 (PMEM)
> > 		  40
> > 
> > node distances:
> > node   0    1    2    3
> >    0  10   20   30   40
> >    1  20   10   40   30
> >    2  30   40   10   40
> >    3  40   30   40   10
> 
> I'm fairly confused by this example. Do all nodes have CPUs? Isn't
> this just classic NUMA, where optimizing for locality makes the most
> sense, rather than tiering?
> 
> Forget the interface for a second, I have no idea how tiering on such
> a system would work. One CPU's lower tier can be another CPU's
> toptier. There is no lowest rung from which to actually *reclaim*
> pages. Would the CPUs just demote in circles?
> 
> And the coldest pages on one socket would get demoted into another
> socket and displace what that socket considers hot local memory?
> 
> I feel like I missing something.
> 
> When we're talking about tiered memory, I'm thinking about CPUs
> utilizing more than one memory node. If those other nodes have CPUs,
> you can't reliably establish a singular tier order anymore and it
> becomes classic NUMA, no?
Wei Xu June 16, 2022, 3:45 a.m. UTC | #37
On Wed, Jun 15, 2022 at 6:11 PM Ying Huang <ying.huang@intel.com> wrote:
>
> On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote:
> > On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:
> > > On 6/13/22 9:20 PM, Johannes Weiner wrote:
> > > > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:
> > > > > If the kernel still can't make the right decision, userspace could rearrange
> > > > > them in any order using rank values. Without something like rank, if
> > > > > userspace needs to fix things up,  it gets hard with device
> > > > > hotplugging. ie, the userspace policy could be that any new PMEM tier device
> > > > > that is hotplugged, park it with a very low-rank value and hence lowest in
> > > > > demotion order by default. (echo 10 >
> > > > > /sys/devices/system/memtier/memtier2/rank) . After that userspace could
> > > > > selectively move the new devices to the correct memory tier?
> > > >
> > > > I had touched on this in the other email.
> > > >
> > > > This doesn't work if two drivers that should have separate policies
> > > > collide into the same tier - which is very likely with just 3 tiers.
> > > > So it seems to me the main usecase for having a rank tunable falls
> > > > apart rather quickly until tiers are spaced out more widely. And it
> > > > does so at the cost of an, IMO, tricky to understand interface.
> > > >
> > >
> > > Considering the kernel has a static map for these tiers, how can two drivers
> > > end up using the same tier? If a new driver is going to manage a memory
> > > device that is of different characteristics than the one managed by dax/kmem,
> > > we will end up adding
> > >
> > > #define MEMORY_TIER_NEW_DEVICE 4
> > >
> > > The new driver will never use MEMORY_TIER_PMEM
> > >
> > > What can happen is two devices that are managed by DAX/kmem that
> > > should be in two memory tiers get assigned the same memory tier
> > > because the dax/kmem driver added both the device to the same memory tier.
> > >
> > > In the future we would avoid that by using more device properties like HMAT
> > > to create additional memory tiers with different rank values. ie, we would
> > > do in the dax/kmem create_tier_from_rank() .
> >
> > Yes, that's the type of collision I mean. Two GPUs, two CXL-attached
> > DRAMs of different speeds etc.
> >
> > I also like Huang's idea of using latency characteristics instead of
> > abstract distances. Though I'm not quite sure how feasible this is in
> > the short term, and share some concerns that Jonathan raised. But I
> > think a wider possible range to begin with makes sense in any case.
> >
> > > > In the other email I had suggested the ability to override not just
> > > > the per-device distance, but also the driver default for new devices
> > > > to handle the hotplug situation.
> > > >
> > >
> > > I understand that the driver override will be done via module parameters.
> > > How will we implement device override? For example in case of dax/kmem driver
> > > the device override will be per dax device? What interface will we use to set the override?
> > >
> > > IIUC in the above proposal the dax/kmem will do
> > >
> > > node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax));
> > >
> > > get_device_tier_index(struct dev_dax *dev)
> > > {
> > >     return dax_kmem_tier_index; // module parameter
> > > }
> > >
> > > Are you suggesting to add a dev_dax property to override the tier defaults?
> >
> > I was thinking a new struct memdevice and struct memtype(?). Every
> > driver implementing memory devices like this sets those up and
> > registers them with generic code and preset parameters. The generic
> > code creates sysfs directories and allows overriding the parameters.
> >
> > struct memdevice {
> >       struct device dev;
> >       unsigned long distance;
> >       struct list_head siblings;
> >       /* nid? ... */
> > };
> >
> > struct memtype {
> >       struct device_type type;
> >       unsigned long default_distance;
> >       struct list_head devices;
> > };
> >
> > That forms the (tweakable) tree describing physical properties.
>
> In general, I think memtype is a good idea.  I have suggested
> something similar before.  It can describe the characters of a
> specific type of memory (same memory media with different interface
> (e.g., CXL, or DIMM) will be different memory types).  And they can
> be used to provide overriding information.
>
> As for memdevice, I think that we already have "node" to represent
> them in sysfs.  Do we really need another one?  Is it sufficient to
> add some links to node in the appropriate directory?  For example,
> make memtype class device under the physical device (e.g. CXL device),
> and create links to node inside the memtype class device directory?
>
> > From that, the kernel then generates the ordered list of tiers.
>
> As Jonathan Cameron pointed, we may need the memory tier ID to be
> stable if possible.  I know this isn't a easy task.  At least we can
> make the default memory tier (CPU local DRAM) ID stable (for example
> make it always 128)?  That provides an anchor for users to understand.

One of the motivations of introducing "rank" is to allow memory tier
ID to be stable, at least for the well-defined tiers such as the
default memory tier.  The default memory tier can be moved around in
the tier hierarchy by adjusting its rank position relative to other
tiers, but its device ID can remain the same, e.g. always 1.

> Best Regards,
> Huang, Ying
>
> > > > This should be less policy than before. Driver default and per-device
> > > > distances (both overridable) combined with one tunable to set the
> > > > range of distances that get grouped into tiers.
> > > >
> > >
> > > Can you elaborate more on how distance value will be used? The device/device NUMA node can have
> > > different distance value from other NUMA nodes. How do we group them?
> > > for ex: earlier discussion did outline three different topologies. Can you
> > > ellaborate how we would end up grouping them using distance?
> > >
> > > For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
> > > so how will we classify node 2?
> > >
> > >
> > > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> > >
> > >               20
> > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > >        |        \   /       |
> > >        | 30    40 X 40      | 30
> > >        |        /   \       |
> > >   Node 2 (PMEM)  ----  Node 3 (PMEM)
> > >               40
> > >
> > > node distances:
> > > node   0    1    2    3
> > >    0  10   20   30   40
> > >    1  20   10   40   30
> > >    2  30   40   10   40
> > >    3  40   30   40   10
> >
> > I'm fairly confused by this example. Do all nodes have CPUs? Isn't
> > this just classic NUMA, where optimizing for locality makes the most
> > sense, rather than tiering?
> >
> > Forget the interface for a second, I have no idea how tiering on such
> > a system would work. One CPU's lower tier can be another CPU's
> > toptier. There is no lowest rung from which to actually *reclaim*
> > pages. Would the CPUs just demote in circles?
> >
> > And the coldest pages on one socket would get demoted into another
> > socket and displace what that socket considers hot local memory?
> >
> > I feel like I missing something.
> >
> > When we're talking about tiered memory, I'm thinking about CPUs
> > utilizing more than one memory node. If those other nodes have CPUs,
> > you can't reliably establish a singular tier order anymore and it
> > becomes classic NUMA, no?
>
>
>
Aneesh Kumar K.V June 16, 2022, 4:47 a.m. UTC | #38
On 6/16/22 9:15 AM, Wei Xu wrote:
> On Wed, Jun 15, 2022 at 6:11 PM Ying Huang <ying.huang@intel.com> wrote:
>>
>> On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote:
>>> On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:

....

>> As Jonathan Cameron pointed, we may need the memory tier ID to be
>> stable if possible.  I know this isn't a easy task.  At least we can
>> make the default memory tier (CPU local DRAM) ID stable (for example
>> make it always 128)?  That provides an anchor for users to understand.
> 
> One of the motivations of introducing "rank" is to allow memory tier
> ID to be stable, at least for the well-defined tiers such as the
> default memory tier.  The default memory tier can be moved around in
> the tier hierarchy by adjusting its rank position relative to other
> tiers, but its device ID can remain the same, e.g. always 1.
> 

With /sys/devices/system/memtier/default_tier userspace will be able query
the default tier details. Did you get to look at 

https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com

Any reason why that will not work with all the requirements we had?

-aneesh
Huang, Ying June 16, 2022, 5:51 a.m. UTC | #39
On Thu, 2022-06-16 at 10:17 +0530, Aneesh Kumar K V wrote:
> On 6/16/22 9:15 AM, Wei Xu wrote:
> > On Wed, Jun 15, 2022 at 6:11 PM Ying Huang <ying.huang@intel.com> wrote:
> > > 
> > > On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote:
> > > > On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:
> 
> ....
> 
> > > As Jonathan Cameron pointed, we may need the memory tier ID to be
> > > stable if possible.  I know this isn't a easy task.  At least we can
> > > make the default memory tier (CPU local DRAM) ID stable (for example
> > > make it always 128)?  That provides an anchor for users to understand.
> > 
> > One of the motivations of introducing "rank" is to allow memory tier
> > ID to be stable, at least for the well-defined tiers such as the
> > default memory tier.  The default memory tier can be moved around in
> > the tier hierarchy by adjusting its rank position relative to other
> > tiers, but its device ID can remain the same, e.g. always 1.
> > 
> 
> With /sys/devices/system/memtier/default_tier userspace will be able query
> the default tier details.
> 

Yes.  This is a way to address the memory tier ID stability issue too. 
Anther choice is to make default_tier a symbolic link.


Best Regards,
Huang, Ying
Jonathan Cameron June 17, 2022, 10:41 a.m. UTC | #40
On Thu, 16 Jun 2022 09:11:24 +0800
Ying Huang <ying.huang@intel.com> wrote:

> On Tue, 2022-06-14 at 14:56 -0400, Johannes Weiner wrote:
> > On Tue, Jun 14, 2022 at 01:31:37PM +0530, Aneesh Kumar K V wrote:  
> > > On 6/13/22 9:20 PM, Johannes Weiner wrote:  
> > > > On Mon, Jun 13, 2022 at 07:53:03PM +0530, Aneesh Kumar K V wrote:  
> > > > > If the kernel still can't make the right decision, userspace could rearrange
> > > > > them in any order using rank values. Without something like rank, if
> > > > > userspace needs to fix things up,  it gets hard with device
> > > > > hotplugging. ie, the userspace policy could be that any new PMEM tier device
> > > > > that is hotplugged, park it with a very low-rank value and hence lowest in
> > > > > demotion order by default. (echo 10 >
> > > > > /sys/devices/system/memtier/memtier2/rank) . After that userspace could
> > > > > selectively move the new devices to the correct memory tier?  
> > > > 
> > > > I had touched on this in the other email.
> > > > 
> > > > This doesn't work if two drivers that should have separate policies
> > > > collide into the same tier - which is very likely with just 3 tiers.
> > > > So it seems to me the main usecase for having a rank tunable falls
> > > > apart rather quickly until tiers are spaced out more widely. And it
> > > > does so at the cost of an, IMO, tricky to understand interface.
> > > >   
> > > 
> > > Considering the kernel has a static map for these tiers, how can two drivers
> > > end up using the same tier? If a new driver is going to manage a memory
> > > device that is of different characteristics than the one managed by dax/kmem,
> > > we will end up adding 
> > > 
> > > #define MEMORY_TIER_NEW_DEVICE 4
> > > 
> > > The new driver will never use MEMORY_TIER_PMEM
> > > 
> > > What can happen is two devices that are managed by DAX/kmem that
> > > should be in two memory tiers get assigned the same memory tier
> > > because the dax/kmem driver added both the device to the same memory tier.
> > > 
> > > In the future we would avoid that by using more device properties like HMAT
> > > to create additional memory tiers with different rank values. ie, we would
> > > do in the dax/kmem create_tier_from_rank() .  
> > 
> > Yes, that's the type of collision I mean. Two GPUs, two CXL-attached
> > DRAMs of different speeds etc.
> > 
> > I also like Huang's idea of using latency characteristics instead of
> > abstract distances. Though I'm not quite sure how feasible this is in
> > the short term, and share some concerns that Jonathan raised. But I
> > think a wider possible range to begin with makes sense in any case.
> >   
> > > > In the other email I had suggested the ability to override not just
> > > > the per-device distance, but also the driver default for new devices
> > > > to handle the hotplug situation.
> > > >   
> > > 
> > > I understand that the driver override will be done via module parameters.
> > > How will we implement device override? For example in case of dax/kmem driver
> > > the device override will be per dax device? What interface will we use to set the override?
> > > 
> > > IIUC in the above proposal the dax/kmem will do
> > > 
> > > node_create_and_set_memory_tier(numa_node, get_device_tier_index(dev_dax));
> > > 
> > > get_device_tier_index(struct dev_dax *dev)
> > > {
> > >     return dax_kmem_tier_index; // module parameter
> > > }
> > > 
> > > Are you suggesting to add a dev_dax property to override the tier defaults?  
> > 
> > I was thinking a new struct memdevice and struct memtype(?). Every
> > driver implementing memory devices like this sets those up and
> > registers them with generic code and preset parameters. The generic
> > code creates sysfs directories and allows overriding the parameters.
> > 
> > struct memdevice {
> > 	struct device dev;
> > 	unsigned long distance;
> > 	struct list_head siblings;
> > 	/* nid? ... */
> > };
> > 
> > struct memtype {
> > 	struct device_type type;
> > 	unsigned long default_distance;
> > 	struct list_head devices;
> > };
> > 
> > That forms the (tweakable) tree describing physical properties.  
> 
> In general, I think memtype is a good idea.  I have suggested
> something similar before.  It can describe the characters of a
> specific type of memory (same memory media with different interface
> (e.g., CXL, or DIMM) will be different memory types).  And they can
> be used to provide overriding information.
I'm not sure you are suggesting interface as one element of distinguishing
types, or as the element - just in case it's as 'the element'.
Ignore the next bit if not ;)

Memory "interface" isn't going to be enough of a distinction.  If you want to have
a default distance it would need to be different for cases where the
same 'type' of RAM has very different characteristics. Applies everywhere
but given CXL 'defines' a lot of this - if we just have DRAM attached
via CXL:

1. 16-lane direct attached DRAM device.  (low latency - high bw)
2. 4x 16-lane direct attached DRAM interleaved (low latency - very high bw)
3. 4-lane direct attached DRAM device (low latency - low bandwidth)
4. 16-lane to single switch, 4x 4-lane devices interleaved (mid latency - high bw)
5. 4-lane to single switch, 4x 4-lane devices interleaved (mid latency, mid bw)
6. 4x 16-lane so 4 switch, each switch to 4 DRAM devices (mid latency, very high bw)
(7. 16 land directed attached nvram. (midish latency, high bw - perf wise might be
    similarish to 4).

It could be a lot more complex, but hopefully that conveys that 'type'
is next to useless to characterize things unless we have a very large number
of potential subtypes. If we were on current tiering proposal
we'd just have the CXL subsystem manage multiple tiers to cover what is
attached.

> 
> As for memdevice, I think that we already have "node" to represent
> them in sysfs.  Do we really need another one?  Is it sufficient to
> add some links to node in the appropriate directory?  For example,
> make memtype class device under the physical device (e.g. CXL device),
> and create links to node inside the memtype class device directory?
> 
> > From that, the kernel then generates the ordered list of tiers.  
> 
> As Jonathan Cameron pointed, we may need the memory tier ID to be
> stable if possible.  I know this isn't a easy task.  At least we can
> make the default memory tier (CPU local DRAM) ID stable (for example
> make it always 128)?  That provides an anchor for users to understand.
> 
> Best Regards,
> Huang, Ying
> 
> > > > This should be less policy than before. Driver default and per-device
> > > > distances (both overridable) combined with one tunable to set the
> > > > range of distances that get grouped into tiers.
> > > >   
> > > 
> > > Can you elaborate more on how distance value will be used? The device/device NUMA node can have
> > > different distance value from other NUMA nodes. How do we group them?
> > > for ex: earlier discussion did outline three different topologies. Can you
> > > ellaborate how we would end up grouping them using distance?
> > > 
> > > For ex: in the topology below node 2 is at distance 30 from Node0 and 40 from Nodes
> > > so how will we classify node 2?
> > > 
> > > 
> > > Node 0 & 1 are DRAM nodes, node 2 & 3 are PMEM nodes.
> > > 
> > > 		  20
> > >   Node 0 (DRAM)  ----  Node 1 (DRAM)
> > >        |        \   /       |
> > >        | 30    40 X 40      | 30
> > >        |        /   \       |
> > >   Node 2 (PMEM)  ----  Node 3 (PMEM)
> > > 		  40
> > > 
> > > node distances:
> > > node   0    1    2    3
> > >    0  10   20   30   40
> > >    1  20   10   40   30
> > >    2  30   40   10   40
> > >    3  40   30   40   10  
> > 
> > I'm fairly confused by this example. Do all nodes have CPUs? Isn't
> > this just classic NUMA, where optimizing for locality makes the most
> > sense, rather than tiering?
> > 
> > Forget the interface for a second, I have no idea how tiering on such
> > a system would work. One CPU's lower tier can be another CPU's
> > toptier. There is no lowest rung from which to actually *reclaim*
> > pages. Would the CPUs just demote in circles?
> > 
> > And the coldest pages on one socket would get demoted into another
> > socket and displace what that socket considers hot local memory?
> > 
> > I feel like I missing something.
> > 
> > When we're talking about tiered memory, I'm thinking about CPUs
> > utilizing more than one memory node. If those other nodes have CPUs,
> > you can't reliably establish a singular tier order anymore and it
> > becomes classic NUMA, no?  
> 
>
Aneesh Kumar K.V June 21, 2022, 8:27 a.m. UTC | #41
On 6/14/22 10:15 PM, Jonathan Cameron wrote:
> 

...

>>
>> It could be simple tier0, tier1, tier2 numbering again, but the
>> numbers now would mean something to the user. A rank tunable is no
>> longer necessary.
> 
> This feels like it might make tier assignments a bit less stable
> and hence run into question of how to hook up accounting. Not my
> area of expertise though, but it was put forward as one of the reasons
> we didn't want hotplug to potentially end up shuffling other tiers
> around.  The desire was for a 'stable' entity.  Can avoid that with
> 'space' between them but then we sort of still have rank, just in a
> form that makes updating it messy (need to create a new tier to do
> it).
> 
>>

How about we do what is proposed here 

https://lore.kernel.org/linux-mm/7b72ccf4-f4ae-cb4e-f411-74d055482026@linux.ibm.com

The cgroup accounting patch posted here https://lore.kernel.org/linux-mm/cover.1655242024.git.tim.c.chen@linux.intel.com

looks at top tier accounting per cgroup and I am not sure what tier ID stability is expected
for top tier accounting. 

-aneesh
diff mbox series

Patch

diff --git a/include/linux/memory-tiers.h b/include/linux/memory-tiers.h
new file mode 100644
index 000000000000..e17f6b4ee177
--- /dev/null
+++ b/include/linux/memory-tiers.h
@@ -0,0 +1,20 @@ 
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_MEMORY_TIERS_H
+#define _LINUX_MEMORY_TIERS_H
+
+#ifdef CONFIG_TIERED_MEMORY
+
+#define MEMORY_TIER_HBM_GPU	0
+#define MEMORY_TIER_DRAM	1
+#define MEMORY_TIER_PMEM	2
+
+#define MEMORY_RANK_HBM_GPU	300
+#define MEMORY_RANK_DRAM	200
+#define MEMORY_RANK_PMEM	100
+
+#define DEFAULT_MEMORY_TIER	MEMORY_TIER_DRAM
+#define MAX_MEMORY_TIERS  3
+
+#endif	/* CONFIG_TIERED_MEMORY */
+
+#endif
diff --git a/mm/Kconfig b/mm/Kconfig
index 169e64192e48..08a3d330740b 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -614,6 +614,17 @@  config ARCH_ENABLE_HUGEPAGE_MIGRATION
 config ARCH_ENABLE_THP_MIGRATION
 	bool
 
+config TIERED_MEMORY
+	bool "Support for explicit memory tiers"
+	def_bool n
+	depends on MIGRATION && NUMA
+	help
+	  Support to split nodes into memory tiers explicitly and
+	  to demote pages on reclaim to lower tiers. This option
+	  also exposes sysfs interface to read nodes available in
+	  specific tier and to move specific node among different
+	  possible tiers.
+
 config HUGETLB_PAGE_SIZE_VARIABLE
 	def_bool n
 	help
diff --git a/mm/Makefile b/mm/Makefile
index 6f9ffa968a1a..482557fbc9d1 100644
--- a/mm/Makefile
+++ b/mm/Makefile
@@ -92,6 +92,7 @@  obj-$(CONFIG_KFENCE) += kfence/
 obj-$(CONFIG_FAILSLAB) += failslab.o
 obj-$(CONFIG_MEMTEST)		+= memtest.o
 obj-$(CONFIG_MIGRATION) += migrate.o
+obj-$(CONFIG_TIERED_MEMORY) += memory-tiers.o
 obj-$(CONFIG_DEVICE_MIGRATION) += migrate_device.o
 obj-$(CONFIG_TRANSPARENT_HUGEPAGE) += huge_memory.o khugepaged.o
 obj-$(CONFIG_PAGE_COUNTER) += page_counter.o
diff --git a/mm/memory-tiers.c b/mm/memory-tiers.c
new file mode 100644
index 000000000000..7de18d94a08d
--- /dev/null
+++ b/mm/memory-tiers.c
@@ -0,0 +1,188 @@ 
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/types.h>
+#include <linux/device.h>
+#include <linux/nodemask.h>
+#include <linux/slab.h>
+#include <linux/memory-tiers.h>
+
+struct memory_tier {
+	struct list_head list;
+	struct device dev;
+	nodemask_t nodelist;
+	int rank;
+};
+
+#define to_memory_tier(device) container_of(device, struct memory_tier, dev)
+
+static struct bus_type memory_tier_subsys = {
+	.name = "memtier",
+	.dev_name = "memtier",
+};
+
+static DEFINE_MUTEX(memory_tier_lock);
+static LIST_HEAD(memory_tiers);
+
+
+static ssize_t nodelist_show(struct device *dev,
+			     struct device_attribute *attr, char *buf)
+{
+	struct memory_tier *memtier = to_memory_tier(dev);
+
+	return sysfs_emit(buf, "%*pbl\n",
+			  nodemask_pr_args(&memtier->nodelist));
+}
+static DEVICE_ATTR_RO(nodelist);
+
+static ssize_t rank_show(struct device *dev,
+			 struct device_attribute *attr, char *buf)
+{
+	struct memory_tier *memtier = to_memory_tier(dev);
+
+	return sysfs_emit(buf, "%d\n", memtier->rank);
+}
+static DEVICE_ATTR_RO(rank);
+
+static struct attribute *memory_tier_dev_attrs[] = {
+	&dev_attr_nodelist.attr,
+	&dev_attr_rank.attr,
+	NULL
+};
+
+static const struct attribute_group memory_tier_dev_group = {
+	.attrs = memory_tier_dev_attrs,
+};
+
+static const struct attribute_group *memory_tier_dev_groups[] = {
+	&memory_tier_dev_group,
+	NULL
+};
+
+static void memory_tier_device_release(struct device *dev)
+{
+	struct memory_tier *tier = to_memory_tier(dev);
+
+	kfree(tier);
+}
+
+/*
+ * Keep it simple by having  direct mapping between
+ * tier index and rank value.
+ */
+static inline int get_rank_from_tier(unsigned int tier)
+{
+	switch (tier) {
+	case MEMORY_TIER_HBM_GPU:
+		return MEMORY_RANK_HBM_GPU;
+	case MEMORY_TIER_DRAM:
+		return MEMORY_RANK_DRAM;
+	case MEMORY_TIER_PMEM:
+		return MEMORY_RANK_PMEM;
+	}
+
+	return 0;
+}
+
+static void insert_memory_tier(struct memory_tier *memtier)
+{
+	struct list_head *ent;
+	struct memory_tier *tmp_memtier;
+
+	list_for_each(ent, &memory_tiers) {
+		tmp_memtier = list_entry(ent, struct memory_tier, list);
+		if (tmp_memtier->rank < memtier->rank) {
+			list_add_tail(&memtier->list, ent);
+			return;
+		}
+	}
+	list_add_tail(&memtier->list, &memory_tiers);
+}
+
+static struct memory_tier *register_memory_tier(unsigned int tier)
+{
+	int error;
+	struct memory_tier *memtier;
+
+	if (tier >= MAX_MEMORY_TIERS)
+		return NULL;
+
+	memtier = kzalloc(sizeof(struct memory_tier), GFP_KERNEL);
+	if (!memtier)
+		return NULL;
+
+	memtier->dev.id = tier;
+	memtier->rank = get_rank_from_tier(tier);
+	memtier->dev.bus = &memory_tier_subsys;
+	memtier->dev.release = memory_tier_device_release;
+	memtier->dev.groups = memory_tier_dev_groups;
+
+	insert_memory_tier(memtier);
+
+	error = device_register(&memtier->dev);
+	if (error) {
+		list_del(&memtier->list);
+		put_device(&memtier->dev);
+		return NULL;
+	}
+	return memtier;
+}
+
+__maybe_unused // temporay to prevent warnings during bisects
+static void unregister_memory_tier(struct memory_tier *memtier)
+{
+	list_del(&memtier->list);
+	device_unregister(&memtier->dev);
+}
+
+static ssize_t
+max_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "%d\n", MAX_MEMORY_TIERS);
+}
+static DEVICE_ATTR_RO(max_tier);
+
+static ssize_t
+default_tier_show(struct device *dev, struct device_attribute *attr, char *buf)
+{
+	return sysfs_emit(buf, "memtier%d\n", DEFAULT_MEMORY_TIER);
+}
+static DEVICE_ATTR_RO(default_tier);
+
+static struct attribute *memory_tier_attrs[] = {
+	&dev_attr_max_tier.attr,
+	&dev_attr_default_tier.attr,
+	NULL
+};
+
+static const struct attribute_group memory_tier_attr_group = {
+	.attrs = memory_tier_attrs,
+};
+
+static const struct attribute_group *memory_tier_attr_groups[] = {
+	&memory_tier_attr_group,
+	NULL,
+};
+
+static int __init memory_tier_init(void)
+{
+	int ret;
+	struct memory_tier *memtier;
+
+	ret = subsys_system_register(&memory_tier_subsys, memory_tier_attr_groups);
+	if (ret)
+		panic("%s() failed to register subsystem: %d\n", __func__, ret);
+
+	/*
+	 * Register only default memory tier to hide all empty
+	 * memory tier from sysfs.
+	 */
+	memtier = register_memory_tier(DEFAULT_MEMORY_TIER);
+	if (!memtier)
+		panic("%s() failed to register memory tier: %d\n", __func__, ret);
+
+	/* CPU only nodes are not part of memory tiers. */
+	memtier->nodelist = node_states[N_MEMORY];
+
+	return 0;
+}
+subsys_initcall(memory_tier_init);
+