[RFC,v2,0/2] Node migration between memory tiers

Message ID	20231213175329.594-1-sthanneeru.opensrc@micron.com
Headers	show Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=micron.com header.i=@micron.com header.b="BAssblea" Received-SPF: Pass (protection.outlook.com: domain of micron.com designates 137.201.242.130 as permitted sender) receiver=protection.outlook.com; client-ip=137.201.242.130; helo=mail.micron.com; pr=C From: <sthanneeru.opensrc@micron.com> To: <sthanneeru.opensrc@micron.com>, <linux-cxl@vger.kernel.org>, <linux-mm@kvack.org> CC: <sthanneeru@micron.com>, <aneesh.kumar@linux.ibm.com>, <dan.j.williams@intel.com>, <ying.huang@intel.com>, <gregory.price@memverge.com>, <mhocko@suse.com>, <tj@kernel.org>, <john@jagalactic.com>, <emirakhur@micron.com>, <vtavarespetr@micron.com>, <Ravis.OpenSrc@micron.com>, <Jonathan.Cameron@huawei.com>, <linux-kernel@vger.kernel.org> Subject: [RFC PATCH v2 0/2] Node migration between memory tiers Date: Wed, 13 Dec 2023 23:23:27 +0530 Message-ID: <20231213175329.594-1-sthanneeru.opensrc@micron.com> Precedence: bulk MIME-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8bit
Series	Node migration between memory tiers \| expand [RFC,v2,0/2] Node migration between memory tiers [1/2] base/node: Add sysfs for memtier_override [2/2] memory tier: Support node migration between tiers

Srinivasulu Opensrc Dec. 13, 2023, 5:53 p.m. UTC

From: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>

The memory tiers feature allows nodes with similar memory types
or performance characteristics to be grouped together in a
memory tier. However, there is currently no provision for
moving a node from one tier to another on demand.

This patch series aims to support node migration between tiers
on demand by sysadmin/root user using the provided sysfs for
node migration.

To migrate a node to a tier, the corresponding node’s sysfs
memtier_override is written with target tier id.

Example: Move node2 to memory tier2 from its default tier(i.e 4)

1. To check current memtier of node2
$cat  /sys/devices/system/node/node2/memtier_override
memory_tier4

2. To migrate node2 to memory_tier2
$echo 2 > /sys/devices/system/node/node2/memtier_override
$cat  /sys/devices/system/node/node2/memtier_override
memory_tier2

Usecases:

1. Useful to move cxl nodes to the right tiers from userspace, when
   the hardware fails to assign the tiers correctly based on
   memorytypes.

   On some platforms we have observed cxl memory being assigned to
   the same tier as DDR memory. This is arguably a system firmware
   bug, but it is true that tiers represent *ranges* of performance
   and we believe it's important for the system operator to have
   the ability to override bad firmware or OS decisions about tier
   assignment as a fail-safe against potential bad outcomes.

2. Useful if we want interleave weights to be applied on memory tiers
   instead of nodes.
In a previous thread, Huang Ying <ying.huang@intel.com> thought
this feature might be useful to overcome limitations of systems
where nodes with different bandwidth characteristics are grouped
in a single tier.
https://lore.kernel.org/lkml/87a5rw1wu8.fsf@yhuang6-desk2.ccr.corp.intel.com/

=============
Version Notes:

V2 : Changed interface to memtier_override from adistance_offset.
memtier_override was recommended by
1. John Groves <john@jagalactic.com>
2. Ravi Shankar <ravis.opensrc@micron.com>
3. Brice Goglin <Brice.Goglin@inria.fr>

V1 : Introduced adistance_offset sysfs.

=============

Srinivasulu Thanneeru (2):
  base/node: Add sysfs for memtier_override
  memory tier: Support node migration between tiers

 Documentation/ABI/stable/sysfs-devices-node |  7 ++
 drivers/base/node.c                         | 47 ++++++++++++
 include/linux/memory-tiers.h                | 11 +++
 include/linux/node.h                        | 11 +++
 mm/memory-tiers.c                           | 85 ++++++++++++---------
 5 files changed, 125 insertions(+), 36 deletions(-)

Huang, Ying Dec. 15, 2023, 5:02 a.m. UTC | #1

<sthanneeru.opensrc@micron.com> writes:

> From: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
>
> The memory tiers feature allows nodes with similar memory types
> or performance characteristics to be grouped together in a
> memory tier. However, there is currently no provision for
> moving a node from one tier to another on demand.
>
> This patch series aims to support node migration between tiers
> on demand by sysadmin/root user using the provided sysfs for
> node migration.
>
> To migrate a node to a tier, the corresponding node’s sysfs
> memtier_override is written with target tier id.
>
> Example: Move node2 to memory tier2 from its default tier(i.e 4)
>
> 1. To check current memtier of node2
> $cat  /sys/devices/system/node/node2/memtier_override
> memory_tier4
>
> 2. To migrate node2 to memory_tier2
> $echo 2 > /sys/devices/system/node/node2/memtier_override
> $cat  /sys/devices/system/node/node2/memtier_override
> memory_tier2
>
> Usecases:
>
> 1. Useful to move cxl nodes to the right tiers from userspace, when
>    the hardware fails to assign the tiers correctly based on
>    memorytypes.
>
>    On some platforms we have observed cxl memory being assigned to
>    the same tier as DDR memory. This is arguably a system firmware
>    bug, but it is true that tiers represent *ranges* of performance
>    and we believe it's important for the system operator to have
>    the ability to override bad firmware or OS decisions about tier
>    assignment as a fail-safe against potential bad outcomes.
>
> 2. Useful if we want interleave weights to be applied on memory tiers
>    instead of nodes.
> In a previous thread, Huang Ying <ying.huang@intel.com> thought
> this feature might be useful to overcome limitations of systems
> where nodes with different bandwidth characteristics are grouped
> in a single tier.
> https://lore.kernel.org/lkml/87a5rw1wu8.fsf@yhuang6-desk2.ccr.corp.intel.com/
>
> =============
> Version Notes:
>
> V2 : Changed interface to memtier_override from adistance_offset.
> memtier_override was recommended by
> 1. John Groves <john@jagalactic.com>
> 2. Ravi Shankar <ravis.opensrc@micron.com>
> 3. Brice Goglin <Brice.Goglin@inria.fr>

It appears that you ignored my comments for V1 as follows ...

https://lore.kernel.org/lkml/87o7f62vur.fsf@yhuang6-desk2.ccr.corp.intel.com/
https://lore.kernel.org/lkml/87jzpt2ft5.fsf@yhuang6-desk2.ccr.corp.intel.com/
https://lore.kernel.org/lkml/87a5qp2et0.fsf@yhuang6-desk2.ccr.corp.intel.com/

--
Best Regards,
Huang, Ying

> V1 : Introduced adistance_offset sysfs.
>
> =============
>
> Srinivasulu Thanneeru (2):
>   base/node: Add sysfs for memtier_override
>   memory tier: Support node migration between tiers
>
>  Documentation/ABI/stable/sysfs-devices-node |  7 ++
>  drivers/base/node.c                         | 47 ++++++++++++
>  include/linux/memory-tiers.h                | 11 +++
>  include/linux/node.h                        | 11 +++
>  mm/memory-tiers.c                           | 85 ++++++++++++---------
>  5 files changed, 125 insertions(+), 36 deletions(-)

Gregory Price Dec. 15, 2023, 5:42 p.m. UTC | #2

On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
> <sthanneeru.opensrc@micron.com> writes:
> 
> > =============
> > Version Notes:
> >
> > V2 : Changed interface to memtier_override from adistance_offset.
> > memtier_override was recommended by
> > 1. John Groves <john@jagalactic.com>
> > 2. Ravi Shankar <ravis.opensrc@micron.com>
> > 3. Brice Goglin <Brice.Goglin@inria.fr>
> 
> It appears that you ignored my comments for V1 as follows ...
> 
> https://lore.kernel.org/lkml/87o7f62vur.fsf@yhuang6-desk2.ccr.corp.intel.com/
> https://lore.kernel.org/lkml/87jzpt2ft5.fsf@yhuang6-desk2.ccr.corp.intel.com/
> https://lore.kernel.org/lkml/87a5qp2et0.fsf@yhuang6-desk2.ccr.corp.intel.com/
> 

Not speaking for the group, just chiming in because i'd discussed it
with them.

"Memory Type" is a bit nebulous.  Is a Micron Type-3 with performance X
and an SK Hynix Type-3 with performance Y a "Different type", or are
they the "Same Type" given that they're both Type 3 backed by some form
of DDR?  Is socket placement of those devices relevant for determining
"Type"?  Is whether they are behind a switch relevant for determining
"Type"? "Type" is frustrating when everything we're talking about
managing is "Type-3" with difference performance.

A concrete example:
To the system, a Multi-Headed Single Logical Device (MH-SLD) looks
exactly the same as an standard SLD.  I may want to have some
combination of local memory expansion devices on the majority of my
expansion slots, but reserve 1 slot on each socket for a connection to
the MH-SLD.   As of right now: There is no good way to differentiate the
devices in terms of "Type" - and even if you had that, the tiering
system would still lump them together.

Similarly, an initial run of switches may or may not allow enumeration
of devices behind it (depends on the configuration), so you may end up
with a static numa node that "looks like" another SLD - despite it being
some definition of "GFAM".  Do number of hops matter in determining
"Type"?

So I really don't think "Type" is useful for determining tier placement.

As of right now, the system lumps DRAM nodes as one tier, and pretty
much everything else as "the other tier". To me, this patch set is an
initial pass meant to allow user-control over tier composition while
the internal mechanism is sussed out and the environment develops.

In general, a release valve that lets you redefine tiers is very welcome
for testing and validation of different setups while the industry evolves.

Just my two cents.

~Gregory

> --
> Best Regards,
> Huang, Ying
>

Huang, Ying Dec. 18, 2023, 5:55 a.m. UTC | #3

Gregory Price <gregory.price@memverge.com> writes:

> On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
>> <sthanneeru.opensrc@micron.com> writes:
>> 
>> > =============
>> > Version Notes:
>> >
>> > V2 : Changed interface to memtier_override from adistance_offset.
>> > memtier_override was recommended by
>> > 1. John Groves <john@jagalactic.com>
>> > 2. Ravi Shankar <ravis.opensrc@micron.com>
>> > 3. Brice Goglin <Brice.Goglin@inria.fr>
>> 
>> It appears that you ignored my comments for V1 as follows ...
>> 
>> https://lore.kernel.org/lkml/87o7f62vur.fsf@yhuang6-desk2.ccr.corp.intel.com/
>> https://lore.kernel.org/lkml/87jzpt2ft5.fsf@yhuang6-desk2.ccr.corp.intel.com/
>> https://lore.kernel.org/lkml/87a5qp2et0.fsf@yhuang6-desk2.ccr.corp.intel.com/
>> 
>
> Not speaking for the group, just chiming in because i'd discussed it
> with them.
>
> "Memory Type" is a bit nebulous.  Is a Micron Type-3 with performance X
> and an SK Hynix Type-3 with performance Y a "Different type", or are
> they the "Same Type" given that they're both Type 3 backed by some form
> of DDR?  Is socket placement of those devices relevant for determining
> "Type"?  Is whether they are behind a switch relevant for determining
> "Type"? "Type" is frustrating when everything we're talking about
> managing is "Type-3" with difference performance.
>
> A concrete example:
> To the system, a Multi-Headed Single Logical Device (MH-SLD) looks
> exactly the same as an standard SLD.  I may want to have some
> combination of local memory expansion devices on the majority of my
> expansion slots, but reserve 1 slot on each socket for a connection to
> the MH-SLD.   As of right now: There is no good way to differentiate the
> devices in terms of "Type" - and even if you had that, the tiering
> system would still lump them together.
>
> Similarly, an initial run of switches may or may not allow enumeration
> of devices behind it (depends on the configuration), so you may end up
> with a static numa node that "looks like" another SLD - despite it being
> some definition of "GFAM".  Do number of hops matter in determining
> "Type"?

In the original design, the memory devices of same memory type are
managed by the same device driver, linked with system in same way
(including switches), built with same media.  So, the performance is
same too.  And, same as memory tiers, memory types are orthogonal to
sockets.  Do you think the definition itself is clear enough?

I admit "memory type" is a confusing name.  Do you have some better
suggestion?

> So I really don't think "Type" is useful for determining tier placement.
>
> As of right now, the system lumps DRAM nodes as one tier, and pretty
> much everything else as "the other tier". To me, this patch set is an
> initial pass meant to allow user-control over tier composition while
> the internal mechanism is sussed out and the environment develops.

The patchset to identify the performance of memory devices and put them
in proper "memory types" and memory tiers via HMAT has been merged by
v6.7-rc1.

      07a8bdd4120c (memory tiering: add abstract distance calculation algorithms management, 2023-09-26)
      d0376aac59a1 (acpi, hmat: refactor hmat_register_target_initiators(), 2023-09-26)
      3718c02dbd4c (acpi, hmat: calculate abstract distance with HMAT, 2023-09-26)
      6bc2cfdf82d5 (dax, kmem: calculate abstract distance with general interface, 2023-09-26)

> In general, a release valve that lets you redefine tiers is very welcome
> for testing and validation of different setups while the industry evolves.
>
> Just my two cents.

--
Best Regards,
Huang, Ying

Srinivasulu Thanneeru Dec. 18, 2023, 8:56 a.m. UTC | #4

Micron Confidential



Micron Confidential

Huang, Ying Dec. 19, 2023, 3:57 a.m. UTC | #5

Hi, Srinivasulu,

Please use a email client that works for kernel patch review.  Your
email is hard to read.  It's hard to identify which part is your text
and which part is my text.  Please refer to,

https://www.kernel.org/doc/html/latest/process/email-clients.html

Or something similar, for example,

https://elinux.org/Mail_client_tips

Srinivasulu Thanneeru <sthanneeru@micron.com> writes:

> Micron Confidential
>
>
>
> Micron Confidential
> ________________________________________
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Friday, December 15, 2023 10:32 AM
> To: Srinivasulu Opensrc
> Cc: linux-cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu
> Thanneeru; aneesh.kumar@linux.ibm.com; dan.j.williams@intel.com;
> gregory.price; mhocko@suse.com; tj@kernel.org; john@jagalactic.com;
> Eishan Mirakhur; Vinicius Tavares Petrucci; Ravis OpenSrc;
> Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org
> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers
>
> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless you recognize the sender and were expecting this message.
>
>
> <sthanneeru.opensrc@micron.com> writes:
>
>> From: Srinivasulu Thanneeru <sthanneeru.opensrc@micron.com>
>>
>> The memory tiers feature allows nodes with similar memory types
>> or performance characteristics to be grouped together in a
>> memory tier. However, there is currently no provision for
>> moving a node from one tier to another on demand.
>>
>> This patch series aims to support node migration between tiers
>> on demand by sysadmin/root user using the provided sysfs for
>> node migration.
>>
>> To migrate a node to a tier, the corresponding node’s sysfs
>> memtier_override is written with target tier id.
>>
>> Example: Move node2 to memory tier2 from its default tier(i.e 4)
>>
>> 1. To check current memtier of node2
>> $cat  /sys/devices/system/node/node2/memtier_override
>> memory_tier4
>>
>> 2. To migrate node2 to memory_tier2
>> $echo 2 > /sys/devices/system/node/node2/memtier_override
>> $cat  /sys/devices/system/node/node2/memtier_override
>> memory_tier2
>>
>> Usecases:
>>
>> 1. Useful to move cxl nodes to the right tiers from userspace, when
>>    the hardware fails to assign the tiers correctly based on
>>    memorytypes.
>>
>>    On some platforms we have observed cxl memory being assigned to
>>    the same tier as DDR memory. This is arguably a system firmware
>>    bug, but it is true that tiers represent *ranges* of performance
>>    and we believe it's important for the system operator to have
>>    the ability to override bad firmware or OS decisions about tier
>>    assignment as a fail-safe against potential bad outcomes.
>>
>> 2. Useful if we want interleave weights to be applied on memory tiers
>>    instead of nodes.
>> In a previous thread, Huang Ying <ying.huang@intel.com> thought
>> this feature might be useful to overcome limitations of systems
>> where nodes with different bandwidth characteristics are grouped
>> in a single tier.
>> https://lore.kernel.org/lkml/87a5rw1wu8.fsf@yhuang6-desk2.ccr.corp.intel.com/
>>
>> =============
>> Version Notes:
>>
>> V2 : Changed interface to memtier_override from adistance_offset.
>> memtier_override was recommended by
>> 1. John Groves <john@jagalactic.com>
>> 2. Ravi Shankar <ravis.opensrc@micron.com>
>> 3. Brice Goglin <Brice.Goglin@inria.fr>
>
> It appears that you ignored my comments for V1 as follows ...
>
> https://lore.kernel.org/lkml/87o7f62vur.fsf@yhuang6-desk2.ccr.corp.intel.com/
>
> Thank you Huang, Ying for pointing to this.
>
> https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
>
> In the presentation above, the adistance_offsets are per memtype.
> We believe that adistance_offset per node is more suitable and flexible
> since we can change it per node. If we keep adistance_offset per memtype,
> then we cannot change it for a specific node of a given memtype.

Why do you need to change it for a specific node?  Why do you needn't to
chagne it for all nodes of a given memtype?

> https://lore.kernel.org/lkml/87jzpt2ft5.fsf@yhuang6-desk2.ccr.corp.intel.com/
>
> I guess that you need to move all NUMA nodes with same performance
> metrics together?  If so, That is why we previously proposed to place
> the knob in "memory_type"? (From: Huang, Ying )
>
> Yes, memory_type would be group the related memories togather as single tier.
> We should also have a flexibility to move nodes between tiers, to address the issues described in usecases above.
>
> https://lore.kernel.org/lkml/87a5qp2et0.fsf@yhuang6-desk2.ccr.corp.intel.com/
>
> This patch provides a way to move a node to the correct tier.
> We observed in test setups where DRAM and CXL are put under the same
> tier (memory_tier4).
> By using this patch, we can move the CXL node away from the DRAM-linked
> tier4 and put it in the desired tier.

Good!  Can you give more details?  So I can resend the patch with your
supporting data.

--
Best Regards,
Huang, Ying

> Regards,
> Srini
>
> --
> Best Regards,
> Huang, Ying
>
>> V1 : Introduced adistance_offset sysfs.
>>
>> =============
>>
>> Srinivasulu Thanneeru (2):
>>   base/node: Add sysfs for memtier_override
>>   memory tier: Support node migration between tiers
>>
>>  Documentation/ABI/stable/sysfs-devices-node |  7 ++
>>  drivers/base/node.c                         | 47 ++++++++++++
>>  include/linux/memory-tiers.h                | 11 +++
>>  include/linux/node.h                        | 11 +++
>>  mm/memory-tiers.c                           | 85 ++++++++++++---------
>>  5 files changed, 125 insertions(+), 36 deletions(-)

Srinivasulu Thanneeru Jan. 3, 2024, 5:26 a.m. UTC | #6

Micron Confidential

Hi Huang, Ying,

My apologies for wrong mail reply format, my mail client settings got changed on my PC.
Please find comments bellow inline.

Regards,
Srini


Micron Confidential
> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Monday, December 18, 2023 11:26 AM
> To: gregory.price <gregory.price@memverge.com>
> Cc: Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>; linux-
> cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu Thanneeru
> <sthanneeru@micron.com>; aneesh.kumar@linux.ibm.com;
> dan.j.williams@intel.com; mhocko@suse.com; tj@kernel.org;
> john@jagalactic.com; Eishan Mirakhur <emirakhur@micron.com>; Vinicius
> Tavares Petrucci <vtavarespetr@micron.com>; Ravis OpenSrc
> <Ravis.OpenSrc@micron.com>; Jonathan.Cameron@huawei.com; linux-
> kernel@vger.kernel.org; Johannes Weiner <hannes@cmpxchg.org>; Wei Xu
> <weixugc@google.com>
> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers
>
> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
> you recognize the sender and were expecting this message.
>
>
> Gregory Price <gregory.price@memverge.com> writes:
>
> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
> >> <sthanneeru.opensrc@micron.com> writes:
> >>
> >> > =============
> >> > Version Notes:
> >> >
> >> > V2 : Changed interface to memtier_override from adistance_offset.
> >> > memtier_override was recommended by
> >> > 1. John Groves <john@jagalactic.com>
> >> > 2. Ravi Shankar <ravis.opensrc@micron.com>
> >> > 3. Brice Goglin <Brice.Goglin@inria.fr>
> >>
> >> It appears that you ignored my comments for V1 as follows ...
> >>
> >>
> https://lore.k/
> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6-
> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi
> bLhwV12Fg%3D&reserved=0

Thank you, Huang, Ying for pointing to this.
https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf

In the presentation above, the adistance_offsets are per memtype.
We believe that adistance_offset per node is more suitable and flexible.
since we can change it per node. If we keep adistance_offset per memtype,
then we cannot change it for a specific node of a given memtype.

> >>
> https://lore.k/
> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6-
> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh
> D%2BcMI%2BflOsI1M%3D&reserved=0

Yes, memory_type would be grouping the related memories together as single tier.
We should also have a flexibility to move nodes between tiers, to address the issues.
described in use cases above.

> >>
> https://lore.k/
> ernel.org%2Flkml%2F87a5qp2et0.fsf%40yhuang6-
> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> D%7C3000%7C%7C%7C&sdata=W%2FWcAD4b9od%2BS0zIak%2Bv5hkjFG1Xcf
> 6p8q3xwmspUiI%3D&reserved=0

This patch provides a way to move a node to the correct tier.
We observed in test setups where DRAM and CXL are put under the same.
tier (memory_tier4).
By using this patch, we can move the CXL node away from the DRAM-linked (memory_tier4)
and put it in the desired tier.

> >>
> >
> > Not speaking for the group, just chiming in because i'd discussed it
> > with them.
> >
> > "Memory Type" is a bit nebulous.  Is a Micron Type-3 with performance X
> > and an SK Hynix Type-3 with performance Y a "Different type", or are
> > they the "Same Type" given that they're both Type 3 backed by some form
> > of DDR?  Is socket placement of those devices relevant for determining
> > "Type"?  Is whether they are behind a switch relevant for determining
> > "Type"? "Type" is frustrating when everything we're talking about
> > managing is "Type-3" with difference performance.
> >
> > A concrete example:
> > To the system, a Multi-Headed Single Logical Device (MH-SLD) looks
> > exactly the same as an standard SLD.  I may want to have some
> > combination of local memory expansion devices on the majority of my
> > expansion slots, but reserve 1 slot on each socket for a connection to
> > the MH-SLD.   As of right now: There is no good way to differentiate the
> > devices in terms of "Type" - and even if you had that, the tiering
> > system would still lump them together.
> >
> > Similarly, an initial run of switches may or may not allow enumeration
> > of devices behind it (depends on the configuration), so you may end up
> > with a static numa node that "looks like" another SLD - despite it being
> > some definition of "GFAM".  Do number of hops matter in determining
> > "Type"?
>
> In the original design, the memory devices of same memory type are
> managed by the same device driver, linked with system in same way
> (including switches), built with same media.  So, the performance is
> same too.  And, same as memory tiers, memory types are orthogonal to
> sockets.  Do you think the definition itself is clear enough?
>
> I admit "memory type" is a confusing name.  Do you have some better
> suggestion?
>
> > So I really don't think "Type" is useful for determining tier placement.
> >
> > As of right now, the system lumps DRAM nodes as one tier, and pretty
> > much everything else as "the other tier". To me, this patch set is an
> > initial pass meant to allow user-control over tier composition while
> > the internal mechanism is sussed out and the environment develops.
>
> The patchset to identify the performance of memory devices and put them
> in proper "memory types" and memory tiers via HMAT has been merged by
> v6.7-rc1.
>
>       07a8bdd4120c (memory tiering: add abstract distance calculation
> algorithms management, 2023-09-26)
>       d0376aac59a1 (acpi, hmat: refactor hmat_register_target_initiators(),
> 2023-09-26)
>       3718c02dbd4c (acpi, hmat: calculate abstract distance with HMAT, 2023-09-
> 26)
>       6bc2cfdf82d5 (dax, kmem: calculate abstract distance with general
> interface, 2023-09-26)
>
> > In general, a release valve that lets you redefine tiers is very welcome
> > for testing and validation of different setups while the industry evolves.
> >
> > Just my two cents.
>
> --
> Best Regards,
> Huang, Ying

Huang, Ying Jan. 3, 2024, 6:07 a.m. UTC | #7

Srinivasulu Thanneeru <sthanneeru@micron.com> writes:

> Micron Confidential
>
> Hi Huang, Ying,
>
> My apologies for wrong mail reply format, my mail client settings got changed on my PC.
> Please find comments bellow inline.
>
> Regards,
> Srini
>
>
> Micron Confidential
>> -----Original Message-----
>> From: Huang, Ying <ying.huang@intel.com>
>> Sent: Monday, December 18, 2023 11:26 AM
>> To: gregory.price <gregory.price@memverge.com>
>> Cc: Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>; linux-
>> cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu Thanneeru
>> <sthanneeru@micron.com>; aneesh.kumar@linux.ibm.com;
>> dan.j.williams@intel.com; mhocko@suse.com; tj@kernel.org;
>> john@jagalactic.com; Eishan Mirakhur <emirakhur@micron.com>; Vinicius
>> Tavares Petrucci <vtavarespetr@micron.com>; Ravis OpenSrc
>> <Ravis.OpenSrc@micron.com>; Jonathan.Cameron@huawei.com; linux-
>> kernel@vger.kernel.org; Johannes Weiner <hannes@cmpxchg.org>; Wei Xu
>> <weixugc@google.com>
>> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory tiers
>>
>> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
>> you recognize the sender and were expecting this message.
>>
>>
>> Gregory Price <gregory.price@memverge.com> writes:
>>
>> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
>> >> <sthanneeru.opensrc@micron.com> writes:
>> >>
>> >> > =============
>> >> > Version Notes:
>> >> >
>> >> > V2 : Changed interface to memtier_override from adistance_offset.
>> >> > memtier_override was recommended by
>> >> > 1. John Groves <john@jagalactic.com>
>> >> > 2. Ravi Shankar <ravis.opensrc@micron.com>
>> >> > 3. Brice Goglin <Brice.Goglin@inria.fr>
>> >>
>> >> It appears that you ignored my comments for V1 as follows ...
>> >>
>> >>
>> https://lore.k/
>> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6-
>> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
>> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
>> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
>> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi
>> bLhwV12Fg%3D&reserved=0
>
> Thank you, Huang, Ying for pointing to this.
> https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
>
> In the presentation above, the adistance_offsets are per memtype.
> We believe that adistance_offset per node is more suitable and flexible.
> since we can change it per node. If we keep adistance_offset per memtype,
> then we cannot change it for a specific node of a given memtype.
>
>> >>
>> https://lore.k/
>> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6-
>> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
>> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
>> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
>> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh
>> D%2BcMI%2BflOsI1M%3D&reserved=0
>
> Yes, memory_type would be grouping the related memories together as single tier.
> We should also have a flexibility to move nodes between tiers, to address the issues.
> described in use cases above.

We don't pursue absolute flexibility.  We add necessary flexibility
only.  Why do you need this kind of flexibility?  Can you provide some
use cases where memory_type based "adistance_offset" doesn't work?

--
Best Regards,
Huang, Ying

Srinivasulu Thanneeru Jan. 3, 2024, 7:56 a.m. UTC | #8

Micron Confidential



Micron Confidential
> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Wednesday, January 3, 2024 11:38 AM
> To: Srinivasulu Thanneeru <sthanneeru@micron.com>
> Cc: gregory.price <gregory.price@memverge.com>; Srinivasulu Opensrc
> <sthanneeru.opensrc@micron.com>; linux-cxl@vger.kernel.org; linux-
> mm@kvack.org; aneesh.kumar@linux.ibm.com; dan.j.williams@intel.com;
> mhocko@suse.com; tj@kernel.org; john@jagalactic.com; Eishan Mirakhur
> <emirakhur@micron.com>; Vinicius Tavares Petrucci
> <vtavarespetr@micron.com>; Ravis OpenSrc <Ravis.OpenSrc@micron.com>;
> Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org; Johannes
> Weiner <hannes@cmpxchg.org>; Wei Xu <weixugc@google.com>
> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
> tiers
>
> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
> you recognize the sender and were expecting this message.
>
>
> Srinivasulu Thanneeru <sthanneeru@micron.com> writes:
>
> > Micron Confidential
> >
> > Hi Huang, Ying,
> >
> > My apologies for wrong mail reply format, my mail client settings got
> changed on my PC.
> > Please find comments bellow inline.
> >
> > Regards,
> > Srini
> >
> >
> > Micron Confidential
> >> -----Original Message-----
> >> From: Huang, Ying <ying.huang@intel.com>
> >> Sent: Monday, December 18, 2023 11:26 AM
> >> To: gregory.price <gregory.price@memverge.com>
> >> Cc: Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>; linux-
> >> cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu Thanneeru
> >> <sthanneeru@micron.com>; aneesh.kumar@linux.ibm.com;
> >> dan.j.williams@intel.com; mhocko@suse.com; tj@kernel.org;
> >> john@jagalactic.com; Eishan Mirakhur <emirakhur@micron.com>; Vinicius
> >> Tavares Petrucci <vtavarespetr@micron.com>; Ravis OpenSrc
> >> <Ravis.OpenSrc@micron.com>; Jonathan.Cameron@huawei.com; linux-
> >> kernel@vger.kernel.org; Johannes Weiner <hannes@cmpxchg.org>; Wei Xu
> >> <weixugc@google.com>
> >> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
> tiers
> >>
> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
> >> you recognize the sender and were expecting this message.
> >>
> >>
> >> Gregory Price <gregory.price@memverge.com> writes:
> >>
> >> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
> >> >> <sthanneeru.opensrc@micron.com> writes:
> >> >>
> >> >> > =============
> >> >> > Version Notes:
> >> >> >
> >> >> > V2 : Changed interface to memtier_override from adistance_offset.
> >> >> > memtier_override was recommended by
> >> >> > 1. John Groves <john@jagalactic.com>
> >> >> > 2. Ravi Shankar <ravis.opensrc@micron.com>
> >> >> > 3. Brice Goglin <Brice.Goglin@inria.fr>
> >> >>
> >> >> It appears that you ignored my comments for V1 as follows ...
> >> >>
> >> >>
> >>
> https://lore.k/
> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
> served=0
> >> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6-
> >>
> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
> >>
> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
> >>
> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
> >>
> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> >>
> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi
> >> bLhwV12Fg%3D&reserved=0
> >
> > Thank you, Huang, Ying for pointing to this.
> >
> https://lpc.ev/
> ents%2Fevent%2F16%2Fcontributions%2F1209%2Fattachments%2F1042%2F1
> 995%2FLive%2520In%2520a%2520World%2520With%2520Multiple%2520Me
> mory%2520Types.pdf&data=05%7C02%7Csthanneeru%40micron.com%7C3e
> 5d38eb47be463c295c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806
> f%7C0%7C0%7C638398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJW
> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
> 000%7C%7C%7C&sdata=1fGraxff7%2F1hNaE0an0xEudSKSUvaF3HgClMkmdC7
> n8%3D&reserved=0
> >
> > In the presentation above, the adistance_offsets are per memtype.
> > We believe that adistance_offset per node is more suitable and flexible.
> > since we can change it per node. If we keep adistance_offset per memtype,
> > then we cannot change it for a specific node of a given memtype.
> >
> >> >>
> >>
> https://lore.k/
> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
> served=0
> >> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6-
> >>
> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
> >>
> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
> >>
> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
> >>
> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> >>
> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh
> >> D%2BcMI%2BflOsI1M%3D&reserved=0
> >
> > Yes, memory_type would be grouping the related memories together as
> single tier.
> > We should also have a flexibility to move nodes between tiers, to address
> the issues.
> > described in use cases above.
>
> We don't pursue absolute flexibility.  We add necessary flexibility
> only.  Why do you need this kind of flexibility?  Can you provide some
> use cases where memory_type based "adistance_offset" doesn't work?

- /sys/devices/virtual/memory_type/memory_type/ adistance_offset
memory_type based "adistance_offset will provide a way to move all nodes of same memory_type (e.g. all cxl nodes)
to different tier.

Whereas /sys/devices/system/node/node2/memtier_override provide a way migrate a node from one tier to another.
Considering a case where we would like to move two cxl nodes into two different tiers in future.
So, I thought it would be good to have flexibility at node level instead of at memory_type.

>
> --
> Best Regards,
> Huang, Ying

Huang, Ying Jan. 3, 2024, 8:29 a.m. UTC | #9

Srinivasulu Thanneeru <sthanneeru@micron.com> writes:

> Micron Confidential
>
>
>
> Micron Confidential
>> -----Original Message-----
>> From: Huang, Ying <ying.huang@intel.com>
>> Sent: Wednesday, January 3, 2024 11:38 AM
>> To: Srinivasulu Thanneeru <sthanneeru@micron.com>
>> Cc: gregory.price <gregory.price@memverge.com>; Srinivasulu Opensrc
>> <sthanneeru.opensrc@micron.com>; linux-cxl@vger.kernel.org; linux-
>> mm@kvack.org; aneesh.kumar@linux.ibm.com; dan.j.williams@intel.com;
>> mhocko@suse.com; tj@kernel.org; john@jagalactic.com; Eishan Mirakhur
>> <emirakhur@micron.com>; Vinicius Tavares Petrucci
>> <vtavarespetr@micron.com>; Ravis OpenSrc <Ravis.OpenSrc@micron.com>;
>> Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org; Johannes
>> Weiner <hannes@cmpxchg.org>; Wei Xu <weixugc@google.com>
>> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
>> tiers
>>
>> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
>> you recognize the sender and were expecting this message.
>>
>>
>> Srinivasulu Thanneeru <sthanneeru@micron.com> writes:
>>
>> > Micron Confidential
>> >
>> > Hi Huang, Ying,
>> >
>> > My apologies for wrong mail reply format, my mail client settings got
>> changed on my PC.
>> > Please find comments bellow inline.
>> >
>> > Regards,
>> > Srini
>> >
>> >
>> > Micron Confidential
>> >> -----Original Message-----
>> >> From: Huang, Ying <ying.huang@intel.com>
>> >> Sent: Monday, December 18, 2023 11:26 AM
>> >> To: gregory.price <gregory.price@memverge.com>
>> >> Cc: Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>; linux-
>> >> cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu Thanneeru
>> >> <sthanneeru@micron.com>; aneesh.kumar@linux.ibm.com;
>> >> dan.j.williams@intel.com; mhocko@suse.com; tj@kernel.org;
>> >> john@jagalactic.com; Eishan Mirakhur <emirakhur@micron.com>; Vinicius
>> >> Tavares Petrucci <vtavarespetr@micron.com>; Ravis OpenSrc
>> >> <Ravis.OpenSrc@micron.com>; Jonathan.Cameron@huawei.com; linux-
>> >> kernel@vger.kernel.org; Johannes Weiner <hannes@cmpxchg.org>; Wei Xu
>> >> <weixugc@google.com>
>> >> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
>> tiers
>> >>
>> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
>> >> you recognize the sender and were expecting this message.
>> >>
>> >>
>> >> Gregory Price <gregory.price@memverge.com> writes:
>> >>
>> >> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
>> >> >> <sthanneeru.opensrc@micron.com> writes:
>> >> >>
>> >> >> > =============
>> >> >> > Version Notes:
>> >> >> >
>> >> >> > V2 : Changed interface to memtier_override from adistance_offset.
>> >> >> > memtier_override was recommended by
>> >> >> > 1. John Groves <john@jagalactic.com>
>> >> >> > 2. Ravi Shankar <ravis.opensrc@micron.com>
>> >> >> > 3. Brice Goglin <Brice.Goglin@inria.fr>
>> >> >>
>> >> >> It appears that you ignored my comments for V1 as follows ...
>> >> >>
>> >> >>
>> >>
>> https://lore.k/
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
>> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
>> served=0
>> >> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6-
>> >>
>> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
>> >>
>> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
>> >>
>> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
>> >>
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
>> >>
>> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi
>> >> bLhwV12Fg%3D&reserved=0
>> >
>> > Thank you, Huang, Ying for pointing to this.
>> >
>> https://lpc.ev/
>> ents%2Fevent%2F16%2Fcontributions%2F1209%2Fattachments%2F1042%2F1
>> 995%2FLive%2520In%2520a%2520World%2520With%2520Multiple%2520Me
>> mory%2520Types.pdf&data=05%7C02%7Csthanneeru%40micron.com%7C3e
>> 5d38eb47be463c295c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806
>> f%7C0%7C0%7C638398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJW
>> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
>> 000%7C%7C%7C&sdata=1fGraxff7%2F1hNaE0an0xEudSKSUvaF3HgClMkmdC7
>> n8%3D&reserved=0
>> >
>> > In the presentation above, the adistance_offsets are per memtype.
>> > We believe that adistance_offset per node is more suitable and flexible.
>> > since we can change it per node. If we keep adistance_offset per memtype,
>> > then we cannot change it for a specific node of a given memtype.
>> >
>> >> >>
>> >>
>> https://lore.k/
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
>> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
>> served=0
>> >> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6-
>> >>
>> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
>> >>
>> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
>> >>
>> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
>> >>
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
>> >>
>> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh
>> >> D%2BcMI%2BflOsI1M%3D&reserved=0
>> >
>> > Yes, memory_type would be grouping the related memories together as
>> single tier.
>> > We should also have a flexibility to move nodes between tiers, to address
>> the issues.
>> > described in use cases above.
>>
>> We don't pursue absolute flexibility.  We add necessary flexibility
>> only.  Why do you need this kind of flexibility?  Can you provide some
>> use cases where memory_type based "adistance_offset" doesn't work?
>
> - /sys/devices/virtual/memory_type/memory_type/ adistance_offset
> memory_type based "adistance_offset will provide a way to move all nodes of same memory_type (e.g. all cxl nodes)
> to different tier.

We will not put the CXL nodes with different performance metrics in one
memory_type.  If so, do you still need to move one of them?

> Whereas /sys/devices/system/node/node2/memtier_override provide a way migrate a node from one tier to another.
> Considering a case where we would like to move two cxl nodes into two different tiers in future.
> So, I thought it would be good to have flexibility at node level instead of at memory_type.

--
Best Regards,
Huang, Ying

Srinivasulu Thanneeru Jan. 3, 2024, 8:47 a.m. UTC | #10

> -----Original Message-----
> From: Huang, Ying <ying.huang@intel.com>
> Sent: Wednesday, January 3, 2024 2:00 PM
> To: Srinivasulu Thanneeru <sthanneeru@micron.com>
> Cc: gregory.price <gregory.price@memverge.com>; Srinivasulu Opensrc
> <sthanneeru.opensrc@micron.com>; linux-cxl@vger.kernel.org; linux-
> mm@kvack.org; aneesh.kumar@linux.ibm.com; dan.j.williams@intel.com;
> mhocko@suse.com; tj@kernel.org; john@jagalactic.com; Eishan Mirakhur
> <emirakhur@micron.com>; Vinicius Tavares Petrucci
> <vtavarespetr@micron.com>; Ravis OpenSrc <Ravis.OpenSrc@micron.com>;
> Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org; Johannes
> Weiner <hannes@cmpxchg.org>; Wei Xu <weixugc@google.com>
> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
> tiers
>
> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
> you recognize the sender and were expecting this message.
>
>
> Srinivasulu Thanneeru <sthanneeru@micron.com> writes:
>
> > Micron Confidential
> >
> >
> >
> > Micron Confidential
> >> -----Original Message-----
> >> From: Huang, Ying <ying.huang@intel.com>
> >> Sent: Wednesday, January 3, 2024 11:38 AM
> >> To: Srinivasulu Thanneeru <sthanneeru@micron.com>
> >> Cc: gregory.price <gregory.price@memverge.com>; Srinivasulu Opensrc
> >> <sthanneeru.opensrc@micron.com>; linux-cxl@vger.kernel.org; linux-
> >> mm@kvack.org; aneesh.kumar@linux.ibm.com;
> dan.j.williams@intel.com;
> >> mhocko@suse.com; tj@kernel.org; john@jagalactic.com; Eishan Mirakhur
> >> <emirakhur@micron.com>; Vinicius Tavares Petrucci
> >> <vtavarespetr@micron.com>; Ravis OpenSrc
> <Ravis.OpenSrc@micron.com>;
> >> Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org; Johannes
> >> Weiner <hannes@cmpxchg.org>; Wei Xu <weixugc@google.com>
> >> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between
> memory
> >> tiers
> >>
> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
> >> you recognize the sender and were expecting this message.
> >>
> >>
> >> Srinivasulu Thanneeru <sthanneeru@micron.com> writes:
> >>
> >> > Micron Confidential
> >> >
> >> > Hi Huang, Ying,
> >> >
> >> > My apologies for wrong mail reply format, my mail client settings got
> >> changed on my PC.
> >> > Please find comments bellow inline.
> >> >
> >> > Regards,
> >> > Srini
> >> >
> >> >
> >> > Micron Confidential
> >> >> -----Original Message-----
> >> >> From: Huang, Ying <ying.huang@intel.com>
> >> >> Sent: Monday, December 18, 2023 11:26 AM
> >> >> To: gregory.price <gregory.price@memverge.com>
> >> >> Cc: Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>; linux-
> >> >> cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu Thanneeru
> >> >> <sthanneeru@micron.com>; aneesh.kumar@linux.ibm.com;
> >> >> dan.j.williams@intel.com; mhocko@suse.com; tj@kernel.org;
> >> >> john@jagalactic.com; Eishan Mirakhur <emirakhur@micron.com>;
> Vinicius
> >> >> Tavares Petrucci <vtavarespetr@micron.com>; Ravis OpenSrc
> >> >> <Ravis.OpenSrc@micron.com>; Jonathan.Cameron@huawei.com;
> linux-
> >> >> kernel@vger.kernel.org; Johannes Weiner <hannes@cmpxchg.org>; Wei
> Xu
> >> >> <weixugc@google.com>
> >> >> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
> >> tiers
> >> >>
> >> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments
> unless
> >> >> you recognize the sender and were expecting this message.
> >> >>
> >> >>
> >> >> Gregory Price <gregory.price@memverge.com> writes:
> >> >>
> >> >> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
> >> >> >> <sthanneeru.opensrc@micron.com> writes:
> >> >> >>
> >> >> >> > =============
> >> >> >> > Version Notes:
> >> >> >> >
> >> >> >> > V2 : Changed interface to memtier_override from adistance_offset.
> >> >> >> > memtier_override was recommended by
> >> >> >> > 1. John Groves <john@jagalactic.com>
> >> >> >> > 2. Ravi Shankar <ravis.opensrc@micron.com>
> >> >> >> > 3. Brice Goglin <Brice.Goglin@inria.fr>
> >> >> >>
> >> >> >> It appears that you ignored my comments for V1 as follows ...
> >> >> >>
> >> >> >>
> >> >>
> >>
> https://lore.k/
> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100
> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> &sdata=k6J1wxcuHTwR9eoD9Yz137bkn6wt1L9zpf5YaOjoIqA%3D&reserved=0
> >>
> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
> >>
> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
> >>
> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> >>
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> >>
> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
> >> served=0
> >> >> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6-
> >> >>
> >>
> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
> >> >>
> >>
> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
> >> >>
> >>
> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
> >> >>
> >>
> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> >> >>
> >>
> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi
> >> >> bLhwV12Fg%3D&reserved=0
> >> >
> >> > Thank you, Huang, Ying for pointing to this.
> >> >
> >>
> https://lpc.ev/
> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100
> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> &sdata=%2F0AW8RYpTIa7%2FiScnkzmmTeAE9TYqjsuWWjTuxBPptk%3D&rese
> rved=0
> >>
> ents%2Fevent%2F16%2Fcontributions%2F1209%2Fattachments%2F1042%2F1
> >>
> 995%2FLive%2520In%2520a%2520World%2520With%2520Multiple%2520Me
> >>
> mory%2520Types.pdf&data=05%7C02%7Csthanneeru%40micron.com%7C3e
> >>
> 5d38eb47be463c295c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806
> >>
> f%7C0%7C0%7C638398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJW
> >>
> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
> >>
> 000%7C%7C%7C&sdata=1fGraxff7%2F1hNaE0an0xEudSKSUvaF3HgClMkmdC7
> >> n8%3D&reserved=0
> >> >
> >> > In the presentation above, the adistance_offsets are per memtype.
> >> > We believe that adistance_offset per node is more suitable and flexible.
> >> > since we can change it per node. If we keep adistance_offset per
> memtype,
> >> > then we cannot change it for a specific node of a given memtype.
> >> >
> >> >> >>
> >> >>
> >>
> https://lore.k/
> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100
> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> &sdata=k6J1wxcuHTwR9eoD9Yz137bkn6wt1L9zpf5YaOjoIqA%3D&reserved=0
> >>
> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
> >>
> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
> >>
> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
> >>
> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
> >>
> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
> >> served=0
> >> >> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6-
> >> >>
> >>
> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
> >> >>
> >>
> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
> >> >>
> >>
> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
> >> >>
> >>
> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
> >> >>
> >>
> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh
> >> >> D%2BcMI%2BflOsI1M%3D&reserved=0
> >> >
> >> > Yes, memory_type would be grouping the related memories together as
> >> single tier.
> >> > We should also have a flexibility to move nodes between tiers, to
> address
> >> the issues.
> >> > described in use cases above.
> >>
> >> We don't pursue absolute flexibility.  We add necessary flexibility
> >> only.  Why do you need this kind of flexibility?  Can you provide some
> >> use cases where memory_type based "adistance_offset" doesn't work?
> >
> > - /sys/devices/virtual/memory_type/memory_type/ adistance_offset
> > memory_type based "adistance_offset will provide a way to move all nodes
> of same memory_type (e.g. all cxl nodes)
> > to different tier.
>
> We will not put the CXL nodes with different performance metrics in one
> memory_type.  If so, do you still need to move one of them?

From  https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
abstract_distance_offset: override by users to deal with firmware issue.

say firmware can configure the cxl node into wrong tiers, similar to that it may also configure all cxl nodes into single memtype, hence all these nodes can fall into a single wrong tier.
In this case, per node adistance_offset would be good to have ?

--
Srini
> > Whereas /sys/devices/system/node/node2/memtier_override provide a
> way migrate a node from one tier to another.
> > Considering a case where we would like to move two cxl nodes into two
> different tiers in future.
> > So, I thought it would be good to have flexibility at node level instead of at
> memory_type.
>
> --
> Best Regards,
> Huang, Ying

Huang, Ying Jan. 4, 2024, 6:05 a.m. UTC | #11

Srinivasulu Thanneeru <sthanneeru@micron.com> writes:

>> -----Original Message-----
>> From: Huang, Ying <ying.huang@intel.com>
>> Sent: Wednesday, January 3, 2024 2:00 PM
>> To: Srinivasulu Thanneeru <sthanneeru@micron.com>
>> Cc: gregory.price <gregory.price@memverge.com>; Srinivasulu Opensrc
>> <sthanneeru.opensrc@micron.com>; linux-cxl@vger.kernel.org; linux-
>> mm@kvack.org; aneesh.kumar@linux.ibm.com; dan.j.williams@intel.com;
>> mhocko@suse.com; tj@kernel.org; john@jagalactic.com; Eishan Mirakhur
>> <emirakhur@micron.com>; Vinicius Tavares Petrucci
>> <vtavarespetr@micron.com>; Ravis OpenSrc <Ravis.OpenSrc@micron.com>;
>> Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org; Johannes
>> Weiner <hannes@cmpxchg.org>; Wei Xu <weixugc@google.com>
>> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
>> tiers
>>
>> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
>> you recognize the sender and were expecting this message.
>>
>>
>> Srinivasulu Thanneeru <sthanneeru@micron.com> writes:
>>
>> > Micron Confidential
>> >
>> >
>> >
>> > Micron Confidential
>> >> -----Original Message-----
>> >> From: Huang, Ying <ying.huang@intel.com>
>> >> Sent: Wednesday, January 3, 2024 11:38 AM
>> >> To: Srinivasulu Thanneeru <sthanneeru@micron.com>
>> >> Cc: gregory.price <gregory.price@memverge.com>; Srinivasulu Opensrc
>> >> <sthanneeru.opensrc@micron.com>; linux-cxl@vger.kernel.org; linux-
>> >> mm@kvack.org; aneesh.kumar@linux.ibm.com;
>> dan.j.williams@intel.com;
>> >> mhocko@suse.com; tj@kernel.org; john@jagalactic.com; Eishan Mirakhur
>> >> <emirakhur@micron.com>; Vinicius Tavares Petrucci
>> >> <vtavarespetr@micron.com>; Ravis OpenSrc
>> <Ravis.OpenSrc@micron.com>;
>> >> Jonathan.Cameron@huawei.com; linux-kernel@vger.kernel.org; Johannes
>> >> Weiner <hannes@cmpxchg.org>; Wei Xu <weixugc@google.com>
>> >> Subject: Re: [EXT] Re: [RFC PATCH v2 0/2] Node migration between
>> memory
>> >> tiers
>> >>
>> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments unless
>> >> you recognize the sender and were expecting this message.
>> >>
>> >>
>> >> Srinivasulu Thanneeru <sthanneeru@micron.com> writes:
>> >>
>> >> > Micron Confidential
>> >> >
>> >> > Hi Huang, Ying,
>> >> >
>> >> > My apologies for wrong mail reply format, my mail client settings got
>> >> changed on my PC.
>> >> > Please find comments bellow inline.
>> >> >
>> >> > Regards,
>> >> > Srini
>> >> >
>> >> >
>> >> > Micron Confidential
>> >> >> -----Original Message-----
>> >> >> From: Huang, Ying <ying.huang@intel.com>
>> >> >> Sent: Monday, December 18, 2023 11:26 AM
>> >> >> To: gregory.price <gregory.price@memverge.com>
>> >> >> Cc: Srinivasulu Opensrc <sthanneeru.opensrc@micron.com>; linux-
>> >> >> cxl@vger.kernel.org; linux-mm@kvack.org; Srinivasulu Thanneeru
>> >> >> <sthanneeru@micron.com>; aneesh.kumar@linux.ibm.com;
>> >> >> dan.j.williams@intel.com; mhocko@suse.com; tj@kernel.org;
>> >> >> john@jagalactic.com; Eishan Mirakhur <emirakhur@micron.com>;
>> Vinicius
>> >> >> Tavares Petrucci <vtavarespetr@micron.com>; Ravis OpenSrc
>> >> >> <Ravis.OpenSrc@micron.com>; Jonathan.Cameron@huawei.com;
>> linux-
>> >> >> kernel@vger.kernel.org; Johannes Weiner <hannes@cmpxchg.org>; Wei
>> Xu
>> >> >> <weixugc@google.com>
>> >> >> Subject: [EXT] Re: [RFC PATCH v2 0/2] Node migration between memory
>> >> tiers
>> >> >>
>> >> >> CAUTION: EXTERNAL EMAIL. Do not click links or open attachments
>> unless
>> >> >> you recognize the sender and were expecting this message.
>> >> >>
>> >> >>
>> >> >> Gregory Price <gregory.price@memverge.com> writes:
>> >> >>
>> >> >> > On Fri, Dec 15, 2023 at 01:02:59PM +0800, Huang, Ying wrote:
>> >> >> >> <sthanneeru.opensrc@micron.com> writes:
>> >> >> >>
>> >> >> >> > =============
>> >> >> >> > Version Notes:
>> >> >> >> >
>> >> >> >> > V2 : Changed interface to memtier_override from adistance_offset.
>> >> >> >> > memtier_override was recommended by
>> >> >> >> > 1. John Groves <john@jagalactic.com>
>> >> >> >> > 2. Ravi Shankar <ravis.opensrc@micron.com>
>> >> >> >> > 3. Brice Goglin <Brice.Goglin@inria.fr>
>> >> >> >>
>> >> >> >> It appears that you ignored my comments for V1 as follows ...
>> >> >> >>
>> >> >> >>
>> >> >>
>> >>
>> https://lore.k/
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100
>> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> &sdata=k6J1wxcuHTwR9eoD9Yz137bkn6wt1L9zpf5YaOjoIqA%3D&reserved=0
>> >>
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
>> >>
>> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> >>
>> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> >>
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> >>
>> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
>> >> served=0
>> >> >> ernel.org%2Flkml%2F87o7f62vur.fsf%40yhuang6-
>> >> >>
>> >>
>> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
>> >> >>
>> >>
>> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
>> >> >>
>> >>
>> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
>> >> >>
>> >>
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
>> >> >>
>> >>
>> D%7C3000%7C%7C%7C&sdata=OpMkYCar%2Fv8uHb7AvXbmaNltnXeTvcNUTi
>> >> >> bLhwV12Fg%3D&reserved=0
>> >> >
>> >> > Thank you, Huang, Ying for pointing to this.
>> >> >
>> >>
>> https://lpc.ev/
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100
>> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> &sdata=%2F0AW8RYpTIa7%2FiScnkzmmTeAE9TYqjsuWWjTuxBPptk%3D&rese
>> rved=0
>> >>
>> ents%2Fevent%2F16%2Fcontributions%2F1209%2Fattachments%2F1042%2F1
>> >>
>> 995%2FLive%2520In%2520a%2520World%2520With%2520Multiple%2520Me
>> >>
>> mory%2520Types.pdf&data=05%7C02%7Csthanneeru%40micron.com%7C3e
>> >>
>> 5d38eb47be463c295c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806
>> >>
>> f%7C0%7C0%7C638398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJW
>> >>
>> IjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3
>> >>
>> 000%7C%7C%7C&sdata=1fGraxff7%2F1hNaE0an0xEudSKSUvaF3HgClMkmdC7
>> >> n8%3D&reserved=0
>> >> >
>> >> > In the presentation above, the adistance_offsets are per memtype.
>> >> > We believe that adistance_offset per node is more suitable and flexible.
>> >> > since we can change it per node. If we keep adistance_offset per
>> memtype,
>> >> > then we cannot change it for a specific node of a given memtype.
>> >> >
>> >> >> >>
>> >> >>
>> >>
>> https://lore.k/
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7Ce9e04d25ea7540100
>> cf308dc0c366eb1%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> 8398675187014390%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> &sdata=k6J1wxcuHTwR9eoD9Yz137bkn6wt1L9zpf5YaOjoIqA%3D&reserved=0
>> >>
>> %2F&data=05%7C02%7Csthanneeru%40micron.com%7C3e5d38eb47be463c2
>> >>
>> 95c08dc0c229d22%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C63
>> >>
>> 8398590664228240%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMD
>> >>
>> AiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C
>> >>
>> &sdata=7fPxb1YYR2tZ0v2FB1vlXnMJFcI%2Fr9HT2%2BUD1MNUd%2FI%3D&re
>> >> served=0
>> >> >> ernel.org%2Flkml%2F87jzpt2ft5.fsf%40yhuang6-
>> >> >>
>> >>
>> desk2.ccr.corp.intel.com%2F&data=05%7C02%7Csthanneeru%40micron.com
>> >> >>
>> >>
>> %7C5e614e5f028342b6b59c08dbff8e3e37%7Cf38a5ecd28134862b11bac1d56
>> >> >>
>> >>
>> 3c806f%7C0%7C0%7C638384758666895965%7CUnknown%7CTWFpbGZsb3d
>> >> >>
>> >>
>> 8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3
>> >> >>
>> >>
>> D%7C3000%7C%7C%7C&sdata=O0%2B6T%2FgU0TicCEYBac%2FAyjOLwAeouh
>> >> >> D%2BcMI%2BflOsI1M%3D&reserved=0
>> >> >
>> >> > Yes, memory_type would be grouping the related memories together as
>> >> single tier.
>> >> > We should also have a flexibility to move nodes between tiers, to
>> address
>> >> the issues.
>> >> > described in use cases above.
>> >>
>> >> We don't pursue absolute flexibility.  We add necessary flexibility
>> >> only.  Why do you need this kind of flexibility?  Can you provide some
>> >> use cases where memory_type based "adistance_offset" doesn't work?
>> >
>> > - /sys/devices/virtual/memory_type/memory_type/ adistance_offset
>> > memory_type based "adistance_offset will provide a way to move all nodes
>> of same memory_type (e.g. all cxl nodes)
>> > to different tier.
>>
>> We will not put the CXL nodes with different performance metrics in one
>> memory_type.  If so, do you still need to move one of them?
>
> From  https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
> abstract_distance_offset: override by users to deal with firmware issue.
>
> say firmware can configure the cxl node into wrong tiers, similar to
> that it may also configure all cxl nodes into single memtype, hence
> all these nodes can fall into a single wrong tier.
> In this case, per node adistance_offset would be good to have ?

I think that it's better to fix the error firmware if possible.  And
these are only theoretical, not practical issues.  Do you have some
practical issues?

I understand that users may want to move nodes between memory tiers for
different policy choices.  For that, memory_type based adistance_offset
should be good.

> --
> Srini
>> > Whereas /sys/devices/system/node/node2/memtier_override provide a
>> way migrate a node from one tier to another.
>> > Considering a case where we would like to move two cxl nodes into two
>> different tiers in future.
>> > So, I thought it would be good to have flexibility at node level instead of at
>> memory_type.

--
Best Regards,
Huang, Ying

Gregory Price Jan. 8, 2024, 5:04 p.m. UTC | #12

On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> >
> > From  https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
> > abstract_distance_offset: override by users to deal with firmware issue.
> >
> > say firmware can configure the cxl node into wrong tiers, similar to
> > that it may also configure all cxl nodes into single memtype, hence
> > all these nodes can fall into a single wrong tier.
> > In this case, per node adistance_offset would be good to have ?
> 
> I think that it's better to fix the error firmware if possible.  And
> these are only theoretical, not practical issues.  Do you have some
> practical issues?
> 
> I understand that users may want to move nodes between memory tiers for
> different policy choices.  For that, memory_type based adistance_offset
> should be good.
> 

There's actually an affirmative case to change memory tiering to allow
either movement of nodes between tiers, or at least base placement on
HMAT information. Preferably, membership would be changable to allow
hotplug/DCD to be managed (there's no guarantee that the memory passed
through will always be what HMAT says on initial boot).

https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/

This group wants to enable passing CXL memory through to KVM/QEMU
(i.e. host CXL expander memory passed through to the guest), and
allow the guest to apply memory tiering.

There are multiple issues with this, presently:

1. The QEMU CXL virtual device is not and probably never will be
   performant enough to be a commodity class virtualization.  The
   reason is that the virtual CXL device is built off the I/O
   virtualization stack, which treats memory accesses as I/O accesses.

   KVM also seems incompatible with the design of the CXL memory device
   in general, but this problem may or may not be a blocker.

   As a result, access to virtual CXL memory device leads to QEMU
   crawling to a halt - and this is unlikely to change.

   There is presently no good way forward to create a performant virtual
   CXL device in QEMU.  This means the memory tiering component in the
   kernel is functionally useless for virtual CXL memory, because...

2. When passing memory through as an explicit NUMA node, but not as
   part of a CXL memory device, the nodes are lumped together in the
   DRAM tier.

None of this has to do with firmware.

Memory-type is an awful way of denoting membership of a tier, but we
have HMAT information that can be passed through via QEMU:

-object memory-backend-ram,size=4G,id=ram-node0 \
-object memory-backend-ram,size=4G,id=ram-node1 \
-numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
-numa node,initiator=0,nodeid=1,memdev=ram-node1 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
-numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
-numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880

Not only would it be nice if we could change tier membership based on
this data, it's realistically the only way to allow guests to accomplish
memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.

~Gregory

Huang, Ying Jan. 9, 2024, 3:41 a.m. UTC | #13

Gregory Price <gregory.price@memverge.com> writes:

> On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
>> >
>> > From  https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
>> > abstract_distance_offset: override by users to deal with firmware issue.
>> >
>> > say firmware can configure the cxl node into wrong tiers, similar to
>> > that it may also configure all cxl nodes into single memtype, hence
>> > all these nodes can fall into a single wrong tier.
>> > In this case, per node adistance_offset would be good to have ?
>> 
>> I think that it's better to fix the error firmware if possible.  And
>> these are only theoretical, not practical issues.  Do you have some
>> practical issues?
>> 
>> I understand that users may want to move nodes between memory tiers for
>> different policy choices.  For that, memory_type based adistance_offset
>> should be good.
>> 
>
> There's actually an affirmative case to change memory tiering to allow
> either movement of nodes between tiers, or at least base placement on
> HMAT information. Preferably, membership would be changable to allow
> hotplug/DCD to be managed (there's no guarantee that the memory passed
> through will always be what HMAT says on initial boot).

IIUC, from Jonathan Cameron as below, the performance of memory
shouldn't change even for DCD devices.

https://lore.kernel.org/linux-mm/20231103141636.000007e4@Huawei.com/

It's possible to change the performance of a NUMA node changed, if we
hot-remove a memory device, then hot-add another different memory
device.  It's hoped that the CDAT changes too.

So, all in all, HMAT + CDAT can help us to put the memory device in
appropriate memory tiers.  Now, we have HMAT support in upstream.  We
will working on CDAT support.

--
Best Regards,
Huang, Ying

> https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
>
> This group wants to enable passing CXL memory through to KVM/QEMU
> (i.e. host CXL expander memory passed through to the guest), and
> allow the guest to apply memory tiering.
>
> There are multiple issues with this, presently:
>
> 1. The QEMU CXL virtual device is not and probably never will be
>    performant enough to be a commodity class virtualization.  The
>    reason is that the virtual CXL device is built off the I/O
>    virtualization stack, which treats memory accesses as I/O accesses.
>
>    KVM also seems incompatible with the design of the CXL memory device
>    in general, but this problem may or may not be a blocker.
>
>    As a result, access to virtual CXL memory device leads to QEMU
>    crawling to a halt - and this is unlikely to change.
>
>    There is presently no good way forward to create a performant virtual
>    CXL device in QEMU.  This means the memory tiering component in the
>    kernel is functionally useless for virtual CXL memory, because...
>
> 2. When passing memory through as an explicit NUMA node, but not as
>    part of a CXL memory device, the nodes are lumped together in the
>    DRAM tier.
>
> None of this has to do with firmware.
>
> Memory-type is an awful way of denoting membership of a tier, but we
> have HMAT information that can be passed through via QEMU:
>
> -object memory-backend-ram,size=4G,id=ram-node0 \
> -object memory-backend-ram,size=4G,id=ram-node1 \
> -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
>
> Not only would it be nice if we could change tier membership based on
> this data, it's realistically the only way to allow guests to accomplish
> memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
>
> ~Gregory

Jonathan Cameron Jan. 9, 2024, 3:50 p.m. UTC | #14

On Tue, 09 Jan 2024 11:41:11 +0800
"Huang, Ying" <ying.huang@intel.com> wrote:

> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:  
> >> >
> >> > From  https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
> >> > abstract_distance_offset: override by users to deal with firmware issue.
> >> >
> >> > say firmware can configure the cxl node into wrong tiers, similar to
> >> > that it may also configure all cxl nodes into single memtype, hence
> >> > all these nodes can fall into a single wrong tier.
> >> > In this case, per node adistance_offset would be good to have ?  
> >> 
> >> I think that it's better to fix the error firmware if possible.  And
> >> these are only theoretical, not practical issues.  Do you have some
> >> practical issues?
> >> 
> >> I understand that users may want to move nodes between memory tiers for
> >> different policy choices.  For that, memory_type based adistance_offset
> >> should be good.
> >>   
> >
> > There's actually an affirmative case to change memory tiering to allow
> > either movement of nodes between tiers, or at least base placement on
> > HMAT information. Preferably, membership would be changable to allow
> > hotplug/DCD to be managed (there's no guarantee that the memory passed
> > through will always be what HMAT says on initial boot).  
> 
> IIUC, from Jonathan Cameron as below, the performance of memory
> shouldn't change even for DCD devices.
> 
> https://lore.kernel.org/linux-mm/20231103141636.000007e4@Huawei.com/
> 
> It's possible to change the performance of a NUMA node changed, if we
> hot-remove a memory device, then hot-add another different memory
> device.  It's hoped that the CDAT changes too.

Not supported, but ACPI has _HMA methods to in theory allow changing
HMAT values based on firmware notifications...  So we 'could' make
it work for HMAT based description.

Ultimately my current thinking is we'll end up emulating CXL type3
devices (hiding topology complexity) and you can update CDAT but
IIRC that is only meant to be for degraded situations - so if you
want multiple performance regions, CDAT should describe them form the start.

> 
> So, all in all, HMAT + CDAT can help us to put the memory device in
> appropriate memory tiers.  Now, we have HMAT support in upstream.  We
> will working on CDAT support.
> 
> --
> Best Regards,
> Huang, Ying
> 
> > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> >
> > This group wants to enable passing CXL memory through to KVM/QEMU
> > (i.e. host CXL expander memory passed through to the guest), and
> > allow the guest to apply memory tiering.
> >
> > There are multiple issues with this, presently:
> >
> > 1. The QEMU CXL virtual device is not and probably never will be
> >    performant enough to be a commodity class virtualization.

I'd flex that a bit - we will end up with a solution for virtualization but
it isn't the emulation that is there today because it's not possible to
emulate some of the topology in a peformant manner (interleaving with sub
page granularity / interleaving at all (to a lesser degree)). There are
ways to do better than we are today, but they start to look like
software dissagregated memory setups (think lots of page faults in the host).

> >  The
> >    reason is that the virtual CXL device is built off the I/O
> >    virtualization stack, which treats memory accesses as I/O accesses.

That will remain true for complex emulation, but it needn't always be
the case. 
I'm not 100% sure we can make it work but my current thinking is:

When decoders are set up: Check if there is any interleaving going on.
  interleaving happening: Current functionally correct path.
  no interleaving: More conventional memory access path.

> >
> >    KVM also seems incompatible with the design of the CXL memory device
> >    in general, but this problem may or may not be a blocker.

That's true if we are doing fine grained routing but as above we can
probably avoid that.

> >
> >    As a result, access to virtual CXL memory device leads to QEMU
> >    crawling to a halt - and this is unlikely to change.

In general yes, but hopefully not for carefully configured cases (the
simple one of direct connect single device, no host interleaving for example).

> >
> >    There is presently no good way forward to create a performant virtual
> >    CXL device in QEMU.  This means the memory tiering component in the
> >    kernel is functionally useless for virtual CXL memory, because...

Agreed - nothing there yet and I don't think the question of CXL virtualization
in general is anywhere near solved...  Maybe emulating a CXL device doesn't
make sense, maybe we end up extending virtio-mem instead.
Needs some PoC work to flesh this out. (it's about number 3 on my list of
stuff to look at this year)

> >
> > 2. When passing memory through as an explicit NUMA node, but not as
> >    part of a CXL memory device, the nodes are lumped together in the
> >    DRAM tier.
> >
> > None of this has to do with firmware.
> >
> > Memory-type is an awful way of denoting membership of a tier, but we
> > have HMAT information that can be passed through via QEMU:
> >
> > -object memory-backend-ram,size=4G,id=ram-node0 \
> > -object memory-backend-ram,size=4G,id=ram-node1 \
> > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> >
> > Not only would it be nice if we could change tier membership based on
> > this data, it's realistically the only way to allow guests to accomplish
> > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.

This I fully agree with.  There will be systems with a bunch of normal DDR with different
access characteristics irrespective of CXL. + likely HMAT solutions will be used
before we get anything more complex in place for CXL.

Jonathan

p.s. I'd love to see _HMA handling implemented in the kernel.. Would trail blaze what
we will probably need to do for fiddly CXL cases where performance degrades on old devices
etc.

> >
> > ~Gregory  
>

Gregory Price Jan. 9, 2024, 5:34 p.m. UTC | #15

On Tue, Jan 09, 2024 at 11:41:11AM +0800, Huang, Ying wrote:
> Gregory Price <gregory.price@memverge.com> writes:
> 
> > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> >> >
> >> > From  https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
> >> > abstract_distance_offset: override by users to deal with firmware issue.
> >> >
> >> > say firmware can configure the cxl node into wrong tiers, similar to
> >> > that it may also configure all cxl nodes into single memtype, hence
> >> > all these nodes can fall into a single wrong tier.
> >> > In this case, per node adistance_offset would be good to have ?
> >> 
> >> I think that it's better to fix the error firmware if possible.  And
> >> these are only theoretical, not practical issues.  Do you have some
> >> practical issues?
> >> 
> >> I understand that users may want to move nodes between memory tiers for
> >> different policy choices.  For that, memory_type based adistance_offset
> >> should be good.
> >> 
> >
> > There's actually an affirmative case to change memory tiering to allow
> > either movement of nodes between tiers, or at least base placement on
> > HMAT information. Preferably, membership would be changable to allow
> > hotplug/DCD to be managed (there's no guarantee that the memory passed
> > through will always be what HMAT says on initial boot).
> 
> IIUC, from Jonathan Cameron as below, the performance of memory
> shouldn't change even for DCD devices.
> 
> https://lore.kernel.org/linux-mm/20231103141636.000007e4@Huawei.com/
> 
> It's possible to change the performance of a NUMA node changed, if we
> hot-remove a memory device, then hot-add another different memory
> device.  It's hoped that the CDAT changes too.
> 
> So, all in all, HMAT + CDAT can help us to put the memory device in
> appropriate memory tiers.  Now, we have HMAT support in upstream.  We
> will working on CDAT support.

That should be sufficient assuming the `-numa hmat-lb` setting in QEMU
does the right thing.  I suppose we also need to figure out a way to set
CDAT information for a memory device that isn't related to CXL (from the
perspective of the guest).  I'll take a look if I get cycles.

~Gregory

Gregory Price Jan. 9, 2024, 5:59 p.m. UTC | #16

On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> On Tue, 09 Jan 2024 11:41:11 +0800
> "Huang, Ying" <ying.huang@intel.com> wrote:
> > Gregory Price <gregory.price@memverge.com> writes:
> > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:  
> > It's possible to change the performance of a NUMA node changed, if we
> > hot-remove a memory device, then hot-add another different memory
> > device.  It's hoped that the CDAT changes too.
> 
> Not supported, but ACPI has _HMA methods to in theory allow changing
> HMAT values based on firmware notifications...  So we 'could' make
> it work for HMAT based description.
> 
> Ultimately my current thinking is we'll end up emulating CXL type3
> devices (hiding topology complexity) and you can update CDAT but
> IIRC that is only meant to be for degraded situations - so if you
> want multiple performance regions, CDAT should describe them form the start.
> 

That was my thought.  I don't think it's particularly *realistic* for
HMAT/CDAT values to change at runtime, but I can imagine a case where
it could be valuable.

> > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> > >
> > > This group wants to enable passing CXL memory through to KVM/QEMU
> > > (i.e. host CXL expander memory passed through to the guest), and
> > > allow the guest to apply memory tiering.
> > >
> > > There are multiple issues with this, presently:
> > >
> > > 1. The QEMU CXL virtual device is not and probably never will be
> > >    performant enough to be a commodity class virtualization.
> 
> I'd flex that a bit - we will end up with a solution for virtualization but
> it isn't the emulation that is there today because it's not possible to
> emulate some of the topology in a peformant manner (interleaving with sub
> page granularity / interleaving at all (to a lesser degree)). There are
> ways to do better than we are today, but they start to look like
> software dissagregated memory setups (think lots of page faults in the host).
>

Agreed, the emulated device as-is can't be the virtualization device,
but it doesn't mean it can't be the basis for it.

My thought is, if you want to pass host CXL *memory* through to the
guest, you don't actually care to pass CXL *control* through to the
guest.  That control lies pretty squarely with the host/hypervisor.

So, at least in theory, you can just cut the type3 device out of the
QEMU configuration entirely and just pass it through as a distinct numa
node with specific hmat qualities.

Barring that, if we must go through the type3 device, the question is
how difficult would it be to just make a stripped down type3 device
to provide the informational components, but hack off anything
topology/interleave related? Then you just do direct passthrough as you
described below.

qemu/kvm would report errors if you tried to touch the naughty bits.

The second question is... is that device "compliant" or does it need
super special handling from the kernel driver :D?  If what i described
is not "compliant", then it's probably a bad idea, and KVM/QEMU should
just hide the CXL device entirely from the guest (for this use case)
and just pass the memory through as a numa node.

Which gets us back to: The memory-tiering component needs a way to
place nodes in different tiers based on HMAT/CDAT/User Whim. All three
of those seem like totally valid ways to go about it.

> > >
> > > 2. When passing memory through as an explicit NUMA node, but not as
> > >    part of a CXL memory device, the nodes are lumped together in the
> > >    DRAM tier.
> > >
> > > None of this has to do with firmware.
> > >
> > > Memory-type is an awful way of denoting membership of a tier, but we
> > > have HMAT information that can be passed through via QEMU:
> > >
> > > -object memory-backend-ram,size=4G,id=ram-node0 \
> > > -object memory-backend-ram,size=4G,id=ram-node1 \
> > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> > >
> > > Not only would it be nice if we could change tier membership based on
> > > this data, it's realistically the only way to allow guests to accomplish
> > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
> 
> This I fully agree with.  There will be systems with a bunch of normal DDR with different
> access characteristics irrespective of CXL. + likely HMAT solutions will be used
> before we get anything more complex in place for CXL.
> 

Had not even considered this, but that's completely accurate as well.

And more discretely: What of devices that don't provide HMAT/CDAT? That
isn't necessarily a violation of any standard.  There probably could be
a release valve for us to still make those devices useful.

The concern I have with not implementing a movement mechanism *at all*
is that a one-size-fits-all initial-placement heuristic feels gross
when we're, at least ideologically, moving toward "software defined memory".

Personally I think the movement mechanism is a good idea that gets folks
where they're going sooner, and it doesn't hurt anything by existing. We
can change the initial placement mechanism too.

</2cents>

~Gregory

Hao Xiang Jan. 10, 2024, 12:28 a.m. UTC | #17

On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote:
>
> On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> > On Tue, 09 Jan 2024 11:41:11 +0800
> > "Huang, Ying" <ying.huang@intel.com> wrote:
> > > Gregory Price <gregory.price@memverge.com> writes:
> > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> > > It's possible to change the performance of a NUMA node changed, if we
> > > hot-remove a memory device, then hot-add another different memory
> > > device.  It's hoped that the CDAT changes too.
> >
> > Not supported, but ACPI has _HMA methods to in theory allow changing
> > HMAT values based on firmware notifications...  So we 'could' make
> > it work for HMAT based description.
> >
> > Ultimately my current thinking is we'll end up emulating CXL type3
> > devices (hiding topology complexity) and you can update CDAT but
> > IIRC that is only meant to be for degraded situations - so if you
> > want multiple performance regions, CDAT should describe them form the start.
> >
>
> That was my thought.  I don't think it's particularly *realistic* for
> HMAT/CDAT values to change at runtime, but I can imagine a case where
> it could be valuable.
>
> > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> > > >
> > > > This group wants to enable passing CXL memory through to KVM/QEMU
> > > > (i.e. host CXL expander memory passed through to the guest), and
> > > > allow the guest to apply memory tiering.
> > > >
> > > > There are multiple issues with this, presently:
> > > >
> > > > 1. The QEMU CXL virtual device is not and probably never will be
> > > >    performant enough to be a commodity class virtualization.
> >
> > I'd flex that a bit - we will end up with a solution for virtualization but
> > it isn't the emulation that is there today because it's not possible to
> > emulate some of the topology in a peformant manner (interleaving with sub
> > page granularity / interleaving at all (to a lesser degree)). There are
> > ways to do better than we are today, but they start to look like
> > software dissagregated memory setups (think lots of page faults in the host).
> >
>
> Agreed, the emulated device as-is can't be the virtualization device,
> but it doesn't mean it can't be the basis for it.
>
> My thought is, if you want to pass host CXL *memory* through to the
> guest, you don't actually care to pass CXL *control* through to the
> guest.  That control lies pretty squarely with the host/hypervisor.
>
> So, at least in theory, you can just cut the type3 device out of the
> QEMU configuration entirely and just pass it through as a distinct numa
> node with specific hmat qualities.
>
> Barring that, if we must go through the type3 device, the question is
> how difficult would it be to just make a stripped down type3 device
> to provide the informational components, but hack off anything
> topology/interleave related? Then you just do direct passthrough as you
> described below.
>
> qemu/kvm would report errors if you tried to touch the naughty bits.
>
> The second question is... is that device "compliant" or does it need
> super special handling from the kernel driver :D?  If what i described
> is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> just hide the CXL device entirely from the guest (for this use case)
> and just pass the memory through as a numa node.
>
> Which gets us back to: The memory-tiering component needs a way to
> place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> of those seem like totally valid ways to go about it.
>
> > > >
> > > > 2. When passing memory through as an explicit NUMA node, but not as
> > > >    part of a CXL memory device, the nodes are lumped together in the
> > > >    DRAM tier.
> > > >
> > > > None of this has to do with firmware.
> > > >
> > > > Memory-type is an awful way of denoting membership of a tier, but we
> > > > have HMAT information that can be passed through via QEMU:
> > > >
> > > > -object memory-backend-ram,size=4G,id=ram-node0 \
> > > > -object memory-backend-ram,size=4G,id=ram-node1 \
> > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> > > >
> > > > Not only would it be nice if we could change tier membership based on
> > > > this data, it's realistically the only way to allow guests to accomplish
> > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
> >
> > This I fully agree with.  There will be systems with a bunch of normal DDR with different
> > access characteristics irrespective of CXL. + likely HMAT solutions will be used
> > before we get anything more complex in place for CXL.
> >
>
> Had not even considered this, but that's completely accurate as well.
>
> And more discretely: What of devices that don't provide HMAT/CDAT? That
> isn't necessarily a violation of any standard.  There probably could be
> a release valve for us to still make those devices useful.
>
> The concern I have with not implementing a movement mechanism *at all*
> is that a one-size-fits-all initial-placement heuristic feels gross
> when we're, at least ideologically, moving toward "software defined memory".
>
> Personally I think the movement mechanism is a good idea that gets folks
> where they're going sooner, and it doesn't hurt anything by existing. We
> can change the initial placement mechanism too.

I think providing users a way to "FIX" the memory tiering is a backup
option. Given that DDRs with different access characteristics provide
the relevant CDAT/HMAT information, the kernel should be able to
correctly establish memory tiering on boot.
Current memory tiering code has
1) memory_tier_init() to iterate through all boot onlined memory
nodes. All nodes are assumed to be fast tier (adistance
MEMTIER_ADISTANCE_DRAM is used).
2) dev_dax_kmem_probe to iterate through all devdax controlled memory
nodes. This is the place the kernel reads the memory attributes from
HMAT and recognizes the memory nodes into the correct tier (devdax
controlled CXL, pmem, etc).
If we want DDRs with different memory characteristics to be put into
the correct tier (as in the guest VM memory tiering case), we probably
need a third path to iterate the boot onlined memory nodes and also be
able to read their memory attributes. I don't think we can do that in
1) because the ACPI subsystem is not yet initialized.

>
> </2cents>
>
> ~Gregory

Huang, Ying Jan. 10, 2024, 5:47 a.m. UTC | #18

Gregory Price <gregory.price@memverge.com> writes:

> On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
>> On Tue, 09 Jan 2024 11:41:11 +0800
>> "Huang, Ying" <ying.huang@intel.com> wrote:
>> > Gregory Price <gregory.price@memverge.com> writes:
>> > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:  
>> > It's possible to change the performance of a NUMA node changed, if we
>> > hot-remove a memory device, then hot-add another different memory
>> > device.  It's hoped that the CDAT changes too.
>> 
>> Not supported, but ACPI has _HMA methods to in theory allow changing
>> HMAT values based on firmware notifications...  So we 'could' make
>> it work for HMAT based description.
>> 
>> Ultimately my current thinking is we'll end up emulating CXL type3
>> devices (hiding topology complexity) and you can update CDAT but
>> IIRC that is only meant to be for degraded situations - so if you
>> want multiple performance regions, CDAT should describe them form the start.
>> 
>
> That was my thought.  I don't think it's particularly *realistic* for
> HMAT/CDAT values to change at runtime, but I can imagine a case where
> it could be valuable.
>
>> > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
>> > >
>> > > This group wants to enable passing CXL memory through to KVM/QEMU
>> > > (i.e. host CXL expander memory passed through to the guest), and
>> > > allow the guest to apply memory tiering.
>> > >
>> > > There are multiple issues with this, presently:
>> > >
>> > > 1. The QEMU CXL virtual device is not and probably never will be
>> > >    performant enough to be a commodity class virtualization.
>> 
>> I'd flex that a bit - we will end up with a solution for virtualization but
>> it isn't the emulation that is there today because it's not possible to
>> emulate some of the topology in a peformant manner (interleaving with sub
>> page granularity / interleaving at all (to a lesser degree)). There are
>> ways to do better than we are today, but they start to look like
>> software dissagregated memory setups (think lots of page faults in the host).
>>
>
> Agreed, the emulated device as-is can't be the virtualization device,
> but it doesn't mean it can't be the basis for it.
>
> My thought is, if you want to pass host CXL *memory* through to the
> guest, you don't actually care to pass CXL *control* through to the
> guest.  That control lies pretty squarely with the host/hypervisor.
>
> So, at least in theory, you can just cut the type3 device out of the
> QEMU configuration entirely and just pass it through as a distinct numa
> node with specific hmat qualities.
>
> Barring that, if we must go through the type3 device, the question is
> how difficult would it be to just make a stripped down type3 device
> to provide the informational components, but hack off anything
> topology/interleave related? Then you just do direct passthrough as you
> described below.
>
> qemu/kvm would report errors if you tried to touch the naughty bits.
>
> The second question is... is that device "compliant" or does it need
> super special handling from the kernel driver :D?  If what i described
> is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> just hide the CXL device entirely from the guest (for this use case)
> and just pass the memory through as a numa node.
>
> Which gets us back to: The memory-tiering component needs a way to
> place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> of those seem like totally valid ways to go about it.
>
>> > >
>> > > 2. When passing memory through as an explicit NUMA node, but not as
>> > >    part of a CXL memory device, the nodes are lumped together in the
>> > >    DRAM tier.
>> > >
>> > > None of this has to do with firmware.
>> > >
>> > > Memory-type is an awful way of denoting membership of a tier, but we
>> > > have HMAT information that can be passed through via QEMU:
>> > >
>> > > -object memory-backend-ram,size=4G,id=ram-node0 \
>> > > -object memory-backend-ram,size=4G,id=ram-node1 \
>> > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
>> > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
>> > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
>> > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
>> > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
>> > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
>> > >
>> > > Not only would it be nice if we could change tier membership based on
>> > > this data, it's realistically the only way to allow guests to accomplish
>> > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
>> 
>> This I fully agree with.  There will be systems with a bunch of normal DDR with different
>> access characteristics irrespective of CXL. + likely HMAT solutions will be used
>> before we get anything more complex in place for CXL.
>> 
>
> Had not even considered this, but that's completely accurate as well.
>
> And more discretely: What of devices that don't provide HMAT/CDAT? That
> isn't necessarily a violation of any standard.  There probably could be
> a release valve for us to still make those devices useful.
>
> The concern I have with not implementing a movement mechanism *at all*
> is that a one-size-fits-all initial-placement heuristic feels gross
> when we're, at least ideologically, moving toward "software defined memory".
>
> Personally I think the movement mechanism is a good idea that gets folks
> where they're going sooner, and it doesn't hurt anything by existing. We
> can change the initial placement mechanism too.
>
> </2cents>

It's the last resort to provide hardware information from user space.
We should try to avoid that if possible.

Per my understanding, per-memory-type abstract distance overriding is to
apply specific policy.  While, per-memory-node abstract distance
overriding is to provide missing hardware information.

--
Best Regards,
Huang, Ying

Huang, Ying Jan. 10, 2024, 6:06 a.m. UTC | #19

Jonathan Cameron <Jonathan.Cameron@Huawei.com> writes:

> On Tue, 09 Jan 2024 11:41:11 +0800
> "Huang, Ying" <ying.huang@intel.com> wrote:
>
>> Gregory Price <gregory.price@memverge.com> writes:
>> 
>> > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:  
>> >> >
>> >> > From  https://lpc.events/event/16/contributions/1209/attachments/1042/1995/Live%20In%20a%20World%20With%20Multiple%20Memory%20Types.pdf
>> >> > abstract_distance_offset: override by users to deal with firmware issue.
>> >> >
>> >> > say firmware can configure the cxl node into wrong tiers, similar to
>> >> > that it may also configure all cxl nodes into single memtype, hence
>> >> > all these nodes can fall into a single wrong tier.
>> >> > In this case, per node adistance_offset would be good to have ?  
>> >> 
>> >> I think that it's better to fix the error firmware if possible.  And
>> >> these are only theoretical, not practical issues.  Do you have some
>> >> practical issues?
>> >> 
>> >> I understand that users may want to move nodes between memory tiers for
>> >> different policy choices.  For that, memory_type based adistance_offset
>> >> should be good.
>> >>   
>> >
>> > There's actually an affirmative case to change memory tiering to allow
>> > either movement of nodes between tiers, or at least base placement on
>> > HMAT information. Preferably, membership would be changable to allow
>> > hotplug/DCD to be managed (there's no guarantee that the memory passed
>> > through will always be what HMAT says on initial boot).  
>> 
>> IIUC, from Jonathan Cameron as below, the performance of memory
>> shouldn't change even for DCD devices.
>> 
>> https://lore.kernel.org/linux-mm/20231103141636.000007e4@Huawei.com/
>> 
>> It's possible to change the performance of a NUMA node changed, if we
>> hot-remove a memory device, then hot-add another different memory
>> device.  It's hoped that the CDAT changes too.
>
> Not supported, but ACPI has _HMA methods to in theory allow changing
> HMAT values based on firmware notifications...  So we 'could' make
> it work for HMAT based description.
>
> Ultimately my current thinking is we'll end up emulating CXL type3
> devices (hiding topology complexity) and you can update CDAT but
> IIRC that is only meant to be for degraded situations - so if you
> want multiple performance regions, CDAT should describe them form the start.

Thank you very much for input!  So, to support degraded performance, we
will need to move a NUMA node between memory tiers.  And, per my
understanding, we should do that in kernel.

>> 
>> So, all in all, HMAT + CDAT can help us to put the memory device in
>> appropriate memory tiers.  Now, we have HMAT support in upstream.  We
>> will working on CDAT support.
>> 

--
Best Regards,
Huang, Ying

Jonathan Cameron Jan. 10, 2024, 2:11 p.m. UTC | #20

On Tue, 9 Jan 2024 12:59:19 -0500
Gregory Price <gregory.price@memverge.com> wrote:

> On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> > On Tue, 09 Jan 2024 11:41:11 +0800
> > "Huang, Ying" <ying.huang@intel.com> wrote:  
> > > Gregory Price <gregory.price@memverge.com> writes:  
> > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:    
> > > It's possible to change the performance of a NUMA node changed, if we
> > > hot-remove a memory device, then hot-add another different memory
> > > device.  It's hoped that the CDAT changes too.  
> > 
> > Not supported, but ACPI has _HMA methods to in theory allow changing
> > HMAT values based on firmware notifications...  So we 'could' make
> > it work for HMAT based description.
> > 
> > Ultimately my current thinking is we'll end up emulating CXL type3
> > devices (hiding topology complexity) and you can update CDAT but
> > IIRC that is only meant to be for degraded situations - so if you
> > want multiple performance regions, CDAT should describe them form the start.
> >   
> 
> That was my thought.  I don't think it's particularly *realistic* for
> HMAT/CDAT values to change at runtime, but I can imagine a case where
> it could be valuable.

For now I'm thinking we might spit that CDAT info via a tracepoint if
it happens, but given it's degraded perf only maybe we don't care.

HMAT is more interesting because it may be used by a firmware first
model to paper over some weird hardware being hotplugged, or for giggles
a hypervisor moving memory around under the hood (think powering down
whole DRAM controllers etc).

Anyhow, that's highly speculative and whoever cares about it can
make it work! :)

> 
> > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> > > >
> > > > This group wants to enable passing CXL memory through to KVM/QEMU
> > > > (i.e. host CXL expander memory passed through to the guest), and
> > > > allow the guest to apply memory tiering.
> > > >
> > > > There are multiple issues with this, presently:
> > > >
> > > > 1. The QEMU CXL virtual device is not and probably never will be
> > > >    performant enough to be a commodity class virtualization.  
> > 
> > I'd flex that a bit - we will end up with a solution for virtualization but
> > it isn't the emulation that is there today because it's not possible to
> > emulate some of the topology in a peformant manner (interleaving with sub
> > page granularity / interleaving at all (to a lesser degree)). There are
> > ways to do better than we are today, but they start to look like
> > software dissagregated memory setups (think lots of page faults in the host).
> >  
> 
> Agreed, the emulated device as-is can't be the virtualization device,
> but it doesn't mean it can't be the basis for it.
> 
> My thought is, if you want to pass host CXL *memory* through to the
> guest, you don't actually care to pass CXL *control* through to the
> guest.  That control lies pretty squarely with the host/hypervisor.
> 
> So, at least in theory, you can just cut the type3 device out of the
> QEMU configuration entirely and just pass it through as a distinct numa
> node with specific hmat qualities.
> 
> Barring that, if we must go through the type3 device, the question is
> how difficult would it be to just make a stripped down type3 device
> to provide the informational components, but hack off anything
> topology/interleave related? Then you just do direct passthrough as you
> described below.

Not stripped down as such, just lock the decoders as if a firmware had
configured it (in reality the config will be really really simple).
The kernel stack handles that fine today.  The only dynamic bit
would be the DC related part.  Not sure our lockdown support in the
emulated device is complete (some of it is there but might have missed
some registers).

> 
> qemu/kvm would report errors if you tried to touch the naughty bits.

Might do that a temporary step along way to enabling thing but given
CXL assumes that the host firmware 'might' have configured everything and
locked it (kernel may be booting out of CXL memory for instance) it should
'just work' without needing this.
 
> The second question is... is that device "compliant" or does it need
> super special handling from the kernel driver :D?  If what i described
> is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> just hide the CXL device entirely from the guest (for this use case)
> and just pass the memory through as a numa node.
Would need to be compliant or very nearly so - I can see we might advertise
no interleave support even though not setting any of the interleave address
bits is technically a spec violation.  However, don't think we need to
do that because of decoder locking.  We advertise interleave options but
don't allow current setting to be changed.

If someone manually resets the bus they are on their own though :(
(that will clear the lock registers as it's the same as removing power).

> 
> Which gets us back to: The memory-tiering component needs a way to
> place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> of those seem like totally valid ways to go about it.
> 
> > > >
> > > > 2. When passing memory through as an explicit NUMA node, but not as
> > > >    part of a CXL memory device, the nodes are lumped together in the
> > > >    DRAM tier.
> > > >
> > > > None of this has to do with firmware.
> > > >
> > > > Memory-type is an awful way of denoting membership of a tier, but we
> > > > have HMAT information that can be passed through via QEMU:
> > > >
> > > > -object memory-backend-ram,size=4G,id=ram-node0 \
> > > > -object memory-backend-ram,size=4G,id=ram-node1 \
> > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> > > >
> > > > Not only would it be nice if we could change tier membership based on
> > > > this data, it's realistically the only way to allow guests to accomplish
> > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.  
> > 
> > This I fully agree with.  There will be systems with a bunch of normal DDR with different
> > access characteristics irrespective of CXL. + likely HMAT solutions will be used
> > before we get anything more complex in place for CXL.
> >   
> 
> Had not even considered this, but that's completely accurate as well.
> 
> And more discretely: What of devices that don't provide HMAT/CDAT? That
> isn't necessarily a violation of any standard.  There probably could be
> a release valve for us to still make those devices useful.

I'd argue any such device needs some driver support. Release valve is they
provide the info from that driver, just like the CDAT solution is doing.

If they don't then meh, their system is borked so they'll will add it
fairly quickly!

> 
> The concern I have with not implementing a movement mechanism *at all*
> is that a one-size-fits-all initial-placement heuristic feels gross
> when we're, at least ideologically, moving toward "software defined memory".
> 
> Personally I think the movement mechanism is a good idea that gets folks
> where they're going sooner, and it doesn't hurt anything by existing. We
> can change the initial placement mechanism too.

I've no problem with a movement mechanism. Hopefully in the long run it
never gets used though! Maybe in short term it's out of tree code.

Jonathan

> 
> </2cents>
> 
> ~Gregory

Jonathan Cameron Jan. 10, 2024, 2:18 p.m. UTC | #21

On Tue, 9 Jan 2024 16:28:15 -0800
Hao Xiang <hao.xiang@bytedance.com> wrote:

> On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote:
> >
> > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:  
> > > On Tue, 09 Jan 2024 11:41:11 +0800
> > > "Huang, Ying" <ying.huang@intel.com> wrote:  
> > > > Gregory Price <gregory.price@memverge.com> writes:  
> > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:  
> > > > It's possible to change the performance of a NUMA node changed, if we
> > > > hot-remove a memory device, then hot-add another different memory
> > > > device.  It's hoped that the CDAT changes too.  
> > >
> > > Not supported, but ACPI has _HMA methods to in theory allow changing
> > > HMAT values based on firmware notifications...  So we 'could' make
> > > it work for HMAT based description.
> > >
> > > Ultimately my current thinking is we'll end up emulating CXL type3
> > > devices (hiding topology complexity) and you can update CDAT but
> > > IIRC that is only meant to be for degraded situations - so if you
> > > want multiple performance regions, CDAT should describe them form the start.
> > >  
> >
> > That was my thought.  I don't think it's particularly *realistic* for
> > HMAT/CDAT values to change at runtime, but I can imagine a case where
> > it could be valuable.
> >  
> > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> > > > >
> > > > > This group wants to enable passing CXL memory through to KVM/QEMU
> > > > > (i.e. host CXL expander memory passed through to the guest), and
> > > > > allow the guest to apply memory tiering.
> > > > >
> > > > > There are multiple issues with this, presently:
> > > > >
> > > > > 1. The QEMU CXL virtual device is not and probably never will be
> > > > >    performant enough to be a commodity class virtualization.  
> > >
> > > I'd flex that a bit - we will end up with a solution for virtualization but
> > > it isn't the emulation that is there today because it's not possible to
> > > emulate some of the topology in a peformant manner (interleaving with sub
> > > page granularity / interleaving at all (to a lesser degree)). There are
> > > ways to do better than we are today, but they start to look like
> > > software dissagregated memory setups (think lots of page faults in the host).
> > >  
> >
> > Agreed, the emulated device as-is can't be the virtualization device,
> > but it doesn't mean it can't be the basis for it.
> >
> > My thought is, if you want to pass host CXL *memory* through to the
> > guest, you don't actually care to pass CXL *control* through to the
> > guest.  That control lies pretty squarely with the host/hypervisor.
> >
> > So, at least in theory, you can just cut the type3 device out of the
> > QEMU configuration entirely and just pass it through as a distinct numa
> > node with specific hmat qualities.
> >
> > Barring that, if we must go through the type3 device, the question is
> > how difficult would it be to just make a stripped down type3 device
> > to provide the informational components, but hack off anything
> > topology/interleave related? Then you just do direct passthrough as you
> > described below.
> >
> > qemu/kvm would report errors if you tried to touch the naughty bits.
> >
> > The second question is... is that device "compliant" or does it need
> > super special handling from the kernel driver :D?  If what i described
> > is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> > just hide the CXL device entirely from the guest (for this use case)
> > and just pass the memory through as a numa node.
> >
> > Which gets us back to: The memory-tiering component needs a way to
> > place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> > of those seem like totally valid ways to go about it.
> >  
> > > > >
> > > > > 2. When passing memory through as an explicit NUMA node, but not as
> > > > >    part of a CXL memory device, the nodes are lumped together in the
> > > > >    DRAM tier.
> > > > >
> > > > > None of this has to do with firmware.
> > > > >
> > > > > Memory-type is an awful way of denoting membership of a tier, but we
> > > > > have HMAT information that can be passed through via QEMU:
> > > > >
> > > > > -object memory-backend-ram,size=4G,id=ram-node0 \
> > > > > -object memory-backend-ram,size=4G,id=ram-node1 \
> > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> > > > >
> > > > > Not only would it be nice if we could change tier membership based on
> > > > > this data, it's realistically the only way to allow guests to accomplish
> > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.  
> > >
> > > This I fully agree with.  There will be systems with a bunch of normal DDR with different
> > > access characteristics irrespective of CXL. + likely HMAT solutions will be used
> > > before we get anything more complex in place for CXL.
> > >  
> >
> > Had not even considered this, but that's completely accurate as well.
> >
> > And more discretely: What of devices that don't provide HMAT/CDAT? That
> > isn't necessarily a violation of any standard.  There probably could be
> > a release valve for us to still make those devices useful.
> >
> > The concern I have with not implementing a movement mechanism *at all*
> > is that a one-size-fits-all initial-placement heuristic feels gross
> > when we're, at least ideologically, moving toward "software defined memory".
> >
> > Personally I think the movement mechanism is a good idea that gets folks
> > where they're going sooner, and it doesn't hurt anything by existing. We
> > can change the initial placement mechanism too.  
> 
> I think providing users a way to "FIX" the memory tiering is a backup
> option. Given that DDRs with different access characteristics provide
> the relevant CDAT/HMAT information, the kernel should be able to
> correctly establish memory tiering on boot.

Include hotplug and I'll be happier!  I know that's messy though.

> Current memory tiering code has
> 1) memory_tier_init() to iterate through all boot onlined memory
> nodes. All nodes are assumed to be fast tier (adistance
> MEMTIER_ADISTANCE_DRAM is used).
> 2) dev_dax_kmem_probe to iterate through all devdax controlled memory
> nodes. This is the place the kernel reads the memory attributes from
> HMAT and recognizes the memory nodes into the correct tier (devdax
> controlled CXL, pmem, etc).
> If we want DDRs with different memory characteristics to be put into
> the correct tier (as in the guest VM memory tiering case), we probably
> need a third path to iterate the boot onlined memory nodes and also be
> able to read their memory attributes. I don't think we can do that in
> 1) because the ACPI subsystem is not yet initialized.

Can we move it later in general?  Or drag HMAT parsing earlier?
ACPI table availability is pretty early, it's just that we don't bother
with HMAT because nothing early uses it.
IIRC SRAT parsing occurs way before memory_tier_init() will be called.

Jonathan



> 
> >
> > </2cents>
> >
> > ~Gregory

Hao Xiang Jan. 10, 2024, 7:29 p.m. UTC | #22

On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron
<Jonathan.Cameron@huawei.com> wrote:
>
> On Tue, 9 Jan 2024 16:28:15 -0800
> Hao Xiang <hao.xiang@bytedance.com> wrote:
>
> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote:
> > >
> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> > > > On Tue, 09 Jan 2024 11:41:11 +0800
> > > > "Huang, Ying" <ying.huang@intel.com> wrote:
> > > > > Gregory Price <gregory.price@memverge.com> writes:
> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> > > > > It's possible to change the performance of a NUMA node changed, if we
> > > > > hot-remove a memory device, then hot-add another different memory
> > > > > device.  It's hoped that the CDAT changes too.
> > > >
> > > > Not supported, but ACPI has _HMA methods to in theory allow changing
> > > > HMAT values based on firmware notifications...  So we 'could' make
> > > > it work for HMAT based description.
> > > >
> > > > Ultimately my current thinking is we'll end up emulating CXL type3
> > > > devices (hiding topology complexity) and you can update CDAT but
> > > > IIRC that is only meant to be for degraded situations - so if you
> > > > want multiple performance regions, CDAT should describe them form the start.
> > > >
> > >
> > > That was my thought.  I don't think it's particularly *realistic* for
> > > HMAT/CDAT values to change at runtime, but I can imagine a case where
> > > it could be valuable.
> > >
> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> > > > > >
> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU
> > > > > > (i.e. host CXL expander memory passed through to the guest), and
> > > > > > allow the guest to apply memory tiering.
> > > > > >
> > > > > > There are multiple issues with this, presently:
> > > > > >
> > > > > > 1. The QEMU CXL virtual device is not and probably never will be
> > > > > >    performant enough to be a commodity class virtualization.
> > > >
> > > > I'd flex that a bit - we will end up with a solution for virtualization but
> > > > it isn't the emulation that is there today because it's not possible to
> > > > emulate some of the topology in a peformant manner (interleaving with sub
> > > > page granularity / interleaving at all (to a lesser degree)). There are
> > > > ways to do better than we are today, but they start to look like
> > > > software dissagregated memory setups (think lots of page faults in the host).
> > > >
> > >
> > > Agreed, the emulated device as-is can't be the virtualization device,
> > > but it doesn't mean it can't be the basis for it.
> > >
> > > My thought is, if you want to pass host CXL *memory* through to the
> > > guest, you don't actually care to pass CXL *control* through to the
> > > guest.  That control lies pretty squarely with the host/hypervisor.
> > >
> > > So, at least in theory, you can just cut the type3 device out of the
> > > QEMU configuration entirely and just pass it through as a distinct numa
> > > node with specific hmat qualities.
> > >
> > > Barring that, if we must go through the type3 device, the question is
> > > how difficult would it be to just make a stripped down type3 device
> > > to provide the informational components, but hack off anything
> > > topology/interleave related? Then you just do direct passthrough as you
> > > described below.
> > >
> > > qemu/kvm would report errors if you tried to touch the naughty bits.
> > >
> > > The second question is... is that device "compliant" or does it need
> > > super special handling from the kernel driver :D?  If what i described
> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> > > just hide the CXL device entirely from the guest (for this use case)
> > > and just pass the memory through as a numa node.
> > >
> > > Which gets us back to: The memory-tiering component needs a way to
> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> > > of those seem like totally valid ways to go about it.
> > >
> > > > > >
> > > > > > 2. When passing memory through as an explicit NUMA node, but not as
> > > > > >    part of a CXL memory device, the nodes are lumped together in the
> > > > > >    DRAM tier.
> > > > > >
> > > > > > None of this has to do with firmware.
> > > > > >
> > > > > > Memory-type is an awful way of denoting membership of a tier, but we
> > > > > > have HMAT information that can be passed through via QEMU:
> > > > > >
> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \
> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \
> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> > > > > >
> > > > > > Not only would it be nice if we could change tier membership based on
> > > > > > this data, it's realistically the only way to allow guests to accomplish
> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
> > > >
> > > > This I fully agree with.  There will be systems with a bunch of normal DDR with different
> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used
> > > > before we get anything more complex in place for CXL.
> > > >
> > >
> > > Had not even considered this, but that's completely accurate as well.
> > >
> > > And more discretely: What of devices that don't provide HMAT/CDAT? That
> > > isn't necessarily a violation of any standard.  There probably could be
> > > a release valve for us to still make those devices useful.
> > >
> > > The concern I have with not implementing a movement mechanism *at all*
> > > is that a one-size-fits-all initial-placement heuristic feels gross
> > > when we're, at least ideologically, moving toward "software defined memory".
> > >
> > > Personally I think the movement mechanism is a good idea that gets folks
> > > where they're going sooner, and it doesn't hurt anything by existing. We
> > > can change the initial placement mechanism too.
> >
> > I think providing users a way to "FIX" the memory tiering is a backup
> > option. Given that DDRs with different access characteristics provide
> > the relevant CDAT/HMAT information, the kernel should be able to
> > correctly establish memory tiering on boot.
>
> Include hotplug and I'll be happier!  I know that's messy though.
>
> > Current memory tiering code has
> > 1) memory_tier_init() to iterate through all boot onlined memory
> > nodes. All nodes are assumed to be fast tier (adistance
> > MEMTIER_ADISTANCE_DRAM is used).
> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory
> > nodes. This is the place the kernel reads the memory attributes from
> > HMAT and recognizes the memory nodes into the correct tier (devdax
> > controlled CXL, pmem, etc).
> > If we want DDRs with different memory characteristics to be put into
> > the correct tier (as in the guest VM memory tiering case), we probably
> > need a third path to iterate the boot onlined memory nodes and also be
> > able to read their memory attributes. I don't think we can do that in
> > 1) because the ACPI subsystem is not yet initialized.
>
> Can we move it later in general?  Or drag HMAT parsing earlier?
> ACPI table availability is pretty early, it's just that we don't bother
> with HMAT because nothing early uses it.
> IIRC SRAT parsing occurs way before memory_tier_init() will be called.

I tested the call sequence under a debugger earlier. hmat_init() is
called after memory_tier_init(). Let me poke around and see what our
options are.

>
> Jonathan
>
>
>
> >
> > >
> > > </2cents>
> > >
> > > ~Gregory
>

Huang, Ying Jan. 12, 2024, 7 a.m. UTC | #23

Hao Xiang <hao.xiang@bytedance.com> writes:

> On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron
> <Jonathan.Cameron@huawei.com> wrote:
>>
>> On Tue, 9 Jan 2024 16:28:15 -0800
>> Hao Xiang <hao.xiang@bytedance.com> wrote:
>>
>> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote:
>> > >
>> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
>> > > > On Tue, 09 Jan 2024 11:41:11 +0800
>> > > > "Huang, Ying" <ying.huang@intel.com> wrote:
>> > > > > Gregory Price <gregory.price@memverge.com> writes:
>> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
>> > > > > It's possible to change the performance of a NUMA node changed, if we
>> > > > > hot-remove a memory device, then hot-add another different memory
>> > > > > device.  It's hoped that the CDAT changes too.
>> > > >
>> > > > Not supported, but ACPI has _HMA methods to in theory allow changing
>> > > > HMAT values based on firmware notifications...  So we 'could' make
>> > > > it work for HMAT based description.
>> > > >
>> > > > Ultimately my current thinking is we'll end up emulating CXL type3
>> > > > devices (hiding topology complexity) and you can update CDAT but
>> > > > IIRC that is only meant to be for degraded situations - so if you
>> > > > want multiple performance regions, CDAT should describe them form the start.
>> > > >
>> > >
>> > > That was my thought.  I don't think it's particularly *realistic* for
>> > > HMAT/CDAT values to change at runtime, but I can imagine a case where
>> > > it could be valuable.
>> > >
>> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
>> > > > > >
>> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU
>> > > > > > (i.e. host CXL expander memory passed through to the guest), and
>> > > > > > allow the guest to apply memory tiering.
>> > > > > >
>> > > > > > There are multiple issues with this, presently:
>> > > > > >
>> > > > > > 1. The QEMU CXL virtual device is not and probably never will be
>> > > > > >    performant enough to be a commodity class virtualization.
>> > > >
>> > > > I'd flex that a bit - we will end up with a solution for virtualization but
>> > > > it isn't the emulation that is there today because it's not possible to
>> > > > emulate some of the topology in a peformant manner (interleaving with sub
>> > > > page granularity / interleaving at all (to a lesser degree)). There are
>> > > > ways to do better than we are today, but they start to look like
>> > > > software dissagregated memory setups (think lots of page faults in the host).
>> > > >
>> > >
>> > > Agreed, the emulated device as-is can't be the virtualization device,
>> > > but it doesn't mean it can't be the basis for it.
>> > >
>> > > My thought is, if you want to pass host CXL *memory* through to the
>> > > guest, you don't actually care to pass CXL *control* through to the
>> > > guest.  That control lies pretty squarely with the host/hypervisor.
>> > >
>> > > So, at least in theory, you can just cut the type3 device out of the
>> > > QEMU configuration entirely and just pass it through as a distinct numa
>> > > node with specific hmat qualities.
>> > >
>> > > Barring that, if we must go through the type3 device, the question is
>> > > how difficult would it be to just make a stripped down type3 device
>> > > to provide the informational components, but hack off anything
>> > > topology/interleave related? Then you just do direct passthrough as you
>> > > described below.
>> > >
>> > > qemu/kvm would report errors if you tried to touch the naughty bits.
>> > >
>> > > The second question is... is that device "compliant" or does it need
>> > > super special handling from the kernel driver :D?  If what i described
>> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should
>> > > just hide the CXL device entirely from the guest (for this use case)
>> > > and just pass the memory through as a numa node.
>> > >
>> > > Which gets us back to: The memory-tiering component needs a way to
>> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three
>> > > of those seem like totally valid ways to go about it.
>> > >
>> > > > > >
>> > > > > > 2. When passing memory through as an explicit NUMA node, but not as
>> > > > > >    part of a CXL memory device, the nodes are lumped together in the
>> > > > > >    DRAM tier.
>> > > > > >
>> > > > > > None of this has to do with firmware.
>> > > > > >
>> > > > > > Memory-type is an awful way of denoting membership of a tier, but we
>> > > > > > have HMAT information that can be passed through via QEMU:
>> > > > > >
>> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \
>> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \
>> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
>> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
>> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
>> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
>> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
>> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
>> > > > > >
>> > > > > > Not only would it be nice if we could change tier membership based on
>> > > > > > this data, it's realistically the only way to allow guests to accomplish
>> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
>> > > >
>> > > > This I fully agree with.  There will be systems with a bunch of normal DDR with different
>> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used
>> > > > before we get anything more complex in place for CXL.
>> > > >
>> > >
>> > > Had not even considered this, but that's completely accurate as well.
>> > >
>> > > And more discretely: What of devices that don't provide HMAT/CDAT? That
>> > > isn't necessarily a violation of any standard.  There probably could be
>> > > a release valve for us to still make those devices useful.
>> > >
>> > > The concern I have with not implementing a movement mechanism *at all*
>> > > is that a one-size-fits-all initial-placement heuristic feels gross
>> > > when we're, at least ideologically, moving toward "software defined memory".
>> > >
>> > > Personally I think the movement mechanism is a good idea that gets folks
>> > > where they're going sooner, and it doesn't hurt anything by existing. We
>> > > can change the initial placement mechanism too.
>> >
>> > I think providing users a way to "FIX" the memory tiering is a backup
>> > option. Given that DDRs with different access characteristics provide
>> > the relevant CDAT/HMAT information, the kernel should be able to
>> > correctly establish memory tiering on boot.
>>
>> Include hotplug and I'll be happier!  I know that's messy though.
>>
>> > Current memory tiering code has
>> > 1) memory_tier_init() to iterate through all boot onlined memory
>> > nodes. All nodes are assumed to be fast tier (adistance
>> > MEMTIER_ADISTANCE_DRAM is used).
>> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory
>> > nodes. This is the place the kernel reads the memory attributes from
>> > HMAT and recognizes the memory nodes into the correct tier (devdax
>> > controlled CXL, pmem, etc).
>> > If we want DDRs with different memory characteristics to be put into
>> > the correct tier (as in the guest VM memory tiering case), we probably
>> > need a third path to iterate the boot onlined memory nodes and also be
>> > able to read their memory attributes. I don't think we can do that in
>> > 1) because the ACPI subsystem is not yet initialized.
>>
>> Can we move it later in general?  Or drag HMAT parsing earlier?
>> ACPI table availability is pretty early, it's just that we don't bother
>> with HMAT because nothing early uses it.
>> IIRC SRAT parsing occurs way before memory_tier_init() will be called.
>
> I tested the call sequence under a debugger earlier. hmat_init() is
> called after memory_tier_init(). Let me poke around and see what our
> options are.

This sounds reasonable.

Please keep in mind that we need a way to identify the base line memory
type(default_dram_type).  A simple method is to use NUMA nodes with CPU
attached.  But I remember that Aneesh said that some NUMA nodes without
CPU will need to be put in default_dram_type too on their systems.  We
need a way to identify that.

--
Best Regards,
Huang, Ying

Hao Xiang Jan. 12, 2024, 8:14 a.m. UTC | #24

On Thu, Jan 11, 2024 at 11:02 PM Huang, Ying <ying.huang@intel.com> wrote:
>
> Hao Xiang <hao.xiang@bytedance.com> writes:
>
> > On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron
> > <Jonathan.Cameron@huawei.com> wrote:
> >>
> >> On Tue, 9 Jan 2024 16:28:15 -0800
> >> Hao Xiang <hao.xiang@bytedance.com> wrote:
> >>
> >> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote:
> >> > >
> >> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
> >> > > > On Tue, 09 Jan 2024 11:41:11 +0800
> >> > > > "Huang, Ying" <ying.huang@intel.com> wrote:
> >> > > > > Gregory Price <gregory.price@memverge.com> writes:
> >> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
> >> > > > > It's possible to change the performance of a NUMA node changed, if we
> >> > > > > hot-remove a memory device, then hot-add another different memory
> >> > > > > device.  It's hoped that the CDAT changes too.
> >> > > >
> >> > > > Not supported, but ACPI has _HMA methods to in theory allow changing
> >> > > > HMAT values based on firmware notifications...  So we 'could' make
> >> > > > it work for HMAT based description.
> >> > > >
> >> > > > Ultimately my current thinking is we'll end up emulating CXL type3
> >> > > > devices (hiding topology complexity) and you can update CDAT but
> >> > > > IIRC that is only meant to be for degraded situations - so if you
> >> > > > want multiple performance regions, CDAT should describe them form the start.
> >> > > >
> >> > >
> >> > > That was my thought.  I don't think it's particularly *realistic* for
> >> > > HMAT/CDAT values to change at runtime, but I can imagine a case where
> >> > > it could be valuable.
> >> > >
> >> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
> >> > > > > >
> >> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU
> >> > > > > > (i.e. host CXL expander memory passed through to the guest), and
> >> > > > > > allow the guest to apply memory tiering.
> >> > > > > >
> >> > > > > > There are multiple issues with this, presently:
> >> > > > > >
> >> > > > > > 1. The QEMU CXL virtual device is not and probably never will be
> >> > > > > >    performant enough to be a commodity class virtualization.
> >> > > >
> >> > > > I'd flex that a bit - we will end up with a solution for virtualization but
> >> > > > it isn't the emulation that is there today because it's not possible to
> >> > > > emulate some of the topology in a peformant manner (interleaving with sub
> >> > > > page granularity / interleaving at all (to a lesser degree)). There are
> >> > > > ways to do better than we are today, but they start to look like
> >> > > > software dissagregated memory setups (think lots of page faults in the host).
> >> > > >
> >> > >
> >> > > Agreed, the emulated device as-is can't be the virtualization device,
> >> > > but it doesn't mean it can't be the basis for it.
> >> > >
> >> > > My thought is, if you want to pass host CXL *memory* through to the
> >> > > guest, you don't actually care to pass CXL *control* through to the
> >> > > guest.  That control lies pretty squarely with the host/hypervisor.
> >> > >
> >> > > So, at least in theory, you can just cut the type3 device out of the
> >> > > QEMU configuration entirely and just pass it through as a distinct numa
> >> > > node with specific hmat qualities.
> >> > >
> >> > > Barring that, if we must go through the type3 device, the question is
> >> > > how difficult would it be to just make a stripped down type3 device
> >> > > to provide the informational components, but hack off anything
> >> > > topology/interleave related? Then you just do direct passthrough as you
> >> > > described below.
> >> > >
> >> > > qemu/kvm would report errors if you tried to touch the naughty bits.
> >> > >
> >> > > The second question is... is that device "compliant" or does it need
> >> > > super special handling from the kernel driver :D?  If what i described
> >> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should
> >> > > just hide the CXL device entirely from the guest (for this use case)
> >> > > and just pass the memory through as a numa node.
> >> > >
> >> > > Which gets us back to: The memory-tiering component needs a way to
> >> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three
> >> > > of those seem like totally valid ways to go about it.
> >> > >
> >> > > > > >
> >> > > > > > 2. When passing memory through as an explicit NUMA node, but not as
> >> > > > > >    part of a CXL memory device, the nodes are lumped together in the
> >> > > > > >    DRAM tier.
> >> > > > > >
> >> > > > > > None of this has to do with firmware.
> >> > > > > >
> >> > > > > > Memory-type is an awful way of denoting membership of a tier, but we
> >> > > > > > have HMAT information that can be passed through via QEMU:
> >> > > > > >
> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \
> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \
> >> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
> >> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
> >> > > > > >
> >> > > > > > Not only would it be nice if we could change tier membership based on
> >> > > > > > this data, it's realistically the only way to allow guests to accomplish
> >> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
> >> > > >
> >> > > > This I fully agree with.  There will be systems with a bunch of normal DDR with different
> >> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used
> >> > > > before we get anything more complex in place for CXL.
> >> > > >
> >> > >
> >> > > Had not even considered this, but that's completely accurate as well.
> >> > >
> >> > > And more discretely: What of devices that don't provide HMAT/CDAT? That
> >> > > isn't necessarily a violation of any standard.  There probably could be
> >> > > a release valve for us to still make those devices useful.
> >> > >
> >> > > The concern I have with not implementing a movement mechanism *at all*
> >> > > is that a one-size-fits-all initial-placement heuristic feels gross
> >> > > when we're, at least ideologically, moving toward "software defined memory".
> >> > >
> >> > > Personally I think the movement mechanism is a good idea that gets folks
> >> > > where they're going sooner, and it doesn't hurt anything by existing. We
> >> > > can change the initial placement mechanism too.
> >> >
> >> > I think providing users a way to "FIX" the memory tiering is a backup
> >> > option. Given that DDRs with different access characteristics provide
> >> > the relevant CDAT/HMAT information, the kernel should be able to
> >> > correctly establish memory tiering on boot.
> >>
> >> Include hotplug and I'll be happier!  I know that's messy though.
> >>
> >> > Current memory tiering code has
> >> > 1) memory_tier_init() to iterate through all boot onlined memory
> >> > nodes. All nodes are assumed to be fast tier (adistance
> >> > MEMTIER_ADISTANCE_DRAM is used).
> >> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory
> >> > nodes. This is the place the kernel reads the memory attributes from
> >> > HMAT and recognizes the memory nodes into the correct tier (devdax
> >> > controlled CXL, pmem, etc).
> >> > If we want DDRs with different memory characteristics to be put into
> >> > the correct tier (as in the guest VM memory tiering case), we probably
> >> > need a third path to iterate the boot onlined memory nodes and also be
> >> > able to read their memory attributes. I don't think we can do that in
> >> > 1) because the ACPI subsystem is not yet initialized.
> >>
> >> Can we move it later in general?  Or drag HMAT parsing earlier?
> >> ACPI table availability is pretty early, it's just that we don't bother
> >> with HMAT because nothing early uses it.
> >> IIRC SRAT parsing occurs way before memory_tier_init() will be called.
> >
> > I tested the call sequence under a debugger earlier. hmat_init() is
> > called after memory_tier_init(). Let me poke around and see what our
> > options are.
>
> This sounds reasonable.
>
> Please keep in mind that we need a way to identify the base line memory
> type(default_dram_type).  A simple method is to use NUMA nodes with CPU
> attached.  But I remember that Aneesh said that some NUMA nodes without
> CPU will need to be put in default_dram_type too on their systems.  We
> need a way to identify that.

Yes, I am doing some prototyping the way you described. In
memory_tier_init(), we will just set the memory tier for the NUMA
nodes with CPU. In hmat_init(), I am trying to call back to mm to
finish the memory tier initialization for the CPUless NUMA nodes. If a
CPUless numa node can't get the effective adistance from
mt_calc_adistance(), we will fallback to add that node to
default_dram_type.
The other thing I want to experiment is to call mt_calc_adistance() on
a memory node with CPU and see what kind of adistance will be
returned.

>
> --
> Best Regards,
> Huang, Ying

Huang, Ying Jan. 15, 2024, 1:24 a.m. UTC | #25

Hao Xiang <hao.xiang@bytedance.com> writes:

> On Thu, Jan 11, 2024 at 11:02 PM Huang, Ying <ying.huang@intel.com> wrote:
>>
>> Hao Xiang <hao.xiang@bytedance.com> writes:
>>
>> > On Wed, Jan 10, 2024 at 6:18 AM Jonathan Cameron
>> > <Jonathan.Cameron@huawei.com> wrote:
>> >>
>> >> On Tue, 9 Jan 2024 16:28:15 -0800
>> >> Hao Xiang <hao.xiang@bytedance.com> wrote:
>> >>
>> >> > On Tue, Jan 9, 2024 at 9:59 AM Gregory Price <gregory.price@memverge.com> wrote:
>> >> > >
>> >> > > On Tue, Jan 09, 2024 at 03:50:49PM +0000, Jonathan Cameron wrote:
>> >> > > > On Tue, 09 Jan 2024 11:41:11 +0800
>> >> > > > "Huang, Ying" <ying.huang@intel.com> wrote:
>> >> > > > > Gregory Price <gregory.price@memverge.com> writes:
>> >> > > > > > On Thu, Jan 04, 2024 at 02:05:01PM +0800, Huang, Ying wrote:
>> >> > > > > It's possible to change the performance of a NUMA node changed, if we
>> >> > > > > hot-remove a memory device, then hot-add another different memory
>> >> > > > > device.  It's hoped that the CDAT changes too.
>> >> > > >
>> >> > > > Not supported, but ACPI has _HMA methods to in theory allow changing
>> >> > > > HMAT values based on firmware notifications...  So we 'could' make
>> >> > > > it work for HMAT based description.
>> >> > > >
>> >> > > > Ultimately my current thinking is we'll end up emulating CXL type3
>> >> > > > devices (hiding topology complexity) and you can update CDAT but
>> >> > > > IIRC that is only meant to be for degraded situations - so if you
>> >> > > > want multiple performance regions, CDAT should describe them form the start.
>> >> > > >
>> >> > >
>> >> > > That was my thought.  I don't think it's particularly *realistic* for
>> >> > > HMAT/CDAT values to change at runtime, but I can imagine a case where
>> >> > > it could be valuable.
>> >> > >
>> >> > > > > > https://lore.kernel.org/linux-cxl/CAAYibXjZ0HSCqMrzXGv62cMLncS_81R3e1uNV5Fu4CPm0zAtYw@mail.gmail.com/
>> >> > > > > >
>> >> > > > > > This group wants to enable passing CXL memory through to KVM/QEMU
>> >> > > > > > (i.e. host CXL expander memory passed through to the guest), and
>> >> > > > > > allow the guest to apply memory tiering.
>> >> > > > > >
>> >> > > > > > There are multiple issues with this, presently:
>> >> > > > > >
>> >> > > > > > 1. The QEMU CXL virtual device is not and probably never will be
>> >> > > > > >    performant enough to be a commodity class virtualization.
>> >> > > >
>> >> > > > I'd flex that a bit - we will end up with a solution for virtualization but
>> >> > > > it isn't the emulation that is there today because it's not possible to
>> >> > > > emulate some of the topology in a peformant manner (interleaving with sub
>> >> > > > page granularity / interleaving at all (to a lesser degree)). There are
>> >> > > > ways to do better than we are today, but they start to look like
>> >> > > > software dissagregated memory setups (think lots of page faults in the host).
>> >> > > >
>> >> > >
>> >> > > Agreed, the emulated device as-is can't be the virtualization device,
>> >> > > but it doesn't mean it can't be the basis for it.
>> >> > >
>> >> > > My thought is, if you want to pass host CXL *memory* through to the
>> >> > > guest, you don't actually care to pass CXL *control* through to the
>> >> > > guest.  That control lies pretty squarely with the host/hypervisor.
>> >> > >
>> >> > > So, at least in theory, you can just cut the type3 device out of the
>> >> > > QEMU configuration entirely and just pass it through as a distinct numa
>> >> > > node with specific hmat qualities.
>> >> > >
>> >> > > Barring that, if we must go through the type3 device, the question is
>> >> > > how difficult would it be to just make a stripped down type3 device
>> >> > > to provide the informational components, but hack off anything
>> >> > > topology/interleave related? Then you just do direct passthrough as you
>> >> > > described below.
>> >> > >
>> >> > > qemu/kvm would report errors if you tried to touch the naughty bits.
>> >> > >
>> >> > > The second question is... is that device "compliant" or does it need
>> >> > > super special handling from the kernel driver :D?  If what i described
>> >> > > is not "compliant", then it's probably a bad idea, and KVM/QEMU should
>> >> > > just hide the CXL device entirely from the guest (for this use case)
>> >> > > and just pass the memory through as a numa node.
>> >> > >
>> >> > > Which gets us back to: The memory-tiering component needs a way to
>> >> > > place nodes in different tiers based on HMAT/CDAT/User Whim. All three
>> >> > > of those seem like totally valid ways to go about it.
>> >> > >
>> >> > > > > >
>> >> > > > > > 2. When passing memory through as an explicit NUMA node, but not as
>> >> > > > > >    part of a CXL memory device, the nodes are lumped together in the
>> >> > > > > >    DRAM tier.
>> >> > > > > >
>> >> > > > > > None of this has to do with firmware.
>> >> > > > > >
>> >> > > > > > Memory-type is an awful way of denoting membership of a tier, but we
>> >> > > > > > have HMAT information that can be passed through via QEMU:
>> >> > > > > >
>> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node0 \
>> >> > > > > > -object memory-backend-ram,size=4G,id=ram-node1 \
>> >> > > > > > -numa node,nodeid=0,cpus=0-4,memdev=ram-node0 \
>> >> > > > > > -numa node,initiator=0,nodeid=1,memdev=ram-node1 \
>> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-latency,latency=10 \
>> >> > > > > > -numa hmat-lb,initiator=0,target=0,hierarchy=memory,data-type=access-bandwidth,bandwidth=10485760 \
>> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-latency,latency=20 \
>> >> > > > > > -numa hmat-lb,initiator=0,target=1,hierarchy=memory,data-type=access-bandwidth,bandwidth=5242880
>> >> > > > > >
>> >> > > > > > Not only would it be nice if we could change tier membership based on
>> >> > > > > > this data, it's realistically the only way to allow guests to accomplish
>> >> > > > > > memory tiering w/ KVM/QEMU and CXL memory passed through to the guest.
>> >> > > >
>> >> > > > This I fully agree with.  There will be systems with a bunch of normal DDR with different
>> >> > > > access characteristics irrespective of CXL. + likely HMAT solutions will be used
>> >> > > > before we get anything more complex in place for CXL.
>> >> > > >
>> >> > >
>> >> > > Had not even considered this, but that's completely accurate as well.
>> >> > >
>> >> > > And more discretely: What of devices that don't provide HMAT/CDAT? That
>> >> > > isn't necessarily a violation of any standard.  There probably could be
>> >> > > a release valve for us to still make those devices useful.
>> >> > >
>> >> > > The concern I have with not implementing a movement mechanism *at all*
>> >> > > is that a one-size-fits-all initial-placement heuristic feels gross
>> >> > > when we're, at least ideologically, moving toward "software defined memory".
>> >> > >
>> >> > > Personally I think the movement mechanism is a good idea that gets folks
>> >> > > where they're going sooner, and it doesn't hurt anything by existing. We
>> >> > > can change the initial placement mechanism too.
>> >> >
>> >> > I think providing users a way to "FIX" the memory tiering is a backup
>> >> > option. Given that DDRs with different access characteristics provide
>> >> > the relevant CDAT/HMAT information, the kernel should be able to
>> >> > correctly establish memory tiering on boot.
>> >>
>> >> Include hotplug and I'll be happier!  I know that's messy though.
>> >>
>> >> > Current memory tiering code has
>> >> > 1) memory_tier_init() to iterate through all boot onlined memory
>> >> > nodes. All nodes are assumed to be fast tier (adistance
>> >> > MEMTIER_ADISTANCE_DRAM is used).
>> >> > 2) dev_dax_kmem_probe to iterate through all devdax controlled memory
>> >> > nodes. This is the place the kernel reads the memory attributes from
>> >> > HMAT and recognizes the memory nodes into the correct tier (devdax
>> >> > controlled CXL, pmem, etc).
>> >> > If we want DDRs with different memory characteristics to be put into
>> >> > the correct tier (as in the guest VM memory tiering case), we probably
>> >> > need a third path to iterate the boot onlined memory nodes and also be
>> >> > able to read their memory attributes. I don't think we can do that in
>> >> > 1) because the ACPI subsystem is not yet initialized.
>> >>
>> >> Can we move it later in general?  Or drag HMAT parsing earlier?
>> >> ACPI table availability is pretty early, it's just that we don't bother
>> >> with HMAT because nothing early uses it.
>> >> IIRC SRAT parsing occurs way before memory_tier_init() will be called.
>> >
>> > I tested the call sequence under a debugger earlier. hmat_init() is
>> > called after memory_tier_init(). Let me poke around and see what our
>> > options are.
>>
>> This sounds reasonable.
>>
>> Please keep in mind that we need a way to identify the base line memory
>> type(default_dram_type).  A simple method is to use NUMA nodes with CPU
>> attached.  But I remember that Aneesh said that some NUMA nodes without
>> CPU will need to be put in default_dram_type too on their systems.  We
>> need a way to identify that.
>
> Yes, I am doing some prototyping the way you described. In
> memory_tier_init(), we will just set the memory tier for the NUMA
> nodes with CPU. In hmat_init(), I am trying to call back to mm to
> finish the memory tier initialization for the CPUless NUMA nodes. If a
> CPUless numa node can't get the effective adistance from
> mt_calc_adistance(), we will fallback to add that node to
> default_dram_type.

Sound reasonable for me.

> The other thing I want to experiment is to call mt_calc_adistance() on
> a memory node with CPU and see what kind of adistance will be
> returned.

Anyway, we need a base line to start.  The abstract distance is
calculated based on the ratio of the performance of a node to that of
default DRAM node.

--
Best Regards,
Huang, Ying

[RFC,v2,0/2] Node migration between memory tiers

Message

Comments