diff mbox series

[3/4] ACPI/MRRM: Add "node" symlink to /sys/devices/system/memory/rangeX

Message ID 20250210211223.6139-4-tony.luck@intel.com (mailing list archive)
State Changes Requested, archived
Headers show
Series Add interfaces for ACPI MRRM table | expand

Commit Message

Luck, Tony Feb. 10, 2025, 9:12 p.m. UTC
Users will likely want to know which node owns each memory range
and which CPUs are local to the range.

Add a symlink to the node directory to provide both pieces of information.

Signed-off-by: Tony Luck <tony.luck@intel.com>
---
 drivers/acpi/acpi_mrrm.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

Comments

Greg Kroah-Hartman Feb. 11, 2025, 6:51 a.m. UTC | #1
On Mon, Feb 10, 2025 at 01:12:22PM -0800, Tony Luck wrote:
> Users will likely want to know which node owns each memory range
> and which CPUs are local to the range.
> 
> Add a symlink to the node directory to provide both pieces of information.
> 
> Signed-off-by: Tony Luck <tony.luck@intel.com>
> ---
>  drivers/acpi/acpi_mrrm.c | 29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/drivers/acpi/acpi_mrrm.c b/drivers/acpi/acpi_mrrm.c
> index 51ed9064e025..28b484943bbd 100644
> --- a/drivers/acpi/acpi_mrrm.c
> +++ b/drivers/acpi/acpi_mrrm.c
> @@ -119,6 +119,31 @@ static struct attribute *memory_range_attrs[] = {
>  
>  ATTRIBUTE_GROUPS(memory_range);
>  
> +static __init int add_node_link(struct mrrm_mem_range_entry *entry)
> +{
> +	struct node *node = NULL;
> +	int ret = 0;
> +	int nid;
> +
> +	for_each_online_node(nid) {
> +		for (int z = 0; z < MAX_NR_ZONES; z++) {
> +			struct zone *zone = NODE_DATA(nid)->node_zones + z;
> +
> +			if (!populated_zone(zone))
> +				continue;
> +			if (zone_intersects(zone, PHYS_PFN(entry->base), PHYS_PFN(entry->length))) {
> +				node = node_devices[zone->node];
> +				goto found;
> +			}
> +		}
> +	}
> +found:
> +	if (node)
> +		ret = sysfs_create_link(&entry->dev.kobj, &node->dev.kobj, "node");

What is going to remove this symlink if the memory goes away?  Or do
these never get removed?

symlinks in sysfs created like this always worry me.  What is going to
use it?

thanks,

greg k-h
David Hildenbrand Feb. 11, 2025, 1:27 p.m. UTC | #2
On 11.02.25 07:51, Greg Kroah-Hartman wrote:
> On Mon, Feb 10, 2025 at 01:12:22PM -0800, Tony Luck wrote:
>> Users will likely want to know which node owns each memory range
>> and which CPUs are local to the range.
>>
>> Add a symlink to the node directory to provide both pieces of information.
>>
>> Signed-off-by: Tony Luck <tony.luck@intel.com>
>> ---
>>   drivers/acpi/acpi_mrrm.c | 29 +++++++++++++++++++++++++++++
>>   1 file changed, 29 insertions(+)
>>
>> diff --git a/drivers/acpi/acpi_mrrm.c b/drivers/acpi/acpi_mrrm.c
>> index 51ed9064e025..28b484943bbd 100644
>> --- a/drivers/acpi/acpi_mrrm.c
>> +++ b/drivers/acpi/acpi_mrrm.c
>> @@ -119,6 +119,31 @@ static struct attribute *memory_range_attrs[] = {
>>   
>>   ATTRIBUTE_GROUPS(memory_range);
>>   
>> +static __init int add_node_link(struct mrrm_mem_range_entry *entry)
>> +{
>> +	struct node *node = NULL;
>> +	int ret = 0;
>> +	int nid;
>> +
>> +	for_each_online_node(nid) {
>> +		for (int z = 0; z < MAX_NR_ZONES; z++) {
>> +			struct zone *zone = NODE_DATA(nid)->node_zones + z;
>> +
>> +			if (!populated_zone(zone))
>> +				continue;
>> +			if (zone_intersects(zone, PHYS_PFN(entry->base), PHYS_PFN(entry->length))) {
>> +				node = node_devices[zone->node];
>> +				goto found;
>> +			}
>> +		}
>> +	}
>> +found:
>> +	if (node)
>> +		ret = sysfs_create_link(&entry->dev.kobj, &node->dev.kobj, "node");
> 
> What is going to remove this symlink if the memory goes away?  Or do
> these never get removed?
> 
> symlinks in sysfs created like this always worry me.  What is going to
> use it?

On top of that, we seem to be building a separate hierarchy here.

/sys/devices/system/memory/ operates in memory block granularity.

/sys/devices/system/node/nodeX/ links to memory blocks that belong to it.

Why is the memory-block granularity insufficient, and why do we have to 
squeeze in another range API here?
Luck, Tony Feb. 11, 2025, 5:02 p.m. UTC | #3
>> +	if (node)
>> +		ret = sysfs_create_link(&entry->dev.kobj, &node->dev.kobj, "node");
>
> What is going to remove this symlink if the memory goes away?  Or do
> these never get removed?

There's currently no method for runtime changes to these memory ranges. They
are described by a static ACPI table.  I need to poke the folks that came up
with this to ask how memory hotplug will be handled (since CXL seems to be
making that fashionable again).

> symlinks in sysfs created like this always worry me.  What is going to
> use it?

<hand waves>User space tools that want to understand what the "per-region"
monitoring and control features are actually operating on.</hand waves>

-Tony
Luck, Tony Feb. 11, 2025, 6:05 p.m. UTC | #4
> > What is going to remove this symlink if the memory goes away?  Or do
> > these never get removed?
> >
> > symlinks in sysfs created like this always worry me.  What is going to
> > use it?
>
> On top of that, we seem to be building a separate hierarchy here.
>
> /sys/devices/system/memory/ operates in memory block granularity.

What defines the memory blocks? I'd initially assumed some connection
to the ACPI SRAT table. But on my test system there are only three
entries in SRAT that define non-zero sized memory blocks (two on
socket/node 0 and one on socket/node 1), yet there are:
    memory0 .. memory32 directories
in /sys/devices/system/memory.

The phys_device and phys_index files aren't helping me figure out
what each of them mean.

> /sys/devices/system/node/nodeX/ links to memory blocks that belong to it.
>
> Why is the memory-block granularity insufficient, and why do we have to
> squeeze in another range API here?

If an MRRM range consists of some set of memory blocks (making
sure that no memory block spans across MRRM range boundaries,
then I could add the {local,remote}_region_id files into the memory
block directories.

This could work now while the region assignments are done by the
BIOS. But in the future when OS gets the opportunity to change them
it might be weird if an MRRM range consists of multiple memory
block range, since the region_ids in each all update together.

/sys/devices/system/memory seemed like a logical place for
memory ranges. But should I jump up a level and make a new
/sys/devices/system/memory_regions directory to expose these
ranges?

-Tony
Greg Kroah-Hartman Feb. 12, 2025, 7:48 a.m. UTC | #5
On Tue, Feb 11, 2025 at 05:02:11PM +0000, Luck, Tony wrote:
> >> +	if (node)
> >> +		ret = sysfs_create_link(&entry->dev.kobj, &node->dev.kobj, "node");
> >
> > What is going to remove this symlink if the memory goes away?  Or do
> > these never get removed?
> 
> There's currently no method for runtime changes to these memory ranges. They
> are described by a static ACPI table.  I need to poke the folks that came up
> with this to ask how memory hotplug will be handled (since CXL seems to be
> making that fashionable again).

ACPI should be supporting memory hotplug today, at the very least
"memory add", so surely you have some old boxes to test this with?

> > symlinks in sysfs created like this always worry me.  What is going to
> > use it?
> 
> <hand waves>User space tools that want to understand what the "per-region"
> monitoring and control features are actually operating on.</hand waves>

If you don't have a real user today, please don't include it now.  Wait
until it is actually needed.

thanks,

greg k-h
David Hildenbrand Feb. 13, 2025, 1:30 p.m. UTC | #6
On 11.02.25 19:05, Luck, Tony wrote:
>>> What is going to remove this symlink if the memory goes away?  Or do
>>> these never get removed?
>>>
>>> symlinks in sysfs created like this always worry me.  What is going to
>>> use it?
>>
>> On top of that, we seem to be building a separate hierarchy here.
>>
>> /sys/devices/system/memory/ operates in memory block granularity.
> 
> What defines the memory blocks? I'd initially assumed some connection
> to the ACPI SRAT table. But on my test system there are only three
> entries in SRAT that define non-zero sized memory blocks (two on
> socket/node 0 and one on socket/node 1), yet there are:
>      memory0 .. memory32 directories
> in /sys/devices/system/memory.

Each memory block is the same size (e.g., 128 MiB .. 2 GiB on x86-64).

The default is memory section granularity (e.g., 128 MiB on x86-64), but 
some configs allow for increasing it: see 
arch/x86/mm/init_64.c:memory_block_size_bytes(), and in particular 
probe_memory_block_size().

They define in the granularity in which we can online/offline/add/remove 
physical memory managed by the buddy.

We create these block during boot/during hotplug, and link them to the 
relevant nodes.

They do not reflect the HW state, but the state Linux manages that 
memory (through the buddy).

> 
> The phys_device and phys_index files aren't helping me figure out
> what each of them mean.

Yes, see Documentation/admin-guide/mm/memory-hotplug.rst

phys_device is a legacy thing for s390x, and phys_index is just the 
memory block ID.

You can derive the address range corresponding to a memory block using 
the ID.

/sys/devices/system/memory/block_size_bytes tells you the size of each 
block.

Address range of block X:
   [ X*block_size_bytes .. (X+1)*block_size_bytes )


Now, the whole interface her is designed for handling memory hotplug:

obj-$(CONFIG_MEMORY_HOTPLUG) += memory.o

It's worth noting that

1) Blocks might not be all-memory (e.g., memory holes). In that case,
    offlining/unplug is not supported.
2) Blocks might span multiple NUMA nodes (e.g., node ends / starts in
    the middle of a block). Similarly, in that case
    offlining/unplug is not supported.

I assume 1) is not a problem. I assume 2) could be a problem for your 
use case.

> 
>> /sys/devices/system/node/nodeX/ links to memory blocks that belong to it.
>>
>> Why is the memory-block granularity insufficient, and why do we have to
>> squeeze in another range API here?
> 
> If an MRRM range consists of some set of memory blocks (making
> sure that no memory block spans across MRRM range boundaries,
> then I could add the {local,remote}_region_id files into the memory
> block directories.
> 
> This could work now while the region assignments are done by the
> BIOS. But in the future when OS gets the opportunity to change them
> it might be weird if an MRRM range consists of multiple memory
> block range, since the region_ids in each all update together.

What about memory ranges not managed by the buddy (e.g., dax/pmem ranges 
not exposed to the buddy through dax/kmem driver, memory hidden from 
Linux using mem=X etc.)?

> 
> /sys/devices/system/memory seemed like a logical place for
> memory ranges. But should I jump up a level and make a new
> /sys/devices/system/memory_regions directory to expose these
> ranges?

Let's take one step back. We do have

1) /proc/iomem to list physical device ranges, without a notion of nodes 
/ other information. Maybe we could extend it, but it might be hard. 
Depending on *what* information we need to expose and for which memory.

/proc/iomem also doesn't indicate "System RAM" for memory not managed by 
the buddy.

2) /sys/devices/system/memory/memoryX and /sys/devices/system/node/

Again, the memory part is more hotplugged focused, and we treat 
individual memory blocks as "memory block devices".


Reading:

"
The MRRM solution is to tag physical address ranges with "region IDs"
so that platform firmware[1] can indicate the type of memory for each
range (with separate tags available for local vs. remote access to
each range).

The region IDs will be used to provide separate event counts for each
region for "perf" and for the "resctrl" file system to monitor and
control memory bandwidth in each region.

Users will need to know the address range(s) that are part of each
region."

A couple of questions:

a) How volatile is that information at runtime? Can ranges / IDs change?
    I read above that user space might in the future be able to
    reconfigure the ranges.

b) How is hotplug/unplug handled?

c) How are memory ranges not managed by Linux handled?

It might make sense to expose what you need in a more specialized, 
acpi/MRRM/perf specific form, and not as generic as you currently 
envision it.
Luck, Tony Feb. 13, 2025, 7:05 p.m. UTC | #7
> A couple of questions:
>
> a) How volatile is that information at runtime? Can ranges / IDs change?
>     I read above that user space might in the future be able to
>     reconfigure the ranges.

Initial implementation has BIOS making all the choices. When there
is a system that supports OS changes I envision making the "local_region_id"
and "remote_region_id" files writeable for the sysadmin to make changes.

Note that the address ranges are fixed (and this isn't going to change).

> b) How is hotplug/unplug handled?

I'm looking for answers to this very good question. Plausibly systems
might reserve address ranges for later hotplug. Those ranges could be
enumerated in the MRRM table. But that is just a guess.

> c) How are memory ranges not managed by Linux handled?

It appears that all system memory is included in the range information.
so access by BIOS to reserved memory ranges would be counted
and controlled (unless SMI were to disable on entry). I'll ask this
question.

> It might make sense to expose what you need in a more specialized,
> acpi/MRRM/perf specific form, and not as generic as you currently
> envision it.

Agreed. The only thing going for /sys/devices/system/memory is the name.
The actual semantics of everything below there don't match well with this
usage.

Rafael: How do you feel about this (not implemented yet, just looking for
a new spot to expose this)?

$ cd /sys/firmware/acpi
$ tree memory_ranges/
memory_ranges/
├── range0
│   ├── base
│   ├── length
│   ├── local_region_id
│   └── remote_region_id
├── range1
│   ├── base
│   ├── length
│   ├── local_region_id
│   └── remote_region_id
└── range2
    ├── base
    ├── length
    ├── local_region_id
    └── remote_region_id

4 directories, 12 files

-Tony
diff mbox series

Patch

diff --git a/drivers/acpi/acpi_mrrm.c b/drivers/acpi/acpi_mrrm.c
index 51ed9064e025..28b484943bbd 100644
--- a/drivers/acpi/acpi_mrrm.c
+++ b/drivers/acpi/acpi_mrrm.c
@@ -119,6 +119,31 @@  static struct attribute *memory_range_attrs[] = {
 
 ATTRIBUTE_GROUPS(memory_range);
 
+static __init int add_node_link(struct mrrm_mem_range_entry *entry)
+{
+	struct node *node = NULL;
+	int ret = 0;
+	int nid;
+
+	for_each_online_node(nid) {
+		for (int z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = NODE_DATA(nid)->node_zones + z;
+
+			if (!populated_zone(zone))
+				continue;
+			if (zone_intersects(zone, PHYS_PFN(entry->base), PHYS_PFN(entry->length))) {
+				node = node_devices[zone->node];
+				goto found;
+			}
+		}
+	}
+found:
+	if (node)
+		ret = sysfs_create_link(&entry->dev.kobj, &node->dev.kobj, "node");
+
+	return ret;
+}
+
 static __init int add_boot_memory_ranges(void)
 {
 	char name[16];
@@ -140,6 +165,10 @@  static __init int add_boot_memory_ranges(void)
 			put_device(&entry->dev);
 			return ret;
 		}
+
+		ret = add_node_link(entry);
+		if (ret)
+			break;
 	}
 
 	return ret;