Message ID | 20250210211223.6139-4-tony.luck@intel.com (mailing list archive) |
---|---|
State | Changes Requested, archived |
Headers | show |
Series | Add interfaces for ACPI MRRM table | expand |
On Mon, Feb 10, 2025 at 01:12:22PM -0800, Tony Luck wrote: > Users will likely want to know which node owns each memory range > and which CPUs are local to the range. > > Add a symlink to the node directory to provide both pieces of information. > > Signed-off-by: Tony Luck <tony.luck@intel.com> > --- > drivers/acpi/acpi_mrrm.c | 29 +++++++++++++++++++++++++++++ > 1 file changed, 29 insertions(+) > > diff --git a/drivers/acpi/acpi_mrrm.c b/drivers/acpi/acpi_mrrm.c > index 51ed9064e025..28b484943bbd 100644 > --- a/drivers/acpi/acpi_mrrm.c > +++ b/drivers/acpi/acpi_mrrm.c > @@ -119,6 +119,31 @@ static struct attribute *memory_range_attrs[] = { > > ATTRIBUTE_GROUPS(memory_range); > > +static __init int add_node_link(struct mrrm_mem_range_entry *entry) > +{ > + struct node *node = NULL; > + int ret = 0; > + int nid; > + > + for_each_online_node(nid) { > + for (int z = 0; z < MAX_NR_ZONES; z++) { > + struct zone *zone = NODE_DATA(nid)->node_zones + z; > + > + if (!populated_zone(zone)) > + continue; > + if (zone_intersects(zone, PHYS_PFN(entry->base), PHYS_PFN(entry->length))) { > + node = node_devices[zone->node]; > + goto found; > + } > + } > + } > +found: > + if (node) > + ret = sysfs_create_link(&entry->dev.kobj, &node->dev.kobj, "node"); What is going to remove this symlink if the memory goes away? Or do these never get removed? symlinks in sysfs created like this always worry me. What is going to use it? thanks, greg k-h
On 11.02.25 07:51, Greg Kroah-Hartman wrote: > On Mon, Feb 10, 2025 at 01:12:22PM -0800, Tony Luck wrote: >> Users will likely want to know which node owns each memory range >> and which CPUs are local to the range. >> >> Add a symlink to the node directory to provide both pieces of information. >> >> Signed-off-by: Tony Luck <tony.luck@intel.com> >> --- >> drivers/acpi/acpi_mrrm.c | 29 +++++++++++++++++++++++++++++ >> 1 file changed, 29 insertions(+) >> >> diff --git a/drivers/acpi/acpi_mrrm.c b/drivers/acpi/acpi_mrrm.c >> index 51ed9064e025..28b484943bbd 100644 >> --- a/drivers/acpi/acpi_mrrm.c >> +++ b/drivers/acpi/acpi_mrrm.c >> @@ -119,6 +119,31 @@ static struct attribute *memory_range_attrs[] = { >> >> ATTRIBUTE_GROUPS(memory_range); >> >> +static __init int add_node_link(struct mrrm_mem_range_entry *entry) >> +{ >> + struct node *node = NULL; >> + int ret = 0; >> + int nid; >> + >> + for_each_online_node(nid) { >> + for (int z = 0; z < MAX_NR_ZONES; z++) { >> + struct zone *zone = NODE_DATA(nid)->node_zones + z; >> + >> + if (!populated_zone(zone)) >> + continue; >> + if (zone_intersects(zone, PHYS_PFN(entry->base), PHYS_PFN(entry->length))) { >> + node = node_devices[zone->node]; >> + goto found; >> + } >> + } >> + } >> +found: >> + if (node) >> + ret = sysfs_create_link(&entry->dev.kobj, &node->dev.kobj, "node"); > > What is going to remove this symlink if the memory goes away? Or do > these never get removed? > > symlinks in sysfs created like this always worry me. What is going to > use it? On top of that, we seem to be building a separate hierarchy here. /sys/devices/system/memory/ operates in memory block granularity. /sys/devices/system/node/nodeX/ links to memory blocks that belong to it. Why is the memory-block granularity insufficient, and why do we have to squeeze in another range API here?
>> + if (node) >> + ret = sysfs_create_link(&entry->dev.kobj, &node->dev.kobj, "node"); > > What is going to remove this symlink if the memory goes away? Or do > these never get removed? There's currently no method for runtime changes to these memory ranges. They are described by a static ACPI table. I need to poke the folks that came up with this to ask how memory hotplug will be handled (since CXL seems to be making that fashionable again). > symlinks in sysfs created like this always worry me. What is going to > use it? <hand waves>User space tools that want to understand what the "per-region" monitoring and control features are actually operating on.</hand waves> -Tony
> > What is going to remove this symlink if the memory goes away? Or do > > these never get removed? > > > > symlinks in sysfs created like this always worry me. What is going to > > use it? > > On top of that, we seem to be building a separate hierarchy here. > > /sys/devices/system/memory/ operates in memory block granularity. What defines the memory blocks? I'd initially assumed some connection to the ACPI SRAT table. But on my test system there are only three entries in SRAT that define non-zero sized memory blocks (two on socket/node 0 and one on socket/node 1), yet there are: memory0 .. memory32 directories in /sys/devices/system/memory. The phys_device and phys_index files aren't helping me figure out what each of them mean. > /sys/devices/system/node/nodeX/ links to memory blocks that belong to it. > > Why is the memory-block granularity insufficient, and why do we have to > squeeze in another range API here? If an MRRM range consists of some set of memory blocks (making sure that no memory block spans across MRRM range boundaries, then I could add the {local,remote}_region_id files into the memory block directories. This could work now while the region assignments are done by the BIOS. But in the future when OS gets the opportunity to change them it might be weird if an MRRM range consists of multiple memory block range, since the region_ids in each all update together. /sys/devices/system/memory seemed like a logical place for memory ranges. But should I jump up a level and make a new /sys/devices/system/memory_regions directory to expose these ranges? -Tony
On Tue, Feb 11, 2025 at 05:02:11PM +0000, Luck, Tony wrote: > >> + if (node) > >> + ret = sysfs_create_link(&entry->dev.kobj, &node->dev.kobj, "node"); > > > > What is going to remove this symlink if the memory goes away? Or do > > these never get removed? > > There's currently no method for runtime changes to these memory ranges. They > are described by a static ACPI table. I need to poke the folks that came up > with this to ask how memory hotplug will be handled (since CXL seems to be > making that fashionable again). ACPI should be supporting memory hotplug today, at the very least "memory add", so surely you have some old boxes to test this with? > > symlinks in sysfs created like this always worry me. What is going to > > use it? > > <hand waves>User space tools that want to understand what the "per-region" > monitoring and control features are actually operating on.</hand waves> If you don't have a real user today, please don't include it now. Wait until it is actually needed. thanks, greg k-h
On 11.02.25 19:05, Luck, Tony wrote: >>> What is going to remove this symlink if the memory goes away? Or do >>> these never get removed? >>> >>> symlinks in sysfs created like this always worry me. What is going to >>> use it? >> >> On top of that, we seem to be building a separate hierarchy here. >> >> /sys/devices/system/memory/ operates in memory block granularity. > > What defines the memory blocks? I'd initially assumed some connection > to the ACPI SRAT table. But on my test system there are only three > entries in SRAT that define non-zero sized memory blocks (two on > socket/node 0 and one on socket/node 1), yet there are: > memory0 .. memory32 directories > in /sys/devices/system/memory. Each memory block is the same size (e.g., 128 MiB .. 2 GiB on x86-64). The default is memory section granularity (e.g., 128 MiB on x86-64), but some configs allow for increasing it: see arch/x86/mm/init_64.c:memory_block_size_bytes(), and in particular probe_memory_block_size(). They define in the granularity in which we can online/offline/add/remove physical memory managed by the buddy. We create these block during boot/during hotplug, and link them to the relevant nodes. They do not reflect the HW state, but the state Linux manages that memory (through the buddy). > > The phys_device and phys_index files aren't helping me figure out > what each of them mean. Yes, see Documentation/admin-guide/mm/memory-hotplug.rst phys_device is a legacy thing for s390x, and phys_index is just the memory block ID. You can derive the address range corresponding to a memory block using the ID. /sys/devices/system/memory/block_size_bytes tells you the size of each block. Address range of block X: [ X*block_size_bytes .. (X+1)*block_size_bytes ) Now, the whole interface her is designed for handling memory hotplug: obj-$(CONFIG_MEMORY_HOTPLUG) += memory.o It's worth noting that 1) Blocks might not be all-memory (e.g., memory holes). In that case, offlining/unplug is not supported. 2) Blocks might span multiple NUMA nodes (e.g., node ends / starts in the middle of a block). Similarly, in that case offlining/unplug is not supported. I assume 1) is not a problem. I assume 2) could be a problem for your use case. > >> /sys/devices/system/node/nodeX/ links to memory blocks that belong to it. >> >> Why is the memory-block granularity insufficient, and why do we have to >> squeeze in another range API here? > > If an MRRM range consists of some set of memory blocks (making > sure that no memory block spans across MRRM range boundaries, > then I could add the {local,remote}_region_id files into the memory > block directories. > > This could work now while the region assignments are done by the > BIOS. But in the future when OS gets the opportunity to change them > it might be weird if an MRRM range consists of multiple memory > block range, since the region_ids in each all update together. What about memory ranges not managed by the buddy (e.g., dax/pmem ranges not exposed to the buddy through dax/kmem driver, memory hidden from Linux using mem=X etc.)? > > /sys/devices/system/memory seemed like a logical place for > memory ranges. But should I jump up a level and make a new > /sys/devices/system/memory_regions directory to expose these > ranges? Let's take one step back. We do have 1) /proc/iomem to list physical device ranges, without a notion of nodes / other information. Maybe we could extend it, but it might be hard. Depending on *what* information we need to expose and for which memory. /proc/iomem also doesn't indicate "System RAM" for memory not managed by the buddy. 2) /sys/devices/system/memory/memoryX and /sys/devices/system/node/ Again, the memory part is more hotplugged focused, and we treat individual memory blocks as "memory block devices". Reading: " The MRRM solution is to tag physical address ranges with "region IDs" so that platform firmware[1] can indicate the type of memory for each range (with separate tags available for local vs. remote access to each range). The region IDs will be used to provide separate event counts for each region for "perf" and for the "resctrl" file system to monitor and control memory bandwidth in each region. Users will need to know the address range(s) that are part of each region." A couple of questions: a) How volatile is that information at runtime? Can ranges / IDs change? I read above that user space might in the future be able to reconfigure the ranges. b) How is hotplug/unplug handled? c) How are memory ranges not managed by Linux handled? It might make sense to expose what you need in a more specialized, acpi/MRRM/perf specific form, and not as generic as you currently envision it.
> A couple of questions: > > a) How volatile is that information at runtime? Can ranges / IDs change? > I read above that user space might in the future be able to > reconfigure the ranges. Initial implementation has BIOS making all the choices. When there is a system that supports OS changes I envision making the "local_region_id" and "remote_region_id" files writeable for the sysadmin to make changes. Note that the address ranges are fixed (and this isn't going to change). > b) How is hotplug/unplug handled? I'm looking for answers to this very good question. Plausibly systems might reserve address ranges for later hotplug. Those ranges could be enumerated in the MRRM table. But that is just a guess. > c) How are memory ranges not managed by Linux handled? It appears that all system memory is included in the range information. so access by BIOS to reserved memory ranges would be counted and controlled (unless SMI were to disable on entry). I'll ask this question. > It might make sense to expose what you need in a more specialized, > acpi/MRRM/perf specific form, and not as generic as you currently > envision it. Agreed. The only thing going for /sys/devices/system/memory is the name. The actual semantics of everything below there don't match well with this usage. Rafael: How do you feel about this (not implemented yet, just looking for a new spot to expose this)? $ cd /sys/firmware/acpi $ tree memory_ranges/ memory_ranges/ ├── range0 │ ├── base │ ├── length │ ├── local_region_id │ └── remote_region_id ├── range1 │ ├── base │ ├── length │ ├── local_region_id │ └── remote_region_id └── range2 ├── base ├── length ├── local_region_id └── remote_region_id 4 directories, 12 files -Tony
diff --git a/drivers/acpi/acpi_mrrm.c b/drivers/acpi/acpi_mrrm.c index 51ed9064e025..28b484943bbd 100644 --- a/drivers/acpi/acpi_mrrm.c +++ b/drivers/acpi/acpi_mrrm.c @@ -119,6 +119,31 @@ static struct attribute *memory_range_attrs[] = { ATTRIBUTE_GROUPS(memory_range); +static __init int add_node_link(struct mrrm_mem_range_entry *entry) +{ + struct node *node = NULL; + int ret = 0; + int nid; + + for_each_online_node(nid) { + for (int z = 0; z < MAX_NR_ZONES; z++) { + struct zone *zone = NODE_DATA(nid)->node_zones + z; + + if (!populated_zone(zone)) + continue; + if (zone_intersects(zone, PHYS_PFN(entry->base), PHYS_PFN(entry->length))) { + node = node_devices[zone->node]; + goto found; + } + } + } +found: + if (node) + ret = sysfs_create_link(&entry->dev.kobj, &node->dev.kobj, "node"); + + return ret; +} + static __init int add_boot_memory_ranges(void) { char name[16]; @@ -140,6 +165,10 @@ static __init int add_boot_memory_ranges(void) put_device(&entry->dev); return ret; } + + ret = add_node_link(entry); + if (ret) + break; } return ret;
Users will likely want to know which node owns each memory range and which CPUs are local to the range. Add a symlink to the node directory to provide both pieces of information. Signed-off-by: Tony Luck <tony.luck@intel.com> --- drivers/acpi/acpi_mrrm.c | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+)