Message ID | 20240924101400.5186-1-bernhard.kaindl@cloud.com (mailing list archive) |
---|---|
State | New |
Headers | show |
Series | Fix XEN_SYSCTL_numainfo[node].memsize for memory holes | expand |
On 24.09.2024 12:14, Bernhard Kaindl wrote: > Fix a long-standing issue (known at since 2014) with the numainfo call. > > The Hypercall `XEN_SYSCTL_numainfo` returns `memsize` for each NUMA node: > > xl info -n: > node: memsize memfree distances > 0: -> 67584 <- 60672 10,21 <- memsize is off by 2048 MB > 1: 65536 60958 21,10 > > So far, `memsize` is calculated from `NODE_DATA->node_spanned_pages`. > It includes memory holes, leading to wrong memsize on x86. Depending on what "memsize" means, it is or is not wrong that way. I'm not sure we can change it like that, at the very least not without bumping the interface version and proving that in-tree uses (if any) are either unaffected or improved. > This patch gets the sum of E820_RAM entries for each NUMA node on boot, > stores it in NODE_DATA->node_present_pages and uses it for `memsize`. > > It also increases it like `total_pages` on memory_add() for memory hotplug. > > The new NODE_DATA->node_present_pages can be slighly lower than the > physical node's RAM due to reserved memory for some of the NUMA nodes. The introduction and maintenance of ->node_present_pages wants to be a separate, prereq change imo. > --- a/xen/arch/x86/x86_64/mm.c > +++ b/xen/arch/x86/x86_64/mm.c > @@ -1333,6 +1333,8 @@ int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm) > /* We can't revert any more */ > share_hotadd_m2p_table(&info); > transfer_pages_to_heap(&info); > + /* Update the node's present pages (like the total_pages of the system) */ > + NODE_DATA(node)->node_present_pages += epfn - spfn; Nit: Blank line ahead of the insertion please. > --- a/xen/common/numa.c > +++ b/xen/common/numa.c > @@ -504,10 +504,22 @@ void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end) > { > unsigned long start_pfn = paddr_to_pfn(start); > unsigned long end_pfn = paddr_to_pfn(end); > + paddr_t map_start, map_end; > + int i = 0, err; arch_get_ram_range()'s first parameter is unsigned int. > NODE_DATA(nodeid)->node_start_pfn = start_pfn; > NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn; > > + /* Add RAM pages within the node to get the present pages for memsize infos */ > + NODE_DATA(nodeid)->node_present_pages = 0; > + while ( (err = arch_get_ram_range(i++, &map_start, &map_end)) != -ENOENT ) { > + if ( err || map_start >= end || map_end <= start ) > + continue; /* Skip non-RAM and maps outside of the node's memory range */ > + /* Add memory that is in the node's memory range (within start and end): */ > + map_start = max(map_start, start); > + map_end = min(map_end, end); > + NODE_DATA(nodeid)->node_present_pages += (map_end - map_start) >> PAGE_SHIFT; > + } Style (whole block): Brace placement, line length. I'm also not convinced the actual calculation is correct: Neither map_start nor map_end need to be page aligned aiui, and hence the present result doesn't necessarily give the actual number of pages (that are usable, and hence meaningful to the consumer of the field). Blank line here please. > @@ -675,7 +687,7 @@ static void cf_check dump_numa(unsigned char key) > mfn_t mfn = _mfn(node_start_pfn(i) + 1); > > printk("NODE%u start->%lu size->%lu free->%lu\n", > - i, node_start_pfn(i), node_spanned_pages(i), > + i, node_start_pfn(i), node_present_pages(i), "size" here really can mean two things. I'd suggest to keep printing node_spanned_pages() and add printing of node_present_pages(). Jan
diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c index b2a280fba3..a22aa45060 100644 --- a/xen/arch/x86/x86_64/mm.c +++ b/xen/arch/x86/x86_64/mm.c @@ -1333,6 +1333,8 @@ int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm) /* We can't revert any more */ share_hotadd_m2p_table(&info); transfer_pages_to_heap(&info); + /* Update the node's present pages (like the total_pages of the system) */ + NODE_DATA(node)->node_present_pages += epfn - spfn; return 0; diff --git a/xen/common/numa.c b/xen/common/numa.c index 28a09766fa..d68cbea44c 100644 --- a/xen/common/numa.c +++ b/xen/common/numa.c @@ -504,10 +504,22 @@ void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end) { unsigned long start_pfn = paddr_to_pfn(start); unsigned long end_pfn = paddr_to_pfn(end); + paddr_t map_start, map_end; + int i = 0, err; NODE_DATA(nodeid)->node_start_pfn = start_pfn; NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn; + /* Add RAM pages within the node to get the present pages for memsize infos */ + NODE_DATA(nodeid)->node_present_pages = 0; + while ( (err = arch_get_ram_range(i++, &map_start, &map_end)) != -ENOENT ) { + if ( err || map_start >= end || map_end <= start ) + continue; /* Skip non-RAM and maps outside of the node's memory range */ + /* Add memory that is in the node's memory range (within start and end): */ + map_start = max(map_start, start); + map_end = min(map_end, end); + NODE_DATA(nodeid)->node_present_pages += (map_end - map_start) >> PAGE_SHIFT; + } node_set_online(nodeid); } @@ -675,7 +687,7 @@ static void cf_check dump_numa(unsigned char key) mfn_t mfn = _mfn(node_start_pfn(i) + 1); printk("NODE%u start->%lu size->%lu free->%lu\n", - i, node_start_pfn(i), node_spanned_pages(i), + i, node_start_pfn(i), node_present_pages(i), avail_node_heap_pages(i)); /* Sanity check mfn_to_nid() */ if ( node_spanned_pages(i) > 1 && mfn_to_nid(mfn) != i ) diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c index d02f44fe3a..cba6d3cfea 100644 --- a/xen/common/sysctl.c +++ b/xen/common/sysctl.c @@ -316,7 +316,7 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl) { if ( node_online(i) ) { - meminfo.memsize = node_spanned_pages(i) << PAGE_SHIFT; + meminfo.memsize = node_present_pages(i) << PAGE_SHIFT; meminfo.memfree = avail_node_heap_pages(i) << PAGE_SHIFT; } else diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h index fd1511a6fb..c860f3ad1c 100644 --- a/xen/include/xen/numa.h +++ b/xen/include/xen/numa.h @@ -71,6 +71,7 @@ extern nodeid_t *memnodemap; struct node_data { unsigned long node_start_pfn; unsigned long node_spanned_pages; + unsigned long node_present_pages; }; extern struct node_data node_data[]; @@ -91,6 +92,7 @@ static inline nodeid_t mfn_to_nid(mfn_t mfn) #define node_start_pfn(nid) (NODE_DATA(nid)->node_start_pfn) #define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages) +#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages) #define node_end_pfn(nid) (NODE_DATA(nid)->node_start_pfn + \ NODE_DATA(nid)->node_spanned_pages) @@ -123,6 +125,7 @@ extern void numa_set_processor_nodes_parsed(nodeid_t node); extern mfn_t first_valid_mfn; #define node_spanned_pages(nid) (max_page - mfn_x(first_valid_mfn)) +#define node_present_pages(nid) total_pages #define node_start_pfn(nid) mfn_x(first_valid_mfn) #define __node_distance(a, b) 20
Fix a long-standing issue (known at since 2014) with the numainfo call. The Hypercall `XEN_SYSCTL_numainfo` returns `memsize` for each NUMA node: xl info -n: node: memsize memfree distances 0: -> 67584 <- 60672 10,21 <- memsize is off by 2048 MB 1: 65536 60958 21,10 So far, `memsize` is calculated from `NODE_DATA->node_spanned_pages`. It includes memory holes, leading to wrong memsize on x86. This patch gets the sum of E820_RAM entries for each NUMA node on boot, stores it in NODE_DATA->node_present_pages and uses it for `memsize`. It also increases it like `total_pages` on memory_add() for memory hotplug. The new NODE_DATA->node_present_pages can be slighly lower than the physical node's RAM due to reserved memory for some of the NUMA nodes. For example, on this example system, NODE_DATA->node_present_pages reports 63.5GB of usable memory on the 1st NUMA node with this patch. This patch uses `arch_get_ram_range()` which is an architecture-provided call that all NUMA architectures already need to provide for iterating over the usable RAM of the system. - Tested on 2-socket and a 4-socket x86 systems - Memory hot-add not tested, but is identical to bumping total_pages. Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com> --- xen/arch/x86/x86_64/mm.c | 2 ++ xen/common/numa.c | 14 +++++++++++++- xen/common/sysctl.c | 2 +- xen/include/xen/numa.h | 3 +++ 4 files changed, 19 insertions(+), 2 deletions(-)