diff mbox series

Fix XEN_SYSCTL_numainfo[node].memsize for memory holes

Message ID 20240924101400.5186-1-bernhard.kaindl@cloud.com (mailing list archive)
State New
Headers show
Series Fix XEN_SYSCTL_numainfo[node].memsize for memory holes | expand

Commit Message

Bernhard Kaindl Sept. 24, 2024, 10:14 a.m. UTC
Fix a long-standing issue (known at since 2014) with the numainfo call.

The Hypercall `XEN_SYSCTL_numainfo` returns `memsize` for each NUMA node:

xl info -n:
node:    memsize    memfree    distances
   0:  -> 67584 <-   60672      10,21  <- memsize is off by 2048 MB
   1:     65536      60958      21,10

So far, `memsize` is calculated from `NODE_DATA->node_spanned_pages`.
It includes memory holes, leading to wrong memsize on x86.

This patch gets the sum of E820_RAM entries for each NUMA node on boot,
stores it in NODE_DATA->node_present_pages and uses it for `memsize`.

It also increases it like `total_pages` on memory_add() for memory hotplug.

The new NODE_DATA->node_present_pages can be slighly lower than the
physical node's RAM due to reserved memory for some of the NUMA nodes.

For example, on this example system, NODE_DATA->node_present_pages
reports 63.5GB of usable memory on the 1st NUMA node with this patch.

This patch uses `arch_get_ram_range()` which is an architecture-provided
call that all NUMA architectures already need to provide for iterating
over the usable RAM of the system.

- Tested on 2-socket and a 4-socket x86 systems
- Memory hot-add not tested, but is identical to bumping total_pages.

Signed-off-by: Bernhard Kaindl <bernhard.kaindl@cloud.com>
---
 xen/arch/x86/x86_64/mm.c |  2 ++
 xen/common/numa.c        | 14 +++++++++++++-
 xen/common/sysctl.c      |  2 +-
 xen/include/xen/numa.h   |  3 +++
 4 files changed, 19 insertions(+), 2 deletions(-)

Comments

Jan Beulich Sept. 24, 2024, 2:07 p.m. UTC | #1
On 24.09.2024 12:14, Bernhard Kaindl wrote:
> Fix a long-standing issue (known at since 2014) with the numainfo call.
> 
> The Hypercall `XEN_SYSCTL_numainfo` returns `memsize` for each NUMA node:
> 
> xl info -n:
> node:    memsize    memfree    distances
>    0:  -> 67584 <-   60672      10,21  <- memsize is off by 2048 MB
>    1:     65536      60958      21,10
> 
> So far, `memsize` is calculated from `NODE_DATA->node_spanned_pages`.
> It includes memory holes, leading to wrong memsize on x86.

Depending on what "memsize" means, it is or is not wrong that way. I'm
not sure we can change it like that, at the very least not without
bumping the interface version and proving that in-tree uses (if any)
are either unaffected or improved.

> This patch gets the sum of E820_RAM entries for each NUMA node on boot,
> stores it in NODE_DATA->node_present_pages and uses it for `memsize`.
> 
> It also increases it like `total_pages` on memory_add() for memory hotplug.
> 
> The new NODE_DATA->node_present_pages can be slighly lower than the
> physical node's RAM due to reserved memory for some of the NUMA nodes.

The introduction and maintenance of ->node_present_pages wants to be a
separate, prereq change imo.

> --- a/xen/arch/x86/x86_64/mm.c
> +++ b/xen/arch/x86/x86_64/mm.c
> @@ -1333,6 +1333,8 @@ int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm)
>      /* We can't revert any more */
>      share_hotadd_m2p_table(&info);
>      transfer_pages_to_heap(&info);
> +    /* Update the node's present pages (like the total_pages of the system) */
> +    NODE_DATA(node)->node_present_pages += epfn - spfn;

Nit: Blank line ahead of the insertion please.

> --- a/xen/common/numa.c
> +++ b/xen/common/numa.c
> @@ -504,10 +504,22 @@ void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
>  {
>      unsigned long start_pfn = paddr_to_pfn(start);
>      unsigned long end_pfn = paddr_to_pfn(end);
> +    paddr_t map_start, map_end;
> +    int i = 0, err;

arch_get_ram_range()'s first parameter is unsigned int.

>      NODE_DATA(nodeid)->node_start_pfn = start_pfn;
>      NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
>  
> +    /* Add RAM pages within the node to get the present pages for memsize infos */
> +    NODE_DATA(nodeid)->node_present_pages = 0;
> +    while ( (err = arch_get_ram_range(i++, &map_start, &map_end)) != -ENOENT ) {
> +        if ( err || map_start >= end || map_end <= start )
> +            continue;  /* Skip non-RAM and maps outside of the node's memory range */
> +        /* Add memory that is in the node's memory range (within start and end): */
> +        map_start = max(map_start, start);
> +        map_end = min(map_end, end);
> +        NODE_DATA(nodeid)->node_present_pages += (map_end - map_start) >> PAGE_SHIFT;
> +    }

Style (whole block): Brace placement, line length.

I'm also not convinced the actual calculation is correct: Neither map_start
nor map_end need to be page aligned aiui, and hence the present result doesn't
necessarily give the actual number of pages (that are usable, and hence
meaningful to the consumer of the field).

Blank line here please.

> @@ -675,7 +687,7 @@ static void cf_check dump_numa(unsigned char key)
>          mfn_t mfn = _mfn(node_start_pfn(i) + 1);
>  
>          printk("NODE%u start->%lu size->%lu free->%lu\n",
> -               i, node_start_pfn(i), node_spanned_pages(i),
> +               i, node_start_pfn(i), node_present_pages(i),

"size" here really can mean two things. I'd suggest to keep printing
node_spanned_pages() and add printing of node_present_pages().

Jan
diff mbox series

Patch

diff --git a/xen/arch/x86/x86_64/mm.c b/xen/arch/x86/x86_64/mm.c
index b2a280fba3..a22aa45060 100644
--- a/xen/arch/x86/x86_64/mm.c
+++ b/xen/arch/x86/x86_64/mm.c
@@ -1333,6 +1333,8 @@  int memory_add(unsigned long spfn, unsigned long epfn, unsigned int pxm)
     /* We can't revert any more */
     share_hotadd_m2p_table(&info);
     transfer_pages_to_heap(&info);
+    /* Update the node's present pages (like the total_pages of the system) */
+    NODE_DATA(node)->node_present_pages += epfn - spfn;
 
     return 0;
 
diff --git a/xen/common/numa.c b/xen/common/numa.c
index 28a09766fa..d68cbea44c 100644
--- a/xen/common/numa.c
+++ b/xen/common/numa.c
@@ -504,10 +504,22 @@  void __init setup_node_bootmem(nodeid_t nodeid, paddr_t start, paddr_t end)
 {
     unsigned long start_pfn = paddr_to_pfn(start);
     unsigned long end_pfn = paddr_to_pfn(end);
+    paddr_t map_start, map_end;
+    int i = 0, err;
 
     NODE_DATA(nodeid)->node_start_pfn = start_pfn;
     NODE_DATA(nodeid)->node_spanned_pages = end_pfn - start_pfn;
 
+    /* Add RAM pages within the node to get the present pages for memsize infos */
+    NODE_DATA(nodeid)->node_present_pages = 0;
+    while ( (err = arch_get_ram_range(i++, &map_start, &map_end)) != -ENOENT ) {
+        if ( err || map_start >= end || map_end <= start )
+            continue;  /* Skip non-RAM and maps outside of the node's memory range */
+        /* Add memory that is in the node's memory range (within start and end): */
+        map_start = max(map_start, start);
+        map_end = min(map_end, end);
+        NODE_DATA(nodeid)->node_present_pages += (map_end - map_start) >> PAGE_SHIFT;
+    }
     node_set_online(nodeid);
 }
 
@@ -675,7 +687,7 @@  static void cf_check dump_numa(unsigned char key)
         mfn_t mfn = _mfn(node_start_pfn(i) + 1);
 
         printk("NODE%u start->%lu size->%lu free->%lu\n",
-               i, node_start_pfn(i), node_spanned_pages(i),
+               i, node_start_pfn(i), node_present_pages(i),
                avail_node_heap_pages(i));
         /* Sanity check mfn_to_nid() */
         if ( node_spanned_pages(i) > 1 && mfn_to_nid(mfn) != i )
diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
index d02f44fe3a..cba6d3cfea 100644
--- a/xen/common/sysctl.c
+++ b/xen/common/sysctl.c
@@ -316,7 +316,7 @@  long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
                 {
                     if ( node_online(i) )
                     {
-                        meminfo.memsize = node_spanned_pages(i) << PAGE_SHIFT;
+                        meminfo.memsize = node_present_pages(i) << PAGE_SHIFT;
                         meminfo.memfree = avail_node_heap_pages(i) << PAGE_SHIFT;
                     }
                     else
diff --git a/xen/include/xen/numa.h b/xen/include/xen/numa.h
index fd1511a6fb..c860f3ad1c 100644
--- a/xen/include/xen/numa.h
+++ b/xen/include/xen/numa.h
@@ -71,6 +71,7 @@  extern nodeid_t *memnodemap;
 struct node_data {
     unsigned long node_start_pfn;
     unsigned long node_spanned_pages;
+    unsigned long node_present_pages;
 };
 
 extern struct node_data node_data[];
@@ -91,6 +92,7 @@  static inline nodeid_t mfn_to_nid(mfn_t mfn)
 
 #define node_start_pfn(nid)     (NODE_DATA(nid)->node_start_pfn)
 #define node_spanned_pages(nid) (NODE_DATA(nid)->node_spanned_pages)
+#define node_present_pages(nid) (NODE_DATA(nid)->node_present_pages)
 #define node_end_pfn(nid)       (NODE_DATA(nid)->node_start_pfn + \
                                  NODE_DATA(nid)->node_spanned_pages)
 
@@ -123,6 +125,7 @@  extern void numa_set_processor_nodes_parsed(nodeid_t node);
 extern mfn_t first_valid_mfn;
 
 #define node_spanned_pages(nid) (max_page - mfn_x(first_valid_mfn))
+#define node_present_pages(nid) total_pages
 #define node_start_pfn(nid) mfn_x(first_valid_mfn)
 #define __node_distance(a, b) 20