diff mbox

API to query NUMA node of mfn

Message ID 20170710131323.GF2461@localhost.localdomain (mailing list archive)
State New, archived
Headers show

Commit Message

Konrad Rzeszutek Wilk July 10, 2017, 1:13 p.m. UTC
On Mon, Jul 10, 2017 at 04:41:35AM -0600, Jan Beulich wrote:
> >>> On 10.07.17 at 12:10, <olaf@aepfle.de> wrote:
> > I would like to verify on which NUMA node the PFNs used by a HVM guest
> > are located. Is there an API for that? Something like:
> > 
> >   foreach (pfn, domid)
> >     mfns_per_node[pfn_to_node(pfn)]++
> >   foreach (node)
> >     printk("%x %x\n", node, mfns_per_node[node])
> 
> phys_to_nid() ?

Soo I wrote some code for exactly this for Xen 4.4.4 , along with
creation of a PGM map to see the NUMA nodes locality.

Attaching them here..
> 
> Jan
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> https://lists.xen.org/xen-devel
From a5e039801c989df29b704a4a5256715321906535 Mon Sep 17 00:00:00 2001
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Date: Tue, 6 Jun 2017 20:31:21 -0400
Subject: [PATCH 1/7] xen/x86: XENDOMCTL_get_memlist: Make it work

This hypercall has a bunch of problems which this patch
fixes.

Specifically it is not preempt capable, takes a nested lock,
and the data is stale after you get it.

The nested lock (and order inversion) is due to the
copy_to_guest_offset call. The particular implementation
(see __hvm_copy) makes P2M calls (p2m_mem_paging_populate), which
take the p2m_lock.

We avoid this by taking the p2m lock early (before page_lock) in:

if ( !guest_handle_okay(domctl->u.getmemlist.buffer, max_pfns) )

here (this takes the p2m lock and then unlocks). And since
it checks out, we can use the fast variant of copy_to_guest
(which still takes the p2m lock).

And we extend this thinking in the copying of the values to
the guest. The loop that copies the mfns[] to buffer
takes (potentially) a p2m lock on every invocation. So to
not make us holding the page_alloc_lock we create a temporary
array (mfns) - which is filled while holding page_alloc_lock.
But we don't hold any locks (well, we hold the domctl lock)
while copying to the guest.

The preemption is used and we also honor 'start_pfn' which
is renamed to 'index' - as there is no enforced order in which
the pages correspond to PFNs.

All of those are fixed by this patch, also it means that
the callers of xc_get_pfn_list have to take into account
that max_pfns != num_pfns value and loop around.

See patch: "libxc: Use XENDOMCTL_get_memlist properly"
and "xen-mceinj: Loop around xc_get_pfn_list"

Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
---
 xen/arch/x86/domctl.c       | 76 ++++++++++++++++++++++++++++++---------------
 xen/arch/x86/mm/hap/hap.c   |  1 +
 xen/arch/x86/mm/p2m-ept.c   |  2 ++
 xen/include/asm-x86/p2m.h   |  2 ++
 xen/include/public/domctl.h | 36 ++++++++++++++++-----
 5 files changed, 84 insertions(+), 33 deletions(-)

Comments

Olaf Hering July 10, 2017, 1:35 p.m. UTC | #1
On Mon, Jul 10, Konrad Rzeszutek Wilk wrote:

> Soo I wrote some code for exactly this for Xen 4.4.4 , along with
> creation of a PGM map to see the NUMA nodes locality.

Are you planning to prepare that for staging at some point? I have not
checked this series is already merged.

Olaf
Konrad Rzeszutek Wilk July 10, 2017, 8:29 p.m. UTC | #2
On Mon, Jul 10, 2017 at 03:35:33PM +0200, Olaf Hering wrote:
> On Mon, Jul 10, Konrad Rzeszutek Wilk wrote:
> 
> > Soo I wrote some code for exactly this for Xen 4.4.4 , along with
> > creation of a PGM map to see the NUMA nodes locality.
> 
> Are you planning to prepare that for staging at some point? I have not
> checked this series is already merged.

At some point when life is not as crazy. You are of course
welcome to see if these patches help you and if they do and
you are itching to get them in the repo - to upstream them.

> 
> Olaf
diff mbox

Patch

diff --git a/xen/arch/x86/domctl.c b/xen/arch/x86/domctl.c
index bebe1fb..3af6b39 100644
--- a/xen/arch/x86/domctl.c
+++ b/xen/arch/x86/domctl.c
@@ -325,57 +325,83 @@  long arch_do_domctl(
 
     case XEN_DOMCTL_getmemlist:
     {
-        int i;
+#define XEN_DOMCTL_getmemlist_max_pfns (GB(1) / PAGE_SIZE)
+        unsigned int i = 0, idx = 0;
         unsigned long max_pfns = domctl->u.getmemlist.max_pfns;
+        unsigned long index = domctl->u.getmemlist.index;
         uint64_t mfn;
         struct page_info *page;
+        uint64_t *mfns;
 
         if ( unlikely(d->is_dying) ) {
             ret = -EINVAL;
             break;
         }
+        /* XSA-74: This sub-hypercall is fixed. */
 
-        /*
-         * XSA-74: This sub-hypercall is broken in several ways:
-         * - lock order inversion (p2m locks inside page_alloc_lock)
-         * - no preemption on huge max_pfns input
-         * - not (re-)checking d->is_dying with page_alloc_lock held
-         * - not honoring start_pfn input (which libxc also doesn't set)
-         * Additionally it is rather useless, as the result is stale by the
-         * time the caller gets to look at it.
-         * As it only has a single, non-production consumer (xen-mceinj),
-         * rather than trying to fix it we restrict it for the time being.
-         */
-        if ( /* No nested locks inside copy_to_guest_offset(). */
-             paging_mode_external(current->domain) ||
-             /* Arbitrary limit capping processing time. */
-             max_pfns > GB(4) / PAGE_SIZE )
+        ret = -E2BIG;
+        if ( max_pfns > XEN_DOMCTL_getmemlist_max_pfns )
+            max_pfns = XEN_DOMCTL_getmemlist_max_pfns;
+
+        /* Report the max number we are OK with. */
+        if ( !max_pfns && guest_handle_is_null(domctl->u.getmemlist.buffer) )
         {
-            ret = -EOPNOTSUPP;
+            domctl->u.getmemlist.max_pfns = XEN_DOMCTL_getmemlist_max_pfns;
+            copyback = 1;
             break;
         }
 
-        spin_lock(&d->page_alloc_lock);
+        ret = -EINVAL;
+        if ( !guest_handle_okay(domctl->u.getmemlist.buffer, max_pfns) )
+            break;
+
+        mfns = xmalloc_array(uint64_t, max_pfns);
+        if ( !mfns )
+        {
+            ret = -ENOMEM;
+            break;
+        }
 
-        ret = i = 0;
+        ret = -EINVAL;
+        spin_lock(&d->page_alloc_lock);
         page_list_for_each(page, &d->page_list)
         {
-            if ( i >= max_pfns )
+            if ( idx >= max_pfns )
                 break;
+
+            if ( index > i++ )
+                continue;
+
+            if ( idx && !(idx & 0xFF) && hypercall_preempt_check() )
+                break;
+
             mfn = page_to_mfn(page);
-            if ( copy_to_guest_offset(domctl->u.getmemlist.buffer,
-                                      i, &mfn, 1) )
+            mfns[idx++] = mfn;
+        }
+        spin_unlock(&d->page_alloc_lock);
+
+        ret = 0;
+        for ( i = 0; i < idx; i++ )
+        {
+
+            if ( __copy_to_guest_offset(domctl->u.getmemlist.buffer,
+                                        i, &mfns[i], 1) )
             {
                 ret = -EFAULT;
                 break;
             }
-			++i;
 		}
 
-        spin_unlock(&d->page_alloc_lock);
-
         domctl->u.getmemlist.num_pfns = i;
+        /*
+         * A poor-man way of keeping track of P2M changes. If the P2M
+         * is changed the version will change as well and the caller
+         * can redo it's list.
+         */
+        domctl->u.getmemlist.version = p2m_get_hostp2m(d)->version;
+
         copyback = 1;
+        xfree(mfns);
     }
     break;
 
diff --git a/xen/arch/x86/mm/hap/hap.c b/xen/arch/x86/mm/hap/hap.c
index ccc4174..0406c2a 100644
--- a/xen/arch/x86/mm/hap/hap.c
+++ b/xen/arch/x86/mm/hap/hap.c
@@ -709,6 +709,7 @@  hap_write_p2m_entry(struct vcpu *v, unsigned long gfn, l1_pgentry_t *p,
     if ( old_flags & _PAGE_PRESENT )
         flush_tlb_mask(d->domain_dirty_cpumask);
 
+    p2m_get_hostp2m(d)->version++;
     paging_unlock(d);
 
     if ( flush_nestedp2m )
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 72b3d0a..7da5b06 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -674,6 +674,8 @@  void ept_sync_domain(struct p2m_domain *p2m)
 {
     struct domain *d = p2m->domain;
     struct ept_data *ept = &p2m->ept;
+
+    p2m->version++;
     /* Only if using EPT and this domain has some VCPUs to dirty. */
     if ( !paging_mode_hap(d) || !d->vcpu || !d->vcpu[0] )
         return;
diff --git a/xen/include/asm-x86/p2m.h b/xen/include/asm-x86/p2m.h
index fcb50b1..b0549e8 100644
--- a/xen/include/asm-x86/p2m.h
+++ b/xen/include/asm-x86/p2m.h
@@ -293,6 +293,8 @@  struct p2m_domain {
         struct ept_data ept;
         /* NPT-equivalent structure could be added here. */
     };
+    /* OVM: Every update to P2M increases this version. */
+    unsigned long version;
 };
 
 /* get host p2m table */
diff --git a/xen/include/public/domctl.h b/xen/include/public/domctl.h
index 27f5001..2a25079 100644
--- a/xen/include/public/domctl.h
+++ b/xen/include/public/domctl.h
@@ -118,16 +118,36 @@  typedef struct xen_domctl_getdomaininfo xen_domctl_getdomaininfo_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_getdomaininfo_t);
 
 
-/* XEN_DOMCTL_getmemlist */
+/*
+ * XEN_DOMCTL_getmemlist
+ * Retrieve an array of mfns of the guest.
+ *
+ * If the hypercall returns an zero value, then it has copied 'num_pfns'
+ * (up to `max_pfns`) of the MFNs in 'buffer', along with the
+ * `version` updated (it may be the same across hypercalls. If it
+ * varies the data is stale and it is recommended that the caller restart
+ * iwht 'index' being zero).
+ *
+ * If the 'max_pfns' is zero, and 'buffer' is NULL, the hypercall returns
+ * -E2BIG and updates the 'max_pfns' with the recommend value to be used.
+ *
+ * Note that due to the asynchronous nature of hypercalls the domain might have
+ * added or removed the number of MFNS making this information stale. It is
+ * the responsibility of the toolstack to use the `version` field to check
+ * between each invocation. if the version differs it should discard the stale
+ * data and start from scratch. It is OK for the toolstack to use the new
+ * `version` field.
+ */
 struct xen_domctl_getmemlist {
-    /* IN variables. */
-    /* Max entries to write to output buffer. */
+    /* IN/OUT: Max entries to write to output buffer. If max_pfns is zero and
+     * buffer is NULL, this has the recommend max size of buffer. */
     uint64_aligned_t max_pfns;
-    /* Start index in guest's page list. */
-    uint64_aligned_t start_pfn;
-    XEN_GUEST_HANDLE_64(uint64) buffer;
-    /* OUT variables. */
-    uint64_aligned_t num_pfns;
+    uint64_aligned_t index;     /* IN: Start index in guest's page list. */
+    XEN_GUEST_HANDLE_64(uint64) buffer; /* IN: If NULL with max_pfns == 0, then
+                                         * max_pfns has recommend value. */
+    uint64_aligned_t version;   /* IN/OUT: If value differs, prior calls may
+                                 * have stale data. */
+    uint64_aligned_t num_pfns;  /* OUT: Number (up to max_pfns) copied. */
 };
 typedef struct xen_domctl_getmemlist xen_domctl_getmemlist_t;
 DEFINE_XEN_GUEST_HANDLE(xen_domctl_getmemlist_t);