diff mbox series

[RFC,2,DO,NOT,APPLY] introduce VCPUOP_register_runstate_phys_memory_area hypercall

Message ID 1558721577-13958-1-git-send-email-andrii.anisov@gmail.com (mailing list archive)
State Superseded
Headers show
Series [RFC,2,DO,NOT,APPLY] introduce VCPUOP_register_runstate_phys_memory_area hypercall | expand

Commit Message

Andrii Anisov May 24, 2019, 6:12 p.m. UTC
From: Andrii Anisov <andrii_anisov@epam.com>

An RFC version of the runstate registration with phys address.
Runstate area access is implemented with mapping on each update once for
all accesses.

Signed-off-by: Andrii Anisov <andrii_anisov@epam.com>
---
 xen/arch/arm/domain.c     |  63 ++++++++++++++++++++++++++---
 xen/common/domain.c       | 101 ++++++++++++++++++++++++++++++++++++++++++++--
 xen/include/public/vcpu.h |  15 +++++++
 xen/include/xen/sched.h   |  28 +++++++++----
 4 files changed, 190 insertions(+), 17 deletions(-)

Comments

Andrii Anisov May 24, 2019, 6:12 p.m. UTC | #1
From: Andrii Anisov <andrii_anisov@epam.com>

Following discussion [1] it is introduced and implemented a runstate
registration interface which uses guest's phys address instead of a virtual one.
The new hypercall employes the same data structures as a predecessor, but
expects the vcpu_runstate_info structure to not cross a page boundary.
The interface is implemented in a way vcpu_runstate_info structure is mapped to
the hypervisor on the hypercall processing and is directly accessed during its
updates. This runstate area mapping follows vcpu_info structure registration.

Permanent mapping of runstate area would consume vmap area on arm32 what is
limited to 1G. Though it might be OK because it would be possible to increase 
the ARM32 virtual address area by reworking the address space. 

The series is tested for ARM64. Build tested for x86. I'd appreciate if someone
could check it with x86.
The Linux kernel patch is here [2]. Though it is for 4.14. It is not still
convinced the absolute correctness of that patch, yet this should be better
aligned with linux community.

Changes in:

  v3: This version again implements runstate mapping on init approach.
      Patches are squashed and refactored in order to not allow virt and phys
      interfaces function simultaneously but replace one another on init.
      In order to measure performance impact of permanent mapping vs mapping on
      access there written two RFC patches which follow mapping on access
      approach with the little difference: 
       - RFC 1 - using copy_to_guest_phys_flush_dcache() for each access to
         runstate area.
       - RFC 2 - mapping runstate area before all update manipulations and unmap
         after.

      RFC patches were implemented for ARM only, because performance measurements
      were done on ARM64 machine.

      There were made performance measurements of approaches (runstate mapped on
      access vs mapped on registration). The test setups are as following:
     
      Thin Dom0 (Linux with intiramfs) with DomD running rich Yocto Linux. In
      DomD 3d benchmark numbers are compared. The benchmark is GlMark2. GLMark2
      is ran with different resolutions in order to emit different irq load, 
      where 320x240 emits high IRQ load, but 1920x1080 emits low irq load.
      Separately tested baking DomD benchmark run with primitive Dom0 CPU burn
      (dd), in order to stimulate VCPU(dX)->VCPU(dY) switches rather than
      VCPU(dX)->idle->VCPU(dX).
      with following results:

                            mapped         mapped         mapped
                            on init        on access      on update
                                           RFC 1          RFC 2
      GLMark2 320x240       2906           2856 (-2%)     2903 (0)
          +Dom0 CPUBurn     2166           2106 (-3%)     2134 (1%)
      GLMark2 800x600       2396           2367 (-1%)     2393 (0%)
          +Dom0 CPUBurn     1958           1911 (-2%)     1942 (-1%)
      GLMark2 1920x1080     939            936  (0%)      935  (0%)
          +Dom0 CPUBurn     909            901  (-1%)     907  (0%)

      Also it was checked IRQ latency difference using TBM in a setup similar to
      [5]. Please note that the IRQ rate is one in 30 seconds, and only
      VCPU->idle->VCPU use-case is considered. With following results (in ns,
      the timer granularity 120ns):

      mapped on init:
          max=10080 warm_max=8760 min=6600 avg=6699
      mapped on update (RFC1):
          max=10440 warm_max=7560 min=7320 avg=7419
      mapped on access (RFC2)
          max=11520 warm_max=7920 min=7200 avg=7299

  v2: It was reconsidered the new runstate interface implementation. The new 
      interface is made independent of the old one. Do not share runstate_area
      field, and consequently avoid excessive concurrency with the old runstate
      interface usage.
      Introduced locks in order to resolve possible concurrency between runstate
      area registration and usage. 
      Addressed comments from Jan Beulich [3][4] about coding style nits. Though
      some of them become obsolete with refactoring and few are picked into this
      thread for further discussion.

      There were made performance measurements of approaches (runstate mapped on
      access vs mapped on registration). The test setups are as following:
     
      Thin Dom0 (Linux with intiramfs) with DomD running rich Yocto Linux. In
      DomD 3d benchmark numbers are compared. The benchmark is GlMark2. GLMark2
      is ran with different resolutions in order to emit different irq load, 
      where 320x240 emits high IRQ load, but 1920x1080 emits low irq load.
      Separately tested baking DomD benchmark run with primitive Dom0 CPU burn
      (dd), in order to stimulate VCPU(dX)->VCPU(dY) switches rather than
      VCPU(dX)->idle->VCPU(dX).
      with following results:

                            mapped               mapped
                            on access            on init
      GLMark2 320x240       2852                 2877          +0.8%
          +Dom0 CPUBurn     2088                 2094          +0.2%
      GLMark2 800x600       2368                 2375          +0.3%
          +Dom0 CPUBurn     1868                 1921          +2.8%
      GLMark2 1920x1080     931                  931            0%
          +Dom0 CPUBurn     892                  894           +0.2%

      Please note that "mapped on access" means using the old runstate
      registering interface. And runstate update in this case still often fails
      to map runstate area like [5], despite the fact that our Linux kernel
      does not have KPTI enabled. So runstate area update, in this case, is
      really shortened.


      Also it was checked IRQ latency difference using TBM in a setup similar to
      [5]. Please note that the IRQ rate is one in 30 seconds, and only
      VCPU->idle->VCPU use-case is considered. With following results (in ns,
      the timer granularity 120ns):

      mapped on access:
          max=9960 warm_max=8640 min=7200 avg=7626
      mapped on init:
          max=9480 warm_max=8400 min=7080 avg=7341

      Unfortunately there are no consitent results yet from profiling using
      Lauterbach PowerTrace. Still in communication with the tracer vendor in
      order to setup the proper configuration.



[1] https://lists.xenproject.org/archives/html/xen-devel/2019-02/msg00416.html
[2] https://github.com/aanisov/linux/commit/ba34d2780f57ea43f81810cd695aace7b55c0f29
[3] https://lists.xenproject.org/archives/html/xen-devel/2019-03/msg00936.html
[4] https://lists.xenproject.org/archives/html/xen-devel/2019-03/msg00934.html
[5] https://lists.xenproject.org/archives/html/xen-devel/2019-01/msg02369.html
[6] https://lists.xenproject.org/archives/html/xen-devel/2018-12/msg02297.html


Andrii Anisov (1):
  xen: introduce VCPUOP_register_runstate_phys_memory_area hypercall

 xen/arch/arm/domain.c        |  58 ++++++++++++++++++---
 xen/arch/x86/domain.c        |  99 ++++++++++++++++++++++++++++++++---
 xen/arch/x86/x86_64/domain.c |  16 +++++-
 xen/common/domain.c          | 121 +++++++++++++++++++++++++++++++++++++++----
 xen/include/public/vcpu.h    |  15 ++++++
 xen/include/xen/sched.h      |  28 +++++++---
 6 files changed, 306 insertions(+), 31 deletions(-)
Julien Grall May 28, 2019, 8:59 a.m. UTC | #2
Hi Andrii,

I am not answering on the content yet, I will do that separately. The threading 
for this series looks quite confusing. The head of the thread is this patch (i.e 
RFC 2) but then you have a cover letter within the thread tagged "V3".

 From what I understand, the 2 e-mails tagged "V3" is the original version where 
as RFC 2 and RFC 3 are variants. Am I correct?

If so, for next time, I would recommend to have the cover letter first and then 
all the patches send "In-Reply-To" the cover letter. This makes easier to track 
series.

Cheers,
Andrii Anisov May 28, 2019, 9:17 a.m. UTC | #3
Hello Julien,

On 28.05.19 11:59, Julien Grall wrote:
> I am not answering on the content yet, I will do that separately. The threading for this series looks quite confusing. The head of the thread is this patch (i.e RFC 2) but then you have a cover letter within the thread tagged "V3".

Actually I've noticed the mangled threading. But not sure why it is so. I just passed a folder with those patches to git-send-mail.

>  From what I understand, the 2 e-mails tagged "V3" is the original version where as RFC 2 and RFC 3 are variants. Am I correct?

Yes, sure.

> If so, for next time, I would recommend to have the cover letter first and then all the patches send "In-Reply-To" the cover letter. This makes easier to track series.

It might help. But before it worked well without additional tricks.
Julien Grall May 28, 2019, 9:23 a.m. UTC | #4
On 28/05/2019 10:17, Andrii Anisov wrote:
> Hello Julien,
> 
> On 28.05.19 11:59, Julien Grall wrote:
>> I am not answering on the content yet, I will do that separately. The 
>> threading for this series looks quite confusing. The head of the thread is 
>> this patch (i.e RFC 2) but then you have a cover letter within the thread 
>> tagged "V3".
> 
> Actually I've noticed the mangled threading. But not sure why it is so. I just 
> passed a folder with those patches to git-send-mail.

IIRC, git-send-email will use the first e-mail in the alphabetical order as the 
"top e-mail" and all the other will be send "In-Reply-To".

> 
>>  From what I understand, the 2 e-mails tagged "V3" is the original version 
>> where as RFC 2 and RFC 3 are variants. Am I correct?
> 
> Yes, sure.
> 
>> If so, for next time, I would recommend to have the cover letter first and 
>> then all the patches send "In-Reply-To" the cover letter. This makes easier to 
>> track series.
> 
> It might help. But before it worked well without additional tricks.

In general, all the patch sent are starting with a number followed by the title. 
The number ensure the correct ordering. If you don't have number then, you will 
fallback to alphabetical order.

Cheers,
Andrii Anisov May 28, 2019, 9:36 a.m. UTC | #5
On 28.05.19 12:23, Julien Grall wrote:
> In general, all the patch sent are starting with a number followed by the title. The number ensure the correct ordering. If you don't have number then, you will fallback to alphabetical order.
Ah, yes, sure. I did rename manually RFC1 patch file. But somewhy missed renaming RFC2 patch file.
diff mbox series

Patch

diff --git a/xen/arch/arm/domain.c b/xen/arch/arm/domain.c
index a9f7ff5..04c4cff 100644
--- a/xen/arch/arm/domain.c
+++ b/xen/arch/arm/domain.c
@@ -274,17 +274,15 @@  static void ctxt_switch_to(struct vcpu *n)
     virt_timer_restore(n);
 }
 
-/* Update per-VCPU guest runstate shared memory area (if registered). */
-static void update_runstate_area(struct vcpu *v)
+static void update_runstate_by_gvaddr(struct vcpu *v)
 {
     void __user *guest_handle = NULL;
 
-    if ( guest_handle_is_null(runstate_guest(v)) )
-        return;
+    ASSERT(!guest_handle_is_null(runstate_guest_virt(v)));
 
     if ( VM_ASSIST(v->domain, runstate_update_flag) )
     {
-        guest_handle = &v->runstate_guest.p->state_entry_time + 1;
+        guest_handle = &v->runstate_guest.virt.p->state_entry_time + 1;
         guest_handle--;
         v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
         __raw_copy_to_guest(guest_handle,
@@ -292,7 +290,7 @@  static void update_runstate_area(struct vcpu *v)
         smp_wmb();
     }
 
-    __copy_to_guest(runstate_guest(v), &v->runstate, 1);
+    __copy_to_guest(runstate_guest_virt(v), &v->runstate, 1);
 
     if ( guest_handle )
     {
@@ -303,6 +301,58 @@  static void update_runstate_area(struct vcpu *v)
     }
 }
 
+extern int map_runstate_area(struct vcpu *v, struct vcpu_runstate_info **area);
+extern void unmap_runstate_area(struct vcpu_runstate_info *area);
+
+static void update_runstate_by_gpaddr(struct vcpu *v)
+{
+    struct vcpu_runstate_info *runstate;
+
+    if ( map_runstate_area(v, &runstate) )
+        return;
+
+    if ( VM_ASSIST(v->domain, runstate_update_flag) )
+    {
+        runstate->state_entry_time |= XEN_RUNSTATE_UPDATE;
+        smp_wmb();
+        v->runstate.state_entry_time |= XEN_RUNSTATE_UPDATE;
+    }
+
+    memcpy(runstate, &v->runstate, sizeof(v->runstate));
+
+    if ( VM_ASSIST(v->domain, runstate_update_flag) )
+    {
+        runstate->state_entry_time &= ~XEN_RUNSTATE_UPDATE;
+        smp_wmb();
+        v->runstate.state_entry_time &= ~XEN_RUNSTATE_UPDATE;
+    }
+
+    unmap_runstate_area(runstate);
+}
+
+/* Update per-VCPU guest runstate shared memory area (if registered). */
+static void update_runstate_area(struct vcpu *v)
+{
+    if ( xchg(&v->runstate_in_use, 1) )
+        return;
+
+    switch ( v->runstate_guest_type )
+    {
+    case RUNSTATE_NONE:
+       break;
+
+    case RUNSTATE_VADDR:
+       update_runstate_by_gvaddr(v);
+       break;
+
+    case RUNSTATE_PADDR:
+       update_runstate_by_gpaddr(v);
+       break;
+    }
+
+    xchg(&v->runstate_in_use, 0);
+}
+
 static void schedule_tail(struct vcpu *prev)
 {
     ctxt_switch_from(prev);
@@ -998,6 +1048,7 @@  long do_arm_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) a
     {
         case VCPUOP_register_vcpu_info:
         case VCPUOP_register_runstate_memory_area:
+        case VCPUOP_register_runstate_phys_memory_area:
             return do_vcpu_op(cmd, vcpuid, arg);
         default:
             return -EINVAL;
diff --git a/xen/common/domain.c b/xen/common/domain.c
index 32bca8d..f167a68 100644
--- a/xen/common/domain.c
+++ b/xen/common/domain.c
@@ -700,6 +700,68 @@  int rcu_lock_live_remote_domain_by_id(domid_t dom, struct domain **d)
     return 0;
 }
 
+void unmap_runstate_area(struct vcpu_runstate_info *area)
+{
+    mfn_t mfn;
+
+    ASSERT(area != NULL);
+
+    mfn = _mfn(domain_page_map_to_mfn(area));
+
+    unmap_domain_page_global((void *)
+                             ((unsigned long)area &
+                              PAGE_MASK));
+
+    put_page_and_type(mfn_to_page(mfn));
+}
+
+int map_runstate_area(struct vcpu *v, struct vcpu_runstate_info **area)
+{
+    unsigned long offset = v->runstate_guest.phys & ~PAGE_MASK;
+    gfn_t gfn = gaddr_to_gfn(v->runstate_guest.phys);
+    struct domain *d = v->domain;
+    void *mapping;
+    struct page_info *page;
+    size_t size = sizeof(struct vcpu_runstate_info);
+
+    if ( offset > (PAGE_SIZE - size) )
+        return -EINVAL;
+
+    page = get_page_from_gfn(d, gfn_x(gfn), NULL, P2M_ALLOC);
+    if ( !page )
+        return -EINVAL;
+
+    if ( !get_page_type(page, PGT_writable_page) )
+    {
+        put_page(page);
+        return -EINVAL;
+    }
+
+    mapping = __map_domain_page_global(page);
+
+    if ( mapping == NULL )
+    {
+        put_page_and_type(page);
+        return -ENOMEM;
+    }
+
+    *area = mapping + offset;
+
+    return 0;
+}
+
+static void discard_runstate_area(struct vcpu *v)
+{
+    v->runstate_guest_type = RUNSTATE_NONE;
+}
+
+static void discard_runstate_area_locked(struct vcpu *v)
+{
+    while ( xchg(&v->runstate_in_use, 1) );
+    discard_runstate_area(v);
+    xchg(&v->runstate_in_use, 0);
+}
+
 int domain_kill(struct domain *d)
 {
     int rc = 0;
@@ -738,7 +800,10 @@  int domain_kill(struct domain *d)
         if ( cpupool_move_domain(d, cpupool0) )
             return -ERESTART;
         for_each_vcpu ( d, v )
+        {
+            discard_runstate_area_locked(v);
             unmap_vcpu_info(v);
+        }
         d->is_dying = DOMDYING_dead;
         /* Mem event cleanup has to go here because the rings 
          * have to be put before we call put_domain. */
@@ -1192,7 +1257,7 @@  int domain_soft_reset(struct domain *d)
 
     for_each_vcpu ( d, v )
     {
-        set_xen_guest_handle(runstate_guest(v), NULL);
+        discard_runstate_area_locked(v);
         unmap_vcpu_info(v);
     }
 
@@ -1520,18 +1585,46 @@  long do_vcpu_op(int cmd, unsigned int vcpuid, XEN_GUEST_HANDLE_PARAM(void) arg)
             break;
 
         rc = 0;
-        runstate_guest(v) = area.addr.h;
+
+        while( xchg(&v->runstate_in_use, 1) == 0);
+
+        discard_runstate_area(v);
+
+        runstate_guest_virt(v) = area.addr.h;
+        v->runstate_guest_type = RUNSTATE_VADDR;
 
         if ( v == current )
         {
-            __copy_to_guest(runstate_guest(v), &v->runstate, 1);
+            __copy_to_guest(runstate_guest_virt(v), &v->runstate, 1);
         }
         else
         {
             vcpu_runstate_get(v, &runstate);
-            __copy_to_guest(runstate_guest(v), &runstate, 1);
+            __copy_to_guest(runstate_guest_virt(v), &runstate, 1);
         }
 
+        xchg(&v->runstate_in_use, 0);
+
+        break;
+    }
+
+    case VCPUOP_register_runstate_phys_memory_area:
+    {
+        struct vcpu_register_runstate_memory_area area;
+
+        rc = -EFAULT;
+        if ( copy_from_guest(&area, arg, 1) )
+             break;
+
+        while( xchg(&v->runstate_in_use, 1) == 0);
+
+        discard_runstate_area(v);
+        v->runstate_guest.phys = area.addr.p;
+        v->runstate_guest_type = RUNSTATE_PADDR;
+
+        xchg(&v->runstate_in_use, 0);
+        rc = 0;
+
         break;
     }
 
diff --git a/xen/include/public/vcpu.h b/xen/include/public/vcpu.h
index 3623af9..d7da4a3 100644
--- a/xen/include/public/vcpu.h
+++ b/xen/include/public/vcpu.h
@@ -235,6 +235,21 @@  struct vcpu_register_time_memory_area {
 typedef struct vcpu_register_time_memory_area vcpu_register_time_memory_area_t;
 DEFINE_XEN_GUEST_HANDLE(vcpu_register_time_memory_area_t);
 
+/*
+ * Register a shared memory area from which the guest may obtain its own
+ * runstate information without needing to execute a hypercall.
+ * Notes:
+ *  1. The registered address must be guest's physical address.
+ *  2. The registered runstate area should not cross page boundary.
+ *  3. Only one shared area may be registered per VCPU. The shared area is
+ *     updated by the hypervisor each time the VCPU is scheduled. Thus
+ *     runstate.state will always be RUNSTATE_running and
+ *     runstate.state_entry_time will indicate the system time at which the
+ *     VCPU was last scheduled to run.
+ * @extra_arg == pointer to vcpu_register_runstate_memory_area structure.
+ */
+#define VCPUOP_register_runstate_phys_memory_area 14
+
 #endif /* __XEN_PUBLIC_VCPU_H__ */
 
 /*
diff --git a/xen/include/xen/sched.h b/xen/include/xen/sched.h
index edee52d..8ac597b 100644
--- a/xen/include/xen/sched.h
+++ b/xen/include/xen/sched.h
@@ -163,17 +163,31 @@  struct vcpu
     void            *sched_priv;    /* scheduler-specific data */
 
     struct vcpu_runstate_info runstate;
+
+    enum {
+        RUNSTATE_NONE = 0,
+        RUNSTATE_PADDR = 1,
+        RUNSTATE_VADDR = 2,
+    } runstate_guest_type;
+
+    unsigned long runstate_in_use;
+
+    union
+    {
 #ifndef CONFIG_COMPAT
-# define runstate_guest(v) ((v)->runstate_guest)
-    XEN_GUEST_HANDLE(vcpu_runstate_info_t) runstate_guest; /* guest address */
+# define runstate_guest_virt(v) ((v)->runstate_guest.virt)
+           XEN_GUEST_HANDLE(vcpu_runstate_info_t) virt; /* guest address */
 #else
-# define runstate_guest(v) ((v)->runstate_guest.native)
-    union {
-        XEN_GUEST_HANDLE(vcpu_runstate_info_t) native;
-        XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t) compat;
-    } runstate_guest; /* guest address */
+# define runstate_guest_virt(v) ((v)->runstate_guest.virt.native)
+           union {
+               XEN_GUEST_HANDLE(vcpu_runstate_info_t) native;
+               XEN_GUEST_HANDLE(vcpu_runstate_info_compat_t) compat;
+           } virt; /* guest address */
 #endif
 
+        paddr_t phys;
+    } runstate_guest;
+
     /* last time when vCPU is scheduled out */
     uint64_t last_run_time;