mbox series

[v7,00/10] KVM: Dirty ring support (QEMU part)

Message ID 20210506160549.130416-1-peterx@redhat.com (mailing list archive)
Headers show
Series KVM: Dirty ring support (QEMU part) | expand

Message

Peter Xu May 6, 2021, 4:05 p.m. UTC
This is v7 of the qemu dirty ring interface support.

v7:
- Rebase to latest master commit d45a5270d07

v6:
- Fix slots_lock init [Keqian, Paolo]
- Comment above KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 on todo (to enable
  KVM_CLEAR_DIRTY_LOG for dirty ring too) [Keqian, Paolo]
- Fix comment for CPUState [Keqian]

v5:
- rebase
- dropped patch "update-linux-headers: Include const.h" after rebase
- dropped patch "KVM: Fixup kvm_log_clear_one_slot() ioctl return check" since
  similar patch got merged recently (38e0b7904eca7cd32f8953c3)

=3D=3D=3D=3D=3D=3D=3D=3D=3D v4 cover letter below =3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D

It is merely the same as v3 content-wise, but there're a few things to mention
besides the rebase itself:

  - I picked up two patches from Eric Farman for the linux-header updates (fr=
om
    Eric's v3 series) for convenience just in case any of the series would got
    queued by any maintainer.

  - One more patch is added as "KVM: Disable manual dirty log when dirty ring
    enabled".  I found this when testing the branch after rebasing to latest
    qemu, that not only the manual dirty log capability is not needed for kvm
    dirty ring, but more importantly INITIALLY_ALL_SET is totally against kvm
    dirty ring and it could silently crash the guest after migration.  For th=
is
    new commit, I touched up "KVM: Add dirty-gfn-count property" a bit.

  - A few more documentation lines in qemu-options.hx.

  - I removed the RFC tag after kernel series got merged.

Again, this is only the 1st step to support dirty ring.  Ideally dirty ring
should grant QEMU the possibility to remove the whole layered dirty bitmap so
that dirty ring will work similarly as auto-converge enabled but should bette=
r;
we will just throttle vcpus with the dirty ring kvm exit rather than explicit=
ly
adding a timer to stop the vcpu thread from entering the guest again (like wh=
at
we did with current migration auto-converge).  Some more information could al=
so
be found in the kvm forum 2020 talk regarding kvm dirty ring (slides 21/22 [1=
]).

That next step (to remove all the dirty bitmaps, as mentioned above) is still
discussable: firstly I don't know whether there's anything I've overlooked in
there.  Meanwhile that's also only services huge VM cases, may not be extreme=
ly
helpful with a lot major scenarios where VMs are not that huge.

There's probably other ways to fix huge VM migration issues, majorly focusing
on responsiveness and convergence.  For example, Google has proposed some new
userfaultfd kernel capability called "minor modes" [2] to track page minor
faults and that could be finally served for that purpose too using postcopy.
That's another long story so I'll stop here, but just as a marker along with
the dirty ring series so there'll still be a record to reference.

Said that, I still think this series is very worth merging even if we don't
persue the next steps yet, since dirty ring is disabled by default, and we can
always work upon this series.

Please review, thanks.

V3: https://lore.kernel.org/qemu-devel/20200523232035.1029349-1-peterx@redhat=
.com/
    (V3 contains all the pre-v3 changelog)

QEMU branch for testing (requires kernel version 5.11-rc1+):
    https://github.com/xzpeter/qemu/tree/kvm-dirty-ring

[1] https://static.sched.com/hosted_files/kvmforum2020/97/kvm_dirty_ring_pete=
r.pdf
[2] https://lore.kernel.org/lkml/20210107190453.3051110-1-axelrasmussen@googl=
e.com/

---------------------------8<---------------------------------

Overview
=3D=3D=3D=3D=3D=3D=3D=3D

KVM dirty ring is a new interface to pass over dirty bits from kernel
to the userspace.  Instead of using a bitmap for each memory region,
the dirty ring contains an array of dirtied GPAs to fetch, one ring
per vcpu.

There're a few major changes comparing to how the old dirty logging
interface would work:

- Granularity of dirty bits

  KVM dirty ring interface does not offer memory region level
  granularity to collect dirty bits (i.e., per KVM memory
  slot). Instead the dirty bit is collected globally for all the vcpus
  at once.  The major effect is on VGA part because VGA dirty tracking
  is enabled as long as the device is created, also it was in memory
  region granularity.  Now that operation will be amplified to a VM
  sync.  Maybe there's smarter way to do the same thing in VGA with
  the new interface, but so far I don't see it affects much at least
  on regular VMs.

- Collection of dirty bits

  The old dirty logging interface collects KVM dirty bits when
  synchronizing dirty bits.  KVM dirty ring interface instead used a
  standalone thread to do that.  So when the other thread (e.g., the
  migration thread) wants to synchronize the dirty bits, it simply
  kick the thread and wait until it flushes all the dirty bits to the
  ramblock dirty bitmap.

A new parameter "dirty-ring-size" is added to "-accel kvm".  By
default, dirty ring is still disabled (size=3D=3D0).  To enable it, we
need to be with:

  -accel kvm,dirty-ring-size=3D65536

This establishes a 64K dirty ring buffer per vcpu.  Then if we
migrate, it'll switch to dirty ring.

I gave it a shot with a 24G guest, 8 vcpus, using 10g NIC as migration
channel.  When idle or dirty workload small, I don't observe major
difference on total migration time.  When with higher random dirty
workload (800MB/s dirty rate upon 20G memory, worse for kvm dirty
ring). Total migration time is (ping pong migrate for 6 times, in
seconds):

|-------------------------+---------------|
| dirty ring (4k entries) | dirty logging |
|-------------------------+---------------|
|                      70 |            58 |
|                      78 |            70 |
|                      72 |            48 |
|                      74 |            52 |
|                      83 |            49 |
|                      65 |            54 |
|-------------------------+---------------|

Summary:

dirty ring average:    73s
dirty logging average: 55s

The KVM dirty ring will be slower in above case.  The number may show
that the dirty logging is still preferred as a default value because
small/medium VMs are still major cases, and high dirty workload
happens frequently too.  And that's what this series did.

TODO:

- Consider to drop the BQL dependency: then we can run the reaper thread in
  parallel of main thread.  Needs some thought around the race conditions.

- Consider to drop the kvmslot bitmap: logically this can be dropped with kvm
  dirty ring, not only for space saving, but also it's still another layer
  linear to guest mem size which is against the whole idea of kvm dirty ring.
  This should make above number (of kvm dirty ring) even smaller (but still m=
ay
  not be as good as dirty logging when with such high workload).

Please refer to the code and comment itself for more information.

Thanks,

Peter Xu (10):
  memory: Introduce log_sync_global() to memory listener
  KVM: Use a big lock to replace per-kml slots_lock
  KVM: Create the KVMSlot dirty bitmap on flag changes
  KVM: Provide helper to get kvm dirty log
  KVM: Provide helper to sync dirty bitmap from slot to ramblock
  KVM: Simplify dirty log sync in kvm_set_phys_mem
  KVM: Cache kvm slot dirty bitmap size
  KVM: Add dirty-gfn-count property
  KVM: Disable manual dirty log when dirty ring enabled
  KVM: Dirty ring support

 accel/kvm/kvm-all.c      | 593 +++++++++++++++++++++++++++++++++------
 accel/kvm/trace-events   |   7 +
 include/exec/memory.h    |  12 +
 include/hw/core/cpu.h    |   7 +
 include/sysemu/kvm_int.h |   7 +-
 qemu-options.hx          |  12 +
 softmmu/memory.c         |  33 ++-
 7 files changed, 572 insertions(+), 99 deletions(-)

--=20
2.31.1

Comments

Paolo Bonzini May 14, 2021, 3:38 p.m. UTC | #1
On 06/05/21 18:05, Peter Xu wrote:
> This is v7 of the qemu dirty ring interface support.
> 
> v7:
> - Rebase to latest master commit d45a5270d07

Queued, thanks!

I only made a small change to rename the property from dirty-gfn-count
to dirty-ring-size, since (assuming the user knows what gfn means)
it's not clear that it's related to the ring buffer support.

Thanks,

Paolo

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index fd8ce2e0b2..aa785b7171 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -146,9 +146,8 @@ struct KVMState
          KVMMemoryListener *ml;
          AddressSpace *as;
      } *as;
-    bool kvm_dirty_ring_enabled;    /* Whether KVM dirty ring is enabled */
-    uint64_t kvm_dirty_ring_size;   /* Size of the per-vcpu dirty ring */
-    uint32_t kvm_dirty_gfn_count;   /* Number of dirty GFNs per ring */
+    uint64_t kvm_dirty_ring_bytes;  /* Size of the per-vcpu dirty ring */
+    uint32_t kvm_dirty_ring_size;   /* Number of dirty GFNs per ring */
      struct KVMDirtyRingReaper reaper;
  };
  
@@ -725,14 +724,14 @@ static void dirty_gfn_set_collected(struct kvm_dirty_gfn *gfn)
  static uint32_t kvm_dirty_ring_reap_one(KVMState *s, CPUState *cpu)
  {
      struct kvm_dirty_gfn *dirty_gfns = cpu->kvm_dirty_gfns, *cur;
-    uint32_t gfn_count = s->kvm_dirty_gfn_count;
+    uint32_t ring_size = s->kvm_dirty_ring_size;
      uint32_t count = 0, fetch = cpu->kvm_fetch_index;
  
-    assert(dirty_gfns && gfn_count);
+    assert(dirty_gfns && ring_size);
      trace_kvm_dirty_ring_reap_vcpu(cpu->cpu_index);
  
      while (true) {
-        cur = &dirty_gfns[fetch % gfn_count];
+        cur = &dirty_gfns[fetch % ring_size];
          if (!dirty_gfn_is_dirtied(cur)) {
              break;
          }
@@ -1389,7 +1388,7 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
                   *
                   * Not easy.  Let's cross the fingers until it's fixed.
                   */
-                if (kvm_state->kvm_dirty_ring_enabled) {
+                if (kvm_state->kvm_dirty_ring_size) {
                      kvm_dirty_ring_reap_locked(kvm_state);
                  } else {
                      kvm_slot_get_dirty_log(kvm_state, mem);
@@ -2445,24 +2444,24 @@ static int kvm_init(MachineState *ms)
       * Enable KVM dirty ring if supported, otherwise fall back to
       * dirty logging mode
       */
-    if (s->kvm_dirty_gfn_count > 0) {
-        uint64_t ring_size;
+    if (s->kvm_dirty_ring_size > 0) {
+        uint64_t ring_bytes;
  
-        ring_size = s->kvm_dirty_gfn_count * sizeof(struct kvm_dirty_gfn);
+        ring_bytes = s->kvm_dirty_ring_size * sizeof(struct kvm_dirty_gfn);
  
          /* Read the max supported pages */
          ret = kvm_vm_check_extension(s, KVM_CAP_DIRTY_LOG_RING);
          if (ret > 0) {
-            if (ring_size > ret) {
-                error_report("KVM dirty GFN count %" PRIu32 " too big "
+            if (ring_bytes > ret) {
+                error_report("KVM dirty ring size %" PRIu32 " too big "
                               "(maximum is %ld).  Please use a smaller value.",
-                             s->kvm_dirty_gfn_count,
+                             s->kvm_dirty_ring_size,
                               ret / sizeof(struct kvm_dirty_gfn));
                  ret = -EINVAL;
                  goto err;
              }
  
-            ret = kvm_vm_enable_cap(s, KVM_CAP_DIRTY_LOG_RING, 0, ring_size);
+            ret = kvm_vm_enable_cap(s, KVM_CAP_DIRTY_LOG_RING, 0, ring_bytes);
              if (ret) {
                  error_report("Enabling of KVM dirty ring failed: %d. "
                               "Suggested mininum value is 1024. "
@@ -2470,8 +2469,7 @@ static int kvm_init(MachineState *ms)
                  goto err;
              }
  
-            s->kvm_dirty_ring_size = ring_size;
-            s->kvm_dirty_ring_enabled = true;
+            s->kvm_dirty_ring_bytes = ring_bytes;
          }
      }
  
@@ -3552,17 +3550,17 @@ bool kvm_kernel_irqchip_split(void)
      return kvm_state->kernel_irqchip_split == ON_OFF_AUTO_ON;
  }
  
-static void kvm_get_dirty_gfn_count(Object *obj, Visitor *v,
+static void kvm_get_dirty_ring_size(Object *obj, Visitor *v,
                                      const char *name, void *opaque,
                                      Error **errp)
  {
      KVMState *s = KVM_STATE(obj);
-    uint32_t value = s->kvm_dirty_gfn_count;
+    uint32_t value = s->kvm_dirty_ring_size;
  
      visit_type_uint32(v, name, &value, errp);
  }
  
-static void kvm_set_dirty_gfn_count(Object *obj, Visitor *v,
+static void kvm_set_dirty_ring_size(Object *obj, Visitor *v,
                                      const char *name, void *opaque,
                                      Error **errp)
  {
@@ -3576,7 +3574,7 @@ static void kvm_set_dirty_gfn_count(Object *obj, Visitor *v,
          return;
      }
  
-    s->kvm_dirty_gfn_count = value;
+    s->kvm_dirty_ring_size = value;
  }
  
  static void kvm_accel_instance_init(Object *obj)
@@ -3587,7 +3585,7 @@ static void kvm_accel_instance_init(Object *obj)
      s->kernel_irqchip_allowed = true;
      s->kernel_irqchip_split = ON_OFF_AUTO_AUTO;
      /* KVM dirty ring is by default off */
-    s->kvm_dirty_gfn_count = 0;
+    s->kvm_dirty_ring_size = 0;
  }
  
  static void kvm_accel_class_init(ObjectClass *oc, void *data)
@@ -3610,11 +3608,11 @@ static void kvm_accel_class_init(ObjectClass *oc, void *data)
      object_class_property_set_description(oc, "kvm-shadow-mem",
          "KVM shadow MMU size");
  
-    object_class_property_add(oc, "dirty-gfn-count", "uint32",
-        kvm_get_dirty_gfn_count, kvm_set_dirty_gfn_count,
+    object_class_property_add(oc, "dirty-ring-size", "uint32",
+        kvm_get_dirty_ring_size, kvm_set_dirty_ring_size,
          NULL, NULL);
-    object_class_property_set_description(oc, "dirty-gfn-count",
-        "KVM dirty GFN count (=0 to disable dirty ring)");
+    object_class_property_set_description(oc, "dirty-ring-size",
+        "Size of KVM dirty page ring buffer (default: 0, i.e. use bitmap)");
  }
  
  static const TypeInfo kvm_accel_type = {
diff --git a/qemu-options.hx b/qemu-options.hx
index acd8b4f6f9..31931f0923 100644
--- a/qemu-options.hx
+++ b/qemu-options.hx
@@ -141,7 +141,7 @@ DEF("accel", HAS_ARG, QEMU_OPTION_accel,
      "                kvm-shadow-mem=size of KVM shadow MMU in bytes\n"
      "                split-wx=on|off (enable TCG split w^x mapping)\n"
      "                tb-size=n (TCG translation block cache size)\n"
-    "                dirty-gfn-count=n (KVM dirty ring GFN count, default 0)\n"
+    "                dirty-ring-size=n (KVM dirty ring GFN count, default 0)\n"
      "                thread=single|multi (enable multi-threaded TCG)\n", QEMU_ARCH_ALL)
  SRST
  ``-accel name[,prop=value[,...]]``
@@ -183,15 +183,15 @@ SRST
          incompatible TCG features have been enabled (e.g.
          icount/replay).
  
-    ``dirty-gfn-count=n``
-        When KVM accelerator is used, it controls the per-vcpu KVM dirty ring
-        size (number of entries one dirty ring contains, per-vcpu). It should
+    ``dirty-ring-size=n``
+        When the KVM accelerator is used, it controls the size of the per-vCPU
+        dirty page ring buffer (number of entries for each vCPU). It should
          be a value that is power of two, and it should be 1024 or bigger (but
          still less than the maximum value that the kernel supports).  4096
          could be a good initial value if you have no idea which is the best.
          Set this value to 0 to disable the feature.  By default, this feature
-        is disabled (dirty-gfn-count=0).  When enabled, it'll automatically
-        replace the kvm get dirty log feature.
+        is disabled (dirty-ring-size=0).  When enabled, KVM will instead
+        record dirty pages in a bitmap.
  
  ERST
Peter Xu May 14, 2021, 3:51 p.m. UTC | #2
On Fri, May 14, 2021 at 05:38:17PM +0200, Paolo Bonzini wrote:
> On 06/05/21 18:05, Peter Xu wrote:
> > This is v7 of the qemu dirty ring interface support.
> > 
> > v7:
> > - Rebase to latest master commit d45a5270d07
> 
> Queued, thanks!
> 
> I only made a small change to rename the property from dirty-gfn-count
> to dirty-ring-size, since (assuming the user knows what gfn means)
> it's not clear that it's related to the ring buffer support.

Yeah previously the only concern was to be clear it's for entry count rather
than size in bytes, however with the man page and all places describing then
looks good.  Thanks Paolo!