[00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG

Message ID	20190508061523.17666-1-peterx@redhat.com (mailing list archive)
Headers	show Return-Path: <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org> From: Peter Xu <peterx@redhat.com> To: qemu-devel@nongnu.org Date: Wed, 8 May 2019 14:15:12 +0800 Message-Id: <20190508061523.17666-1-peterx@redhat.com> Subject: [Qemu-devel] [PATCH 00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG Precedence: list Cc: Laurent Vivier <lvivier@redhat.com>, Paolo Bonzini <pbonzini@redhat.com>, "Dr . David Alan Gilbert" <dgilbert@redhat.com>, peterx@redhat.com, Juan Quintela <quintela@redhat.com> Errors-To: qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org Sender: "Qemu-devel" <qemu-devel-bounces+patchwork-qemu-devel=patchwork.kernel.org@nongnu.org>
Series	kvm/migration: support KVM_CLEAR_DIRTY_LOG \| expand [00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG [01/11] migration: No need to take rcu during sync_dirty_bitmap [02/11] memory: Remove memory_region_get_dirty() [03/11] memory: Don't set migration bitmap when without migration [04/11] bitmap: Add bitmap_copy_with_{src\|dst}_offset() [05/11] memory: Pass mr into snapshot_and_clear_dirty [06/11] memory: Introduce memory listener hook log_clear() [07/11] kvm: Update comments for sync_dirty_bitmap [08/11] kvm: Persistent per kvmslot dirty bitmap [09/11] kvm: Introduce slots lock for memory listener [10/11] kvm: Support KVM_CLEAR_DIRTY_LOG [11/11] migration: Split log_clear() into smaller chunks

Peter Xu May 8, 2019, 6:15 a.m. UTC

Summary
=====================

Based-on: <20190426062705.4651-1-peterx@redhat.com>

(It's "[PATCH] checkpatch: allow SPDX-License-Identifier", not a big
 deal, just to make sure no spoil message since one patch used SPDX
 license identifier and checkpatch doesn't like it...)

This series allows QEMU to start using the new KVM_CLEAR_DIRTY_LOG
interface. For more on KVM_CLEAR_DIRTY_LOG itself, please refer to:

  https://github.com/torvalds/linux/blob/master/Documentation/virtual/kvm/api.txt#L3810

The QEMU work (which is this series) is pushed too, please find the
tree here:

  https://github.com/xzpeter/qemu/tree/kvm-clear-dirty-log

Meanwhile, For anyone who really wants to try this out, please also
upgrade the host kernel and use the tree here:

  https://github.com/xzpeter/linux/tree/kvm-clear-dirty-fixes

The new kernel is required because there are still some fixes to make
the whole thing work, which have not yet been finalized and they
haven't reached Linux master.

Design
===================

I started with a naive/stupid design that I always pass all 1's to the
KVM for a memory range to clear all the dirty bits within that memory
range, but then I encountered guest oops - it's simply because we
can't clear any dirty bit from QEMU if we are not _sure_ that the bit
is dirty in the kernel.  Otherwise we might accidentally clear a bit
that we don't even know of (e.g., the bit was clear in migration's
dirty bitmap in QEMU) but actually that page was just being written so
QEMU will never remember to migrate that new page again.

The new design is focused on a dirty bitmap cache within the QEMU kvm
layer (which is per kvm memory slot).  With that we know what's dirty
in the kernel previously (note! the kernel bitmap is still growing all
the time so the cache will only be a subset of the realtime kernel
bitmap but that's far enough for us) and with that we'll be sure to
not accidentally clear unknown dirty pages.

With this method, we can also avoid race when multiple users (e.g.,
DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for
multiple time.  If without the kvm memory slot cached dirty bitmap we
won't be able to know which bit has been cleared and then if we send
the CLEAR operation upon the same bit twice (or more) we can still
face the same issue to clear something accidentally while we
shouldn't.

Summary: we really need to be careful on what bit to clear otherwise
we can face anything after the migration completes.  And I hope this
series has considered all about this.

Besides the new KVM cache layer and the new ioctl support, this series
introduced the memory_region_clear_dirty_bitmap() in the memory API
layer to allow clearing dirty bits of a specific memory range within
the memory region.

Implementations
============================

Patch 1-3:  these should be nothing directly related to the series but
            they are things I found during working on it.  They can be
            picked even earlier if reviewers are happy with them.

Patch 4:    pre-work on bitmap operations, and within the patch I added
            the first unit test for utils/bitmap.c.

Patch 5-6:  the new memory API interface.  Since no one is providing
            log_clear() yet so it's not working yet.  Note that this
            only splits the dirty clear operation from sync but it
            hasn't yet been splitted into smaller chunk so it's not
            really helpful for us yet.

Patch 7-10: kvm support of KVM_CLEAR_DIRTY_LOG.

Patch 11:   do the log_clear() splitting for the case of migration.
            Also a new parameter is introduced to define the block
            size of the small chunks (the unit to clear dirty bits)

Tests
===========================

- make check
- migrate idle/memory-heavy guests

(Not yet tested with huge guests but it'll be more than welcomed if
 someone has the resource and wants to give it a shot)

Please have a look, thanks.

Peter Xu (11):
  migration: No need to take rcu during sync_dirty_bitmap
  memory: Remove memory_region_get_dirty()
  memory: Don't set migration bitmap when without migration
  bitmap: Add bitmap_copy_with_{src|dst}_offset()
  memory: Pass mr into snapshot_and_clear_dirty
  memory: Introduce memory listener hook log_clear()
  kvm: Update comments for sync_dirty_bitmap
  kvm: Persistent per kvmslot dirty bitmap
  kvm: Introduce slots lock for memory listener
  kvm: Support KVM_CLEAR_DIRTY_LOG
  migration: Split log_clear() into smaller chunks

 accel/kvm/kvm-all.c      | 287 ++++++++++++++++++++++++++++++++++-----
 accel/kvm/trace-events   |   1 +
 exec.c                   |  15 +-
 include/exec/memory.h    |  36 ++---
 include/exec/ram_addr.h  |  91 ++++++++++++-
 include/qemu/bitmap.h    |   9 ++
 include/sysemu/kvm_int.h |   4 +
 memory.c                 |  64 +++++++--
 migration/migration.c    |   4 +
 migration/migration.h    |  27 ++++
 migration/ram.c          |  45 ++++++
 migration/trace-events   |   1 +
 tests/Makefile.include   |   2 +
 tests/test-bitmap.c      |  81 +++++++++++
 util/bitmap.c            |  73 ++++++++++
 15 files changed, 671 insertions(+), 69 deletions(-)
 create mode 100644 tests/test-bitmap.c

Paolo Bonzini May 8, 2019, 10:09 a.m. UTC | #1

On 08/05/19 01:15, Peter Xu wrote:
> Design
> ===================
> 
> I started with a naive/stupid design that I always pass all 1's to the
> KVM for a memory range to clear all the dirty bits within that memory
> range, but then I encountered guest oops - it's simply because we
> can't clear any dirty bit from QEMU if we are not _sure_ that the bit
> is dirty in the kernel.  Otherwise we might accidentally clear a bit
> that we don't even know of (e.g., the bit was clear in migration's
> dirty bitmap in QEMU) but actually that page was just being written so
> QEMU will never remember to migrate that new page again.
> 
> The new design is focused on a dirty bitmap cache within the QEMU kvm
> layer (which is per kvm memory slot).  With that we know what's dirty
> in the kernel previously (note! the kernel bitmap is still growing all
> the time so the cache will only be a subset of the realtime kernel
> bitmap but that's far enough for us) and with that we'll be sure to
> not accidentally clear unknown dirty pages.
> 
> With this method, we can also avoid race when multiple users (e.g.,
> DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for
> multiple time.  If without the kvm memory slot cached dirty bitmap we
> won't be able to know which bit has been cleared and then if we send
> the CLEAR operation upon the same bit twice (or more) we can still
> face the same issue to clear something accidentally while we
> shouldn't.
> 
> Summary: we really need to be careful on what bit to clear otherwise
> we can face anything after the migration completes.  And I hope this
> series has considered all about this.

The disadvantage of this is that you won't clear in the kernel those
dirty bits that come from other sources (e.g. vhost or
address_space_map).  This can lead to double-copying of pages.

Migration already makes a local copy in rb->bmap, and
memory_region_snapshot_and_clear_dirty can also do the clear.  Would it
be possible to invoke the clear using rb->bmap instead of the KVMSlot's
new bitmap?

Thanks,

Paolo

> Besides the new KVM cache layer and the new ioctl support, this series
> introduced the memory_region_clear_dirty_bitmap() in the memory API
> layer to allow clearing dirty bits of a specific memory range within
> the memory region.

Peter Xu May 8, 2019, 11:39 a.m. UTC | #2

On Wed, May 08, 2019 at 12:09:00PM +0200, Paolo Bonzini wrote:
> On 08/05/19 01:15, Peter Xu wrote:
> > Design
> > ===================
> > 
> > I started with a naive/stupid design that I always pass all 1's to the
> > KVM for a memory range to clear all the dirty bits within that memory
> > range, but then I encountered guest oops - it's simply because we
> > can't clear any dirty bit from QEMU if we are not _sure_ that the bit
> > is dirty in the kernel.  Otherwise we might accidentally clear a bit
> > that we don't even know of (e.g., the bit was clear in migration's
> > dirty bitmap in QEMU) but actually that page was just being written so
> > QEMU will never remember to migrate that new page again.
> > 
> > The new design is focused on a dirty bitmap cache within the QEMU kvm
> > layer (which is per kvm memory slot).  With that we know what's dirty
> > in the kernel previously (note! the kernel bitmap is still growing all
> > the time so the cache will only be a subset of the realtime kernel
> > bitmap but that's far enough for us) and with that we'll be sure to
> > not accidentally clear unknown dirty pages.
> > 
> > With this method, we can also avoid race when multiple users (e.g.,
> > DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for
> > multiple time.  If without the kvm memory slot cached dirty bitmap we
> > won't be able to know which bit has been cleared and then if we send
> > the CLEAR operation upon the same bit twice (or more) we can still
> > face the same issue to clear something accidentally while we
> > shouldn't.
> > 
> > Summary: we really need to be careful on what bit to clear otherwise
> > we can face anything after the migration completes.  And I hope this
> > series has considered all about this.
> 
> The disadvantage of this is that you won't clear in the kernel those
> dirty bits that come from other sources (e.g. vhost or
> address_space_map).  This can lead to double-copying of pages.
> 
> Migration already makes a local copy in rb->bmap, and
> memory_region_snapshot_and_clear_dirty can also do the clear.  Would it
> be possible to invoke the clear using rb->bmap instead of the KVMSlot's
> new bitmap?

Do you mean to invoke the clear notifications and deliver the bitmap
from the very top of the stack (for VGA it would probably be the
DirtyBitmapSnapshot, for MIGRATION it's the rb->bmap)?

Actually that's what I did in the first version before I post the work
but I noticed that there seems to have a race condition with the
design.  The problem is we have multiple copies of the same dirty
bitmap from KVM and the race can happen with those multiple users
(bitmaps of the users can be a merged version containing KVM and other
sources like vhost, address_space_map, etc. but let's just make it
simpler to not have them yet).  Here's one possible race condition
(assuming migration has started):

  (1) page P is written by the guest (let's version its data as P1),
      its dirty bit D is set in KVM

  (2) QEMU sync dirty log, fetch D for page P (which is set).  D is
      not cleared in KVM due to MANUAL_PROTECT cap.

  (3) QEMU copies the D bit to all users of dirty bmap (MIGRATION and
      VGA as example).

  (4) MIGRATION code collects bit D in migration bitmap, clear it,
      send CLEAR to KVM to clear the bit on remote (then D in KVM is
      cleared too), send the page P with content P1 to destination.
      CAUTION: when reach here VGA bitmap still has the D bit set.

  (5) page P is written by the guest again (let's version its data as
      P2), bit D set again in KVM.

  (6) VGA code collectes bit D (note! we haven't synced the log again
      so this is the _old_ dirty bit came from step 3 above) in VGA
      bitmap, clear it, send CLEAR to KVM to clear the bit on remote.
      Then D bit in KVM is cleared again.  Until here, D bit in all
      bitmaps are cleared (MIGRATION, VGA, KVM).

  (7) page P is never written again until migration completes (no
      matter whether we sync again D bit will be cleared).

  (8) On destination VM page P will contain content P1 rather than P2,
      this is data loss...

After I noticed it, I added the kvm bitmap layer with the lock to make
sure we won't send the 2nd clear at step (6) above because with the
per memslot bitmap we will be able to know that we've already cleared
a bit by one user so we shouldn't clear it again (as long as we don't
SYNC again).  However later on I just noticed maybe I don't even need
the top layer bitmap at all, because it seems to not bring much
benefit but instead it'll bring lots of more complexity - I'll need to
do bitmap convertions (because upper layer bitmaps are per ramblock,
and kvm bitmaps are per memslot), not to mention I'll need to AND the
two bitmaps (upper layer bitmap, and the QEMU kvm cached bitmap) to
avoid the above race.  So at last I simply removed the top layer
bitmap and only keep the per memslot bitmap as what this series did.

Thanks,

Paolo Bonzini May 8, 2019, 11:55 a.m. UTC | #3

On 08/05/19 06:39, Peter Xu wrote:
>> The disadvantage of this is that you won't clear in the kernel those
>> dirty bits that come from other sources (e.g. vhost or
>> address_space_map).  This can lead to double-copying of pages.
>>
>> Migration already makes a local copy in rb->bmap, and
>> memory_region_snapshot_and_clear_dirty can also do the clear.  Would it
>> be possible to invoke the clear using rb->bmap instead of the KVMSlot's
>> new bitmap?
>
> Actually that's what I did in the first version before I post the work
> but I noticed that there seems to have a race condition with the
> design.  The problem is we have multiple copies of the same dirty
> bitmap from KVM and the race can happen with those multiple users
> (bitmaps of the users can be a merged version containing KVM and other
> sources like vhost, address_space_map, etc. but let's just make it
> simpler to not have them yet).

I see now.  And in fact the same double-copying inefficiency happens
already without this series, so you are improving the situation anyway.

Have you done any kind of benchmarking already?

Thanks,

Paolo

>   (1) page P is written by the guest (let's version its data as P1),
>       its dirty bit D is set in KVM
> 
>   (2) QEMU sync dirty log, fetch D for page P (which is set).  D is
>       not cleared in KVM due to MANUAL_PROTECT cap.
> 
>   (3) QEMU copies the D bit to all users of dirty bmap (MIGRATION and
>       VGA as example).
> 
>   (4) MIGRATION code collects bit D in migration bitmap, clear it,
>       send CLEAR to KVM to clear the bit on remote (then D in KVM is
>       cleared too), send the page P with content P1 to destination.
>       CAUTION: when reach here VGA bitmap still has the D bit set.
> 
>   (5) page P is written by the guest again (let's version its data as
>       P2), bit D set again in KVM.
> 
>   (6) VGA code collectes bit D (note! we haven't synced the log again
>       so this is the _old_ dirty bit came from step 3 above) in VGA
>       bitmap, clear it, send CLEAR to KVM to clear the bit on remote.
>       Then D bit in KVM is cleared again.  Until here, D bit in all
>       bitmaps are cleared (MIGRATION, VGA, KVM).
> 
>   (7) page P is never written again until migration completes (no
>       matter whether we sync again D bit will be cleared).
> 
>   (8) On destination VM page P will contain content P1 rather than P2,
>       this is data loss...

Peter Xu May 9, 2019, 2:33 a.m. UTC | #4

On Wed, May 08, 2019 at 01:55:07PM +0200, Paolo Bonzini wrote:
> On 08/05/19 06:39, Peter Xu wrote:
> >> The disadvantage of this is that you won't clear in the kernel those
> >> dirty bits that come from other sources (e.g. vhost or
> >> address_space_map).  This can lead to double-copying of pages.
> >>
> >> Migration already makes a local copy in rb->bmap, and
> >> memory_region_snapshot_and_clear_dirty can also do the clear.  Would it
> >> be possible to invoke the clear using rb->bmap instead of the KVMSlot's
> >> new bitmap?
> >
> > Actually that's what I did in the first version before I post the work
> > but I noticed that there seems to have a race condition with the
> > design.  The problem is we have multiple copies of the same dirty
> > bitmap from KVM and the race can happen with those multiple users
> > (bitmaps of the users can be a merged version containing KVM and other
> > sources like vhost, address_space_map, etc. but let's just make it
> > simpler to not have them yet).
> 
> I see now.  And in fact the same double-copying inefficiency happens
> already without this series, so you are improving the situation anyway.
> 
> Have you done any kind of benchmarking already?

Not yet.  I posted the series for some initial reviews first before
moving on with performance tests.

My plan of the test scenario could be:

- find a guest with relatively large memory (I would guess it is
  better to have memory like 64G or even more to make some big
  difference)

- run random dirty memory workload upon most of the mem, with dirty
  rate X Bps.

- setup the migration bandwidth to Y Bps (Y should be bigger than X
  but not that big.  One could be X=800M and Y=1G to emulate 10G nic
  with a workload that we can still converge with precopy only) and
  start precopy migration.

- measure total migration time with CLEAR_LOG on & off. We should
  expect the guest to have these with CLEAR_LOG: (1) not hang during
  log_sync, and (2) migration should complete faster.

Does above test plan makes sense?

If both the QEMU/KVM changes looks ok in general, I can at least try
this on some smaller guests (I can manage ~10G mem guests with my own
hosts, but I can also try to find some bigger ones).

Thanks,

Peter Xu May 16, 2019, 9:26 a.m. UTC | #5

On Thu, May 09, 2019 at 10:33:19AM +0800, Peter Xu wrote:
> On Wed, May 08, 2019 at 01:55:07PM +0200, Paolo Bonzini wrote:
> > On 08/05/19 06:39, Peter Xu wrote:
> > >> The disadvantage of this is that you won't clear in the kernel those
> > >> dirty bits that come from other sources (e.g. vhost or
> > >> address_space_map).  This can lead to double-copying of pages.
> > >>
> > >> Migration already makes a local copy in rb->bmap, and
> > >> memory_region_snapshot_and_clear_dirty can also do the clear.  Would it
> > >> be possible to invoke the clear using rb->bmap instead of the KVMSlot's
> > >> new bitmap?
> > >
> > > Actually that's what I did in the first version before I post the work
> > > but I noticed that there seems to have a race condition with the
> > > design.  The problem is we have multiple copies of the same dirty
> > > bitmap from KVM and the race can happen with those multiple users
> > > (bitmaps of the users can be a merged version containing KVM and other
> > > sources like vhost, address_space_map, etc. but let's just make it
> > > simpler to not have them yet).
> > 
> > I see now.  And in fact the same double-copying inefficiency happens
> > already without this series, so you are improving the situation anyway.
> > 
> > Have you done any kind of benchmarking already?
> 
> Not yet.  I posted the series for some initial reviews first before
> moving on with performance tests.
> 
> My plan of the test scenario could be:
> 
> - find a guest with relatively large memory (I would guess it is
>   better to have memory like 64G or even more to make some big
>   difference)
> 
> - run random dirty memory workload upon most of the mem, with dirty
>   rate X Bps.
> 
> - setup the migration bandwidth to Y Bps (Y should be bigger than X
>   but not that big.  One could be X=800M and Y=1G to emulate 10G nic
>   with a workload that we can still converge with precopy only) and
>   start precopy migration.
> 
> - measure total migration time with CLEAR_LOG on & off. We should
>   expect the guest to have these with CLEAR_LOG: (1) not hang during
>   log_sync, and (2) migration should complete faster.

Some updates on performance numbers.

Summary: the ideal case below shows ~40% or even more time reduced to
migrate the same VM with the same workload.  In other words, it could
be seen as ~40% faster than before.

Test environment: 13G guest, 10G test mem (so I leave 3G untouched),
dirty rate 900MB/s, BW 10Gbps to emulate ixgbe, downtime 100ms.

IO pattern: I pre-fault all the 10G mem then I do random writes (with
command "mig_mon mm_dirty 10240 900 random" [1]) upon the test memory
with a constant dirty rate (900MB/s, as mentioned).  Then I migrate
during the IOs.

Here's the total migration time of such VM (for each scenario, I run
the migration 5 times then I get an average migration total time
used):

   |--------------+---------------------+-------------|
   | scenario     | migration times (s) | average (s) |
   |--------------+---------------------+-------------|
   | no CLEAR_LOG | 55, 54, 56, 74, 54  |          58 |
   | 1G chunk     | 40, 39, 41, 39, 40  |          40 |
   | 128M chunk   | 38, 40, 37, 40, 38  |          38 |
   | 16M chunk    | 42, 40, 38, 41, 38  |          39 |
   | 1M chunk     | 37, 40, 36, 40, 39  |          38 |
   |--------------+---------------------+-------------|

The first "no CLEAR_LOG" means the master branch which still uses the
GET_DIRTY only.  The latter four scenarios are all with the new
CLEAR_LOG interface, aka, this series.  The test result shows that
128M chunk size seems to be a good default value instead of 1G (which
this series used).  I'll adjust that accordingly when I post the next
version.

[1] https://github.com/xzpeter/clibs/blob/master/bsd/mig_mon/mig_mon.c

Regards,

[00/11] kvm/migration: support KVM_CLEAR_DIRTY_LOG

Message

Comments