Message ID | 20190508061523.17666-1-peterx@redhat.com (mailing list archive) |
---|---|
Headers | show |
Series | kvm/migration: support KVM_CLEAR_DIRTY_LOG | expand |
On 08/05/19 01:15, Peter Xu wrote: > Design > =================== > > I started with a naive/stupid design that I always pass all 1's to the > KVM for a memory range to clear all the dirty bits within that memory > range, but then I encountered guest oops - it's simply because we > can't clear any dirty bit from QEMU if we are not _sure_ that the bit > is dirty in the kernel. Otherwise we might accidentally clear a bit > that we don't even know of (e.g., the bit was clear in migration's > dirty bitmap in QEMU) but actually that page was just being written so > QEMU will never remember to migrate that new page again. > > The new design is focused on a dirty bitmap cache within the QEMU kvm > layer (which is per kvm memory slot). With that we know what's dirty > in the kernel previously (note! the kernel bitmap is still growing all > the time so the cache will only be a subset of the realtime kernel > bitmap but that's far enough for us) and with that we'll be sure to > not accidentally clear unknown dirty pages. > > With this method, we can also avoid race when multiple users (e.g., > DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for > multiple time. If without the kvm memory slot cached dirty bitmap we > won't be able to know which bit has been cleared and then if we send > the CLEAR operation upon the same bit twice (or more) we can still > face the same issue to clear something accidentally while we > shouldn't. > > Summary: we really need to be careful on what bit to clear otherwise > we can face anything after the migration completes. And I hope this > series has considered all about this. The disadvantage of this is that you won't clear in the kernel those dirty bits that come from other sources (e.g. vhost or address_space_map). This can lead to double-copying of pages. Migration already makes a local copy in rb->bmap, and memory_region_snapshot_and_clear_dirty can also do the clear. Would it be possible to invoke the clear using rb->bmap instead of the KVMSlot's new bitmap? Thanks, Paolo > Besides the new KVM cache layer and the new ioctl support, this series > introduced the memory_region_clear_dirty_bitmap() in the memory API > layer to allow clearing dirty bits of a specific memory range within > the memory region.
On Wed, May 08, 2019 at 12:09:00PM +0200, Paolo Bonzini wrote: > On 08/05/19 01:15, Peter Xu wrote: > > Design > > =================== > > > > I started with a naive/stupid design that I always pass all 1's to the > > KVM for a memory range to clear all the dirty bits within that memory > > range, but then I encountered guest oops - it's simply because we > > can't clear any dirty bit from QEMU if we are not _sure_ that the bit > > is dirty in the kernel. Otherwise we might accidentally clear a bit > > that we don't even know of (e.g., the bit was clear in migration's > > dirty bitmap in QEMU) but actually that page was just being written so > > QEMU will never remember to migrate that new page again. > > > > The new design is focused on a dirty bitmap cache within the QEMU kvm > > layer (which is per kvm memory slot). With that we know what's dirty > > in the kernel previously (note! the kernel bitmap is still growing all > > the time so the cache will only be a subset of the realtime kernel > > bitmap but that's far enough for us) and with that we'll be sure to > > not accidentally clear unknown dirty pages. > > > > With this method, we can also avoid race when multiple users (e.g., > > DIRTY_MEMORY_VGA and DIRTY_MEMORY_MIGRATION) want to clear the bit for > > multiple time. If without the kvm memory slot cached dirty bitmap we > > won't be able to know which bit has been cleared and then if we send > > the CLEAR operation upon the same bit twice (or more) we can still > > face the same issue to clear something accidentally while we > > shouldn't. > > > > Summary: we really need to be careful on what bit to clear otherwise > > we can face anything after the migration completes. And I hope this > > series has considered all about this. > > The disadvantage of this is that you won't clear in the kernel those > dirty bits that come from other sources (e.g. vhost or > address_space_map). This can lead to double-copying of pages. > > Migration already makes a local copy in rb->bmap, and > memory_region_snapshot_and_clear_dirty can also do the clear. Would it > be possible to invoke the clear using rb->bmap instead of the KVMSlot's > new bitmap? Do you mean to invoke the clear notifications and deliver the bitmap from the very top of the stack (for VGA it would probably be the DirtyBitmapSnapshot, for MIGRATION it's the rb->bmap)? Actually that's what I did in the first version before I post the work but I noticed that there seems to have a race condition with the design. The problem is we have multiple copies of the same dirty bitmap from KVM and the race can happen with those multiple users (bitmaps of the users can be a merged version containing KVM and other sources like vhost, address_space_map, etc. but let's just make it simpler to not have them yet). Here's one possible race condition (assuming migration has started): (1) page P is written by the guest (let's version its data as P1), its dirty bit D is set in KVM (2) QEMU sync dirty log, fetch D for page P (which is set). D is not cleared in KVM due to MANUAL_PROTECT cap. (3) QEMU copies the D bit to all users of dirty bmap (MIGRATION and VGA as example). (4) MIGRATION code collects bit D in migration bitmap, clear it, send CLEAR to KVM to clear the bit on remote (then D in KVM is cleared too), send the page P with content P1 to destination. CAUTION: when reach here VGA bitmap still has the D bit set. (5) page P is written by the guest again (let's version its data as P2), bit D set again in KVM. (6) VGA code collectes bit D (note! we haven't synced the log again so this is the _old_ dirty bit came from step 3 above) in VGA bitmap, clear it, send CLEAR to KVM to clear the bit on remote. Then D bit in KVM is cleared again. Until here, D bit in all bitmaps are cleared (MIGRATION, VGA, KVM). (7) page P is never written again until migration completes (no matter whether we sync again D bit will be cleared). (8) On destination VM page P will contain content P1 rather than P2, this is data loss... After I noticed it, I added the kvm bitmap layer with the lock to make sure we won't send the 2nd clear at step (6) above because with the per memslot bitmap we will be able to know that we've already cleared a bit by one user so we shouldn't clear it again (as long as we don't SYNC again). However later on I just noticed maybe I don't even need the top layer bitmap at all, because it seems to not bring much benefit but instead it'll bring lots of more complexity - I'll need to do bitmap convertions (because upper layer bitmaps are per ramblock, and kvm bitmaps are per memslot), not to mention I'll need to AND the two bitmaps (upper layer bitmap, and the QEMU kvm cached bitmap) to avoid the above race. So at last I simply removed the top layer bitmap and only keep the per memslot bitmap as what this series did. Thanks,
On 08/05/19 06:39, Peter Xu wrote: >> The disadvantage of this is that you won't clear in the kernel those >> dirty bits that come from other sources (e.g. vhost or >> address_space_map). This can lead to double-copying of pages. >> >> Migration already makes a local copy in rb->bmap, and >> memory_region_snapshot_and_clear_dirty can also do the clear. Would it >> be possible to invoke the clear using rb->bmap instead of the KVMSlot's >> new bitmap? > > Actually that's what I did in the first version before I post the work > but I noticed that there seems to have a race condition with the > design. The problem is we have multiple copies of the same dirty > bitmap from KVM and the race can happen with those multiple users > (bitmaps of the users can be a merged version containing KVM and other > sources like vhost, address_space_map, etc. but let's just make it > simpler to not have them yet). I see now. And in fact the same double-copying inefficiency happens already without this series, so you are improving the situation anyway. Have you done any kind of benchmarking already? Thanks, Paolo > (1) page P is written by the guest (let's version its data as P1), > its dirty bit D is set in KVM > > (2) QEMU sync dirty log, fetch D for page P (which is set). D is > not cleared in KVM due to MANUAL_PROTECT cap. > > (3) QEMU copies the D bit to all users of dirty bmap (MIGRATION and > VGA as example). > > (4) MIGRATION code collects bit D in migration bitmap, clear it, > send CLEAR to KVM to clear the bit on remote (then D in KVM is > cleared too), send the page P with content P1 to destination. > CAUTION: when reach here VGA bitmap still has the D bit set. > > (5) page P is written by the guest again (let's version its data as > P2), bit D set again in KVM. > > (6) VGA code collectes bit D (note! we haven't synced the log again > so this is the _old_ dirty bit came from step 3 above) in VGA > bitmap, clear it, send CLEAR to KVM to clear the bit on remote. > Then D bit in KVM is cleared again. Until here, D bit in all > bitmaps are cleared (MIGRATION, VGA, KVM). > > (7) page P is never written again until migration completes (no > matter whether we sync again D bit will be cleared). > > (8) On destination VM page P will contain content P1 rather than P2, > this is data loss...
On Wed, May 08, 2019 at 01:55:07PM +0200, Paolo Bonzini wrote: > On 08/05/19 06:39, Peter Xu wrote: > >> The disadvantage of this is that you won't clear in the kernel those > >> dirty bits that come from other sources (e.g. vhost or > >> address_space_map). This can lead to double-copying of pages. > >> > >> Migration already makes a local copy in rb->bmap, and > >> memory_region_snapshot_and_clear_dirty can also do the clear. Would it > >> be possible to invoke the clear using rb->bmap instead of the KVMSlot's > >> new bitmap? > > > > Actually that's what I did in the first version before I post the work > > but I noticed that there seems to have a race condition with the > > design. The problem is we have multiple copies of the same dirty > > bitmap from KVM and the race can happen with those multiple users > > (bitmaps of the users can be a merged version containing KVM and other > > sources like vhost, address_space_map, etc. but let's just make it > > simpler to not have them yet). > > I see now. And in fact the same double-copying inefficiency happens > already without this series, so you are improving the situation anyway. > > Have you done any kind of benchmarking already? Not yet. I posted the series for some initial reviews first before moving on with performance tests. My plan of the test scenario could be: - find a guest with relatively large memory (I would guess it is better to have memory like 64G or even more to make some big difference) - run random dirty memory workload upon most of the mem, with dirty rate X Bps. - setup the migration bandwidth to Y Bps (Y should be bigger than X but not that big. One could be X=800M and Y=1G to emulate 10G nic with a workload that we can still converge with precopy only) and start precopy migration. - measure total migration time with CLEAR_LOG on & off. We should expect the guest to have these with CLEAR_LOG: (1) not hang during log_sync, and (2) migration should complete faster. Does above test plan makes sense? If both the QEMU/KVM changes looks ok in general, I can at least try this on some smaller guests (I can manage ~10G mem guests with my own hosts, but I can also try to find some bigger ones). Thanks,
On Thu, May 09, 2019 at 10:33:19AM +0800, Peter Xu wrote: > On Wed, May 08, 2019 at 01:55:07PM +0200, Paolo Bonzini wrote: > > On 08/05/19 06:39, Peter Xu wrote: > > >> The disadvantage of this is that you won't clear in the kernel those > > >> dirty bits that come from other sources (e.g. vhost or > > >> address_space_map). This can lead to double-copying of pages. > > >> > > >> Migration already makes a local copy in rb->bmap, and > > >> memory_region_snapshot_and_clear_dirty can also do the clear. Would it > > >> be possible to invoke the clear using rb->bmap instead of the KVMSlot's > > >> new bitmap? > > > > > > Actually that's what I did in the first version before I post the work > > > but I noticed that there seems to have a race condition with the > > > design. The problem is we have multiple copies of the same dirty > > > bitmap from KVM and the race can happen with those multiple users > > > (bitmaps of the users can be a merged version containing KVM and other > > > sources like vhost, address_space_map, etc. but let's just make it > > > simpler to not have them yet). > > > > I see now. And in fact the same double-copying inefficiency happens > > already without this series, so you are improving the situation anyway. > > > > Have you done any kind of benchmarking already? > > Not yet. I posted the series for some initial reviews first before > moving on with performance tests. > > My plan of the test scenario could be: > > - find a guest with relatively large memory (I would guess it is > better to have memory like 64G or even more to make some big > difference) > > - run random dirty memory workload upon most of the mem, with dirty > rate X Bps. > > - setup the migration bandwidth to Y Bps (Y should be bigger than X > but not that big. One could be X=800M and Y=1G to emulate 10G nic > with a workload that we can still converge with precopy only) and > start precopy migration. > > - measure total migration time with CLEAR_LOG on & off. We should > expect the guest to have these with CLEAR_LOG: (1) not hang during > log_sync, and (2) migration should complete faster. Some updates on performance numbers. Summary: the ideal case below shows ~40% or even more time reduced to migrate the same VM with the same workload. In other words, it could be seen as ~40% faster than before. Test environment: 13G guest, 10G test mem (so I leave 3G untouched), dirty rate 900MB/s, BW 10Gbps to emulate ixgbe, downtime 100ms. IO pattern: I pre-fault all the 10G mem then I do random writes (with command "mig_mon mm_dirty 10240 900 random" [1]) upon the test memory with a constant dirty rate (900MB/s, as mentioned). Then I migrate during the IOs. Here's the total migration time of such VM (for each scenario, I run the migration 5 times then I get an average migration total time used): |--------------+---------------------+-------------| | scenario | migration times (s) | average (s) | |--------------+---------------------+-------------| | no CLEAR_LOG | 55, 54, 56, 74, 54 | 58 | | 1G chunk | 40, 39, 41, 39, 40 | 40 | | 128M chunk | 38, 40, 37, 40, 38 | 38 | | 16M chunk | 42, 40, 38, 41, 38 | 39 | | 1M chunk | 37, 40, 36, 40, 39 | 38 | |--------------+---------------------+-------------| The first "no CLEAR_LOG" means the master branch which still uses the GET_DIRTY only. The latter four scenarios are all with the new CLEAR_LOG interface, aka, this series. The test result shows that 128M chunk size seems to be a good default value instead of 1G (which this series used). I'll adjust that accordingly when I post the next version. [1] https://github.com/xzpeter/clibs/blob/master/bsd/mig_mon/mig_mon.c Regards,