diff mbox

[v1] migration: skip sending ram pages released by virtio-balloon driver.

Message ID 1457082167-12254-1-git-send-email-jitendra.kolhe@hpe.com (mailing list archive)
State New, archived
Headers show

Commit Message

Jitendra Kolhe March 4, 2016, 9:02 a.m. UTC
While measuring live migration performance for qemu/kvm guest, it
was observed that the qemu doesn’t maintain any intelligence for the
guest ram pages which are release by the guest balloon driver and
treat such pages as any other normal guest ram pages. This has direct
impact on overall migration time for the guest which has released
(ballooned out) memory to the host.

In case of SD-X, where we can configure large guests with 1TB and with
considerable amount of memory release by balloon driver to the host,
the migration time gets worse.

The solution proposed below is local only to qemu (and does not require
any modification to Linux kernel or any guest driver). We have verified
the fix for large guests =1TB on SD-X and in case where 90% of memory
is released by balloon driver the migration time for an idle guests
reduces to ~600 sec's from ~1200 sec’s.

Further details can be found in patch commit text (below).

During live migration, as part of 1st iteration in ram_save_iterate()
-> ram_find_and_save_block () will try to migrate ram pages which are
released by vitrio-balloon driver as part of dynamic memory delete.
Even though the pages which are returned to the host by virtio-balloon
driver are zero pages, the migration algorithm will still end up
scanning the entire page ram_find_and_save_block() -> ram_save_page/
ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
also end-up sending some control information over network for these
page during migration. This adds to total migration time.

The proposed fix, uses the existing bitmap infrastructure to create
a virtio-balloon bitmap. The bits in the bitmap represent a guest ram
page of size 1UL<< VIRTIO_BALLOON_PFN_SHIFT. The bitmap represents
entire guest ram memory till max configured memory. Guest ram pages
returned by virtio-balloon driver will be represented by 1 in the
bitmap. During live migration, each guest ram page (host VA offset)
is checked against the virtio-balloon bitmap, if the bit is set the
corresponding ram page will be excluded from scanning and sending
control information during migration. The bitmap is also migrated to
the target as part of every ram_save_iterate loop and after the
guest is stopped remaining balloon bitmap is migrated as part of
balloon driver save / load interface.

With the proposed fix, the average migration time for an idle guest
with 1TB maximum memory and 64vCpus
 - reduces from ~1200 secs to ~600 sec, with guest memory ballooned
   down to 128GB (~10% of 1TB).
 - reduces from ~1300 to ~1200 sec (7%), with guest memory ballooned
   down to 896GB (~90% of 1TB),
 - with no ballooning configured, we don’t expect to see any impact
   on total migration time.

The optimization gets temporarily disabled, if the balloon operation is
in progress. Since the optimization skips scanning and migrating control
information for ballooned out pages, we might skip guest ram pages in
cases where the guest balloon driver has freed the ram page to the guest
but not yet informed the host/qemu about the ram page
(VIRTIO_BALLOON_F_MUST_TELL_HOST). In such case with optimization, we
might skip migrating ram pages which the guest is using. Since this
problem is specific to balloon leak, we can restrict balloon operation in
progress check to only balloon leak operation in progress check.

The optimization also get permanently disabled (for all subsequent
migrations) in case any of the migration uses postcopy capability. In case
of postcopy the balloon bitmap would be required to send after vm_stop,
which has significant impact on the downtime. Moreover, the applications
in the guest space won’t be actually faulting on the ram pages which are
already ballooned out, the proposed optimization will not show any
improvement in migration time during postcopy.

Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
---
 balloon.c                          | 253 ++++++++++++++++++++++++++++++++++++-
 exec.c                             |   3 +
 hw/virtio/virtio-balloon.c         |  35 ++++-
 include/hw/virtio/virtio-balloon.h |   1 +
 include/migration/migration.h      |   1 +
 include/sysemu/balloon.h           |  15 ++-
 migration/migration.c              |   9 ++
 migration/ram.c                    |  23 +++-
 qapi-schema.json                   |   5 +-
 9 files changed, 337 insertions(+), 8 deletions(-)

Comments

Eric Blake March 7, 2016, 5:05 p.m. UTC | #1
On 03/04/2016 02:02 AM, Jitendra Kolhe wrote:
> While measuring live migration performance for qemu/kvm guest, it
> was observed that the qemu doesn’t maintain any intelligence for the
> guest ram pages which are release by the guest balloon driver and
> treat such pages as any other normal guest ram pages. This has direct
> impact on overall migration time for the guest which has released
> (ballooned out) memory to the host.
> 

> Signed-off-by: Jitendra Kolhe <jitendra.kolhe@hpe.com>
> ---
>  balloon.c                          | 253 ++++++++++++++++++++++++++++++++++++-
>  exec.c                             |   3 +
>  hw/virtio/virtio-balloon.c         |  35 ++++-
>  include/hw/virtio/virtio-balloon.h |   1 +
>  include/migration/migration.h      |   1 +
>  include/sysemu/balloon.h           |  15 ++-
>  migration/migration.c              |   9 ++
>  migration/ram.c                    |  23 +++-
>  qapi-schema.json                   |   5 +-
>  9 files changed, 337 insertions(+), 8 deletions(-)
> 

> +++ b/qapi-schema.json
> @@ -544,11 +544,14 @@
>  #          been migrated, pulling the remaining pages along as needed. NOTE: If
>  #          the migration fails during postcopy the VM will fail.  (since 2.5)
>  #
> +# @skip-balloon: Skip scaning ram pages released by virtio-balloon driver.

s/scaning/scanning/

> +#          (since 2.5)

You've missed 2.5.  In fact, this is borderline between new feature and
bug fix, so you may have even missed 2.6 since soft freeze has already
passed, in which case this should read 2.7.

Does this need to be an option, or should it be unconditionally enabled?
Roman Kagan March 10, 2016, 9:49 a.m. UTC | #2
On Fri, Mar 04, 2016 at 02:32:47PM +0530, Jitendra Kolhe wrote:
> Even though the pages which are returned to the host by virtio-balloon
> driver are zero pages, the migration algorithm will still end up
> scanning the entire page ram_find_and_save_block() -> ram_save_page/
> ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
> also end-up sending some control information over network for these
> page during migration. This adds to total migration time.

I wonder if it is the scanning for zeros or sending the whiteout which
affects the total migration time more.  If it is the former (as I would
expect) then a rather local change to is_zero_range() to make use of the
mapping information before scanning would get you all the speedups
without protocol changes, interfering with postcopy etc.

Roman.
Jitendra Kolhe March 11, 2016, 5:59 a.m. UTC | #3
On 3/10/2016 3:19 PM, Roman Kagan wrote:
> On Fri, Mar 04, 2016 at 02:32:47PM +0530, Jitendra Kolhe wrote:
>> Even though the pages which are returned to the host by virtio-balloon
>> driver are zero pages, the migration algorithm will still end up
>> scanning the entire page ram_find_and_save_block() -> ram_save_page/
>> ram_save_compressed_page -> save_zero_page() -> is_zero_range().  We
>> also end-up sending some control information over network for these
>> page during migration. This adds to total migration time.
>
> I wonder if it is the scanning for zeros or sending the whiteout which
> affects the total migration time more.  If it is the former (as I would
> expect) then a rather local change to is_zero_range() to make use of the
> mapping information before scanning would get you all the speedups
> without protocol changes, interfering with postcopy etc.
>
> Roman.
>

Localizing the solution to zero page scan check is a good idea. I too
agree that most of the time is send in scanning for zero page in which
case we should be able to localize solution to is_zero_range().
However in case of ballooned out pages (which can be seen as a subset
of guest zero pages) we also spend a very small portion of total
migration time in sending the control information, which can be also
avoided.
 From my tests for 16GB idle guest of which 12GB was ballooned out, the
zero page scan time for 12GB ballooned out pages was ~1789 ms and
save_page_header + qemu_put_byte(f, 0); for same 12GB ballooned out
pages was ~556 ms. Total migration time was ~8000 ms
     if (is_zero_range(p, TARGET_PAGE_SIZE)) {
         acct_info.dup_pages++;
         *bytes_transferred += save_page_header(f, block,
                                                offset |
RAM_SAVE_FLAG_COMPRESS);
         qemu_put_byte(f, 0);
         *bytes_transferred += 1;
         pages = 1;
     }
Would moving the solution to save_zero_page() be good enough?

Thanks,
- Jitendra
Liang Li March 11, 2016, 7:25 a.m. UTC | #4
> On 3/10/2016 3:19 PM, Roman Kagan wrote:
> > On Fri, Mar 04, 2016 at 02:32:47PM +0530, Jitendra Kolhe wrote:
> >> Even though the pages which are returned to the host by
> >> virtio-balloon driver are zero pages, the migration algorithm will
> >> still end up scanning the entire page ram_find_and_save_block() ->
> >> ram_save_page/ ram_save_compressed_page -> save_zero_page() ->
> >> is_zero_range().  We also end-up sending some control information
> >> over network for these page during migration. This adds to total migration
> time.
> >
> > I wonder if it is the scanning for zeros or sending the whiteout which
> > affects the total migration time more.  If it is the former (as I
> > would
> > expect) then a rather local change to is_zero_range() to make use of
> > the mapping information before scanning would get you all the speedups
> > without protocol changes, interfering with postcopy etc.
> >
> > Roman.
> >
> 
> Localizing the solution to zero page scan check is a good idea. I too agree that
> most of the time is send in scanning for zero page in which case we should be
> able to localize solution to is_zero_range().
> However in case of ballooned out pages (which can be seen as a subset of
> guest zero pages) we also spend a very small portion of total migration time
> in sending the control information, which can be also avoided.
>  From my tests for 16GB idle guest of which 12GB was ballooned out, the
> zero page scan time for 12GB ballooned out pages was ~1789 ms and
> save_page_header + qemu_put_byte(f, 0); for same 12GB ballooned out
> pages was ~556 ms. Total migration time was ~8000 ms

How did you do the tests? ~ 556ms seems too long for putting several bytes to the buffer.
It's likely the time you measured contains the portion to processes the other 4GB guest memory pages.

Liang
 
>      if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>          acct_info.dup_pages++;
>          *bytes_transferred += save_page_header(f, block,
>                                                 offset | RAM_SAVE_FLAG_COMPRESS);
>          qemu_put_byte(f, 0);
>          *bytes_transferred += 1;
>          pages = 1;
>      }
> Would moving the solution to save_zero_page() be good enough?
> 
> Thanks,
> - Jitendra
Jitendra Kolhe March 11, 2016, 10:20 a.m. UTC | #5
On 3/11/2016 12:55 PM, Li, Liang Z wrote:
>> On 3/10/2016 3:19 PM, Roman Kagan wrote:
>>> On Fri, Mar 04, 2016 at 02:32:47PM +0530, Jitendra Kolhe wrote:
>>>> Even though the pages which are returned to the host by
>>>> virtio-balloon driver are zero pages, the migration algorithm will
>>>> still end up scanning the entire page ram_find_and_save_block() ->
>>>> ram_save_page/ ram_save_compressed_page -> save_zero_page() ->
>>>> is_zero_range().  We also end-up sending some control information
>>>> over network for these page during migration. This adds to total migration
>> time.
>>>
>>> I wonder if it is the scanning for zeros or sending the whiteout which
>>> affects the total migration time more.  If it is the former (as I
>>> would
>>> expect) then a rather local change to is_zero_range() to make use of
>>> the mapping information before scanning would get you all the speedups
>>> without protocol changes, interfering with postcopy etc.
>>>
>>> Roman.
>>>
>>
>> Localizing the solution to zero page scan check is a good idea. I too agree that
>> most of the time is send in scanning for zero page in which case we should be
>> able to localize solution to is_zero_range().
>> However in case of ballooned out pages (which can be seen as a subset of
>> guest zero pages) we also spend a very small portion of total migration time
>> in sending the control information, which can be also avoided.
>>   From my tests for 16GB idle guest of which 12GB was ballooned out, the
>> zero page scan time for 12GB ballooned out pages was ~1789 ms and
>> save_page_header + qemu_put_byte(f, 0); for same 12GB ballooned out
>> pages was ~556 ms. Total migration time was ~8000 ms
>
> How did you do the tests? ~ 556ms seems too long for putting several bytes to the buffer.
> It's likely the time you measured contains the portion to processes the other 4GB guest memory pages.
>
> Liang
>

I modified save_zero_page() as below and updated timers only for 
ballooned out pages so is_zero_page() should return true(also 
qemu_balloon_bitmap_test() from my patchset returned 1)
With below instrumentation, I got t1 = ~1789ms and t2 = ~556ms. Also the 
total migration time noted (~8000ms) is for unmodified qemu source.
It seems to addup to final migration time with proposed patchset.

Here is the last entry for “another round” of test, this time its ~547ms
JK: block=7f5417a345e0, offset=3ffe42020, zero_page_scan_time=1218 us, 
save_page_header_time=184 us, total_save_zero_page_time=1453 us
cumulated vals: zero_page_scan_time=1723920378 us, 
save_page_header_time=547514618 us, total_save_zero_page_time=2371059239 us

static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
                           uint8_t *p, uint64_t *bytes_transferred)
{
     int pages = -1;
     int64_t time1, time2, time3, time4;
     static int64_t t1 = 0, t2 = 0, t3 = 0;

     time1 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     if (is_zero_range(p, TARGET_PAGE_SIZE)) {
         time2 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
         acct_info.dup_pages++;
         *bytes_transferred += save_page_header(f, block,
                                                offset | 
RAM_SAVE_FLAG_COMPRESS);
         qemu_put_byte(f, 0);
         time3 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
         *bytes_transferred += 1;
         pages = 1;
     }
     time4 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);

     if (qemu_balloon_bitmap_test(block, offset) == 1) {
         t1 += (time2-time1);
         t2 += (time3-time2);
         t3 += (time4-time1);
         fprintf(stderr, "block=%lx, offset=%lx, zero_page_scan_time=%ld 
us, save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n"
                         "cumulated vals: zero_page_scan_time=%ld us, 
save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n",
                          (unsigned long)block, (unsigned long)offset,
                          (time2-time1), (time3-time2), (time4-time1), 
t1, t2, t3);
     }
     return pages;
}

Thanks,
- Jitendra

>>       if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>>           acct_info.dup_pages++;
>>           *bytes_transferred += save_page_header(f, block,
>>                                                  offset | RAM_SAVE_FLAG_COMPRESS);
>>           qemu_put_byte(f, 0);
>>           *bytes_transferred += 1;
>>           pages = 1;
>>       }
>> Would moving the solution to save_zero_page() be good enough?
>>
>> Thanks,
>> - Jitendra
>
Liang Li March 11, 2016, 10:54 a.m. UTC | #6
> >>> I wonder if it is the scanning for zeros or sending the whiteout
> >>> which affects the total migration time more.  If it is the former
> >>> (as I would
> >>> expect) then a rather local change to is_zero_range() to make use of
> >>> the mapping information before scanning would get you all the
> >>> speedups without protocol changes, interfering with postcopy etc.
> >>>
> >>> Roman.
> >>>
> >>
> >> Localizing the solution to zero page scan check is a good idea. I too
> >> agree that most of the time is send in scanning for zero page in
> >> which case we should be able to localize solution to is_zero_range().
> >> However in case of ballooned out pages (which can be seen as a subset
> >> of guest zero pages) we also spend a very small portion of total
> >> migration time in sending the control information, which can be also
> avoided.
> >>   From my tests for 16GB idle guest of which 12GB was ballooned out,
> >> the zero page scan time for 12GB ballooned out pages was ~1789 ms and
> >> save_page_header + qemu_put_byte(f, 0); for same 12GB ballooned out
> >> pages was ~556 ms. Total migration time was ~8000 ms
> >
> > How did you do the tests? ~ 556ms seems too long for putting several
> bytes to the buffer.
> > It's likely the time you measured contains the portion to processes the
> other 4GB guest memory pages.
> >
> > Liang
> >
> 
> I modified save_zero_page() as below and updated timers only for ballooned
> out pages so is_zero_page() should return true(also
> qemu_balloon_bitmap_test() from my patchset returned 1) With below
> instrumentation, I got t1 = ~1789ms and t2 = ~556ms. Also the total migration
> time noted (~8000ms) is for unmodified qemu source.

You mean the total live migration time for the unmodified qemu and the 'you modified for test' qemu
are almost the same?

> It seems to addup to final migration time with proposed patchset.
> 
> Here is the last entry for "another round" of test, this time its ~547ms
> JK: block=7f5417a345e0, offset=3ffe42020, zero_page_scan_time=1218 us,
> save_page_header_time=184 us, total_save_zero_page_time=1453 us
> cumulated vals: zero_page_scan_time=1723920378 us,
> save_page_header_time=547514618 us,
> total_save_zero_page_time=2371059239 us
> 
> static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
>                            uint8_t *p, uint64_t *bytes_transferred) {
>      int pages = -1;
>      int64_t time1, time2, time3, time4;
>      static int64_t t1 = 0, t2 = 0, t3 = 0;
> 
>      time1 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>      if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>          time2 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>          acct_info.dup_pages++;
>          *bytes_transferred += save_page_header(f, block,
>                                                 offset | RAM_SAVE_FLAG_COMPRESS);
>          qemu_put_byte(f, 0);
>          time3 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>          *bytes_transferred += 1;
>          pages = 1;
>      }
>      time4 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
> 
>      if (qemu_balloon_bitmap_test(block, offset) == 1) {
>          t1 += (time2-time1);
>          t2 += (time3-time2);
>          t3 += (time4-time1);
>          fprintf(stderr, "block=%lx, offset=%lx, zero_page_scan_time=%ld us,
> save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n"
>                          "cumulated vals: zero_page_scan_time=%ld us,
> save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n",
>                           (unsigned long)block, (unsigned long)offset,
>                           (time2-time1), (time3-time2), (time4-time1), t1, t2, t3);
>      }
>      return pages;
> }
> 

Thanks for your  description.
The issue here is that there are too many qemu_clock_get_ns() call,  the cost of the function
itself may become the main time consuming operation.  You can measure the time consumed 
by  the qemu_clock_get_ns() you added for test by comparing the result with the version
which not add the qemu_clock_get_ns().

Liang
Jitendra Kolhe March 11, 2016, 2:39 p.m. UTC | #7
On 3/11/2016 4:24 PM, Li, Liang Z wrote:
>>>>> I wonder if it is the scanning for zeros or sending the whiteout
>>>>> which affects the total migration time more.  If it is the former
>>>>> (as I would
>>>>> expect) then a rather local change to is_zero_range() to make use of
>>>>> the mapping information before scanning would get you all the
>>>>> speedups without protocol changes, interfering with postcopy etc.
>>>>>
>>>>> Roman.
>>>>>
>>>>
>>>> Localizing the solution to zero page scan check is a good idea. I too
>>>> agree that most of the time is send in scanning for zero page in
>>>> which case we should be able to localize solution to is_zero_range().
>>>> However in case of ballooned out pages (which can be seen as a subset
>>>> of guest zero pages) we also spend a very small portion of total
>>>> migration time in sending the control information, which can be also
>> avoided.
>>>>    From my tests for 16GB idle guest of which 12GB was ballooned out,
>>>> the zero page scan time for 12GB ballooned out pages was ~1789 ms and
>>>> save_page_header + qemu_put_byte(f, 0); for same 12GB ballooned out
>>>> pages was ~556 ms. Total migration time was ~8000 ms
>>>
>>> How did you do the tests? ~ 556ms seems too long for putting several
>> bytes to the buffer.
>>> It's likely the time you measured contains the portion to processes the
>> other 4GB guest memory pages.
>>>
>>> Liang
>>>
>>
>> I modified save_zero_page() as below and updated timers only for ballooned
>> out pages so is_zero_page() should return true(also
>> qemu_balloon_bitmap_test() from my patchset returned 1) With below
>> instrumentation, I got t1 = ~1789ms and t2 = ~556ms. Also the total migration
>> time noted (~8000ms) is for unmodified qemu source.
>
> You mean the total live migration time for the unmodified qemu and the 'you modified for test' qemu
> are almost the same?
>

Not sure I understand the question, but if 'you modified for test' means 
below modifications to save_zero_page(), then answer is no. Here is what 
I tried, let’s say we have 3 versions of qemu (below timings are for 
16GB idle guest with 12GB ballooned out)

v1. Unmodified qemu – absolutely not code change – Total Migration time 
= ~7600ms (I rounded this one to ~8000ms)
v2. Modified qemu 1 – with proposed patch set (which skips both zero 
pages scan and migrating control information for ballooned out pages) - 
Total Migration time = ~5700ms
v3. Modified qemu 2 – only with changes to save_zero_page() as discussed 
in previous mail (and of course using proposed patch set only to 
maintain bitmap for ballooned out pages) – Total migration time is 
irrelevant in this case.
Total Zero page scan time = ~1789ms
Total (save_page_header + qemu_put_byte(f, 0)) = ~556ms.
Everything seems to add up here (may not be exact) – 5700+1789+559 = ~8000ms

I see 2 factors that we have not considered in this add up a. overhead 
for migrating balloon bitmap to target and b. as you mentioned below 
overhead of qemu_clock_get_ns().

>> It seems to addup to final migration time with proposed patchset.
>>
>> Here is the last entry for "another round" of test, this time its ~547ms
>> JK: block=7f5417a345e0, offset=3ffe42020, zero_page_scan_time=1218 us,
>> save_page_header_time=184 us, total_save_zero_page_time=1453 us
>> cumulated vals: zero_page_scan_time=1723920378 us,
>> save_page_header_time=547514618 us,
>> total_save_zero_page_time=2371059239 us
>>
>> static int save_zero_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
>>                             uint8_t *p, uint64_t *bytes_transferred) {
>>       int pages = -1;
>>       int64_t time1, time2, time3, time4;
>>       static int64_t t1 = 0, t2 = 0, t3 = 0;
>>
>>       time1 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>       if (is_zero_range(p, TARGET_PAGE_SIZE)) {
>>           time2 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>           acct_info.dup_pages++;
>>           *bytes_transferred += save_page_header(f, block,
>>                                                  offset | RAM_SAVE_FLAG_COMPRESS);
>>           qemu_put_byte(f, 0);
>>           time3 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>           *bytes_transferred += 1;
>>           pages = 1;
>>       }
>>       time4 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
>>
>>       if (qemu_balloon_bitmap_test(block, offset) == 1) {
>>           t1 += (time2-time1);
>>           t2 += (time3-time2);
>>           t3 += (time4-time1);
>>           fprintf(stderr, "block=%lx, offset=%lx, zero_page_scan_time=%ld us,
>> save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n"
>>                           "cumulated vals: zero_page_scan_time=%ld us,
>> save_page_header_time=%ld us, total_save_zero_page_time=%ld us\n",
>>                            (unsigned long)block, (unsigned long)offset,
>>                            (time2-time1), (time3-time2), (time4-time1), t1, t2, t3);
>>       }
>>       return pages;
>> }
>>
>
> Thanks for your  description.
> The issue here is that there are too many qemu_clock_get_ns() call,  the cost of the function
> itself may become the main time consuming operation.  You can measure the time consumed
> by  the qemu_clock_get_ns() you added for test by comparing the result with the version
> which not add the qemu_clock_get_ns().
>
> Liang
>

Yes, we can try to measure overhead for qemu_clock_get_ns() calls and 
see if things add up perfectly.

Thanks,
- Jitendra
Jitendra Kolhe March 15, 2016, 1:20 p.m. UTC | #8
On 3/11/2016 8:09 PM, Jitendra Kolhe wrote:
>> You mean the total live migration time for the unmodified qemu and the
>> 'you modified for test' qemu
>> are almost the same?
>>
>
> Not sure I understand the question, but if 'you modified for test' means
> below modifications to save_zero_page(), then answer is no. Here is what
> I tried, let’s say we have 3 versions of qemu (below timings are for
> 16GB idle guest with 12GB ballooned out)
>
> v1. Unmodified qemu – absolutely not code change – Total Migration time
> = ~7600ms (I rounded this one to ~8000ms)
> v2. Modified qemu 1 – with proposed patch set (which skips both zero
> pages scan and migrating control information for ballooned out pages) -
> Total Migration time = ~5700ms
> v3. Modified qemu 2 – only with changes to save_zero_page() as discussed
> in previous mail (and of course using proposed patch set only to
> maintain bitmap for ballooned out pages) – Total migration time is
> irrelevant in this case.
> Total Zero page scan time = ~1789ms
> Total (save_page_header + qemu_put_byte(f, 0)) = ~556ms.
> Everything seems to add up here (may not be exact) – 5700+1789+559 =
> ~8000ms
>
> I see 2 factors that we have not considered in this add up a. overhead
> for migrating balloon bitmap to target and b. as you mentioned below
> overhead of qemu_clock_get_ns().

Missed one more factor of testing each page against balloon bitmap 
during migration, which is consuming around ~320ms for same 
configuration. If we remove this overhead which is introduced by 
proposed patch set from above calculation we almost get total migration 
time for unmodified qemu (5700-320+1789+559=~7700ms)

Thanks,
- Jitendra
Roman Kagan March 18, 2016, 11:27 a.m. UTC | #9
[ Sorry I've lost this thread with email setup changes on my side;
catching up ]

On Tue, Mar 15, 2016 at 06:50:45PM +0530, Jitendra Kolhe wrote:
> On 3/11/2016 8:09 PM, Jitendra Kolhe wrote:
> > Here is what
> >I tried, let’s say we have 3 versions of qemu (below timings are for
> >16GB idle guest with 12GB ballooned out)
> >
> >v1. Unmodified qemu – absolutely not code change – Total Migration time
> >= ~7600ms (I rounded this one to ~8000ms)
> >v2. Modified qemu 1 – with proposed patch set (which skips both zero
> >pages scan and migrating control information for ballooned out pages) -
> >Total Migration time = ~5700ms
> >v3. Modified qemu 2 – only with changes to save_zero_page() as discussed
> >in previous mail (and of course using proposed patch set only to
> >maintain bitmap for ballooned out pages) – Total migration time is
> >irrelevant in this case.
> >Total Zero page scan time = ~1789ms
> >Total (save_page_header + qemu_put_byte(f, 0)) = ~556ms.
> >Everything seems to add up here (may not be exact) – 5700+1789+559 =
> >~8000ms
> >
> >I see 2 factors that we have not considered in this add up a. overhead
> >for migrating balloon bitmap to target and b. as you mentioned below
> >overhead of qemu_clock_get_ns().
> 
> Missed one more factor of testing each page against balloon bitmap during
> migration, which is consuming around ~320ms for same configuration. If we
> remove this overhead which is introduced by proposed patch set from above
> calculation we almost get total migration time for unmodified qemu
> (5700-320+1789+559=~7700ms)

I'm a bit lost in the numbers you quote, so let me try with
back-of-the-envelope calculation.

First off, the way you identify pages that don't need to be sent is
basically orthogonal to how you optimize the protocol to send them.  So
teaching is_zero_range() to consult unmapped or ballooned out page map
looks like a low-hanging fruit that may benefit the migration time by
avoiding scanning the memory, without protocol changes. [And vice versa,
if sending the zero pages bitmap brought so big benefit it would make
sense to apply it to pages found by scanning, too].

Now regarding the protocol:

 - as a first approximation, let's speak in terms of transferred data
   size

 - consider a VM using 1/10 of its memory (I think this can be
   considered an extreme of over-provisioning)

 - a whiteout is 3 decimal orders smaller than a page, so with zero
   pages replaced by whiteouts (current protocol) the overall
   transferred data size for zero pages is on the order of a percent of
   the total transferred data size

 - zero page bitmap would reduce that further by a couple of orders

So, if this calculation is not totally off, extending the protocol to
use zero page bitmaps is unlikely to give an improvement at more than a
percent level.

I'm not sure it pays off the extra code paths and incompatible protocol
changes...

Roman.
Jitendra Kolhe March 22, 2016, 5:47 a.m. UTC | #10
On 3/18/2016 4:57 PM, Roman Kagan wrote:
> [ Sorry I've lost this thread with email setup changes on my side;
> catching up ]
> 
> On Tue, Mar 15, 2016 at 06:50:45PM +0530, Jitendra Kolhe wrote:
>> On 3/11/2016 8:09 PM, Jitendra Kolhe wrote:
>>> Here is what
>>> I tried, let’s say we have 3 versions of qemu (below timings are for
>>> 16GB idle guest with 12GB ballooned out)
>>>
>>> v1. Unmodified qemu – absolutely not code change – Total Migration time
>>> = ~7600ms (I rounded this one to ~8000ms)
>>> v2. Modified qemu 1 – with proposed patch set (which skips both zero
>>> pages scan and migrating control information for ballooned out pages) -
>>> Total Migration time = ~5700ms
>>> v3. Modified qemu 2 – only with changes to save_zero_page() as discussed
>>> in previous mail (and of course using proposed patch set only to
>>> maintain bitmap for ballooned out pages) – Total migration time is
>>> irrelevant in this case.
>>> Total Zero page scan time = ~1789ms
>>> Total (save_page_header + qemu_put_byte(f, 0)) = ~556ms.
>>> Everything seems to add up here (may not be exact) – 5700+1789+559 =
>>> ~8000ms
>>>
>>> I see 2 factors that we have not considered in this add up a. overhead
>>> for migrating balloon bitmap to target and b. as you mentioned below
>>> overhead of qemu_clock_get_ns().
>>
>> Missed one more factor of testing each page against balloon bitmap during
>> migration, which is consuming around ~320ms for same configuration. If we
>> remove this overhead which is introduced by proposed patch set from above
>> calculation we almost get total migration time for unmodified qemu
>> (5700-320+1789+559=~7700ms)

Thanks for your response, just to clarify my understanding first, with
"protocol" you mean - saving or sending, header or control information 
per page during migration?
I am drafting my below response based on this assumption.

> 
> I'm a bit lost in the numbers you quote, so let me try with
> back-of-the-envelope calculation.
> 
> First off, the way you identify pages that don't need to be sent is
> basically orthogonal to how you optimize the protocol to send them.  So
> teaching is_zero_range() to consult unmapped or ballooned out page map
> looks like a low-hanging fruit that may benefit the migration time by
> avoiding scanning the memory, without protocol changes. 

Yes, the intention of proposed patch is not to optimize existing
protocol, which is used to send control or header information during migration.
Changes only to is_zero_range() should still show benefit in migration time.

> [And vice versa,
> if sending the zero pages bitmap brought so big benefit it would make
> sense to apply it to pages found by scanning, too].
> 

I am not sure if we would see any or much benefit with this, with the timings
that we are seeing the time to test against a bitmap vs. sending control or
header information is not huge.
In case of proposed patch we are anyways spending time to test against bitmap
to avoid zero page scan.

> Now regarding the protocol:
> 
>  - as a first approximation, let's speak in terms of transferred data
>    size
> 
>  - consider a VM using 1/10 of its memory (I think this can be
>    considered an extreme of over-provisioning)
> 
>  - a whiteout is 3 decimal orders smaller than a page, so with zero
>    pages replaced by whiteouts (current protocol) the overall
>    transferred data size for zero pages is on the order of a percent of
>    the total transferred data size
> 
>  - zero page bitmap would reduce that further by a couple of orders
> 
> So, if this calculation is not totally off, extending the protocol to
> use zero page bitmaps is unlikely to give an improvement at more than a
> percent level.
> 

I agree that current protocol has already reduced total transferred data
size to less than a percent compared to actually sending the zero page.
But here we are talking even to reduce it further by not sending control
or header information.
On my test setup average zero page scan time for every 12GB zero page
is around 1789ms and time taken to send header or control information is
around 559ms for same 12GB zero pages, which constitutes around 30% of
zero page scan time.

I think the point here is, should we consider ballooned out pages as guest
pages and treat them as any other guest ram pages so we expect existing
protocol to take care of them or should we treat them as non guest ram pages
in which case, it may be fine to skip standard protocol?
Note, proposed patch is only focused on ballooned out pages which is a
subset of guest zero page set.

> I'm not sure it pays off the extra code paths and incompatible protocol
> changes...
> 
> Roman.
> 

If skipping sending control or header information for “only” ballooned out
pages raises doubt about protocol compatibility then, yes I agree it’s not
worth the gain we see. We can still localize solution to is_zero_range() 
scan and avoid scanning for zero pages.

Thanks,
- Jitendra
diff mbox

Patch

diff --git a/balloon.c b/balloon.c
index f2ef50c..937c55e 100644
--- a/balloon.c
+++ b/balloon.c
@@ -33,11 +33,33 @@ 
 #include "qmp-commands.h"
 #include "qapi/qmp/qerror.h"
 #include "qapi/qmp/qjson.h"
+#include "migration/migration.h"
+#include "exec/ram_addr.h"
+#include "qemu/typedefs.h"
 
+#define BALLOON_BITMAP_DISABLE_FLAG -1UL
+typedef enum {
+    BALLOON_BITMAP_DISABLE_NONE = 1, /* Enabled */
+    BALLOON_BITMAP_DISABLE_CURRENT,
+    BALLOON_BITMAP_DISABLE_PERNAMENT,
+} BalloonBitmapDisableState;
 static QEMUBalloonEvent *balloon_event_fn;
 static QEMUBalloonStatus *balloon_stat_fn;
+static QEMUBalloonInProgress *balloon_in_progress_fn;
 static void *balloon_opaque;
 static bool balloon_inhibited;
+static unsigned long balloon_bitmap_pages;
+static unsigned int  balloon_bitmap_pfn_shift;
+static unsigned long balloon_min_bitmap_offset;
+static unsigned long balloon_max_bitmap_offset;
+static QemuMutex balloon_bitmap_mutex;
+static BalloonBitmapDisableState balloon_bitmap_disable_state;
+static bool balloon_bitmap_xfered;
+
+static struct BitmapRcu {
+    struct rcu_head rcu;
+    unsigned long *bmap;
+} *balloon_bitmap_rcu;
 
 bool qemu_balloon_is_inhibited(void)
 {
@@ -49,6 +71,21 @@  void qemu_balloon_inhibit(bool state)
     balloon_inhibited = state;
 }
 
+void qemu_mutex_lock_balloon_bitmap(void)
+{
+    qemu_mutex_lock(&balloon_bitmap_mutex);
+}
+
+void qemu_mutex_unlock_balloon_bitmap(void)
+{
+    qemu_mutex_unlock(&balloon_bitmap_mutex);
+}
+
+void qemu_balloon_reset_bitmap_data(void)
+{
+    balloon_bitmap_xfered = false;
+}
+
 static bool have_balloon(Error **errp)
 {
     if (kvm_enabled() && !kvm_has_sync_mmu()) {
@@ -65,9 +102,12 @@  static bool have_balloon(Error **errp)
 }
 
 int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
-                             QEMUBalloonStatus *stat_func, void *opaque)
+                             QEMUBalloonStatus *stat_func,
+                             QEMUBalloonInProgress *in_progress_func,
+                             void *opaque, int pfn_shift)
 {
-    if (balloon_event_fn || balloon_stat_fn || balloon_opaque) {
+    if (balloon_event_fn || balloon_stat_fn ||
+        balloon_in_progress_fn || balloon_opaque) {
         /* We're already registered one balloon handler.  How many can
          * a guest really have?
          */
@@ -75,17 +115,38 @@  int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
     }
     balloon_event_fn = event_func;
     balloon_stat_fn = stat_func;
+    balloon_in_progress_fn = in_progress_func;
     balloon_opaque = opaque;
+
+    qemu_mutex_init(&balloon_bitmap_mutex);
+    balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_NONE;
+    balloon_bitmap_pfn_shift = pfn_shift;
+    balloon_bitmap_pages = (last_ram_offset() >> balloon_bitmap_pfn_shift);
+    balloon_bitmap_rcu = g_new0(struct BitmapRcu, 1);
+    balloon_bitmap_rcu->bmap = bitmap_new(balloon_bitmap_pages);
+    bitmap_clear(balloon_bitmap_rcu->bmap, 0, balloon_bitmap_pages);
     return 0;
 }
 
+static void balloon_bitmap_free(struct BitmapRcu *bmap)
+{
+    g_free(bmap->bmap);
+    g_free(bmap);
+}
+
 void qemu_remove_balloon_handler(void *opaque)
 {
+    struct BitmapRcu *bitmap = balloon_bitmap_rcu;
     if (balloon_opaque != opaque) {
         return;
     }
+    atomic_rcu_set(&balloon_bitmap_rcu, NULL);
+    if (bitmap) {
+        call_rcu(bitmap, balloon_bitmap_free, rcu);
+    }
     balloon_event_fn = NULL;
     balloon_stat_fn = NULL;
+    balloon_in_progress_fn = NULL;
     balloon_opaque = NULL;
 }
 
@@ -116,3 +177,191 @@  void qmp_balloon(int64_t target, Error **errp)
     trace_balloon_event(balloon_opaque, target);
     balloon_event_fn(balloon_opaque, target);
 }
+
+/* Handle Ram hotplug case, only called in case old < new */
+int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new)
+{
+    struct BitmapRcu *old_bitmap = balloon_bitmap_rcu, *bitmap;
+    unsigned long old_offset, new_offset;
+
+    if (!balloon_bitmap_rcu) {
+        return -1;
+    }
+
+    old_offset = (old >> balloon_bitmap_pfn_shift);
+    new_offset = (new >> balloon_bitmap_pfn_shift);
+
+    bitmap = g_new(struct BitmapRcu, 1);
+    bitmap->bmap = bitmap_new(new_offset);
+
+    qemu_mutex_lock_balloon_bitmap();
+    bitmap_clear(bitmap->bmap, 0,
+                 balloon_bitmap_pages + new_offset - old_offset);
+    bitmap_copy(bitmap->bmap, old_bitmap->bmap, old_offset);
+
+    atomic_rcu_set(&balloon_bitmap_rcu, bitmap);
+    balloon_bitmap_pages += new_offset - old_offset;
+    qemu_mutex_unlock_balloon_bitmap();
+    call_rcu(old_bitmap, balloon_bitmap_free, rcu);
+
+    return 0;
+}
+
+/* Should be called with balloon bitmap mutex lock held */
+int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate)
+{
+    unsigned long *bitmap;
+    unsigned long offset = 0;
+
+    if (!balloon_bitmap_rcu) {
+        return -1;
+    }
+
+    offset = (addr >> balloon_bitmap_pfn_shift);
+    if (balloon_bitmap_xfered) {
+        if (offset < balloon_min_bitmap_offset) {
+            balloon_min_bitmap_offset = offset;
+        }
+        if (offset > balloon_max_bitmap_offset) {
+            balloon_max_bitmap_offset = offset;
+        }
+    }
+
+    rcu_read_lock();
+    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
+    if (deflate == 0) {
+        set_bit(offset, bitmap);
+    } else {
+        clear_bit(offset, bitmap);
+    }
+    rcu_read_unlock();
+    return 0;
+}
+
+void qemu_balloon_bitmap_setup(void)
+{
+    if (migrate_postcopy_ram()) {
+        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERNAMENT;
+    } else if ((!balloon_bitmap_rcu || !migrate_skip_balloon()) &&
+               (balloon_bitmap_disable_state !=
+                BALLOON_BITMAP_DISABLE_PERNAMENT)) {
+        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_CURRENT;
+    }
+}
+
+int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr)
+{
+    unsigned long *bitmap;
+    ram_addr_t base;
+    unsigned long nr = 0;
+    int ret = 0;
+
+    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_CURRENT ||
+        balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERNAMENT) {
+        return 0;
+    }
+    balloon_in_progress_fn(balloon_opaque, &ret);
+    if (ret == 1) {
+        return 0;
+    }
+
+    rcu_read_lock();
+    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
+    base = rb->offset >> balloon_bitmap_pfn_shift;
+    nr = base + (addr >> balloon_bitmap_pfn_shift);
+    if (test_bit(nr, bitmap)) {
+        ret = 1;
+    }
+    rcu_read_unlock();
+    return ret;
+}
+
+int qemu_balloon_bitmap_save(QEMUFile *f)
+{
+    unsigned long *bitmap;
+    unsigned long offset = 0, next = 0, len = 0;
+    unsigned long tmpoffset = 0, tmplimit = 0;
+
+    if (balloon_bitmap_disable_state == BALLOON_BITMAP_DISABLE_PERNAMENT) {
+        qemu_put_be64(f, BALLOON_BITMAP_DISABLE_FLAG);
+        return 0;
+    }
+
+    qemu_mutex_lock_balloon_bitmap();
+    if (balloon_bitmap_xfered) {
+        tmpoffset = balloon_min_bitmap_offset;
+        tmplimit  = balloon_max_bitmap_offset;
+    } else {
+        balloon_bitmap_xfered = true;
+        tmpoffset = offset;
+        tmplimit  = balloon_bitmap_pages;
+    }
+
+    balloon_min_bitmap_offset = balloon_bitmap_pages;
+    balloon_max_bitmap_offset = 0;
+
+    qemu_put_be64(f, balloon_bitmap_pages);
+    qemu_put_be64(f, tmpoffset);
+    qemu_put_be64(f, tmplimit);
+    rcu_read_lock();
+    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
+    while (tmpoffset < tmplimit) {
+        unsigned long next_set_bit, start_set_bit;
+        next_set_bit = find_next_bit(bitmap, balloon_bitmap_pages, tmpoffset);
+        start_set_bit = next_set_bit;
+        if (next_set_bit == balloon_bitmap_pages) {
+            len = 0;
+            next = start_set_bit;
+            qemu_put_be64(f, next);
+            qemu_put_be64(f, len);
+            break;
+        }
+        next_set_bit = find_next_zero_bit(bitmap,
+                                          balloon_bitmap_pages,
+                                          ++next_set_bit);
+        len = (next_set_bit - start_set_bit);
+        next = start_set_bit;
+        qemu_put_be64(f, next);
+        qemu_put_be64(f, len);
+        tmpoffset = next + len;
+    }
+    rcu_read_unlock();
+    qemu_mutex_unlock_balloon_bitmap();
+    return 0;
+}
+
+int qemu_balloon_bitmap_load(QEMUFile *f)
+{
+    unsigned long *bitmap;
+    unsigned long next = 0, len = 0;
+    unsigned long tmpoffset = 0, tmplimit = 0;
+
+    if (!balloon_bitmap_rcu) {
+        return -1;
+    }
+
+    qemu_mutex_lock_balloon_bitmap();
+    balloon_bitmap_pages = qemu_get_be64(f);
+    if (balloon_bitmap_pages == BALLOON_BITMAP_DISABLE_FLAG) {
+        balloon_bitmap_disable_state = BALLOON_BITMAP_DISABLE_PERNAMENT;
+        qemu_mutex_unlock_balloon_bitmap();
+        return 0;
+    }
+    tmpoffset = qemu_get_be64(f);
+    tmplimit  = qemu_get_be64(f);
+    rcu_read_lock();
+    bitmap = atomic_rcu_read(&balloon_bitmap_rcu)->bmap;
+    while (tmpoffset < tmplimit) {
+        next = qemu_get_be64(f);
+        len  = qemu_get_be64(f);
+        if (len == 0) {
+            break;
+        }
+        bitmap_set(bitmap, next, len);
+        tmpoffset = next + len;
+    }
+    rcu_read_unlock();
+    qemu_mutex_unlock_balloon_bitmap();
+    return 0;
+}
+
diff --git a/exec.c b/exec.c
index c62c439..42144ba 100644
--- a/exec.c
+++ b/exec.c
@@ -58,6 +58,7 @@ 
 #ifndef _WIN32
 #include "qemu/mmap-alloc.h"
 #endif
+#include "sysemu/balloon.h"
 
 //#define DEBUG_SUBPAGE
 
@@ -1594,6 +1595,8 @@  static ram_addr_t ram_block_add(RAMBlock *new_block, Error **errp)
     if (new_ram_size > old_ram_size) {
         migration_bitmap_extend(old_ram_size, new_ram_size);
         dirty_memory_extend(old_ram_size, new_ram_size);
+        qemu_balloon_bitmap_extend(old_ram_size << TARGET_PAGE_BITS,
+                                   new_ram_size << TARGET_PAGE_BITS);
     }
     /* Keep the list sorted from biggest to smallest block.  Unlike QTAILQ,
      * QLIST (which has an RCU-friendly variant) does not have insertion at
diff --git a/hw/virtio/virtio-balloon.c b/hw/virtio/virtio-balloon.c
index e9c30e9..e3011f8 100644
--- a/hw/virtio/virtio-balloon.c
+++ b/hw/virtio/virtio-balloon.c
@@ -27,6 +27,7 @@ 
 #include "qapi/visitor.h"
 #include "qapi-event.h"
 #include "trace.h"
+#include "migration/migration.h"
 
 #if defined(__linux__)
 #include <sys/mman.h>
@@ -213,11 +214,13 @@  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
     VirtQueueElement *elem;
     MemoryRegionSection section;
 
+    qemu_mutex_lock_balloon_bitmap();
     for (;;) {
         size_t offset = 0;
         uint32_t pfn;
         elem = virtqueue_pop(vq, sizeof(VirtQueueElement));
         if (!elem) {
+            qemu_mutex_unlock_balloon_bitmap();
             return;
         }
 
@@ -241,6 +244,7 @@  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
             addr = section.offset_within_region;
             balloon_page(memory_region_get_ram_ptr(section.mr) + addr,
                          !!(vq == s->dvq));
+            qemu_balloon_bitmap_update(addr, !!(vq == s->dvq));
             memory_region_unref(section.mr);
         }
 
@@ -248,6 +252,7 @@  static void virtio_balloon_handle_output(VirtIODevice *vdev, VirtQueue *vq)
         virtio_notify(vdev, vq);
         g_free(elem);
     }
+    qemu_mutex_unlock_balloon_bitmap();
 }
 
 static void virtio_balloon_receive_stats(VirtIODevice *vdev, VirtQueue *vq)
@@ -293,6 +298,16 @@  out:
     }
 }
 
+static void virtio_balloon_migration_state_changed(Notifier *notifier,
+                                                   void *data)
+{
+    MigrationState *mig = data;
+
+    if (migration_has_failed(mig)) {
+        qemu_balloon_reset_bitmap_data();
+    }
+}
+
 static void virtio_balloon_get_config(VirtIODevice *vdev, uint8_t *config_data)
 {
     VirtIOBalloon *dev = VIRTIO_BALLOON(vdev);
@@ -372,6 +387,16 @@  static void virtio_balloon_stat(void *opaque, BalloonInfo *info)
                                              VIRTIO_BALLOON_PFN_SHIFT);
 }
 
+static void virtio_balloon_in_progress(void *opaque, int *status)
+{
+    VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
+    if (cpu_to_le32(dev->actual) != cpu_to_le32(dev->num_pages)) {
+        *status = 1;
+        return;
+    }
+    *status = 0;
+}
+
 static void virtio_balloon_to_target(void *opaque, ram_addr_t target)
 {
     VirtIOBalloon *dev = VIRTIO_BALLOON(opaque);
@@ -399,6 +424,7 @@  static void virtio_balloon_save_device(VirtIODevice *vdev, QEMUFile *f)
 
     qemu_put_be32(f, s->num_pages);
     qemu_put_be32(f, s->actual);
+    qemu_balloon_bitmap_save(f);
 }
 
 static int virtio_balloon_load(QEMUFile *f, void *opaque, int version_id)
@@ -416,6 +442,7 @@  static int virtio_balloon_load_device(VirtIODevice *vdev, QEMUFile *f,
 
     s->num_pages = qemu_get_be32(f);
     s->actual = qemu_get_be32(f);
+    qemu_balloon_bitmap_load(f);
     return 0;
 }
 
@@ -429,7 +456,9 @@  static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
                 sizeof(struct virtio_balloon_config));
 
     ret = qemu_add_balloon_handler(virtio_balloon_to_target,
-                                   virtio_balloon_stat, s);
+                                   virtio_balloon_stat,
+                                   virtio_balloon_in_progress, s,
+                                   VIRTIO_BALLOON_PFN_SHIFT);
 
     if (ret < 0) {
         error_setg(errp, "Only one balloon device is supported");
@@ -443,6 +472,9 @@  static void virtio_balloon_device_realize(DeviceState *dev, Error **errp)
 
     reset_stats(s);
 
+    s->migration_state_notifier.notify = virtio_balloon_migration_state_changed;
+    add_migration_state_change_notifier(&s->migration_state_notifier);
+
     register_savevm(dev, "virtio-balloon", -1, 1,
                     virtio_balloon_save, virtio_balloon_load, s);
 }
@@ -452,6 +484,7 @@  static void virtio_balloon_device_unrealize(DeviceState *dev, Error **errp)
     VirtIODevice *vdev = VIRTIO_DEVICE(dev);
     VirtIOBalloon *s = VIRTIO_BALLOON(dev);
 
+    remove_migration_state_change_notifier(&s->migration_state_notifier);
     balloon_stats_destroy_timer(s);
     qemu_remove_balloon_handler(s);
     unregister_savevm(dev, "virtio-balloon", s);
diff --git a/include/hw/virtio/virtio-balloon.h b/include/hw/virtio/virtio-balloon.h
index 35f62ac..1ded5a9 100644
--- a/include/hw/virtio/virtio-balloon.h
+++ b/include/hw/virtio/virtio-balloon.h
@@ -43,6 +43,7 @@  typedef struct VirtIOBalloon {
     int64_t stats_last_update;
     int64_t stats_poll_interval;
     uint32_t host_features;
+    Notifier migration_state_notifier;
 } VirtIOBalloon;
 
 #endif
diff --git a/include/migration/migration.h b/include/migration/migration.h
index ac2c12c..6c1d1af 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -267,6 +267,7 @@  void migrate_del_blocker(Error *reason);
 
 bool migrate_postcopy_ram(void);
 bool migrate_zero_blocks(void);
+bool migrate_skip_balloon(void);
 
 bool migrate_auto_converge(void);
 
diff --git a/include/sysemu/balloon.h b/include/sysemu/balloon.h
index 3f976b4..5325c38 100644
--- a/include/sysemu/balloon.h
+++ b/include/sysemu/balloon.h
@@ -15,14 +15,27 @@ 
 #define _QEMU_BALLOON_H
 
 #include "qapi-types.h"
+#include "migration/qemu-file.h"
 
 typedef void (QEMUBalloonEvent)(void *opaque, ram_addr_t target);
 typedef void (QEMUBalloonStatus)(void *opaque, BalloonInfo *info);
+typedef void (QEMUBalloonInProgress) (void *opaque, int *status);
 
 int qemu_add_balloon_handler(QEMUBalloonEvent *event_func,
-			     QEMUBalloonStatus *stat_func, void *opaque);
+                             QEMUBalloonStatus *stat_func,
+                             QEMUBalloonInProgress *progress_func,
+                             void *opaque, int pfn_shift);
 void qemu_remove_balloon_handler(void *opaque);
 bool qemu_balloon_is_inhibited(void);
 void qemu_balloon_inhibit(bool state);
+void qemu_mutex_lock_balloon_bitmap(void);
+void qemu_mutex_unlock_balloon_bitmap(void);
+void qemu_balloon_reset_bitmap_data(void);
+void qemu_balloon_bitmap_setup(void);
+int qemu_balloon_bitmap_extend(ram_addr_t old, ram_addr_t new);
+int qemu_balloon_bitmap_update(ram_addr_t addr, int deflate);
+int qemu_balloon_bitmap_test(RAMBlock *rb, ram_addr_t addr);
+int qemu_balloon_bitmap_save(QEMUFile *f);
+int qemu_balloon_bitmap_load(QEMUFile *f);
 
 #endif
diff --git a/migration/migration.c b/migration/migration.c
index 0129d9f..2dd9fa0 100644
--- a/migration/migration.c
+++ b/migration/migration.c
@@ -1199,6 +1199,15 @@  int migrate_use_xbzrle(void)
     return s->enabled_capabilities[MIGRATION_CAPABILITY_XBZRLE];
 }
 
+bool migrate_skip_balloon(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_SKIP_BALLOON];
+}
+
 int64_t migrate_xbzrle_cache_size(void)
 {
     MigrationState *s;
diff --git a/migration/ram.c b/migration/ram.c
index 704f6a9..f18725a 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -40,6 +40,7 @@ 
 #include "trace.h"
 #include "exec/ram_addr.h"
 #include "qemu/rcu_queue.h"
+#include "sysemu/balloon.h"
 
 #ifdef DEBUG_MIGRATION_RAM
 #define DPRINTF(fmt, ...) \
@@ -65,6 +66,7 @@  static uint64_t bitmap_sync_count;
 #define RAM_SAVE_FLAG_XBZRLE   0x40
 /* 0x80 is reserved in migration.h start with 0x100 next */
 #define RAM_SAVE_FLAG_COMPRESS_PAGE    0x100
+#define RAM_SAVE_FLAG_BALLOON  0x200
 
 static const uint8_t ZERO_TARGET_PAGE[TARGET_PAGE_SIZE];
 
@@ -1355,9 +1357,16 @@  static int ram_find_and_save_block(QEMUFile *f, bool last_stage,
         }
 
         if (found) {
-            pages = ram_save_host_page(ms, f, &pss,
-                                       last_stage, bytes_transferred,
-                                       dirty_ram_abs);
+            /* skip saving ram host page if the corresponding guest page
+             * is ballooned out
+             */
+            if (qemu_balloon_bitmap_test(pss.block, pss.offset) != 1) {
+                pages = ram_save_host_page(ms, f, &pss,
+                                           last_stage, bytes_transferred,
+                                           dirty_ram_abs);
+            } else {
+                migration_bitmap_clear_dirty(dirty_ram_abs);
+            }
         }
     } while (!pages && again);
 
@@ -1959,6 +1968,7 @@  static int ram_save_setup(QEMUFile *f, void *opaque)
 
     rcu_read_unlock();
 
+    qemu_balloon_bitmap_setup();
     ram_control_before_iterate(f, RAM_CONTROL_SETUP);
     ram_control_after_iterate(f, RAM_CONTROL_SETUP);
 
@@ -1984,6 +1994,9 @@  static int ram_save_iterate(QEMUFile *f, void *opaque)
 
     ram_control_before_iterate(f, RAM_CONTROL_ROUND);
 
+    qemu_put_be64(f, RAM_SAVE_FLAG_BALLOON);
+    qemu_balloon_bitmap_save(f);
+
     t0 = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     i = 0;
     while ((ret = qemu_file_rate_limit(f)) == 0) {
@@ -2493,6 +2506,10 @@  static int ram_load(QEMUFile *f, void *opaque, int version_id)
             }
             break;
 
+        case RAM_SAVE_FLAG_BALLOON:
+            qemu_balloon_bitmap_load(f);
+            break;
+
         case RAM_SAVE_FLAG_COMPRESS:
             ch = qemu_get_byte(f);
             ram_handle_compressed(host, ch, TARGET_PAGE_SIZE);
diff --git a/qapi-schema.json b/qapi-schema.json
index 7b8f2a1..1b8111c 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -544,11 +544,14 @@ 
 #          been migrated, pulling the remaining pages along as needed. NOTE: If
 #          the migration fails during postcopy the VM will fail.  (since 2.5)
 #
+# @skip-balloon: Skip scaning ram pages released by virtio-balloon driver.
+#          (since 2.5)
+#
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
   'data': ['xbzrle', 'rdma-pin-all', 'auto-converge', 'zero-blocks',
-           'compress', 'events', 'x-postcopy-ram'] }
+           'compress', 'events', 'x-postcopy-ram', 'skip-balloon'] }
 
 ##
 # @MigrationCapabilityStatus