diff mbox

[1/8] migration: stop compressing page in migration thread

Message ID 20180313075739.11194-2-xiaoguangrong@tencent.com (mailing list archive)
State New, archived
Headers show

Commit Message

Xiao Guangrong March 13, 2018, 7:57 a.m. UTC
From: Xiao Guangrong <xiaoguangrong@tencent.com>

As compression is a heavy work, do not do it in migration thread,
instead, we post it out as a normal page

Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
---
 migration/ram.c | 32 ++++++++++++++++----------------
 1 file changed, 16 insertions(+), 16 deletions(-)

Comments

Dr. David Alan Gilbert March 15, 2018, 10:25 a.m. UTC | #1
* guangrong.xiao@gmail.com (guangrong.xiao@gmail.com) wrote:
> From: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> As compression is a heavy work, do not do it in migration thread,
> instead, we post it out as a normal page
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> ---
>  migration/ram.c | 32 ++++++++++++++++----------------

Hi,
  Do you have some performance numbers to show this helps?  Were those
taken on a normal system or were they taken with one of the compression
accelerators (which I think the compression migration was designed for)?

>  1 file changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c
> index 7266351fd0..615693f180 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1132,7 +1132,7 @@ static int ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss,
>      int pages = -1;
>      uint64_t bytes_xmit = 0;
>      uint8_t *p;
> -    int ret, blen;
> +    int ret;
>      RAMBlock *block = pss->block;
>      ram_addr_t offset = pss->page << TARGET_PAGE_BITS;
>  
> @@ -1162,23 +1162,23 @@ static int ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss,
>          if (block != rs->last_sent_block) {
>              flush_compressed_data(rs);
>              pages = save_zero_page(rs, block, offset);
> -            if (pages == -1) {
> -                /* Make sure the first page is sent out before other pages */
> -                bytes_xmit = save_page_header(rs, rs->f, block, offset |
> -                                              RAM_SAVE_FLAG_COMPRESS_PAGE);
> -                blen = qemu_put_compression_data(rs->f, p, TARGET_PAGE_SIZE,
> -                                                 migrate_compress_level());
> -                if (blen > 0) {
> -                    ram_counters.transferred += bytes_xmit + blen;
> -                    ram_counters.normal++;
> -                    pages = 1;
> -                } else {
> -                    qemu_file_set_error(rs->f, blen);
> -                    error_report("compressed data failed!");
> -                }
> -            }
>              if (pages > 0) {
>                  ram_release_pages(block->idstr, offset, pages);
> +            } else {
> +                /*
> +                 * Make sure the first page is sent out before other pages.
> +                 *
> +                 * we post it as normal page as compression will take much
> +                 * CPU resource.
> +                 */
> +                ram_counters.transferred += save_page_header(rs, rs->f, block,
> +                                                offset | RAM_SAVE_FLAG_PAGE);
> +                qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE,
> +                                      migrate_release_ram() &
> +                                      migration_in_postcopy());
> +                ram_counters.transferred += TARGET_PAGE_SIZE;
> +                ram_counters.normal++;
> +                pages = 1;


However, the code and idea look OK, so

Reviewed-by: Dr. David Alan Gilbert <dgilbert@redhat.com>

>              }
>          } else {
>              pages = save_zero_page(rs, block, offset);
> -- 
> 2.14.3
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Xiao Guangrong March 16, 2018, 8:05 a.m. UTC | #2
Hi David,

Thanks for your review.

On 03/15/2018 06:25 PM, Dr. David Alan Gilbert wrote:

>>   migration/ram.c | 32 ++++++++++++++++----------------
> 
> Hi,
>    Do you have some performance numbers to show this helps?  Were those
> taken on a normal system or were they taken with one of the compression
> accelerators (which I think the compression migration was designed for)?

Yes, i have tested it on my desktop, i7-4790 + 16G, by locally live migrate
the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350.

During the migration, a workload which has 8 threads repeatedly written total
6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after
applying, the bandwidth is ~50 mbps.

BTW, Compression will use almost all valid bandwidth after all of our work
which i will post it out part by part.
Dr. David Alan Gilbert March 19, 2018, 12:11 p.m. UTC | #3
* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> Hi David,
> 
> Thanks for your review.
> 
> On 03/15/2018 06:25 PM, Dr. David Alan Gilbert wrote:
> 
> > >   migration/ram.c | 32 ++++++++++++++++----------------
> > 
> > Hi,
> >    Do you have some performance numbers to show this helps?  Were those
> > taken on a normal system or were they taken with one of the compression
> > accelerators (which I think the compression migration was designed for)?
> 
> Yes, i have tested it on my desktop, i7-4790 + 16G, by locally live migrate
> the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350.
> 
> During the migration, a workload which has 8 threads repeatedly written total
> 6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after
> applying, the bandwidth is ~50 mbps.

OK, that's good - worth adding those notes to your cover letter.
I wonder how well it works with compression acceleration hardware; I
can't see anything in this series making it worse.

> BTW, Compression will use almost all valid bandwidth after all of our work
> which i will post it out part by part.

Oh, that will be very nice.

Dave

--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Peter Xu March 21, 2018, 8:19 a.m. UTC | #4
On Fri, Mar 16, 2018 at 04:05:14PM +0800, Xiao Guangrong wrote:
> 
> Hi David,
> 
> Thanks for your review.
> 
> On 03/15/2018 06:25 PM, Dr. David Alan Gilbert wrote:
> 
> > >   migration/ram.c | 32 ++++++++++++++++----------------
> > 
> > Hi,
> >    Do you have some performance numbers to show this helps?  Were those
> > taken on a normal system or were they taken with one of the compression
> > accelerators (which I think the compression migration was designed for)?
> 
> Yes, i have tested it on my desktop, i7-4790 + 16G, by locally live migrate
> the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350.
> 
> During the migration, a workload which has 8 threads repeatedly written total
> 6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after
> applying, the bandwidth is ~50 mbps.

Hi, Guangrong,

Not really review comments, but I got some questions. :)

IIUC this patch will only change the behavior when last_sent_block
changed.  I see that the performance is doubled after the change,
which is really promising.  However I don't fully understand why it
brings such a big difference considering that IMHO current code is
sending dirty pages per-RAMBlock.  I mean, IMHO last_sent_block should
not change frequently?  Or am I wrong?

Another follow-up question would be: have you measured how long time
needed to compress a 4k page, and how many time to send it?  I think
"sending the page" is not really meaningful considering that we just
put a page into the buffer (which should be extremely fast since we
don't really flush it every time), however I would be curious on how
slow would compressing a page be.

Thanks,

> 
> BTW, Compression will use almost all valid bandwidth after all of our work
> which i will post it out part by part.
>
Xiao Guangrong March 22, 2018, 11:38 a.m. UTC | #5
On 03/21/2018 04:19 PM, Peter Xu wrote:
> On Fri, Mar 16, 2018 at 04:05:14PM +0800, Xiao Guangrong wrote:
>>
>> Hi David,
>>
>> Thanks for your review.
>>
>> On 03/15/2018 06:25 PM, Dr. David Alan Gilbert wrote:
>>
>>>>    migration/ram.c | 32 ++++++++++++++++----------------
>>>
>>> Hi,
>>>     Do you have some performance numbers to show this helps?  Were those
>>> taken on a normal system or were they taken with one of the compression
>>> accelerators (which I think the compression migration was designed for)?
>>
>> Yes, i have tested it on my desktop, i7-4790 + 16G, by locally live migrate
>> the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350.
>>
>> During the migration, a workload which has 8 threads repeatedly written total
>> 6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after
>> applying, the bandwidth is ~50 mbps.
> 
> Hi, Guangrong,
> 
> Not really review comments, but I got some questions. :)

Your comments are always valuable to me! :)

> 
> IIUC this patch will only change the behavior when last_sent_block
> changed.  I see that the performance is doubled after the change,
> which is really promising.  However I don't fully understand why it
> brings such a big difference considering that IMHO current code is
> sending dirty pages per-RAMBlock.  I mean, IMHO last_sent_block should
> not change frequently?  Or am I wrong?

It's depends on the configuration, each memory-region which is ram or
file backend has a RAMBlock.

Actually, more benefits comes from the fact that the performance & throughput
of the multithreads has been improved as the threads is fed by the
migration thread and the result is consumed by the migration
thread.

> 
> Another follow-up question would be: have you measured how long time
> needed to compress a 4k page, and how many time to send it?  I think
> "sending the page" is not really meaningful considering that we just
> put a page into the buffer (which should be extremely fast since we
> don't really flush it every time), however I would be curious on how
> slow would compressing a page be.

I haven't benchmark the performance of zlib, i think it is CPU intensive
workload, particularly, there no compression-accelerator (e.g, QAT) on
our production. BTW, we were using lzo instead of zlib which worked
better for some workload.

Putting a page into buffer should depend on the network, i,e, if the
network is congested it should take long time. :)
Peter Xu March 26, 2018, 9:02 a.m. UTC | #6
On Thu, Mar 22, 2018 at 07:38:07PM +0800, Xiao Guangrong wrote:
> 
> 
> On 03/21/2018 04:19 PM, Peter Xu wrote:
> > On Fri, Mar 16, 2018 at 04:05:14PM +0800, Xiao Guangrong wrote:
> > > 
> > > Hi David,
> > > 
> > > Thanks for your review.
> > > 
> > > On 03/15/2018 06:25 PM, Dr. David Alan Gilbert wrote:
> > > 
> > > > >    migration/ram.c | 32 ++++++++++++++++----------------
> > > > 
> > > > Hi,
> > > >     Do you have some performance numbers to show this helps?  Were those
> > > > taken on a normal system or were they taken with one of the compression
> > > > accelerators (which I think the compression migration was designed for)?
> > > 
> > > Yes, i have tested it on my desktop, i7-4790 + 16G, by locally live migrate
> > > the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350.
> > > 
> > > During the migration, a workload which has 8 threads repeatedly written total
> > > 6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after
> > > applying, the bandwidth is ~50 mbps.
> > 
> > Hi, Guangrong,
> > 
> > Not really review comments, but I got some questions. :)
> 
> Your comments are always valuable to me! :)
> 
> > 
> > IIUC this patch will only change the behavior when last_sent_block
> > changed.  I see that the performance is doubled after the change,
> > which is really promising.  However I don't fully understand why it
> > brings such a big difference considering that IMHO current code is
> > sending dirty pages per-RAMBlock.  I mean, IMHO last_sent_block should
> > not change frequently?  Or am I wrong?
> 
> It's depends on the configuration, each memory-region which is ram or
> file backend has a RAMBlock.
> 
> Actually, more benefits comes from the fact that the performance & throughput
> of the multithreads has been improved as the threads is fed by the
> migration thread and the result is consumed by the migration
> thread.

I'm not sure whether I got your points - I think you mean that the
compression threads and the migration thread can form a better
pipeline if the migration thread does not do any compression at all.

I think I agree with that.

However it does not really explain to me on why a very rare event
(sending the first page of a RAMBlock, considering bitmap sync is
rare) can greatly affect the performance (it shows a doubled boost).

Btw, about the numbers: IMHO the numbers might not be really "true
numbers".  Or say, even the bandwidth is doubled, IMHO it does not
mean the performance is doubled. Becasue the data has changed.

Previously there were only compressed pages, and now for each cycle of
RAMBlock looping we'll send a normal page (then we'll get more thing
to send).  So IMHO we don't really know whether we sent more pages
with this patch, we can only know we sent more bytes (e.g., an extreme
case is that the extra 25Mbps/s are all caused by those normal pages,
and we can be sending exactly the same number of pages like before, or
even worse?).

> 
> > 
> > Another follow-up question would be: have you measured how long time
> > needed to compress a 4k page, and how many time to send it?  I think
> > "sending the page" is not really meaningful considering that we just
> > put a page into the buffer (which should be extremely fast since we
> > don't really flush it every time), however I would be curious on how
> > slow would compressing a page be.
> 
> I haven't benchmark the performance of zlib, i think it is CPU intensive
> workload, particularly, there no compression-accelerator (e.g, QAT) on
> our production. BTW, we were using lzo instead of zlib which worked
> better for some workload.

Never mind. Good to know about that.

> 
> Putting a page into buffer should depend on the network, i,e, if the
> network is congested it should take long time. :)

Again, considering that I don't know much on compression (especially I
hardly used that) mine are only questions, which should not block your
patches to be either queued/merged/reposted when proper. :)

Thanks,
Xiao Guangrong March 26, 2018, 3:43 p.m. UTC | #7
On 03/26/2018 05:02 PM, Peter Xu wrote:
> On Thu, Mar 22, 2018 at 07:38:07PM +0800, Xiao Guangrong wrote:
>>
>>
>> On 03/21/2018 04:19 PM, Peter Xu wrote:
>>> On Fri, Mar 16, 2018 at 04:05:14PM +0800, Xiao Guangrong wrote:
>>>>
>>>> Hi David,
>>>>
>>>> Thanks for your review.
>>>>
>>>> On 03/15/2018 06:25 PM, Dr. David Alan Gilbert wrote:
>>>>
>>>>>>     migration/ram.c | 32 ++++++++++++++++----------------
>>>>>
>>>>> Hi,
>>>>>      Do you have some performance numbers to show this helps?  Were those
>>>>> taken on a normal system or were they taken with one of the compression
>>>>> accelerators (which I think the compression migration was designed for)?
>>>>
>>>> Yes, i have tested it on my desktop, i7-4790 + 16G, by locally live migrate
>>>> the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350.
>>>>
>>>> During the migration, a workload which has 8 threads repeatedly written total
>>>> 6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after
>>>> applying, the bandwidth is ~50 mbps.
>>>
>>> Hi, Guangrong,
>>>
>>> Not really review comments, but I got some questions. :)
>>
>> Your comments are always valuable to me! :)
>>
>>>
>>> IIUC this patch will only change the behavior when last_sent_block
>>> changed.  I see that the performance is doubled after the change,
>>> which is really promising.  However I don't fully understand why it
>>> brings such a big difference considering that IMHO current code is
>>> sending dirty pages per-RAMBlock.  I mean, IMHO last_sent_block should
>>> not change frequently?  Or am I wrong?
>>
>> It's depends on the configuration, each memory-region which is ram or
>> file backend has a RAMBlock.
>>
>> Actually, more benefits comes from the fact that the performance & throughput
>> of the multithreads has been improved as the threads is fed by the
>> migration thread and the result is consumed by the migration
>> thread.
> 
> I'm not sure whether I got your points - I think you mean that the
> compression threads and the migration thread can form a better
> pipeline if the migration thread does not do any compression at all.
> 
> I think I agree with that.
> 
> However it does not really explain to me on why a very rare event
> (sending the first page of a RAMBlock, considering bitmap sync is
> rare) can greatly affect the performance (it shows a doubled boost).
> 

I understand it is trick indeed, but it is not very hard to explain.
Multi-threads (using 8 CPUs in our test) keep idle for a long time
for the origin code, however, after our patch, as the normal is
posted out async-ly that it's extremely fast as you said (the network
is almost idle for current implementation) so it has a long time that
the CPUs can be used effectively to generate more compressed data than
before.

> Btw, about the numbers: IMHO the numbers might not be really "true
> numbers".  Or say, even the bandwidth is doubled, IMHO it does not
> mean the performance is doubled. Becasue the data has changed.
> 
> Previously there were only compressed pages, and now for each cycle of
> RAMBlock looping we'll send a normal page (then we'll get more thing
> to send).  So IMHO we don't really know whether we sent more pages
> with this patch, we can only know we sent more bytes (e.g., an extreme
> case is that the extra 25Mbps/s are all caused by those normal pages,
> and we can be sending exactly the same number of pages like before, or
> even worse?).
> 

Current implementation uses CPU very ineffectively (it's our next work
to be posted out) that the network is almost idle so posting more data
out is a better choice,further more, migration thread plays a role for
parallel, it'd better to make it fast.

>>
>>>
>>> Another follow-up question would be: have you measured how long time
>>> needed to compress a 4k page, and how many time to send it?  I think
>>> "sending the page" is not really meaningful considering that we just
>>> put a page into the buffer (which should be extremely fast since we
>>> don't really flush it every time), however I would be curious on how
>>> slow would compressing a page be.
>>
>> I haven't benchmark the performance of zlib, i think it is CPU intensive
>> workload, particularly, there no compression-accelerator (e.g, QAT) on
>> our production. BTW, we were using lzo instead of zlib which worked
>> better for some workload.
> 
> Never mind. Good to know about that.
> 
>>
>> Putting a page into buffer should depend on the network, i,e, if the
>> network is congested it should take long time. :)
> 
> Again, considering that I don't know much on compression (especially I
> hardly used that) mine are only questions, which should not block your
> patches to be either queued/merged/reposted when proper. :)

Yes, i see. The discussion can potentially raise a better solution.

Thanks for your comment, Peter!
Peter Xu March 27, 2018, 7:33 a.m. UTC | #8
On Mon, Mar 26, 2018 at 11:43:33PM +0800, Xiao Guangrong wrote:
> 
> 
> On 03/26/2018 05:02 PM, Peter Xu wrote:
> > On Thu, Mar 22, 2018 at 07:38:07PM +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 03/21/2018 04:19 PM, Peter Xu wrote:
> > > > On Fri, Mar 16, 2018 at 04:05:14PM +0800, Xiao Guangrong wrote:
> > > > > 
> > > > > Hi David,
> > > > > 
> > > > > Thanks for your review.
> > > > > 
> > > > > On 03/15/2018 06:25 PM, Dr. David Alan Gilbert wrote:
> > > > > 
> > > > > > >     migration/ram.c | 32 ++++++++++++++++----------------
> > > > > > 
> > > > > > Hi,
> > > > > >      Do you have some performance numbers to show this helps?  Were those
> > > > > > taken on a normal system or were they taken with one of the compression
> > > > > > accelerators (which I think the compression migration was designed for)?
> > > > > 
> > > > > Yes, i have tested it on my desktop, i7-4790 + 16G, by locally live migrate
> > > > > the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350.
> > > > > 
> > > > > During the migration, a workload which has 8 threads repeatedly written total
> > > > > 6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after
> > > > > applying, the bandwidth is ~50 mbps.
> > > > 
> > > > Hi, Guangrong,
> > > > 
> > > > Not really review comments, but I got some questions. :)
> > > 
> > > Your comments are always valuable to me! :)
> > > 
> > > > 
> > > > IIUC this patch will only change the behavior when last_sent_block
> > > > changed.  I see that the performance is doubled after the change,
> > > > which is really promising.  However I don't fully understand why it
> > > > brings such a big difference considering that IMHO current code is
> > > > sending dirty pages per-RAMBlock.  I mean, IMHO last_sent_block should
> > > > not change frequently?  Or am I wrong?
> > > 
> > > It's depends on the configuration, each memory-region which is ram or
> > > file backend has a RAMBlock.
> > > 
> > > Actually, more benefits comes from the fact that the performance & throughput
> > > of the multithreads has been improved as the threads is fed by the
> > > migration thread and the result is consumed by the migration
> > > thread.
> > 
> > I'm not sure whether I got your points - I think you mean that the
> > compression threads and the migration thread can form a better
> > pipeline if the migration thread does not do any compression at all.
> > 
> > I think I agree with that.
> > 
> > However it does not really explain to me on why a very rare event
> > (sending the first page of a RAMBlock, considering bitmap sync is
> > rare) can greatly affect the performance (it shows a doubled boost).
> > 
> 
> I understand it is trick indeed, but it is not very hard to explain.
> Multi-threads (using 8 CPUs in our test) keep idle for a long time
> for the origin code, however, after our patch, as the normal is
> posted out async-ly that it's extremely fast as you said (the network
> is almost idle for current implementation) so it has a long time that
> the CPUs can be used effectively to generate more compressed data than
> before.

Ah.  If the compression threads are consuming more CPU after this
patch, then it can persuade me far better than the original numbers,
since AFAICT that means it's the real part of bandwidth that is
boosted (the first pages of RAMBlocks are not sent via compression
threads), and I suppose it proves a better pipeline.

> 
> > Btw, about the numbers: IMHO the numbers might not be really "true
> > numbers".  Or say, even the bandwidth is doubled, IMHO it does not
> > mean the performance is doubled. Becasue the data has changed.
> > 
> > Previously there were only compressed pages, and now for each cycle of
> > RAMBlock looping we'll send a normal page (then we'll get more thing
> > to send).  So IMHO we don't really know whether we sent more pages
> > with this patch, we can only know we sent more bytes (e.g., an extreme
> > case is that the extra 25Mbps/s are all caused by those normal pages,
> > and we can be sending exactly the same number of pages like before, or
> > even worse?).
> > 
> 
> Current implementation uses CPU very ineffectively (it's our next work
> to be posted out) that the network is almost idle so posting more data
> out is a better choice,further more, migration thread plays a role for
> parallel, it'd better to make it fast.
> 
> > > 
> > > > 
> > > > Another follow-up question would be: have you measured how long time
> > > > needed to compress a 4k page, and how many time to send it?  I think
> > > > "sending the page" is not really meaningful considering that we just
> > > > put a page into the buffer (which should be extremely fast since we
> > > > don't really flush it every time), however I would be curious on how
> > > > slow would compressing a page be.
> > > 
> > > I haven't benchmark the performance of zlib, i think it is CPU intensive
> > > workload, particularly, there no compression-accelerator (e.g, QAT) on
> > > our production. BTW, we were using lzo instead of zlib which worked
> > > better for some workload.
> > 
> > Never mind. Good to know about that.
> > 
> > > 
> > > Putting a page into buffer should depend on the network, i,e, if the
> > > network is congested it should take long time. :)
> > 
> > Again, considering that I don't know much on compression (especially I
> > hardly used that) mine are only questions, which should not block your
> > patches to be either queued/merged/reposted when proper. :)
> 
> Yes, i see. The discussion can potentially raise a better solution.
> 
> Thanks for your comment, Peter!

I think I have no problem on this patch.  Please take my r-b if you
like:

Reviewed-by: Peter Xu <peterx@redhat.com>

Thanks!
Xiao Guangrong March 27, 2018, 3:24 p.m. UTC | #9
On 03/28/2018 11:01 AM, Wang, Wei W wrote:
> On Tuesday, March 13, 2018 3:58 PM, Xiao Guangrong wrote:
>>
>> As compression is a heavy work, do not do it in migration thread, instead, we
>> post it out as a normal page
>>
>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> 
> 
> Hi Guangrong,
> 
> Dave asked me to help review your patch, so I will just drop my 2 cents wherever possible, and hope that could be inspiring for your work.

Thank you both for the nice help on the work. :)

> 
> 
>> ---
>>   migration/ram.c | 32 ++++++++++++++++----------------
>>   1 file changed, 16 insertions(+), 16 deletions(-)
>>
>> diff --git a/migration/ram.c b/migration/ram.c index
>> 7266351fd0..615693f180 100644
>> --- a/migration/ram.c
>> +++ b/migration/ram.c
>> @@ -1132,7 +1132,7 @@ static int ram_save_compressed_page(RAMState
>> *rs, PageSearchStatus *pss,
>>       int pages = -1;
>>       uint64_t bytes_xmit = 0;
>>       uint8_t *p;
>> -    int ret, blen;
>> +    int ret;
>>       RAMBlock *block = pss->block;
>>       ram_addr_t offset = pss->page << TARGET_PAGE_BITS;
>>
>> @@ -1162,23 +1162,23 @@ static int
>> ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss,
>>           if (block != rs->last_sent_block) {
>>               flush_compressed_data(rs);
>>               pages = save_zero_page(rs, block, offset);
>> -            if (pages == -1) {
>> -                /* Make sure the first page is sent out before other pages */
>> -                bytes_xmit = save_page_header(rs, rs->f, block, offset |
>> -                                              RAM_SAVE_FLAG_COMPRESS_PAGE);
>> -                blen = qemu_put_compression_data(rs->f, p, TARGET_PAGE_SIZE,
>> -                                                 migrate_compress_level());
>> -                if (blen > 0) {
>> -                    ram_counters.transferred += bytes_xmit + blen;
>> -                    ram_counters.normal++;
>> -                    pages = 1;
>> -                } else {
>> -                    qemu_file_set_error(rs->f, blen);
>> -                    error_report("compressed data failed!");
>> -                }
>> -            }
>>               if (pages > 0) {
>>                   ram_release_pages(block->idstr, offset, pages);
>> +            } else {
>> +                /*
>> +                 * Make sure the first page is sent out before other pages.
>> +                 *
>> +                 * we post it as normal page as compression will take much
>> +                 * CPU resource.
>> +                 */
>> +                ram_counters.transferred += save_page_header(rs, rs->f, block,
>> +                                                offset | RAM_SAVE_FLAG_PAGE);
>> +                qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE,
>> +                                      migrate_release_ram() &
>> +                                      migration_in_postcopy());
>> +                ram_counters.transferred += TARGET_PAGE_SIZE;
>> +                ram_counters.normal++;
>> +                pages = 1;
>>               }
>>           } else {
>>               pages = save_zero_page(rs, block, offset);
>> --
> 
> I agree that this patch is an improvement for the current implementation. So just pile up mine here:
> Reviewed-by: Wei Wang <wei.w.wang@intel.com>

Thanks.

> 
> 
> If you are interested in something more aggressive, I can share an alternative approach, which I think would be better. Please see below.
> 
> Actually, we can use the multi-threaded compression for the first page as well, which will not block the migration thread progress. The advantage is that we can enjoy the compression benefit for the first page and meanwhile not blocking the migration thread - the page is given to a compression thread and compressed asynchronously to the migration thread execution.
> 

Yes, it is a good point.

> The main barrier to achieving the above that is that we need to make sure the first page of each block is sent first in the multi-threaded environment. We can twist the current implementation to achieve that, which is not hard:
> 
> For example, we can add a new flag to RAMBlock - bool first_page_added. In each thread of compression, they need
> 1) check if this is the first page of the block.
> 2) If it is the first page, set block->first_page_added after sending the page;
> 3) If it is not the first the page, wait to send the page only when block->first_page_added is set.


So there is another barrier introduced which hurts the parallel...

Hmm, we need more deliberate consideration on this point, let me think it over after this work.

Thank you.
Dr. David Alan Gilbert March 27, 2018, 7:12 p.m. UTC | #10
* Xiao Guangrong (guangrong.xiao@gmail.com) wrote:
> 
> 
> On 03/26/2018 05:02 PM, Peter Xu wrote:
> > On Thu, Mar 22, 2018 at 07:38:07PM +0800, Xiao Guangrong wrote:
> > > 
> > > 
> > > On 03/21/2018 04:19 PM, Peter Xu wrote:
> > > > On Fri, Mar 16, 2018 at 04:05:14PM +0800, Xiao Guangrong wrote:
> > > > > 
> > > > > Hi David,
> > > > > 
> > > > > Thanks for your review.
> > > > > 
> > > > > On 03/15/2018 06:25 PM, Dr. David Alan Gilbert wrote:
> > > > > 
> > > > > > >     migration/ram.c | 32 ++++++++++++++++----------------
> > > > > > 
> > > > > > Hi,
> > > > > >      Do you have some performance numbers to show this helps?  Were those
> > > > > > taken on a normal system or were they taken with one of the compression
> > > > > > accelerators (which I think the compression migration was designed for)?
> > > > > 
> > > > > Yes, i have tested it on my desktop, i7-4790 + 16G, by locally live migrate
> > > > > the VM which has 8 vCPUs + 6G memory and the max-bandwidth is limited to 350.
> > > > > 
> > > > > During the migration, a workload which has 8 threads repeatedly written total
> > > > > 6G memory in the VM. Before this patchset, its bandwidth is ~25 mbps, after
> > > > > applying, the bandwidth is ~50 mbps.
> > > > 
> > > > Hi, Guangrong,
> > > > 
> > > > Not really review comments, but I got some questions. :)
> > > 
> > > Your comments are always valuable to me! :)
> > > 
> > > > 
> > > > IIUC this patch will only change the behavior when last_sent_block
> > > > changed.  I see that the performance is doubled after the change,
> > > > which is really promising.  However I don't fully understand why it
> > > > brings such a big difference considering that IMHO current code is
> > > > sending dirty pages per-RAMBlock.  I mean, IMHO last_sent_block should
> > > > not change frequently?  Or am I wrong?
> > > 
> > > It's depends on the configuration, each memory-region which is ram or
> > > file backend has a RAMBlock.
> > > 
> > > Actually, more benefits comes from the fact that the performance & throughput
> > > of the multithreads has been improved as the threads is fed by the
> > > migration thread and the result is consumed by the migration
> > > thread.
> > 
> > I'm not sure whether I got your points - I think you mean that the
> > compression threads and the migration thread can form a better
> > pipeline if the migration thread does not do any compression at all.
> > 
> > I think I agree with that.
> > 
> > However it does not really explain to me on why a very rare event
> > (sending the first page of a RAMBlock, considering bitmap sync is
> > rare) can greatly affect the performance (it shows a doubled boost).
> > 
> 
> I understand it is trick indeed, but it is not very hard to explain.
> Multi-threads (using 8 CPUs in our test) keep idle for a long time
> for the origin code, however, after our patch, as the normal is
> posted out async-ly that it's extremely fast as you said (the network
> is almost idle for current implementation) so it has a long time that
> the CPUs can be used effectively to generate more compressed data than
> before.

One thing to try, to explain Peter's worry, would be, for testing, to
add a counter to see how often this case triggers, and perhaps add
some debug to see when;  Peter's right that flipping between the
RAMBlocks seems odd, unless you're either doing lots of iterations or
have lots of separate RAMBlocks for some reason.

Dave

> > Btw, about the numbers: IMHO the numbers might not be really "true
> > numbers".  Or say, even the bandwidth is doubled, IMHO it does not
> > mean the performance is doubled. Becasue the data has changed.
> > 
> > Previously there were only compressed pages, and now for each cycle of
> > RAMBlock looping we'll send a normal page (then we'll get more thing
> > to send).  So IMHO we don't really know whether we sent more pages
> > with this patch, we can only know we sent more bytes (e.g., an extreme
> > case is that the extra 25Mbps/s are all caused by those normal pages,
> > and we can be sending exactly the same number of pages like before, or
> > even worse?).
> > 
> 
> Current implementation uses CPU very ineffectively (it's our next work
> to be posted out) that the network is almost idle so posting more data
> out is a better choice,further more, migration thread plays a role for
> parallel, it'd better to make it fast.
> 
> > > 
> > > > 
> > > > Another follow-up question would be: have you measured how long time
> > > > needed to compress a 4k page, and how many time to send it?  I think
> > > > "sending the page" is not really meaningful considering that we just
> > > > put a page into the buffer (which should be extremely fast since we
> > > > don't really flush it every time), however I would be curious on how
> > > > slow would compressing a page be.
> > > 
> > > I haven't benchmark the performance of zlib, i think it is CPU intensive
> > > workload, particularly, there no compression-accelerator (e.g, QAT) on
> > > our production. BTW, we were using lzo instead of zlib which worked
> > > better for some workload.
> > 
> > Never mind. Good to know about that.
> > 
> > > 
> > > Putting a page into buffer should depend on the network, i,e, if the
> > > network is congested it should take long time. :)
> > 
> > Again, considering that I don't know much on compression (especially I
> > hardly used that) mine are only questions, which should not block your
> > patches to be either queued/merged/reposted when proper. :)
> 
> Yes, i see. The discussion can potentially raise a better solution.
> 
> Thanks for your comment, Peter!
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
Wang, Wei W March 28, 2018, 3:01 a.m. UTC | #11
On Tuesday, March 13, 2018 3:58 PM, Xiao Guangrong wrote:
> 
> As compression is a heavy work, do not do it in migration thread, instead, we
> post it out as a normal page
> 
> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>


Hi Guangrong,

Dave asked me to help review your patch, so I will just drop my 2 cents wherever possible, and hope that could be inspiring for your work.


> ---
>  migration/ram.c | 32 ++++++++++++++++----------------
>  1 file changed, 16 insertions(+), 16 deletions(-)
> 
> diff --git a/migration/ram.c b/migration/ram.c index
> 7266351fd0..615693f180 100644
> --- a/migration/ram.c
> +++ b/migration/ram.c
> @@ -1132,7 +1132,7 @@ static int ram_save_compressed_page(RAMState
> *rs, PageSearchStatus *pss,
>      int pages = -1;
>      uint64_t bytes_xmit = 0;
>      uint8_t *p;
> -    int ret, blen;
> +    int ret;
>      RAMBlock *block = pss->block;
>      ram_addr_t offset = pss->page << TARGET_PAGE_BITS;
> 
> @@ -1162,23 +1162,23 @@ static int
> ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss,
>          if (block != rs->last_sent_block) {
>              flush_compressed_data(rs);
>              pages = save_zero_page(rs, block, offset);
> -            if (pages == -1) {
> -                /* Make sure the first page is sent out before other pages */
> -                bytes_xmit = save_page_header(rs, rs->f, block, offset |
> -                                              RAM_SAVE_FLAG_COMPRESS_PAGE);
> -                blen = qemu_put_compression_data(rs->f, p, TARGET_PAGE_SIZE,
> -                                                 migrate_compress_level());
> -                if (blen > 0) {
> -                    ram_counters.transferred += bytes_xmit + blen;
> -                    ram_counters.normal++;
> -                    pages = 1;
> -                } else {
> -                    qemu_file_set_error(rs->f, blen);
> -                    error_report("compressed data failed!");
> -                }
> -            }
>              if (pages > 0) {
>                  ram_release_pages(block->idstr, offset, pages);
> +            } else {
> +                /*
> +                 * Make sure the first page is sent out before other pages.
> +                 *
> +                 * we post it as normal page as compression will take much
> +                 * CPU resource.
> +                 */
> +                ram_counters.transferred += save_page_header(rs, rs->f, block,
> +                                                offset | RAM_SAVE_FLAG_PAGE);
> +                qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE,
> +                                      migrate_release_ram() &
> +                                      migration_in_postcopy());
> +                ram_counters.transferred += TARGET_PAGE_SIZE;
> +                ram_counters.normal++;
> +                pages = 1;
>              }
>          } else {
>              pages = save_zero_page(rs, block, offset);
> --

I agree that this patch is an improvement for the current implementation. So just pile up mine here:
Reviewed-by: Wei Wang <wei.w.wang@intel.com>


If you are interested in something more aggressive, I can share an alternative approach, which I think would be better. Please see below.

Actually, we can use the multi-threaded compression for the first page as well, which will not block the migration thread progress. The advantage is that we can enjoy the compression benefit for the first page and meanwhile not blocking the migration thread - the page is given to a compression thread and compressed asynchronously to the migration thread execution.

The main barrier to achieving the above that is that we need to make sure the first page of each block is sent first in the multi-threaded environment. We can twist the current implementation to achieve that, which is not hard:

For example, we can add a new flag to RAMBlock - bool first_page_added. In each thread of compression, they need
1) check if this is the first page of the block.
2) If it is the first page, set block->first_page_added after sending the page;
3) If it is not the first the page, wait to send the page only when block->first_page_added is set.

Best,
Wei
Wang, Wei W March 28, 2018, 7:30 a.m. UTC | #12
On 03/27/2018 11:24 PM, Xiao Guangrong wrote:
>
>
> On 03/28/2018 11:01 AM, Wang, Wei W wrote:
>> On Tuesday, March 13, 2018 3:58 PM, Xiao Guangrong wrote:
>>>
>>> As compression is a heavy work, do not do it in migration thread, 
>>> instead, we
>>> post it out as a normal page
>>>
>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>>
>>
>> Hi Guangrong,
>>
>> Dave asked me to help review your patch, so I will just drop my 2 
>> cents wherever possible, and hope that could be inspiring for your work.
>
> Thank you both for the nice help on the work. :)
>
>>
>>
>>> ---
>>>   migration/ram.c | 32 ++++++++++++++++----------------
>>>   1 file changed, 16 insertions(+), 16 deletions(-)
>>>
>>> diff --git a/migration/ram.c b/migration/ram.c index
>>> 7266351fd0..615693f180 100644
>>> --- a/migration/ram.c
>>> +++ b/migration/ram.c
>>> @@ -1132,7 +1132,7 @@ static int ram_save_compressed_page(RAMState
>>> *rs, PageSearchStatus *pss,
>>>       int pages = -1;
>>>       uint64_t bytes_xmit = 0;
>>>       uint8_t *p;
>>> -    int ret, blen;
>>> +    int ret;
>>>       RAMBlock *block = pss->block;
>>>       ram_addr_t offset = pss->page << TARGET_PAGE_BITS;
>>>
>>> @@ -1162,23 +1162,23 @@ static int
>>> ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss,
>>>           if (block != rs->last_sent_block) {
>>>               flush_compressed_data(rs);
>>>               pages = save_zero_page(rs, block, offset);
>>> -            if (pages == -1) {
>>> -                /* Make sure the first page is sent out before 
>>> other pages */
>>> -                bytes_xmit = save_page_header(rs, rs->f, block, 
>>> offset |
>>> - RAM_SAVE_FLAG_COMPRESS_PAGE);
>>> -                blen = qemu_put_compression_data(rs->f, p, 
>>> TARGET_PAGE_SIZE,
>>> - migrate_compress_level());
>>> -                if (blen > 0) {
>>> -                    ram_counters.transferred += bytes_xmit + blen;
>>> -                    ram_counters.normal++;
>>> -                    pages = 1;
>>> -                } else {
>>> -                    qemu_file_set_error(rs->f, blen);
>>> -                    error_report("compressed data failed!");
>>> -                }
>>> -            }
>>>               if (pages > 0) {
>>>                   ram_release_pages(block->idstr, offset, pages);
>>> +            } else {
>>> +                /*
>>> +                 * Make sure the first page is sent out before 
>>> other pages.
>>> +                 *
>>> +                 * we post it as normal page as compression will 
>>> take much
>>> +                 * CPU resource.
>>> +                 */
>>> +                ram_counters.transferred += save_page_header(rs, 
>>> rs->f, block,
>>> +                                                offset | 
>>> RAM_SAVE_FLAG_PAGE);
>>> +                qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE,
>>> +                                      migrate_release_ram() &
>>> + migration_in_postcopy());
>>> +                ram_counters.transferred += TARGET_PAGE_SIZE;
>>> +                ram_counters.normal++;
>>> +                pages = 1;
>>>               }
>>>           } else {
>>>               pages = save_zero_page(rs, block, offset);
>>> -- 
>>
>> I agree that this patch is an improvement for the current 
>> implementation. So just pile up mine here:
>> Reviewed-by: Wei Wang <wei.w.wang@intel.com>
>
> Thanks.
>
>>
>>
>> If you are interested in something more aggressive, I can share an 
>> alternative approach, which I think would be better. Please see below.
>>
>> Actually, we can use the multi-threaded compression for the first 
>> page as well, which will not block the migration thread progress. The 
>> advantage is that we can enjoy the compression benefit for the first 
>> page and meanwhile not blocking the migration thread - the page is 
>> given to a compression thread and compressed asynchronously to the 
>> migration thread execution.
>>
>
> Yes, it is a good point.
>
>> The main barrier to achieving the above that is that we need to make 
>> sure the first page of each block is sent first in the multi-threaded 
>> environment. We can twist the current implementation to achieve that, 
>> which is not hard:
>>
>> For example, we can add a new flag to RAMBlock - bool 
>> first_page_added. In each thread of compression, they need
>> 1) check if this is the first page of the block.
>> 2) If it is the first page, set block->first_page_added after sending 
>> the page;
>> 3) If it is not the first the page, wait to send the page only when 
>> block->first_page_added is set.
>
>
> So there is another barrier introduced which hurts the parallel...
>
> Hmm, we need more deliberate consideration on this point, let me think 
> it over after this work.
>

Sure. Just a reminder, this doesn't have to be a barrier to the 
compression, it is just used to serialize sending the pages.

Btw, this reminds me a possible bug in this patch (also in the current 
upstream code): there appears to be no guarantee that the first page 
will be sent before others. The migration thread and the compression 
thread use different buffers. The migration thread just puts the first 
page into its buffer first,  the second page is put to the compression 
thread buffer later. There appears to be no guarantee that the migration 
thread will flush its buffer before the compression thread.

Best,
Wei
Peter Xu March 28, 2018, 7:37 a.m. UTC | #13
On Wed, Mar 28, 2018 at 03:30:06PM +0800, Wei Wang wrote:
> On 03/27/2018 11:24 PM, Xiao Guangrong wrote:
> > 
> > 
> > On 03/28/2018 11:01 AM, Wang, Wei W wrote:
> > > On Tuesday, March 13, 2018 3:58 PM, Xiao Guangrong wrote:
> > > > 
> > > > As compression is a heavy work, do not do it in migration
> > > > thread, instead, we
> > > > post it out as a normal page
> > > > 
> > > > Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
> > > 
> > > 
> > > Hi Guangrong,
> > > 
> > > Dave asked me to help review your patch, so I will just drop my 2
> > > cents wherever possible, and hope that could be inspiring for your
> > > work.
> > 
> > Thank you both for the nice help on the work. :)
> > 
> > > 
> > > 
> > > > ---
> > > >   migration/ram.c | 32 ++++++++++++++++----------------
> > > >   1 file changed, 16 insertions(+), 16 deletions(-)
> > > > 
> > > > diff --git a/migration/ram.c b/migration/ram.c index
> > > > 7266351fd0..615693f180 100644
> > > > --- a/migration/ram.c
> > > > +++ b/migration/ram.c
> > > > @@ -1132,7 +1132,7 @@ static int ram_save_compressed_page(RAMState
> > > > *rs, PageSearchStatus *pss,
> > > >       int pages = -1;
> > > >       uint64_t bytes_xmit = 0;
> > > >       uint8_t *p;
> > > > -    int ret, blen;
> > > > +    int ret;
> > > >       RAMBlock *block = pss->block;
> > > >       ram_addr_t offset = pss->page << TARGET_PAGE_BITS;
> > > > 
> > > > @@ -1162,23 +1162,23 @@ static int
> > > > ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss,
> > > >           if (block != rs->last_sent_block) {
> > > >               flush_compressed_data(rs);
> > > >               pages = save_zero_page(rs, block, offset);
> > > > -            if (pages == -1) {
> > > > -                /* Make sure the first page is sent out before
> > > > other pages */
> > > > -                bytes_xmit = save_page_header(rs, rs->f, block,
> > > > offset |
> > > > - RAM_SAVE_FLAG_COMPRESS_PAGE);
> > > > -                blen = qemu_put_compression_data(rs->f, p,
> > > > TARGET_PAGE_SIZE,
> > > > - migrate_compress_level());
> > > > -                if (blen > 0) {
> > > > -                    ram_counters.transferred += bytes_xmit + blen;
> > > > -                    ram_counters.normal++;
> > > > -                    pages = 1;
> > > > -                } else {
> > > > -                    qemu_file_set_error(rs->f, blen);
> > > > -                    error_report("compressed data failed!");
> > > > -                }
> > > > -            }
> > > >               if (pages > 0) {
> > > >                   ram_release_pages(block->idstr, offset, pages);
> > > > +            } else {
> > > > +                /*
> > > > +                 * Make sure the first page is sent out before
> > > > other pages.
> > > > +                 *
> > > > +                 * we post it as normal page as compression
> > > > will take much
> > > > +                 * CPU resource.
> > > > +                 */
> > > > +                ram_counters.transferred +=
> > > > save_page_header(rs, rs->f, block,
> > > > +                                                offset |
> > > > RAM_SAVE_FLAG_PAGE);
> > > > +                qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE,
> > > > +                                      migrate_release_ram() &
> > > > + migration_in_postcopy());
> > > > +                ram_counters.transferred += TARGET_PAGE_SIZE;
> > > > +                ram_counters.normal++;
> > > > +                pages = 1;
> > > >               }
> > > >           } else {
> > > >               pages = save_zero_page(rs, block, offset);
> > > > -- 
> > > 
> > > I agree that this patch is an improvement for the current
> > > implementation. So just pile up mine here:
> > > Reviewed-by: Wei Wang <wei.w.wang@intel.com>
> > 
> > Thanks.
> > 
> > > 
> > > 
> > > If you are interested in something more aggressive, I can share an
> > > alternative approach, which I think would be better. Please see
> > > below.
> > > 
> > > Actually, we can use the multi-threaded compression for the first
> > > page as well, which will not block the migration thread progress.
> > > The advantage is that we can enjoy the compression benefit for the
> > > first page and meanwhile not blocking the migration thread - the
> > > page is given to a compression thread and compressed asynchronously
> > > to the migration thread execution.
> > > 
> > 
> > Yes, it is a good point.
> > 
> > > The main barrier to achieving the above that is that we need to make
> > > sure the first page of each block is sent first in the
> > > multi-threaded environment. We can twist the current implementation
> > > to achieve that, which is not hard:
> > > 
> > > For example, we can add a new flag to RAMBlock - bool
> > > first_page_added. In each thread of compression, they need
> > > 1) check if this is the first page of the block.
> > > 2) If it is the first page, set block->first_page_added after
> > > sending the page;
> > > 3) If it is not the first the page, wait to send the page only when
> > > block->first_page_added is set.
> > 
> > 
> > So there is another barrier introduced which hurts the parallel...
> > 
> > Hmm, we need more deliberate consideration on this point, let me think
> > it over after this work.
> > 
> 
> Sure. Just a reminder, this doesn't have to be a barrier to the compression,
> it is just used to serialize sending the pages.
> 
> Btw, this reminds me a possible bug in this patch (also in the current
> upstream code): there appears to be no guarantee that the first page will be
> sent before others. The migration thread and the compression thread use
> different buffers. The migration thread just puts the first page into its
> buffer first,  the second page is put to the compression thread buffer
> later. There appears to be no guarantee that the migration thread will flush
> its buffer before the compression thread.

IIUC finally the compression buffers will be queued into the migration
IO stream, so they are still serialized.

In compress_page_with_multi_thread() there is:

        bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);

comp_param[idx].file should be the compression buffer.

rs->f should be the migration IO stream. Thanks,
Wang, Wei W March 28, 2018, 8:30 a.m. UTC | #14
On 03/28/2018 03:37 PM, Peter Xu wrote:
> On Wed, Mar 28, 2018 at 03:30:06PM +0800, Wei Wang wrote:
>> On 03/27/2018 11:24 PM, Xiao Guangrong wrote:
>>>
>>> On 03/28/2018 11:01 AM, Wang, Wei W wrote:
>>>> On Tuesday, March 13, 2018 3:58 PM, Xiao Guangrong wrote:
>>>>> As compression is a heavy work, do not do it in migration
>>>>> thread, instead, we
>>>>> post it out as a normal page
>>>>>
>>>>> Signed-off-by: Xiao Guangrong <xiaoguangrong@tencent.com>
>>>>
>>>> Hi Guangrong,
>>>>
>>>> Dave asked me to help review your patch, so I will just drop my 2
>>>> cents wherever possible, and hope that could be inspiring for your
>>>> work.
>>> Thank you both for the nice help on the work. :)
>>>
>>>>
>>>>> ---
>>>>>    migration/ram.c | 32 ++++++++++++++++----------------
>>>>>    1 file changed, 16 insertions(+), 16 deletions(-)
>>>>>
>>>>> diff --git a/migration/ram.c b/migration/ram.c index
>>>>> 7266351fd0..615693f180 100644
>>>>> --- a/migration/ram.c
>>>>> +++ b/migration/ram.c
>>>>> @@ -1132,7 +1132,7 @@ static int ram_save_compressed_page(RAMState
>>>>> *rs, PageSearchStatus *pss,
>>>>>        int pages = -1;
>>>>>        uint64_t bytes_xmit = 0;
>>>>>        uint8_t *p;
>>>>> -    int ret, blen;
>>>>> +    int ret;
>>>>>        RAMBlock *block = pss->block;
>>>>>        ram_addr_t offset = pss->page << TARGET_PAGE_BITS;
>>>>>
>>>>> @@ -1162,23 +1162,23 @@ static int
>>>>> ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss,
>>>>>            if (block != rs->last_sent_block) {
>>>>>                flush_compressed_data(rs);
>>>>>                pages = save_zero_page(rs, block, offset);
>>>>> -            if (pages == -1) {
>>>>> -                /* Make sure the first page is sent out before
>>>>> other pages */
>>>>> -                bytes_xmit = save_page_header(rs, rs->f, block,
>>>>> offset |
>>>>> - RAM_SAVE_FLAG_COMPRESS_PAGE);
>>>>> -                blen = qemu_put_compression_data(rs->f, p,
>>>>> TARGET_PAGE_SIZE,
>>>>> - migrate_compress_level());
>>>>> -                if (blen > 0) {
>>>>> -                    ram_counters.transferred += bytes_xmit + blen;
>>>>> -                    ram_counters.normal++;
>>>>> -                    pages = 1;
>>>>> -                } else {
>>>>> -                    qemu_file_set_error(rs->f, blen);
>>>>> -                    error_report("compressed data failed!");
>>>>> -                }
>>>>> -            }
>>>>>                if (pages > 0) {
>>>>>                    ram_release_pages(block->idstr, offset, pages);
>>>>> +            } else {
>>>>> +                /*
>>>>> +                 * Make sure the first page is sent out before
>>>>> other pages.
>>>>> +                 *
>>>>> +                 * we post it as normal page as compression
>>>>> will take much
>>>>> +                 * CPU resource.
>>>>> +                 */
>>>>> +                ram_counters.transferred +=
>>>>> save_page_header(rs, rs->f, block,
>>>>> +                                                offset |
>>>>> RAM_SAVE_FLAG_PAGE);
>>>>> +                qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE,
>>>>> +                                      migrate_release_ram() &
>>>>> + migration_in_postcopy());
>>>>> +                ram_counters.transferred += TARGET_PAGE_SIZE;
>>>>> +                ram_counters.normal++;
>>>>> +                pages = 1;
>>>>>                }
>>>>>            } else {
>>>>>                pages = save_zero_page(rs, block, offset);
>>>>> -- 
>>>> I agree that this patch is an improvement for the current
>>>> implementation. So just pile up mine here:
>>>> Reviewed-by: Wei Wang <wei.w.wang@intel.com>
>>> Thanks.
>>>
>>>>
>>>> If you are interested in something more aggressive, I can share an
>>>> alternative approach, which I think would be better. Please see
>>>> below.
>>>>
>>>> Actually, we can use the multi-threaded compression for the first
>>>> page as well, which will not block the migration thread progress.
>>>> The advantage is that we can enjoy the compression benefit for the
>>>> first page and meanwhile not blocking the migration thread - the
>>>> page is given to a compression thread and compressed asynchronously
>>>> to the migration thread execution.
>>>>
>>> Yes, it is a good point.
>>>
>>>> The main barrier to achieving the above that is that we need to make
>>>> sure the first page of each block is sent first in the
>>>> multi-threaded environment. We can twist the current implementation
>>>> to achieve that, which is not hard:
>>>>
>>>> For example, we can add a new flag to RAMBlock - bool
>>>> first_page_added. In each thread of compression, they need
>>>> 1) check if this is the first page of the block.
>>>> 2) If it is the first page, set block->first_page_added after
>>>> sending the page;
>>>> 3) If it is not the first the page, wait to send the page only when
>>>> block->first_page_added is set.
>>>
>>> So there is another barrier introduced which hurts the parallel...
>>>
>>> Hmm, we need more deliberate consideration on this point, let me think
>>> it over after this work.
>>>
>> Sure. Just a reminder, this doesn't have to be a barrier to the compression,
>> it is just used to serialize sending the pages.
>>
>> Btw, this reminds me a possible bug in this patch (also in the current
>> upstream code): there appears to be no guarantee that the first page will be
>> sent before others. The migration thread and the compression thread use
>> different buffers. The migration thread just puts the first page into its
>> buffer first,  the second page is put to the compression thread buffer
>> later. There appears to be no guarantee that the migration thread will flush
>> its buffer before the compression thread.
> IIUC finally the compression buffers will be queued into the migration
> IO stream, so they are still serialized.
>
> In compress_page_with_multi_thread() there is:
>
>          bytes_xmit = qemu_put_qemu_file(rs->f, comp_param[idx].file);
>
> comp_param[idx].file should be the compression buffer.
>
> rs->f should be the migration IO stream.

OK, thanks. It turns out that the comp_param[idx].file is not writable 
currently. This needs an extra copy, which could be avoided with the 
above approach.

Best,
Wei
diff mbox

Patch

diff --git a/migration/ram.c b/migration/ram.c
index 7266351fd0..615693f180 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -1132,7 +1132,7 @@  static int ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss,
     int pages = -1;
     uint64_t bytes_xmit = 0;
     uint8_t *p;
-    int ret, blen;
+    int ret;
     RAMBlock *block = pss->block;
     ram_addr_t offset = pss->page << TARGET_PAGE_BITS;
 
@@ -1162,23 +1162,23 @@  static int ram_save_compressed_page(RAMState *rs, PageSearchStatus *pss,
         if (block != rs->last_sent_block) {
             flush_compressed_data(rs);
             pages = save_zero_page(rs, block, offset);
-            if (pages == -1) {
-                /* Make sure the first page is sent out before other pages */
-                bytes_xmit = save_page_header(rs, rs->f, block, offset |
-                                              RAM_SAVE_FLAG_COMPRESS_PAGE);
-                blen = qemu_put_compression_data(rs->f, p, TARGET_PAGE_SIZE,
-                                                 migrate_compress_level());
-                if (blen > 0) {
-                    ram_counters.transferred += bytes_xmit + blen;
-                    ram_counters.normal++;
-                    pages = 1;
-                } else {
-                    qemu_file_set_error(rs->f, blen);
-                    error_report("compressed data failed!");
-                }
-            }
             if (pages > 0) {
                 ram_release_pages(block->idstr, offset, pages);
+            } else {
+                /*
+                 * Make sure the first page is sent out before other pages.
+                 *
+                 * we post it as normal page as compression will take much
+                 * CPU resource.
+                 */
+                ram_counters.transferred += save_page_header(rs, rs->f, block,
+                                                offset | RAM_SAVE_FLAG_PAGE);
+                qemu_put_buffer_async(rs->f, p, TARGET_PAGE_SIZE,
+                                      migrate_release_ram() &
+                                      migration_in_postcopy());
+                ram_counters.transferred += TARGET_PAGE_SIZE;
+                ram_counters.normal++;
+                pages = 1;
             }
         } else {
             pages = save_zero_page(rs, block, offset);