diff mbox series

[RFC,4/4] migration/multifd/zero-copy: Flush only the LRU half of the header array

Message ID 20221025044730.319941-5-leobras@redhat.com (mailing list archive)
State New, archived
Headers show
Series MultiFD zero-copy improvements | expand

Commit Message

Leonardo Bras Oct. 25, 2022, 4:47 a.m. UTC
Zero-copy multifd migration sends both the header and the memory pages in a
single syscall. Since it's necessary to flush before reusing the header, a
header array was implemented, so each write call uses a different
array, and flushing only take place after all headers have been used,
meaning 1 flush for each N writes.

This method has a bottleneck, though: After the last write, a flush will
have to wait for all writes to finish, which will be a lot, meaning the
recvmsg() syscall called in qio_channel_socket_flush() will be called a
lot. On top of that, it will create a time period when the I/O queue is
empty and nothing is getting send: between the flush and the next write.

To avoid that, use qio_channel_flush()'s new max_pending parameter to wait
until at most half of the array is still in use. (i.e. the LRU half of the
array can be reused)

Flushing for the LRU half of the array is much faster, since it does not
have to wait for the most recent writes to finish, making up for having
to flush twice per array.

As a main benefit, this approach keeps the I/O queue from being empty while
there are still data to be sent, making it easier to keep the I/O maximum
throughput while consuming less cpu time.

Signed-off-by: Leonardo Bras <leobras@redhat.com>
---
 migration/multifd.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

Comments

Daniel P. Berrangé Oct. 25, 2022, 8:35 a.m. UTC | #1
On Tue, Oct 25, 2022 at 01:47:31AM -0300, Leonardo Bras wrote:
> Zero-copy multifd migration sends both the header and the memory pages in a
> single syscall. Since it's necessary to flush before reusing the header, a
> header array was implemented, so each write call uses a different
> array, and flushing only take place after all headers have been used,
> meaning 1 flush for each N writes.
> 
> This method has a bottleneck, though: After the last write, a flush will
> have to wait for all writes to finish, which will be a lot, meaning the
> recvmsg() syscall called in qio_channel_socket_flush() will be called a
> lot. On top of that, it will create a time period when the I/O queue is
> empty and nothing is getting send: between the flush and the next write.
> 
> To avoid that, use qio_channel_flush()'s new max_pending parameter to wait
> until at most half of the array is still in use. (i.e. the LRU half of the
> array can be reused)
> 
> Flushing for the LRU half of the array is much faster, since it does not
> have to wait for the most recent writes to finish, making up for having
> to flush twice per array.
> 
> As a main benefit, this approach keeps the I/O queue from being empty while
> there are still data to be sent, making it easier to keep the I/O maximum
> throughput while consuming less cpu time.

Doesn't this defeat the reason for adding the flush in the first
place, which was to ensure that a migration iteration was fully
sent before starting the next iteration over RAM ? If it is OK to
only partially flush on each iteration, then why do we need to
flush at all ?


With regards,
Daniel
Juan Quintela Oct. 25, 2022, 10:07 a.m. UTC | #2
Daniel P. Berrangé <berrange@redhat.com> wrote:
> On Tue, Oct 25, 2022 at 01:47:31AM -0300, Leonardo Bras wrote:
>> Zero-copy multifd migration sends both the header and the memory pages in a
>> single syscall. Since it's necessary to flush before reusing the header, a
>> header array was implemented, so each write call uses a different
>> array, and flushing only take place after all headers have been used,
>> meaning 1 flush for each N writes.
>> 
>> This method has a bottleneck, though: After the last write, a flush will
>> have to wait for all writes to finish, which will be a lot, meaning the
>> recvmsg() syscall called in qio_channel_socket_flush() will be called a
>> lot. On top of that, it will create a time period when the I/O queue is
>> empty and nothing is getting send: between the flush and the next write.
>> 
>> To avoid that, use qio_channel_flush()'s new max_pending parameter to wait
>> until at most half of the array is still in use. (i.e. the LRU half of the
>> array can be reused)
>> 
>> Flushing for the LRU half of the array is much faster, since it does not
>> have to wait for the most recent writes to finish, making up for having
>> to flush twice per array.
>> 
>> As a main benefit, this approach keeps the I/O queue from being empty while
>> there are still data to be sent, making it easier to keep the I/O maximum
>> throughput while consuming less cpu time.
>
> Doesn't this defeat the reason for adding the flush in the first
> place, which was to ensure that a migration iteration was fully
> sent before starting the next iteration over RAM ? If it is OK to
> only partially flush on each iteration, then why do we need to
> flush at all ?

Now we need to do the flush in two places:
- on sync_main (the old place)
- on the migration thread, when we run out of array entries.
  This one has nothing to do with correctness, it is just that we have
  more space than needed.

So, in this regard, I think that the patches are correct.

But on the other hand, I am not sure that I like the size of the array.
Leonardo is using 1024 entries for each migration channel.  That means
that it allows it to have 1024 entries * 512 KB each packet is 512MB of
unflushed data in each channel.  I think that this is still too much.

I will need some data from testing, but my understanding on how Zero
Copy work is that having around 10MB in each channel would be more than
enough to saturate the link.  And once that the data inflight is
smaller, we can just flush it when we get out of channels.

My idea here was to work the size the other way around, add a parameter
to the user about how much memory is he available for mlocking, and
just do a memory/channels array entries on each channel.  That will:

a - limit the amount of mlocked memory that we need
    10MB/channel for 10 channels is 100MB of mlocked memory, for a guest
    that has lots of Gigabytes of RAM looks reasonable.

b - We don't synchronize after each write, because I still claim than
    doing a non asynchronous write on the channel just syncs everything
    (otherwise I can't see how the hardware can even work).

So I guess that the best thing to decide this is:
- get a couple of nice 100Gigabit networking cards
- Enable 16 channels or so, so we know that the CPU is not going to be
  the bottleneck
- test with this patch
- remove last two patches and test with several array sizes
  (10, 20, 30,..) and we will see that after some size, performance will
  not improve at all.
- Got that value as default one.

What do you think?

Later, Juan.

PD.  Yes, I don't like to add another parameter, so you can recompile
     with different values, or we will not add the parameter once that
     we find a right value.
Leonardo Bras Oct. 25, 2022, 1:47 p.m. UTC | #3
On Tue, 2022-10-25 at 12:07 +0200, Juan Quintela wrote:
> Daniel P. Berrangé <berrange@redhat.com> wrote:
> > On Tue, Oct 25, 2022 at 01:47:31AM -0300, Leonardo Bras wrote:
> > > Zero-copy multifd migration sends both the header and the memory pages in a
> > > single syscall. Since it's necessary to flush before reusing the header, a
> > > header array was implemented, so each write call uses a different
> > > array, and flushing only take place after all headers have been used,
> > > meaning 1 flush for each N writes.
> > > 
> > > This method has a bottleneck, though: After the last write, a flush will
> > > have to wait for all writes to finish, which will be a lot, meaning the
> > > recvmsg() syscall called in qio_channel_socket_flush() will be called a
> > > lot. On top of that, it will create a time period when the I/O queue is
> > > empty and nothing is getting send: between the flush and the next write.
> > > 
> > > To avoid that, use qio_channel_flush()'s new max_pending parameter to wait
> > > until at most half of the array is still in use. (i.e. the LRU half of the
> > > array can be reused)
> > > 
> > > Flushing for the LRU half of the array is much faster, since it does not
> > > have to wait for the most recent writes to finish, making up for having
> > > to flush twice per array.
> > > 
> > > As a main benefit, this approach keeps the I/O queue from being empty while
> > > there are still data to be sent, making it easier to keep the I/O maximum
> > > throughput while consuming less cpu time.
> > 
> > Doesn't this defeat the reason for adding the flush in the first
> > place, which was to ensure that a migration iteration was fully
> > sent before starting the next iteration over RAM ? If it is OK to
> > only partially flush on each iteration, then why do we need to
> > flush at all ?
> 
> Now we need to do the flush in two places:
> - on sync_main (the old place)
> - on the migration thread, when we run out of array entries.
>   This one has nothing to do with correctness, it is just that we have
>   more space than needed.

That's correct! In fact, sync_main (the old place) will call flush with
max_remaining = 0, meaning it will only return when there is nothing else
pending.

> 
> So, in this regard, I think that the patches are correct.
> 
> But on the other hand, I am not sure that I like the size of the array.
> Leonardo is using 1024 entries for each migration channel.  That means
> that it allows it to have 1024 entries * 512 KB each packet is 512MB of
> unflushed data in each channel.  I think that this is still too much.

The size is correct, meaning we need to allow 512MB of locked memory per multifd
channel.

> 
> I will need some data from testing, but my understanding on how Zero
> Copy work is that having around 10MB in each channel would be more than
> enough to saturate the link.  And once that the data inflight is
> smaller, we can just flush it when we get out of channels.
> 
> My idea here was to work the size the other way around, add a parameter
> to the user about how much memory is he available for mlocking, and
> just do a memory/channels array entries on each channel.  That will:
> 
> a - limit the amount of mlocked memory that we need
>     10MB/channel for 10 channels is 100MB of mlocked memory, for a guest
>     that has lots of Gigabytes of RAM looks reasonable.
> 
> b - We don't synchronize after each write, because I still claim than
>     doing a non asynchronous write on the channel just syncs everything
>     (otherwise I can't see how the hardware can even work).

So the user sets the locked memory available and we split it between the number
of channels during migration. Seems a good strategy, since it will explore the
hardware well regardless of the channel count. 

> 
> So I guess that the best thing to decide this is:
> - get a couple of nice 100Gigabit networking cards
> - Enable 16 channels or so, so we know that the CPU is not going to be
>   the bottleneck
> - test with this patch
> - remove last two patches and test with several array sizes
>   (10, 20, 30,..) and we will see that after some size, performance will
>   not improve at all.
> - Got that value as default one.
> 
> What do you think?

Most of it is good. I only disagree on removing the two last patches.

I did some tests with smaller array sizes, and without the last two patches it
would totally destroy performance. I blame the fact that once each N writes it
will have to wait the queue to be completely empty, and then add more data for
sending. 

Also, waiting for the completion of a send that was just scheduled takes
multiple recvmsg syscalls, raising cpu usage.

Other than that, seems a good strategy.

Thank you both for reviewing!

Best regards,
Leo

> 
> Later, Juan.
> 
> PD.  Yes, I don't like to add another parameter, so you can recompile
>      with different values, or we will not add the parameter once that
>      we find a right value.
>
diff mbox series

Patch

diff --git a/migration/multifd.c b/migration/multifd.c
index c5d1f911a4..fe9df460f6 100644
--- a/migration/multifd.c
+++ b/migration/multifd.c
@@ -569,12 +569,13 @@  void multifd_save_cleanup(void)
     multifd_send_state = NULL;
 }
 
-static int multifd_zero_copy_flush(QIOChannel *c)
+static int multifd_zero_copy_flush(QIOChannel *c,
+                                   int max_remaining)
 {
     int ret;
     Error *err = NULL;
 
-    ret = qio_channel_flush(c, 0, &err);
+    ret = qio_channel_flush(c, max_remaining, &err);
     if (ret < 0) {
         error_report_err(err);
         return -1;
@@ -636,7 +637,7 @@  int multifd_send_sync_main(QEMUFile *f)
         qemu_mutex_unlock(&p->mutex);
         qemu_sem_post(&p->sem);
 
-        if (flush_zero_copy && p->c && (multifd_zero_copy_flush(p->c) < 0)) {
+        if (flush_zero_copy && p->c && (multifd_zero_copy_flush(p->c, 0) < 0)) {
             return -1;
         }
     }
@@ -719,12 +720,17 @@  static void *multifd_send_thread(void *opaque)
 
             if (use_zero_copy_send) {
                 p->packet_idx = (p->packet_idx + 1) % HEADER_ARR_SZ;
-
-                if (!p->packet_idx && (multifd_zero_copy_flush(p->c) < 0)) {
+                /*
+                 * When half the array have been used, flush to make sure the
+                 * next half is available
+                 */
+                if (!(p->packet_idx % (HEADER_ARR_SZ / 2)) &&
+                    (multifd_zero_copy_flush(p->c, HEADER_ARR_SZ / 2) < 0)) {
                     break;
                 }
                 header = (void *)p->packet + p->packet_idx * p->packet_len;
             }
+
             qemu_mutex_lock(&p->mutex);
             p->pending_job--;
             qemu_mutex_unlock(&p->mutex);